Make the kernel preemptible

[RFC PATCH 57/86] coccinelle: script to remove cond_resched()

Posted by Ankur Arora 2 years, 3 months ago

Rudimentary script to remove the straight-forward subset of
cond_resched() and allies:

1)  if (need_resched())
	  cond_resched()

2)  expression*;
    cond_resched();  /* or in the reverse order */

3)  if (expression)
	statement
    cond_resched();  /* or in the reverse order */

The last two patterns depend on the control flow level to ensure
that the complex cond_resched() patterns (ex. conditioned ones)
are left alone and we only pick up ones which are only minimally
related the neighbouring code.

Cc: Julia Lawall <Julia.Lawall@inria.fr>
Cc: Nicolas Palix <nicolas.palix@imag.fr>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 scripts/coccinelle/api/cond_resched.cocci | 53 +++++++++++++++++++++++
 1 file changed, 53 insertions(+)
 create mode 100644 scripts/coccinelle/api/cond_resched.cocci

diff --git a/scripts/coccinelle/api/cond_resched.cocci b/scripts/coccinelle/api/cond_resched.cocci
new file mode 100644
index 000000000000..bf43768a8f8c
--- /dev/null
+++ b/scripts/coccinelle/api/cond_resched.cocci
@@ -0,0 +1,53 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/// Remove naked cond_resched() statements
+///
+//# Remove cond_resched() statements when:
+//#   - executing at the same control flow level as the previous or the
+//#     next statement (this lets us avoid complicated conditionals in
+//#     the neighbourhood.)
+//#   - they are of the form "if (need_resched()) cond_resched()" which
+//#     is always safe.
+//#
+//# Coccinelle generally takes care of comments in the immediate neighbourhood
+//# but might need to handle other comments alluding to rescheduling.
+//#
+virtual patch
+virtual context
+
+@ r1 @
+identifier r;
+@@
+
+(
+ r = cond_resched();
+|
+-if (need_resched())
+-	cond_resched();
+)
+
+@ r2 @
+expression E;
+statement S,T;
+@@
+(
+ E;
+|
+ if (E) S
+|
+ if (E) S else T
+|
+)
+-cond_resched();
+
+@ r3 @
+expression E;
+statement S,T;
+@@
+-cond_resched();
+(
+ E;
+|
+ if (E) S
+|
+ if (E) S else T
+)
-- 
2.31.1

Re: [RFC PATCH 57/86] coccinelle: script to remove cond_resched()

Posted by Julia Lawall 2 years, 3 months ago


On Tue, 7 Nov 2023, Ankur Arora wrote:

> Rudimentary script to remove the straight-forward subset of
> cond_resched() and allies:
>
> 1)  if (need_resched())
> 	  cond_resched()
>
> 2)  expression*;
>     cond_resched();  /* or in the reverse order */
>
> 3)  if (expression)
> 	statement
>     cond_resched();  /* or in the reverse order */
>
> The last two patterns depend on the control flow level to ensure
> that the complex cond_resched() patterns (ex. conditioned ones)
> are left alone and we only pick up ones which are only minimally
> related the neighbouring code.
>
> Cc: Julia Lawall <Julia.Lawall@inria.fr>
> Cc: Nicolas Palix <nicolas.palix@imag.fr>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  scripts/coccinelle/api/cond_resched.cocci | 53 +++++++++++++++++++++++
>  1 file changed, 53 insertions(+)
>  create mode 100644 scripts/coccinelle/api/cond_resched.cocci
>
> diff --git a/scripts/coccinelle/api/cond_resched.cocci b/scripts/coccinelle/api/cond_resched.cocci
> new file mode 100644
> index 000000000000..bf43768a8f8c
> --- /dev/null
> +++ b/scripts/coccinelle/api/cond_resched.cocci
> @@ -0,0 +1,53 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/// Remove naked cond_resched() statements
> +///
> +//# Remove cond_resched() statements when:
> +//#   - executing at the same control flow level as the previous or the
> +//#     next statement (this lets us avoid complicated conditionals in
> +//#     the neighbourhood.)
> +//#   - they are of the form "if (need_resched()) cond_resched()" which
> +//#     is always safe.
> +//#
> +//# Coccinelle generally takes care of comments in the immediate neighbourhood
> +//# but might need to handle other comments alluding to rescheduling.
> +//#
> +virtual patch
> +virtual context
> +
> +@ r1 @
> +identifier r;
> +@@
> +
> +(
> + r = cond_resched();
> +|
> +-if (need_resched())
> +-	cond_resched();
> +)

This rule doesn't make sense.  The first branch of the disjunction will
never match a a place where the second branch matches.  Anyway, in the
second branch there is no assignment, so I don't see what the first branch
is protecting against.

The disjunction is just useless.  Whether it is there or or whether only
the second brancha is there, doesn't have any impact on the result.

> +
> +@ r2 @
> +expression E;
> +statement S,T;
> +@@
> +(
> + E;
> +|
> + if (E) S

This case is not needed.  It will be matched by the next case.

> +|
> + if (E) S else T
> +|
> +)
> +-cond_resched();
> +
> +@ r3 @
> +expression E;
> +statement S,T;
> +@@
> +-cond_resched();
> +(
> + E;
> +|
> + if (E) S

As above.

> +|
> + if (E) S else T
> +)

I have the impression that you are trying to retain some cond_rescheds.
Could you send an example of one that you are trying to keep?  Overall,
the above rules seem a bit ad hoc.  You may be keeping some cases you
don't want to, or removing some cases that you want to keep.

Of course, if you are confident that the job is done with this semantic
patch as it is, then that's fine too.

julia

Re: [RFC PATCH 57/86] coccinelle: script to remove cond_resched()

Posted by Ankur Arora 2 years, 3 months ago

Julia Lawall <julia.lawall@inria.fr> writes:

> On Tue, 7 Nov 2023, Ankur Arora wrote:
>
>> Rudimentary script to remove the straight-forward subset of
>> cond_resched() and allies:
>>
>> 1)  if (need_resched())
>> 	  cond_resched()
>>
>> 2)  expression*;
>>     cond_resched();  /* or in the reverse order */
>>
>> 3)  if (expression)
>> 	statement
>>     cond_resched();  /* or in the reverse order */
>>
>> The last two patterns depend on the control flow level to ensure
>> that the complex cond_resched() patterns (ex. conditioned ones)
>> are left alone and we only pick up ones which are only minimally
>> related the neighbouring code.
>>
>> Cc: Julia Lawall <Julia.Lawall@inria.fr>
>> Cc: Nicolas Palix <nicolas.palix@imag.fr>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>  scripts/coccinelle/api/cond_resched.cocci | 53 +++++++++++++++++++++++
>>  1 file changed, 53 insertions(+)
>>  create mode 100644 scripts/coccinelle/api/cond_resched.cocci
>>
>> diff --git a/scripts/coccinelle/api/cond_resched.cocci b/scripts/coccinelle/api/cond_resched.cocci
>> new file mode 100644
>> index 000000000000..bf43768a8f8c
>> --- /dev/null
>> +++ b/scripts/coccinelle/api/cond_resched.cocci
>> @@ -0,0 +1,53 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/// Remove naked cond_resched() statements
>> +///
>> +//# Remove cond_resched() statements when:
>> +//#   - executing at the same control flow level as the previous or the
>> +//#     next statement (this lets us avoid complicated conditionals in
>> +//#     the neighbourhood.)
>> +//#   - they are of the form "if (need_resched()) cond_resched()" which
>> +//#     is always safe.
>> +//#
>> +//# Coccinelle generally takes care of comments in the immediate neighbourhood
>> +//# but might need to handle other comments alluding to rescheduling.
>> +//#
>> +virtual patch
>> +virtual context
>> +
>> +@ r1 @
>> +identifier r;
>> +@@
>> +
>> +(
>> + r = cond_resched();
>> +|
>> +-if (need_resched())
>> +-	cond_resched();
>> +)
>
> This rule doesn't make sense.  The first branch of the disjunction will
> never match a a place where the second branch matches.  Anyway, in the
> second branch there is no assignment, so I don't see what the first branch
> is protecting against.
>
> The disjunction is just useless.  Whether it is there or or whether only
> the second brancha is there, doesn't have any impact on the result.
>
>> +
>> +@ r2 @
>> +expression E;
>> +statement S,T;
>> +@@
>> +(
>> + E;
>> +|
>> + if (E) S
>
> This case is not needed.  It will be matched by the next case.
>
>> +|
>> + if (E) S else T
>> +|
>> +)
>> +-cond_resched();
>> +
>> +@ r3 @
>> +expression E;
>> +statement S,T;
>> +@@
>> +-cond_resched();
>> +(
>> + E;
>> +|
>> + if (E) S
>
> As above.
>
>> +|
>> + if (E) S else T
>> +)
>
> I have the impression that you are trying to retain some cond_rescheds.
> Could you send an example of one that you are trying to keep?  Overall,
> the above rules seem a bit ad hoc.  You may be keeping some cases you
> don't want to, or removing some cases that you want to keep.

Right. I was trying to ensure that the script only handled the cases
that didn't have any "interesting" connections to the surrounding code.

Just to give you an example of the kind of constructs that I wanted
to avoid:

mm/memoy.c::zap_pmd_range():

                if (addr != next)
                        pmd--;
        } while (pmd++, cond_resched(), addr != end);

mm/backing-dev.c::cleanup_offline_cgwbs_workfn()

                while (cleanup_offline_cgwb(wb))
                        cond_resched();


                while (cleanup_offline_cgwb(wb))
                        cond_resched();

But from a quick check the simplest coccinelle script does a much
better job than my overly complex (and incorrect) one:

@r1@
@@
-       cond_resched();

It avoids the first one. And transforms the second to:

                while (cleanup_offline_cgwb(wb))
                        {}

which is exactly what I wanted.

> Of course, if you are confident that the job is done with this semantic
> patch as it is, then that's fine too.

Not at all. Thanks for pointing out the mistakes.



--
ankur

Re: [RFC PATCH 57/86] coccinelle: script to remove cond_resched()

Posted by Julia Lawall 2 years, 3 months ago


On Wed, 8 Nov 2023, Ankur Arora wrote:

>
> Julia Lawall <julia.lawall@inria.fr> writes:
>
> > On Tue, 7 Nov 2023, Ankur Arora wrote:
> >
> >> Rudimentary script to remove the straight-forward subset of
> >> cond_resched() and allies:
> >>
> >> 1)  if (need_resched())
> >> 	  cond_resched()
> >>
> >> 2)  expression*;
> >>     cond_resched();  /* or in the reverse order */
> >>
> >> 3)  if (expression)
> >> 	statement
> >>     cond_resched();  /* or in the reverse order */
> >>
> >> The last two patterns depend on the control flow level to ensure
> >> that the complex cond_resched() patterns (ex. conditioned ones)
> >> are left alone and we only pick up ones which are only minimally
> >> related the neighbouring code.
> >>
> >> Cc: Julia Lawall <Julia.Lawall@inria.fr>
> >> Cc: Nicolas Palix <nicolas.palix@imag.fr>
> >> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> >> ---
> >>  scripts/coccinelle/api/cond_resched.cocci | 53 +++++++++++++++++++++++
> >>  1 file changed, 53 insertions(+)
> >>  create mode 100644 scripts/coccinelle/api/cond_resched.cocci
> >>
> >> diff --git a/scripts/coccinelle/api/cond_resched.cocci b/scripts/coccinelle/api/cond_resched.cocci
> >> new file mode 100644
> >> index 000000000000..bf43768a8f8c
> >> --- /dev/null
> >> +++ b/scripts/coccinelle/api/cond_resched.cocci
> >> @@ -0,0 +1,53 @@
> >> +// SPDX-License-Identifier: GPL-2.0-only
> >> +/// Remove naked cond_resched() statements
> >> +///
> >> +//# Remove cond_resched() statements when:
> >> +//#   - executing at the same control flow level as the previous or the
> >> +//#     next statement (this lets us avoid complicated conditionals in
> >> +//#     the neighbourhood.)
> >> +//#   - they are of the form "if (need_resched()) cond_resched()" which
> >> +//#     is always safe.
> >> +//#
> >> +//# Coccinelle generally takes care of comments in the immediate neighbourhood
> >> +//# but might need to handle other comments alluding to rescheduling.
> >> +//#
> >> +virtual patch
> >> +virtual context
> >> +
> >> +@ r1 @
> >> +identifier r;
> >> +@@
> >> +
> >> +(
> >> + r = cond_resched();
> >> +|
> >> +-if (need_resched())
> >> +-	cond_resched();
> >> +)
> >
> > This rule doesn't make sense.  The first branch of the disjunction will
> > never match a a place where the second branch matches.  Anyway, in the
> > second branch there is no assignment, so I don't see what the first branch
> > is protecting against.
> >
> > The disjunction is just useless.  Whether it is there or or whether only
> > the second brancha is there, doesn't have any impact on the result.
> >
> >> +
> >> +@ r2 @
> >> +expression E;
> >> +statement S,T;
> >> +@@
> >> +(
> >> + E;
> >> +|
> >> + if (E) S
> >
> > This case is not needed.  It will be matched by the next case.
> >
> >> +|
> >> + if (E) S else T
> >> +|
> >> +)
> >> +-cond_resched();
> >> +
> >> +@ r3 @
> >> +expression E;
> >> +statement S,T;
> >> +@@
> >> +-cond_resched();
> >> +(
> >> + E;
> >> +|
> >> + if (E) S
> >
> > As above.
> >
> >> +|
> >> + if (E) S else T
> >> +)
> >
> > I have the impression that you are trying to retain some cond_rescheds.
> > Could you send an example of one that you are trying to keep?  Overall,
> > the above rules seem a bit ad hoc.  You may be keeping some cases you
> > don't want to, or removing some cases that you want to keep.
>
> Right. I was trying to ensure that the script only handled the cases
> that didn't have any "interesting" connections to the surrounding code.
>
> Just to give you an example of the kind of constructs that I wanted
> to avoid:
>
> mm/memoy.c::zap_pmd_range():
>
>                 if (addr != next)
>                         pmd--;
>         } while (pmd++, cond_resched(), addr != end);
>
> mm/backing-dev.c::cleanup_offline_cgwbs_workfn()
>
>                 while (cleanup_offline_cgwb(wb))
>                         cond_resched();
>
>
>                 while (cleanup_offline_cgwb(wb))
>                         cond_resched();
>
> But from a quick check the simplest coccinelle script does a much
> better job than my overly complex (and incorrect) one:
>
> @r1@
> @@
> -       cond_resched();
>
> It avoids the first one. And transforms the second to:
>
>                 while (cleanup_offline_cgwb(wb))
>                         {}
>
> which is exactly what I wanted.

Perfect!

It could be good to run both scripts and compare the results.

julia

>
> > Of course, if you are confident that the job is done with this semantic
> > patch as it is, then that's fine too.
>
> Not at all. Thanks for pointing out the mistakes.
>
>
>
> --
> ankur
>

Re: [RFC PATCH 57/86] coccinelle: script to remove cond_resched()

Posted by Paul E. McKenney 2 years, 2 months ago

On Tue, Nov 07, 2023 at 03:07:53PM -0800, Ankur Arora wrote:
> Rudimentary script to remove the straight-forward subset of
> cond_resched() and allies:
> 
> 1)  if (need_resched())
> 	  cond_resched()
> 
> 2)  expression*;
>     cond_resched();  /* or in the reverse order */
> 
> 3)  if (expression)
> 	statement
>     cond_resched();  /* or in the reverse order */
> 
> The last two patterns depend on the control flow level to ensure
> that the complex cond_resched() patterns (ex. conditioned ones)
> are left alone and we only pick up ones which are only minimally
> related the neighbouring code.

This series looks to get rid of stall warnings for long in-kernel
preempt-enabled code paths, which is of course a very good thing.
But removing all of the cond_resched() calls can actually increase
scheduling latency compared to the current CONFIG_PREEMPT_NONE=y state,
correct?

If so, it would be good to take a measured approach.  For example, it
is clear that a loop that does a cond_resched() every (say) ten jiffies
can remove that cond_resched() without penalty, at least in kernels built
with either CONFIG_NO_HZ_FULL=n or CONFIG_PREEMPT=y.  But this is not so
clear for a loop that does a cond_resched() every (say) ten microseconds.

Or am I missing something here?

							Thanx, Paul

> Cc: Julia Lawall <Julia.Lawall@inria.fr>
> Cc: Nicolas Palix <nicolas.palix@imag.fr>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  scripts/coccinelle/api/cond_resched.cocci | 53 +++++++++++++++++++++++
>  1 file changed, 53 insertions(+)
>  create mode 100644 scripts/coccinelle/api/cond_resched.cocci
> 
> diff --git a/scripts/coccinelle/api/cond_resched.cocci b/scripts/coccinelle/api/cond_resched.cocci
> new file mode 100644
> index 000000000000..bf43768a8f8c
> --- /dev/null
> +++ b/scripts/coccinelle/api/cond_resched.cocci
> @@ -0,0 +1,53 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/// Remove naked cond_resched() statements
> +///
> +//# Remove cond_resched() statements when:
> +//#   - executing at the same control flow level as the previous or the
> +//#     next statement (this lets us avoid complicated conditionals in
> +//#     the neighbourhood.)
> +//#   - they are of the form "if (need_resched()) cond_resched()" which
> +//#     is always safe.
> +//#
> +//# Coccinelle generally takes care of comments in the immediate neighbourhood
> +//# but might need to handle other comments alluding to rescheduling.
> +//#
> +virtual patch
> +virtual context
> +
> +@ r1 @
> +identifier r;
> +@@
> +
> +(
> + r = cond_resched();
> +|
> +-if (need_resched())
> +-	cond_resched();
> +)
> +
> +@ r2 @
> +expression E;
> +statement S,T;
> +@@
> +(
> + E;
> +|
> + if (E) S
> +|
> + if (E) S else T
> +|
> +)
> +-cond_resched();
> +
> +@ r3 @
> +expression E;
> +statement S,T;
> +@@
> +-cond_resched();
> +(
> + E;
> +|
> + if (E) S
> +|
> + if (E) S else T
> +)
> -- 
> 2.31.1
>

Re: [RFC PATCH 57/86] coccinelle: script to remove cond_resched()

Posted by Ankur Arora 2 years, 2 months ago

Paul E. McKenney <paulmck@kernel.org> writes:

> On Tue, Nov 07, 2023 at 03:07:53PM -0800, Ankur Arora wrote:
>> Rudimentary script to remove the straight-forward subset of
>> cond_resched() and allies:
>>
>> 1)  if (need_resched())
>> 	  cond_resched()
>>
>> 2)  expression*;
>>     cond_resched();  /* or in the reverse order */
>>
>> 3)  if (expression)
>> 	statement
>>     cond_resched();  /* or in the reverse order */
>>
>> The last two patterns depend on the control flow level to ensure
>> that the complex cond_resched() patterns (ex. conditioned ones)
>> are left alone and we only pick up ones which are only minimally
>> related the neighbouring code.
>
> This series looks to get rid of stall warnings for long in-kernel
> preempt-enabled code paths, which is of course a very good thing.
> But removing all of the cond_resched() calls can actually increase
> scheduling latency compared to the current CONFIG_PREEMPT_NONE=y state,
> correct?

Not necessarily.

If TIF_NEED_RESCHED_LAZY is set, then we let the current task finish
before preempting. If that task runs for arbitrarily long (what Thomas
calls the hog problem) -- currently we allow them to run for upto one
extra tick (which might shorten/become a tunable.)

If TIF_NEED_RESCHED is set, then it gets folded the same it does now
and preemption happens at the next safe preemption point.

So, I guess the scheduling latency would always be bounded but how much
latency a task would incur would be scheduler policy dependent.

This is early days, so the policy (or really the rest of it) isn't set
in stone but having two levels of preemption -- immediate and
deferred -- does seem to give the scheduler greater freedom of policy.

Btw, are you concerned about the scheduling latencies in general or the
scheduling latency of a particular set of tasks?

> If so, it would be good to take a measured approach.  For example, it
> is clear that a loop that does a cond_resched() every (say) ten jiffies
> can remove that cond_resched() without penalty, at least in kernels built
> with either CONFIG_NO_HZ_FULL=n or CONFIG_PREEMPT=y.  But this is not so
> clear for a loop that does a cond_resched() every (say) ten microseconds.

True. Though both of those loops sound bad :).

Yeah, and as we were discussing offlist, the question is the comparative
density of preempt_dec_and_test() is true vs calls to cond_resched().

And if they are similar then we could replace cond_resched() quiescence
reporting with ones in preempt_enable() (as you mention elsewhere in the
thread.)

Thanks

--
ankur

Re: [RFC PATCH 57/86] coccinelle: script to remove cond_resched()

Posted by Paul E. McKenney 2 years, 2 months ago

On Mon, Nov 20, 2023 at 09:16:19PM -0800, Ankur Arora wrote:
> 
> Paul E. McKenney <paulmck@kernel.org> writes:
> 
> > On Tue, Nov 07, 2023 at 03:07:53PM -0800, Ankur Arora wrote:
> >> Rudimentary script to remove the straight-forward subset of
> >> cond_resched() and allies:
> >>
> >> 1)  if (need_resched())
> >> 	  cond_resched()
> >>
> >> 2)  expression*;
> >>     cond_resched();  /* or in the reverse order */
> >>
> >> 3)  if (expression)
> >> 	statement
> >>     cond_resched();  /* or in the reverse order */
> >>
> >> The last two patterns depend on the control flow level to ensure
> >> that the complex cond_resched() patterns (ex. conditioned ones)
> >> are left alone and we only pick up ones which are only minimally
> >> related the neighbouring code.
> >
> > This series looks to get rid of stall warnings for long in-kernel
> > preempt-enabled code paths, which is of course a very good thing.
> > But removing all of the cond_resched() calls can actually increase
> > scheduling latency compared to the current CONFIG_PREEMPT_NONE=y state,
> > correct?
> 
> Not necessarily.
> 
> If TIF_NEED_RESCHED_LAZY is set, then we let the current task finish
> before preempting. If that task runs for arbitrarily long (what Thomas
> calls the hog problem) -- currently we allow them to run for upto one
> extra tick (which might shorten/become a tunable.)

Agreed, and that is the easy case.  But getting rid of the cond_resched()
calls really can increase scheduling latency of this patchset compared
to status-quo mainline.

> If TIF_NEED_RESCHED is set, then it gets folded the same it does now
> and preemption happens at the next safe preemption point.
> 
> So, I guess the scheduling latency would always be bounded but how much
> latency a task would incur would be scheduler policy dependent.
> 
> This is early days, so the policy (or really the rest of it) isn't set
> in stone but having two levels of preemption -- immediate and
> deferred -- does seem to give the scheduler greater freedom of policy.

"Give the scheduler freedom!" is a wonderful slogan, but not necessarily
a useful one-size-fits-all design principle.  The scheduler does not
and cannot know everything, after all.

> Btw, are you concerned about the scheduling latencies in general or the
> scheduling latency of a particular set of tasks?

There are a lot of workloads out there with a lot of objective functions
and constraints, but it is safe to say that both will be important, as
will other things, depending on the workload.

But you knew that already, right?  ;-)

> > If so, it would be good to take a measured approach.  For example, it
> > is clear that a loop that does a cond_resched() every (say) ten jiffies
> > can remove that cond_resched() without penalty, at least in kernels built
> > with either CONFIG_NO_HZ_FULL=n or CONFIG_PREEMPT=y.  But this is not so
> > clear for a loop that does a cond_resched() every (say) ten microseconds.
> 
> True. Though both of those loops sound bad :).

Yes, but do they sound bad enough to be useful in the real world?  ;-)

> Yeah, and as we were discussing offlist, the question is the comparative
> density of preempt_dec_and_test() is true vs calls to cond_resched().
> 
> And if they are similar then we could replace cond_resched() quiescence
> reporting with ones in preempt_enable() (as you mention elsewhere in the
> thread.)

Here is hoping that something like that can help.

I am quite happy with the thought of reducing the number of cond_resched()
invocations, but not at the expense of the Linux kernel failing to do
its job.

							Thanx, Paul

[RFC PATCH 58/86] treewide: x86: remove cond_resched()

Posted by Ankur Arora 2 years, 3 months ago

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well
be -- for most workloads the scheduler does not have an interminable
supply of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

Most of the instances of cond_resched() here are from set-1 or set-2.
Remove them.

There's one set-3 case where kvm_recalculate_apic_map() sees an
unexpected APIC-id, where we now use cond_resched_stall() to delay
the retry.

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Thomas Gleixner <tglx@linutronix.de> 
Cc: Ingo Molnar <mingo@redhat.com> 
Cc: Borislav Petkov <bp@alien8.de> 
Cc: Dave Hansen <dave.hansen@linux.intel.com> 
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> 
Cc: "David S. Miller" <davem@davemloft.net> 
Cc: David Ahern <dsahern@kernel.org> 
Cc: x86@kernel.org 
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/kernel/alternative.c   | 10 ----------
 arch/x86/kernel/cpu/sgx/encl.c  | 14 +++++++-------
 arch/x86/kernel/cpu/sgx/ioctl.c |  3 ---
 arch/x86/kernel/cpu/sgx/main.c  |  5 -----
 arch/x86/kernel/cpu/sgx/virt.c  |  4 ----
 arch/x86/kvm/lapic.c            |  6 +++++-
 arch/x86/kvm/mmu/mmu.c          |  2 +-
 arch/x86/kvm/svm/sev.c          |  5 +++--
 arch/x86/net/bpf_jit_comp.c     |  1 -
 arch/x86/net/bpf_jit_comp32.c   |  1 -
 arch/x86/xen/mmu_pv.c           |  1 -
 11 files changed, 16 insertions(+), 36 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 73be3931e4f0..3d0b6a606852 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -2189,16 +2189,6 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
 	 */
 	atomic_set_release(&bp_desc.refs, 1);
 
-	/*
-	 * Function tracing can enable thousands of places that need to be
-	 * updated. This can take quite some time, and with full kernel debugging
-	 * enabled, this could cause the softlockup watchdog to trigger.
-	 * This function gets called every 256 entries added to be patched.
-	 * Call cond_resched() here to make sure that other tasks can get scheduled
-	 * while processing all the functions being patched.
-	 */
-	cond_resched();
-
 	/*
 	 * Corresponding read barrier in int3 notifier for making sure the
 	 * nr_entries and handler are correctly ordered wrt. patching.
diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 279148e72459..05afb4e2f552 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -549,14 +549,15 @@ int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start,
 			break;
 		}
 
-		/* Reschedule on every XA_CHECK_SCHED iteration. */
+		/*
+		 * Drop the lock every XA_CHECK_SCHED iteration so the
+		 * scheduler can preempt if needed.
+		 */
 		if (!(++count % XA_CHECK_SCHED)) {
 			xas_pause(&xas);
 			xas_unlock(&xas);
 			mutex_unlock(&encl->lock);
 
-			cond_resched();
-
 			mutex_lock(&encl->lock);
 			xas_lock(&xas);
 		}
@@ -723,16 +724,15 @@ void sgx_encl_release(struct kref *ref)
 		}
 
 		kfree(entry);
+
 		/*
-		 * Invoke scheduler on every XA_CHECK_SCHED iteration
-		 * to prevent soft lockups.
+		 * Drop the lock every XA_CHECK_SCHED iteration so the
+		 * scheduler can preempt if needed.
 		 */
 		if (!(++count % XA_CHECK_SCHED)) {
 			xas_pause(&xas);
 			xas_unlock(&xas);
 
-			cond_resched();
-
 			xas_lock(&xas);
 		}
 	}
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index 5d390df21440..2b899569bb60 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -439,9 +439,6 @@ static long sgx_ioc_enclave_add_pages(struct sgx_encl *encl, void __user *arg)
 			break;
 		}
 
-		if (need_resched())
-			cond_resched();
-
 		ret = sgx_encl_add_page(encl, add_arg.src + c, add_arg.offset + c,
 					&secinfo, add_arg.flags);
 		if (ret)
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 166692f2d501..f8bd01e56b72 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -98,8 +98,6 @@ static unsigned long __sgx_sanitize_pages(struct list_head *dirty_page_list)
 			list_move_tail(&page->list, &dirty);
 			left_dirty++;
 		}
-
-		cond_resched();
 	}
 
 	list_splice(&dirty, dirty_page_list);
@@ -413,8 +411,6 @@ static int ksgxd(void *p)
 
 		if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
 			sgx_reclaim_pages();
-
-		cond_resched();
 	}
 
 	return 0;
@@ -581,7 +577,6 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 		}
 
 		sgx_reclaim_pages();
-		cond_resched();
 	}
 
 	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
diff --git a/arch/x86/kernel/cpu/sgx/virt.c b/arch/x86/kernel/cpu/sgx/virt.c
index 7aaa3652e31d..6ce0983c6249 100644
--- a/arch/x86/kernel/cpu/sgx/virt.c
+++ b/arch/x86/kernel/cpu/sgx/virt.c
@@ -175,7 +175,6 @@ static long sgx_vepc_remove_all(struct sgx_vepc *vepc)
 				return -EBUSY;
 			}
 		}
-		cond_resched();
 	}
 
 	/*
@@ -204,7 +203,6 @@ static int sgx_vepc_release(struct inode *inode, struct file *file)
 			continue;
 
 		xa_erase(&vepc->page_array, index);
-		cond_resched();
 	}
 
 	/*
@@ -223,7 +221,6 @@ static int sgx_vepc_release(struct inode *inode, struct file *file)
 			list_add_tail(&epc_page->list, &secs_pages);
 
 		xa_erase(&vepc->page_array, index);
-		cond_resched();
 	}
 
 	/*
@@ -245,7 +242,6 @@ static int sgx_vepc_release(struct inode *inode, struct file *file)
 
 		if (sgx_vepc_free_page(epc_page))
 			list_add_tail(&epc_page->list, &secs_pages);
-		cond_resched();
 	}
 
 	if (!list_empty(&secs_pages))
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 3e977dbbf993..dd87a8214c80 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -435,7 +435,11 @@ void kvm_recalculate_apic_map(struct kvm *kvm)
 			kvfree(new);
 			new = NULL;
 			if (r == -E2BIG) {
-				cond_resched();
+				/*
+				 * A vCPU was just added or a enabled its APIC.
+				 * Give things time to settle before retrying.
+				 */
+				cond_resched_stall();
 				goto retry;
 			}
 
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f7901cb4d2fa..58efaca73dd4 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6431,8 +6431,8 @@ static int shadow_mmu_try_split_huge_page(struct kvm *kvm,
 	}
 
 	if (need_topup_split_caches_or_resched(kvm)) {
+		/* The preemption point in write_unlock() reschedules if needed. */
 		write_unlock(&kvm->mmu_lock);
-		cond_resched();
 		/*
 		 * If the topup succeeds, return -EAGAIN to indicate that the
 		 * rmap iterator should be restarted because the MMU lock was
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 4900c078045a..a98f29692a29 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -476,7 +476,6 @@ static void sev_clflush_pages(struct page *pages[], unsigned long npages)
 		page_virtual = kmap_local_page(pages[i]);
 		clflush_cache_range(page_virtual, PAGE_SIZE);
 		kunmap_local(page_virtual);
-		cond_resched();
 	}
 }
 
@@ -2157,12 +2156,14 @@ void sev_vm_destroy(struct kvm *kvm)
 	/*
 	 * if userspace was terminated before unregistering the memory regions
 	 * then lets unpin all the registered memory.
+	 *
+	 * This might be a while but we are preemptible so the scheduler can
+	 * always preempt if needed.
 	 */
 	if (!list_empty(head)) {
 		list_for_each_safe(pos, q, head) {
 			__unregister_enc_region_locked(kvm,
 				list_entry(pos, struct enc_region, list));
-			cond_resched();
 		}
 	}
 
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index a5930042139d..bae5b39810bb 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -2819,7 +2819,6 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 			prog->aux->extable = (void *) image + roundup(proglen, align);
 		}
 		oldproglen = proglen;
-		cond_resched();
 	}
 
 	if (bpf_jit_enable > 1)
diff --git a/arch/x86/net/bpf_jit_comp32.c b/arch/x86/net/bpf_jit_comp32.c
index 429a89c5468b..03566f031b23 100644
--- a/arch/x86/net/bpf_jit_comp32.c
+++ b/arch/x86/net/bpf_jit_comp32.c
@@ -2594,7 +2594,6 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 			}
 		}
 		oldproglen = proglen;
-		cond_resched();
 	}
 
 	if (bpf_jit_enable > 1)
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index b6830554ff69..a046cde342b1 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -2510,7 +2510,6 @@ int xen_remap_pfn(struct vm_area_struct *vma, unsigned long addr,
 		addr += range;
 		if (err_ptr)
 			err_ptr += batch;
-		cond_resched();
 	}
 out:
 
-- 
2.31.1

[RFC PATCH 59/86] treewide: rcu: remove cond_resched()

Posted by Ankur Arora 2 years, 3 months ago

All the cond_resched() calls in the RCU interfaces here are to
drive preemption once it has reported a potentially quiescent
state, or to exit the grace period. With PREEMPTION=y that should
happen implicitly.

So we can remove these.

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: "Paul E. McKenney" <paulmck@kernel.org> 
Cc: Frederic Weisbecker <frederic@kernel.org> 
Cc: Ingo Molnar <mingo@redhat.com> 
Cc: Peter Zijlstra <peterz@infradead.org> 
Cc: Juri Lelli <juri.lelli@redhat.com> 
Cc: Vincent Guittot <vincent.guittot@linaro.org> 
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/rcupdate.h | 6 ++----
 include/linux/sched.h    | 7 ++++++-
 kernel/hung_task.c       | 6 +++---
 kernel/rcu/tasks.h       | 5 +----
 4 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 7246ee602b0b..58f8c7faaa52 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -238,14 +238,12 @@ static inline bool rcu_trace_implies_rcu_gp(void) { return true; }
 /**
  * cond_resched_tasks_rcu_qs - Report potential quiescent states to RCU
  *
- * This macro resembles cond_resched(), except that it is defined to
- * report potential quiescent states to RCU-tasks even if the cond_resched()
- * machinery were to be shut off, as some advocate for PREEMPTION kernels.
+ * This macro resembles cond_resched(), in that it reports potential
+ * quiescent states to RCU-tasks.
  */
 #define cond_resched_tasks_rcu_qs() \
 do { \
 	rcu_tasks_qs(current, false); \
-	cond_resched(); \
 } while (0)
 
 /*
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 199f8f7211f2..bae6eed534dd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2145,7 +2145,12 @@ static inline void cond_resched_rcu(void)
 {
 #if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU)
 	rcu_read_unlock();
-	cond_resched();
+
+	/*
+	 * Might reschedule here as we exit the RCU read-side
+	 * critical section.
+	 */
+
 	rcu_read_lock();
 #endif
 }
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 9a24574988d2..4bdfad08a2e8 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -153,8 +153,8 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
  * To avoid extending the RCU grace period for an unbounded amount of time,
  * periodically exit the critical section and enter a new one.
  *
- * For preemptible RCU it is sufficient to call rcu_read_unlock in order
- * to exit the grace period. For classic RCU, a reschedule is required.
+ * Under a preemptive kernel, or with preemptible RCU, it is sufficient to
+ * call rcu_read_unlock in order to exit the grace period.
  */
 static bool rcu_lock_break(struct task_struct *g, struct task_struct *t)
 {
@@ -163,7 +163,7 @@ static bool rcu_lock_break(struct task_struct *g, struct task_struct *t)
 	get_task_struct(g);
 	get_task_struct(t);
 	rcu_read_unlock();
-	cond_resched();
+
 	rcu_read_lock();
 	can_cont = pid_alive(g) && pid_alive(t);
 	put_task_struct(t);
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 8d65f7d576a3..fa1d9aa31b36 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -541,7 +541,6 @@ static void rcu_tasks_invoke_cbs(struct rcu_tasks *rtp, struct rcu_tasks_percpu
 		local_bh_disable();
 		rhp->func(rhp);
 		local_bh_enable();
-		cond_resched();
 	}
 	raw_spin_lock_irqsave_rcu_node(rtpcp, flags);
 	rcu_segcblist_add_len(&rtpcp->cblist, -len);
@@ -974,10 +973,8 @@ static void check_all_holdout_tasks(struct list_head *hop,
 {
 	struct task_struct *t, *t1;
 
-	list_for_each_entry_safe(t, t1, hop, rcu_tasks_holdout_list) {
+	list_for_each_entry_safe(t, t1, hop, rcu_tasks_holdout_list)
 		check_holdout_task(t, needreport, firstreport);
-		cond_resched();
-	}
 }
 
 /* Finish off the Tasks-RCU grace period. */
-- 
2.31.1

Re: [RFC PATCH 59/86] treewide: rcu: remove cond_resched()

Posted by Paul E. McKenney 2 years, 2 months ago

On Tue, Nov 07, 2023 at 03:07:55PM -0800, Ankur Arora wrote:
> All the cond_resched() calls in the RCU interfaces here are to
> drive preemption once it has reported a potentially quiescent
> state, or to exit the grace period. With PREEMPTION=y that should
> happen implicitly.
> 
> So we can remove these.
> 
> [1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/
> 
> Cc: "Paul E. McKenney" <paulmck@kernel.org> 
> Cc: Frederic Weisbecker <frederic@kernel.org> 
> Cc: Ingo Molnar <mingo@redhat.com> 
> Cc: Peter Zijlstra <peterz@infradead.org> 
> Cc: Juri Lelli <juri.lelli@redhat.com> 
> Cc: Vincent Guittot <vincent.guittot@linaro.org> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  include/linux/rcupdate.h | 6 ++----
>  include/linux/sched.h    | 7 ++++++-
>  kernel/hung_task.c       | 6 +++---
>  kernel/rcu/tasks.h       | 5 +----
>  4 files changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 7246ee602b0b..58f8c7faaa52 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -238,14 +238,12 @@ static inline bool rcu_trace_implies_rcu_gp(void) { return true; }
>  /**
>   * cond_resched_tasks_rcu_qs - Report potential quiescent states to RCU
>   *
> - * This macro resembles cond_resched(), except that it is defined to
> - * report potential quiescent states to RCU-tasks even if the cond_resched()
> - * machinery were to be shut off, as some advocate for PREEMPTION kernels.
> + * This macro resembles cond_resched(), in that it reports potential
> + * quiescent states to RCU-tasks.
>   */
>  #define cond_resched_tasks_rcu_qs() \
>  do { \
>  	rcu_tasks_qs(current, false); \
> -	cond_resched(); \

I am a bit nervous about dropping the cond_resched() in a few cases,
for example, the call from rcu_tasks_trace_pregp_step() only momentarily
enables interrupts.  This should be OK given a scheduling-clock interrupt,
except that nohz_full CPUs don't necessarily have these.  At least not
unless RCU happens to be in a grace period at the time.

>  } while (0)
>  
>  /*
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 199f8f7211f2..bae6eed534dd 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2145,7 +2145,12 @@ static inline void cond_resched_rcu(void)
>  {
>  #if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU)
>  	rcu_read_unlock();
> -	cond_resched();
> +
> +	/*
> +	 * Might reschedule here as we exit the RCU read-side
> +	 * critical section.
> +	 */
> +
>  	rcu_read_lock();

And here I am wondering if some of my nervousness about increased
grace-period latency due to removing cond_resched() might be addressed
by making preempt_enable() take over the help-RCU functionality currently
being provided by cond_resched()...

>  #endif
>  }
> diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> index 9a24574988d2..4bdfad08a2e8 100644
> --- a/kernel/hung_task.c
> +++ b/kernel/hung_task.c
> @@ -153,8 +153,8 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
>   * To avoid extending the RCU grace period for an unbounded amount of time,
>   * periodically exit the critical section and enter a new one.
>   *
> - * For preemptible RCU it is sufficient to call rcu_read_unlock in order
> - * to exit the grace period. For classic RCU, a reschedule is required.
> + * Under a preemptive kernel, or with preemptible RCU, it is sufficient to
> + * call rcu_read_unlock in order to exit the grace period.
>   */
>  static bool rcu_lock_break(struct task_struct *g, struct task_struct *t)
>  {
> @@ -163,7 +163,7 @@ static bool rcu_lock_break(struct task_struct *g, struct task_struct *t)
>  	get_task_struct(g);
>  	get_task_struct(t);
>  	rcu_read_unlock();
> -	cond_resched();
> +
>  	rcu_read_lock();
>  	can_cont = pid_alive(g) && pid_alive(t);
>  	put_task_struct(t);
> diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> index 8d65f7d576a3..fa1d9aa31b36 100644
> --- a/kernel/rcu/tasks.h
> +++ b/kernel/rcu/tasks.h
> @@ -541,7 +541,6 @@ static void rcu_tasks_invoke_cbs(struct rcu_tasks *rtp, struct rcu_tasks_percpu
>  		local_bh_disable();
>  		rhp->func(rhp);
>  		local_bh_enable();
> -		cond_resched();

...and by local_bh_enable().

						Thanx, Paul

>  	}
>  	raw_spin_lock_irqsave_rcu_node(rtpcp, flags);
>  	rcu_segcblist_add_len(&rtpcp->cblist, -len);
> @@ -974,10 +973,8 @@ static void check_all_holdout_tasks(struct list_head *hop,
>  {
>  	struct task_struct *t, *t1;
>  
> -	list_for_each_entry_safe(t, t1, hop, rcu_tasks_holdout_list) {
> +	list_for_each_entry_safe(t, t1, hop, rcu_tasks_holdout_list)
>  		check_holdout_task(t, needreport, firstreport);
> -		cond_resched();
> -	}
>  }
>  
>  /* Finish off the Tasks-RCU grace period. */
> -- 
> 2.31.1
>

[RFC PATCH 60/86] treewide: torture: remove cond_resched()

Posted by Ankur Arora 2 years, 3 months ago

Some cases changed to cond_resched_stall() to avoid changing
the behaviour of the test too drastically.

Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/rcu/rcuscale.c   | 2 --
 kernel/rcu/rcutorture.c | 8 ++++----
 kernel/scftorture.c     | 1 -
 kernel/torture.c        | 1 -
 4 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/kernel/rcu/rcuscale.c b/kernel/rcu/rcuscale.c
index ffdb30495e3c..737620bbec83 100644
--- a/kernel/rcu/rcuscale.c
+++ b/kernel/rcu/rcuscale.c
@@ -672,8 +672,6 @@ kfree_scale_thread(void *arg)
 			else
 				kfree_rcu(alloc_ptr, rh);
 		}
-
-		cond_resched();
 	} while (!torture_must_stop() && ++loop < kfree_loops);
 
 	if (atomic_inc_return(&n_kfree_scale_thread_ended) >= kfree_nrealthreads) {
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index ade42d6a9d9b..158d58710b51 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -81,7 +81,7 @@ torture_param(int, fqs_stutter, 3, "Wait time between fqs bursts (s)");
 torture_param(int, fwd_progress, 1, "Number of grace-period forward progress tasks (0 to disable)");
 torture_param(int, fwd_progress_div, 4, "Fraction of CPU stall to wait");
 torture_param(int, fwd_progress_holdoff, 60, "Time between forward-progress tests (s)");
-torture_param(bool, fwd_progress_need_resched, 1, "Hide cond_resched() behind need_resched()");
+torture_param(bool, fwd_progress_need_resched, 1, "Hide cond_resched_stall() behind need_resched()");
 torture_param(bool, gp_cond, false, "Use conditional/async GP wait primitives");
 torture_param(bool, gp_cond_exp, false, "Use conditional/async expedited GP wait primitives");
 torture_param(bool, gp_cond_full, false, "Use conditional/async full-state GP wait primitives");
@@ -2611,7 +2611,7 @@ static void rcu_torture_fwd_prog_cond_resched(unsigned long iter)
 		return;
 	}
 	// No userspace emulation: CB invocation throttles call_rcu()
-	cond_resched();
+	cond_resched_stall();
 }
 
 /*
@@ -2691,7 +2691,7 @@ static void rcu_torture_fwd_prog_nr(struct rcu_fwd *rfp,
 		udelay(10);
 		cur_ops->readunlock(idx);
 		if (!fwd_progress_need_resched || need_resched())
-			cond_resched();
+			cond_resched_stall();
 	}
 	(*tested_tries)++;
 	if (!time_before(jiffies, stopat) &&
@@ -3232,7 +3232,7 @@ static int rcu_torture_read_exit(void *unused)
 				errexit = true;
 				break;
 			}
-			cond_resched();
+			cond_resched_stall();
 			kthread_stop(tsp);
 			n_read_exits++;
 		}
diff --git a/kernel/scftorture.c b/kernel/scftorture.c
index 59032aaccd18..24192fe01125 100644
--- a/kernel/scftorture.c
+++ b/kernel/scftorture.c
@@ -487,7 +487,6 @@ static int scftorture_invoker(void *arg)
 			set_cpus_allowed_ptr(current, cpumask_of(cpu));
 			was_offline = false;
 		}
-		cond_resched();
 		stutter_wait("scftorture_invoker");
 	} while (!torture_must_stop());
 
diff --git a/kernel/torture.c b/kernel/torture.c
index b28b05bbef02..0c0224c76275 100644
--- a/kernel/torture.c
+++ b/kernel/torture.c
@@ -747,7 +747,6 @@ bool stutter_wait(const char *title)
 			while (READ_ONCE(stutter_pause_test)) {
 				if (!(i++ & 0xffff))
 					torture_hrtimeout_us(10, 0, NULL);
-				cond_resched();
 			}
 		} else {
 			torture_hrtimeout_jiffies(round_jiffies_relative(HZ), NULL);
-- 
2.31.1

Re: [RFC PATCH 60/86] treewide: torture: remove cond_resched()

Posted by Paul E. McKenney 2 years, 2 months ago

On Tue, Nov 07, 2023 at 03:07:56PM -0800, Ankur Arora wrote:
> Some cases changed to cond_resched_stall() to avoid changing
> the behaviour of the test too drastically.
> 
> Cc: Davidlohr Bueso <dave@stgolabs.net>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Josh Triplett <josh@joshtriplett.org>
> Cc: Frederic Weisbecker <frederic@kernel.org>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>

Given lazy preemption, I am OK with dropping the cond_resched()
invocations from the various torture tests.

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>

> ---
>  kernel/rcu/rcuscale.c   | 2 --
>  kernel/rcu/rcutorture.c | 8 ++++----
>  kernel/scftorture.c     | 1 -
>  kernel/torture.c        | 1 -
>  4 files changed, 4 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/rcu/rcuscale.c b/kernel/rcu/rcuscale.c
> index ffdb30495e3c..737620bbec83 100644
> --- a/kernel/rcu/rcuscale.c
> +++ b/kernel/rcu/rcuscale.c
> @@ -672,8 +672,6 @@ kfree_scale_thread(void *arg)
>  			else
>  				kfree_rcu(alloc_ptr, rh);
>  		}
> -
> -		cond_resched();
>  	} while (!torture_must_stop() && ++loop < kfree_loops);
>  
>  	if (atomic_inc_return(&n_kfree_scale_thread_ended) >= kfree_nrealthreads) {
> diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
> index ade42d6a9d9b..158d58710b51 100644
> --- a/kernel/rcu/rcutorture.c
> +++ b/kernel/rcu/rcutorture.c
> @@ -81,7 +81,7 @@ torture_param(int, fqs_stutter, 3, "Wait time between fqs bursts (s)");
>  torture_param(int, fwd_progress, 1, "Number of grace-period forward progress tasks (0 to disable)");
>  torture_param(int, fwd_progress_div, 4, "Fraction of CPU stall to wait");
>  torture_param(int, fwd_progress_holdoff, 60, "Time between forward-progress tests (s)");
> -torture_param(bool, fwd_progress_need_resched, 1, "Hide cond_resched() behind need_resched()");
> +torture_param(bool, fwd_progress_need_resched, 1, "Hide cond_resched_stall() behind need_resched()");
>  torture_param(bool, gp_cond, false, "Use conditional/async GP wait primitives");
>  torture_param(bool, gp_cond_exp, false, "Use conditional/async expedited GP wait primitives");
>  torture_param(bool, gp_cond_full, false, "Use conditional/async full-state GP wait primitives");
> @@ -2611,7 +2611,7 @@ static void rcu_torture_fwd_prog_cond_resched(unsigned long iter)
>  		return;
>  	}
>  	// No userspace emulation: CB invocation throttles call_rcu()
> -	cond_resched();
> +	cond_resched_stall();
>  }
>  
>  /*
> @@ -2691,7 +2691,7 @@ static void rcu_torture_fwd_prog_nr(struct rcu_fwd *rfp,
>  		udelay(10);
>  		cur_ops->readunlock(idx);
>  		if (!fwd_progress_need_resched || need_resched())
> -			cond_resched();
> +			cond_resched_stall();
>  	}
>  	(*tested_tries)++;
>  	if (!time_before(jiffies, stopat) &&
> @@ -3232,7 +3232,7 @@ static int rcu_torture_read_exit(void *unused)
>  				errexit = true;
>  				break;
>  			}
> -			cond_resched();
> +			cond_resched_stall();
>  			kthread_stop(tsp);
>  			n_read_exits++;
>  		}
> diff --git a/kernel/scftorture.c b/kernel/scftorture.c
> index 59032aaccd18..24192fe01125 100644
> --- a/kernel/scftorture.c
> +++ b/kernel/scftorture.c
> @@ -487,7 +487,6 @@ static int scftorture_invoker(void *arg)
>  			set_cpus_allowed_ptr(current, cpumask_of(cpu));
>  			was_offline = false;
>  		}
> -		cond_resched();
>  		stutter_wait("scftorture_invoker");
>  	} while (!torture_must_stop());
>  
> diff --git a/kernel/torture.c b/kernel/torture.c
> index b28b05bbef02..0c0224c76275 100644
> --- a/kernel/torture.c
> +++ b/kernel/torture.c
> @@ -747,7 +747,6 @@ bool stutter_wait(const char *title)
>  			while (READ_ONCE(stutter_pause_test)) {
>  				if (!(i++ & 0xffff))
>  					torture_hrtimeout_us(10, 0, NULL);
> -				cond_resched();
>  			}
>  		} else {
>  			torture_hrtimeout_jiffies(round_jiffies_relative(HZ), NULL);
> -- 
> 2.31.1
>

[RFC PATCH 61/86] treewide: bpf: remove cond_resched()

Posted by Ankur Arora 2 years, 3 months ago

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

All the uses of cond_resched() here are from set-1, so we can trivially
remove them.

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: bpf@vger.kernel.org
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/bpf/arraymap.c | 3 ---
 kernel/bpf/bpf_iter.c | 7 +------
 kernel/bpf/btf.c      | 9 ---------
 kernel/bpf/cpumap.c   | 2 --
 kernel/bpf/hashtab.c  | 7 -------
 kernel/bpf/syscall.c  | 3 ---
 kernel/bpf/verifier.c | 5 -----
 7 files changed, 1 insertion(+), 35 deletions(-)

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 2058e89b5ddd..cb0d626038b4 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -25,7 +25,6 @@ static void bpf_array_free_percpu(struct bpf_array *array)
 
 	for (i = 0; i < array->map.max_entries; i++) {
 		free_percpu(array->pptrs[i]);
-		cond_resched();
 	}
 }
 
@@ -42,7 +41,6 @@ static int bpf_array_alloc_percpu(struct bpf_array *array)
 			return -ENOMEM;
 		}
 		array->pptrs[i] = ptr;
-		cond_resched();
 	}
 
 	return 0;
@@ -423,7 +421,6 @@ static void array_map_free(struct bpf_map *map)
 
 				for_each_possible_cpu(cpu) {
 					bpf_obj_free_fields(map->record, per_cpu_ptr(pptr, cpu));
-					cond_resched();
 				}
 			}
 		} else {
diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
index 96856f130cbf..dfb24f76ccf7 100644
--- a/kernel/bpf/bpf_iter.c
+++ b/kernel/bpf/bpf_iter.c
@@ -73,7 +73,7 @@ static inline bool bpf_iter_target_support_resched(const struct bpf_iter_target_
 	return tinfo->reg_info->feature & BPF_ITER_RESCHED;
 }
 
-static bool bpf_iter_support_resched(struct seq_file *seq)
+static bool __maybe_unused bpf_iter_support_resched(struct seq_file *seq)
 {
 	struct bpf_iter_priv_data *iter_priv;
 
@@ -97,7 +97,6 @@ static ssize_t bpf_seq_read(struct file *file, char __user *buf, size_t size,
 	struct seq_file *seq = file->private_data;
 	size_t n, offs, copied = 0;
 	int err = 0, num_objs = 0;
-	bool can_resched;
 	void *p;
 
 	mutex_lock(&seq->lock);
@@ -150,7 +149,6 @@ static ssize_t bpf_seq_read(struct file *file, char __user *buf, size_t size,
 		goto done;
 	}
 
-	can_resched = bpf_iter_support_resched(seq);
 	while (1) {
 		loff_t pos = seq->index;
 
@@ -196,9 +194,6 @@ static ssize_t bpf_seq_read(struct file *file, char __user *buf, size_t size,
 			}
 			break;
 		}
-
-		if (can_resched)
-			cond_resched();
 	}
 stop:
 	offs = seq->count;
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 8090d7fb11ef..fe560f80e230 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -5361,8 +5361,6 @@ btf_parse_struct_metas(struct bpf_verifier_log *log, struct btf *btf)
 		if (!__btf_type_is_struct(t))
 			continue;
 
-		cond_resched();
-
 		for_each_member(j, t, member) {
 			if (btf_id_set_contains(&aof.set, member->type))
 				goto parse;
@@ -5427,8 +5425,6 @@ static int btf_check_type_tags(struct btf_verifier_env *env,
 		if (!btf_type_is_modifier(t))
 			continue;
 
-		cond_resched();
-
 		in_tags = btf_type_is_type_tag(t);
 		while (btf_type_is_modifier(t)) {
 			if (!chain_limit--) {
@@ -8296,11 +8292,6 @@ bpf_core_add_cands(struct bpf_cand_cache *cands, const struct btf *targ_btf,
 		if (!targ_name)
 			continue;
 
-		/* the resched point is before strncmp to make sure that search
-		 * for non-existing name will have a chance to schedule().
-		 */
-		cond_resched();
-
 		if (strncmp(cands->name, targ_name, cands->name_len) != 0)
 			continue;
 
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index e42a1bdb7f53..0aed2a6ef262 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -290,8 +290,6 @@ static int cpu_map_kthread_run(void *data)
 			} else {
 				__set_current_state(TASK_RUNNING);
 			}
-		} else {
-			sched = cond_resched();
 		}
 
 		/*
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index a8c7e1c5abfa..17ed14d2dd44 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -142,7 +142,6 @@ static void htab_init_buckets(struct bpf_htab *htab)
 		raw_spin_lock_init(&htab->buckets[i].raw_lock);
 		lockdep_set_class(&htab->buckets[i].raw_lock,
 					  &htab->lockdep_key);
-		cond_resched();
 	}
 }
 
@@ -232,7 +231,6 @@ static void htab_free_prealloced_timers(struct bpf_htab *htab)
 
 		elem = get_htab_elem(htab, i);
 		bpf_obj_free_timer(htab->map.record, elem->key + round_up(htab->map.key_size, 8));
-		cond_resched();
 	}
 }
 
@@ -255,13 +253,10 @@ static void htab_free_prealloced_fields(struct bpf_htab *htab)
 
 			for_each_possible_cpu(cpu) {
 				bpf_obj_free_fields(htab->map.record, per_cpu_ptr(pptr, cpu));
-				cond_resched();
 			}
 		} else {
 			bpf_obj_free_fields(htab->map.record, elem->key + round_up(htab->map.key_size, 8));
-			cond_resched();
 		}
-		cond_resched();
 	}
 }
 
@@ -278,7 +273,6 @@ static void htab_free_elems(struct bpf_htab *htab)
 		pptr = htab_elem_get_ptr(get_htab_elem(htab, i),
 					 htab->map.key_size);
 		free_percpu(pptr);
-		cond_resched();
 	}
 free_elems:
 	bpf_map_area_free(htab->elems);
@@ -337,7 +331,6 @@ static int prealloc_init(struct bpf_htab *htab)
 			goto free_elems;
 		htab_elem_set_ptr(get_htab_elem(htab, i), htab->map.key_size,
 				  pptr);
-		cond_resched();
 	}
 
 skip_percpu_elems:
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index d77b2f8b9364..8762c3d678be 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1695,7 +1695,6 @@ int generic_map_delete_batch(struct bpf_map *map,
 		bpf_enable_instrumentation();
 		if (err)
 			break;
-		cond_resched();
 	}
 	if (copy_to_user(&uattr->batch.count, &cp, sizeof(cp)))
 		err = -EFAULT;
@@ -1752,7 +1751,6 @@ int generic_map_update_batch(struct bpf_map *map, struct file *map_file,
 
 		if (err)
 			break;
-		cond_resched();
 	}
 
 	if (copy_to_user(&uattr->batch.count, &cp, sizeof(cp)))
@@ -1849,7 +1847,6 @@ int generic_map_lookup_batch(struct bpf_map *map,
 		swap(prev_key, key);
 		retry = MAP_LOOKUP_RETRIES;
 		cp++;
-		cond_resched();
 	}
 
 	if (err == -EFAULT)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 873ade146f3d..25e6f318c561 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -16489,9 +16489,6 @@ static int do_check(struct bpf_verifier_env *env)
 		if (signal_pending(current))
 			return -EAGAIN;
 
-		if (need_resched())
-			cond_resched();
-
 		if (env->log.level & BPF_LOG_LEVEL2 && do_print_state) {
 			verbose(env, "\nfrom %d to %d%s:",
 				env->prev_insn_idx, env->insn_idx,
@@ -18017,7 +18014,6 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 			err = -ENOTSUPP;
 			goto out_free;
 		}
-		cond_resched();
 	}
 
 	/* at this point all bpf functions were successfully JITed
@@ -18061,7 +18057,6 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 			err = -ENOTSUPP;
 			goto out_free;
 		}
-		cond_resched();
 	}
 
 	/* finally lock prog and jit images for all functions and
-- 
2.31.1

[RFC PATCH 62/86] treewide: trace: remove cond_resched()

Posted by Ankur Arora 2 years, 3 months ago

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

All the cond_resched() calls here are from set-1. Remove them.

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/trace/ftrace.c                |  4 ----
 kernel/trace/ring_buffer.c           |  4 ----
 kernel/trace/ring_buffer_benchmark.c | 13 -------------
 kernel/trace/trace.c                 | 11 -----------
 kernel/trace/trace_events.c          |  1 -
 kernel/trace/trace_selftest.c        |  9 ---------
 6 files changed, 42 deletions(-)

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 8de8bec5f366..096ebb608610 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -2723,7 +2723,6 @@ void __weak ftrace_replace_code(int mod_flags)
 	struct dyn_ftrace *rec;
 	struct ftrace_page *pg;
 	bool enable = mod_flags & FTRACE_MODIFY_ENABLE_FL;
-	int schedulable = mod_flags & FTRACE_MODIFY_MAY_SLEEP_FL;
 	int failed;
 
 	if (unlikely(ftrace_disabled))
@@ -2740,8 +2739,6 @@ void __weak ftrace_replace_code(int mod_flags)
 			/* Stop processing */
 			return;
 		}
-		if (schedulable)
-			cond_resched();
 	} while_for_each_ftrace_rec();
 }
 
@@ -4363,7 +4360,6 @@ match_records(struct ftrace_hash *hash, char *func, int len, char *mod)
 			}
 			found = 1;
 		}
-		cond_resched();
 	} while_for_each_ftrace_rec();
  out_unlock:
 	mutex_unlock(&ftrace_lock);
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 515cafdb18d9..5c5eb6a8c7db 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1996,8 +1996,6 @@ rb_remove_pages(struct ring_buffer_per_cpu *cpu_buffer, unsigned long nr_pages)
 	tmp_iter_page = first_page;
 
 	do {
-		cond_resched();
-
 		to_remove_page = tmp_iter_page;
 		rb_inc_page(&tmp_iter_page);
 
@@ -2206,8 +2204,6 @@ int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size,
 				err = -ENOMEM;
 				goto out_err;
 			}
-
-			cond_resched();
 		}
 
 		cpus_read_lock();
diff --git a/kernel/trace/ring_buffer_benchmark.c b/kernel/trace/ring_buffer_benchmark.c
index aef34673d79d..8d1c23d135cb 100644
--- a/kernel/trace/ring_buffer_benchmark.c
+++ b/kernel/trace/ring_buffer_benchmark.c
@@ -267,19 +267,6 @@ static void ring_buffer_producer(void)
 		if (consumer && !(cnt % wakeup_interval))
 			wake_up_process(consumer);
 
-#ifndef CONFIG_PREEMPTION
-		/*
-		 * If we are a non preempt kernel, the 10 seconds run will
-		 * stop everything while it runs. Instead, we will call
-		 * cond_resched and also add any time that was lost by a
-		 * reschedule.
-		 *
-		 * Do a cond resched at the same frequency we would wake up
-		 * the reader.
-		 */
-		if (cnt % wakeup_interval)
-			cond_resched();
-#endif
 	} while (ktime_before(end_time, timeout) && !break_test());
 	trace_printk("End ring buffer hammer\n");
 
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 0776dba32c2d..1efb69423818 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2052,13 +2052,6 @@ static int do_run_tracer_selftest(struct tracer *type)
 {
 	int ret;
 
-	/*
-	 * Tests can take a long time, especially if they are run one after the
-	 * other, as does happen during bootup when all the tracers are
-	 * registered. This could cause the soft lockup watchdog to trigger.
-	 */
-	cond_resched();
-
 	tracing_selftest_running = true;
 	ret = run_tracer_selftest(type);
 	tracing_selftest_running = false;
@@ -2083,10 +2076,6 @@ static __init int init_trace_selftests(void)
 
 	tracing_selftest_running = true;
 	list_for_each_entry_safe(p, n, &postponed_selftests, list) {
-		/* This loop can take minutes when sanitizers are enabled, so
-		 * lets make sure we allow RCU processing.
-		 */
-		cond_resched();
 		ret = run_tracer_selftest(p->type);
 		/* If the test fails, then warn and remove from available_tracers */
 		if (ret < 0) {
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index f49d6ddb6342..91951d038ba4 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -2770,7 +2770,6 @@ void trace_event_eval_update(struct trace_eval_map **map, int len)
 				update_event_fields(call, map[i]);
 			}
 		}
-		cond_resched();
 	}
 	up_write(&trace_event_sem);
 }
diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
index 529590499b1f..07cfad8ce16f 100644
--- a/kernel/trace/trace_selftest.c
+++ b/kernel/trace/trace_selftest.c
@@ -848,11 +848,6 @@ trace_selftest_startup_function_graph(struct tracer *trace,
 	}
 
 #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
-	/*
-	 * These tests can take some time to run. Make sure on non PREEMPT
-	 * kernels, we do not trigger the softlockup detector.
-	 */
-	cond_resched();
 
 	tracing_reset_online_cpus(&tr->array_buffer);
 	set_graph_array(tr);
@@ -875,8 +870,6 @@ trace_selftest_startup_function_graph(struct tracer *trace,
 	if (ret)
 		goto out;
 
-	cond_resched();
-
 	ret = register_ftrace_graph(&fgraph_ops);
 	if (ret) {
 		warn_failed_init_tracer(trace, ret);
@@ -899,8 +892,6 @@ trace_selftest_startup_function_graph(struct tracer *trace,
 	if (ret)
 		goto out;
 
-	cond_resched();
-
 	tracing_start();
 
 	if (!ret && !count) {
-- 
2.31.1

[RFC PATCH 63/86] treewide: futex: remove cond_resched()

Posted by Ankur Arora 2 years, 3 months ago

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

Most cases here are from set-3. Replace with cond_resched_stall().
There were a few cases (__fixup_pi_state_owner() and futex_requeue())
where we had given up a spinlock or mutex and so, a resched, if any
was needed, would have happened already.

Replace with cpu_relax() in one case, with nothing in the other.

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Thomas Gleixner <tglx@linutronix.de> 
Cc: Ingo Molnar <mingo@redhat.com> 
Cc: Peter Zijlstra <peterz@infradead.org> 
Cc: Darren Hart <dvhart@infradead.org> 
Cc: Davidlohr Bueso <dave@stgolabs.net> 
Cc: "André Almeida" <andrealmeid@igalia.com> 
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/futex/core.c     | 6 +-----
 kernel/futex/pi.c       | 6 +++---
 kernel/futex/requeue.c  | 1 -
 kernel/futex/waitwake.c | 2 +-
 4 files changed, 5 insertions(+), 10 deletions(-)

diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index f10587d1d481..4821931fb19d 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -724,7 +724,7 @@ static int handle_futex_death(u32 __user *uaddr, struct task_struct *curr,
 			goto retry;
 
 		case -EAGAIN:
-			cond_resched();
+			cond_resched_stall();
 			goto retry;
 
 		default:
@@ -822,8 +822,6 @@ static void exit_robust_list(struct task_struct *curr)
 		 */
 		if (!--limit)
 			break;
-
-		cond_resched();
 	}
 
 	if (pending) {
@@ -922,8 +920,6 @@ static void compat_exit_robust_list(struct task_struct *curr)
 		 */
 		if (!--limit)
 			break;
-
-		cond_resched();
 	}
 	if (pending) {
 		void __user *uaddr = futex_uaddr(pending, futex_offset);
diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index ce2889f12375..e3f6ca4cd875 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -809,7 +809,7 @@ static int __fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q,
 		break;
 
 	case -EAGAIN:
-		cond_resched();
+		cpu_relax();
 		err = 0;
 		break;
 
@@ -981,7 +981,7 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
 			 * this task might loop forever, aka. live lock.
 			 */
 			wait_for_owner_exiting(ret, exiting);
-			cond_resched();
+			cond_resched_stall();
 			goto retry;
 		default:
 			goto out_unlock_put_key;
@@ -1219,7 +1219,7 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
 	return ret;
 
 pi_retry:
-	cond_resched();
+	cond_resched_stall();
 	goto retry;
 
 pi_faulted:
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index cba8b1a6a4cc..9f916162ef6e 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -560,7 +560,6 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
 			 * this task might loop forever, aka. live lock.
 			 */
 			wait_for_owner_exiting(ret, exiting);
-			cond_resched();
 			goto retry;
 		default:
 			goto out_unlock;
diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c
index ba01b9408203..801b1ec3625a 100644
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -277,7 +277,7 @@ int futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
 				return ret;
 		}
 
-		cond_resched();
+		cond_resched_stall();
 		if (!(flags & FLAGS_SHARED))
 			goto retry_private;
 		goto retry;
-- 
2.31.1

[RFC PATCH 64/86] treewide: printk: remove cond_resched()

Posted by Ankur Arora 2 years, 3 months ago

The printk code goes to great lengths to ensure that there are no
scheduling stalls which would cause softlockup/RCU splats and make
things worse.

With PREEMPT_COUNT=y and PREEMPTION=y, this should be a non-issue as
the scheduler can determine when this logic can be preempted.

So, remove cond_resched() and related code.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> 
Cc: Petr Mladek <pmladek@suse.com> 
Cc: Steven Rostedt <rostedt@goodmis.org> 
Cc: John Ogness <john.ogness@linutronix.de> 
Cc: Sergey Senozhatsky <senozhatsky@chromium.org> 
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/console.h |  2 +-
 kernel/printk/printk.c  | 65 +++++++++--------------------------------
 2 files changed, 15 insertions(+), 52 deletions(-)

diff --git a/include/linux/console.h b/include/linux/console.h
index 7de11c763eb3..db418dab5674 100644
--- a/include/linux/console.h
+++ b/include/linux/console.h
@@ -347,7 +347,7 @@ extern int unregister_console(struct console *);
 extern void console_lock(void);
 extern int console_trylock(void);
 extern void console_unlock(void);
-extern void console_conditional_schedule(void);
+static inline void console_conditional_schedule(void) { }
 extern void console_unblank(void);
 extern void console_flush_on_panic(enum con_flush_mode mode);
 extern struct tty_driver *console_device(int *);
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 0b3af1529778..2708d9f499a3 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -375,9 +375,6 @@ static int preferred_console = -1;
 int console_set_on_cmdline;
 EXPORT_SYMBOL(console_set_on_cmdline);
 
-/* Flag: console code may call schedule() */
-static int console_may_schedule;
-
 enum con_msg_format_flags {
 	MSG_FORMAT_DEFAULT	= 0,
 	MSG_FORMAT_SYSLOG	= (1 << 0),
@@ -2651,7 +2648,6 @@ void console_lock(void)
 
 	down_console_sem();
 	console_locked = 1;
-	console_may_schedule = 1;
 }
 EXPORT_SYMBOL(console_lock);
 
@@ -2671,7 +2667,6 @@ int console_trylock(void)
 	if (down_trylock_console_sem())
 		return 0;
 	console_locked = 1;
-	console_may_schedule = 0;
 	return 1;
 }
 EXPORT_SYMBOL(console_trylock);
@@ -2922,9 +2917,6 @@ static bool console_emit_next_record(struct console *con, bool *handover, int co
 /*
  * Print out all remaining records to all consoles.
  *
- * @do_cond_resched is set by the caller. It can be true only in schedulable
- * context.
- *
  * @next_seq is set to the sequence number after the last available record.
  * The value is valid only when this function returns true. It means that all
  * usable consoles are completely flushed.
@@ -2942,7 +2934,7 @@ static bool console_emit_next_record(struct console *con, bool *handover, int co
  *
  * Requires the console_lock.
  */
-static bool console_flush_all(bool do_cond_resched, u64 *next_seq, bool *handover)
+static bool console_flush_all(u64 *next_seq, bool *handover)
 {
 	bool any_usable = false;
 	struct console *con;
@@ -2983,9 +2975,6 @@ static bool console_flush_all(bool do_cond_resched, u64 *next_seq, bool *handove
 			/* Allow panic_cpu to take over the consoles safely. */
 			if (other_cpu_in_panic())
 				goto abandon;
-
-			if (do_cond_resched)
-				cond_resched();
 		}
 		console_srcu_read_unlock(cookie);
 	} while (any_progress);
@@ -3011,28 +3000,26 @@ static bool console_flush_all(bool do_cond_resched, u64 *next_seq, bool *handove
  */
 void console_unlock(void)
 {
-	bool do_cond_resched;
 	bool handover;
 	bool flushed;
 	u64 next_seq;
 
 	/*
-	 * Console drivers are called with interrupts disabled, so
-	 * @console_may_schedule should be cleared before; however, we may
+	 * Console drivers are called with interrupts disabled, so in
+	 * general we cannot schedule. There are also cases where we will
 	 * end up dumping a lot of lines, for example, if called from
-	 * console registration path, and should invoke cond_resched()
-	 * between lines if allowable.  Not doing so can cause a very long
-	 * scheduling stall on a slow console leading to RCU stall and
-	 * softlockup warnings which exacerbate the issue with more
-	 * messages practically incapacitating the system. Therefore, create
-	 * a local to use for the printing loop.
+	 * console registration path.
+	 *
+	 * Not scheduling while working on a slow console could lead to
+	 * RCU stalls and softlockup warnings which exacerbate the issue
+	 * with more messages practically incapacitating the system.
+	 *
+	 * However, most of the console code is preemptible, so the scheduler
+	 * should be able to preempt us and make forward progress.
 	 */
-	do_cond_resched = console_may_schedule;
 
 	do {
-		console_may_schedule = 0;
-
-		flushed = console_flush_all(do_cond_resched, &next_seq, &handover);
+		flushed = console_flush_all(&next_seq, &handover);
 		if (!handover)
 			__console_unlock();
 
@@ -3055,22 +3042,6 @@ void console_unlock(void)
 }
 EXPORT_SYMBOL(console_unlock);
 
-/**
- * console_conditional_schedule - yield the CPU if required
- *
- * If the console code is currently allowed to sleep, and
- * if this CPU should yield the CPU to another task, do
- * so here.
- *
- * Must be called within console_lock();.
- */
-void __sched console_conditional_schedule(void)
-{
-	if (console_may_schedule)
-		cond_resched();
-}
-EXPORT_SYMBOL(console_conditional_schedule);
-
 void console_unblank(void)
 {
 	bool found_unblank = false;
@@ -3118,7 +3089,6 @@ void console_unblank(void)
 		console_lock();
 
 	console_locked = 1;
-	console_may_schedule = 0;
 
 	cookie = console_srcu_read_lock();
 	for_each_console_srcu(c) {
@@ -3154,13 +3124,6 @@ void console_flush_on_panic(enum con_flush_mode mode)
 	 *   - semaphores are not NMI-safe
 	 */
 
-	/*
-	 * If another context is holding the console lock,
-	 * @console_may_schedule might be set. Clear it so that
-	 * this context does not call cond_resched() while flushing.
-	 */
-	console_may_schedule = 0;
-
 	if (mode == CONSOLE_REPLAY_ALL) {
 		struct console *c;
 		int cookie;
@@ -3179,7 +3142,7 @@ void console_flush_on_panic(enum con_flush_mode mode)
 		console_srcu_read_unlock(cookie);
 	}
 
-	console_flush_all(false, &next_seq, &handover);
+	console_flush_all(&next_seq, &handover);
 }
 
 /*
@@ -3364,7 +3327,7 @@ static void console_init_seq(struct console *newcon, bool bootcon_registered)
 			 * Flush all consoles and set the console to start at
 			 * the next unprinted sequence number.
 			 */
-			if (!console_flush_all(true, &newcon->seq, &handover)) {
+			if (!console_flush_all(&newcon->seq, &handover)) {
 				/*
 				 * Flushing failed. Just choose the lowest
 				 * sequence of the enabled boot consoles.
-- 
2.31.1

[RFC PATCH 65/86] treewide: task_work: remove cond_resched()

Posted by Ankur Arora 2 years, 3 months ago

The cond_resched() call was added in commit f341861fb0b7 ("task_work:
add a scheduling point in task_work_run()") because of softlockups
when processes with a large number of open sockets would exit.

Given the always-on PREEMPTION, we should be able to remove it
without much concern. However, task_work_run() does get called
from some "interesting" places: one of them being the
exit_to_user_loop() itself.

That means that if TIF_NEED_RESCHED (or TIF_NEED_RESCHED_LAZY) were
to be set once we were in a potentially long running task_work_run()
all, then we would ignore the need-resched flags and there would
be no call to schedule().

However, in that case, the next timer tick should cause rescheduling
in irqentry_exit_cond_resched(), since then the TIF_NEED_RESCHED flag
(even if the original flag were TIF_NEED_RESCHED_LAZY the tick would
upgrade that.)

Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/task_work.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/task_work.c b/kernel/task_work.c
index 95a7e1b7f1da..6a891465c8e1 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -179,7 +179,6 @@ void task_work_run(void)
 			next = work->next;
 			work->func(work);
 			work = next;
-			cond_resched();
 		} while (work);
 	}
 }
-- 
2.31.1

[RFC PATCH 66/86] treewide: kernel: remove cond_resched()

Posted by Ankur Arora 2 years, 3 months ago

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

All of these are from set-1 except for the retry loops in
task_function_call() or the mutex testing logic.

Replace these with cond_resched_stall(). The others can be removed.

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Tejun Heo <tj@kernel.org> 
Cc: Zefan Li <lizefan.x@bytedance.com> 
Cc: Johannes Weiner <hannes@cmpxchg.org> 
Cc: Peter Oberparleiter <oberpar@linux.ibm.com> 
Cc: Eric Biederman <ebiederm@xmission.com> 
Cc: Will Deacon <will@kernel.org> 
Cc: Luis Chamberlain <mcgrof@kernel.org> 
Cc: Oleg Nesterov <oleg@redhat.com> 
Cc: Juri Lelli <juri.lelli@redhat.com> 
Cc: Vincent Guittot <vincent.guittot@linaro.org> 
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/sched/cond_resched.h | 1 -
 kernel/auditsc.c                   | 2 --
 kernel/cgroup/rstat.c              | 3 +--
 kernel/dma/debug.c                 | 2 --
 kernel/events/core.c               | 2 +-
 kernel/gcov/base.c                 | 1 -
 kernel/kallsyms.c                  | 4 +---
 kernel/kexec_core.c                | 6 ------
 kernel/locking/test-ww_mutex.c     | 4 ++--
 kernel/module/main.c               | 1 -
 kernel/ptrace.c                    | 2 --
 kernel/sched/core.c                | 1 -
 kernel/sched/fair.c                | 4 ----
 13 files changed, 5 insertions(+), 28 deletions(-)
 delete mode 100644 include/linux/sched/cond_resched.h

diff --git a/include/linux/sched/cond_resched.h b/include/linux/sched/cond_resched.h
deleted file mode 100644
index 227f5be81bcd..000000000000
--- a/include/linux/sched/cond_resched.h
+++ /dev/null
@@ -1 +0,0 @@
-#include <linux/sched.h>
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 6f0d6fb6523f..47abfc1e6c75 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -2460,8 +2460,6 @@ void __audit_inode_child(struct inode *parent,
 		}
 	}
 
-	cond_resched();
-
 	/* is there a matching child entry? */
 	list_for_each_entry(n, &context->names_list, list) {
 		/* can only match entries that have a name */
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index d80d7a608141..d61dc98d1d2f 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -210,8 +210,7 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
 		/* play nice and yield if necessary */
 		if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
 			spin_unlock_irq(&cgroup_rstat_lock);
-			if (!cond_resched())
-				cpu_relax();
+			cond_resched_stall();
 			spin_lock_irq(&cgroup_rstat_lock);
 		}
 	}
diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c
index 06366acd27b0..fb8e7aed9751 100644
--- a/kernel/dma/debug.c
+++ b/kernel/dma/debug.c
@@ -543,8 +543,6 @@ void debug_dma_dump_mappings(struct device *dev)
 			}
 		}
 		spin_unlock_irqrestore(&bucket->lock, flags);
-
-		cond_resched();
 	}
 }
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index a2f2a9525d72..02330c190472 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -125,7 +125,7 @@ task_function_call(struct task_struct *p, remote_function_f func, void *info)
 		if (ret != -EAGAIN)
 			break;
 
-		cond_resched();
+		cond_resched_stall();
 	}
 
 	return ret;
diff --git a/kernel/gcov/base.c b/kernel/gcov/base.c
index 073a3738c5e6..3c22a15065b3 100644
--- a/kernel/gcov/base.c
+++ b/kernel/gcov/base.c
@@ -43,7 +43,6 @@ void gcov_enable_events(void)
 	/* Perform event callback for previously registered entries. */
 	while ((info = gcov_info_next(info))) {
 		gcov_event(GCOV_ADD, info);
-		cond_resched();
 	}
 
 	mutex_unlock(&gcov_lock);
diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c
index 18edd57b5fe8..a3c5ce9246cd 100644
--- a/kernel/kallsyms.c
+++ b/kernel/kallsyms.c
@@ -19,7 +19,7 @@
 #include <linux/kdb.h>
 #include <linux/err.h>
 #include <linux/proc_fs.h>
-#include <linux/sched.h>	/* for cond_resched */
+#include <linux/sched.h>
 #include <linux/ctype.h>
 #include <linux/slab.h>
 #include <linux/filter.h>
@@ -295,7 +295,6 @@ int kallsyms_on_each_symbol(int (*fn)(void *, const char *, unsigned long),
 		ret = fn(data, namebuf, kallsyms_sym_address(i));
 		if (ret != 0)
 			return ret;
-		cond_resched();
 	}
 	return 0;
 }
@@ -312,7 +311,6 @@ int kallsyms_on_each_match_symbol(int (*fn)(void *, unsigned long),
 
 	for (i = start; !ret && i <= end; i++) {
 		ret = fn(data, kallsyms_sym_address(get_symbol_seq(i)));
-		cond_resched();
 	}
 
 	return ret;
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 9dc728982d79..40699ea33034 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -452,8 +452,6 @@ static struct page *kimage_alloc_crash_control_pages(struct kimage *image,
 	while (hole_end <= crashk_res.end) {
 		unsigned long i;
 
-		cond_resched();
-
 		if (hole_end > KEXEC_CRASH_CONTROL_MEMORY_LIMIT)
 			break;
 		/* See if I overlap any of the segments */
@@ -832,8 +830,6 @@ static int kimage_load_normal_segment(struct kimage *image,
 		else
 			buf += mchunk;
 		mbytes -= mchunk;
-
-		cond_resched();
 	}
 out:
 	return result;
@@ -900,8 +896,6 @@ static int kimage_load_crash_segment(struct kimage *image,
 		else
 			buf += mchunk;
 		mbytes -= mchunk;
-
-		cond_resched();
 	}
 out:
 	return result;
diff --git a/kernel/locking/test-ww_mutex.c b/kernel/locking/test-ww_mutex.c
index 93cca6e69860..b1bb683274f8 100644
--- a/kernel/locking/test-ww_mutex.c
+++ b/kernel/locking/test-ww_mutex.c
@@ -46,7 +46,7 @@ static void test_mutex_work(struct work_struct *work)
 
 	if (mtx->flags & TEST_MTX_TRY) {
 		while (!ww_mutex_trylock(&mtx->mutex, NULL))
-			cond_resched();
+			cond_resched_stall();
 	} else {
 		ww_mutex_lock(&mtx->mutex, NULL);
 	}
@@ -84,7 +84,7 @@ static int __test_mutex(unsigned int flags)
 				ret = -EINVAL;
 				break;
 			}
-			cond_resched();
+			cond_resched_stall();
 		} while (time_before(jiffies, timeout));
 	} else {
 		ret = wait_for_completion_timeout(&mtx.done, TIMEOUT);
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 98fedfdb8db5..03f6fcfa87f8 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1908,7 +1908,6 @@ static int copy_chunked_from_user(void *dst, const void __user *usrc, unsigned l
 
 		if (copy_from_user(dst, usrc, n) != 0)
 			return -EFAULT;
-		cond_resched();
 		dst += n;
 		usrc += n;
 		len -= n;
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 443057bee87c..83a65a3c614a 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -798,8 +798,6 @@ static int ptrace_peek_siginfo(struct task_struct *child,
 
 		if (signal_pending(current))
 			break;
-
-		cond_resched();
 	}
 
 	if (i > 0)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3467a3a7d4bf..691b50791e04 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -25,7 +25,6 @@
 #include <linux/refcount_api.h>
 #include <linux/topology.h>
 #include <linux/sched/clock.h>
-#include <linux/sched/cond_resched.h>
 #include <linux/sched/cputime.h>
 #include <linux/sched/debug.h>
 #include <linux/sched/hotplug.h>
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 448fe36e7bbb..4e67e88282a6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -33,7 +33,6 @@
 #include <linux/refcount_api.h>
 #include <linux/topology.h>
 #include <linux/sched/clock.h>
-#include <linux/sched/cond_resched.h>
 #include <linux/sched/cputime.h>
 #include <linux/sched/isolation.h>
 #include <linux/sched/nohz.h>
@@ -51,8 +50,6 @@
 
 #include <asm/switch_to.h>
 
-#include <linux/sched/cond_resched.h>
-
 #include "sched.h"
 #include "stats.h"
 #include "autogroup.h"
@@ -3374,7 +3371,6 @@ static void task_numa_work(struct callback_head *work)
 			if (pages <= 0 || virtpages <= 0)
 				goto out;
 
-			cond_resched();
 		} while (end != vma->vm_end);
 	} for_each_vma(vmi, vma);
 
-- 
2.31.1

Re: [RFC PATCH 66/86] treewide: kernel: remove cond_resched()

Posted by Luis Chamberlain 2 years, 2 months ago

On Tue, Nov 07, 2023 at 03:08:02PM -0800, Ankur Arora wrote:
> There are broadly three sets of uses of cond_resched():
> 
> 1.  Calls to cond_resched() out of the goodness of our heart,
>     otherwise known as avoiding lockup splats.
> 
> 2.  Open coded variants of cond_resched_lock() which call
>     cond_resched().
> 
> 3.  Retry or error handling loops, where cond_resched() is used as a
>     quick alternative to spinning in a tight-loop.
> 
> When running under a full preemption model, the cond_resched() reduces
> to a NOP (not even a barrier) so removing it obviously cannot matter.
> 
> But considering only voluntary preemption models (for say code that
> has been mostly tested under those), for set-1 and set-2 the
> scheduler can now preempt kernel tasks running beyond their time
> quanta anywhere they are preemptible() [1]. Which removes any need
> for these explicitly placed scheduling points.
> 
> The cond_resched() calls in set-3 are a little more difficult.
> To start with, given it's NOP character under full preemption, it
> never actually saved us from a tight loop.
> With voluntary preemption, it's not a NOP, but it might as well be --
> for most workloads the scheduler does not have an interminable supply
> of runnable tasks on the runqueue.
> 
> So, cond_resched() is useful to not get softlockup splats, but not
> terribly good for error handling. Ideally, these should be replaced
> with some kind of timed or event wait.
> For now we use cond_resched_stall(), which tries to schedule if
> possible, and executes a cpu_relax() if not.
> 
> All of these are from set-1 except for the retry loops in
> task_function_call() or the mutex testing logic.
> 
> Replace these with cond_resched_stall(). The others can be removed.
> 
> [1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/
> 
> Cc: Tejun Heo <tj@kernel.org> 
> Cc: Zefan Li <lizefan.x@bytedance.com> 
> Cc: Johannes Weiner <hannes@cmpxchg.org> 
> Cc: Peter Oberparleiter <oberpar@linux.ibm.com> 
> Cc: Eric Biederman <ebiederm@xmission.com> 
> Cc: Will Deacon <will@kernel.org> 
> Cc: Luis Chamberlain <mcgrof@kernel.org> 
> Cc: Oleg Nesterov <oleg@redhat.com> 
> Cc: Juri Lelli <juri.lelli@redhat.com> 
> Cc: Vincent Guittot <vincent.guittot@linaro.org> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>

Sounds like the sort of test which should be put into linux-next to get
test coverage right away. To see what really blows up.

 Luis

Re: [RFC PATCH 66/86] treewide: kernel: remove cond_resched()

Posted by Steven Rostedt 2 years, 2 months ago

On Fri, 17 Nov 2023 10:14:33 -0800
Luis Chamberlain <mcgrof@kernel.org> wrote:

> Sounds like the sort of test which should be put into linux-next to get
> test coverage right away. To see what really blows up.

No, it shouldn't have been added this early in the development. It depends
on the first part of this patch series, that needs to finished first.

Ankur just gave his full vision of the RFC. You can ignore the removal of
the cond_resched() for the time being.

Thanks for looking at it Luis!

-- Steve

[RFC PATCH 67/86] treewide: kernel: remove cond_reshed()

Posted by Ankur Arora 2 years, 3 months ago

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

All of these are set-1 or set-2. Replace the call in stop_one_cpu()
with cond_resched_stall() to allow it a chance to schedule.

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Tejun Heo <tj@kernel.org> 
Cc: Lai Jiangshan <jiangshanlai@gmail.com> 
Cc: Andrew Morton <akpm@linux-foundation.org> 
Cc: Nicholas Piggin <npiggin@gmail.com> 
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/kthread.c      |  1 -
 kernel/softirq.c      |  1 -
 kernel/stop_machine.c |  2 +-
 kernel/workqueue.c    | 10 ----------
 4 files changed, 1 insertion(+), 13 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 1eea53050bab..e111eebee240 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -830,7 +830,6 @@ int kthread_worker_fn(void *worker_ptr)
 		schedule();
 
 	try_to_freeze();
-	cond_resched();
 	goto repeat;
 }
 EXPORT_SYMBOL_GPL(kthread_worker_fn);
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 210cf5f8d92c..c80237cbcb3d 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -920,7 +920,6 @@ static void run_ksoftirqd(unsigned int cpu)
 		 */
 		__do_softirq();
 		ksoftirqd_run_end();
-		cond_resched();
 		return;
 	}
 	ksoftirqd_run_end();
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index cedb17ba158a..1929fe8ecd70 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -148,7 +148,7 @@ int stop_one_cpu(unsigned int cpu, cpu_stop_fn_t fn, void *arg)
 	 * In case @cpu == smp_proccessor_id() we can avoid a sleep+wakeup
 	 * cycle by doing a preemption:
 	 */
-	cond_resched();
+	cond_resched_stall();
 	wait_for_completion(&done.completion);
 	return done.ret;
 }
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index a3522b70218d..be5080e1b7d6 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2646,16 +2646,6 @@ __acquires(&pool->lock)
 		dump_stack();
 	}
 
-	/*
-	 * The following prevents a kworker from hogging CPU on !PREEMPTION
-	 * kernels, where a requeueing work item waiting for something to
-	 * happen could deadlock with stop_machine as such work item could
-	 * indefinitely requeue itself while all other CPUs are trapped in
-	 * stop_machine. At the same time, report a quiescent RCU state so
-	 * the same condition doesn't freeze RCU.
-	 */
-	cond_resched();
-
 	raw_spin_lock_irq(&pool->lock);
 
 	/*
-- 
2.31.1

[RFC PATCH 68/86] treewide: mm: remove cond_resched()

Posted by Ankur Arora 2 years, 3 months ago

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

Most of the cond_resched() cases here are from set-1, we are
executing in long loops and want to see if rescheduling is
needed.

Now the scheduler can handle rescheduling for those.

There are a few set-2 cases where we give up a lock and reacquire
it.  The unlock will take care of the preemption, but maybe there
should be a cpu_relax() before reacquiring?

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Andrew Morton <akpm@linux-foundation.org> 
Cc: SeongJae Park <sj@kernel.org> 
Cc: "Matthew Wilcox 
Cc: Mike Kravetz <mike.kravetz@oracle.com> 
Cc: Muchun Song <muchun.song@linux.dev> 
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> 
Cc: Marco Elver <elver@google.com> 
Cc: Catalin Marinas <catalin.marinas@arm.com> 
Cc: Johannes Weiner <hannes@cmpxchg.org> 
Cc: Michal Hocko <mhocko@kernel.org> 
Cc: Roman Gushchin <roman.gushchin@linux.dev> 
Cc: Shakeel Butt <shakeelb@google.com> 
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> 
Cc: Miaohe Lin <linmiaohe@huawei.com> 
Cc: David Hildenbrand <david@redhat.com> 
Cc: Oscar Salvador <osalvador@suse.de> 
Cc: Mike Rapoport <rppt@kernel.org> 
Cc: Will Deacon <will@kernel.org> 
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> 
Cc: Nick Piggin <npiggin@gmail.com> 
Cc: Peter Zijlstra <peterz@infradead.org> 
Cc: Dennis Zhou <dennis@kernel.org> 
Cc: Tejun Heo <tj@kernel.org> 
Cc: Christoph Lameter <cl@linux.com> 
Cc: Hugh Dickins <hughd@google.com> 
Cc: Pekka Enberg <penberg@kernel.org> 
Cc: David Rientjes <rientjes@google.com> 
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> 
Cc: Vlastimil Babka <vbabka@suse.cz> 
Cc: Vitaly Wool <vitaly.wool@konsulko.com> 
Cc: Minchan Kim <minchan@kernel.org> 
Cc: Sergey Senozhatsky <senozhatsky@chromium.org> 
Cc: Seth Jennings <sjenning@redhat.com> 
Cc: Dan Streetman <ddstreet@ieee.org> 
Cc: linux-mm@kvack.org

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 mm/backing-dev.c        |  8 +++++++-
 mm/compaction.c         | 23 ++++++-----------------
 mm/damon/paddr.c        |  1 -
 mm/dmapool_test.c       |  2 --
 mm/filemap.c            |  6 ------
 mm/gup.c                |  1 -
 mm/huge_memory.c        |  3 ---
 mm/hugetlb.c            | 12 ------------
 mm/hugetlb_cgroup.c     |  1 -
 mm/kasan/quarantine.c   |  6 ++++--
 mm/kfence/kfence_test.c | 22 +---------------------
 mm/khugepaged.c         |  5 -----
 mm/kmemleak.c           |  8 --------
 mm/ksm.c                | 21 ++++-----------------
 mm/madvise.c            |  3 ---
 mm/memcontrol.c         |  4 ----
 mm/memory-failure.c     |  1 -
 mm/memory.c             | 12 +-----------
 mm/memory_hotplug.c     |  6 ------
 mm/mempolicy.c          |  1 -
 mm/migrate.c            |  6 ------
 mm/mincore.c            |  1 -
 mm/mlock.c              |  2 --
 mm/mm_init.c            | 13 +++----------
 mm/mmap.c               |  1 -
 mm/mmu_gather.c         |  2 --
 mm/mprotect.c           |  1 -
 mm/mremap.c             |  1 -
 mm/nommu.c              |  1 -
 mm/page-writeback.c     |  1 -
 mm/page_alloc.c         | 13 ++-----------
 mm/page_counter.c       |  1 -
 mm/page_ext.c           |  1 -
 mm/page_idle.c          |  2 --
 mm/page_io.c            |  2 --
 mm/page_owner.c         |  1 -
 mm/percpu.c             |  5 -----
 mm/rmap.c               |  2 --
 mm/shmem.c              |  9 ---------
 mm/shuffle.c            |  6 ++++--
 mm/slab.c               |  3 ---
 mm/swap_cgroup.c        |  4 ----
 mm/swapfile.c           | 14 --------------
 mm/truncate.c           |  4 ----
 mm/userfaultfd.c        |  3 ---
 mm/util.c               |  1 -
 mm/vmalloc.c            |  5 -----
 mm/vmscan.c             | 29 ++---------------------------
 mm/vmstat.c             |  4 ----
 mm/workingset.c         |  1 -
 mm/z3fold.c             | 15 ++++-----------
 mm/zsmalloc.c           |  1 -
 mm/zswap.c              |  1 -
 53 files changed, 38 insertions(+), 264 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 1e3447bccdb1..22ca90addb35 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -816,8 +816,14 @@ static void cleanup_offline_cgwbs_workfn(struct work_struct *work)
 			continue;
 
 		spin_unlock_irq(&cgwb_lock);
+
+		/*
+		 * cleanup_offline_cgwb() can implicitly reschedule
+		 * on unlock when needed, so just loop here.
+		 */
 		while (cleanup_offline_cgwb(wb))
-			cond_resched();
+			;
+
 		spin_lock_irq(&cgwb_lock);
 
 		wb_put(wb);
diff --git a/mm/compaction.c b/mm/compaction.c
index 38c8d216c6a3..5bca34760fec 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -395,8 +395,6 @@ static void __reset_isolation_suitable(struct zone *zone)
 	 */
 	for (; migrate_pfn < free_pfn; migrate_pfn += pageblock_nr_pages,
 					free_pfn -= pageblock_nr_pages) {
-		cond_resched();
-
 		/* Update the migrate PFN */
 		if (__reset_isolation_pfn(zone, migrate_pfn, true, source_set) &&
 		    migrate_pfn < reset_migrate) {
@@ -571,8 +569,6 @@ static bool compact_unlock_should_abort(spinlock_t *lock,
 		return true;
 	}
 
-	cond_resched();
-
 	return false;
 }
 
@@ -874,8 +870,6 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 			return -EINTR;
 	}
 
-	cond_resched();
-
 	if (cc->direct_compaction && (cc->mode == MIGRATE_ASYNC)) {
 		skip_on_failure = true;
 		next_skip_pfn = block_end_pfn(low_pfn, cc->order);
@@ -923,8 +917,6 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 
 				goto fatal_pending;
 			}
-
-			cond_resched();
 		}
 
 		nr_scanned++;
@@ -1681,11 +1673,10 @@ static void isolate_freepages(struct compact_control *cc)
 		unsigned long nr_isolated;
 
 		/*
-		 * This can iterate a massively long zone without finding any
-		 * suitable migration targets, so periodically check resched.
+		 * We can iterate over a massively long zone without finding
+		 * any suitable migration targets. Since we don't disable
+		 * preemption while doing so, expect to be preempted.
 		 */
-		if (!(block_start_pfn % (COMPACT_CLUSTER_MAX * pageblock_nr_pages)))
-			cond_resched();
 
 		page = pageblock_pfn_to_page(block_start_pfn, block_end_pfn,
 									zone);
@@ -2006,12 +1997,10 @@ static isolate_migrate_t isolate_migratepages(struct compact_control *cc)
 			block_end_pfn += pageblock_nr_pages) {
 
 		/*
-		 * This can potentially iterate a massively long zone with
-		 * many pageblocks unsuitable, so periodically check if we
-		 * need to schedule.
+		 * We can potentially iterate a massively long zone with
+		 * many pageblocks unsuitable. Since we don't disable
+		 * preemption while doing so, expect to be preempted.
 		 */
-		if (!(low_pfn % (COMPACT_CLUSTER_MAX * pageblock_nr_pages)))
-			cond_resched();
 
 		page = pageblock_pfn_to_page(block_start_pfn,
 						block_end_pfn, cc->zone);
diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
index 909db25efb35..97eed5e0f89b 100644
--- a/mm/damon/paddr.c
+++ b/mm/damon/paddr.c
@@ -251,7 +251,6 @@ static unsigned long damon_pa_pageout(struct damon_region *r, struct damos *s)
 		folio_put(folio);
 	}
 	applied = reclaim_pages(&folio_list);
-	cond_resched();
 	return applied * PAGE_SIZE;
 }
 
diff --git a/mm/dmapool_test.c b/mm/dmapool_test.c
index 370fb9e209ef..c519475310e4 100644
--- a/mm/dmapool_test.c
+++ b/mm/dmapool_test.c
@@ -82,8 +82,6 @@ static int dmapool_test_block(const struct dmapool_parms *parms)
 		ret = dmapool_test_alloc(p, blocks);
 		if (ret)
 			goto free_pool;
-		if (need_resched())
-			cond_resched();
 	}
 	end_time = ktime_get();
 
diff --git a/mm/filemap.c b/mm/filemap.c
index dc4dcc5eaf5e..e3c9cf5b33b4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -526,7 +526,6 @@ static void __filemap_fdatawait_range(struct address_space *mapping,
 			folio_clear_error(folio);
 		}
 		folio_batch_release(&fbatch);
-		cond_resched();
 	}
 }
 
@@ -2636,8 +2635,6 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
 	folio_batch_init(&fbatch);
 
 	do {
-		cond_resched();
-
 		/*
 		 * If we've already successfully copied some data, then we
 		 * can no longer safely return -EIOCBQUEUED. Hence mark
@@ -2910,8 +2907,6 @@ ssize_t filemap_splice_read(struct file *in, loff_t *ppos,
 	folio_batch_init(&fbatch);
 
 	do {
-		cond_resched();
-
 		if (*ppos >= i_size_read(in->f_mapping->host))
 			break;
 
@@ -3984,7 +3979,6 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
 			if (unlikely(status < 0))
 				break;
 		}
-		cond_resched();
 
 		if (unlikely(status == 0)) {
 			/*
diff --git a/mm/gup.c b/mm/gup.c
index 2f8a2d89fde1..f6d913e97d71 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1232,7 +1232,6 @@ static long __get_user_pages(struct mm_struct *mm,
 			ret = -EINTR;
 			goto out;
 		}
-		cond_resched();
 
 		page = follow_page_mask(vma, start, foll_flags, &ctx);
 		if (!page || PTR_ERR(page) == -EMLINK) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 064fbd90822b..6d48ee94a8c8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2954,7 +2954,6 @@ static void split_huge_pages_all(void)
 			folio_unlock(folio);
 next:
 			folio_put(folio);
-			cond_resched();
 		}
 	}
 
@@ -3044,7 +3043,6 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
 		folio_unlock(folio);
 next:
 		folio_put(folio);
-		cond_resched();
 	}
 	mmap_read_unlock(mm);
 	mmput(mm);
@@ -3101,7 +3099,6 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
 		folio_unlock(folio);
 next:
 		folio_put(folio);
-		cond_resched();
 	}
 
 	filp_close(candidate, NULL);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1301ba7b2c9a..d611d256ebc2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1830,8 +1830,6 @@ static void free_hpage_workfn(struct work_struct *work)
 		h = size_to_hstate(page_size(page));
 
 		__update_and_free_hugetlb_folio(h, page_folio(page));
-
-		cond_resched();
 	}
 }
 static DECLARE_WORK(free_hpage_work, free_hpage_workfn);
@@ -1869,7 +1867,6 @@ static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
 	list_for_each_entry_safe(page, t_page, list, lru) {
 		folio = page_folio(page);
 		update_and_free_hugetlb_folio(h, folio, false);
-		cond_resched();
 	}
 }
 
@@ -2319,7 +2316,6 @@ int dissolve_free_huge_page(struct page *page)
 		 */
 		if (unlikely(!folio_test_hugetlb_freed(folio))) {
 			spin_unlock_irq(&hugetlb_lock);
-			cond_resched();
 
 			/*
 			 * Theoretically, we should return -EBUSY when we
@@ -2563,7 +2559,6 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 			break;
 		}
 		list_add(&folio->lru, &surplus_list);
-		cond_resched();
 	}
 	allocated += i;
 
@@ -2961,7 +2956,6 @@ static int alloc_and_dissolve_hugetlb_folio(struct hstate *h,
 		 * we retry.
 		 */
 		spin_unlock_irq(&hugetlb_lock);
-		cond_resched();
 		goto retry;
 	} else {
 		/*
@@ -3233,7 +3227,6 @@ static void __init gather_bootmem_prealloc(void)
 		 * other side-effects, like CommitLimit going negative.
 		 */
 		adjust_managed_page_count(page, pages_per_huge_page(h));
-		cond_resched();
 	}
 }
 static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
@@ -3255,7 +3248,6 @@ static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
 				break;
 			free_huge_folio(folio); /* free it into the hugepage allocator */
 		}
-		cond_resched();
 	}
 	if (i == h->max_huge_pages_node[nid])
 		return;
@@ -3317,7 +3309,6 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 					 &node_states[N_MEMORY],
 					 node_alloc_noretry))
 			break;
-		cond_resched();
 	}
 	if (i < h->max_huge_pages) {
 		char buf[32];
@@ -3536,9 +3527,6 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 		 */
 		spin_unlock_irq(&hugetlb_lock);
 
-		/* yield cpu to avoid soft lockup */
-		cond_resched();
-
 		ret = alloc_pool_huge_page(h, nodes_allowed,
 						node_alloc_noretry);
 		spin_lock_irq(&hugetlb_lock);
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index dedd2edb076e..a4441f328752 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -235,7 +235,6 @@ static void hugetlb_cgroup_css_offline(struct cgroup_subsys_state *css)
 
 			spin_unlock_irq(&hugetlb_lock);
 		}
-		cond_resched();
 	} while (hugetlb_cgroup_have_usage(h_cg));
 }
 
diff --git a/mm/kasan/quarantine.c b/mm/kasan/quarantine.c
index 152dca73f398..1a1edadbeb39 100644
--- a/mm/kasan/quarantine.c
+++ b/mm/kasan/quarantine.c
@@ -374,9 +374,11 @@ void kasan_quarantine_remove_cache(struct kmem_cache *cache)
 		if (qlist_empty(&global_quarantine[i]))
 			continue;
 		qlist_move_cache(&global_quarantine[i], &to_free, cache);
-		/* Scanning whole quarantine can take a while. */
+		/*
+		 * Scanning whole quarantine can take a while so check if need
+		 * to reschedule after giving up the lock.
+		 */
 		raw_spin_unlock_irqrestore(&quarantine_lock, flags);
-		cond_resched();
 		raw_spin_lock_irqsave(&quarantine_lock, flags);
 	}
 	raw_spin_unlock_irqrestore(&quarantine_lock, flags);
diff --git a/mm/kfence/kfence_test.c b/mm/kfence/kfence_test.c
index 95b2b84c296d..29fbc24046b9 100644
--- a/mm/kfence/kfence_test.c
+++ b/mm/kfence/kfence_test.c
@@ -244,7 +244,7 @@ enum allocation_policy {
 static void *test_alloc(struct kunit *test, size_t size, gfp_t gfp, enum allocation_policy policy)
 {
 	void *alloc;
-	unsigned long timeout, resched_after;
+	unsigned long timeout;
 	const char *policy_name;
 
 	switch (policy) {
@@ -265,17 +265,6 @@ static void *test_alloc(struct kunit *test, size_t size, gfp_t gfp, enum allocat
 	kunit_info(test, "%s: size=%zu, gfp=%x, policy=%s, cache=%i\n", __func__, size, gfp,
 		   policy_name, !!test_cache);
 
-	/*
-	 * 100x the sample interval should be more than enough to ensure we get
-	 * a KFENCE allocation eventually.
-	 */
-	timeout = jiffies + msecs_to_jiffies(100 * kfence_sample_interval);
-	/*
-	 * Especially for non-preemption kernels, ensure the allocation-gate
-	 * timer can catch up: after @resched_after, every failed allocation
-	 * attempt yields, to ensure the allocation-gate timer is scheduled.
-	 */
-	resched_after = jiffies + msecs_to_jiffies(kfence_sample_interval);
 	do {
 		if (test_cache)
 			alloc = kmem_cache_alloc(test_cache, gfp);
@@ -307,8 +296,6 @@ static void *test_alloc(struct kunit *test, size_t size, gfp_t gfp, enum allocat
 
 		test_free(alloc);
 
-		if (time_after(jiffies, resched_after))
-			cond_resched();
 	} while (time_before(jiffies, timeout));
 
 	KUNIT_ASSERT_TRUE_MSG(test, false, "failed to allocate from KFENCE");
@@ -628,7 +615,6 @@ static void test_gfpzero(struct kunit *test)
 			kunit_warn(test, "giving up ... cannot get same object back\n");
 			return;
 		}
-		cond_resched();
 	}
 
 	for (i = 0; i < size; i++)
@@ -755,12 +741,6 @@ static void test_memcache_alloc_bulk(struct kunit *test)
 			}
 		}
 		kmem_cache_free_bulk(test_cache, num, objects);
-		/*
-		 * kmem_cache_alloc_bulk() disables interrupts, and calling it
-		 * in a tight loop may not give KFENCE a chance to switch the
-		 * static branch. Call cond_resched() to let KFENCE chime in.
-		 */
-		cond_resched();
 	} while (!pass && time_before(jiffies, timeout));
 
 	KUNIT_EXPECT_TRUE(test, pass);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 4025225ef434..ebec87db5cc1 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2361,7 +2361,6 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 	for_each_vma(vmi, vma) {
 		unsigned long hstart, hend;
 
-		cond_resched();
 		if (unlikely(hpage_collapse_test_exit(mm))) {
 			progress++;
 			break;
@@ -2382,7 +2381,6 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 		while (khugepaged_scan.address < hend) {
 			bool mmap_locked = true;
 
-			cond_resched();
 			if (unlikely(hpage_collapse_test_exit(mm)))
 				goto breakouterloop;
 
@@ -2488,8 +2486,6 @@ static void khugepaged_do_scan(struct collapse_control *cc)
 	lru_add_drain_all();
 
 	while (true) {
-		cond_resched();
-
 		if (unlikely(kthread_should_stop() || try_to_freeze()))
 			break;
 
@@ -2721,7 +2717,6 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		int result = SCAN_FAIL;
 
 		if (!mmap_locked) {
-			cond_resched();
 			mmap_read_lock(mm);
 			mmap_locked = true;
 			result = hugepage_vma_revalidate(mm, addr, false, &vma,
diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 54c2c90d3abc..9092941cb259 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -1394,7 +1394,6 @@ static void scan_large_block(void *start, void *end)
 		next = min(start + MAX_SCAN_SIZE, end);
 		scan_block(start, next, NULL);
 		start = next;
-		cond_resched();
 	}
 }
 #endif
@@ -1439,7 +1438,6 @@ static void scan_object(struct kmemleak_object *object)
 				break;
 
 			raw_spin_unlock_irqrestore(&object->lock, flags);
-			cond_resched();
 			raw_spin_lock_irqsave(&object->lock, flags);
 		} while (object->flags & OBJECT_ALLOCATED);
 	} else
@@ -1466,8 +1464,6 @@ static void scan_gray_list(void)
 	 */
 	object = list_entry(gray_list.next, typeof(*object), gray_list);
 	while (&object->gray_list != &gray_list) {
-		cond_resched();
-
 		/* may add new objects to the list */
 		if (!scan_should_stop())
 			scan_object(object);
@@ -1501,7 +1497,6 @@ static void kmemleak_cond_resched(struct kmemleak_object *object)
 	raw_spin_unlock_irq(&kmemleak_lock);
 
 	rcu_read_unlock();
-	cond_resched();
 	rcu_read_lock();
 
 	raw_spin_lock_irq(&kmemleak_lock);
@@ -1584,9 +1579,6 @@ static void kmemleak_scan(void)
 		for (pfn = start_pfn; pfn < end_pfn; pfn++) {
 			struct page *page = pfn_to_online_page(pfn);
 
-			if (!(pfn & 63))
-				cond_resched();
-
 			if (!page)
 				continue;
 
diff --git a/mm/ksm.c b/mm/ksm.c
index 981af9c72e7a..df5bca0af731 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -492,7 +492,6 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr, bool lock_v
 	do {
 		int ksm_page;
 
-		cond_resched();
 		ksm_page = walk_page_range_vma(vma, addr, addr + 1, ops, NULL);
 		if (WARN_ON_ONCE(ksm_page < 0))
 			return ksm_page;
@@ -686,7 +685,6 @@ static void remove_node_from_stable_tree(struct ksm_stable_node *stable_node)
 		stable_node->rmap_hlist_len--;
 		put_anon_vma(rmap_item->anon_vma);
 		rmap_item->address &= PAGE_MASK;
-		cond_resched();
 	}
 
 	/*
@@ -813,6 +811,10 @@ static struct page *get_ksm_page(struct ksm_stable_node *stable_node,
  */
 static void remove_rmap_item_from_tree(struct ksm_rmap_item *rmap_item)
 {
+	/*
+	 * We are called from many long loops, and for the most part don't
+	 * disable preemption. So expect to be preempted occasionally.
+	 */
 	if (rmap_item->address & STABLE_FLAG) {
 		struct ksm_stable_node *stable_node;
 		struct page *page;
@@ -858,7 +860,6 @@ static void remove_rmap_item_from_tree(struct ksm_rmap_item *rmap_item)
 		rmap_item->address &= PAGE_MASK;
 	}
 out:
-	cond_resched();		/* we're called from many long loops */
 }
 
 static void remove_trailing_rmap_items(struct ksm_rmap_item **rmap_list)
@@ -1000,13 +1001,11 @@ static int remove_all_stable_nodes(void)
 				err = -EBUSY;
 				break;	/* proceed to next nid */
 			}
-			cond_resched();
 		}
 	}
 	list_for_each_entry_safe(stable_node, next, &migrate_nodes, list) {
 		if (remove_stable_node(stable_node))
 			err = -EBUSY;
-		cond_resched();
 	}
 	return err;
 }
@@ -1452,7 +1451,6 @@ static struct page *stable_node_dup(struct ksm_stable_node **_stable_node_dup,
 
 	hlist_for_each_entry_safe(dup, hlist_safe,
 				  &stable_node->hlist, hlist_dup) {
-		cond_resched();
 		/*
 		 * We must walk all stable_node_dup to prune the stale
 		 * stable nodes during lookup.
@@ -1654,7 +1652,6 @@ static struct page *stable_tree_search(struct page *page)
 		struct page *tree_page;
 		int ret;
 
-		cond_resched();
 		stable_node = rb_entry(*new, struct ksm_stable_node, node);
 		stable_node_any = NULL;
 		tree_page = chain_prune(&stable_node_dup, &stable_node,	root);
@@ -1899,7 +1896,6 @@ static struct ksm_stable_node *stable_tree_insert(struct page *kpage)
 		struct page *tree_page;
 		int ret;
 
-		cond_resched();
 		stable_node = rb_entry(*new, struct ksm_stable_node, node);
 		stable_node_any = NULL;
 		tree_page = chain(&stable_node_dup, stable_node, root);
@@ -2016,7 +2012,6 @@ struct ksm_rmap_item *unstable_tree_search_insert(struct ksm_rmap_item *rmap_ite
 		struct page *tree_page;
 		int ret;
 
-		cond_resched();
 		tree_rmap_item = rb_entry(*new, struct ksm_rmap_item, node);
 		tree_page = get_mergeable_page(tree_rmap_item);
 		if (!tree_page)
@@ -2350,7 +2345,6 @@ static struct ksm_rmap_item *scan_get_next_rmap_item(struct page **page)
 						    GET_KSM_PAGE_NOLOCK);
 				if (page)
 					put_page(page);
-				cond_resched();
 			}
 		}
 
@@ -2396,7 +2390,6 @@ static struct ksm_rmap_item *scan_get_next_rmap_item(struct page **page)
 			*page = follow_page(vma, ksm_scan.address, FOLL_GET);
 			if (IS_ERR_OR_NULL(*page)) {
 				ksm_scan.address += PAGE_SIZE;
-				cond_resched();
 				continue;
 			}
 			if (is_zone_device_page(*page))
@@ -2418,7 +2411,6 @@ static struct ksm_rmap_item *scan_get_next_rmap_item(struct page **page)
 next_page:
 			put_page(*page);
 			ksm_scan.address += PAGE_SIZE;
-			cond_resched();
 		}
 	}
 
@@ -2489,7 +2481,6 @@ static void ksm_do_scan(unsigned int scan_npages)
 	unsigned int npages = scan_npages;
 
 	while (npages-- && likely(!freezing(current))) {
-		cond_resched();
 		rmap_item = scan_get_next_rmap_item(&page);
 		if (!rmap_item)
 			return;
@@ -2858,7 +2849,6 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
 
-		cond_resched();
 		if (!anon_vma_trylock_read(anon_vma)) {
 			if (rwc->try_lock) {
 				rwc->contended = true;
@@ -2870,7 +2860,6 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
 					       0, ULONG_MAX) {
 			unsigned long addr;
 
-			cond_resched();
 			vma = vmac->vma;
 
 			/* Ignore the stable/unstable/sqnr flags */
@@ -3046,14 +3035,12 @@ static void ksm_check_stable_tree(unsigned long start_pfn,
 				node = rb_first(root_stable_tree + nid);
 			else
 				node = rb_next(node);
-			cond_resched();
 		}
 	}
 	list_for_each_entry_safe(stable_node, next, &migrate_nodes, list) {
 		if (stable_node->kpfn >= start_pfn &&
 		    stable_node->kpfn < end_pfn)
 			remove_node_from_stable_tree(stable_node);
-		cond_resched();
 	}
 }
 
diff --git a/mm/madvise.c b/mm/madvise.c
index 4dded5d27e7e..3aa53f2e70e2 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -225,7 +225,6 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
 	if (ptep)
 		pte_unmap_unlock(ptep, ptl);
 	swap_read_unplug(splug);
-	cond_resched();
 
 	return 0;
 }
@@ -531,7 +530,6 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 	}
 	if (pageout)
 		reclaim_pages(&folio_list);
-	cond_resched();
 
 	return 0;
 }
@@ -755,7 +753,6 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		arch_leave_lazy_mmu_mode();
 		pte_unmap_unlock(start_pte, ptl);
 	}
-	cond_resched();
 
 	return 0;
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5b009b233ab8..4bccab7df97f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5650,7 +5650,6 @@ static int mem_cgroup_do_precharge(unsigned long count)
 		if (ret)
 			return ret;
 		mc.precharge++;
-		cond_resched();
 	}
 	return 0;
 }
@@ -6035,7 +6034,6 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
 		if (get_mctgt_type(vma, addr, ptep_get(pte), NULL))
 			mc.precharge++;	/* increment precharge temporarily */
 	pte_unmap_unlock(pte - 1, ptl);
-	cond_resched();
 
 	return 0;
 }
@@ -6303,7 +6301,6 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 		}
 	}
 	pte_unmap_unlock(pte - 1, ptl);
-	cond_resched();
 
 	if (addr != end) {
 		/*
@@ -6345,7 +6342,6 @@ static void mem_cgroup_move_charge(void)
 		 * feature anyway, so it wouldn't be a big problem.
 		 */
 		__mem_cgroup_clear_mc();
-		cond_resched();
 		goto retry;
 	}
 	/*
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 4d6e43c88489..f291bb06c37c 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -804,7 +804,6 @@ static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr,
 	}
 	pte_unmap_unlock(mapped_pte, ptl);
 out:
-	cond_resched();
 	return ret;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 517221f01303..faa36db93f80 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1104,7 +1104,6 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	pte_unmap_unlock(orig_src_pte, src_ptl);
 	add_mm_rss_vec(dst_mm, rss);
 	pte_unmap_unlock(orig_dst_pte, dst_ptl);
-	cond_resched();
 
 	if (ret == -EIO) {
 		VM_WARN_ON_ONCE(!entry.val);
@@ -1573,7 +1572,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 		addr = zap_pte_range(tlb, vma, pmd, addr, next, details);
 		if (addr != next)
 			pmd--;
-	} while (pmd++, cond_resched(), addr != end);
+	} while (pmd++, addr != end);
 
 	return addr;
 }
@@ -1601,7 +1600,6 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 			continue;
 		next = zap_pmd_range(tlb, vma, pud, addr, next, details);
 next:
-		cond_resched();
 	} while (pud++, addr = next, addr != end);
 
 	return addr;
@@ -5926,7 +5924,6 @@ static inline int process_huge_page(
 		l = n;
 		/* Process subpages at the end of huge page */
 		for (i = pages_per_huge_page - 1; i >= 2 * n; i--) {
-			cond_resched();
 			ret = process_subpage(addr + i * PAGE_SIZE, i, arg);
 			if (ret)
 				return ret;
@@ -5937,7 +5934,6 @@ static inline int process_huge_page(
 		l = pages_per_huge_page - n;
 		/* Process subpages at the begin of huge page */
 		for (i = 0; i < base; i++) {
-			cond_resched();
 			ret = process_subpage(addr + i * PAGE_SIZE, i, arg);
 			if (ret)
 				return ret;
@@ -5951,11 +5947,9 @@ static inline int process_huge_page(
 		int left_idx = base + i;
 		int right_idx = base + 2 * l - 1 - i;
 
-		cond_resched();
 		ret = process_subpage(addr + left_idx * PAGE_SIZE, left_idx, arg);
 		if (ret)
 			return ret;
-		cond_resched();
 		ret = process_subpage(addr + right_idx * PAGE_SIZE, right_idx, arg);
 		if (ret)
 			return ret;
@@ -5973,7 +5967,6 @@ static void clear_gigantic_page(struct page *page,
 	might_sleep();
 	for (i = 0; i < pages_per_huge_page; i++) {
 		p = nth_page(page, i);
-		cond_resched();
 		clear_user_highpage(p, addr + i * PAGE_SIZE);
 	}
 }
@@ -6013,7 +6006,6 @@ static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
 		dst_page = folio_page(dst, i);
 		src_page = folio_page(src, i);
 
-		cond_resched();
 		if (copy_mc_user_highpage(dst_page, src_page,
 					  addr + i*PAGE_SIZE, vma)) {
 			memory_failure_queue(page_to_pfn(src_page), 0);
@@ -6085,8 +6077,6 @@ long copy_folio_from_user(struct folio *dst_folio,
 			break;
 
 		flush_dcache_page(subpage);
-
-		cond_resched();
 	}
 	return ret_val;
 }
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 1b03f4ec6fd2..2a621f00db1a 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -402,7 +402,6 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
 					 params->pgmap);
 		if (err)
 			break;
-		cond_resched();
 	}
 	vmemmap_populate_print_last();
 	return err;
@@ -532,8 +531,6 @@ void __ref remove_pfn_range_from_zone(struct zone *zone,
 
 	/* Poison struct pages because they are now uninitialized again. */
 	for (pfn = start_pfn; pfn < end_pfn; pfn += cur_nr_pages) {
-		cond_resched();
-
 		/* Select all remaining pages up to the next section boundary */
 		cur_nr_pages =
 			min(end_pfn - pfn, SECTION_ALIGN_UP(pfn + 1) - pfn);
@@ -580,7 +577,6 @@ void __remove_pages(unsigned long pfn, unsigned long nr_pages,
 	}
 
 	for (; pfn < end_pfn; pfn += cur_nr_pages) {
-		cond_resched();
 		/* Select all remaining pages up to the next section boundary */
 		cur_nr_pages = min(end_pfn - pfn,
 				   SECTION_ALIGN_UP(pfn + 1) - pfn);
@@ -1957,8 +1953,6 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 				goto failed_removal_isolated;
 			}
 
-			cond_resched();
-
 			ret = scan_movable_pages(pfn, end_pfn, &pfn);
 			if (!ret) {
 				/*
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 29ebf1e7898c..fa201f89568e 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -554,7 +554,6 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
 			break;
 	}
 	pte_unmap_unlock(mapped_pte, ptl);
-	cond_resched();
 
 	return addr != end ? -EIO : 0;
 }
diff --git a/mm/migrate.c b/mm/migrate.c
index 06086dc9da28..6b0d0d4f07d8 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1528,8 +1528,6 @@ static int migrate_hugetlbs(struct list_head *from, new_folio_t get_new_folio,
 
 			nr_pages = folio_nr_pages(folio);
 
-			cond_resched();
-
 			/*
 			 * Migratability of hugepages depends on architectures and
 			 * their size.  This check is necessary because some callers
@@ -1633,8 +1631,6 @@ static int migrate_pages_batch(struct list_head *from,
 			is_thp = folio_test_large(folio) && folio_test_pmd_mappable(folio);
 			nr_pages = folio_nr_pages(folio);
 
-			cond_resched();
-
 			/*
 			 * Large folio migration might be unsupported or
 			 * the allocation might be failed so we should retry
@@ -1754,8 +1750,6 @@ static int migrate_pages_batch(struct list_head *from,
 			is_thp = folio_test_large(folio) && folio_test_pmd_mappable(folio);
 			nr_pages = folio_nr_pages(folio);
 
-			cond_resched();
-
 			rc = migrate_folio_move(put_new_folio, private,
 						folio, dst, mode,
 						reason, ret_folios);
diff --git a/mm/mincore.c b/mm/mincore.c
index dad3622cc963..46a1716621d1 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -151,7 +151,6 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	pte_unmap_unlock(ptep - 1, ptl);
 out:
 	walk->private += nr;
-	cond_resched();
 	return 0;
 }
 
diff --git a/mm/mlock.c b/mm/mlock.c
index 06bdfab83b58..746ca30145b5 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -351,7 +351,6 @@ static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
 	pte_unmap(start_pte);
 out:
 	spin_unlock(ptl);
-	cond_resched();
 	return 0;
 }
 
@@ -696,7 +695,6 @@ static int apply_mlockall_flags(int flags)
 		/* Ignore errors */
 		mlock_fixup(&vmi, vma, &prev, vma->vm_start, vma->vm_end,
 			    newflags);
-		cond_resched();
 	}
 out:
 	return 0;
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 50f2f34745af..88d27009800e 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -892,10 +892,8 @@ void __meminit memmap_init_range(unsigned long size, int nid, unsigned long zone
 		 * such that unmovable allocations won't be scattered all
 		 * over the place during system boot.
 		 */
-		if (pageblock_aligned(pfn)) {
+		if (pageblock_aligned(pfn))
 			set_pageblock_migratetype(page, migratetype);
-			cond_resched();
-		}
 		pfn++;
 	}
 }
@@ -996,10 +994,8 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
 	 * Please note that MEMINIT_HOTPLUG path doesn't clear memmap
 	 * because this is done early in section_activate()
 	 */
-	if (pageblock_aligned(pfn)) {
+	if (pageblock_aligned(pfn))
 		set_pageblock_migratetype(page, MIGRATE_MOVABLE);
-		cond_resched();
-	}
 
 	/*
 	 * ZONE_DEVICE pages are released directly to the driver page allocator
@@ -2163,10 +2159,8 @@ deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
 	 * Initialize and free pages in MAX_ORDER sized increments so that we
 	 * can avoid introducing any issues with the buddy allocator.
 	 */
-	while (spfn < end_pfn) {
+	while (spfn < end_pfn)
 		deferred_init_maxorder(&i, zone, &spfn, &epfn);
-		cond_resched();
-	}
 }
 
 /* An arch may override for more concurrency. */
@@ -2365,7 +2359,6 @@ void set_zone_contiguous(struct zone *zone)
 		if (!__pageblock_pfn_to_page(block_start_pfn,
 					     block_end_pfn, zone))
 			return;
-		cond_resched();
 	}
 
 	/* We confirm that there is no hole */
diff --git a/mm/mmap.c b/mm/mmap.c
index 9e018d8dd7d6..436c255f4f45 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3253,7 +3253,6 @@ void exit_mmap(struct mm_struct *mm)
 			nr_accounted += vma_pages(vma);
 		remove_vma(vma, true);
 		count++;
-		cond_resched();
 	} while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
 
 	BUG_ON(count != mm->map_count);
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 4f559f4ddd21..dbf660a14469 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -98,8 +98,6 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
 			free_pages_and_swap_cache(pages, nr);
 			pages += nr;
 			batch->nr -= nr;
-
-			cond_resched();
 		} while (batch->nr);
 	}
 	tlb->active = &tlb->local;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index b94fbb45d5c7..45af8b1aac59 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -423,7 +423,6 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 			goto again;
 		pages += ret;
 next:
-		cond_resched();
 	} while (pmd++, addr = next, addr != end);
 
 	if (range.start)
diff --git a/mm/mremap.c b/mm/mremap.c
index 382e81c33fc4..26f06349558e 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -514,7 +514,6 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 	mmu_notifier_invalidate_range_start(&range);
 
 	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
-		cond_resched();
 		/*
 		 * If extent is PUD-sized try to speed up the move by moving at the
 		 * PUD level if possible.
diff --git a/mm/nommu.c b/mm/nommu.c
index 7f9e9e5a0e12..54cb28e9919d 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1525,7 +1525,6 @@ void exit_mmap(struct mm_struct *mm)
 	for_each_vma(vmi, vma) {
 		cleanup_vma_from_mm(vma);
 		delete_vma(mm, vma);
-		cond_resched();
 	}
 	__mt_destroy(&mm->mm_mt);
 	mmap_write_unlock(mm);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 61a190b9d83c..582cb5a72467 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2510,7 +2510,6 @@ int write_cache_pages(struct address_space *mapping,
 			}
 		}
 		folio_batch_release(&fbatch);
-		cond_resched();
 	}
 
 	/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 85741403948f..c7e7a236de3d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3418,8 +3418,6 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	 */
 	count_vm_event(COMPACTFAIL);
 
-	cond_resched();
-
 	return NULL;
 }
 
@@ -3617,8 +3615,6 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
 	unsigned int noreclaim_flag;
 	unsigned long progress;
 
-	cond_resched();
-
 	/* We now go into synchronous reclaim */
 	cpuset_memory_pressure_bump();
 	fs_reclaim_acquire(gfp_mask);
@@ -3630,8 +3626,6 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
 	memalloc_noreclaim_restore(noreclaim_flag);
 	fs_reclaim_release(gfp_mask);
 
-	cond_resched();
-
 	return progress;
 }
 
@@ -3852,13 +3846,11 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 	 * Memory allocation/reclaim might be called from a WQ context and the
 	 * current implementation of the WQ concurrency control doesn't
 	 * recognize that a particular WQ is congested if the worker thread is
-	 * looping without ever sleeping. Therefore we have to do a short sleep
-	 * here rather than calling cond_resched().
+	 * looping without ever sleeping. Therefore do a short sleep here.
 	 */
 	if (current->flags & PF_WQ_WORKER)
 		schedule_timeout_uninterruptible(1);
-	else
-		cond_resched();
+
 	return ret;
 }
 
@@ -4162,7 +4154,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		if (page)
 			goto got_pg;
 
-		cond_resched();
 		goto retry;
 	}
 fail:
diff --git a/mm/page_counter.c b/mm/page_counter.c
index db20d6452b71..c15befd5b02a 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -196,7 +196,6 @@ int page_counter_set_max(struct page_counter *counter, unsigned long nr_pages)
 			return 0;
 
 		counter->max = old;
-		cond_resched();
 	}
 }
 
diff --git a/mm/page_ext.c b/mm/page_ext.c
index 4548fcc66d74..855271588c8c 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -472,7 +472,6 @@ void __init page_ext_init(void)
 				continue;
 			if (init_section_page_ext(pfn, nid))
 				goto oom;
-			cond_resched();
 		}
 	}
 	hotplug_memory_notifier(page_ext_callback, DEFAULT_CALLBACK_PRI);
diff --git a/mm/page_idle.c b/mm/page_idle.c
index 41ea77f22011..694eb1b14a66 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -151,7 +151,6 @@ static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
 		}
 		if (bit == BITMAP_CHUNK_BITS - 1)
 			out++;
-		cond_resched();
 	}
 	return (char *)out - buf;
 }
@@ -188,7 +187,6 @@ static ssize_t page_idle_bitmap_write(struct file *file, struct kobject *kobj,
 		}
 		if (bit == BITMAP_CHUNK_BITS - 1)
 			in++;
-		cond_resched();
 	}
 	return (char *)in - buf;
 }
diff --git a/mm/page_io.c b/mm/page_io.c
index fe4c21af23f2..02bbc8165400 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -106,8 +106,6 @@ int generic_swapfile_activate(struct swap_info_struct *sis,
 		unsigned block_in_page;
 		sector_t first_block;
 
-		cond_resched();
-
 		first_block = probe_block;
 		ret = bmap(inode, &first_block);
 		if (ret || !first_block)
diff --git a/mm/page_owner.c b/mm/page_owner.c
index 4e2723e1b300..72278db2f01c 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -680,7 +680,6 @@ static void init_pages_in_zone(pg_data_t *pgdat, struct zone *zone)
 ext_put_continue:
 			page_ext_put(page_ext);
 		}
-		cond_resched();
 	}
 
 	pr_info("Node %d, zone %8s: page owner found early allocated %lu pages\n",
diff --git a/mm/percpu.c b/mm/percpu.c
index a7665de8485f..538b63f399ae 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -2015,7 +2015,6 @@ static void pcpu_balance_free(bool empty_only)
 			spin_unlock_irq(&pcpu_lock);
 		}
 		pcpu_destroy_chunk(chunk);
-		cond_resched();
 	}
 	spin_lock_irq(&pcpu_lock);
 }
@@ -2083,7 +2082,6 @@ static void pcpu_balance_populated(void)
 
 			spin_unlock_irq(&pcpu_lock);
 			ret = pcpu_populate_chunk(chunk, rs, rs + nr, gfp);
-			cond_resched();
 			spin_lock_irq(&pcpu_lock);
 			if (!ret) {
 				nr_to_pop -= nr;
@@ -2101,7 +2099,6 @@ static void pcpu_balance_populated(void)
 		/* ran out of chunks to populate, create a new one and retry */
 		spin_unlock_irq(&pcpu_lock);
 		chunk = pcpu_create_chunk(gfp);
-		cond_resched();
 		spin_lock_irq(&pcpu_lock);
 		if (chunk) {
 			pcpu_chunk_relocate(chunk, -1);
@@ -2186,7 +2183,6 @@ static void pcpu_reclaim_populated(void)
 
 			spin_unlock_irq(&pcpu_lock);
 			pcpu_depopulate_chunk(chunk, i + 1, end + 1);
-			cond_resched();
 			spin_lock_irq(&pcpu_lock);
 
 			pcpu_chunk_depopulated(chunk, i + 1, end + 1);
@@ -2203,7 +2199,6 @@ static void pcpu_reclaim_populated(void)
 			pcpu_post_unmap_tlb_flush(chunk,
 						  freed_page_start,
 						  freed_page_end);
-			cond_resched();
 			spin_lock_irq(&pcpu_lock);
 		}
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 9f795b93cf40..c7aec4516309 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2434,7 +2434,6 @@ static void rmap_walk_anon(struct folio *folio,
 		unsigned long address = vma_address(&folio->page, vma);
 
 		VM_BUG_ON_VMA(address == -EFAULT, vma);
-		cond_resched();
 
 		if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg))
 			continue;
@@ -2495,7 +2494,6 @@ static void rmap_walk_file(struct folio *folio,
 		unsigned long address = vma_address(&folio->page, vma);
 
 		VM_BUG_ON_VMA(address == -EFAULT, vma);
-		cond_resched();
 
 		if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg))
 			continue;
diff --git a/mm/shmem.c b/mm/shmem.c
index 112172031b2c..0280fe449ad8 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -939,7 +939,6 @@ void shmem_unlock_mapping(struct address_space *mapping)
 	       filemap_get_folios(mapping, &index, ~0UL, &fbatch)) {
 		check_move_unevictable_folios(&fbatch);
 		folio_batch_release(&fbatch);
-		cond_resched();
 	}
 }
 
@@ -1017,7 +1016,6 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 		}
 		folio_batch_remove_exceptionals(&fbatch);
 		folio_batch_release(&fbatch);
-		cond_resched();
 	}
 
 	/*
@@ -1058,8 +1056,6 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 
 	index = start;
 	while (index < end) {
-		cond_resched();
-
 		if (!find_get_entries(mapping, &index, end - 1, &fbatch,
 				indices)) {
 			/* If all gone or hole-punch or unfalloc, we're done */
@@ -1394,7 +1390,6 @@ int shmem_unuse(unsigned int type)
 		mutex_unlock(&shmem_swaplist_mutex);
 
 		error = shmem_unuse_inode(&info->vfs_inode, type);
-		cond_resched();
 
 		mutex_lock(&shmem_swaplist_mutex);
 		next = list_next_entry(info, swaplist);
@@ -2832,7 +2827,6 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 			error = -EFAULT;
 			break;
 		}
-		cond_resched();
 	}
 
 	*ppos = ((loff_t) index << PAGE_SHIFT) + offset;
@@ -2986,8 +2980,6 @@ static ssize_t shmem_file_splice_read(struct file *in, loff_t *ppos,
 		in->f_ra.prev_pos = *ppos;
 		if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
 			break;
-
-		cond_resched();
 	} while (len);
 
 	if (folio)
@@ -3155,7 +3147,6 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 		folio_mark_dirty(folio);
 		folio_unlock(folio);
 		folio_put(folio);
-		cond_resched();
 	}
 
 	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
diff --git a/mm/shuffle.c b/mm/shuffle.c
index fb1393b8b3a9..f78f201c773b 100644
--- a/mm/shuffle.c
+++ b/mm/shuffle.c
@@ -136,10 +136,12 @@ void __meminit __shuffle_zone(struct zone *z)
 
 		pr_debug("%s: swap: %#lx -> %#lx\n", __func__, i, j);
 
-		/* take it easy on the zone lock */
+		/*
+		 * Drop the zone lock occasionally to allow the scheduler to
+		 * preempt us if needed.
+		 */
 		if ((i % (100 * order_pages)) == 0) {
 			spin_unlock_irqrestore(&z->lock, flags);
-			cond_resched();
 			spin_lock_irqsave(&z->lock, flags);
 		}
 	}
diff --git a/mm/slab.c b/mm/slab.c
index 9ad3d0f2d1a5..7681d2cb5e64 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2196,8 +2196,6 @@ static int drain_freelist(struct kmem_cache *cache,
 		raw_spin_unlock_irq(&n->list_lock);
 		slab_destroy(cache, slab);
 		nr_freed++;
-
-		cond_resched();
 	}
 out:
 	return nr_freed;
@@ -3853,7 +3851,6 @@ static void cache_reap(struct work_struct *w)
 			STATS_ADD_REAPED(searchp, freed);
 		}
 next:
-		cond_resched();
 	}
 	check_irq_on();
 	mutex_unlock(&slab_mutex);
diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c
index db6c4a26cf59..20d2aefbefd6 100644
--- a/mm/swap_cgroup.c
+++ b/mm/swap_cgroup.c
@@ -50,8 +50,6 @@ static int swap_cgroup_prepare(int type)
 			goto not_enough_page;
 		ctrl->map[idx] = page;
 
-		if (!(idx % SWAP_CLUSTER_MAX))
-			cond_resched();
 	}
 	return 0;
 not_enough_page:
@@ -223,8 +221,6 @@ void swap_cgroup_swapoff(int type)
 			struct page *page = map[i];
 			if (page)
 				__free_page(page);
-			if (!(i % SWAP_CLUSTER_MAX))
-				cond_resched();
 		}
 		vfree(map);
 	}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index e52f486834eb..27db3dcec1a2 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -190,7 +190,6 @@ static int discard_swap(struct swap_info_struct *si)
 				nr_blocks, GFP_KERNEL);
 		if (err)
 			return err;
-		cond_resched();
 	}
 
 	for (se = next_se(se); se; se = next_se(se)) {
@@ -201,8 +200,6 @@ static int discard_swap(struct swap_info_struct *si)
 				nr_blocks, GFP_KERNEL);
 		if (err)
 			break;
-
-		cond_resched();
 	}
 	return err;		/* That will often be -EOPNOTSUPP */
 }
@@ -864,7 +861,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 				goto checks;
 			}
 			if (unlikely(--latency_ration < 0)) {
-				cond_resched();
 				latency_ration = LATENCY_LIMIT;
 			}
 		}
@@ -931,7 +927,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 		if (n_ret)
 			goto done;
 		spin_unlock(&si->lock);
-		cond_resched();
 		spin_lock(&si->lock);
 		latency_ration = LATENCY_LIMIT;
 	}
@@ -974,7 +969,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 	spin_unlock(&si->lock);
 	while (++offset <= READ_ONCE(si->highest_bit)) {
 		if (unlikely(--latency_ration < 0)) {
-			cond_resched();
 			latency_ration = LATENCY_LIMIT;
 			scanned_many = true;
 		}
@@ -984,7 +978,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 	offset = si->lowest_bit;
 	while (offset < scan_base) {
 		if (unlikely(--latency_ration < 0)) {
-			cond_resched();
 			latency_ration = LATENCY_LIMIT;
 			scanned_many = true;
 		}
@@ -1099,7 +1092,6 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
 		spin_unlock(&si->lock);
 		if (n_ret || size == SWAPFILE_CLUSTER)
 			goto check_out;
-		cond_resched();
 
 		spin_lock(&swap_avail_lock);
 nextsi:
@@ -1914,7 +1906,6 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		cond_resched();
 		next = pmd_addr_end(addr, end);
 		ret = unuse_pte_range(vma, pmd, addr, next, type);
 		if (ret)
@@ -1997,8 +1988,6 @@ static int unuse_mm(struct mm_struct *mm, unsigned int type)
 			if (ret)
 				break;
 		}
-
-		cond_resched();
 	}
 	mmap_read_unlock(mm);
 	return ret;
@@ -2025,8 +2014,6 @@ static unsigned int find_next_to_unuse(struct swap_info_struct *si,
 		count = READ_ONCE(si->swap_map[i]);
 		if (count && swap_count(count) != SWAP_MAP_BAD)
 			break;
-		if ((i % LATENCY_LIMIT) == 0)
-			cond_resched();
 	}
 
 	if (i == si->max)
@@ -2079,7 +2066,6 @@ static int try_to_unuse(unsigned int type)
 		 * Make sure that we aren't completely killing
 		 * interactive performance.
 		 */
-		cond_resched();
 		spin_lock(&mmlist_lock);
 	}
 	spin_unlock(&mmlist_lock);
diff --git a/mm/truncate.c b/mm/truncate.c
index 8e3aa9e8618e..9efcec90f24d 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -369,7 +369,6 @@ void truncate_inode_pages_range(struct address_space *mapping,
 		for (i = 0; i < folio_batch_count(&fbatch); i++)
 			folio_unlock(fbatch.folios[i]);
 		folio_batch_release(&fbatch);
-		cond_resched();
 	}
 
 	same_folio = (lstart >> PAGE_SHIFT) == (lend >> PAGE_SHIFT);
@@ -399,7 +398,6 @@ void truncate_inode_pages_range(struct address_space *mapping,
 
 	index = start;
 	while (index < end) {
-		cond_resched();
 		if (!find_get_entries(mapping, &index, end - 1, &fbatch,
 				indices)) {
 			/* If all gone from start onwards, we're done */
@@ -533,7 +531,6 @@ unsigned long mapping_try_invalidate(struct address_space *mapping,
 		}
 		folio_batch_remove_exceptionals(&fbatch);
 		folio_batch_release(&fbatch);
-		cond_resched();
 	}
 	return count;
 }
@@ -677,7 +674,6 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 		}
 		folio_batch_remove_exceptionals(&fbatch);
 		folio_batch_release(&fbatch);
-		cond_resched();
 	}
 	/*
 	 * For DAX we invalidate page tables after invalidating page cache.  We
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 96d9eae5c7cc..89127f6b8bd7 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -459,8 +459,6 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
 		hugetlb_vma_unlock_read(dst_vma);
 		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 
-		cond_resched();
-
 		if (unlikely(err == -ENOENT)) {
 			mmap_read_unlock(dst_mm);
 			BUG_ON(!folio);
@@ -677,7 +675,6 @@ static __always_inline ssize_t mfill_atomic(struct mm_struct *dst_mm,
 
 		err = mfill_atomic_pte(dst_pmd, dst_vma, dst_addr,
 				       src_addr, flags, &folio);
-		cond_resched();
 
 		if (unlikely(err == -ENOENT)) {
 			void *kaddr;
diff --git a/mm/util.c b/mm/util.c
index 8cbbfd3a3d59..3bc08be921fa 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -796,7 +796,6 @@ void folio_copy(struct folio *dst, struct folio *src)
 		copy_highpage(folio_page(dst, i), folio_page(src, i));
 		if (++i == nr)
 			break;
-		cond_resched();
 	}
 }
 
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index a3fedb3ee0db..7d2b76cde1a7 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -351,8 +351,6 @@ static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		vunmap_pte_range(pmd, addr, next, mask);
-
-		cond_resched();
 	} while (pmd++, addr = next, addr != end);
 }
 
@@ -2840,7 +2838,6 @@ void vfree(const void *addr)
 		 * can be freed as an array of order-0 allocations
 		 */
 		__free_page(page);
-		cond_resched();
 	}
 	atomic_long_sub(vm->nr_pages, &nr_vmalloc_pages);
 	kvfree(vm->pages);
@@ -3035,7 +3032,6 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
 							pages + nr_allocated);
 
 			nr_allocated += nr;
-			cond_resched();
 
 			/*
 			 * If zero or pages were obtained partly,
@@ -3091,7 +3087,6 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
 		for (i = 0; i < (1U << order); i++)
 			pages[nr_allocated + i] = page + i;
 
-		cond_resched();
 		nr_allocated += 1U << order;
 	}
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6f13394b112e..e12f9fd27002 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -905,8 +905,6 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 		count_vm_events(SLABS_SCANNED, shrinkctl->nr_scanned);
 		total_scan -= shrinkctl->nr_scanned;
 		scanned += shrinkctl->nr_scanned;
-
-		cond_resched();
 	}
 
 	/*
@@ -1074,7 +1072,6 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 
 	up_read(&shrinker_rwsem);
 out:
-	cond_resched();
 	return freed;
 }
 
@@ -1204,7 +1201,6 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason)
 	 */
 	if (!current_is_kswapd() &&
 	    current->flags & (PF_USER_WORKER|PF_KTHREAD)) {
-		cond_resched();
 		return;
 	}
 
@@ -1232,7 +1228,6 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason)
 		fallthrough;
 	case VMSCAN_THROTTLE_NOPROGRESS:
 		if (skip_throttle_noprogress(pgdat)) {
-			cond_resched();
 			return;
 		}
 
@@ -1715,7 +1710,6 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 	struct swap_iocb *plug = NULL;
 
 	memset(stat, 0, sizeof(*stat));
-	cond_resched();
 	do_demote_pass = can_demote(pgdat->node_id, sc);
 
 retry:
@@ -1726,8 +1720,6 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 		bool dirty, writeback;
 		unsigned int nr_pages;
 
-		cond_resched();
-
 		folio = lru_to_folio(folio_list);
 		list_del(&folio->lru);
 
@@ -2719,7 +2711,6 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	while (!list_empty(&l_hold)) {
 		struct folio *folio;
 
-		cond_resched();
 		folio = lru_to_folio(&l_hold);
 		list_del(&folio->lru);
 
@@ -4319,8 +4310,6 @@ static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_
 			reset_batch_size(lruvec, walk);
 			spin_unlock_irq(&lruvec->lru_lock);
 		}
-
-		cond_resched();
 	} while (err == -EAGAIN);
 }
 
@@ -4455,7 +4444,6 @@ static void inc_max_seq(struct lruvec *lruvec, bool can_swap, bool force_scan)
 			continue;
 
 		spin_unlock_irq(&lruvec->lru_lock);
-		cond_resched();
 		goto restart;
 	}
 
@@ -4616,8 +4604,6 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
 			mem_cgroup_iter_break(NULL, memcg);
 			return;
 		}
-
-		cond_resched();
 	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
 
 	/*
@@ -5378,8 +5364,6 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 
 		if (sc->nr_reclaimed >= nr_to_reclaim)
 			break;
-
-		cond_resched();
 	}
 
 	/* whether try_to_inc_max_seq() was successful */
@@ -5718,14 +5702,11 @@ static void lru_gen_change_state(bool enabled)
 
 			while (!(enabled ? fill_evictable(lruvec) : drain_evictable(lruvec))) {
 				spin_unlock_irq(&lruvec->lru_lock);
-				cond_resched();
 				spin_lock_irq(&lruvec->lru_lock);
 			}
 
 			spin_unlock_irq(&lruvec->lru_lock);
 		}
-
-		cond_resched();
 	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
 unlock:
 	mutex_unlock(&state_mutex);
@@ -6026,8 +6007,6 @@ static int run_eviction(struct lruvec *lruvec, unsigned long seq, struct scan_co
 
 		if (!evict_folios(lruvec, sc, swappiness))
 			return 0;
-
-		cond_resched();
 	}
 
 	return -EINTR;
@@ -6321,8 +6300,6 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 			}
 		}
 
-		cond_resched();
-
 		if (nr_reclaimed < nr_to_reclaim || proportional_reclaim)
 			continue;
 
@@ -6473,10 +6450,9 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 		 * This loop can become CPU-bound when target memcgs
 		 * aren't eligible for reclaim - either because they
 		 * don't have any reclaimable pages, or because their
-		 * memory is explicitly protected. Avoid soft lockups.
+		 * memory is explicitly protected. We don't disable
+		 * preemption, so expect to be preempted.
 		 */
-		cond_resched();
-
 		mem_cgroup_calculate_protection(target_memcg, memcg);
 
 		if (mem_cgroup_below_min(target_memcg, memcg)) {
@@ -8024,7 +8000,6 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
 	trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, order,
 					   sc.gfp_mask);
 
-	cond_resched();
 	psi_memstall_enter(&pflags);
 	fs_reclaim_acquire(sc.gfp_mask);
 	/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 00e81e99c6ee..de61cc004865 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -835,7 +835,6 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 #ifdef CONFIG_NUMA
 
 		if (do_pagesets) {
-			cond_resched();
 			/*
 			 * Deal with draining the remote pageset of this
 			 * processor
@@ -1525,7 +1524,6 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
 			}
 			seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount);
 			spin_unlock_irq(&zone->lock);
-			cond_resched();
 			spin_lock_irq(&zone->lock);
 		}
 		seq_putc(m, '\n');
@@ -2041,8 +2039,6 @@ static void vmstat_shepherd(struct work_struct *w)
 
 		if (!delayed_work_pending(dw) && need_update(cpu))
 			queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
-
-		cond_resched();
 	}
 	cpus_read_unlock();
 
diff --git a/mm/workingset.c b/mm/workingset.c
index da58a26d0d4d..ba94e5fb8390 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -750,7 +750,6 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
 	}
 	ret = LRU_REMOVED_RETRY;
 out:
-	cond_resched();
 	spin_lock_irq(lru_lock);
 	return ret;
 }
diff --git a/mm/z3fold.c b/mm/z3fold.c
index 7c76b396b74c..2614236c2212 100644
--- a/mm/z3fold.c
+++ b/mm/z3fold.c
@@ -175,7 +175,7 @@ enum z3fold_handle_flags {
 /*
  * Forward declarations
  */
-static struct z3fold_header *__z3fold_alloc(struct z3fold_pool *, size_t, bool);
+static struct z3fold_header *__z3fold_alloc(struct z3fold_pool *, size_t);
 static void compact_page_work(struct work_struct *w);
 
 /*****************
@@ -504,7 +504,6 @@ static void free_pages_work(struct work_struct *w)
 		spin_unlock(&pool->stale_lock);
 		cancel_work_sync(&zhdr->work);
 		free_z3fold_page(page, false);
-		cond_resched();
 		spin_lock(&pool->stale_lock);
 	}
 	spin_unlock(&pool->stale_lock);
@@ -629,7 +628,7 @@ static struct z3fold_header *compact_single_buddy(struct z3fold_header *zhdr)
 		short chunks = size_to_chunks(sz);
 		void *q;
 
-		new_zhdr = __z3fold_alloc(pool, sz, false);
+		new_zhdr = __z3fold_alloc(pool, sz);
 		if (!new_zhdr)
 			return NULL;
 
@@ -783,7 +782,7 @@ static void compact_page_work(struct work_struct *w)
 
 /* returns _locked_ z3fold page header or NULL */
 static inline struct z3fold_header *__z3fold_alloc(struct z3fold_pool *pool,
-						size_t size, bool can_sleep)
+						   size_t size)
 {
 	struct z3fold_header *zhdr = NULL;
 	struct page *page;
@@ -811,8 +810,6 @@ static inline struct z3fold_header *__z3fold_alloc(struct z3fold_pool *pool,
 			spin_unlock(&pool->lock);
 			zhdr = NULL;
 			migrate_enable();
-			if (can_sleep)
-				cond_resched();
 			goto lookup;
 		}
 		list_del_init(&zhdr->buddy);
@@ -825,8 +822,6 @@ static inline struct z3fold_header *__z3fold_alloc(struct z3fold_pool *pool,
 			z3fold_page_unlock(zhdr);
 			zhdr = NULL;
 			migrate_enable();
-			if (can_sleep)
-				cond_resched();
 			goto lookup;
 		}
 
@@ -869,8 +864,6 @@ static inline struct z3fold_header *__z3fold_alloc(struct z3fold_pool *pool,
 			    test_bit(PAGE_CLAIMED, &page->private)) {
 				z3fold_page_unlock(zhdr);
 				zhdr = NULL;
-				if (can_sleep)
-					cond_resched();
 				continue;
 			}
 			kref_get(&zhdr->refcount);
@@ -1016,7 +1009,7 @@ static int z3fold_alloc(struct z3fold_pool *pool, size_t size, gfp_t gfp,
 		bud = HEADLESS;
 	else {
 retry:
-		zhdr = __z3fold_alloc(pool, size, can_sleep);
+		zhdr = __z3fold_alloc(pool, size);
 		if (zhdr) {
 			bud = get_free_buddy(zhdr, chunks);
 			if (bud == HEADLESS) {
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index b58f957429f0..e6fe6522c845 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -2029,7 +2029,6 @@ static unsigned long __zs_compact(struct zs_pool *pool,
 			dst_zspage = NULL;
 
 			spin_unlock(&pool->lock);
-			cond_resched();
 			spin_lock(&pool->lock);
 		}
 	}
diff --git a/mm/zswap.c b/mm/zswap.c
index 37d2b1cb2ecb..ad6d67ebbf70 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -704,7 +704,6 @@ static void shrink_worker(struct work_struct *w)
 			if (++failures == MAX_RECLAIM_RETRIES)
 				break;
 		}
-		cond_resched();
 	} while (!zswap_can_accept());
 	zswap_pool_put(pool);
 }
-- 
2.31.1

Re: [RFC PATCH 68/86] treewide: mm: remove cond_resched()

Posted by Sergey Senozhatsky 2 years, 3 months ago

On (23/11/07 15:08), Ankur Arora wrote:
[..]
> +++ b/mm/zsmalloc.c
> @@ -2029,7 +2029,6 @@ static unsigned long __zs_compact(struct zs_pool *pool,
>  			dst_zspage = NULL;
>  
>  			spin_unlock(&pool->lock);
> -			cond_resched();
>  			spin_lock(&pool->lock);
>  		}
>  	}

I'd personally prefer to have a comment explaining why we do that
spin_unlock/spin_lock sequence, which may look confusing to people.

Maybe would make sense to put a nice comment in all similar cases.
For instance:

  	rcu_read_unlock();
 -	cond_resched();
  	rcu_read_lock();

Re: [RFC PATCH 68/86] treewide: mm: remove cond_resched()

Posted by Vlastimil Babka 2 years, 3 months ago

On 11/8/23 02:28, Sergey Senozhatsky wrote:
> On (23/11/07 15:08), Ankur Arora wrote:
> [..]
>> +++ b/mm/zsmalloc.c
>> @@ -2029,7 +2029,6 @@ static unsigned long __zs_compact(struct zs_pool *pool,
>>  			dst_zspage = NULL;
>>  
>>  			spin_unlock(&pool->lock);
>> -			cond_resched();
>>  			spin_lock(&pool->lock);
>>  		}
>>  	}
> 
> I'd personally prefer to have a comment explaining why we do that
> spin_unlock/spin_lock sequence, which may look confusing to people.

Wonder if it would make sense to have a lock operation that does the
unlock/lock as a self-documenting thing, and maybe could also be optimized
to first check if there's a actually a need for it (because TIF_NEED_RESCHED
or lock is contended).

> Maybe would make sense to put a nice comment in all similar cases.
> For instance:
> 
>   	rcu_read_unlock();
>  -	cond_resched();
>   	rcu_read_lock();

Re: [RFC PATCH 68/86] treewide: mm: remove cond_resched()

Posted by Yosry Ahmed 2 years, 3 months ago

On Tue, Nov 7, 2023 at 11:49 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 11/8/23 02:28, Sergey Senozhatsky wrote:
> > On (23/11/07 15:08), Ankur Arora wrote:
> > [..]
> >> +++ b/mm/zsmalloc.c
> >> @@ -2029,7 +2029,6 @@ static unsigned long __zs_compact(struct zs_pool *pool,
> >>                      dst_zspage = NULL;
> >>
> >>                      spin_unlock(&pool->lock);
> >> -                    cond_resched();
> >>                      spin_lock(&pool->lock);
> >>              }
> >>      }
> >
> > I'd personally prefer to have a comment explaining why we do that
> > spin_unlock/spin_lock sequence, which may look confusing to people.
>
> Wonder if it would make sense to have a lock operation that does the
> unlock/lock as a self-documenting thing, and maybe could also be optimized
> to first check if there's a actually a need for it (because TIF_NEED_RESCHED
> or lock is contended).

+1, I was going to suggest this as well. It can be extended to other
locking types that disable preemption as well like RCU. Something like
spin_lock_relax() or something.

>
> > Maybe would make sense to put a nice comment in all similar cases.
> > For instance:
> >
> >       rcu_read_unlock();
> >  -    cond_resched();
> >       rcu_read_lock();
>
>

Re: [RFC PATCH 68/86] treewide: mm: remove cond_resched()

Posted by Ankur Arora 2 years, 3 months ago

Yosry Ahmed <yosryahmed@google.com> writes:

> On Tue, Nov 7, 2023 at 11:49 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> On 11/8/23 02:28, Sergey Senozhatsky wrote:
>> > On (23/11/07 15:08), Ankur Arora wrote:
>> > [..]
>> >> +++ b/mm/zsmalloc.c
>> >> @@ -2029,7 +2029,6 @@ static unsigned long __zs_compact(struct zs_pool *pool,
>> >>                      dst_zspage = NULL;
>> >>
>> >>                      spin_unlock(&pool->lock);
>> >> -                    cond_resched();
>> >>                      spin_lock(&pool->lock);
>> >>              }
>> >>      }
>> >
>> > I'd personally prefer to have a comment explaining why we do that
>> > spin_unlock/spin_lock sequence, which may look confusing to people.
>>
>> Wonder if it would make sense to have a lock operation that does the
>> unlock/lock as a self-documenting thing, and maybe could also be optimized
>> to first check if there's a actually a need for it (because TIF_NEED_RESCHED
>> or lock is contended).
>
> +1, I was going to suggest this as well. It can be extended to other
> locking types that disable preemption as well like RCU. Something like
> spin_lock_relax() or something.

Good point. We actually do have exactly that: cond_resched_lock(). (And
similar RW lock variants.)

>> > Maybe would make sense to put a nice comment in all similar cases.
>> > For instance:
>> >
>> >       rcu_read_unlock();
>> >  -    cond_resched();
>> >       rcu_read_lock();

And we have this construct as well: cond_resched_rcu().

I can switch to the alternatives when I send out the next version of
the series.

Thanks

--
ankur

Re: [RFC PATCH 68/86] treewide: mm: remove cond_resched()

Posted by Matthew Wilcox 2 years, 3 months ago

On Wed, Nov 08, 2023 at 12:54:19AM -0800, Ankur Arora wrote:
> Yosry Ahmed <yosryahmed@google.com> writes:
> > On Tue, Nov 7, 2023 at 11:49 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> >> On 11/8/23 02:28, Sergey Senozhatsky wrote:
> >> > I'd personally prefer to have a comment explaining why we do that
> >> > spin_unlock/spin_lock sequence, which may look confusing to people.
> >>
> >> Wonder if it would make sense to have a lock operation that does the
> >> unlock/lock as a self-documenting thing, and maybe could also be optimized
> >> to first check if there's a actually a need for it (because TIF_NEED_RESCHED
> >> or lock is contended).
> >
> > +1, I was going to suggest this as well. It can be extended to other
> > locking types that disable preemption as well like RCU. Something like
> > spin_lock_relax() or something.
> 
> Good point. We actually do have exactly that: cond_resched_lock(). (And
> similar RW lock variants.)

That's a shame; I was going to suggest calling it spin_cycle() ...

Re: [RFC PATCH 68/86] treewide: mm: remove cond_resched()

Posted by Steven Rostedt 2 years, 3 months ago

On Wed, 8 Nov 2023 12:58:49 +0000
Matthew Wilcox <willy@infradead.org> wrote:

> > Good point. We actually do have exactly that: cond_resched_lock(). (And
> > similar RW lock variants.)  
> 
> That's a shame; I was going to suggest calling it spin_cycle() ...

  Then wash.. rinse.. repeat!

-- Steve

[RFC PATCH 69/86] treewide: io_uring: remove cond_resched()

Posted by Ankur Arora 2 years, 3 months ago

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

All of the uses of cond_resched() are from set-1 or set-2.
Remove them.

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 io_uring/io-wq.c    |  4 +---
 io_uring/io_uring.c | 21 ++++++++++++---------
 io_uring/kbuf.c     |  2 --
 io_uring/sqpoll.c   |  6 ++++--
 io_uring/tctx.c     |  4 +---
 5 files changed, 18 insertions(+), 19 deletions(-)

diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index 522196dfb0ff..fcaf9161be03 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -532,10 +532,8 @@ static struct io_wq_work *io_get_next_work(struct io_wq_acct *acct,
 static void io_assign_current_work(struct io_worker *worker,
 				   struct io_wq_work *work)
 {
-	if (work) {
+	if (work)
 		io_run_task_work();
-		cond_resched();
-	}
 
 	raw_spin_lock(&worker->lock);
 	worker->cur_work = work;
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 8d1bc6cdfe71..547b7c6bdc68 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1203,9 +1203,14 @@ static unsigned int handle_tw_list(struct llist_node *node,
 		node = next;
 		count++;
 		if (unlikely(need_resched())) {
+
+			/*
+			 * Depending on whether we have PREEMPT_RCU or not, the
+			 * mutex_unlock() or percpu_ref_put() should cause us to
+			 * reschedule.
+			 */
 			ctx_flush_and_put(*ctx, ts);
 			*ctx = NULL;
-			cond_resched();
 		}
 	}
 
@@ -1611,7 +1616,6 @@ static __cold void io_iopoll_try_reap_events(struct io_ring_ctx *ctx)
 		 */
 		if (need_resched()) {
 			mutex_unlock(&ctx->uring_lock);
-			cond_resched();
 			mutex_lock(&ctx->uring_lock);
 		}
 	}
@@ -1977,7 +1981,6 @@ void io_wq_submit_work(struct io_wq_work *work)
 				break;
 			if (io_wq_worker_stopped())
 				break;
-			cond_resched();
 			continue;
 		}
 
@@ -2649,7 +2652,6 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 			ret = 0;
 			break;
 		}
-		cond_resched();
 	} while (1);
 
 	if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN))
@@ -3096,8 +3098,12 @@ static __cold void io_ring_exit_work(struct work_struct *work)
 		if (ctx->flags & IORING_SETUP_DEFER_TASKRUN)
 			io_move_task_work_from_local(ctx);
 
+		/*
+		 * io_uring_try_cancel_requests() will reschedule when needed
+		 * in the mutex_unlock().
+		 */
 		while (io_uring_try_cancel_requests(ctx, NULL, true))
-			cond_resched();
+			;
 
 		if (ctx->sq_data) {
 			struct io_sq_data *sqd = ctx->sq_data;
@@ -3313,7 +3319,6 @@ static __cold bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
 		while (!wq_list_empty(&ctx->iopoll_list)) {
 			io_iopoll_try_reap_events(ctx);
 			ret = true;
-			cond_resched();
 		}
 	}
 
@@ -3382,10 +3387,8 @@ __cold void io_uring_cancel_generic(bool cancel_all, struct io_sq_data *sqd)
 								     cancel_all);
 		}
 
-		if (loop) {
-			cond_resched();
+		if (loop)
 			continue;
-		}
 
 		prepare_to_wait(&tctx->wait, &wait, TASK_INTERRUPTIBLE);
 		io_run_task_work();
diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c
index 9123138aa9f4..ef94a7c76d9a 100644
--- a/io_uring/kbuf.c
+++ b/io_uring/kbuf.c
@@ -246,7 +246,6 @@ static int __io_remove_buffers(struct io_ring_ctx *ctx,
 		list_move(&nxt->list, &ctx->io_buffers_cache);
 		if (++i == nbufs)
 			return i;
-		cond_resched();
 	}
 
 	return i;
@@ -421,7 +420,6 @@ static int io_add_buffers(struct io_ring_ctx *ctx, struct io_provide_buf *pbuf,
 		buf->bgid = pbuf->bgid;
 		addr += pbuf->len;
 		bid++;
-		cond_resched();
 	}
 
 	return i ? 0 : -ENOMEM;
diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
index bd6c2c7959a5..b297b7b8047e 100644
--- a/io_uring/sqpoll.c
+++ b/io_uring/sqpoll.c
@@ -212,7 +212,6 @@ static bool io_sqd_handle_event(struct io_sq_data *sqd)
 		mutex_unlock(&sqd->lock);
 		if (signal_pending(current))
 			did_sig = get_signal(&ksig);
-		cond_resched();
 		mutex_lock(&sqd->lock);
 	}
 	return did_sig || test_bit(IO_SQ_THREAD_SHOULD_STOP, &sqd->state);
@@ -258,8 +257,11 @@ static int io_sq_thread(void *data)
 			if (sqt_spin)
 				timeout = jiffies + sqd->sq_thread_idle;
 			if (unlikely(need_resched())) {
+				/*
+				 * Drop the mutex and reacquire so a reschedule can
+				 * happen on unlock.
+				 */
 				mutex_unlock(&sqd->lock);
-				cond_resched();
 				mutex_lock(&sqd->lock);
 			}
 			continue;
diff --git a/io_uring/tctx.c b/io_uring/tctx.c
index c043fe93a3f2..1bf58f01e50c 100644
--- a/io_uring/tctx.c
+++ b/io_uring/tctx.c
@@ -181,10 +181,8 @@ __cold void io_uring_clean_tctx(struct io_uring_task *tctx)
 	struct io_tctx_node *node;
 	unsigned long index;
 
-	xa_for_each(&tctx->xa, index, node) {
+	xa_for_each(&tctx->xa, index, node)
 		io_uring_del_tctx_node(index);
-		cond_resched();
-	}
 	if (wq) {
 		/*
 		 * Must be after io_uring_del_tctx_node() (removes nodes under
-- 
2.31.1

[RFC PATCH 70/86] treewide: ipc: remove cond_resched()

Posted by Ankur Arora 2 years, 3 months ago

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

All calls to cond_resched() are from set-1, from potentially long
running loops. Remove them.

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Davidlohr Bueso <dave@stgolabs.net> 
Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr> 
Cc: Manfred Spraul <manfred@colorfullife.com> 
Cc: Andrew Morton <akpm@linux-foundation.org> 
Cc: Jann Horn <jannh@google.com> 
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 ipc/msgutil.c | 3 ---
 ipc/sem.c     | 2 --
 2 files changed, 5 deletions(-)

diff --git a/ipc/msgutil.c b/ipc/msgutil.c
index d0a0e877cadd..d9d1b7957bb6 100644
--- a/ipc/msgutil.c
+++ b/ipc/msgutil.c
@@ -62,8 +62,6 @@ static struct msg_msg *alloc_msg(size_t len)
 	while (len > 0) {
 		struct msg_msgseg *seg;
 
-		cond_resched();
-
 		alen = min(len, DATALEN_SEG);
 		seg = kmalloc(sizeof(*seg) + alen, GFP_KERNEL_ACCOUNT);
 		if (seg == NULL)
@@ -177,7 +175,6 @@ void free_msg(struct msg_msg *msg)
 	while (seg != NULL) {
 		struct msg_msgseg *tmp = seg->next;
 
-		cond_resched();
 		kfree(seg);
 		seg = tmp;
 	}
diff --git a/ipc/sem.c b/ipc/sem.c
index a39cdc7bf88f..e12ab01161f6 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -2350,8 +2350,6 @@ void exit_sem(struct task_struct *tsk)
 		int semid, i;
 		DEFINE_WAKE_Q(wake_q);
 
-		cond_resched();
-
 		rcu_read_lock();
 		un = list_entry_rcu(ulp->list_proc.next,
 				    struct sem_undo, list_proc);
-- 
2.31.1

[RFC PATCH 71/86] treewide: lib: remove cond_resched()

Posted by Ankur Arora 2 years, 3 months ago

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

Almost all the cond_resched() calls are from set-1. Remove them.

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "David S. Miller" <davem@davemloft.net> 
Cc: Kees Cook <keescook@chromium.org> 
Cc: Eric Dumazet <edumazet@google.com> 
Cc: Jakub Kicinski <kuba@kernel.org> 
Cc: Paolo Abeni <pabeni@redhat.com> 
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 lib/crc32test.c          |  2 --
 lib/crypto/mpi/mpi-pow.c |  1 -
 lib/memcpy_kunit.c       |  5 -----
 lib/random32.c           |  1 -
 lib/rhashtable.c         |  2 --
 lib/test_bpf.c           |  3 ---
 lib/test_lockup.c        |  2 +-
 lib/test_maple_tree.c    |  8 --------
 lib/test_rhashtable.c    | 10 ----------
 9 files changed, 1 insertion(+), 33 deletions(-)

diff --git a/lib/crc32test.c b/lib/crc32test.c
index 9b4af79412c4..3eee90482e9a 100644
--- a/lib/crc32test.c
+++ b/lib/crc32test.c
@@ -729,7 +729,6 @@ static int __init crc32c_combine_test(void)
 			      crc_full == test[i].crc32c_le))
 				errors++;
 			runs++;
-			cond_resched();
 		}
 	}
 
@@ -817,7 +816,6 @@ static int __init crc32_combine_test(void)
 			      crc_full == test[i].crc_le))
 				errors++;
 			runs++;
-			cond_resched();
 		}
 	}
 
diff --git a/lib/crypto/mpi/mpi-pow.c b/lib/crypto/mpi/mpi-pow.c
index 2fd7a46d55ec..074534900b7e 100644
--- a/lib/crypto/mpi/mpi-pow.c
+++ b/lib/crypto/mpi/mpi-pow.c
@@ -242,7 +242,6 @@ int mpi_powm(MPI res, MPI base, MPI exp, MPI mod)
 				}
 				e <<= 1;
 				c--;
-				cond_resched();
 			}
 
 			i--;
diff --git a/lib/memcpy_kunit.c b/lib/memcpy_kunit.c
index 440aee705ccc..c2a6b09fe93a 100644
--- a/lib/memcpy_kunit.c
+++ b/lib/memcpy_kunit.c
@@ -361,8 +361,6 @@ static void copy_large_test(struct kunit *test, bool use_memmove)
 			/* Zero out what we copied for the next cycle. */
 			memset(large_dst + offset, 0, bytes);
 		}
-		/* Avoid stall warnings if this loop gets slow. */
-		cond_resched();
 	}
 }
 
@@ -489,9 +487,6 @@ static void memmove_overlap_test(struct kunit *test)
 			for (int s_off = s_start; s_off < s_end;
 			     s_off = next_step(s_off, s_start, s_end, window_step))
 				inner_loop(test, bytes, d_off, s_off);
-
-			/* Avoid stall warnings. */
-			cond_resched();
 		}
 	}
 }
diff --git a/lib/random32.c b/lib/random32.c
index 32060b852668..10bc804d99d6 100644
--- a/lib/random32.c
+++ b/lib/random32.c
@@ -287,7 +287,6 @@ static int __init prandom_state_selftest(void)
 			errors++;
 
 		runs++;
-		cond_resched();
 	}
 
 	if (errors)
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 6ae2ba8e06a2..5ff0f521bf29 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -328,7 +328,6 @@ static int rhashtable_rehash_table(struct rhashtable *ht)
 		err = rhashtable_rehash_chain(ht, old_hash);
 		if (err)
 			return err;
-		cond_resched();
 	}
 
 	/* Publish the new table pointer. */
@@ -1147,7 +1146,6 @@ void rhashtable_free_and_destroy(struct rhashtable *ht,
 		for (i = 0; i < tbl->size; i++) {
 			struct rhash_head *pos, *next;
 
-			cond_resched();
 			for (pos = rht_ptr_exclusive(rht_bucket(tbl, i)),
 			     next = !rht_is_a_nulls(pos) ?
 					rht_dereference(pos->next, ht) : NULL;
diff --git a/lib/test_bpf.c b/lib/test_bpf.c
index ecde4216201e..15b4d32712d8 100644
--- a/lib/test_bpf.c
+++ b/lib/test_bpf.c
@@ -14758,7 +14758,6 @@ static __init int test_skb_segment(void)
 	for (i = 0; i < ARRAY_SIZE(skb_segment_tests); i++) {
 		const struct skb_segment_test *test = &skb_segment_tests[i];
 
-		cond_resched();
 		if (exclude_test(i))
 			continue;
 
@@ -14787,7 +14786,6 @@ static __init int test_bpf(void)
 		struct bpf_prog *fp;
 		int err;
 
-		cond_resched();
 		if (exclude_test(i))
 			continue;
 
@@ -15171,7 +15169,6 @@ static __init int test_tail_calls(struct bpf_array *progs)
 		u64 duration;
 		int ret;
 
-		cond_resched();
 		if (exclude_test(i))
 			continue;
 
diff --git a/lib/test_lockup.c b/lib/test_lockup.c
index c3fd87d6c2dd..9af5d34c98f6 100644
--- a/lib/test_lockup.c
+++ b/lib/test_lockup.c
@@ -381,7 +381,7 @@ static void test_lockup(bool master)
 			touch_nmi_watchdog();
 
 		if (call_cond_resched)
-			cond_resched();
+			cond_resched_stall();
 
 		test_wait(cooldown_secs, cooldown_nsecs);
 
diff --git a/lib/test_maple_tree.c b/lib/test_maple_tree.c
index 464eeb90d5ad..321fd5d8aef3 100644
--- a/lib/test_maple_tree.c
+++ b/lib/test_maple_tree.c
@@ -2672,7 +2672,6 @@ static noinline void __init check_dup(struct maple_tree *mt)
 		rcu_barrier();
 	}
 
-	cond_resched();
 	mt_cache_shrink();
 	/* Check with a value at zero, no gap */
 	for (i = 1000; i < 2000; i++) {
@@ -2682,7 +2681,6 @@ static noinline void __init check_dup(struct maple_tree *mt)
 		rcu_barrier();
 	}
 
-	cond_resched();
 	mt_cache_shrink();
 	/* Check with a value at zero and unreasonably large */
 	for (i = big_start; i < big_start + 10; i++) {
@@ -2692,7 +2690,6 @@ static noinline void __init check_dup(struct maple_tree *mt)
 		rcu_barrier();
 	}
 
-	cond_resched();
 	mt_cache_shrink();
 	/* Small to medium size not starting at zero*/
 	for (i = 200; i < 1000; i++) {
@@ -2702,7 +2699,6 @@ static noinline void __init check_dup(struct maple_tree *mt)
 		rcu_barrier();
 	}
 
-	cond_resched();
 	mt_cache_shrink();
 	/* Unreasonably large not starting at zero*/
 	for (i = big_start; i < big_start + 10; i++) {
@@ -2710,7 +2706,6 @@ static noinline void __init check_dup(struct maple_tree *mt)
 		check_dup_gaps(mt, i, false, 5);
 		mtree_destroy(mt);
 		rcu_barrier();
-		cond_resched();
 		mt_cache_shrink();
 	}
 
@@ -2720,7 +2715,6 @@ static noinline void __init check_dup(struct maple_tree *mt)
 		check_dup_gaps(mt, i, false, 5);
 		mtree_destroy(mt);
 		rcu_barrier();
-		cond_resched();
 		if (i % 2 == 0)
 			mt_cache_shrink();
 	}
@@ -2732,7 +2726,6 @@ static noinline void __init check_dup(struct maple_tree *mt)
 		check_dup_gaps(mt, i, true, 5);
 		mtree_destroy(mt);
 		rcu_barrier();
-		cond_resched();
 	}
 
 	mt_cache_shrink();
@@ -2743,7 +2736,6 @@ static noinline void __init check_dup(struct maple_tree *mt)
 		mtree_destroy(mt);
 		rcu_barrier();
 		mt_cache_shrink();
-		cond_resched();
 	}
 }
 
diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index c20f6cb4bf55..e5d1f272f2c6 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -119,7 +119,6 @@ static int insert_retry(struct rhashtable *ht, struct test_obj *obj,
 
 	do {
 		retries++;
-		cond_resched();
 		err = rhashtable_insert_fast(ht, &obj->node, params);
 		if (err == -ENOMEM && enomem_retry) {
 			enomem_retries++;
@@ -253,8 +252,6 @@ static s64 __init test_rhashtable(struct rhashtable *ht, struct test_obj *array,
 
 			rhashtable_remove_fast(ht, &obj->node, test_rht_params);
 		}
-
-		cond_resched();
 	}
 
 	end = ktime_get_ns();
@@ -371,8 +368,6 @@ static int __init test_rhltable(unsigned int entries)
 		u32 i = get_random_u32_below(entries);
 		u32 prand = get_random_u32_below(4);
 
-		cond_resched();
-
 		err = rhltable_remove(&rhlt, &rhl_test_objects[i].list_node, test_rht_params);
 		if (test_bit(i, obj_in_table)) {
 			clear_bit(i, obj_in_table);
@@ -412,7 +407,6 @@ static int __init test_rhltable(unsigned int entries)
 	}
 
 	for (i = 0; i < entries; i++) {
-		cond_resched();
 		err = rhltable_remove(&rhlt, &rhl_test_objects[i].list_node, test_rht_params);
 		if (test_bit(i, obj_in_table)) {
 			if (WARN(err, "cannot remove element at slot %d", i))
@@ -607,8 +601,6 @@ static int thread_lookup_test(struct thread_data *tdata)
 			       obj->value.tid, obj->value.id, key.tid, key.id);
 			err++;
 		}
-
-		cond_resched();
 	}
 	return err;
 }
@@ -660,8 +652,6 @@ static int threadfunc(void *data)
 				goto out;
 			}
 			tdata->objs[i].value.id = TEST_INSERT_FAIL;
-
-			cond_resched();
 		}
 		err = thread_lookup_test(tdata);
 		if (err) {
-- 
2.31.1

Re: [RFC PATCH 71/86] treewide: lib: remove cond_resched()

Posted by Kees Cook 2 years, 3 months ago

On Tue, Nov 07, 2023 at 03:08:07PM -0800, Ankur Arora wrote:
> There are broadly three sets of uses of cond_resched():
> 
> 1.  Calls to cond_resched() out of the goodness of our heart,
>     otherwise known as avoiding lockup splats.
> 
> 2.  Open coded variants of cond_resched_lock() which call
>     cond_resched().
> 
> 3.  Retry or error handling loops, where cond_resched() is used as a
>     quick alternative to spinning in a tight-loop.
> 
> When running under a full preemption model, the cond_resched() reduces
> to a NOP (not even a barrier) so removing it obviously cannot matter.
> 
> But considering only voluntary preemption models (for say code that
> has been mostly tested under those), for set-1 and set-2 the
> scheduler can now preempt kernel tasks running beyond their time
> quanta anywhere they are preemptible() [1]. Which removes any need
> for these explicitly placed scheduling points.
> 
> The cond_resched() calls in set-3 are a little more difficult.
> To start with, given it's NOP character under full preemption, it
> never actually saved us from a tight loop.
> With voluntary preemption, it's not a NOP, but it might as well be --
> for most workloads the scheduler does not have an interminable supply
> of runnable tasks on the runqueue.
> 
> So, cond_resched() is useful to not get softlockup splats, but not
> terribly good for error handling. Ideally, these should be replaced
> with some kind of timed or event wait.
> For now we use cond_resched_stall(), which tries to schedule if
> possible, and executes a cpu_relax() if not.
> 
> Almost all the cond_resched() calls are from set-1. Remove them.

FOr the memcpy_kunit.c cases, I don't think there are preemption
locations in its loops. Perhaps I'm misunderstanding something? Why will
the memcpy test no longer produce softlockup splats?

-Kees

> 
> [1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/
> 
> Cc: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: "David S. Miller" <davem@davemloft.net> 
> Cc: Kees Cook <keescook@chromium.org> 
> Cc: Eric Dumazet <edumazet@google.com> 
> Cc: Jakub Kicinski <kuba@kernel.org> 
> Cc: Paolo Abeni <pabeni@redhat.com> 
> Cc: Thomas Graf <tgraf@suug.ch>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  lib/crc32test.c          |  2 --
>  lib/crypto/mpi/mpi-pow.c |  1 -
>  lib/memcpy_kunit.c       |  5 -----
>  lib/random32.c           |  1 -
>  lib/rhashtable.c         |  2 --
>  lib/test_bpf.c           |  3 ---
>  lib/test_lockup.c        |  2 +-
>  lib/test_maple_tree.c    |  8 --------
>  lib/test_rhashtable.c    | 10 ----------
>  9 files changed, 1 insertion(+), 33 deletions(-)
> 
> diff --git a/lib/crc32test.c b/lib/crc32test.c
> index 9b4af79412c4..3eee90482e9a 100644
> --- a/lib/crc32test.c
> +++ b/lib/crc32test.c
> @@ -729,7 +729,6 @@ static int __init crc32c_combine_test(void)
>  			      crc_full == test[i].crc32c_le))
>  				errors++;
>  			runs++;
> -			cond_resched();
>  		}
>  	}
>  
> @@ -817,7 +816,6 @@ static int __init crc32_combine_test(void)
>  			      crc_full == test[i].crc_le))
>  				errors++;
>  			runs++;
> -			cond_resched();
>  		}
>  	}
>  
> diff --git a/lib/crypto/mpi/mpi-pow.c b/lib/crypto/mpi/mpi-pow.c
> index 2fd7a46d55ec..074534900b7e 100644
> --- a/lib/crypto/mpi/mpi-pow.c
> +++ b/lib/crypto/mpi/mpi-pow.c
> @@ -242,7 +242,6 @@ int mpi_powm(MPI res, MPI base, MPI exp, MPI mod)
>  				}
>  				e <<= 1;
>  				c--;
> -				cond_resched();
>  			}
>  
>  			i--;
> diff --git a/lib/memcpy_kunit.c b/lib/memcpy_kunit.c
> index 440aee705ccc..c2a6b09fe93a 100644
> --- a/lib/memcpy_kunit.c
> +++ b/lib/memcpy_kunit.c
> @@ -361,8 +361,6 @@ static void copy_large_test(struct kunit *test, bool use_memmove)
>  			/* Zero out what we copied for the next cycle. */
>  			memset(large_dst + offset, 0, bytes);
>  		}
> -		/* Avoid stall warnings if this loop gets slow. */
> -		cond_resched();
>  	}
>  }
>  
> @@ -489,9 +487,6 @@ static void memmove_overlap_test(struct kunit *test)
>  			for (int s_off = s_start; s_off < s_end;
>  			     s_off = next_step(s_off, s_start, s_end, window_step))
>  				inner_loop(test, bytes, d_off, s_off);
> -
> -			/* Avoid stall warnings. */
> -			cond_resched();
>  		}
>  	}
>  }
> diff --git a/lib/random32.c b/lib/random32.c
> index 32060b852668..10bc804d99d6 100644
> --- a/lib/random32.c
> +++ b/lib/random32.c
> @@ -287,7 +287,6 @@ static int __init prandom_state_selftest(void)
>  			errors++;
>  
>  		runs++;
> -		cond_resched();
>  	}
>  
>  	if (errors)
> diff --git a/lib/rhashtable.c b/lib/rhashtable.c
> index 6ae2ba8e06a2..5ff0f521bf29 100644
> --- a/lib/rhashtable.c
> +++ b/lib/rhashtable.c
> @@ -328,7 +328,6 @@ static int rhashtable_rehash_table(struct rhashtable *ht)
>  		err = rhashtable_rehash_chain(ht, old_hash);
>  		if (err)
>  			return err;
> -		cond_resched();
>  	}
>  
>  	/* Publish the new table pointer. */
> @@ -1147,7 +1146,6 @@ void rhashtable_free_and_destroy(struct rhashtable *ht,
>  		for (i = 0; i < tbl->size; i++) {
>  			struct rhash_head *pos, *next;
>  
> -			cond_resched();
>  			for (pos = rht_ptr_exclusive(rht_bucket(tbl, i)),
>  			     next = !rht_is_a_nulls(pos) ?
>  					rht_dereference(pos->next, ht) : NULL;
> diff --git a/lib/test_bpf.c b/lib/test_bpf.c
> index ecde4216201e..15b4d32712d8 100644
> --- a/lib/test_bpf.c
> +++ b/lib/test_bpf.c
> @@ -14758,7 +14758,6 @@ static __init int test_skb_segment(void)
>  	for (i = 0; i < ARRAY_SIZE(skb_segment_tests); i++) {
>  		const struct skb_segment_test *test = &skb_segment_tests[i];
>  
> -		cond_resched();
>  		if (exclude_test(i))
>  			continue;
>  
> @@ -14787,7 +14786,6 @@ static __init int test_bpf(void)
>  		struct bpf_prog *fp;
>  		int err;
>  
> -		cond_resched();
>  		if (exclude_test(i))
>  			continue;
>  
> @@ -15171,7 +15169,6 @@ static __init int test_tail_calls(struct bpf_array *progs)
>  		u64 duration;
>  		int ret;
>  
> -		cond_resched();
>  		if (exclude_test(i))
>  			continue;
>  
> diff --git a/lib/test_lockup.c b/lib/test_lockup.c
> index c3fd87d6c2dd..9af5d34c98f6 100644
> --- a/lib/test_lockup.c
> +++ b/lib/test_lockup.c
> @@ -381,7 +381,7 @@ static void test_lockup(bool master)
>  			touch_nmi_watchdog();
>  
>  		if (call_cond_resched)
> -			cond_resched();
> +			cond_resched_stall();
>  
>  		test_wait(cooldown_secs, cooldown_nsecs);
>  
> diff --git a/lib/test_maple_tree.c b/lib/test_maple_tree.c
> index 464eeb90d5ad..321fd5d8aef3 100644
> --- a/lib/test_maple_tree.c
> +++ b/lib/test_maple_tree.c
> @@ -2672,7 +2672,6 @@ static noinline void __init check_dup(struct maple_tree *mt)
>  		rcu_barrier();
>  	}
>  
> -	cond_resched();
>  	mt_cache_shrink();
>  	/* Check with a value at zero, no gap */
>  	for (i = 1000; i < 2000; i++) {
> @@ -2682,7 +2681,6 @@ static noinline void __init check_dup(struct maple_tree *mt)
>  		rcu_barrier();
>  	}
>  
> -	cond_resched();
>  	mt_cache_shrink();
>  	/* Check with a value at zero and unreasonably large */
>  	for (i = big_start; i < big_start + 10; i++) {
> @@ -2692,7 +2690,6 @@ static noinline void __init check_dup(struct maple_tree *mt)
>  		rcu_barrier();
>  	}
>  
> -	cond_resched();
>  	mt_cache_shrink();
>  	/* Small to medium size not starting at zero*/
>  	for (i = 200; i < 1000; i++) {
> @@ -2702,7 +2699,6 @@ static noinline void __init check_dup(struct maple_tree *mt)
>  		rcu_barrier();
>  	}
>  
> -	cond_resched();
>  	mt_cache_shrink();
>  	/* Unreasonably large not starting at zero*/
>  	for (i = big_start; i < big_start + 10; i++) {
> @@ -2710,7 +2706,6 @@ static noinline void __init check_dup(struct maple_tree *mt)
>  		check_dup_gaps(mt, i, false, 5);
>  		mtree_destroy(mt);
>  		rcu_barrier();
> -		cond_resched();
>  		mt_cache_shrink();
>  	}
>  
> @@ -2720,7 +2715,6 @@ static noinline void __init check_dup(struct maple_tree *mt)
>  		check_dup_gaps(mt, i, false, 5);
>  		mtree_destroy(mt);
>  		rcu_barrier();
> -		cond_resched();
>  		if (i % 2 == 0)
>  			mt_cache_shrink();
>  	}
> @@ -2732,7 +2726,6 @@ static noinline void __init check_dup(struct maple_tree *mt)
>  		check_dup_gaps(mt, i, true, 5);
>  		mtree_destroy(mt);
>  		rcu_barrier();
> -		cond_resched();
>  	}
>  
>  	mt_cache_shrink();
> @@ -2743,7 +2736,6 @@ static noinline void __init check_dup(struct maple_tree *mt)
>  		mtree_destroy(mt);
>  		rcu_barrier();
>  		mt_cache_shrink();
> -		cond_resched();
>  	}
>  }
>  
> diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
> index c20f6cb4bf55..e5d1f272f2c6 100644
> --- a/lib/test_rhashtable.c
> +++ b/lib/test_rhashtable.c
> @@ -119,7 +119,6 @@ static int insert_retry(struct rhashtable *ht, struct test_obj *obj,
>  
>  	do {
>  		retries++;
> -		cond_resched();
>  		err = rhashtable_insert_fast(ht, &obj->node, params);
>  		if (err == -ENOMEM && enomem_retry) {
>  			enomem_retries++;
> @@ -253,8 +252,6 @@ static s64 __init test_rhashtable(struct rhashtable *ht, struct test_obj *array,
>  
>  			rhashtable_remove_fast(ht, &obj->node, test_rht_params);
>  		}
> -
> -		cond_resched();
>  	}
>  
>  	end = ktime_get_ns();
> @@ -371,8 +368,6 @@ static int __init test_rhltable(unsigned int entries)
>  		u32 i = get_random_u32_below(entries);
>  		u32 prand = get_random_u32_below(4);
>  
> -		cond_resched();
> -
>  		err = rhltable_remove(&rhlt, &rhl_test_objects[i].list_node, test_rht_params);
>  		if (test_bit(i, obj_in_table)) {
>  			clear_bit(i, obj_in_table);
> @@ -412,7 +407,6 @@ static int __init test_rhltable(unsigned int entries)
>  	}
>  
>  	for (i = 0; i < entries; i++) {
> -		cond_resched();
>  		err = rhltable_remove(&rhlt, &rhl_test_objects[i].list_node, test_rht_params);
>  		if (test_bit(i, obj_in_table)) {
>  			if (WARN(err, "cannot remove element at slot %d", i))
> @@ -607,8 +601,6 @@ static int thread_lookup_test(struct thread_data *tdata)
>  			       obj->value.tid, obj->value.id, key.tid, key.id);
>  			err++;
>  		}
> -
> -		cond_resched();
>  	}
>  	return err;
>  }
> @@ -660,8 +652,6 @@ static int threadfunc(void *data)
>  				goto out;
>  			}
>  			tdata->objs[i].value.id = TEST_INSERT_FAIL;
> -
> -			cond_resched();
>  		}
>  		err = thread_lookup_test(tdata);
>  		if (err) {
> -- 
> 2.31.1
> 

-- 
Kees Cook

Re: [RFC PATCH 71/86] treewide: lib: remove cond_resched()

Posted by Steven Rostedt 2 years, 3 months ago

On Wed, 8 Nov 2023 11:15:37 -0800
Kees Cook <keescook@chromium.org> wrote:

> FOr the memcpy_kunit.c cases, I don't think there are preemption
> locations in its loops. Perhaps I'm misunderstanding something? Why will
> the memcpy test no longer produce softlockup splats?

This patchset will switch over to a NEED_RESCHED_LAZY routine, so that
VOLUNTARY and NONE preemption models will be forced to preempt if its in
the kernel for too long.

Time slice is over: set NEED_RESCHED_LAZY

For VOLUNTARY and NONE, NEED_RESCHED_LAZY will not preempt the kernel (but
will preempt user space).

If in the kernel for over 1 tick (1ms for 1000Hz, 4ms for 250Hz, etc),
if NEED_RESCHED_LAZY is still set after one tick, then set NEED_RESCHED.

NEED_RESCHED will now schedule in the kernel once it is able to regardless
of preemption model. (PREEMPT_NONE will now use preempt_disable()).

This allows us to get rid of all cond_resched()s throughout the kernel as
this will be the new mechanism to keep from running inside the kernel for
too long. The watchdog is always longer than one tick.

-- Steve

RE: [RFC PATCH 71/86] treewide: lib: remove cond_resched()

Posted by David Laight 2 years, 3 months ago

From: Steven Rostedt
> Sent: 08 November 2023 19:42
> 
> On Wed, 8 Nov 2023 11:15:37 -0800
> Kees Cook <keescook@chromium.org> wrote:
> 
> > FOr the memcpy_kunit.c cases, I don't think there are preemption
> > locations in its loops. Perhaps I'm misunderstanding something? Why will
> > the memcpy test no longer produce softlockup splats?
> 
> This patchset will switch over to a NEED_RESCHED_LAZY routine, so that
> VOLUNTARY and NONE preemption models will be forced to preempt if its in
> the kernel for too long.
> 
> Time slice is over: set NEED_RESCHED_LAZY
> 
> For VOLUNTARY and NONE, NEED_RESCHED_LAZY will not preempt the kernel (but
> will preempt user space).
> 
> If in the kernel for over 1 tick (1ms for 1000Hz, 4ms for 250Hz, etc),
> if NEED_RESCHED_LAZY is still set after one tick, then set NEED_RESCHED.

Delaying the reschedule that long seems like a regression.
I'm sure a lot of the cond_resched() calls were added to cause
pre-emption much earlier than 1 tick.

I doubt the distibutions will change from VOLUTARY any time soon.
So that is what most people will be using.

	David.

> 
> NEED_RESCHED will now schedule in the kernel once it is able to regardless
> of preemption model. (PREEMPT_NONE will now use preempt_disable()).
> 
> This allows us to get rid of all cond_resched()s throughout the kernel as
> this will be the new mechanism to keep from running inside the kernel for
> too long. The watchdog is always longer than one tick.
> 
> -- Steve

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Re: [RFC PATCH 71/86] treewide: lib: remove cond_resched()

Posted by Kees Cook 2 years, 3 months ago

On Wed, Nov 08, 2023 at 02:41:44PM -0500, Steven Rostedt wrote:
> On Wed, 8 Nov 2023 11:15:37 -0800
> Kees Cook <keescook@chromium.org> wrote:
> 
> > FOr the memcpy_kunit.c cases, I don't think there are preemption
> > locations in its loops. Perhaps I'm misunderstanding something? Why will
> > the memcpy test no longer produce softlockup splats?
> 
> This patchset will switch over to a NEED_RESCHED_LAZY routine, so that
> VOLUNTARY and NONE preemption models will be forced to preempt if its in
> the kernel for too long.
> 
> Time slice is over: set NEED_RESCHED_LAZY
> 
> For VOLUNTARY and NONE, NEED_RESCHED_LAZY will not preempt the kernel (but
> will preempt user space).
> 
> If in the kernel for over 1 tick (1ms for 1000Hz, 4ms for 250Hz, etc),
> if NEED_RESCHED_LAZY is still set after one tick, then set NEED_RESCHED.
> 
> NEED_RESCHED will now schedule in the kernel once it is able to regardless
> of preemption model. (PREEMPT_NONE will now use preempt_disable()).
> 
> This allows us to get rid of all cond_resched()s throughout the kernel as
> this will be the new mechanism to keep from running inside the kernel for
> too long. The watchdog is always longer than one tick.

Okay, it sounds like it's taken care of. :)

Acked-by: Kees Cook <keescook@chromium.org> # for lib/memcpy_kunit.c

-- 
Kees Cook

Re: [RFC PATCH 71/86] treewide: lib: remove cond_resched()

Posted by Steven Rostedt 2 years, 3 months ago

On Wed, 8 Nov 2023 14:16:25 -0800
Kees Cook <keescook@chromium.org> wrote:

> Okay, it sounds like it's taken care of. :)
> 
> Acked-by: Kees Cook <keescook@chromium.org> # for lib/memcpy_kunit.c

Thanks Kees,

But I have to admit (and Ankur is now aware) that it was premature to send
the cond_resched() removal patches with this RFC. It may be a year before
we get everything straighten out with the new preempt models.

So, expect to see this patch again sometime next year ;-)

Hopefully, you can ack it then.

-- Steve

Re: [RFC PATCH 71/86] treewide: lib: remove cond_resched()

Posted by Herbert Xu 2 years, 3 months ago

On Tue, Nov 07, 2023 at 03:08:07PM -0800, Ankur Arora wrote:
> There are broadly three sets of uses of cond_resched():
> 
> 1.  Calls to cond_resched() out of the goodness of our heart,
>     otherwise known as avoiding lockup splats.
> 
> 2.  Open coded variants of cond_resched_lock() which call
>     cond_resched().
> 
> 3.  Retry or error handling loops, where cond_resched() is used as a
>     quick alternative to spinning in a tight-loop.
> 
> When running under a full preemption model, the cond_resched() reduces
> to a NOP (not even a barrier) so removing it obviously cannot matter.
> 
> But considering only voluntary preemption models (for say code that
> has been mostly tested under those), for set-1 and set-2 the
> scheduler can now preempt kernel tasks running beyond their time
> quanta anywhere they are preemptible() [1]. Which removes any need
> for these explicitly placed scheduling points.
> 
> The cond_resched() calls in set-3 are a little more difficult.
> To start with, given it's NOP character under full preemption, it
> never actually saved us from a tight loop.
> With voluntary preemption, it's not a NOP, but it might as well be --
> for most workloads the scheduler does not have an interminable supply
> of runnable tasks on the runqueue.
> 
> So, cond_resched() is useful to not get softlockup splats, but not
> terribly good for error handling. Ideally, these should be replaced
> with some kind of timed or event wait.
> For now we use cond_resched_stall(), which tries to schedule if
> possible, and executes a cpu_relax() if not.
> 
> Almost all the cond_resched() calls are from set-1. Remove them.
> 
> [1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/
> 
> Cc: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: "David S. Miller" <davem@davemloft.net> 
> Cc: Kees Cook <keescook@chromium.org> 
> Cc: Eric Dumazet <edumazet@google.com> 
> Cc: Jakub Kicinski <kuba@kernel.org> 
> Cc: Paolo Abeni <pabeni@redhat.com> 
> Cc: Thomas Graf <tgraf@suug.ch>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  lib/crc32test.c          |  2 --
>  lib/crypto/mpi/mpi-pow.c |  1 -
>  lib/memcpy_kunit.c       |  5 -----
>  lib/random32.c           |  1 -
>  lib/rhashtable.c         |  2 --
>  lib/test_bpf.c           |  3 ---
>  lib/test_lockup.c        |  2 +-
>  lib/test_maple_tree.c    |  8 --------
>  lib/test_rhashtable.c    | 10 ----------
>  9 files changed, 1 insertion(+), 33 deletions(-)

Nack.
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [RFC PATCH 71/86] treewide: lib: remove cond_resched()

Posted by Steven Rostedt 2 years, 3 months ago

On Wed, 8 Nov 2023 17:15:11 +0800
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> >  lib/crc32test.c          |  2 --
> >  lib/crypto/mpi/mpi-pow.c |  1 -
> >  lib/memcpy_kunit.c       |  5 -----
> >  lib/random32.c           |  1 -
> >  lib/rhashtable.c         |  2 --
> >  lib/test_bpf.c           |  3 ---
> >  lib/test_lockup.c        |  2 +-
> >  lib/test_maple_tree.c    |  8 --------
> >  lib/test_rhashtable.c    | 10 ----------
> >  9 files changed, 1 insertion(+), 33 deletions(-)  
> 
> Nack.

A "Nack" with no commentary is completely useless and borderline offensive.

What is your rationale for the Nack?

The cond_resched() is going away if the patches earlier in the series gets
implemented. So either it is removed from your code, or it will become a
nop, and just wasting bits in the source tree. Your choice.

-- Steve

Re: [RFC PATCH 71/86] treewide: lib: remove cond_resched()

Posted by Herbert Xu 2 years, 3 months ago

On Wed, Nov 08, 2023 at 10:08:18AM -0500, Steven Rostedt wrote:
>
> A "Nack" with no commentary is completely useless and borderline offensive.

Well you just sent me an email out of the blue, with zero context
about what you were doing, and you're complaining to me about giving
your a curt response?

> What is your rationale for the Nack?

Next time perhaps consider sending the cover letter and the important
patches to everyone rather than the mailing list.

> The cond_resched() is going away if the patches earlier in the series gets
> implemented. So either it is removed from your code, or it will become a
> nop, and just wasting bits in the source tree. Your choice.

This is exactly what I should have received.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [RFC PATCH 71/86] treewide: lib: remove cond_resched()

Posted by Steven Rostedt 2 years, 3 months ago

On Thu, 9 Nov 2023 12:19:55 +0800
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> On Wed, Nov 08, 2023 at 10:08:18AM -0500, Steven Rostedt wrote:
> >
> > A "Nack" with no commentary is completely useless and borderline offensive.  
> 
> Well you just sent me an email out of the blue, with zero context
> about what you were doing, and you're complaining to me about giving
> your a curt response?

First, I didn't send the email, and your "Nack" wasn't directed at me.

Second, with lore and lei, it's trivial today to find the cover letter from
the message id. But I get it. It's annoying when you have to do that.

> 
> > What is your rationale for the Nack?  
> 
> Next time perhaps consider sending the cover letter and the important
> patches to everyone rather than the mailing list.

Then that is how you should have responded. I see other maintainers respond
as such. A "Nack" is still meaningless. You could have responded with:

 "What is this? And why are you doing it?"

Which is a much better and a more meaningful response than a "Nack".

> 
> > The cond_resched() is going away if the patches earlier in the series gets
> > implemented. So either it is removed from your code, or it will become a
> > nop, and just wasting bits in the source tree. Your choice.  
> 
> This is exactly what I should have received.

Which is why I replied, as the original email author is still new at this,
but is learning.

-- Steve

[RFC PATCH 72/86] treewide: crypto: remove cond_resched()

Posted by Ankur Arora 2 years, 3 months ago

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

All the cond_resched() calls are from set-1. Remove them.

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: linux-crypto@vger.kernel.org
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 crypto/internal.h |  2 +-
 crypto/tcrypt.c   |  5 -----
 crypto/testmgr.c  | 10 ----------
 3 files changed, 1 insertion(+), 16 deletions(-)

diff --git a/crypto/internal.h b/crypto/internal.h
index 63e59240d5fb..930f8f5fad39 100644
--- a/crypto/internal.h
+++ b/crypto/internal.h
@@ -203,7 +203,7 @@ static inline void crypto_notify(unsigned long val, void *v)
 static inline void crypto_yield(u32 flags)
 {
 	if (flags & CRYPTO_TFM_REQ_MAY_SLEEP)
-		cond_resched();
+		cond_resched_stall();
 }
 
 static inline int crypto_is_test_larval(struct crypto_larval *larval)
diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c
index 202ca1a3105d..9f33b9724a2e 100644
--- a/crypto/tcrypt.c
+++ b/crypto/tcrypt.c
@@ -414,7 +414,6 @@ static void test_mb_aead_speed(const char *algo, int enc, int secs,
 			if (secs) {
 				ret = test_mb_aead_jiffies(data, enc, bs,
 							   secs, num_mb);
-				cond_resched();
 			} else {
 				ret = test_mb_aead_cycles(data, enc, bs,
 							  num_mb);
@@ -667,7 +666,6 @@ static void test_aead_speed(const char *algo, int enc, unsigned int secs,
 			if (secs) {
 				ret = test_aead_jiffies(req, enc, bs,
 							secs);
-				cond_resched();
 			} else {
 				ret = test_aead_cycles(req, enc, bs);
 			}
@@ -923,7 +921,6 @@ static void test_ahash_speed_common(const char *algo, unsigned int secs,
 		if (secs) {
 			ret = test_ahash_jiffies(req, speed[i].blen,
 						 speed[i].plen, output, secs);
-			cond_resched();
 		} else {
 			ret = test_ahash_cycles(req, speed[i].blen,
 						speed[i].plen, output);
@@ -1182,7 +1179,6 @@ static void test_mb_skcipher_speed(const char *algo, int enc, int secs,
 				ret = test_mb_acipher_jiffies(data, enc,
 							      bs, secs,
 							      num_mb);
-				cond_resched();
 			} else {
 				ret = test_mb_acipher_cycles(data, enc,
 							     bs, num_mb);
@@ -1397,7 +1393,6 @@ static void test_skcipher_speed(const char *algo, int enc, unsigned int secs,
 			if (secs) {
 				ret = test_acipher_jiffies(req, enc,
 							   bs, secs);
-				cond_resched();
 			} else {
 				ret = test_acipher_cycles(req, enc,
 							  bs);
diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index 216878c8bc3d..2909c5aa4b8b 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -1676,7 +1676,6 @@ static int test_hash_vec(const struct hash_testvec *vec, unsigned int vec_num,
 						req, desc, tsgl, hashstate);
 			if (err)
 				return err;
-			cond_resched();
 		}
 	}
 #endif
@@ -1837,7 +1836,6 @@ static int test_hash_vs_generic_impl(const char *generic_driver,
 					req, desc, tsgl, hashstate);
 		if (err)
 			goto out;
-		cond_resched();
 	}
 	err = 0;
 out:
@@ -1966,7 +1964,6 @@ static int __alg_test_hash(const struct hash_testvec *vecs,
 		err = test_hash_vec(&vecs[i], i, req, desc, tsgl, hashstate);
 		if (err)
 			goto out;
-		cond_resched();
 	}
 	err = test_hash_vs_generic_impl(generic_driver, maxkeysize, req,
 					desc, tsgl, hashstate);
@@ -2246,7 +2243,6 @@ static int test_aead_vec(int enc, const struct aead_testvec *vec,
 						&cfg, req, tsgls);
 			if (err)
 				return err;
-			cond_resched();
 		}
 	}
 #endif
@@ -2476,7 +2472,6 @@ static int test_aead_inauthentic_inputs(struct aead_extra_tests_ctx *ctx)
 			if (err)
 				return err;
 		}
-		cond_resched();
 	}
 	return 0;
 }
@@ -2580,7 +2575,6 @@ static int test_aead_vs_generic_impl(struct aead_extra_tests_ctx *ctx)
 			if (err)
 				goto out;
 		}
-		cond_resched();
 	}
 	err = 0;
 out:
@@ -2659,7 +2653,6 @@ static int test_aead(int enc, const struct aead_test_suite *suite,
 		err = test_aead_vec(enc, &suite->vecs[i], i, req, tsgls);
 		if (err)
 			return err;
-		cond_resched();
 	}
 	return 0;
 }
@@ -3006,7 +2999,6 @@ static int test_skcipher_vec(int enc, const struct cipher_testvec *vec,
 						    &cfg, req, tsgls);
 			if (err)
 				return err;
-			cond_resched();
 		}
 	}
 #endif
@@ -3203,7 +3195,6 @@ static int test_skcipher_vs_generic_impl(const char *generic_driver,
 					    cfg, req, tsgls);
 		if (err)
 			goto out;
-		cond_resched();
 	}
 	err = 0;
 out:
@@ -3236,7 +3227,6 @@ static int test_skcipher(int enc, const struct cipher_test_suite *suite,
 		err = test_skcipher_vec(enc, &suite->vecs[i], i, req, tsgls);
 		if (err)
 			return err;
-		cond_resched();
 	}
 	return 0;
 }
-- 
2.31.1

[RFC PATCH 73/86] treewide: security: remove cond_resched()

Posted by Ankur Arora 2 years, 3 months ago

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

All the cond_resched() calls are to avoid monopolizing the CPU while
executing in long loops (set-1 or set-2).

Remove them.

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: David Howells <dhowells@redhat.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 security/keys/gc.c             | 1 -
 security/landlock/fs.c         | 1 -
 security/selinux/ss/hashtab.h  | 2 --
 security/selinux/ss/policydb.c | 6 ------
 security/selinux/ss/services.c | 1 -
 security/selinux/ss/sidtab.c   | 1 -
 6 files changed, 12 deletions(-)

diff --git a/security/keys/gc.c b/security/keys/gc.c
index 3c90807476eb..edb886df2d82 100644
--- a/security/keys/gc.c
+++ b/security/keys/gc.c
@@ -265,7 +265,6 @@ static void key_garbage_collector(struct work_struct *work)
 
 maybe_resched:
 	if (cursor) {
-		cond_resched();
 		spin_lock(&key_serial_lock);
 		goto continue_scanning;
 	}
diff --git a/security/landlock/fs.c b/security/landlock/fs.c
index 1c0c198f6fdb..e7ecd8cca418 100644
--- a/security/landlock/fs.c
+++ b/security/landlock/fs.c
@@ -1013,7 +1013,6 @@ static void hook_sb_delete(struct super_block *const sb)
 			 * previous loop walk, which is not needed anymore.
 			 */
 			iput(prev_inode);
-			cond_resched();
 			spin_lock(&sb->s_inode_list_lock);
 		}
 		prev_inode = inode;
diff --git a/security/selinux/ss/hashtab.h b/security/selinux/ss/hashtab.h
index f9713b56d3d0..1e297dd83b3e 100644
--- a/security/selinux/ss/hashtab.h
+++ b/security/selinux/ss/hashtab.h
@@ -64,8 +64,6 @@ static inline int hashtab_insert(struct hashtab *h, void *key, void *datum,
 	u32 hvalue;
 	struct hashtab_node *prev, *cur;
 
-	cond_resched();
-
 	if (!h->size || h->nel == HASHTAB_MAX_NODES)
 		return -EINVAL;
 
diff --git a/security/selinux/ss/policydb.c b/security/selinux/ss/policydb.c
index 2d528f699a22..2737b753d9da 100644
--- a/security/selinux/ss/policydb.c
+++ b/security/selinux/ss/policydb.c
@@ -336,7 +336,6 @@ static int filenametr_destroy(void *key, void *datum, void *p)
 		kfree(d);
 		d = next;
 	} while (unlikely(d));
-	cond_resched();
 	return 0;
 }
 
@@ -348,7 +347,6 @@ static int range_tr_destroy(void *key, void *datum, void *p)
 	ebitmap_destroy(&rt->level[0].cat);
 	ebitmap_destroy(&rt->level[1].cat);
 	kfree(datum);
-	cond_resched();
 	return 0;
 }
 
@@ -786,7 +784,6 @@ void policydb_destroy(struct policydb *p)
 	struct role_allow *ra, *lra = NULL;
 
 	for (i = 0; i < SYM_NUM; i++) {
-		cond_resched();
 		hashtab_map(&p->symtab[i].table, destroy_f[i], NULL);
 		hashtab_destroy(&p->symtab[i].table);
 	}
@@ -802,7 +799,6 @@ void policydb_destroy(struct policydb *p)
 	avtab_destroy(&p->te_avtab);
 
 	for (i = 0; i < OCON_NUM; i++) {
-		cond_resched();
 		c = p->ocontexts[i];
 		while (c) {
 			ctmp = c;
@@ -814,7 +810,6 @@ void policydb_destroy(struct policydb *p)
 
 	g = p->genfs;
 	while (g) {
-		cond_resched();
 		kfree(g->fstype);
 		c = g->head;
 		while (c) {
@@ -834,7 +829,6 @@ void policydb_destroy(struct policydb *p)
 	hashtab_destroy(&p->role_tr);
 
 	for (ra = p->role_allow; ra; ra = ra->next) {
-		cond_resched();
 		kfree(lra);
 		lra = ra;
 	}
diff --git a/security/selinux/ss/services.c b/security/selinux/ss/services.c
index 1eeffc66ea7d..0cb652456256 100644
--- a/security/selinux/ss/services.c
+++ b/security/selinux/ss/services.c
@@ -2790,7 +2790,6 @@ int security_get_user_sids(u32 fromsid,
 					  &dummy_avd);
 		if (!rc)
 			mysids2[j++] = mysids[i];
-		cond_resched();
 	}
 	kfree(mysids);
 	*sids = mysids2;
diff --git a/security/selinux/ss/sidtab.c b/security/selinux/ss/sidtab.c
index d8ead463b8df..c5537cecb755 100644
--- a/security/selinux/ss/sidtab.c
+++ b/security/selinux/ss/sidtab.c
@@ -415,7 +415,6 @@ static int sidtab_convert_tree(union sidtab_entry_inner *edst,
 			(*pos)++;
 			i++;
 		}
-		cond_resched();
 	}
 	return 0;
 }
-- 
2.31.1

[RFC PATCH 74/86] treewide: fs: remove cond_resched()

Posted by Ankur Arora 2 years, 3 months ago

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

Most uses here are from set-1 or ones that can be converted to set-2.
And a few cases in retry loops where we replace cond_resched() with
cpu_relax() or cond_resched_stall().

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Chris Mason <clm@fb.com> 
Cc: Josef Bacik <josef@toxicpanda.com> 
Cc: David Sterba <dsterba@suse.com> 
Cc: Alexander Viro <viro@zeniv.linux.org.uk> 
Cc: Christian Brauner <brauner@kernel.org> 
Cc: Gao Xiang <xiang@kernel.org> 
Cc: Chao Yu <chao@kernel.org> 
Cc: "Theodore Ts'o" <tytso@mit.edu> 
Cc: Andreas Dilger <adilger.kernel@dilger.ca> 
Cc: Jaegeuk Kim <jaegeuk@kernel.org> 
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> 
Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> 
Cc: Mike Kravetz <mike.kravetz@oracle.com> 
Cc: Muchun Song <muchun.song@linux.dev> 
Cc: Trond Myklebust <trond.myklebust@hammerspace.com> 
Cc: Anna Schumaker <anna@kernel.org> 
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 fs/afs/write.c                     |  2 --
 fs/btrfs/backref.c                 |  6 ------
 fs/btrfs/block-group.c             |  3 ---
 fs/btrfs/ctree.c                   |  1 -
 fs/btrfs/defrag.c                  |  1 -
 fs/btrfs/disk-io.c                 |  3 ---
 fs/btrfs/extent-io-tree.c          |  5 -----
 fs/btrfs/extent-tree.c             |  8 --------
 fs/btrfs/extent_io.c               |  9 ---------
 fs/btrfs/file-item.c               |  1 -
 fs/btrfs/file.c                    |  4 ----
 fs/btrfs/free-space-cache.c        |  4 ----
 fs/btrfs/inode.c                   |  9 ---------
 fs/btrfs/ordered-data.c            |  2 --
 fs/btrfs/qgroup.c                  |  1 -
 fs/btrfs/reflink.c                 |  2 --
 fs/btrfs/relocation.c              |  9 ---------
 fs/btrfs/scrub.c                   |  3 ---
 fs/btrfs/send.c                    |  1 -
 fs/btrfs/space-info.c              |  1 -
 fs/btrfs/tests/extent-io-tests.c   |  1 -
 fs/btrfs/transaction.c             |  3 ---
 fs/btrfs/tree-log.c                | 12 ------------
 fs/btrfs/uuid-tree.c               |  1 -
 fs/btrfs/volumes.c                 |  2 --
 fs/buffer.c                        |  1 -
 fs/cachefiles/cache.c              |  4 +---
 fs/cachefiles/namei.c              |  1 -
 fs/cachefiles/volume.c             |  1 -
 fs/ceph/addr.c                     |  1 -
 fs/dax.c                           |  1 -
 fs/dcache.c                        |  2 --
 fs/dlm/ast.c                       |  1 -
 fs/dlm/dir.c                       |  2 --
 fs/dlm/lock.c                      |  3 ---
 fs/dlm/lowcomms.c                  |  3 ---
 fs/dlm/recover.c                   |  1 -
 fs/drop_caches.c                   |  1 -
 fs/erofs/utils.c                   |  1 -
 fs/erofs/zdata.c                   |  8 ++++++--
 fs/eventpoll.c                     |  3 ---
 fs/exec.c                          |  4 ----
 fs/ext4/block_validity.c           |  2 --
 fs/ext4/dir.c                      |  1 -
 fs/ext4/extents.c                  |  1 -
 fs/ext4/ialloc.c                   |  1 -
 fs/ext4/inode.c                    |  1 -
 fs/ext4/mballoc.c                  | 12 ++++--------
 fs/ext4/namei.c                    |  3 ---
 fs/ext4/orphan.c                   |  1 -
 fs/ext4/super.c                    |  2 --
 fs/f2fs/checkpoint.c               | 16 ++++++----------
 fs/f2fs/compress.c                 |  1 -
 fs/f2fs/data.c                     |  3 ---
 fs/f2fs/dir.c                      |  1 -
 fs/f2fs/extent_cache.c             |  1 -
 fs/f2fs/f2fs.h                     |  6 +++++-
 fs/f2fs/file.c                     |  3 ---
 fs/f2fs/node.c                     |  4 ----
 fs/f2fs/super.c                    |  1 -
 fs/fat/fatent.c                    |  2 --
 fs/file.c                          |  7 +------
 fs/fs-writeback.c                  |  3 ---
 fs/gfs2/aops.c                     |  1 -
 fs/gfs2/bmap.c                     |  2 --
 fs/gfs2/glock.c                    |  2 +-
 fs/gfs2/log.c                      |  1 -
 fs/gfs2/ops_fstype.c               |  1 -
 fs/hpfs/buffer.c                   |  8 --------
 fs/hugetlbfs/inode.c               |  3 ---
 fs/inode.c                         |  3 ---
 fs/iomap/buffered-io.c             |  7 +------
 fs/jbd2/checkpoint.c               |  2 --
 fs/jbd2/commit.c                   |  3 ---
 fs/jbd2/recovery.c                 |  2 --
 fs/jffs2/build.c                   |  6 +-----
 fs/jffs2/erase.c                   |  3 ---
 fs/jffs2/gc.c                      |  2 --
 fs/jffs2/nodelist.c                |  1 -
 fs/jffs2/nodemgmt.c                | 11 ++++++++---
 fs/jffs2/readinode.c               |  2 --
 fs/jffs2/scan.c                    |  4 ----
 fs/jffs2/summary.c                 |  2 --
 fs/jfs/jfs_txnmgr.c                | 14 ++++----------
 fs/libfs.c                         |  5 ++---
 fs/mbcache.c                       |  1 -
 fs/namei.c                         |  1 -
 fs/netfs/io.c                      |  1 -
 fs/nfs/delegation.c                |  3 ---
 fs/nfs/pnfs.c                      |  2 --
 fs/nfs/write.c                     |  4 ----
 fs/nilfs2/btree.c                  |  1 -
 fs/nilfs2/inode.c                  |  1 -
 fs/nilfs2/page.c                   |  4 ----
 fs/nilfs2/segment.c                |  4 ----
 fs/notify/fanotify/fanotify_user.c |  1 -
 fs/notify/fsnotify.c               |  1 -
 fs/ntfs/attrib.c                   |  3 ---
 fs/ntfs/file.c                     |  2 --
 fs/ntfs3/file.c                    |  9 ---------
 fs/ntfs3/frecord.c                 |  2 --
 fs/ocfs2/alloc.c                   |  4 +---
 fs/ocfs2/cluster/tcp.c             |  8 ++++++--
 fs/ocfs2/dlm/dlmthread.c           |  7 +++----
 fs/ocfs2/file.c                    | 10 ++++------
 fs/proc/base.c                     |  1 -
 fs/proc/fd.c                       |  1 -
 fs/proc/kcore.c                    |  1 -
 fs/proc/page.c                     |  6 ------
 fs/proc/task_mmu.c                 |  7 -------
 fs/quota/dquot.c                   |  1 -
 fs/reiserfs/journal.c              |  2 --
 fs/select.c                        |  1 -
 fs/smb/client/file.c               |  2 --
 fs/splice.c                        |  1 -
 fs/ubifs/budget.c                  |  1 -
 fs/ubifs/commit.c                  |  1 -
 fs/ubifs/debug.c                   |  5 -----
 fs/ubifs/dir.c                     |  1 -
 fs/ubifs/gc.c                      |  5 -----
 fs/ubifs/io.c                      |  2 --
 fs/ubifs/lprops.c                  |  2 --
 fs/ubifs/lpt_commit.c              |  3 ---
 fs/ubifs/orphan.c                  |  1 -
 fs/ubifs/recovery.c                |  4 ----
 fs/ubifs/replay.c                  |  7 -------
 fs/ubifs/scan.c                    |  2 --
 fs/ubifs/shrinker.c                |  1 -
 fs/ubifs/super.c                   |  2 --
 fs/ubifs/tnc_commit.c              |  2 --
 fs/ubifs/tnc_misc.c                |  1 -
 fs/userfaultfd.c                   |  9 ---------
 fs/verity/enable.c                 |  1 -
 fs/verity/read_metadata.c          |  1 -
 fs/xfs/scrub/common.h              |  7 -------
 fs/xfs/scrub/xfarray.c             |  7 -------
 fs/xfs/xfs_aops.c                  |  1 -
 fs/xfs/xfs_icache.c                |  2 --
 fs/xfs/xfs_iwalk.c                 |  1 -
 139 files changed, 54 insertions(+), 396 deletions(-)

diff --git a/fs/afs/write.c b/fs/afs/write.c
index e1c45341719b..6b2bc1dad8e0 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -568,7 +568,6 @@ static void afs_extend_writeback(struct address_space *mapping,
 		}
 
 		folio_batch_release(&fbatch);
-		cond_resched();
 	} while (!stop);
 
 	*_len = len;
@@ -790,7 +789,6 @@ static int afs_writepages_region(struct address_space *mapping,
 		}
 
 		folio_batch_release(&fbatch);
-		cond_resched();
 	} while (wbc->nr_to_write > 0);
 
 	*_next = start;
diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index a4a809efc92f..2adaabd18b6e 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -823,7 +823,6 @@ static int resolve_indirect_refs(struct btrfs_backref_walk_ctx *ctx,
 		prelim_ref_insert(ctx->fs_info, &preftrees->direct, ref, NULL);
 
 		ulist_reinit(parents);
-		cond_resched();
 	}
 out:
 	/*
@@ -879,7 +878,6 @@ static int add_missing_keys(struct btrfs_fs_info *fs_info,
 			btrfs_tree_read_unlock(eb);
 		free_extent_buffer(eb);
 		prelim_ref_insert(fs_info, &preftrees->indirect, ref, NULL);
-		cond_resched();
 	}
 	return 0;
 }
@@ -1676,7 +1674,6 @@ static int find_parent_nodes(struct btrfs_backref_walk_ctx *ctx,
 			 */
 			ref->inode_list = NULL;
 		}
-		cond_resched();
 	}
 
 out:
@@ -1784,7 +1781,6 @@ static int btrfs_find_all_roots_safe(struct btrfs_backref_walk_ctx *ctx)
 		if (!node)
 			break;
 		ctx->bytenr = node->val;
-		cond_resched();
 	}
 
 	ulist_free(ctx->refs);
@@ -1993,7 +1989,6 @@ int btrfs_is_data_extent_shared(struct btrfs_inode *inode, u64 bytenr,
 		}
 		shared.share_count = 0;
 		shared.have_delayed_delete_refs = false;
-		cond_resched();
 	}
 
 	/*
@@ -3424,7 +3419,6 @@ int btrfs_backref_add_tree_node(struct btrfs_trans_handle *trans,
 		struct btrfs_key key;
 		int type;
 
-		cond_resched();
 		eb = btrfs_backref_get_eb(iter);
 
 		key.objectid = iter->bytenr;
diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index b2e5107b7cec..fe9f0a23dbb2 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -769,7 +769,6 @@ static int load_extent_tree_free(struct btrfs_caching_control *caching_ctl)
 				btrfs_release_path(path);
 				up_read(&fs_info->commit_root_sem);
 				mutex_unlock(&caching_ctl->mutex);
-				cond_resched();
 				mutex_lock(&caching_ctl->mutex);
 				down_read(&fs_info->commit_root_sem);
 				goto next;
@@ -4066,8 +4065,6 @@ int btrfs_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags,
 			wait_for_alloc = false;
 			spin_unlock(&space_info->lock);
 		}
-
-		cond_resched();
 	} while (wait_for_alloc);
 
 	mutex_lock(&fs_info->chunk_mutex);
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 617d4827eec2..09b70b271cd2 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -5052,7 +5052,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
 				 */
 				free_extent_buffer(next);
 				btrfs_release_path(path);
-				cond_resched();
 				goto again;
 			}
 			if (!ret)
diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
index f2ff4cbe8656..2219c3ccb863 100644
--- a/fs/btrfs/defrag.c
+++ b/fs/btrfs/defrag.c
@@ -1326,7 +1326,6 @@ int btrfs_defrag_file(struct inode *inode, struct file_ra_state *ra,
 			ret = 0;
 			break;
 		}
-		cond_resched();
 	}
 
 	if (ra_allocated)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 68f60d50e1fd..e9d1cef7d030 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4561,7 +4561,6 @@ static void btrfs_destroy_all_ordered_extents(struct btrfs_fs_info *fs_info)
 		spin_unlock(&fs_info->ordered_root_lock);
 		btrfs_destroy_ordered_extents(root);
 
-		cond_resched();
 		spin_lock(&fs_info->ordered_root_lock);
 	}
 	spin_unlock(&fs_info->ordered_root_lock);
@@ -4643,7 +4642,6 @@ static void btrfs_destroy_delayed_refs(struct btrfs_transaction *trans,
 		}
 		btrfs_cleanup_ref_head_accounting(fs_info, delayed_refs, head);
 		btrfs_put_delayed_ref_head(head);
-		cond_resched();
 		spin_lock(&delayed_refs->lock);
 	}
 	btrfs_qgroup_destroy_extent_records(trans);
@@ -4759,7 +4757,6 @@ static void btrfs_destroy_pinned_extent(struct btrfs_fs_info *fs_info,
 		free_extent_state(cached_state);
 		btrfs_error_unpin_extent_range(fs_info, start, end);
 		mutex_unlock(&fs_info->unused_bg_unpin_mutex);
-		cond_resched();
 	}
 }
 
diff --git a/fs/btrfs/extent-io-tree.c b/fs/btrfs/extent-io-tree.c
index ff8e117a1ace..39aa803cbb13 100644
--- a/fs/btrfs/extent-io-tree.c
+++ b/fs/btrfs/extent-io-tree.c
@@ -695,8 +695,6 @@ int __clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 	if (start > end)
 		goto out;
 	spin_unlock(&tree->lock);
-	if (gfpflags_allow_blocking(mask))
-		cond_resched();
 	goto again;
 
 out:
@@ -1189,8 +1187,6 @@ static int __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 	if (start > end)
 		goto out;
 	spin_unlock(&tree->lock);
-	if (gfpflags_allow_blocking(mask))
-		cond_resched();
 	goto again;
 
 out:
@@ -1409,7 +1405,6 @@ int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 	if (start > end)
 		goto out;
 	spin_unlock(&tree->lock);
-	cond_resched();
 	first_iteration = false;
 	goto again;
 
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index fc313fce5bbd..33be7bb96872 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -1996,7 +1996,6 @@ static int btrfs_run_delayed_refs_for_head(struct btrfs_trans_handle *trans,
 		}
 
 		btrfs_put_delayed_ref(ref);
-		cond_resched();
 
 		spin_lock(&locked_ref->lock);
 		btrfs_merge_delayed_refs(fs_info, delayed_refs, locked_ref);
@@ -2074,7 +2073,6 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
 		 */
 
 		locked_ref = NULL;
-		cond_resched();
 	} while ((nr != -1 && count < nr) || locked_ref);
 
 	return 0;
@@ -2183,7 +2181,6 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
 		mutex_unlock(&head->mutex);
 
 		btrfs_put_delayed_ref_head(head);
-		cond_resched();
 		goto again;
 	}
 out:
@@ -2805,7 +2802,6 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
 		unpin_extent_range(fs_info, start, end, true);
 		mutex_unlock(&fs_info->unused_bg_unpin_mutex);
 		free_extent_state(cached_state);
-		cond_resched();
 	}
 
 	if (btrfs_test_opt(fs_info, DISCARD_ASYNC)) {
@@ -4416,7 +4412,6 @@ static noinline int find_free_extent(struct btrfs_root *root,
 			goto have_block_group;
 		}
 		release_block_group(block_group, ffe_ctl, ffe_ctl->delalloc);
-		cond_resched();
 	}
 	up_read(&space_info->groups_sem);
 
@@ -5037,7 +5032,6 @@ static noinline void reada_walk_down(struct btrfs_trans_handle *trans,
 		if (nread >= wc->reada_count)
 			break;
 
-		cond_resched();
 		bytenr = btrfs_node_blockptr(eb, slot);
 		generation = btrfs_node_ptr_generation(eb, slot);
 
@@ -6039,8 +6033,6 @@ static int btrfs_trim_free_extents(struct btrfs_device *device, u64 *trimmed)
 			ret = -ERESTARTSYS;
 			break;
 		}
-
-		cond_resched();
 	}
 
 	return ret;
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index caccd0376342..209911d0e873 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -227,7 +227,6 @@ static void __process_pages_contig(struct address_space *mapping,
 					 page_ops, start, end);
 		}
 		folio_batch_release(&fbatch);
-		cond_resched();
 	}
 }
 
@@ -291,7 +290,6 @@ static noinline int lock_delalloc_pages(struct inode *inode,
 			processed_end = page_offset(page) + PAGE_SIZE - 1;
 		}
 		folio_batch_release(&fbatch);
-		cond_resched();
 	}
 
 	return 0;
@@ -401,7 +399,6 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
 			      &cached_state);
 		__unlock_for_delalloc(inode, locked_page,
 			      delalloc_start, delalloc_end);
-		cond_resched();
 		goto again;
 	}
 	free_extent_state(cached_state);
@@ -1924,7 +1921,6 @@ int btree_write_cache_pages(struct address_space *mapping,
 			nr_to_write_done = wbc->nr_to_write <= 0;
 		}
 		folio_batch_release(&fbatch);
-		cond_resched();
 	}
 	if (!scanned && !done) {
 		/*
@@ -2116,7 +2112,6 @@ static int extent_write_cache_pages(struct address_space *mapping,
 					    wbc->nr_to_write <= 0);
 		}
 		folio_batch_release(&fbatch);
-		cond_resched();
 	}
 	if (!scanned && !done) {
 		/*
@@ -2397,8 +2392,6 @@ int try_release_extent_mapping(struct page *page, gfp_t mask)
 
 			/* once for us */
 			free_extent_map(em);
-
-			cond_resched(); /* Allow large-extent preemption. */
 		}
 	}
 	return try_release_extent_state(tree, page, mask);
@@ -2698,7 +2691,6 @@ static int fiemap_process_hole(struct btrfs_inode *inode,
 		last_delalloc_end = delalloc_end;
 		cur_offset = delalloc_end + 1;
 		extent_offset += cur_offset - delalloc_start;
-		cond_resched();
 	}
 
 	/*
@@ -2986,7 +2978,6 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 			/* No more file extent items for this inode. */
 			break;
 		}
-		cond_resched();
 	}
 
 check_eof_delalloc:
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 1ce5dd154499..12cc0cfde0ff 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -1252,7 +1252,6 @@ int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans,
 	btrfs_mark_buffer_dirty(path->nodes[0]);
 	if (total_bytes < sums->len) {
 		btrfs_release_path(path);
-		cond_resched();
 		goto again;
 	}
 out:
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 361535c71c0f..541b6c87ddf3 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1405,8 +1405,6 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
 
 		btrfs_drop_pages(fs_info, pages, num_pages, pos, copied);
 
-		cond_resched();
-
 		pos += copied;
 		num_written += copied;
 	}
@@ -3376,7 +3374,6 @@ bool btrfs_find_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
 
 		prev_delalloc_end = delalloc_end;
 		cur_offset = delalloc_end + 1;
-		cond_resched();
 	}
 
 	return ret;
@@ -3654,7 +3651,6 @@ static loff_t find_desired_extent(struct file *file, loff_t offset, int whence)
 			ret = -EINTR;
 			goto out;
 		}
-		cond_resched();
 	}
 
 	/* We have an implicit hole from the last extent found up to i_size. */
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 27fad70451aa..c9606fcdc310 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -3807,8 +3807,6 @@ static int trim_no_bitmap(struct btrfs_block_group *block_group,
 			ret = -ERESTARTSYS;
 			break;
 		}
-
-		cond_resched();
 	}
 
 	return ret;
@@ -4000,8 +3998,6 @@ static int trim_bitmaps(struct btrfs_block_group *block_group,
 			ret = -ERESTARTSYS;
 			break;
 		}
-
-		cond_resched();
 	}
 
 	if (offset >= end)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 7814b9d654ce..789569e135cf 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1021,7 +1021,6 @@ static void compress_file_range(struct btrfs_work *work)
 			 nr_pages, compress_type);
 	if (start + total_in < end) {
 		start += total_in;
-		cond_resched();
 		goto again;
 	}
 	return;
@@ -3376,7 +3375,6 @@ void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info)
 		run_delayed_iput_locked(fs_info, inode);
 		if (need_resched()) {
 			spin_unlock_irq(&fs_info->delayed_iput_lock);
-			cond_resched();
 			spin_lock_irq(&fs_info->delayed_iput_lock);
 		}
 	}
@@ -4423,7 +4421,6 @@ static void btrfs_prune_dentries(struct btrfs_root *root)
 			 * cache when its usage count hits zero.
 			 */
 			iput(inode);
-			cond_resched();
 			spin_lock(&root->inode_lock);
 			goto again;
 		}
@@ -5135,7 +5132,6 @@ static void evict_inode_truncate_pages(struct inode *inode)
 				 EXTENT_CLEAR_ALL_BITS | EXTENT_DO_ACCOUNTING,
 				 &cached_state);
 
-		cond_resched();
 		spin_lock(&io_tree->lock);
 	}
 	spin_unlock(&io_tree->lock);
@@ -7209,8 +7205,6 @@ static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 lockend,
 
 		if (ret)
 			break;
-
-		cond_resched();
 	}
 
 	return ret;
@@ -9269,7 +9263,6 @@ static int start_delalloc_inodes(struct btrfs_root *root,
 			if (ret || wbc->nr_to_write <= 0)
 				goto out;
 		}
-		cond_resched();
 		spin_lock(&root->delalloc_lock);
 	}
 	spin_unlock(&root->delalloc_lock);
@@ -10065,7 +10058,6 @@ ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
 			break;
 		btrfs_put_ordered_extent(ordered);
 		unlock_extent(io_tree, start, lockend, &cached_state);
-		cond_resched();
 	}
 
 	em = btrfs_get_extent(inode, NULL, 0, start, lockend - start + 1);
@@ -10306,7 +10298,6 @@ ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *from,
 		if (ordered)
 			btrfs_put_ordered_extent(ordered);
 		unlock_extent(io_tree, start, end, &cached_state);
-		cond_resched();
 	}
 
 	/*
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 345c449d588c..58463c479c91 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -715,7 +715,6 @@ u64 btrfs_wait_ordered_extents(struct btrfs_root *root, u64 nr,
 		list_add_tail(&ordered->work_list, &works);
 		btrfs_queue_work(fs_info->flush_workers, &ordered->flush_work);
 
-		cond_resched();
 		spin_lock(&root->ordered_extent_lock);
 		if (nr != U64_MAX)
 			nr--;
@@ -729,7 +728,6 @@ u64 btrfs_wait_ordered_extents(struct btrfs_root *root, u64 nr,
 		list_del_init(&ordered->work_list);
 		wait_for_completion(&ordered->completion);
 		btrfs_put_ordered_extent(ordered);
-		cond_resched();
 	}
 	mutex_unlock(&root->ordered_extent_mutex);
 
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index b99230db3c82..c483648be366 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1926,7 +1926,6 @@ int btrfs_qgroup_trace_leaf_items(struct btrfs_trans_handle *trans,
 		if (ret)
 			return ret;
 	}
-	cond_resched();
 	return 0;
 }
 
diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
index 65d2bd6910f2..6f599c275dc7 100644
--- a/fs/btrfs/reflink.c
+++ b/fs/btrfs/reflink.c
@@ -569,8 +569,6 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
 			ret = -EINTR;
 			goto out;
 		}
-
-		cond_resched();
 	}
 	ret = 0;
 
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index c6d4bb8cbe29..7e16a6d953d9 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -1094,7 +1094,6 @@ int replace_file_extents(struct btrfs_trans_handle *trans,
 	for (i = 0; i < nritems; i++) {
 		struct btrfs_ref ref = { 0 };
 
-		cond_resched();
 		btrfs_item_key_to_cpu(leaf, &key, i);
 		if (key.type != BTRFS_EXTENT_DATA_KEY)
 			continue;
@@ -1531,7 +1530,6 @@ static int invalidate_extent_cache(struct btrfs_root *root,
 	while (1) {
 		struct extent_state *cached_state = NULL;
 
-		cond_resched();
 		iput(inode);
 
 		if (objectid > max_key->objectid)
@@ -2163,7 +2161,6 @@ struct btrfs_root *select_reloc_root(struct btrfs_trans_handle *trans,
 
 	next = node;
 	while (1) {
-		cond_resched();
 		next = walk_up_backref(next, edges, &index);
 		root = next->root;
 
@@ -2286,7 +2283,6 @@ struct btrfs_root *select_one_root(struct btrfs_backref_node *node)
 
 	next = node;
 	while (1) {
-		cond_resched();
 		next = walk_up_backref(next, edges, &index);
 		root = next->root;
 
@@ -2331,7 +2327,6 @@ u64 calcu_metadata_size(struct reloc_control *rc,
 	BUG_ON(reserve && node->processed);
 
 	while (next) {
-		cond_resched();
 		while (1) {
 			if (next->processed && (reserve || next != node))
 				break;
@@ -2426,8 +2421,6 @@ static int do_relocation(struct btrfs_trans_handle *trans,
 	list_for_each_entry(edge, &node->upper, list[LOWER]) {
 		struct btrfs_ref ref = { 0 };
 
-		cond_resched();
-
 		upper = edge->node[UPPER];
 		root = select_reloc_root(trans, rc, upper, edges);
 		if (IS_ERR(root)) {
@@ -2609,7 +2602,6 @@ static void update_processed_blocks(struct reloc_control *rc,
 	int index = 0;
 
 	while (next) {
-		cond_resched();
 		while (1) {
 			if (next->processed)
 				break;
@@ -3508,7 +3500,6 @@ int find_next_extent(struct reloc_control *rc, struct btrfs_path *path,
 	while (1) {
 		bool block_found;
 
-		cond_resched();
 		if (rc->search_start >= last) {
 			ret = 1;
 			break;
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index b877203f1dc5..4dba0e3b6887 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -2046,9 +2046,6 @@ static int scrub_simple_mirror(struct scrub_ctx *sctx,
 			break;
 
 		cur_logical = found_logical + BTRFS_STRIPE_LEN;
-
-		/* Don't hold CPU for too long time */
-		cond_resched();
 	}
 	return ret;
 }
diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 3a566150c531..503782af0b35 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -7778,7 +7778,6 @@ static int btrfs_compare_trees(struct btrfs_root *left_root,
 		if (need_resched() ||
 		    rwsem_is_contended(&fs_info->commit_root_sem)) {
 			up_read(&fs_info->commit_root_sem);
-			cond_resched();
 			down_read(&fs_info->commit_root_sem);
 		}
 
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index d7e8cd4f140c..e597c5365c71 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -1211,7 +1211,6 @@ static void btrfs_preempt_reclaim_metadata_space(struct work_struct *work)
 		if (!to_reclaim)
 			to_reclaim = btrfs_calc_insert_metadata_size(fs_info, 1);
 		flush_space(fs_info, space_info, to_reclaim, flush, true);
-		cond_resched();
 		spin_lock(&space_info->lock);
 	}
 
diff --git a/fs/btrfs/tests/extent-io-tests.c b/fs/btrfs/tests/extent-io-tests.c
index 1cc86af97dc6..7021025d8535 100644
--- a/fs/btrfs/tests/extent-io-tests.c
+++ b/fs/btrfs/tests/extent-io-tests.c
@@ -45,7 +45,6 @@ static noinline int process_page_range(struct inode *inode, u64 start, u64 end,
 				folio_put(folio);
 		}
 		folio_batch_release(&fbatch);
-		cond_resched();
 		loops++;
 		if (loops > 100000) {
 			printk(KERN_ERR
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index c780d3729463..ce5cbc12e041 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1115,7 +1115,6 @@ int btrfs_write_marked_extents(struct btrfs_fs_info *fs_info,
 			werr = filemap_fdatawait_range(mapping, start, end);
 		free_extent_state(cached_state);
 		cached_state = NULL;
-		cond_resched();
 		start = end + 1;
 	}
 	return werr;
@@ -1157,7 +1156,6 @@ static int __btrfs_wait_marked_extents(struct btrfs_fs_info *fs_info,
 			werr = err;
 		free_extent_state(cached_state);
 		cached_state = NULL;
-		cond_resched();
 		start = end + 1;
 	}
 	if (err)
@@ -1507,7 +1505,6 @@ int btrfs_defrag_root(struct btrfs_root *root)
 
 		btrfs_end_transaction(trans);
 		btrfs_btree_balance_dirty(info);
-		cond_resched();
 
 		if (btrfs_fs_closing(info) || ret != -EAGAIN)
 			break;
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index cbb17b542131..3c215762a07f 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -2657,11 +2657,9 @@ static noinline int walk_down_log_tree(struct btrfs_trans_handle *trans,
 		path->nodes[*level-1] = next;
 		*level = btrfs_header_level(next);
 		path->slots[*level] = 0;
-		cond_resched();
 	}
 	path->slots[*level] = btrfs_header_nritems(path->nodes[*level]);
 
-	cond_resched();
 	return 0;
 }
 
@@ -3898,7 +3896,6 @@ static noinline int log_dir_items(struct btrfs_trans_handle *trans,
 		}
 		if (need_resched()) {
 			btrfs_release_path(path);
-			cond_resched();
 			goto search;
 		}
 	}
@@ -5037,7 +5034,6 @@ static int btrfs_log_all_xattrs(struct btrfs_trans_handle *trans,
 		ins_nr++;
 		path->slots[0]++;
 		found_xattrs = true;
-		cond_resched();
 	}
 	if (ins_nr > 0) {
 		ret = copy_items(trans, inode, dst_path, path,
@@ -5135,7 +5131,6 @@ static int btrfs_log_holes(struct btrfs_trans_handle *trans,
 
 		prev_extent_end = btrfs_file_extent_end(path);
 		path->slots[0]++;
-		cond_resched();
 	}
 
 	if (prev_extent_end < i_size) {
@@ -5919,13 +5914,6 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans,
 		} else {
 			break;
 		}
-
-		/*
-		 * We may process many leaves full of items for our inode, so
-		 * avoid monopolizing a cpu for too long by rescheduling while
-		 * not holding locks on any tree.
-		 */
-		cond_resched();
 	}
 	if (ins_nr) {
 		ret = copy_items(trans, inode, dst_path, path, ins_start_slot,
diff --git a/fs/btrfs/uuid-tree.c b/fs/btrfs/uuid-tree.c
index 7c7001f42b14..98890e0d7b24 100644
--- a/fs/btrfs/uuid-tree.c
+++ b/fs/btrfs/uuid-tree.c
@@ -324,7 +324,6 @@ int btrfs_uuid_tree_iterate(struct btrfs_fs_info *fs_info)
 			ret = -EINTR;
 			goto out;
 		}
-		cond_resched();
 		leaf = path->nodes[0];
 		slot = path->slots[0];
 		btrfs_item_key_to_cpu(leaf, &key, slot);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index b9ef6f54635c..ceda63fcc721 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1689,7 +1689,6 @@ static int find_free_dev_extent(struct btrfs_device *device, u64 num_bytes,
 			search_start = extent_end;
 next:
 		path->slots[0]++;
-		cond_resched();
 	}
 
 	/*
@@ -4756,7 +4755,6 @@ int btrfs_uuid_scan_kthread(void *data)
 		} else {
 			break;
 		}
-		cond_resched();
 	}
 
 out:
diff --git a/fs/buffer.c b/fs/buffer.c
index 12e9a71c693d..a362b42bc63d 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1743,7 +1743,6 @@ void clean_bdev_aliases(struct block_device *bdev, sector_t block, sector_t len)
 			folio_unlock(folio);
 		}
 		folio_batch_release(&fbatch);
-		cond_resched();
 		/* End of range already reached? */
 		if (index > end || !index)
 			break;
diff --git a/fs/cachefiles/cache.c b/fs/cachefiles/cache.c
index 7077f72e6f47..7f078244cc0a 100644
--- a/fs/cachefiles/cache.c
+++ b/fs/cachefiles/cache.c
@@ -299,9 +299,7 @@ static void cachefiles_withdraw_objects(struct cachefiles_cache *cache)
 		fscache_withdraw_cookie(object->cookie);
 		count++;
 		if ((count & 63) == 0) {
-			spin_unlock(&cache->object_list_lock);
-			cond_resched();
-			spin_lock(&cache->object_list_lock);
+			cond_resched_lock(&cache->object_list_lock);
 		}
 	}
 
diff --git a/fs/cachefiles/namei.c b/fs/cachefiles/namei.c
index 7bf7a5fcc045..3fa8a2ecb299 100644
--- a/fs/cachefiles/namei.c
+++ b/fs/cachefiles/namei.c
@@ -353,7 +353,6 @@ int cachefiles_bury_object(struct cachefiles_cache *cache,
 		unlock_rename(cache->graveyard, dir);
 		dput(grave);
 		grave = NULL;
-		cond_resched();
 		goto try_again;
 	}
 
diff --git a/fs/cachefiles/volume.c b/fs/cachefiles/volume.c
index 89df0ba8ba5e..6a4d9d87c68c 100644
--- a/fs/cachefiles/volume.c
+++ b/fs/cachefiles/volume.c
@@ -62,7 +62,6 @@ void cachefiles_acquire_volume(struct fscache_volume *vcookie)
 			cachefiles_bury_object(cache, NULL, cache->store, vdentry,
 					       FSCACHE_VOLUME_IS_WEIRD);
 			cachefiles_put_directory(volume->dentry);
-			cond_resched();
 			goto retry;
 		}
 	}
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index f4863078f7fe..f2be2adf5d41 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1375,7 +1375,6 @@ static int ceph_writepages_start(struct address_space *mapping,
 					wait_on_page_writeback(page);
 				}
 				folio_batch_release(&fbatch);
-				cond_resched();
 			}
 		}
 
diff --git a/fs/dax.c b/fs/dax.c
index 93cf6e8d8990..f68e026e6ec4 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -986,7 +986,6 @@ static int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev,
 	i_mmap_lock_read(mapping);
 	vma_interval_tree_foreach(vma, &mapping->i_mmap, index, end) {
 		pfn_mkclean_range(pfn, count, index, vma);
-		cond_resched();
 	}
 	i_mmap_unlock_read(mapping);
 
diff --git a/fs/dcache.c b/fs/dcache.c
index 25ac74d30bff..3f5b4adba111 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -619,7 +619,6 @@ static void __dentry_kill(struct dentry *dentry)
 	spin_unlock(&dentry->d_lock);
 	if (likely(can_free))
 		dentry_free(dentry);
-	cond_resched();
 }
 
 static struct dentry *__lock_parent(struct dentry *dentry)
@@ -1629,7 +1628,6 @@ void shrink_dcache_parent(struct dentry *parent)
 			continue;
 		}
 
-		cond_resched();
 		if (!data.found)
 			break;
 		data.victim = NULL;
diff --git a/fs/dlm/ast.c b/fs/dlm/ast.c
index 1f2f70a1b824..d6f36527814f 100644
--- a/fs/dlm/ast.c
+++ b/fs/dlm/ast.c
@@ -261,7 +261,6 @@ void dlm_callback_resume(struct dlm_ls *ls)
 	sum += count;
 	if (!empty) {
 		count = 0;
-		cond_resched();
 		goto more;
 	}
 
diff --git a/fs/dlm/dir.c b/fs/dlm/dir.c
index f6acba4310a7..d8b24f9bb744 100644
--- a/fs/dlm/dir.c
+++ b/fs/dlm/dir.c
@@ -94,8 +94,6 @@ int dlm_recover_directory(struct dlm_ls *ls, uint64_t seq)
 			if (error)
 				goto out_free;
 
-			cond_resched();
-
 			/*
 			 * pick namelen/name pairs out of received buffer
 			 */
diff --git a/fs/dlm/lock.c b/fs/dlm/lock.c
index 652c51fbbf76..6bf02cbc5550 100644
--- a/fs/dlm/lock.c
+++ b/fs/dlm/lock.c
@@ -1713,7 +1713,6 @@ void dlm_scan_rsbs(struct dlm_ls *ls)
 		shrink_bucket(ls, i);
 		if (dlm_locking_stopped(ls))
 			break;
-		cond_resched();
 	}
 }
 
@@ -5227,7 +5226,6 @@ void dlm_recover_purge(struct dlm_ls *ls)
 		}
 		unlock_rsb(r);
 		unhold_rsb(r);
-		cond_resched();
 	}
 	up_write(&ls->ls_root_sem);
 
@@ -5302,7 +5300,6 @@ void dlm_recover_grant(struct dlm_ls *ls)
 		confirm_master(r, 0);
 		unlock_rsb(r);
 		put_rsb(r);
-		cond_resched();
 	}
 
 	if (lkb_count)
diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index f7bc22e74db2..494ede3678d6 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -562,7 +562,6 @@ int dlm_lowcomms_connect_node(int nodeid)
 	up_read(&con->sock_lock);
 	srcu_read_unlock(&connections_srcu, idx);
 
-	cond_resched();
 	return 0;
 }
 
@@ -1504,7 +1503,6 @@ static void process_recv_sockets(struct work_struct *work)
 		/* CF_RECV_PENDING cleared */
 		break;
 	case DLM_IO_RESCHED:
-		cond_resched();
 		queue_work(io_workqueue, &con->rwork);
 		/* CF_RECV_PENDING not cleared */
 		break;
@@ -1650,7 +1648,6 @@ static void process_send_sockets(struct work_struct *work)
 		break;
 	case DLM_IO_RESCHED:
 		/* CF_SEND_PENDING not cleared */
-		cond_resched();
 		queue_work(io_workqueue, &con->swork);
 		break;
 	default:
diff --git a/fs/dlm/recover.c b/fs/dlm/recover.c
index 53917c0aa3c0..6d9b074631ff 100644
--- a/fs/dlm/recover.c
+++ b/fs/dlm/recover.c
@@ -545,7 +545,6 @@ int dlm_recover_masters(struct dlm_ls *ls, uint64_t seq)
 		else
 			error = recover_master(r, &count, seq);
 		unlock_rsb(r);
-		cond_resched();
 		total++;
 
 		if (error) {
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index b9575957a7c2..3409677acfae 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -41,7 +41,6 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 		iput(toput_inode);
 		toput_inode = inode;
 
-		cond_resched();
 		spin_lock(&sb->s_inode_list_lock);
 	}
 	spin_unlock(&sb->s_inode_list_lock);
diff --git a/fs/erofs/utils.c b/fs/erofs/utils.c
index cc6fb9e98899..f32ff29392d1 100644
--- a/fs/erofs/utils.c
+++ b/fs/erofs/utils.c
@@ -93,7 +93,6 @@ struct erofs_workgroup *erofs_insert_workgroup(struct super_block *sb,
 		} else if (!erofs_workgroup_get(pre)) {
 			/* try to legitimize the current in-tree one */
 			xa_unlock(&sbi->managed_pslots);
-			cond_resched();
 			goto repeat;
 		}
 		lockref_put_return(&grp->lockref);
diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
index 036f610e044b..20ae6af8a9d6 100644
--- a/fs/erofs/zdata.c
+++ b/fs/erofs/zdata.c
@@ -697,8 +697,13 @@ static void z_erofs_cache_invalidate_folio(struct folio *folio,
 	DBG_BUGON(stop > folio_size(folio) || stop < length);
 
 	if (offset == 0 && stop == folio_size(folio))
+		/*
+		 * We are in a seemingly tight loop here. Though, if needed,
+		 * preemption can happen in z_erofs_cache_release_folio()
+		 * via the spin_unlock() call.
+		 */
 		while (!z_erofs_cache_release_folio(folio, GFP_NOFS))
-			cond_resched();
+			;
 }
 
 static const struct address_space_operations z_erofs_cache_aops = {
@@ -1527,7 +1532,6 @@ static struct page *pickup_page_for_submission(struct z_erofs_pcluster *pcl,
 	if (oldpage != cmpxchg(&pcl->compressed_bvecs[nr].page,
 			       oldpage, page)) {
 		erofs_pagepool_add(pagepool, page);
-		cond_resched();
 		goto repeat;
 	}
 out_tocache:
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 1d9a71a0c4c1..45794a9da768 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -801,7 +801,6 @@ static void ep_clear_and_put(struct eventpoll *ep)
 		epi = rb_entry(rbp, struct epitem, rbn);
 
 		ep_unregister_pollwait(ep, epi);
-		cond_resched();
 	}
 
 	/*
@@ -816,7 +815,6 @@ static void ep_clear_and_put(struct eventpoll *ep)
 		next = rb_next(rbp);
 		epi = rb_entry(rbp, struct epitem, rbn);
 		ep_remove_safe(ep, epi);
-		cond_resched();
 	}
 
 	dispose = ep_refcount_dec_and_test(ep);
@@ -1039,7 +1037,6 @@ static struct epitem *ep_find_tfd(struct eventpoll *ep, int tfd, unsigned long t
 			else
 				toff--;
 		}
-		cond_resched();
 	}
 
 	return NULL;
diff --git a/fs/exec.c b/fs/exec.c
index 6518e33ea813..ca3b25054e3f 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -451,7 +451,6 @@ static int count(struct user_arg_ptr argv, int max)
 
 			if (fatal_signal_pending(current))
 				return -ERESTARTNOHAND;
-			cond_resched();
 		}
 	}
 	return i;
@@ -469,7 +468,6 @@ static int count_strings_kernel(const char *const *argv)
 			return -E2BIG;
 		if (fatal_signal_pending(current))
 			return -ERESTARTNOHAND;
-		cond_resched();
 	}
 	return i;
 }
@@ -562,7 +560,6 @@ static int copy_strings(int argc, struct user_arg_ptr argv,
 				ret = -ERESTARTNOHAND;
 				goto out;
 			}
-			cond_resched();
 
 			offset = pos % PAGE_SIZE;
 			if (offset == 0)
@@ -661,7 +658,6 @@ static int copy_strings_kernel(int argc, const char *const *argv,
 			return ret;
 		if (fatal_signal_pending(current))
 			return -ERESTARTNOHAND;
-		cond_resched();
 	}
 	return 0;
 }
diff --git a/fs/ext4/block_validity.c b/fs/ext4/block_validity.c
index 6fe3c941b565..1a7baca041cf 100644
--- a/fs/ext4/block_validity.c
+++ b/fs/ext4/block_validity.c
@@ -162,7 +162,6 @@ static int ext4_protect_reserved_inode(struct super_block *sb,
 		return PTR_ERR(inode);
 	num = (inode->i_size + sb->s_blocksize - 1) >> sb->s_blocksize_bits;
 	while (i < num) {
-		cond_resched();
 		map.m_lblk = i;
 		map.m_len = num - i;
 		n = ext4_map_blocks(NULL, inode, &map, 0);
@@ -224,7 +223,6 @@ int ext4_setup_system_zone(struct super_block *sb)
 	for (i=0; i < ngroups; i++) {
 		unsigned int meta_blks = ext4_num_base_meta_blocks(sb, i);
 
-		cond_resched();
 		if (meta_blks != 0) {
 			ret = add_system_zone(system_blks,
 					ext4_group_first_block_no(sb, i),
diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index 3985f8c33f95..cb7d2427be8b 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -174,7 +174,6 @@ static int ext4_readdir(struct file *file, struct dir_context *ctx)
 			err = -ERESTARTSYS;
 			goto errout;
 		}
-		cond_resched();
 		offset = ctx->pos & (sb->s_blocksize - 1);
 		map.m_lblk = ctx->pos >> EXT4_BLOCK_SIZE_BITS(sb);
 		map.m_len = 1;
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 202c76996b62..79851e582c7d 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3001,7 +3001,6 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 			}
 			/* Yield here to deal with large extent trees.
 			 * Should be a no-op if we did IO above. */
-			cond_resched();
 			if (WARN_ON(i + 1 > depth)) {
 				err = -EFSCORRUPTED;
 				break;
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index b65058d972f9..25d78953eec9 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -1482,7 +1482,6 @@ unsigned long ext4_count_free_inodes(struct super_block *sb)
 		if (!gdp)
 			continue;
 		desc_count += ext4_free_inodes_count(sb, gdp);
-		cond_resched();
 	}
 	return desc_count;
 #endif
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 4ce35f1c8b0a..1c3af3a8fe2e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2491,7 +2491,6 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
 			}
 		}
 		folio_batch_release(&fbatch);
-		cond_resched();
 	}
 	mpd->scanned_until_end = 1;
 	if (handle)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 1e599305d85f..074b5cdea363 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2843,7 +2843,6 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 		     ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups)) {
 			int ret = 0;
 
-			cond_resched();
 			if (new_cr != cr) {
 				cr = new_cr;
 				goto repeat;
@@ -3387,7 +3386,6 @@ static int ext4_mb_init_backend(struct super_block *sb)
 	sbi->s_buddy_cache->i_ino = EXT4_BAD_INO;
 	EXT4_I(sbi->s_buddy_cache)->i_disksize = 0;
 	for (i = 0; i < ngroups; i++) {
-		cond_resched();
 		desc = ext4_get_group_desc(sb, i, NULL);
 		if (desc == NULL) {
 			ext4_msg(sb, KERN_ERR, "can't read descriptor %u", i);
@@ -3746,7 +3744,6 @@ int ext4_mb_release(struct super_block *sb)
 
 	if (sbi->s_group_info) {
 		for (i = 0; i < ngroups; i++) {
-			cond_resched();
 			grinfo = ext4_get_group_info(sb, i);
 			if (!grinfo)
 				continue;
@@ -6034,7 +6031,6 @@ static int ext4_mb_discard_preallocations(struct super_block *sb, int needed)
 		ret = ext4_mb_discard_group_preallocations(sb, i, &busy);
 		freed += ret;
 		needed -= ret;
-		cond_resched();
 	}
 
 	if (needed > 0 && busy && ++retry < 3) {
@@ -6173,8 +6169,6 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
 		while (ar->len &&
 			ext4_claim_free_clusters(sbi, ar->len, ar->flags)) {
 
-			/* let others to free the space */
-			cond_resched();
 			ar->len = ar->len >> 1;
 		}
 		if (!ar->len) {
@@ -6720,7 +6714,6 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
 		int is_metadata = flags & EXT4_FREE_BLOCKS_METADATA;
 
 		for (i = 0; i < count; i++) {
-			cond_resched();
 			if (is_metadata)
 				bh = sb_find_get_block(inode->i_sb, block + i);
 			ext4_forget(handle, is_metadata, inode, bh, block + i);
@@ -6959,8 +6952,11 @@ __releases(ext4_group_lock_ptr(sb, e4b->bd_group))
 			return count;
 
 		if (need_resched()) {
+			/*
+			 * Rescheduling can implicitly happen after the
+			 * unlock.
+			 */
 			ext4_unlock_group(sb, e4b->bd_group);
-			cond_resched();
 			ext4_lock_group(sb, e4b->bd_group);
 		}
 
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index bbda587f76b8..2ab27008c4dd 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -1255,7 +1255,6 @@ int ext4_htree_fill_tree(struct file *dir_file, __u32 start_hash,
 			err = -ERESTARTSYS;
 			goto errout;
 		}
-		cond_resched();
 		block = dx_get_block(frame->at);
 		ret = htree_dirblock_to_tree(dir_file, dir, block, &hinfo,
 					     start_hash, start_minor_hash);
@@ -1341,7 +1340,6 @@ static int dx_make_map(struct inode *dir, struct buffer_head *bh,
 			map_tail->size = ext4_rec_len_from_disk(de->rec_len,
 								blocksize);
 			count++;
-			cond_resched();
 		}
 		de = ext4_next_entry(de, blocksize);
 	}
@@ -1658,7 +1656,6 @@ static struct buffer_head *__ext4_find_entry(struct inode *dir,
 		/*
 		 * We deal with the read-ahead logic here.
 		 */
-		cond_resched();
 		if (ra_ptr >= ra_max) {
 			/* Refill the readahead buffer */
 			ra_ptr = 0;
diff --git a/fs/ext4/orphan.c b/fs/ext4/orphan.c
index e5b47dda3317..fb04e8bccd3c 100644
--- a/fs/ext4/orphan.c
+++ b/fs/ext4/orphan.c
@@ -67,7 +67,6 @@ static int ext4_orphan_file_add(handle_t *handle, struct inode *inode)
 				atomic_inc(&oi->of_binfo[i].ob_free_entries);
 				return -ENOSPC;
 			}
-			cond_resched();
 		}
 		while (bdata[j]) {
 			if (++j >= inodes_per_ob) {
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index dbebd8b3127e..170c75323300 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3861,7 +3861,6 @@ static int ext4_lazyinit_thread(void *arg)
 		cur = jiffies;
 		if ((time_after_eq(cur, next_wakeup)) ||
 		    (MAX_JIFFY_OFFSET == next_wakeup)) {
-			cond_resched();
 			continue;
 		}
 
@@ -4226,7 +4225,6 @@ int ext4_calculate_overhead(struct super_block *sb)
 		overhead += blks;
 		if (blks)
 			memset(buf, 0, PAGE_SIZE);
-		cond_resched();
 	}
 
 	/*
diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
index b0597a539fc5..20ea41b5814c 100644
--- a/fs/f2fs/checkpoint.c
+++ b/fs/f2fs/checkpoint.c
@@ -45,7 +45,6 @@ struct page *f2fs_grab_meta_page(struct f2fs_sb_info *sbi, pgoff_t index)
 repeat:
 	page = f2fs_grab_cache_page(mapping, index, false);
 	if (!page) {
-		cond_resched();
 		goto repeat;
 	}
 	f2fs_wait_on_page_writeback(page, META, true, true);
@@ -76,7 +75,6 @@ static struct page *__get_meta_page(struct f2fs_sb_info *sbi, pgoff_t index,
 repeat:
 	page = f2fs_grab_cache_page(mapping, index, false);
 	if (!page) {
-		cond_resched();
 		goto repeat;
 	}
 	if (PageUptodate(page))
@@ -463,7 +461,6 @@ long f2fs_sync_meta_pages(struct f2fs_sb_info *sbi, enum page_type type,
 				break;
 		}
 		folio_batch_release(&fbatch);
-		cond_resched();
 	}
 stop:
 	if (nwritten)
@@ -1111,9 +1108,13 @@ int f2fs_sync_dirty_inodes(struct f2fs_sb_info *sbi, enum inode_type type,
 			F2FS_I(inode)->cp_task = NULL;
 
 		iput(inode);
-		/* We need to give cpu to another writers. */
+		/*
+		 * We need to give cpu to other writers but cond_resched_stall()
+		 * does not guarantee that. Perhaps we should explicitly wait on
+		 * an event or a timeout?
+		 */
 		if (ino == cur_ino)
-			cond_resched();
+			cond_resched_stall();
 		else
 			ino = cur_ino;
 	} else {
@@ -1122,7 +1123,6 @@ int f2fs_sync_dirty_inodes(struct f2fs_sb_info *sbi, enum inode_type type,
 		 * writebacking dentry pages in the freeing inode.
 		 */
 		f2fs_submit_merged_write(sbi, DATA);
-		cond_resched();
 	}
 	goto retry;
 }
@@ -1229,7 +1229,6 @@ static int block_operations(struct f2fs_sb_info *sbi)
 		f2fs_quota_sync(sbi->sb, -1);
 		if (locked)
 			up_read(&sbi->sb->s_umount);
-		cond_resched();
 		goto retry_flush_quotas;
 	}
 
@@ -1240,7 +1239,6 @@ static int block_operations(struct f2fs_sb_info *sbi)
 		err = f2fs_sync_dirty_inodes(sbi, DIR_INODE, true);
 		if (err)
 			return err;
-		cond_resched();
 		goto retry_flush_quotas;
 	}
 
@@ -1256,7 +1254,6 @@ static int block_operations(struct f2fs_sb_info *sbi)
 		err = f2fs_sync_inode_meta(sbi);
 		if (err)
 			return err;
-		cond_resched();
 		goto retry_flush_quotas;
 	}
 
@@ -1273,7 +1270,6 @@ static int block_operations(struct f2fs_sb_info *sbi)
 			f2fs_unlock_all(sbi);
 			return err;
 		}
-		cond_resched();
 		goto retry_flush_nodes;
 	}
 
diff --git a/fs/f2fs/compress.c b/fs/f2fs/compress.c
index d820801f473e..39a2a974e087 100644
--- a/fs/f2fs/compress.c
+++ b/fs/f2fs/compress.c
@@ -1941,7 +1941,6 @@ void f2fs_invalidate_compress_pages(struct f2fs_sb_info *sbi, nid_t ino)
 			folio_unlock(folio);
 		}
 		folio_batch_release(&fbatch);
-		cond_resched();
 	} while (index < end);
 }
 
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 916e317ac925..dfde82cab326 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -2105,7 +2105,6 @@ int f2fs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 	}
 
 prep_next:
-	cond_resched();
 	if (fatal_signal_pending(current))
 		ret = -EINTR;
 	else
@@ -3250,7 +3249,6 @@ static int f2fs_write_cache_pages(struct address_space *mapping,
 				goto readd;
 		}
 		release_pages(pages, nr_pages);
-		cond_resched();
 	}
 #ifdef CONFIG_F2FS_FS_COMPRESSION
 	/* flush remained pages in compress cluster */
@@ -3981,7 +3979,6 @@ static int check_swap_activate(struct swap_info_struct *sis,
 	while (cur_lblock < last_lblock && cur_lblock < sis->max) {
 		struct f2fs_map_blocks map;
 retry:
-		cond_resched();
 
 		memset(&map, 0, sizeof(map));
 		map.m_lblk = cur_lblock;
diff --git a/fs/f2fs/dir.c b/fs/f2fs/dir.c
index 8aa29fe2e87b..fc15a05fa807 100644
--- a/fs/f2fs/dir.c
+++ b/fs/f2fs/dir.c
@@ -1090,7 +1090,6 @@ static int f2fs_readdir(struct file *file, struct dir_context *ctx)
 			err = -ERESTARTSYS;
 			goto out_free;
 		}
-		cond_resched();
 
 		/* readahead for multi pages of dir */
 		if (npages - n > 1 && !ra_has_index(ra, n))
diff --git a/fs/f2fs/extent_cache.c b/fs/f2fs/extent_cache.c
index 0e2d49140c07..b87946f33a5f 100644
--- a/fs/f2fs/extent_cache.c
+++ b/fs/f2fs/extent_cache.c
@@ -936,7 +936,6 @@ static unsigned int __shrink_extent_tree(struct f2fs_sb_info *sbi, int nr_shrink
 
 		if (node_cnt + tree_cnt >= nr_shrink)
 			goto unlock_out;
-		cond_resched();
 	}
 	mutex_unlock(&eti->extent_tree_lock);
 
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index 6d688e42d89c..073e6fd1986d 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -2849,8 +2849,12 @@ static inline bool is_idle(struct f2fs_sb_info *sbi, int type)
 static inline void f2fs_radix_tree_insert(struct radix_tree_root *root,
 				unsigned long index, void *item)
 {
+	/*
+	 * Insert in a tight loop. The scheduler will
+	 * preempt when necessary.
+	 */
 	while (radix_tree_insert(root, index, item))
-		cond_resched();
+		;
 }
 
 #define RAW_IS_INODE(p)	((p)->footer.nid == (p)->footer.ino)
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index ca5904129b16..0ac3dc5dafee 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -3922,7 +3922,6 @@ static int f2fs_sec_trim_file(struct file *filp, unsigned long arg)
 			ret = -EINTR;
 			goto out;
 		}
-		cond_resched();
 	}
 
 	if (len)
@@ -4110,7 +4109,6 @@ static int f2fs_ioc_decompress_file(struct file *filp)
 		count -= cluster_size;
 		page_idx += cluster_size;
 
-		cond_resched();
 		if (fatal_signal_pending(current)) {
 			ret = -EINTR;
 			break;
@@ -4188,7 +4186,6 @@ static int f2fs_ioc_compress_file(struct file *filp)
 		count -= cluster_size;
 		page_idx += cluster_size;
 
-		cond_resched();
 		if (fatal_signal_pending(current)) {
 			ret = -EINTR;
 			break;
diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index ee2e1dd64f25..8187b6ad119a 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -1579,7 +1579,6 @@ static struct page *last_fsync_dnode(struct f2fs_sb_info *sbi, nid_t ino)
 			unlock_page(page);
 		}
 		folio_batch_release(&fbatch);
-		cond_resched();
 	}
 	return last_page;
 }
@@ -1841,7 +1840,6 @@ int f2fs_fsync_node_pages(struct f2fs_sb_info *sbi, struct inode *inode,
 			}
 		}
 		folio_batch_release(&fbatch);
-		cond_resched();
 
 		if (ret || marked)
 			break;
@@ -1944,7 +1942,6 @@ void f2fs_flush_inline_data(struct f2fs_sb_info *sbi)
 			unlock_page(page);
 		}
 		folio_batch_release(&fbatch);
-		cond_resched();
 	}
 }
 
@@ -2046,7 +2043,6 @@ int f2fs_sync_node_pages(struct f2fs_sb_info *sbi,
 				break;
 		}
 		folio_batch_release(&fbatch);
-		cond_resched();
 
 		if (wbc->nr_to_write == 0) {
 			step = 2;
diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index a8c8232852bb..09667bd8ecf7 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -2705,7 +2705,6 @@ static ssize_t f2fs_quota_write(struct super_block *sb, int type,
 		towrite -= tocopy;
 		off += tocopy;
 		data += tocopy;
-		cond_resched();
 	}
 
 	if (len == towrite)
diff --git a/fs/fat/fatent.c b/fs/fat/fatent.c
index 1db348f8f887..96d9f1632f2a 100644
--- a/fs/fat/fatent.c
+++ b/fs/fat/fatent.c
@@ -741,7 +741,6 @@ int fat_count_free_clusters(struct super_block *sb)
 			if (ops->ent_get(&fatent) == FAT_ENT_FREE)
 				free++;
 		} while (fat_ent_next(sbi, &fatent));
-		cond_resched();
 	}
 	sbi->free_clusters = free;
 	sbi->free_clus_valid = 1;
@@ -822,7 +821,6 @@ int fat_trim_fs(struct inode *inode, struct fstrim_range *range)
 		if (need_resched()) {
 			fatent_brelse(&fatent);
 			unlock_fat(sbi);
-			cond_resched();
 			lock_fat(sbi);
 		}
 	}
diff --git a/fs/file.c b/fs/file.c
index 3e4a4dfa38fc..8ae2cec580a9 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -428,10 +428,8 @@ static struct fdtable *close_files(struct files_struct * files)
 		while (set) {
 			if (set & 1) {
 				struct file * file = xchg(&fdt->fd[i], NULL);
-				if (file) {
+				if (file)
 					filp_close(file, files);
-					cond_resched();
-				}
 			}
 			i++;
 			set >>= 1;
@@ -708,11 +706,9 @@ static inline void __range_close(struct files_struct *files, unsigned int fd,
 		if (file) {
 			spin_unlock(&files->file_lock);
 			filp_close(file, files);
-			cond_resched();
 			spin_lock(&files->file_lock);
 		} else if (need_resched()) {
 			spin_unlock(&files->file_lock);
-			cond_resched();
 			spin_lock(&files->file_lock);
 		}
 	}
@@ -845,7 +841,6 @@ void do_close_on_exec(struct files_struct *files)
 			__put_unused_fd(files, fd);
 			spin_unlock(&files->file_lock);
 			filp_close(file, files);
-			cond_resched();
 			spin_lock(&files->file_lock);
 		}
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index c1af01b2c42d..bf311aeb058b 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1914,7 +1914,6 @@ static long writeback_sb_inodes(struct super_block *sb,
 			 * give up the CPU.
 			 */
 			blk_flush_plug(current->plug, false);
-			cond_resched();
 		}
 
 		/*
@@ -2621,8 +2620,6 @@ static void wait_sb_inodes(struct super_block *sb)
 		 */
 		filemap_fdatawait_keep_errors(mapping);
 
-		cond_resched();
-
 		iput(inode);
 
 		rcu_read_lock();
diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index c26d48355cc2..4d5bc99b6301 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -357,7 +357,6 @@ static int gfs2_write_cache_jdata(struct address_space *mapping,
 		if (ret > 0)
 			ret = 0;
 		folio_batch_release(&fbatch);
-		cond_resched();
 	}
 
 	if (!cycled && !done) {
diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index ef7017fb6951..2eb057461023 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -1592,7 +1592,6 @@ static int sweep_bh_for_rgrps(struct gfs2_inode *ip, struct gfs2_holder *rd_gh,
 			buf_in_tr = false;
 		}
 		gfs2_glock_dq_uninit(rd_gh);
-		cond_resched();
 		goto more_rgrps;
 	}
 out:
@@ -1962,7 +1961,6 @@ static int punch_hole(struct gfs2_inode *ip, u64 offset, u64 length)
 	if (current->journal_info) {
 		up_write(&ip->i_rw_mutex);
 		gfs2_trans_end(sdp);
-		cond_resched();
 	}
 	gfs2_quota_unhold(ip);
 out_metapath:
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 4a280be229a6..a1eca3d9857c 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -2073,7 +2073,7 @@ static void glock_hash_walk(glock_examiner examiner, const struct gfs2_sbd *sdp)
 		}
 
 		rhashtable_walk_stop(&iter);
-	} while (cond_resched(), gl == ERR_PTR(-EAGAIN));
+	} while (gl == ERR_PTR(-EAGAIN));
 
 	rhashtable_walk_exit(&iter);
 }
diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index e5271ae87d1c..7567a29eeb21 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -143,7 +143,6 @@ __acquires(&sdp->sd_ail_lock)
 		ret = write_cache_pages(mapping, wbc, __gfs2_writepage, mapping);
 		if (need_resched()) {
 			blk_finish_plug(plug);
-			cond_resched();
 			blk_start_plug(plug);
 		}
 		spin_lock(&sdp->sd_ail_lock);
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index 33ca04733e93..8ae07f0871b1 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -1774,7 +1774,6 @@ static void gfs2_evict_inodes(struct super_block *sb)
 		iput(toput_inode);
 		toput_inode = inode;
 
-		cond_resched();
 		spin_lock(&sb->s_inode_list_lock);
 	}
 	spin_unlock(&sb->s_inode_list_lock);
diff --git a/fs/hpfs/buffer.c b/fs/hpfs/buffer.c
index d39246865c51..88459fea4548 100644
--- a/fs/hpfs/buffer.c
+++ b/fs/hpfs/buffer.c
@@ -77,8 +77,6 @@ void *hpfs_map_sector(struct super_block *s, unsigned secno, struct buffer_head
 
 	hpfs_prefetch_sectors(s, secno, ahead);
 
-	cond_resched();
-
 	*bhp = bh = sb_bread(s, hpfs_search_hotfix_map(s, secno));
 	if (bh != NULL)
 		return bh->b_data;
@@ -97,8 +95,6 @@ void *hpfs_get_sector(struct super_block *s, unsigned secno, struct buffer_head
 
 	hpfs_lock_assert(s);
 
-	cond_resched();
-
 	if ((*bhp = bh = sb_getblk(s, hpfs_search_hotfix_map(s, secno))) != NULL) {
 		if (!buffer_uptodate(bh)) wait_on_buffer(bh);
 		set_buffer_uptodate(bh);
@@ -118,8 +114,6 @@ void *hpfs_map_4sectors(struct super_block *s, unsigned secno, struct quad_buffe
 
 	hpfs_lock_assert(s);
 
-	cond_resched();
-
 	if (secno & 3) {
 		pr_err("%s(): unaligned read\n", __func__);
 		return NULL;
@@ -168,8 +162,6 @@ void *hpfs_map_4sectors(struct super_block *s, unsigned secno, struct quad_buffe
 void *hpfs_get_4sectors(struct super_block *s, unsigned secno,
                           struct quad_buffer_head *qbh)
 {
-	cond_resched();
-
 	hpfs_lock_assert(s);
 
 	if (secno & 3) {
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 316c4cebd3f3..21da053bdaaa 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -689,7 +689,6 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 		}
 		folio_batch_release(&fbatch);
-		cond_resched();
 	}
 
 	if (truncate_op)
@@ -867,8 +866,6 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 		struct folio *folio;
 		unsigned long addr;
 
-		cond_resched();
-
 		/*
 		 * fallocate(2) manpage permits EINTR; we may have been
 		 * interrupted because we are using up too much memory.
diff --git a/fs/inode.c b/fs/inode.c
index 84bc3c76e5cc..f2898988bf40 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -695,7 +695,6 @@ static void dispose_list(struct list_head *head)
 		list_del_init(&inode->i_lru);
 
 		evict(inode);
-		cond_resched();
 	}
 }
 
@@ -737,7 +736,6 @@ void evict_inodes(struct super_block *sb)
 		 */
 		if (need_resched()) {
 			spin_unlock(&sb->s_inode_list_lock);
-			cond_resched();
 			dispose_list(&dispose);
 			goto again;
 		}
@@ -778,7 +776,6 @@ void invalidate_inodes(struct super_block *sb)
 		list_add(&inode->i_lru, &dispose);
 		if (need_resched()) {
 			spin_unlock(&sb->s_inode_list_lock);
-			cond_resched();
 			dispose_list(&dispose);
 			goto again;
 		}
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 2bc0aa23fde3..a76faf26b06e 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -927,7 +927,6 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 		if (unlikely(copied != status))
 			iov_iter_revert(i, copied - status);
 
-		cond_resched();
 		if (unlikely(status == 0)) {
 			/*
 			 * A short copy made iomap_write_end() reject the
@@ -1296,8 +1295,6 @@ static loff_t iomap_unshare_iter(struct iomap_iter *iter)
 		if (WARN_ON_ONCE(bytes == 0))
 			return -EIO;
 
-		cond_resched();
-
 		pos += bytes;
 		written += bytes;
 		length -= bytes;
@@ -1533,10 +1530,8 @@ iomap_finish_ioends(struct iomap_ioend *ioend, int error)
 	completions = iomap_finish_ioend(ioend, error);
 
 	while (!list_empty(&tmp)) {
-		if (completions > IOEND_BATCH_SIZE * 8) {
-			cond_resched();
+		if (completions > IOEND_BATCH_SIZE * 8)
 			completions = 0;
-		}
 		ioend = list_first_entry(&tmp, struct iomap_ioend, io_list);
 		list_del_init(&ioend->io_list);
 		completions += iomap_finish_ioend(ioend, error);
diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index 118699fff2f9..1f3c0813d0be 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -457,7 +457,6 @@ unsigned long jbd2_journal_shrink_checkpoint_list(journal_t *journal,
 	}
 
 	spin_unlock(&journal->j_list_lock);
-	cond_resched();
 
 	if (*nr_to_scan && next_tid)
 		goto again;
@@ -529,7 +528,6 @@ void jbd2_journal_destroy_checkpoint(journal_t *journal)
 		}
 		__jbd2_journal_clean_checkpoint_list(journal, true);
 		spin_unlock(&journal->j_list_lock);
-		cond_resched();
 	}
 }
 
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 8d6f934c3d95..db7052ee0c62 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -729,7 +729,6 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 				bh->b_end_io = journal_end_buffer_io_sync;
 				submit_bh(REQ_OP_WRITE | REQ_SYNC, bh);
 			}
-			cond_resched();
 
 			/* Force a new descriptor to be generated next
                            time round the loop. */
@@ -811,7 +810,6 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 						    b_assoc_buffers);
 
 		wait_on_buffer(bh);
-		cond_resched();
 
 		if (unlikely(!buffer_uptodate(bh)))
 			err = -EIO;
@@ -854,7 +852,6 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 
 		bh = list_entry(log_bufs.prev, struct buffer_head, b_assoc_buffers);
 		wait_on_buffer(bh);
-		cond_resched();
 
 		if (unlikely(!buffer_uptodate(bh)))
 			err = -EIO;
diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
index c269a7d29a46..fbc419d36cd0 100644
--- a/fs/jbd2/recovery.c
+++ b/fs/jbd2/recovery.c
@@ -509,8 +509,6 @@ static int do_one_pass(journal_t *journal,
 		struct buffer_head *	obh;
 		struct buffer_head *	nbh;
 
-		cond_resched();
-
 		/* If we already know where to stop the log traversal,
 		 * check right now that we haven't gone past the end of
 		 * the log. */
diff --git a/fs/jffs2/build.c b/fs/jffs2/build.c
index 6ae9d6fefb86..4f9539211306 100644
--- a/fs/jffs2/build.c
+++ b/fs/jffs2/build.c
@@ -121,10 +121,8 @@ static int jffs2_build_filesystem(struct jffs2_sb_info *c)
 	c->flags |= JFFS2_SB_FLAG_BUILDING;
 	/* Now scan the directory tree, increasing nlink according to every dirent found. */
 	for_each_inode(i, c, ic) {
-		if (ic->scan_dents) {
+		if (ic->scan_dents)
 			jffs2_build_inode_pass1(c, ic, &dir_hardlinks);
-			cond_resched();
-		}
 	}
 
 	dbg_fsbuild("pass 1 complete\n");
@@ -141,7 +139,6 @@ static int jffs2_build_filesystem(struct jffs2_sb_info *c)
 			continue;
 
 		jffs2_build_remove_unlinked_inode(c, ic, &dead_fds);
-		cond_resched();
 	}
 
 	dbg_fsbuild("pass 2a starting\n");
@@ -209,7 +206,6 @@ static int jffs2_build_filesystem(struct jffs2_sb_info *c)
 			jffs2_free_full_dirent(fd);
 		}
 		ic->scan_dents = NULL;
-		cond_resched();
 	}
 	ret = jffs2_build_xattr_subsystem(c);
 	if (ret)
diff --git a/fs/jffs2/erase.c b/fs/jffs2/erase.c
index acd32f05b519..a2706246a68e 100644
--- a/fs/jffs2/erase.c
+++ b/fs/jffs2/erase.c
@@ -143,8 +143,6 @@ int jffs2_erase_pending_blocks(struct jffs2_sb_info *c, int count)
 			BUG();
 		}
 
-		/* Be nice */
-		cond_resched();
 		mutex_lock(&c->erase_free_sem);
 		spin_lock(&c->erase_completion_lock);
 	}
@@ -387,7 +385,6 @@ static int jffs2_block_check_erase(struct jffs2_sb_info *c, struct jffs2_erasebl
 			}
 		}
 		ofs += readlen;
-		cond_resched();
 	}
 	ret = 0;
 fail:
diff --git a/fs/jffs2/gc.c b/fs/jffs2/gc.c
index 5c6602f3c189..3ba9054ac63c 100644
--- a/fs/jffs2/gc.c
+++ b/fs/jffs2/gc.c
@@ -923,8 +923,6 @@ static int jffs2_garbage_collect_deletion_dirent(struct jffs2_sb_info *c, struct
 
 		for (raw = f->inocache->nodes; raw != (void *)f->inocache; raw = raw->next_in_ino) {
 
-			cond_resched();
-
 			/* We only care about obsolete ones */
 			if (!(ref_obsolete(raw)))
 				continue;
diff --git a/fs/jffs2/nodelist.c b/fs/jffs2/nodelist.c
index b86c78d178c6..7a56a5fb1637 100644
--- a/fs/jffs2/nodelist.c
+++ b/fs/jffs2/nodelist.c
@@ -578,7 +578,6 @@ void jffs2_kill_fragtree(struct rb_root *root, struct jffs2_sb_info *c)
 		}
 
 		jffs2_free_node_frag(frag);
-		cond_resched();
 	}
 }
 
diff --git a/fs/jffs2/nodemgmt.c b/fs/jffs2/nodemgmt.c
index a7bbe879cfc3..5f9ab75540f4 100644
--- a/fs/jffs2/nodemgmt.c
+++ b/fs/jffs2/nodemgmt.c
@@ -185,8 +185,6 @@ int jffs2_reserve_space(struct jffs2_sb_info *c, uint32_t minsize,
 			} else if (ret)
 				return ret;
 
-			cond_resched();
-
 			if (signal_pending(current))
 				return -EINTR;
 
@@ -227,7 +225,14 @@ int jffs2_reserve_space_gc(struct jffs2_sb_info *c, uint32_t minsize,
 		spin_unlock(&c->erase_completion_lock);
 
 		if (ret == -EAGAIN)
-			cond_resched();
+			/*
+			 * The spin_unlock() above will implicitly reschedule
+			 * if one is needed.
+			 *
+			 * In case we did not reschedule, take a breather here
+			 * before retrying.
+			 */
+			cpu_relax();
 		else
 			break;
 	}
diff --git a/fs/jffs2/readinode.c b/fs/jffs2/readinode.c
index 03b4f99614be..f9fc1f6451f8 100644
--- a/fs/jffs2/readinode.c
+++ b/fs/jffs2/readinode.c
@@ -1013,8 +1013,6 @@ static int jffs2_get_inode_nodes(struct jffs2_sb_info *c, struct jffs2_inode_inf
 		valid_ref = jffs2_first_valid_node(ref->next_in_ino);
 		spin_unlock(&c->erase_completion_lock);
 
-		cond_resched();
-
 		/*
 		 * At this point we don't know the type of the node we're going
 		 * to read, so we do not know the size of its header. In order
diff --git a/fs/jffs2/scan.c b/fs/jffs2/scan.c
index 29671e33a171..aaf6b33ba200 100644
--- a/fs/jffs2/scan.c
+++ b/fs/jffs2/scan.c
@@ -143,8 +143,6 @@ int jffs2_scan_medium(struct jffs2_sb_info *c)
 	for (i=0; i<c->nr_blocks; i++) {
 		struct jffs2_eraseblock *jeb = &c->blocks[i];
 
-		cond_resched();
-
 		/* reset summary info for next eraseblock scan */
 		jffs2_sum_reset_collected(s);
 
@@ -621,8 +619,6 @@ static int jffs2_scan_eraseblock (struct jffs2_sb_info *c, struct jffs2_eraseblo
 		if (err)
 			return err;
 
-		cond_resched();
-
 		if (ofs & 3) {
 			pr_warn("Eep. ofs 0x%08x not word-aligned!\n", ofs);
 			ofs = PAD(ofs);
diff --git a/fs/jffs2/summary.c b/fs/jffs2/summary.c
index 4fe64519870f..5a4a6438a966 100644
--- a/fs/jffs2/summary.c
+++ b/fs/jffs2/summary.c
@@ -397,8 +397,6 @@ static int jffs2_sum_process_sum_data(struct jffs2_sb_info *c, struct jffs2_eras
 	for (i=0; i<je32_to_cpu(summary->sum_num); i++) {
 		dbg_summary("processing summary index %d\n", i);
 
-		cond_resched();
-
 		/* Make sure there's a spare ref for dirty space */
 		err = jffs2_prealloc_raw_node_refs(c, jeb, 2);
 		if (err)
diff --git a/fs/jfs/jfs_txnmgr.c b/fs/jfs/jfs_txnmgr.c
index ce4b4760fcb1..d30011f3e935 100644
--- a/fs/jfs/jfs_txnmgr.c
+++ b/fs/jfs/jfs_txnmgr.c
@@ -2833,12 +2833,11 @@ void txQuiesce(struct super_block *sb)
 		mutex_lock(&jfs_ip->commit_mutex);
 		txCommit(tid, 1, &ip, 0);
 		txEnd(tid);
+		/*
+		 * The mutex_unlock() reschedules if needed.
+		 */
 		mutex_unlock(&jfs_ip->commit_mutex);
-		/*
-		 * Just to be safe.  I don't know how
-		 * long we can run without blocking
-		 */
-		cond_resched();
+
 		TXN_LOCK();
 	}
 
@@ -2912,11 +2911,6 @@ int jfs_sync(void *arg)
 				mutex_unlock(&jfs_ip->commit_mutex);
 
 				iput(ip);
-				/*
-				 * Just to be safe.  I don't know how
-				 * long we can run without blocking
-				 */
-				cond_resched();
 				TXN_LOCK();
 			} else {
 				/* We can't get the commit mutex.  It may
diff --git a/fs/libfs.c b/fs/libfs.c
index 37f2d34ee090..c74cecca8557 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -125,9 +125,8 @@ static struct dentry *scan_positives(struct dentry *cursor,
 		if (need_resched()) {
 			list_move(&cursor->d_child, p);
 			p = &cursor->d_child;
-			spin_unlock(&dentry->d_lock);
-			cond_resched();
-			spin_lock(&dentry->d_lock);
+
+			cond_resched_lock(&dentry->d_lock);
 		}
 	}
 	spin_unlock(&dentry->d_lock);
diff --git a/fs/mbcache.c b/fs/mbcache.c
index 2a4b8b549e93..451d554d3f55 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -322,7 +322,6 @@ static unsigned long mb_cache_shrink(struct mb_cache *cache,
 		spin_unlock(&cache->c_list_lock);
 		__mb_cache_entry_free(cache, entry);
 		shrunk++;
-		cond_resched();
 		spin_lock(&cache->c_list_lock);
 	}
 	spin_unlock(&cache->c_list_lock);
diff --git a/fs/namei.c b/fs/namei.c
index 94565bd7e73f..e911d7f15dad 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1781,7 +1781,6 @@ static const char *pick_link(struct nameidata *nd, struct path *link,
 
 	if (!(nd->flags & LOOKUP_RCU)) {
 		touch_atime(&last->link);
-		cond_resched();
 	} else if (atime_needs_update(&last->link, inode)) {
 		if (!try_to_unlazy(nd))
 			return ERR_PTR(-ECHILD);
diff --git a/fs/netfs/io.c b/fs/netfs/io.c
index 7f753380e047..fe9487237b5d 100644
--- a/fs/netfs/io.c
+++ b/fs/netfs/io.c
@@ -641,7 +641,6 @@ int netfs_begin_read(struct netfs_io_request *rreq, bool sync)
 			netfs_rreq_assess(rreq, false);
 			if (!test_bit(NETFS_RREQ_IN_PROGRESS, &rreq->flags))
 				break;
-			cond_resched();
 		}
 
 		ret = rreq->error;
diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
index cf7365581031..6b5b060b3658 100644
--- a/fs/nfs/delegation.c
+++ b/fs/nfs/delegation.c
@@ -650,7 +650,6 @@ static int nfs_server_return_marked_delegations(struct nfs_server *server,
 
 		err = nfs_end_delegation_return(inode, delegation, 0);
 		iput(inode);
-		cond_resched();
 		if (!err)
 			goto restart;
 		set_bit(NFS4CLNT_DELEGRETURN, &server->nfs_client->cl_state);
@@ -1186,7 +1185,6 @@ static int nfs_server_reap_unclaimed_delegations(struct nfs_server *server,
 			nfs_put_delegation(delegation);
 		}
 		iput(inode);
-		cond_resched();
 		goto restart;
 	}
 	rcu_read_unlock();
@@ -1318,7 +1316,6 @@ static int nfs_server_reap_expired_delegations(struct nfs_server *server,
 		put_cred(cred);
 		if (!nfs4_server_rebooted(server->nfs_client)) {
 			iput(inode);
-			cond_resched();
 			goto restart;
 		}
 		nfs_inode_mark_test_expired_delegation(server,inode);
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 84343aefbbd6..10db43e1833a 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -2665,14 +2665,12 @@ static int pnfs_layout_return_unused_byserver(struct nfs_server *server,
 			spin_unlock(&inode->i_lock);
 			rcu_read_unlock();
 			pnfs_put_layout_hdr(lo);
-			cond_resched();
 			goto restart;
 		}
 		spin_unlock(&inode->i_lock);
 		rcu_read_unlock();
 		pnfs_send_layoutreturn(lo, &stateid, &cred, iomode, false);
 		pnfs_put_layout_hdr(lo);
-		cond_resched();
 		goto restart;
 	}
 	rcu_read_unlock();
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 9d82d50ce0b1..eec3d641998b 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1053,7 +1053,6 @@ nfs_scan_commit_list(struct list_head *src, struct list_head *dst,
 		ret++;
 		if ((ret == max) && !cinfo->dreq)
 			break;
-		cond_resched();
 	}
 	return ret;
 }
@@ -1890,8 +1889,6 @@ static void nfs_commit_release_pages(struct nfs_commit_data *data)
 		atomic_long_inc(&NFS_I(data->inode)->redirtied_pages);
 	next:
 		nfs_unlock_and_release_request(req);
-		/* Latency breaker */
-		cond_resched();
 	}
 	nfss = NFS_SERVER(data->inode);
 	if (atomic_long_read(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
@@ -1958,7 +1955,6 @@ static int __nfs_commit_inode(struct inode *inode, int how,
 		}
 		if (nscan < INT_MAX)
 			break;
-		cond_resched();
 	}
 	nfs_commit_end(cinfo.mds);
 	if (ret || !may_wait)
diff --git a/fs/nilfs2/btree.c b/fs/nilfs2/btree.c
index 13592e82eaf6..4ed6d5d23ade 100644
--- a/fs/nilfs2/btree.c
+++ b/fs/nilfs2/btree.c
@@ -2173,7 +2173,6 @@ static void nilfs_btree_lookup_dirty_buffers(struct nilfs_bmap *btree,
 			} while ((bh = bh->b_this_page) != head);
 		}
 		folio_batch_release(&fbatch);
-		cond_resched();
 	}
 
 	for (level = NILFS_BTREE_LEVEL_NODE_MIN;
diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
index 1a8bd5993476..a5780f54ac6d 100644
--- a/fs/nilfs2/inode.c
+++ b/fs/nilfs2/inode.c
@@ -1280,7 +1280,6 @@ int nilfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 			}
 			blkoff += n;
 		}
-		cond_resched();
 	} while (true);
 
 	/* If ret is 1 then we just hit the end of the extent array */
diff --git a/fs/nilfs2/page.c b/fs/nilfs2/page.c
index b4e54d079b7d..71c5b6792e5f 100644
--- a/fs/nilfs2/page.c
+++ b/fs/nilfs2/page.c
@@ -277,7 +277,6 @@ int nilfs_copy_dirty_pages(struct address_space *dmap,
 		folio_unlock(folio);
 	}
 	folio_batch_release(&fbatch);
-	cond_resched();
 
 	if (likely(!err))
 		goto repeat;
@@ -346,7 +345,6 @@ void nilfs_copy_back_pages(struct address_space *dmap,
 		folio_unlock(folio);
 	}
 	folio_batch_release(&fbatch);
-	cond_resched();
 
 	goto repeat;
 }
@@ -382,7 +380,6 @@ void nilfs_clear_dirty_pages(struct address_space *mapping, bool silent)
 			folio_unlock(folio);
 		}
 		folio_batch_release(&fbatch);
-		cond_resched();
 	}
 }
 
@@ -539,7 +536,6 @@ unsigned long nilfs_find_uncommitted_extent(struct inode *inode,
 	} while (++i < nr_folios);
 
 	folio_batch_release(&fbatch);
-	cond_resched();
 	goto repeat;
 
 out_locked:
diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index 7ec16879756e..45c65b450119 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -361,7 +361,6 @@ static void nilfs_transaction_lock(struct super_block *sb,
 		nilfs_segctor_do_immediate_flush(sci);
 
 		up_write(&nilfs->ns_segctor_sem);
-		cond_resched();
 	}
 	if (gcflag)
 		ti->ti_flags |= NILFS_TI_GC;
@@ -746,13 +745,11 @@ static size_t nilfs_lookup_dirty_data_buffers(struct inode *inode,
 			ndirties++;
 			if (unlikely(ndirties >= nlimit)) {
 				folio_batch_release(&fbatch);
-				cond_resched();
 				return ndirties;
 			}
 		} while (bh = bh->b_this_page, bh != head);
 	}
 	folio_batch_release(&fbatch);
-	cond_resched();
 	goto repeat;
 }
 
@@ -785,7 +782,6 @@ static void nilfs_lookup_dirty_node_buffers(struct inode *inode,
 			} while (bh != head);
 		}
 		folio_batch_release(&fbatch);
-		cond_resched();
 	}
 }
 
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 62fe0b679e58..64a66e1aeac4 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -805,7 +805,6 @@ static ssize_t fanotify_read(struct file *file, char __user *buf,
 		 * User can supply arbitrarily large buffer. Avoid softlockups
 		 * in case there are lots of available events.
 		 */
-		cond_resched();
 		event = get_one_event(group, count);
 		if (IS_ERR(event)) {
 			ret = PTR_ERR(event);
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 7974e91ffe13..a6aff29204f6 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -79,7 +79,6 @@ static void fsnotify_unmount_inodes(struct super_block *sb)
 
 		iput_inode = inode;
 
-		cond_resched();
 		spin_lock(&sb->s_inode_list_lock);
 	}
 	spin_unlock(&sb->s_inode_list_lock);
diff --git a/fs/ntfs/attrib.c b/fs/ntfs/attrib.c
index f79408f9127a..173f6fcfef54 100644
--- a/fs/ntfs/attrib.c
+++ b/fs/ntfs/attrib.c
@@ -2556,7 +2556,6 @@ int ntfs_attr_set(ntfs_inode *ni, const s64 ofs, const s64 cnt, const u8 val)
 		set_page_dirty(page);
 		put_page(page);
 		balance_dirty_pages_ratelimited(mapping);
-		cond_resched();
 		if (idx == end)
 			goto done;
 		idx++;
@@ -2597,7 +2596,6 @@ int ntfs_attr_set(ntfs_inode *ni, const s64 ofs, const s64 cnt, const u8 val)
 		unlock_page(page);
 		put_page(page);
 		balance_dirty_pages_ratelimited(mapping);
-		cond_resched();
 	}
 	/* If there is a last partial page, need to do it the slow way. */
 	if (end_ofs) {
@@ -2614,7 +2612,6 @@ int ntfs_attr_set(ntfs_inode *ni, const s64 ofs, const s64 cnt, const u8 val)
 		set_page_dirty(page);
 		put_page(page);
 		balance_dirty_pages_ratelimited(mapping);
-		cond_resched();
 	}
 done:
 	ntfs_debug("Done.");
diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index cbc545999cfe..a03ad2d7faf7 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -259,7 +259,6 @@ static int ntfs_attr_extend_initialized(ntfs_inode *ni, const s64 new_init_size)
 		 * files.
 		 */
 		balance_dirty_pages_ratelimited(mapping);
-		cond_resched();
 	} while (++index < end_index);
 	read_lock_irqsave(&ni->size_lock, flags);
 	BUG_ON(ni->initialized_size != new_init_size);
@@ -1868,7 +1867,6 @@ static ssize_t ntfs_perform_write(struct file *file, struct iov_iter *i,
 			iov_iter_revert(i, copied);
 			break;
 		}
-		cond_resched();
 		if (unlikely(copied < bytes)) {
 			iov_iter_revert(i, copied);
 			if (copied)
diff --git a/fs/ntfs3/file.c b/fs/ntfs3/file.c
index 1f7a194983c5..cfb09f47a588 100644
--- a/fs/ntfs3/file.c
+++ b/fs/ntfs3/file.c
@@ -158,7 +158,6 @@ static int ntfs_extend_initialized_size(struct file *file,
 			break;
 
 		balance_dirty_pages_ratelimited(mapping);
-		cond_resched();
 	}
 
 	return 0;
@@ -241,7 +240,6 @@ static int ntfs_zero_range(struct inode *inode, u64 vbo, u64 vbo_to)
 
 		unlock_page(page);
 		put_page(page);
-		cond_resched();
 	}
 out:
 	mark_inode_dirty(inode);
@@ -1005,13 +1003,6 @@ static ssize_t ntfs_compress_write(struct kiocb *iocb, struct iov_iter *from)
 		if (err)
 			goto out;
 
-		/*
-		 * We can loop for a long time in here. Be nice and allow
-		 * us to schedule out to avoid softlocking if preempt
-		 * is disabled.
-		 */
-		cond_resched();
-
 		pos += copied;
 		written += copied;
 
diff --git a/fs/ntfs3/frecord.c b/fs/ntfs3/frecord.c
index dad976a68985..8fa4bb50b0b1 100644
--- a/fs/ntfs3/frecord.c
+++ b/fs/ntfs3/frecord.c
@@ -2265,8 +2265,6 @@ int ni_decompress_file(struct ntfs_inode *ni)
 
 		if (err)
 			goto out;
-
-		cond_resched();
 	}
 
 remove_wof:
diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index aef58f1395c8..2fccabc7aa51 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -7637,10 +7637,8 @@ int ocfs2_trim_mainbm(struct super_block *sb, struct fstrim_range *range)
 	 * main_bm related locks for avoiding the current IO starve, then go to
 	 * trim the next group
 	 */
-	if (ret >= 0 && group <= last_group) {
-		cond_resched();
+	if (ret >= 0 && group <= last_group)
 		goto next_group;
-	}
 out:
 	range->len = trimmed * sb->s_blocksize;
 	return ret;
diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c
index 960080753d3b..7bf6f46bd429 100644
--- a/fs/ocfs2/cluster/tcp.c
+++ b/fs/ocfs2/cluster/tcp.c
@@ -951,7 +951,12 @@ static void o2net_sendpage(struct o2net_sock_container *sc,
 		if (ret == (ssize_t)-EAGAIN) {
 			mlog(0, "sendpage of size %zu to " SC_NODEF_FMT
 			     " returned EAGAIN\n", size, SC_NODEF_ARGS(sc));
-			cond_resched();
+
+			/*
+			 * Take a breather before retrying. Though maybe this
+			 * should be a wait on an event or a timeout?
+			 */
+			cpu_relax();
 			continue;
 		}
 		mlog(ML_ERROR, "sendpage of size %zu to " SC_NODEF_FMT
@@ -1929,7 +1934,6 @@ static void o2net_accept_many(struct work_struct *work)
 		o2net_accept_one(sock, &more);
 		if (!more)
 			break;
-		cond_resched();
 	}
 }
 
diff --git a/fs/ocfs2/dlm/dlmthread.c b/fs/ocfs2/dlm/dlmthread.c
index eedf07ca23ca..271e0f7405e5 100644
--- a/fs/ocfs2/dlm/dlmthread.c
+++ b/fs/ocfs2/dlm/dlmthread.c
@@ -792,11 +792,10 @@ static int dlm_thread(void *data)
 		spin_unlock(&dlm->spinlock);
 		dlm_flush_asts(dlm);
 
-		/* yield and continue right away if there is more work to do */
-		if (!n) {
-			cond_resched();
+		/* An unlock above would have led to a yield if one was
+		 * needed. Continue right away if there is more to do */
+		if (!n)
 			continue;
-		}
 
 		wait_event_interruptible_timeout(dlm->dlm_thread_wq,
 						 !dlm_dirty_list_empty(dlm) ||
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index c45596c25c66..f977337a33db 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -940,6 +940,10 @@ static int ocfs2_zero_extend_range(struct inode *inode, u64 range_start,
 	BUG_ON(range_start >= range_end);
 
 	while (zero_pos < range_end) {
+		/*
+		 * If this is a very long extent, then we might be here
+		 * awhile. We should expect the scheduler to preempt us.
+		 */
 		next_pos = (zero_pos & PAGE_MASK) + PAGE_SIZE;
 		if (next_pos > range_end)
 			next_pos = range_end;
@@ -949,12 +953,6 @@ static int ocfs2_zero_extend_range(struct inode *inode, u64 range_start,
 			break;
 		}
 		zero_pos = next_pos;
-
-		/*
-		 * Very large extends have the potential to lock up
-		 * the cpu for extended periods of time.
-		 */
-		cond_resched();
 	}
 
 	return rc;
diff --git a/fs/proc/base.c b/fs/proc/base.c
index ffd54617c354..fec3dc6a887d 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3532,7 +3532,6 @@ int proc_pid_readdir(struct file *file, struct dir_context *ctx)
 		char name[10 + 1];
 		unsigned int len;
 
-		cond_resched();
 		if (!has_pid_permissions(fs_info, iter.task, HIDEPID_INVISIBLE))
 			continue;
 
diff --git a/fs/proc/fd.c b/fs/proc/fd.c
index 6276b3938842..b014c44b96e9 100644
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -272,7 +272,6 @@ static int proc_readfd_common(struct file *file, struct dir_context *ctx,
 				     name, len, instantiate, p,
 				     &data))
 			goto out;
-		cond_resched();
 		rcu_read_lock();
 	}
 	rcu_read_unlock();
diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index 23fc24d16b31..4625dea20bc6 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -491,7 +491,6 @@ static ssize_t read_kcore_iter(struct kiocb *iocb, struct iov_iter *iter)
 
 		if (page_offline_frozen++ % MAX_ORDER_NR_PAGES == 0) {
 			page_offline_thaw();
-			cond_resched();
 			page_offline_freeze();
 		}
 
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 195b077c0fac..14fd181baf57 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -80,8 +80,6 @@ static ssize_t kpagecount_read(struct file *file, char __user *buf,
 		pfn++;
 		out++;
 		count -= KPMSIZE;
-
-		cond_resched();
 	}
 
 	*ppos += (char __user *)out - buf;
@@ -258,8 +256,6 @@ static ssize_t kpageflags_read(struct file *file, char __user *buf,
 		pfn++;
 		out++;
 		count -= KPMSIZE;
-
-		cond_resched();
 	}
 
 	*ppos += (char __user *)out - buf;
@@ -313,8 +309,6 @@ static ssize_t kpagecgroup_read(struct file *file, char __user *buf,
 		pfn++;
 		out++;
 		count -= KPMSIZE;
-
-		cond_resched();
 	}
 
 	*ppos += (char __user *)out - buf;
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 3dd5be96691b..49c2ebcb5fd9 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -629,7 +629,6 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		smaps_pte_entry(pte, addr, walk);
 	pte_unmap_unlock(pte - 1, ptl);
 out:
-	cond_resched();
 	return 0;
 }
 
@@ -1210,7 +1209,6 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 		ClearPageReferenced(page);
 	}
 	pte_unmap_unlock(pte - 1, ptl);
-	cond_resched();
 	return 0;
 }
 
@@ -1554,8 +1552,6 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
 	}
 	pte_unmap_unlock(orig_pte, ptl);
 
-	cond_resched();
-
 	return err;
 }
 
@@ -1605,8 +1601,6 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
 			frame++;
 	}
 
-	cond_resched();
-
 	return err;
 }
 #else
@@ -1899,7 +1893,6 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	pte_unmap_unlock(orig_pte, ptl);
-	cond_resched();
 	return 0;
 }
 #ifdef CONFIG_HUGETLB_PAGE
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 31e897ad5e6a..994d69edf349 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -1068,7 +1068,6 @@ static int add_dquot_ref(struct super_block *sb, int type)
 		 * later.
 		 */
 		old_inode = inode;
-		cond_resched();
 		spin_lock(&sb->s_inode_list_lock);
 	}
 	spin_unlock(&sb->s_inode_list_lock);
diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
index 015bfe4e4524..74b503a46884 100644
--- a/fs/reiserfs/journal.c
+++ b/fs/reiserfs/journal.c
@@ -814,7 +814,6 @@ static int write_ordered_buffers(spinlock_t * lock,
 			if (chunk.nr)
 				write_ordered_chunk(&chunk);
 			wait_on_buffer(bh);
-			cond_resched();
 			spin_lock(lock);
 			goto loop_next;
 		}
@@ -1671,7 +1670,6 @@ static int write_one_transaction(struct super_block *s,
 		}
 next:
 		cn = cn->next;
-		cond_resched();
 	}
 	return ret;
 }
diff --git a/fs/select.c b/fs/select.c
index 0ee55af1a55c..1d05de51c543 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -573,7 +573,6 @@ static int do_select(int n, fd_set_bits *fds, struct timespec64 *end_time)
 				*routp = res_out;
 			if (res_ex)
 				*rexp = res_ex;
-			cond_resched();
 		}
 		wait->_qproc = NULL;
 		if (retval || timed_out || signal_pending(current))
diff --git a/fs/smb/client/file.c b/fs/smb/client/file.c
index 2108b3b40ce9..da3b31b02b45 100644
--- a/fs/smb/client/file.c
+++ b/fs/smb/client/file.c
@@ -2713,7 +2713,6 @@ static void cifs_extend_writeback(struct address_space *mapping,
 		}
 
 		folio_batch_release(&batch);
-		cond_resched();
 	} while (!stop);
 
 	*_len = len;
@@ -2951,7 +2950,6 @@ static int cifs_writepages_region(struct address_space *mapping,
 		}
 
 		folio_batch_release(&fbatch);		
-		cond_resched();
 	} while (wbc->nr_to_write > 0);
 
 	*_next = start;
diff --git a/fs/splice.c b/fs/splice.c
index d983d375ff11..0b43bedbf36f 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -604,7 +604,6 @@ ssize_t __splice_from_pipe(struct pipe_inode_info *pipe, struct splice_desc *sd,
 
 	splice_from_pipe_begin(sd);
 	do {
-		cond_resched();
 		ret = splice_from_pipe_next(pipe, sd);
 		if (ret > 0)
 			ret = splice_from_pipe_feed(pipe, sd, actor);
diff --git a/fs/ubifs/budget.c b/fs/ubifs/budget.c
index d76eb7b39f56..b9100c713964 100644
--- a/fs/ubifs/budget.c
+++ b/fs/ubifs/budget.c
@@ -477,7 +477,6 @@ int ubifs_budget_space(struct ubifs_info *c, struct ubifs_budget_req *req)
 	}
 
 	err = make_free_space(c);
-	cond_resched();
 	if (err == -EAGAIN) {
 		dbg_budg("try again");
 		goto again;
diff --git a/fs/ubifs/commit.c b/fs/ubifs/commit.c
index c4fc1047fc07..2fd6aef59b7d 100644
--- a/fs/ubifs/commit.c
+++ b/fs/ubifs/commit.c
@@ -309,7 +309,6 @@ int ubifs_bg_thread(void *info)
 			ubifs_ro_mode(c, err);
 
 		run_bg_commit(c);
-		cond_resched();
 	}
 
 	ubifs_msg(c, "background thread \"%s\" stops", c->bgt_name);
diff --git a/fs/ubifs/debug.c b/fs/ubifs/debug.c
index eef9e527d9ff..add4b72fd52f 100644
--- a/fs/ubifs/debug.c
+++ b/fs/ubifs/debug.c
@@ -852,7 +852,6 @@ void ubifs_dump_leb(const struct ubifs_info *c, int lnum)
 	       sleb->nodes_cnt, sleb->endpt);
 
 	list_for_each_entry(snod, &sleb->nodes, list) {
-		cond_resched();
 		pr_err("Dumping node at LEB %d:%d len %d\n", lnum,
 		       snod->offs, snod->len);
 		ubifs_dump_node(c, snod->node, c->leb_size - snod->offs);
@@ -1622,8 +1621,6 @@ int dbg_walk_index(struct ubifs_info *c, dbg_leaf_callback leaf_cb,
 	while (1) {
 		int idx;
 
-		cond_resched();
-
 		if (znode_cb) {
 			err = znode_cb(c, znode, priv);
 			if (err) {
@@ -2329,7 +2326,6 @@ int dbg_check_data_nodes_order(struct ubifs_info *c, struct list_head *head)
 		ino_t inuma, inumb;
 		uint32_t blka, blkb;
 
-		cond_resched();
 		sa = container_of(cur, struct ubifs_scan_node, list);
 		sb = container_of(cur->next, struct ubifs_scan_node, list);
 
@@ -2396,7 +2392,6 @@ int dbg_check_nondata_nodes_order(struct ubifs_info *c, struct list_head *head)
 		ino_t inuma, inumb;
 		uint32_t hasha, hashb;
 
-		cond_resched();
 		sa = container_of(cur, struct ubifs_scan_node, list);
 		sb = container_of(cur->next, struct ubifs_scan_node, list);
 
diff --git a/fs/ubifs/dir.c b/fs/ubifs/dir.c
index 2f48c58d47cd..7baa86efa471 100644
--- a/fs/ubifs/dir.c
+++ b/fs/ubifs/dir.c
@@ -683,7 +683,6 @@ static int ubifs_readdir(struct file *file, struct dir_context *ctx)
 		kfree(file->private_data);
 		ctx->pos = key_hash_flash(c, &dent->key);
 		file->private_data = dent;
-		cond_resched();
 	}
 
 out:
diff --git a/fs/ubifs/gc.c b/fs/ubifs/gc.c
index 3134d070fcc0..d85bcb64e9a8 100644
--- a/fs/ubifs/gc.c
+++ b/fs/ubifs/gc.c
@@ -109,7 +109,6 @@ static int data_nodes_cmp(void *priv, const struct list_head *a,
 	struct ubifs_info *c = priv;
 	struct ubifs_scan_node *sa, *sb;
 
-	cond_resched();
 	if (a == b)
 		return 0;
 
@@ -153,7 +152,6 @@ static int nondata_nodes_cmp(void *priv, const struct list_head *a,
 	struct ubifs_info *c = priv;
 	struct ubifs_scan_node *sa, *sb;
 
-	cond_resched();
 	if (a == b)
 		return 0;
 
@@ -305,7 +303,6 @@ static int move_node(struct ubifs_info *c, struct ubifs_scan_leb *sleb,
 {
 	int err, new_lnum = wbuf->lnum, new_offs = wbuf->offs + wbuf->used;
 
-	cond_resched();
 	err = ubifs_wbuf_write_nolock(wbuf, snod->node, snod->len);
 	if (err)
 		return err;
@@ -695,8 +692,6 @@ int ubifs_garbage_collect(struct ubifs_info *c, int anyway)
 		/* Maybe continue after find and break before find */
 		lp.lnum = -1;
 
-		cond_resched();
-
 		/* Give the commit an opportunity to run */
 		if (ubifs_gc_should_commit(c)) {
 			ret = -EAGAIN;
diff --git a/fs/ubifs/io.c b/fs/ubifs/io.c
index 01d8eb170382..4915ab97f7ce 100644
--- a/fs/ubifs/io.c
+++ b/fs/ubifs/io.c
@@ -683,8 +683,6 @@ int ubifs_bg_wbufs_sync(struct ubifs_info *c)
 	for (i = 0; i < c->jhead_cnt; i++) {
 		struct ubifs_wbuf *wbuf = &c->jheads[i].wbuf;
 
-		cond_resched();
-
 		/*
 		 * If the mutex is locked then wbuf is being changed, so
 		 * synchronization is not necessary.
diff --git a/fs/ubifs/lprops.c b/fs/ubifs/lprops.c
index 6d6cd85c2b4c..57e4d001125a 100644
--- a/fs/ubifs/lprops.c
+++ b/fs/ubifs/lprops.c
@@ -1113,8 +1113,6 @@ static int scan_check_cb(struct ubifs_info *c,
 	list_for_each_entry(snod, &sleb->nodes, list) {
 		int found, level = 0;
 
-		cond_resched();
-
 		if (is_idx == -1)
 			is_idx = (snod->type == UBIFS_IDX_NODE) ? 1 : 0;
 
diff --git a/fs/ubifs/lpt_commit.c b/fs/ubifs/lpt_commit.c
index c4d079328b92..0cadd08f6304 100644
--- a/fs/ubifs/lpt_commit.c
+++ b/fs/ubifs/lpt_commit.c
@@ -1483,7 +1483,6 @@ static int dbg_is_nnode_dirty(struct ubifs_info *c, int lnum, int offs)
 	for (; nnode; nnode = next_nnode(c, nnode, &hght)) {
 		struct ubifs_nbranch *branch;
 
-		cond_resched();
 		if (nnode->parent) {
 			branch = &nnode->parent->nbranch[nnode->iip];
 			if (branch->lnum != lnum || branch->offs != offs)
@@ -1517,7 +1516,6 @@ static int dbg_is_pnode_dirty(struct ubifs_info *c, int lnum, int offs)
 		struct ubifs_pnode *pnode;
 		struct ubifs_nbranch *branch;
 
-		cond_resched();
 		pnode = ubifs_pnode_lookup(c, i);
 		if (IS_ERR(pnode))
 			return PTR_ERR(pnode);
@@ -1673,7 +1671,6 @@ int dbg_check_ltab(struct ubifs_info *c)
 		pnode = ubifs_pnode_lookup(c, i);
 		if (IS_ERR(pnode))
 			return PTR_ERR(pnode);
-		cond_resched();
 	}
 
 	/* Check nodes */
diff --git a/fs/ubifs/orphan.c b/fs/ubifs/orphan.c
index 4909321d84cf..23572f418a8b 100644
--- a/fs/ubifs/orphan.c
+++ b/fs/ubifs/orphan.c
@@ -957,7 +957,6 @@ static int dbg_read_orphans(struct check_info *ci, struct ubifs_scan_leb *sleb)
 	int i, n, err;
 
 	list_for_each_entry(snod, &sleb->nodes, list) {
-		cond_resched();
 		if (snod->type != UBIFS_ORPH_NODE)
 			continue;
 		orph = snod->node;
diff --git a/fs/ubifs/recovery.c b/fs/ubifs/recovery.c
index f0d51dd21c9e..6b1bf684ec14 100644
--- a/fs/ubifs/recovery.c
+++ b/fs/ubifs/recovery.c
@@ -638,8 +638,6 @@ struct ubifs_scan_leb *ubifs_recover_leb(struct ubifs_info *c, int lnum,
 		dbg_scan("look at LEB %d:%d (%d bytes left)",
 			 lnum, offs, len);
 
-		cond_resched();
-
 		/*
 		 * Scan quietly until there is an error from which we cannot
 		 * recover
@@ -999,8 +997,6 @@ static int clean_an_unclean_leb(struct ubifs_info *c,
 	while (len >= 8) {
 		int ret;
 
-		cond_resched();
-
 		/* Scan quietly until there is an error */
 		ret = ubifs_scan_a_node(c, buf, len, lnum, offs, quiet);
 
diff --git a/fs/ubifs/replay.c b/fs/ubifs/replay.c
index 4211e4456b1e..9a361d8f998e 100644
--- a/fs/ubifs/replay.c
+++ b/fs/ubifs/replay.c
@@ -305,7 +305,6 @@ static int replay_entries_cmp(void *priv, const struct list_head *a,
 	struct ubifs_info *c = priv;
 	struct replay_entry *ra, *rb;
 
-	cond_resched();
 	if (a == b)
 		return 0;
 
@@ -332,8 +331,6 @@ static int apply_replay_list(struct ubifs_info *c)
 	list_sort(c, &c->replay_list, &replay_entries_cmp);
 
 	list_for_each_entry(r, &c->replay_list, list) {
-		cond_resched();
-
 		err = apply_replay_entry(c, r);
 		if (err)
 			return err;
@@ -722,8 +719,6 @@ static int replay_bud(struct ubifs_info *c, struct bud_entry *b)
 		u8 hash[UBIFS_HASH_ARR_SZ];
 		int deletion = 0;
 
-		cond_resched();
-
 		if (snod->sqnum >= SQNUM_WATERMARK) {
 			ubifs_err(c, "file system's life ended");
 			goto out_dump;
@@ -1060,8 +1055,6 @@ static int replay_log_leb(struct ubifs_info *c, int lnum, int offs, void *sbuf)
 	}
 
 	list_for_each_entry(snod, &sleb->nodes, list) {
-		cond_resched();
-
 		if (snod->sqnum >= SQNUM_WATERMARK) {
 			ubifs_err(c, "file system's life ended");
 			goto out_dump;
diff --git a/fs/ubifs/scan.c b/fs/ubifs/scan.c
index 84a9157dcc32..db3fc3297d1a 100644
--- a/fs/ubifs/scan.c
+++ b/fs/ubifs/scan.c
@@ -269,8 +269,6 @@ struct ubifs_scan_leb *ubifs_scan(const struct ubifs_info *c, int lnum,
 		dbg_scan("look at LEB %d:%d (%d bytes left)",
 			 lnum, offs, len);
 
-		cond_resched();
-
 		ret = ubifs_scan_a_node(c, buf, len, lnum, offs, quiet);
 		if (ret > 0) {
 			/* Padding bytes or a valid padding node */
diff --git a/fs/ubifs/shrinker.c b/fs/ubifs/shrinker.c
index d00a6f20ac7b..f381f844c321 100644
--- a/fs/ubifs/shrinker.c
+++ b/fs/ubifs/shrinker.c
@@ -125,7 +125,6 @@ static int shrink_tnc(struct ubifs_info *c, int nr, int age, int *contention)
 
 		zprev = znode;
 		znode = ubifs_tnc_levelorder_next(c, c->zroot.znode, znode);
-		cond_resched();
 	}
 
 	return total_freed;
diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index b08fb28d16b5..0307d12d29d2 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -949,8 +949,6 @@ static int check_volume_empty(struct ubifs_info *c)
 			c->empty = 0;
 			break;
 		}
-
-		cond_resched();
 	}
 
 	return 0;
diff --git a/fs/ubifs/tnc_commit.c b/fs/ubifs/tnc_commit.c
index a55e04822d16..97218e7d380d 100644
--- a/fs/ubifs/tnc_commit.c
+++ b/fs/ubifs/tnc_commit.c
@@ -857,8 +857,6 @@ static int write_index(struct ubifs_info *c)
 	while (1) {
 		u8 hash[UBIFS_HASH_ARR_SZ];
 
-		cond_resched();
-
 		znode = cnext;
 		idx = c->cbuf + used;
 
diff --git a/fs/ubifs/tnc_misc.c b/fs/ubifs/tnc_misc.c
index 4d686e34e64d..b92d2ca00a0b 100644
--- a/fs/ubifs/tnc_misc.c
+++ b/fs/ubifs/tnc_misc.c
@@ -235,7 +235,6 @@ long ubifs_destroy_tnc_subtree(const struct ubifs_info *c,
 			    !ubifs_zn_dirty(zn->zbranch[n].znode))
 				clean_freed += 1;
 
-			cond_resched();
 			kfree(zn->zbranch[n].znode);
 		}
 
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 56eaae9dac1a..ad8500e831ba 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -914,7 +914,6 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
 	mmap_write_lock(mm);
 	prev = NULL;
 	for_each_vma(vmi, vma) {
-		cond_resched();
 		BUG_ON(!!vma->vm_userfaultfd_ctx.ctx ^
 		       !!(vma->vm_flags & __VM_UFFD_FLAGS));
 		if (vma->vm_userfaultfd_ctx.ctx != ctx) {
@@ -1277,7 +1276,6 @@ static __always_inline void wake_userfault(struct userfaultfd_ctx *ctx,
 		seq = read_seqcount_begin(&ctx->refile_seq);
 		need_wakeup = waitqueue_active(&ctx->fault_pending_wqh) ||
 			waitqueue_active(&ctx->fault_wqh);
-		cond_resched();
 	} while (read_seqcount_retry(&ctx->refile_seq, seq));
 	if (need_wakeup)
 		__wake_userfault(ctx, range);
@@ -1392,8 +1390,6 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 	basic_ioctls = false;
 	cur = vma;
 	do {
-		cond_resched();
-
 		BUG_ON(!!cur->vm_userfaultfd_ctx.ctx ^
 		       !!(cur->vm_flags & __VM_UFFD_FLAGS));
 
@@ -1458,7 +1454,6 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 
 	ret = 0;
 	for_each_vma_range(vmi, vma, end) {
-		cond_resched();
 
 		BUG_ON(!vma_can_userfault(vma, vm_flags));
 		BUG_ON(vma->vm_userfaultfd_ctx.ctx &&
@@ -1603,8 +1598,6 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
 	found = false;
 	cur = vma;
 	do {
-		cond_resched();
-
 		BUG_ON(!!cur->vm_userfaultfd_ctx.ctx ^
 		       !!(cur->vm_flags & __VM_UFFD_FLAGS));
 
@@ -1629,8 +1622,6 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
 
 	ret = 0;
 	for_each_vma_range(vmi, vma, end) {
-		cond_resched();
-
 		BUG_ON(!vma_can_userfault(vma, vma->vm_flags));
 
 		/*
diff --git a/fs/verity/enable.c b/fs/verity/enable.c
index c284f46d1b53..a13623717dd6 100644
--- a/fs/verity/enable.c
+++ b/fs/verity/enable.c
@@ -152,7 +152,6 @@ static int build_merkle_tree(struct file *filp,
 			err = -EINTR;
 			goto out;
 		}
-		cond_resched();
 	}
 	/* Finish all nonempty pending tree blocks. */
 	for (level = 0; level < num_levels; level++) {
diff --git a/fs/verity/read_metadata.c b/fs/verity/read_metadata.c
index f58432772d9e..1b0102faae6c 100644
--- a/fs/verity/read_metadata.c
+++ b/fs/verity/read_metadata.c
@@ -71,7 +71,6 @@ static int fsverity_read_merkle_tree(struct inode *inode,
 			err = -EINTR;
 			break;
 		}
-		cond_resched();
 		offs_in_page = 0;
 	}
 	return retval ? retval : err;
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index cabdc0e16838..97022145e888 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -16,13 +16,6 @@ xchk_should_terminate(
 	struct xfs_scrub	*sc,
 	int			*error)
 {
-	/*
-	 * If preemption is disabled, we need to yield to the scheduler every
-	 * few seconds so that we don't run afoul of the soft lockup watchdog
-	 * or RCU stall detector.
-	 */
-	cond_resched();
-
 	if (fatal_signal_pending(current)) {
 		if (*error == 0)
 			*error = -EINTR;
diff --git a/fs/xfs/scrub/xfarray.c b/fs/xfs/scrub/xfarray.c
index f0f532c10a5a..59deed2fae80 100644
--- a/fs/xfs/scrub/xfarray.c
+++ b/fs/xfs/scrub/xfarray.c
@@ -498,13 +498,6 @@ xfarray_sort_terminated(
 	struct xfarray_sortinfo	*si,
 	int			*error)
 {
-	/*
-	 * If preemption is disabled, we need to yield to the scheduler every
-	 * few seconds so that we don't run afoul of the soft lockup watchdog
-	 * or RCU stall detector.
-	 */
-	cond_resched();
-
 	if ((si->flags & XFARRAY_SORT_KILLABLE) &&
 	    fatal_signal_pending(current)) {
 		if (*error == 0)
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 465d7630bb21..cba03bff03ab 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -171,7 +171,6 @@ xfs_end_io(
 		list_del_init(&ioend->io_list);
 		iomap_ioend_try_merge(ioend, &tmp);
 		xfs_end_ioend(ioend);
-		cond_resched();
 	}
 }
 
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 3c210ac83713..d0ffbf581355 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1716,8 +1716,6 @@ xfs_icwalk_ag(
 		if (error == -EFSCORRUPTED)
 			break;
 
-		cond_resched();
-
 		if (icw && (icw->icw_flags & XFS_ICWALK_FLAG_SCAN_LIMIT)) {
 			icw->icw_scan_limit -= XFS_LOOKUP_BATCH;
 			if (icw->icw_scan_limit <= 0)
diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
index b3275e8d47b6..908881df15ed 100644
--- a/fs/xfs/xfs_iwalk.c
+++ b/fs/xfs/xfs_iwalk.c
@@ -420,7 +420,6 @@ xfs_iwalk_ag(
 		struct xfs_inobt_rec_incore	*irec;
 		xfs_ino_t			rec_fsino;
 
-		cond_resched();
 		if (xfs_pwork_want_abort(&iwag->pwork))
 			goto out;
 
-- 
2.31.1

[RFC PATCH 75/86] treewide: virt: remove cond_resched()

Posted by Ankur Arora 2 years, 3 months ago

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

All the cond_resched() calls here are from set-1. Remove them.

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Juergen Gross <jgross@suse.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 drivers/xen/balloon.c      | 2 --
 drivers/xen/gntdev.c       | 2 --
 drivers/xen/xen-scsiback.c | 9 +++++----
 virt/kvm/pfncache.c        | 2 --
 4 files changed, 5 insertions(+), 10 deletions(-)

diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 586a1673459e..a57e516b36f5 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -550,8 +550,6 @@ static int balloon_thread(void *unused)
 		update_schedule();
 
 		mutex_unlock(&balloon_mutex);
-
-		cond_resched();
 	}
 }
 
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 61faea1f0663..cbf74a2b6a06 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -974,8 +974,6 @@ static long gntdev_ioctl_grant_copy(struct gntdev_priv *priv, void __user *u)
 		ret = gntdev_grant_copy_seg(&batch, &seg, &copy.segments[i].status);
 		if (ret < 0)
 			goto out;
-
-		cond_resched();
 	}
 	if (batch.nr_ops)
 		ret = gntdev_copy(&batch);
diff --git a/drivers/xen/xen-scsiback.c b/drivers/xen/xen-scsiback.c
index 8b77e4c06e43..1ab88ba93166 100644
--- a/drivers/xen/xen-scsiback.c
+++ b/drivers/xen/xen-scsiback.c
@@ -814,9 +814,6 @@ static int scsiback_do_cmd_fn(struct vscsibk_info *info,
 			transport_generic_free_cmd(&pending_req->se_cmd, 0);
 			break;
 		}
-
-		/* Yield point for this unbounded loop. */
-		cond_resched();
 	}
 
 	gnttab_page_cache_shrink(&info->free_pages, scsiback_max_buffer_pages);
@@ -831,8 +828,12 @@ static irqreturn_t scsiback_irq_fn(int irq, void *dev_id)
 	int rc;
 	unsigned int eoi_flags = XEN_EOI_FLAG_SPURIOUS;
 
+	/*
+	 * Process cmds in a tight loop.  The scheduler can preempt when
+	 * it needs to.
+	 */
 	while ((rc = scsiback_do_cmd_fn(info, &eoi_flags)) > 0)
-		cond_resched();
+		;
 
 	/* In case of a ring error we keep the event channel masked. */
 	if (!rc)
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index 2d6aba677830..cc757d5b4acc 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -178,8 +178,6 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 				gpc_unmap_khva(new_pfn, new_khva);
 
 			kvm_release_pfn_clean(new_pfn);
-
-			cond_resched();
 		}
 
 		/* We always request a writeable mapping */
-- 
2.31.1

[RFC PATCH 76/86] treewide: block: remove cond_resched()

Posted by Ankur Arora 2 years, 3 months ago

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

All the uses here are in set-1 (some right after we give up the
lock, causing an explicit preemption check.)

We can remove all of them.

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: cgroups@vger.kernel.org
Cc: linux-block@vger.kernel.org
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 block/blk-cgroup.c |  2 --
 block/blk-lib.c    | 11 -----------
 block/blk-mq.c     |  3 ---
 block/blk-zoned.c  |  6 ------
 4 files changed, 22 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 4a42ea2972ad..145c378367ec 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -597,7 +597,6 @@ static void blkg_destroy_all(struct gendisk *disk)
 		if (!(--count)) {
 			count = BLKG_DESTROY_BATCH_SIZE;
 			spin_unlock_irq(&q->queue_lock);
-			cond_resched();
 			goto restart;
 		}
 	}
@@ -1234,7 +1233,6 @@ static void blkcg_destroy_blkgs(struct blkcg *blkcg)
 			 * need to rescheduling to avoid softlockup.
 			 */
 			spin_unlock_irq(&blkcg->lock);
-			cond_resched();
 			spin_lock_irq(&blkcg->lock);
 			continue;
 		}
diff --git a/block/blk-lib.c b/block/blk-lib.c
index e59c3069e835..0bb118e9748b 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -69,14 +69,6 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 		bio->bi_iter.bi_size = req_sects << 9;
 		sector += req_sects;
 		nr_sects -= req_sects;
-
-		/*
-		 * We can loop for a long time in here, if someone does
-		 * full device discards (like mkfs). Be nice and allow
-		 * us to schedule out to avoid softlocking if preempt
-		 * is disabled.
-		 */
-		cond_resched();
 	}
 
 	*biop = bio;
@@ -145,7 +137,6 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
 			bio->bi_iter.bi_size = nr_sects << 9;
 			nr_sects = 0;
 		}
-		cond_resched();
 	}
 
 	*biop = bio;
@@ -189,7 +180,6 @@ static int __blkdev_issue_zero_pages(struct block_device *bdev,
 			if (bi_size < sz)
 				break;
 		}
-		cond_resched();
 	}
 
 	*biop = bio;
@@ -336,7 +326,6 @@ int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector,
 			bio_put(bio);
 			break;
 		}
-		cond_resched();
 	}
 	blk_finish_plug(&plug);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1fafd54dce3c..f45ee6a69700 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1372,7 +1372,6 @@ static void blk_rq_poll_completion(struct request *rq, struct completion *wait)
 {
 	do {
 		blk_hctx_poll(rq->q, rq->mq_hctx, NULL, 0);
-		cond_resched();
 	} while (!completion_done(wait));
 }
 
@@ -4310,7 +4309,6 @@ static int __blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
 	for (i = 0; i < set->nr_hw_queues; i++) {
 		if (!__blk_mq_alloc_map_and_rqs(set, i))
 			goto out_unwind;
-		cond_resched();
 	}
 
 	return 0;
@@ -4425,7 +4423,6 @@ static int blk_mq_realloc_tag_set_tags(struct blk_mq_tag_set *set,
 				__blk_mq_free_map_and_rqs(set, i);
 			return -ENOMEM;
 		}
-		cond_resched();
 	}
 
 done:
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 619ee41a51cc..8005f55e22e5 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -208,9 +208,6 @@ static int blkdev_zone_reset_all_emulated(struct block_device *bdev,
 				   gfp_mask);
 		bio->bi_iter.bi_sector = sector;
 		sector += zone_sectors;
-
-		/* This may take a while, so be nice to others */
-		cond_resched();
 	}
 
 	if (bio) {
@@ -293,9 +290,6 @@ int blkdev_zone_mgmt(struct block_device *bdev, enum req_op op,
 		bio = blk_next_bio(bio, bdev, 0, op | REQ_SYNC, gfp_mask);
 		bio->bi_iter.bi_sector = sector;
 		sector += zone_sectors;
-
-		/* This may take a while, so be nice to others */
-		cond_resched();
 	}
 
 	ret = submit_bio_wait(bio);
-- 
2.31.1

[RFC PATCH 77/86] treewide: netfilter: remove cond_resched()

Posted by Ankur Arora 2 years, 3 months ago

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

Most of the uses here are in set-1 (some right after we give up a lock
or enable bottom-halves, causing an explicit preemption check.)
We can remove all of them.

There's one case where we do "cond_resched(); cpu_relax()" while
spinning on a seqcount. Replace with cond_resched_stall().

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Florian Westphal <fw@strlen.de>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@verge.net.au>
Cc: Julian Anastasov <ja@ssi.bg>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 net/netfilter/ipset/ip_set_core.c   | 1 -
 net/netfilter/ipvs/ip_vs_est.c      | 3 ---
 net/netfilter/nf_conncount.c        | 2 --
 net/netfilter/nf_conntrack_core.c   | 3 ---
 net/netfilter/nf_conntrack_ecache.c | 3 ---
 net/netfilter/nf_tables_api.c       | 2 --
 net/netfilter/nft_set_rbtree.c      | 2 --
 net/netfilter/x_tables.c            | 3 +--
 net/netfilter/xt_hashlimit.c        | 1 -
 9 files changed, 1 insertion(+), 19 deletions(-)

diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c
index 35d2f9c9ada0..f584c5e756ae 100644
--- a/net/netfilter/ipset/ip_set_core.c
+++ b/net/netfilter/ipset/ip_set_core.c
@@ -1703,7 +1703,6 @@ call_ad(struct net *net, struct sock *ctnl, struct sk_buff *skb,
 		if (retried) {
 			__ip_set_get_netlink(set);
 			nfnl_unlock(NFNL_SUBSYS_IPSET);
-			cond_resched();
 			nfnl_lock(NFNL_SUBSYS_IPSET);
 			__ip_set_put_netlink(set);
 		}
diff --git a/net/netfilter/ipvs/ip_vs_est.c b/net/netfilter/ipvs/ip_vs_est.c
index c5970ba416ae..5543efeeb3f7 100644
--- a/net/netfilter/ipvs/ip_vs_est.c
+++ b/net/netfilter/ipvs/ip_vs_est.c
@@ -622,7 +622,6 @@ static void ip_vs_est_drain_temp_list(struct netns_ipvs *ipvs)
 			goto unlock;
 		}
 		mutex_unlock(&__ip_vs_mutex);
-		cond_resched();
 	}
 
 unlock:
@@ -681,7 +680,6 @@ static int ip_vs_est_calc_limits(struct netns_ipvs *ipvs, int *chain_max)
 
 		if (!ipvs->enable || kthread_should_stop())
 			goto stop;
-		cond_resched();
 
 		diff = ktime_to_ns(ktime_sub(t2, t1));
 		if (diff <= 1 * NSEC_PER_USEC) {
@@ -815,7 +813,6 @@ static void ip_vs_est_calc_phase(struct netns_ipvs *ipvs)
 		 * and deleted (releasing kthread contexts)
 		 */
 		mutex_unlock(&__ip_vs_mutex);
-		cond_resched();
 		mutex_lock(&__ip_vs_mutex);
 
 		/* Current kt released ? */
diff --git a/net/netfilter/nf_conncount.c b/net/netfilter/nf_conncount.c
index 5d8ed6c90b7e..e7bc39ca204d 100644
--- a/net/netfilter/nf_conncount.c
+++ b/net/netfilter/nf_conncount.c
@@ -473,8 +473,6 @@ static void tree_gc_worker(struct work_struct *work)
 	rcu_read_unlock();
 	local_bh_enable();
 
-	cond_resched();
-
 	spin_lock_bh(&nf_conncount_locks[tree]);
 	if (gc_count < ARRAY_SIZE(gc_nodes))
 		goto next; /* do not bother */
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 9f6f2e643575..d2f38870bbab 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1563,7 +1563,6 @@ static void gc_worker(struct work_struct *work)
 		 * we will just continue with next hash slot.
 		 */
 		rcu_read_unlock();
-		cond_resched();
 		i++;
 
 		delta_time = nfct_time_stamp - end_time;
@@ -2393,7 +2392,6 @@ get_next_corpse(int (*iter)(struct nf_conn *i, void *data),
 		}
 		spin_unlock(lockp);
 		local_bh_enable();
-		cond_resched();
 	}
 
 	return NULL;
@@ -2418,7 +2416,6 @@ static void nf_ct_iterate_cleanup(int (*iter)(struct nf_conn *i, void *data),
 
 		nf_ct_delete(ct, iter_data->portid, iter_data->report);
 		nf_ct_put(ct);
-		cond_resched();
 	}
 	mutex_unlock(&nf_conntrack_mutex);
 }
diff --git a/net/netfilter/nf_conntrack_ecache.c b/net/netfilter/nf_conntrack_ecache.c
index 69948e1d6974..b568e329bf22 100644
--- a/net/netfilter/nf_conntrack_ecache.c
+++ b/net/netfilter/nf_conntrack_ecache.c
@@ -84,7 +84,6 @@ static enum retry_state ecache_work_evict_list(struct nf_conntrack_net *cnet)
 
 		if (sent++ > 16) {
 			spin_unlock_bh(&cnet->ecache.dying_lock);
-			cond_resched();
 			goto next;
 		}
 	}
@@ -96,8 +95,6 @@ static enum retry_state ecache_work_evict_list(struct nf_conntrack_net *cnet)
 
 		hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_REPLY].hnnode);
 		nf_ct_put(ct);
-
-		cond_resched();
 	}
 
 	return ret;
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 29c651804cb2..6ff5515d9b17 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -3742,8 +3742,6 @@ static int nft_table_validate(struct net *net, const struct nft_table *table)
 		err = nft_chain_validate(&ctx, chain);
 		if (err < 0)
 			return err;
-
-		cond_resched();
 	}
 
 	return 0;
diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c
index e34662f4a71e..9bdf7c0e0831 100644
--- a/net/netfilter/nft_set_rbtree.c
+++ b/net/netfilter/nft_set_rbtree.c
@@ -495,8 +495,6 @@ static int nft_rbtree_insert(const struct net *net, const struct nft_set *set,
 		if (fatal_signal_pending(current))
 			return -EINTR;
 
-		cond_resched();
-
 		write_lock_bh(&priv->lock);
 		write_seqcount_begin(&priv->count);
 		err = __nft_rbtree_insert(net, set, rbe, ext);
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index 21624d68314f..ab53adf6393d 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -1433,8 +1433,7 @@ xt_replace_table(struct xt_table *table,
 
 		if (seq & 1) {
 			do {
-				cond_resched();
-				cpu_relax();
+				cond_resched_stall();
 			} while (seq == raw_read_seqcount(s));
 		}
 	}
diff --git a/net/netfilter/xt_hashlimit.c b/net/netfilter/xt_hashlimit.c
index 0859b8f76764..47a11d49231a 100644
--- a/net/netfilter/xt_hashlimit.c
+++ b/net/netfilter/xt_hashlimit.c
@@ -372,7 +372,6 @@ static void htable_selective_cleanup(struct xt_hashlimit_htable *ht, bool select
 				dsthash_free(ht, dh);
 		}
 		spin_unlock_bh(&ht->lock);
-		cond_resched();
 	}
 }
 
-- 
2.31.1

[RFC PATCH 78/86] treewide: net: remove cond_resched()

Posted by Ankur Arora 2 years, 3 months ago

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

All the uses here are in set-1 (some right after we give up a lock
or enable bottom-halves, causing an explicit preemption check.)

We can remove all of them.

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: "David S. Miller" <davem@davemloft.net> 
Cc: Eric Dumazet <edumazet@google.com> 
Cc: Jakub Kicinski <kuba@kernel.org> 
Cc: Paolo Abeni <pabeni@redhat.com> 
Cc: David Ahern <dsahern@kernel.org> 
Cc: Pablo Neira Ayuso <pablo@netfilter.org> 
Cc: Jozsef Kadlecsik <kadlec@netfilter.org> 
Cc: Florian Westphal <fw@strlen.de> 
Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> 
Cc: Jamal Hadi Salim <jhs@mojatatu.com> 
Cc: Cong Wang <xiyou.wangcong@gmail.com> 
Cc: Jiri Pirko <jiri@resnulli.us> 
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 net/core/dev.c                  | 4 ----
 net/core/neighbour.c            | 1 -
 net/core/net_namespace.c        | 1 -
 net/core/netclassid_cgroup.c    | 1 -
 net/core/rtnetlink.c            | 1 -
 net/core/sock.c                 | 2 --
 net/ipv4/inet_connection_sock.c | 3 ---
 net/ipv4/inet_diag.c            | 1 -
 net/ipv4/inet_hashtables.c      | 1 -
 net/ipv4/inet_timewait_sock.c   | 1 -
 net/ipv4/inetpeer.c             | 1 -
 net/ipv4/netfilter/arp_tables.c | 2 --
 net/ipv4/netfilter/ip_tables.c  | 3 ---
 net/ipv4/nexthop.c              | 1 -
 net/ipv4/tcp_ipv4.c             | 2 --
 net/ipv4/udp.c                  | 2 --
 net/netlink/af_netlink.c        | 1 -
 net/sched/sch_api.c             | 3 ---
 net/socket.c                    | 2 --
 19 files changed, 33 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 9f3f8930c691..467715278307 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6304,7 +6304,6 @@ void napi_busy_loop(unsigned int napi_id,
 			if (!IS_ENABLED(CONFIG_PREEMPT_RT))
 				preempt_enable();
 			rcu_read_unlock();
-			cond_resched();
 			if (loop_end(loop_end_arg, start_time))
 				return;
 			goto restart;
@@ -6709,8 +6708,6 @@ static int napi_threaded_poll(void *data)
 
 			if (!repoll)
 				break;
-
-			cond_resched();
 		}
 	}
 	return 0;
@@ -11478,7 +11475,6 @@ static void __net_exit default_device_exit_batch(struct list_head *net_list)
 	rtnl_lock();
 	list_for_each_entry(net, net_list, exit_list) {
 		default_device_exit_net(net);
-		cond_resched();
 	}
 
 	list_for_each_entry(net, net_list, exit_list) {
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index df81c1f0a570..86584a2ace2f 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1008,7 +1008,6 @@ static void neigh_periodic_work(struct work_struct *work)
 		 * grows while we are preempted.
 		 */
 		write_unlock_bh(&tbl->lock);
-		cond_resched();
 		write_lock_bh(&tbl->lock);
 		nht = rcu_dereference_protected(tbl->nht,
 						lockdep_is_held(&tbl->lock));
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index f4183c4c1ec8..5533e8268b30 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -168,7 +168,6 @@ static void ops_exit_list(const struct pernet_operations *ops,
 	if (ops->exit) {
 		list_for_each_entry(net, net_exit_list, exit_list) {
 			ops->exit(net);
-			cond_resched();
 		}
 	}
 	if (ops->exit_batch)
diff --git a/net/core/netclassid_cgroup.c b/net/core/netclassid_cgroup.c
index d6a70aeaa503..7162c3d30f1b 100644
--- a/net/core/netclassid_cgroup.c
+++ b/net/core/netclassid_cgroup.c
@@ -92,7 +92,6 @@ static void update_classid_task(struct task_struct *p, u32 classid)
 		task_lock(p);
 		fd = iterate_fd(p->files, fd, update_classid_sock, &ctx);
 		task_unlock(p);
-		cond_resched();
 	} while (fd);
 }
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 53c377d054f0..c4ff7b21f906 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -140,7 +140,6 @@ void __rtnl_unlock(void)
 		struct sk_buff *next = head->next;
 
 		kfree_skb(head);
-		cond_resched();
 		head = next;
 	}
 }
diff --git a/net/core/sock.c b/net/core/sock.c
index 16584e2dd648..c91f9fc687ba 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2982,8 +2982,6 @@ void __release_sock(struct sock *sk)
 			skb_mark_not_on_list(skb);
 			sk_backlog_rcv(sk, skb);
 
-			cond_resched();
-
 			skb = next;
 		} while (skb != NULL);
 
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 394a498c2823..49b90cf913a0 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -389,7 +389,6 @@ inet_csk_find_open_port(const struct sock *sk, struct inet_bind_bucket **tb_ret,
 		goto success;
 next_port:
 		spin_unlock_bh(&head->lock);
-		cond_resched();
 	}
 
 	offset--;
@@ -1420,8 +1419,6 @@ void inet_csk_listen_stop(struct sock *sk)
 		bh_unlock_sock(child);
 		local_bh_enable();
 		sock_put(child);
-
-		cond_resched();
 	}
 	if (queue->fastopenq.rskq_rst_head) {
 		/* Free all the reqs queued in rskq_rst_head. */
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index e13a84433413..45d3c9027355 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -1147,7 +1147,6 @@ void inet_diag_dump_icsk(struct inet_hashinfo *hashinfo, struct sk_buff *skb,
 		}
 		if (res < 0)
 			break;
-		cond_resched();
 		if (accum == SKARR_SZ) {
 			s_num = num + 1;
 			goto next_chunk;
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 598c1b114d2c..47f86ce00704 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -1080,7 +1080,6 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 		goto ok;
 next_port:
 		spin_unlock_bh(&head->lock);
-		cond_resched();
 	}
 
 	offset++;
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index dd37a5bf6881..519c77bc15ec 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -288,7 +288,6 @@ void inet_twsk_purge(struct inet_hashinfo *hashinfo, int family)
 	for (slot = 0; slot <= hashinfo->ehash_mask; slot++) {
 		struct inet_ehash_bucket *head = &hashinfo->ehash[slot];
 restart_rcu:
-		cond_resched();
 		rcu_read_lock();
 restart:
 		sk_nulls_for_each_rcu(sk, node, &head->chain) {
diff --git a/net/ipv4/inetpeer.c b/net/ipv4/inetpeer.c
index e9fed83e9b3c..d32a70c27cbe 100644
--- a/net/ipv4/inetpeer.c
+++ b/net/ipv4/inetpeer.c
@@ -300,7 +300,6 @@ void inetpeer_invalidate_tree(struct inet_peer_base *base)
 		p = rb_next(p);
 		rb_erase(&peer->rb_node, &base->rb_root);
 		inet_putpeer(peer);
-		cond_resched();
 	}
 
 	base->total = 0;
diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index 2407066b0fec..3f8c9c4f3ce0 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -622,7 +622,6 @@ static void get_counters(const struct xt_table_info *t,
 
 			ADD_COUNTER(counters[i], bcnt, pcnt);
 			++i;
-			cond_resched();
 		}
 	}
 }
@@ -642,7 +641,6 @@ static void get_old_counters(const struct xt_table_info *t,
 			ADD_COUNTER(counters[i], tmp->bcnt, tmp->pcnt);
 			++i;
 		}
-		cond_resched();
 	}
 }
 
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index 7da1df4997d0..f8b7ae5106be 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -761,7 +761,6 @@ get_counters(const struct xt_table_info *t,
 
 			ADD_COUNTER(counters[i], bcnt, pcnt);
 			++i; /* macro does multi eval of i */
-			cond_resched();
 		}
 	}
 }
@@ -781,8 +780,6 @@ static void get_old_counters(const struct xt_table_info *t,
 			ADD_COUNTER(counters[i], tmp->bcnt, tmp->pcnt);
 			++i; /* macro does multi eval of i */
 		}
-
-		cond_resched();
 	}
 }
 
diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
index bbff68b5b5d4..d0f009aea17e 100644
--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -2424,7 +2424,6 @@ static void flush_all_nexthops(struct net *net)
 	while ((node = rb_first(root))) {
 		nh = rb_entry(node, struct nexthop, rb_node);
 		remove_nexthop(net, nh, NULL);
-		cond_resched();
 	}
 }
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 4167e8a48b60..d2542780447c 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2449,8 +2449,6 @@ static void *established_get_first(struct seq_file *seq)
 		struct hlist_nulls_node *node;
 		spinlock_t *lock = inet_ehash_lockp(hinfo, st->bucket);
 
-		cond_resched();
-
 		/* Lockless fast path for the common case of empty buckets */
 		if (empty_bucket(hinfo, st))
 			continue;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index f39b9c844580..e01eca44559b 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -281,7 +281,6 @@ int udp_lib_get_port(struct sock *sk, unsigned short snum,
 				snum += rand;
 			} while (snum != first);
 			spin_unlock_bh(&hslot->lock);
-			cond_resched();
 		} while (++first != last);
 		goto fail;
 	} else {
@@ -1890,7 +1889,6 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int flags,
 	kfree_skb(skb);
 
 	/* starting over for a new packet, but check if we need to yield */
-	cond_resched();
 	msg->msg_flags &= ~MSG_TRUNC;
 	goto try_again;
 }
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index eb086b06d60d..4e2ed0c5cf6e 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -843,7 +843,6 @@ static int netlink_autobind(struct socket *sock)
 	bool ok;
 
 retry:
-	cond_resched();
 	rcu_read_lock();
 	ok = !__netlink_lookup(table, portid, net);
 	rcu_read_unlock();
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index e9eaf637220e..06ec50c52ea8 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -772,7 +772,6 @@ static u32 qdisc_alloc_handle(struct net_device *dev)
 			autohandle = TC_H_MAKE(0x80000000U, 0);
 		if (!qdisc_lookup(dev, autohandle))
 			return autohandle;
-		cond_resched();
 	} while	(--i > 0);
 
 	return 0;
@@ -923,7 +922,6 @@ static int tc_fill_qdisc(struct sk_buff *skb, struct Qdisc *q, u32 clid,
 	u32 block_index;
 	__u32 qlen;
 
-	cond_resched();
 	nlh = nlmsg_put(skb, portid, seq, event, sizeof(*tcm), flags);
 	if (!nlh)
 		goto out_nlmsg_trim;
@@ -1888,7 +1886,6 @@ static int tc_fill_tclass(struct sk_buff *skb, struct Qdisc *q,
 	struct gnet_dump d;
 	const struct Qdisc_class_ops *cl_ops = q->ops->cl_ops;
 
-	cond_resched();
 	nlh = nlmsg_put(skb, portid, seq, event, sizeof(*tcm), flags);
 	if (!nlh)
 		goto out_nlmsg_trim;
diff --git a/net/socket.c b/net/socket.c
index c4a6f5532955..d6499c7c7869 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2709,7 +2709,6 @@ int __sys_sendmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen,
 		++datagrams;
 		if (msg_data_left(&msg_sys))
 			break;
-		cond_resched();
 	}
 
 	fput_light(sock->file, fput_needed);
@@ -2944,7 +2943,6 @@ static int do_recvmmsg(int fd, struct mmsghdr __user *mmsg,
 		/* Out of band data, return right away */
 		if (msg_sys.msg_flags & MSG_OOB)
 			break;
-		cond_resched();
 	}
 
 	if (err == 0)
-- 
2.31.1

[RFC PATCH 79/86] treewide: net: remove cond_resched()

Posted by Ankur Arora 2 years, 3 months ago

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

Most of the uses here are in set-1 (some right after we give up a
lock or enable bottom-halves, causing an explicit preemption check.)

We can remove all of them.

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Marek Lindner <mareklindner@neomailbox.ch> 
Cc: Simon Wunderlich <sw@simonwunderlich.de> 
Cc: Antonio Quartulli <a@unstable.cc> 
Cc: Sven Eckelmann <sven@narfation.org> 
Cc: "David S. Miller" <davem@davemloft.net> 
Cc: Eric Dumazet <edumazet@google.com> 
Cc: Jakub Kicinski <kuba@kernel.org> 
Cc: Paolo Abeni <pabeni@redhat.com> 
Cc: Roopa Prabhu <roopa@nvidia.com> 
Cc: Nikolay Aleksandrov <razor@blackwall.org> 
Cc: David Ahern <dsahern@kernel.org> 
Cc: Pablo Neira Ayuso <pablo@netfilter.org> 
Cc: Jozsef Kadlecsik <kadlec@netfilter.org> 
Cc: Florian Westphal <fw@strlen.de> 
Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> 
Cc: Matthieu Baerts <matttbe@kernel.org> 
Cc: Mat Martineau <martineau@kernel.org> 
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> 
Cc: Xin Long <lucien.xin@gmail.com> 
Cc: Trond Myklebust <trond.myklebust@hammerspace.com> 
Cc: Anna Schumaker <anna@kernel.org> 
Cc: Jon Maloy <jmaloy@redhat.com> 
Cc: Ying Xue <ying.xue@windriver.com> 
Cc: Martin Schiller <ms@dev.tdt.de> 
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 net/batman-adv/tp_meter.c       |  2 --
 net/bpf/test_run.c              |  1 -
 net/bridge/br_netlink.c         |  1 -
 net/ipv6/fib6_rules.c           |  1 -
 net/ipv6/netfilter/ip6_tables.c |  2 --
 net/ipv6/udp.c                  |  2 --
 net/mptcp/mptcp_diag.c          |  2 --
 net/mptcp/pm_netlink.c          |  5 -----
 net/mptcp/protocol.c            |  1 -
 net/rds/ib_recv.c               |  2 --
 net/rds/tcp.c                   |  2 +-
 net/rds/threads.c               |  1 -
 net/rxrpc/call_object.c         |  2 +-
 net/sctp/socket.c               |  1 -
 net/sunrpc/cache.c              | 11 +++++++++--
 net/sunrpc/sched.c              |  2 +-
 net/sunrpc/svc_xprt.c           |  1 -
 net/sunrpc/xprtsock.c           |  2 --
 net/tipc/core.c                 |  2 +-
 net/tipc/topsrv.c               |  3 ---
 net/unix/af_unix.c              |  5 ++---
 net/x25/af_x25.c                |  1 -
 22 files changed, 15 insertions(+), 37 deletions(-)

diff --git a/net/batman-adv/tp_meter.c b/net/batman-adv/tp_meter.c
index 7f3dd3c393e0..a0b160088c33 100644
--- a/net/batman-adv/tp_meter.c
+++ b/net/batman-adv/tp_meter.c
@@ -877,8 +877,6 @@ static int batadv_tp_send(void *arg)
 		/* right-shift the TWND */
 		if (!err)
 			tp_vars->last_sent += payload_len;
-
-		cond_resched();
 	}
 
 out:
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 0841f8d82419..f4558fdfdf74 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -81,7 +81,6 @@ static bool bpf_test_timer_continue(struct bpf_test_timer *t, int iterations,
 		/* During iteration: we need to reschedule between runs. */
 		t->time_spent += ktime_get_ns() - t->time_start;
 		bpf_test_timer_leave(t);
-		cond_resched();
 		bpf_test_timer_enter(t);
 	}
 
diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index 10f0d33d8ccf..f326b034245f 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -780,7 +780,6 @@ int br_process_vlan_info(struct net_bridge *br,
 					       v - 1, rtm_cmd);
 				v_change_start = 0;
 			}
-			cond_resched();
 		}
 		/* v_change_start is set only if the last/whole range changed */
 		if (v_change_start)
diff --git a/net/ipv6/fib6_rules.c b/net/ipv6/fib6_rules.c
index 7c2003833010..528e6a582c21 100644
--- a/net/ipv6/fib6_rules.c
+++ b/net/ipv6/fib6_rules.c
@@ -500,7 +500,6 @@ static void __net_exit fib6_rules_net_exit_batch(struct list_head *net_list)
 	rtnl_lock();
 	list_for_each_entry(net, net_list, exit_list) {
 		fib_rules_unregister(net->ipv6.fib6_rules_ops);
-		cond_resched();
 	}
 	rtnl_unlock();
 }
diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index fd9f049d6d41..704f14c4146f 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -778,7 +778,6 @@ get_counters(const struct xt_table_info *t,
 
 			ADD_COUNTER(counters[i], bcnt, pcnt);
 			++i;
-			cond_resched();
 		}
 	}
 }
@@ -798,7 +797,6 @@ static void get_old_counters(const struct xt_table_info *t,
 			ADD_COUNTER(counters[i], tmp->bcnt, tmp->pcnt);
 			++i;
 		}
-		cond_resched();
 	}
 }
 
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 86b5d509a468..032d4f7e6ed3 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -443,8 +443,6 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 	}
 	kfree_skb(skb);
 
-	/* starting over for a new packet, but check if we need to yield */
-	cond_resched();
 	msg->msg_flags &= ~MSG_TRUNC;
 	goto try_again;
 }
diff --git a/net/mptcp/mptcp_diag.c b/net/mptcp/mptcp_diag.c
index 8df1bdb647e2..82bf16511476 100644
--- a/net/mptcp/mptcp_diag.c
+++ b/net/mptcp/mptcp_diag.c
@@ -141,7 +141,6 @@ static void mptcp_diag_dump_listeners(struct sk_buff *skb, struct netlink_callba
 		spin_unlock(&ilb->lock);
 		rcu_read_unlock();
 
-		cond_resched();
 		diag_ctx->l_num = 0;
 	}
 
@@ -190,7 +189,6 @@ static void mptcp_diag_dump(struct sk_buff *skb, struct netlink_callback *cb,
 			diag_ctx->s_num--;
 			break;
 		}
-		cond_resched();
 	}
 
 	if ((r->idiag_states & TCPF_LISTEN) && r->id.idiag_dport == 0)
diff --git a/net/mptcp/pm_netlink.c b/net/mptcp/pm_netlink.c
index 9661f3812682..b48d2636ce8d 100644
--- a/net/mptcp/pm_netlink.c
+++ b/net/mptcp/pm_netlink.c
@@ -1297,7 +1297,6 @@ static int mptcp_nl_add_subflow_or_signal_addr(struct net *net)
 
 next:
 		sock_put(sk);
-		cond_resched();
 	}
 
 	return 0;
@@ -1443,7 +1442,6 @@ static int mptcp_nl_remove_subflow_and_signal_addr(struct net *net,
 
 next:
 		sock_put(sk);
-		cond_resched();
 	}
 
 	return 0;
@@ -1478,7 +1476,6 @@ static int mptcp_nl_remove_id_zero_address(struct net *net,
 
 next:
 		sock_put(sk);
-		cond_resched();
 	}
 
 	return 0;
@@ -1594,7 +1591,6 @@ static void mptcp_nl_remove_addrs_list(struct net *net,
 		}
 
 		sock_put(sk);
-		cond_resched();
 	}
 }
 
@@ -1878,7 +1874,6 @@ static int mptcp_nl_set_flags(struct net *net,
 
 next:
 		sock_put(sk);
-		cond_resched();
 	}
 
 	return ret;
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 886ab689a8ae..8c4a51903b23 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -3383,7 +3383,6 @@ static void mptcp_release_cb(struct sock *sk)
 		if (flags & BIT(MPTCP_RETRANSMIT))
 			__mptcp_retrans(sk);
 
-		cond_resched();
 		spin_lock_bh(&sk->sk_lock.slock);
 	}
 
diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
index e53b7f266bd7..d2111e895a10 100644
--- a/net/rds/ib_recv.c
+++ b/net/rds/ib_recv.c
@@ -459,8 +459,6 @@ void rds_ib_recv_refill(struct rds_connection *conn, int prefill, gfp_t gfp)
 	    rds_ib_ring_empty(&ic->i_recv_ring))) {
 		queue_delayed_work(rds_wq, &conn->c_recv_w, 1);
 	}
-	if (can_wait)
-		cond_resched();
 }
 
 /*
diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index 2dba7505b414..9b4d07235904 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -530,7 +530,7 @@ static void rds_tcp_accept_worker(struct work_struct *work)
 					       rds_tcp_accept_w);
 
 	while (rds_tcp_accept_one(rtn->rds_tcp_listen_sock) == 0)
-		cond_resched();
+		cond_resched_stall();
 }
 
 void rds_tcp_accept_work(struct sock *sk)
diff --git a/net/rds/threads.c b/net/rds/threads.c
index 1f424cbfcbb4..2a75b48769e8 100644
--- a/net/rds/threads.c
+++ b/net/rds/threads.c
@@ -198,7 +198,6 @@ void rds_send_worker(struct work_struct *work)
 	if (rds_conn_path_state(cp) == RDS_CONN_UP) {
 		clear_bit(RDS_LL_SEND_FULL, &cp->cp_flags);
 		ret = rds_send_xmit(cp);
-		cond_resched();
 		rdsdebug("conn %p ret %d\n", cp->cp_conn, ret);
 		switch (ret) {
 		case -EAGAIN:
diff --git a/net/rxrpc/call_object.c b/net/rxrpc/call_object.c
index 773eecd1e979..d2704a492a3c 100644
--- a/net/rxrpc/call_object.c
+++ b/net/rxrpc/call_object.c
@@ -755,7 +755,7 @@ void rxrpc_destroy_all_calls(struct rxrpc_net *rxnet)
 			       call->flags, call->events);
 
 			spin_unlock(&rxnet->call_lock);
-			cond_resched();
+			cpu_relax();
 			spin_lock(&rxnet->call_lock);
 		}
 
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 7f89e43154c0..448112919848 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -8364,7 +8364,6 @@ static int sctp_get_port_local(struct sock *sk, union sctp_addr *addr)
 			break;
 		next:
 			spin_unlock_bh(&head->lock);
-			cond_resched();
 		} while (--remaining > 0);
 
 		/* Exhausted local port range during search? */
diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
index 95ff74706104..3bcacfbbf35f 100644
--- a/net/sunrpc/cache.c
+++ b/net/sunrpc/cache.c
@@ -521,10 +521,17 @@ static void do_cache_clean(struct work_struct *work)
  */
 void cache_flush(void)
 {
+	/*
+	 * We call cache_clean() in what is seemingly a tight loop. But,
+	 * the scheduler can always preempt us when we give up the spinlock
+	 * in cache_clean().
+	 */
+
 	while (cache_clean() != -1)
-		cond_resched();
+		;
+
 	while (cache_clean() != -1)
-		cond_resched();
+		;
 }
 EXPORT_SYMBOL_GPL(cache_flush);
 
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 6debf4fd42d4..5b7a3c8a271f 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -950,7 +950,7 @@ static void __rpc_execute(struct rpc_task *task)
 		 * Lockless check for whether task is sleeping or not.
 		 */
 		if (!RPC_IS_QUEUED(task)) {
-			cond_resched();
+			cond_resched_stall();
 			continue;
 		}
 
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 4cfe9640df48..d2486645d725 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -851,7 +851,6 @@ void svc_recv(struct svc_rqst *rqstp)
 		goto out;
 
 	try_to_freeze();
-	cond_resched();
 	if (kthread_should_stop())
 		goto out;
 
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index a15bf2ede89b..50c1f2556b3e 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -776,7 +776,6 @@ static void xs_stream_data_receive(struct sock_xprt *transport)
 		if (ret < 0)
 			break;
 		read += ret;
-		cond_resched();
 	}
 	if (ret == -ESHUTDOWN)
 		kernel_sock_shutdown(transport->sock, SHUT_RDWR);
@@ -1412,7 +1411,6 @@ static void xs_udp_data_receive(struct sock_xprt *transport)
 			break;
 		xs_udp_data_read_skb(&transport->xprt, sk, skb);
 		consume_skb(skb);
-		cond_resched();
 	}
 	xs_poll_check_readable(transport);
 out:
diff --git a/net/tipc/core.c b/net/tipc/core.c
index 434e70eabe08..ed4cd5faa387 100644
--- a/net/tipc/core.c
+++ b/net/tipc/core.c
@@ -119,7 +119,7 @@ static void __net_exit tipc_exit_net(struct net *net)
 	tipc_crypto_stop(&tipc_net(net)->crypto_tx);
 #endif
 	while (atomic_read(&tn->wq_count))
-		cond_resched();
+		cond_resched_stall();
 }
 
 static void __net_exit tipc_pernet_pre_exit(struct net *net)
diff --git a/net/tipc/topsrv.c b/net/tipc/topsrv.c
index 8ee0c07d00e9..13cd3816fb52 100644
--- a/net/tipc/topsrv.c
+++ b/net/tipc/topsrv.c
@@ -277,7 +277,6 @@ static void tipc_conn_send_to_sock(struct tipc_conn *con)
 			ret = kernel_sendmsg(con->sock, &msg, &iov,
 					     1, sizeof(*evt));
 			if (ret == -EWOULDBLOCK || ret == 0) {
-				cond_resched();
 				return;
 			} else if (ret < 0) {
 				return tipc_conn_close(con);
@@ -288,7 +287,6 @@ static void tipc_conn_send_to_sock(struct tipc_conn *con)
 
 		/* Don't starve users filling buffers */
 		if (++count >= MAX_SEND_MSG_COUNT) {
-			cond_resched();
 			count = 0;
 		}
 		spin_lock_bh(&con->outqueue_lock);
@@ -426,7 +424,6 @@ static void tipc_conn_recv_work(struct work_struct *work)
 
 		/* Don't flood Rx machine */
 		if (++count >= MAX_RECV_MSG_COUNT) {
-			cond_resched();
 			count = 0;
 		}
 	}
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 3e8a04a13668..bb1367f93db2 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1184,10 +1184,9 @@ static int unix_autobind(struct sock *sk)
 		unix_table_double_unlock(net, old_hash, new_hash);
 
 		/* __unix_find_socket_byname() may take long time if many names
-		 * are already in use.
+		 * are already in use. The unlock above would have allowed the
+		 * scheduler to preempt if preemption was needed.
 		 */
-		cond_resched();
-
 		if (ordernum == lastnum) {
 			/* Give up if all names seems to be in use. */
 			err = -ENOSPC;
diff --git a/net/x25/af_x25.c b/net/x25/af_x25.c
index 0fb5143bec7a..2a6b05bcb53d 100644
--- a/net/x25/af_x25.c
+++ b/net/x25/af_x25.c
@@ -343,7 +343,6 @@ static unsigned int x25_new_lci(struct x25_neigh *nb)
 			lci = 0;
 			break;
 		}
-		cond_resched();
 	}
 
 	return lci;
-- 
2.31.1

Re: [RFC PATCH 79/86] treewide: net: remove cond_resched()

Posted by Eric Dumazet 2 years, 3 months ago

On Wed, Nov 8, 2023 at 12:09 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
> There are broadly three sets of uses of cond_resched():
>
> 1.  Calls to cond_resched() out of the goodness of our heart,
>     otherwise known as avoiding lockup splats.
>
> 2.  Open coded variants of cond_resched_lock() which call
>     cond_resched().
>
> 3.  Retry or error handling loops, where cond_resched() is used as a
>     quick alternative to spinning in a tight-loop.
>
> When running under a full preemption model, the cond_resched() reduces
> to a NOP (not even a barrier) so removing it obviously cannot matter.
>
> But considering only voluntary preemption models (for say code that
> has been mostly tested under those), for set-1 and set-2 the
> scheduler can now preempt kernel tasks running beyond their time
> quanta anywhere they are preemptible() [1]. Which removes any need
> for these explicitly placed scheduling points.

What about RCU callbacks ? cond_resched() was helping a bit.

>
> The cond_resched() calls in set-3 are a little more difficult.
> To start with, given it's NOP character under full preemption, it
> never actually saved us from a tight loop.
> With voluntary preemption, it's not a NOP, but it might as well be --
> for most workloads the scheduler does not have an interminable supply
> of runnable tasks on the runqueue.
>
> So, cond_resched() is useful to not get softlockup splats, but not
> terribly good for error handling. Ideally, these should be replaced
> with some kind of timed or event wait.
> For now we use cond_resched_stall(), which tries to schedule if
> possible, and executes a cpu_relax() if not.
>
> Most of the uses here are in set-1 (some right after we give up a
> lock or enable bottom-halves, causing an explicit preemption check.)
>
> We can remove all of them.

A patch series of 86 is not reasonable.

596 files changed, 881 insertions(+), 2813 deletions(-)

If really cond_resched() becomes a nop (Nice !) ,
make this at the definition of cond_resched(),
and add there nice debugging.

Whoever needs to call a "real" cond_resched(), could call a
cond_resched_for_real()
(Please change the name, this is only to make a point)

Then let the removal happen whenever each maintainer decides, 6 months
later, without polluting lkml.

Imagine we have to revert this series in 1 month, how painful this
would be had we removed
~1400 cond_resched() calls all over the place, with many conflicts.

Thanks

Re: [RFC PATCH 79/86] treewide: net: remove cond_resched()

Posted by Steven Rostedt 2 years, 3 months ago

On Wed, 8 Nov 2023 13:16:17 +0100
Eric Dumazet <edumazet@google.com> wrote:

> > Most of the uses here are in set-1 (some right after we give up a
> > lock or enable bottom-halves, causing an explicit preemption check.)
> >
> > We can remove all of them.  
> 
> A patch series of 86 is not reasonable.

Agreed. The removal of cond_resched() wasn't needed for the RFC, as there's
really no comments needed once we make cond_resched obsolete.

I think Ankur just wanted to send all the work for the RFC to let people
know what he has done. I chalk that up as a Noobie mistake.

Ankur, next time you may want to break things up to get RFCs for each step
before going to the next one.

Currently, it looks like the first thing to do is to start with Thomas's
patch, and get the kinks out of NEED_RESCHED_LAZY, as Thomas suggested.

Perhaps work on separating PREEMPT_RCU from PREEMPT.

Then you may need to work on handling the #ifndef PREEMPTION parts of the
kernel.

And so on. Each being a separate patch series that will affect the way the
rest of the changes will be done.

I want this change too, so I'm willing to help you out on this. If you
didn't start it, I would have ;-)

-- Steve

Re: [RFC PATCH 79/86] treewide: net: remove cond_resched()

Posted by Ankur Arora 2 years, 3 months ago

Steven Rostedt <rostedt@goodmis.org> writes:

> On Wed, 8 Nov 2023 13:16:17 +0100
> Eric Dumazet <edumazet@google.com> wrote:
>
>> > Most of the uses here are in set-1 (some right after we give up a
>> > lock or enable bottom-halves, causing an explicit preemption check.)
>> >
>> > We can remove all of them.
>>
>> A patch series of 86 is not reasonable.

/me nods.

> Agreed. The removal of cond_resched() wasn't needed for the RFC, as there's
> really no comments needed once we make cond_resched obsolete.
>
> I think Ankur just wanted to send all the work for the RFC to let people
> know what he has done. I chalk that up as a Noobie mistake.
>
> Ankur, next time you may want to break things up to get RFCs for each step
> before going to the next one.

Yeah agreed. It would have made sense to break this up into changes
touching the scheduler code first, get agreement. Rinse. Repeat.

> Currently, it looks like the first thing to do is to start with Thomas's
> patch, and get the kinks out of NEED_RESCHED_LAZY, as Thomas suggested.
>
> Perhaps work on separating PREEMPT_RCU from PREEMPT.

Agree to both of those.

> Then you may need to work on handling the #ifndef PREEMPTION parts of the
> kernel.

In other words express !PREEMPTION scheduler models/things that depend
on cond_resched() etc, into Thomas' model?

> And so on. Each being a separate patch series that will affect the way the
> rest of the changes will be done.

Ack that.

> I want this change too, so I'm willing to help you out on this. If you
> didn't start it, I would have ;-)

Thanks and I really appreciate that.

--
ankur

[RFC PATCH 80/86] treewide: sound: remove cond_resched()

Posted by Ankur Arora 2 years, 3 months ago

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

Most uses here are from set-1 when we are executing in extended
loops. Remove them.

In addition there are a few set-3 cases in the neighbourhood of
HW register access. Replace those instances with cond_resched_stall()

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Takashi Iwai <tiwai@suse.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 sound/arm/aaci.c                    | 2 +-
 sound/core/seq/seq_virmidi.c        | 2 --
 sound/hda/hdac_controller.c         | 1 -
 sound/isa/sb/emu8000_patch.c        | 5 -----
 sound/isa/sb/emu8000_pcm.c          | 2 +-
 sound/isa/wss/wss_lib.c             | 1 -
 sound/pci/echoaudio/echoaudio_dsp.c | 2 --
 sound/pci/ens1370.c                 | 1 -
 sound/pci/es1968.c                  | 2 +-
 sound/pci/lola/lola.c               | 1 -
 sound/pci/mixart/mixart_hwdep.c     | 2 +-
 sound/pci/pcxhr/pcxhr_core.c        | 5 -----
 sound/pci/vx222/vx222_ops.c         | 2 --
 sound/x86/intel_hdmi_audio.c        | 1 -
 14 files changed, 4 insertions(+), 25 deletions(-)

diff --git a/sound/arm/aaci.c b/sound/arm/aaci.c
index 0817ad21af74..d216f4859e61 100644
--- a/sound/arm/aaci.c
+++ b/sound/arm/aaci.c
@@ -145,7 +145,7 @@ static unsigned short aaci_ac97_read(struct snd_ac97 *ac97, unsigned short reg)
 	timeout = FRAME_PERIOD_US * 8;
 	do {
 		udelay(1);
-		cond_resched();
+		cond_resched_stall();
 		v = readl(aaci->base + AACI_SLFR) & (SLFR_1RXV|SLFR_2RXV);
 	} while ((v != (SLFR_1RXV|SLFR_2RXV)) && --timeout);
 
diff --git a/sound/core/seq/seq_virmidi.c b/sound/core/seq/seq_virmidi.c
index 1b9260108e48..99226da86d3c 100644
--- a/sound/core/seq/seq_virmidi.c
+++ b/sound/core/seq/seq_virmidi.c
@@ -154,8 +154,6 @@ static void snd_vmidi_output_work(struct work_struct *work)
 			if (ret < 0)
 				break;
 		}
-		/* rawmidi input might be huge, allow to have a break */
-		cond_resched();
 	}
 }
 
diff --git a/sound/hda/hdac_controller.c b/sound/hda/hdac_controller.c
index 7f3a000fab0c..9b6df2f541ca 100644
--- a/sound/hda/hdac_controller.c
+++ b/sound/hda/hdac_controller.c
@@ -284,7 +284,6 @@ int snd_hdac_bus_get_response(struct hdac_bus *bus, unsigned int addr,
 			msleep(2); /* temporary workaround */
 		} else {
 			udelay(10);
-			cond_resched();
 		}
 	}
 
diff --git a/sound/isa/sb/emu8000_patch.c b/sound/isa/sb/emu8000_patch.c
index 8c1e7f2bfc34..d808c461be35 100644
--- a/sound/isa/sb/emu8000_patch.c
+++ b/sound/isa/sb/emu8000_patch.c
@@ -218,11 +218,6 @@ snd_emu8000_sample_new(struct snd_emux *rec, struct snd_sf_sample *sp,
 		offset++;
 		write_word(emu, &dram_offset, s);
 
-		/* we may take too long time in this loop.
-		 * so give controls back to kernel if needed.
-		 */
-		cond_resched();
-
 		if (i == sp->v.loopend &&
 		    (sp->v.mode_flags & (SNDRV_SFNT_SAMPLE_BIDIR_LOOP|SNDRV_SFNT_SAMPLE_REVERSE_LOOP)))
 		{
diff --git a/sound/isa/sb/emu8000_pcm.c b/sound/isa/sb/emu8000_pcm.c
index 9234d4fe8ada..fd18c7cf1812 100644
--- a/sound/isa/sb/emu8000_pcm.c
+++ b/sound/isa/sb/emu8000_pcm.c
@@ -404,7 +404,7 @@ static int emu8k_pcm_trigger(struct snd_pcm_substream *subs, int cmd)
  */
 #define CHECK_SCHEDULER() \
 do { \
-	cond_resched();\
+	cond_resched_stall();\
 	if (signal_pending(current))\
 		return -EAGAIN;\
 } while (0)
diff --git a/sound/isa/wss/wss_lib.c b/sound/isa/wss/wss_lib.c
index 026061b55ee9..97c74e8c26ee 100644
--- a/sound/isa/wss/wss_lib.c
+++ b/sound/isa/wss/wss_lib.c
@@ -1159,7 +1159,6 @@ static int snd_ad1848_probe(struct snd_wss *chip)
 	while (wss_inb(chip, CS4231P(REGSEL)) & CS4231_INIT) {
 		if (time_after(jiffies, timeout))
 			return -ENODEV;
-		cond_resched();
 	}
 	spin_lock_irqsave(&chip->reg_lock, flags);
 
diff --git a/sound/pci/echoaudio/echoaudio_dsp.c b/sound/pci/echoaudio/echoaudio_dsp.c
index 2a40091d472c..085b229c83b5 100644
--- a/sound/pci/echoaudio/echoaudio_dsp.c
+++ b/sound/pci/echoaudio/echoaudio_dsp.c
@@ -100,7 +100,6 @@ static int write_dsp(struct echoaudio *chip, u32 data)
 			return 0;
 		}
 		udelay(1);
-		cond_resched();
 	}
 
 	chip->bad_board = true;		/* Set true until DSP re-loaded */
@@ -123,7 +122,6 @@ static int read_dsp(struct echoaudio *chip, u32 *data)
 			return 0;
 		}
 		udelay(1);
-		cond_resched();
 	}
 
 	chip->bad_board = true;		/* Set true until DSP re-loaded */
diff --git a/sound/pci/ens1370.c b/sound/pci/ens1370.c
index 89210b2c7342..4948ae411a94 100644
--- a/sound/pci/ens1370.c
+++ b/sound/pci/ens1370.c
@@ -501,7 +501,6 @@ static unsigned int snd_es1371_wait_src_ready(struct ensoniq * ensoniq)
 		r = inl(ES_REG(ensoniq, 1371_SMPRATE));
 		if ((r & ES_1371_SRC_RAM_BUSY) == 0)
 			return r;
-		cond_resched();
 	}
 	dev_err(ensoniq->card->dev, "wait src ready timeout 0x%lx [0x%x]\n",
 		   ES_REG(ensoniq, 1371_SMPRATE), r);
diff --git a/sound/pci/es1968.c b/sound/pci/es1968.c
index 4bc0f53c223b..1598880cfeea 100644
--- a/sound/pci/es1968.c
+++ b/sound/pci/es1968.c
@@ -612,7 +612,7 @@ static int snd_es1968_ac97_wait(struct es1968 *chip)
 	while (timeout-- > 0) {
 		if (!(inb(chip->io_port + ESM_AC97_INDEX) & 1))
 			return 0;
-		cond_resched();
+		cond_resched_stall();
 	}
 	dev_dbg(chip->card->dev, "ac97 timeout\n");
 	return 1; /* timeout */
diff --git a/sound/pci/lola/lola.c b/sound/pci/lola/lola.c
index 1aa30e90b86a..3c18b5543512 100644
--- a/sound/pci/lola/lola.c
+++ b/sound/pci/lola/lola.c
@@ -166,7 +166,6 @@ static int rirb_get_response(struct lola *chip, unsigned int *val,
 		if (time_after(jiffies, timeout))
 			break;
 		udelay(20);
-		cond_resched();
 	}
 	dev_warn(chip->card->dev, "RIRB response error\n");
 	if (!chip->polling_mode) {
diff --git a/sound/pci/mixart/mixart_hwdep.c b/sound/pci/mixart/mixart_hwdep.c
index 689c0f995a9c..1906cb861002 100644
--- a/sound/pci/mixart/mixart_hwdep.c
+++ b/sound/pci/mixart/mixart_hwdep.c
@@ -41,7 +41,7 @@ static int mixart_wait_nice_for_register_value(struct mixart_mgr *mgr,
 	do {	/* we may take too long time in this loop.
 		 * so give controls back to kernel if needed.
 		 */
-		cond_resched();
+		cond_resched_stall();
 
 		read = readl_be( MIXART_MEM( mgr, offset ));
 		if(is_egal) {
diff --git a/sound/pci/pcxhr/pcxhr_core.c b/sound/pci/pcxhr/pcxhr_core.c
index 23f253effb4f..221eb6570c5e 100644
--- a/sound/pci/pcxhr/pcxhr_core.c
+++ b/sound/pci/pcxhr/pcxhr_core.c
@@ -304,8 +304,6 @@ int pcxhr_load_xilinx_binary(struct pcxhr_mgr *mgr,
 			PCXHR_OUTPL(mgr, PCXHR_PLX_CHIPSC, chipsc);
 			mask >>= 1;
 		}
-		/* don't take too much time in this loop... */
-		cond_resched();
 	}
 	chipsc &= ~(PCXHR_CHIPSC_DATA_CLK | PCXHR_CHIPSC_DATA_IN);
 	PCXHR_OUTPL(mgr, PCXHR_PLX_CHIPSC, chipsc);
@@ -356,9 +354,6 @@ static int pcxhr_download_dsp(struct pcxhr_mgr *mgr, const struct firmware *dsp)
 		PCXHR_OUTPB(mgr, PCXHR_DSP_TXH, data[0]);
 		PCXHR_OUTPB(mgr, PCXHR_DSP_TXM, data[1]);
 		PCXHR_OUTPB(mgr, PCXHR_DSP_TXL, data[2]);
-
-		/* don't take too much time in this loop... */
-		cond_resched();
 	}
 	/* give some time to boot the DSP */
 	msleep(PCXHR_WAIT_DEFAULT);
diff --git a/sound/pci/vx222/vx222_ops.c b/sound/pci/vx222/vx222_ops.c
index 3e7e928b24f8..84a59566b036 100644
--- a/sound/pci/vx222/vx222_ops.c
+++ b/sound/pci/vx222/vx222_ops.c
@@ -376,8 +376,6 @@ static int vx2_load_xilinx_binary(struct vx_core *chip, const struct firmware *x
 	for (i = 0; i < xilinx->size; i++, image++) {
 		if (put_xilinx_data(chip, port, 8, *image) < 0)
 			return -EINVAL;
-		/* don't take too much time in this loop... */
-		cond_resched();
 	}
 	put_xilinx_data(chip, port, 4, 0xff); /* end signature */
 
diff --git a/sound/x86/intel_hdmi_audio.c b/sound/x86/intel_hdmi_audio.c
index ab95fb34a635..e734d2f5f711 100644
--- a/sound/x86/intel_hdmi_audio.c
+++ b/sound/x86/intel_hdmi_audio.c
@@ -1020,7 +1020,6 @@ static void wait_clear_underrun_bit(struct snd_intelhad *intelhaddata)
 		if (!(val & AUD_HDMI_STATUS_MASK_UNDERRUN))
 			return;
 		udelay(100);
-		cond_resched();
 		had_write_register(intelhaddata, AUD_HDMI_STATUS, val);
 	}
 	dev_err(intelhaddata->dev, "Unable to clear UNDERRUN bits\n");
-- 
2.31.1

[RFC PATCH 81/86] treewide: md: remove cond_resched()

Posted by Ankur Arora 2 years, 3 months ago

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

Most of the uses here are in set-1. Remove them.

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Coly Li <colyli@suse.de> 
Cc: Kent Overstreet <kent.overstreet@gmail.com> 
Cc: Alasdair Kergon <agk@redhat.com> 
Cc: Mike Snitzer <snitzer@kernel.org> 
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 drivers/md/bcache/btree.c     |  5 -----
 drivers/md/bcache/journal.c   |  2 --
 drivers/md/bcache/sysfs.c     |  1 -
 drivers/md/bcache/writeback.c |  2 --
 drivers/md/dm-bufio.c         | 14 --------------
 drivers/md/dm-cache-target.c  |  4 ----
 drivers/md/dm-crypt.c         |  3 ---
 drivers/md/dm-integrity.c     |  3 ---
 drivers/md/dm-kcopyd.c        |  2 --
 drivers/md/dm-snap.c          |  1 -
 drivers/md/dm-stats.c         |  8 --------
 drivers/md/dm-thin.c          |  2 --
 drivers/md/dm-writecache.c    | 11 -----------
 drivers/md/dm.c               |  4 ----
 drivers/md/md.c               |  1 -
 drivers/md/raid1.c            |  2 --
 drivers/md/raid10.c           |  3 ---
 drivers/md/raid5.c            |  2 --
 18 files changed, 70 deletions(-)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index fd121a61f17c..b9389d3c39d7 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1826,7 +1826,6 @@ static void bch_btree_gc(struct cache_set *c)
 	do {
 		ret = bcache_btree_root(gc_root, c, &op, &writes, &stats);
 		closure_sync(&writes);
-		cond_resched();
 
 		if (ret == -EAGAIN)
 			schedule_timeout_interruptible(msecs_to_jiffies
@@ -1981,7 +1980,6 @@ static int bch_btree_check_thread(void *arg)
 				goto out;
 			}
 			skip_nr--;
-			cond_resched();
 		}
 
 		if (p) {
@@ -2005,7 +2003,6 @@ static int bch_btree_check_thread(void *arg)
 		}
 		p = NULL;
 		prev_idx = cur_idx;
-		cond_resched();
 	}
 
 out:
@@ -2670,8 +2667,6 @@ void bch_refill_keybuf(struct cache_set *c, struct keybuf *buf,
 	struct bkey start = buf->last_scanned;
 	struct refill refill;
 
-	cond_resched();
-
 	bch_btree_op_init(&refill.op, -1);
 	refill.nr_found	= 0;
 	refill.buf	= buf;
diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index c182c21de2e8..5e06a665d082 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -384,8 +384,6 @@ int bch_journal_replay(struct cache_set *s, struct list_head *list)
 
 			BUG_ON(!bch_keylist_empty(&keylist));
 			keys++;
-
-			cond_resched();
 		}
 
 		if (i->pin)
diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index 0e2c1880f60b..d7e248b54abd 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -1030,7 +1030,6 @@ KTYPE(bch_cache_set_internal);
 
 static int __bch_cache_cmp(const void *l, const void *r)
 {
-	cond_resched();
 	return *((uint16_t *)r) - *((uint16_t *)l);
 }
 
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 24c049067f61..7da09bba3067 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -863,8 +863,6 @@ static int sectors_dirty_init_fn(struct btree_op *_op, struct btree *b,
 					     KEY_START(k), KEY_SIZE(k));
 
 	op->count++;
-	if (!(op->count % INIT_KEYS_EACH_TIME))
-		cond_resched();
 
 	return MAP_CONTINUE;
 }
diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index bc309e41d074..0b8f3341fa79 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -294,8 +294,6 @@ static struct lru_entry *lru_evict(struct lru *lru, le_predicate pred, void *con
 		}
 
 		h = h->next;
-
-		cond_resched();
 	}
 
 	return NULL;
@@ -762,7 +760,6 @@ static void __cache_iterate(struct dm_buffer_cache *bc, int list_mode,
 		case IT_COMPLETE:
 			return;
 		}
-		cond_resched();
 
 		le = to_le(le->list.next);
 	} while (le != first);
@@ -890,8 +887,6 @@ static void __remove_range(struct dm_buffer_cache *bc,
 	struct dm_buffer *b;
 
 	while (true) {
-		cond_resched();
-
 		b = __find_next(root, begin);
 		if (!b || (b->block >= end))
 			break;
@@ -1435,7 +1430,6 @@ static void __flush_write_list(struct list_head *write_list)
 			list_entry(write_list->next, struct dm_buffer, write_list);
 		list_del(&b->write_list);
 		submit_io(b, REQ_OP_WRITE, write_endio);
-		cond_resched();
 	}
 	blk_finish_plug(&plug);
 }
@@ -1953,8 +1947,6 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
 				submit_io(b, REQ_OP_READ, read_endio);
 			dm_bufio_release(b);
 
-			cond_resched();
-
 			if (!n_blocks)
 				goto flush_plug;
 			dm_bufio_lock(c);
@@ -2093,8 +2085,6 @@ int dm_bufio_write_dirty_buffers(struct dm_bufio_client *c)
 			cache_mark(&c->cache, b, LIST_CLEAN);
 
 		cache_put_and_wake(c, b);
-
-		cond_resched();
 	}
 	lru_iter_end(&it);
 
@@ -2350,7 +2340,6 @@ static void __scan(struct dm_bufio_client *c)
 
 			atomic_long_dec(&c->need_shrink);
 			freed++;
-			cond_resched();
 		}
 	}
 }
@@ -2659,8 +2648,6 @@ static unsigned long __evict_many(struct dm_bufio_client *c,
 
 		__make_buffer_clean(b);
 		__free_buffer_wake(b);
-
-		cond_resched();
 	}
 
 	return count;
@@ -2802,7 +2789,6 @@ static void evict_old(void)
 	while (dm_bufio_current_allocated > threshold) {
 		if (!__evict_a_few(64))
 			break;
-		cond_resched();
 	}
 	mutex_unlock(&dm_bufio_clients_lock);
 }
diff --git a/drivers/md/dm-cache-target.c b/drivers/md/dm-cache-target.c
index 911f73f7ebba..df136b29471a 100644
--- a/drivers/md/dm-cache-target.c
+++ b/drivers/md/dm-cache-target.c
@@ -1829,7 +1829,6 @@ static void process_deferred_bios(struct work_struct *ws)
 
 		else
 			commit_needed = process_bio(cache, bio) || commit_needed;
-		cond_resched();
 	}
 
 	if (commit_needed)
@@ -1853,7 +1852,6 @@ static void requeue_deferred_bios(struct cache *cache)
 	while ((bio = bio_list_pop(&bios))) {
 		bio->bi_status = BLK_STS_DM_REQUEUE;
 		bio_endio(bio);
-		cond_resched();
 	}
 }
 
@@ -1894,8 +1892,6 @@ static void check_migrations(struct work_struct *ws)
 		r = mg_start(cache, op, NULL);
 		if (r)
 			break;
-
-		cond_resched();
 	}
 }
 
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 5315fd261c23..70a24ade34af 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1629,8 +1629,6 @@ static blk_status_t crypt_convert(struct crypt_config *cc,
 			atomic_dec(&ctx->cc_pending);
 			ctx->cc_sector += sector_step;
 			tag_offset++;
-			if (!atomic)
-				cond_resched();
 			continue;
 		/*
 		 * There was a data integrity error.
@@ -1965,7 +1963,6 @@ static int dmcrypt_write(void *data)
 			io = crypt_io_from_node(rb_first(&write_tree));
 			rb_erase(&io->rb_node, &write_tree);
 			kcryptd_io_write(io);
-			cond_resched();
 		} while (!RB_EMPTY_ROOT(&write_tree));
 		blk_finish_plug(&plug);
 	}
diff --git a/drivers/md/dm-integrity.c b/drivers/md/dm-integrity.c
index 97a8d5fc9ebb..63c88f23b585 100644
--- a/drivers/md/dm-integrity.c
+++ b/drivers/md/dm-integrity.c
@@ -2717,12 +2717,10 @@ static void integrity_recalc(struct work_struct *w)
 				       ic->sectors_per_block, BITMAP_OP_TEST_ALL_CLEAR)) {
 			logical_sector += ic->sectors_per_block;
 			n_sectors -= ic->sectors_per_block;
-			cond_resched();
 		}
 		while (block_bitmap_op(ic, ic->recalc_bitmap, logical_sector + n_sectors - ic->sectors_per_block,
 				       ic->sectors_per_block, BITMAP_OP_TEST_ALL_CLEAR)) {
 			n_sectors -= ic->sectors_per_block;
-			cond_resched();
 		}
 		get_area_and_offset(ic, logical_sector, &area, &offset);
 	}
@@ -2782,7 +2780,6 @@ static void integrity_recalc(struct work_struct *w)
 	}
 
 advance_and_next:
-	cond_resched();
 
 	spin_lock_irq(&ic->endio_wait.lock);
 	remove_range_unlocked(ic, &range);
diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
index d01807c50f20..8a91e83188e7 100644
--- a/drivers/md/dm-kcopyd.c
+++ b/drivers/md/dm-kcopyd.c
@@ -512,8 +512,6 @@ static int run_complete_job(struct kcopyd_job *job)
 	if (atomic_dec_and_test(&kc->nr_jobs))
 		wake_up(&kc->destroyq);
 
-	cond_resched();
-
 	return 0;
 }
 
diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index bf7a574499a3..cd8891c12cca 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -1762,7 +1762,6 @@ static void copy_callback(int read_err, unsigned long write_err, void *context)
 			s->exception_complete_sequence++;
 			rb_erase(&pe->out_of_order_node, &s->out_of_order_tree);
 			complete_exception(pe);
-			cond_resched();
 		}
 	} else {
 		struct rb_node *parent = NULL;
diff --git a/drivers/md/dm-stats.c b/drivers/md/dm-stats.c
index db2d997a6c18..d6878cb7b0ef 100644
--- a/drivers/md/dm-stats.c
+++ b/drivers/md/dm-stats.c
@@ -230,7 +230,6 @@ void dm_stats_cleanup(struct dm_stats *stats)
 				       atomic_read(&shared->in_flight[READ]),
 				       atomic_read(&shared->in_flight[WRITE]));
 			}
-			cond_resched();
 		}
 		dm_stat_free(&s->rcu_head);
 	}
@@ -336,7 +335,6 @@ static int dm_stats_create(struct dm_stats *stats, sector_t start, sector_t end,
 	for (ni = 0; ni < n_entries; ni++) {
 		atomic_set(&s->stat_shared[ni].in_flight[READ], 0);
 		atomic_set(&s->stat_shared[ni].in_flight[WRITE], 0);
-		cond_resched();
 	}
 
 	if (s->n_histogram_entries) {
@@ -350,7 +348,6 @@ static int dm_stats_create(struct dm_stats *stats, sector_t start, sector_t end,
 		for (ni = 0; ni < n_entries; ni++) {
 			s->stat_shared[ni].tmp.histogram = hi;
 			hi += s->n_histogram_entries + 1;
-			cond_resched();
 		}
 	}
 
@@ -372,7 +369,6 @@ static int dm_stats_create(struct dm_stats *stats, sector_t start, sector_t end,
 			for (ni = 0; ni < n_entries; ni++) {
 				p[ni].histogram = hi;
 				hi += s->n_histogram_entries + 1;
-				cond_resched();
 			}
 		}
 	}
@@ -512,7 +508,6 @@ static int dm_stats_list(struct dm_stats *stats, const char *program,
 			}
 			DMEMIT("\n");
 		}
-		cond_resched();
 	}
 	mutex_unlock(&stats->mutex);
 
@@ -794,7 +789,6 @@ static void __dm_stat_clear(struct dm_stat *s, size_t idx_start, size_t idx_end,
 				local_irq_enable();
 			}
 		}
-		cond_resched();
 	}
 }
 
@@ -910,8 +904,6 @@ static int dm_stats_print(struct dm_stats *stats, int id,
 
 		if (unlikely(sz + 1 >= maxlen))
 			goto buffer_overflow;
-
-		cond_resched();
 	}
 
 	if (clear)
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 07c7f9795b10..52e4a7dc6923 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -2234,7 +2234,6 @@ static void process_thin_deferred_bios(struct thin_c *tc)
 			throttle_work_update(&pool->throttle);
 			dm_pool_issue_prefetches(pool->pmd);
 		}
-		cond_resched();
 	}
 	blk_finish_plug(&plug);
 }
@@ -2317,7 +2316,6 @@ static void process_thin_deferred_cells(struct thin_c *tc)
 			else
 				pool->process_cell(tc, cell);
 		}
-		cond_resched();
 	} while (!list_empty(&cells));
 }
 
diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
index 074cb785eafc..75ecc26915a1 100644
--- a/drivers/md/dm-writecache.c
+++ b/drivers/md/dm-writecache.c
@@ -321,8 +321,6 @@ static int persistent_memory_claim(struct dm_writecache *wc)
 			while (daa-- && i < p) {
 				pages[i++] = pfn_t_to_page(pfn);
 				pfn.val++;
-				if (!(i & 15))
-					cond_resched();
 			}
 		} while (i < p);
 		wc->memory_map = vmap(pages, p, VM_MAP, PAGE_KERNEL);
@@ -819,7 +817,6 @@ static void writecache_flush(struct dm_writecache *wc)
 		if (writecache_entry_is_committed(wc, e2))
 			break;
 		e = e2;
-		cond_resched();
 	}
 	writecache_commit_flushed(wc, true);
 
@@ -848,7 +845,6 @@ static void writecache_flush(struct dm_writecache *wc)
 		if (unlikely(e->lru.prev == &wc->lru))
 			break;
 		e = container_of(e->lru.prev, struct wc_entry, lru);
-		cond_resched();
 	}
 
 	if (need_flush_after_free)
@@ -970,7 +966,6 @@ static int writecache_alloc_entries(struct dm_writecache *wc)
 
 		e->index = b;
 		e->write_in_progress = false;
-		cond_resched();
 	}
 
 	return 0;
@@ -1058,7 +1053,6 @@ static void writecache_resume(struct dm_target *ti)
 			e->original_sector = le64_to_cpu(wme.original_sector);
 			e->seq_count = le64_to_cpu(wme.seq_count);
 		}
-		cond_resched();
 	}
 #endif
 	for (b = 0; b < wc->n_blocks; b++) {
@@ -1093,7 +1087,6 @@ static void writecache_resume(struct dm_target *ti)
 				}
 			}
 		}
-		cond_resched();
 	}
 
 	if (need_flush) {
@@ -1824,7 +1817,6 @@ static void __writeback_throttle(struct dm_writecache *wc, struct writeback_list
 			wc_unlock(wc);
 		}
 	}
-	cond_resched();
 }
 
 static void __writecache_writeback_pmem(struct dm_writecache *wc, struct writeback_list *wbl)
@@ -2024,7 +2016,6 @@ static void writecache_writeback(struct work_struct *work)
 				     read_original_sector(wc, e))) {
 				BUG_ON(!f->write_in_progress);
 				list_move(&e->lru, &skipped);
-				cond_resched();
 				continue;
 			}
 		}
@@ -2079,7 +2070,6 @@ static void writecache_writeback(struct work_struct *work)
 				break;
 			}
 		}
-		cond_resched();
 	}
 
 	if (!list_empty(&skipped)) {
@@ -2168,7 +2158,6 @@ static int init_memory(struct dm_writecache *wc)
 
 	for (b = 0; b < wc->n_blocks; b++) {
 		write_original_sector_seq_count(wc, &wc->entries[b], -1, -1);
-		cond_resched();
 	}
 
 	writecache_flush_all_metadata(wc);
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 64a1f306c96c..ac0aff4de190 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -996,7 +996,6 @@ static void dm_wq_requeue_work(struct work_struct *work)
 		io->next = NULL;
 		__dm_io_complete(io, false);
 		io = next;
-		cond_resched();
 	}
 }
 
@@ -1379,12 +1378,10 @@ static noinline void __set_swap_bios_limit(struct mapped_device *md, int latch)
 {
 	mutex_lock(&md->swap_bios_lock);
 	while (latch < md->swap_bios) {
-		cond_resched();
 		down(&md->swap_bios_semaphore);
 		md->swap_bios--;
 	}
 	while (latch > md->swap_bios) {
-		cond_resched();
 		up(&md->swap_bios_semaphore);
 		md->swap_bios++;
 	}
@@ -2583,7 +2580,6 @@ static void dm_wq_work(struct work_struct *work)
 			break;
 
 		submit_bio_noacct(bio);
-		cond_resched();
 	}
 }
 
diff --git a/drivers/md/md.c b/drivers/md/md.c
index a104a025084d..88e8148be28f 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -9048,7 +9048,6 @@ void md_do_sync(struct md_thread *thread)
 		 * about not overloading the IO subsystem. (things like an
 		 * e2fsck being done on the RAID array should execute fast)
 		 */
-		cond_resched();
 
 		recovery_done = io_sectors - atomic_read(&mddev->recovery_active);
 		currspeed = ((unsigned long)(recovery_done - mddev->resync_mark_cnt))/2
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 2aabac773fe7..71bd8d8d1d1c 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -807,7 +807,6 @@ static void flush_bio_list(struct r1conf *conf, struct bio *bio)
 
 		raid1_submit_write(bio);
 		bio = next;
-		cond_resched();
 	}
 }
 
@@ -2613,7 +2612,6 @@ static void raid1d(struct md_thread *thread)
 		else
 			WARN_ON_ONCE(1);
 
-		cond_resched();
 		if (mddev->sb_flags & ~(1<<MD_SB_CHANGE_PENDING))
 			md_check_recovery(mddev);
 	}
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 023413120851..d41f856ebcf4 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -916,7 +916,6 @@ static void flush_pending_writes(struct r10conf *conf)
 
 			raid1_submit_write(bio);
 			bio = next;
-			cond_resched();
 		}
 		blk_finish_plug(&plug);
 	} else
@@ -1132,7 +1131,6 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule)
 
 		raid1_submit_write(bio);
 		bio = next;
-		cond_resched();
 	}
 	kfree(plug);
 }
@@ -3167,7 +3165,6 @@ static void raid10d(struct md_thread *thread)
 		else
 			WARN_ON_ONCE(1);
 
-		cond_resched();
 		if (mddev->sb_flags & ~(1<<MD_SB_CHANGE_PENDING))
 			md_check_recovery(mddev);
 	}
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 284cd71bcc68..47b995c97363 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6727,8 +6727,6 @@ static int handle_active_stripes(struct r5conf *conf, int group,
 		handle_stripe(batch[i]);
 	log_write_stripe_run(conf);
 
-	cond_resched();
-
 	spin_lock_irq(&conf->device_lock);
 	for (i = 0; i < batch_size; i++) {
 		hash = batch[i]->hash_lock_index;
-- 
2.31.1

[RFC PATCH 82/86] treewide: mtd: remove cond_resched()

Posted by Ankur Arora 2 years, 3 months ago

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

Most of the uses here are in set-1 (some right after we give up a lock
or enable bottom-halves, causing an explicit preemption check.)

There are a few cases from set-3. Replace them with
cond_resched_stall(). Some of those places, however, have wait-times
milliseconds, so maybe we should just have an msleep() there?

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Vignesh Raghavendra <vigneshr@ti.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Tudor Ambarus <tudor.ambarus@linaro.org>
Cc: Pratyush Yadav <pratyush@kernel.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 drivers/mtd/chips/cfi_cmdset_0001.c        |  6 ------
 drivers/mtd/chips/cfi_cmdset_0002.c        |  1 -
 drivers/mtd/chips/cfi_util.c               |  2 +-
 drivers/mtd/devices/spear_smi.c            |  2 +-
 drivers/mtd/devices/sst25l.c               |  3 +--
 drivers/mtd/devices/st_spi_fsm.c           |  4 ----
 drivers/mtd/inftlcore.c                    |  5 -----
 drivers/mtd/lpddr/lpddr_cmds.c             |  6 +-----
 drivers/mtd/mtd_blkdevs.c                  |  1 -
 drivers/mtd/nand/onenand/onenand_base.c    | 18 +-----------------
 drivers/mtd/nand/onenand/onenand_samsung.c |  8 +++++++-
 drivers/mtd/nand/raw/diskonchip.c          |  4 ++--
 drivers/mtd/nand/raw/fsmc_nand.c           |  3 +--
 drivers/mtd/nand/raw/hisi504_nand.c        |  2 +-
 drivers/mtd/nand/raw/nand_base.c           |  3 +--
 drivers/mtd/nand/raw/nand_legacy.c         | 17 +++++++++++++++--
 drivers/mtd/spi-nor/core.c                 |  8 +++++++-
 drivers/mtd/tests/mtd_test.c               |  2 --
 drivers/mtd/tests/mtd_test.h               |  2 +-
 drivers/mtd/tests/pagetest.c               |  1 -
 drivers/mtd/tests/readtest.c               |  2 --
 drivers/mtd/tests/torturetest.c            |  1 -
 drivers/mtd/ubi/attach.c                   | 10 ----------
 drivers/mtd/ubi/build.c                    |  2 --
 drivers/mtd/ubi/cdev.c                     |  4 ----
 drivers/mtd/ubi/eba.c                      |  8 --------
 drivers/mtd/ubi/misc.c                     |  2 --
 drivers/mtd/ubi/vtbl.c                     |  6 ------
 drivers/mtd/ubi/wl.c                       | 13 -------------
 29 files changed, 40 insertions(+), 106 deletions(-)

diff --git a/drivers/mtd/chips/cfi_cmdset_0001.c b/drivers/mtd/chips/cfi_cmdset_0001.c
index 11b06fefaa0e..c6abed74e4df 100644
--- a/drivers/mtd/chips/cfi_cmdset_0001.c
+++ b/drivers/mtd/chips/cfi_cmdset_0001.c
@@ -1208,7 +1208,6 @@ static int __xipram xip_wait_for_operation(
 			local_irq_enable();
 			mutex_unlock(&chip->mutex);
 			xip_iprefetch();
-			cond_resched();
 
 			/*
 			 * We're back.  However someone else might have
@@ -1337,7 +1336,6 @@ static int inval_cache_and_wait_for_operation(
 			sleep_time = 1000000/HZ;
 		} else {
 			udelay(1);
-			cond_resched();
 			timeo--;
 		}
 		mutex_lock(&chip->mutex);
@@ -1913,10 +1911,6 @@ static int cfi_intelext_writev (struct mtd_info *mtd, const struct kvec *vecs,
 				return 0;
 		}
 
-		/* Be nice and reschedule with the chip in a usable state for other
-		   processes. */
-		cond_resched();
-
 	} while (len);
 
 	return 0;
diff --git a/drivers/mtd/chips/cfi_cmdset_0002.c b/drivers/mtd/chips/cfi_cmdset_0002.c
index df589d9b4d70..f6d8f8ccbe3f 100644
--- a/drivers/mtd/chips/cfi_cmdset_0002.c
+++ b/drivers/mtd/chips/cfi_cmdset_0002.c
@@ -1105,7 +1105,6 @@ static void __xipram xip_udelay(struct map_info *map, struct flchip *chip,
 			local_irq_enable();
 			mutex_unlock(&chip->mutex);
 			xip_iprefetch();
-			cond_resched();
 
 			/*
 			 * We're back.  However someone else might have
diff --git a/drivers/mtd/chips/cfi_util.c b/drivers/mtd/chips/cfi_util.c
index 140c69a67e82..c178dae31a59 100644
--- a/drivers/mtd/chips/cfi_util.c
+++ b/drivers/mtd/chips/cfi_util.c
@@ -28,7 +28,7 @@ void cfi_udelay(int us)
 		msleep(DIV_ROUND_UP(us, 1000));
 	} else {
 		udelay(us);
-		cond_resched();
+		cond_resched_stall();
 	}
 }
 EXPORT_SYMBOL(cfi_udelay);
diff --git a/drivers/mtd/devices/spear_smi.c b/drivers/mtd/devices/spear_smi.c
index 0a35e5236ae5..9b4d226633a9 100644
--- a/drivers/mtd/devices/spear_smi.c
+++ b/drivers/mtd/devices/spear_smi.c
@@ -278,7 +278,7 @@ static int spear_smi_wait_till_ready(struct spear_smi *dev, u32 bank,
 			return 0;
 		}
 
-		cond_resched();
+		cond_resched_stall();
 	} while (!time_after_eq(jiffies, finish));
 
 	dev_err(&dev->pdev->dev, "smi controller is busy, timeout\n");
diff --git a/drivers/mtd/devices/sst25l.c b/drivers/mtd/devices/sst25l.c
index 8813994ce9f4..ff16147d9bdd 100644
--- a/drivers/mtd/devices/sst25l.c
+++ b/drivers/mtd/devices/sst25l.c
@@ -132,8 +132,7 @@ static int sst25l_wait_till_ready(struct sst25l_flash *flash)
 			return err;
 		if (!(status & SST25L_STATUS_BUSY))
 			return 0;
-
-		cond_resched();
+		cond_resched_stall();
 	} while (!time_after_eq(jiffies, deadline));
 
 	return -ETIMEDOUT;
diff --git a/drivers/mtd/devices/st_spi_fsm.c b/drivers/mtd/devices/st_spi_fsm.c
index 95530cbbb1e0..a0f5874c1941 100644
--- a/drivers/mtd/devices/st_spi_fsm.c
+++ b/drivers/mtd/devices/st_spi_fsm.c
@@ -738,8 +738,6 @@ static void stfsm_wait_seq(struct stfsm *fsm)
 
 		if (stfsm_is_idle(fsm))
 			return;
-
-		cond_resched();
 	}
 
 	dev_err(fsm->dev, "timeout on sequence completion\n");
@@ -901,8 +899,6 @@ static uint8_t stfsm_wait_busy(struct stfsm *fsm)
 		if (!timeout)
 			/* Restart */
 			writel(seq->seq_cfg, fsm->base + SPI_FAST_SEQ_CFG);
-
-		cond_resched();
 	}
 
 	dev_err(fsm->dev, "timeout on wait_busy\n");
diff --git a/drivers/mtd/inftlcore.c b/drivers/mtd/inftlcore.c
index 9739387cff8c..c757b8a25748 100644
--- a/drivers/mtd/inftlcore.c
+++ b/drivers/mtd/inftlcore.c
@@ -732,11 +732,6 @@ static void INFTL_trydeletechain(struct INFTLrecord *inftl, unsigned thisVUC)
 
 		/* Now sort out whatever was pointing to it... */
 		*prevEUN = BLOCK_NIL;
-
-		/* Ideally we'd actually be responsive to new
-		   requests while we're doing this -- if there's
-		   free space why should others be made to wait? */
-		cond_resched();
 	}
 
 	inftl->VUtable[thisVUC] = BLOCK_NIL;
diff --git a/drivers/mtd/lpddr/lpddr_cmds.c b/drivers/mtd/lpddr/lpddr_cmds.c
index 3c3939bc2dad..ad8992d24082 100644
--- a/drivers/mtd/lpddr/lpddr_cmds.c
+++ b/drivers/mtd/lpddr/lpddr_cmds.c
@@ -161,7 +161,7 @@ static int wait_for_ready(struct map_info *map, struct flchip *chip,
 			sleep_time = 1000000/HZ;
 		} else {
 			udelay(1);
-			cond_resched();
+			cond_resched_stall();
 			timeo--;
 		}
 		mutex_lock(&chip->mutex);
@@ -677,10 +677,6 @@ static int lpddr_writev(struct mtd_info *mtd, const struct kvec *vecs,
 		(*retlen) += size;
 		len -= size;
 
-		/* Be nice and reschedule with the chip in a usable
-		 * state for other processes */
-		cond_resched();
-
 	} while (len);
 
 	return 0;
diff --git a/drivers/mtd/mtd_blkdevs.c b/drivers/mtd/mtd_blkdevs.c
index ff18636e0889..96bff5627a31 100644
--- a/drivers/mtd/mtd_blkdevs.c
+++ b/drivers/mtd/mtd_blkdevs.c
@@ -158,7 +158,6 @@ static void mtd_blktrans_work(struct mtd_blktrans_dev *dev)
 		}
 
 		background_done = 0;
-		cond_resched();
 		spin_lock_irq(&dev->queue_lock);
 	}
 }
diff --git a/drivers/mtd/nand/onenand/onenand_base.c b/drivers/mtd/nand/onenand/onenand_base.c
index f66385faf631..97d07e4cc150 100644
--- a/drivers/mtd/nand/onenand/onenand_base.c
+++ b/drivers/mtd/nand/onenand/onenand_base.c
@@ -567,7 +567,7 @@ static int onenand_wait(struct mtd_info *mtd, int state)
 			break;
 
 		if (state != FL_READING && state != FL_PREPARING_ERASE)
-			cond_resched();
+			cond_resched_stall();
 	}
 	/* To get correct interrupt status in timeout case */
 	interrupt = this->read_word(this->base + ONENAND_REG_INTERRUPT);
@@ -1143,8 +1143,6 @@ static int onenand_mlc_read_ops_nolock(struct mtd_info *mtd, loff_t from,
 	stats = mtd->ecc_stats;
 
 	while (read < len) {
-		cond_resched();
-
 		thislen = min_t(int, writesize, len - read);
 
 		column = from & (writesize - 1);
@@ -1307,7 +1305,6 @@ static int onenand_read_ops_nolock(struct mtd_info *mtd, loff_t from,
 		buf += thislen;
 		thislen = min_t(int, writesize, len - read);
 		column = 0;
-		cond_resched();
 		/* Now wait for load */
 		ret = this->wait(mtd, FL_READING);
 		onenand_update_bufferram(mtd, from, !ret);
@@ -1378,8 +1375,6 @@ static int onenand_read_oob_nolock(struct mtd_info *mtd, loff_t from,
 	readcmd = ONENAND_IS_4KB_PAGE(this) ? ONENAND_CMD_READ : ONENAND_CMD_READOOB;
 
 	while (read < len) {
-		cond_resched();
-
 		thislen = oobsize - column;
 		thislen = min_t(int, thislen, len);
 
@@ -1565,8 +1560,6 @@ int onenand_bbt_read_oob(struct mtd_info *mtd, loff_t from,
 	readcmd = ONENAND_IS_4KB_PAGE(this) ? ONENAND_CMD_READ : ONENAND_CMD_READOOB;
 
 	while (read < len) {
-		cond_resched();
-
 		thislen = mtd->oobsize - column;
 		thislen = min_t(int, thislen, len);
 
@@ -1838,8 +1831,6 @@ static int onenand_write_ops_nolock(struct mtd_info *mtd, loff_t to,
 			thislen = min_t(int, mtd->writesize - column, len - written);
 			thisooblen = min_t(int, oobsize - oobcolumn, ooblen - oobwritten);
 
-			cond_resched();
-
 			this->command(mtd, ONENAND_CMD_BUFFERRAM, to, thislen);
 
 			/* Partial page write */
@@ -2022,8 +2013,6 @@ static int onenand_write_oob_nolock(struct mtd_info *mtd, loff_t to,
 	while (written < len) {
 		int thislen = min_t(int, oobsize, len - written);
 
-		cond_resched();
-
 		this->command(mtd, ONENAND_CMD_BUFFERRAM, to, mtd->oobsize);
 
 		/* We send data to spare ram with oobsize
@@ -2232,7 +2221,6 @@ static int onenand_multiblock_erase(struct mtd_info *mtd,
 		}
 
 		/* last block of 64-eb series */
-		cond_resched();
 		this->command(mtd, ONENAND_CMD_ERASE, addr, block_size);
 		onenand_invalidate_bufferram(mtd, addr, block_size);
 
@@ -2288,8 +2276,6 @@ static int onenand_block_by_block_erase(struct mtd_info *mtd,
 
 	/* Loop through the blocks */
 	while (len) {
-		cond_resched();
-
 		/* Check if we have a bad block, we do not erase bad blocks */
 		if (onenand_block_isbad_nolock(mtd, addr, 0)) {
 			printk(KERN_WARNING "%s: attempt to erase a bad block "
@@ -2799,8 +2785,6 @@ static int onenand_otp_write_oob_nolock(struct mtd_info *mtd, loff_t to,
 	while (written < len) {
 		int thislen = min_t(int, oobsize, len - written);
 
-		cond_resched();
-
 		block = (int) (to >> this->erase_shift);
 		/*
 		 * Write 'DFS, FBA' of Flash
diff --git a/drivers/mtd/nand/onenand/onenand_samsung.c b/drivers/mtd/nand/onenand/onenand_samsung.c
index fd6890a03d55..2e0c8f50d77d 100644
--- a/drivers/mtd/nand/onenand/onenand_samsung.c
+++ b/drivers/mtd/nand/onenand/onenand_samsung.c
@@ -338,8 +338,14 @@ static int s3c_onenand_wait(struct mtd_info *mtd, int state)
 		if (stat & flags)
 			break;
 
+		/*
+		 * Use a cond_resched_stall() to avoid spinning in
+		 * a tight loop.
+		 * Though, given that the timeout is in milliseconds,
+		 * maybe this should timeout or event wait?
+		 */
 		if (state != FL_READING)
-			cond_resched();
+			cond_resched_stall();
 	}
 	/* To get correct interrupt status in timeout case */
 	stat = s3c_read_reg(INT_ERR_STAT_OFFSET);
diff --git a/drivers/mtd/nand/raw/diskonchip.c b/drivers/mtd/nand/raw/diskonchip.c
index 5d2ddb037a9a..930b4fdf75e0 100644
--- a/drivers/mtd/nand/raw/diskonchip.c
+++ b/drivers/mtd/nand/raw/diskonchip.c
@@ -248,7 +248,7 @@ static int _DoC_WaitReady(struct doc_priv *doc)
 				return -EIO;
 			}
 			udelay(1);
-			cond_resched();
+			cond_resched_stall();
 		}
 	} else {
 		while (!(ReadDOC(docptr, CDSNControl) & CDSN_CTRL_FR_B)) {
@@ -257,7 +257,7 @@ static int _DoC_WaitReady(struct doc_priv *doc)
 				return -EIO;
 			}
 			udelay(1);
-			cond_resched();
+			cond_resched_stall();
 		}
 	}
 
diff --git a/drivers/mtd/nand/raw/fsmc_nand.c b/drivers/mtd/nand/raw/fsmc_nand.c
index 811982da3557..20e88e98e517 100644
--- a/drivers/mtd/nand/raw/fsmc_nand.c
+++ b/drivers/mtd/nand/raw/fsmc_nand.c
@@ -398,8 +398,7 @@ static int fsmc_read_hwecc_ecc4(struct nand_chip *chip, const u8 *data,
 	do {
 		if (readl_relaxed(host->regs_va + STS) & FSMC_CODE_RDY)
 			break;
-
-		cond_resched();
+		cond_resched_stall();
 	} while (!time_after_eq(jiffies, deadline));
 
 	if (time_after_eq(jiffies, deadline)) {
diff --git a/drivers/mtd/nand/raw/hisi504_nand.c b/drivers/mtd/nand/raw/hisi504_nand.c
index fe291a2e5c77..bf669b1750f8 100644
--- a/drivers/mtd/nand/raw/hisi504_nand.c
+++ b/drivers/mtd/nand/raw/hisi504_nand.c
@@ -819,7 +819,7 @@ static int hisi_nfc_suspend(struct device *dev)
 		if (((hinfc_read(host, HINFC504_STATUS) & 0x1) == 0x0) &&
 		    (hinfc_read(host, HINFC504_DMA_CTRL) &
 		     HINFC504_DMA_CTRL_DMA_START)) {
-			cond_resched();
+			cond_resched_stall();
 			return 0;
 		}
 	}
diff --git a/drivers/mtd/nand/raw/nand_base.c b/drivers/mtd/nand/raw/nand_base.c
index 1fcac403cee6..656126b05f09 100644
--- a/drivers/mtd/nand/raw/nand_base.c
+++ b/drivers/mtd/nand/raw/nand_base.c
@@ -730,8 +730,7 @@ int nand_gpio_waitrdy(struct nand_chip *chip, struct gpio_desc *gpiod,
 	do {
 		if (gpiod_get_value_cansleep(gpiod))
 			return 0;
-
-		cond_resched();
+		cond_resched_stall();
 	} while	(time_before(jiffies, timeout_ms));
 
 	return gpiod_get_value_cansleep(gpiod) ? 0 : -ETIMEDOUT;
diff --git a/drivers/mtd/nand/raw/nand_legacy.c b/drivers/mtd/nand/raw/nand_legacy.c
index 743792edf98d..aaef537b46c3 100644
--- a/drivers/mtd/nand/raw/nand_legacy.c
+++ b/drivers/mtd/nand/raw/nand_legacy.c
@@ -203,7 +203,13 @@ void nand_wait_ready(struct nand_chip *chip)
 	do {
 		if (chip->legacy.dev_ready(chip))
 			return;
-		cond_resched();
+		/*
+		 * Use a cond_resched_stall() to avoid spinning in
+		 * a tight loop.
+		 * Though, given that the timeout is in milliseconds,
+		 * maybe this should timeout or event wait?
+		 */
+		cond_resched_stall();
 	} while (time_before(jiffies, timeo));
 
 	if (!chip->legacy.dev_ready(chip))
@@ -565,7 +571,14 @@ static int nand_wait(struct nand_chip *chip)
 				if (status & NAND_STATUS_READY)
 					break;
 			}
-			cond_resched();
+
+			/*
+			 * Use a cond_resched_stall() to avoid spinning in
+			 * a tight loop.
+			 * Though, given that the timeout is in milliseconds,
+			 * maybe this should timeout or event wait?
+			 */
+			cond_resched_stall();
 		} while (time_before(jiffies, timeo));
 	}
 
diff --git a/drivers/mtd/spi-nor/core.c b/drivers/mtd/spi-nor/core.c
index 1b0c6770c14e..e32e6eebb0e2 100644
--- a/drivers/mtd/spi-nor/core.c
+++ b/drivers/mtd/spi-nor/core.c
@@ -730,7 +730,13 @@ static int spi_nor_wait_till_ready_with_timeout(struct spi_nor *nor,
 		if (ret)
 			return 0;
 
-		cond_resched();
+		/*
+		 * Use a cond_resched_stall() to avoid spinning in
+		 * a tight loop.
+		 * Though, given that the timeout is in milliseconds,
+		 * maybe this should timeout or event wait?
+		 */
+		cond_resched_stall();
 	}
 
 	dev_dbg(nor->dev, "flash operation timed out\n");
diff --git a/drivers/mtd/tests/mtd_test.c b/drivers/mtd/tests/mtd_test.c
index c84250beffdc..5bb0c6ef7df9 100644
--- a/drivers/mtd/tests/mtd_test.c
+++ b/drivers/mtd/tests/mtd_test.c
@@ -51,7 +51,6 @@ int mtdtest_scan_for_bad_eraseblocks(struct mtd_info *mtd, unsigned char *bbt,
 		bbt[i] = is_block_bad(mtd, eb + i) ? 1 : 0;
 		if (bbt[i])
 			bad += 1;
-		cond_resched();
 	}
 	pr_info("scanned %d eraseblocks, %d are bad\n", i, bad);
 
@@ -70,7 +69,6 @@ int mtdtest_erase_good_eraseblocks(struct mtd_info *mtd, unsigned char *bbt,
 		err = mtdtest_erase_eraseblock(mtd, eb + i);
 		if (err)
 			return err;
-		cond_resched();
 	}
 
 	return 0;
diff --git a/drivers/mtd/tests/mtd_test.h b/drivers/mtd/tests/mtd_test.h
index 5a6e3bbe0474..4742f53c6f25 100644
--- a/drivers/mtd/tests/mtd_test.h
+++ b/drivers/mtd/tests/mtd_test.h
@@ -4,7 +4,7 @@
 
 static inline int mtdtest_relax(void)
 {
-	cond_resched();
+	cond_resched_stall();
 	if (signal_pending(current)) {
 		pr_info("aborting test due to pending signal!\n");
 		return -EINTR;
diff --git a/drivers/mtd/tests/pagetest.c b/drivers/mtd/tests/pagetest.c
index 8eb40b6e6dfa..79330c0ccd85 100644
--- a/drivers/mtd/tests/pagetest.c
+++ b/drivers/mtd/tests/pagetest.c
@@ -43,7 +43,6 @@ static int write_eraseblock(int ebnum)
 	loff_t addr = (loff_t)ebnum * mtd->erasesize;
 
 	prandom_bytes_state(&rnd_state, writebuf, mtd->erasesize);
-	cond_resched();
 	return mtdtest_write(mtd, addr, mtd->erasesize, writebuf);
 }
 
diff --git a/drivers/mtd/tests/readtest.c b/drivers/mtd/tests/readtest.c
index 99670ef91f2b..c862d9a6dc1d 100644
--- a/drivers/mtd/tests/readtest.c
+++ b/drivers/mtd/tests/readtest.c
@@ -91,7 +91,6 @@ static void dump_eraseblock(int ebnum)
 		for (j = 0; j < 32 && i < n; j++, i++)
 			p += sprintf(p, "%02x", (unsigned int)iobuf[i]);
 		printk(KERN_CRIT "%s\n", line);
-		cond_resched();
 	}
 	if (!mtd->oobsize)
 		return;
@@ -106,7 +105,6 @@ static void dump_eraseblock(int ebnum)
 				p += sprintf(p, "%02x",
 					     (unsigned int)iobuf1[i]);
 			printk(KERN_CRIT "%s\n", line);
-			cond_resched();
 		}
 }
 
diff --git a/drivers/mtd/tests/torturetest.c b/drivers/mtd/tests/torturetest.c
index 841689b4d86d..94cf4f6c6c4c 100644
--- a/drivers/mtd/tests/torturetest.c
+++ b/drivers/mtd/tests/torturetest.c
@@ -390,7 +390,6 @@ static void report_corrupt(unsigned char *read, unsigned char *written)
 	       " what was read from flash and what was expected\n");
 
 	for (i = 0; i < check_len; i += pgsize) {
-		cond_resched();
 		bytes = bits = 0;
 		first = countdiffs(written, read, i, pgsize, &bytes,
 				   &bits);
diff --git a/drivers/mtd/ubi/attach.c b/drivers/mtd/ubi/attach.c
index ae5abe492b52..0994d2d8edf0 100644
--- a/drivers/mtd/ubi/attach.c
+++ b/drivers/mtd/ubi/attach.c
@@ -1390,8 +1390,6 @@ static int scan_all(struct ubi_device *ubi, struct ubi_attach_info *ai,
 		goto out_ech;
 
 	for (pnum = start; pnum < ubi->peb_count; pnum++) {
-		cond_resched();
-
 		dbg_gen("process PEB %d", pnum);
 		err = scan_peb(ubi, ai, pnum, false);
 		if (err < 0)
@@ -1504,8 +1502,6 @@ static int scan_fast(struct ubi_device *ubi, struct ubi_attach_info **ai)
 		goto out_ech;
 
 	for (pnum = 0; pnum < UBI_FM_MAX_START; pnum++) {
-		cond_resched();
-
 		dbg_gen("process PEB %d", pnum);
 		err = scan_peb(ubi, scan_ai, pnum, true);
 		if (err < 0)
@@ -1674,8 +1670,6 @@ static int self_check_ai(struct ubi_device *ubi, struct ubi_attach_info *ai)
 	ubi_rb_for_each_entry(rb1, av, &ai->volumes, rb) {
 		int leb_count = 0;
 
-		cond_resched();
-
 		vols_found += 1;
 
 		if (ai->is_empty) {
@@ -1715,8 +1709,6 @@ static int self_check_ai(struct ubi_device *ubi, struct ubi_attach_info *ai)
 
 		last_aeb = NULL;
 		ubi_rb_for_each_entry(rb2, aeb, &av->root, u.rb) {
-			cond_resched();
-
 			last_aeb = aeb;
 			leb_count += 1;
 
@@ -1790,8 +1782,6 @@ static int self_check_ai(struct ubi_device *ubi, struct ubi_attach_info *ai)
 		ubi_rb_for_each_entry(rb2, aeb, &av->root, u.rb) {
 			int vol_type;
 
-			cond_resched();
-
 			last_aeb = aeb;
 
 			err = ubi_io_read_vid_hdr(ubi, aeb->pnum, vidb, 1);
diff --git a/drivers/mtd/ubi/build.c b/drivers/mtd/ubi/build.c
index 8ee51e49fced..52740f461259 100644
--- a/drivers/mtd/ubi/build.c
+++ b/drivers/mtd/ubi/build.c
@@ -1257,8 +1257,6 @@ static int __init ubi_init(void)
 		struct mtd_dev_param *p = &mtd_dev_param[i];
 		struct mtd_info *mtd;
 
-		cond_resched();
-
 		mtd = open_mtd_device(p->name);
 		if (IS_ERR(mtd)) {
 			err = PTR_ERR(mtd);
diff --git a/drivers/mtd/ubi/cdev.c b/drivers/mtd/ubi/cdev.c
index f43430b9c1e6..e60c0ad0eeb4 100644
--- a/drivers/mtd/ubi/cdev.c
+++ b/drivers/mtd/ubi/cdev.c
@@ -209,8 +209,6 @@ static ssize_t vol_cdev_read(struct file *file, __user char *buf, size_t count,
 	lnum = div_u64_rem(*offp, vol->usable_leb_size, &off);
 
 	do {
-		cond_resched();
-
 		if (off + len >= vol->usable_leb_size)
 			len = vol->usable_leb_size - off;
 
@@ -289,8 +287,6 @@ static ssize_t vol_cdev_direct_write(struct file *file, const char __user *buf,
 	len = count > tbuf_size ? tbuf_size : count;
 
 	while (count) {
-		cond_resched();
-
 		if (off + len >= vol->usable_leb_size)
 			len = vol->usable_leb_size - off;
 
diff --git a/drivers/mtd/ubi/eba.c b/drivers/mtd/ubi/eba.c
index 655ff41863e2..f1e097503826 100644
--- a/drivers/mtd/ubi/eba.c
+++ b/drivers/mtd/ubi/eba.c
@@ -1408,9 +1408,7 @@ int ubi_eba_copy_leb(struct ubi_device *ubi, int from, int to,
 		aldata_size = data_size =
 			ubi_calc_data_len(ubi, ubi->peb_buf, data_size);
 
-	cond_resched();
 	crc = crc32(UBI_CRC32_INIT, ubi->peb_buf, data_size);
-	cond_resched();
 
 	/*
 	 * It may turn out to be that the whole @from physical eraseblock
@@ -1432,8 +1430,6 @@ int ubi_eba_copy_leb(struct ubi_device *ubi, int from, int to,
 		goto out_unlock_buf;
 	}
 
-	cond_resched();
-
 	/* Read the VID header back and check if it was written correctly */
 	err = ubi_io_read_vid_hdr(ubi, to, vidb, 1);
 	if (err) {
@@ -1454,8 +1450,6 @@ int ubi_eba_copy_leb(struct ubi_device *ubi, int from, int to,
 				err = MOVE_TARGET_WR_ERR;
 			goto out_unlock_buf;
 		}
-
-		cond_resched();
 	}
 
 	ubi_assert(vol->eba_tbl->entries[lnum].pnum == from);
@@ -1640,8 +1634,6 @@ int ubi_eba_init(struct ubi_device *ubi, struct ubi_attach_info *ai)
 		if (!vol)
 			continue;
 
-		cond_resched();
-
 		tbl = ubi_eba_create_table(vol, vol->reserved_pebs);
 		if (IS_ERR(tbl)) {
 			err = PTR_ERR(tbl);
diff --git a/drivers/mtd/ubi/misc.c b/drivers/mtd/ubi/misc.c
index 1794d66b6eb7..8751337a8101 100644
--- a/drivers/mtd/ubi/misc.c
+++ b/drivers/mtd/ubi/misc.c
@@ -61,8 +61,6 @@ int ubi_check_volume(struct ubi_device *ubi, int vol_id)
 	for (i = 0; i < vol->used_ebs; i++) {
 		int size;
 
-		cond_resched();
-
 		if (i == vol->used_ebs - 1)
 			size = vol->last_eb_bytes;
 		else
diff --git a/drivers/mtd/ubi/vtbl.c b/drivers/mtd/ubi/vtbl.c
index f700f0e4f2ec..6e0d8b3109d5 100644
--- a/drivers/mtd/ubi/vtbl.c
+++ b/drivers/mtd/ubi/vtbl.c
@@ -163,8 +163,6 @@ static int vtbl_check(const struct ubi_device *ubi,
 	const char *name;
 
 	for (i = 0; i < ubi->vtbl_slots; i++) {
-		cond_resched();
-
 		reserved_pebs = be32_to_cpu(vtbl[i].reserved_pebs);
 		alignment = be32_to_cpu(vtbl[i].alignment);
 		data_pad = be32_to_cpu(vtbl[i].data_pad);
@@ -526,8 +524,6 @@ static int init_volumes(struct ubi_device *ubi,
 	struct ubi_volume *vol;
 
 	for (i = 0; i < ubi->vtbl_slots; i++) {
-		cond_resched();
-
 		if (be32_to_cpu(vtbl[i].reserved_pebs) == 0)
 			continue; /* Empty record */
 
@@ -736,8 +732,6 @@ static int check_attaching_info(const struct ubi_device *ubi,
 	}
 
 	for (i = 0; i < ubi->vtbl_slots + UBI_INT_VOL_COUNT; i++) {
-		cond_resched();
-
 		av = ubi_find_av(ai, i);
 		vol = ubi->volumes[i];
 		if (!vol) {
diff --git a/drivers/mtd/ubi/wl.c b/drivers/mtd/ubi/wl.c
index 26a214f016c1..5ff22ac93ba9 100644
--- a/drivers/mtd/ubi/wl.c
+++ b/drivers/mtd/ubi/wl.c
@@ -190,8 +190,6 @@ static int do_work(struct ubi_device *ubi)
 	int err;
 	struct ubi_work *wrk;
 
-	cond_resched();
-
 	/*
 	 * @ubi->work_sem is used to synchronize with the workers. Workers take
 	 * it in read mode, so many of them may be doing works at a time. But
@@ -519,7 +517,6 @@ static void serve_prot_queue(struct ubi_device *ubi)
 			 * too long.
 			 */
 			spin_unlock(&ubi->wl_lock);
-			cond_resched();
 			goto repeat;
 		}
 	}
@@ -1703,8 +1700,6 @@ int ubi_thread(void *u)
 			}
 		} else
 			failures = 0;
-
-		cond_resched();
 	}
 
 	dbg_wl("background thread \"%s\" is killed", ubi->bgt_name);
@@ -1805,8 +1800,6 @@ int ubi_wl_init(struct ubi_device *ubi, struct ubi_attach_info *ai)
 
 	ubi->free_count = 0;
 	list_for_each_entry_safe(aeb, tmp, &ai->erase, u.list) {
-		cond_resched();
-
 		err = erase_aeb(ubi, aeb, false);
 		if (err)
 			goto out_free;
@@ -1815,8 +1808,6 @@ int ubi_wl_init(struct ubi_device *ubi, struct ubi_attach_info *ai)
 	}
 
 	list_for_each_entry(aeb, &ai->free, u.list) {
-		cond_resched();
-
 		e = kmem_cache_alloc(ubi_wl_entry_slab, GFP_KERNEL);
 		if (!e) {
 			err = -ENOMEM;
@@ -1837,8 +1828,6 @@ int ubi_wl_init(struct ubi_device *ubi, struct ubi_attach_info *ai)
 
 	ubi_rb_for_each_entry(rb1, av, &ai->volumes, rb) {
 		ubi_rb_for_each_entry(rb2, aeb, &av->root, u.rb) {
-			cond_resched();
-
 			e = kmem_cache_alloc(ubi_wl_entry_slab, GFP_KERNEL);
 			if (!e) {
 				err = -ENOMEM;
@@ -1864,8 +1853,6 @@ int ubi_wl_init(struct ubi_device *ubi, struct ubi_attach_info *ai)
 	}
 
 	list_for_each_entry(aeb, &ai->fastmap, u.list) {
-		cond_resched();
-
 		e = ubi_find_fm_block(ubi, aeb->pnum);
 
 		if (e) {
-- 
2.31.1

Re: [RFC PATCH 82/86] treewide: mtd: remove cond_resched()

Posted by Miquel Raynal 2 years, 3 months ago

Hi Ankur,

ankur.a.arora@oracle.com wrote on Tue,  7 Nov 2023 15:08:18 -0800:

> There are broadly three sets of uses of cond_resched():
> 
> 1.  Calls to cond_resched() out of the goodness of our heart,
>     otherwise known as avoiding lockup splats.
> 
> 2.  Open coded variants of cond_resched_lock() which call
>     cond_resched().
> 
> 3.  Retry or error handling loops, where cond_resched() is used as a
>     quick alternative to spinning in a tight-loop.
> 
> When running under a full preemption model, the cond_resched() reduces
> to a NOP (not even a barrier) so removing it obviously cannot matter.
> 
> But considering only voluntary preemption models (for say code that
> has been mostly tested under those), for set-1 and set-2 the
> scheduler can now preempt kernel tasks running beyond their time
> quanta anywhere they are preemptible() [1]. Which removes any need
> for these explicitly placed scheduling points.
> 
> The cond_resched() calls in set-3 are a little more difficult.
> To start with, given it's NOP character under full preemption, it
> never actually saved us from a tight loop.
> With voluntary preemption, it's not a NOP, but it might as well be --
> for most workloads the scheduler does not have an interminable supply
> of runnable tasks on the runqueue.
> 
> So, cond_resched() is useful to not get softlockup splats, but not
> terribly good for error handling. Ideally, these should be replaced
> with some kind of timed or event wait.
> For now we use cond_resched_stall(), which tries to schedule if
> possible, and executes a cpu_relax() if not.
> 
> Most of the uses here are in set-1 (some right after we give up a lock
> or enable bottom-halves, causing an explicit preemption check.)
> 
> There are a few cases from set-3. Replace them with
> cond_resched_stall(). Some of those places, however, have wait-times
> milliseconds, so maybe we should just have an msleep() there?

Yeah, I believe this should work.

> 
> [1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/
> 
> Cc: Miquel Raynal <miquel.raynal@bootlin.com>
> Cc: Richard Weinberger <richard@nod.at>
> Cc: Vignesh Raghavendra <vigneshr@ti.com>
> Cc: Kyungmin Park <kyungmin.park@samsung.com>
> Cc: Tudor Ambarus <tudor.ambarus@linaro.org>
> Cc: Pratyush Yadav <pratyush@kernel.org>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---

[...]

> --- a/drivers/mtd/nand/raw/nand_legacy.c
> +++ b/drivers/mtd/nand/raw/nand_legacy.c
> @@ -203,7 +203,13 @@ void nand_wait_ready(struct nand_chip *chip)
>  	do {
>  		if (chip->legacy.dev_ready(chip))
>  			return;
> -		cond_resched();
> +		/*
> +		 * Use a cond_resched_stall() to avoid spinning in
> +		 * a tight loop.
> +		 * Though, given that the timeout is in milliseconds,
> +		 * maybe this should timeout or event wait?

Event waiting is precisely what we do here, with the hardware access
which are available in this case. So I believe this part of the comment
(in general) is not relevant. Now regarding the timeout I believe it is
closer to the second than the millisecond, so timeout-ing is not
relevant either in most cases (talking about mtd/ in general).

> +		 */
> +		cond_resched_stall();
>  	} while (time_before(jiffies, timeo));

Thanks,
Miquèl

Re: [RFC PATCH 82/86] treewide: mtd: remove cond_resched()

Posted by Matthew Wilcox 2 years, 3 months ago

On Wed, Nov 08, 2023 at 05:28:27PM +0100, Miquel Raynal wrote:
> > --- a/drivers/mtd/nand/raw/nand_legacy.c
> > +++ b/drivers/mtd/nand/raw/nand_legacy.c
> > @@ -203,7 +203,13 @@ void nand_wait_ready(struct nand_chip *chip)
> >  	do {
> >  		if (chip->legacy.dev_ready(chip))
> >  			return;
> > -		cond_resched();
> > +		/*
> > +		 * Use a cond_resched_stall() to avoid spinning in
> > +		 * a tight loop.
> > +		 * Though, given that the timeout is in milliseconds,
> > +		 * maybe this should timeout or event wait?
> 
> Event waiting is precisely what we do here, with the hardware access
> which are available in this case. So I believe this part of the comment
> (in general) is not relevant. Now regarding the timeout I believe it is
> closer to the second than the millisecond, so timeout-ing is not
> relevant either in most cases (talking about mtd/ in general).

I think you've misunderstood what Ankur wrote here.  What you're
currently doing is spinning in a very tight loop.  The comment is
suggesting you might want to msleep(1) or something to avoid burning CPU
cycles.  It'd be even better if the hardware could signal you somehow,
but I bet it can't.

> > +		 */
> > +		cond_resched_stall();
> >  	} while (time_before(jiffies, timeo));
> 
> Thanks,
> Miquèl
>

Re: [RFC PATCH 82/86] treewide: mtd: remove cond_resched()

Posted by Steven Rostedt 2 years, 3 months ago

On Wed, 8 Nov 2023 16:32:36 +0000
Matthew Wilcox <willy@infradead.org> wrote:

> On Wed, Nov 08, 2023 at 05:28:27PM +0100, Miquel Raynal wrote:
> > > --- a/drivers/mtd/nand/raw/nand_legacy.c
> > > +++ b/drivers/mtd/nand/raw/nand_legacy.c
> > > @@ -203,7 +203,13 @@ void nand_wait_ready(struct nand_chip *chip)
> > >  	do {
> > >  		if (chip->legacy.dev_ready(chip))
> > >  			return;
> > > -		cond_resched();
> > > +		/*
> > > +		 * Use a cond_resched_stall() to avoid spinning in
> > > +		 * a tight loop.
> > > +		 * Though, given that the timeout is in milliseconds,
> > > +		 * maybe this should timeout or event wait?  
> > 
> > Event waiting is precisely what we do here, with the hardware access
> > which are available in this case. So I believe this part of the comment
> > (in general) is not relevant. Now regarding the timeout I believe it is
> > closer to the second than the millisecond, so timeout-ing is not
> > relevant either in most cases (talking about mtd/ in general).  
> 
> I think you've misunderstood what Ankur wrote here.  What you're
> currently doing is spinning in a very tight loop.  The comment is
> suggesting you might want to msleep(1) or something to avoid burning CPU
> cycles.  It'd be even better if the hardware could signal you somehow,
> but I bet it can't.
> 

Oh how I wish we could bring back the old PREEMPT_RT cpu_chill()...

#define cpu_chill()	msleep(1)

;-)

-- Steve


> > > +		 */
> > > +		cond_resched_stall();
> > >  	} while (time_before(jiffies, timeo));  
> > 
> > Thanks,
> > Miquèl
> >

Re: [RFC PATCH 82/86] treewide: mtd: remove cond_resched()

Posted by Miquel Raynal 2 years, 3 months ago

Hello,

rostedt@goodmis.org wrote on Wed, 8 Nov 2023 12:21:16 -0500:

> On Wed, 8 Nov 2023 16:32:36 +0000
> Matthew Wilcox <willy@infradead.org> wrote:
> 
> > On Wed, Nov 08, 2023 at 05:28:27PM +0100, Miquel Raynal wrote:  
> > > > --- a/drivers/mtd/nand/raw/nand_legacy.c
> > > > +++ b/drivers/mtd/nand/raw/nand_legacy.c
> > > > @@ -203,7 +203,13 @@ void nand_wait_ready(struct nand_chip *chip)
> > > >  	do {
> > > >  		if (chip->legacy.dev_ready(chip))
> > > >  			return;
> > > > -		cond_resched();
> > > > +		/*
> > > > +		 * Use a cond_resched_stall() to avoid spinning in
> > > > +		 * a tight loop.
> > > > +		 * Though, given that the timeout is in milliseconds,
> > > > +		 * maybe this should timeout or event wait?    
> > > 
> > > Event waiting is precisely what we do here, with the hardware access
> > > which are available in this case. So I believe this part of the comment
> > > (in general) is not relevant. Now regarding the timeout I believe it is
> > > closer to the second than the millisecond, so timeout-ing is not
> > > relevant either in most cases (talking about mtd/ in general).    
> > 
> > I think you've misunderstood what Ankur wrote here.  What you're
> > currently doing is spinning in a very tight loop.  The comment is
> > suggesting you might want to msleep(1) or something to avoid burning CPU
> > cycles.  It'd be even better if the hardware could signal you somehow,
> > but I bet it can't.

Well, I think I'm aligned with the change and the first sentence in the
comment, but not with the second sentence which I find not relevant.

Maybe I don't understand what "maybe this should timeout" and Ankur
meant "sleeping" there, but for me a timeout is when you bail out with
an error. If sleeping is advised, then why not using a more explicit
wording? As for hardware events, in this case it is not relevant, as
you noticed, so I asked this part of the sentence to be dropped.

This is a legacy part of the core but is still part of the core. In
general I don't mind treewide changes to be slightly generic and I will
not be bothered too much with the device drivers changes, but the core
is more important to my eyes.

> Oh how I wish we could bring back the old PREEMPT_RT cpu_chill()...
> 
> #define cpu_chill()	msleep(1)

:')

Thanks,
Miquèl

[RFC PATCH 83/86] treewide: drm: remove cond_resched()

Posted by Ankur Arora 2 years, 3 months ago

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

Most of the uses here are in set-1 (some right after we give up a lock
or enable bottom-halves, causing an explicit preemption check.)

There are a few cases from set-3. Replace them with
cond_resched_stall().

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Inki Dae <inki.dae@samsung.com> 
Cc: Jagan Teki <jagan@amarulasolutions.com> 
Cc: Marek Szyprowski <m.szyprowski@samsung.com> 
Cc: Andrzej Hajda <andrzej.hajda@intel.com> 
Cc: Neil Armstrong <neil.armstrong@linaro.org> 
Cc: Robert Foss <rfoss@kernel.org> 
Cc: David Airlie <airlied@gmail.com> 
Cc: Daniel Vetter <daniel@ffwll.ch> 
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> 
Cc: Maxime Ripard <mripard@kernel.org> 
Cc: Thomas Zimmermann <tzimmermann@suse.de> 
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 drivers/gpu/drm/bridge/samsung-dsim.c         |  2 +-
 drivers/gpu/drm/drm_buddy.c                   |  1 -
 drivers/gpu/drm/drm_gem.c                     |  1 -
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |  2 +-
 drivers/gpu/drm/i915/gem/i915_gem_object.c    |  1 -
 drivers/gpu/drm/i915/gem/i915_gem_shmem.c     |  2 --
 .../gpu/drm/i915/gem/selftests/huge_pages.c   |  6 ----
 .../drm/i915/gem/selftests/i915_gem_mman.c    |  5 ----
 drivers/gpu/drm/i915/gt/intel_breadcrumbs.c   |  2 +-
 drivers/gpu/drm/i915/gt/intel_gt.c            |  2 +-
 drivers/gpu/drm/i915/gt/intel_migrate.c       |  4 ---
 drivers/gpu/drm/i915/gt/selftest_execlists.c  |  4 ---
 drivers/gpu/drm/i915/gt/selftest_hangcheck.c  |  2 --
 drivers/gpu/drm/i915/gt/selftest_lrc.c        |  2 --
 drivers/gpu/drm/i915/gt/selftest_migrate.c    |  2 --
 drivers/gpu/drm/i915/gt/selftest_timeline.c   |  4 ---
 drivers/gpu/drm/i915/i915_active.c            |  2 +-
 drivers/gpu/drm/i915/i915_gem_evict.c         |  2 --
 drivers/gpu/drm/i915/i915_gpu_error.c         | 18 ++++--------
 drivers/gpu/drm/i915/intel_uncore.c           |  1 -
 drivers/gpu/drm/i915/selftests/i915_gem_gtt.c |  2 --
 drivers/gpu/drm/i915/selftests/i915_request.c |  2 --
 .../gpu/drm/i915/selftests/i915_selftest.c    |  3 --
 drivers/gpu/drm/i915/selftests/i915_vma.c     |  9 ------
 .../gpu/drm/i915/selftests/igt_flush_test.c   |  2 --
 .../drm/i915/selftests/intel_memory_region.c  |  4 ---
 drivers/gpu/drm/tests/drm_buddy_test.c        |  5 ----
 drivers/gpu/drm/tests/drm_mm_test.c           | 29 -------------------
 28 files changed, 11 insertions(+), 110 deletions(-)

diff --git a/drivers/gpu/drm/bridge/samsung-dsim.c b/drivers/gpu/drm/bridge/samsung-dsim.c
index cf777bdb25d2..ae537b9bf8df 100644
--- a/drivers/gpu/drm/bridge/samsung-dsim.c
+++ b/drivers/gpu/drm/bridge/samsung-dsim.c
@@ -1013,7 +1013,7 @@ static int samsung_dsim_wait_for_hdr_fifo(struct samsung_dsim *dsi)
 		if (reg & DSIM_SFR_HEADER_EMPTY)
 			return 0;
 
-		if (!cond_resched())
+		if (!cond_resched_stall())
 			usleep_range(950, 1050);
 	} while (--timeout);
 
diff --git a/drivers/gpu/drm/drm_buddy.c b/drivers/gpu/drm/drm_buddy.c
index e6f5ba5f4baf..fe401d18bf4d 100644
--- a/drivers/gpu/drm/drm_buddy.c
+++ b/drivers/gpu/drm/drm_buddy.c
@@ -311,7 +311,6 @@ void drm_buddy_free_list(struct drm_buddy *mm, struct list_head *objects)
 
 	list_for_each_entry_safe(block, on, objects, link) {
 		drm_buddy_free_block(mm, block);
-		cond_resched();
 	}
 	INIT_LIST_HEAD(objects);
 }
diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
index 44a948b80ee1..881caa4b48a9 100644
--- a/drivers/gpu/drm/drm_gem.c
+++ b/drivers/gpu/drm/drm_gem.c
@@ -506,7 +506,6 @@ static void drm_gem_check_release_batch(struct folio_batch *fbatch)
 {
 	check_move_unevictable_folios(fbatch);
 	__folio_batch_release(fbatch);
-	cond_resched();
 }
 
 /**
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 5a687a3686bd..0b16689423b4 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -1812,7 +1812,7 @@ static noinline int eb_relocate_parse_slow(struct i915_execbuffer *eb)
 		err = eb_copy_relocations(eb);
 		have_copy = err == 0;
 	} else {
-		cond_resched();
+		cond_resched_stall();
 		err = 0;
 	}
 
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.c b/drivers/gpu/drm/i915/gem/i915_gem_object.c
index ef9346ed6d0f..172eee1e8889 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_object.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_object.c
@@ -414,7 +414,6 @@ static void __i915_gem_free_objects(struct drm_i915_private *i915,
 
 		/* But keep the pointer alive for RCU-protected lookups */
 		call_rcu(&obj->rcu, __i915_gem_free_object_rcu);
-		cond_resched();
 	}
 }
 
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
index 73a4a4eb29e0..38ea2fc206e0 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
@@ -26,7 +26,6 @@ static void check_release_folio_batch(struct folio_batch *fbatch)
 {
 	check_move_unevictable_folios(fbatch);
 	__folio_batch_release(fbatch);
-	cond_resched();
 }
 
 void shmem_sg_free_table(struct sg_table *st, struct address_space *mapping,
@@ -108,7 +107,6 @@ int shmem_sg_alloc_table(struct drm_i915_private *i915, struct sg_table *st,
 		gfp_t gfp = noreclaim;
 
 		do {
-			cond_resched();
 			folio = shmem_read_folio_gfp(mapping, i, gfp);
 			if (!IS_ERR(folio))
 				break;
diff --git a/drivers/gpu/drm/i915/gem/selftests/huge_pages.c b/drivers/gpu/drm/i915/gem/selftests/huge_pages.c
index 6b9f6cf50bf6..fae0fa993404 100644
--- a/drivers/gpu/drm/i915/gem/selftests/huge_pages.c
+++ b/drivers/gpu/drm/i915/gem/selftests/huge_pages.c
@@ -1447,8 +1447,6 @@ static int igt_ppgtt_smoke_huge(void *arg)
 
 		if (err)
 			break;
-
-		cond_resched();
 	}
 
 	return err;
@@ -1538,8 +1536,6 @@ static int igt_ppgtt_sanity_check(void *arg)
 				goto out;
 			}
 		}
-
-		cond_resched();
 	}
 
 out:
@@ -1738,8 +1734,6 @@ static int igt_ppgtt_mixed(void *arg)
 			break;
 
 		addr += obj->base.size;
-
-		cond_resched();
 	}
 
 	i915_gem_context_unlock_engines(ctx);
diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c
index 72957a36a36b..c994071532cf 100644
--- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c
+++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c
@@ -221,7 +221,6 @@ static int check_partial_mappings(struct drm_i915_gem_object *obj,
 		u32 *cpu;
 
 		GEM_BUG_ON(view.partial.size > nreal);
-		cond_resched();
 
 		vma = i915_gem_object_ggtt_pin(obj, &view, 0, 0, PIN_MAPPABLE);
 		if (IS_ERR(vma)) {
@@ -1026,8 +1025,6 @@ static void igt_close_objects(struct drm_i915_private *i915,
 		i915_gem_object_put(obj);
 	}
 
-	cond_resched();
-
 	i915_gem_drain_freed_objects(i915);
 }
 
@@ -1041,8 +1038,6 @@ static void igt_make_evictable(struct list_head *objects)
 			i915_gem_object_unpin_pages(obj);
 		i915_gem_object_unlock(obj);
 	}
-
-	cond_resched();
 }
 
 static int igt_fill_mappable(struct intel_memory_region *mr,
diff --git a/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c b/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c
index ecc990ec1b95..e016f1203f7c 100644
--- a/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c
+++ b/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c
@@ -315,7 +315,7 @@ void __intel_breadcrumbs_park(struct intel_breadcrumbs *b)
 		local_irq_disable();
 		signal_irq_work(&b->irq_work);
 		local_irq_enable();
-		cond_resched();
+		cond_resched_stall();
 	}
 }
 
diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c
index 449f0b7fc843..40cfdf4f5fff 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt.c
@@ -664,7 +664,7 @@ int intel_gt_wait_for_idle(struct intel_gt *gt, long timeout)
 
 	while ((timeout = intel_gt_retire_requests_timeout(gt, timeout,
 							   &remaining_timeout)) > 0) {
-		cond_resched();
+		cond_resched_stall();
 		if (signal_pending(current))
 			return -EINTR;
 	}
diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c b/drivers/gpu/drm/i915/gt/intel_migrate.c
index 576e5ef0289b..cc3f62d5c28f 100644
--- a/drivers/gpu/drm/i915/gt/intel_migrate.c
+++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
@@ -906,8 +906,6 @@ intel_context_migrate_copy(struct intel_context *ce,
 			err = -EINVAL;
 			break;
 		}
-
-		cond_resched();
 	} while (1);
 
 out_ce:
@@ -1067,8 +1065,6 @@ intel_context_migrate_clear(struct intel_context *ce,
 		i915_request_add(rq);
 		if (err || !it.sg || !sg_dma_len(it.sg))
 			break;
-
-		cond_resched();
 	} while (1);
 
 out_ce:
diff --git a/drivers/gpu/drm/i915/gt/selftest_execlists.c b/drivers/gpu/drm/i915/gt/selftest_execlists.c
index 4202df5b8c12..52c8fa3e5cad 100644
--- a/drivers/gpu/drm/i915/gt/selftest_execlists.c
+++ b/drivers/gpu/drm/i915/gt/selftest_execlists.c
@@ -60,8 +60,6 @@ static int wait_for_submit(struct intel_engine_cs *engine,
 
 		if (done)
 			return -ETIME;
-
-		cond_resched();
 	} while (1);
 }
 
@@ -72,7 +70,6 @@ static int wait_for_reset(struct intel_engine_cs *engine,
 	timeout += jiffies;
 
 	do {
-		cond_resched();
 		intel_engine_flush_submission(engine);
 
 		if (READ_ONCE(engine->execlists.pending[0]))
@@ -1373,7 +1370,6 @@ static int live_timeslice_queue(void *arg)
 
 		/* Wait until we ack the release_queue and start timeslicing */
 		do {
-			cond_resched();
 			intel_engine_flush_submission(engine);
 		} while (READ_ONCE(engine->execlists.pending[0]));
 
diff --git a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
index 0dd4d00ee894..e751ed2cf8b2 100644
--- a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
+++ b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
@@ -939,8 +939,6 @@ static void active_engine(struct kthread_work *work)
 			pr_err("[%s] Request put failed: %d!\n", engine->name, err);
 			break;
 		}
-
-		cond_resched();
 	}
 
 	for (count = 0; count < ARRAY_SIZE(rq); count++) {
diff --git a/drivers/gpu/drm/i915/gt/selftest_lrc.c b/drivers/gpu/drm/i915/gt/selftest_lrc.c
index 5f826b6dcf5d..83a42492f0d0 100644
--- a/drivers/gpu/drm/i915/gt/selftest_lrc.c
+++ b/drivers/gpu/drm/i915/gt/selftest_lrc.c
@@ -70,8 +70,6 @@ static int wait_for_submit(struct intel_engine_cs *engine,
 
 		if (done)
 			return -ETIME;
-
-		cond_resched();
 	} while (1);
 }
 
diff --git a/drivers/gpu/drm/i915/gt/selftest_migrate.c b/drivers/gpu/drm/i915/gt/selftest_migrate.c
index 3def5ca72dec..9dfa70699df9 100644
--- a/drivers/gpu/drm/i915/gt/selftest_migrate.c
+++ b/drivers/gpu/drm/i915/gt/selftest_migrate.c
@@ -210,8 +210,6 @@ static int intel_context_copy_ccs(struct intel_context *ce,
 		i915_request_add(rq);
 		if (err || !it.sg || !sg_dma_len(it.sg))
 			break;
-
-		cond_resched();
 	} while (1);
 
 out_ce:
diff --git a/drivers/gpu/drm/i915/gt/selftest_timeline.c b/drivers/gpu/drm/i915/gt/selftest_timeline.c
index fa36cf920bde..15b8fd41ad90 100644
--- a/drivers/gpu/drm/i915/gt/selftest_timeline.c
+++ b/drivers/gpu/drm/i915/gt/selftest_timeline.c
@@ -352,7 +352,6 @@ static int bench_sync(void *arg)
 		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
 
 	mock_timeline_fini(&tl);
-	cond_resched();
 
 	mock_timeline_init(&tl, 0);
 
@@ -382,7 +381,6 @@ static int bench_sync(void *arg)
 		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
 
 	mock_timeline_fini(&tl);
-	cond_resched();
 
 	mock_timeline_init(&tl, 0);
 
@@ -405,7 +403,6 @@ static int bench_sync(void *arg)
 	pr_info("%s: %lu repeated insert/lookups, %lluns/op\n",
 		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
 	mock_timeline_fini(&tl);
-	cond_resched();
 
 	/* Benchmark searching for a known context id and changing the seqno */
 	for (last_order = 1, order = 1; order < 32;
@@ -434,7 +431,6 @@ static int bench_sync(void *arg)
 			__func__, count, order,
 			(long long)div64_ul(ktime_to_ns(kt), count));
 		mock_timeline_fini(&tl);
-		cond_resched();
 	}
 
 	return 0;
diff --git a/drivers/gpu/drm/i915/i915_active.c b/drivers/gpu/drm/i915/i915_active.c
index 5ec293011d99..810251c33495 100644
--- a/drivers/gpu/drm/i915/i915_active.c
+++ b/drivers/gpu/drm/i915/i915_active.c
@@ -865,7 +865,7 @@ int i915_active_acquire_preallocate_barrier(struct i915_active *ref,
 
 	/* Wait until the previous preallocation is completed */
 	while (!llist_empty(&ref->preallocated_barriers))
-		cond_resched();
+		cond_resched_stall();
 
 	/*
 	 * Preallocate a node for each physical engine supporting the target
diff --git a/drivers/gpu/drm/i915/i915_gem_evict.c b/drivers/gpu/drm/i915/i915_gem_evict.c
index c02ebd6900ae..1a600f42a3ad 100644
--- a/drivers/gpu/drm/i915/i915_gem_evict.c
+++ b/drivers/gpu/drm/i915/i915_gem_evict.c
@@ -267,8 +267,6 @@ i915_gem_evict_something(struct i915_address_space *vm,
 	if (ret)
 		return ret;
 
-	cond_resched();
-
 	flags |= PIN_NONBLOCK;
 	goto search_again;
 
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 4008bb09fdb5..410072145d4d 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -320,8 +320,6 @@ static int compress_page(struct i915_vma_compress *c,
 
 		if (zlib_deflate(zstream, Z_NO_FLUSH) != Z_OK)
 			return -EIO;
-
-		cond_resched();
 	} while (zstream->avail_in);
 
 	/* Fallback to uncompressed if we increase size? */
@@ -408,7 +406,6 @@ static int compress_page(struct i915_vma_compress *c,
 	if (!(wc && i915_memcpy_from_wc(ptr, src, PAGE_SIZE)))
 		memcpy(ptr, src, PAGE_SIZE);
 	list_add_tail(&virt_to_page(ptr)->lru, &dst->page_list);
-	cond_resched();
 
 	return 0;
 }
@@ -2325,13 +2322,6 @@ void intel_klog_error_capture(struct intel_gt *gt,
 						 l_count, line++, ptr2);
 					ptr[pos] = chr;
 					ptr2 = ptr + pos;
-
-					/*
-					 * If spewing large amounts of data via a serial console,
-					 * this can be a very slow process. So be friendly and try
-					 * not to cause 'softlockup on CPU' problems.
-					 */
-					cond_resched();
 				}
 
 				if (ptr2 < (ptr + count))
@@ -2352,8 +2342,12 @@ void intel_klog_error_capture(struct intel_gt *gt,
 				got--;
 			}
 
-			/* As above. */
-			cond_resched();
+			/*
+			 * If spewing large amounts of data via a serial console,
+			 * this can be a very slow process. So be friendly and try
+			 * not to cause 'softlockup on CPU' problems.
+			 */
+			cond_resched_stall();
 		}
 
 		if (got)
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index dfefad5a5fec..d2e74cfb1aac 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -487,7 +487,6 @@ intel_uncore_forcewake_reset(struct intel_uncore *uncore)
 		}
 
 		spin_unlock_irqrestore(&uncore->lock, irqflags);
-		cond_resched();
 	}
 
 	drm_WARN_ON(&uncore->i915->drm, active_domains);
diff --git a/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c b/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
index 5c397a2df70e..4b497e969a33 100644
--- a/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
+++ b/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
@@ -201,7 +201,6 @@ static int igt_ppgtt_alloc(void *arg)
 		}
 
 		ppgtt->vm.allocate_va_range(&ppgtt->vm, &stash, 0, size);
-		cond_resched();
 
 		ppgtt->vm.clear_range(&ppgtt->vm, 0, size);
 
@@ -224,7 +223,6 @@ static int igt_ppgtt_alloc(void *arg)
 
 		ppgtt->vm.allocate_va_range(&ppgtt->vm, &stash,
 					    last, size - last);
-		cond_resched();
 
 		i915_vm_free_pt_stash(&ppgtt->vm, &stash);
 	}
diff --git a/drivers/gpu/drm/i915/selftests/i915_request.c b/drivers/gpu/drm/i915/selftests/i915_request.c
index a9b79888c193..43bb54fc8c78 100644
--- a/drivers/gpu/drm/i915/selftests/i915_request.c
+++ b/drivers/gpu/drm/i915/selftests/i915_request.c
@@ -438,8 +438,6 @@ static void __igt_breadcrumbs_smoketest(struct kthread_work *work)
 
 		num_fences += count;
 		num_waits++;
-
-		cond_resched();
 	}
 
 	atomic_long_add(num_fences, &t->num_fences);
diff --git a/drivers/gpu/drm/i915/selftests/i915_selftest.c b/drivers/gpu/drm/i915/selftests/i915_selftest.c
index ee79e0809a6d..17e6bbc3c87e 100644
--- a/drivers/gpu/drm/i915/selftests/i915_selftest.c
+++ b/drivers/gpu/drm/i915/selftests/i915_selftest.c
@@ -179,7 +179,6 @@ static int __run_selftests(const char *name,
 		if (!st->enabled)
 			continue;
 
-		cond_resched();
 		if (signal_pending(current))
 			return -EINTR;
 
@@ -381,7 +380,6 @@ int __i915_subtests(const char *caller,
 	int err;
 
 	for (; count--; st++) {
-		cond_resched();
 		if (signal_pending(current))
 			return -EINTR;
 
@@ -414,7 +412,6 @@ bool __igt_timeout(unsigned long timeout, const char *fmt, ...)
 	va_list va;
 
 	if (!signal_pending(current)) {
-		cond_resched();
 		if (time_before(jiffies, timeout))
 			return false;
 	}
diff --git a/drivers/gpu/drm/i915/selftests/i915_vma.c b/drivers/gpu/drm/i915/selftests/i915_vma.c
index 71b52d5efef4..1bacdcd77c5b 100644
--- a/drivers/gpu/drm/i915/selftests/i915_vma.c
+++ b/drivers/gpu/drm/i915/selftests/i915_vma.c
@@ -197,8 +197,6 @@ static int igt_vma_create(void *arg)
 			list_del_init(&ctx->link);
 			mock_context_close(ctx);
 		}
-
-		cond_resched();
 	}
 
 end:
@@ -347,8 +345,6 @@ static int igt_vma_pin1(void *arg)
 				goto out;
 			}
 		}
-
-		cond_resched();
 	}
 
 	err = 0;
@@ -697,7 +693,6 @@ static int igt_vma_rotate_remap(void *arg)
 						pr_err("Unbinding returned %i\n", err);
 						goto out_object;
 					}
-					cond_resched();
 				}
 			}
 		}
@@ -858,8 +853,6 @@ static int igt_vma_partial(void *arg)
 					pr_err("Unbinding returned %i\n", err);
 					goto out_object;
 				}
-
-				cond_resched();
 			}
 		}
 
@@ -1085,8 +1078,6 @@ static int igt_vma_remapped_gtt(void *arg)
 				}
 			}
 			i915_vma_unpin_iomap(vma);
-
-			cond_resched();
 		}
 	}
 
diff --git a/drivers/gpu/drm/i915/selftests/igt_flush_test.c b/drivers/gpu/drm/i915/selftests/igt_flush_test.c
index 29110abb4fe0..fbc1b606df29 100644
--- a/drivers/gpu/drm/i915/selftests/igt_flush_test.c
+++ b/drivers/gpu/drm/i915/selftests/igt_flush_test.c
@@ -22,8 +22,6 @@ int igt_flush_test(struct drm_i915_private *i915)
 		if (intel_gt_is_wedged(gt))
 			ret = -EIO;
 
-		cond_resched();
-
 		if (intel_gt_wait_for_idle(gt, HZ * 3) == -ETIME) {
 			pr_err("%pS timed out, cancelling all further testing.\n",
 			       __builtin_return_address(0));
diff --git a/drivers/gpu/drm/i915/selftests/intel_memory_region.c b/drivers/gpu/drm/i915/selftests/intel_memory_region.c
index d985d9bae2e8..3fce433284bd 100644
--- a/drivers/gpu/drm/i915/selftests/intel_memory_region.c
+++ b/drivers/gpu/drm/i915/selftests/intel_memory_region.c
@@ -46,8 +46,6 @@ static void close_objects(struct intel_memory_region *mem,
 		i915_gem_object_put(obj);
 	}
 
-	cond_resched();
-
 	i915_gem_drain_freed_objects(i915);
 }
 
@@ -1290,8 +1288,6 @@ static int _perf_memcpy(struct intel_memory_region *src_mr,
 			div64_u64(mul_u32_u32(4 * size,
 					      1000 * 1000 * 1000),
 				  t[1] + 2 * t[2] + t[3]) >> 20);
-
-		cond_resched();
 	}
 
 	i915_gem_object_unpin_map(dst);
diff --git a/drivers/gpu/drm/tests/drm_buddy_test.c b/drivers/gpu/drm/tests/drm_buddy_test.c
index 09ee6f6af896..7ee65bad4bb7 100644
--- a/drivers/gpu/drm/tests/drm_buddy_test.c
+++ b/drivers/gpu/drm/tests/drm_buddy_test.c
@@ -29,7 +29,6 @@ static bool __timeout(unsigned long timeout, const char *fmt, ...)
 	va_list va;
 
 	if (!signal_pending(current)) {
-		cond_resched();
 		if (time_before(jiffies, timeout))
 			return false;
 	}
@@ -485,8 +484,6 @@ static void drm_test_buddy_alloc_smoke(struct kunit *test)
 
 		if (err || timeout)
 			break;
-
-		cond_resched();
 	}
 
 	kfree(order);
@@ -681,8 +678,6 @@ static void drm_test_buddy_alloc_range(struct kunit *test)
 		rem -= size;
 		if (!rem)
 			break;
-
-		cond_resched();
 	}
 
 	drm_buddy_free_list(&mm, &blocks);
diff --git a/drivers/gpu/drm/tests/drm_mm_test.c b/drivers/gpu/drm/tests/drm_mm_test.c
index 05d5e7af6d25..7d11740ef599 100644
--- a/drivers/gpu/drm/tests/drm_mm_test.c
+++ b/drivers/gpu/drm/tests/drm_mm_test.c
@@ -474,8 +474,6 @@ static void drm_test_mm_reserve(struct kunit *test)
 		KUNIT_ASSERT_FALSE(test, __drm_test_mm_reserve(test, count, size - 1));
 		KUNIT_ASSERT_FALSE(test, __drm_test_mm_reserve(test, count, size));
 		KUNIT_ASSERT_FALSE(test, __drm_test_mm_reserve(test, count, size + 1));
-
-		cond_resched();
 	}
 }
 
@@ -645,8 +643,6 @@ static int __drm_test_mm_insert(struct kunit *test, unsigned int count, u64 size
 		drm_mm_for_each_node_safe(node, next, &mm)
 			drm_mm_remove_node(node);
 		DRM_MM_BUG_ON(!drm_mm_clean(&mm));
-
-		cond_resched();
 	}
 
 	ret = 0;
@@ -671,8 +667,6 @@ static void drm_test_mm_insert(struct kunit *test)
 		KUNIT_ASSERT_FALSE(test, __drm_test_mm_insert(test, count, size - 1, false));
 		KUNIT_ASSERT_FALSE(test, __drm_test_mm_insert(test, count, size, false));
 		KUNIT_ASSERT_FALSE(test, __drm_test_mm_insert(test, count, size + 1, false));
-
-		cond_resched();
 	}
 }
 
@@ -693,8 +687,6 @@ static void drm_test_mm_replace(struct kunit *test)
 		KUNIT_ASSERT_FALSE(test, __drm_test_mm_insert(test, count, size - 1, true));
 		KUNIT_ASSERT_FALSE(test, __drm_test_mm_insert(test, count, size, true));
 		KUNIT_ASSERT_FALSE(test, __drm_test_mm_insert(test, count, size + 1, true));
-
-		cond_resched();
 	}
 }
 
@@ -882,8 +874,6 @@ static int __drm_test_mm_insert_range(struct kunit *test, unsigned int count, u6
 		drm_mm_for_each_node_safe(node, next, &mm)
 			drm_mm_remove_node(node);
 		DRM_MM_BUG_ON(!drm_mm_clean(&mm));
-
-		cond_resched();
 	}
 
 	ret = 0;
@@ -942,8 +932,6 @@ static void drm_test_mm_insert_range(struct kunit *test)
 								    max / 2, max));
 		KUNIT_ASSERT_FALSE(test, __drm_test_mm_insert_range(test, count, size,
 								    max / 4 + 1, 3 * max / 4 - 1));
-
-		cond_resched();
 	}
 }
 
@@ -1086,8 +1074,6 @@ static void drm_test_mm_align(struct kunit *test)
 		drm_mm_for_each_node_safe(node, next, &mm)
 			drm_mm_remove_node(node);
 		DRM_MM_BUG_ON(!drm_mm_clean(&mm));
-
-		cond_resched();
 	}
 
 out:
@@ -1122,8 +1108,6 @@ static void drm_test_mm_align_pot(struct kunit *test, int max)
 			KUNIT_FAIL(test, "insert failed with alignment=%llx [%d]", align, bit);
 			goto out;
 		}
-
-		cond_resched();
 	}
 
 out:
@@ -1465,8 +1449,6 @@ static void drm_test_mm_evict(struct kunit *test)
 				goto out;
 			}
 		}
-
-		cond_resched();
 	}
 
 out:
@@ -1547,8 +1529,6 @@ static void drm_test_mm_evict_range(struct kunit *test)
 				goto out;
 			}
 		}
-
-		cond_resched();
 	}
 
 out:
@@ -1658,7 +1638,6 @@ static void drm_test_mm_topdown(struct kunit *test)
 		drm_mm_for_each_node_safe(node, next, &mm)
 			drm_mm_remove_node(node);
 		DRM_MM_BUG_ON(!drm_mm_clean(&mm));
-		cond_resched();
 	}
 
 out:
@@ -1750,7 +1729,6 @@ static void drm_test_mm_bottomup(struct kunit *test)
 		drm_mm_for_each_node_safe(node, next, &mm)
 			drm_mm_remove_node(node);
 		DRM_MM_BUG_ON(!drm_mm_clean(&mm));
-		cond_resched();
 	}
 
 out:
@@ -1968,8 +1946,6 @@ static void drm_test_mm_color(struct kunit *test)
 			drm_mm_remove_node(node);
 			kfree(node);
 		}
-
-		cond_resched();
 	}
 
 out:
@@ -2038,7 +2014,6 @@ static int evict_color(struct kunit *test, struct drm_mm *mm, u64 range_start,
 		}
 	}
 
-	cond_resched();
 	return 0;
 }
 
@@ -2110,8 +2085,6 @@ static void drm_test_mm_color_evict(struct kunit *test)
 				goto out;
 			}
 		}
-
-		cond_resched();
 	}
 
 out:
@@ -2196,8 +2169,6 @@ static void drm_test_mm_color_evict_range(struct kunit *test)
 				goto out;
 			}
 		}
-
-		cond_resched();
 	}
 
 out:
-- 
2.31.1

[RFC PATCH 84/86] treewide: net: remove cond_resched()

Posted by Ankur Arora 2 years, 3 months ago

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

Most of the uses here are in set-1 (some right after we give up a lock,
causing an explicit preemption check.)

There are some uses from set-3 where we busy wait: ex. mlx4/mlx5
drivers, mtk_mdio_busy_wait() and similar. Replaced with
cond_resched_stall().  Some of those places, however, have wait-times
in milliseconds, so maybe we should just be a timed-wait?

Note: there are also a few other cases, where I've replaced by
cond_resched_stall() (ex mhi_net_rx_refill_work() or
broadcom/b43::lo_measure_feedthrough()) where it doesn't seem
like the right thing.

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: "David S. Miller" <davem@davemloft.net> 
Cc: Eric Dumazet <edumazet@google.com> 
Cc: Jakub Kicinski <kuba@kernel.org> 
Cc: Paolo Abeni <pabeni@redhat.com> 
Cc: Felix Fietkau <nbd@nbd.name> 
Cc: John Crispin <john@phrozen.org> 
Cc: Sean Wang <sean.wang@mediatek.com> 
Cc: Mark Lee <Mark-MC.Lee@mediatek.com> 
Cc: Lorenzo Bianconi <lorenzo@kernel.org> 
Cc: Matthias Brugger <matthias.bgg@gmail.com> 
Cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> 
Cc: "Michael S. Tsirkin" <mst@redhat.com> 
Cc: Jason Wang <jasowang@redhat.com> 
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> 
Cc: Kalle Valo <kvalo@kernel.org> 
Cc: Larry Finger <Larry.Finger@lwfinger.net> 
Cc: Ryder Lee <ryder.lee@mediatek.com> 
Cc: Loic Poulain <loic.poulain@linaro.org> 
Cc: Sergey Ryazanov <ryazanov.s.a@gmail.com> 
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 drivers/net/dummy.c                                 |  1 -
 drivers/net/ethernet/broadcom/tg3.c                 |  2 +-
 drivers/net/ethernet/intel/e1000/e1000_hw.c         |  3 ---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c         |  2 +-
 drivers/net/ethernet/mellanox/mlx4/catas.c          |  2 +-
 drivers/net/ethernet/mellanox/mlx4/cmd.c            | 13 ++++++-------
 .../net/ethernet/mellanox/mlx4/resource_tracker.c   |  9 ++++++++-
 drivers/net/ethernet/mellanox/mlx5/core/cmd.c       |  4 +---
 drivers/net/ethernet/mellanox/mlx5/core/fw.c        |  3 +--
 drivers/net/ethernet/mellanox/mlxsw/i2c.c           |  5 -----
 drivers/net/ethernet/mellanox/mlxsw/pci.c           |  2 --
 drivers/net/ethernet/pasemi/pasemi_mac.c            |  3 ---
 .../net/ethernet/qlogic/netxen/netxen_nic_init.c    |  2 --
 .../net/ethernet/qlogic/qlcnic/qlcnic_83xx_init.c   |  1 -
 drivers/net/ethernet/qlogic/qlcnic/qlcnic_init.c    |  1 -
 .../net/ethernet/qlogic/qlcnic/qlcnic_minidump.c    |  2 --
 drivers/net/ethernet/sfc/falcon/falcon.c            |  6 ------
 drivers/net/ifb.c                                   |  1 -
 drivers/net/ipvlan/ipvlan_core.c                    |  1 -
 drivers/net/macvlan.c                               |  2 --
 drivers/net/mhi_net.c                               |  4 ++--
 drivers/net/netdevsim/fib.c                         |  1 -
 drivers/net/virtio_net.c                            |  2 --
 drivers/net/wireguard/ratelimiter.c                 |  2 --
 drivers/net/wireguard/receive.c                     |  3 ---
 drivers/net/wireguard/send.c                        |  4 ----
 drivers/net/wireless/broadcom/b43/lo.c              |  6 +++---
 drivers/net/wireless/broadcom/b43/pio.c             |  1 -
 drivers/net/wireless/broadcom/b43legacy/phy.c       |  5 -----
 .../wireless/broadcom/brcm80211/brcmfmac/cfg80211.c |  1 -
 drivers/net/wireless/cisco/airo.c                   |  2 --
 drivers/net/wireless/intel/iwlwifi/pcie/trans.c     |  2 --
 drivers/net/wireless/marvell/mwl8k.c                |  2 --
 drivers/net/wireless/mediatek/mt76/util.c           |  1 -
 drivers/net/wwan/mhi_wwan_mbim.c                    |  2 +-
 drivers/net/wwan/t7xx/t7xx_hif_dpmaif_tx.c          |  3 ---
 drivers/net/xen-netback/netback.c                   |  1 -
 drivers/net/xen-netback/rx.c                        |  2 --
 38 files changed, 25 insertions(+), 84 deletions(-)

diff --git a/drivers/net/dummy.c b/drivers/net/dummy.c
index c4b1b0aa438a..dfebf6387d8a 100644
--- a/drivers/net/dummy.c
+++ b/drivers/net/dummy.c
@@ -182,7 +182,6 @@ static int __init dummy_init_module(void)
 
 	for (i = 0; i < numdummies && !err; i++) {
 		err = dummy_init_one();
-		cond_resched();
 	}
 	if (err < 0)
 		__rtnl_link_unregister(&dummy_link_ops);
diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index 14b311196b8f..ad511d721db3 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -12040,7 +12040,7 @@ static int tg3_get_eeprom(struct net_device *dev, struct ethtool_eeprom *eeprom,
 				ret = -EINTR;
 				goto eeprom_done;
 			}
-			cond_resched();
+			cond_resched_stall();
 		}
 	}
 	eeprom->len += i;
diff --git a/drivers/net/ethernet/intel/e1000/e1000_hw.c b/drivers/net/ethernet/intel/e1000/e1000_hw.c
index 4542e2bc28e8..22a419bdc6b7 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_hw.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_hw.c
@@ -3937,7 +3937,6 @@ static s32 e1000_do_read_eeprom(struct e1000_hw *hw, u16 offset, u16 words,
 			 */
 			data[i] = e1000_shift_in_ee_bits(hw, 16);
 			e1000_standby_eeprom(hw);
-			cond_resched();
 		}
 	}
 
@@ -4088,7 +4087,6 @@ static s32 e1000_write_eeprom_spi(struct e1000_hw *hw, u16 offset, u16 words,
 			return -E1000_ERR_EEPROM;
 
 		e1000_standby_eeprom(hw);
-		cond_resched();
 
 		/*  Send the WRITE ENABLE command (8 bit opcode )  */
 		e1000_shift_out_ee_bits(hw, EEPROM_WREN_OPCODE_SPI,
@@ -4198,7 +4196,6 @@ static s32 e1000_write_eeprom_microwire(struct e1000_hw *hw, u16 offset,
 
 		/* Recover from write */
 		e1000_standby_eeprom(hw);
-		cond_resched();
 
 		words_written++;
 	}
diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 20afe79f380a..26a9f293ed32 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -309,7 +309,7 @@ static int mtk_mdio_busy_wait(struct mtk_eth *eth)
 			return 0;
 		if (time_after(jiffies, t_start + PHY_IAC_TIMEOUT))
 			break;
-		cond_resched();
+		cond_resched_stall();
 	}
 
 	dev_err(eth->dev, "mdio: MDIO timeout\n");
diff --git a/drivers/net/ethernet/mellanox/mlx4/catas.c b/drivers/net/ethernet/mellanox/mlx4/catas.c
index 0d8a362c2673..f013eb3fa6f8 100644
--- a/drivers/net/ethernet/mellanox/mlx4/catas.c
+++ b/drivers/net/ethernet/mellanox/mlx4/catas.c
@@ -148,7 +148,7 @@ static int mlx4_reset_slave(struct mlx4_dev *dev)
 			mlx4_warn(dev, "VF Reset succeed\n");
 			return 0;
 		}
-		cond_resched();
+		cond_resched_stall();
 	}
 	mlx4_err(dev, "Fail to send reset over the communication channel\n");
 	return -ETIMEDOUT;
diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c b/drivers/net/ethernet/mellanox/mlx4/cmd.c
index f5b1f8c7834f..259918642b50 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c
@@ -312,7 +312,8 @@ static int mlx4_comm_cmd_poll(struct mlx4_dev *dev, u8 cmd, u16 param,
 
 	end = msecs_to_jiffies(timeout) + jiffies;
 	while (comm_pending(dev) && time_before(jiffies, end))
-		cond_resched();
+		cond_resched_stall();
+
 	ret_from_pending = comm_pending(dev);
 	if (ret_from_pending) {
 		/* check if the slave is trying to boot in the middle of
@@ -387,7 +388,7 @@ static int mlx4_comm_cmd_wait(struct mlx4_dev *dev, u8 vhcr_cmd,
 	if (!(dev->persist->state & MLX4_DEVICE_STATE_INTERNAL_ERROR)) {
 		end = msecs_to_jiffies(timeout) + jiffies;
 		while (comm_pending(dev) && time_before(jiffies, end))
-			cond_resched();
+			cond_resched_stall();
 	}
 	goto out;
 
@@ -470,7 +471,7 @@ static int mlx4_cmd_post(struct mlx4_dev *dev, u64 in_param, u64 out_param,
 			mlx4_err(dev, "%s:cmd_pending failed\n", __func__);
 			goto out;
 		}
-		cond_resched();
+		cond_resched_stall();
 	}
 
 	/*
@@ -621,8 +622,7 @@ static int mlx4_cmd_poll(struct mlx4_dev *dev, u64 in_param, u64 *out_param,
 			err = mlx4_internal_err_ret_value(dev, op, op_modifier);
 			goto out;
 		}
-
-		cond_resched();
+		cond_resched_stall();
 	}
 
 	if (cmd_pending(dev)) {
@@ -2324,8 +2324,7 @@ static int sync_toggles(struct mlx4_dev *dev)
 			priv->cmd.comm_toggle = rd_toggle >> 31;
 			return 0;
 		}
-
-		cond_resched();
+		cond_resched_stall();
 	}
 
 	/*
diff --git a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
index 771b92019af1..c8127acea986 100644
--- a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
+++ b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
@@ -4649,7 +4649,14 @@ static int move_all_busy(struct mlx4_dev *dev, int slave,
 		if (time_after(jiffies, begin + 5 * HZ))
 			break;
 		if (busy)
-			cond_resched();
+			/*
+			 * Giving up the spinlock in _move_all_busy() will
+			 * reschedule if needed.
+			 * Add a cpu_relax() here to ensure that we give
+			 * others a chance to acquire the lock.
+			 */
+			cpu_relax();
+
 	} while (busy);
 
 	if (busy)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
index c22b0ad0c870..3c5bfa8eda00 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
@@ -285,7 +285,7 @@ static void poll_timeout(struct mlx5_cmd_work_ent *ent)
 			ent->ret = 0;
 			return;
 		}
-		cond_resched();
+		cond_resched_stall();
 	} while (time_before(jiffies, poll_end));
 
 	ent->ret = -ETIMEDOUT;
@@ -1773,13 +1773,11 @@ void mlx5_cmd_flush(struct mlx5_core_dev *dev)
 	for (i = 0; i < cmd->vars.max_reg_cmds; i++) {
 		while (down_trylock(&cmd->vars.sem)) {
 			mlx5_cmd_trigger_completions(dev);
-			cond_resched();
 		}
 	}
 
 	while (down_trylock(&cmd->vars.pages_sem)) {
 		mlx5_cmd_trigger_completions(dev);
-		cond_resched();
 	}
 
 	/* Unlock cmdif */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fw.c b/drivers/net/ethernet/mellanox/mlx5/core/fw.c
index 58f4c0d0fafa..a08ca20ceeda 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fw.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fw.c
@@ -373,8 +373,7 @@ int mlx5_cmd_fast_teardown_hca(struct mlx5_core_dev *dev)
 	do {
 		if (mlx5_get_nic_state(dev) == MLX5_NIC_IFC_DISABLED)
 			break;
-
-		cond_resched();
+		cond_resched_stall();
 	} while (!time_after(jiffies, end));
 
 	if (mlx5_get_nic_state(dev) != MLX5_NIC_IFC_DISABLED) {
diff --git a/drivers/net/ethernet/mellanox/mlxsw/i2c.c b/drivers/net/ethernet/mellanox/mlxsw/i2c.c
index d23f293e285c..1a11f8cd6bb9 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/i2c.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/i2c.c
@@ -180,7 +180,6 @@ static int mlxsw_i2c_wait_go_bit(struct i2c_client *client,
 				break;
 			}
 		}
-		cond_resched();
 	} while ((time_before(jiffies, end)) || (i++ < MLXSW_I2C_RETRY));
 
 	if (wait_done) {
@@ -361,8 +360,6 @@ mlxsw_i2c_write(struct device *dev, size_t in_mbox_size, u8 *in_mbox, int num,
 			err = i2c_transfer(client->adapter, &write_tran, 1);
 			if (err == 1)
 				break;
-
-			cond_resched();
 		} while ((time_before(jiffies, end)) ||
 			 (j++ < MLXSW_I2C_RETRY));
 
@@ -473,8 +470,6 @@ mlxsw_i2c_cmd(struct device *dev, u16 opcode, u32 in_mod, size_t in_mbox_size,
 					   ARRAY_SIZE(read_tran));
 			if (err == ARRAY_SIZE(read_tran))
 				break;
-
-			cond_resched();
 		} while ((time_before(jiffies, end)) ||
 			 (j++ < MLXSW_I2C_RETRY));
 
diff --git a/drivers/net/ethernet/mellanox/mlxsw/pci.c b/drivers/net/ethernet/mellanox/mlxsw/pci.c
index 51eea1f0529c..8124b27d0eaa 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/pci.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/pci.c
@@ -1455,7 +1455,6 @@ static int mlxsw_pci_sys_ready_wait(struct mlxsw_pci *mlxsw_pci,
 		val = mlxsw_pci_read32(mlxsw_pci, FW_READY);
 		if ((val & MLXSW_PCI_FW_READY_MASK) == MLXSW_PCI_FW_READY_MAGIC)
 			return 0;
-		cond_resched();
 	} while (time_before(jiffies, end));
 
 	*p_sys_status = val & MLXSW_PCI_FW_READY_MASK;
@@ -1824,7 +1823,6 @@ static int mlxsw_pci_cmd_exec(void *bus_priv, u16 opcode, u8 opcode_mod,
 				*p_status = ctrl >> MLXSW_PCI_CIR_CTRL_STATUS_SHIFT;
 				break;
 			}
-			cond_resched();
 		} while (time_before(jiffies, end));
 	} else {
 		wait_event_timeout(mlxsw_pci->cmd.wait, *p_wait_done, timeout);
diff --git a/drivers/net/ethernet/pasemi/pasemi_mac.c b/drivers/net/ethernet/pasemi/pasemi_mac.c
index ed7dd0a04235..3ec6ac758878 100644
--- a/drivers/net/ethernet/pasemi/pasemi_mac.c
+++ b/drivers/net/ethernet/pasemi/pasemi_mac.c
@@ -1225,7 +1225,6 @@ static void pasemi_mac_pause_txchan(struct pasemi_mac *mac)
 		sta = read_dma_reg(PAS_DMA_TXCHAN_TCMDSTA(txch));
 		if (!(sta & PAS_DMA_TXCHAN_TCMDSTA_ACT))
 			break;
-		cond_resched();
 	}
 
 	if (sta & PAS_DMA_TXCHAN_TCMDSTA_ACT)
@@ -1246,7 +1245,6 @@ static void pasemi_mac_pause_rxchan(struct pasemi_mac *mac)
 		sta = read_dma_reg(PAS_DMA_RXCHAN_CCMDSTA(rxch));
 		if (!(sta & PAS_DMA_RXCHAN_CCMDSTA_ACT))
 			break;
-		cond_resched();
 	}
 
 	if (sta & PAS_DMA_RXCHAN_CCMDSTA_ACT)
@@ -1265,7 +1263,6 @@ static void pasemi_mac_pause_rxint(struct pasemi_mac *mac)
 		sta = read_dma_reg(PAS_DMA_RXINT_RCMDSTA(mac->dma_if));
 		if (!(sta & PAS_DMA_RXINT_RCMDSTA_ACT))
 			break;
-		cond_resched();
 	}
 
 	if (sta & PAS_DMA_RXINT_RCMDSTA_ACT)
diff --git a/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c b/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c
index 35ec9aab3dc7..c26c43a7a83c 100644
--- a/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c
+++ b/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c
@@ -326,8 +326,6 @@ static int netxen_wait_rom_done(struct netxen_adapter *adapter)
 	long timeout = 0;
 	long done = 0;
 
-	cond_resched();
-
 	while (done == 0) {
 		done = NXRD32(adapter, NETXEN_ROMUSB_GLB_STATUS);
 		done &= 2;
diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_init.c b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_init.c
index c95d56e56c59..359db1fa500f 100644
--- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_init.c
+++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_init.c
@@ -2023,7 +2023,6 @@ static void qlcnic_83xx_exec_template_cmd(struct qlcnic_adapter *p_dev,
 			break;
 		}
 		entry += p_hdr->size;
-		cond_resched();
 	}
 	p_dev->ahw->reset.seq_index = index;
 }
diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_init.c b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_init.c
index 09f20c794754..110b1ea921e5 100644
--- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_init.c
+++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_init.c
@@ -295,7 +295,6 @@ static int qlcnic_wait_rom_done(struct qlcnic_adapter *adapter)
 	long done = 0;
 	int err = 0;
 
-	cond_resched();
 	while (done == 0) {
 		done = QLCRD32(adapter, QLCNIC_ROMUSB_GLB_STATUS, &err);
 		done &= 2;
diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_minidump.c b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_minidump.c
index 7ecb3dfe30bd..38b4f56fc464 100644
--- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_minidump.c
+++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_minidump.c
@@ -702,7 +702,6 @@ static u32 qlcnic_read_memory_test_agent(struct qlcnic_adapter *adapter,
 		addr += 16;
 		reg_read -= 16;
 		ret += 16;
-		cond_resched();
 	}
 out:
 	mutex_unlock(&adapter->ahw->mem_lock);
@@ -1383,7 +1382,6 @@ int qlcnic_dump_fw(struct qlcnic_adapter *adapter)
 		buf_offset += entry->hdr.cap_size;
 		entry_offset += entry->hdr.offset;
 		buffer = fw_dump->data + buf_offset;
-		cond_resched();
 	}
 
 	fw_dump->clr = 1;
diff --git a/drivers/net/ethernet/sfc/falcon/falcon.c b/drivers/net/ethernet/sfc/falcon/falcon.c
index 7a1c9337081b..44cc6e1bef57 100644
--- a/drivers/net/ethernet/sfc/falcon/falcon.c
+++ b/drivers/net/ethernet/sfc/falcon/falcon.c
@@ -630,8 +630,6 @@ falcon_spi_read(struct ef4_nic *efx, const struct falcon_spi_device *spi,
 			break;
 		pos += block_len;
 
-		/* Avoid locking up the system */
-		cond_resched();
 		if (signal_pending(current)) {
 			rc = -EINTR;
 			break;
@@ -723,8 +721,6 @@ falcon_spi_write(struct ef4_nic *efx, const struct falcon_spi_device *spi,
 
 		pos += block_len;
 
-		/* Avoid locking up the system */
-		cond_resched();
 		if (signal_pending(current)) {
 			rc = -EINTR;
 			break;
@@ -839,8 +835,6 @@ falcon_spi_erase(struct falcon_mtd_partition *part, loff_t start, size_t len)
 		if (memcmp(empty, buffer, block_len))
 			return -EIO;
 
-		/* Avoid locking up the system */
-		cond_resched();
 		if (signal_pending(current))
 			return -EINTR;
 	}
diff --git a/drivers/net/ifb.c b/drivers/net/ifb.c
index 78253ad57b2e..ffd23d862967 100644
--- a/drivers/net/ifb.c
+++ b/drivers/net/ifb.c
@@ -434,7 +434,6 @@ static int __init ifb_init_module(void)
 
 	for (i = 0; i < numifbs && !err; i++) {
 		err = ifb_init_one(i);
-		cond_resched();
 	}
 	if (err)
 		__rtnl_link_unregister(&ifb_link_ops);
diff --git a/drivers/net/ipvlan/ipvlan_core.c b/drivers/net/ipvlan/ipvlan_core.c
index c0c49f181367..91a4d1bda8a0 100644
--- a/drivers/net/ipvlan/ipvlan_core.c
+++ b/drivers/net/ipvlan/ipvlan_core.c
@@ -292,7 +292,6 @@ void ipvlan_process_multicast(struct work_struct *work)
 				kfree_skb(skb);
 		}
 		dev_put(dev);
-		cond_resched();
 	}
 }
 
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 02bd201bc7e5..120af3235f4d 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -341,8 +341,6 @@ static void macvlan_process_broadcast(struct work_struct *w)
 		if (src)
 			dev_put(src->dev);
 		consume_skb(skb);
-
-		cond_resched();
 	}
 }
 
diff --git a/drivers/net/mhi_net.c b/drivers/net/mhi_net.c
index ae169929a9d8..cbb59a94b083 100644
--- a/drivers/net/mhi_net.c
+++ b/drivers/net/mhi_net.c
@@ -291,9 +291,9 @@ static void mhi_net_rx_refill_work(struct work_struct *work)
 		}
 
 		/* Do not hog the CPU if rx buffers are consumed faster than
-		 * queued (unlikely).
+		 * queued (uhlikely).
 		 */
-		cond_resched();
+		cond_resched_stall();
 	}
 
 	/* If we're still starved of rx buffers, reschedule later */
diff --git a/drivers/net/netdevsim/fib.c b/drivers/net/netdevsim/fib.c
index a1f91ff8ec56..7b7a37b247d1 100644
--- a/drivers/net/netdevsim/fib.c
+++ b/drivers/net/netdevsim/fib.c
@@ -1492,7 +1492,6 @@ static void nsim_fib_event_work(struct work_struct *work)
 		nsim_fib_event(fib_event);
 		list_del(&fib_event->list);
 		kfree(fib_event);
-		cond_resched();
 	}
 	mutex_unlock(&data->fib_lock);
 }
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index d67f742fbd4c..d0d7cd077a85 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -4015,7 +4015,6 @@ static void free_unused_bufs(struct virtnet_info *vi)
 		struct virtqueue *vq = vi->sq[i].vq;
 		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL)
 			virtnet_sq_free_unused_buf(vq, buf);
-		cond_resched();
 	}
 
 	for (i = 0; i < vi->max_queue_pairs; i++) {
@@ -4023,7 +4022,6 @@ static void free_unused_bufs(struct virtnet_info *vi)
 
 		while ((buf = virtnet_rq_detach_unused_buf(rq)) != NULL)
 			virtnet_rq_free_unused_buf(rq->vq, buf);
-		cond_resched();
 	}
 }
 
diff --git a/drivers/net/wireguard/ratelimiter.c b/drivers/net/wireguard/ratelimiter.c
index dd55e5c26f46..c9c411ec377a 100644
--- a/drivers/net/wireguard/ratelimiter.c
+++ b/drivers/net/wireguard/ratelimiter.c
@@ -74,8 +74,6 @@ static void wg_ratelimiter_gc_entries(struct work_struct *work)
 		}
 #endif
 		spin_unlock(&table_lock);
-		if (likely(work))
-			cond_resched();
 	}
 	if (likely(work))
 		queue_delayed_work(system_power_efficient_wq, &gc_work, HZ);
diff --git a/drivers/net/wireguard/receive.c b/drivers/net/wireguard/receive.c
index 0b3f0c843550..8468b041e786 100644
--- a/drivers/net/wireguard/receive.c
+++ b/drivers/net/wireguard/receive.c
@@ -213,7 +213,6 @@ void wg_packet_handshake_receive_worker(struct work_struct *work)
 		wg_receive_handshake_packet(wg, skb);
 		dev_kfree_skb(skb);
 		atomic_dec(&wg->handshake_queue_len);
-		cond_resched();
 	}
 }
 
@@ -501,8 +500,6 @@ void wg_packet_decrypt_worker(struct work_struct *work)
 			likely(decrypt_packet(skb, PACKET_CB(skb)->keypair)) ?
 				PACKET_STATE_CRYPTED : PACKET_STATE_DEAD;
 		wg_queue_enqueue_per_peer_rx(skb, state);
-		if (need_resched())
-			cond_resched();
 	}
 }
 
diff --git a/drivers/net/wireguard/send.c b/drivers/net/wireguard/send.c
index 95c853b59e1d..aa122729d802 100644
--- a/drivers/net/wireguard/send.c
+++ b/drivers/net/wireguard/send.c
@@ -279,8 +279,6 @@ void wg_packet_tx_worker(struct work_struct *work)
 
 		wg_noise_keypair_put(keypair, false);
 		wg_peer_put(peer);
-		if (need_resched())
-			cond_resched();
 	}
 }
 
@@ -303,8 +301,6 @@ void wg_packet_encrypt_worker(struct work_struct *work)
 			}
 		}
 		wg_queue_enqueue_per_peer_tx(first, state);
-		if (need_resched())
-			cond_resched();
 	}
 }
 
diff --git a/drivers/net/wireless/broadcom/b43/lo.c b/drivers/net/wireless/broadcom/b43/lo.c
index 338b6545a1e7..0fc018a706f3 100644
--- a/drivers/net/wireless/broadcom/b43/lo.c
+++ b/drivers/net/wireless/broadcom/b43/lo.c
@@ -112,10 +112,10 @@ static u16 lo_measure_feedthrough(struct b43_wldev *dev,
 	udelay(21);
 	feedthrough = b43_phy_read(dev, B43_PHY_LO_LEAKAGE);
 
-	/* This is a good place to check if we need to relax a bit,
+	/* This is a good place to check if we need to relax a bit
 	 * as this is the main function called regularly
-	 * in the LO calibration. */
-	cond_resched();
+	 * in the L0 calibration. */
+	cond_resched_stall();
 
 	return feedthrough;
 }
diff --git a/drivers/net/wireless/broadcom/b43/pio.c b/drivers/net/wireless/broadcom/b43/pio.c
index 8c28a9250cd1..44f5920ab6ff 100644
--- a/drivers/net/wireless/broadcom/b43/pio.c
+++ b/drivers/net/wireless/broadcom/b43/pio.c
@@ -768,7 +768,6 @@ void b43_pio_rx(struct b43_pio_rxqueue *q)
 		stop = !pio_rx_frame(q);
 		if (stop)
 			break;
-		cond_resched();
 		if (WARN_ON_ONCE(++count > 10000))
 			break;
 	}
diff --git a/drivers/net/wireless/broadcom/b43legacy/phy.c b/drivers/net/wireless/broadcom/b43legacy/phy.c
index c1395e622759..d6d2cf2a38fe 100644
--- a/drivers/net/wireless/broadcom/b43legacy/phy.c
+++ b/drivers/net/wireless/broadcom/b43legacy/phy.c
@@ -1113,7 +1113,6 @@ static u16 b43legacy_phy_lo_b_r15_loop(struct b43legacy_wldev *dev)
 		ret += b43legacy_phy_read(dev, 0x002C);
 	}
 	local_irq_restore(flags);
-	cond_resched();
 
 	return ret;
 }
@@ -1242,7 +1241,6 @@ u16 b43legacy_phy_lo_g_deviation_subval(struct b43legacy_wldev *dev,
 	}
 	ret = b43legacy_phy_read(dev, 0x002D);
 	local_irq_restore(flags);
-	cond_resched();
 
 	return ret;
 }
@@ -1580,7 +1578,6 @@ void b43legacy_phy_lo_g_measure(struct b43legacy_wldev *dev)
 			b43legacy_radio_write16(dev, 0x43, i);
 			b43legacy_radio_write16(dev, 0x52, phy->txctl2);
 			udelay(10);
-			cond_resched();
 
 			b43legacy_phy_set_baseband_attenuation(dev, j * 2);
 
@@ -1631,7 +1628,6 @@ void b43legacy_phy_lo_g_measure(struct b43legacy_wldev *dev)
 					      phy->txctl2
 					      | (3/*txctl1*/ << 4));
 			udelay(10);
-			cond_resched();
 
 			b43legacy_phy_set_baseband_attenuation(dev, j * 2);
 
@@ -1654,7 +1650,6 @@ void b43legacy_phy_lo_g_measure(struct b43legacy_wldev *dev)
 		b43legacy_phy_write(dev, 0x0812, (r27 << 8) | 0xA2);
 		udelay(2);
 		b43legacy_phy_write(dev, 0x0812, (r27 << 8) | 0xA3);
-		cond_resched();
 	} else
 		b43legacy_phy_write(dev, 0x0015, r27 | 0xEFA0);
 	b43legacy_phy_lo_adjust(dev, is_initializing);
diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c
index 2a90bb24ba77..3cc5476c529d 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c
@@ -3979,7 +3979,6 @@ static int brcmf_cfg80211_sched_scan_stop(struct wiphy *wiphy,
 static __always_inline void brcmf_delay(u32 ms)
 {
 	if (ms < 1000 / HZ) {
-		cond_resched();
 		mdelay(ms);
 	} else {
 		msleep(ms);
diff --git a/drivers/net/wireless/cisco/airo.c b/drivers/net/wireless/cisco/airo.c
index dbd13f7aa3e6..f15a55138dd9 100644
--- a/drivers/net/wireless/cisco/airo.c
+++ b/drivers/net/wireless/cisco/airo.c
@@ -3988,8 +3988,6 @@ static u16 issuecommand(struct airo_info *ai, Cmd *pCmd, Resp *pRsp,
 		if ((IN4500(ai, COMMAND)) == pCmd->cmd)
 			// PC4500 didn't notice command, try again
 			OUT4500(ai, COMMAND, pCmd->cmd);
-		if (may_sleep && (max_tries & 255) == 0)
-			cond_resched();
 	}
 
 	if (max_tries == -1) {
diff --git a/drivers/net/wireless/intel/iwlwifi/pcie/trans.c b/drivers/net/wireless/intel/iwlwifi/pcie/trans.c
index 198933f853c5..9ab63ff0b6aa 100644
--- a/drivers/net/wireless/intel/iwlwifi/pcie/trans.c
+++ b/drivers/net/wireless/intel/iwlwifi/pcie/trans.c
@@ -2309,8 +2309,6 @@ static int iwl_trans_pcie_read_mem(struct iwl_trans *trans, u32 addr,
 			}
 			iwl_trans_release_nic_access(trans);
 
-			if (resched)
-				cond_resched();
 		} else {
 			return -EBUSY;
 		}
diff --git a/drivers/net/wireless/marvell/mwl8k.c b/drivers/net/wireless/marvell/mwl8k.c
index 13bcb123d122..9b4341da3163 100644
--- a/drivers/net/wireless/marvell/mwl8k.c
+++ b/drivers/net/wireless/marvell/mwl8k.c
@@ -632,7 +632,6 @@ mwl8k_send_fw_load_cmd(struct mwl8k_priv *priv, void *data, int length)
 				break;
 			}
 		}
-		cond_resched();
 		udelay(1);
 	} while (--loops);
 
@@ -795,7 +794,6 @@ static int mwl8k_load_firmware(struct ieee80211_hw *hw)
 			break;
 		}
 
-		cond_resched();
 		udelay(1);
 	} while (--loops);
 
diff --git a/drivers/net/wireless/mediatek/mt76/util.c b/drivers/net/wireless/mediatek/mt76/util.c
index fc76c66ff1a5..54ffe67d1365 100644
--- a/drivers/net/wireless/mediatek/mt76/util.c
+++ b/drivers/net/wireless/mediatek/mt76/util.c
@@ -130,7 +130,6 @@ int __mt76_worker_fn(void *ptr)
 		set_bit(MT76_WORKER_RUNNING, &w->state);
 		set_current_state(TASK_RUNNING);
 		w->fn(w);
-		cond_resched();
 		clear_bit(MT76_WORKER_RUNNING, &w->state);
 	}
 
diff --git a/drivers/net/wwan/mhi_wwan_mbim.c b/drivers/net/wwan/mhi_wwan_mbim.c
index 3f72ae943b29..d8aaf476f25d 100644
--- a/drivers/net/wwan/mhi_wwan_mbim.c
+++ b/drivers/net/wwan/mhi_wwan_mbim.c
@@ -400,7 +400,7 @@ static void mhi_net_rx_refill_work(struct work_struct *work)
 		/* Do not hog the CPU if rx buffers are consumed faster than
 		 * queued (unlikely).
 		 */
-		cond_resched();
+		cond_resched_stall();
 	}
 
 	/* If we're still starved of rx buffers, reschedule later */
diff --git a/drivers/net/wwan/t7xx/t7xx_hif_dpmaif_tx.c b/drivers/net/wwan/t7xx/t7xx_hif_dpmaif_tx.c
index 8dab025a088a..52420b1f3669 100644
--- a/drivers/net/wwan/t7xx/t7xx_hif_dpmaif_tx.c
+++ b/drivers/net/wwan/t7xx/t7xx_hif_dpmaif_tx.c
@@ -423,7 +423,6 @@ static void t7xx_do_tx_hw_push(struct dpmaif_ctrl *dpmaif_ctrl)
 		drb_send_cnt = t7xx_txq_burst_send_skb(txq);
 		if (drb_send_cnt <= 0) {
 			usleep_range(10, 20);
-			cond_resched();
 			continue;
 		}
 
@@ -437,8 +436,6 @@ static void t7xx_do_tx_hw_push(struct dpmaif_ctrl *dpmaif_ctrl)
 
 		t7xx_dpmaif_ul_update_hw_drb_cnt(&dpmaif_ctrl->hw_info, txq->index,
 						 drb_send_cnt * DPMAIF_UL_DRB_SIZE_WORD);
-
-		cond_resched();
 	} while (!t7xx_tx_lists_are_all_empty(dpmaif_ctrl) && !kthread_should_stop() &&
 		 (dpmaif_ctrl->state == DPMAIF_STATE_PWRON));
 }
diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
index 88f760a7cbc3..a540e95ba58f 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -1571,7 +1571,6 @@ int xenvif_dealloc_kthread(void *data)
 			break;
 
 		xenvif_tx_dealloc_action(queue);
-		cond_resched();
 	}
 
 	/* Unmap anything remaining*/
diff --git a/drivers/net/xen-netback/rx.c b/drivers/net/xen-netback/rx.c
index 0ba754ebc5ba..bccefaec5312 100644
--- a/drivers/net/xen-netback/rx.c
+++ b/drivers/net/xen-netback/rx.c
@@ -669,8 +669,6 @@ int xenvif_kthread_guest_rx(void *data)
 		 * slots.
 		 */
 		xenvif_rx_queue_drop_expired(queue);
-
-		cond_resched();
 	}
 
 	/* Bin any remaining skbs */
-- 
2.31.1

[RFC PATCH 85/86] treewide: drivers: remove cond_resched()

Posted by Ankur Arora 2 years, 3 months ago

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

The cond_resched() calls here are all kinds. Those from set-1
or set-2 are quite straight-forward to handle.

There are quite a few from set-3, where as noted above, we
use cond_resched() as if it were a amulent. Which I supppose
it is, in that it wards off softlockup or RCU splats.

Those are now cond_resched_stall(), but in most cases, given
that the timeouts are in milliseconds, they could be easily
timed waits.

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Oded Gabbay <ogabbay@kernel.org> 
Cc: Miguel Ojeda <ojeda@kernel.org> 
Cc: Jens Axboe <axboe@kernel.dk> 
Cc: Minchan Kim <minchan@kernel.org> 
Cc: Sergey Senozhatsky <senozhatsky@chromium.org> 
Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com> 
Cc: "Theodore Ts'o" <tytso@mit.edu> 
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> 
Cc: Amit Shah <amit@kernel.org> 
Cc: Gonglei <arei.gonglei@huawei.com> 
Cc: "Michael S. Tsirkin" <mst@redhat.com> 
Cc: Jason Wang <jasowang@redhat.com> 
Cc: "David S. Miller" <davem@davemloft.net> 
Cc: Davidlohr Bueso <dave@stgolabs.net> 
Cc: Jonathan Cameron <jonathan.cameron@huawei.com> 
Cc: Dave Jiang <dave.jiang@intel.com> 
Cc: Alison Schofield <alison.schofield@intel.com> 
Cc: Vishal Verma <vishal.l.verma@intel.com> 
Cc: Ira Weiny <ira.weiny@intel.com> 
Cc: Dan Williams <dan.j.williams@intel.com> 
Cc: Sumit Semwal <sumit.semwal@linaro.org> 
Cc: "Christian König" <christian.koenig@amd.com> 
Cc: Andi Shyti <andi.shyti@kernel.org> 
Cc: Ray Jui <rjui@broadcom.com> 
Cc: Scott Branden <sbranden@broadcom.com> 
Cc: Chris Packham <chris.packham@alliedtelesis.co.nz> 
Cc: Shawn Guo <shawnguo@kernel.org> 
Cc: Sascha Hauer <s.hauer@pengutronix.de> 
Cc: Junxian Huang <huangjunxian6@hisilicon.com> 
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> 
Cc: Will Deacon <will@kernel.org> 
Cc: Joerg Roedel <joro@8bytes.org> 
Cc: Mauro Carvalho Chehab <mchehab@kernel.org> 
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> 
Cc: Hans de Goede <hdegoede@redhat.com> 
Cc: "Ilpo Järvinen" <ilpo.jarvinen@linux.intel.com> 
Cc: Mark Gross <markgross@kernel.org> 
Cc: Finn Thain <fthain@linux-m68k.org> 
Cc: Michael Schmitz <schmitzmic@gmail.com> 
Cc: "James E.J. Bottomley" <jejb@linux.ibm.com> 
Cc: "Martin K. Petersen" <martin.petersen@oracle.com> 
Cc: Kashyap Desai <kashyap.desai@broadcom.com> 
Cc: Sumit Saxena <sumit.saxena@broadcom.com> 
Cc: Shivasharan S <shivasharan.srikanteshwara@broadcom.com> 
Cc: Mark Brown <broonie@kernel.org> 
Cc: Neil Armstrong <neil.armstrong@linaro.org> 
Cc: Jens Wiklander <jens.wiklander@linaro.org> 
Cc: Alex Williamson <alex.williamson@redhat.com> 
Cc: Helge Deller <deller@gmx.de> 
Cc: David Hildenbrand <david@redhat.com> 
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 drivers/accel/ivpu/ivpu_drv.c                      |  2 --
 drivers/accel/ivpu/ivpu_gem.c                      |  1 -
 drivers/accel/ivpu/ivpu_pm.c                       |  8 ++++++--
 drivers/accel/qaic/qaic_data.c                     |  2 --
 drivers/auxdisplay/charlcd.c                       | 11 -----------
 drivers/base/power/domain.c                        |  1 -
 drivers/block/aoe/aoecmd.c                         |  3 +--
 drivers/block/brd.c                                |  1 -
 drivers/block/drbd/drbd_bitmap.c                   |  4 ----
 drivers/block/drbd/drbd_debugfs.c                  |  1 -
 drivers/block/loop.c                               |  3 ---
 drivers/block/xen-blkback/blkback.c                |  3 ---
 drivers/block/zram/zram_drv.c                      |  2 --
 drivers/bluetooth/virtio_bt.c                      |  1 -
 drivers/char/hw_random/arm_smccc_trng.c            |  1 -
 drivers/char/lp.c                                  |  2 --
 drivers/char/mem.c                                 |  4 ----
 drivers/char/mwave/3780i.c                         |  4 +---
 drivers/char/ppdev.c                               |  4 ----
 drivers/char/random.c                              |  2 --
 drivers/char/virtio_console.c                      |  1 -
 drivers/crypto/virtio/virtio_crypto_core.c         |  1 -
 drivers/cxl/pci.c                                  |  1 -
 drivers/dma-buf/selftest.c                         |  1 -
 drivers/dma-buf/st-dma-fence-chain.c               |  1 -
 drivers/fsi/fsi-sbefifo.c                          | 14 ++++++++++++--
 drivers/i2c/busses/i2c-bcm-iproc.c                 |  9 +++++++--
 drivers/i2c/busses/i2c-highlander.c                |  9 +++++++--
 drivers/i2c/busses/i2c-ibm_iic.c                   | 11 +++++++----
 drivers/i2c/busses/i2c-mpc.c                       |  2 +-
 drivers/i2c/busses/i2c-mxs.c                       |  9 ++++++++-
 drivers/i2c/busses/scx200_acb.c                    |  9 +++++++--
 drivers/infiniband/core/umem.c                     |  1 -
 drivers/infiniband/hw/hfi1/driver.c                |  1 -
 drivers/infiniband/hw/hfi1/firmware.c              |  2 +-
 drivers/infiniband/hw/hfi1/init.c                  |  1 -
 drivers/infiniband/hw/hfi1/ruc.c                   |  1 -
 drivers/infiniband/hw/hns/hns_roce_hw_v2.c         |  5 ++++-
 drivers/infiniband/hw/qib/qib_init.c               |  1 -
 drivers/infiniband/sw/rxe/rxe_qp.c                 |  3 +--
 drivers/infiniband/sw/rxe/rxe_task.c               |  4 ++--
 drivers/input/evdev.c                              |  1 -
 drivers/input/keyboard/clps711x-keypad.c           |  2 +-
 drivers/input/misc/uinput.c                        |  1 -
 drivers/input/mousedev.c                           |  1 -
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c        |  2 --
 drivers/media/i2c/vpx3220.c                        |  3 ---
 drivers/media/pci/cobalt/cobalt-i2c.c              |  4 ++--
 drivers/misc/bcm-vk/bcm_vk_dev.c                   |  3 +--
 drivers/misc/bcm-vk/bcm_vk_msg.c                   |  3 +--
 drivers/misc/genwqe/card_base.c                    |  3 +--
 drivers/misc/genwqe/card_ddcb.c                    |  6 ------
 drivers/misc/genwqe/card_dev.c                     |  2 --
 drivers/misc/vmw_balloon.c                         |  4 ----
 drivers/mmc/host/mmc_spi.c                         |  3 ---
 drivers/nvdimm/btt.c                               |  2 --
 drivers/nvme/target/zns.c                          |  2 --
 drivers/parport/parport_ip32.c                     |  1 -
 drivers/parport/parport_pc.c                       |  4 ----
 drivers/pci/pci-sysfs.c                            |  1 -
 drivers/pci/proc.c                                 |  1 -
 .../x86/intel/speed_select_if/isst_if_mbox_pci.c   |  4 ++--
 drivers/s390/cio/css.c                             |  8 --------
 drivers/scsi/NCR5380.c                             |  2 --
 drivers/scsi/megaraid.c                            |  1 -
 drivers/scsi/qedi/qedi_main.c                      |  1 -
 drivers/scsi/qla2xxx/qla_nx.c                      |  2 --
 drivers/scsi/qla2xxx/qla_sup.c                     |  5 -----
 drivers/scsi/qla4xxx/ql4_nx.c                      |  1 -
 drivers/scsi/xen-scsifront.c                       |  2 +-
 drivers/spi/spi-lantiq-ssc.c                       |  3 +--
 drivers/spi/spi-meson-spifc.c                      |  2 +-
 drivers/spi/spi.c                                  |  2 +-
 drivers/staging/rtl8723bs/core/rtw_mlme_ext.c      |  2 +-
 drivers/staging/rtl8723bs/core/rtw_pwrctrl.c       |  2 --
 drivers/tee/optee/ffa_abi.c                        |  1 -
 drivers/tee/optee/smc_abi.c                        |  1 -
 drivers/tty/hvc/hvc_console.c                      |  6 ++----
 drivers/tty/tty_buffer.c                           |  3 ---
 drivers/tty/tty_io.c                               |  1 -
 drivers/usb/gadget/udc/max3420_udc.c               |  1 -
 drivers/usb/host/max3421-hcd.c                     |  2 +-
 drivers/usb/host/xen-hcd.c                         |  2 +-
 drivers/vfio/vfio_iommu_spapr_tce.c                |  2 --
 drivers/vfio/vfio_iommu_type1.c                    |  7 -------
 drivers/vhost/vhost.c                              |  1 -
 drivers/video/console/vgacon.c                     |  4 ----
 drivers/virtio/virtio_mem.c                        |  8 --------
 88 files changed, 82 insertions(+), 190 deletions(-)

diff --git a/drivers/accel/ivpu/ivpu_drv.c b/drivers/accel/ivpu/ivpu_drv.c
index 7e9359611d69..479801a1d961 100644
--- a/drivers/accel/ivpu/ivpu_drv.c
+++ b/drivers/accel/ivpu/ivpu_drv.c
@@ -314,8 +314,6 @@ static int ivpu_wait_for_ready(struct ivpu_device *vdev)
 		ret = ivpu_ipc_receive(vdev, &cons, &ipc_hdr, NULL, 0);
 		if (ret != -ETIMEDOUT || time_after_eq(jiffies, timeout))
 			break;
-
-		cond_resched();
 	}
 
 	ivpu_ipc_consumer_del(vdev, &cons);
diff --git a/drivers/accel/ivpu/ivpu_gem.c b/drivers/accel/ivpu/ivpu_gem.c
index d09f13b35902..06e4c1eceae8 100644
--- a/drivers/accel/ivpu/ivpu_gem.c
+++ b/drivers/accel/ivpu/ivpu_gem.c
@@ -156,7 +156,6 @@ static int __must_check internal_alloc_pages_locked(struct ivpu_bo *bo)
 			ret = -ENOMEM;
 			goto err_free_pages;
 		}
-		cond_resched();
 	}
 
 	bo->pages = pages;
diff --git a/drivers/accel/ivpu/ivpu_pm.c b/drivers/accel/ivpu/ivpu_pm.c
index ffff2496e8e8..aa9cc4a1903c 100644
--- a/drivers/accel/ivpu/ivpu_pm.c
+++ b/drivers/accel/ivpu/ivpu_pm.c
@@ -105,7 +105,7 @@ static void ivpu_pm_recovery_work(struct work_struct *work)
 retry:
 	ret = pci_try_reset_function(to_pci_dev(vdev->drm.dev));
 	if (ret == -EAGAIN && !drm_dev_is_unplugged(&vdev->drm)) {
-		cond_resched();
+		cond_resched_stall();
 		goto retry;
 	}
 
@@ -146,7 +146,11 @@ int ivpu_pm_suspend_cb(struct device *dev)
 
 	timeout = jiffies + msecs_to_jiffies(vdev->timeout.tdr);
 	while (!ivpu_hw_is_idle(vdev)) {
-		cond_resched();
+
+		/* The timeout is in thousands of msecs. Maybe this should be a
+		 * timed wait instead?
+		 */
+		cond_resched_stall();
 		if (time_after_eq(jiffies, timeout)) {
 			ivpu_err(vdev, "Failed to enter idle on system suspend\n");
 			return -EBUSY;
diff --git a/drivers/accel/qaic/qaic_data.c b/drivers/accel/qaic/qaic_data.c
index f4b06792c6f1..d06fd9d765f2 100644
--- a/drivers/accel/qaic/qaic_data.c
+++ b/drivers/accel/qaic/qaic_data.c
@@ -1516,7 +1516,6 @@ void irq_polling_work(struct work_struct *work)
 			return;
 		}
 
-		cond_resched();
 		usleep_range(datapath_poll_interval_us, 2 * datapath_poll_interval_us);
 	}
 }
@@ -1547,7 +1546,6 @@ irqreturn_t dbc_irq_threaded_fn(int irq, void *data)
 
 	if (!event_count) {
 		event_count = NUM_EVENTS;
-		cond_resched();
 	}
 
 	/*
diff --git a/drivers/auxdisplay/charlcd.c b/drivers/auxdisplay/charlcd.c
index 6d309e4971b6..cb1213e292f4 100644
--- a/drivers/auxdisplay/charlcd.c
+++ b/drivers/auxdisplay/charlcd.c
@@ -470,14 +470,6 @@ static ssize_t charlcd_write(struct file *file, const char __user *buf,
 	char c;
 
 	for (; count-- > 0; (*ppos)++, tmp++) {
-		if (((count + 1) & 0x1f) == 0) {
-			/*
-			 * charlcd_write() is invoked as a VFS->write() callback
-			 * and as such it is always invoked from preemptible
-			 * context and may sleep.
-			 */
-			cond_resched();
-		}
 
 		if (get_user(c, tmp))
 			return -EFAULT;
@@ -539,9 +531,6 @@ static void charlcd_puts(struct charlcd *lcd, const char *s)
 	int count = strlen(s);
 
 	for (; count-- > 0; tmp++) {
-		if (((count + 1) & 0x1f) == 0)
-			cond_resched();
-
 		charlcd_write_char(lcd, *tmp);
 	}
 }
diff --git a/drivers/base/power/domain.c b/drivers/base/power/domain.c
index 5cb2023581d4..6b77bdfe1de9 100644
--- a/drivers/base/power/domain.c
+++ b/drivers/base/power/domain.c
@@ -2696,7 +2696,6 @@ static void genpd_dev_pm_detach(struct device *dev, bool power_off)
 			break;
 
 		mdelay(i);
-		cond_resched();
 	}
 
 	if (ret < 0) {
diff --git a/drivers/block/aoe/aoecmd.c b/drivers/block/aoe/aoecmd.c
index d7317425be51..d212b0df661f 100644
--- a/drivers/block/aoe/aoecmd.c
+++ b/drivers/block/aoe/aoecmd.c
@@ -1235,8 +1235,7 @@ kthread(void *vp)
 		if (!more) {
 			schedule();
 			remove_wait_queue(k->waitq, &wait);
-		} else
-			cond_resched();
+		}
 	} while (!kthread_should_stop());
 	complete(&k->rendez);	/* tell spawner we're stopping */
 	return 0;
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 970bd6ff38c4..be1577cd4d4b 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -111,7 +111,6 @@ static void brd_free_pages(struct brd_device *brd)
 
 	xa_for_each(&brd->brd_pages, idx, page) {
 		__free_page(page);
-		cond_resched();
 	}
 
 	xa_destroy(&brd->brd_pages);
diff --git a/drivers/block/drbd/drbd_bitmap.c b/drivers/block/drbd/drbd_bitmap.c
index 85ca000a0564..f12de044c540 100644
--- a/drivers/block/drbd/drbd_bitmap.c
+++ b/drivers/block/drbd/drbd_bitmap.c
@@ -563,7 +563,6 @@ static unsigned long bm_count_bits(struct drbd_bitmap *b)
 		p_addr = __bm_map_pidx(b, idx);
 		bits += bitmap_weight(p_addr, BITS_PER_PAGE);
 		__bm_unmap(p_addr);
-		cond_resched();
 	}
 	/* last (or only) page */
 	last_word = ((b->bm_bits - 1) & BITS_PER_PAGE_MASK) >> LN2_BPL;
@@ -1118,7 +1117,6 @@ static int bm_rw(struct drbd_device *device, const unsigned int flags, unsigned
 			atomic_inc(&ctx->in_flight);
 			bm_page_io_async(ctx, i);
 			++count;
-			cond_resched();
 		}
 	} else if (flags & BM_AIO_WRITE_HINTED) {
 		/* ASSERT: BM_AIO_WRITE_ALL_PAGES is not set. */
@@ -1158,7 +1156,6 @@ static int bm_rw(struct drbd_device *device, const unsigned int flags, unsigned
 			atomic_inc(&ctx->in_flight);
 			bm_page_io_async(ctx, i);
 			++count;
-			cond_resched();
 		}
 	}
 
@@ -1545,7 +1542,6 @@ void _drbd_bm_set_bits(struct drbd_device *device, const unsigned long s, const
 	for (page_nr = first_page; page_nr < last_page; page_nr++) {
 		bm_set_full_words_within_one_page(device->bitmap, page_nr, first_word, last_word);
 		spin_unlock_irq(&b->bm_lock);
-		cond_resched();
 		first_word = 0;
 		spin_lock_irq(&b->bm_lock);
 	}
diff --git a/drivers/block/drbd/drbd_debugfs.c b/drivers/block/drbd/drbd_debugfs.c
index 12460b584bcb..48a85882dfc4 100644
--- a/drivers/block/drbd/drbd_debugfs.c
+++ b/drivers/block/drbd/drbd_debugfs.c
@@ -318,7 +318,6 @@ static void seq_print_resource_transfer_log_summary(struct seq_file *m,
 			struct drbd_request *req_next;
 			kref_get(&req->kref);
 			spin_unlock_irq(&resource->req_lock);
-			cond_resched();
 			spin_lock_irq(&resource->req_lock);
 			req_next = list_next_entry(req, tl_requests);
 			if (kref_put(&req->kref, drbd_req_destroy))
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 9f2d412fc560..0ea0d37b2f28 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -271,7 +271,6 @@ static int lo_write_simple(struct loop_device *lo, struct request *rq,
 		ret = lo_write_bvec(lo->lo_backing_file, &bvec, &pos);
 		if (ret < 0)
 			break;
-		cond_resched();
 	}
 
 	return ret;
@@ -300,7 +299,6 @@ static int lo_read_simple(struct loop_device *lo, struct request *rq,
 				zero_fill_bio(bio);
 			break;
 		}
-		cond_resched();
 	}
 
 	return 0;
@@ -1948,7 +1946,6 @@ static void loop_process_work(struct loop_worker *worker,
 		spin_unlock_irq(&lo->lo_work_lock);
 
 		loop_handle_cmd(cmd);
-		cond_resched();
 
 		spin_lock_irq(&lo->lo_work_lock);
 	}
diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
index c362f4ad80ab..9bcef880df30 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -1259,9 +1259,6 @@ __do_block_io_op(struct xen_blkif_ring *ring, unsigned int *eoi_flags)
 				goto done;
 			break;
 		}
-
-		/* Yield point for this unbounded loop. */
-		cond_resched();
 	}
 done:
 	return more_to_do;
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 06673c6ca255..b1f9312e7905 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1819,8 +1819,6 @@ static ssize_t recompress_store(struct device *dev,
 			ret = err;
 			break;
 		}
-
-		cond_resched();
 	}
 
 	__free_page(page);
diff --git a/drivers/bluetooth/virtio_bt.c b/drivers/bluetooth/virtio_bt.c
index 2ac70b560c46..c570c45d1480 100644
--- a/drivers/bluetooth/virtio_bt.c
+++ b/drivers/bluetooth/virtio_bt.c
@@ -79,7 +79,6 @@ static int virtbt_close_vdev(struct virtio_bluetooth *vbt)
 
 		while ((skb = virtqueue_detach_unused_buf(vq)))
 			kfree_skb(skb);
-		cond_resched();
 	}
 
 	return 0;
diff --git a/drivers/char/hw_random/arm_smccc_trng.c b/drivers/char/hw_random/arm_smccc_trng.c
index 7e954341b09f..f60d101920e4 100644
--- a/drivers/char/hw_random/arm_smccc_trng.c
+++ b/drivers/char/hw_random/arm_smccc_trng.c
@@ -84,7 +84,6 @@ static int smccc_trng_read(struct hwrng *rng, void *data, size_t max, bool wait)
 			tries++;
 			if (tries >= SMCCC_TRNG_MAX_TRIES)
 				return copied;
-			cond_resched();
 			break;
 		default:
 			return -EIO;
diff --git a/drivers/char/lp.c b/drivers/char/lp.c
index 2f171d14b9b5..1d58105112b5 100644
--- a/drivers/char/lp.c
+++ b/drivers/char/lp.c
@@ -478,8 +478,6 @@ static ssize_t lp_read(struct file *file, char __user *buf,
 			retval = -ERESTARTSYS;
 			break;
 		}
-
-		cond_resched();
 	}
 	parport_negotiate(lp_table[minor].dev->port, IEEE1284_MODE_COMPAT);
  out:
diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 1052b0f2d4cf..6f97ab7004d9 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -92,8 +92,6 @@ static inline int range_is_allowed(unsigned long pfn, unsigned long size)
 
 static inline bool should_stop_iteration(void)
 {
-	if (need_resched())
-		cond_resched();
 	return signal_pending(current);
 }
 
@@ -497,7 +495,6 @@ static ssize_t read_iter_zero(struct kiocb *iocb, struct iov_iter *iter)
 			continue;
 		if (iocb->ki_flags & IOCB_NOWAIT)
 			return written ? written : -EAGAIN;
-		cond_resched();
 	}
 	return written;
 }
@@ -523,7 +520,6 @@ static ssize_t read_zero(struct file *file, char __user *buf,
 
 		if (signal_pending(current))
 			break;
-		cond_resched();
 	}
 
 	return cleared;
diff --git a/drivers/char/mwave/3780i.c b/drivers/char/mwave/3780i.c
index 4a8937f80570..927a1cca1168 100644
--- a/drivers/char/mwave/3780i.c
+++ b/drivers/char/mwave/3780i.c
@@ -51,7 +51,7 @@
 #include <linux/delay.h>
 #include <linux/ioport.h>
 #include <linux/bitops.h>
-#include <linux/sched.h>	/* cond_resched() */
+#include <linux/sched.h>
 
 #include <asm/io.h>
 #include <linux/uaccess.h>
@@ -64,9 +64,7 @@ static DEFINE_SPINLOCK(dsp_lock);
 
 static void PaceMsaAccess(unsigned short usDspBaseIO)
 {
-	cond_resched();
 	udelay(100);
-	cond_resched();
 }
 
 unsigned short dsp3780I_ReadMsaCfg(unsigned short usDspBaseIO,
diff --git a/drivers/char/ppdev.c b/drivers/char/ppdev.c
index 4c188e9e477c..7463228ba9bf 100644
--- a/drivers/char/ppdev.c
+++ b/drivers/char/ppdev.c
@@ -176,8 +176,6 @@ static ssize_t pp_read(struct file *file, char __user *buf, size_t count,
 			bytes_read = -ERESTARTSYS;
 			break;
 		}
-
-		cond_resched();
 	}
 
 	parport_set_timeout(pp->pdev, pp->default_inactivity);
@@ -256,8 +254,6 @@ static ssize_t pp_write(struct file *file, const char __user *buf,
 
 		if (signal_pending(current))
 			break;
-
-		cond_resched();
 	}
 
 	parport_set_timeout(pp->pdev, pp->default_inactivity);
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 3cb37760dfec..9e25f3a5c83d 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -457,7 +457,6 @@ static ssize_t get_random_bytes_user(struct iov_iter *iter)
 		if (ret % PAGE_SIZE == 0) {
 			if (signal_pending(current))
 				break;
-			cond_resched();
 		}
 	}
 
@@ -1417,7 +1416,6 @@ static ssize_t write_pool_user(struct iov_iter *iter)
 		if (ret % PAGE_SIZE == 0) {
 			if (signal_pending(current))
 				break;
-			cond_resched();
 		}
 	}
 
diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index 680d1ef2a217..1f8da0a71ce9 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -1936,7 +1936,6 @@ static void remove_vqs(struct ports_device *portdev)
 		flush_bufs(vq, true);
 		while ((buf = virtqueue_detach_unused_buf(vq)))
 			free_buf(buf, true);
-		cond_resched();
 	}
 	portdev->vdev->config->del_vqs(portdev->vdev);
 	kfree(portdev->in_vqs);
diff --git a/drivers/crypto/virtio/virtio_crypto_core.c b/drivers/crypto/virtio/virtio_crypto_core.c
index 43a0838d31ff..3842915ea743 100644
--- a/drivers/crypto/virtio/virtio_crypto_core.c
+++ b/drivers/crypto/virtio/virtio_crypto_core.c
@@ -490,7 +490,6 @@ static void virtcrypto_free_unused_reqs(struct virtio_crypto *vcrypto)
 			kfree(vc_req->req_data);
 			kfree(vc_req->sgs);
 		}
-		cond_resched();
 	}
 }
 
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 44a21ab7add5..2c7e670d9a91 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -634,7 +634,6 @@ static irqreturn_t cxl_event_thread(int irq, void *id)
 		if (!status)
 			break;
 		cxl_mem_get_event_records(mds, status);
-		cond_resched();
 	} while (status);
 
 	return IRQ_HANDLED;
diff --git a/drivers/dma-buf/selftest.c b/drivers/dma-buf/selftest.c
index c60b6944b4bd..ddf94da3d412 100644
--- a/drivers/dma-buf/selftest.c
+++ b/drivers/dma-buf/selftest.c
@@ -93,7 +93,6 @@ __subtests(const char *caller, const struct subtest *st, int count, void *data)
 	int err;
 
 	for (; count--; st++) {
-		cond_resched();
 		if (signal_pending(current))
 			return -EINTR;
 
diff --git a/drivers/dma-buf/st-dma-fence-chain.c b/drivers/dma-buf/st-dma-fence-chain.c
index c0979c8049b5..cde69fadb4f4 100644
--- a/drivers/dma-buf/st-dma-fence-chain.c
+++ b/drivers/dma-buf/st-dma-fence-chain.c
@@ -431,7 +431,6 @@ static int __find_race(void *arg)
 signal:
 		seqno = get_random_u32_below(data->fc.chain_length - 1);
 		dma_fence_signal(data->fc.fences[seqno]);
-		cond_resched();
 	}
 
 	if (atomic_dec_and_test(&data->children))
diff --git a/drivers/fsi/fsi-sbefifo.c b/drivers/fsi/fsi-sbefifo.c
index 0a98517f3959..0e58ebae0130 100644
--- a/drivers/fsi/fsi-sbefifo.c
+++ b/drivers/fsi/fsi-sbefifo.c
@@ -372,7 +372,13 @@ static int sbefifo_request_reset(struct sbefifo *sbefifo)
 			return 0;
 		}
 
-		cond_resched();
+		/*
+		 * Use cond_resched_stall() to avoid spinning in a
+		 * tight loop.
+		 * Though, given that the timeout is in milliseconds,
+		 * maybe this should be a timed or event wait?
+		 */
+		cond_resched_stall();
 	}
 	dev_err(dev, "FIFO reset timed out\n");
 
@@ -462,7 +468,11 @@ static int sbefifo_wait(struct sbefifo *sbefifo, bool up,
 
 	end_time = jiffies + timeout;
 	while (!time_after(jiffies, end_time)) {
-		cond_resched();
+		/*
+		 * As above, maybe this should be a timed or event wait?
+		 */
+		cond_resched_stall();
+
 		rc = sbefifo_regr(sbefifo, addr, &sts);
 		if (rc < 0) {
 			dev_err(dev, "FSI error %d reading status register\n", rc);
diff --git a/drivers/i2c/busses/i2c-bcm-iproc.c b/drivers/i2c/busses/i2c-bcm-iproc.c
index 51aab662050b..6efe6d18d859 100644
--- a/drivers/i2c/busses/i2c-bcm-iproc.c
+++ b/drivers/i2c/busses/i2c-bcm-iproc.c
@@ -788,8 +788,13 @@ static int bcm_iproc_i2c_xfer_wait(struct bcm_iproc_i2c_dev *iproc_i2c,
 				break;
 			}
 
-			cpu_relax();
-			cond_resched();
+			/*
+			 * Use cond_resched_stall() to avoid spinning in a
+			 * tight loop.
+			 * Though, given that the timeout is in milliseconds,
+			 * maybe this should be a timed or event wait?
+			 */
+			cond_resched_stall();
 		} while (!iproc_i2c->xfer_is_done);
 	}
 
diff --git a/drivers/i2c/busses/i2c-highlander.c b/drivers/i2c/busses/i2c-highlander.c
index 7922bc917c33..06eed7e1c4f3 100644
--- a/drivers/i2c/busses/i2c-highlander.c
+++ b/drivers/i2c/busses/i2c-highlander.c
@@ -187,8 +187,13 @@ static void highlander_i2c_poll(struct highlander_i2c_dev *dev)
 		if (time_after(jiffies, timeout))
 			break;
 
-		cpu_relax();
-		cond_resched();
+		/*
+		 * Use cond_resched_stall() to avoid spinning in a
+		 * tight loop.
+		 * Though, given that the timeout is in milliseconds,
+		 * maybe this should be a timed or event wait?
+		 */
+		cond_resched_stall();
 	}
 
 	dev_err(dev->dev, "polling timed out\n");
diff --git a/drivers/i2c/busses/i2c-ibm_iic.c b/drivers/i2c/busses/i2c-ibm_iic.c
index 408820319ec4..b486d8b9636b 100644
--- a/drivers/i2c/busses/i2c-ibm_iic.c
+++ b/drivers/i2c/busses/i2c-ibm_iic.c
@@ -207,9 +207,6 @@ static void iic_dev_reset(struct ibm_iic_private* dev)
 			udelay(10);
 			dc ^= DIRCNTL_SCC;
 			out_8(&iic->directcntl, dc);
-
-			/* be nice */
-			cond_resched();
 		}
 	}
 
@@ -231,7 +228,13 @@ static int iic_dc_wait(volatile struct iic_regs __iomem *iic, u8 mask)
 	while ((in_8(&iic->directcntl) & mask) != mask){
 		if (unlikely(time_after(jiffies, x)))
 			return -1;
-		cond_resched();
+		/*
+		 * Use cond_resched_stall() to avoid spinning in a
+		 * tight loop.
+		 * Though, given that the timeout is in milliseconds,
+		 * maybe this should be a timed or event wait?
+		 */
+		cond_resched_stall();
 	}
 	return 0;
 }
diff --git a/drivers/i2c/busses/i2c-mpc.c b/drivers/i2c/busses/i2c-mpc.c
index e4e4995ab224..82d24523c6a7 100644
--- a/drivers/i2c/busses/i2c-mpc.c
+++ b/drivers/i2c/busses/i2c-mpc.c
@@ -712,7 +712,7 @@ static int mpc_i2c_execute_msg(struct mpc_i2c *i2c)
 			}
 			return -EIO;
 		}
-		cond_resched();
+		cond_resched_stall();
 	}
 
 	return i2c->rc;
diff --git a/drivers/i2c/busses/i2c-mxs.c b/drivers/i2c/busses/i2c-mxs.c
index 36def0a9c95c..d4d69cd7ef46 100644
--- a/drivers/i2c/busses/i2c-mxs.c
+++ b/drivers/i2c/busses/i2c-mxs.c
@@ -310,7 +310,14 @@ static int mxs_i2c_pio_wait_xfer_end(struct mxs_i2c_dev *i2c)
 			return -ENXIO;
 		if (time_after(jiffies, timeout))
 			return -ETIMEDOUT;
-		cond_resched();
+
+		/*
+		 * Use cond_resched_stall() to avoid spinning in a
+		 * tight loop.
+		 * Though, given that the timeout is in milliseconds,
+		 * maybe this should be a timed or event wait?
+		 */
+		cond_resched_stall();
 	}
 
 	return 0;
diff --git a/drivers/i2c/busses/scx200_acb.c b/drivers/i2c/busses/scx200_acb.c
index 83c1db610f54..5646130c003f 100644
--- a/drivers/i2c/busses/scx200_acb.c
+++ b/drivers/i2c/busses/scx200_acb.c
@@ -232,8 +232,13 @@ static void scx200_acb_poll(struct scx200_acb_iface *iface)
 		}
 		if (time_after(jiffies, timeout))
 			break;
-		cpu_relax();
-		cond_resched();
+		/*
+		 * Use cond_resched_stall() to avoid spinning in a
+		 * tight loop.
+		 * Though, given that the timeout is in milliseconds,
+		 * maybe this should timeout or event wait?
+		 */
+		cond_resched_stall();
 	}
 
 	dev_err(&iface->adapter.dev, "timeout in state %s\n",
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index f9ab671c8eda..6b4d3d3193a2 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -215,7 +215,6 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,
 		gup_flags |= FOLL_WRITE;
 
 	while (npages) {
-		cond_resched();
 		pinned = pin_user_pages_fast(cur_base,
 					  min_t(unsigned long, npages,
 						PAGE_SIZE /
diff --git a/drivers/infiniband/hw/hfi1/driver.c b/drivers/infiniband/hw/hfi1/driver.c
index f4492fa407e0..b390eb169a60 100644
--- a/drivers/infiniband/hw/hfi1/driver.c
+++ b/drivers/infiniband/hw/hfi1/driver.c
@@ -668,7 +668,6 @@ static noinline int max_packet_exceeded(struct hfi1_packet *packet, int thread)
 		if ((packet->numpkt & (MAX_PKT_RECV_THREAD - 1)) == 0)
 			/* allow defered processing */
 			process_rcv_qp_work(packet);
-		cond_resched();
 		return RCV_PKT_OK;
 	} else {
 		this_cpu_inc(*packet->rcd->dd->rcv_limit);
diff --git a/drivers/infiniband/hw/hfi1/firmware.c b/drivers/infiniband/hw/hfi1/firmware.c
index 0c0cef5b1e0e..717ccb0e69b4 100644
--- a/drivers/infiniband/hw/hfi1/firmware.c
+++ b/drivers/infiniband/hw/hfi1/firmware.c
@@ -560,7 +560,7 @@ static void __obtain_firmware(struct hfi1_devdata *dd)
 		 * something that holds for 30 seconds.  If we do that twice
 		 * in a row it triggers task blocked warning.
 		 */
-		cond_resched();
+		cond_resched_stall();
 		if (fw_8051_load)
 			dispose_one_firmware(&fw_8051);
 		if (fw_fabric_serdes_load)
diff --git a/drivers/infiniband/hw/hfi1/init.c b/drivers/infiniband/hw/hfi1/init.c
index 6de37c5d7d27..3b5abcd72660 100644
--- a/drivers/infiniband/hw/hfi1/init.c
+++ b/drivers/infiniband/hw/hfi1/init.c
@@ -1958,7 +1958,6 @@ int hfi1_setup_eagerbufs(struct hfi1_ctxtdata *rcd)
 	for (idx = 0; idx < rcd->egrbufs.alloced; idx++) {
 		hfi1_put_tid(dd, rcd->eager_base + idx, PT_EAGER,
 			     rcd->egrbufs.rcvtids[idx].dma, order);
-		cond_resched();
 	}
 
 	return 0;
diff --git a/drivers/infiniband/hw/hfi1/ruc.c b/drivers/infiniband/hw/hfi1/ruc.c
index b0151b7293f5..35fa25211351 100644
--- a/drivers/infiniband/hw/hfi1/ruc.c
+++ b/drivers/infiniband/hw/hfi1/ruc.c
@@ -459,7 +459,6 @@ bool hfi1_schedule_send_yield(struct rvt_qp *qp, struct hfi1_pkt_state *ps,
 			return true;
 		}
 
-		cond_resched();
 		this_cpu_inc(*ps->ppd->dd->send_schedule);
 		ps->timeout = jiffies + ps->timeout_int;
 	}
diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
index d82daff2d9bd..c76610422255 100644
--- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
+++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
@@ -2985,7 +2985,10 @@ static int v2_wait_mbox_complete(struct hns_roce_dev *hr_dev, u32 timeout,
 			return -ETIMEDOUT;
 		}
 
-		cond_resched();
+		/* The timeout is in hundreds of msecs. Maybe this should be a
+		 * timed wait instead?
+		 */
+		cond_resched_stall();
 		ret = -EBUSY;
 	}
 
diff --git a/drivers/infiniband/hw/qib/qib_init.c b/drivers/infiniband/hw/qib/qib_init.c
index 33667becd52b..0d8e0abb5090 100644
--- a/drivers/infiniband/hw/qib/qib_init.c
+++ b/drivers/infiniband/hw/qib/qib_init.c
@@ -1674,7 +1674,6 @@ int qib_setup_eagerbufs(struct qib_ctxtdata *rcd)
 					  RCVHQ_RCV_TYPE_EAGER, pa);
 			pa += egrsize;
 		}
-		cond_resched(); /* don't hog the cpu */
 	}
 
 	return 0;
diff --git a/drivers/infiniband/sw/rxe/rxe_qp.c b/drivers/infiniband/sw/rxe/rxe_qp.c
index 28e379c108bc..b0fb5a993bae 100644
--- a/drivers/infiniband/sw/rxe/rxe_qp.c
+++ b/drivers/infiniband/sw/rxe/rxe_qp.c
@@ -778,12 +778,11 @@ int rxe_qp_to_attr(struct rxe_qp *qp, struct ib_qp_attr *attr, int mask)
 	rxe_av_to_attr(&qp->alt_av, &attr->alt_ah_attr);
 
 	/* Applications that get this state typically spin on it.
-	 * Yield the processor
+	 * Giving up the spinlock will reschedule if needed.
 	 */
 	spin_lock_irqsave(&qp->state_lock, flags);
 	if (qp->attr.sq_draining) {
 		spin_unlock_irqrestore(&qp->state_lock, flags);
-		cond_resched();
 	} else {
 		spin_unlock_irqrestore(&qp->state_lock, flags);
 	}
diff --git a/drivers/infiniband/sw/rxe/rxe_task.c b/drivers/infiniband/sw/rxe/rxe_task.c
index 1501120d4f52..692f57fdfdc9 100644
--- a/drivers/infiniband/sw/rxe/rxe_task.c
+++ b/drivers/infiniband/sw/rxe/rxe_task.c
@@ -227,7 +227,7 @@ void rxe_cleanup_task(struct rxe_task *task)
 	 * for the previously scheduled tasks to finish.
 	 */
 	while (!is_done(task))
-		cond_resched();
+		cond_resched_stall();
 
 	spin_lock_irqsave(&task->lock, flags);
 	task->state = TASK_STATE_INVALID;
@@ -289,7 +289,7 @@ void rxe_disable_task(struct rxe_task *task)
 	spin_unlock_irqrestore(&task->lock, flags);
 
 	while (!is_done(task))
-		cond_resched();
+		cond_resched_stall();
 
 	spin_lock_irqsave(&task->lock, flags);
 	task->state = TASK_STATE_DRAINED;
diff --git a/drivers/input/evdev.c b/drivers/input/evdev.c
index 95f90699d2b1..effbc991be41 100644
--- a/drivers/input/evdev.c
+++ b/drivers/input/evdev.c
@@ -529,7 +529,6 @@ static ssize_t evdev_write(struct file *file, const char __user *buffer,
 
 		input_inject_event(&evdev->handle,
 				   event.type, event.code, event.value);
-		cond_resched();
 	}
 
  out:
diff --git a/drivers/input/keyboard/clps711x-keypad.c b/drivers/input/keyboard/clps711x-keypad.c
index 4c1a3e611edd..e02f6d35ed51 100644
--- a/drivers/input/keyboard/clps711x-keypad.c
+++ b/drivers/input/keyboard/clps711x-keypad.c
@@ -52,7 +52,7 @@ static void clps711x_keypad_poll(struct input_dev *input)
 			/* Read twice for protection against fluctuations */
 			do {
 				state = gpiod_get_value_cansleep(data->desc);
-				cond_resched();
+				cond_resched_stall();
 				state1 = gpiod_get_value_cansleep(data->desc);
 			} while (state != state1);
 
diff --git a/drivers/input/misc/uinput.c b/drivers/input/misc/uinput.c
index d98212d55108..a6c95916ac7e 100644
--- a/drivers/input/misc/uinput.c
+++ b/drivers/input/misc/uinput.c
@@ -624,7 +624,6 @@ static ssize_t uinput_inject_events(struct uinput_device *udev,
 
 		input_event(udev->dev, ev.type, ev.code, ev.value);
 		bytes += input_event_size();
-		cond_resched();
 	}
 
 	return bytes;
diff --git a/drivers/input/mousedev.c b/drivers/input/mousedev.c
index 505c562a5daa..7ce9ffca6d12 100644
--- a/drivers/input/mousedev.c
+++ b/drivers/input/mousedev.c
@@ -704,7 +704,6 @@ static ssize_t mousedev_write(struct file *file, const char __user *buffer,
 		mousedev_generate_response(client, c);
 
 		spin_unlock_irq(&client->packet_lock);
-		cond_resched();
 	}
 
 	kill_fasync(&client->fasync, SIGIO, POLL_IN);
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index bd0a596f9863..8f517a80a831 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1582,8 +1582,6 @@ static irqreturn_t arm_smmu_evtq_thread(int irq, void *dev)
 			for (i = 0; i < ARRAY_SIZE(evt); ++i)
 				dev_info(smmu->dev, "\t0x%016llx\n",
 					 (unsigned long long)evt[i]);
-
-			cond_resched();
 		}
 
 		/*
diff --git a/drivers/media/i2c/vpx3220.c b/drivers/media/i2c/vpx3220.c
index 1eaae886f217..c673dba9a592 100644
--- a/drivers/media/i2c/vpx3220.c
+++ b/drivers/media/i2c/vpx3220.c
@@ -81,9 +81,6 @@ static int vpx3220_fp_status(struct v4l2_subdev *sd)
 			return 0;
 
 		udelay(10);
-
-		if (need_resched())
-			cond_resched();
 	}
 
 	return -1;
diff --git a/drivers/media/pci/cobalt/cobalt-i2c.c b/drivers/media/pci/cobalt/cobalt-i2c.c
index 10c9ee33f73e..2a11dd49559a 100644
--- a/drivers/media/pci/cobalt/cobalt-i2c.c
+++ b/drivers/media/pci/cobalt/cobalt-i2c.c
@@ -140,7 +140,7 @@ static int cobalt_tx_bytes(struct cobalt_i2c_regs __iomem *regs,
 		while (status & M00018_SR_BITMAP_TIP_MSK) {
 			if (time_after(jiffies, start_time + adap->timeout))
 				return -ETIMEDOUT;
-			cond_resched();
+			cond_resched_stall();
 			status = ioread8(&regs->cr_sr);
 		}
 
@@ -199,7 +199,7 @@ static int cobalt_rx_bytes(struct cobalt_i2c_regs __iomem *regs,
 		while (status & M00018_SR_BITMAP_TIP_MSK) {
 			if (time_after(jiffies, start_time + adap->timeout))
 				return -ETIMEDOUT;
-			cond_resched();
+			cond_resched_stall();
 			status = ioread8(&regs->cr_sr);
 		}
 
diff --git a/drivers/misc/bcm-vk/bcm_vk_dev.c b/drivers/misc/bcm-vk/bcm_vk_dev.c
index d4a96137728d..d262e4c5b4e3 100644
--- a/drivers/misc/bcm-vk/bcm_vk_dev.c
+++ b/drivers/misc/bcm-vk/bcm_vk_dev.c
@@ -364,8 +364,7 @@ static inline int bcm_vk_wait(struct bcm_vk *vk, enum pci_barno bar,
 		if (time_after(jiffies, timeout))
 			return -ETIMEDOUT;
 
-		cpu_relax();
-		cond_resched();
+		cond_resched_stall();
 	} while ((rd_val & mask) != value);
 
 	return 0;
diff --git a/drivers/misc/bcm-vk/bcm_vk_msg.c b/drivers/misc/bcm-vk/bcm_vk_msg.c
index e17d81231ea6..1b5a71382e76 100644
--- a/drivers/misc/bcm-vk/bcm_vk_msg.c
+++ b/drivers/misc/bcm-vk/bcm_vk_msg.c
@@ -1295,8 +1295,7 @@ int bcm_vk_release(struct inode *inode, struct file *p_file)
 			break;
 		}
 		dma_cnt = atomic_read(&ctx->dma_cnt);
-		cpu_relax();
-		cond_resched();
+		cond_resched_stall();
 	} while (dma_cnt);
 	dev_dbg(dev, "Draining for [fd-%d] pid %d - delay %d ms\n",
 		ctx->idx, pid, jiffies_to_msecs(jiffies - start_time));
diff --git a/drivers/misc/genwqe/card_base.c b/drivers/misc/genwqe/card_base.c
index 224a7e97cbea..03ed8a426d49 100644
--- a/drivers/misc/genwqe/card_base.c
+++ b/drivers/misc/genwqe/card_base.c
@@ -1004,7 +1004,6 @@ static int genwqe_health_thread(void *data)
 		}
 
 		cd->last_gfir = gfir;
-		cond_resched();
 	}
 
 	return 0;
@@ -1041,7 +1040,7 @@ static int genwqe_health_thread(void *data)
 
 	/* genwqe_bus_reset failed(). Now wait for genwqe_remove(). */
 	while (!kthread_should_stop())
-		cond_resched();
+		cond_resched_stall();
 
 	return -EIO;
 }
diff --git a/drivers/misc/genwqe/card_ddcb.c b/drivers/misc/genwqe/card_ddcb.c
index 500b1feaf1f6..793faf4bdc06 100644
--- a/drivers/misc/genwqe/card_ddcb.c
+++ b/drivers/misc/genwqe/card_ddcb.c
@@ -1207,12 +1207,6 @@ static int genwqe_card_thread(void *data)
 		}
 		if (should_stop)
 			break;
-
-		/*
-		 * Avoid soft lockups on heavy loads; we do not want
-		 * to disable our interrupts.
-		 */
-		cond_resched();
 	}
 	return 0;
 }
diff --git a/drivers/misc/genwqe/card_dev.c b/drivers/misc/genwqe/card_dev.c
index 55fc5b80e649..ec1112dc7d5a 100644
--- a/drivers/misc/genwqe/card_dev.c
+++ b/drivers/misc/genwqe/card_dev.c
@@ -1322,7 +1322,6 @@ static int genwqe_inform_and_stop_processes(struct genwqe_dev *cd)
 			     genwqe_open_files(cd); i++) {
 			dev_info(&pci_dev->dev, "  %d sec ...", i);
 
-			cond_resched();
 			msleep(1000);
 		}
 
@@ -1340,7 +1339,6 @@ static int genwqe_inform_and_stop_processes(struct genwqe_dev *cd)
 				     genwqe_open_files(cd); i++) {
 				dev_warn(&pci_dev->dev, "  %d sec ...", i);
 
-				cond_resched();
 				msleep(1000);
 			}
 		}
diff --git a/drivers/misc/vmw_balloon.c b/drivers/misc/vmw_balloon.c
index 9ce9b9e0e9b6..7cf977e70935 100644
--- a/drivers/misc/vmw_balloon.c
+++ b/drivers/misc/vmw_balloon.c
@@ -1158,8 +1158,6 @@ static void vmballoon_inflate(struct vmballoon *b)
 			vmballoon_split_refused_pages(&ctl);
 			ctl.page_size--;
 		}
-
-		cond_resched();
 	}
 
 	/*
@@ -1282,8 +1280,6 @@ static unsigned long vmballoon_deflate(struct vmballoon *b, uint64_t n_frames,
 				break;
 			ctl.page_size++;
 		}
-
-		cond_resched();
 	}
 
 	return deflated_frames;
diff --git a/drivers/mmc/host/mmc_spi.c b/drivers/mmc/host/mmc_spi.c
index cc333ad67cac..e05d99437547 100644
--- a/drivers/mmc/host/mmc_spi.c
+++ b/drivers/mmc/host/mmc_spi.c
@@ -192,9 +192,6 @@ static int mmc_spi_skip(struct mmc_spi_host *host, unsigned long timeout,
 			if (cp[i] != byte)
 				return cp[i];
 		}
-
-		/* If we need long timeouts, we may release the CPU */
-		cond_resched();
 	} while (time_is_after_jiffies(start + timeout));
 	return -ETIMEDOUT;
 }
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index d5593b0dc700..5e97555db441 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -435,7 +435,6 @@ static int btt_map_init(struct arena_info *arena)
 
 		offset += size;
 		mapsize -= size;
-		cond_resched();
 	}
 
  free:
@@ -479,7 +478,6 @@ static int btt_log_init(struct arena_info *arena)
 
 		offset += size;
 		logsize -= size;
-		cond_resched();
 	}
 
 	for (i = 0; i < arena->nfree; i++) {
diff --git a/drivers/nvme/target/zns.c b/drivers/nvme/target/zns.c
index 5b5c1e481722..12eee9a87e42 100644
--- a/drivers/nvme/target/zns.c
+++ b/drivers/nvme/target/zns.c
@@ -432,8 +432,6 @@ static u16 nvmet_bdev_zone_mgmt_emulate_all(struct nvmet_req *req)
 				zsa_req_op(req->cmd->zms.zsa) | REQ_SYNC,
 				GFP_KERNEL);
 			bio->bi_iter.bi_sector = sector;
-			/* This may take a while, so be nice to others */
-			cond_resched();
 		}
 		sector += bdev_zone_sectors(bdev);
 	}
diff --git a/drivers/parport/parport_ip32.c b/drivers/parport/parport_ip32.c
index 0919ed99ba94..8c52008bbb7c 100644
--- a/drivers/parport/parport_ip32.c
+++ b/drivers/parport/parport_ip32.c
@@ -1238,7 +1238,6 @@ static size_t parport_ip32_epp_write_addr(struct parport *p, const void *buf,
 static unsigned int parport_ip32_fifo_wait_break(struct parport *p,
 						 unsigned long expire)
 {
-	cond_resched();
 	if (time_after(jiffies, expire)) {
 		pr_debug1(PPIP32 "%s: FIFO write timed out\n", p->name);
 		return 1;
diff --git a/drivers/parport/parport_pc.c b/drivers/parport/parport_pc.c
index 1f236aaf7867..a482b5b835ec 100644
--- a/drivers/parport/parport_pc.c
+++ b/drivers/parport/parport_pc.c
@@ -663,8 +663,6 @@ static size_t parport_pc_fifo_write_block_dma(struct parport *port,
 		}
 		/* Is serviceIntr set? */
 		if (!(inb(ECONTROL(port)) & (1<<2))) {
-			cond_resched();
-
 			goto false_alarm;
 		}
 
@@ -674,8 +672,6 @@ static size_t parport_pc_fifo_write_block_dma(struct parport *port,
 		count = get_dma_residue(port->dma);
 		release_dma_lock(dmaflag);
 
-		cond_resched(); /* Can't yield the port. */
-
 		/* Anyone else waiting for the port? */
 		if (port->waithead) {
 			printk(KERN_DEBUG "Somebody wants the port\n");
diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index d9eede2dbc0e..e7bb03c3c148 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -719,7 +719,6 @@ static ssize_t pci_read_config(struct file *filp, struct kobject *kobj,
 		data[off - init_off + 3] = (val >> 24) & 0xff;
 		off += 4;
 		size -= 4;
-		cond_resched();
 	}
 
 	if (size >= 2) {
diff --git a/drivers/pci/proc.c b/drivers/pci/proc.c
index f967709082d6..7d3cd2201e64 100644
--- a/drivers/pci/proc.c
+++ b/drivers/pci/proc.c
@@ -83,7 +83,6 @@ static ssize_t proc_bus_pci_read(struct file *file, char __user *buf,
 		buf += 4;
 		pos += 4;
 		cnt -= 4;
-		cond_resched();
 	}
 
 	if (cnt >= 2) {
diff --git a/drivers/platform/x86/intel/speed_select_if/isst_if_mbox_pci.c b/drivers/platform/x86/intel/speed_select_if/isst_if_mbox_pci.c
index df1fc6c719f3..c202ae0d0656 100644
--- a/drivers/platform/x86/intel/speed_select_if/isst_if_mbox_pci.c
+++ b/drivers/platform/x86/intel/speed_select_if/isst_if_mbox_pci.c
@@ -56,7 +56,7 @@ static int isst_if_mbox_cmd(struct pci_dev *pdev,
 			ret = -EBUSY;
 			tm_delta = ktime_us_delta(ktime_get(), tm);
 			if (tm_delta > OS_MAILBOX_TIMEOUT_AVG_US)
-				cond_resched();
+				cond_resched_stall();
 			continue;
 		}
 		ret = 0;
@@ -95,7 +95,7 @@ static int isst_if_mbox_cmd(struct pci_dev *pdev,
 			ret = -EBUSY;
 			tm_delta = ktime_us_delta(ktime_get(), tm);
 			if (tm_delta > OS_MAILBOX_TIMEOUT_AVG_US)
-				cond_resched();
+				cond_resched_stall();
 			continue;
 		}
 
diff --git a/drivers/s390/cio/css.c b/drivers/s390/cio/css.c
index 3ff46fc694f8..6122a4a057fa 100644
--- a/drivers/s390/cio/css.c
+++ b/drivers/s390/cio/css.c
@@ -659,11 +659,6 @@ static int slow_eval_known_fn(struct subchannel *sch, void *data)
 		rc = css_evaluate_known_subchannel(sch, 1);
 		if (rc == -EAGAIN)
 			css_schedule_eval(sch->schid);
-		/*
-		 * The loop might take long time for platforms with lots of
-		 * known devices. Allow scheduling here.
-		 */
-		cond_resched();
 	}
 	return 0;
 }
@@ -695,9 +690,6 @@ static int slow_eval_unknown_fn(struct subchannel_id schid, void *data)
 		default:
 			rc = 0;
 		}
-		/* Allow scheduling here since the containing loop might
-		 * take a while.  */
-		cond_resched();
 	}
 	return rc;
 }
diff --git a/drivers/scsi/NCR5380.c b/drivers/scsi/NCR5380.c
index cea3a79d538e..40e66afd77cf 100644
--- a/drivers/scsi/NCR5380.c
+++ b/drivers/scsi/NCR5380.c
@@ -738,8 +738,6 @@ static void NCR5380_main(struct work_struct *work)
 			maybe_release_dma_irq(instance);
 		}
 		spin_unlock_irq(&hostdata->lock);
-		if (!done)
-			cond_resched();
 	} while (!done);
 }
 
diff --git a/drivers/scsi/megaraid.c b/drivers/scsi/megaraid.c
index e92f1a73cc9b..675504f8149a 100644
--- a/drivers/scsi/megaraid.c
+++ b/drivers/scsi/megaraid.c
@@ -1696,7 +1696,6 @@ __mega_busywait_mbox (adapter_t *adapter)
 		if (!mbox->m_in.busy)
 			return 0;
 		udelay(100);
-		cond_resched();
 	}
 	return -1;		/* give up after 1 second */
 }
diff --git a/drivers/scsi/qedi/qedi_main.c b/drivers/scsi/qedi/qedi_main.c
index cd0180b1f5b9..9e2596199458 100644
--- a/drivers/scsi/qedi/qedi_main.c
+++ b/drivers/scsi/qedi/qedi_main.c
@@ -1943,7 +1943,6 @@ static int qedi_percpu_io_thread(void *arg)
 				if (!work->is_solicited)
 					kfree(work);
 			}
-			cond_resched();
 			spin_lock_irqsave(&p->p_work_lock, flags);
 		}
 		set_current_state(TASK_INTERRUPTIBLE);
diff --git a/drivers/scsi/qla2xxx/qla_nx.c b/drivers/scsi/qla2xxx/qla_nx.c
index 6dfb70edb9a6..e1a5c2dbe134 100644
--- a/drivers/scsi/qla2xxx/qla_nx.c
+++ b/drivers/scsi/qla2xxx/qla_nx.c
@@ -972,7 +972,6 @@ qla82xx_flash_wait_write_finish(struct qla_hw_data *ha)
 		if (ret < 0 || (val & 1) == 0)
 			return ret;
 		udelay(10);
-		cond_resched();
 	}
 	ql_log(ql_log_warn, vha, 0xb00d,
 	       "Timeout reached waiting for write finish.\n");
@@ -1037,7 +1036,6 @@ ql82xx_rom_lock_d(struct qla_hw_data *ha)
 
 	while ((qla82xx_rom_lock(ha) != 0) && (loops < 50000)) {
 		udelay(100);
-		cond_resched();
 		loops++;
 	}
 	if (loops >= 50000) {
diff --git a/drivers/scsi/qla2xxx/qla_sup.c b/drivers/scsi/qla2xxx/qla_sup.c
index c092a6b1ced4..40fc521ba89f 100644
--- a/drivers/scsi/qla2xxx/qla_sup.c
+++ b/drivers/scsi/qla2xxx/qla_sup.c
@@ -463,7 +463,6 @@ qla24xx_read_flash_dword(struct qla_hw_data *ha, uint32_t addr, uint32_t *data)
 			return QLA_SUCCESS;
 		}
 		udelay(10);
-		cond_resched();
 	}
 
 	ql_log(ql_log_warn, pci_get_drvdata(ha->pdev), 0x7090,
@@ -505,7 +504,6 @@ qla24xx_write_flash_dword(struct qla_hw_data *ha, uint32_t addr, uint32_t data)
 		if (!(rd_reg_dword(&reg->flash_addr) & FARX_DATA_FLAG))
 			return QLA_SUCCESS;
 		udelay(10);
-		cond_resched();
 	}
 
 	ql_log(ql_log_warn, pci_get_drvdata(ha->pdev), 0x7090,
@@ -2151,7 +2149,6 @@ qla2x00_poll_flash(struct qla_hw_data *ha, uint32_t addr, uint8_t poll_data,
 		}
 		udelay(10);
 		barrier();
-		cond_resched();
 	}
 	return status;
 }
@@ -2301,7 +2298,6 @@ qla2x00_read_flash_data(struct qla_hw_data *ha, uint8_t *tmp_buf,
 		if (saddr % 100)
 			udelay(10);
 		*tmp_buf = data;
-		cond_resched();
 	}
 }
 
@@ -2589,7 +2585,6 @@ qla2x00_write_optrom_data(struct scsi_qla_host *vha, void *buf,
 				rval = QLA_FUNCTION_FAILED;
 				break;
 			}
-			cond_resched();
 		}
 	} while (0);
 	qla2x00_flash_disable(ha);
diff --git a/drivers/scsi/qla4xxx/ql4_nx.c b/drivers/scsi/qla4xxx/ql4_nx.c
index 47adff9f0506..e40a525a2202 100644
--- a/drivers/scsi/qla4xxx/ql4_nx.c
+++ b/drivers/scsi/qla4xxx/ql4_nx.c
@@ -3643,7 +3643,6 @@ qla4_82xx_read_flash_data(struct scsi_qla_host *ha, uint32_t *dwptr,
 	int loops = 0;
 	while ((qla4_82xx_rom_lock(ha) != 0) && (loops < 50000)) {
 		udelay(100);
-		cond_resched();
 		loops++;
 	}
 	if (loops >= 50000) {
diff --git a/drivers/scsi/xen-scsifront.c b/drivers/scsi/xen-scsifront.c
index 9ec55ddc1204..6f8e0c69f832 100644
--- a/drivers/scsi/xen-scsifront.c
+++ b/drivers/scsi/xen-scsifront.c
@@ -442,7 +442,7 @@ static irqreturn_t scsifront_irq_fn(int irq, void *dev_id)
 
 	while (scsifront_cmd_done(info, &eoiflag))
 		/* Yield point for this unbounded loop. */
-		cond_resched();
+		cond_resched_stall();
 
 	xen_irq_lateeoi(irq, eoiflag);
 
diff --git a/drivers/spi/spi-lantiq-ssc.c b/drivers/spi/spi-lantiq-ssc.c
index 938e9e577e4f..151b381fc098 100644
--- a/drivers/spi/spi-lantiq-ssc.c
+++ b/drivers/spi/spi-lantiq-ssc.c
@@ -775,8 +775,7 @@ static void lantiq_ssc_bussy_work(struct work_struct *work)
 			spi_finalize_current_transfer(spi->host);
 			return;
 		}
-
-		cond_resched();
+		cond_resched_stall();
 	} while (!time_after_eq(jiffies, end));
 
 	if (spi->host->cur_msg)
diff --git a/drivers/spi/spi-meson-spifc.c b/drivers/spi/spi-meson-spifc.c
index 06626f406f68..ff3550ebb22b 100644
--- a/drivers/spi/spi-meson-spifc.c
+++ b/drivers/spi/spi-meson-spifc.c
@@ -100,7 +100,7 @@ static int meson_spifc_wait_ready(struct meson_spifc *spifc)
 		regmap_read(spifc->regmap, REG_SLAVE, &data);
 		if (data & SLAVE_TRST_DONE)
 			return 0;
-		cond_resched();
+		cond_resched_stall();
 	} while (!time_after(jiffies, deadline));
 
 	return -ETIMEDOUT;
diff --git a/drivers/spi/spi.c b/drivers/spi/spi.c
index 8d6304cb061e..3ddbfa9babdc 100644
--- a/drivers/spi/spi.c
+++ b/drivers/spi/spi.c
@@ -1808,7 +1808,7 @@ static void __spi_pump_messages(struct spi_controller *ctlr, bool in_kthread)
 
 	/* Prod the scheduler in case transfer_one() was busy waiting */
 	if (!ret)
-		cond_resched();
+		cond_resched_stall();
 	return;
 
 out_unlock:
diff --git a/drivers/staging/rtl8723bs/core/rtw_mlme_ext.c b/drivers/staging/rtl8723bs/core/rtw_mlme_ext.c
index 985683767a40..2a2ebdf12a45 100644
--- a/drivers/staging/rtl8723bs/core/rtw_mlme_ext.c
+++ b/drivers/staging/rtl8723bs/core/rtw_mlme_ext.c
@@ -3775,7 +3775,7 @@ unsigned int send_beacon(struct adapter *padapter)
 		issue_beacon(padapter, 100);
 		issue++;
 		do {
-			cond_resched();
+			cond_resched_stall();
 			rtw_hal_get_hwreg(padapter, HW_VAR_BCN_VALID, (u8 *)(&bxmitok));
 			poll++;
 		} while ((poll%10) != 0 && false == bxmitok && !padapter->bSurpriseRemoved && !padapter->bDriverStopped);
diff --git a/drivers/staging/rtl8723bs/core/rtw_pwrctrl.c b/drivers/staging/rtl8723bs/core/rtw_pwrctrl.c
index a392d5b4caf2..c263fbc71201 100644
--- a/drivers/staging/rtl8723bs/core/rtw_pwrctrl.c
+++ b/drivers/staging/rtl8723bs/core/rtw_pwrctrl.c
@@ -576,8 +576,6 @@ void LPS_Leave_check(struct adapter *padapter)
 	bReady = false;
 	start_time = jiffies;
 
-	cond_resched();
-
 	while (1) {
 		mutex_lock(&pwrpriv->lock);
 
diff --git a/drivers/tee/optee/ffa_abi.c b/drivers/tee/optee/ffa_abi.c
index 0828240f27e6..49f55c051d71 100644
--- a/drivers/tee/optee/ffa_abi.c
+++ b/drivers/tee/optee/ffa_abi.c
@@ -581,7 +581,6 @@ static int optee_ffa_yielding_call(struct tee_context *ctx,
 		 * filled in by ffa_mem_ops->sync_send_receive() returning
 		 * above.
 		 */
-		cond_resched();
 		optee_handle_ffa_rpc(ctx, optee, data->data1, rpc_arg);
 		cmd = OPTEE_FFA_YIELDING_CALL_RESUME;
 		data->data0 = cmd;
diff --git a/drivers/tee/optee/smc_abi.c b/drivers/tee/optee/smc_abi.c
index d5b28fd35d66..86e01454422c 100644
--- a/drivers/tee/optee/smc_abi.c
+++ b/drivers/tee/optee/smc_abi.c
@@ -943,7 +943,6 @@ static int optee_smc_do_call_with_arg(struct tee_context *ctx,
 			 */
 			optee_cq_wait_for_completion(&optee->call_queue, &w);
 		} else if (OPTEE_SMC_RETURN_IS_RPC(res.a0)) {
-			cond_resched();
 			param.a0 = res.a0;
 			param.a1 = res.a1;
 			param.a2 = res.a2;
diff --git a/drivers/tty/hvc/hvc_console.c b/drivers/tty/hvc/hvc_console.c
index 959fae54ca39..11bb4204b78d 100644
--- a/drivers/tty/hvc/hvc_console.c
+++ b/drivers/tty/hvc/hvc_console.c
@@ -538,7 +538,6 @@ static ssize_t hvc_write(struct tty_struct *tty, const u8 *buf, size_t count)
 		if (count) {
 			if (hp->n_outbuf > 0)
 				hvc_flush(hp);
-			cond_resched();
 		}
 	}
 
@@ -653,7 +652,7 @@ static int __hvc_poll(struct hvc_struct *hp, bool may_sleep)
 
 	if (may_sleep) {
 		spin_unlock_irqrestore(&hp->lock, flags);
-		cond_resched();
+
 		spin_lock_irqsave(&hp->lock, flags);
 	}
 
@@ -725,7 +724,7 @@ static int __hvc_poll(struct hvc_struct *hp, bool may_sleep)
 	if (may_sleep) {
 		/* Keep going until the flip is full */
 		spin_unlock_irqrestore(&hp->lock, flags);
-		cond_resched();
+
 		spin_lock_irqsave(&hp->lock, flags);
 		goto read_again;
 	} else if (read_total < HVC_ATOMIC_READ_MAX) {
@@ -802,7 +801,6 @@ static int khvcd(void *unused)
 			mutex_lock(&hvc_structs_mutex);
 			list_for_each_entry(hp, &hvc_structs, next) {
 				poll_mask |= __hvc_poll(hp, true);
-				cond_resched();
 			}
 			mutex_unlock(&hvc_structs_mutex);
 		} else
diff --git a/drivers/tty/tty_buffer.c b/drivers/tty/tty_buffer.c
index 5f6d0cf67571..c70d695ed69d 100644
--- a/drivers/tty/tty_buffer.c
+++ b/drivers/tty/tty_buffer.c
@@ -498,9 +498,6 @@ static void flush_to_ldisc(struct work_struct *work)
 			lookahead_bufs(port, head);
 		if (!rcvd)
 			break;
-
-		if (need_resched())
-			cond_resched();
 	}
 
 	mutex_unlock(&buf->lock);
diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index 8a94e5a43c6d..0221ff17a4bf 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -1032,7 +1032,6 @@ static ssize_t iterate_tty_write(struct tty_ldisc *ld, struct tty_struct *tty,
 		ret = -ERESTARTSYS;
 		if (signal_pending(current))
 			break;
-		cond_resched();
 	}
 	if (written) {
 		tty_update_time(tty, true);
diff --git a/drivers/usb/gadget/udc/max3420_udc.c b/drivers/usb/gadget/udc/max3420_udc.c
index 2d57786d3db7..b9051c341b10 100644
--- a/drivers/usb/gadget/udc/max3420_udc.c
+++ b/drivers/usb/gadget/udc/max3420_udc.c
@@ -451,7 +451,6 @@ static void __max3420_start(struct max3420_udc *udc)
 		val = spi_rd8(udc, MAX3420_REG_USBIRQ);
 		if (val & OSCOKIRQ)
 			break;
-		cond_resched();
 	}
 
 	/* Enable PULL-UP only when Vbus detected */
diff --git a/drivers/usb/host/max3421-hcd.c b/drivers/usb/host/max3421-hcd.c
index d152d72de126..64f12f5113a2 100644
--- a/drivers/usb/host/max3421-hcd.c
+++ b/drivers/usb/host/max3421-hcd.c
@@ -1294,7 +1294,7 @@ max3421_reset_hcd(struct usb_hcd *hcd)
 				"timed out waiting for oscillator OK signal");
 			return 1;
 		}
-		cond_resched();
+		cond_resched_stall();
 	}
 
 	/*
diff --git a/drivers/usb/host/xen-hcd.c b/drivers/usb/host/xen-hcd.c
index 46fdab940092..0b78f371c30a 100644
--- a/drivers/usb/host/xen-hcd.c
+++ b/drivers/usb/host/xen-hcd.c
@@ -1086,7 +1086,7 @@ static irqreturn_t xenhcd_int(int irq, void *dev_id)
 	while (xenhcd_urb_request_done(info, &eoiflag) |
 	       xenhcd_conn_notify(info, &eoiflag))
 		/* Yield point for this unbounded loop. */
-		cond_resched();
+		cond_resched_stall();
 
 	xen_irq_lateeoi(irq, eoiflag);
 	return IRQ_HANDLED;
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index a94ec6225d31..523c6685818d 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -457,8 +457,6 @@ static int tce_iommu_clear(struct tce_container *container,
 			}
 		}
 
-		cond_resched();
-
 		direction = DMA_NONE;
 		oldhpa = 0;
 		ret = iommu_tce_xchg_no_kill(container->mm, tbl, entry, &oldhpa,
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index eacd6ec04de5..afc9724051ce 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -962,8 +962,6 @@ static long vfio_sync_unpin(struct vfio_dma *dma, struct vfio_domain *domain,
 		kfree(entry);
 	}
 
-	cond_resched();
-
 	return unlocked;
 }
 
@@ -1029,7 +1027,6 @@ static size_t unmap_unpin_slow(struct vfio_domain *domain,
 						     unmapped >> PAGE_SHIFT,
 						     false);
 		*iova += unmapped;
-		cond_resched();
 	}
 	return unmapped;
 }
@@ -1062,7 +1059,6 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
 
 	list_for_each_entry_continue(d, &iommu->domain_list, next) {
 		iommu_unmap(d->domain, dma->iova, dma->size);
-		cond_resched();
 	}
 
 	iommu_iotlb_gather_init(&iotlb_gather);
@@ -1439,8 +1435,6 @@ static int vfio_iommu_map(struct vfio_iommu *iommu, dma_addr_t iova,
 				GFP_KERNEL);
 		if (ret)
 			goto unwind;
-
-		cond_resched();
 	}
 
 	return 0;
@@ -1448,7 +1442,6 @@ static int vfio_iommu_map(struct vfio_iommu *iommu, dma_addr_t iova,
 unwind:
 	list_for_each_entry_continue_reverse(d, &iommu->domain_list, next) {
 		iommu_unmap(d->domain, iova, npage << PAGE_SHIFT);
-		cond_resched();
 	}
 
 	return ret;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index e0c181ad17e3..8939be49c47d 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -410,7 +410,6 @@ static bool vhost_worker(void *data)
 			kcov_remote_start_common(worker->kcov_handle);
 			work->fn(work);
 			kcov_remote_stop();
-			cond_resched();
 		}
 	}
 
diff --git a/drivers/video/console/vgacon.c b/drivers/video/console/vgacon.c
index 7ad047bcae17..e17e7937e11d 100644
--- a/drivers/video/console/vgacon.c
+++ b/drivers/video/console/vgacon.c
@@ -870,12 +870,10 @@ static int vgacon_do_font_op(struct vgastate *state, char *arg, int set,
 		if (set)
 			for (i = 0; i < cmapsz; i++) {
 				vga_writeb(arg[i], charmap + i);
-				cond_resched();
 			}
 		else
 			for (i = 0; i < cmapsz; i++) {
 				arg[i] = vga_readb(charmap + i);
-				cond_resched();
 			}
 
 		/*
@@ -889,12 +887,10 @@ static int vgacon_do_font_op(struct vgastate *state, char *arg, int set,
 			if (set)
 				for (i = 0; i < cmapsz; i++) {
 					vga_writeb(arg[i], charmap + i);
-					cond_resched();
 				}
 			else
 				for (i = 0; i < cmapsz; i++) {
 					arg[i] = vga_readb(charmap + i);
-					cond_resched();
 				}
 		}
 	}
diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index fa5226c198cc..c9c66aac49ca 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -1754,7 +1754,6 @@ static int virtio_mem_sbm_plug_request(struct virtio_mem *vm, uint64_t diff)
 			rc = virtio_mem_sbm_plug_any_sb(vm, mb_id, &nb_sb);
 			if (rc || !nb_sb)
 				goto out_unlock;
-			cond_resched();
 		}
 	}
 
@@ -1772,7 +1771,6 @@ static int virtio_mem_sbm_plug_request(struct virtio_mem *vm, uint64_t diff)
 		rc = virtio_mem_sbm_plug_and_add_mb(vm, mb_id, &nb_sb);
 		if (rc || !nb_sb)
 			return rc;
-		cond_resched();
 	}
 
 	/* Try to prepare, plug and add new blocks */
@@ -1786,7 +1784,6 @@ static int virtio_mem_sbm_plug_request(struct virtio_mem *vm, uint64_t diff)
 		rc = virtio_mem_sbm_plug_and_add_mb(vm, mb_id, &nb_sb);
 		if (rc)
 			return rc;
-		cond_resched();
 	}
 
 	return 0;
@@ -1869,7 +1866,6 @@ static int virtio_mem_bbm_plug_request(struct virtio_mem *vm, uint64_t diff)
 			nb_bb--;
 		if (rc || !nb_bb)
 			return rc;
-		cond_resched();
 	}
 
 	/* Try to prepare, plug and add new big blocks */
@@ -1885,7 +1881,6 @@ static int virtio_mem_bbm_plug_request(struct virtio_mem *vm, uint64_t diff)
 			nb_bb--;
 		if (rc)
 			return rc;
-		cond_resched();
 	}
 
 	return 0;
@@ -2107,7 +2102,6 @@ static int virtio_mem_sbm_unplug_request(struct virtio_mem *vm, uint64_t diff)
 			if (rc || !nb_sb)
 				goto out_unlock;
 			mutex_unlock(&vm->hotplug_mutex);
-			cond_resched();
 			mutex_lock(&vm->hotplug_mutex);
 		}
 		if (!unplug_online && i == 1) {
@@ -2250,8 +2244,6 @@ static int virtio_mem_bbm_unplug_request(struct virtio_mem *vm, uint64_t diff)
 	 */
 	for (i = 0; i < 3; i++) {
 		virtio_mem_bbm_for_each_bb_rev(vm, bb_id, VIRTIO_MEM_BBM_BB_ADDED) {
-			cond_resched();
-
 			/*
 			 * As we're holding no locks, these checks are racy,
 			 * but we don't care.
-- 
2.31.1

Re: [RFC PATCH 85/86] treewide: drivers: remove cond_resched()

Posted by Dmitry Torokhov 2 years, 3 months ago

Hi Anhur,

On Tue, Nov 07, 2023 at 03:08:21PM -0800, Ankur Arora wrote:
> There are broadly three sets of uses of cond_resched():
> 
> 1.  Calls to cond_resched() out of the goodness of our heart,
>     otherwise known as avoiding lockup splats.

...

What about RCU stalls? The calls to cond_resched() in evdev.c and
mousedev.c were added specifically to allow RCU to run in cases when
userspace passes a large buffer and the kernel is not fully preemptable.

Thanks.

-- 
Dmitry

Re: [RFC PATCH 85/86] treewide: drivers: remove cond_resched()

Posted by Ankur Arora 2 years, 3 months ago

Dmitry Torokhov <dmitry.torokhov@gmail.com> writes:

> Hi Anhur,
>
> On Tue, Nov 07, 2023 at 03:08:21PM -0800, Ankur Arora wrote:
>> There are broadly three sets of uses of cond_resched():
>>
>> 1.  Calls to cond_resched() out of the goodness of our heart,
>>     otherwise known as avoiding lockup splats.
>
> ...
>
> What about RCU stalls? The calls to cond_resched() in evdev.c and
> mousedev.c were added specifically to allow RCU to run in cases when
> userspace passes a large buffer and the kernel is not fully preemptable.

Hi Dmitry

The short answer is that even if the kernel isn't fully preemptible, it
will always have preempt-count which means that RCU will always know
when a read-side critical section gets over.

Long version: cond_resched_rcu() is defined as:

 static inline void cond_resched_rcu(void)
 {
 #if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU)
        rcu_read_unlock();
        cond_resched();
        rcu_read_lock();
 #endif
 }

So the relevant case is PREEMPT_RCU=n.

Now, currently PREEMPT_RCU=n, also implies PREEMPT_COUNT=n. And so
the rcu_read_lock()/_unlock() reduce to a barrier. And, that's
why we need the explicit cond_resched() there.

The reason we can remove the cond_resched() after patch 43, and 47 is
because rcu_read_lock()/_unlock() will modify the preempt count and so
RCU will have visibility into when RCU read-side critical sections
finish.

That said, this series in this form isn't really going anywhere in the
short-term so none of this is imminent.

On the calls to cond_resched(), if the kernel is fully preemptible
they are a NOP. And then the code would be polling in a tight loop.

Would it make sense to do something like this instead?

     if (!cond_resched())
        msleep()/usleep()/cpu_relax();

--
ankur

Re: [RFC PATCH 85/86] treewide: drivers: remove cond_resched()

Posted by Steven Rostedt 2 years, 3 months ago

On Thu, 9 Nov 2023 15:25:54 -0800
Dmitry Torokhov <dmitry.torokhov@gmail.com> wrote:

> Hi Anhur,
> 
> On Tue, Nov 07, 2023 at 03:08:21PM -0800, Ankur Arora wrote:
> > There are broadly three sets of uses of cond_resched():
> > 
> > 1.  Calls to cond_resched() out of the goodness of our heart,
> >     otherwise known as avoiding lockup splats.  
> 
> ...
> 
> What about RCU stalls? The calls to cond_resched() in evdev.c and
> mousedev.c were added specifically to allow RCU to run in cases when
> userspace passes a large buffer and the kernel is not fully preemptable.
> 

First, this patch is being sent out premature as it depends on acceptance
of the previous patches.

When the previous patches are finished, then we don't need cond_resched()
to protect against RCU stalls, because even "PREEMPT_NONE" will allow
preemption inside the kernel.

What the earlier patches do is introduce a concept of NEED_RESCHED_LAZY.
Then when the scheduler wants to resched the task, it will set that bit
instead of NEED_RESCHED (for the old PREEMPT_NONE version). For VOLUNTARY,
it sets the LAZY bit of SCHED_OTHER but NEED_RESCHED for RT/DL tasks. For
PREEMPT, it will always set NEED_RESCHED.

NEED_RESCHED will always schedule, but NEED_RESCHED_LAZY only schedules
when going to user space.

Now after on tick (depending on HZ it can be 1ms, 3.3ms, 4ms 10ms) if
NEED_RESCHED_LAZY is set, then it will set NEED_RESCHED, forcing a
preemption at the next available moment (when preempt count is zero).

This will be done even with the old PREEMPT_NONE configuration.

That way we will no longer play whack-a-mole to get rid of all he long
running kernel paths by inserting cond_resched() in them.

-- Steve

Re: [RFC PATCH 85/86] treewide: drivers: remove cond_resched()

Posted by Chris Packham 2 years, 3 months ago

On 8/11/23 12:08, Ankur Arora wrote:
> There are broadly three sets of uses of cond_resched():
>
> 1.  Calls to cond_resched() out of the goodness of our heart,
>      otherwise known as avoiding lockup splats.
>
> 2.  Open coded variants of cond_resched_lock() which call
>      cond_resched().
>
> 3.  Retry or error handling loops, where cond_resched() is used as a
>      quick alternative to spinning in a tight-loop.
>
> When running under a full preemption model, the cond_resched() reduces
> to a NOP (not even a barrier) so removing it obviously cannot matter.
>
> But considering only voluntary preemption models (for say code that
> has been mostly tested under those), for set-1 and set-2 the
> scheduler can now preempt kernel tasks running beyond their time
> quanta anywhere they are preemptible() [1]. Which removes any need
> for these explicitly placed scheduling points.
>
> The cond_resched() calls in set-3 are a little more difficult.
> To start with, given it's NOP character under full preemption, it
> never actually saved us from a tight loop.
> With voluntary preemption, it's not a NOP, but it might as well be --
> for most workloads the scheduler does not have an interminable supply
> of runnable tasks on the runqueue.
>
> So, cond_resched() is useful to not get softlockup splats, but not
> terribly good for error handling. Ideally, these should be replaced
> with some kind of timed or event wait.
> For now we use cond_resched_stall(), which tries to schedule if
> possible, and executes a cpu_relax() if not.
>
> The cond_resched() calls here are all kinds. Those from set-1
> or set-2 are quite straight-forward to handle.
>
> There are quite a few from set-3, where as noted above, we
> use cond_resched() as if it were a amulent. Which I supppose
> it is, in that it wards off softlockup or RCU splats.
>
> Those are now cond_resched_stall(), but in most cases, given
> that the timeouts are in milliseconds, they could be easily
> timed waits.

For i2c-mpc.c:

It looks as the code in question could probably be converted to 
readb_poll_timeout(). If I find sufficient round-tuits I might look at 
that. Regardless in the context of the tree-wide change ...

Reviewed-by: Chris Packham <chris.packham@alliedtelesis.co.nz>

Re: [RFC PATCH 85/86] treewide: drivers: remove cond_resched()

Posted by Ankur Arora 2 years, 3 months ago

Chris Packham <Chris.Packham@alliedtelesis.co.nz> writes:

> On 8/11/23 12:08, Ankur Arora wrote:
>> There are broadly three sets of uses of cond_resched():
>>
>> 1.  Calls to cond_resched() out of the goodness of our heart,
>>      otherwise known as avoiding lockup splats.
>>
>> 2.  Open coded variants of cond_resched_lock() which call
>>      cond_resched().
>>
>> 3.  Retry or error handling loops, where cond_resched() is used as a
>>      quick alternative to spinning in a tight-loop.
>>
>> When running under a full preemption model, the cond_resched() reduces
>> to a NOP (not even a barrier) so removing it obviously cannot matter.
>>
>> But considering only voluntary preemption models (for say code that
>> has been mostly tested under those), for set-1 and set-2 the
>> scheduler can now preempt kernel tasks running beyond their time
>> quanta anywhere they are preemptible() [1]. Which removes any need
>> for these explicitly placed scheduling points.
>>
>> The cond_resched() calls in set-3 are a little more difficult.
>> To start with, given it's NOP character under full preemption, it
>> never actually saved us from a tight loop.
>> With voluntary preemption, it's not a NOP, but it might as well be --
>> for most workloads the scheduler does not have an interminable supply
>> of runnable tasks on the runqueue.
>>
>> So, cond_resched() is useful to not get softlockup splats, but not
>> terribly good for error handling. Ideally, these should be replaced
>> with some kind of timed or event wait.
>> For now we use cond_resched_stall(), which tries to schedule if
>> possible, and executes a cpu_relax() if not.
>>
>> The cond_resched() calls here are all kinds. Those from set-1
>> or set-2 are quite straight-forward to handle.
>>
>> There are quite a few from set-3, where as noted above, we
>> use cond_resched() as if it were a amulent. Which I supppose
>> it is, in that it wards off softlockup or RCU splats.
>>
>> Those are now cond_resched_stall(), but in most cases, given
>> that the timeouts are in milliseconds, they could be easily
>> timed waits.
>
> For i2c-mpc.c:
>
> It looks as the code in question could probably be converted to
> readb_poll_timeout(). If I find sufficient round-tuits I might look at
> that. Regardless in the context of the tree-wide change ...
>
> Reviewed-by: Chris Packham <chris.packham@alliedtelesis.co.nz>

Thanks Chris. This'll take a while before this lands.
I'll see if I can send a patch with cond_resched_stall() or similar
separately.

Meanwhile please feel free to make the readb_poll_timeout() change.

--
ankur

[RFC PATCH 86/86] sched: remove cond_resched()

Posted by Ankur Arora 2 years, 3 months ago

Now that we don't have any users of cond_resched() in the tree,
we can finally remove it.

Cc: Ingo Molnar <mingo@redhat.com> 
Cc: Peter Zijlstra <peterz@infradead.org> 
Cc: Juri Lelli <juri.lelli@redhat.com> 
Cc: Vincent Guittot <vincent.guittot@linaro.org> 
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/sched.h | 16 ++++------------
 kernel/sched/core.c   | 13 -------------
 2 files changed, 4 insertions(+), 25 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index bae6eed534dd..bbb981c1a142 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2083,19 +2083,11 @@ static inline bool test_tsk_need_resched_any(struct task_struct *tsk)
 }
 
 /*
- * cond_resched() and cond_resched_lock(): latency reduction via
- * explicit rescheduling in places that are safe. The return
- * value indicates whether a reschedule was done in fact.
- * cond_resched_lock() will drop the spinlock before scheduling,
+ * cond_resched_lock(): latency reduction via explicit rescheduling
+ * in places that are safe. The return value indicates whether a
+ * reschedule was done in fact.  cond_resched_lock() will drop the
+ * spinlock before scheduling.
  */
-#ifdef CONFIG_PREEMPTION
-static inline int _cond_resched(void) { return 0; }
-#endif
-
-#define cond_resched() ({			\
-	__might_resched(__FILE__, __LINE__, 0);	\
-	_cond_resched();			\
-})
 
 extern int __cond_resched_lock(spinlock_t *lock);
 extern int __cond_resched_rwlock_read(rwlock_t *lock);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 691b50791e04..6940893e3930 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8580,19 +8580,6 @@ SYSCALL_DEFINE0(sched_yield)
 	return 0;
 }
 
-#ifndef CONFIG_PREEMPTION
-int __sched _cond_resched(void)
-{
-	if (should_resched(0)) {
-		preempt_schedule_common();
-		return 1;
-	}
-
-	return 0;
-}
-EXPORT_SYMBOL(_cond_resched);
-#endif
-
 /*
  * __cond_resched_lock() - if a reschedule is pending, drop the given lock
  * (implicitly calling schedule), and reacquire the lock.
-- 
2.31.1