The inner loop in poll_idle() polls over the thread_info flags,
waiting to see if the thread has TIF_NEED_RESCHED set. The loop
exits once the condition is met, or if the poll time limit has
been exceeded.
To minimize the number of instructions executed in each iteration,
the time check is done only intermittently (once every
POLL_IDLE_RELAX_COUNT iterations). In addition, each loop iteration
executes cpu_relax() which on certain platforms provides a hint to
the pipeline that the loop busy-waits, allowing the processor to
reduce power consumption.
This is close to what smp_cond_load_relaxed_timeout() provides. So,
restructure the loop and fold the loop condition and the timeout check
in smp_cond_load_relaxed_timeout().
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
drivers/cpuidle/poll_state.c | 29 ++++++++---------------------
1 file changed, 8 insertions(+), 21 deletions(-)
diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
index 9b6d90a72601..dc7f4b424fec 100644
--- a/drivers/cpuidle/poll_state.c
+++ b/drivers/cpuidle/poll_state.c
@@ -8,35 +8,22 @@
#include <linux/sched/clock.h>
#include <linux/sched/idle.h>
-#define POLL_IDLE_RELAX_COUNT 200
-
static int __cpuidle poll_idle(struct cpuidle_device *dev,
struct cpuidle_driver *drv, int index)
{
- u64 time_start;
-
- time_start = local_clock_noinstr();
+ u64 time_end;
+ u32 flags = 0;
dev->poll_time_limit = false;
+ time_end = local_clock_noinstr() + cpuidle_poll_time(drv, dev);
+
raw_local_irq_enable();
if (!current_set_polling_and_test()) {
- unsigned int loop_count = 0;
- u64 limit;
-
- limit = cpuidle_poll_time(drv, dev);
-
- while (!need_resched()) {
- cpu_relax();
- if (loop_count++ < POLL_IDLE_RELAX_COUNT)
- continue;
-
- loop_count = 0;
- if (local_clock_noinstr() - time_start > limit) {
- dev->poll_time_limit = true;
- break;
- }
- }
+ flags = smp_cond_load_relaxed_timeout(¤t_thread_info()->flags,
+ (VAL & _TIF_NEED_RESCHED),
+ (local_clock_noinstr() >= time_end));
+ dev->poll_time_limit = !(flags & _TIF_NEED_RESCHED);
}
raw_local_irq_disable();
--
2.43.5
On Tue, Oct 28, 2025 at 6:32 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
> The inner loop in poll_idle() polls over the thread_info flags,
> waiting to see if the thread has TIF_NEED_RESCHED set. The loop
> exits once the condition is met, or if the poll time limit has
> been exceeded.
>
> To minimize the number of instructions executed in each iteration,
> the time check is done only intermittently (once every
> POLL_IDLE_RELAX_COUNT iterations). In addition, each loop iteration
> executes cpu_relax() which on certain platforms provides a hint to
> the pipeline that the loop busy-waits, allowing the processor to
> reduce power consumption.
>
> This is close to what smp_cond_load_relaxed_timeout() provides. So,
> restructure the loop and fold the loop condition and the timeout check
> in smp_cond_load_relaxed_timeout().
Well, it is close, but is it close enough?
> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
> drivers/cpuidle/poll_state.c | 29 ++++++++---------------------
> 1 file changed, 8 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
> index 9b6d90a72601..dc7f4b424fec 100644
> --- a/drivers/cpuidle/poll_state.c
> +++ b/drivers/cpuidle/poll_state.c
> @@ -8,35 +8,22 @@
> #include <linux/sched/clock.h>
> #include <linux/sched/idle.h>
>
> -#define POLL_IDLE_RELAX_COUNT 200
> -
> static int __cpuidle poll_idle(struct cpuidle_device *dev,
> struct cpuidle_driver *drv, int index)
> {
> - u64 time_start;
> -
> - time_start = local_clock_noinstr();
> + u64 time_end;
> + u32 flags = 0;
>
> dev->poll_time_limit = false;
>
> + time_end = local_clock_noinstr() + cpuidle_poll_time(drv, dev);
Is there any particular reason for doing this unconditionally? If
not, then it looks like an arbitrary unrelated change to me.
> +
> raw_local_irq_enable();
> if (!current_set_polling_and_test()) {
> - unsigned int loop_count = 0;
> - u64 limit;
> -
> - limit = cpuidle_poll_time(drv, dev);
> -
> - while (!need_resched()) {
> - cpu_relax();
> - if (loop_count++ < POLL_IDLE_RELAX_COUNT)
> - continue;
> -
> - loop_count = 0;
> - if (local_clock_noinstr() - time_start > limit) {
> - dev->poll_time_limit = true;
> - break;
> - }
> - }
> + flags = smp_cond_load_relaxed_timeout(¤t_thread_info()->flags,
> + (VAL & _TIF_NEED_RESCHED),
> + (local_clock_noinstr() >= time_end));
So my understanding of this is that it reduces duplication with some
other places doing similar things. Fair enough.
However, since there is "timeout" in the name, I'd expect it to take
the timeout as an argument.
> + dev->poll_time_limit = !(flags & _TIF_NEED_RESCHED);
> }
> raw_local_irq_disable();
>
> --
Rafael J. Wysocki <rafael@kernel.org> writes:
> On Tue, Oct 28, 2025 at 6:32 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>
>> The inner loop in poll_idle() polls over the thread_info flags,
>> waiting to see if the thread has TIF_NEED_RESCHED set. The loop
>> exits once the condition is met, or if the poll time limit has
>> been exceeded.
>>
>> To minimize the number of instructions executed in each iteration,
>> the time check is done only intermittently (once every
>> POLL_IDLE_RELAX_COUNT iterations). In addition, each loop iteration
>> executes cpu_relax() which on certain platforms provides a hint to
>> the pipeline that the loop busy-waits, allowing the processor to
>> reduce power consumption.
>>
>> This is close to what smp_cond_load_relaxed_timeout() provides. So,
>> restructure the loop and fold the loop condition and the timeout check
>> in smp_cond_load_relaxed_timeout().
>
> Well, it is close, but is it close enough?
I guess that's the question.
>> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
>> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>> drivers/cpuidle/poll_state.c | 29 ++++++++---------------------
>> 1 file changed, 8 insertions(+), 21 deletions(-)
>>
>> diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
>> index 9b6d90a72601..dc7f4b424fec 100644
>> --- a/drivers/cpuidle/poll_state.c
>> +++ b/drivers/cpuidle/poll_state.c
>> @@ -8,35 +8,22 @@
>> #include <linux/sched/clock.h>
>> #include <linux/sched/idle.h>
>>
>> -#define POLL_IDLE_RELAX_COUNT 200
>> -
>> static int __cpuidle poll_idle(struct cpuidle_device *dev,
>> struct cpuidle_driver *drv, int index)
>> {
>> - u64 time_start;
>> -
>> - time_start = local_clock_noinstr();
>> + u64 time_end;
>> + u32 flags = 0;
>>
>> dev->poll_time_limit = false;
>>
>> + time_end = local_clock_noinstr() + cpuidle_poll_time(drv, dev);
>
> Is there any particular reason for doing this unconditionally? If
> not, then it looks like an arbitrary unrelated change to me.
Agreed. Will fix.
>> +
>> raw_local_irq_enable();
>> if (!current_set_polling_and_test()) {
>> - unsigned int loop_count = 0;
>> - u64 limit;
>> -
>> - limit = cpuidle_poll_time(drv, dev);
>> -
>> - while (!need_resched()) {
>> - cpu_relax();
>> - if (loop_count++ < POLL_IDLE_RELAX_COUNT)
>> - continue;
>> -
>> - loop_count = 0;
>> - if (local_clock_noinstr() - time_start > limit) {
>> - dev->poll_time_limit = true;
>> - break;
>> - }
>> - }
>> + flags = smp_cond_load_relaxed_timeout(¤t_thread_info()->flags,
>> + (VAL & _TIF_NEED_RESCHED),
>> + (local_clock_noinstr() >= time_end));
>
> So my understanding of this is that it reduces duplication with some
> other places doing similar things. Fair enough.
>
> However, since there is "timeout" in the name, I'd expect it to take
> the timeout as an argument.
The early versions did have a timeout but that complicated the
implementation significantly. And the current users poll_idle(),
rqspinlock don't need a precise timeout.
smp_cond_load_relaxed_timed(), smp_cond_load_relaxed_timecheck()?
The problem with all suffixes I can think of is that it makes the
interface itself nonobvious.
Possibly something with the sense of bail out might work.
--
ankur
On Wed, Oct 29, 2025 at 5:42 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>
> Rafael J. Wysocki <rafael@kernel.org> writes:
>
> > On Tue, Oct 28, 2025 at 6:32 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
> >>
> >> The inner loop in poll_idle() polls over the thread_info flags,
> >> waiting to see if the thread has TIF_NEED_RESCHED set. The loop
> >> exits once the condition is met, or if the poll time limit has
> >> been exceeded.
> >>
> >> To minimize the number of instructions executed in each iteration,
> >> the time check is done only intermittently (once every
> >> POLL_IDLE_RELAX_COUNT iterations). In addition, each loop iteration
> >> executes cpu_relax() which on certain platforms provides a hint to
> >> the pipeline that the loop busy-waits, allowing the processor to
> >> reduce power consumption.
> >>
> >> This is close to what smp_cond_load_relaxed_timeout() provides. So,
> >> restructure the loop and fold the loop condition and the timeout check
> >> in smp_cond_load_relaxed_timeout().
> >
> > Well, it is close, but is it close enough?
>
> I guess that's the question.
>
> >> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
> >> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> >> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> >> ---
> >> drivers/cpuidle/poll_state.c | 29 ++++++++---------------------
> >> 1 file changed, 8 insertions(+), 21 deletions(-)
> >>
> >> diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
> >> index 9b6d90a72601..dc7f4b424fec 100644
> >> --- a/drivers/cpuidle/poll_state.c
> >> +++ b/drivers/cpuidle/poll_state.c
> >> @@ -8,35 +8,22 @@
> >> #include <linux/sched/clock.h>
> >> #include <linux/sched/idle.h>
> >>
> >> -#define POLL_IDLE_RELAX_COUNT 200
> >> -
> >> static int __cpuidle poll_idle(struct cpuidle_device *dev,
> >> struct cpuidle_driver *drv, int index)
> >> {
> >> - u64 time_start;
> >> -
> >> - time_start = local_clock_noinstr();
> >> + u64 time_end;
> >> + u32 flags = 0;
> >>
> >> dev->poll_time_limit = false;
> >>
> >> + time_end = local_clock_noinstr() + cpuidle_poll_time(drv, dev);
> >
> > Is there any particular reason for doing this unconditionally? If
> > not, then it looks like an arbitrary unrelated change to me.
>
> Agreed. Will fix.
>
> >> +
> >> raw_local_irq_enable();
> >> if (!current_set_polling_and_test()) {
> >> - unsigned int loop_count = 0;
> >> - u64 limit;
> >> -
> >> - limit = cpuidle_poll_time(drv, dev);
> >> -
> >> - while (!need_resched()) {
> >> - cpu_relax();
> >> - if (loop_count++ < POLL_IDLE_RELAX_COUNT)
> >> - continue;
> >> -
> >> - loop_count = 0;
> >> - if (local_clock_noinstr() - time_start > limit) {
> >> - dev->poll_time_limit = true;
> >> - break;
> >> - }
> >> - }
> >> + flags = smp_cond_load_relaxed_timeout(¤t_thread_info()->flags,
> >> + (VAL & _TIF_NEED_RESCHED),
> >> + (local_clock_noinstr() >= time_end));
> >
> > So my understanding of this is that it reduces duplication with some
> > other places doing similar things. Fair enough.
> >
> > However, since there is "timeout" in the name, I'd expect it to take
> > the timeout as an argument.
>
> The early versions did have a timeout but that complicated the
> implementation significantly. And the current users poll_idle(),
> rqspinlock don't need a precise timeout.
>
> smp_cond_load_relaxed_timed(), smp_cond_load_relaxed_timecheck()?
>
> The problem with all suffixes I can think of is that it makes the
> interface itself nonobvious.
>
> Possibly something with the sense of bail out might work.
It basically has two conditions, one of which is checked in every step
of the internal loop and the other one is checked every
SMP_TIMEOUT_POLL_COUNT steps of it. That isn't particularly
straightforward IMV.
Honestly, I prefer the existing code. It is much easier to follow and
I don't see why the new code would be better. Sorry.
Rafael J. Wysocki <rafael@kernel.org> writes:
> On Wed, Oct 29, 2025 at 5:42 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>
>>
>> Rafael J. Wysocki <rafael@kernel.org> writes:
>>
>> > On Tue, Oct 28, 2025 at 6:32 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>> >>
>> >> The inner loop in poll_idle() polls over the thread_info flags,
>> >> waiting to see if the thread has TIF_NEED_RESCHED set. The loop
>> >> exits once the condition is met, or if the poll time limit has
>> >> been exceeded.
>> >>
>> >> To minimize the number of instructions executed in each iteration,
>> >> the time check is done only intermittently (once every
>> >> POLL_IDLE_RELAX_COUNT iterations). In addition, each loop iteration
>> >> executes cpu_relax() which on certain platforms provides a hint to
>> >> the pipeline that the loop busy-waits, allowing the processor to
>> >> reduce power consumption.
>> >>
>> >> This is close to what smp_cond_load_relaxed_timeout() provides. So,
>> >> restructure the loop and fold the loop condition and the timeout check
>> >> in smp_cond_load_relaxed_timeout().
>> >
>> > Well, it is close, but is it close enough?
>>
>> I guess that's the question.
>>
>> >> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
>> >> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
>> >> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> >> ---
>> >> drivers/cpuidle/poll_state.c | 29 ++++++++---------------------
>> >> 1 file changed, 8 insertions(+), 21 deletions(-)
>> >>
>> >> diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
>> >> index 9b6d90a72601..dc7f4b424fec 100644
>> >> --- a/drivers/cpuidle/poll_state.c
>> >> +++ b/drivers/cpuidle/poll_state.c
>> >> @@ -8,35 +8,22 @@
>> >> #include <linux/sched/clock.h>
>> >> #include <linux/sched/idle.h>
>> >>
>> >> -#define POLL_IDLE_RELAX_COUNT 200
>> >> -
>> >> static int __cpuidle poll_idle(struct cpuidle_device *dev,
>> >> struct cpuidle_driver *drv, int index)
>> >> {
>> >> - u64 time_start;
>> >> -
>> >> - time_start = local_clock_noinstr();
>> >> + u64 time_end;
>> >> + u32 flags = 0;
>> >>
>> >> dev->poll_time_limit = false;
>> >>
>> >> + time_end = local_clock_noinstr() + cpuidle_poll_time(drv, dev);
>> >
>> > Is there any particular reason for doing this unconditionally? If
>> > not, then it looks like an arbitrary unrelated change to me.
>>
>> Agreed. Will fix.
>>
>> >> +
>> >> raw_local_irq_enable();
>> >> if (!current_set_polling_and_test()) {
>> >> - unsigned int loop_count = 0;
>> >> - u64 limit;
>> >> -
>> >> - limit = cpuidle_poll_time(drv, dev);
>> >> -
>> >> - while (!need_resched()) {
>> >> - cpu_relax();
>> >> - if (loop_count++ < POLL_IDLE_RELAX_COUNT)
>> >> - continue;
>> >> -
>> >> - loop_count = 0;
>> >> - if (local_clock_noinstr() - time_start > limit) {
>> >> - dev->poll_time_limit = true;
>> >> - break;
>> >> - }
>> >> - }
>> >> + flags = smp_cond_load_relaxed_timeout(¤t_thread_info()->flags,
>> >> + (VAL & _TIF_NEED_RESCHED),
>> >> + (local_clock_noinstr() >= time_end));
>> >
>> > So my understanding of this is that it reduces duplication with some
>> > other places doing similar things. Fair enough.
>> >
>> > However, since there is "timeout" in the name, I'd expect it to take
>> > the timeout as an argument.
>>
>> The early versions did have a timeout but that complicated the
>> implementation significantly. And the current users poll_idle(),
>> rqspinlock don't need a precise timeout.
>>
>> smp_cond_load_relaxed_timed(), smp_cond_load_relaxed_timecheck()?
>>
>> The problem with all suffixes I can think of is that it makes the
>> interface itself nonobvious.
>>
>> Possibly something with the sense of bail out might work.
>
> It basically has two conditions, one of which is checked in every step
> of the internal loop and the other one is checked every
> SMP_TIMEOUT_POLL_COUNT steps of it. That isn't particularly
> straightforward IMV.
Right. And that's similar to what poll_idle().
> Honestly, I prefer the existing code. It is much easier to follow and
> I don't see why the new code would be better. Sorry.
I don't think there's any problem with the current code. However, I'd like
to add support for poll_idle() on arm64 (and maybe other platforms) where
instead of spinning in a cpu_relax() loop, you wait on a cacheline.
And that's what using something like smp_cond_load_relaxed_timeout()
would enable.
Something like the series here:
https://lore.kernel.org/lkml/87wmaljd81.fsf@oracle.com/
(Sorry, should have mentioned this in the commit message.)
--
ankur
On Wed, Oct 29, 2025 at 8:13 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>
> Rafael J. Wysocki <rafael@kernel.org> writes:
>
> > On Wed, Oct 29, 2025 at 5:42 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
> >>
> >>
> >> Rafael J. Wysocki <rafael@kernel.org> writes:
> >>
> >> > On Tue, Oct 28, 2025 at 6:32 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
> >> >>
> >> >> The inner loop in poll_idle() polls over the thread_info flags,
> >> >> waiting to see if the thread has TIF_NEED_RESCHED set. The loop
> >> >> exits once the condition is met, or if the poll time limit has
> >> >> been exceeded.
> >> >>
> >> >> To minimize the number of instructions executed in each iteration,
> >> >> the time check is done only intermittently (once every
> >> >> POLL_IDLE_RELAX_COUNT iterations). In addition, each loop iteration
> >> >> executes cpu_relax() which on certain platforms provides a hint to
> >> >> the pipeline that the loop busy-waits, allowing the processor to
> >> >> reduce power consumption.
> >> >>
> >> >> This is close to what smp_cond_load_relaxed_timeout() provides. So,
> >> >> restructure the loop and fold the loop condition and the timeout check
> >> >> in smp_cond_load_relaxed_timeout().
> >> >
> >> > Well, it is close, but is it close enough?
> >>
> >> I guess that's the question.
> >>
> >> >> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
> >> >> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> >> >> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> >> >> ---
> >> >> drivers/cpuidle/poll_state.c | 29 ++++++++---------------------
> >> >> 1 file changed, 8 insertions(+), 21 deletions(-)
> >> >>
> >> >> diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
> >> >> index 9b6d90a72601..dc7f4b424fec 100644
> >> >> --- a/drivers/cpuidle/poll_state.c
> >> >> +++ b/drivers/cpuidle/poll_state.c
> >> >> @@ -8,35 +8,22 @@
> >> >> #include <linux/sched/clock.h>
> >> >> #include <linux/sched/idle.h>
> >> >>
> >> >> -#define POLL_IDLE_RELAX_COUNT 200
> >> >> -
> >> >> static int __cpuidle poll_idle(struct cpuidle_device *dev,
> >> >> struct cpuidle_driver *drv, int index)
> >> >> {
> >> >> - u64 time_start;
> >> >> -
> >> >> - time_start = local_clock_noinstr();
> >> >> + u64 time_end;
> >> >> + u32 flags = 0;
> >> >>
> >> >> dev->poll_time_limit = false;
> >> >>
> >> >> + time_end = local_clock_noinstr() + cpuidle_poll_time(drv, dev);
> >> >
> >> > Is there any particular reason for doing this unconditionally? If
> >> > not, then it looks like an arbitrary unrelated change to me.
> >>
> >> Agreed. Will fix.
> >>
> >> >> +
> >> >> raw_local_irq_enable();
> >> >> if (!current_set_polling_and_test()) {
> >> >> - unsigned int loop_count = 0;
> >> >> - u64 limit;
> >> >> -
> >> >> - limit = cpuidle_poll_time(drv, dev);
> >> >> -
> >> >> - while (!need_resched()) {
> >> >> - cpu_relax();
> >> >> - if (loop_count++ < POLL_IDLE_RELAX_COUNT)
> >> >> - continue;
> >> >> -
> >> >> - loop_count = 0;
> >> >> - if (local_clock_noinstr() - time_start > limit) {
> >> >> - dev->poll_time_limit = true;
> >> >> - break;
> >> >> - }
> >> >> - }
> >> >> + flags = smp_cond_load_relaxed_timeout(¤t_thread_info()->flags,
> >> >> + (VAL & _TIF_NEED_RESCHED),
> >> >> + (local_clock_noinstr() >= time_end));
> >> >
> >> > So my understanding of this is that it reduces duplication with some
> >> > other places doing similar things. Fair enough.
> >> >
> >> > However, since there is "timeout" in the name, I'd expect it to take
> >> > the timeout as an argument.
> >>
> >> The early versions did have a timeout but that complicated the
> >> implementation significantly. And the current users poll_idle(),
> >> rqspinlock don't need a precise timeout.
> >>
> >> smp_cond_load_relaxed_timed(), smp_cond_load_relaxed_timecheck()?
> >>
> >> The problem with all suffixes I can think of is that it makes the
> >> interface itself nonobvious.
> >>
> >> Possibly something with the sense of bail out might work.
> >
> > It basically has two conditions, one of which is checked in every step
> > of the internal loop and the other one is checked every
> > SMP_TIMEOUT_POLL_COUNT steps of it. That isn't particularly
> > straightforward IMV.
>
> Right. And that's similar to what poll_idle().
My point is that the macro in its current form is not particularly
straightforward.
The code in poll_idle() does what it needs to do.
> > Honestly, I prefer the existing code. It is much easier to follow and
> > I don't see why the new code would be better. Sorry.
>
> I don't think there's any problem with the current code. However, I'd like
> to add support for poll_idle() on arm64 (and maybe other platforms) where
> instead of spinning in a cpu_relax() loop, you wait on a cacheline.
Well, there is MWAIT on x86, but it is not used here. It just takes
too much time to wake up from. There are "fast" variants of that too,
but they have been designed with user space in mind, so somewhat
cumbersome for kernel use.
> And that's what using something like smp_cond_load_relaxed_timeout()
> would enable.
>
> Something like the series here:
> https://lore.kernel.org/lkml/87wmaljd81.fsf@oracle.com/
>
> (Sorry, should have mentioned this in the commit message.)
I'm not sure how you can combine that with a proper timeout. The
timeout is needed because you want to break out of this when it starts
to take too much time, so you can go back to the idle loop and maybe
select a better idle state.
Rafael J. Wysocki <rafael@kernel.org> writes:
> On Wed, Oct 29, 2025 at 8:13 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>
>>
>> Rafael J. Wysocki <rafael@kernel.org> writes:
>>
>> > On Wed, Oct 29, 2025 at 5:42 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>> >>
>> >>
>> >> Rafael J. Wysocki <rafael@kernel.org> writes:
>> >>
>> >> > On Tue, Oct 28, 2025 at 6:32 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>> >> >>
>> >> >> The inner loop in poll_idle() polls over the thread_info flags,
>> >> >> waiting to see if the thread has TIF_NEED_RESCHED set. The loop
>> >> >> exits once the condition is met, or if the poll time limit has
>> >> >> been exceeded.
>> >> >>
>> >> >> To minimize the number of instructions executed in each iteration,
>> >> >> the time check is done only intermittently (once every
>> >> >> POLL_IDLE_RELAX_COUNT iterations). In addition, each loop iteration
>> >> >> executes cpu_relax() which on certain platforms provides a hint to
>> >> >> the pipeline that the loop busy-waits, allowing the processor to
>> >> >> reduce power consumption.
>> >> >>
>> >> >> This is close to what smp_cond_load_relaxed_timeout() provides. So,
>> >> >> restructure the loop and fold the loop condition and the timeout check
>> >> >> in smp_cond_load_relaxed_timeout().
>> >> >
>> >> > Well, it is close, but is it close enough?
>> >>
>> >> I guess that's the question.
>> >>
>> >> >> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
>> >> >> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
>> >> >> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> >> >> ---
>> >> >> drivers/cpuidle/poll_state.c | 29 ++++++++---------------------
>> >> >> 1 file changed, 8 insertions(+), 21 deletions(-)
>> >> >>
>> >> >> diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
>> >> >> index 9b6d90a72601..dc7f4b424fec 100644
>> >> >> --- a/drivers/cpuidle/poll_state.c
>> >> >> +++ b/drivers/cpuidle/poll_state.c
>> >> >> @@ -8,35 +8,22 @@
>> >> >> #include <linux/sched/clock.h>
>> >> >> #include <linux/sched/idle.h>
>> >> >>
>> >> >> -#define POLL_IDLE_RELAX_COUNT 200
>> >> >> -
>> >> >> static int __cpuidle poll_idle(struct cpuidle_device *dev,
>> >> >> struct cpuidle_driver *drv, int index)
>> >> >> {
>> >> >> - u64 time_start;
>> >> >> -
>> >> >> - time_start = local_clock_noinstr();
>> >> >> + u64 time_end;
>> >> >> + u32 flags = 0;
>> >> >>
>> >> >> dev->poll_time_limit = false;
>> >> >>
>> >> >> + time_end = local_clock_noinstr() + cpuidle_poll_time(drv, dev);
>> >> >
>> >> > Is there any particular reason for doing this unconditionally? If
>> >> > not, then it looks like an arbitrary unrelated change to me.
>> >>
>> >> Agreed. Will fix.
>> >>
>> >> >> +
>> >> >> raw_local_irq_enable();
>> >> >> if (!current_set_polling_and_test()) {
>> >> >> - unsigned int loop_count = 0;
>> >> >> - u64 limit;
>> >> >> -
>> >> >> - limit = cpuidle_poll_time(drv, dev);
>> >> >> -
>> >> >> - while (!need_resched()) {
>> >> >> - cpu_relax();
>> >> >> - if (loop_count++ < POLL_IDLE_RELAX_COUNT)
>> >> >> - continue;
>> >> >> -
>> >> >> - loop_count = 0;
>> >> >> - if (local_clock_noinstr() - time_start > limit) {
>> >> >> - dev->poll_time_limit = true;
>> >> >> - break;
>> >> >> - }
>> >> >> - }
>> >> >> + flags = smp_cond_load_relaxed_timeout(¤t_thread_info()->flags,
>> >> >> + (VAL & _TIF_NEED_RESCHED),
>> >> >> + (local_clock_noinstr() >= time_end));
>> >> >
>> >> > So my understanding of this is that it reduces duplication with some
>> >> > other places doing similar things. Fair enough.
>> >> >
>> >> > However, since there is "timeout" in the name, I'd expect it to take
>> >> > the timeout as an argument.
>> >>
>> >> The early versions did have a timeout but that complicated the
>> >> implementation significantly. And the current users poll_idle(),
>> >> rqspinlock don't need a precise timeout.
>> >>
>> >> smp_cond_load_relaxed_timed(), smp_cond_load_relaxed_timecheck()?
>> >>
>> >> The problem with all suffixes I can think of is that it makes the
>> >> interface itself nonobvious.
>> >>
>> >> Possibly something with the sense of bail out might work.
>> >
>> > It basically has two conditions, one of which is checked in every step
>> > of the internal loop and the other one is checked every
>> > SMP_TIMEOUT_POLL_COUNT steps of it. That isn't particularly
>> > straightforward IMV.
>>
>> Right. And that's similar to what poll_idle().
>
> My point is that the macro in its current form is not particularly
> straightforward.
>
> The code in poll_idle() does what it needs to do.
>
>> > Honestly, I prefer the existing code. It is much easier to follow and
>> > I don't see why the new code would be better. Sorry.
>>
>> I don't think there's any problem with the current code. However, I'd like
>> to add support for poll_idle() on arm64 (and maybe other platforms) where
>> instead of spinning in a cpu_relax() loop, you wait on a cacheline.
>
> Well, there is MWAIT on x86, but it is not used here. It just takes
> too much time to wake up from. There are "fast" variants of that too,
> but they have been designed with user space in mind, so somewhat
> cumbersome for kernel use.
>
>> And that's what using something like smp_cond_load_relaxed_timeout()
>> would enable.
>>
>> Something like the series here:
>> https://lore.kernel.org/lkml/87wmaljd81.fsf@oracle.com/
>>
>> (Sorry, should have mentioned this in the commit message.)
>
> I'm not sure how you can combine that with a proper timeout.
Would taking the timeout as a separate argument work?
flags = smp_cond_load_relaxed_timeout(¤t_thread_info()->flags,
(VAL & _TIF_NEED_RESCHED),
local_clock_noinstr(), time_end);
Or you are thinking of something on different lines from the smp_cond_load
kind of interface?
> The
> timeout is needed because you want to break out of this when it starts
> to take too much time, so you can go back to the idle loop and maybe
> select a better idle state.
Agreed. And that will happen with the version in the patch:
flags = smp_cond_load_relaxed_timeout(¤t_thread_info()->flags,
(VAL & _TIF_NEED_RESCHED),
(local_clock_noinstr() >= time_end));
Just that with waited mode on arm64 the timeout might be delayed depending
on granularity of the event stream.
Thanks
--
ankur
On Wed, Oct 29, 2025 at 10:01 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>
> Rafael J. Wysocki <rafael@kernel.org> writes:
>
> > On Wed, Oct 29, 2025 at 8:13 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
> >>
> >>
> >> Rafael J. Wysocki <rafael@kernel.org> writes:
> >>
> >> > On Wed, Oct 29, 2025 at 5:42 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
> >> >>
> >> >>
> >> >> Rafael J. Wysocki <rafael@kernel.org> writes:
> >> >>
> >> >> > On Tue, Oct 28, 2025 at 6:32 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
> >> >> >>
> >> >> >> The inner loop in poll_idle() polls over the thread_info flags,
> >> >> >> waiting to see if the thread has TIF_NEED_RESCHED set. The loop
> >> >> >> exits once the condition is met, or if the poll time limit has
> >> >> >> been exceeded.
> >> >> >>
> >> >> >> To minimize the number of instructions executed in each iteration,
> >> >> >> the time check is done only intermittently (once every
> >> >> >> POLL_IDLE_RELAX_COUNT iterations). In addition, each loop iteration
> >> >> >> executes cpu_relax() which on certain platforms provides a hint to
> >> >> >> the pipeline that the loop busy-waits, allowing the processor to
> >> >> >> reduce power consumption.
> >> >> >>
> >> >> >> This is close to what smp_cond_load_relaxed_timeout() provides. So,
> >> >> >> restructure the loop and fold the loop condition and the timeout check
> >> >> >> in smp_cond_load_relaxed_timeout().
> >> >> >
> >> >> > Well, it is close, but is it close enough?
> >> >>
> >> >> I guess that's the question.
> >> >>
> >> >> >> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
> >> >> >> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> >> >> >> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> >> >> >> ---
> >> >> >> drivers/cpuidle/poll_state.c | 29 ++++++++---------------------
> >> >> >> 1 file changed, 8 insertions(+), 21 deletions(-)
> >> >> >>
> >> >> >> diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
> >> >> >> index 9b6d90a72601..dc7f4b424fec 100644
> >> >> >> --- a/drivers/cpuidle/poll_state.c
> >> >> >> +++ b/drivers/cpuidle/poll_state.c
> >> >> >> @@ -8,35 +8,22 @@
> >> >> >> #include <linux/sched/clock.h>
> >> >> >> #include <linux/sched/idle.h>
> >> >> >>
> >> >> >> -#define POLL_IDLE_RELAX_COUNT 200
> >> >> >> -
> >> >> >> static int __cpuidle poll_idle(struct cpuidle_device *dev,
> >> >> >> struct cpuidle_driver *drv, int index)
> >> >> >> {
> >> >> >> - u64 time_start;
> >> >> >> -
> >> >> >> - time_start = local_clock_noinstr();
> >> >> >> + u64 time_end;
> >> >> >> + u32 flags = 0;
> >> >> >>
> >> >> >> dev->poll_time_limit = false;
> >> >> >>
> >> >> >> + time_end = local_clock_noinstr() + cpuidle_poll_time(drv, dev);
> >> >> >
> >> >> > Is there any particular reason for doing this unconditionally? If
> >> >> > not, then it looks like an arbitrary unrelated change to me.
> >> >>
> >> >> Agreed. Will fix.
> >> >>
> >> >> >> +
> >> >> >> raw_local_irq_enable();
> >> >> >> if (!current_set_polling_and_test()) {
> >> >> >> - unsigned int loop_count = 0;
> >> >> >> - u64 limit;
> >> >> >> -
> >> >> >> - limit = cpuidle_poll_time(drv, dev);
> >> >> >> -
> >> >> >> - while (!need_resched()) {
> >> >> >> - cpu_relax();
> >> >> >> - if (loop_count++ < POLL_IDLE_RELAX_COUNT)
> >> >> >> - continue;
> >> >> >> -
> >> >> >> - loop_count = 0;
> >> >> >> - if (local_clock_noinstr() - time_start > limit) {
> >> >> >> - dev->poll_time_limit = true;
> >> >> >> - break;
> >> >> >> - }
> >> >> >> - }
> >> >> >> + flags = smp_cond_load_relaxed_timeout(¤t_thread_info()->flags,
> >> >> >> + (VAL & _TIF_NEED_RESCHED),
> >> >> >> + (local_clock_noinstr() >= time_end));
> >> >> >
> >> >> > So my understanding of this is that it reduces duplication with some
> >> >> > other places doing similar things. Fair enough.
> >> >> >
> >> >> > However, since there is "timeout" in the name, I'd expect it to take
> >> >> > the timeout as an argument.
> >> >>
> >> >> The early versions did have a timeout but that complicated the
> >> >> implementation significantly. And the current users poll_idle(),
> >> >> rqspinlock don't need a precise timeout.
> >> >>
> >> >> smp_cond_load_relaxed_timed(), smp_cond_load_relaxed_timecheck()?
> >> >>
> >> >> The problem with all suffixes I can think of is that it makes the
> >> >> interface itself nonobvious.
> >> >>
> >> >> Possibly something with the sense of bail out might work.
> >> >
> >> > It basically has two conditions, one of which is checked in every step
> >> > of the internal loop and the other one is checked every
> >> > SMP_TIMEOUT_POLL_COUNT steps of it. That isn't particularly
> >> > straightforward IMV.
> >>
> >> Right. And that's similar to what poll_idle().
> >
> > My point is that the macro in its current form is not particularly
> > straightforward.
> >
> > The code in poll_idle() does what it needs to do.
> >
> >> > Honestly, I prefer the existing code. It is much easier to follow and
> >> > I don't see why the new code would be better. Sorry.
> >>
> >> I don't think there's any problem with the current code. However, I'd like
> >> to add support for poll_idle() on arm64 (and maybe other platforms) where
> >> instead of spinning in a cpu_relax() loop, you wait on a cacheline.
> >
> > Well, there is MWAIT on x86, but it is not used here. It just takes
> > too much time to wake up from. There are "fast" variants of that too,
> > but they have been designed with user space in mind, so somewhat
> > cumbersome for kernel use.
> >
> >> And that's what using something like smp_cond_load_relaxed_timeout()
> >> would enable.
> >>
> >> Something like the series here:
> >> https://lore.kernel.org/lkml/87wmaljd81.fsf@oracle.com/
> >>
> >> (Sorry, should have mentioned this in the commit message.)
> >
> > I'm not sure how you can combine that with a proper timeout.
>
> Would taking the timeout as a separate argument work?
>
> flags = smp_cond_load_relaxed_timeout(¤t_thread_info()->flags,
> (VAL & _TIF_NEED_RESCHED),
> local_clock_noinstr(), time_end);
>
> Or you are thinking of something on different lines from the smp_cond_load
> kind of interface?
I would like it to be something along the lines of
arch_busy_wait_for_need_resched(time_limit);
dev->poll_time_limit = !need_resched();
and I don't care much about how exactly this is done in the arch code,
so long as it does what it says.
> > The timeout is needed because you want to break out of this when it starts
> > to take too much time, so you can go back to the idle loop and maybe
> > select a better idle state.
>
> Agreed. And that will happen with the version in the patch:
>
> flags = smp_cond_load_relaxed_timeout(¤t_thread_info()->flags,
> (VAL & _TIF_NEED_RESCHED),
> (local_clock_noinstr() >= time_end));
>
> Just that with waited mode on arm64 the timeout might be delayed depending
> on granularity of the event stream.
That's fine. cpuidle_poll_time() is not exact anyway.
Rafael J. Wysocki <rafael@kernel.org> writes:
> On Wed, Oct 29, 2025 at 10:01 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>
>>
>> Rafael J. Wysocki <rafael@kernel.org> writes:
>>
>> > On Wed, Oct 29, 2025 at 8:13 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>> >>
>> >>
>> >> Rafael J. Wysocki <rafael@kernel.org> writes:
>> >>
>> >> > On Wed, Oct 29, 2025 at 5:42 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>> >> >>
>> >> >>
>> >> >> Rafael J. Wysocki <rafael@kernel.org> writes:
>> >> >>
>> >> >> > On Tue, Oct 28, 2025 at 6:32 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>> >> >> >>
>> >> >> >> The inner loop in poll_idle() polls over the thread_info flags,
>> >> >> >> waiting to see if the thread has TIF_NEED_RESCHED set. The loop
>> >> >> >> exits once the condition is met, or if the poll time limit has
>> >> >> >> been exceeded.
>> >> >> >>
>> >> >> >> To minimize the number of instructions executed in each iteration,
>> >> >> >> the time check is done only intermittently (once every
>> >> >> >> POLL_IDLE_RELAX_COUNT iterations). In addition, each loop iteration
>> >> >> >> executes cpu_relax() which on certain platforms provides a hint to
>> >> >> >> the pipeline that the loop busy-waits, allowing the processor to
>> >> >> >> reduce power consumption.
>> >> >> >>
>> >> >> >> This is close to what smp_cond_load_relaxed_timeout() provides. So,
>> >> >> >> restructure the loop and fold the loop condition and the timeout check
>> >> >> >> in smp_cond_load_relaxed_timeout().
>> >> >> >
>> >> >> > Well, it is close, but is it close enough?
>> >> >>
>> >> >> I guess that's the question.
>> >> >>
>> >> >> >> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
>> >> >> >> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
>> >> >> >> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> >> >> >> ---
>> >> >> >> drivers/cpuidle/poll_state.c | 29 ++++++++---------------------
>> >> >> >> 1 file changed, 8 insertions(+), 21 deletions(-)
>> >> >> >>
>> >> >> >> diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
>> >> >> >> index 9b6d90a72601..dc7f4b424fec 100644
>> >> >> >> --- a/drivers/cpuidle/poll_state.c
>> >> >> >> +++ b/drivers/cpuidle/poll_state.c
>> >> >> >> @@ -8,35 +8,22 @@
>> >> >> >> #include <linux/sched/clock.h>
>> >> >> >> #include <linux/sched/idle.h>
>> >> >> >>
>> >> >> >> -#define POLL_IDLE_RELAX_COUNT 200
>> >> >> >> -
>> >> >> >> static int __cpuidle poll_idle(struct cpuidle_device *dev,
>> >> >> >> struct cpuidle_driver *drv, int index)
>> >> >> >> {
>> >> >> >> - u64 time_start;
>> >> >> >> -
>> >> >> >> - time_start = local_clock_noinstr();
>> >> >> >> + u64 time_end;
>> >> >> >> + u32 flags = 0;
>> >> >> >>
>> >> >> >> dev->poll_time_limit = false;
>> >> >> >>
>> >> >> >> + time_end = local_clock_noinstr() + cpuidle_poll_time(drv, dev);
>> >> >> >
>> >> >> > Is there any particular reason for doing this unconditionally? If
>> >> >> > not, then it looks like an arbitrary unrelated change to me.
>> >> >>
>> >> >> Agreed. Will fix.
>> >> >>
>> >> >> >> +
>> >> >> >> raw_local_irq_enable();
>> >> >> >> if (!current_set_polling_and_test()) {
>> >> >> >> - unsigned int loop_count = 0;
>> >> >> >> - u64 limit;
>> >> >> >> -
>> >> >> >> - limit = cpuidle_poll_time(drv, dev);
>> >> >> >> -
>> >> >> >> - while (!need_resched()) {
>> >> >> >> - cpu_relax();
>> >> >> >> - if (loop_count++ < POLL_IDLE_RELAX_COUNT)
>> >> >> >> - continue;
>> >> >> >> -
>> >> >> >> - loop_count = 0;
>> >> >> >> - if (local_clock_noinstr() - time_start > limit) {
>> >> >> >> - dev->poll_time_limit = true;
>> >> >> >> - break;
>> >> >> >> - }
>> >> >> >> - }
>> >> >> >> + flags = smp_cond_load_relaxed_timeout(¤t_thread_info()->flags,
>> >> >> >> + (VAL & _TIF_NEED_RESCHED),
>> >> >> >> + (local_clock_noinstr() >= time_end));
>> >> >> >
>> >> >> > So my understanding of this is that it reduces duplication with some
>> >> >> > other places doing similar things. Fair enough.
>> >> >> >
>> >> >> > However, since there is "timeout" in the name, I'd expect it to take
>> >> >> > the timeout as an argument.
>> >> >>
>> >> >> The early versions did have a timeout but that complicated the
>> >> >> implementation significantly. And the current users poll_idle(),
>> >> >> rqspinlock don't need a precise timeout.
>> >> >>
>> >> >> smp_cond_load_relaxed_timed(), smp_cond_load_relaxed_timecheck()?
>> >> >>
>> >> >> The problem with all suffixes I can think of is that it makes the
>> >> >> interface itself nonobvious.
>> >> >>
>> >> >> Possibly something with the sense of bail out might work.
>> >> >
>> >> > It basically has two conditions, one of which is checked in every step
>> >> > of the internal loop and the other one is checked every
>> >> > SMP_TIMEOUT_POLL_COUNT steps of it. That isn't particularly
>> >> > straightforward IMV.
>> >>
>> >> Right. And that's similar to what poll_idle().
>> >
>> > My point is that the macro in its current form is not particularly
>> > straightforward.
>> >
>> > The code in poll_idle() does what it needs to do.
>> >
>> >> > Honestly, I prefer the existing code. It is much easier to follow and
>> >> > I don't see why the new code would be better. Sorry.
>> >>
>> >> I don't think there's any problem with the current code. However, I'd like
>> >> to add support for poll_idle() on arm64 (and maybe other platforms) where
>> >> instead of spinning in a cpu_relax() loop, you wait on a cacheline.
>> >
>> > Well, there is MWAIT on x86, but it is not used here. It just takes
>> > too much time to wake up from. There are "fast" variants of that too,
>> > but they have been designed with user space in mind, so somewhat
>> > cumbersome for kernel use.
>> >
>> >> And that's what using something like smp_cond_load_relaxed_timeout()
>> >> would enable.
>> >>
>> >> Something like the series here:
>> >> https://lore.kernel.org/lkml/87wmaljd81.fsf@oracle.com/
>> >>
>> >> (Sorry, should have mentioned this in the commit message.)
>> >
>> > I'm not sure how you can combine that with a proper timeout.
>>
>> Would taking the timeout as a separate argument work?
>>
>> flags = smp_cond_load_relaxed_timeout(¤t_thread_info()->flags,
>> (VAL & _TIF_NEED_RESCHED),
>> local_clock_noinstr(), time_end);
>>
>> Or you are thinking of something on different lines from the smp_cond_load
>> kind of interface?
>
> I would like it to be something along the lines of
>
> arch_busy_wait_for_need_resched(time_limit);
> dev->poll_time_limit = !need_resched();
>
> and I don't care much about how exactly this is done in the arch code,
> so long as it does what it says.
This looks great. I think it could just be:
tif_need_resched_wait(time_limit);
And, given that this is tied in with scheduling contexts, this interface
should be able to use local_clock()/sched_clock().
--
ankur
© 2016 - 2025 Red Hat, Inc.