Add a timed variant of smp_cond_load_relaxed().
This is useful because arm64 supports polling on a conditional variable
by directly waiting on the cacheline instead of spin waiting for the
condition to change.
However, an implementation such as this has a problem that it can block
forever -- unless there's an explicit timeout or another out-of-band
mechanism which allows it to come out of the wait state periodically.
smp_cond_load_relaxed_timeout() supports these semantics by specifying
a time-check expression and an associated time-limit.
However, note that for the generic spin-wait implementation we want to
minimize the numbers of instructions executed in each iteration. So,
limit how often we evaluate the time-check expression by doing it once
every smp_cond_time_check_count.
The inner loop in poll_idle() has a substantially similar structure
and constraints as smp_cond_load_relaxed_timeout(), so define
smp_cond_time_check_count to the same value used in poll_idle().
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
include/asm-generic/barrier.h | 42 +++++++++++++++++++++++++++++++++++
1 file changed, 42 insertions(+)
diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index d4f581c1e21d..77726ef807e4 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -273,6 +273,48 @@ do { \
})
#endif
+#ifndef smp_cond_time_check_count
+/*
+ * Limit how often smp_cond_load_relaxed_timeout() evaluates time_expr_ns.
+ * This helps reduce the number of instructions executed while spin-waiting.
+ */
+#define smp_cond_time_check_count 200
+#endif
+
+/**
+ * smp_cond_load_relaxed_timeout() - (Spin) wait for cond with no ordering
+ * guarantees until a timeout expires.
+ * @ptr: pointer to the variable to wait on
+ * @cond: boolean expression to wait for
+ * @time_expr_ns: evaluates to the current time
+ * @time_limit_ns: compared against time_expr_ns
+ *
+ * Equivalent to using READ_ONCE() on the condition variable.
+ *
+ * Due to C lacking lambda expressions we load the value of *ptr into a
+ * pre-named variable @VAL to be used in @cond.
+ */
+#ifndef smp_cond_load_relaxed_timeout
+#define smp_cond_load_relaxed_timeout(ptr, cond_expr, time_expr_ns, \
+ time_limit_ns) ({ \
+ typeof(ptr) __PTR = (ptr); \
+ __unqual_scalar_typeof(*ptr) VAL; \
+ unsigned int __count = 0; \
+ for (;;) { \
+ VAL = READ_ONCE(*__PTR); \
+ if (cond_expr) \
+ break; \
+ cpu_relax(); \
+ if (__count++ < smp_cond_time_check_count) \
+ continue; \
+ if ((time_expr_ns) >= time_limit_ns) \
+ break; \
+ __count = 0; \
+ } \
+ (typeof(*ptr))VAL; \
+})
+#endif
+
/*
* pmem_wmb() ensures that all stores for which the modification
* are written to persistent storage by preceding instructions have
--
2.43.5
On Thu, 7 Nov 2024, Ankur Arora wrote: > +#ifndef smp_cond_time_check_count > +/* > + * Limit how often smp_cond_load_relaxed_timeout() evaluates time_expr_ns. > + * This helps reduce the number of instructions executed while spin-waiting. > + */ > +#define smp_cond_time_check_count 200 > +#endif I dont like these loops that execute differently depending on the hardware. Can we use cycles and ns instead to have defined periods of time? Later patches establish the infrastructure to convert cycles to nanoseconds and microseconds. Use that? > +#ifndef smp_cond_load_relaxed_timeout > +#define smp_cond_load_relaxed_timeout(ptr, cond_expr, time_expr_ns, \ > + time_limit_ns) ({ \ > + typeof(ptr) __PTR = (ptr); \ > + __unqual_scalar_typeof(*ptr) VAL; \ > + unsigned int __count = 0; \ > + for (;;) { \ > + VAL = READ_ONCE(*__PTR); \ > + if (cond_expr) \ > + break; \ > + cpu_relax(); \ > + if (__count++ < smp_cond_time_check_count) \ > + continue; \ > + if ((time_expr_ns) >= time_limit_ns) \ > + break; \ Calling the clock retrieval function repeatedly should be fine and is typically done in user space as well as in kernel space for functions that need to wait short time periods.
Christoph Lameter (Ampere) <cl@gentwo.org> writes: > On Thu, 7 Nov 2024, Ankur Arora wrote: > >> +#ifndef smp_cond_time_check_count >> +/* >> + * Limit how often smp_cond_load_relaxed_timeout() evaluates time_expr_ns. >> + * This helps reduce the number of instructions executed while spin-waiting. >> + */ >> +#define smp_cond_time_check_count 200 >> +#endif > > I dont like these loops that execute differently depending on the > hardware. Can we use cycles and ns instead to have defined periods of > time? Later patches establish the infrastructure to convert cycles to > nanoseconds and microseconds. Use that? > >> +#ifndef smp_cond_load_relaxed_timeout >> +#define smp_cond_load_relaxed_timeout(ptr, cond_expr, time_expr_ns, \ >> + time_limit_ns) ({ \ >> + typeof(ptr) __PTR = (ptr); \ >> + __unqual_scalar_typeof(*ptr) VAL; \ >> + unsigned int __count = 0; \ >> + for (;;) { \ >> + VAL = READ_ONCE(*__PTR); \ >> + if (cond_expr) \ >> + break; \ >> + cpu_relax(); \ >> + if (__count++ < smp_cond_time_check_count) \ >> + continue; \ >> + if ((time_expr_ns) >= time_limit_ns) \ >> + break; \ > > Calling the clock retrieval function repeatedly should be fine and is > typically done in user space as well as in kernel space for functions that > need to wait short time periods. The problem is that you might have multiple CPUs polling in idle for prolonged periods of time. And, so you want to minimize your power/thermal envelope. For instance see commit 4dc2375c1a4e "cpuidle: poll_state: Avoid invoking local_clock() too often" which originally added a similar rate limit to poll_idle() where they saw exactly that issue. -- ankur
On Thu, 7 Nov 2024, Ankur Arora wrote: > > Calling the clock retrieval function repeatedly should be fine and is > > typically done in user space as well as in kernel space for functions that > > need to wait short time periods. > > The problem is that you might have multiple CPUs polling in idle > for prolonged periods of time. And, so you want to minimize > your power/thermal envelope. On ARM that maps to YIELD which does not do anything for the power envelope AFAICT. It switches to the other hyperthread. > For instance see commit 4dc2375c1a4e "cpuidle: poll_state: Avoid > invoking local_clock() too often" which originally added a similar > rate limit to poll_idle() where they saw exactly that issue. Looping w/o calling local_clock may increase the wait period etc. For power saving most arches have special instructions like ARMS WFE/WFET. These are then causing more accurate wait times than the looping thing?
On Fri, Nov 08, 2024 at 11:41:08AM -0800, Christoph Lameter (Ampere) wrote: > On Thu, 7 Nov 2024, Ankur Arora wrote: > > > Calling the clock retrieval function repeatedly should be fine and is > > > typically done in user space as well as in kernel space for functions that > > > need to wait short time periods. > > > > The problem is that you might have multiple CPUs polling in idle > > for prolonged periods of time. And, so you want to minimize > > your power/thermal envelope. > > On ARM that maps to YIELD which does not do anything for the power > envelope AFAICT. It switches to the other hyperthread. The issue is not necessarily arm64 but poll_idle() on other architectures like x86 where, at the end of this series, they still call cpu_relax() in a loop and check local_clock() every 200 times or so iterations. So I wouldn't want to revert the improvement in 4dc2375c1a4e ("cpuidle: poll_state: Avoid invoking local_clock() too often"). I agree that the 200 iterations here it's pretty random and it was something made up for poll_idle() specifically and it could increase the wait period in other situations (or other architectures). OTOH, I'm not sure we want to make this API too complex if the only user for a while would be poll_idle(). We could add a comment that the timeout granularity can be pretty coarse and architecture dependent (200 cpu_relax() calls in one deployment, 100us on arm64 with WFE). -- Catalin
Catalin Marinas <catalin.marinas@arm.com> writes: > On Fri, Nov 08, 2024 at 11:41:08AM -0800, Christoph Lameter (Ampere) wrote: >> On Thu, 7 Nov 2024, Ankur Arora wrote: >> > > Calling the clock retrieval function repeatedly should be fine and is >> > > typically done in user space as well as in kernel space for functions that >> > > need to wait short time periods. >> > >> > The problem is that you might have multiple CPUs polling in idle >> > for prolonged periods of time. And, so you want to minimize >> > your power/thermal envelope. >> >> On ARM that maps to YIELD which does not do anything for the power >> envelope AFAICT. It switches to the other hyperthread. > > The issue is not necessarily arm64 but poll_idle() on other > architectures like x86 where, at the end of this series, they still call > cpu_relax() in a loop and check local_clock() every 200 times or so > iterations. So I wouldn't want to revert the improvement in 4dc2375c1a4e > ("cpuidle: poll_state: Avoid invoking local_clock() too often"). > > I agree that the 200 iterations here it's pretty random and it was > something made up for poll_idle() specifically and it could increase the > wait period in other situations (or other architectures). > > OTOH, I'm not sure we want to make this API too complex if the only > user for a while would be poll_idle(). We could add a comment that the > timeout granularity can be pretty coarse and architecture dependent (200 > cpu_relax() calls in one deployment, 100us on arm64 with WFE). Yeah, agreed. Not worth over engineering this interface at least not until there are other users. For now I'll just add a comment mentioning that the time-check is only coarse grained and architecture dependent. -- ankur
Christoph Lameter (Ampere) <cl@gentwo.org> writes: > On Thu, 7 Nov 2024, Ankur Arora wrote: > >> > Calling the clock retrieval function repeatedly should be fine and is >> > typically done in user space as well as in kernel space for functions that >> > need to wait short time periods. >> >> The problem is that you might have multiple CPUs polling in idle >> for prolonged periods of time. And, so you want to minimize >> your power/thermal envelope. > > On ARM that maps to YIELD which does not do anything for the power > envelope AFAICT. It switches to the other hyperthread. Agreed. For arm64 patch-5 adds a specialized version. For the fallback case when we don't have an event stream, the arm64 version does use the same cpu_relax() loop but that's not a production thing. >> For instance see commit 4dc2375c1a4e "cpuidle: poll_state: Avoid >> invoking local_clock() too often" which originally added a similar >> rate limit to poll_idle() where they saw exactly that issue. > > Looping w/o calling local_clock may increase the wait period etc. Yeah. I don't think that's a real problem for the poll_idle() case as the only thing waiting on the other side of the possibly delayed timer is a deeper idle state. But, for any other potential users the looping duration might be too long (the generated code for x86 will execute around 200 * 7 instructions before checking the timer, so a worst case delay of say around 1-2us.) I'll note that in the comment around smp_cond_time_check_count just to warn any future users. > For power saving most arches have special instructions like ARMS > WFE/WFET. These are then causing more accurate wait times than the looping > thing? Definitely true for WFET. The WFE can still overshoot because the eventstream has a period of 100us. -- ankur
On Fri, 8 Nov 2024, Ankur Arora wrote: > > For power saving most arches have special instructions like ARMS > > WFE/WFET. These are then causing more accurate wait times than the looping > > thing? > > Definitely true for WFET. The WFE can still overshoot because the > eventstream has a period of 100us. We can only use the event stream if we need to wait more than 100us. The rest of the wait period can be coverd by a busy loop. Thus we are accurate.
© 2016 - 2024 Red Hat, Inc.