kernel/Kconfig.hz | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
The frequency at which TICK happens is very important from scheduler
perspective. There's a responsiveness trade-of that for interactive
systems the current default is set too low.
Having a slow TICK frequency can lead to the following shortcomings in
scheduler decisions:
1. Imprecise time slice
-----------------------
Preemption checks occur when a new task wakes up, on return from
interrupt or at TICK. If we have N tasks running on the same CPU then as
a worst case scenario these tasks will time slice every TICK regardless
of their actual slice size.
By default base_slice ends up being 3ms on many systems. But due to TICK
being 4ms by default, tasks will end up slicing every 4ms instead in
busy scenarios. It also makes the effectiveness of reducing the
base_slice to a lower value like 2ms or 1ms pointless. It will allow new
waking tasks to preempt sooner. But it will prevent timely cycling of
tasks in busy scenarios. Which is an important and frequent scenario.
2. Delayed load_balance()
-------------------------
Scheduler task placement decision at wake up can easily become stale as
more tasks wake up. load_balance() is the correction point to ensure the
system is loaded optimally. And in the case of HMP systems tasks are
migrated to a bigger CPU to meet their compute demand.
Newidle balance can help alleviate the problem. But the worst case
scenario is for the TICK to trigger the load_balance().
3. Delayed stats update
-----------------------
And subsequently delayed cpufreq updates and misfit detection (the need
to move a task from little CPU to a big CPU in HMP systems).
When a task is busy then as a worst case scenario the util signal will
update every TICK. Since util signal is the main driver for our
preferred governor - schedutil - and what drives EAS to decide if
a task fits a CPU or needs to migrate to a bigger CPU, these delays can
be detrimental to system responsiveness.
------------------------------------------------------------------------
Note that the worst case scenario is an important and defining
characteristic for interactive systems. It's all about the P90 and P95.
Responsiveness IMHO is no longer a characteristic of a desktop system.
Modern hardware and workloads are interactive generally and need better
latencies. To my knowledge even servers run mixed workloads and serve
a lot of users interactively.
On Android and Desktop systems etc 120Hz is a common screen
configuration. This gives tasks 8ms deadline to do their work. 4ms is
half this time which makes the burden on making very correct decision at
wake up stressed more than necessary. And it makes utilizing the system
effectively to maintain best perf/watt harder. As an example [1] tries
to fix our definition of DVFS headroom to be a function of TICK as it
defines our worst case scenario of updating stats. The larger TICK means
we have to be overly aggressive in going into higher frequencies if we
want to ensure perf is not impacted. But if the task didn't consume all
of its slice, we lost an opportunity to use a lower frequency and save
power. Lower TICK value allows us to be smarter about our resource
allocation to balance perf and power.
Generally workloads working with ever smaller deadlines is not unique to
UI pipeline. Everything is expected to finish work sooner and be more
responsive.
I believe HZ_250 was the default as a trade-off for battery power
devices that might not be happy with frequent TICKS potentially draining
the battery unnecessarily. But to my understanding the current state of
NOHZ should be good enough to alleviate these concerns. And recent
addition of RCU_LAZY further helps with keeping TICK quite in idle
scenarios.
As pointed out to me by Saravana though, the longer TICK did indirectly
help with timer coalescing which means it could hide issues with
drivers/tasks asking for frequent timers preventing entry to deeper idle
states (4ms is a high value to allow entry to deeper idle state for many
systems). But one can argue this is a problem with these drivers/tasks.
And if the coalescing behavior is desired we can make it intentional
rather than accidental.
The faster TICK might still result in higher power, but not due to TICK
activities. The system is more responsive (as intended) and it is
expected the residencies in higher freqs would be higher as they were
accidentally being stuck at lower freqs. The series in [1] attempts to
improve scheduler handling of responsiveness and give users/apps a way
to better provide their needs, including opting out of getting adequate
response (rampup_multiplier being 0 in the mentioned series).
Since the default behavior might end up on many unwary users, ensure it
matches what modern systems and workloads expect given that our NOHZ has
moved a long way to keep TICKS tamed in idle scenarios.
[1] https://lore.kernel.org/lkml/20240820163512.1096301-6-qyousef@layalina.io/
Signed-off-by: Qais Yousef <qyousef@layalina.io>
---
kernel/Kconfig.hz | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/Kconfig.hz b/kernel/Kconfig.hz
index 38ef6d06888e..c742c9298af3 100644
--- a/kernel/Kconfig.hz
+++ b/kernel/Kconfig.hz
@@ -5,7 +5,7 @@
choice
prompt "Timer frequency"
- default HZ_250
+ default HZ_1000
help
Allows the configuration of the timer frequency. It is customary
to have the timer interrupt run at 1000 Hz but 100 Hz may be more
--
2.34.1
The 250Hz was set as a middle ground. There are still workloads which sensitive to cache misses and to the time spent in ticks. You still can loose ~1% performance with higher Hz, It might not sound much, but if you are running thousands of servers 1% loss can be a high cost. There are many workloads out there where 100Hz is the better choice.
On Fri, 28 Feb 2025 11:33:04 +0100 Attila Fazekas <afazekas@redhat.com> wrote: > The 250Hz was set as a middle ground. > > There are still workloads which sensitive to cache misses > and to the time spent in ticks. > > You still can loose ~1% performance with higher Hz, > It might not sound much, but if you are running thousands > of servers 1% loss can be a high cost. > > There are many workloads out there where 100Hz is the better choice. Are there are real issues with changing HZ to 1000, but adding an option for the timer tick interval (in ms). So 'jiffies' would always count milliseconds. That would make is easy to make the actual 'clock tick' be boot time selectable (or run-time if you get brave!). The 'timer wheel' code would really need to work on actual ticks, but I doubt nothing else cares. That would allow HZ be the same for all architectures, even though m68k might really want a 50Hz interrupt. That is much better than any plan to make HZ a variable - which will bloat code (and with divisions) and not be valid for static initialisers. David
On Mon, 10 Feb 2025 00:19:15 +0000 Qais Yousef <qyousef@layalina.io> wrote: > The frequency at which TICK happens is very important from scheduler > perspective. There's a responsiveness trade-of that for interactive > systems the current default is set too low. The problem I see is that most people use a kernel from one of the distributions. So you need to persuade them to change their default. Change the default 'default' won't necessarily have any effect. OTOH if you decouple the timer interrupt rate from HZ (or jiffies) then it becomes possible to boot time (or even run-time) change the timer interrupt rate. So it makes much more sense to fix 'jiffies' as a 1ms counter and then configure the timer interrupt to be a number of ms/jiffies. This would be similar all the simplifications that came about by making the high precision timestamps (etc) ns regardless of the actual resolution on any specific hardware. David
On 02/16/25 19:05, David Laight wrote: > On Mon, 10 Feb 2025 00:19:15 +0000 > Qais Yousef <qyousef@layalina.io> wrote: > > > The frequency at which TICK happens is very important from scheduler > > perspective. There's a responsiveness trade-of that for interactive > > systems the current default is set too low. > > The problem I see is that most people use a kernel from one of the distributions. > So you need to persuade them to change their default. > Change the default 'default' won't necessarily have any effect. True to some extent. I think Debian [1] relies on kernel's default, which is a big distro. But the worry goes beyond that. I think 1K HZ is the modern sensible default for all users. It shouldn't cause power issue given NOHZ and other improvements. And the logic for context switch etc is no longer valid IMHO. Based on sched_ext saga it shows that people care a lot more on spending more time to do the right decision given the complexity of today's systems. And the systems I work on (mobile phones) I don't see impact on throughput. So IMHO both sides of the arguments are no longer valid, but we continue to get common discussions about latencies. This won't solve all problems, but will hopefully send the right message to ensure most users switch to this too and address one of the root causes for this common 'complaint'. > > OTOH if you decouple the timer interrupt rate from HZ (or jiffies) > then it becomes possible to boot time (or even run-time) change > the timer interrupt rate. > > So it makes much more sense to fix 'jiffies' as a 1ms counter and then > configure the timer interrupt to be a number of ms/jiffies. > > This would be similar all the simplifications that came about by making > the high precision timestamps (etc) ns regardless of the actual > resolution on any specific hardware. Agreed. John is trying to do something similar but review comments show cased teething issues. I did spend a long time working on converting HZ to be a variable and switched a large number of users - but lost all this work sadly when my machine died and forgot to push.. I think moving the default to follow what most folks should really be using is the right thing for now IMHO. It doesn't force anyone to make an alternate choice if they think they know really better. [1] https://salsa.debian.org/kernel-team/linux/-/blob/debian/latest/debian/config/config#L6389
On Sun, Feb 9, 2025 at 4:19 PM Qais Yousef <qyousef@layalina.io> wrote: > > The frequency at which TICK happens is very important from scheduler > perspective. There's a responsiveness trade-of that for interactive > systems the current default is set too low. > > Having a slow TICK frequency can lead to the following shortcomings in > scheduler decisions: > > 1. Imprecise time slice > ----------------------- > > Preemption checks occur when a new task wakes up, on return from > interrupt or at TICK. If we have N tasks running on the same CPU then as > a worst case scenario these tasks will time slice every TICK regardless > of their actual slice size. > > By default base_slice ends up being 3ms on many systems. But due to TICK > being 4ms by default, tasks will end up slicing every 4ms instead in > busy scenarios. It also makes the effectiveness of reducing the > base_slice to a lower value like 2ms or 1ms pointless. It will allow new > waking tasks to preempt sooner. But it will prevent timely cycling of > tasks in busy scenarios. Which is an important and frequent scenario. > > 2. Delayed load_balance() > ------------------------- > > Scheduler task placement decision at wake up can easily become stale as > more tasks wake up. load_balance() is the correction point to ensure the > system is loaded optimally. And in the case of HMP systems tasks are > migrated to a bigger CPU to meet their compute demand. > > Newidle balance can help alleviate the problem. But the worst case > scenario is for the TICK to trigger the load_balance(). > > 3. Delayed stats update > ----------------------- > > And subsequently delayed cpufreq updates and misfit detection (the need > to move a task from little CPU to a big CPU in HMP systems). > > When a task is busy then as a worst case scenario the util signal will > update every TICK. Since util signal is the main driver for our > preferred governor - schedutil - and what drives EAS to decide if > a task fits a CPU or needs to migrate to a bigger CPU, these delays can > be detrimental to system responsiveness. > > ------------------------------------------------------------------------ > > Note that the worst case scenario is an important and defining > characteristic for interactive systems. It's all about the P90 and P95. > Responsiveness IMHO is no longer a characteristic of a desktop system. > Modern hardware and workloads are interactive generally and need better > latencies. To my knowledge even servers run mixed workloads and serve > a lot of users interactively. > > On Android and Desktop systems etc 120Hz is a common screen > configuration. This gives tasks 8ms deadline to do their work. 4ms is > half this time which makes the burden on making very correct decision at > wake up stressed more than necessary. And it makes utilizing the system > effectively to maintain best perf/watt harder. As an example [1] tries > to fix our definition of DVFS headroom to be a function of TICK as it > defines our worst case scenario of updating stats. The larger TICK means > we have to be overly aggressive in going into higher frequencies if we > want to ensure perf is not impacted. But if the task didn't consume all > of its slice, we lost an opportunity to use a lower frequency and save > power. Lower TICK value allows us to be smarter about our resource > allocation to balance perf and power. > > Generally workloads working with ever smaller deadlines is not unique to > UI pipeline. Everything is expected to finish work sooner and be more > responsive. > > I believe HZ_250 was the default as a trade-off for battery power > devices that might not be happy with frequent TICKS potentially draining > the battery unnecessarily. But to my understanding the current state of > NOHZ should be good enough to alleviate these concerns. And recent > addition of RCU_LAZY further helps with keeping TICK quite in idle > scenarios. > > As pointed out to me by Saravana though, the longer TICK did indirectly > help with timer coalescing which means it could hide issues with > drivers/tasks asking for frequent timers preventing entry to deeper idle > states (4ms is a high value to allow entry to deeper idle state for many > systems). But one can argue this is a problem with these drivers/tasks. > And if the coalescing behavior is desired we can make it intentional > rather than accidental. > > The faster TICK might still result in higher power, but not due to TICK > activities. The system is more responsive (as intended) and it is > expected the residencies in higher freqs would be higher as they were > accidentally being stuck at lower freqs. The series in [1] attempts to > improve scheduler handling of responsiveness and give users/apps a way > to better provide their needs, including opting out of getting adequate > response (rampup_multiplier being 0 in the mentioned series). > > Since the default behavior might end up on many unwary users, ensure it > matches what modern systems and workloads expect given that our NOHZ has > moved a long way to keep TICKS tamed in idle scenarios. > > [1] https://lore.kernel.org/lkml/20240820163512.1096301-6-qyousef@layalina.io/ > > Signed-off-by: Qais Yousef <qyousef@layalina.io> > --- > kernel/Kconfig.hz | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/Kconfig.hz b/kernel/Kconfig.hz > index 38ef6d06888e..c742c9298af3 100644 > --- a/kernel/Kconfig.hz > +++ b/kernel/Kconfig.hz > @@ -5,7 +5,7 @@ > > choice > prompt "Timer frequency" > - default HZ_250 > + default HZ_1000 This is going to mess up power for tons of IOT and low power devices. I think we should leave the default alone and set the config in the device specific defconfig. Even on Android, for some use cases, this causes ~7% CPU power increase. This also causes more CPU wakeups because jiffy based timers that are set for t + 1ms, t + 2ms, t+ 3ms, t + 4ms would all get grouped into a t + 4ms HZ wakeup, but with 1000 HZ timer, it'd cause 4 separate wakeups. I'd like to Nack this. -Saravana > help > Allows the configuration of the timer frequency. It is customary > to have the timer interrupt run at 1000 Hz but 100 Hz may be more > -- > 2.34.1 >
On 02/13/25 00:24, Saravana Kannan wrote: > > kernel/Kconfig.hz | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/kernel/Kconfig.hz b/kernel/Kconfig.hz > > index 38ef6d06888e..c742c9298af3 100644 > > --- a/kernel/Kconfig.hz > > +++ b/kernel/Kconfig.hz > > @@ -5,7 +5,7 @@ > > > > choice > > prompt "Timer frequency" > > - default HZ_250 > > + default HZ_1000 > > This is going to mess up power for tons of IOT and low power devices. Why are you singling them out? Why is it worse for them compared to other battery powered devices? > I think we should leave the default alone and set the config in the > device specific defconfig. Even on Android, for some use cases, this I'll hold the mirror and tell you to keep the default for your systems in your defconfigs. There has been a lot of discussion about sched latency and this is a common cause especially when combined with schedutil. There's a lot of accidental behaviors going and they are being addressed. The default should be representative of what users of all classes are after. And responsiveness has been a prime problem for a while. > causes ~7% CPU power increase. This also causes more CPU wakeups Have you analyzed the cause? Is this caused by something not mentioned in the commit message? Accidental behaviors are not a reason not to move on to better default. And managing system response time (particularly with schedutil) is an ongoing area of improvement. UiBench gets 13% and 54% less missed frames at the cost of 6.67% higher power. There's a big performance impact because of the long TICK. Phoronix (thankfully!) did a comparison too. The power impact wasn't noticeable with some big gains in some benchmark. The things that got slightly worse were regained by enabling PREEMPT_LAZY according to the comments. https://www.phoronix.com/news/Linux-250Hz-1000Hz-Kernel-2025 > because jiffy based timers that are set for t + 1ms, t + 2ms, t+ 3ms, > t + 4ms would all get grouped into a t + 4ms HZ wakeup, but with 1000 > HZ timer, it'd cause 4 separate wakeups. This has been called out in the commit message. You can't rely on accidental behavior. If this is something that you think matters a lot you can send a patch to do this coaleasing and decouple it from TICK. The right thing to do is audit the drivers that are causing high wake rate. You can't prevent improvements to the system because there are users that rely on wrong behavior. > > I'd like to Nack this. Please use the value you like in your defconfig if you have concerns. The new default is a good indications for what people should be using by default. But no one is forcing anyone to stick to them. The new default is what really the majority of people want today, and this patch signals this clearly but doesn't take away any other option from them. -- Qais Yousef
On Mon, Feb 10, 2025 at 12:19:15AM +0000, Qais Yousef wrote: > The frequency at which TICK happens is very important from scheduler > perspective. There's a responsiveness trade-of that for interactive > systems the current default is set too low. Another thing that screws up pretty badly at least with pre-EEVDF CFS is the extra lag that gets added to high nice value tasks due to the coarser tick causes low nice value tasks to get an even longer time slice. I caught this when tracing Android few years ago. ISTR, this was pretty bad almost to a point of defeating fairness. Not sure if that shows with EEVDF though. > > Having a slow TICK frequency can lead to the following shortcomings in > scheduler decisions: > > 1. Imprecise time slice > ----------------------- > > Preemption checks occur when a new task wakes up, on return from > interrupt or at TICK. If we have N tasks running on the same CPU then as > a worst case scenario these tasks will time slice every TICK regardless > of their actual slice size. > > By default base_slice ends up being 3ms on many systems. But due to TICK > being 4ms by default, tasks will end up slicing every 4ms instead in > busy scenarios. It also makes the effectiveness of reducing the > base_slice to a lower value like 2ms or 1ms pointless. It will allow new > waking tasks to preempt sooner. But it will prevent timely cycling of > tasks in busy scenarios. Which is an important and frequent scenario. > > 2. Delayed load_balance() > ------------------------- > > Scheduler task placement decision at wake up can easily become stale as > more tasks wake up. load_balance() is the correction point to ensure the > system is loaded optimally. And in the case of HMP systems tasks are > migrated to a bigger CPU to meet their compute demand. > > Newidle balance can help alleviate the problem. But the worst case > scenario is for the TICK to trigger the load_balance(). > > 3. Delayed stats update > ----------------------- > > And subsequently delayed cpufreq updates and misfit detection (the need > to move a task from little CPU to a big CPU in HMP systems). > > When a task is busy then as a worst case scenario the util signal will > update every TICK. Since util signal is the main driver for our > preferred governor - schedutil - and what drives EAS to decide if > a task fits a CPU or needs to migrate to a bigger CPU, these delays can > be detrimental to system responsiveness. > > ------------------------------------------------------------------------ > > Note that the worst case scenario is an important and defining > characteristic for interactive systems. It's all about the P90 and P95. > Responsiveness IMHO is no longer a characteristic of a desktop system. > Modern hardware and workloads are interactive generally and need better > latencies. To my knowledge even servers run mixed workloads and serve > a lot of users interactively. > > On Android and Desktop systems etc 120Hz is a common screen > configuration. This gives tasks 8ms deadline to do their work. 4ms is > half this time which makes the burden on making very correct decision at > wake up stressed more than necessary. And it makes utilizing the system > effectively to maintain best perf/watt harder. As an example [1] tries > to fix our definition of DVFS headroom to be a function of TICK as it > defines our worst case scenario of updating stats. The larger TICK means > we have to be overly aggressive in going into higher frequencies if we > want to ensure perf is not impacted. But if the task didn't consume all > of its slice, we lost an opportunity to use a lower frequency and save > power. Lower TICK value allows us to be smarter about our resource > allocation to balance perf and power. > > Generally workloads working with ever smaller deadlines is not unique to > UI pipeline. Everything is expected to finish work sooner and be more > responsive. > > I believe HZ_250 was the default as a trade-off for battery power > devices that might not be happy with frequent TICKS potentially draining > the battery unnecessarily. But to my understanding the current state of Actually, on x86, me and Steve did some debug on Chromebooks and we found that HZ_250 actually increased power versus higher HZ. This was because cpuidle governor changes C states on the tick, and by making it less frequent, the CPU could be in a shallow C state for longer. > NOHZ should be good enough to alleviate these concerns. And recent > addition of RCU_LAZY further helps with keeping TICK quite in idle > scenarios. > > As pointed out to me by Saravana though, the longer TICK did indirectly > help with timer coalescing which means it could hide issues with > drivers/tasks asking for frequent timers preventing entry to deeper idle > states (4ms is a high value to allow entry to deeper idle state for many > systems). But one can argue this is a problem with these drivers/tasks. > And if the coalescing behavior is desired we can make it intentional > rather than accidental. I am not sure how much coalescing of timer-wheel events matter. My impression is coalescing matters only for HRtimer since those can be more granular. > > The faster TICK might still result in higher power, but not due to TICK > activities. The system is more responsive (as intended) and it is > expected the residencies in higher freqs would be higher as they were > accidentally being stuck at lower freqs. The series in [1] attempts to > improve scheduler handling of responsiveness and give users/apps a way > to better provide their needs, including opting out of getting adequate > response (rampup_multiplier being 0 in the mentioned series). > > Since the default behavior might end up on many unwary users, ensure it > matches what modern systems and workloads expect given that our NOHZ has > moved a long way to keep TICKS tamed in idle scenarios. > > [1] https://lore.kernel.org/lkml/20240820163512.1096301-6-qyousef@layalina.io/ > > Signed-off-by: Qais Yousef <qyousef@layalina.io> > --- > kernel/Kconfig.hz | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/Kconfig.hz b/kernel/Kconfig.hz > index 38ef6d06888e..c742c9298af3 100644 > --- a/kernel/Kconfig.hz > +++ b/kernel/Kconfig.hz > @@ -5,7 +5,7 @@ > > choice > prompt "Timer frequency" > - default HZ_250 > + default HZ_1000 Its fine with me, but I wonder who else cares about HZ_250 default. I certainly don't. And if someone really wants it for an odd reason, then can just adjust the config for themselves. Acked-by: Joel Fernandes <joelagnelf@nvidia.com> thanks, - Joel > help > Allows the configuration of the timer frequency. It is customary > to have the timer interrupt run at 1000 Hz but 100 Hz may be more > -- > 2.34.1 >
On Wed, Feb 12, 2025 at 09:50:54AM -0500, Joel Fernandes wrote: > On Mon, Feb 10, 2025 at 12:19:15AM +0000, Qais Yousef wrote: ... > > I believe HZ_250 was the default as a trade-off for battery power > > devices that might not be happy with frequent TICKS potentially draining > > the battery unnecessarily. But to my understanding the current state of > > Actually, on x86, me and Steve did some debug on Chromebooks and we found > that HZ_250 actually increased power versus higher HZ. This was because > cpuidle governor changes C states on the tick, and by making it less > frequent, the CPU could be in a shallow C state for longer. FWIW, I found the same about power consumption when we decided to switch to CONFIG_HZ=1000 in the Ubuntu kernel: https://discourse.ubuntu.com/t/enable-low-latency-features-in-the-generic-ubuntu-kernel-for-24-04/42255 -Andrea
* Andrea Righi <arighi@nvidia.com> wrote: > On Wed, Feb 12, 2025 at 09:50:54AM -0500, Joel Fernandes wrote: > > On Mon, Feb 10, 2025 at 12:19:15AM +0000, Qais Yousef wrote: > ... > > > I believe HZ_250 was the default as a trade-off for battery power > > > devices that might not be happy with frequent TICKS potentially draining > > > the battery unnecessarily. But to my understanding the current state of > > > > Actually, on x86, me and Steve did some debug on Chromebooks and we found > > that HZ_250 actually increased power versus higher HZ. This was because > > cpuidle governor changes C states on the tick, and by making it less > > frequent, the CPU could be in a shallow C state for longer. > > FWIW, I found the same about power consumption when we decided to switch to > CONFIG_HZ=1000 in the Ubuntu kernel: > https://discourse.ubuntu.com/t/enable-low-latency-features-in-the-generic-ubuntu-kernel-for-24-04/42255 The "HZ=1000 reduces power consumption or keeps it the same" is actually a pretty good argument to change the default to HZ=1000. These experiments and numbers (if any) should be incorporated in the changelog prominently - as actual data and the Kconfig decisions made by major distros will, most of the time, be superior to meta analysis that seems to be the changelog right now. Thanks, Ingo
On Sun, Feb 23, 2025, Ingo Molnar wrote: > > * Andrea Righi <arighi@nvidia.com> wrote: > > > On Wed, Feb 12, 2025 at 09:50:54AM -0500, Joel Fernandes wrote: > > > On Mon, Feb 10, 2025 at 12:19:15AM +0000, Qais Yousef wrote: > > ... > > > > I believe HZ_250 was the default as a trade-off for battery power > > > > devices that might not be happy with frequent TICKS potentially draining > > > > the battery unnecessarily. But to my understanding the current state of > > > > > > Actually, on x86, me and Steve did some debug on Chromebooks and we found > > > that HZ_250 actually increased power versus higher HZ. This was because > > > cpuidle governor changes C states on the tick, and by making it less > > > frequent, the CPU could be in a shallow C state for longer. > > > > FWIW, I found the same about power consumption when we decided to switch to > > CONFIG_HZ=1000 in the Ubuntu kernel: > > https://discourse.ubuntu.com/t/enable-low-latency-features-in-the-generic-ubuntu-kernel-for-24-04/42255 > > The "HZ=1000 reduces power consumption or keeps it the same" is > actually a pretty good argument to change the default to HZ=1000. > > These experiments and numbers (if any) should be incorporated in the > changelog prominently - as actual data and the Kconfig decisions made > by major distros will, most of the time, be superior to meta analysis > that seems to be the changelog right now. Speaking of which, has anyone done analysis when running as a VM? I don't know about other architectures, but on x86 at least, the tick (or more specifically, (hr)timers) is the number one source of VM-Exits. Off the cuff, I wouldn't any meaningful difference in performance, but I also wouldn't be surprised if running in a VM behaves differently than running on bare metal. E.g. except for slice-of-hardware setups, MWAIT won't be exposed to the guest and thus the cpuidle governor (in the guest) effectively has a binary decision (to hlt, or not to hlt).
On 02/23/25 11:00, Ingo Molnar wrote: > > * Andrea Righi <arighi@nvidia.com> wrote: > > > On Wed, Feb 12, 2025 at 09:50:54AM -0500, Joel Fernandes wrote: > > > On Mon, Feb 10, 2025 at 12:19:15AM +0000, Qais Yousef wrote: > > ... > > > > I believe HZ_250 was the default as a trade-off for battery power > > > > devices that might not be happy with frequent TICKS potentially draining > > > > the battery unnecessarily. But to my understanding the current state of > > > > > > Actually, on x86, me and Steve did some debug on Chromebooks and we found > > > that HZ_250 actually increased power versus higher HZ. This was because > > > cpuidle governor changes C states on the tick, and by making it less > > > frequent, the CPU could be in a shallow C state for longer. > > > > FWIW, I found the same about power consumption when we decided to switch to > > CONFIG_HZ=1000 in the Ubuntu kernel: > > https://discourse.ubuntu.com/t/enable-low-latency-features-in-the-generic-ubuntu-kernel-for-24-04/42255 Thanks for sharing the data Andrea! > > The "HZ=1000 reduces power consumption or keeps it the same" is > actually a pretty good argument to change the default to HZ=1000. > > These experiments and numbers (if any) should be incorporated in the > changelog prominently - as actual data and the Kconfig decisions made > by major distros will, most of the time, be superior to meta analysis > that seems to be the changelog right now. I will update the commit message to incorporate data and the feedback received. Thanks! -- Qais Yousef
On Mon, 24 Feb 2025 at 00:21, Qais Yousef <qyousef@layalina.io> wrote: > > On 02/23/25 11:00, Ingo Molnar wrote: > > > > * Andrea Righi <arighi@nvidia.com> wrote: > > > > > On Wed, Feb 12, 2025 at 09:50:54AM -0500, Joel Fernandes wrote: > > > > On Mon, Feb 10, 2025 at 12:19:15AM +0000, Qais Yousef wrote: > > > ... > > > > > I believe HZ_250 was the default as a trade-off for battery power > > > > > devices that might not be happy with frequent TICKS potentially draining > > > > > the battery unnecessarily. But to my understanding the current state of > > > > > > > > Actually, on x86, me and Steve did some debug on Chromebooks and we found > > > > that HZ_250 actually increased power versus higher HZ. This was because > > > > cpuidle governor changes C states on the tick, and by making it less > > > > frequent, the CPU could be in a shallow C state for longer. > > > > > > FWIW, I found the same about power consumption when we decided to switch to > > > CONFIG_HZ=1000 in the Ubuntu kernel: > > > https://discourse.ubuntu.com/t/enable-low-latency-features-in-the-generic-ubuntu-kernel-for-24-04/42255 > > Thanks for sharing the data Andrea! I don't have power figures to share but I'm aligned with the fact that keeping HZ=250 (4ms) is not sustainable when it comes to handle task scheduling on devices with 120 fps (~8ms) constraint. FWIW Acked-by : Vincent Guittot <vincent.guittot@linaro.org> > > > > > The "HZ=1000 reduces power consumption or keeps it the same" is > > actually a pretty good argument to change the default to HZ=1000. > > > > These experiments and numbers (if any) should be incorporated in the > > changelog prominently - as actual data and the Kconfig decisions made > > by major distros will, most of the time, be superior to meta analysis > > that seems to be the changelog right now. > > I will update the commit message to incorporate data and the feedback received. > > Thanks! > > -- > Qais Yousef
On 02/12/25 09:50, Joel Fernandes wrote: > On Mon, Feb 10, 2025 at 12:19:15AM +0000, Qais Yousef wrote: > > The frequency at which TICK happens is very important from scheduler > > perspective. There's a responsiveness trade-of that for interactive > > systems the current default is set too low. > > Another thing that screws up pretty badly at least with pre-EEVDF CFS is the > extra lag that gets added to high nice value tasks due to the coarser tick > causes low nice value tasks to get an even longer time slice. I caught this > when tracing Android few years ago. ISTR, this was pretty bad almost to a > point of defeating fairness. Not sure if that shows with EEVDF though. There was a bug that Vincent fixed with sched_period in extreme scenarios. But generally starvation problems are common with 4ms TICK when the CPU is overloaded as it could be a long time before the task gets a chance to run again. > > NOHZ should be good enough to alleviate these concerns. And recent > > addition of RCU_LAZY further helps with keeping TICK quite in idle > > scenarios. > > > > As pointed out to me by Saravana though, the longer TICK did indirectly > > help with timer coalescing which means it could hide issues with > > drivers/tasks asking for frequent timers preventing entry to deeper idle > > states (4ms is a high value to allow entry to deeper idle state for many > > systems). But one can argue this is a problem with these drivers/tasks. > > And if the coalescing behavior is desired we can make it intentional > > rather than accidental. > > I am not sure how much coalescing of timer-wheel events matter. My impression > is coalescing matters only for HRtimer since those can be more granular. Bad usage of english from my side. It just they trigger accurately now and they were accidentally deferred to the next jiffie which has 4ms granularity. > > kernel/Kconfig.hz | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/kernel/Kconfig.hz b/kernel/Kconfig.hz > > index 38ef6d06888e..c742c9298af3 100644 > > --- a/kernel/Kconfig.hz > > +++ b/kernel/Kconfig.hz > > @@ -5,7 +5,7 @@ > > > > choice > > prompt "Timer frequency" > > - default HZ_250 > > + default HZ_1000 > > Its fine with me, but I wonder who else cares about HZ_250 default. I > certainly don't. And if someone really wants it for an odd reason, then can > just adjust the config for themselves. I think it is a common source of latency and performance and the arguments for throughput and power are now outdated IMHO. Modern hardware and workloads are different and time to modernize some default value to better suite the wider audience. > Acked-by: Joel Fernandes <joelagnelf@nvidia.com> Thanks! -- Qais Yousef
© 2016 - 2026 Red Hat, Inc.