[PATCH] sched: Further restrict the preemption modes

Peter Zijlstra posted 1 patch 1 month, 2 weeks ago
There is a newer version of this series
kernel/Kconfig.preempt |    3 +++
kernel/sched/core.c    |    2 +-
kernel/sched/debug.c   |    2 +-
3 files changed, 5 insertions(+), 2 deletions(-)
[PATCH] sched: Further restrict the preemption modes
Posted by Peter Zijlstra 1 month, 2 weeks ago

[ with 6.18 being an LTS release, it might be a good time for this ]

The introduction of PREEMPT_LAZY was for multiple reasons:

  - PREEMPT_RT suffered from over-scheduling, hurting performance compared to
    !PREEMPT_RT.

  - the introduction of (more) features that rely on preemption; like
    folio_zero_user() which can do large memset() without preemption checks.

    (Xen already had a horrible hack to deal with long running hypercalls)

  - the endless and uncontrolled sprinkling of cond_resched() -- mostly cargo
    cult or in response to poor to replicate workloads.

By moving to a model that is fundamentally preemptable these things become
manageable and avoid needing to introduce more horrible hacks.

Since this is a requirement; limit PREEMPT_NONE to architectures that do not
support preemption at all. Further limit PREEMPT_VOLUNTARY to those
architectures that do not yet have PREEMPT_LAZY support (with the eventual goal
to make this the empty set and completely remove voluntary preemption and
cond_resched() -- notably VOLUNTARY is already limited to !ARCH_NO_PREEMPT.)

This leaves up-to-date architectures (arm64, loongarch, powerpc, riscv, s390,
x86) with only two preemption models: full and lazy (like PREEMPT_RT).

While Lazy has been the recommended setting for a while, not all distributions
have managed to make the switch yet. Force things along. Keep the patch minimal
in case of hard to address regressions that might pop up.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/Kconfig.preempt |    3 +++
 kernel/sched/core.c    |    2 +-
 kernel/sched/debug.c   |    2 +-
 3 files changed, 5 insertions(+), 2 deletions(-)

--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -16,11 +16,13 @@ config ARCH_HAS_PREEMPT_LAZY
 
 choice
 	prompt "Preemption Model"
+	default PREEMPT_LAZY if ARCH_HAS_PREEMPT_LAZY
 	default PREEMPT_NONE
 
 config PREEMPT_NONE
 	bool "No Forced Preemption (Server)"
 	depends on !PREEMPT_RT
+	depends on ARCH_NO_PREEMPT
 	select PREEMPT_NONE_BUILD if !PREEMPT_DYNAMIC
 	help
 	  This is the traditional Linux preemption model, geared towards
@@ -35,6 +37,7 @@ config PREEMPT_NONE
 
 config PREEMPT_VOLUNTARY
 	bool "Voluntary Kernel Preemption (Desktop)"
+	depends on !ARCH_HAS_PREEMPT_LAZY
 	depends on !ARCH_NO_PREEMPT
 	depends on !PREEMPT_RT
 	select PREEMPT_VOLUNTARY_BUILD if !PREEMPT_DYNAMIC
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7553,7 +7553,7 @@ int preempt_dynamic_mode = preempt_dynam
 
 int sched_dynamic_mode(const char *str)
 {
-# ifndef CONFIG_PREEMPT_RT
+# if !(defined(CONFIG_PREEMPT_RT) || defined(CONFIG_ARCH_HAS_PREEMPT_LAZY))
 	if (!strcmp(str, "none"))
 		return preempt_dynamic_none;
 
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -243,7 +243,7 @@ static ssize_t sched_dynamic_write(struc
 
 static int sched_dynamic_show(struct seq_file *m, void *v)
 {
-	int i = IS_ENABLED(CONFIG_PREEMPT_RT) * 2;
+	int i = (IS_ENABLED(CONFIG_PREEMPT_RT) || IS_ENABLED(CONFIG_ARCH_HAS_PREEMPT_LAZY)) * 2;
 	int j;
 
 	/* Count entries in NULL terminated preempt_modes */
Re: [PATCH] sched: Further restrict the preemption modes
Posted by Shrikanth Hegde 4 weeks, 1 day ago
Hi Peter.

On 12/19/25 3:45 PM, Peter Zijlstra wrote:
> 
> [ with 6.18 being an LTS release, it might be a good time for this ]
> 
> The introduction of PREEMPT_LAZY was for multiple reasons:
> 
>    - PREEMPT_RT suffered from over-scheduling, hurting performance compared to
>      !PREEMPT_RT.
> 
>    - the introduction of (more) features that rely on preemption; like
>      folio_zero_user() which can do large memset() without preemption checks.
> 
>      (Xen already had a horrible hack to deal with long running hypercalls)
> 
>    - the endless and uncontrolled sprinkling of cond_resched() -- mostly cargo
>      cult or in response to poor to replicate workloads.
> 
> By moving to a model that is fundamentally preemptable these things become
> manageable and avoid needing to introduce more horrible hacks.
> 
> Since this is a requirement; limit PREEMPT_NONE to architectures that do not
> support preemption at all. Further limit PREEMPT_VOLUNTARY to those
> architectures that do not yet have PREEMPT_LAZY support (with the eventual goal
> to make this the empty set and completely remove voluntary preemption and
> cond_resched() -- notably VOLUNTARY is already limited to !ARCH_NO_PREEMPT.)
> 
> This leaves up-to-date architectures (arm64, loongarch, powerpc, riscv, s390,
> x86) with only two preemption models: full and lazy (like PREEMPT_RT).
> 
> While Lazy has been the recommended setting for a while, not all distributions
> have managed to make the switch yet. Force things along. Keep the patch minimal
> in case of hard to address regressions that might pop up.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>   kernel/Kconfig.preempt |    3 +++
>   kernel/sched/core.c    |    2 +-
>   kernel/sched/debug.c   |    2 +-
>   3 files changed, 5 insertions(+), 2 deletions(-)
> 
> --- a/kernel/Kconfig.preempt
> +++ b/kernel/Kconfig.preempt
> @@ -16,11 +16,13 @@ config ARCH_HAS_PREEMPT_LAZY
>   
>   choice
>   	prompt "Preemption Model"
> +	default PREEMPT_LAZY if ARCH_HAS_PREEMPT_LAZY
>   	default PREEMPT_NONE
>   
>   config PREEMPT_NONE
>   	bool "No Forced Preemption (Server)"
>   	depends on !PREEMPT_RT
> +	depends on ARCH_NO_PREEMPT
>   	select PREEMPT_NONE_BUILD if !PREEMPT_DYNAMIC
>   	help
>   	  This is the traditional Linux preemption model, geared towards
> @@ -35,6 +37,7 @@ config PREEMPT_NONE
>   
>   config PREEMPT_VOLUNTARY
>   	bool "Voluntary Kernel Preemption (Desktop)"
> +	depends on !ARCH_HAS_PREEMPT_LAZY
>   	depends on !ARCH_NO_PREEMPT
>   	depends on !PREEMPT_RT
>   	select PREEMPT_VOLUNTARY_BUILD if !PREEMPT_DYNAMIC
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7553,7 +7553,7 @@ int preempt_dynamic_mode = preempt_dynam
>   
>   int sched_dynamic_mode(const char *str)
>   {
> -# ifndef CONFIG_PREEMPT_RT
> +# if !(defined(CONFIG_PREEMPT_RT) || defined(CONFIG_ARCH_HAS_PREEMPT_LAZY))
>   	if (!strcmp(str, "none"))
>   		return preempt_dynamic_none;
>   
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -243,7 +243,7 @@ static ssize_t sched_dynamic_write(struc
>   
>   static int sched_dynamic_show(struct seq_file *m, void *v)
>   {
> -	int i = IS_ENABLED(CONFIG_PREEMPT_RT) * 2;
> +	int i = (IS_ENABLED(CONFIG_PREEMPT_RT) || IS_ENABLED(CONFIG_ARCH_HAS_PREEMPT_LAZY)) * 2;
>   	int j;
>   
>   	/* Count entries in NULL terminated preempt_modes */

Maybe only change the default to LAZY, but keep other options possible 
via dynamic update?

- When the kernel changes to lazy being the default, the scheduling 
pattern can change and it may affect the workloads. having ability to 
dynamically change to none/voluntary could help one to figure out where
it is regressing. we could document cases where regression is expected.

- with preempt=full/lazy we will likely never see softlockups. How are 
we going to find out longer kernel paths(some maybe design, some may be 
bugs) apart from observing workload regression?


Also, is softlockup code is of any use in preempt=full/lazy?
Re: [PATCH] sched: Further restrict the preemption modes
Posted by Steven Rostedt 1 month ago
On Fri, 19 Dec 2025 11:15:02 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> --- a/kernel/Kconfig.preempt
> +++ b/kernel/Kconfig.preempt
> @@ -16,11 +16,13 @@ config ARCH_HAS_PREEMPT_LAZY
>  
>  choice
>  	prompt "Preemption Model"
> +	default PREEMPT_LAZY if ARCH_HAS_PREEMPT_LAZY
>  	default PREEMPT_NONE

I think you can just make this:

	default PREEMPT_LAZY

and remove the PREEMPT_NONE.

As PREEMPT_NONE now depends on ARCH_NO_PREEMPT and all the other options
depend on !ARCH_NO_PREEMPT, the default will be PREEMPT_LAZY if it's
available, but it will never be PREEMPT_NONE if it isn't unless
PREEMPT_NONE is the only option available.

I added default PREEMPT_LAZY and did a:

   $ mkdir /tmp/build
   $ make O=/tmp/build ARCH=alpha defconfig

And the result is:

CONFIG_PREEMPT_NONE_BUILD=y
CONFIG_PREEMPT_NONE=y

-- Steve


>  
>  config PREEMPT_NONE
>  	bool "No Forced Preemption (Server)"
>  	depends on !PREEMPT_RT
> +	depends on ARCH_NO_PREEMPT
>  	select PREEMPT_NONE_BUILD if !PREEMPT_DYNAMIC
>  	help
>  	  This is the traditional Linux preemption model, geared towards
> @@ -35,6 +37,7 @@ config PREEMPT_NONE
>  
>  config PREEMPT_VOLUNTARY
>  	bool "Voluntary Kernel Preemption (Desktop)"
> +	depends on !ARCH_HAS_PREEMPT_LAZY
>  	depends on !ARCH_NO_PREEMPT
>  	depends on !PREEMPT_RT
>  	select PREEMPT_VOLUNTARY_BUILD if !PREEMPT_DYNAMIC
Re: [PATCH] sched: Further restrict the preemption modes
Posted by Valentin Schneider 1 month ago
On 19/12/25 11:15, Peter Zijlstra wrote:
> [ with 6.18 being an LTS release, it might be a good time for this ]
>
> The introduction of PREEMPT_LAZY was for multiple reasons:
>
>   - PREEMPT_RT suffered from over-scheduling, hurting performance compared to
>     !PREEMPT_RT.
>
>   - the introduction of (more) features that rely on preemption; like
>     folio_zero_user() which can do large memset() without preemption checks.
>
>     (Xen already had a horrible hack to deal with long running hypercalls)
>
>   - the endless and uncontrolled sprinkling of cond_resched() -- mostly cargo
>     cult or in response to poor to replicate workloads.
>
> By moving to a model that is fundamentally preemptable these things become
> manageable and avoid needing to introduce more horrible hacks.
>
> Since this is a requirement; limit PREEMPT_NONE to architectures that do not
> support preemption at all. Further limit PREEMPT_VOLUNTARY to those
> architectures that do not yet have PREEMPT_LAZY support (with the eventual goal
> to make this the empty set and completely remove voluntary preemption and
> cond_resched() -- notably VOLUNTARY is already limited to !ARCH_NO_PREEMPT.)
>
> This leaves up-to-date architectures (arm64, loongarch, powerpc, riscv, s390,
> x86) with only two preemption models: full and lazy (like PREEMPT_RT).
>
> While Lazy has been the recommended setting for a while, not all distributions
> have managed to make the switch yet. Force things along. Keep the patch minimal
> in case of hard to address regressions that might pop up.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Reviewed-by: Valentin Schneider <vschneid@redhat.com>
[tip: sched/core] sched: Further restrict the preemption modes
Posted by tip-bot2 for Peter Zijlstra 3 weeks, 5 days ago
The following commit has been merged into the sched/core branch of tip:

Commit-ID:     7dadeaa6e851e7d67733f3e24fc53ee107781d0f
Gitweb:        https://git.kernel.org/tip/7dadeaa6e851e7d67733f3e24fc53ee107781d0f
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Thu, 18 Dec 2025 15:25:10 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 08 Jan 2026 12:43:57 +01:00

sched: Further restrict the preemption modes

The introduction of PREEMPT_LAZY was for multiple reasons:

  - PREEMPT_RT suffered from over-scheduling, hurting performance compared to
    !PREEMPT_RT.

  - the introduction of (more) features that rely on preemption; like
    folio_zero_user() which can do large memset() without preemption checks.

    (Xen already had a horrible hack to deal with long running hypercalls)

  - the endless and uncontrolled sprinkling of cond_resched() -- mostly cargo
    cult or in response to poor to replicate workloads.

By moving to a model that is fundamentally preemptable these things become
managable and avoid needing to introduce more horrible hacks.

Since this is a requirement; limit PREEMPT_NONE to architectures that do not
support preemption at all. Further limit PREEMPT_VOLUNTARY to those
architectures that do not yet have PREEMPT_LAZY support (with the eventual goal
to make this the empty set and completely remove voluntary preemption and
cond_resched() -- notably VOLUNTARY is already limited to !ARCH_NO_PREEMPT.)

This leaves up-to-date architectures (arm64, loongarch, powerpc, riscv, s390,
x86) with only two preemption models: full and lazy.

While Lazy has been the recommended setting for a while, not all distributions
have managed to make the switch yet. Force things along. Keep the patch minimal
in case of hard to address regressions that might pop up.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://patch.msgid.link/20251219101502.GB1132199@noisy.programming.kicks-ass.net
---
 kernel/Kconfig.preempt | 3 +++
 kernel/sched/core.c    | 2 +-
 kernel/sched/debug.c   | 2 +-
 3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index da32680..88c594c 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -16,11 +16,13 @@ config ARCH_HAS_PREEMPT_LAZY
 
 choice
 	prompt "Preemption Model"
+	default PREEMPT_LAZY if ARCH_HAS_PREEMPT_LAZY
 	default PREEMPT_NONE
 
 config PREEMPT_NONE
 	bool "No Forced Preemption (Server)"
 	depends on !PREEMPT_RT
+	depends on ARCH_NO_PREEMPT
 	select PREEMPT_NONE_BUILD if !PREEMPT_DYNAMIC
 	help
 	  This is the traditional Linux preemption model, geared towards
@@ -35,6 +37,7 @@ config PREEMPT_NONE
 
 config PREEMPT_VOLUNTARY
 	bool "Voluntary Kernel Preemption (Desktop)"
+	depends on !ARCH_HAS_PREEMPT_LAZY
 	depends on !ARCH_NO_PREEMPT
 	depends on !PREEMPT_RT
 	select PREEMPT_VOLUNTARY_BUILD if !PREEMPT_DYNAMIC
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5b17d8e..fa72075 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7553,7 +7553,7 @@ int preempt_dynamic_mode = preempt_dynamic_undefined;
 
 int sched_dynamic_mode(const char *str)
 {
-# ifndef CONFIG_PREEMPT_RT
+# if !(defined(CONFIG_PREEMPT_RT) || defined(CONFIG_ARCH_HAS_PREEMPT_LAZY))
 	if (!strcmp(str, "none"))
 		return preempt_dynamic_none;
 
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 41caa22..5f9b771 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -243,7 +243,7 @@ static ssize_t sched_dynamic_write(struct file *filp, const char __user *ubuf,
 
 static int sched_dynamic_show(struct seq_file *m, void *v)
 {
-	int i = IS_ENABLED(CONFIG_PREEMPT_RT) * 2;
+	int i = (IS_ENABLED(CONFIG_PREEMPT_RT) || IS_ENABLED(CONFIG_ARCH_HAS_PREEMPT_LAZY)) * 2;
 	int j;
 
 	/* Count entries in NULL terminated preempt_modes */