[v1] Introduce QPW for per-cpu operations

[PATCH 1/4] Introducing qpw_lock() and per-cpu queue & flush work

Posted by Marcelo Tosatti 1 day, 3 hours ago

Some places in the kernel implement a parallel programming strategy
consisting on local_locks() for most of the work, and some rare remote
operations are scheduled on target cpu. This keeps cache bouncing low since
cacheline tends to be mostly local, and avoids the cost of locks in non-RT
kernels, even though the very few remote operations will be expensive due
to scheduling overhead.

On the other hand, for RT workloads this can represent a problem:
scheduling work on remote cpu that are executing low latency tasks
is undesired and can introduce unexpected deadline misses.

It's interesting, though, that local_lock()s in RT kernels become
spinlock(). We can make use of those to avoid scheduling work on a remote
cpu by directly updating another cpu's per_cpu structure, while holding
it's spinlock().

In order to do that, it's necessary to introduce a new set of functions to
make it possible to get another cpu's per-cpu "local" lock (qpw_{un,}lock*)
and also the corresponding queue_percpu_work_on() and flush_percpu_work()
helpers to run the remote work.

Users of non-RT kernels but with low latency requirements can select
similar functionality by using the CONFIG_QPW compile time option.

On CONFIG_QPW disabled kernels, no changes are expected, as every
one of the introduced helpers work the exactly same as the current
implementation:
qpw_{un,}lock*()        ->  local_{un,}lock*() (ignores cpu parameter)
queue_percpu_work_on()  ->  queue_work_on()
flush_percpu_work()     ->  flush_work()

For QPW enabled kernels, though, qpw_{un,}lock*() will use the extra
cpu parameter to select the correct per-cpu structure to work on,
and acquire the spinlock for that cpu.

queue_percpu_work_on() will just call the requested function in the current
cpu, which will operate in another cpu's per-cpu object. Since the
local_locks() become spinlock()s in QPW enabled kernels, we are
safe doing that.

flush_percpu_work() then becomes a no-op since no work is actually
scheduled on a remote cpu.

Some minimal code rework is needed in order to make this mechanism work:
The calls for local_{un,}lock*() on the functions that are currently
scheduled on remote cpus need to be replaced by qpw_{un,}lock_n*(), so in
QPW enabled kernels they can reference a different cpu. It's also
necessary to use a qpw_struct instead of a work_struct, but it just
contains a work struct and, in CONFIG_QPW, the target cpu.

This should have almost no impact on non-CONFIG_QPW kernels: few
this_cpu_ptr() will become per_cpu_ptr(,smp_processor_id()).

On CONFIG_QPW kernels, this should avoid deadlines misses by
removing scheduling noise.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
---
 Documentation/admin-guide/kernel-parameters.txt |   10 +
 Documentation/locking/qpwlocks.rst              |   63 +++++++
 MAINTAINERS                                     |    6 
 include/linux/qpw.h                             |  190 ++++++++++++++++++++++++
 init/Kconfig                                    |   35 ++++
 kernel/Makefile                                 |    2 
 kernel/qpw.c                                    |   26 +++
 7 files changed, 332 insertions(+)
 create mode 100644 include/linux/qpw.h
 create mode 100644 kernel/qpw.c

Index: slab/Documentation/admin-guide/kernel-parameters.txt
===================================================================
--- slab.orig/Documentation/admin-guide/kernel-parameters.txt
+++ slab/Documentation/admin-guide/kernel-parameters.txt
@@ -2819,6 +2819,16 @@ Kernel parameters
 
 			The format of <cpu-list> is described above.
 
+	qpw=		[KNL,SMP] Select a behavior on per-CPU resource sharing
+			and remote interference mechanism on a kernel built with
+			CONFIG_QPW.
+			Format: { "0" | "1" }
+			0 - local_lock() + queue_work_on(remote_cpu)
+			1 - spin_lock() for both local and remote operations
+
+			Selecting 1 may be interesting for systems that want
+			to avoid interruption & context switches from IPIs.
+
 	iucv=		[HW,NET]
 
 	ivrs_ioapic	[HW,X86-64]
Index: slab/MAINTAINERS
===================================================================
--- slab.orig/MAINTAINERS
+++ slab/MAINTAINERS
@@ -21291,6 +21291,12 @@ F:	Documentation/networking/device_drive
 F:	drivers/bus/fsl-mc/
 F:	include/uapi/linux/fsl_mc.h
 
+QPW
+M:	Leonardo Bras <leobras@redhat.com>
+S:	Supported
+F:	include/linux/qpw.h
+F:	kernel/qpw.c
+
 QT1010 MEDIA DRIVER
 L:	linux-media@vger.kernel.org
 S:	Orphan
Index: slab/include/linux/qpw.h
===================================================================
--- /dev/null
+++ slab/include/linux/qpw.h
@@ -0,0 +1,190 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_QPW_H
+#define _LINUX_QPW_H
+
+#include "linux/spinlock.h"
+#include "linux/local_lock.h"
+#include "linux/workqueue.h"
+
+#ifndef CONFIG_QPW
+
+typedef local_lock_t qpw_lock_t;
+typedef local_trylock_t qpw_trylock_t;
+
+struct qpw_struct {
+	struct work_struct work;
+};
+
+#define qpw_lock_init(lock)			\
+	local_lock_init(lock)
+
+#define qpw_trylock_init(lock)			\
+	local_trylock_init(lock)
+
+#define qpw_lock(lock, cpu)			\
+	local_lock(lock)
+
+#define qpw_lock_irqsave(lock, flags, cpu)	\
+	local_lock_irqsave(lock, flags)
+
+#define qpw_trylock(lock, cpu)			\
+	local_trylock(lock)
+
+#define qpw_trylock_irqsave(lock, flags, cpu)	\
+	local_trylock_irqsave(lock, flags)
+
+#define qpw_unlock(lock, cpu)			\
+	local_unlock(lock)
+
+#define qpw_unlock_irqrestore(lock, flags, cpu)	\
+	local_unlock_irqrestore(lock, flags)
+
+#define qpw_lockdep_assert_held(lock)		\
+	lockdep_assert_held(lock)
+
+#define queue_percpu_work_on(c, wq, qpw)	\
+	queue_work_on(c, wq, &(qpw)->work)
+
+#define flush_percpu_work(qpw)			\
+	flush_work(&(qpw)->work)
+
+#define qpw_get_cpu(qpw)	smp_processor_id()
+
+#define qpw_is_cpu_remote(cpu)		(false)
+
+#define INIT_QPW(qpw, func, c)			\
+	INIT_WORK(&(qpw)->work, (func))
+
+#else /* CONFIG_QPW */
+
+DECLARE_STATIC_KEY_MAYBE(CONFIG_QPW_DEFAULT, qpw_sl);
+
+typedef union {
+	spinlock_t sl;
+	local_lock_t ll;
+} qpw_lock_t;
+
+typedef union {
+	spinlock_t sl;
+	local_trylock_t ll;
+} qpw_trylock_t;
+
+struct qpw_struct {
+	struct work_struct work;
+	int cpu;
+};
+
+#define qpw_lock_init(lock)								\
+	do {										\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
+			spin_lock_init(lock.sl);					\
+		else									\
+			local_lock_init(lock.ll);					\
+	} while (0)
+
+#define qpw_trylock_init(lock)								\
+	do {										\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
+			spin_lock_init(lock.sl);					\
+		else									\
+			local_trylock_init(lock.ll);					\
+	} while (0)
+
+#define qpw_lock(lock, cpu)								\
+	do {										\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
+			spin_lock(per_cpu_ptr(lock.sl, cpu));				\
+		else									\
+			local_lock(lock.ll);						\
+	} while (0)
+
+#define qpw_lock_irqsave(lock, flags, cpu)						\
+	do {										\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
+			spin_lock_irqsave(per_cpu_ptr(lock.sl, cpu), flags);		\
+		else									\
+			local_lock_irqsave(lock.ll, flags);				\
+	} while (0)
+
+#define qpw_trylock(lock, cpu)								\
+	({										\
+		int t;									\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
+			t = spin_trylock(per_cpu_ptr(lock.sl, cpu));			\
+		else									\
+			t = local_trylock(lock.ll);					\
+		t;									\
+	})
+
+#define qpw_trylock_irqsave(lock, flags, cpu)						\
+	({										\
+		int t;									\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
+			t = spin_trylock_irqsave(per_cpu_ptr(lock.sl, cpu), flags);	\
+		else									\
+			t = local_trylock_irqsave(lock.ll, flags);			\
+		t;									\
+	})
+
+#define qpw_unlock(lock, cpu)								\
+	do {										\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) {			\
+			spin_unlock(per_cpu_ptr(lock.sl, cpu));				\
+		} else {								\
+			local_unlock(lock.ll);						\
+		}									\
+	} while (0)
+
+#define qpw_unlock_irqrestore(lock, flags, cpu)						\
+	do {										\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
+			spin_unlock_irqrestore(per_cpu_ptr(lock.sl, cpu), flags);	\
+		else									\
+			local_unlock_irqrestore(lock.ll, flags);			\
+	} while (0)
+
+#define qpw_lockdep_assert_held(lock)							\
+	do {										\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
+			lockdep_assert_held(this_cpu_ptr(lock.sl));			\
+		else									\
+			lockdep_assert_held(this_cpu_ptr(lock.ll));			\
+	} while (0)
+
+#define queue_percpu_work_on(c, wq, qpw)						\
+	do {										\
+		int __c = c;								\
+		struct qpw_struct *__qpw = (qpw);					\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) {			\
+			WARN_ON((__c) != __qpw->cpu);					\
+			__qpw->work.func(&__qpw->work);					\
+		} else {								\
+			queue_work_on(__c, wq, &(__qpw)->work);				\
+		}									\
+	} while (0)
+
+/*
+ * Does nothing if QPW is set to use spinlock, as the task is already done at the
+ * time queue_percpu_work_on() returns.
+ */
+#define flush_percpu_work(qpw)								\
+	do {										\
+		struct qpw_struct *__qpw = (qpw);					\
+		if (!static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) {		\
+			flush_work(&__qpw->work);					\
+		}									\
+	} while (0)
+
+#define qpw_get_cpu(w)			container_of((w), struct qpw_struct, work)->cpu
+
+#define qpw_is_cpu_remote(cpu)		((cpu) != smp_processor_id())
+
+#define INIT_QPW(qpw, func, c)								\
+	do {										\
+		struct qpw_struct *__qpw = (qpw);					\
+		INIT_WORK(&__qpw->work, (func));					\
+		__qpw->cpu = (c);							\
+	} while (0)
+
+#endif /* CONFIG_QPW */
+#endif /* LINUX_QPW_H */
Index: slab/init/Kconfig
===================================================================
--- slab.orig/init/Kconfig
+++ slab/init/Kconfig
@@ -747,6 +747,41 @@ config CPU_ISOLATION
 
 	  Say Y if unsure.
 
+config QPW
+	bool "Queue per-CPU Work"
+	depends on SMP || COMPILE_TEST
+	default n
+	help
+	  Allow changing the behavior on per-CPU resource sharing with cache,
+	  from the regular local_locks() + queue_work_on(remote_cpu) to using
+	  per-CPU spinlocks on both local and remote operations.
+
+	  This is useful to give user the option on reducing IPIs to CPUs, and
+	  thus reduce interruptions and context switches. On the other hand, it
+	  increases generated code and will use atomic operations if spinlocks
+	  are selected.
+
+	  If set, will use the default behavior set in QPW_DEFAULT unless boot
+	  parameter qpw is passed with a different behavior.
+
+	  If unset, will use the local_lock() + queue_work_on() strategy,
+	  regardless of the boot parameter or QPW_DEFAULT.
+
+	  Say N if unsure.
+
+config QPW_DEFAULT
+	bool "Use per-CPU spinlocks by default"
+	depends on QPW
+	default n
+	help
+	  If set, will use per-CPU spinlocks as default behavior for per-CPU
+	  remote operations.
+
+	  If unset, will use local_lock() + queue_work_on(cpu) as default
+	  behavior for remote operations.
+
+	  Say N if unsure
+
 source "kernel/rcu/Kconfig"
 
 config IKCONFIG
Index: slab/kernel/Makefile
===================================================================
--- slab.orig/kernel/Makefile
+++ slab/kernel/Makefile
@@ -140,6 +140,8 @@ obj-$(CONFIG_WATCH_QUEUE) += watch_queue
 obj-$(CONFIG_RESOURCE_KUNIT_TEST) += resource_kunit.o
 obj-$(CONFIG_SYSCTL_KUNIT_TEST) += sysctl-test.o
 
+obj-$(CONFIG_QPW) += qpw.o
+
 CFLAGS_kstack_erase.o += $(DISABLE_KSTACK_ERASE)
 CFLAGS_kstack_erase.o += $(call cc-option,-mgeneral-regs-only)
 obj-$(CONFIG_KSTACK_ERASE) += kstack_erase.o
Index: slab/kernel/qpw.c
===================================================================
--- /dev/null
+++ slab/kernel/qpw.c
@@ -0,0 +1,26 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "linux/export.h"
+#include <linux/sched.h>
+#include <linux/qpw.h>
+#include <linux/string.h>
+
+DEFINE_STATIC_KEY_MAYBE(CONFIG_QPW_DEFAULT, qpw_sl);
+EXPORT_SYMBOL(qpw_sl);
+
+static int __init qpw_setup(char *str)
+{
+	int opt;
+
+	if (!get_option(&str, &opt)) {
+		pr_warn("QPW: invalid qpw parameter: %s, ignoring.\n", str);
+		return 0;
+	}
+
+	if (opt)
+		static_branch_enable(&qpw_sl);
+	else
+		static_branch_disable(&qpw_sl);
+
+	return 0;
+}
+__setup("qpw=", qpw_setup);
Index: slab/Documentation/locking/qpwlocks.rst
===================================================================
--- /dev/null
+++ slab/Documentation/locking/qpwlocks.rst
@@ -0,0 +1,63 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========
+QPW locks
+=========
+
+Some places in the kernel implement a parallel programming strategy
+consisting on local_locks() for most of the work, and some rare remote
+operations are scheduled on target cpu. This keeps cache bouncing low since
+cacheline tends to be mostly local, and avoids the cost of locks in non-RT
+kernels, even though the very few remote operations will be expensive due
+to scheduling overhead.
+
+On the other hand, for RT workloads this can represent a problem:
+scheduling work on remote cpu that are executing low latency tasks
+is undesired and can introduce unexpected deadline misses.
+
+QPW locks help to convert sites that use local_locks (for cpu local operations)
+and queue_work_on (for queueing work remotely, to be executed
+locally on the owner cpu of the lock) to QPW locks.
+
+The lock is declared qpw_lock_t type.
+The lock is initialized with qpw_lock_init.
+The lock is locked with qpw_lock (takes a lock and cpu as a parameter).
+The lock is unlocked with qpw_unlock (takes a lock and cpu as a parameter).
+
+The qpw_lock_irqsave function disables interrupts and saves current interrupt state,
+cpu as a parameter.
+
+For trylock variant, there is the qpw_trylock_t type, initialized with
+qpw_trylock_init. Then the corresponding qpw_trylock and
+qpw_trylock_irqsave.
+
+work_struct should be replaced by qpw_struct, which contains a cpu parameter
+(owner cpu of the lock), initialized by INIT_QPW.
+
+The queue work related functions (analogous to queue_work_on and flush_work) are:
+queue_percpu_work_on and flush_percpu_work.
+
+The behaviour of the QPW functions is as follows:
+
+* !CONFIG_PREEMPT_RT and !CONFIG_QPW (or CONFIG_QPW and qpw=off kernel
+boot parameter):
+        - qpw_lock:                     local_lock
+        - qpw_lock_irqsave:             local_lock_irqsave
+        - qpw_trylock:                  local_trylock
+        - qpw_trylock_irqsave:          local_trylock_irqsave
+        - qpw_unlock:                   local_unlock
+        - queue_percpu_work_on:         queue_work_on
+        - flush_percpu_work:            flush_work
+
+* CONFIG_PREEMPT_RT or CONFIG_QPW (and CONFIG_QPW_DEFAULT or qpw=on kernel
+boot parameter),
+        - qpw_lock:                     spin_lock
+        - qpw_lock_irqsave:             spin_lock_irqsave
+        - qpw_trylock:                  spin_trylock
+        - qpw_trylock_irqsave:          spin_trylock_irqsave
+        - qpw_unlock:                   spin_unlock
+        - queue_percpu_work_on:         executes work function on caller cpu
+        - flush_percpu_work:            empty
+
+qpw_get_cpu(work_struct), to be called from within qpw work function,
+returns the target cpu.

Re: [PATCH 1/4] Introducing qpw_lock() and per-cpu queue & flush work

Posted by Leonardo Bras 17 hours ago

On Fri, Feb 06, 2026 at 11:34:31AM -0300, Marcelo Tosatti wrote:
> Some places in the kernel implement a parallel programming strategy
> consisting on local_locks() for most of the work, and some rare remote
> operations are scheduled on target cpu. This keeps cache bouncing low since
> cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> kernels, even though the very few remote operations will be expensive due
> to scheduling overhead.
> 
> On the other hand, for RT workloads this can represent a problem:
> scheduling work on remote cpu that are executing low latency tasks
> is undesired and can introduce unexpected deadline misses.
> 
> It's interesting, though, that local_lock()s in RT kernels become
> spinlock(). We can make use of those to avoid scheduling work on a remote
> cpu by directly updating another cpu's per_cpu structure, while holding
> it's spinlock().
> 
> In order to do that, it's necessary to introduce a new set of functions to
> make it possible to get another cpu's per-cpu "local" lock (qpw_{un,}lock*)
> and also the corresponding queue_percpu_work_on() and flush_percpu_work()
> helpers to run the remote work.
> 
> Users of non-RT kernels but with low latency requirements can select
> similar functionality by using the CONFIG_QPW compile time option.
> 
> On CONFIG_QPW disabled kernels, no changes are expected, as every
> one of the introduced helpers work the exactly same as the current
> implementation:
> qpw_{un,}lock*()        ->  local_{un,}lock*() (ignores cpu parameter)
> queue_percpu_work_on()  ->  queue_work_on()
> flush_percpu_work()     ->  flush_work()
> 
> For QPW enabled kernels, though, qpw_{un,}lock*() will use the extra
> cpu parameter to select the correct per-cpu structure to work on,
> and acquire the spinlock for that cpu.
> 
> queue_percpu_work_on() will just call the requested function in the current
> cpu, which will operate in another cpu's per-cpu object. Since the
> local_locks() become spinlock()s in QPW enabled kernels, we are
> safe doing that.
> 
> flush_percpu_work() then becomes a no-op since no work is actually
> scheduled on a remote cpu.
> 
> Some minimal code rework is needed in order to make this mechanism work:
> The calls for local_{un,}lock*() on the functions that are currently
> scheduled on remote cpus need to be replaced by qpw_{un,}lock_n*(), so in
> QPW enabled kernels they can reference a different cpu. It's also
> necessary to use a qpw_struct instead of a work_struct, but it just
> contains a work struct and, in CONFIG_QPW, the target cpu.
> 
> This should have almost no impact on non-CONFIG_QPW kernels: few
> this_cpu_ptr() will become per_cpu_ptr(,smp_processor_id()).
> 
> On CONFIG_QPW kernels, this should avoid deadlines misses by
> removing scheduling noise.
> 
> Signed-off-by: Leonardo Bras <leobras@redhat.com>
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> ---
>  Documentation/admin-guide/kernel-parameters.txt |   10 +
>  Documentation/locking/qpwlocks.rst              |   63 +++++++
>  MAINTAINERS                                     |    6 
>  include/linux/qpw.h                             |  190 ++++++++++++++++++++++++
>  init/Kconfig                                    |   35 ++++
>  kernel/Makefile                                 |    2 
>  kernel/qpw.c                                    |   26 +++
>  7 files changed, 332 insertions(+)
>  create mode 100644 include/linux/qpw.h
>  create mode 100644 kernel/qpw.c
> 
> Index: slab/Documentation/admin-guide/kernel-parameters.txt
> ===================================================================
> --- slab.orig/Documentation/admin-guide/kernel-parameters.txt
> +++ slab/Documentation/admin-guide/kernel-parameters.txt
> @@ -2819,6 +2819,16 @@ Kernel parameters
>  
>  			The format of <cpu-list> is described above.
>  
> +	qpw=		[KNL,SMP] Select a behavior on per-CPU resource sharing
> +			and remote interference mechanism on a kernel built with
> +			CONFIG_QPW.
> +			Format: { "0" | "1" }
> +			0 - local_lock() + queue_work_on(remote_cpu)
> +			1 - spin_lock() for both local and remote operations
> +
> +			Selecting 1 may be interesting for systems that want
> +			to avoid interruption & context switches from IPIs.
> +
>  	iucv=		[HW,NET]
>  
>  	ivrs_ioapic	[HW,X86-64]
> Index: slab/MAINTAINERS
> ===================================================================
> --- slab.orig/MAINTAINERS
> +++ slab/MAINTAINERS
> @@ -21291,6 +21291,12 @@ F:	Documentation/networking/device_drive
>  F:	drivers/bus/fsl-mc/
>  F:	include/uapi/linux/fsl_mc.h
>  
> +QPW
> +M:	Leonardo Bras <leobras@redhat.com>

Thanks for keeping that up :)
Could you please change this line to 

+M:	Leonardo Bras <leobras.c@gmail.com>

As I don't have access to Red Hat's mail anymore.
The signoffs on each commit should be fine to keep :)

> +S:	Supported
> +F:	include/linux/qpw.h
> +F:	kernel/qpw.c
> +

Should we also add the Documentation file as well?

+F:	Documentation/locking/qpwlocks.rst


>  QT1010 MEDIA DRIVER
>  L:	linux-media@vger.kernel.org
>  S:	Orphan
> Index: slab/include/linux/qpw.h
> ===================================================================
> --- /dev/null
> +++ slab/include/linux/qpw.h
> @@ -0,0 +1,190 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_QPW_H
> +#define _LINUX_QPW_H
> +
> +#include "linux/spinlock.h"
> +#include "linux/local_lock.h"
> +#include "linux/workqueue.h"
> +
> +#ifndef CONFIG_QPW
> +
> +typedef local_lock_t qpw_lock_t;
> +typedef local_trylock_t qpw_trylock_t;
> +
> +struct qpw_struct {
> +	struct work_struct work;
> +};
> +
> +#define qpw_lock_init(lock)			\
> +	local_lock_init(lock)
> +
> +#define qpw_trylock_init(lock)			\
> +	local_trylock_init(lock)
> +
> +#define qpw_lock(lock, cpu)			\
> +	local_lock(lock)
> +
> +#define qpw_lock_irqsave(lock, flags, cpu)	\
> +	local_lock_irqsave(lock, flags)
> +
> +#define qpw_trylock(lock, cpu)			\
> +	local_trylock(lock)
> +
> +#define qpw_trylock_irqsave(lock, flags, cpu)	\
> +	local_trylock_irqsave(lock, flags)
> +
> +#define qpw_unlock(lock, cpu)			\
> +	local_unlock(lock)
> +
> +#define qpw_unlock_irqrestore(lock, flags, cpu)	\
> +	local_unlock_irqrestore(lock, flags)
> +
> +#define qpw_lockdep_assert_held(lock)		\
> +	lockdep_assert_held(lock)
> +
> +#define queue_percpu_work_on(c, wq, qpw)	\
> +	queue_work_on(c, wq, &(qpw)->work)
> +
> +#define flush_percpu_work(qpw)			\
> +	flush_work(&(qpw)->work)
> +
> +#define qpw_get_cpu(qpw)	smp_processor_id()
> +
> +#define qpw_is_cpu_remote(cpu)		(false)
> +
> +#define INIT_QPW(qpw, func, c)			\
> +	INIT_WORK(&(qpw)->work, (func))
> +
> +#else /* CONFIG_QPW */
> +
> +DECLARE_STATIC_KEY_MAYBE(CONFIG_QPW_DEFAULT, qpw_sl);
> +
> +typedef union {
> +	spinlock_t sl;
> +	local_lock_t ll;
> +} qpw_lock_t;
> +
> +typedef union {
> +	spinlock_t sl;
> +	local_trylock_t ll;
> +} qpw_trylock_t;
> +
> +struct qpw_struct {
> +	struct work_struct work;
> +	int cpu;
> +};
> +
> +#define qpw_lock_init(lock)								\
> +	do {										\
> +		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
> +			spin_lock_init(lock.sl);					\
> +		else									\
> +			local_lock_init(lock.ll);					\
> +	} while (0)
> +
> +#define qpw_trylock_init(lock)								\
> +	do {										\
> +		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
> +			spin_lock_init(lock.sl);					\
> +		else									\
> +			local_trylock_init(lock.ll);					\
> +	} while (0)
> +
> +#define qpw_lock(lock, cpu)								\
> +	do {										\
> +		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
> +			spin_lock(per_cpu_ptr(lock.sl, cpu));				\
> +		else									\
> +			local_lock(lock.ll);						\
> +	} while (0)
> +
> +#define qpw_lock_irqsave(lock, flags, cpu)						\
> +	do {										\
> +		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
> +			spin_lock_irqsave(per_cpu_ptr(lock.sl, cpu), flags);		\
> +		else									\
> +			local_lock_irqsave(lock.ll, flags);				\
> +	} while (0)
> +
> +#define qpw_trylock(lock, cpu)								\
> +	({										\
> +		int t;									\
> +		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
> +			t = spin_trylock(per_cpu_ptr(lock.sl, cpu));			\
> +		else									\
> +			t = local_trylock(lock.ll);					\
> +		t;									\
> +	})
> +
> +#define qpw_trylock_irqsave(lock, flags, cpu)						\
> +	({										\
> +		int t;									\
> +		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
> +			t = spin_trylock_irqsave(per_cpu_ptr(lock.sl, cpu), flags);	\
> +		else									\
> +			t = local_trylock_irqsave(lock.ll, flags);			\
> +		t;									\
> +	})
> +
> +#define qpw_unlock(lock, cpu)								\
> +	do {										\
> +		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) {			\
> +			spin_unlock(per_cpu_ptr(lock.sl, cpu));				\
> +		} else {								\
> +			local_unlock(lock.ll);						\
> +		}									\
> +	} while (0)
> +
> +#define qpw_unlock_irqrestore(lock, flags, cpu)						\
> +	do {										\
> +		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
> +			spin_unlock_irqrestore(per_cpu_ptr(lock.sl, cpu), flags);	\
> +		else									\
> +			local_unlock_irqrestore(lock.ll, flags);			\
> +	} while (0)
> +
> +#define qpw_lockdep_assert_held(lock)							\
> +	do {										\
> +		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
> +			lockdep_assert_held(this_cpu_ptr(lock.sl));			\
> +		else									\
> +			lockdep_assert_held(this_cpu_ptr(lock.ll));			\
> +	} while (0)
> +
> +#define queue_percpu_work_on(c, wq, qpw)						\
> +	do {										\
> +		int __c = c;								\
> +		struct qpw_struct *__qpw = (qpw);					\
> +		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) {			\
> +			WARN_ON((__c) != __qpw->cpu);					\
> +			__qpw->work.func(&__qpw->work);					\
> +		} else {								\
> +			queue_work_on(__c, wq, &(__qpw)->work);				\
> +		}									\
> +	} while (0)
> +
> +/*
> + * Does nothing if QPW is set to use spinlock, as the task is already done at the
> + * time queue_percpu_work_on() returns.
> + */
> +#define flush_percpu_work(qpw)								\
> +	do {										\
> +		struct qpw_struct *__qpw = (qpw);					\
> +		if (!static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) {		\
> +			flush_work(&__qpw->work);					\
> +		}									\
> +	} while (0)
> +
> +#define qpw_get_cpu(w)			container_of((w), struct qpw_struct, work)->cpu
> +
> +#define qpw_is_cpu_remote(cpu)		((cpu) != smp_processor_id())
> +
> +#define INIT_QPW(qpw, func, c)								\
> +	do {										\
> +		struct qpw_struct *__qpw = (qpw);					\
> +		INIT_WORK(&__qpw->work, (func));					\
> +		__qpw->cpu = (c);							\
> +	} while (0)
> +
> +#endif /* CONFIG_QPW */
> +#endif /* LINUX_QPW_H */
> Index: slab/init/Kconfig
> ===================================================================
> --- slab.orig/init/Kconfig
> +++ slab/init/Kconfig
> @@ -747,6 +747,41 @@ config CPU_ISOLATION
>  
>  	  Say Y if unsure.
>  
> +config QPW
> +	bool "Queue per-CPU Work"
> +	depends on SMP || COMPILE_TEST
> +	default n
> +	help
> +	  Allow changing the behavior on per-CPU resource sharing with cache,
> +	  from the regular local_locks() + queue_work_on(remote_cpu) to using
> +	  per-CPU spinlocks on both local and remote operations.
> +
> +	  This is useful to give user the option on reducing IPIs to CPUs, and
> +	  thus reduce interruptions and context switches. On the other hand, it
> +	  increases generated code and will use atomic operations if spinlocks
> +	  are selected.
> +
> +	  If set, will use the default behavior set in QPW_DEFAULT unless boot
> +	  parameter qpw is passed with a different behavior.
> +
> +	  If unset, will use the local_lock() + queue_work_on() strategy,
> +	  regardless of the boot parameter or QPW_DEFAULT.
> +
> +	  Say N if unsure.
> +
> +config QPW_DEFAULT
> +	bool "Use per-CPU spinlocks by default"
> +	depends on QPW
> +	default n
> +	help
> +	  If set, will use per-CPU spinlocks as default behavior for per-CPU
> +	  remote operations.
> +
> +	  If unset, will use local_lock() + queue_work_on(cpu) as default
> +	  behavior for remote operations.
> +
> +	  Say N if unsure
> +
>  source "kernel/rcu/Kconfig"
>  
>  config IKCONFIG
> Index: slab/kernel/Makefile
> ===================================================================
> --- slab.orig/kernel/Makefile
> +++ slab/kernel/Makefile
> @@ -140,6 +140,8 @@ obj-$(CONFIG_WATCH_QUEUE) += watch_queue
>  obj-$(CONFIG_RESOURCE_KUNIT_TEST) += resource_kunit.o
>  obj-$(CONFIG_SYSCTL_KUNIT_TEST) += sysctl-test.o
>  
> +obj-$(CONFIG_QPW) += qpw.o
> +
>  CFLAGS_kstack_erase.o += $(DISABLE_KSTACK_ERASE)
>  CFLAGS_kstack_erase.o += $(call cc-option,-mgeneral-regs-only)
>  obj-$(CONFIG_KSTACK_ERASE) += kstack_erase.o
> Index: slab/kernel/qpw.c
> ===================================================================
> --- /dev/null
> +++ slab/kernel/qpw.c
> @@ -0,0 +1,26 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include "linux/export.h"
> +#include <linux/sched.h>
> +#include <linux/qpw.h>
> +#include <linux/string.h>
> +
> +DEFINE_STATIC_KEY_MAYBE(CONFIG_QPW_DEFAULT, qpw_sl);
> +EXPORT_SYMBOL(qpw_sl);
> +
> +static int __init qpw_setup(char *str)
> +{
> +	int opt;
> +
> +	if (!get_option(&str, &opt)) {
> +		pr_warn("QPW: invalid qpw parameter: %s, ignoring.\n", str);
> +		return 0;
> +	}
> +
> +	if (opt)
> +		static_branch_enable(&qpw_sl);
> +	else
> +		static_branch_disable(&qpw_sl);
> +
> +	return 0;
> +}
> +__setup("qpw=", qpw_setup);
> Index: slab/Documentation/locking/qpwlocks.rst
> ===================================================================
> --- /dev/null
> +++ slab/Documentation/locking/qpwlocks.rst
> @@ -0,0 +1,63 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=========
> +QPW locks
> +=========
> +
> +Some places in the kernel implement a parallel programming strategy
> +consisting on local_locks() for most of the work, and some rare remote
> +operations are scheduled on target cpu. This keeps cache bouncing low since
> +cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> +kernels, even though the very few remote operations will be expensive due
> +to scheduling overhead.
> +
> +On the other hand, for RT workloads this can represent a problem:
> +scheduling work on remote cpu that are executing low latency tasks
> +is undesired and can introduce unexpected deadline misses.
> +
> +QPW locks help to convert sites that use local_locks (for cpu local operations)
> +and queue_work_on (for queueing work remotely, to be executed
> +locally on the owner cpu of the lock) to QPW locks.
> +
> +The lock is declared qpw_lock_t type.
> +The lock is initialized with qpw_lock_init.
> +The lock is locked with qpw_lock (takes a lock and cpu as a parameter).
> +The lock is unlocked with qpw_unlock (takes a lock and cpu as a parameter).
> +
> +The qpw_lock_irqsave function disables interrupts and saves current interrupt state,
> +cpu as a parameter.
> +
> +For trylock variant, there is the qpw_trylock_t type, initialized with
> +qpw_trylock_init. Then the corresponding qpw_trylock and
> +qpw_trylock_irqsave.
> +
> +work_struct should be replaced by qpw_struct, which contains a cpu parameter
> +(owner cpu of the lock), initialized by INIT_QPW.
> +
> +The queue work related functions (analogous to queue_work_on and flush_work) are:
> +queue_percpu_work_on and flush_percpu_work.
> +
> +The behaviour of the QPW functions is as follows:
> +
> +* !CONFIG_PREEMPT_RT and !CONFIG_QPW (or CONFIG_QPW and qpw=off kernel

I don't think PREEMPT_RT is needed here (maybe it was copied from the 
previous QPW version which was dependent on PREEMPT_RT?)

> +boot parameter):
> +        - qpw_lock:                     local_lock
> +        - qpw_lock_irqsave:             local_lock_irqsave
> +        - qpw_trylock:                  local_trylock
> +        - qpw_trylock_irqsave:          local_trylock_irqsave
> +        - qpw_unlock:                   local_unlock
> +        - queue_percpu_work_on:         queue_work_on
> +        - flush_percpu_work:            flush_work
> +
> +* CONFIG_PREEMPT_RT or CONFIG_QPW (and CONFIG_QPW_DEFAULT or qpw=on kernel

Same here

> +boot parameter),
> +        - qpw_lock:                     spin_lock
> +        - qpw_lock_irqsave:             spin_lock_irqsave
> +        - qpw_trylock:                  spin_trylock
> +        - qpw_trylock_irqsave:          spin_trylock_irqsave
> +        - qpw_unlock:                   spin_unlock
> +        - queue_percpu_work_on:         executes work function on caller cpu
> +        - flush_percpu_work:            empty
> +
> +qpw_get_cpu(work_struct), to be called from within qpw work function,
> +returns the target cpu.
> 
> 


Other than that, LGTM!

Reviewed-by: Leonardo Bras <leobras.c@gmail.com>

Thanks!
Leo

Re: [PATCH 1/4] Introducing qpw_lock() and per-cpu queue & flush work

Posted by Marcelo Tosatti 1 day, 2 hours ago

On Fri, Feb 06, 2026 at 11:34:31AM -0300, Marcelo Tosatti wrote:
> Some places in the kernel implement a parallel programming strategy
> consisting on local_locks() for most of the work, and some rare remote
> operations are scheduled on target cpu. This keeps cache bouncing low since
> cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> kernels, even though the very few remote operations will be expensive due
> to scheduling overhead.

Forgot to mention: patchset is against Vlastimil's slab/next tree.

[PATCH 1/4] Introducing qpw_lock() and per-cpu queue & flush work
[PATCH 2/4] mm/swap: move bh draining into a separate workqueue
[PATCH 3/4] swap: apply new queue_percpu_work_on() interface
[PATCH 4/4] slub: apply new queue_percpu_work_on() interface