Introduce QPW for per-cpu operations (v2)

[PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work

Posted by Marcelo Tosatti 1 month, 1 week ago

Some places in the kernel implement a parallel programming strategy
consisting on local_locks() for most of the work, and some rare remote
operations are scheduled on target cpu. This keeps cache bouncing low since
cacheline tends to be mostly local, and avoids the cost of locks in non-RT
kernels, even though the very few remote operations will be expensive due
to scheduling overhead.

On the other hand, for RT workloads this can represent a problem:
scheduling work on remote cpu that are executing low latency tasks
is undesired and can introduce unexpected deadline misses.

It's interesting, though, that local_lock()s in RT kernels become
spinlock(). We can make use of those to avoid scheduling work on a remote
cpu by directly updating another cpu's per_cpu structure, while holding
it's spinlock().

In order to do that, it's necessary to introduce a new set of functions to
make it possible to get another cpu's per-cpu "local" lock (qpw_{un,}lock*)
and also the corresponding queue_percpu_work_on() and flush_percpu_work()
helpers to run the remote work.

Users of non-RT kernels but with low latency requirements can select
similar functionality by using the CONFIG_QPW compile time option.

On CONFIG_QPW disabled kernels, no changes are expected, as every
one of the introduced helpers work the exactly same as the current
implementation:
qpw_{un,}lock*()        ->  local_{un,}lock*() (ignores cpu parameter)
queue_percpu_work_on()  ->  queue_work_on()
flush_percpu_work()     ->  flush_work()

For QPW enabled kernels, though, qpw_{un,}lock*() will use the extra
cpu parameter to select the correct per-cpu structure to work on,
and acquire the spinlock for that cpu.

queue_percpu_work_on() will just call the requested function in the current
cpu, which will operate in another cpu's per-cpu object. Since the
local_locks() become spinlock()s in QPW enabled kernels, we are
safe doing that.

flush_percpu_work() then becomes a no-op since no work is actually
scheduled on a remote cpu.

Some minimal code rework is needed in order to make this mechanism work:
The calls for local_{un,}lock*() on the functions that are currently
scheduled on remote cpus need to be replaced by qpw_{un,}lock_n*(), so in
QPW enabled kernels they can reference a different cpu. It's also
necessary to use a qpw_struct instead of a work_struct, but it just
contains a work struct and, in CONFIG_QPW, the target cpu.

This should have almost no impact on non-CONFIG_QPW kernels: few
this_cpu_ptr() will become per_cpu_ptr(,smp_processor_id()).

On CONFIG_QPW kernels, this should avoid deadlines misses by
removing scheduling noise.

Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
---
 Documentation/admin-guide/kernel-parameters.txt |   10 
 Documentation/locking/qpwlocks.rst              |   70 ++++++
 MAINTAINERS                                     |    7 
 include/linux/qpw.h                             |  256 ++++++++++++++++++++++++
 init/Kconfig                                    |   35 +++
 kernel/Makefile                                 |    2 
 kernel/qpw.c                                    |   26 ++
 7 files changed, 406 insertions(+)
 create mode 100644 include/linux/qpw.h
 create mode 100644 kernel/qpw.c

Index: linux/Documentation/admin-guide/kernel-parameters.txt
===================================================================
--- linux.orig/Documentation/admin-guide/kernel-parameters.txt
+++ linux/Documentation/admin-guide/kernel-parameters.txt
@@ -2840,6 +2840,16 @@ Kernel parameters
 
 			The format of <cpu-list> is described above.
 
+	qpw=		[KNL,SMP] Select a behavior on per-CPU resource sharing
+			and remote interference mechanism on a kernel built with
+			CONFIG_QPW.
+			Format: { "0" | "1" }
+			0 - local_lock() + queue_work_on(remote_cpu)
+			1 - spin_lock() for both local and remote operations
+
+			Selecting 1 may be interesting for systems that want
+			to avoid interruption & context switches from IPIs.
+
 	iucv=		[HW,NET]
 
 	ivrs_ioapic	[HW,X86-64]
Index: linux/MAINTAINERS
===================================================================
--- linux.orig/MAINTAINERS
+++ linux/MAINTAINERS
@@ -21553,6 +21553,13 @@ F:	Documentation/networking/device_drive
 F:	drivers/bus/fsl-mc/
 F:	include/uapi/linux/fsl_mc.h
 
+QPW
+M:	Leonardo Bras <leobras.c@gmail.com>
+S:	Supported
+F:	Documentation/locking/qpwlocks.rst
+F:	include/linux/qpw.h
+F:	kernel/qpw.c
+
 QT1010 MEDIA DRIVER
 L:	linux-media@vger.kernel.org
 S:	Orphan
Index: linux/include/linux/qpw.h
===================================================================
--- /dev/null
+++ linux/include/linux/qpw.h
@@ -0,0 +1,256 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_QPW_H
+#define _LINUX_QPW_H
+
+#include "linux/spinlock.h"
+#include "linux/local_lock.h"
+#include "linux/workqueue.h"
+
+#ifndef CONFIG_QPW
+
+typedef local_lock_t qpw_lock_t;
+typedef local_trylock_t qpw_trylock_t;
+
+struct qpw_struct {
+	struct work_struct work;
+};
+
+#define qpw_lock_init(lock)				\
+	local_lock_init(lock)
+
+#define qpw_trylock_init(lock)				\
+	local_trylock_init(lock)
+
+#define qpw_lock(lock, cpu)				\
+	local_lock(lock)
+
+#define local_qpw_lock(lock)				\
+	local_lock(lock)
+
+#define qpw_lock_irqsave(lock, flags, cpu)		\
+	local_lock_irqsave(lock, flags)
+
+#define local_qpw_lock_irqsave(lock, flags)		\
+	local_lock_irqsave(lock, flags)
+
+#define qpw_trylock(lock, cpu)				\
+	local_trylock(lock)
+
+#define local_qpw_trylock(lock)				\
+	local_trylock(lock)
+
+#define qpw_trylock_irqsave(lock, flags, cpu)		\
+	local_trylock_irqsave(lock, flags)
+
+#define qpw_unlock(lock, cpu)				\
+	local_unlock(lock)
+
+#define local_qpw_unlock(lock)				\
+	local_unlock(lock)
+
+#define qpw_unlock_irqrestore(lock, flags, cpu)		\
+	local_unlock_irqrestore(lock, flags)
+
+#define local_qpw_unlock_irqrestore(lock, flags)	\
+	local_unlock_irqrestore(lock, flags)
+
+#define qpw_lockdep_assert_held(lock)			\
+	lockdep_assert_held(lock)
+
+#define queue_percpu_work_on(c, wq, qpw)		\
+	queue_work_on(c, wq, &(qpw)->work)
+
+#define flush_percpu_work(qpw)				\
+	flush_work(&(qpw)->work)
+
+#define qpw_get_cpu(qpw)	smp_processor_id()
+
+#define qpw_is_cpu_remote(cpu)		(false)
+
+#define INIT_QPW(qpw, func, c)				\
+	INIT_WORK(&(qpw)->work, (func))
+
+#else /* CONFIG_QPW */
+
+DECLARE_STATIC_KEY_MAYBE(CONFIG_QPW_DEFAULT, qpw_sl);
+
+typedef union {
+	spinlock_t sl;
+	local_lock_t ll;
+} qpw_lock_t;
+
+typedef union {
+	spinlock_t sl;
+	local_trylock_t ll;
+} qpw_trylock_t;
+
+struct qpw_struct {
+	struct work_struct work;
+	int cpu;
+};
+
+#define qpw_lock_init(lock)								\
+	do {										\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
+			spin_lock_init(lock.sl);					\
+		else									\
+			local_lock_init(lock.ll);					\
+	} while (0)
+
+#define qpw_trylock_init(lock)								\
+	do {										\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
+			spin_lock_init(lock.sl);					\
+		else									\
+			local_trylock_init(lock.ll);					\
+	} while (0)
+
+#define qpw_lock(lock, cpu)								\
+	do {										\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
+			spin_lock(per_cpu_ptr(lock.sl, cpu));				\
+		else									\
+			local_lock(lock.ll);						\
+	} while (0)
+
+#define local_qpw_lock(lock)								\
+	do {										\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) {			\
+			migrate_disable();						\
+			spin_lock(this_cpu_ptr(lock.sl));				\
+		} else									\
+			local_lock(lock.ll);						\
+	} while (0)
+
+#define qpw_lock_irqsave(lock, flags, cpu)						\
+	do {										\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
+			spin_lock_irqsave(per_cpu_ptr(lock.sl, cpu), flags);		\
+		else									\
+			local_lock_irqsave(lock.ll, flags);				\
+	} while (0)
+
+#define local_qpw_lock_irqsave(lock, flags)						\
+	do {										\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) {			\
+			migrate_disable();						\
+			spin_lock_irqsave(this_cpu_ptr(lock.sl), flags);		\
+		} else									\
+			local_lock_irqsave(lock.ll, flags);				\
+	} while (0)
+
+
+#define qpw_trylock(lock, cpu)                                                          \
+	({                                                                              \
+		int t;                                                                  \
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))                   \
+			t = spin_trylock(per_cpu_ptr(lock.sl, cpu));                    \
+		else                                                                    \
+			t = local_trylock(lock.ll);                                     \
+		t;                                                                      \
+	})
+
+#define local_qpw_trylock(lock)								\
+	({										\
+		int t;									\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) {			\
+			migrate_disable();						\
+			t = spin_trylock(this_cpu_ptr(lock.sl));			\
+			if (!t)								\
+				migrate_enable();					\
+		} else									\
+			t = local_trylock(lock.ll);					\
+		t;									\
+	})
+
+#define qpw_trylock_irqsave(lock, flags, cpu)						\
+	({										\
+		int t;									\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
+			t = spin_trylock_irqsave(per_cpu_ptr(lock.sl, cpu), flags);	\
+		else									\
+			t = local_trylock_irqsave(lock.ll, flags);			\
+		t;									\
+	})
+
+#define qpw_unlock(lock, cpu)								\
+	do {										\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) {			\
+			spin_unlock(per_cpu_ptr(lock.sl, cpu));				\
+		} else {								\
+			local_unlock(lock.ll);						\
+		}									\
+	} while (0)
+
+#define local_qpw_unlock(lock)								\
+do {										\
+	if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) {			\
+		spin_unlock(this_cpu_ptr(lock.sl));				\
+		migrate_enable();						\
+	} else {								\
+		local_unlock(lock.ll);						\
+	}									\
+} while (0)
+
+#define qpw_unlock_irqrestore(lock, flags, cpu)						\
+	do {										\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
+			spin_unlock_irqrestore(per_cpu_ptr(lock.sl, cpu), flags);	\
+		else									\
+			local_unlock_irqrestore(lock.ll, flags);			\
+	} while (0)
+
+#define local_qpw_unlock_irqrestore(lock, flags)					\
+	do {										\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) {			\
+			spin_unlock_irqrestore(this_cpu_ptr(lock.sl), flags);		\
+			migrate_enable();						\
+		} else									\
+			local_unlock_irqrestore(lock.ll, flags);			\
+	} while (0)
+
+#define qpw_lockdep_assert_held(lock)							\
+	do {										\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
+			lockdep_assert_held(this_cpu_ptr(lock.sl));			\
+		else									\
+			lockdep_assert_held(this_cpu_ptr(lock.ll));			\
+	} while (0)
+
+#define queue_percpu_work_on(c, wq, qpw)						\
+	do {										\
+		int __c = c;								\
+		struct qpw_struct *__qpw = (qpw);					\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) {			\
+			WARN_ON((__c) != __qpw->cpu);					\
+			__qpw->work.func(&__qpw->work);					\
+		} else {								\
+			queue_work_on(__c, wq, &(__qpw)->work);				\
+		}									\
+	} while (0)
+
+/*
+ * Does nothing if QPW is set to use spinlock, as the task is already done at the
+ * time queue_percpu_work_on() returns.
+ */
+#define flush_percpu_work(qpw)								\
+	do {										\
+		struct qpw_struct *__qpw = (qpw);					\
+		if (!static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) {		\
+			flush_work(&__qpw->work);					\
+		}									\
+	} while (0)
+
+#define qpw_get_cpu(w)			container_of((w), struct qpw_struct, work)->cpu
+
+#define qpw_is_cpu_remote(cpu)		((cpu) != smp_processor_id())
+
+#define INIT_QPW(qpw, func, c)								\
+	do {										\
+		struct qpw_struct *__qpw = (qpw);					\
+		INIT_WORK(&__qpw->work, (func));					\
+		__qpw->cpu = (c);							\
+	} while (0)
+
+#endif /* CONFIG_QPW */
+#endif /* LINUX_QPW_H */
Index: linux/init/Kconfig
===================================================================
--- linux.orig/init/Kconfig
+++ linux/init/Kconfig
@@ -762,6 +762,41 @@ config CPU_ISOLATION
 
 	  Say Y if unsure.
 
+config QPW
+	bool "Queue per-CPU Work"
+	depends on SMP || COMPILE_TEST
+	default n
+	help
+	  Allow changing the behavior on per-CPU resource sharing with cache,
+	  from the regular local_locks() + queue_work_on(remote_cpu) to using
+	  per-CPU spinlocks on both local and remote operations.
+
+	  This is useful to give user the option on reducing IPIs to CPUs, and
+	  thus reduce interruptions and context switches. On the other hand, it
+	  increases generated code and will use atomic operations if spinlocks
+	  are selected.
+
+	  If set, will use the default behavior set in QPW_DEFAULT unless boot
+	  parameter qpw is passed with a different behavior.
+
+	  If unset, will use the local_lock() + queue_work_on() strategy,
+	  regardless of the boot parameter or QPW_DEFAULT.
+
+	  Say N if unsure.
+
+config QPW_DEFAULT
+	bool "Use per-CPU spinlocks by default"
+	depends on QPW
+	default n
+	help
+	  If set, will use per-CPU spinlocks as default behavior for per-CPU
+	  remote operations.
+
+	  If unset, will use local_lock() + queue_work_on(cpu) as default
+	  behavior for remote operations.
+
+	  Say N if unsure
+
 source "kernel/rcu/Kconfig"
 
 config IKCONFIG
Index: linux/kernel/Makefile
===================================================================
--- linux.orig/kernel/Makefile
+++ linux/kernel/Makefile
@@ -142,6 +142,8 @@ obj-$(CONFIG_WATCH_QUEUE) += watch_queue
 obj-$(CONFIG_RESOURCE_KUNIT_TEST) += resource_kunit.o
 obj-$(CONFIG_SYSCTL_KUNIT_TEST) += sysctl-test.o
 
+obj-$(CONFIG_QPW) += qpw.o
+
 CFLAGS_kstack_erase.o += $(DISABLE_KSTACK_ERASE)
 CFLAGS_kstack_erase.o += $(call cc-option,-mgeneral-regs-only)
 obj-$(CONFIG_KSTACK_ERASE) += kstack_erase.o
Index: linux/kernel/qpw.c
===================================================================
--- /dev/null
+++ linux/kernel/qpw.c
@@ -0,0 +1,26 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "linux/export.h"
+#include <linux/sched.h>
+#include <linux/qpw.h>
+#include <linux/string.h>
+
+DEFINE_STATIC_KEY_MAYBE(CONFIG_QPW_DEFAULT, qpw_sl);
+EXPORT_SYMBOL(qpw_sl);
+
+static int __init qpw_setup(char *str)
+{
+	int opt;
+
+	if (!get_option(&str, &opt)) {
+		pr_warn("QPW: invalid qpw parameter: %s, ignoring.\n", str);
+		return 0;
+	}
+
+	if (opt)
+		static_branch_enable(&qpw_sl);
+	else
+		static_branch_disable(&qpw_sl);
+
+	return 1;
+}
+__setup("qpw=", qpw_setup);
Index: linux/Documentation/locking/qpwlocks.rst
===================================================================
--- /dev/null
+++ linux/Documentation/locking/qpwlocks.rst
@@ -0,0 +1,70 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========
+QPW locks
+=========
+
+Some places in the kernel implement a parallel programming strategy
+consisting on local_locks() for most of the work, and some rare remote
+operations are scheduled on target cpu. This keeps cache bouncing low since
+cacheline tends to be mostly local, and avoids the cost of locks in non-RT
+kernels, even though the very few remote operations will be expensive due
+to scheduling overhead.
+
+On the other hand, for RT workloads this can represent a problem:
+scheduling work on remote cpu that are executing low latency tasks
+is undesired and can introduce unexpected deadline misses.
+
+QPW locks help to convert sites that use local_locks (for cpu local operations)
+and queue_work_on (for queueing work remotely, to be executed
+locally on the owner cpu of the lock) to QPW locks.
+
+The lock is declared qpw_lock_t type.
+The lock is initialized with qpw_lock_init.
+The lock is locked with qpw_lock (takes a lock and cpu as a parameter).
+The lock is unlocked with qpw_unlock (takes a lock and cpu as a parameter).
+
+The qpw_lock_irqsave function disables interrupts and saves current interrupt state,
+cpu as a parameter.
+
+For trylock variant, there is the qpw_trylock_t type, initialized with
+qpw_trylock_init. Then the corresponding qpw_trylock and
+qpw_trylock_irqsave.
+
+work_struct should be replaced by qpw_struct, which contains a cpu parameter
+(owner cpu of the lock), initialized by INIT_QPW.
+
+The queue work related functions (analogous to queue_work_on and flush_work) are:
+queue_percpu_work_on and flush_percpu_work.
+
+The behaviour of the QPW functions is as follows:
+
+* !CONFIG_QPW (or CONFIG_QPW and qpw=off kernel boot parameter):
+        - qpw_lock:                     local_lock
+        - qpw_lock_irqsave:             local_lock_irqsave
+        - qpw_trylock:                  local_trylock
+        - qpw_trylock_irqsave:          local_trylock_irqsave
+        - qpw_unlock:                   local_unlock
+        - queue_percpu_work_on:         queue_work_on
+        - flush_percpu_work:            flush_work
+
+* CONFIG_QPW (and CONFIG_QPW_DEFAULT=y or qpw=on kernel boot parameter),
+        - qpw_lock:                     spin_lock
+        - qpw_lock_irqsave:             spin_lock_irqsave
+        - qpw_trylock:                  spin_trylock
+        - qpw_trylock_irqsave:          spin_trylock_irqsave
+        - qpw_unlock:                   spin_unlock
+        - queue_percpu_work_on:         executes work function on caller cpu
+        - flush_percpu_work:            empty
+
+qpw_get_cpu(work_struct), to be called from within qpw work function,
+returns the target cpu.
+
+In addition to the locking functions above, there are the local locking
+functions (local_qpw_lock, local_qpw_trylock and local_qpw_unlock).
+These must only be used to access per-CPU data from the CPU that owns
+that data, and not remotely. They disable preemption or migration
+and don't require a cpu parameter.
+
+These should only be used when accessing per-CPU data of the local CPU.
+

Re: [PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work

Posted by Frederic Weisbecker 3 weeks, 5 days ago

Le Mon, Mar 02, 2026 at 12:49:47PM -0300, Marcelo Tosatti a écrit :
> Some places in the kernel implement a parallel programming strategy
> consisting on local_locks() for most of the work, and some rare remote
> operations are scheduled on target cpu. This keeps cache bouncing low since
> cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> kernels, even though the very few remote operations will be expensive due
> to scheduling overhead.
> 
> On the other hand, for RT workloads this can represent a problem:
> scheduling work on remote cpu that are executing low latency tasks
> is undesired and can introduce unexpected deadline misses.
> 
> It's interesting, though, that local_lock()s in RT kernels become
> spinlock(). We can make use of those to avoid scheduling work on a remote
> cpu by directly updating another cpu's per_cpu structure, while holding
> it's spinlock().
> 
> In order to do that, it's necessary to introduce a new set of functions to
> make it possible to get another cpu's per-cpu "local" lock (qpw_{un,}lock*)
> and also the corresponding queue_percpu_work_on() and flush_percpu_work()
> helpers to run the remote work.
> 
> Users of non-RT kernels but with low latency requirements can select
> similar functionality by using the CONFIG_QPW compile time option.
> 
> On CONFIG_QPW disabled kernels, no changes are expected, as every
> one of the introduced helpers work the exactly same as the current
> implementation:
> qpw_{un,}lock*()        ->  local_{un,}lock*() (ignores cpu parameter)

I find this part of the semantic a bit weird. If we eventually queue
the work, why do we care about doing a local_lock() locally ?

> queue_percpu_work_on()  ->  queue_work_on()
> flush_percpu_work()     ->  flush_work()
> 
> @@ -2840,6 +2840,16 @@ Kernel parameters
>  
>  			The format of <cpu-list> is described above.
>  
> +	qpw=		[KNL,SMP] Select a behavior on per-CPU resource sharing
> +			and remote interference mechanism on a kernel built with
> +			CONFIG_QPW.
> +			Format: { "0" | "1" }
> +			0 - local_lock() + queue_work_on(remote_cpu)
> +			1 - spin_lock() for both local and remote operations
> +
> +			Selecting 1 may be interesting for systems that want
> +			to avoid interruption & context switches from IPIs.

Like Vlastimil suggested, it would be better to just have it off by default
and turn it on only if nohz_full= is passed. Then we can consider introducing
the parameter later if the need arise.

> +#define qpw_lock_init(lock)				\
> +	local_lock_init(lock)
> +
> +#define qpw_trylock_init(lock)				\
> +	local_trylock_init(lock)
> +
> +#define qpw_lock(lock, cpu)				\
> +	local_lock(lock)
> +
> +#define local_qpw_lock(lock)				\
> +	local_lock(lock)

It would be easier to grep if all the APIs start with qpw_* prefix.

qpw_local_lock() ?

> +
> +#define qpw_lock_irqsave(lock, flags, cpu)		\
> +	local_lock_irqsave(lock, flags)
> +
> +#define local_qpw_lock_irqsave(lock, flags)		\
> +	local_lock_irqsave(lock, flags)

ditto?

> +
> +#define qpw_trylock(lock, cpu)				\
> +	local_trylock(lock)
> +
> +#define local_qpw_trylock(lock)				\
> +	local_trylock(lock)

...

> +
> +#define qpw_trylock_irqsave(lock, flags, cpu)		\
> +	local_trylock_irqsave(lock, flags)
> +
> +#define qpw_unlock(lock, cpu)				\
> +	local_unlock(lock)
> +
> +#define local_qpw_unlock(lock)				\
> +	local_unlock(lock)

...

> +
> +#define qpw_unlock_irqrestore(lock, flags, cpu)		\
> +	local_unlock_irqrestore(lock, flags)
> +
> +#define local_qpw_unlock_irqrestore(lock, flags)	\
> +	local_unlock_irqrestore(lock, flags)

...

> +
> +#define qpw_lockdep_assert_held(lock)			\
> +	lockdep_assert_held(lock)
> +
> +#define queue_percpu_work_on(c, wq, qpw)		\
> +	queue_work_on(c, wq, &(qpw)->work)

qpw_queue_work_on() ?

Perhaps even better would be qpw_queue_work_for(), leaving some room for
mystery about where/how the work will be executed :-)

> +
> +#define flush_percpu_work(qpw)				\
> +	flush_work(&(qpw)->work)

qpw_flush_work() ?

> +
> +#define qpw_get_cpu(qpw)	smp_processor_id()
> +
> +#define qpw_is_cpu_remote(cpu)		(false)
> +
> +#define INIT_QPW(qpw, func, c)				\
> +	INIT_WORK(&(qpw)->work, (func))
> +
> @@ -762,6 +762,41 @@ config CPU_ISOLATION
>  
>  	  Say Y if unsure.
>  
> +config QPW
> +	bool "Queue per-CPU Work"
> +	depends on SMP || COMPILE_TEST
> +	default n
> +	help
> +	  Allow changing the behavior on per-CPU resource sharing with cache,
> +	  from the regular local_locks() + queue_work_on(remote_cpu) to using
> +	  per-CPU spinlocks on both local and remote operations.
> +
> +	  This is useful to give user the option on reducing IPIs to CPUs, and
> +	  thus reduce interruptions and context switches. On the other hand, it
> +	  increases generated code and will use atomic operations if spinlocks
> +	  are selected.
> +
> +	  If set, will use the default behavior set in QPW_DEFAULT unless boot
> +	  parameter qpw is passed with a different behavior.
> +
> +	  If unset, will use the local_lock() + queue_work_on() strategy,
> +	  regardless of the boot parameter or QPW_DEFAULT.
> +
> +	  Say N if unsure.

Perhaps that too should just be selected automatically by CONFIG_NO_HZ_FULL and if
the need arise in the future, make it visible to the user?

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

Re: [PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work

Posted by Leonardo Bras 3 weeks, 3 days ago

On Fri, Mar 13, 2026 at 10:55:47PM +0100, Frederic Weisbecker wrote:
> Le Mon, Mar 02, 2026 at 12:49:47PM -0300, Marcelo Tosatti a écrit :
> > Some places in the kernel implement a parallel programming strategy
> > consisting on local_locks() for most of the work, and some rare remote
> > operations are scheduled on target cpu. This keeps cache bouncing low since
> > cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> > kernels, even though the very few remote operations will be expensive due
> > to scheduling overhead.
> > 
> > On the other hand, for RT workloads this can represent a problem:
> > scheduling work on remote cpu that are executing low latency tasks
> > is undesired and can introduce unexpected deadline misses.
> > 
> > It's interesting, though, that local_lock()s in RT kernels become
> > spinlock(). We can make use of those to avoid scheduling work on a remote
> > cpu by directly updating another cpu's per_cpu structure, while holding
> > it's spinlock().
> > 
> > In order to do that, it's necessary to introduce a new set of functions to
> > make it possible to get another cpu's per-cpu "local" lock (qpw_{un,}lock*)
> > and also the corresponding queue_percpu_work_on() and flush_percpu_work()
> > helpers to run the remote work.
> > 
> > Users of non-RT kernels but with low latency requirements can select
> > similar functionality by using the CONFIG_QPW compile time option.
> > 
> > On CONFIG_QPW disabled kernels, no changes are expected, as every
> > one of the introduced helpers work the exactly same as the current
> > implementation:
> > qpw_{un,}lock*()        ->  local_{un,}lock*() (ignores cpu parameter)
> 
> I find this part of the semantic a bit weird. If we eventually queue
> the work, why do we care about doing a local_lock() locally ?

(Sorry, not sure if I was able to understand the question.)

Local locks make sure a per-cpu procedure happens on the same CPU from 
start to end. Using migrate_disable & using per-cpu spinlocks on RT and 
doing preempt_disable in non_RT.

Most of the cases happen to have the work done in the local cpu, and just 
a few procedures happen to be queued remotely, such as remote cache 
draining. 

Even with the new 'local_qpw_lock()' which is faster for cases we are sure 
to have local usages, on qpw=0 we have to make qpw_lock() a local_lock as 
well, as the cpu receiving the scheduled work needs to make sure to run it 
all without moving to a different cpu.

> 
> > queue_percpu_work_on()  ->  queue_work_on()
> > flush_percpu_work()     ->  flush_work()

btw Marcelo, I think we need to do add the local_qpw_lock here as well, or 
change the first line to '{local_,}qpw_{un,}lock*()'

> > 
> > @@ -2840,6 +2840,16 @@ Kernel parameters
> >  
> >  			The format of <cpu-list> is described above.
> >  
> > +	qpw=		[KNL,SMP] Select a behavior on per-CPU resource sharing
> > +			and remote interference mechanism on a kernel built with
> > +			CONFIG_QPW.
> > +			Format: { "0" | "1" }
> > +			0 - local_lock() + queue_work_on(remote_cpu)
> > +			1 - spin_lock() for both local and remote operations
> > +
> > +			Selecting 1 may be interesting for systems that want
> > +			to avoid interruption & context switches from IPIs.
> 
> Like Vlastimil suggested, it would be better to just have it off by default
> and turn it on only if nohz_full= is passed. Then we can consider introducing
> the parameter later if the need arise.

I agree with having it enabled with isolcpus/nohz_full, but I would 
recommend having this option anyway, as the user could disable qpw if 
wanted, or enable outside isolcpu scenarios for any reason.

> 
> > +#define qpw_lock_init(lock)				\
> > +	local_lock_init(lock)
> > +
> > +#define qpw_trylock_init(lock)				\
> > +	local_trylock_init(lock)
> > +
> > +#define qpw_lock(lock, cpu)				\
> > +	local_lock(lock)
> > +
> > +#define local_qpw_lock(lock)				\
> > +	local_lock(lock)
> 
> It would be easier to grep if all the APIs start with qpw_* prefix.
> 
> qpw_local_lock() ?

Sure, not against the change.
And sure, would need to change all versions starting with local_ .

> 
> > +
> > +#define qpw_lock_irqsave(lock, flags, cpu)		\
> > +	local_lock_irqsave(lock, flags)
> > +
> > +#define local_qpw_lock_irqsave(lock, flags)		\
> > +	local_lock_irqsave(lock, flags)
> 
> ditto?
> 
> > +
> > +#define qpw_trylock(lock, cpu)				\
> > +	local_trylock(lock)
> > +
> > +#define local_qpw_trylock(lock)				\
> > +	local_trylock(lock)
> 
> ...
> 
> > +
> > +#define qpw_trylock_irqsave(lock, flags, cpu)		\
> > +	local_trylock_irqsave(lock, flags)
> > +
> > +#define qpw_unlock(lock, cpu)				\
> > +	local_unlock(lock)
> > +
> > +#define local_qpw_unlock(lock)				\
> > +	local_unlock(lock)
> 
> ...
> 
> > +
> > +#define qpw_unlock_irqrestore(lock, flags, cpu)		\
> > +	local_unlock_irqrestore(lock, flags)
> > +
> > +#define local_qpw_unlock_irqrestore(lock, flags)	\
> > +	local_unlock_irqrestore(lock, flags)
> 
> ...
> 
> > +
> > +#define qpw_lockdep_assert_held(lock)			\
> > +	lockdep_assert_held(lock)
> > +
> > +#define queue_percpu_work_on(c, wq, qpw)		\
> > +	queue_work_on(c, wq, &(qpw)->work)
> 
> qpw_queue_work_on() ?
> 
> Perhaps even better would be qpw_queue_work_for(), leaving some room for
> mystery about where/how the work will be executed :-)
> 

QPW comes from Queue PerCPU Work
Having it called qpw_queue_work_{on,for}() would be repetitve
But having qpw_on() or qpw_for() would be misleading :) 

That's why I went with queue_percpu_work_on() based on how we have the 
original function (queue_work_on) being called.

> > +
> > +#define flush_percpu_work(qpw)				\
> > +	flush_work(&(qpw)->work)
> 
> qpw_flush_work() ?

Same as above,
qpw_flush() ?

> 
> > +
> > +#define qpw_get_cpu(qpw)	smp_processor_id()
> > +
> > +#define qpw_is_cpu_remote(cpu)		(false)
> > +
> > +#define INIT_QPW(qpw, func, c)				\
> > +	INIT_WORK(&(qpw)->work, (func))
> > +
> > @@ -762,6 +762,41 @@ config CPU_ISOLATION
> >  
> >  	  Say Y if unsure.
> >  
> > +config QPW
> > +	bool "Queue per-CPU Work"
> > +	depends on SMP || COMPILE_TEST
> > +	default n
> > +	help
> > +	  Allow changing the behavior on per-CPU resource sharing with cache,
> > +	  from the regular local_locks() + queue_work_on(remote_cpu) to using
> > +	  per-CPU spinlocks on both local and remote operations.
> > +
> > +	  This is useful to give user the option on reducing IPIs to CPUs, and
> > +	  thus reduce interruptions and context switches. On the other hand, it
> > +	  increases generated code and will use atomic operations if spinlocks
> > +	  are selected.
> > +
> > +	  If set, will use the default behavior set in QPW_DEFAULT unless boot
> > +	  parameter qpw is passed with a different behavior.
> > +
> > +	  If unset, will use the local_lock() + queue_work_on() strategy,
> > +	  regardless of the boot parameter or QPW_DEFAULT.
> > +
> > +	  Say N if unsure.
> 
> Perhaps that too should just be selected automatically by CONFIG_NO_HZ_FULL and if
> the need arise in the future, make it visible to the user?
> 

I think it would be good to have this, and let whoever is building have the 
chance to disable QPW if it doesn't work well for their machines or 
workload, without having to add a new boot parameter to continue have 
their stuff working as always after a kernel update.

But that is open to discussion :)

Thanks!
Leo

Re: [PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work

Posted by Frederic Weisbecker 3 weeks, 1 day ago

Le Sun, Mar 15, 2026 at 03:10:27PM -0300, Leonardo Bras a écrit :
> On Fri, Mar 13, 2026 at 10:55:47PM +0100, Frederic Weisbecker wrote:
> > I find this part of the semantic a bit weird. If we eventually queue
> > the work, why do we care about doing a local_lock() locally ?
> 
> (Sorry, not sure if I was able to understand the question.)
> 
> Local locks make sure a per-cpu procedure happens on the same CPU from 
> start to end. Using migrate_disable & using per-cpu spinlocks on RT and 
> doing preempt_disable in non_RT.
> 
> Most of the cases happen to have the work done in the local cpu, and just 
> a few procedures happen to be queued remotely, such as remote cache 
> draining. 
> 
> Even with the new 'local_qpw_lock()' which is faster for cases we are sure 
> to have local usages, on qpw=0 we have to make qpw_lock() a local_lock as 
> well, as the cpu receiving the scheduled work needs to make sure to run it 
> all without moving to a different cpu.

But queue_work_on() already makes sure the work doesn't move to a different CPU
(provided hotplug is correctly handled for the work).

Looks like we are both confused, so let's take a practical example. Suppose
CPU 0 queues a work to CPU 1 which sets a per-cpu variable named A to the value
"1". We want to guarantee that further reads of that per-cpu value by CPU 1
see the new value. With qpw=1, it looks like this:

CPU 0                                               CPU 1
-----                                               -----

qpw_lock(CPU 1)
   spin_lock(&QPW_CPU1)
qpw_queue_for(write_A, 1)
    write_A()
       A1 = per_cpu_ptr(&A, 1)
       *A1 = 1
qpw_unlock(CPU 1)
    spin_unlock(&QPW_CPU1)
                                                   read_A()
                                                       qpw_lock(CPU 1)
                                                           spin_lock(&QPW_CPU1)
                                                       r0 = __this_cpu_read(&A)
                                                       qpw_unlock(CPU 1)
                                                           spin_unlock(&QPW_CPU1)
                                                   

CPU 0 took the spinlock while writing to A, so CPU 1 is guaranteed to further
observe the new value because it takes the same spinlock (r0 == 1)

Now look at the qpw=0 case:
                                  
CPU 0                                               CPU 1
-----                                               -----

qpw_lock(CPU 1)
   local_lock(&QPW_CPU0)
qpw_queue_for(write_A, 1)
    queue_work_on(write_A, CPU 1)
qpw_unlock(CPU 1)
    local_unlock(&QPW_CPU0)
                                                   // workqueue
                                                   write_A()
                                                       qpw_lock(CPU 1)
                                                           local_lock(&QPW_CPU1)
                                                       A1 = per_cpu_ptr(&A, 1)
                                                       *A1 = 1
                                                       qpw_unlock(CPU 1)
                                                           local_unlock(&QPW_CPU1)

                                                   read_A()
                                                       qpw_lock(CPU 1)
                                                           local_lock(&QPW_CPU1)
                                                       r0 = __this_cpu_read(&A)
                                                       qpw_unlock(CPU 1)
                                                           local_unlock(&QPW_CPU1)

Here CPU 0 queues the work on CPU 1 which writes and reads the new value
(r0 == 1). local_lock() / preempt_disable() makes sure the CPU doesn't change.

But what is the point in doing local_lock(&QPW_CPU0) on CPU 0 ?


> > > 
> > > @@ -2840,6 +2840,16 @@ Kernel parameters
> > >  
> > >  			The format of <cpu-list> is described above.
> > >  
> > > +	qpw=		[KNL,SMP] Select a behavior on per-CPU resource sharing
> > > +			and remote interference mechanism on a kernel built with
> > > +			CONFIG_QPW.
> > > +			Format: { "0" | "1" }
> > > +			0 - local_lock() + queue_work_on(remote_cpu)
> > > +			1 - spin_lock() for both local and remote operations
> > > +
> > > +			Selecting 1 may be interesting for systems that want
> > > +			to avoid interruption & context switches from IPIs.
> > 
> > Like Vlastimil suggested, it would be better to just have it off by default
> > and turn it on only if nohz_full= is passed. Then we can consider introducing
> > the parameter later if the need arise.
> 
> I agree with having it enabled with isolcpus/nohz_full, but I would 
> recommend having this option anyway, as the user could disable qpw if 
> wanted, or enable outside isolcpu scenarios for any reason.

Do you know any such users? Or suspect a potential usecase? If not we can still
add that option later. It's probably better than sticking with a useless
parameter that we'll have to maintain forever.

> > > +#define qpw_lockdep_assert_held(lock)			\
> > > +	lockdep_assert_held(lock)
> > > +
> > > +#define queue_percpu_work_on(c, wq, qpw)		\
> > > +	queue_work_on(c, wq, &(qpw)->work)
> > 
> > qpw_queue_work_on() ?
> > 
> > Perhaps even better would be qpw_queue_work_for(), leaving some room for
> > mystery about where/how the work will be executed :-)
> > 
> 
> QPW comes from Queue PerCPU Work
> Having it called qpw_queue_work_{on,for}() would be repetitve

Well, qpw_ just becomes the name of the subsystem and its prefix for APIs.
For example qpw_lock() doesn't mean that we queue and lock, it only means we lock.

> But having qpw_on() or qpw_for() would be misleading :) 
> 
> That's why I went with queue_percpu_work_on() based on how we have the 
> original function (queue_work_on) being called.

That's much more misleading at it doesn't refer to qpw at all and it only
suggest that it's a queueing a per-cpu workqueue.

> > Perhaps that too should just be selected automatically by CONFIG_NO_HZ_FULL and if
> > the need arise in the future, make it visible to the user?
> > 
> 
> I think it would be good to have this, and let whoever is building have the 
> chance to disable QPW if it doesn't work well for their machines or 
> workload, without having to add a new boot parameter to continue have 
> their stuff working as always after a kernel update.
> 
> But that is open to discussion :)

Ok I guess we can stick with the Kconfig at least in the beginning.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

Re: [PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work

Posted by Marcelo Tosatti 2 weeks, 2 days ago

On Tue, Mar 17, 2026 at 02:33:50PM +0100, Frederic Weisbecker wrote:
> Le Sun, Mar 15, 2026 at 03:10:27PM -0300, Leonardo Bras a écrit :
> > On Fri, Mar 13, 2026 at 10:55:47PM +0100, Frederic Weisbecker wrote:
> > > I find this part of the semantic a bit weird. If we eventually queue
> > > the work, why do we care about doing a local_lock() locally ?
> > 
> > (Sorry, not sure if I was able to understand the question.)
> > 
> > Local locks make sure a per-cpu procedure happens on the same CPU from 
> > start to end. Using migrate_disable & using per-cpu spinlocks on RT and 
> > doing preempt_disable in non_RT.
> > 
> > Most of the cases happen to have the work done in the local cpu, and just 
> > a few procedures happen to be queued remotely, such as remote cache 
> > draining. 
> > 
> > Even with the new 'local_qpw_lock()' which is faster for cases we are sure 
> > to have local usages, on qpw=0 we have to make qpw_lock() a local_lock as 
> > well, as the cpu receiving the scheduled work needs to make sure to run it 
> > all without moving to a different cpu.
> 
> But queue_work_on() already makes sure the work doesn't move to a different CPU
> (provided hotplug is correctly handled for the work).

commit b01b2141999936ac3e4746b7f76c0f204ae4b445
Author: Ingo Molnar <mingo@kernel.org>
Date:   Wed May 27 22:11:15 2020 +0200

    mm/swap: Use local_lock for protection

    The various struct pagevec per CPU variables are protected by disabling
    either preemption or interrupts across the critical sections. Inside
    these sections spinlocks have to be acquired.

    These spinlocks are regular spinlock_t types which are converted to
    "sleeping" spinlocks on PREEMPT_RT enabled kernels. Obviously sleeping
    locks cannot be acquired in preemption or interrupt disabled sections.

    local locks provide a trivial way to substitute preempt and interrupt
    disable instances. On a non PREEMPT_RT enabled kernel local_lock() maps
    to preempt_disable() and local_lock_irq() to local_irq_disable().

    Create lru_rotate_pvecs containing the pagevec and the locallock.
    Create lru_pvecs containing the remaining pagevecs and the locallock.
    Add lru_add_drain_cpu_zone() which is used from compact_zone() to avoid
    exporting the pvec structure.

    Change the relevant call sites to acquire these locks instead of using
    preempt_disable() / get_cpu() / get_cpu_var() and local_irq_disable() /
    local_irq_save().

    There is neither a functional change nor a change in the generated
    binary code for non PREEMPT_RT enabled non-debug kernels.

    When lockdep is enabled local locks have lockdep maps embedded. These
    allow lockdep to validate the protections, i.e. inappropriate usage of a
    preemption only protected sections would result in a lockdep warning
    while the same problem would not be noticed with a plain
    preempt_disable() based protection.

    local locks also improve readability as they provide a named scope for
    the protections while preempt/interrupt disable are opaque scopeless.

    Finally local locks allow PREEMPT_RT to substitute them with real
    locking primitives to ensure the correctness of operation in a fully
    preemptible kernel.

    [ bigeasy: Adopted to use local_lock ]

> Looks like we are both confused, so let's take a practical example. Suppose
> CPU 0 queues a work to CPU 1 which sets a per-cpu variable named A to the value
> "1". We want to guarantee that further reads of that per-cpu value by CPU 1
> see the new value. With qpw=1, it looks like this:
> 
> CPU 0                                               CPU 1
> -----                                               -----
> 
> qpw_lock(CPU 1)
>    spin_lock(&QPW_CPU1)
> qpw_queue_for(write_A, 1)
>     write_A()
>        A1 = per_cpu_ptr(&A, 1)
>        *A1 = 1
> qpw_unlock(CPU 1)
>     spin_unlock(&QPW_CPU1)
>                                                    read_A()
>                                                        qpw_lock(CPU 1)
>                                                            spin_lock(&QPW_CPU1)
>                                                        r0 = __this_cpu_read(&A)
>                                                        qpw_unlock(CPU 1)
>                                                            spin_unlock(&QPW_CPU1)
>                                                    
> 
> CPU 0 took the spinlock while writing to A, so CPU 1 is guaranteed to further
> observe the new value because it takes the same spinlock (r0 == 1)
> 
> Now look at the qpw=0 case:
>                                   
> CPU 0                                               CPU 1
> -----                                               -----
> 
> qpw_lock(CPU 1)
>    local_lock(&QPW_CPU0)
> qpw_queue_for(write_A, 1)
>     queue_work_on(write_A, CPU 1)
> qpw_unlock(CPU 1)
>     local_unlock(&QPW_CPU0)
>                                                    // workqueue
>                                                    write_A()
>                                                        qpw_lock(CPU 1)
>                                                            local_lock(&QPW_CPU1)
>                                                        A1 = per_cpu_ptr(&A, 1)
>                                                        *A1 = 1
>                                                        qpw_unlock(CPU 1)
>                                                            local_unlock(&QPW_CPU1)
> 
>                                                    read_A()
>                                                        qpw_lock(CPU 1)
>                                                            local_lock(&QPW_CPU1)
>                                                        r0 = __this_cpu_read(&A)
>                                                        qpw_unlock(CPU 1)
>                                                            local_unlock(&QPW_CPU1)
> 
> Here CPU 0 queues the work on CPU 1 which writes and reads the new value
> (r0 == 1). local_lock() / preempt_disable() makes sure the CPU doesn't change.
> 
> But what is the point in doing local_lock(&QPW_CPU0) on CPU 0 ?

To protect certain that structures that are protected by
preempt_disable (non-RT) and migrate_disable (RT).

> > > > 
> > > > @@ -2840,6 +2840,16 @@ Kernel parameters
> > > >  
> > > >  			The format of <cpu-list> is described above.
> > > >  
> > > > +	qpw=		[KNL,SMP] Select a behavior on per-CPU resource sharing
> > > > +			and remote interference mechanism on a kernel built with
> > > > +			CONFIG_QPW.
> > > > +			Format: { "0" | "1" }
> > > > +			0 - local_lock() + queue_work_on(remote_cpu)
> > > > +			1 - spin_lock() for both local and remote operations
> > > > +
> > > > +			Selecting 1 may be interesting for systems that want
> > > > +			to avoid interruption & context switches from IPIs.
> > > 
> > > Like Vlastimil suggested, it would be better to just have it off by default
> > > and turn it on only if nohz_full= is passed. Then we can consider introducing
> > > the parameter later if the need arise.
> > 
> > I agree with having it enabled with isolcpus/nohz_full, but I would 
> > recommend having this option anyway, as the user could disable qpw if 
> > wanted, or enable outside isolcpu scenarios for any reason.
> 
> Do you know any such users? Or suspect a potential usecase? If not we can still
> add that option later. It's probably better than sticking with a useless
> parameter that we'll have to maintain forever.

Someone that does not boot with isolcpus= but uses cgroups for CPU
isolation?

> > > > +#define qpw_lockdep_assert_held(lock)			\
> > > > +	lockdep_assert_held(lock)
> > > > +
> > > > +#define queue_percpu_work_on(c, wq, qpw)		\
> > > > +	queue_work_on(c, wq, &(qpw)->work)
> > > 
> > > qpw_queue_work_on() ?
> > > 
> > > Perhaps even better would be qpw_queue_work_for(), leaving some room for
> > > mystery about where/how the work will be executed :-)
> > > 
> > 
> > QPW comes from Queue PerCPU Work
> > Having it called qpw_queue_work_{on,for}() would be repetitve
> 
> Well, qpw_ just becomes the name of the subsystem and its prefix for APIs.
> For example qpw_lock() doesn't mean that we queue and lock, it only means we lock.
> 
> > But having qpw_on() or qpw_for() would be misleading :) 
> > 
> > That's why I went with queue_percpu_work_on() based on how we have the 
> > original function (queue_work_on) being called.
> 
> That's much more misleading at it doesn't refer to qpw at all and it only
> suggest that it's a queueing a per-cpu workqueue.
> 
> > > Perhaps that too should just be selected automatically by CONFIG_NO_HZ_FULL and if
> > > the need arise in the future, make it visible to the user?
> > > 
> > 
> > I think it would be good to have this, and let whoever is building have the 
> > chance to disable QPW if it doesn't work well for their machines or 
> > workload, without having to add a new boot parameter to continue have 
> > their stuff working as always after a kernel update.
> > 
> > But that is open to discussion :)
> 
> Ok I guess we can stick with the Kconfig at least in the beginning.
> 
> Thanks.
> 
> -- 
> Frederic Weisbecker
> SUSE Labs
> 
>

Re: [PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work

Posted by Leonardo Bras 2 weeks, 3 days ago

On Tue, Mar 17, 2026 at 02:33:50PM +0100, Frederic Weisbecker wrote:
> Le Sun, Mar 15, 2026 at 03:10:27PM -0300, Leonardo Bras a écrit :
> > On Fri, Mar 13, 2026 at 10:55:47PM +0100, Frederic Weisbecker wrote:
> > > I find this part of the semantic a bit weird. If we eventually queue
> > > the work, why do we care about doing a local_lock() locally ?
> > 
> > (Sorry, not sure if I was able to understand the question.)
> > 
> > Local locks make sure a per-cpu procedure happens on the same CPU from 
> > start to end. Using migrate_disable & using per-cpu spinlocks on RT and 
> > doing preempt_disable in non_RT.
> > 
> > Most of the cases happen to have the work done in the local cpu, and just 
> > a few procedures happen to be queued remotely, such as remote cache 
> > draining. 
> > 
> > Even with the new 'local_qpw_lock()' which is faster for cases we are sure 
> > to have local usages, on qpw=0 we have to make qpw_lock() a local_lock as 
> > well, as the cpu receiving the scheduled work needs to make sure to run it 
> > all without moving to a different cpu.
> 
> But queue_work_on() already makes sure the work doesn't move to a different CPU
> (provided hotplug is correctly handled for the work).
> 
> Looks like we are both confused, so let's take a practical example. Suppose
> CPU 0 queues a work to CPU 1 which sets a per-cpu variable named A to the value
> "1". We want to guarantee that further reads of that per-cpu value by CPU 1
> see the new value. With qpw=1, it looks like this:
> 
> CPU 0                                               CPU 1
> -----                                               -----
> 
> qpw_lock(CPU 1)
>    spin_lock(&QPW_CPU1)
> qpw_queue_for(write_A, 1)
>     write_A()
>        A1 = per_cpu_ptr(&A, 1)
>        *A1 = 1
> qpw_unlock(CPU 1)
>     spin_unlock(&QPW_CPU1)
>                                                    read_A()
>                                                        qpw_lock(CPU 1)
>                                                            spin_lock(&QPW_CPU1)
>                                                        r0 = __this_cpu_read(&A)
>                                                        qpw_unlock(CPU 1)
>                                                            spin_unlock(&QPW_CPU1)
>                                                    
> 
> CPU 0 took the spinlock while writing to A, so CPU 1 is guaranteed to further
> observe the new value because it takes the same spinlock (r0 == 1)
> 

Here, if we are in CPU0 we should never take the qpw_lock(CPU1) unless we 
are inside queue_percpu_work_on().

Maybe I am not getting your use case :/

Also, I don't see a case where we would need to call 
queue_percpu_work_on() inside a qpw_lock(). This could be dangerous as it 
could be the case in another cpu and cause a deadlock:

CPU 0 				CPU 1
qpw_lock(0)			qpw_lock(1)
...				...
queue_percpu_work_on()		queue_percpu_work_on()
	qpw_lock(1)			qpw_lock(0)


> Now look at the qpw=0 case:
>                                   
> CPU 0                                               CPU 1
> -----                                               -----
> 
> qpw_lock(CPU 1)
>    local_lock(&QPW_CPU0)
> qpw_queue_for(write_A, 1)
>     queue_work_on(write_A, CPU 1)
> qpw_unlock(CPU 1)
>     local_unlock(&QPW_CPU0)
>                                                    // workqueue
>                                                    write_A()
>                                                        qpw_lock(CPU 1)
>                                                            local_lock(&QPW_CPU1)
>                                                        A1 = per_cpu_ptr(&A, 1)
>                                                        *A1 = 1
>                                                        qpw_unlock(CPU 1)
>                                                            local_unlock(&QPW_CPU1)
> 
>                                                    read_A()
>                                                        qpw_lock(CPU 1)
>                                                            local_lock(&QPW_CPU1)
>                                                        r0 = __this_cpu_read(&A)
>                                                        qpw_unlock(CPU 1)
>                                                            local_unlock(&QPW_CPU1)
> 
> Here CPU 0 queues the work on CPU 1 which writes and reads the new value
> (r0 == 1). local_lock() / preempt_disable() makes sure the CPU doesn't change.
> 
> But what is the point in doing local_lock(&QPW_CPU0) on CPU 0 ?

I can't see the case where one would need to hold the qpw_lock while 
calling queue_percpu_work_on(). Holding the qpw_lock() (as is the case of
local_lock()) should be done only when one is working on data particular to 
that cpu structures. Queuing work on other CPU while touching this cpu data 
is unexpected to me. 



> 
> 
> > > > 
> > > > @@ -2840,6 +2840,16 @@ Kernel parameters
> > > >  
> > > >  			The format of <cpu-list> is described above.
> > > >  
> > > > +	qpw=		[KNL,SMP] Select a behavior on per-CPU resource sharing
> > > > +			and remote interference mechanism on a kernel built with
> > > > +			CONFIG_QPW.
> > > > +			Format: { "0" | "1" }
> > > > +			0 - local_lock() + queue_work_on(remote_cpu)
> > > > +			1 - spin_lock() for both local and remote operations
> > > > +
> > > > +			Selecting 1 may be interesting for systems that want
> > > > +			to avoid interruption & context switches from IPIs.
> > > 
> > > Like Vlastimil suggested, it would be better to just have it off by default
> > > and turn it on only if nohz_full= is passed. Then we can consider introducing
> > > the parameter later if the need arise.
> > 
> > I agree with having it enabled with isolcpus/nohz_full, but I would 
> > recommend having this option anyway, as the user could disable qpw if 
> > wanted, or enable outside isolcpu scenarios for any reason.
> 
> Do you know any such users? Or suspect a potential usecase? If not we can still
> add that option later. It's probably better than sticking with a useless
> parameter that we'll have to maintain forever.

Out of my head, I can think only on HPC scenario where user wants to make 
use of the regular/RT scheduler for many small workloads, but doesn't like 
the impact of IPI on those cases. Such systems that explore memory at it's 
limit will also benefit from those, for example, if cache gets drained 
remotely very often.

None of those necessarily will need to or benefit from isolcpus, and may 
want to just use the kernel scheduler policies. 

> 
> > > > +#define qpw_lockdep_assert_held(lock)			\
> > > > +	lockdep_assert_held(lock)
> > > > +
> > > > +#define queue_percpu_work_on(c, wq, qpw)		\
> > > > +	queue_work_on(c, wq, &(qpw)->work)
> > > 
> > > qpw_queue_work_on() ?
> > > 
> > > Perhaps even better would be qpw_queue_work_for(), leaving some room for
> > > mystery about where/how the work will be executed :-)
> > > 
> > 
> > QPW comes from Queue PerCPU Work
> > Having it called qpw_queue_work_{on,for}() would be repetitve
> 
> Well, qpw_ just becomes the name of the subsystem and its prefix for APIs.
> For example qpw_lock() doesn't mean that we queue and lock, it only means we lock.
> 

Locks for queue'ing per-cpu work. :D

> > But having qpw_on() or qpw_for() would be misleading :) 
> > 
> > That's why I went with queue_percpu_work_on() based on how we have the 
> > original function (queue_work_on) being called.
> 
> That's much more misleading at it doesn't refer to qpw at all and it only
> suggest that it's a queueing a per-cpu workqueue.
> 

Humm, maybe qpw_queue_for/on()?

Or maybe change the name of the API for pw:
pw_lock()/unlock
pw_queue();
pw_flush()

and so on?

That way it stays true to what means :)


> > > Perhaps that too should just be selected automatically by CONFIG_NO_HZ_FULL and if
> > > the need arise in the future, make it visible to the user?
> > > 
> > 
> > I think it would be good to have this, and let whoever is building have the 
> > chance to disable QPW if it doesn't work well for their machines or 
> > workload, without having to add a new boot parameter to continue have 
> > their stuff working as always after a kernel update.
> > 
> > But that is open to discussion :)
> 
> Ok I guess we can stick with the Kconfig at least in the beginning.
> 
> Thanks.
> 
> -- 
> Frederic Weisbecker
> SUSE Labs


Thanks!
Leo

Re: [PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work

Posted by Frederic Weisbecker 2 weeks, 2 days ago

Le Sun, Mar 22, 2026 at 10:38:56PM -0300, Leonardo Bras a écrit :
> On Tue, Mar 17, 2026 at 02:33:50PM +0100, Frederic Weisbecker wrote:
> > Le Sun, Mar 15, 2026 at 03:10:27PM -0300, Leonardo Bras a écrit :
> > > On Fri, Mar 13, 2026 at 10:55:47PM +0100, Frederic Weisbecker wrote:
> > > > I find this part of the semantic a bit weird. If we eventually queue
> > > > the work, why do we care about doing a local_lock() locally ?
> > > 
> > > (Sorry, not sure if I was able to understand the question.)
> > > 
> > > Local locks make sure a per-cpu procedure happens on the same CPU from 
> > > start to end. Using migrate_disable & using per-cpu spinlocks on RT and 
> > > doing preempt_disable in non_RT.
> > > 
> > > Most of the cases happen to have the work done in the local cpu, and just 
> > > a few procedures happen to be queued remotely, such as remote cache 
> > > draining. 
> > > 
> > > Even with the new 'local_qpw_lock()' which is faster for cases we are sure 
> > > to have local usages, on qpw=0 we have to make qpw_lock() a local_lock as 
> > > well, as the cpu receiving the scheduled work needs to make sure to run it 
> > > all without moving to a different cpu.
> > 
> > But queue_work_on() already makes sure the work doesn't move to a different CPU
> > (provided hotplug is correctly handled for the work).
> > 
> > Looks like we are both confused, so let's take a practical example. Suppose
> > CPU 0 queues a work to CPU 1 which sets a per-cpu variable named A to the value
> > "1". We want to guarantee that further reads of that per-cpu value by CPU 1
> > see the new value. With qpw=1, it looks like this:
> > 
> > CPU 0                                               CPU 1
> > -----                                               -----
> > 
> > qpw_lock(CPU 1)
> >    spin_lock(&QPW_CPU1)
> > qpw_queue_for(write_A, 1)
> >     write_A()
> >        A1 = per_cpu_ptr(&A, 1)
> >        *A1 = 1
> > qpw_unlock(CPU 1)
> >     spin_unlock(&QPW_CPU1)
> >                                                    read_A()
> >                                                        qpw_lock(CPU 1)
> >                                                            spin_lock(&QPW_CPU1)
> >                                                        r0 = __this_cpu_read(&A)
> >                                                        qpw_unlock(CPU 1)
> >                                                            spin_unlock(&QPW_CPU1)
> >                                                    
> > 
> > CPU 0 took the spinlock while writing to A, so CPU 1 is guaranteed to further
> > observe the new value because it takes the same spinlock (r0 == 1)
> > 
> 
> Here, if we are in CPU0 we should never take the qpw_lock(CPU1) unless we 
> are inside queue_percpu_work_on().
> 
> Maybe I am not getting your use case :/
> 
> Also, I don't see a case where we would need to call 
> queue_percpu_work_on() inside a qpw_lock(). This could be dangerous as it 
> could be the case in another cpu and cause a deadlock:
> 
> CPU 0 				CPU 1
> qpw_lock(0)			qpw_lock(1)
> ...				...
> queue_percpu_work_on()		queue_percpu_work_on()
> 	qpw_lock(1)			qpw_lock(0)

Ok I just checked the practical usecase in the patchset and it was me not
getting your usecase. The qpw lock is used inside the work itself. And now
that makes sense.

> 
> 
> > Now look at the qpw=0 case:
> >                                   
> > CPU 0                                               CPU 1
> > -----                                               -----
> > 
> > qpw_lock(CPU 1)
> >    local_lock(&QPW_CPU0)
> > qpw_queue_for(write_A, 1)
> >     queue_work_on(write_A, CPU 1)
> > qpw_unlock(CPU 1)
> >     local_unlock(&QPW_CPU0)
> >                                                    // workqueue
> >                                                    write_A()
> >                                                        qpw_lock(CPU 1)
> >                                                            local_lock(&QPW_CPU1)
> >                                                        A1 = per_cpu_ptr(&A, 1)
> >                                                        *A1 = 1
> >                                                        qpw_unlock(CPU 1)
> >                                                            local_unlock(&QPW_CPU1)
> > 
> >                                                    read_A()
> >                                                        qpw_lock(CPU 1)
> >                                                            local_lock(&QPW_CPU1)
> >                                                        r0 = __this_cpu_read(&A)
> >                                                        qpw_unlock(CPU 1)
> >                                                            local_unlock(&QPW_CPU1)
> > 
> > Here CPU 0 queues the work on CPU 1 which writes and reads the new value
> > (r0 == 1). local_lock() / preempt_disable() makes sure the CPU doesn't change.
> > 
> > But what is the point in doing local_lock(&QPW_CPU0) on CPU 0 ?
> 
> I can't see the case where one would need to hold the qpw_lock while 
> calling queue_percpu_work_on(). Holding the qpw_lock() (as is the case of
> local_lock()) should be done only when one is working on data particular to 
> that cpu structures. Queuing work on other CPU while touching this cpu data 
> is unexpected to me.

Yep!

> > > > Like Vlastimil suggested, it would be better to just have it off by default
> > > > and turn it on only if nohz_full= is passed. Then we can consider introducing
> > > > the parameter later if the need arise.
> > > 
> > > I agree with having it enabled with isolcpus/nohz_full, but I would 
> > > recommend having this option anyway, as the user could disable qpw if 
> > > wanted, or enable outside isolcpu scenarios for any reason.
> > 
> > Do you know any such users? Or suspect a potential usecase? If not we can still
> > add that option later. It's probably better than sticking with a useless
> > parameter that we'll have to maintain forever.
> 
> Out of my head, I can think only on HPC scenario where user wants to make 
> use of the regular/RT scheduler for many small workloads, but doesn't like 
> the impact of IPI on those cases.

There are many more IPIs to care about then. I suspect the issue would be more
about the workqueue itself.

> Such systems that explore memory at it's 
> limit will also benefit from those, for example, if cache gets drained 
> remotely very often.
> 
> None of those necessarily will need to or benefit from isolcpus, and may 
> want to just use the kernel scheduler policies.

This sounds like "just in case" usecases that could be dealt with later if
needed. But like Marcelo said, those who want to rely on cpuset isolated
partitions would need to enable that on boot.

> > > QPW comes from Queue PerCPU Work
> > > Having it called qpw_queue_work_{on,for}() would be repetitve
> > 
> > Well, qpw_ just becomes the name of the subsystem and its prefix for APIs.
> > For example qpw_lock() doesn't mean that we queue and lock, it only means we lock.
> > 
> 
> Locks for queue'ing per-cpu work. :D

Right!

> 
> > > But having qpw_on() or qpw_for() would be misleading :) 
> > > 
> > > That's why I went with queue_percpu_work_on() based on how we have the 
> > > original function (queue_work_on) being called.
> > 
> > That's much more misleading at it doesn't refer to qpw at all and it only
> > suggest that it's a queueing a per-cpu workqueue.
> > 
> 
> Humm, maybe qpw_queue_for/on()?
> 
> Or maybe change the name of the API for pw:
> pw_lock()/unlock
> pw_queue();
> pw_flush()
> 
> and so on?
> 
> That way it stays true to what means :)

Would better to keep the same prefix for all APIs :-)

-- 
Frederic Weisbecker
SUSE Labs

Re: [PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work

Posted by Leonardo Bras 2 weeks, 1 day ago

On Tue, Mar 24, 2026 at 12:54:17PM +0100, Frederic Weisbecker wrote:
> Le Sun, Mar 22, 2026 at 10:38:56PM -0300, Leonardo Bras a écrit :
> > On Tue, Mar 17, 2026 at 02:33:50PM +0100, Frederic Weisbecker wrote:
> > > Le Sun, Mar 15, 2026 at 03:10:27PM -0300, Leonardo Bras a écrit :
> > > > On Fri, Mar 13, 2026 at 10:55:47PM +0100, Frederic Weisbecker wrote:
> > > > > I find this part of the semantic a bit weird. If we eventually queue
> > > > > the work, why do we care about doing a local_lock() locally ?
> > > > 
> > > > (Sorry, not sure if I was able to understand the question.)
> > > > 
> > > > Local locks make sure a per-cpu procedure happens on the same CPU from 
> > > > start to end. Using migrate_disable & using per-cpu spinlocks on RT and 
> > > > doing preempt_disable in non_RT.
> > > > 
> > > > Most of the cases happen to have the work done in the local cpu, and just 
> > > > a few procedures happen to be queued remotely, such as remote cache 
> > > > draining. 
> > > > 
> > > > Even with the new 'local_qpw_lock()' which is faster for cases we are sure 
> > > > to have local usages, on qpw=0 we have to make qpw_lock() a local_lock as 
> > > > well, as the cpu receiving the scheduled work needs to make sure to run it 
> > > > all without moving to a different cpu.
> > > 
> > > But queue_work_on() already makes sure the work doesn't move to a different CPU
> > > (provided hotplug is correctly handled for the work).
> > > 
> > > Looks like we are both confused, so let's take a practical example. Suppose
> > > CPU 0 queues a work to CPU 1 which sets a per-cpu variable named A to the value
> > > "1". We want to guarantee that further reads of that per-cpu value by CPU 1
> > > see the new value. With qpw=1, it looks like this:
> > > 
> > > CPU 0                                               CPU 1
> > > -----                                               -----
> > > 
> > > qpw_lock(CPU 1)
> > >    spin_lock(&QPW_CPU1)
> > > qpw_queue_for(write_A, 1)
> > >     write_A()
> > >        A1 = per_cpu_ptr(&A, 1)
> > >        *A1 = 1
> > > qpw_unlock(CPU 1)
> > >     spin_unlock(&QPW_CPU1)
> > >                                                    read_A()
> > >                                                        qpw_lock(CPU 1)
> > >                                                            spin_lock(&QPW_CPU1)
> > >                                                        r0 = __this_cpu_read(&A)
> > >                                                        qpw_unlock(CPU 1)
> > >                                                            spin_unlock(&QPW_CPU1)
> > >                                                    
> > > 
> > > CPU 0 took the spinlock while writing to A, so CPU 1 is guaranteed to further
> > > observe the new value because it takes the same spinlock (r0 == 1)
> > > 
> > 
> > Here, if we are in CPU0 we should never take the qpw_lock(CPU1) unless we 
> > are inside queue_percpu_work_on().
> > 
> > Maybe I am not getting your use case :/
> > 
> > Also, I don't see a case where we would need to call 
> > queue_percpu_work_on() inside a qpw_lock(). This could be dangerous as it 
> > could be the case in another cpu and cause a deadlock:
> > 
> > CPU 0 				CPU 1
> > qpw_lock(0)			qpw_lock(1)
> > ...				...
> > queue_percpu_work_on()		queue_percpu_work_on()
> > 	qpw_lock(1)			qpw_lock(0)
> 
> Ok I just checked the practical usecase in the patchset and it was me not
> getting your usecase. The qpw lock is used inside the work itself. And now
> that makes sense.
> 
> > 
> > 
> > > Now look at the qpw=0 case:
> > >                                   
> > > CPU 0                                               CPU 1
> > > -----                                               -----
> > > 
> > > qpw_lock(CPU 1)
> > >    local_lock(&QPW_CPU0)
> > > qpw_queue_for(write_A, 1)
> > >     queue_work_on(write_A, CPU 1)
> > > qpw_unlock(CPU 1)
> > >     local_unlock(&QPW_CPU0)
> > >                                                    // workqueue
> > >                                                    write_A()
> > >                                                        qpw_lock(CPU 1)
> > >                                                            local_lock(&QPW_CPU1)
> > >                                                        A1 = per_cpu_ptr(&A, 1)
> > >                                                        *A1 = 1
> > >                                                        qpw_unlock(CPU 1)
> > >                                                            local_unlock(&QPW_CPU1)
> > > 
> > >                                                    read_A()
> > >                                                        qpw_lock(CPU 1)
> > >                                                            local_lock(&QPW_CPU1)
> > >                                                        r0 = __this_cpu_read(&A)
> > >                                                        qpw_unlock(CPU 1)
> > >                                                            local_unlock(&QPW_CPU1)
> > > 
> > > Here CPU 0 queues the work on CPU 1 which writes and reads the new value
> > > (r0 == 1). local_lock() / preempt_disable() makes sure the CPU doesn't change.
> > > 
> > > But what is the point in doing local_lock(&QPW_CPU0) on CPU 0 ?
> > 
> > I can't see the case where one would need to hold the qpw_lock while 
> > calling queue_percpu_work_on(). Holding the qpw_lock() (as is the case of
> > local_lock()) should be done only when one is working on data particular to 
> > that cpu structures. Queuing work on other CPU while touching this cpu data 
> > is unexpected to me.
> 
> Yep!
> 
> > > > > Like Vlastimil suggested, it would be better to just have it off by default
> > > > > and turn it on only if nohz_full= is passed. Then we can consider introducing
> > > > > the parameter later if the need arise.
> > > > 
> > > > I agree with having it enabled with isolcpus/nohz_full, but I would 
> > > > recommend having this option anyway, as the user could disable qpw if 
> > > > wanted, or enable outside isolcpu scenarios for any reason.
> > > 
> > > Do you know any such users? Or suspect a potential usecase? If not we can still
> > > add that option later. It's probably better than sticking with a useless
> > > parameter that we'll have to maintain forever.
> > 
> > Out of my head, I can think only on HPC scenario where user wants to make 
> > use of the regular/RT scheduler for many small workloads, but doesn't like 
> > the impact of IPI on those cases.
> 
> There are many more IPIs to care about then. I suspect the issue would be more
> about the workqueue itself.

There are some mechanisms for workqueues to be offloaded to other CPUs if 
those are isolated, we could easily mimic that if wanted (or use isolcpus)

It's more about the locking strategies: some code uses local_lock + 
queue_work_on() and it is really effective in a lot of scenarios, but that 
relies on IPIs which can be terrible in other scenarios.

QPW is about letting user decide which locking strategy to use based on 
it's workloads :)
 
> > Such systems that explore memory at it's 
> > limit will also benefit from those, for example, if cache gets drained 
> > remotely very often.
> > 
> > None of those necessarily will need to or benefit from isolcpus, and may 
> > want to just use the kernel scheduler policies.
> 
> This sounds like "just in case" usecases that could be dealt with later if
> needed. But like Marcelo said, those who want to rely on cpuset isolated
> partitions would need to enable that on boot.
> 

Agree, he could exemplify much better :)

> > > > QPW comes from Queue PerCPU Work
> > > > Having it called qpw_queue_work_{on,for}() would be repetitve
> > > 
> > > Well, qpw_ just becomes the name of the subsystem and its prefix for APIs.
> > > For example qpw_lock() doesn't mean that we queue and lock, it only means we lock.
> > > 
> > 
> > Locks for queue'ing per-cpu work. :D
> 
> Right!
> 
> > 
> > > > But having qpw_on() or qpw_for() would be misleading :) 
> > > > 
> > > > That's why I went with queue_percpu_work_on() based on how we have the 
> > > > original function (queue_work_on) being called.
> > > 
> > > That's much more misleading at it doesn't refer to qpw at all and it only
> > > suggest that it's a queueing a per-cpu workqueue.
> > > 
> > 
> > Humm, maybe qpw_queue_for/on()?
> > 
> > Or maybe change the name of the API for pw:
> > pw_lock()/unlock
> > pw_queue();
> > pw_flush()
> > 
> > and so on?
> > 
> > That way it stays true to what means :)
> 
> Would better to keep the same prefix for all APIs :-)
> 

Naming was always hard with this mechanism :D

Will try to come with something meaningful and consistent across this and 
other APIs.

Thanks!
Leo

Re: [PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work

Posted by Vlastimil Babka (SUSE) 4 weeks, 1 day ago

On 3/2/26 16:49, Marcelo Tosatti wrote:
> Index: linux/Documentation/admin-guide/kernel-parameters.txt
> ===================================================================
> --- linux.orig/Documentation/admin-guide/kernel-parameters.txt
> +++ linux/Documentation/admin-guide/kernel-parameters.txt
> @@ -2840,6 +2840,16 @@ Kernel parameters
>  
>  			The format of <cpu-list> is described above.
>  
> +	qpw=		[KNL,SMP] Select a behavior on per-CPU resource sharing
> +			and remote interference mechanism on a kernel built with
> +			CONFIG_QPW.
> +			Format: { "0" | "1" }
> +			0 - local_lock() + queue_work_on(remote_cpu)
> +			1 - spin_lock() for both local and remote operations
> +
> +			Selecting 1 may be interesting for systems that want
> +			to avoid interruption & context switches from IPIs.
Requiring a new boot option is always a nuissance. The cpu isolation is
AFAIK difficult enough to setup already. Could the default be that qpw will
auto-enable if there are isolated cpus configured? The option could still be
useful for overriding that automatic decision to both 0 and 1 for testing
etc, but not requried for the expected usecase?

Re: [PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work

Posted by Leonardo Bras 3 weeks, 3 days ago

On Wed, Mar 11, 2026 at 08:58:05AM +0100, Vlastimil Babka (SUSE) wrote:
> On 3/2/26 16:49, Marcelo Tosatti wrote:
> > Index: linux/Documentation/admin-guide/kernel-parameters.txt
> > ===================================================================
> > --- linux.orig/Documentation/admin-guide/kernel-parameters.txt
> > +++ linux/Documentation/admin-guide/kernel-parameters.txt
> > @@ -2840,6 +2840,16 @@ Kernel parameters
> >  
> >  			The format of <cpu-list> is described above.
> >  
> > +	qpw=		[KNL,SMP] Select a behavior on per-CPU resource sharing
> > +			and remote interference mechanism on a kernel built with
> > +			CONFIG_QPW.
> > +			Format: { "0" | "1" }
> > +			0 - local_lock() + queue_work_on(remote_cpu)
> > +			1 - spin_lock() for both local and remote operations
> > +
> > +			Selecting 1 may be interesting for systems that want
> > +			to avoid interruption & context switches from IPIs.
> Requiring a new boot option is always a nuissance. The cpu isolation is
> AFAIK difficult enough to setup already. Could the default be that qpw will
> auto-enable if there are isolated cpus configured? The option could still be
> useful for overriding that automatic decision to both 0 and 1 for testing
> etc, but not requried for the expected usecase?


I think it's okay, as something like this?
(should work for nohz_full and isolcpus)

######
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 81bc8b329ef17..6c9052c28e3e4 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -170,20 +170,23 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
                for_each_set_bit(type, &iter_flags, HK_TYPE_MAX)
                        housekeeping_setup_type(type, housekeeping_staging);
        }
 
        if ((flags & HK_FLAG_KERNEL_NOISE) && !(housekeeping.flags & HK_FLAG_KERNEL_NOISE))
                tick_nohz_full_setup(non_housekeeping_mask);
 
        housekeeping.flags |= flags;
        err = 1;
 
+       if (IS_ENABLED(CONFIG_QPW_DEFAULT))
+               qpw_setup("1");
+
 free_housekeeping_staging:
        free_bootmem_cpumask_var(housekeeping_staging);
 free_non_housekeeping_mask:
        free_bootmem_cpumask_var(non_housekeeping_mask);
 
        return err;
 }
######

We would only have to be sure that this runs before cmdline parses qpw=?, 
so user could disable qpw if wanted.

Would that work?

Thanks!
Leo

Re: [PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work

Posted by Vlastimil Babka (SUSE) 3 weeks, 3 days ago

On 3/15/26 18:37, Leonardo Bras wrote:
> On Wed, Mar 11, 2026 at 08:58:05AM +0100, Vlastimil Babka (SUSE) wrote:
>> On 3/2/26 16:49, Marcelo Tosatti wrote:
>> > Index: linux/Documentation/admin-guide/kernel-parameters.txt
>> > ===================================================================
>> > --- linux.orig/Documentation/admin-guide/kernel-parameters.txt
>> > +++ linux/Documentation/admin-guide/kernel-parameters.txt
>> > @@ -2840,6 +2840,16 @@ Kernel parameters
>> >  
>> >  			The format of <cpu-list> is described above.
>> >  
>> > +	qpw=		[KNL,SMP] Select a behavior on per-CPU resource sharing
>> > +			and remote interference mechanism on a kernel built with
>> > +			CONFIG_QPW.
>> > +			Format: { "0" | "1" }
>> > +			0 - local_lock() + queue_work_on(remote_cpu)
>> > +			1 - spin_lock() for both local and remote operations
>> > +
>> > +			Selecting 1 may be interesting for systems that want
>> > +			to avoid interruption & context switches from IPIs.
>> Requiring a new boot option is always a nuissance. The cpu isolation is
>> AFAIK difficult enough to setup already. Could the default be that qpw will
>> auto-enable if there are isolated cpus configured? The option could still be
>> useful for overriding that automatic decision to both 0 and 1 for testing
>> etc, but not requried for the expected usecase?
> 
> 
> I think it's okay, as something like this?
> (should work for nohz_full and isolcpus)
> 
> ######
> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> index 81bc8b329ef17..6c9052c28e3e4 100644
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -170,20 +170,23 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
>                 for_each_set_bit(type, &iter_flags, HK_TYPE_MAX)
>                         housekeeping_setup_type(type, housekeeping_staging);
>         }
>  
>         if ((flags & HK_FLAG_KERNEL_NOISE) && !(housekeeping.flags & HK_FLAG_KERNEL_NOISE))
>                 tick_nohz_full_setup(non_housekeeping_mask);
>  
>         housekeeping.flags |= flags;
>         err = 1;
>  
> +       if (IS_ENABLED(CONFIG_QPW_DEFAULT))
> +               qpw_setup("1");
> +
>  free_housekeeping_staging:
>         free_bootmem_cpumask_var(housekeeping_staging);
>  free_non_housekeeping_mask:
>         free_bootmem_cpumask_var(non_housekeeping_mask);
>  
>         return err;
>  }
> ######
> 
> We would only have to be sure that this runs before cmdline parses qpw=?, 

I'm not sure it's possible to achieve this ordering with __setup calls,
unless one of them is early, and then it might be too early to do the
necessary action.

> so user could disable qpw if wanted.
> 
> Would that work?

The pattern I'm familiar with is collecting all related params via
early_param() setting some variables, and then an init call (not tied to any
of the param) looks at those variables and does whatever is necessary.

> Thanks!
> Leo
> 
> 
>

Re: [PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work

Posted by Leonardo Bras 2 weeks, 3 days ago

On Mon, Mar 16, 2026 at 11:55:46AM +0100, Vlastimil Babka (SUSE) wrote:
> On 3/15/26 18:37, Leonardo Bras wrote:
> > On Wed, Mar 11, 2026 at 08:58:05AM +0100, Vlastimil Babka (SUSE) wrote:
> >> On 3/2/26 16:49, Marcelo Tosatti wrote:
> >> > Index: linux/Documentation/admin-guide/kernel-parameters.txt
> >> > ===================================================================
> >> > --- linux.orig/Documentation/admin-guide/kernel-parameters.txt
> >> > +++ linux/Documentation/admin-guide/kernel-parameters.txt
> >> > @@ -2840,6 +2840,16 @@ Kernel parameters
> >> >  
> >> >  			The format of <cpu-list> is described above.
> >> >  
> >> > +	qpw=		[KNL,SMP] Select a behavior on per-CPU resource sharing
> >> > +			and remote interference mechanism on a kernel built with
> >> > +			CONFIG_QPW.
> >> > +			Format: { "0" | "1" }
> >> > +			0 - local_lock() + queue_work_on(remote_cpu)
> >> > +			1 - spin_lock() for both local and remote operations
> >> > +
> >> > +			Selecting 1 may be interesting for systems that want
> >> > +			to avoid interruption & context switches from IPIs.
> >> Requiring a new boot option is always a nuissance. The cpu isolation is
> >> AFAIK difficult enough to setup already. Could the default be that qpw will
> >> auto-enable if there are isolated cpus configured? The option could still be
> >> useful for overriding that automatic decision to both 0 and 1 for testing
> >> etc, but not requried for the expected usecase?
> > 
> > 
> > I think it's okay, as something like this?
> > (should work for nohz_full and isolcpus)
> > 
> > ######
> > diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> > index 81bc8b329ef17..6c9052c28e3e4 100644
> > --- a/kernel/sched/isolation.c
> > +++ b/kernel/sched/isolation.c
> > @@ -170,20 +170,23 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
> >                 for_each_set_bit(type, &iter_flags, HK_TYPE_MAX)
> >                         housekeeping_setup_type(type, housekeeping_staging);
> >         }
> >  
> >         if ((flags & HK_FLAG_KERNEL_NOISE) && !(housekeeping.flags & HK_FLAG_KERNEL_NOISE))
> >                 tick_nohz_full_setup(non_housekeeping_mask);
> >  
> >         housekeeping.flags |= flags;
> >         err = 1;
> >  
> > +       if (IS_ENABLED(CONFIG_QPW_DEFAULT))
> > +               qpw_setup("1");
> > +
> >  free_housekeeping_staging:
> >         free_bootmem_cpumask_var(housekeeping_staging);
> >  free_non_housekeeping_mask:
> >         free_bootmem_cpumask_var(non_housekeeping_mask);
> >  
> >         return err;
> >  }
> > ######
> > 
> > We would only have to be sure that this runs before cmdline parses qpw=?, 
> 
> I'm not sure it's possible to achieve this ordering with __setup calls,
> unless one of them is early, and then it might be too early to do the
> necessary action.
> 
> > so user could disable qpw if wanted.
> > 
> > Would that work?
> 
> The pattern I'm familiar with is collecting all related params via
> early_param() setting some variables, and then an init call (not tied to any
> of the param) looks at those variables and does whatever is necessary.
> 
> > Thanks!
> > Leo
> > 
> > 
> > 
> 

Makes sense, will take a look on that approach.

Thanks!
Leo

Re: [PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work

Posted by Vlastimil Babka (SUSE) 1 month, 1 week ago

On 3/2/26 16:49, Marcelo Tosatti wrote:
> +#define local_qpw_lock(lock)								\
> +	do {										\
> +		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) {			\
> +			migrate_disable();						\

Have you considered using migrate_disable() on PREEMPT_RT and
preempt_disable() on !PREEMPT_RT since it's cheaper? It's what the pcp
locking in mm/page_alloc.c does, for that reason. It should reduce the
overhead with qpw=1 on !PREEMPT_RT.

> +			spin_lock(this_cpu_ptr(lock.sl));				\
> +		} else									\
> +			local_lock(lock.ll);						\
> +	} while (0)
> +

Re: [PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work

Posted by Marcelo Tosatti 1 month ago

On Tue, Mar 03, 2026 at 01:03:36PM +0100, Vlastimil Babka (SUSE) wrote:
> On 3/2/26 16:49, Marcelo Tosatti wrote:
> > +#define local_qpw_lock(lock)								\
> > +	do {										\
> > +		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) {			\
> > +			migrate_disable();						\
> 
> Have you considered using migrate_disable() on PREEMPT_RT and
> preempt_disable() on !PREEMPT_RT since it's cheaper? It's what the pcp
> locking in mm/page_alloc.c does, for that reason. It should reduce the
> overhead with qpw=1 on !PREEMPT_RT.

migrate_disable:
Patched kernel, CONFIG_QPW=y, qpw=1:    192 cycles

preempt_disable:
[   65.497223] kmalloc_bench: Avg cycles per kmalloc: 184 cycles

I tried it before, but it was crashing for some reason which i didnt
look into (perhaps PREEMPT_RT was enabled).

Will change this for the next iteration, thanks.

Re: [PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work

Posted by Leonardo Bras 1 month ago

On Tue, Mar 03, 2026 at 01:02:13PM -0300, Marcelo Tosatti wrote:
> On Tue, Mar 03, 2026 at 01:03:36PM +0100, Vlastimil Babka (SUSE) wrote:
> > On 3/2/26 16:49, Marcelo Tosatti wrote:
> > > +#define local_qpw_lock(lock)								\
> > > +	do {										\
> > > +		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) {			\
> > > +			migrate_disable();						\
> > 
> > Have you considered using migrate_disable() on PREEMPT_RT and
> > preempt_disable() on !PREEMPT_RT since it's cheaper? It's what the pcp
> > locking in mm/page_alloc.c does, for that reason. It should reduce the
> > overhead with qpw=1 on !PREEMPT_RT.
> 
> migrate_disable:
> Patched kernel, CONFIG_QPW=y, qpw=1:    192 cycles
> 
> preempt_disable:
> [   65.497223] kmalloc_bench: Avg cycles per kmalloc: 184 cycles
> 
> I tried it before, but it was crashing for some reason which i didnt
> look into (perhaps PREEMPT_RT was enabled).
> 
> Will change this for the next iteration, thanks.
> 

Hi all,

That made me remember that rt spinlock already uses migrate_disable and 
non-rt spinlocks already have preempt_disable()

Maybe it's actually worth adding a local_spin_lock() in spinlock{,_rt}.c 
whichy would get the per-cpu variable inside the preempt/migrate_disable 
area, and making use of it in qpw code. That way we avoid nesting 
migtrate_disable or preempt_disable, and further reducing impact.

The alternative is to not have migrate/preempt disable here and actually 
trust the ones inside the locking primitives. Is there a chance of 
contention, but I don't remember being able to detect it.

Thanks!
Leo

Re: [PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work

Posted by Vlastimil Babka (SUSE) 1 month ago

On 3/8/26 19:00, Leonardo Bras wrote:
> On Tue, Mar 03, 2026 at 01:02:13PM -0300, Marcelo Tosatti wrote:
>> On Tue, Mar 03, 2026 at 01:03:36PM +0100, Vlastimil Babka (SUSE) wrote:
>> > On 3/2/26 16:49, Marcelo Tosatti wrote:
>> > > +#define local_qpw_lock(lock)								\
>> > > +	do {										\
>> > > +		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) {			\
>> > > +			migrate_disable();						\
>> > 
>> > Have you considered using migrate_disable() on PREEMPT_RT and
>> > preempt_disable() on !PREEMPT_RT since it's cheaper? It's what the pcp
>> > locking in mm/page_alloc.c does, for that reason. It should reduce the
>> > overhead with qpw=1 on !PREEMPT_RT.
>> 
>> migrate_disable:
>> Patched kernel, CONFIG_QPW=y, qpw=1:    192 cycles
>> 
>> preempt_disable:
>> [   65.497223] kmalloc_bench: Avg cycles per kmalloc: 184 cycles
>> 
>> I tried it before, but it was crashing for some reason which i didnt
>> look into (perhaps PREEMPT_RT was enabled).
>> 
>> Will change this for the next iteration, thanks.
>> 
> 
> Hi all,
> 
> That made me remember that rt spinlock already uses migrate_disable and 
> non-rt spinlocks already have preempt_disable()
> 
> Maybe it's actually worth adding a local_spin_lock() in spinlock{,_rt}.c 
> whichy would get the per-cpu variable inside the preempt/migrate_disable 
> area, and making use of it in qpw code. That way we avoid nesting 
> migtrate_disable or preempt_disable, and further reducing impact.

That would be nice indeed. But since the nested disable/enable cost should
be low, and the spinlock code rather complicated, it might be tough to sell.
It would be also great to have those trylocks inline on all arches.

> The alternative is to not have migrate/preempt disable here and actually 
> trust the ones inside the locking primitives. Is there a chance of 
> contention, but I don't remember being able to detect it.

So then we could pick the lock on one cpu but then get migrated and actually
lock it on another cpu. Is contention the only possible downside of this, or
could it lead to subtle bugs depending on the particular user? The paths
that don't flush stuff on remote cpus but expect working with the local
cpu's structure in a fastpath might get broken. I'd be wary of this.

> Thanks!
> Leo

Re: [PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work

Posted by Leonardo Bras 4 weeks, 1 day ago

On Mon, Mar 09, 2026 at 11:14:23AM +0100, Vlastimil Babka (SUSE) wrote:
> On 3/8/26 19:00, Leonardo Bras wrote:
> > On Tue, Mar 03, 2026 at 01:02:13PM -0300, Marcelo Tosatti wrote:
> >> On Tue, Mar 03, 2026 at 01:03:36PM +0100, Vlastimil Babka (SUSE) wrote:
> >> > On 3/2/26 16:49, Marcelo Tosatti wrote:
> >> > > +#define local_qpw_lock(lock)								\
> >> > > +	do {										\
> >> > > +		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) {			\
> >> > > +			migrate_disable();						\
> >> > 
> >> > Have you considered using migrate_disable() on PREEMPT_RT and
> >> > preempt_disable() on !PREEMPT_RT since it's cheaper? It's what the pcp
> >> > locking in mm/page_alloc.c does, for that reason. It should reduce the
> >> > overhead with qpw=1 on !PREEMPT_RT.
> >> 
> >> migrate_disable:
> >> Patched kernel, CONFIG_QPW=y, qpw=1:    192 cycles
> >> 
> >> preempt_disable:
> >> [   65.497223] kmalloc_bench: Avg cycles per kmalloc: 184 cycles
> >> 
> >> I tried it before, but it was crashing for some reason which i didnt
> >> look into (perhaps PREEMPT_RT was enabled).
> >> 
> >> Will change this for the next iteration, thanks.
> >> 
> > 
> > Hi all,
> > 
> > That made me remember that rt spinlock already uses migrate_disable and 
> > non-rt spinlocks already have preempt_disable()
> > 
> > Maybe it's actually worth adding a local_spin_lock() in spinlock{,_rt}.c 
> > whichy would get the per-cpu variable inside the preempt/migrate_disable 
> > area, and making use of it in qpw code. That way we avoid nesting 
> > migtrate_disable or preempt_disable, and further reducing impact.
> 
> That would be nice indeed. But since the nested disable/enable cost should
> be low, and the spinlock code rather complicated, it might be tough to sell.
> It would be also great to have those trylocks inline on all arches.

Fair enough.
I will take a look in spinlock code later, maybe we can have one in qpw 
code that can be used internally without impacting other users.

> 
> > The alternative is to not have migrate/preempt disable here and actually 
> > trust the ones inside the locking primitives. Is there a chance of 
> > contention, but I don't remember being able to detect it.
> 
> So then we could pick the lock on one cpu but then get migrated and actually
> lock it on another cpu. Is contention the only possible downside of this, or
> could it lead to subtle bugs depending on the particular user? The paths
> that don't flush stuff on remote cpus but expect working with the local
> cpu's structure in a fastpath might get broken. I'd be wary of this.

Yeah, that's right. Contention could be really bad for realtime, as rare as 
it may happen. 

And you are right in potential bugs: for user functions that operate on 
local per-cpu data (this_cpu_read/write) it would be expensive to have a 
per_cpu_read/write(), so IIRC Marcelo did not convert that in functions 
that always run in local_cpu. If the cpu migrates before getting the lock, 
we will safely operate remotelly on that cpu data, but any this_cpu_*() in 
the function will operate in local cpu instead of remote cpu.

So you and Marcelo are correct: we can't have migrate/preempt happening 
during the routine, which means we need them before we get the cpu.

Thanks!
Leo

[PATCH v2 1/5] slab: distinguish lock and trylock for sheaf_flush_main()
[PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work
[PATCH v2 3/5] mm/swap: move bh draining into a separate workqueue
[PATCH v2 4/5] swap: apply new queue_percpu_work_on() interface
[PATCH v2 5/5] slub: apply new queue_percpu_work_on() interface