Introduce QPW for per-cpu operations (v2)

[PATCH v2 0/5] Introduce QPW for per-cpu operations (v2)

Posted by Marcelo Tosatti 1 month, 1 week ago

The problem:
Some places in the kernel implement a parallel programming strategy
consisting on local_locks() for most of the work, and some rare remote
operations are scheduled on target cpu. This keeps cache bouncing low since
cacheline tends to be mostly local, and avoids the cost of locks in non-RT
kernels, even though the very few remote operations will be expensive due
to scheduling overhead.

On the other hand, for RT workloads this can represent a problem: getting
an important workload scheduled out to deal with remote requests is
sure to introduce unexpected deadline misses.

The idea:
Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks.
In this case, instead of scheduling work on a remote cpu, it should
be safe to grab that remote cpu's per-cpu spinlock and run the required
work locally. That major cost, which is un/locking in every local function,
already happens in PREEMPT_RT.

Also, there is no need to worry about extra cache bouncing:
The cacheline invalidation already happens due to schedule_work_on().

This will avoid schedule_work_on(), and thus avoid scheduling-out an
RT workload.

Proposed solution:
A new interface called Queue PerCPU Work (QPW), which should replace
Work Queue in the above mentioned use case.

If CONFIG_QPW=n this interfaces just wraps the current
local_locks + WorkQueue behavior, so no expected change in runtime.

If CONFIG_QPW=y, and qpw kernel boot option =1, 
queue_percpu_work_on(cpu,...) will lock that cpu's per-cpu structure
and perform work on it locally. This is possible because on 
functions that can be used for performing remote work on remote 
per-cpu structures, the local_lock (which is already
a this_cpu spinlock()), will be replaced by a qpw_spinlock(), which
is able to get the per_cpu spinlock() for the cpu passed as parameter.

v1->v2:
- Introduce local_qpw_lock and unlock functions, move preempt_disable/
  preempt_enable to it (Leonardo Bras). This reduces performance
  overhead of the patch.
- Documentation and changelog typo fixes (Leonardo Bras).
- Fix places where preempt_disable/preempt_enable was not being
  correctly performed.
- Add performance measurements.

RFC->v1:

- Introduce CONFIG_QPW and qpw= kernel boot option to enable 
  remote spinlocking and execution even on !CONFIG_PREEMPT_RT
  kernels (Leonardo Bras).
- Move buffer_head draining to separate workqueue (Marcelo Tosatti).
- Convert mlock per-CPU page lists to QPW (Marcelo Tosatti).
- Drop memcontrol convertion (as isolated CPUs are not targets
  of queue_work_on anymore).
- Rebase SLUB against Vlastimil's slab/next.
- Add basic document for QPW (Waiman Long).

The performance numbers, as measured by the following test program,
are as follows:

Unpatched kernel:			166 cycles
Patched kernel, CONFIG_QPW=n:		166 cycles
Patched kernel, CONFIG_QPW=y, qpw=0:	168 cycles
Patched kernel, CONFIG_QPW=y, qpw=1:	192 cycles

kmalloc_bench.c:
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/slab.h>
#include <linux/timex.h>
#include <linux/preempt.h>
#include <linux/irqflags.h>
#include <linux/vmalloc.h>

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Gemini AI");
MODULE_DESCRIPTION("A simple kmalloc performance benchmark");

static int size = 64; // Default allocation size in bytes
module_param(size, int, 0644);

static int iterations = 9000000; // Default number of iterations
module_param(iterations, int, 0644);

static int __init kmalloc_bench_init(void) {
    void **ptrs;
    cycles_t start, end;
    uint64_t total_cycles;
    int i;
    pr_info("kmalloc_bench: Starting test (size=%d, iterations=%d)\n", size, iterations);

    // Allocate an array to store pointers to avoid immediate kfree-reuse optimization
    ptrs = vmalloc(sizeof(void *) * iterations);
    if (!ptrs) {
        pr_err("kmalloc_bench: Failed to allocate pointer array\n");
        return -ENOMEM;
    }

    preempt_disable();
    start = get_cycles();

    for (i = 0; i < iterations; i++) {
        ptrs[i] = kmalloc(size, GFP_ATOMIC);
    }

    end = get_cycles();

    total_cycles = end - start;
    preempt_enable();

    pr_info("kmalloc_bench: Total cycles for %d allocs: %llu\n", iterations, total_cycles);
    pr_info("kmalloc_bench: Avg cycles per kmalloc: %llu\n", total_cycles / iterations);

    // Cleanup
    for (i = 0; i < iterations; i++) {
        kfree(ptrs[i]);
    }
    vfree(ptrs);

    return 0;
}

static void __exit kmalloc_bench_exit(void) {
    pr_info("kmalloc_bench: Module unloaded\n");
}

module_init(kmalloc_bench_init);
module_exit(kmalloc_bench_exit);

The following testcase triggers lru_add_drain_all on an isolated CPU
(that does sys_write to a file before entering its realtime 
loop).

/* 
 * Simulates a low latency loop program that is interrupted
 * due to lru_add_drain_all. To trigger lru_add_drain_all, run:
 *
 * blockdev --flushbufs /dev/sdX
 *
 */ 
#define _GNU_SOURCE
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <stdlib.h>
#include <stdarg.h>
#include <pthread.h>
#include <sched.h>
#include <unistd.h>

int cpu;

static void *run(void *arg)
{
	pthread_t current_thread;
	cpu_set_t cpuset;
	int ret, nrloops;
	struct sched_param sched_p;
	pid_t pid;
	int fd;
	char buf[] = "xxxxxxxxxxx";

	CPU_ZERO(&cpuset);
	CPU_SET(cpu, &cpuset);

	current_thread = pthread_self();    
	ret = pthread_setaffinity_np(current_thread, sizeof(cpu_set_t), &cpuset);
	if (ret) {
		perror("pthread_setaffinity_np failed\n");
		exit(0);
	}

	memset(&sched_p, 0, sizeof(struct sched_param));
	sched_p.sched_priority = 1;
	pid = gettid();
	ret = sched_setscheduler(pid, SCHED_FIFO, &sched_p);
	if (ret) {
		perror("sched_setscheduler");
		exit(0);
	}

	fd = open("/tmp/tmpfile", O_RDWR|O_CREAT|O_TRUNC);
	if (fd == -1) {
		perror("open");
		exit(0);
	}

	ret = write(fd, buf, sizeof(buf));
	if (ret == -1) {
		perror("write");
		exit(0);
	}

	do { 
		nrloops = nrloops+2;
		nrloops--;
	} while (1);
}

int main(int argc, char *argv[])
{
        int fd, ret;
	pthread_t thread;
	long val;
	char *endptr, *str;
	struct sched_param sched_p;
	pid_t pid;

	if (argc != 2) {
		printf("usage: %s cpu-nr\n", argv[0]);
		printf("where CPU number is the CPU to pin thread to\n");
		exit(0);
	}
	str = argv[1];
	cpu = strtol(str, &endptr, 10);
	if (cpu < 0) {
		printf("strtol returns %d\n", cpu);
		exit(0);
	}
	printf("cpunr=%d\n", cpu);

	memset(&sched_p, 0, sizeof(struct sched_param));
	sched_p.sched_priority = 1;
	pid = getpid();
	ret = sched_setscheduler(pid, SCHED_FIFO, &sched_p);
	if (ret) {
		perror("sched_setscheduler");
		exit(0);
	}

	pthread_create(&thread, NULL, run, NULL);

	sleep(5000);

	pthread_join(thread, NULL);
}

Re: [PATCH v2 0/5] Introduce QPW for per-cpu operations (v2)

Posted by Frederic Weisbecker 1 month ago

Le Mon, Mar 02, 2026 at 12:49:45PM -0300, Marcelo Tosatti a écrit :
> The problem:
> Some places in the kernel implement a parallel programming strategy
> consisting on local_locks() for most of the work, and some rare remote
> operations are scheduled on target cpu. This keeps cache bouncing low since
> cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> kernels, even though the very few remote operations will be expensive due
> to scheduling overhead.
> 
> On the other hand, for RT workloads this can represent a problem: getting
> an important workload scheduled out to deal with remote requests is
> sure to introduce unexpected deadline misses.
> 
> The idea:
> Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks.
> In this case, instead of scheduling work on a remote cpu, it should
> be safe to grab that remote cpu's per-cpu spinlock and run the required
> work locally. That major cost, which is un/locking in every local function,
> already happens in PREEMPT_RT.
> 
> Also, there is no need to worry about extra cache bouncing:
> The cacheline invalidation already happens due to schedule_work_on().
> 
> This will avoid schedule_work_on(), and thus avoid scheduling-out an
> RT workload.
> 
> Proposed solution:
> A new interface called Queue PerCPU Work (QPW), which should replace
> Work Queue in the above mentioned use case.
> 
> If CONFIG_QPW=n this interfaces just wraps the current
> local_locks + WorkQueue behavior, so no expected change in runtime.
> 
> If CONFIG_QPW=y, and qpw kernel boot option =1, 
> queue_percpu_work_on(cpu,...) will lock that cpu's per-cpu structure
> and perform work on it locally. This is possible because on 
> functions that can be used for performing remote work on remote 
> per-cpu structures, the local_lock (which is already
> a this_cpu spinlock()), will be replaced by a qpw_spinlock(), which
> is able to get the per_cpu spinlock() for the cpu passed as parameter.

So let me summarize what are the possible design solutions, on top of our discussions,
so we can compare:

1) Never queue remotely but always queue locally and execute on userspace
   return via task work.

   Pros:
         - Simple and easy to maintain.

   Cons:
         - Need a case by case handling.

	 - Might be suitable for full userspace applications but not for
           some HPC usecases. In the best world MPI is fully implemented in
           userspace but that doesn't appear to be the case.

2) Queue locally the workqueue right away or do it remotely (if it's
   really necessary) if the isolated CPU is in userspace, otherwise queue
   it for execution on return to kernel. The work will be handled by preemption
   to a worker or by a workqueue flush on return to userspace.

   Pros:
        - The local queue handling is simple.

   Cons:
        - The remote queue must synchronize with return to userspace and
	  eventually postpone to return to kernel if the target is in userspace.
	  Also it may need to differentiate IRQs and syscalls.

        - Therefore still involve some case by case handling eventually.
   
        - Flushing the global workqueues to avoid deadlocks is unadvised as shown
          in the comment above flush_scheduled_work(). It even triggers a
          warning. Significant efforts have been put to convert all the existing
	  users. It's not impossible to sell in our case because we shouldn't
	  hold a lock upon return to userspace. But that will restore a new
	  dangerous API.

        - Queueing the workqueue / flushing involves a context switch which
          induce more noise (eg: tick restart)
	  
        - As above, probably not suitable for HPC.

3) QPW: Handle the work remotely

   Pros:
        - Works on all cases, without any surprise.

   Cons:
        - Introduce new locking scheme to maintain and debug.

        - Needs case by case handling.

Thoughts?

-- 
Frederic Weisbecker
SUSE Labs

Re: [PATCH v2 0/5] Introduce QPW for per-cpu operations (v2)

Posted by Marcelo Tosatti 4 weeks, 1 day ago

Hi Frederic,

On Thu, Mar 05, 2026 at 05:55:12PM +0100, Frederic Weisbecker wrote:
> Le Mon, Mar 02, 2026 at 12:49:45PM -0300, Marcelo Tosatti a écrit :
> > The problem:
> > Some places in the kernel implement a parallel programming strategy
> > consisting on local_locks() for most of the work, and some rare remote
> > operations are scheduled on target cpu. This keeps cache bouncing low since
> > cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> > kernels, even though the very few remote operations will be expensive due
> > to scheduling overhead.
> > 
> > On the other hand, for RT workloads this can represent a problem: getting
> > an important workload scheduled out to deal with remote requests is
> > sure to introduce unexpected deadline misses.
> > 
> > The idea:
> > Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks.
> > In this case, instead of scheduling work on a remote cpu, it should
> > be safe to grab that remote cpu's per-cpu spinlock and run the required
> > work locally. That major cost, which is un/locking in every local function,
> > already happens in PREEMPT_RT.
> > 
> > Also, there is no need to worry about extra cache bouncing:
> > The cacheline invalidation already happens due to schedule_work_on().
> > 
> > This will avoid schedule_work_on(), and thus avoid scheduling-out an
> > RT workload.
> > 
> > Proposed solution:
> > A new interface called Queue PerCPU Work (QPW), which should replace
> > Work Queue in the above mentioned use case.
> > 
> > If CONFIG_QPW=n this interfaces just wraps the current
> > local_locks + WorkQueue behavior, so no expected change in runtime.
> > 
> > If CONFIG_QPW=y, and qpw kernel boot option =1, 
> > queue_percpu_work_on(cpu,...) will lock that cpu's per-cpu structure
> > and perform work on it locally. This is possible because on 
> > functions that can be used for performing remote work on remote 
> > per-cpu structures, the local_lock (which is already
> > a this_cpu spinlock()), will be replaced by a qpw_spinlock(), which
> > is able to get the per_cpu spinlock() for the cpu passed as parameter.
> 
> So let me summarize what are the possible design solutions, on top of our discussions,
> so we can compare:
> 
> 1) Never queue remotely but always queue locally and execute on userspace
>    return via task work.

How can you "queue locally" if the request is visible on a remote CPU?

That is, the event which triggers the manipulation of data structures 
which need to be performed by the owner CPU (owner of the data
structures) is triggered on a remote CPU.

This is confusing.

Can you also please give a practical example of such case ?

>    Pros:
>          - Simple and easy to maintain.
> 
>    Cons:
>          - Need a case by case handling.
> 
> 	 - Might be suitable for full userspace applications but not for
>            some HPC usecases. In the best world MPI is fully implemented in
>            userspace but that doesn't appear to be the case.
> 
> 2) Queue locally the workqueue right away

Again, the event which triggers the manipulation of data structures
by the owner CPU happens on a remote CPU. 
So how can you queue it locally ?

>    or do it remotely (if it's
>    really necessary) if the isolated CPU is in userspace, otherwise queue
>    it for execution on return to kernel. The work will be handled by preemption
>    to a worker or by a workqueue flush on return to userspace.
> 
>    Pros:
>         - The local queue handling is simple.
> 
>    Cons:
>         - The remote queue must synchronize with return to userspace and
> 	  eventually postpone to return to kernel if the target is in userspace.
> 	  Also it may need to differentiate IRQs and syscalls.
> 
>         - Therefore still involve some case by case handling eventually.
>    
>         - Flushing the global workqueues to avoid deadlocks is unadvised as shown
>           in the comment above flush_scheduled_work(). It even triggers a
>           warning. Significant efforts have been put to convert all the existing
> 	  users. It's not impossible to sell in our case because we shouldn't
> 	  hold a lock upon return to userspace. But that will restore a new
> 	  dangerous API.
> 
>         - Queueing the workqueue / flushing involves a context switch which
>           induce more noise (eg: tick restart)
> 	  
>         - As above, probably not suitable for HPC.
> 
> 3) QPW: Handle the work remotely
> 
>    Pros:
>         - Works on all cases, without any surprise.
> 
>    Cons:
>         - Introduce new locking scheme to maintain and debug.
> 
>         - Needs case by case handling.
> 
> Thoughts?

Can you please be more verbose, mindful of lesser cognitive powers ? :-) 

Note: i also dislike the added layers (and multiple cases) QPW adds.

But there is precedence with local locks...

Code would be less complex in case spinlocks were added:

01b44456a7aa7c3b24fa9db7d1714b208b8ef3d8 mm/page_alloc: replace local_lock with normal spinlock
4b23a68f953628eb4e4b7fe1294ebf93d4b8ceee mm/page_alloc: protect PCP lists with a spinlock

But people seem to reject that in the basis of performance
degradation.

Re: [PATCH v2 0/5] Introduce QPW for per-cpu operations (v2)

Posted by Vlastimil Babka 4 weeks, 1 day ago

On 3/10/26 18:12, Marcelo Tosatti wrote:
> Hi Frederic,
> 
> On Thu, Mar 05, 2026 at 05:55:12PM +0100, Frederic Weisbecker wrote:
> 
> Can you please be more verbose, mindful of lesser cognitive powers ? :-) 
> 
> Note: i also dislike the added layers (and multiple cases) QPW adds.
> 
> But there is precedence with local locks...
> 
> Code would be less complex in case spinlocks were added:
> 
> 01b44456a7aa7c3b24fa9db7d1714b208b8ef3d8 mm/page_alloc: replace local_lock with normal spinlock
> 4b23a68f953628eb4e4b7fe1294ebf93d4b8ceee mm/page_alloc: protect PCP lists with a spinlock

Note that per bf75f200569dd05ac2112797f44548beb6b4be26 changelog this seems
it was all done for the same reasons as QPW. It's nice we got the
trylock-without-irqsave approach as a followup, but the cost of (especially
non-inlined) spin_trylock is not great, given that now we could do the
trylock-without-irqsave cheaply with local_trylock.

So that to me suggests it could be worth to try convert pcplists to QPW if
it's agreed upon as the best way forward and is merged.

> But people seem to reject that in the basis of performance
> degradation.
>

Re: [PATCH v2 0/5] Introduce QPW for per-cpu operations (v2)

Posted by Hillf Danton 4 weeks, 1 day ago

On Tue, 10 Mar 2026 14:12:03 -0300 Marcelo Tosatti wrote:
> Can you please be more verbose, mindful of lesser cognitive powers ? :-) 
> 
> Note: i also dislike the added layers (and multiple cases) QPW adds.
> 
> But there is precedence with local locks...
> 
> Code would be less complex in case spinlocks were added:
> 
> 01b44456a7aa7c3b24fa9db7d1714b208b8ef3d8 mm/page_alloc: replace local_lock with normal spinlock
> 4b23a68f953628eb4e4b7fe1294ebf93d4b8ceee mm/page_alloc: protect PCP lists with a spinlock
> 
> But people seem to reject that in the basis of performance degradation.
>
Given pcp_spin_lock() cut in 0f21b911011f ("mm/page_alloc: simplify and cleanup
pcp locking"), spin lock works because of trylock and fallback, so it is a special
case instead of a generic boilerplate to follow.

Re: [PATCH v2 0/5] Introduce QPW for per-cpu operations (v2)

Posted by Frederic Weisbecker 4 weeks, 1 day ago

Le Tue, Mar 10, 2026 at 02:12:03PM -0300, Marcelo Tosatti a écrit :
> Hi Frederic,
> 
> On Thu, Mar 05, 2026 at 05:55:12PM +0100, Frederic Weisbecker wrote:
> > Le Mon, Mar 02, 2026 at 12:49:45PM -0300, Marcelo Tosatti a écrit :
> > > The problem:
> > > Some places in the kernel implement a parallel programming strategy
> > > consisting on local_locks() for most of the work, and some rare remote
> > > operations are scheduled on target cpu. This keeps cache bouncing low since
> > > cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> > > kernels, even though the very few remote operations will be expensive due
> > > to scheduling overhead.
> > > 
> > > On the other hand, for RT workloads this can represent a problem: getting
> > > an important workload scheduled out to deal with remote requests is
> > > sure to introduce unexpected deadline misses.
> > > 
> > > The idea:
> > > Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks.
> > > In this case, instead of scheduling work on a remote cpu, it should
> > > be safe to grab that remote cpu's per-cpu spinlock and run the required
> > > work locally. That major cost, which is un/locking in every local function,
> > > already happens in PREEMPT_RT.
> > > 
> > > Also, there is no need to worry about extra cache bouncing:
> > > The cacheline invalidation already happens due to schedule_work_on().
> > > 
> > > This will avoid schedule_work_on(), and thus avoid scheduling-out an
> > > RT workload.
> > > 
> > > Proposed solution:
> > > A new interface called Queue PerCPU Work (QPW), which should replace
> > > Work Queue in the above mentioned use case.
> > > 
> > > If CONFIG_QPW=n this interfaces just wraps the current
> > > local_locks + WorkQueue behavior, so no expected change in runtime.
> > > 
> > > If CONFIG_QPW=y, and qpw kernel boot option =1, 
> > > queue_percpu_work_on(cpu,...) will lock that cpu's per-cpu structure
> > > and perform work on it locally. This is possible because on 
> > > functions that can be used for performing remote work on remote 
> > > per-cpu structures, the local_lock (which is already
> > > a this_cpu spinlock()), will be replaced by a qpw_spinlock(), which
> > > is able to get the per_cpu spinlock() for the cpu passed as parameter.
> > 
> > So let me summarize what are the possible design solutions, on top of our discussions,
> > so we can compare:
> > 
> > 1) Never queue remotely but always queue locally and execute on userspace
> >    return via task work.
> 
> How can you "queue locally" if the request is visible on a remote CPU?
> 
> That is, the event which triggers the manipulation of data structures 
> which need to be performed by the owner CPU (owner of the data
> structures) is triggered on a remote CPU.
> 
> This is confusing.
> 
> Can you also please give a practical example of such case ?

Right so in the case of LRU batching, it consists in always queue
locally as soon as there is something to do. Then no remote queueing
is necessary. Like here:

https://lwn.net/ml/all/20250703140717.25703-7-frederic@kernel.org/

> 
> >    Pros:
> >          - Simple and easy to maintain.
> > 
> >    Cons:
> >          - Need a case by case handling.
> > 
> > 	 - Might be suitable for full userspace applications but not for
> >            some HPC usecases. In the best world MPI is fully implemented in
> >            userspace but that doesn't appear to be the case.
> > 
> > 2) Queue locally the workqueue right away
> 
> Again, the event which triggers the manipulation of data structures
> by the owner CPU happens on a remote CPU. 
> So how can you queue it locally ?

So that would be the same as above but instead of using task_work(), we
would force queue a workqueue locally. It's more agressive.

> 
> >    or do it remotely (if it's
> >    really necessary) if the isolated CPU is in userspace, otherwise queue
> >    it for execution on return to kernel. The work will be handled by preemption
> >    to a worker or by a workqueue flush on return to userspace.
> > 
> >    Pros:
> >         - The local queue handling is simple.
> > 
> >    Cons:
> >         - The remote queue must synchronize with return to userspace and
> > 	  eventually postpone to return to kernel if the target is in userspace.
> > 	  Also it may need to differentiate IRQs and syscalls.
> > 
> >         - Therefore still involve some case by case handling eventually.
> >    
> >         - Flushing the global workqueues to avoid deadlocks is unadvised as shown
> >           in the comment above flush_scheduled_work(). It even triggers a
> >           warning. Significant efforts have been put to convert all the existing
> > 	  users. It's not impossible to sell in our case because we shouldn't
> > 	  hold a lock upon return to userspace. But that will restore a new
> > 	  dangerous API.
> > 
> >         - Queueing the workqueue / flushing involves a context switch which
> >           induce more noise (eg: tick restart)
> > 	  
> >         - As above, probably not suitable for HPC.
> > 
> > 3) QPW: Handle the work remotely
> > 
> >    Pros:
> >         - Works on all cases, without any surprise.
> > 
> >    Cons:
> >         - Introduce new locking scheme to maintain and debug.
> > 
> >         - Needs case by case handling.
> > 
> > Thoughts?
> 
> Can you please be more verbose, mindful of lesser cognitive powers ? :-)

Arguably verbosity is not my most developed skill :o)

> 
> Note: i also dislike the added layers (and multiple cases) QPW adds.
> 
> But there is precedence with local locks...
> 
> Code would be less complex in case spinlocks were added:
> 
> 01b44456a7aa7c3b24fa9db7d1714b208b8ef3d8 mm/page_alloc: replace local_lock with normal spinlock
> 4b23a68f953628eb4e4b7fe1294ebf93d4b8ceee mm/page_alloc: protect PCP lists with a spinlock
> 
> But people seem to reject that in the basis of performance
> degradation.

And that makes sense. Anyway, we have lockdep to help.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

Re: [PATCH v2 0/5] Introduce QPW for per-cpu operations (v2)

Posted by Marcelo Tosatti 1 month ago

On Thu, Mar 05, 2026 at 05:55:12PM +0100, Frederic Weisbecker wrote:
> Le Mon, Mar 02, 2026 at 12:49:45PM -0300, Marcelo Tosatti a écrit :
> > The problem:
> > Some places in the kernel implement a parallel programming strategy
> > consisting on local_locks() for most of the work, and some rare remote
> > operations are scheduled on target cpu. This keeps cache bouncing low since
> > cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> > kernels, even though the very few remote operations will be expensive due
> > to scheduling overhead.
> > 
> > On the other hand, for RT workloads this can represent a problem: getting
> > an important workload scheduled out to deal with remote requests is
> > sure to introduce unexpected deadline misses.
> > 
> > The idea:
> > Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks.
> > In this case, instead of scheduling work on a remote cpu, it should
> > be safe to grab that remote cpu's per-cpu spinlock and run the required
> > work locally. That major cost, which is un/locking in every local function,
> > already happens in PREEMPT_RT.
> > 
> > Also, there is no need to worry about extra cache bouncing:
> > The cacheline invalidation already happens due to schedule_work_on().
> > 
> > This will avoid schedule_work_on(), and thus avoid scheduling-out an
> > RT workload.
> > 
> > Proposed solution:
> > A new interface called Queue PerCPU Work (QPW), which should replace
> > Work Queue in the above mentioned use case.
> > 
> > If CONFIG_QPW=n this interfaces just wraps the current
> > local_locks + WorkQueue behavior, so no expected change in runtime.
> > 
> > If CONFIG_QPW=y, and qpw kernel boot option =1, 
> > queue_percpu_work_on(cpu,...) will lock that cpu's per-cpu structure
> > and perform work on it locally. This is possible because on 
> > functions that can be used for performing remote work on remote 
> > per-cpu structures, the local_lock (which is already
> > a this_cpu spinlock()), will be replaced by a qpw_spinlock(), which
> > is able to get the per_cpu spinlock() for the cpu passed as parameter.
> 
> So let me summarize what are the possible design solutions, on top of our discussions,
> so we can compare:

I find this summary difficult to comprehend. The way i see it is:

A certain class of data structures can be manipulated only by each individual CPU (the
per-CPU caches), since they lack proper locks for such data to be
manipulated by remote CPUs.

There are certain operations which require such data to be manipulated,
therefore work is queued to execute on the owner CPUs.

> 
> 1) Never queue remotely but always queue locally and execute on userspace

When you say "queue locally", do you mean to queue the data structure 
manipulation to happen on return to userspace of the owner CPU ?

What if it does not return to userspace ? (or takes a long time to return 
to userspace?).

>    return via task work.
> 
>    Pros:
>          - Simple and easy to maintain.
> 
>    Cons:
>          - Need a case by case handling.
> 
> 	 - Might be suitable for full userspace applications but not for
>            some HPC usecases. In the best world MPI is fully implemented in
>            userspace but that doesn't appear to be the case.
> 
> 2) Queue locally the workqueue right away or do it remotely (if it's
>    really necessary) if the isolated CPU is in userspace, otherwise queue
>    it for execution on return to kernel. The work will be handled by preemption
>    to a worker or by a workqueue flush on return to userspace.
> 
>    Pros:
>         - The local queue handling is simple.
> 
>    Cons:
>         - The remote queue must synchronize with return to userspace and
> 	  eventually postpone to return to kernel if the target is in userspace.
> 	  Also it may need to differentiate IRQs and syscalls.
> 
>         - Therefore still involve some case by case handling eventually.
>    
>         - Flushing the global workqueues to avoid deadlocks is unadvised as shown
>           in the comment above flush_scheduled_work(). It even triggers a
>           warning. Significant efforts have been put to convert all the existing
> 	  users. It's not impossible to sell in our case because we shouldn't
> 	  hold a lock upon return to userspace. But that will restore a new
> 	  dangerous API.
> 
>         - Queueing the workqueue / flushing involves a context switch which
>           induce more noise (eg: tick restart)
> 	  
>         - As above, probably not suitable for HPC.
> 
> 3) QPW: Handle the work remotely
> 
>    Pros:
>         - Works on all cases, without any surprise.
> 
>    Cons:
>         - Introduce new locking scheme to maintain and debug.
> 
>         - Needs case by case handling.
> 
> Thoughts?
> 
> -- 
> Frederic Weisbecker
> SUSE Labs

Its hard for me to parse your concise summary (perhaps it could be more
verbose).

Anyway, one thought is to use some sort of SRCU type protection on the 
per-CPU caches.
But that adds cost as well (compared to non-SRCU), which then seems to
have cost similar to adding per-CPU spinlocks.

Re: [PATCH v2 0/5] Introduce QPW for per-cpu operations (v2)

Posted by Frederic Weisbecker 4 weeks, 1 day ago

Le Thu, Mar 05, 2026 at 10:47:00PM -0300, Marcelo Tosatti a écrit :
> On Thu, Mar 05, 2026 at 05:55:12PM +0100, Frederic Weisbecker wrote:
> > So let me summarize what are the possible design solutions, on top of our discussions,
> > so we can compare:
> 
> I find this summary difficult to comprehend. The way i see it is:
> 
> A certain class of data structures can be manipulated only by each individual CPU (the
> per-CPU caches), since they lack proper locks for such data to be
> manipulated by remote CPUs.
> 
> There are certain operations which require such data to be manipulated,
> therefore work is queued to execute on the owner CPUs.

Right.

 
> > 
> > 1) Never queue remotely but always queue locally and execute on userspace
> 
> When you say "queue locally", do you mean to queue the data structure 
> manipulation to happen on return to userspace of the owner CPU ?

Yes.

> 
> What if it does not return to userspace ? (or takes a long time to return 
> to userspace?).

Indeed it's a bet that syscalls eventually return "soon enough" for correctness
to be maintained and that the CPU is not stuck on some kthread. But on isolation
workloads, those assumptions are usually true.

> 
> >    return via task work.
> > 
> >    Pros:
> >          - Simple and easy to maintain.
> > 
> >    Cons:
> >          - Need a case by case handling.
> > 
> > 	 - Might be suitable for full userspace applications but not for
> >            some HPC usecases. In the best world MPI is fully implemented in
> >            userspace but that doesn't appear to be the case.
> > 
> > 2) Queue locally the workqueue right away or do it remotely (if it's
> >    really necessary) if the isolated CPU is in userspace, otherwise queue
> >    it for execution on return to kernel. The work will be handled by preemption
> >    to a worker or by a workqueue flush on return to userspace.
> > 
> >    Pros:
> >         - The local queue handling is simple.
> > 
> >    Cons:
> >         - The remote queue must synchronize with return to userspace and
> > 	  eventually postpone to return to kernel if the target is in userspace.
> > 	  Also it may need to differentiate IRQs and syscalls.
> > 
> >         - Therefore still involve some case by case handling eventually.
> >    
> >         - Flushing the global workqueues to avoid deadlocks is unadvised as shown
> >           in the comment above flush_scheduled_work(). It even triggers a
> >           warning. Significant efforts have been put to convert all the existing
> > 	  users. It's not impossible to sell in our case because we shouldn't
> > 	  hold a lock upon return to userspace. But that will restore a new
> > 	  dangerous API.
> > 
> >         - Queueing the workqueue / flushing involves a context switch which
> >           induce more noise (eg: tick restart)
> > 	  
> >         - As above, probably not suitable for HPC.
> > 
> > 3) QPW: Handle the work remotely
> > 
> >    Pros:
> >         - Works on all cases, without any surprise.
> > 
> >    Cons:
> >         - Introduce new locking scheme to maintain and debug.
> > 
> >         - Needs case by case handling.
> > 
> > Thoughts?
> > 
> > -- 
> > Frederic Weisbecker
> > SUSE Labs
> 
> Its hard for me to parse your concise summary (perhaps it could be more
> verbose).
> 
> Anyway, one thought is to use some sort of SRCU type protection on the 
> per-CPU caches.
> But that adds cost as well (compared to non-SRCU), which then seems to
> have cost similar to adding per-CPU spinlocks.

Well, there is SRCU-fast now. Though do we care about housekeeping performance
to be optimized on isolated workloads to the point we complicate things with a
weaker and and trickier synchronization mechanism? Probably not. If we choose to
pick up your solution, I'm fine with spinlocks.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

Re: [PATCH v2 0/5] Introduce QPW for per-cpu operations (v2)

Posted by Vlastimil Babka (SUSE) 1 month ago

On 3/2/26 16:49, Marcelo Tosatti wrote:
> Proposed solution:
> A new interface called Queue PerCPU Work (QPW), which should replace
> Work Queue in the above mentioned use case.
> 
> If CONFIG_QPW=n this interfaces just wraps the current
> local_locks + WorkQueue behavior, so no expected change in runtime.
> 
> If CONFIG_QPW=y, and qpw kernel boot option =1, 
> queue_percpu_work_on(cpu,...) will lock that cpu's per-cpu structure
> and perform work on it locally. This is possible because on 
> functions that can be used for performing remote work on remote 
> per-cpu structures, the local_lock (which is already
> a this_cpu spinlock()), will be replaced by a qpw_spinlock(), which
> is able to get the pe
A process thing: several patches have Leo's S-o-b: but not From:
You probably need his From: and your Co-developed-by: or some other variant,
see Documentation/process/submitting-patches.rst section "When to use
Acked-by:, Cc:, and Co-developed-by:"

Re: [PATCH v2 0/5] Introduce QPW for per-cpu operations (v2)

Posted by Frederic Weisbecker 1 month, 1 week ago

Le Mon, Mar 02, 2026 at 12:49:45PM -0300, Marcelo Tosatti a écrit :
> The problem:
> Some places in the kernel implement a parallel programming strategy
> consisting on local_locks() for most of the work, and some rare remote
> operations are scheduled on target cpu. This keeps cache bouncing low since
> cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> kernels, even though the very few remote operations will be expensive due
> to scheduling overhead.
> 
> On the other hand, for RT workloads this can represent a problem: getting
> an important workload scheduled out to deal with remote requests is
> sure to introduce unexpected deadline misses.
> 
> The idea:
> Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks.
> In this case, instead of scheduling work on a remote cpu, it should
> be safe to grab that remote cpu's per-cpu spinlock and run the required
> work locally. That major cost, which is un/locking in every local function,
> already happens in PREEMPT_RT.
> 
> Also, there is no need to worry about extra cache bouncing:
> The cacheline invalidation already happens due to schedule_work_on().
> 
> This will avoid schedule_work_on(), and thus avoid scheduling-out an
> RT workload.
> 
> Proposed solution:
> A new interface called Queue PerCPU Work (QPW), which should replace
> Work Queue in the above mentioned use case.
> 
> If CONFIG_QPW=n this interfaces just wraps the current
> local_locks + WorkQueue behavior, so no expected change in runtime.
> 
> If CONFIG_QPW=y, and qpw kernel boot option =1, 
> queue_percpu_work_on(cpu,...) will lock that cpu's per-cpu structure
> and perform work on it locally. This is possible because on 
> functions that can be used for performing remote work on remote 
> per-cpu structures, the local_lock (which is already
> a this_cpu spinlock()), will be replaced by a qpw_spinlock(), which
> is able to get the per_cpu spinlock() for the cpu passed as parameter.

Ok I'm slowly considering this as a more comfortable solution than the
flush before userspace. Despite it being perhaps a bit more complicated,
remote handling of housekeeping work is more surprise-free against all
the possible nohz_full usecases that we are having a hard time to envision.

Reviewing this more in details now.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

Re: [PATCH v2 0/5] Introduce QPW for per-cpu operations (v2)

Posted by Leonardo Bras 1 month ago

On Tue, Mar 03, 2026 at 12:15:53PM +0100, Frederic Weisbecker wrote:
> Le Mon, Mar 02, 2026 at 12:49:45PM -0300, Marcelo Tosatti a écrit :
> > The problem:
> > Some places in the kernel implement a parallel programming strategy
> > consisting on local_locks() for most of the work, and some rare remote
> > operations are scheduled on target cpu. This keeps cache bouncing low since
> > cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> > kernels, even though the very few remote operations will be expensive due
> > to scheduling overhead.
> > 
> > On the other hand, for RT workloads this can represent a problem: getting
> > an important workload scheduled out to deal with remote requests is
> > sure to introduce unexpected deadline misses.
> > 
> > The idea:
> > Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks.
> > In this case, instead of scheduling work on a remote cpu, it should
> > be safe to grab that remote cpu's per-cpu spinlock and run the required
> > work locally. That major cost, which is un/locking in every local function,
> > already happens in PREEMPT_RT.
> > 
> > Also, there is no need to worry about extra cache bouncing:
> > The cacheline invalidation already happens due to schedule_work_on().
> > 
> > This will avoid schedule_work_on(), and thus avoid scheduling-out an
> > RT workload.
> > 
> > Proposed solution:
> > A new interface called Queue PerCPU Work (QPW), which should replace
> > Work Queue in the above mentioned use case.
> > 
> > If CONFIG_QPW=n this interfaces just wraps the current
> > local_locks + WorkQueue behavior, so no expected change in runtime.
> > 
> > If CONFIG_QPW=y, and qpw kernel boot option =1, 
> > queue_percpu_work_on(cpu,...) will lock that cpu's per-cpu structure
> > and perform work on it locally. This is possible because on 
> > functions that can be used for performing remote work on remote 
> > per-cpu structures, the local_lock (which is already
> > a this_cpu spinlock()), will be replaced by a qpw_spinlock(), which
> > is able to get the per_cpu spinlock() for the cpu passed as parameter.
> 
> Ok I'm slowly considering this as a more comfortable solution than the
> flush before userspace. Despite it being perhaps a bit more complicated,
> remote handling of housekeeping work is more surprise-free against all
> the possible nohz_full usecases that we are having a hard time to envision.
> 
> Reviewing this more in details now.

Awesome! Thanks!
Leo

> 
> Thanks.
> 
> -- 
> Frederic Weisbecker
> SUSE Labs