Introduce Per-CPU Work helpers (was QPW)

[PATCH v4 0/4] Introduce Per-CPU Work helpers (was QPW)

Posted by Leonardo Bras 6 days, 2 hours ago

The problem:
Some places in the kernel implement a parallel programming strategy
consisting on local_locks() for most of the work, and some rare remote
operations are scheduled on target cpu. This keeps cache bouncing low since
cacheline tends to be mostly local, and avoids the cost of locks in non-RT
kernels, even though the very few remote operations will be expensive due
to scheduling overhead.

On the other hand, for RT workloads this can represent a problem: getting
an important workload scheduled out to deal with remote requests is
sure to introduce unexpected deadline misses.

The idea:
Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks.
In this case, instead of scheduling work on a remote cpu, it should
be safe to grab that remote cpu's per-cpu spinlock and run the required
work locally. That major cost, which is un/locking in every local function,
already happens in PREEMPT_RT.

Also, there is no need to worry about extra cache bouncing:
The cacheline invalidation already happens due to schedule_work_on().

This will avoid schedule_work_on(), and thus avoid scheduling-out an
RT workload.

Proposed solution:
A new interface called PerCPU Work (PW), which should replace
Work Queue in the above mentioned use case.

If CONFIG_PWLOCKS=n this interfaces just wraps the current
local_locks + WorkQueue behavior, so no expected change in runtime.

If CONFIG_PWLOCKS=y, and kernel boot option pwlocks=1,
pw_queue_on(cpu,...) will lock that cpu's per-cpu structure
and perform work on it locally. 

v3->v4:
- Mechanism name changed from QPW to PW/PWLOCKS. Helper funcions / API,
  file names and config options renamed accordingly.
- All members of the Per-CPU Work API now start with the same prefix 
  (Frederic Weisbecker)
- Improved style a bit, reviewed documentation

v2->v3:
- Use preempt_disable/preempt_enable on !CONFIG_PREEMPT_RT (Vlastimil Babka).
- Improve documentation to include local_qpw_lock on operations table
  (Leonardo Bras).
- Enable qpw=1 automatically if CPU isolation is enabled (Vlastimil Babka).

v1->v2:
- Introduce local_qpw_lock and unlock functions, move preempt_disable/
  preempt_enable to it (Leonardo Bras). This reduces performance
  overhead of the patch.
- Documentation and changelog typo fixes (Leonardo Bras).
- Fix places where preempt_disable/preempt_enable was not being
  correctly performed.
- Add performance measurements.

RFC->v1:

- Introduce CONFIG_QPW and qpw= kernel boot option to enable
  remote spinlocking and execution even on !CONFIG_PREEMPT_RT
  kernels (Leonardo Bras).
- Move buffer_head draining to separate workqueue (Marcelo Tosatti).
- Convert mlock per-CPU page lists to QPW (Marcelo Tosatti).
- Drop memcontrol convertion (as isolated CPUs are not targets
  of queue_work_on anymore).
- Rebase SLUB against Vlastimil's slab/next.
- Add basic document for QPW (Waiman Long).

The performance numbers, as measured by the following test program,
are as follows (v3, mechanics not changed since then):

CONFIG_PREEMPT_DYNAMIC=y
Unpatched kernel:                       60 cycles
Patched kernel, CONFIG_QPW=n:           62 cycles
Patched kernel, CONFIG_QPW=y, qpw=0:    62 cycles
Patched kernel, CONFIG_QPW=y, qpw=1:    75 cycles

CONFIG_PREEMPT_RT:
Unpatched kernel:                       95 cycles
Patched kernel, CONFIG_QPW=y, qpw=0:    99 cycles
Patched kernel, CONFIG_QPW=y, qpw=1:    97 cycles

kmalloc_bench.c:
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/slab.h>
#include <linux/timex.h>
#include <linux/preempt.h>
#include <linux/irqflags.h>
#include <linux/vmalloc.h>

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Gemini AI");
MODULE_DESCRIPTION("A simple kmalloc performance benchmark");

static int size = 64; // Default allocation size in bytes
module_param(size, int, 0644);

static int iterations = 9000000; // Default number of iterations
module_param(iterations, int, 0644);

static int __init kmalloc_bench_init(void) {
    void **ptrs;
    cycles_t start, end;
    uint64_t total_cycles;
    int i;
    pr_info("kmalloc_bench: Starting test (size=%d, iterations=%d)\n", size, iterations);

    // Allocate an array to store pointers to avoid immediate kfree-reuse optimization
    ptrs = vmalloc(sizeof(void *) * iterations);
    if (!ptrs) {
        pr_err("kmalloc_bench: Failed to allocate pointer array\n");
        return -ENOMEM;
    }

    preempt_disable();
    start = get_cycles();

    for (i = 0; i < iterations; i++) {
        ptrs[i] = kmalloc(size, GFP_ATOMIC);
    }

    end = get_cycles();

    total_cycles = end - start;
    preempt_enable();

    pr_info("kmalloc_bench: Total cycles for %d allocs: %llu\n", iterations, total_cycles);
    pr_info("kmalloc_bench: Avg cycles per kmalloc: %llu\n", total_cycles / iterations);

    // Cleanup
    for (i = 0; i < iterations; i++) {
        kfree(ptrs[i]);
    }
    vfree(ptrs);

    return 0;
}

static void __exit kmalloc_bench_exit(void) {
    pr_info("kmalloc_bench: Module unloaded\n");
}

module_init(kmalloc_bench_init);
module_exit(kmalloc_bench_exit);

The following testcase triggers lru_add_drain_all on an isolated CPU
(that does sys_write to a file before entering its realtime
loop).

/*
 * Simulates a low latency loop program that is interrupted
 * due to lru_add_drain_all. To trigger lru_add_drain_all, run:
 *
 * blockdev --flushbufs /dev/sdX
 *
 */
#define _GNU_SOURCE
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <stdlib.h>
#include <stdarg.h>
#include <pthread.h>
#include <sched.h>
#include <unistd.h>

int cpu;

static void *run(void *arg)
{
        pthread_t current_thread;
        cpu_set_t cpuset;
        int ret, nrloops;
        struct sched_param sched_p;
        pid_t pid;
        int fd;
        char buf[] = "xxxxxxxxxxx";

        CPU_ZERO(&cpuset);
        CPU_SET(cpu, &cpuset);

        current_thread = pthread_self();   
        ret = pthread_setaffinity_np(current_thread, sizeof(cpu_set_t), &cpuset);
        if (ret) {
                perror("pthread_setaffinity_np failed\n");
                exit(0);
        }

        memset(&sched_p, 0, sizeof(struct sched_param));
        sched_p.sched_priority = 1;
        pid = gettid();
        ret = sched_setscheduler(pid, SCHED_FIFO, &sched_p);
        if (ret) {
                perror("sched_setscheduler");
                exit(0);
        }

        fd = open("/tmp/tmpfile", O_RDWR|O_CREAT|O_TRUNC);
        if (fd == -1) {
                perror("open");
                exit(0);
        }

        ret = write(fd, buf, sizeof(buf));
        if (ret == -1) {
                perror("write");
                exit(0);
        }

        do {
                nrloops = nrloops+2;
                nrloops--;
        } while (1);
}

int main(int argc, char *argv[])
{
        int fd, ret;
        pthread_t thread;
        long val;
        char *endptr, *str;
        struct sched_param sched_p;
        pid_t pid;

        if (argc != 2) {
                printf("usage: %s cpu-nr\n", argv[0]);
                printf("where CPU number is the CPU to pin thread to\n");
                exit(0);
        }
        str = argv[1];
        cpu = strtol(str, &endptr, 10);
        if (cpu < 0) {
                printf("strtol returns %d\n", cpu);
                exit(0);
        }
        printf("cpunr=%d\n", cpu);

        memset(&sched_p, 0, sizeof(struct sched_param));
        sched_p.sched_priority = 1;
        pid = getpid();
        ret = sched_setscheduler(pid, SCHED_FIFO, &sched_p);
        if (ret) {
                perror("sched_setscheduler");
                exit(0);
        }

        pthread_create(&thread, NULL, run, NULL);

        sleep(5000);

        pthread_join(thread, NULL);
}

Leonardo Bras (3):
  Introducing pw_lock() and per-cpu queue & flush work
  swap: apply new pw_queue_on() interface
  slub: apply new pw_queue_on() interface

Marcelo Tosatti (1):
  mm/swap: move bh draining into a separate workqueue

 MAINTAINERS                                   |   7 +
 .../admin-guide/kernel-parameters.txt         |  10 +
 Documentation/locking/pwlocks.rst             |  76 +++++
 init/Kconfig                                  |  35 +++
 kernel/Makefile                               |   2 +
 include/linux/pwlocks.h                       | 265 ++++++++++++++++++
 mm/internal.h                                 |   4 +-
 kernel/pwlocks.c                              |  47 ++++
 mm/mlock.c                                    |  51 +++-
 mm/page_alloc.c                               |   2 +-
 mm/slub.c                                     | 142 +++++-----
 mm/swap.c                                     | 109 ++++---
 12 files changed, 624 insertions(+), 126 deletions(-)
 create mode 100644 Documentation/locking/pwlocks.rst
 create mode 100644 include/linux/pwlocks.h
 create mode 100644 kernel/pwlocks.c


base-commit: 5200f5f493f79f14bbdc349e402a40dfb32f23c8
-- 
2.54.0

Re: [PATCH v4 0/4] Introduce Per-CPU Work helpers (was QPW)

Posted by Sebastian Andrzej Siewior 4 days, 14 hours ago

On 2026-05-18 22:27:46 [-0300], Leonardo Bras wrote:
> The problem:
> Some places in the kernel implement a parallel programming strategy
> consisting on local_locks() for most of the work, and some rare remote
> operations are scheduled on target cpu. This keeps cache bouncing low since
> cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> kernels, even though the very few remote operations will be expensive due
> to scheduling overhead.
> 
> On the other hand, for RT workloads this can represent a problem: getting
> an important workload scheduled out to deal with remote requests is
> sure to introduce unexpected deadline misses.
> 
> The idea:
> Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks.
It does not become a _spin_lock because it does not spin. It sleeps.

> In this case, instead of scheduling work on a remote cpu, it should
> be safe to grab that remote cpu's per-cpu spinlock and run the required
> work locally. That major cost, which is un/locking in every local function,
> already happens in PREEMPT_RT.

We did have this before but only in the RT tree. It was a bit messy from
the naming because it started with local_ but then it was a remote CPU.
The main issue was the different code path which led to a few deadlocks
back then.
By the time local_lock_t went upstream, the cross-CPU locking was
removed. As far as I remember, the cross-CPU user which did schedule
work on a remote CPU and annoyed NOHZ folks were replaced.

> Also, there is no need to worry about extra cache bouncing:
> The cacheline invalidation already happens due to schedule_work_on().
> 
> This will avoid schedule_work_on(), and thus avoid scheduling-out an
> RT workload.
> 
> Proposed solution:
> A new interface called PerCPU Work (PW), which should replace
> Work Queue in the above mentioned use case.
> 
> If CONFIG_PWLOCKS=n this interfaces just wraps the current
> local_locks + WorkQueue behavior, so no expected change in runtime.
> 
> If CONFIG_PWLOCKS=y, and kernel boot option pwlocks=1,
> pw_queue_on(cpu,...) will lock that cpu's per-cpu structure
> and perform work on it locally. 
> 
Sebastian

[syzbot ci] Re: Introduce Per-CPU Work helpers (was QPW)

Posted by syzbot ci 5 days, 20 hours ago

syzbot ci has tested the following series

[v4] Introduce Per-CPU Work helpers (was QPW)
https://lore.kernel.org/all/20260519012754.240804-1-leobras.c@gmail.com
* [PATCH v4 1/4] Introducing pw_lock() and per-cpu queue & flush work
* [PATCH v4 2/4] mm/swap: move bh draining into a separate workqueue
* [PATCH v4 3/4] swap: apply new pw_queue_on() interface
* [PATCH v4 4/4] slub: apply new pw_queue_on() interface

and found the following issue:
WARNING in __pcs_replace_empty_main

Full report is available here:
https://ci.syzbot.org/series/804f81bd-77b4-490e-bd57-6345ad2aa923

***

WARNING in __pcs_replace_empty_main

tree:      drm-next
URL:       https://gitlab.freedesktop.org/drm/kernel.git
base:      5200f5f493f79f14bbdc349e402a40dfb32f23c8
arch:      amd64
compiler:  Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config:    https://ci.syzbot.org/builds/3ea80958-13bd-49da-9c64-6deb788113f8/config

clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
Zone ranges:
  DMA      [mem 0x0000000000001000-0x0000000000ffffff]
  DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
  Normal   [mem 0x0000000100000000-0x000000023fffffff]
  Device   empty
Movable zone start for each node
Early memory node ranges
  node   0: [mem 0x0000000000001000-0x000000000009efff]
  node   0: [mem 0x0000000000100000-0x000000007ffdefff]
  node   0: [mem 0x0000000100000000-0x0000000160000fff]
  node   1: [mem 0x0000000160001000-0x000000023fffffff]
Initmem setup node 0 [mem 0x0000000000001000-0x0000000160000fff]
Initmem setup node 1 [mem 0x0000000160001000-0x000000023fffffff]
On node 0, zone DMA: 1 pages in unavailable ranges
On node 0, zone DMA: 97 pages in unavailable ranges
On node 0, zone Normal: 33 pages in unavailable ranges
setup_percpu: NR_CPUS:8 nr_cpumask_bits:2 nr_cpu_ids:2 nr_node_ids:2
percpu: Embedded 71 pages/cpu s250632 r8192 d31992 u2097152
kvm-guest: PV spinlocks disabled, no host support
Kernel command line: earlyprintk=serial net.ifnames=0 sysctl.kernel.hung_task_all_cpu_backtrace=1 ima_policy=tcb nf-conntrack-ftp.ports=20000 nf-conntrack-tftp.ports=20000 nf-conntrack-sip.ports=20000 nf-conntrack-irc.ports=20000 nf-conntrack-sane.ports=20000 binder.debug_mask=0 rcupdate.rcu_expedited=1 rcupdate.rcu_cpu_stall_cputime=1 no_hash_pointers page_owner=on sysctl.vm.nr_hugepages=4 sysctl.vm.nr_overcommit_hugepages=4 secretmem.enable=1 sysctl.max_rcu_stall_to_panic=1 msr.allow_writes=off coredump_filter=0xffff root=/dev/sda console=ttyS0 vsyscall=native numa=fake=2 kvm-intel.nested=1 spec_store_bypass_disable=prctl nopcid vivid.n_devs=64 vivid.multiplanar=1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2 netrom.nr_ndevs=32 rose.rose_ndevs=32 smp.csd_lock_timeout=100000 watchdog_thresh=55 workqueue.watchdog_thresh=140 sysctl.net.core.netdev_unregister_timeout_secs=140 dummy_hcd.num=32 max_loop=32 nbds_max=32 \
Kernel command line: comedi.comedi_num_legacy_minors=4 panic_on_warn=1 root=/dev/sda console=ttyS0 root=/dev/sda1
Unknown kernel command line parameters "nbds_max=32", will be passed to user space.
printk: log buffer data + meta data: 262144 + 917504 = 1179648 bytes
software IO TLB: area num 2.
Fallback order for Node 0: 0 1 
Fallback order for Node 1: 1 0 
Built 2 zonelists, mobility grouping on.  Total pages: 1834877
Policy zone: Normal
mem auto-init: stack:all(zero), heap alloc:on, heap free:off
stackdepot: allocating hash table via alloc_large_system_hash
stackdepot hash table entries: 1048576 (order: 12, 16777216 bytes, linear)
stackdepot: allocating space for 8192 stack pools via memblock
**********************************************************
**   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **
**                                                      **
** This system shows unhashed kernel memory addresses   **
** via the console, logs, and other interfaces. This    **
** might reduce the security of your system.            **
**                                                      **
** If you see this message and you are not debugging    **
** the kernel, report this immediately to your system   **
** administrator!                                       **
**                                                      **
** Use hash_pointers=always to force this mode off      **
**                                                      **
**   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **
**********************************************************
------------[ cut here ]------------
debug_locks && !(lock_is_held(&(&s->cpu_sheaves->lock)->dep_map) != 0)
WARNING: mm/slub.c:4601 at __pcs_replace_empty_main+0x51b/0x6e0, CPU#0: swapper/0
Modules linked in:
CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted syzkaller #0 PREEMPT(undef) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:__pcs_replace_empty_main+0x51b/0x6e0
Code: 48 85 f6 74 15 4c 89 ff 48 89 c6 e8 af 5e ff ff 4d 89 74 24 38 e9 36 fc ff ff 49 89 44 24 40 4d 89 74 24 38 e9 27 fc ff ff 90 <0f> 0b 90 83 7b 2c 00 0f 85 23 fb ff ff 48 8b 1b e8 20 cd 82 09 41
RSP: 0000:ffffffff8e607d58 EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffffffff91bb8398 RCX: 0000000000000002
RDX: 0000000000000cc0 RSI: ffffffff8e21ec94 RDI: ffffffff8c28b160
RBP: 0000000000000cc0 R08: 0000000000005e00 R09: 00000000477ac845
R10: 0000000047d13f7f R11: 000000002fa01ecd R12: ffff88812103f308
R13: 0000000000000000 R14: ffffffff91bb8398 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff88818dc8a000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff88823ffff000 CR3: 000000000e74a000 CR4: 00000000000000b0
Call Trace:
 <TASK>
 kmem_cache_alloc_node_noprof+0x441/0x690
 do_kmem_cache_create+0x172/0x620
 create_boot_cache+0xbf/0x120
 kmem_cache_init+0x11a/0x1e0
 mm_core_init+0x7e/0xb0
 start_kernel+0x15a/0x3e0
 x86_64_start_reservations+0x24/0x30
 x86_64_start_kernel+0x143/0x1c0
 common_startup_64+0x13e/0x147
 </TASK>


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
  Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.

To test a patch for this bug, please reply with `#syz test`
(should be on a separate line).

The patch should be attached to the email.
Note: arguments like custom git repos and branches are not supported.