From nobody Sat Feb 7 13:41:40 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B15CF41B37D for ; Fri, 6 Feb 2026 14:40:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770388817; cv=none; b=TvvEUL+XCje13g+zwTEU81NKLzZESxa70Gkt6sSHkBgz4hiCN5fyKwB7Ty0pLsEsBlHdOqEx62xAw5py74Vn2V4ce/aQmtXteVZHE5HtqQv6vofPaqGIF6w0PFFa9o4ieHHWBUodnLaTWU6JIaJzazTg4LVCzmLDeWyVzUUW++k= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770388817; c=relaxed/simple; bh=674iJzfGbFsX2qbg3ceilngtOic4gGutdvqYvY0LB34=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=TUx6mvAXvCGNG78MYukugHQuqo6VZDIY1PLVAul5fRGeWjq6MwpurBD0Gn/3hUTDu2+nwliysgpd1fz+kVnGXo1NfljHt4OFUEVQStABWP+0YQlPaXbRpQUDO7amkuURYZDrjaiBQWAxOB34p8JW5gAnZyPziN6vSgJi/P4CtSE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=QxlPvTUb; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="QxlPvTUb" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1770388816; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=fEOvqcoh1Fhr9YWKofxCBfvK5qH5BMzdkv72N1ftR1w=; b=QxlPvTUbfiLgDH9WB/xoUOgYDlwXe0MXskg87fr8gSI2PxpvZUTnBlEid+loVMYV1ziwl7 0h3WSPd/N+49NyaB82iDCBQFELO3vogk/5Chj2fD9joU0dQ3n81QaScY9P1mjKthiOvhHk vXrLZiWW5u5EsQ3glYn2zajbALrfdW4= Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-631-xgRGpuo7OGSRD_FKQvVXFw-1; Fri, 06 Feb 2026 09:40:12 -0500 X-MC-Unique: xgRGpuo7OGSRD_FKQvVXFw-1 X-Mimecast-MFC-AGG-ID: xgRGpuo7OGSRD_FKQvVXFw_1770388810 Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id AF7461955DC2; Fri, 6 Feb 2026 14:40:09 +0000 (UTC) Received: from tpad.localdomain (unknown [10.22.74.16]) by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id A09A918004AD; Fri, 6 Feb 2026 14:40:07 +0000 (UTC) Received: by tpad.localdomain (Postfix, from userid 1000) id A63C641D2D126; Fri, 6 Feb 2026 11:39:20 -0300 (-03) Message-ID: <20260206143741.525190180@redhat.com> User-Agent: quilt/0.66 Date: Fri, 06 Feb 2026 11:34:31 -0300 From: Marcelo Tosatti To: linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org Cc: Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton , Christoph Lameter , Pekka Enberg , David Rientjes , Joonsoo Kim , Vlastimil Babka , Hyeonggon Yoo <42.hyeyoo@gmail.com>, Leonardo Bras , Thomas Gleixner , Waiman Long , Boqun Feng , Marcelo Tosatti Subject: [PATCH 1/4] Introducing qpw_lock() and per-cpu queue & flush work References: <20260206143430.021026873@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Some places in the kernel implement a parallel programming strategy consisting on local_locks() for most of the work, and some rare remote operations are scheduled on target cpu. This keeps cache bouncing low since cacheline tends to be mostly local, and avoids the cost of locks in non-RT kernels, even though the very few remote operations will be expensive due to scheduling overhead. On the other hand, for RT workloads this can represent a problem: scheduling work on remote cpu that are executing low latency tasks is undesired and can introduce unexpected deadline misses. It's interesting, though, that local_lock()s in RT kernels become spinlock(). We can make use of those to avoid scheduling work on a remote cpu by directly updating another cpu's per_cpu structure, while holding it's spinlock(). In order to do that, it's necessary to introduce a new set of functions to make it possible to get another cpu's per-cpu "local" lock (qpw_{un,}lock*) and also the corresponding queue_percpu_work_on() and flush_percpu_work() helpers to run the remote work. Users of non-RT kernels but with low latency requirements can select similar functionality by using the CONFIG_QPW compile time option. On CONFIG_QPW disabled kernels, no changes are expected, as every one of the introduced helpers work the exactly same as the current implementation: qpw_{un,}lock*() -> local_{un,}lock*() (ignores cpu parameter) queue_percpu_work_on() -> queue_work_on() flush_percpu_work() -> flush_work() For QPW enabled kernels, though, qpw_{un,}lock*() will use the extra cpu parameter to select the correct per-cpu structure to work on, and acquire the spinlock for that cpu. queue_percpu_work_on() will just call the requested function in the current cpu, which will operate in another cpu's per-cpu object. Since the local_locks() become spinlock()s in QPW enabled kernels, we are safe doing that. flush_percpu_work() then becomes a no-op since no work is actually scheduled on a remote cpu. Some minimal code rework is needed in order to make this mechanism work: The calls for local_{un,}lock*() on the functions that are currently scheduled on remote cpus need to be replaced by qpw_{un,}lock_n*(), so in QPW enabled kernels they can reference a different cpu. It's also necessary to use a qpw_struct instead of a work_struct, but it just contains a work struct and, in CONFIG_QPW, the target cpu. This should have almost no impact on non-CONFIG_QPW kernels: few this_cpu_ptr() will become per_cpu_ptr(,smp_processor_id()). On CONFIG_QPW kernels, this should avoid deadlines misses by removing scheduling noise. Signed-off-by: Leonardo Bras Signed-off-by: Marcelo Tosatti Reviewed-by: Leonardo Bras --- Documentation/admin-guide/kernel-parameters.txt | 10 + Documentation/locking/qpwlocks.rst | 63 +++++++ MAINTAINERS | 6=20 include/linux/qpw.h | 190 +++++++++++++++++++= +++++ init/Kconfig | 35 ++++ kernel/Makefile | 2=20 kernel/qpw.c | 26 +++ 7 files changed, 332 insertions(+) create mode 100644 include/linux/qpw.h create mode 100644 kernel/qpw.c Index: slab/Documentation/admin-guide/kernel-parameters.txt =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- slab.orig/Documentation/admin-guide/kernel-parameters.txt +++ slab/Documentation/admin-guide/kernel-parameters.txt @@ -2819,6 +2819,16 @@ Kernel parameters =20 The format of is described above. =20 + qpw=3D [KNL,SMP] Select a behavior on per-CPU resource sharing + and remote interference mechanism on a kernel built with + CONFIG_QPW. + Format: { "0" | "1" } + 0 - local_lock() + queue_work_on(remote_cpu) + 1 - spin_lock() for both local and remote operations + + Selecting 1 may be interesting for systems that want + to avoid interruption & context switches from IPIs. + iucv=3D [HW,NET] =20 ivrs_ioapic [HW,X86-64] Index: slab/MAINTAINERS =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- slab.orig/MAINTAINERS +++ slab/MAINTAINERS @@ -21291,6 +21291,12 @@ F: Documentation/networking/device_drive F: drivers/bus/fsl-mc/ F: include/uapi/linux/fsl_mc.h =20 +QPW +M: Leonardo Bras +S: Supported +F: include/linux/qpw.h +F: kernel/qpw.c + QT1010 MEDIA DRIVER L: linux-media@vger.kernel.org S: Orphan Index: slab/include/linux/qpw.h =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- /dev/null +++ slab/include/linux/qpw.h @@ -0,0 +1,190 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_QPW_H +#define _LINUX_QPW_H + +#include "linux/spinlock.h" +#include "linux/local_lock.h" +#include "linux/workqueue.h" + +#ifndef CONFIG_QPW + +typedef local_lock_t qpw_lock_t; +typedef local_trylock_t qpw_trylock_t; + +struct qpw_struct { + struct work_struct work; +}; + +#define qpw_lock_init(lock) \ + local_lock_init(lock) + +#define qpw_trylock_init(lock) \ + local_trylock_init(lock) + +#define qpw_lock(lock, cpu) \ + local_lock(lock) + +#define qpw_lock_irqsave(lock, flags, cpu) \ + local_lock_irqsave(lock, flags) + +#define qpw_trylock(lock, cpu) \ + local_trylock(lock) + +#define qpw_trylock_irqsave(lock, flags, cpu) \ + local_trylock_irqsave(lock, flags) + +#define qpw_unlock(lock, cpu) \ + local_unlock(lock) + +#define qpw_unlock_irqrestore(lock, flags, cpu) \ + local_unlock_irqrestore(lock, flags) + +#define qpw_lockdep_assert_held(lock) \ + lockdep_assert_held(lock) + +#define queue_percpu_work_on(c, wq, qpw) \ + queue_work_on(c, wq, &(qpw)->work) + +#define flush_percpu_work(qpw) \ + flush_work(&(qpw)->work) + +#define qpw_get_cpu(qpw) smp_processor_id() + +#define qpw_is_cpu_remote(cpu) (false) + +#define INIT_QPW(qpw, func, c) \ + INIT_WORK(&(qpw)->work, (func)) + +#else /* CONFIG_QPW */ + +DECLARE_STATIC_KEY_MAYBE(CONFIG_QPW_DEFAULT, qpw_sl); + +typedef union { + spinlock_t sl; + local_lock_t ll; +} qpw_lock_t; + +typedef union { + spinlock_t sl; + local_trylock_t ll; +} qpw_trylock_t; + +struct qpw_struct { + struct work_struct work; + int cpu; +}; + +#define qpw_lock_init(lock) \ + do { \ + if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) \ + spin_lock_init(lock.sl); \ + else \ + local_lock_init(lock.ll); \ + } while (0) + +#define qpw_trylock_init(lock) \ + do { \ + if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) \ + spin_lock_init(lock.sl); \ + else \ + local_trylock_init(lock.ll); \ + } while (0) + +#define qpw_lock(lock, cpu) \ + do { \ + if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) \ + spin_lock(per_cpu_ptr(lock.sl, cpu)); \ + else \ + local_lock(lock.ll); \ + } while (0) + +#define qpw_lock_irqsave(lock, flags, cpu) \ + do { \ + if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) \ + spin_lock_irqsave(per_cpu_ptr(lock.sl, cpu), flags); \ + else \ + local_lock_irqsave(lock.ll, flags); \ + } while (0) + +#define qpw_trylock(lock, cpu) \ + ({ \ + int t; \ + if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) \ + t =3D spin_trylock(per_cpu_ptr(lock.sl, cpu)); \ + else \ + t =3D local_trylock(lock.ll); \ + t; \ + }) + +#define qpw_trylock_irqsave(lock, flags, cpu) \ + ({ \ + int t; \ + if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) \ + t =3D spin_trylock_irqsave(per_cpu_ptr(lock.sl, cpu), flags); \ + else \ + t =3D local_trylock_irqsave(lock.ll, flags); \ + t; \ + }) + +#define qpw_unlock(lock, cpu) \ + do { \ + if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) { \ + spin_unlock(per_cpu_ptr(lock.sl, cpu)); \ + } else { \ + local_unlock(lock.ll); \ + } \ + } while (0) + +#define qpw_unlock_irqrestore(lock, flags, cpu) \ + do { \ + if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) \ + spin_unlock_irqrestore(per_cpu_ptr(lock.sl, cpu), flags); \ + else \ + local_unlock_irqrestore(lock.ll, flags); \ + } while (0) + +#define qpw_lockdep_assert_held(lock) \ + do { \ + if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) \ + lockdep_assert_held(this_cpu_ptr(lock.sl)); \ + else \ + lockdep_assert_held(this_cpu_ptr(lock.ll)); \ + } while (0) + +#define queue_percpu_work_on(c, wq, qpw) \ + do { \ + int __c =3D c; \ + struct qpw_struct *__qpw =3D (qpw); \ + if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) { \ + WARN_ON((__c) !=3D __qpw->cpu); \ + __qpw->work.func(&__qpw->work); \ + } else { \ + queue_work_on(__c, wq, &(__qpw)->work); \ + } \ + } while (0) + +/* + * Does nothing if QPW is set to use spinlock, as the task is already done= at the + * time queue_percpu_work_on() returns. + */ +#define flush_percpu_work(qpw) \ + do { \ + struct qpw_struct *__qpw =3D (qpw); \ + if (!static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) { \ + flush_work(&__qpw->work); \ + } \ + } while (0) + +#define qpw_get_cpu(w) container_of((w), struct qpw_struct, work)->cpu + +#define qpw_is_cpu_remote(cpu) ((cpu) !=3D smp_processor_id()) + +#define INIT_QPW(qpw, func, c) \ + do { \ + struct qpw_struct *__qpw =3D (qpw); \ + INIT_WORK(&__qpw->work, (func)); \ + __qpw->cpu =3D (c); \ + } while (0) + +#endif /* CONFIG_QPW */ +#endif /* LINUX_QPW_H */ Index: slab/init/Kconfig =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- slab.orig/init/Kconfig +++ slab/init/Kconfig @@ -747,6 +747,41 @@ config CPU_ISOLATION =20 Say Y if unsure. =20 +config QPW + bool "Queue per-CPU Work" + depends on SMP || COMPILE_TEST + default n + help + Allow changing the behavior on per-CPU resource sharing with cache, + from the regular local_locks() + queue_work_on(remote_cpu) to using + per-CPU spinlocks on both local and remote operations. + + This is useful to give user the option on reducing IPIs to CPUs, and + thus reduce interruptions and context switches. On the other hand, it + increases generated code and will use atomic operations if spinlocks + are selected. + + If set, will use the default behavior set in QPW_DEFAULT unless boot + parameter qpw is passed with a different behavior. + + If unset, will use the local_lock() + queue_work_on() strategy, + regardless of the boot parameter or QPW_DEFAULT. + + Say N if unsure. + +config QPW_DEFAULT + bool "Use per-CPU spinlocks by default" + depends on QPW + default n + help + If set, will use per-CPU spinlocks as default behavior for per-CPU + remote operations. + + If unset, will use local_lock() + queue_work_on(cpu) as default + behavior for remote operations. + + Say N if unsure + source "kernel/rcu/Kconfig" =20 config IKCONFIG Index: slab/kernel/Makefile =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- slab.orig/kernel/Makefile +++ slab/kernel/Makefile @@ -140,6 +140,8 @@ obj-$(CONFIG_WATCH_QUEUE) +=3D watch_queue obj-$(CONFIG_RESOURCE_KUNIT_TEST) +=3D resource_kunit.o obj-$(CONFIG_SYSCTL_KUNIT_TEST) +=3D sysctl-test.o =20 +obj-$(CONFIG_QPW) +=3D qpw.o + CFLAGS_kstack_erase.o +=3D $(DISABLE_KSTACK_ERASE) CFLAGS_kstack_erase.o +=3D $(call cc-option,-mgeneral-regs-only) obj-$(CONFIG_KSTACK_ERASE) +=3D kstack_erase.o Index: slab/kernel/qpw.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- /dev/null +++ slab/kernel/qpw.c @@ -0,0 +1,26 @@ +// SPDX-License-Identifier: GPL-2.0 +#include "linux/export.h" +#include +#include +#include + +DEFINE_STATIC_KEY_MAYBE(CONFIG_QPW_DEFAULT, qpw_sl); +EXPORT_SYMBOL(qpw_sl); + +static int __init qpw_setup(char *str) +{ + int opt; + + if (!get_option(&str, &opt)) { + pr_warn("QPW: invalid qpw parameter: %s, ignoring.\n", str); + return 0; + } + + if (opt) + static_branch_enable(&qpw_sl); + else + static_branch_disable(&qpw_sl); + + return 0; +} +__setup("qpw=3D", qpw_setup); Index: slab/Documentation/locking/qpwlocks.rst =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- /dev/null +++ slab/Documentation/locking/qpwlocks.rst @@ -0,0 +1,63 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=3D=3D=3D=3D=3D=3D=3D=3D=3D +QPW locks +=3D=3D=3D=3D=3D=3D=3D=3D=3D + +Some places in the kernel implement a parallel programming strategy +consisting on local_locks() for most of the work, and some rare remote +operations are scheduled on target cpu. This keeps cache bouncing low since +cacheline tends to be mostly local, and avoids the cost of locks in non-RT +kernels, even though the very few remote operations will be expensive due +to scheduling overhead. + +On the other hand, for RT workloads this can represent a problem: +scheduling work on remote cpu that are executing low latency tasks +is undesired and can introduce unexpected deadline misses. + +QPW locks help to convert sites that use local_locks (for cpu local operat= ions) +and queue_work_on (for queueing work remotely, to be executed +locally on the owner cpu of the lock) to QPW locks. + +The lock is declared qpw_lock_t type. +The lock is initialized with qpw_lock_init. +The lock is locked with qpw_lock (takes a lock and cpu as a parameter). +The lock is unlocked with qpw_unlock (takes a lock and cpu as a parameter). + +The qpw_lock_irqsave function disables interrupts and saves current interr= upt state, +cpu as a parameter. + +For trylock variant, there is the qpw_trylock_t type, initialized with +qpw_trylock_init. Then the corresponding qpw_trylock and +qpw_trylock_irqsave. + +work_struct should be replaced by qpw_struct, which contains a cpu paramet= er +(owner cpu of the lock), initialized by INIT_QPW. + +The queue work related functions (analogous to queue_work_on and flush_wor= k) are: +queue_percpu_work_on and flush_percpu_work. + +The behaviour of the QPW functions is as follows: + +* !CONFIG_PREEMPT_RT and !CONFIG_QPW (or CONFIG_QPW and qpw=3Doff kernel +boot parameter): + - qpw_lock: local_lock + - qpw_lock_irqsave: local_lock_irqsave + - qpw_trylock: local_trylock + - qpw_trylock_irqsave: local_trylock_irqsave + - qpw_unlock: local_unlock + - queue_percpu_work_on: queue_work_on + - flush_percpu_work: flush_work + +* CONFIG_PREEMPT_RT or CONFIG_QPW (and CONFIG_QPW_DEFAULT or qpw=3Don kern= el +boot parameter), + - qpw_lock: spin_lock + - qpw_lock_irqsave: spin_lock_irqsave + - qpw_trylock: spin_trylock + - qpw_trylock_irqsave: spin_trylock_irqsave + - qpw_unlock: spin_unlock + - queue_percpu_work_on: executes work function on caller c= pu + - flush_percpu_work: empty + +qpw_get_cpu(work_struct), to be called from within qpw work function, +returns the target cpu. From nobody Sat Feb 7 13:41:40 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2474B421884 for ; Fri, 6 Feb 2026 14:40:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770388817; cv=none; b=fHZ5XCNsGFSfqSJPVLKvAsKtfGM9m1uBOCe+HqC7D2LVxUJu4DGaLYe6eWbqXObrWSOoNmF2+cD9h2UooDTjcTsYjMroTn1IgaZGXkYcGOjBIFWgrSkZCc1sRKN2DqybrLobHdub2zaRIG3xZOPS6gqa2sbKXfo41y6yM1R/p54= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770388817; c=relaxed/simple; bh=5MTIYnSrhIUJ17bD+PnmP1FVtEy0iYc8v1FyJh5RjRo=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=AFToSsHrHiGjFoU+Fpwj1EypChBq+A2p/SQn8WYt3WNmdsYD8FBYJiOkwWjJtEe9wnSQMoZHssW/DzyjiT2lRsSRgNbFQD1prmtmcJChnk29WlmRBLhXGe71epF2aosIBggARO+FvK+D0Xn7AyXvJ2Q1+arGmwqJ5T0gsocl58A= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=YvbvYQ9F; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="YvbvYQ9F" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1770388816; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=+1brVgWYEq0ZWfMlhAaXgacR5RpthFeG1w4z1Bx58lo=; b=YvbvYQ9FdthxRTuxO9nBB4vlH3/vRxbYlty32tjyqJmshmrEpul9qHA5lUo2ZnbfyMh+K/ Du/tII3jx/0W3HVPDtXFyIFZoDraUEJsZR885lcj/vnO7LAlWFIt2BuZRZPb7hZNq0ZyWi ad+SO2EEw5twZ8jF4NwHdRy6o2HB5k4= Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-358-qKQQ0E6EO7K-OUnMby61xQ-1; Fri, 06 Feb 2026 09:40:12 -0500 X-MC-Unique: qKQQ0E6EO7K-OUnMby61xQ-1 X-Mimecast-MFC-AGG-ID: qKQQ0E6EO7K-OUnMby61xQ_1770388809 Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 1A5E91955DCA; Fri, 6 Feb 2026 14:40:09 +0000 (UTC) Received: from tpad.localdomain (unknown [10.22.74.16]) by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id A2978192C7CA; Fri, 6 Feb 2026 14:40:07 +0000 (UTC) Received: by tpad.localdomain (Postfix, from userid 1000) id A8AB741DF09D7; Fri, 6 Feb 2026 11:39:20 -0300 (-03) Message-ID: <20260206143741.557251404@redhat.com> User-Agent: quilt/0.66 Date: Fri, 06 Feb 2026 11:34:32 -0300 From: Marcelo Tosatti To: linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org Cc: Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton , Christoph Lameter , Pekka Enberg , David Rientjes , Joonsoo Kim , Vlastimil Babka , Hyeonggon Yoo <42.hyeyoo@gmail.com>, Leonardo Bras , Thomas Gleixner , Waiman Long , Boqun Feng , Marcelo Tosatti Subject: [PATCH 2/4] mm/swap: move bh draining into a separate workqueue References: <20260206143430.021026873@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Separate the bh draining into a separate workqueue (from the mm lru draining), so that its possible to switch the mm lru draining to QPW. To switch bh draining to QPW, it would be necessary to add a spinlock to addition of bhs to percpu cache, and that is a very hot path. Signed-off-by: Marcelo Tosatti --- mm/swap.c | 52 +++++++++++++++++++++++++++++++++++++--------------- 1 file changed, 37 insertions(+), 15 deletions(-) Index: slab/mm/swap.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- slab.orig/mm/swap.c +++ slab/mm/swap.c @@ -745,12 +745,11 @@ void lru_add_drain(void) * the same cpu. It shouldn't be a problem in !SMP case since * the core is only one and the locks will disable preemption. */ -static void lru_add_and_bh_lrus_drain(void) +static void lru_add_mm_drain(void) { local_lock(&cpu_fbatches.lock); lru_add_drain_cpu(smp_processor_id()); local_unlock(&cpu_fbatches.lock); - invalidate_bh_lrus_cpu(); mlock_drain_local(); } =20 @@ -769,10 +768,17 @@ static DEFINE_PER_CPU(struct work_struct =20 static void lru_add_drain_per_cpu(struct work_struct *dummy) { - lru_add_and_bh_lrus_drain(); + lru_add_mm_drain(); } =20 -static bool cpu_needs_drain(unsigned int cpu) +static DEFINE_PER_CPU(struct work_struct, bh_add_drain_work); + +static void bh_add_drain_per_cpu(struct work_struct *dummy) +{ + invalidate_bh_lrus_cpu(); +} + +static bool cpu_needs_mm_drain(unsigned int cpu) { struct cpu_fbatches *fbatches =3D &per_cpu(cpu_fbatches, cpu); =20 @@ -783,8 +789,12 @@ static bool cpu_needs_drain(unsigned int folio_batch_count(&fbatches->lru_deactivate) || folio_batch_count(&fbatches->lru_lazyfree) || folio_batch_count(&fbatches->lru_activate) || - need_mlock_drain(cpu) || - has_bh_in_lru(cpu, NULL); + need_mlock_drain(cpu); +} + +static bool cpu_needs_bh_drain(unsigned int cpu) +{ + return has_bh_in_lru(cpu, NULL); } =20 /* @@ -807,7 +817,7 @@ static inline void __lru_add_drain_all(b * each CPU. */ static unsigned int lru_drain_gen; - static struct cpumask has_work; + static struct cpumask has_mm_work, has_bh_work; static DEFINE_MUTEX(lock); unsigned cpu, this_gen; =20 @@ -870,20 +880,31 @@ static inline void __lru_add_drain_all(b WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1); smp_mb(); =20 - cpumask_clear(&has_work); + cpumask_clear(&has_mm_work); + cpumask_clear(&has_bh_work); for_each_online_cpu(cpu) { - struct work_struct *work =3D &per_cpu(lru_add_drain_work, cpu); + struct work_struct *mm_work =3D &per_cpu(lru_add_drain_work, cpu); + struct work_struct *bh_work =3D &per_cpu(bh_add_drain_work, cpu); + + if (cpu_needs_mm_drain(cpu)) { + INIT_WORK(mm_work, lru_add_drain_per_cpu); + queue_work_on(cpu, mm_percpu_wq, mm_work); + __cpumask_set_cpu(cpu, &has_mm_work); + } =20 - if (cpu_needs_drain(cpu)) { - INIT_WORK(work, lru_add_drain_per_cpu); - queue_work_on(cpu, mm_percpu_wq, work); - __cpumask_set_cpu(cpu, &has_work); + if (cpu_needs_bh_drain(cpu)) { + INIT_WORK(bh_work, bh_add_drain_per_cpu); + queue_work_on(cpu, mm_percpu_wq, bh_work); + __cpumask_set_cpu(cpu, &has_bh_work); } } =20 - for_each_cpu(cpu, &has_work) + for_each_cpu(cpu, &has_mm_work) flush_work(&per_cpu(lru_add_drain_work, cpu)); =20 + for_each_cpu(cpu, &has_bh_work) + flush_work(&per_cpu(bh_add_drain_work, cpu)); + done: mutex_unlock(&lock); } @@ -929,7 +950,8 @@ void lru_cache_disable(void) #ifdef CONFIG_SMP __lru_add_drain_all(true); #else - lru_add_and_bh_lrus_drain(); + lru_add_mm_drain(); + invalidate_bh_lrus_cpu(); #endif } From nobody Sat Feb 7 13:41:40 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AB5164218BB for ; Fri, 6 Feb 2026 14:40:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770388818; cv=none; b=X/uIR8dLuMSq6CkisYjsj6tS2HmvGrUPu9AruDGDzyUVu/QSmQjbw1TCpZPQu6JMU4+8TfjM2/d6AVWkpSrR4wBtpLyyQVYgc3MFZca2gRskuP5zJi/33e4KWLznVDXtiQIfWapoG4INciP4gerTj9JcLyg8sxbCmNTCvb+heUs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770388818; c=relaxed/simple; bh=zyitgazVbNyRejUntdJMXClL49rmA/pJjcIiOHvRTe0=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=ShRTNLCPRA1MmdAu7h6JlNJU1aRW3p2GLQWLPasJ7c4uziaFEnHOGPamlizblB/XU3EBv606Qa6vy1EkvNqSt5L1lYgeZeQhn89xeocTXWjNPj+TsaP0a8ncVtFbLwI8mroIDIxIzxrbIL1xIPpm7VGTA1chlcJbJPik1L6GiiA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=GkmU7uh9; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="GkmU7uh9" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1770388816; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=u+vDHDuyg1tZxOCkT1e6XPP9sTvmKQrxuHI3KpZjiIU=; b=GkmU7uh9y0gpLuPuZSGlU92gs8uC4/iGzEui3fA6QRl4cTWsSoiaR28LWoo/lQr3ILjYd8 pCrrn9XqOEvusLtjW9QhlZ9iMacz+MmxER/Cpg2op2GO0eUF1P0aNFrrnoYaWUfUIYEDzl itqiXnP9zLqJEWPvQECFw9HNeLm6Uvw= Received: from mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-450--cr5hfH1OYSwaV0E10P2yA-1; Fri, 06 Feb 2026 09:40:12 -0500 X-MC-Unique: -cr5hfH1OYSwaV0E10P2yA-1 X-Mimecast-MFC-AGG-ID: -cr5hfH1OYSwaV0E10P2yA_1770388810 Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id B95CA1955F68; Fri, 6 Feb 2026 14:40:09 +0000 (UTC) Received: from tpad.localdomain (unknown [10.22.74.16]) by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 793AF1800464; Fri, 6 Feb 2026 14:40:07 +0000 (UTC) Received: by tpad.localdomain (Postfix, from userid 1000) id B0BD0401E2F84; Fri, 6 Feb 2026 11:39:20 -0300 (-03) Message-ID: <20260206143741.589656953@redhat.com> User-Agent: quilt/0.66 Date: Fri, 06 Feb 2026 11:34:33 -0300 From: Marcelo Tosatti To: linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org Cc: Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton , Christoph Lameter , Pekka Enberg , David Rientjes , Joonsoo Kim , Vlastimil Babka , Hyeonggon Yoo <42.hyeyoo@gmail.com>, Leonardo Bras , Thomas Gleixner , Waiman Long , Boqun Feng , Marcelo Tosatti Subject: [PATCH 3/4] swap: apply new queue_percpu_work_on() interface References: <20260206143430.021026873@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Make use of the new qpw_{un,}lock*() and queue_percpu_work_on() interface to improve performance & latency on PREEMPT_RT kernels. For functions that may be scheduled in a different cpu, replace local_{un,}lock*() by qpw_{un,}lock*(), and replace schedule_work_on() by queue_percpu_work_on(). The same happens for flush_work() and flush_percpu_work(). The change requires allocation of qpw_structs instead of a work_structs, and changing parameters of a few functions to include the cpu parameter. This should bring no relevant performance impact on non-RT kernels: For functions that may be scheduled in a different cpu, the local_*lock's this_cpu_ptr() becomes a per_cpu_ptr(smp_processor_id()). Signed-off-by: Leonardo Bras Signed-off-by: Marcelo Tosatti --- mm/internal.h | 4 +- mm/mlock.c | 71 ++++++++++++++++++++++++++++++++------------ mm/page_alloc.c | 2 - mm/swap.c | 90 +++++++++++++++++++++++++++++++--------------------= ----- 4 files changed, 108 insertions(+), 59 deletions(-) Index: slab/mm/mlock.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- slab.orig/mm/mlock.c +++ slab/mm/mlock.c @@ -25,17 +25,16 @@ #include #include #include +#include =20 #include "internal.h" =20 struct mlock_fbatch { - local_lock_t lock; + qpw_lock_t lock; struct folio_batch fbatch; }; =20 -static DEFINE_PER_CPU(struct mlock_fbatch, mlock_fbatch) =3D { - .lock =3D INIT_LOCAL_LOCK(lock), -}; +static DEFINE_PER_CPU(struct mlock_fbatch, mlock_fbatch); =20 bool can_do_mlock(void) { @@ -209,18 +208,25 @@ static void mlock_folio_batch(struct fol folios_put(fbatch); } =20 -void mlock_drain_local(void) +void mlock_drain_cpu(int cpu) { struct folio_batch *fbatch; =20 - local_lock(&mlock_fbatch.lock); - fbatch =3D this_cpu_ptr(&mlock_fbatch.fbatch); + qpw_lock(&mlock_fbatch.lock, cpu); + fbatch =3D per_cpu_ptr(&mlock_fbatch.fbatch, cpu); if (folio_batch_count(fbatch)) mlock_folio_batch(fbatch); - local_unlock(&mlock_fbatch.lock); + qpw_unlock(&mlock_fbatch.lock, cpu); } =20 -void mlock_drain_remote(int cpu) +void mlock_drain_local(void) +{ + migrate_disable(); + mlock_drain_cpu(smp_processor_id()); + migrate_enable(); +} + +void mlock_drain_offline(int cpu) { struct folio_batch *fbatch; =20 @@ -242,9 +248,12 @@ bool need_mlock_drain(int cpu) void mlock_folio(struct folio *folio) { struct folio_batch *fbatch; + int cpu; =20 - local_lock(&mlock_fbatch.lock); - fbatch =3D this_cpu_ptr(&mlock_fbatch.fbatch); + migrate_disable(); + cpu =3D smp_processor_id(); + qpw_lock(&mlock_fbatch.lock, cpu); + fbatch =3D per_cpu_ptr(&mlock_fbatch.fbatch, cpu); =20 if (!folio_test_set_mlocked(folio)) { int nr_pages =3D folio_nr_pages(folio); @@ -257,7 +266,8 @@ void mlock_folio(struct folio *folio) if (!folio_batch_add(fbatch, mlock_lru(folio)) || !folio_may_be_lru_cached(folio) || lru_cache_disabled()) mlock_folio_batch(fbatch); - local_unlock(&mlock_fbatch.lock); + qpw_unlock(&mlock_fbatch.lock, cpu); + migrate_enable(); } =20 /** @@ -268,9 +278,13 @@ void mlock_new_folio(struct folio *folio { struct folio_batch *fbatch; int nr_pages =3D folio_nr_pages(folio); + int cpu; + + migrate_disable(); + cpu =3D smp_processor_id(); + qpw_lock(&mlock_fbatch.lock, cpu); =20 - local_lock(&mlock_fbatch.lock); - fbatch =3D this_cpu_ptr(&mlock_fbatch.fbatch); + fbatch =3D per_cpu_ptr(&mlock_fbatch.fbatch, cpu); folio_set_mlocked(folio); =20 zone_stat_mod_folio(folio, NR_MLOCK, nr_pages); @@ -280,7 +294,8 @@ void mlock_new_folio(struct folio *folio if (!folio_batch_add(fbatch, mlock_new(folio)) || !folio_may_be_lru_cached(folio) || lru_cache_disabled()) mlock_folio_batch(fbatch); - local_unlock(&mlock_fbatch.lock); + migrate_enable(); + qpw_unlock(&mlock_fbatch.lock, cpu); } =20 /** @@ -290,9 +305,13 @@ void mlock_new_folio(struct folio *folio void munlock_folio(struct folio *folio) { struct folio_batch *fbatch; + int cpu; =20 - local_lock(&mlock_fbatch.lock); - fbatch =3D this_cpu_ptr(&mlock_fbatch.fbatch); + migrate_disable(); + cpu =3D smp_processor_id(); + qpw_lock(&mlock_fbatch.lock, cpu); + + fbatch =3D per_cpu_ptr(&mlock_fbatch.fbatch, cpu); /* * folio_test_clear_mlocked(folio) must be left to __munlock_folio(), * which will check whether the folio is multiply mlocked. @@ -301,7 +320,8 @@ void munlock_folio(struct folio *folio) if (!folio_batch_add(fbatch, folio) || !folio_may_be_lru_cached(folio) || lru_cache_disabled()) mlock_folio_batch(fbatch); - local_unlock(&mlock_fbatch.lock); + qpw_unlock(&mlock_fbatch.lock, cpu); + migrate_enable(); } =20 static inline unsigned int folio_mlock_step(struct folio *folio, @@ -823,3 +843,18 @@ void user_shm_unlock(size_t size, struct spin_unlock(&shmlock_user_lock); put_ucounts(ucounts); } + +int __init mlock_init(void) +{ + unsigned int cpu; + + for_each_possible_cpu(cpu) { + struct mlock_fbatch *fbatch =3D &per_cpu(mlock_fbatch, cpu); + + qpw_lock_init(&fbatch->lock); + } + + return 0; +} + +module_init(mlock_init); Index: slab/mm/swap.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- slab.orig/mm/swap.c +++ slab/mm/swap.c @@ -35,7 +35,7 @@ #include #include #include -#include +#include #include =20 #include "internal.h" @@ -52,7 +52,7 @@ struct cpu_fbatches { * The following folio batches are grouped together because they are prot= ected * by disabling preemption (and interrupts remain enabled). */ - local_lock_t lock; + qpw_lock_t lock; struct folio_batch lru_add; struct folio_batch lru_deactivate_file; struct folio_batch lru_deactivate; @@ -61,14 +61,11 @@ struct cpu_fbatches { struct folio_batch lru_activate; #endif /* Protecting the following batches which require disabling interrupts */ - local_lock_t lock_irq; + qpw_lock_t lock_irq; struct folio_batch lru_move_tail; }; =20 -static DEFINE_PER_CPU(struct cpu_fbatches, cpu_fbatches) =3D { - .lock =3D INIT_LOCAL_LOCK(lock), - .lock_irq =3D INIT_LOCAL_LOCK(lock_irq), -}; +static DEFINE_PER_CPU(struct cpu_fbatches, cpu_fbatches); =20 static void __page_cache_release(struct folio *folio, struct lruvec **lruv= ecp, unsigned long *flagsp) @@ -183,22 +180,24 @@ static void __folio_batch_add_and_move(s struct folio *folio, move_fn_t move_fn, bool disable_irq) { unsigned long flags; + int cpu; =20 folio_get(folio); =20 + cpu =3D smp_processor_id(); if (disable_irq) - local_lock_irqsave(&cpu_fbatches.lock_irq, flags); + qpw_lock_irqsave(&cpu_fbatches.lock_irq, flags, cpu); else - local_lock(&cpu_fbatches.lock); + qpw_lock(&cpu_fbatches.lock, cpu); =20 - if (!folio_batch_add(this_cpu_ptr(fbatch), folio) || + if (!folio_batch_add(per_cpu_ptr(fbatch, cpu), folio) || !folio_may_be_lru_cached(folio) || lru_cache_disabled()) - folio_batch_move_lru(this_cpu_ptr(fbatch), move_fn); + folio_batch_move_lru(per_cpu_ptr(fbatch, cpu), move_fn); =20 if (disable_irq) - local_unlock_irqrestore(&cpu_fbatches.lock_irq, flags); + qpw_unlock_irqrestore(&cpu_fbatches.lock_irq, flags, cpu); else - local_unlock(&cpu_fbatches.lock); + qpw_unlock(&cpu_fbatches.lock, cpu); } =20 #define folio_batch_add_and_move(folio, op) \ @@ -358,9 +357,10 @@ static void __lru_cache_activate_folio(s { struct folio_batch *fbatch; int i; + int cpu =3D smp_processor_id(); =20 - local_lock(&cpu_fbatches.lock); - fbatch =3D this_cpu_ptr(&cpu_fbatches.lru_add); + qpw_lock(&cpu_fbatches.lock, cpu); + fbatch =3D per_cpu_ptr(&cpu_fbatches.lru_add, cpu); =20 /* * Search backwards on the optimistic assumption that the folio being @@ -381,7 +381,7 @@ static void __lru_cache_activate_folio(s } } =20 - local_unlock(&cpu_fbatches.lock); + qpw_unlock(&cpu_fbatches.lock, cpu); } =20 #ifdef CONFIG_LRU_GEN @@ -653,9 +653,9 @@ void lru_add_drain_cpu(int cpu) unsigned long flags; =20 /* No harm done if a racing interrupt already did this */ - local_lock_irqsave(&cpu_fbatches.lock_irq, flags); + qpw_lock_irqsave(&cpu_fbatches.lock_irq, flags, cpu); folio_batch_move_lru(fbatch, lru_move_tail); - local_unlock_irqrestore(&cpu_fbatches.lock_irq, flags); + qpw_unlock_irqrestore(&cpu_fbatches.lock_irq, flags, cpu); } =20 fbatch =3D &fbatches->lru_deactivate_file; @@ -733,10 +733,12 @@ void folio_mark_lazyfree(struct folio *f =20 void lru_add_drain(void) { - local_lock(&cpu_fbatches.lock); - lru_add_drain_cpu(smp_processor_id()); - local_unlock(&cpu_fbatches.lock); - mlock_drain_local(); + int cpu =3D smp_processor_id(); + + qpw_lock(&cpu_fbatches.lock, cpu); + lru_add_drain_cpu(cpu); + qpw_unlock(&cpu_fbatches.lock, cpu); + mlock_drain_cpu(cpu); } =20 /* @@ -745,30 +747,32 @@ void lru_add_drain(void) * the same cpu. It shouldn't be a problem in !SMP case since * the core is only one and the locks will disable preemption. */ -static void lru_add_mm_drain(void) +static void lru_add_mm_drain(int cpu) { - local_lock(&cpu_fbatches.lock); - lru_add_drain_cpu(smp_processor_id()); - local_unlock(&cpu_fbatches.lock); - mlock_drain_local(); + qpw_lock(&cpu_fbatches.lock, cpu); + lru_add_drain_cpu(cpu); + qpw_unlock(&cpu_fbatches.lock, cpu); + mlock_drain_cpu(cpu); } =20 void lru_add_drain_cpu_zone(struct zone *zone) { - local_lock(&cpu_fbatches.lock); - lru_add_drain_cpu(smp_processor_id()); + int cpu =3D smp_processor_id(); + + qpw_lock(&cpu_fbatches.lock, cpu); + lru_add_drain_cpu(cpu); drain_local_pages(zone); - local_unlock(&cpu_fbatches.lock); - mlock_drain_local(); + qpw_unlock(&cpu_fbatches.lock, cpu); + mlock_drain_cpu(cpu); } =20 #ifdef CONFIG_SMP =20 -static DEFINE_PER_CPU(struct work_struct, lru_add_drain_work); +static DEFINE_PER_CPU(struct qpw_struct, lru_add_drain_qpw); =20 -static void lru_add_drain_per_cpu(struct work_struct *dummy) +static void lru_add_drain_per_cpu(struct work_struct *w) { - lru_add_mm_drain(); + lru_add_mm_drain(qpw_get_cpu(w)); } =20 static DEFINE_PER_CPU(struct work_struct, bh_add_drain_work); @@ -883,12 +887,12 @@ static inline void __lru_add_drain_all(b cpumask_clear(&has_mm_work); cpumask_clear(&has_bh_work); for_each_online_cpu(cpu) { - struct work_struct *mm_work =3D &per_cpu(lru_add_drain_work, cpu); + struct qpw_struct *mm_qpw =3D &per_cpu(lru_add_drain_qpw, cpu); struct work_struct *bh_work =3D &per_cpu(bh_add_drain_work, cpu); =20 if (cpu_needs_mm_drain(cpu)) { - INIT_WORK(mm_work, lru_add_drain_per_cpu); - queue_work_on(cpu, mm_percpu_wq, mm_work); + INIT_QPW(mm_qpw, lru_add_drain_per_cpu, cpu); + queue_percpu_work_on(cpu, mm_percpu_wq, mm_qpw); __cpumask_set_cpu(cpu, &has_mm_work); } =20 @@ -900,7 +904,7 @@ static inline void __lru_add_drain_all(b } =20 for_each_cpu(cpu, &has_mm_work) - flush_work(&per_cpu(lru_add_drain_work, cpu)); + flush_percpu_work(&per_cpu(lru_add_drain_qpw, cpu)); =20 for_each_cpu(cpu, &has_bh_work) flush_work(&per_cpu(bh_add_drain_work, cpu)); @@ -950,7 +954,7 @@ void lru_cache_disable(void) #ifdef CONFIG_SMP __lru_add_drain_all(true); #else - lru_add_mm_drain(); + lru_add_mm_drain(smp_processor_id()); invalidate_bh_lrus_cpu(); #endif } @@ -1124,6 +1128,7 @@ static const struct ctl_table swap_sysct void __init swap_setup(void) { unsigned long megs =3D PAGES_TO_MB(totalram_pages()); + unsigned int cpu; =20 /* Use a smaller cluster for small-memory machines */ if (megs < 16) @@ -1136,4 +1141,11 @@ void __init swap_setup(void) */ =20 register_sysctl_init("vm", swap_sysctl_table); + + for_each_possible_cpu(cpu) { + struct cpu_fbatches *fbatches =3D &per_cpu(cpu_fbatches, cpu); + + qpw_lock_init(&fbatches->lock); + qpw_lock_init(&fbatches->lock_irq); + } } Index: slab/mm/internal.h =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- slab.orig/mm/internal.h +++ slab/mm/internal.h @@ -1061,10 +1061,12 @@ static inline void munlock_vma_folio(str munlock_folio(folio); } =20 +int __init mlock_init(void); void mlock_new_folio(struct folio *folio); bool need_mlock_drain(int cpu); void mlock_drain_local(void); -void mlock_drain_remote(int cpu); +void mlock_drain_cpu(int cpu); +void mlock_drain_offline(int cpu); =20 extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma); =20 Index: slab/mm/page_alloc.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- slab.orig/mm/page_alloc.c +++ slab/mm/page_alloc.c @@ -6251,7 +6251,7 @@ static int page_alloc_cpu_dead(unsigned struct zone *zone; =20 lru_add_drain_cpu(cpu); - mlock_drain_remote(cpu); + mlock_drain_offline(cpu); drain_pages(cpu); =20 /* From nobody Sat Feb 7 13:41:40 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 84C69423151 for ; Fri, 6 Feb 2026 14:40:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770388818; cv=none; b=SdM2hnnE9Su9DT8wllzknxeF4w5VPT/se9qGH5HDvtlbefMOL+Mic2RPzTs5zeMmjs52KvSqLsKJayV7dUor6lSiXib8mczOnepsYtQzNVGWc4WYdrBr0a0o25bzW/LBfus2OIIARDk7TVala+RS7YbP4aVcGKKJEQJLvFUcSPE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770388818; c=relaxed/simple; bh=m6kEmEIfXFU1Hl0FIsgn6Fy2bN3huAhWxrtV4Z4+Dlg=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=U6fO2kkL9bCJdWpwNAhd2XUwXnsJIegQ+PUyxFfSe1DFrv3iFVS95e0n3C9fBT9RQrk1XMN4llYbojLoLz8MpNkPcXWcuuLtmy3Wt919Ujrc6jt7EUKwXltwxd9rU0WNFcppf41YVq++GtUS2P5u4vLIwnlbJm58oft0m6IiRqU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=LrAWY9MS; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="LrAWY9MS" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1770388817; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=CIBbWTksw7DvJ78/DV3io3akwWeED1pL6ZqB6VcV8to=; b=LrAWY9MSBsVgzYHKA4mivppd5nQmVWXeXpGDTFBGn/8GI3U9uKVMjZFlqIjSBHOHK6v3+Y xNPYG4CJX1eenxm6BFkUgdKi2P0RlZNcC8qvod0sW/uvI3hsElBKbDq3SCq3EUeqbcKGsv 71yX0EXrWkCHVd+mO52wX3ASmQggzm0= Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-631-wu1KRp-3McuXwj9pNVbERQ-1; Fri, 06 Feb 2026 09:40:12 -0500 X-MC-Unique: wu1KRp-3McuXwj9pNVbERQ-1 X-Mimecast-MFC-AGG-ID: wu1KRp-3McuXwj9pNVbERQ_1770388810 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 21DDB18001FC; Fri, 6 Feb 2026 14:40:10 +0000 (UTC) Received: from tpad.localdomain (unknown [10.22.74.16]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 6F78B1800465; Fri, 6 Feb 2026 14:40:07 +0000 (UTC) Received: by tpad.localdomain (Postfix, from userid 1000) id B40B841DF09EC; Fri, 6 Feb 2026 11:39:20 -0300 (-03) Message-ID: <20260206143741.621816322@redhat.com> User-Agent: quilt/0.66 Date: Fri, 06 Feb 2026 11:34:34 -0300 From: Marcelo Tosatti To: linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org Cc: Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton , Christoph Lameter , Pekka Enberg , David Rientjes , Joonsoo Kim , Vlastimil Babka , Hyeonggon Yoo <42.hyeyoo@gmail.com>, Leonardo Bras , Thomas Gleixner , Waiman Long , Boqun Feng , Marcelo Tosatti Subject: [PATCH 4/4] slub: apply new queue_percpu_work_on() interface References: <20260206143430.021026873@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Make use of the new qpw_{un,}lock*() and queue_percpu_work_on() interface to improve performance & latency on PREEMPT_RT kernels. For functions that may be scheduled in a different cpu, replace local_{un,}lock*() by qpw_{un,}lock*(), and replace schedule_work_on() by queue_percpu_work_on(). The same happens for flush_work() and flush_percpu_work(). This change requires allocation of qpw_structs instead of a work_structs, and changing parameters of a few functions to include the cpu parameter. This should bring no relevant performance impact on non-RT kernels: For functions that may be scheduled in a different cpu, the local_*lock's this_cpu_ptr() becomes a per_cpu_ptr(smp_processor_id()). Signed-off-by: Leonardo Bras Signed-off-by: Marcelo Tosatti --- mm/slub.c | 218 ++++++++++++++++++++++++++++++++++++++++-----------------= ----- 1 file changed, 142 insertions(+), 76 deletions(-) Index: slab/mm/slub.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- slab.orig/mm/slub.c +++ slab/mm/slub.c @@ -49,6 +49,7 @@ #include #include #include +#include #include =20 #include "internal.h" @@ -128,7 +129,7 @@ * For debug caches, all allocations are forced to go through a list_lock * protected region to serialize against concurrent validation. * - * cpu_sheaves->lock (local_trylock) + * cpu_sheaves->lock (qpw_trylock) * * This lock protects fastpath operations on the percpu sheaves. On !RT = it * only disables preemption and does no atomic operations. As long as th= e main @@ -156,7 +157,7 @@ * Interrupts are disabled as part of list_lock or barn lock operations,= or * around the slab_lock operation, in order to make the slab allocator s= afe * to use in the context of an irq. - * Preemption is disabled as part of local_trylock operations. + * Preemption is disabled as part of qpw_trylock operations. * kmalloc_nolock() and kfree_nolock() are safe in NMI context but see * their limitations. * @@ -417,7 +418,7 @@ struct slab_sheaf { }; =20 struct slub_percpu_sheaves { - local_trylock_t lock; + qpw_trylock_t lock; struct slab_sheaf *main; /* never NULL when unlocked */ struct slab_sheaf *spare; /* empty or full, may be NULL */ struct slab_sheaf *rcu_free; /* for batching kfree_rcu() */ @@ -479,7 +480,7 @@ static nodemask_t slab_nodes; static struct workqueue_struct *flushwq; =20 struct slub_flush_work { - struct work_struct work; + struct qpw_struct qpw; struct kmem_cache *s; bool skip; }; @@ -2826,7 +2827,7 @@ static void __kmem_cache_free_bulk(struc * * returns true if at least partially flushed */ -static bool sheaf_flush_main(struct kmem_cache *s) +static bool sheaf_flush_main(struct kmem_cache *s, int cpu) { struct slub_percpu_sheaves *pcs; unsigned int batch, remaining; @@ -2835,10 +2836,10 @@ static bool sheaf_flush_main(struct kmem bool ret =3D false; =20 next_batch: - if (!local_trylock(&s->cpu_sheaves->lock)) + if (!qpw_trylock(&s->cpu_sheaves->lock, cpu)) return ret; =20 - pcs =3D this_cpu_ptr(s->cpu_sheaves); + pcs =3D per_cpu_ptr(s->cpu_sheaves, cpu); sheaf =3D pcs->main; =20 batch =3D min(PCS_BATCH_MAX, sheaf->size); @@ -2848,7 +2849,7 @@ next_batch: =20 remaining =3D sheaf->size; =20 - local_unlock(&s->cpu_sheaves->lock); + qpw_unlock(&s->cpu_sheaves->lock, cpu); =20 __kmem_cache_free_bulk(s, batch, &objects[0]); =20 @@ -2932,13 +2933,13 @@ static void rcu_free_sheaf_nobarn(struct * flushing operations are rare so let's keep it simple and flush to slabs * directly, skipping the barn */ -static void pcs_flush_all(struct kmem_cache *s) +static void pcs_flush_all(struct kmem_cache *s, int cpu) { struct slub_percpu_sheaves *pcs; struct slab_sheaf *spare, *rcu_free; =20 - local_lock(&s->cpu_sheaves->lock); - pcs =3D this_cpu_ptr(s->cpu_sheaves); + qpw_lock(&s->cpu_sheaves->lock, cpu); + pcs =3D per_cpu_ptr(s->cpu_sheaves, cpu); =20 spare =3D pcs->spare; pcs->spare =3D NULL; @@ -2946,7 +2947,7 @@ static void pcs_flush_all(struct kmem_ca rcu_free =3D pcs->rcu_free; pcs->rcu_free =3D NULL; =20 - local_unlock(&s->cpu_sheaves->lock); + qpw_unlock(&s->cpu_sheaves->lock, cpu); =20 if (spare) { sheaf_flush_unused(s, spare); @@ -2956,7 +2957,7 @@ static void pcs_flush_all(struct kmem_ca if (rcu_free) call_rcu(&rcu_free->rcu_head, rcu_free_sheaf_nobarn); =20 - sheaf_flush_main(s); + sheaf_flush_main(s, cpu); } =20 static void __pcs_flush_all_cpu(struct kmem_cache *s, unsigned int cpu) @@ -3881,13 +3882,13 @@ static void flush_cpu_sheaves(struct wor { struct kmem_cache *s; struct slub_flush_work *sfw; + int cpu =3D qpw_get_cpu(w); =20 - sfw =3D container_of(w, struct slub_flush_work, work); - + sfw =3D &per_cpu(slub_flush, cpu); s =3D sfw->s; =20 if (cache_has_sheaves(s)) - pcs_flush_all(s); + pcs_flush_all(s, cpu); } =20 static void flush_all_cpus_locked(struct kmem_cache *s) @@ -3904,17 +3905,17 @@ static void flush_all_cpus_locked(struct sfw->skip =3D true; continue; } - INIT_WORK(&sfw->work, flush_cpu_sheaves); + INIT_QPW(&sfw->qpw, flush_cpu_sheaves, cpu); sfw->skip =3D false; sfw->s =3D s; - queue_work_on(cpu, flushwq, &sfw->work); + queue_percpu_work_on(cpu, flushwq, &sfw->qpw); } =20 for_each_online_cpu(cpu) { sfw =3D &per_cpu(slub_flush, cpu); if (sfw->skip) continue; - flush_work(&sfw->work); + flush_percpu_work(&sfw->qpw); } =20 mutex_unlock(&flush_lock); @@ -3933,17 +3934,18 @@ static void flush_rcu_sheaf(struct work_ struct slab_sheaf *rcu_free; struct slub_flush_work *sfw; struct kmem_cache *s; + int cpu =3D qpw_get_cpu(w); =20 - sfw =3D container_of(w, struct slub_flush_work, work); + sfw =3D &per_cpu(slub_flush, cpu); s =3D sfw->s; =20 - local_lock(&s->cpu_sheaves->lock); - pcs =3D this_cpu_ptr(s->cpu_sheaves); + qpw_lock(&s->cpu_sheaves->lock, cpu); + pcs =3D per_cpu_ptr(s->cpu_sheaves, cpu); =20 rcu_free =3D pcs->rcu_free; pcs->rcu_free =3D NULL; =20 - local_unlock(&s->cpu_sheaves->lock); + qpw_unlock(&s->cpu_sheaves->lock, cpu); =20 if (rcu_free) call_rcu(&rcu_free->rcu_head, rcu_free_sheaf_nobarn); @@ -3968,14 +3970,14 @@ void flush_rcu_sheaves_on_cache(struct k * sure the __kfree_rcu_sheaf() finished its call_rcu() */ =20 - INIT_WORK(&sfw->work, flush_rcu_sheaf); + INIT_QPW(&sfw->qpw, flush_rcu_sheaf, cpu); sfw->s =3D s; - queue_work_on(cpu, flushwq, &sfw->work); + queue_percpu_work_on(cpu, flushwq, &sfw->qpw); } =20 for_each_online_cpu(cpu) { sfw =3D &per_cpu(slub_flush, cpu); - flush_work(&sfw->work); + flush_percpu_work(&sfw->qpw); } =20 mutex_unlock(&flush_lock); @@ -4472,22 +4474,24 @@ bool slab_post_alloc_hook(struct kmem_ca * * Must be called with the cpu_sheaves local lock locked. If successful, r= eturns * the pcs pointer and the local lock locked (possibly on a different cpu = than - * initially called). If not successful, returns NULL and the local lock - * unlocked. + * initially called), and migration disabled. If not successful, returns N= ULL + * and the local lock unlocked, with migration enabled. */ static struct slub_percpu_sheaves * -__pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves = *pcs, gfp_t gfp) +__pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves = *pcs, gfp_t gfp, + int *cpu) { struct slab_sheaf *empty =3D NULL; struct slab_sheaf *full; struct node_barn *barn; bool can_alloc; =20 - lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock)); + qpw_lockdep_assert_held(&s->cpu_sheaves->lock); =20 /* Bootstrap or debug cache, back off */ if (unlikely(!cache_has_sheaves(s))) { - local_unlock(&s->cpu_sheaves->lock); + qpw_unlock(&s->cpu_sheaves->lock, *cpu); + migrate_enable(); return NULL; } =20 @@ -4498,7 +4502,8 @@ __pcs_replace_empty_main(struct kmem_cac =20 barn =3D get_barn(s); if (!barn) { - local_unlock(&s->cpu_sheaves->lock); + qpw_unlock(&s->cpu_sheaves->lock, *cpu); + migrate_enable(); return NULL; } =20 @@ -4524,7 +4529,8 @@ __pcs_replace_empty_main(struct kmem_cac } } =20 - local_unlock(&s->cpu_sheaves->lock); + qpw_unlock(&s->cpu_sheaves->lock, *cpu); + migrate_enable(); =20 if (!can_alloc) return NULL; @@ -4550,7 +4556,9 @@ __pcs_replace_empty_main(struct kmem_cac * we can reach here only when gfpflags_allow_blocking * so this must not be an irq */ - local_lock(&s->cpu_sheaves->lock); + migrate_disable(); + *cpu =3D smp_processor_id(); + qpw_lock(&s->cpu_sheaves->lock, *cpu); pcs =3D this_cpu_ptr(s->cpu_sheaves); =20 /* @@ -4593,6 +4601,7 @@ void *alloc_from_pcs(struct kmem_cache * struct slub_percpu_sheaves *pcs; bool node_requested; void *object; + int cpu; =20 #ifdef CONFIG_NUMA if (static_branch_unlikely(&strict_numa) && @@ -4627,13 +4636,17 @@ void *alloc_from_pcs(struct kmem_cache * return NULL; } =20 - if (!local_trylock(&s->cpu_sheaves->lock)) + migrate_disable(); + cpu =3D smp_processor_id(); + if (!qpw_trylock(&s->cpu_sheaves->lock, cpu)) { + migrate_enable(); return NULL; + } =20 pcs =3D this_cpu_ptr(s->cpu_sheaves); =20 if (unlikely(pcs->main->size =3D=3D 0)) { - pcs =3D __pcs_replace_empty_main(s, pcs, gfp); + pcs =3D __pcs_replace_empty_main(s, pcs, gfp, &cpu); if (unlikely(!pcs)) return NULL; } @@ -4647,7 +4660,8 @@ void *alloc_from_pcs(struct kmem_cache * * the current allocation or previous freeing process. */ if (page_to_nid(virt_to_page(object)) !=3D node) { - local_unlock(&s->cpu_sheaves->lock); + qpw_unlock(&s->cpu_sheaves->lock, cpu); + migrate_enable(); stat(s, ALLOC_NODE_MISMATCH); return NULL; } @@ -4655,7 +4669,8 @@ void *alloc_from_pcs(struct kmem_cache * =20 pcs->main->size--; =20 - local_unlock(&s->cpu_sheaves->lock); + qpw_unlock(&s->cpu_sheaves->lock, cpu); + migrate_enable(); =20 stat(s, ALLOC_FASTPATH); =20 @@ -4670,10 +4685,15 @@ unsigned int alloc_from_pcs_bulk(struct struct slab_sheaf *main; unsigned int allocated =3D 0; unsigned int batch; + int cpu; =20 next_batch: - if (!local_trylock(&s->cpu_sheaves->lock)) + migrate_disable(); + cpu =3D smp_processor_id(); + if (!qpw_trylock(&s->cpu_sheaves->lock, cpu)) { + migrate_enable(); return allocated; + } =20 pcs =3D this_cpu_ptr(s->cpu_sheaves); =20 @@ -4683,7 +4703,8 @@ next_batch: struct node_barn *barn; =20 if (unlikely(!cache_has_sheaves(s))) { - local_unlock(&s->cpu_sheaves->lock); + qpw_unlock(&s->cpu_sheaves->lock, cpu); + migrate_enable(); return allocated; } =20 @@ -4694,7 +4715,8 @@ next_batch: =20 barn =3D get_barn(s); if (!barn) { - local_unlock(&s->cpu_sheaves->lock); + qpw_unlock(&s->cpu_sheaves->lock, cpu); + migrate_enable(); return allocated; } =20 @@ -4709,7 +4731,8 @@ next_batch: =20 stat(s, BARN_GET_FAIL); =20 - local_unlock(&s->cpu_sheaves->lock); + qpw_unlock(&s->cpu_sheaves->lock, cpu); + migrate_enable(); =20 /* * Once full sheaves in barn are depleted, let the bulk @@ -4727,7 +4750,8 @@ do_alloc: main->size -=3D batch; memcpy(p, main->objects + main->size, batch * sizeof(void *)); =20 - local_unlock(&s->cpu_sheaves->lock); + qpw_unlock(&s->cpu_sheaves->lock, cpu); + migrate_enable(); =20 stat_add(s, ALLOC_FASTPATH, batch); =20 @@ -4877,6 +4901,7 @@ kmem_cache_prefill_sheaf(struct kmem_cac struct slub_percpu_sheaves *pcs; struct slab_sheaf *sheaf =3D NULL; struct node_barn *barn; + int cpu; =20 if (unlikely(!size)) return NULL; @@ -4906,7 +4931,9 @@ kmem_cache_prefill_sheaf(struct kmem_cac return sheaf; } =20 - local_lock(&s->cpu_sheaves->lock); + migrate_disable(); + cpu =3D smp_processor_id(); + qpw_lock(&s->cpu_sheaves->lock, cpu); pcs =3D this_cpu_ptr(s->cpu_sheaves); =20 if (pcs->spare) { @@ -4925,7 +4952,8 @@ kmem_cache_prefill_sheaf(struct kmem_cac stat(s, BARN_GET_FAIL); } =20 - local_unlock(&s->cpu_sheaves->lock); + qpw_unlock(&s->cpu_sheaves->lock, cpu); + migrate_enable(); =20 =20 if (!sheaf) @@ -4961,6 +4989,7 @@ void kmem_cache_return_sheaf(struct kmem { struct slub_percpu_sheaves *pcs; struct node_barn *barn; + int cpu; =20 if (unlikely((sheaf->capacity !=3D s->sheaf_capacity) || sheaf->pfmemalloc)) { @@ -4969,7 +4998,9 @@ void kmem_cache_return_sheaf(struct kmem return; } =20 - local_lock(&s->cpu_sheaves->lock); + migrate_disable(); + cpu =3D smp_processor_id(); + qpw_lock(&s->cpu_sheaves->lock, cpu); pcs =3D this_cpu_ptr(s->cpu_sheaves); barn =3D get_barn(s); =20 @@ -4979,7 +5010,8 @@ void kmem_cache_return_sheaf(struct kmem stat(s, SHEAF_RETURN_FAST); } =20 - local_unlock(&s->cpu_sheaves->lock); + qpw_unlock(&s->cpu_sheaves->lock, cpu); + migrate_enable(); =20 if (!sheaf) return; @@ -5507,9 +5539,9 @@ slab_empty: */ static void __pcs_install_empty_sheaf(struct kmem_cache *s, struct slub_percpu_sheaves *pcs, struct slab_sheaf *empty, - struct node_barn *barn) + struct node_barn *barn, int cpu) { - lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock)); + qpw_lockdep_assert_held(&s->cpu_sheaves->lock); =20 /* This is what we expect to find if nobody interrupted us. */ if (likely(!pcs->spare)) { @@ -5546,31 +5578,34 @@ static void __pcs_install_empty_sheaf(st /* * Replace the full main sheaf with a (at least partially) empty sheaf. * - * Must be called with the cpu_sheaves local lock locked. If successful, r= eturns - * the pcs pointer and the local lock locked (possibly on a different cpu = than - * initially called). If not successful, returns NULL and the local lock - * unlocked. + * Must be called with the cpu_sheaves local lock locked, and migration co= unter + * increased. If successful, returns the pcs pointer and the local lock lo= cked + * (possibly on a different cpu than initially called), with migration cou= nter + * increased. If not successful, returns NULL and the local lock unlocked, + * and migration counter decreased. */ static struct slub_percpu_sheaves * __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *= pcs, - bool allow_spin) + bool allow_spin, int *cpu) { struct slab_sheaf *empty; struct node_barn *barn; bool put_fail; =20 restart: - lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock)); + qpw_lockdep_assert_held(&s->cpu_sheaves->lock); =20 /* Bootstrap or debug cache, back off */ if (unlikely(!cache_has_sheaves(s))) { - local_unlock(&s->cpu_sheaves->lock); + qpw_unlock(&s->cpu_sheaves->lock, *cpu); + migrate_enable(); return NULL; } =20 barn =3D get_barn(s); if (!barn) { - local_unlock(&s->cpu_sheaves->lock); + qpw_unlock(&s->cpu_sheaves->lock, *cpu); + migrate_enable(); return NULL; } =20 @@ -5607,7 +5642,8 @@ restart: stat(s, BARN_PUT_FAIL); =20 pcs->spare =3D NULL; - local_unlock(&s->cpu_sheaves->lock); + qpw_unlock(&s->cpu_sheaves->lock, *cpu); + migrate_enable(); =20 sheaf_flush_unused(s, to_flush); empty =3D to_flush; @@ -5623,7 +5659,8 @@ restart: put_fail =3D true; =20 alloc_empty: - local_unlock(&s->cpu_sheaves->lock); + qpw_unlock(&s->cpu_sheaves->lock, *cpu); + migrate_enable(); =20 /* * alloc_empty_sheaf() doesn't support !allow_spin and it's @@ -5640,11 +5677,17 @@ alloc_empty: if (put_fail) stat(s, BARN_PUT_FAIL); =20 - if (!sheaf_flush_main(s)) + migrate_disable(); + *cpu =3D smp_processor_id(); + if (!sheaf_flush_main(s, *cpu)) { + migrate_enable(); return NULL; + } =20 - if (!local_trylock(&s->cpu_sheaves->lock)) + if (!qpw_trylock(&s->cpu_sheaves->lock, *cpu)) { + migrate_enable(); return NULL; + } =20 pcs =3D this_cpu_ptr(s->cpu_sheaves); =20 @@ -5659,13 +5702,14 @@ alloc_empty: return pcs; =20 got_empty: - if (!local_trylock(&s->cpu_sheaves->lock)) { + if (!qpw_trylock(&s->cpu_sheaves->lock, *cpu)) { + migrate_enable(); barn_put_empty_sheaf(barn, empty); return NULL; } =20 pcs =3D this_cpu_ptr(s->cpu_sheaves); - __pcs_install_empty_sheaf(s, pcs, empty, barn); + __pcs_install_empty_sheaf(s, pcs, empty, barn, *cpu); =20 return pcs; } @@ -5678,22 +5722,28 @@ static __fastpath_inline bool free_to_pcs(struct kmem_cache *s, void *object, bool allow_spin) { struct slub_percpu_sheaves *pcs; + int cpu; =20 - if (!local_trylock(&s->cpu_sheaves->lock)) + migrate_disable(); + cpu =3D smp_processor_id(); + if (!qpw_trylock(&s->cpu_sheaves->lock, cpu)) { + migrate_enable(); return false; + } =20 pcs =3D this_cpu_ptr(s->cpu_sheaves); =20 if (unlikely(pcs->main->size =3D=3D s->sheaf_capacity)) { =20 - pcs =3D __pcs_replace_full_main(s, pcs, allow_spin); + pcs =3D __pcs_replace_full_main(s, pcs, allow_spin, &cpu); if (unlikely(!pcs)) return false; } =20 pcs->main->objects[pcs->main->size++] =3D object; =20 - local_unlock(&s->cpu_sheaves->lock); + qpw_unlock(&s->cpu_sheaves->lock, cpu); + migrate_enable(); =20 stat(s, FREE_FASTPATH); =20 @@ -5777,14 +5827,19 @@ bool __kfree_rcu_sheaf(struct kmem_cache { struct slub_percpu_sheaves *pcs; struct slab_sheaf *rcu_sheaf; + int cpu; =20 if (WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_RT))) return false; =20 lock_map_acquire_try(&kfree_rcu_sheaf_map); =20 - if (!local_trylock(&s->cpu_sheaves->lock)) + migrate_disable(); + cpu =3D smp_processor_id(); + if (!qpw_trylock(&s->cpu_sheaves->lock, cpu)) { + migrate_enable(); goto fail; + } =20 pcs =3D this_cpu_ptr(s->cpu_sheaves); =20 @@ -5795,7 +5850,8 @@ bool __kfree_rcu_sheaf(struct kmem_cache =20 /* Bootstrap or debug cache, fall back */ if (unlikely(!cache_has_sheaves(s))) { - local_unlock(&s->cpu_sheaves->lock); + qpw_unlock(&s->cpu_sheaves->lock, cpu); + migrate_enable(); goto fail; } =20 @@ -5807,7 +5863,8 @@ bool __kfree_rcu_sheaf(struct kmem_cache =20 barn =3D get_barn(s); if (!barn) { - local_unlock(&s->cpu_sheaves->lock); + qpw_unlock(&s->cpu_sheaves->lock, cpu); + migrate_enable(); goto fail; } =20 @@ -5818,15 +5875,18 @@ bool __kfree_rcu_sheaf(struct kmem_cache goto do_free; } =20 - local_unlock(&s->cpu_sheaves->lock); + qpw_unlock(&s->cpu_sheaves->lock, cpu); + migrate_enable(); =20 empty =3D alloc_empty_sheaf(s, GFP_NOWAIT); =20 if (!empty) goto fail; =20 - if (!local_trylock(&s->cpu_sheaves->lock)) { + migrate_disable(); + if (!qpw_trylock(&s->cpu_sheaves->lock, cpu)) { barn_put_empty_sheaf(barn, empty); + migrate_enable(); goto fail; } =20 @@ -5862,7 +5922,8 @@ do_free: if (rcu_sheaf) call_rcu(&rcu_sheaf->rcu_head, rcu_free_sheaf); =20 - local_unlock(&s->cpu_sheaves->lock); + qpw_unlock(&s->cpu_sheaves->lock, cpu); + migrate_enable(); =20 stat(s, FREE_RCU_SHEAF); lock_map_release(&kfree_rcu_sheaf_map); @@ -5889,6 +5950,7 @@ static void free_to_pcs_bulk(struct kmem void *remote_objects[PCS_BATCH_MAX]; unsigned int remote_nr =3D 0; int node =3D numa_mem_id(); + int cpu; =20 next_remote_batch: while (i < size) { @@ -5918,7 +5980,9 @@ next_remote_batch: goto flush_remote; =20 next_batch: - if (!local_trylock(&s->cpu_sheaves->lock)) + migrate_disable(); + cpu =3D smp_processor_id(); + if (!qpw_trylock(&s->cpu_sheaves->lock, cpu)) goto fallback; =20 pcs =3D this_cpu_ptr(s->cpu_sheaves); @@ -5961,7 +6025,8 @@ do_free: memcpy(main->objects + main->size, p, batch * sizeof(void *)); main->size +=3D batch; =20 - local_unlock(&s->cpu_sheaves->lock); + qpw_unlock(&s->cpu_sheaves->lock, cpu); + migrate_enable(); =20 stat_add(s, FREE_FASTPATH, batch); =20 @@ -5977,7 +6042,8 @@ do_free: return; =20 no_empty: - local_unlock(&s->cpu_sheaves->lock); + qpw_unlock(&s->cpu_sheaves->lock, cpu); + migrate_enable(); =20 /* * if we depleted all empty sheaves in the barn or there are too @@ -7377,7 +7443,7 @@ static int init_percpu_sheaves(struct km =20 pcs =3D per_cpu_ptr(s->cpu_sheaves, cpu); =20 - local_trylock_init(&pcs->lock); + qpw_trylock_init(&pcs->lock); =20 /* * Bootstrap sheaf has zero size so fast-path allocation fails.