.../admin-guide/kernel-parameters.txt | 26 +- block/blk-mq-cpumap.c | 233 ++++++++++++++++-- block/blk-mq.c | 49 ++++ drivers/scsi/aacraid/comminit.c | 3 +- include/linux/group_cpus.h | 3 + include/linux/sched/isolation.h | 1 + kernel/irq/affinity.c | 31 ++- kernel/sched/isolation.c | 7 + lib/group_cpus.c | 108 +++++++- 9 files changed, 422 insertions(+), 39 deletions(-)
Hi,
I have decided to drive this series forward on behalf of Daniel Wagner, the
original author. The series has been rebased on v7.1-rc4-100-g8bc67e4db64a.
This series introduces a new CPU isolation feature, "isolcpus=io_queue",
designed to protect isolated cores from the disruptive hardware interrupts
generated by high-performance multi-queue devices.
When enabled, it fundamentally alters how the generic IRQ subsystem and the
block layer (blk-mq) map hardware queues:
1. Restricted IRQ Affinity: Managed hardware interrupts are strictly
confined to online housekeeping CPUs.
2. Transparent I/O Submission: Applications running on isolated CPUs
can still seamlessly submit I/O requests; however, the resulting
hardware completion interrupts are safely routed to a designated
housekeeping CPU.
3. Topology-Aware Queue Allocation: The generic CPU-to-hardware-queue
mapping logic is extended to distribute hardware contexts evenly
among the available housekeeping CPUs, preventing MSI-X vector
exhaustion while maintaining optimal cache locality where possible.
To prevent I/O stalls, the block layer is additionally hardened to reject
hot-plug requests that attempt to offline a housekeeping CPU if it is the
last remaining CPU actively serving an online isolated core.
The complex "top-down" mask plumbing introduced in v12, which modified
struct irq_affinity and expanded block layer APIs, has been abandoned. It
is replaced by a centralised approach: direct isolation querying via
housekeeping_cpumask(HK_TYPE_IO_QUEUE) within the genirq/affinity
subsystem. This architectural simplification successfully decouples core
changes from driver-specific implementations.
Please let me know your thoughts.
Changes since v13:
- Removed ineffective data_race() annotations around mask and
cpu_present_mask pointers. Wrapping the pointers failed to suppress
KCSAN warnings for the underlying inline bitmap memory accesses.
- Fixed a silent validation bypass in blk_mq_map_hw_queues() caused by
overlapping IRQ affinity masks by removing the short-circuiting
optimisation and evaluating the active_hctx bitmap in a secondary pass.
- Restored topology-aware multi-queue fallback by correctly routing
missing IRQ affinity masks to the map_software path instead of the naive
map-all fallback.
- Dropped hctx->queue->disk->disk_name from warning to avoid a UAF.
- Fixed an isolation leak where excess allocated hardware queues were
improperly padded with irq_default_affinity. Because these queues are
marked as managed, they bypassed user-space IRQ balancing; they are now
safely padded with the housekeeping mask.
- Enforced the housekeeping vector cap prior to evaluating driver-provided
calc_sets() callbacks, preventing modern multi-queue drivers from
bypassing the cap and wasting memory on dead queues.
- Introduced a safety net to the vector calculation to prevent fatal
-ENOSPC device probe aborts on heavily isolated systems where the
housekeeping CPU count is lower than the device's structural minimum.
- Removed an inaccurate claim stating that the io_queue isolation flag
takes precedence over managed_irq. Both flags are parsed, evaluated, and
enforced entirely independently by their respective subsystems.
- Linked to v13: https://lore.kernel.org/lkml/20260513005509.135966-1-atomlin@atomlin.com/
Changes since v12:
- Resolved TOCTOU race conditions against CPU hotplug events in
blk_mq_map_queues() and group_mask_cpus_evenly() by taking lockless
snapshots of the online CPU mask prior to algorithmic evaluation.
- Migrated the active_hctx tracking to a dynamically sized bitmap
(bitmap_zalloc), resolving a critical out-of-bounds memory write that
occurred when hardware queues exceeded the system CPU count.
- Wrapped the disk pointer fetch in blk_mq_hctx_can_offline_hk_cpu() with
READ_ONCE() to prevent a TOCTOU NULL pointer dereference against
concurrent device teardowns.
- Introduced bitmap_empty() checks to prevent the mapping logic from
routing unassigned CPUs into unallocated memory when all mapped CPUs are
offline, safely forcing a fallback mapping instead.
- Implemented a native two-stage distribution logic in
group_mask_cpus_evenly() that first prioritises physically present CPUs
to prevent I/O starvation before distributing remaining vectors to
non-present CPUs for hotplug safety.
- Restricted the maximum number of allocated vectors in
irq_calc_affinity_vectors() to the weight of the housekeeping mask,
preventing drivers from wasting memory on dead hardware queues that
physically cannot be routed.
- Added padding logic using irq_default_affinity for sets where isolation
constraints yield fewer masks than requested vectors, preserving the 1:1
hardware queue mapping sequence for subsequent sets.
- Fixed a logic flaw that prematurely rejected valid offline requests by
manually iterating over cpu_online_mask and reverse-mapping to
accurately detect isolated CPUs, properly permitting the offlining of
non-housekeeping CPUs.
- Corrected an absolute versus relative queue index calculation bug in
blk_mq_map_queues() that was overwriting loop iterations, by iterating
directly over the generated masks.
- Replaced scoped __free cleanups with traditional goto unwinding in the
block layer to align with subsystem styling guidelines.
- Refined the io_queue kernel command-line parameter documentation for
better clarity and precision.
- Linked to v12: https://lore.kernel.org/lkml/20260422185215.100929-1-atomlin@atomlin.com/
Changes since v11:
- Removed duplicate paragraph from the commit message in patch 11
(Marco Crivellari)
- Ensure ZERO_SIZE_PTR is not returned by group_mask_cpus_evenly()
(Marco Crivellari)
- Linked to v11: https://lore.kernel.org/lkml/20260416192942.1243421-1-atomlin@atomlin.com/
Changes since v10:
- Completely rewrote the isolcpus=io_queue documentation in
Documentation/admin-guide/kernel-parameters.txt to clarify its exclusive
application to managed IRQs, queue allocation limits, vector exhaustion
prevention, and hardware interrupt routing (Ming Lei)
- Fixed a stack frame bloat issue by avoiding the on-stack declaration of
struct cpumask (Waiman Long)
- Linked to v10: https://lore.kernel.org/linux-nvme/20260401222312.772334-1-atomlin@atomlin.com/
Changes since v9:
- Fixed a page fault regression encountered when initialising secondary
queue maps (e.g., NVMe poll queues). Restored the qmap->queue_offset to
the mq_map assignment to ensure CPUs are strictly mapped to absolute
hardware indices (Keith Busch)
- Corrected the active_hctx tracker to utilise relative queue indices,
preventing out-of-bounds mask assignments
- Fixed the blk_mq_validate() sanity check to properly evaluate absolute
queue indices against the offset-adjusted loop index
- Corrected typographical errors within block/blk-mq-cpumap.c
(Keith Busch)
- Clarified the commit message regarding the removal of the !SMP fallback
code, explicitly noting that the core scheduler now mandates SMP
unconditionally (Sebastian Andrzej Siewior)
- Added missing "Signed-off-by:" tags to properly record the patch series
chain of custody
- Linked to v9: https://lore.kernel.org/lkml/20260330221047.630206-1-atomlin@atomlin.com/
Changes since v8:
- Added "Reviewed-by:" tags
- Introduced irq_spread_hk_filter() to safely restrict managed IRQ
affinity to housekeeping CPUs (Thomas Gleixner)
- Removed the unsafe global static variable blk_hk_online_mask from
blk-mq-cpumap.c and blk-mq.c. blk_mq_online_queue_affinity() now returns
a stable pointer, delegating safe intersection to the callers to prevent
concurrent modification races (Thomas Gleixner, Hannes Reinecke)
- Resolved BUG: kernel NULL pointer dereference in __blk_mq_all_tag_iter
reported by the kernel test robot during cpuhotplug rcutorture stress
testing
- Linked to v8: https://lore.kernel.org/lkml/20250905-isolcpus-io-queues-v8-0-885984c5daca@kernel.org/
Changes since v7:
- Added commit 524f5eea4bbe ("lib/group_cpus: remove !SMP code")
- Merged the new mapping logic directly into the existing function to
avoid special casing
- Refined the group_mask_cpus_evenly() implementation with the following
updates:
- Corrected the function name typo (changed group_masks_cpus_evenly to
group_mask_cpus_evenly)
- Updated the documentation comment to accurately reflect the function's
behavior
- Renamed the cpu_mask argument to mask for consistency
- Added a new patch for aacraid to include the missing number of queues
calculation
- Restricted updates to only affect SCSI drivers that support
PCI_IRQ_AFFINITY and do not utilise nvme-fabrics
- Removed the __free cleanup attribute usage for cpumask_var_t allocations
due to compatibility issues
- Updated the documentation to explicitly highlight the limitations
surrounding CPU offlining
- Collected accumulated Reviewed-by and Acked-by tags
- Linked to v7: https://patch.msgid.link/20250702-isolcpus-io-queues-v7-0-557aa7eacce4@kernel.org
Changes since v6:
- Sent out the first part of the series independently:
https://lore.kernel.org/all/20250617-isolcpus-queue-counters-v1-0-13923686b54b@kernel.org/
- Added comprehensive kernel command-line documentation
- Added validation logic to ensure the resulting CPU-to-queue mapping is
fully operational
- Rewrote the isolcpus mapping code to properly account for active
hardware contexts (hctx)
- Introduced blk_mq_map_hk_irq_queues, which utilizes the mask retrieved
from irq_get_affinity()
- Refactored blk_mq_map_hk_queues to require the caller to explicitly test
for HK_TYPE_MANAGED_IRQ
- Linked to v6: https://patch.msgid.link/20250424-isolcpus-io-queues-v6-0-9a53a870ca1f@kernel.org
Changes since v5:
- Reintroduced the io_queue type for the isolcpus kernel parameter
- Prevented the offlining of a housekeeping CPU if an isolated CPU is
still present, upgrading this behavior from a simple warning to a hard
restriction
- Linked to v5: https://lore.kernel.org/r/20250110-isolcpus-io-queues-v5-0-0e4f118680b0@kernel.org
Changes since v4:
- Rebased the series onto the latest for-6.14/block branch.
- Updated the documentation regarding the managed_irq parameters
- Reworded the commit message for "blk-mq: issue warning when offlining
hctx with online isolcpus" for better clarity
- Split the input and output parameters in the patch "lib/group_cpus: let
group_cpu_evenly return number of groups"
- Dropped the patch "sched/isolation: document HK_TYPE housekeeping
option"
- Linked to v4: https://lore.kernel.org/r/20241217-isolcpus-io-queues-v4-0-5d355fbb1e14@kernel.org
Changes since v3:
- Added the patch "blk-mq: issue warning when offlining hctx with online
isolcpus"
- Fixed the check in group_cpus_evenly(); the condition now properly uses
housekeeping_enabled() instead of cpumask_weight(), as the latter always
returns a valid mask
- Dropped the Fixes: tag from "lib/group_cpus.c: honor housekeeping config
when grouping CPUs"
- Fixed an overlong line warning in the patch "scsi: use block layer
helpers to calculate num of queues"
- Dropped the patch "sched/isolation: Add io_queue housekeeping option" in
favor of simply documenting the housekeeping hk_type enum
- Added the patch "lib/group_cpus: let group_cpu_evenly return number of
groups"
- Collected accumulated Reviewed-by and Acked-by tags
- Split the patchset by moving foundational changes into a separate
preparation series:
https://lore.kernel.org/linux-nvme/20241202-refactor-blk-affinity-helpers-v6-0-27211e9c2cd5@kernel.org/
- Linked to v3: https://lore.kernel.org/r/20240806-isolcpus-io-queues-v3-0-da0eecfeaf8b@suse.de
Changes since v2:
- Integrated patches from Ming Lei
(https://lore.kernel.org/all/20210709081005.421340-1-ming.lei@redhat.com/):
"virtio: add APIs for retrieving vq affinity" and "blk-mq: introduce
blk_mq_dev_map_queues"
- Replaced all instances of blk_mq_pci_map_queues and
blk_mq_virtio_map_queues with the new unified blk_mq_dev_map_queues
- Updated and expanded the helper functions used for calculating the
number of queues
- Added the CPU-to-hctx mapping function specifically to support the
isolcpus=io_queue parameter
- Documented the hk_type enum and the newly introduced isolcpus=io_queue
parameter
- Added the patch "scsi: pm8001: do not overwrite PCI queue mapping"
- Linked to v2: https://lore.kernel.org/r/20240627-isolcpus-io-queues-v2-0-26a32e3c4f75@suse.de
Changes since v1:
- Updated the feature documentation for clarity and completeness
- Split the blk/nvme-pci patch into smaller, logical commits
- Dropped the HK_TYPE_IO_QUEUE macro in favor of reusing
HK_TYPE_MANAGED_IRQ
- Linked to v1: https://lore.kernel.org/r/20240621-isolcpus-io-queues-v1-0-8b169bf41083@suse.de
Aaron Tomlin (1):
genirq/affinity: Restrict managed IRQ affinity to housekeeping CPUs
Daniel Wagner (7):
scsi: aacraid: use block layer helpers to calculate num of queues
lib/group_cpus: remove dead !SMP code
lib/group_cpus: Add group_mask_cpus_evenly()
isolation: Introduce io_queue isolcpus type
blk-mq: use hk cpus only when isolcpus=io_queue is enabled
blk-mq: prevent offlining hk CPUs with associated online isolated CPUs
docs: add io_queue flag to isolcpus
.../admin-guide/kernel-parameters.txt | 26 +-
block/blk-mq-cpumap.c | 233 ++++++++++++++++--
block/blk-mq.c | 49 ++++
drivers/scsi/aacraid/comminit.c | 3 +-
include/linux/group_cpus.h | 3 +
include/linux/sched/isolation.h | 1 +
kernel/irq/affinity.c | 31 ++-
kernel/sched/isolation.c | 7 +
lib/group_cpus.c | 108 +++++++-
9 files changed, 422 insertions(+), 39 deletions(-)
base-commit: 8bc67e4db64aa72732c474b44ea8622062c903f0
--
2.51.0
© 2016 - 2026 Red Hat, Inc.