drivers/base/dd.c | 147 +++++++++++++++++++++++++++++---------- drivers/pci/pci-driver.c | 116 +++--------------------------- include/linux/pci.h | 4 -- kernel/sched/isolation.c | 2 - 4 files changed, 118 insertions(+), 151 deletions(-)
Hi all,
** Overview **
This patchset introduces NUMA-node-aware synchronous probing.
Drivers can initialize and allocate memory on the device’s local
node without scattering kmalloc_node() calls throughout the code.
NUMA-aware probing was added to PCI drivers in 2005 and has
benefited them ever since.
The asynchronous probe path already supports NUMA-node-aware
probing via async_schedule_dev() in the driver core. Since NUMA
affinity is orthogonal to sync/async probing, this patchset adds
NUMA-node-aware support to the synchronous probe path.
** Background **
The idea arose from a discussion with Bjorn and Danilo about a
PCI-probe issue [1]:
when PCI devices on the same NUMA node are probed asynchronously,
pci_call_probe() calls work_on_cpu(), pins every probe worker to
the same CPU inside that node, and forces the probes to run serially.
Testing three NVMe devices on the same NUMA node of an AMD EPYC 9A64
2.4 GHz processor (all on CPU 0):
nvme 0000:01:00.0: CPU: 0, COMM: kworker/0:1, probe cost: 53372612 ns
nvme 0000:02:00.0: CPU: 0, COMM: kworker/0:2, probe cost: 49532941 ns
nvme 0000:03:00.0: CPU: 0, COMM: kworker/0:3, probe cost: 47315175 ns
Since the driver core already provides NUMA-node-aware asynchronous
probing, we can extend the same capability to the synchronous probe
path. This solves the issue and lets other drivers benefit from
NUMA-local initialization as well.
[1] https://lore.kernel.org/all/20251227113326.964-1-guojinhui.liam@bytedance.com/
** Changes **
The series makes three main changes:
1. Adds helper __device_attach_driver_scan() to eliminate duplication
between __device_attach() and __device_attach_async_helper().
2. Introduce helper __driver_probe_device_node() and use it to enable
NUMA-local synchronous probing in __device_attach(), device_driver_attach(),
and __driver_attach().
3. Removes the now-redundant NUMA code from the PCI driver.
** Test **
I added debug prints to nvme, mlx5, usbhid, and intel_rapl_msr and
ran tests on an AMD EPYC 9A64 system (Updated test results for the
new patchset are provided below):
NUMA topology of the test machine:
# lscpu |grep NUMA
NUMA node(s): 2
NUMA node0 CPU(s): 0-63,128-191
NUMA node1 CPU(s): 64-127,192-255
1. Without the patchset
- PCI drivers (nvme, mlx5) probe sequentially on CPU 0
- USB and platform drivers pick random CPUs in the udev worker
nvme 0000:01:00.0: CPU: 0, COMM: kworker/0:1, cost: 54013202 ns
nvme 0000:02:00.0: CPU: 0, COMM: kworker/0:2, cost: 53968911 ns
nvme 0000:03:00.0: CPU: 0, COMM: kworker/0:4, cost: 48077276 ns
mlx5_core 0000:41:00.0: CPU: 0, COMM: kworker/0:2 cost: 506256717 ns
mlx5_core 0000:41:00.1: CPU: 0, COMM: kworker/0:2 cost: 514289394 ns
usb 1-2.4: CPU: 163, COMM: (udev-worker), cost 854131 ns
usb 1-2.6: CPU: 163, COMM: (udev-worker), cost 967993 ns
intel_rapl_msr intel_rapl_msr.0: CPU: 61, COMM: (udev-worker), cost: 3717567 ns
2. With the patchset
- PCI probes are spread across CPUs inside the device’s NUMA node
- Asynchronous nvme probes are ~35 % faster; synchronous mlx5 times
are unchanged
- USB probe times are virtually identical
- Platform driver (no NUMA node) falls back to the original path
nvme 0000:01:00.0: CPU: 3, COMM: kworker/u1025:1, cost: 34244206 ns
nvme 0000:02:00.0: CPU: 1, COMM: kworker/u1025:2, cost: 33883391 ns
nvme 0000:03:00.0: CPU: 2, COMM: kworker/u1025:3, cost: 33943040 ns
mlx5_core 0000:41:00.0: CPU: 3, COMM: kworker/u1025:1, cost: 507206174 ns
mlx5_core 0000:41:00.1: CPU: 3, COMM: kworker/u1025:1, cost: 514927642 ns
usb 1-2.4: CPU: 4, COMM: kworker/u1025:8, cost: 991417 ns
usb 1-2.6: CPU: 2, COMM: kworker/u1025:5, cost: 935112 ns
intel_rapl_msr intel_rapl_msr.0: CPU: 17, COMM: (udev-worker), cost: 4849967 ns
3. With the patchset, unbind/bind cycles also spread PCI probes across
CPUs within the device’s NUMA node:
nvme 0000:02:00.0: CPU: 130, COMM: kworker/u1025:4, cost: 35086209 ns
** Final **
Comments and suggestions are welcome.
Best Regards,
Jinhui
---
v1: https://lore.kernel.org/all/20260107175548.1792-1-guojinhui.liam@bytedance.com/
Changelog in v1 -> v2:
- Reword the first patch’s commit message for accuracy and add
Reviewed-by tags; no code changes.
- Refactor the second patch to reduce complexity: introduce
__driver_probe_device_node() and update the signature of
driver_probe_device() to support NUMA-node-aware synchronous
probing. (suggested by Danilo)
- The third patch resolves conflicts with three patches from
patchset [2] that have since been merged into linux-next.git.
- Update the test data in the cover letter for the new patchset.
[2] https://lore.kernel.org/all/20260101221359.22298-1-frederic@kernel.org/
Jinhui Guo (3):
driver core: Introduce helper function __device_attach_driver_scan()
driver core: Add NUMA-node awareness to the synchronous probe path
PCI: Clean up NUMA-node awareness in pci_bus_type probe
drivers/base/dd.c | 147 +++++++++++++++++++++++++++++----------
drivers/pci/pci-driver.c | 116 +++---------------------------
include/linux/pci.h | 4 --
kernel/sched/isolation.c | 2 -
4 files changed, 118 insertions(+), 151 deletions(-)
--
2.20.1
Jinhui Guo wrote: > Hi all, > > ** Overview ** > > This patchset introduces NUMA-node-aware synchronous probing. > > Drivers can initialize and allocate memory on the device’s local > node without scattering kmalloc_node() calls throughout the code. > NUMA-aware probing was added to PCI drivers in 2005 and has > benefited them ever since. > > The asynchronous probe path already supports NUMA-node-aware > probing via async_schedule_dev() in the driver core. Since NUMA > affinity is orthogonal to sync/async probing, this patchset adds > NUMA-node-aware support to the synchronous probe path. > > ** Background ** > > The idea arose from a discussion with Bjorn and Danilo about a > PCI-probe issue [1]: > > when PCI devices on the same NUMA node are probed asynchronously, > pci_call_probe() calls work_on_cpu(), pins every probe worker to > the same CPU inside that node, and forces the probes to run serially. > > Testing three NVMe devices on the same NUMA node of an AMD EPYC 9A64 > 2.4 GHz processor (all on CPU 0): > > nvme 0000:01:00.0: CPU: 0, COMM: kworker/0:1, probe cost: 53372612 ns > nvme 0000:02:00.0: CPU: 0, COMM: kworker/0:2, probe cost: 49532941 ns > nvme 0000:03:00.0: CPU: 0, COMM: kworker/0:3, probe cost: 47315175 ns > > Since the driver core already provides NUMA-node-aware asynchronous > probing, we can extend the same capability to the synchronous probe > path. This solves the issue and lets other drivers benefit from > NUMA-local initialization as well. I like that from a global benefit perspective, but not necessarily from a regression perspective. Is there a minimal fix to PCI to make its current workqueue unbound, then if that goes well come back and move all devices into this scheme?
On Fri Jan 23, 2026 17:04:27 -0800, Dan Williams wrote: > Jinhui Guo wrote: > > Hi all, > > > > ** Overview ** > > > > This patchset introduces NUMA-node-aware synchronous probing. > > > > Drivers can initialize and allocate memory on the device’s local > > node without scattering kmalloc_node() calls throughout the code. > > NUMA-aware probing was added to PCI drivers in 2005 and has > > benefited them ever since. > > > > The asynchronous probe path already supports NUMA-node-aware > > probing via async_schedule_dev() in the driver core. Since NUMA > > affinity is orthogonal to sync/async probing, this patchset adds > > NUMA-node-aware support to the synchronous probe path. > > > > ** Background ** > > > > The idea arose from a discussion with Bjorn and Danilo about a > > PCI-probe issue [1]: > > > > when PCI devices on the same NUMA node are probed asynchronously, > > pci_call_probe() calls work_on_cpu(), pins every probe worker to > > the same CPU inside that node, and forces the probes to run serially. > > > > Testing three NVMe devices on the same NUMA node of an AMD EPYC 9A64 > > 2.4 GHz processor (all on CPU 0): > > > > nvme 0000:01:00.0: CPU: 0, COMM: kworker/0:1, probe cost: 53372612 ns > > nvme 0000:02:00.0: CPU: 0, COMM: kworker/0:2, probe cost: 49532941 ns > > nvme 0000:03:00.0: CPU: 0, COMM: kworker/0:3, probe cost: 47315175 ns > > > > Since the driver core already provides NUMA-node-aware asynchronous > > probing, we can extend the same capability to the synchronous probe > > path. This solves the issue and lets other drivers benefit from > > NUMA-local initialization as well. > > I like that from a global benefit perspective, but not necessarily from > a regression perspective. Is there a minimal fix to PCI to make its > current workqueue unbound, then if that goes well come back and move all > devices into this scheme? Hi Dan, Thank you for your time, and apologies for the delayed reply. I understand your concerns about stability and hope for better PCI regression handling. However, I believe introducing NUMA-node awareness to the driver core's asynchronous probe path is the better solution: 1. The asynchronous path already uses async_schedule_dev() with queue_work_node() to bind workers to specific NUMA nodes—this causes no side effects to driver probing. 2. I initially submitted a PCI-only fix [1], but handling asynchronous probing in PCI driver proved difficult. Using current_is_async() works but feels fragile. After discussions with Bjorn and Danilo [2][3], moving the solution to driver core makes distinguishing async/sync probing straightforward. Testing shows minimal impact on synchronous probe time. 3. If you prefer a PCI-only approach, we could add a flag in struct device_driver (default false) that PCI sets during registration. This limits the new path to PCI devices while others retain existing behavior. The extra code is ~10 lines and can be removed once confidence is established. 4. I'm committed to supporting this: I'll include "Fixes:" tags for any fallout and provide patches within a month of any report. Since the logic mirrors the core async helper, risk should be low—but I'll take full responsibility regardless. Please let me know if you have other concerns. [1] https://lore.kernel.org/all/20251230142736.1168-1-guojinhui.liam@bytedance.com/ [2] https://lore.kernel.org/all/20251231165503.GA159243@bhelgaas/ [3] https://lore.kernel.org/all/DFFXIZR1AGTV.2WZ1G2JAU0HFQ@kernel.org/ Best Regards, Jinhui
Jinhui Guo wrote: [..] > > I like that from a global benefit perspective, but not necessarily from > > a regression perspective. Is there a minimal fix to PCI to make its > > current workqueue unbound, then if that goes well come back and move all > > devices into this scheme? > > Hi Dan, > > Thank you for your time, and apologies for the delayed reply. I would not have read an earlier reply over this weekend anyway, so no worries. > I understand your concerns about stability and hope for better PCI regression > handling. However, I believe introducing NUMA-node awareness to the driver > core's asynchronous probe path is the better solution: > > 1. The asynchronous path already uses async_schedule_dev() with queue_work_node() > to bind workers to specific NUMA nodes—this causes no side effects to driver > probing. > 2. I initially submitted a PCI-only fix [1], but handling asynchronous probing in > PCI driver proved difficult. Using current_is_async() works but feels fragile. > After discussions with Bjorn and Danilo [2][3], moving the solution to driver > core makes distinguishing async/sync probing straightforward. Testing shows > minimal impact on synchronous probe time. > 3. If you prefer a PCI-only approach, we could add a flag in struct device_driver > (default false) that PCI sets during registration. This limits the new path to > PCI devices while others retain existing behavior. The extra code is ~10 lines > and can be removed once confidence is established. I am open to this option. One demonstration of how this conversion can cause odd surprises is what it does to locking assumptions. For example, I ran into the work_on_cpu(..., local_pci_probe...) behavior with some of the work-in-progress confidential device work [1]. I was surprised when lockdep_assert_held() returned false in a driver probe context. I like that buses can opt-in to this behavior vs it being forced. Similar to how async-behavior is handled as an opt-in. [1]: https://git.kernel.org/pub/scm/linux/kernel/git/devsec/tsm.git/tree/drivers/base/coco.c?h=staging#n86 > 4. I'm committed to supporting this: I'll include "Fixes:" tags for any fallout > and provide patches within a month of any report. Since the logic mirrors the > core async helper, risk should be low—but I'll take full responsibility > regardless. Sounds good. With the above change you can add: Acked-by: Dan Williams <dan.j.williams@intel.com> ...and I may carve out some time to upgrade that to Reviewed-by on the next posting.
On Mon Jan 26, 2026 17:17:49 -0800, Jinhui Guo wrote: > Hi Dan, > > Thank you for your time, and apologies for the delayed reply. > > I understand your concerns about stability and hope for better PCI regression > handling. However, I believe introducing NUMA-node awareness to the driver > core's asynchronous probe path is the better solution: "asynchronous probe path" -> "synchronous probe path" Apologies for the typo. Best Regards, Jinhui
© 2016 - 2026 Red Hat, Inc.