[v2] Add NUMA-node-aware synchronous probing to driver core

[PATCH v2 0/3] Add NUMA-node-aware synchronous probing to driver core

Posted by Jinhui Guo 2 weeks, 1 day ago

Hi all,

** Overview **

This patchset introduces NUMA-node-aware synchronous probing.

Drivers can initialize and allocate memory on the device’s local
node without scattering kmalloc_node() calls throughout the code.
NUMA-aware probing was added to PCI drivers in 2005 and has
benefited them ever since.

The asynchronous probe path already supports NUMA-node-aware
probing via async_schedule_dev() in the driver core. Since NUMA
affinity is orthogonal to sync/async probing, this patchset adds
NUMA-node-aware support to the synchronous probe path.

** Background **

The idea arose from a discussion with Bjorn and Danilo about a
PCI-probe issue [1]:

when PCI devices on the same NUMA node are probed asynchronously,
pci_call_probe() calls work_on_cpu(), pins every probe worker to
the same CPU inside that node, and forces the probes to run serially.

Testing three NVMe devices on the same NUMA node of an AMD EPYC 9A64
2.4 GHz processor (all on CPU 0):

  nvme 0000:01:00.0: CPU: 0, COMM: kworker/0:1, probe cost: 53372612 ns
  nvme 0000:02:00.0: CPU: 0, COMM: kworker/0:2, probe cost: 49532941 ns
  nvme 0000:03:00.0: CPU: 0, COMM: kworker/0:3, probe cost: 47315175 ns

Since the driver core already provides NUMA-node-aware asynchronous
probing, we can extend the same capability to the synchronous probe
path. This solves the issue and lets other drivers benefit from
NUMA-local initialization as well.

[1] https://lore.kernel.org/all/20251227113326.964-1-guojinhui.liam@bytedance.com/

** Changes **

The series makes three main changes:

1. Adds helper __device_attach_driver_scan() to eliminate duplication
   between __device_attach() and __device_attach_async_helper().
2. Introduce helper __driver_probe_device_node() and use it to enable
   NUMA-local synchronous probing in __device_attach(), device_driver_attach(),
   and __driver_attach().
3. Removes the now-redundant NUMA code from the PCI driver.

** Test **

I added debug prints to nvme, mlx5, usbhid, and intel_rapl_msr and
ran tests on an AMD EPYC 9A64 system (Updated test results for the
new patchset are provided below):

NUMA topology of the test machine:
  # lscpu |grep NUMA
  NUMA node(s):                            2
  NUMA node0 CPU(s):                       0-63,128-191
  NUMA node1 CPU(s):                       64-127,192-255

1. Without the patchset
   - PCI drivers (nvme, mlx5) probe sequentially on CPU 0
   - USB and platform drivers pick random CPUs in the udev worker

   nvme 0000:01:00.0: CPU: 0, COMM: kworker/0:1, cost: 54013202 ns
   nvme 0000:02:00.0: CPU: 0, COMM: kworker/0:2, cost: 53968911 ns
   nvme 0000:03:00.0: CPU: 0, COMM: kworker/0:4, cost: 48077276 ns

   mlx5_core 0000:41:00.0: CPU: 0, COMM: kworker/0:2 cost: 506256717 ns
   mlx5_core 0000:41:00.1: CPU: 0, COMM: kworker/0:2 cost: 514289394 ns

   usb 1-2.4: CPU: 163, COMM: (udev-worker), cost 854131 ns
   usb 1-2.6: CPU: 163, COMM: (udev-worker), cost 967993 ns

   intel_rapl_msr intel_rapl_msr.0: CPU: 61, COMM: (udev-worker), cost: 3717567 ns

2. With the patchset
   - PCI probes are spread across CPUs inside the device’s NUMA node
   - Asynchronous nvme probes are ~35 % faster; synchronous mlx5 times
     are unchanged
   - USB probe times are virtually identical
   - Platform driver (no NUMA node) falls back to the original path

   nvme 0000:01:00.0: CPU: 3, COMM: kworker/u1025:1, cost: 34244206 ns
   nvme 0000:02:00.0: CPU: 1, COMM: kworker/u1025:2, cost: 33883391 ns
   nvme 0000:03:00.0: CPU: 2, COMM: kworker/u1025:3, cost: 33943040 ns

   mlx5_core 0000:41:00.0: CPU: 3, COMM: kworker/u1025:1, cost: 507206174 ns
   mlx5_core 0000:41:00.1: CPU: 3, COMM: kworker/u1025:1, cost: 514927642 ns

   usb 1-2.4: CPU: 4, COMM: kworker/u1025:8, cost: 991417 ns
   usb 1-2.6: CPU: 2, COMM: kworker/u1025:5, cost: 935112 ns

   intel_rapl_msr intel_rapl_msr.0: CPU: 17, COMM: (udev-worker), cost: 4849967 ns

3. With the patchset, unbind/bind cycles also spread PCI probes across
   CPUs within the device’s NUMA node:

   nvme 0000:02:00.0: CPU: 130, COMM: kworker/u1025:4, cost: 35086209 ns

** Final **

Comments and suggestions are welcome.

Best Regards,
Jinhui

---
v1: https://lore.kernel.org/all/20260107175548.1792-1-guojinhui.liam@bytedance.com/

Changelog in v1 -> v2:
 - Reword the first patch’s commit message for accuracy and add
   Reviewed-by tags; no code changes.
 - Refactor the second patch to reduce complexity: introduce
   __driver_probe_device_node() and update the signature of
   driver_probe_device() to support NUMA-node-aware synchronous
   probing. (suggested by Danilo)
 - The third patch resolves conflicts with three patches from
   patchset [2] that have since been merged into linux-next.git.
 - Update the test data in the cover letter for the new patchset.

[2] https://lore.kernel.org/all/20260101221359.22298-1-frederic@kernel.org/

Jinhui Guo (3):
  driver core: Introduce helper function __device_attach_driver_scan()
  driver core: Add NUMA-node awareness to the synchronous probe path
  PCI: Clean up NUMA-node awareness in pci_bus_type probe

 drivers/base/dd.c        | 147 +++++++++++++++++++++++++++++----------
 drivers/pci/pci-driver.c | 116 +++---------------------------
 include/linux/pci.h      |   4 --
 kernel/sched/isolation.c |   2 -
 4 files changed, 118 insertions(+), 151 deletions(-)

-- 
2.20.1

Re: [PATCH v2 0/3] Add NUMA-node-aware synchronous probing to driver core

Posted by dan.j.williams@intel.com 2 weeks ago

Jinhui Guo wrote:
> Hi all,
> 
> ** Overview **
> 
> This patchset introduces NUMA-node-aware synchronous probing.
> 
> Drivers can initialize and allocate memory on the device’s local
> node without scattering kmalloc_node() calls throughout the code.
> NUMA-aware probing was added to PCI drivers in 2005 and has
> benefited them ever since.
> 
> The asynchronous probe path already supports NUMA-node-aware
> probing via async_schedule_dev() in the driver core. Since NUMA
> affinity is orthogonal to sync/async probing, this patchset adds
> NUMA-node-aware support to the synchronous probe path.
> 
> ** Background **
> 
> The idea arose from a discussion with Bjorn and Danilo about a
> PCI-probe issue [1]:
> 
> when PCI devices on the same NUMA node are probed asynchronously,
> pci_call_probe() calls work_on_cpu(), pins every probe worker to
> the same CPU inside that node, and forces the probes to run serially.
> 
> Testing three NVMe devices on the same NUMA node of an AMD EPYC 9A64
> 2.4 GHz processor (all on CPU 0):
> 
>   nvme 0000:01:00.0: CPU: 0, COMM: kworker/0:1, probe cost: 53372612 ns
>   nvme 0000:02:00.0: CPU: 0, COMM: kworker/0:2, probe cost: 49532941 ns
>   nvme 0000:03:00.0: CPU: 0, COMM: kworker/0:3, probe cost: 47315175 ns
> 
> Since the driver core already provides NUMA-node-aware asynchronous
> probing, we can extend the same capability to the synchronous probe
> path. This solves the issue and lets other drivers benefit from
> NUMA-local initialization as well.

I like that from a global benefit perspective, but not necessarily from
a regression perspective. Is there a minimal fix to PCI to make its
current workqueue unbound, then if that goes well come back and move all
devices into this scheme?

Re: [PATCH v2 0/3] Add NUMA-node-aware synchronous probing to driver core

Posted by Jinhui Guo 1 week, 5 days ago

On Fri Jan 23, 2026 17:04:27 -0800, Dan Williams wrote:
> Jinhui Guo wrote:
> > Hi all,
> > 
> > ** Overview **
> > 
> > This patchset introduces NUMA-node-aware synchronous probing.
> > 
> > Drivers can initialize and allocate memory on the device’s local
> > node without scattering kmalloc_node() calls throughout the code.
> > NUMA-aware probing was added to PCI drivers in 2005 and has
> > benefited them ever since.
> > 
> > The asynchronous probe path already supports NUMA-node-aware
> > probing via async_schedule_dev() in the driver core. Since NUMA
> > affinity is orthogonal to sync/async probing, this patchset adds
> > NUMA-node-aware support to the synchronous probe path.
> > 
> > ** Background **
> > 
> > The idea arose from a discussion with Bjorn and Danilo about a
> > PCI-probe issue [1]:
> > 
> > when PCI devices on the same NUMA node are probed asynchronously,
> > pci_call_probe() calls work_on_cpu(), pins every probe worker to
> > the same CPU inside that node, and forces the probes to run serially.
> > 
> > Testing three NVMe devices on the same NUMA node of an AMD EPYC 9A64
> > 2.4 GHz processor (all on CPU 0):
> > 
> >   nvme 0000:01:00.0: CPU: 0, COMM: kworker/0:1, probe cost: 53372612 ns
> >   nvme 0000:02:00.0: CPU: 0, COMM: kworker/0:2, probe cost: 49532941 ns
> >   nvme 0000:03:00.0: CPU: 0, COMM: kworker/0:3, probe cost: 47315175 ns
> > 
> > Since the driver core already provides NUMA-node-aware asynchronous
> > probing, we can extend the same capability to the synchronous probe
> > path. This solves the issue and lets other drivers benefit from
> > NUMA-local initialization as well.
> 
> I like that from a global benefit perspective, but not necessarily from
> a regression perspective. Is there a minimal fix to PCI to make its
> current workqueue unbound, then if that goes well come back and move all
> devices into this scheme?

Hi Dan,

Thank you for your time, and apologies for the delayed reply.

I understand your concerns about stability and hope for better PCI regression
handling. However, I believe introducing NUMA-node awareness to the driver
core's asynchronous probe path is the better solution:

1. The asynchronous path already uses async_schedule_dev() with queue_work_node()
   to bind workers to specific NUMA nodes—this causes no side effects to driver
   probing.
2. I initially submitted a PCI-only fix [1], but handling asynchronous probing in
   PCI driver proved difficult. Using current_is_async() works but feels fragile.
   After discussions with Bjorn and Danilo [2][3], moving the solution to driver
   core makes distinguishing async/sync probing straightforward. Testing shows
   minimal impact on synchronous probe time.
3. If you prefer a PCI-only approach, we could add a flag in struct device_driver
   (default false) that PCI sets during registration. This limits the new path to
   PCI devices while others retain existing behavior. The extra code is ~10 lines
   and can be removed once confidence is established.
4. I'm committed to supporting this: I'll include "Fixes:" tags for any fallout
   and provide patches within a month of any report. Since the logic mirrors the
   core async helper, risk should be low—but I'll take full responsibility
   regardless.

Please let me know if you have other concerns.

[1] https://lore.kernel.org/all/20251230142736.1168-1-guojinhui.liam@bytedance.com/
[2] https://lore.kernel.org/all/20251231165503.GA159243@bhelgaas/
[3] https://lore.kernel.org/all/DFFXIZR1AGTV.2WZ1G2JAU0HFQ@kernel.org/

Best Regards,
Jinhui

Re: [PATCH v2 0/3] Add NUMA-node-aware synchronous probing to driver core

Posted by dan.j.williams@intel.com 1 week, 4 days ago

Jinhui Guo wrote:
[..]
> > I like that from a global benefit perspective, but not necessarily from
> > a regression perspective. Is there a minimal fix to PCI to make its
> > current workqueue unbound, then if that goes well come back and move all
> > devices into this scheme?
> 
> Hi Dan,
> 
> Thank you for your time, and apologies for the delayed reply.

I would not have read an earlier reply over this weekend anyway, so no
worries.

> I understand your concerns about stability and hope for better PCI regression
> handling. However, I believe introducing NUMA-node awareness to the driver
> core's asynchronous probe path is the better solution:
> 
> 1. The asynchronous path already uses async_schedule_dev() with queue_work_node()
>    to bind workers to specific NUMA nodes—this causes no side effects to driver
>    probing.
> 2. I initially submitted a PCI-only fix [1], but handling asynchronous probing in
>    PCI driver proved difficult. Using current_is_async() works but feels fragile.
>    After discussions with Bjorn and Danilo [2][3], moving the solution to driver
>    core makes distinguishing async/sync probing straightforward. Testing shows
>    minimal impact on synchronous probe time.
> 3. If you prefer a PCI-only approach, we could add a flag in struct device_driver
>    (default false) that PCI sets during registration. This limits the new path to
>    PCI devices while others retain existing behavior. The extra code is ~10 lines
>    and can be removed once confidence is established.

I am open to this option. One demonstration of how this conversion can
cause odd surprises is what it does to locking assumptions. For example,
I ran into the work_on_cpu(..., local_pci_probe...) behavior with some
of the work-in-progress confidential device work [1]. I was surprised
when lockdep_assert_held() returned false in a driver probe context.

I like that buses can opt-in to this behavior vs it being forced.
Similar to how async-behavior is handled as an opt-in.

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/devsec/tsm.git/tree/drivers/base/coco.c?h=staging#n86

> 4. I'm committed to supporting this: I'll include "Fixes:" tags for any fallout
>    and provide patches within a month of any report. Since the logic mirrors the
>    core async helper, risk should be low—but I'll take full responsibility
>    regardless.

Sounds good.

With the above change you can add:

Acked-by: Dan Williams <dan.j.williams@intel.com>

...and I may carve out some time to upgrade that to Reviewed-by on the
next posting.

Re: [PATCH v2 0/3] Add NUMA-node-aware synchronous probing to driver core

Posted by Jinhui Guo 1 week, 4 days ago

On Mon Jan 26, 2026 17:17:49 -0800, Jinhui Guo wrote:
> Hi Dan,
> 
> Thank you for your time, and apologies for the delayed reply.
> 
> I understand your concerns about stability and hope for better PCI regression
> handling. However, I believe introducing NUMA-node awareness to the driver
> core's asynchronous probe path is the better solution:

"asynchronous probe path" -> "synchronous probe path"

Apologies for the typo.

Best Regards,
Jinhui