PCI: Avoid work_on_cpu() in async probe workers

[PATCH] PCI: Avoid work_on_cpu() in async probe workers

Posted by Jinhui Guo 1 month, 1 week ago

Commit ef0ff68351be ("driver core: Probe devices asynchronously instead of
the driver") speeds up the loading of large numbers of device drivers by
submitting asynchronous probe workers to an unbounded workqueue and binding
each worker to the CPU near the device’s NUMA node. These workers are not
scheduled on isolated CPUs because their cpumask is restricted to
housekeeping_cpumask(HK_TYPE_WQ) and housekeeping_cpumask(HK_TYPE_DOMAIN).

However, when PCI devices reside on the same NUMA node, all their
drivers’ probe workers are bound to the same CPU within that node, yet
the probes still run in parallel because pci_call_probe() invokes
work_on_cpu(). Introduced by commit 873392ca514f ("PCI: work_on_cpu: use
in drivers/pci/pci-driver.c"), work_on_cpu() queues a worker on
system_percpu_wq to bind the probe thread to the first CPU in the
device’s NUMA node (chosen via cpumask_any_and() in pci_call_probe()).

1. The function __driver_attach() submits an asynchronous worker with
   callback __driver_attach_async_helper().

   __driver_attach()
    async_schedule_dev(__driver_attach_async_helper, dev)
     async_schedule_node(func, dev, dev_to_node(dev))
      async_schedule_node_domain(func, data, node, &async_dfl_domain)
       __async_schedule_node_domain(func, data, node, domain, entry)
        queue_work_node(node, async_wq, &entry->work)

2. The asynchronous probe worker ultimately calls work_on_cpu() in
   pci_call_probe(), binding the worker to the same CPU within the
   device’s NUMA node.

   __driver_attach_async_helper()
    driver_probe_device(drv, dev)
     __driver_probe_device(drv, dev)
      really_probe(dev, drv)
       call_driver_probe(dev, drv)
        dev->bus->probe(dev)
         pci_device_probe(dev)
          __pci_device_probe(drv, pci_dev)
           pci_call_probe(drv, pci_dev, id)
            cpu = cpumask_any_and(cpumask_of_node(node), wq_domain_mask)
            error = work_on_cpu(cpu, local_pci_probe, &ddi)
             schedule_work_on(cpu, &wfc.work);
              queue_work_on(cpu, system_percpu_wq, work)

To fix the issue, pci_call_probe() must not call work_on_cpu() when it is
already running inside an unbounded asynchronous worker. Because a driver
can be probed asynchronously either by probe_type or by the kernel command
line, we cannot rely on PROBE_PREFER_ASYNCHRONOUS alone. Instead, we test
the PF_WQ_WORKER flag in current->flags; if it is set, pci_call_probe() is
executing within an unbounded workqueue worker and should skip the extra
work_on_cpu() call.

Testing three NVMe devices on the same NUMA node of an AMD EPYC 9A64
2.4 GHz processor shows a 35 % probe-time improvement with the patch:

Before (all on CPU 0):
  nvme 0000:01:00.0: CPU: 0, COMM: kworker/0:1, probe cost: 53372612 ns
  nvme 0000:02:00.0: CPU: 0, COMM: kworker/0:2, probe cost: 49532941 ns
  nvme 0000:03:00.0: CPU: 0, COMM: kworker/0:3, probe cost: 47315175 ns

After (spread across CPUs 1, 2, 5):
  nvme 0000:01:00.0: CPU: 5, COMM: kworker/u1025:5, probe cost: 34765890 ns
  nvme 0000:02:00.0: CPU: 1, COMM: kworker/u1025:2, probe cost: 34696433 ns
  nvme 0000:03:00.0: CPU: 2, COMM: kworker/u1025:3, probe cost: 33233323 ns

The improvement grows with more PCI devices because fewer probes contend
for the same CPU.

Fixes: ef0ff68351be ("driver core: Probe devices asynchronously instead of the driver")
Cc: stable@vger.kernel.org
Signed-off-by: Jinhui Guo <guojinhui.liam@bytedance.com>
---
 drivers/pci/pci-driver.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 7c2d9d596258..4bc47a84d330 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -366,9 +366,11 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
 	/*
 	 * Prevent nesting work_on_cpu() for the case where a Virtual Function
 	 * device is probed from work_on_cpu() of the Physical device.
+	 * Check PF_WQ_WORKER to prevent invoking work_on_cpu() in an asynchronous
+	 * probe worker when the driver allows asynchronous probing.
 	 */
 	if (node < 0 || node >= MAX_NUMNODES || !node_online(node) ||
-	    pci_physfn_is_probed(dev)) {
+	    pci_physfn_is_probed(dev) || (current->flags & PF_WQ_WORKER)) {
 		cpu = nr_cpu_ids;
 	} else {
 		cpumask_var_t wq_domain_mask;
-- 
2.20.1

Re: [PATCH] PCI: Avoid work_on_cpu() in async probe workers

Posted by Tejun Heo 1 month, 1 week ago

On Sat, Dec 27, 2025 at 07:33:26PM +0800, Jinhui Guo wrote:
> To fix the issue, pci_call_probe() must not call work_on_cpu() when it is
> already running inside an unbounded asynchronous worker. Because a driver
> can be probed asynchronously either by probe_type or by the kernel command
> line, we cannot rely on PROBE_PREFER_ASYNCHRONOUS alone. Instead, we test
> the PF_WQ_WORKER flag in current->flags; if it is set, pci_call_probe() is
> executing within an unbounded workqueue worker and should skip the extra
> work_on_cpu() call.

Why not just use queue_work_on() on system_dfl_wq (or any other unbound
workqueue)? Those are soft-affine to cache domain but can overflow to other
CPUs?

Thanks.

-- 
tejun

Re: [PATCH] PCI: Avoid work_on_cpu() in async probe workers

Posted by Jinhui Guo 1 month, 1 week ago

On Mon, Dec 29, 2025 at 08:08:57AM -1000, Tejun Heo wrote:
> On Sat, Dec 27, 2025 at 07:33:26PM +0800, Jinhui Guo wrote:
> > To fix the issue, pci_call_probe() must not call work_on_cpu() when it is
> > already running inside an unbounded asynchronous worker. Because a driver
> > can be probed asynchronously either by probe_type or by the kernel command
> > line, we cannot rely on PROBE_PREFER_ASYNCHRONOUS alone. Instead, we test
> > the PF_WQ_WORKER flag in current->flags; if it is set, pci_call_probe() is
> > executing within an unbounded workqueue worker and should skip the extra
> > work_on_cpu() call.
> 
> Why not just use queue_work_on() on system_dfl_wq (or any other unbound
> workqueue)? Those are soft-affine to cache domain but can overflow to other
> CPUs?

Hi, tejun,

Thank you for your time and helpful suggestions.
I had considered replacing work_on_cpu() with queue_work_on(system_dfl_wq) +
flush_work(), but that would be a refactor rather than a fix for the specific
problem we hit.

Let me restate the issue:

1. With PROBE_PREFER_ASYNCHRONOUS enabled, the driver core queues work on
   async_wq to speed up driver probe.
2. The PCI core then calls work_on_cpu() to tie the probe thread to the PCI
   device’s NUMA node, but it always picks the same CPU for every device on
   that node, forcing the PCI probes to run serially.

Therefore I test current->flags & PF_WQ_WORKER to detect that we are already
inside an async_wq worker and skip the extra nested work queue.

I agree with your point—using queue_work_on(system_dfl_wq) + flush_work()
would be cleaner and would let different vendors’ drivers probe in parallel
instead of fighting over the same CPU. I’ve prepared and tested another patch,
but I’m still unsure it’s the better approach; any further suggestions would
be greatly appreciated.

Test results for that patch:
  nvme 0000:01:00.0: CPU: 2, COMM: kworker/u1025:3, probe cost: 34904955 ns
  nvme 0000:02:00.0: CPU: 134, COMM: kworker/u1025:1, probe cost: 34774235 ns
  nvme 0000:03:00.0: CPU: 1, COMM: kworker/u1025:4, probe cost: 34573054 ns

Key changes in the patch:

1. Keep the current->flags & PF_WQ_WORKER test to avoid nested workers.
2. Replace work_on_cpu() with queue_work_node(system_dfl_wq) + flush_work()
   to enable parallel probing when PROBE_PREFER_ASYNCHRONOUS is disabled.
3. Remove all cpumask operations.
4. Drop cpu_hotplug_disable() since both cpumask manipulation and work_on_cpu()
   are gone.

The patch is shown below.

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 7c2d9d5962586..e66a67c48f28d 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -347,10 +347,24 @@ static bool pci_physfn_is_probed(struct pci_dev *dev)
 #endif
 }

+struct pci_probe_work {
+    struct work_struct work;
+    struct drv_dev_and_id ddi;
+    int result;
+};
+
+static void pci_probe_work_func(struct work_struct *work)
+{
+       struct pci_probe_work *pw = container_of(work, struct pci_probe_work, work);
+
+       pw->result = local_pci_probe(&pw->ddi);
+}
+
 static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
                          const struct pci_device_id *id)
 {
        int error, node, cpu;
+       struct pci_probe_work pw;
        struct drv_dev_and_id ddi = { drv, dev, id };

        /*
@@ -361,38 +375,25 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
        node = dev_to_node(&dev->dev);
        dev->is_probed = 1;

-       cpu_hotplug_disable();
-
        /*
         * Prevent nesting work_on_cpu() for the case where a Virtual Function
         * device is probed from work_on_cpu() of the Physical device.
         */
        if (node < 0 || node >= MAX_NUMNODES || !node_online(node) ||
-           pci_physfn_is_probed(dev)) {
-               cpu = nr_cpu_ids;
-       } else {
-               cpumask_var_t wq_domain_mask;
-
-               if (!zalloc_cpumask_var(&wq_domain_mask, GFP_KERNEL)) {
-                       error = -ENOMEM;
-                       goto out;
-               }
-               cpumask_and(wq_domain_mask,
-                           housekeeping_cpumask(HK_TYPE_WQ),
-                           housekeeping_cpumask(HK_TYPE_DOMAIN));
-
-               cpu = cpumask_any_and(cpumask_of_node(node),
-                                     wq_domain_mask);
-               free_cpumask_var(wq_domain_mask);
+           pci_physfn_is_probed(dev) || (current->flags & PF_WQ_WORKER)) {
+               error = local_pci_probe(&ddi);
+               goto out;
        }

-       if (cpu < nr_cpu_ids)
-               error = work_on_cpu(cpu, local_pci_probe, &ddi);
-       else
-               error = local_pci_probe(&ddi);
+       INIT_WORK_ONSTACK(&pw.work, pci_probe_work_func);
+       pw.ddi = ddi;
+       queue_work_node(node, system_dfl_wq, &pw.work);
+       flush_work(&pw.work);
+       error = pw.result;
+       destroy_work_on_stack(&pw.work);
+
 out:
        dev->is_probed = 0;
-       cpu_hotplug_enable();
        return error;
 }


Best Regards,
Jinhui

Re: [PATCH] PCI: Avoid work_on_cpu() in async probe workers

Posted by Bjorn Helgaas 1 month, 1 week ago

On Tue, Dec 30, 2025 at 10:27:36PM +0800, Jinhui Guo wrote:
> On Mon, Dec 29, 2025 at 08:08:57AM -1000, Tejun Heo wrote:
> > On Sat, Dec 27, 2025 at 07:33:26PM +0800, Jinhui Guo wrote:
> > > To fix the issue, pci_call_probe() must not call work_on_cpu() when it is
> > > already running inside an unbounded asynchronous worker. Because a driver
> > > can be probed asynchronously either by probe_type or by the kernel command
> > > line, we cannot rely on PROBE_PREFER_ASYNCHRONOUS alone. Instead, we test
> > > the PF_WQ_WORKER flag in current->flags; if it is set, pci_call_probe() is
> > > executing within an unbounded workqueue worker and should skip the extra
> > > work_on_cpu() call.
> > 
> > Why not just use queue_work_on() on system_dfl_wq (or any other unbound
> > workqueue)? Those are soft-affine to cache domain but can overflow to other
> > CPUs?
> 
> Hi, tejun,
> 
> Thank you for your time and helpful suggestions.
> I had considered replacing work_on_cpu() with queue_work_on(system_dfl_wq) +
> flush_work(), but that would be a refactor rather than a fix for the specific
> problem we hit.
> 
> Let me restate the issue:
> 
> 1. With PROBE_PREFER_ASYNCHRONOUS enabled, the driver core queues work on
>    async_wq to speed up driver probe.
> 2. The PCI core then calls work_on_cpu() to tie the probe thread to the PCI
>    device’s NUMA node, but it always picks the same CPU for every device on
>    that node, forcing the PCI probes to run serially.
> 
> Therefore I test current->flags & PF_WQ_WORKER to detect that we are already
> inside an async_wq worker and skip the extra nested work queue.
> 
> I agree with your point—using queue_work_on(system_dfl_wq) + flush_work()
> would be cleaner and would let different vendors’ drivers probe in parallel
> instead of fighting over the same CPU. I’ve prepared and tested another patch,
> but I’m still unsure it’s the better approach; any further suggestions would
> be greatly appreciated.
> 
> Test results for that patch:
>   nvme 0000:01:00.0: CPU: 2, COMM: kworker/u1025:3, probe cost: 34904955 ns
>   nvme 0000:02:00.0: CPU: 134, COMM: kworker/u1025:1, probe cost: 34774235 ns
>   nvme 0000:03:00.0: CPU: 1, COMM: kworker/u1025:4, probe cost: 34573054 ns
> 
> Key changes in the patch:
> 
> 1. Keep the current->flags & PF_WQ_WORKER test to avoid nested workers.
> 2. Replace work_on_cpu() with queue_work_node(system_dfl_wq) + flush_work()
>    to enable parallel probing when PROBE_PREFER_ASYNCHRONOUS is disabled.
> 3. Remove all cpumask operations.
> 4. Drop cpu_hotplug_disable() since both cpumask manipulation and work_on_cpu()
>    are gone.
> 
> The patch is shown below.

I love this patch because it makes pci_call_probe() so much simpler.

I *would* like a short higher-level description of the issue that
doesn't assume so much workqueue background.

I'm not an expert, but IIUC __driver_attach() schedules async workers
so driver probes can run in parallel, but the problem is that the
workers for devices on node X are currently serialized because they
all bind to the same CPU on that node.

Naive questions: It looks like async_schedule_dev() already schedules
an async worker on the device node, so why does pci_call_probe() need
to use queue_work_node() again?

pci_call_probe() dates to 2005 (d42c69972b85 ("[PATCH] PCI: Run PCI
driver initialization on local node")), but the async_schedule_dev()
looks like it was only added in 2019 (c37e20eaf4b2 ("driver core:
Attach devices on CPU local to device node")).  Maybe the
pci_call_probe() node awareness is no longer necessary?

> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index 7c2d9d5962586..e66a67c48f28d 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -347,10 +347,24 @@ static bool pci_physfn_is_probed(struct pci_dev *dev)
>  #endif
>  }
> 
> +struct pci_probe_work {
> +    struct work_struct work;
> +    struct drv_dev_and_id ddi;
> +    int result;
> +};
> +
> +static void pci_probe_work_func(struct work_struct *work)
> +{
> +       struct pci_probe_work *pw = container_of(work, struct pci_probe_work, work);
> +
> +       pw->result = local_pci_probe(&pw->ddi);
> +}
> +
>  static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
>                           const struct pci_device_id *id)
>  {
>         int error, node, cpu;
> +       struct pci_probe_work pw;
>         struct drv_dev_and_id ddi = { drv, dev, id };
> 
>         /*
> @@ -361,38 +375,25 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
>         node = dev_to_node(&dev->dev);
>         dev->is_probed = 1;
> 
> -       cpu_hotplug_disable();
> -
>         /*
>          * Prevent nesting work_on_cpu() for the case where a Virtual Function
>          * device is probed from work_on_cpu() of the Physical device.
>          */
>         if (node < 0 || node >= MAX_NUMNODES || !node_online(node) ||
> -           pci_physfn_is_probed(dev)) {
> -               cpu = nr_cpu_ids;
> -       } else {
> -               cpumask_var_t wq_domain_mask;
> -
> -               if (!zalloc_cpumask_var(&wq_domain_mask, GFP_KERNEL)) {
> -                       error = -ENOMEM;
> -                       goto out;
> -               }
> -               cpumask_and(wq_domain_mask,
> -                           housekeeping_cpumask(HK_TYPE_WQ),
> -                           housekeeping_cpumask(HK_TYPE_DOMAIN));
> -
> -               cpu = cpumask_any_and(cpumask_of_node(node),
> -                                     wq_domain_mask);
> -               free_cpumask_var(wq_domain_mask);
> +           pci_physfn_is_probed(dev) || (current->flags & PF_WQ_WORKER)) {
> +               error = local_pci_probe(&ddi);
> +               goto out;
>         }
> 
> -       if (cpu < nr_cpu_ids)
> -               error = work_on_cpu(cpu, local_pci_probe, &ddi);
> -       else
> -               error = local_pci_probe(&ddi);
> +       INIT_WORK_ONSTACK(&pw.work, pci_probe_work_func);
> +       pw.ddi = ddi;
> +       queue_work_node(node, system_dfl_wq, &pw.work);
> +       flush_work(&pw.work);
> +       error = pw.result;
> +       destroy_work_on_stack(&pw.work);
> +
>  out:
>         dev->is_probed = 0;
> -       cpu_hotplug_enable();
>         return error;
>  }
> 
> 
> Best Regards,
> Jinhui

Re: [PATCH] PCI: Avoid work_on_cpu() in async probe workers

Posted by Jinhui Guo 1 month, 1 week ago

On Tue, Dec 30, 2025 at 03:52:41PM -0600, Bjorn Helgaas wrote:
> On Tue, Dec 30, 2025 at 10:27:36PM +0800, Jinhui Guo wrote:
> > On Mon, Dec 29, 2025 at 08:08:57AM -1000, Tejun Heo wrote:
> > > On Sat, Dec 27, 2025 at 07:33:26PM +0800, Jinhui Guo wrote:
> > > > To fix the issue, pci_call_probe() must not call work_on_cpu() when it is
> > > > already running inside an unbounded asynchronous worker. Because a driver
> > > > can be probed asynchronously either by probe_type or by the kernel command
> > > > line, we cannot rely on PROBE_PREFER_ASYNCHRONOUS alone. Instead, we test
> > > > the PF_WQ_WORKER flag in current->flags; if it is set, pci_call_probe() is
> > > > executing within an unbounded workqueue worker and should skip the extra
> > > > work_on_cpu() call.
> > > 
> > > Why not just use queue_work_on() on system_dfl_wq (or any other unbound
> > > workqueue)? Those are soft-affine to cache domain but can overflow to other
> > > CPUs?
> > 
> > Hi, tejun,
> > 
> > Thank you for your time and helpful suggestions.
> > I had considered replacing work_on_cpu() with queue_work_on(system_dfl_wq) +
> > flush_work(), but that would be a refactor rather than a fix for the specific
> > problem we hit.
> > 
> > Let me restate the issue:
> > 
> > 1. With PROBE_PREFER_ASYNCHRONOUS enabled, the driver core queues work on
> >    async_wq to speed up driver probe.
> > 2. The PCI core then calls work_on_cpu() to tie the probe thread to the PCI
> >    device’s NUMA node, but it always picks the same CPU for every device on
> >    that node, forcing the PCI probes to run serially.
> > 
> > Therefore I test current->flags & PF_WQ_WORKER to detect that we are already
> > inside an async_wq worker and skip the extra nested work queue.
> > 
> > I agree with your point—using queue_work_on(system_dfl_wq) + flush_work()
> > would be cleaner and would let different vendors’ drivers probe in parallel
> > instead of fighting over the same CPU. I’ve prepared and tested another patch,
> > but I’m still unsure it’s the better approach; any further suggestions would
> > be greatly appreciated.
> > 
> > Test results for that patch:
> >   nvme 0000:01:00.0: CPU: 2, COMM: kworker/u1025:3, probe cost: 34904955 ns
> >   nvme 0000:02:00.0: CPU: 134, COMM: kworker/u1025:1, probe cost: 34774235 ns
> >   nvme 0000:03:00.0: CPU: 1, COMM: kworker/u1025:4, probe cost: 34573054 ns
> > 
> > Key changes in the patch:
> > 
> > 1. Keep the current->flags & PF_WQ_WORKER test to avoid nested workers.
> > 2. Replace work_on_cpu() with queue_work_node(system_dfl_wq) + flush_work()
> >    to enable parallel probing when PROBE_PREFER_ASYNCHRONOUS is disabled.
> > 3. Remove all cpumask operations.
> > 4. Drop cpu_hotplug_disable() since both cpumask manipulation and work_on_cpu()
> >    are gone.
> > 
> > The patch is shown below.
> 
> I love this patch because it makes pci_call_probe() so much simpler.
> 
> I *would* like a short higher-level description of the issue that
> doesn't assume so much workqueue background.
> 
> I'm not an expert, but IIUC __driver_attach() schedules async workers
> so driver probes can run in parallel, but the problem is that the
> workers for devices on node X are currently serialized because they
> all bind to the same CPU on that node.
> 
> Naive questions: It looks like async_schedule_dev() already schedules
> an async worker on the device node, so why does pci_call_probe() need
> to use queue_work_node() again?
> 
> pci_call_probe() dates to 2005 (d42c69972b85 ("[PATCH] PCI: Run PCI
> driver initialization on local node")), but the async_schedule_dev()
> looks like it was only added in 2019 (c37e20eaf4b2 ("driver core:
> Attach devices on CPU local to device node")).  Maybe the
> pci_call_probe() node awareness is no longer necessary?

Hi, Bjorn

Thank you for your time and kind reply.

As I see it, two scenarios should be borne in mind:

1. Driver allowed to probe asynchronously
   The driver core schedules async workers via async_schedule_dev(),
   so pci_call_probe() needs no extra queue_work_node().

2. Driver not allowed to probe asynchronously
   The driver core (__driver_attach() or __device_attach()) calls
   pci_call_probe() directly, without any async worker from
   async_schedule_dev(). NUMA-node awareness in pci_call_probe()
   is therefore still required.

Best Regards,
Jinhui

Re: [PATCH] PCI: Avoid work_on_cpu() in async probe workers

Posted by Bjorn Helgaas 1 month, 1 week ago

[+cc Rafael, Danilo (driver core question), update Alexander's email]

On Wed, Dec 31, 2025 at 03:51:05PM +0800, Jinhui Guo wrote:
> On Tue, Dec 30, 2025 at 03:52:41PM -0600, Bjorn Helgaas wrote:
> > On Tue, Dec 30, 2025 at 10:27:36PM +0800, Jinhui Guo wrote:
> > > On Mon, Dec 29, 2025 at 08:08:57AM -1000, Tejun Heo wrote:
> > > > On Sat, Dec 27, 2025 at 07:33:26PM +0800, Jinhui Guo wrote:
> > > > > To fix the issue, pci_call_probe() must not call work_on_cpu() when it is
> > > > > already running inside an unbounded asynchronous worker. Because a driver
> > > > > can be probed asynchronously either by probe_type or by the kernel command
> > > > > line, we cannot rely on PROBE_PREFER_ASYNCHRONOUS alone. Instead, we test
> > > > > the PF_WQ_WORKER flag in current->flags; if it is set, pci_call_probe() is
> > > > > executing within an unbounded workqueue worker and should skip the extra
> > > > > work_on_cpu() call.
> > > > 
> > > > Why not just use queue_work_on() on system_dfl_wq (or any other unbound
> > > > workqueue)? Those are soft-affine to cache domain but can overflow to other
> > > > CPUs?
> > > 
> > > Hi, tejun,
> > > 
> > > Thank you for your time and helpful suggestions.
> > > I had considered replacing work_on_cpu() with queue_work_on(system_dfl_wq) +
> > > flush_work(), but that would be a refactor rather than a fix for the specific
> > > problem we hit.
> > > 
> > > Let me restate the issue:
> > > 
> > > 1. With PROBE_PREFER_ASYNCHRONOUS enabled, the driver core queues work on
> > >    async_wq to speed up driver probe.
> > > 2. The PCI core then calls work_on_cpu() to tie the probe thread to the PCI
> > >    device’s NUMA node, but it always picks the same CPU for every device on
> > >    that node, forcing the PCI probes to run serially.
> > > 
> > > Therefore I test current->flags & PF_WQ_WORKER to detect that we are already
> > > inside an async_wq worker and skip the extra nested work queue.
> > > 
> > > I agree with your point—using queue_work_on(system_dfl_wq) + flush_work()
> > > would be cleaner and would let different vendors’ drivers probe in parallel
> > > instead of fighting over the same CPU. I’ve prepared and tested another patch,
> > > but I’m still unsure it’s the better approach; any further suggestions would
> > > be greatly appreciated.
> > > 
> > > Test results for that patch:
> > >   nvme 0000:01:00.0: CPU: 2, COMM: kworker/u1025:3, probe cost: 34904955 ns
> > >   nvme 0000:02:00.0: CPU: 134, COMM: kworker/u1025:1, probe cost: 34774235 ns
> > >   nvme 0000:03:00.0: CPU: 1, COMM: kworker/u1025:4, probe cost: 34573054 ns
> > > 
> > > Key changes in the patch:
> > > 
> > > 1. Keep the current->flags & PF_WQ_WORKER test to avoid nested workers.
> > > 2. Replace work_on_cpu() with queue_work_node(system_dfl_wq) + flush_work()
> > >    to enable parallel probing when PROBE_PREFER_ASYNCHRONOUS is disabled.
> > > 3. Remove all cpumask operations.
> > > 4. Drop cpu_hotplug_disable() since both cpumask manipulation and work_on_cpu()
> > >    are gone.
> > > 
> > > The patch is shown below.
> > 
> > I love this patch because it makes pci_call_probe() so much simpler.
> > 
> > I *would* like a short higher-level description of the issue that
> > doesn't assume so much workqueue background.
> > 
> > I'm not an expert, but IIUC __driver_attach() schedules async workers
> > so driver probes can run in parallel, but the problem is that the
> > workers for devices on node X are currently serialized because they
> > all bind to the same CPU on that node.
> > 
> > Naive questions: It looks like async_schedule_dev() already schedules
> > an async worker on the device node, so why does pci_call_probe() need
> > to use queue_work_node() again?
> > 
> > pci_call_probe() dates to 2005 (d42c69972b85 ("[PATCH] PCI: Run PCI
> > driver initialization on local node")), but the async_schedule_dev()
> > looks like it was only added in 2019 (c37e20eaf4b2 ("driver core:
> > Attach devices on CPU local to device node")).  Maybe the
> > pci_call_probe() node awareness is no longer necessary?
> 
> Hi, Bjorn
> 
> Thank you for your time and kind reply.
> 
> As I see it, two scenarios should be borne in mind:
> 
> 1. Driver allowed to probe asynchronously
>    The driver core schedules async workers via async_schedule_dev(),
>    so pci_call_probe() needs no extra queue_work_node().
> 
> 2. Driver not allowed to probe asynchronously
>    The driver core (__driver_attach() or __device_attach()) calls
>    pci_call_probe() directly, without any async worker from
>    async_schedule_dev(). NUMA-node awareness in pci_call_probe()
>    is therefore still required.

Good point, we need the NUMA awareness in both sync and async probe
paths.

But node affinity is orthogonal to the sync/async question, so it
seems weird to deal with affinity in two separate places.  It also
seems sub-optimal to have node affinity in the driver core async path
but not the synchronous probe path.

Maybe driver_probe_device() should do something about NUMA affinity?

Bjorn

Re: [PATCH] PCI: Avoid work_on_cpu() in async probe workers

Posted by Danilo Krummrich 1 month ago

On Wed Dec 31, 2025 at 5:55 PM CET, Bjorn Helgaas wrote:
> On Wed, Dec 31, 2025 at 03:51:05PM +0800, Jinhui Guo wrote:
>> Hi, Bjorn
>> 
>> Thank you for your time and kind reply.
>> 
>> As I see it, two scenarios should be borne in mind:
>> 
>> 1. Driver allowed to probe asynchronously
>>    The driver core schedules async workers via async_schedule_dev(),
>>    so pci_call_probe() needs no extra queue_work_node().
>> 
>> 2. Driver not allowed to probe asynchronously
>>    The driver core (__driver_attach() or __device_attach()) calls
>>    pci_call_probe() directly, without any async worker from
>>    async_schedule_dev(). NUMA-node awareness in pci_call_probe()
>>    is therefore still required.
>
> Good point, we need the NUMA awareness in both sync and async probe
> paths.
>
> But node affinity is orthogonal to the sync/async question, so it
> seems weird to deal with affinity in two separate places.

In general I agree, but implementation wise it might make a difference:

In the async path we ultimately use queue_work_node(), which may fall back to
default queue_work() behavior or explicitly picks the current CPU to queue work,
if we are on the corresponding NUMA node already.

In the sync path however - if we want to do something about NUMA affinity - we
probably want to queue work as well and wait for completion, but in the fallback
case always execute the code ourselves, i.e. do not queue any work at all.

> It also
> seems sub-optimal to have node affinity in the driver core async path
> but not the synchronous probe path.
>
> Maybe driver_probe_device() should do something about NUMA affinity?

driver_probe_device() might not be the best place, as it is the helper executed
from within the async path (work queue) and sync path (unless you have something
else in mind than what I mentioned above).

I think __device_attach() and __driver_attach() - probably through a common
helper - should handle NUMA affinity instead.

Re: [PATCH] PCI: Avoid work_on_cpu() in async probe workers

Posted by Jinhui Guo 1 month, 1 week ago

On Tue, Dec 30, 2025 at 22:27:36PM +0800, Jinhui Guo wrote
> 2. Replace work_on_cpu() with queue_work_node(system_dfl_wq) + flush_work()
>    to enable parallel probing when PROBE_PREFER_ASYNCHRONOUS is disabled.

Sorry for the mis-statement—probing is serial, not parallel:

2. Replace work_on_cpu() with queue_work_node(system_dfl_wq) + flush_work()
   to enable serial probing when PROBE_PREFER_ASYNCHRONOUS is disabled.

Thanks,
Jinhui

Re: [PATCH] PCI: Avoid work_on_cpu() in async probe workers

Posted by Bjorn Helgaas 1 month, 1 week ago

[+cc Marco, Tejun; just FYI since you have ongoing per-CPU wq work]

On Sat, Dec 27, 2025 at 07:33:26PM +0800, Jinhui Guo wrote:
> Commit ef0ff68351be ("driver core: Probe devices asynchronously instead of
> the driver") speeds up the loading of large numbers of device drivers by
> submitting asynchronous probe workers to an unbounded workqueue and binding
> each worker to the CPU near the device’s NUMA node. These workers are not
> scheduled on isolated CPUs because their cpumask is restricted to
> housekeeping_cpumask(HK_TYPE_WQ) and housekeeping_cpumask(HK_TYPE_DOMAIN).
> 
> However, when PCI devices reside on the same NUMA node, all their
> drivers’ probe workers are bound to the same CPU within that node, yet
> the probes still run in parallel because pci_call_probe() invokes
> work_on_cpu(). Introduced by commit 873392ca514f ("PCI: work_on_cpu: use
> in drivers/pci/pci-driver.c"), work_on_cpu() queues a worker on
> system_percpu_wq to bind the probe thread to the first CPU in the
> device’s NUMA node (chosen via cpumask_any_and() in pci_call_probe()).
> 
> 1. The function __driver_attach() submits an asynchronous worker with
>    callback __driver_attach_async_helper().
> 
>    __driver_attach()
>     async_schedule_dev(__driver_attach_async_helper, dev)
>      async_schedule_node(func, dev, dev_to_node(dev))
>       async_schedule_node_domain(func, data, node, &async_dfl_domain)
>        __async_schedule_node_domain(func, data, node, domain, entry)
>         queue_work_node(node, async_wq, &entry->work)
> 
> 2. The asynchronous probe worker ultimately calls work_on_cpu() in
>    pci_call_probe(), binding the worker to the same CPU within the
>    device’s NUMA node.
> 
>    __driver_attach_async_helper()
>     driver_probe_device(drv, dev)
>      __driver_probe_device(drv, dev)
>       really_probe(dev, drv)
>        call_driver_probe(dev, drv)
>         dev->bus->probe(dev)
>          pci_device_probe(dev)
>           __pci_device_probe(drv, pci_dev)
>            pci_call_probe(drv, pci_dev, id)
>             cpu = cpumask_any_and(cpumask_of_node(node), wq_domain_mask)
>             error = work_on_cpu(cpu, local_pci_probe, &ddi)
>              schedule_work_on(cpu, &wfc.work);
>               queue_work_on(cpu, system_percpu_wq, work)
> 
> To fix the issue, pci_call_probe() must not call work_on_cpu() when it is
> already running inside an unbounded asynchronous worker. Because a driver
> can be probed asynchronously either by probe_type or by the kernel command
> line, we cannot rely on PROBE_PREFER_ASYNCHRONOUS alone. Instead, we test
> the PF_WQ_WORKER flag in current->flags; if it is set, pci_call_probe() is
> executing within an unbounded workqueue worker and should skip the extra
> work_on_cpu() call.
> 
> Testing three NVMe devices on the same NUMA node of an AMD EPYC 9A64
> 2.4 GHz processor shows a 35 % probe-time improvement with the patch:
> 
> Before (all on CPU 0):
>   nvme 0000:01:00.0: CPU: 0, COMM: kworker/0:1, probe cost: 53372612 ns
>   nvme 0000:02:00.0: CPU: 0, COMM: kworker/0:2, probe cost: 49532941 ns
>   nvme 0000:03:00.0: CPU: 0, COMM: kworker/0:3, probe cost: 47315175 ns
> 
> After (spread across CPUs 1, 2, 5):
>   nvme 0000:01:00.0: CPU: 5, COMM: kworker/u1025:5, probe cost: 34765890 ns
>   nvme 0000:02:00.0: CPU: 1, COMM: kworker/u1025:2, probe cost: 34696433 ns
>   nvme 0000:03:00.0: CPU: 2, COMM: kworker/u1025:3, probe cost: 33233323 ns
> 
> The improvement grows with more PCI devices because fewer probes contend
> for the same CPU.
> 
> Fixes: ef0ff68351be ("driver core: Probe devices asynchronously instead of the driver")
> Cc: stable@vger.kernel.org
> Signed-off-by: Jinhui Guo <guojinhui.liam@bytedance.com>
> ---
>  drivers/pci/pci-driver.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index 7c2d9d596258..4bc47a84d330 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -366,9 +366,11 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
>  	/*
>  	 * Prevent nesting work_on_cpu() for the case where a Virtual Function
>  	 * device is probed from work_on_cpu() of the Physical device.
> +	 * Check PF_WQ_WORKER to prevent invoking work_on_cpu() in an asynchronous
> +	 * probe worker when the driver allows asynchronous probing.
>  	 */
>  	if (node < 0 || node >= MAX_NUMNODES || !node_online(node) ||
> -	    pci_physfn_is_probed(dev)) {
> +	    pci_physfn_is_probed(dev) || (current->flags & PF_WQ_WORKER)) {
>  		cpu = nr_cpu_ids;
>  	} else {
>  		cpumask_var_t wq_domain_mask;
> -- 
> 2.20.1