From nobody Sun Feb 8 14:31:15 2026 Received: from sg-1-103.ptr.blmpb.com (sg-1-103.ptr.blmpb.com [118.26.132.103]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C200531327F for ; Sat, 27 Dec 2025 11:34:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=118.26.132.103 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766835258; cv=none; b=k4JCuqTr979aD/6NPkQCxdeJpDJad6uKWPfkrK00MZt7BejYidKlEwXt/v0SnusrFAofyNRJPkgfUdC0bDIus0GfhwxisBl53/EHdfJx+buxZtG+nMRoVBDopKuOnWYyGs3oCRKyjiKH11YbfBPLejEs54vVRD6OPNspvuq7plk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766835258; c=relaxed/simple; bh=BuIMaNT+UI0KsqsbHtmd0bHo46Jp/KXJIvwHEITBcSs=; h=Message-Id:Mime-Version:To:Cc:From:Subject:Date:Content-Type; b=bmxWKlyG5Mj7QldDJAFkMgU7X7DNV8YQjKZtYOjc6CI3g+jjvd29ka5LOCWCSdo5kplzz+MmTni9eA8X0Se27bUB1940D9IIhh40FwxySuB3EJyncK2ra+4O5MMltHQs485tIdwrPtOu3xBBtRyBJxIyCdWtksA32rjh6sZafBw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=ADqDne/d; arc=none smtp.client-ip=118.26.132.103 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="ADqDne/d" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; s=2212171451; d=bytedance.com; t=1766835243; h=from:subject: mime-version:from:date:message-id:subject:to:cc:reply-to:content-type: mime-version:in-reply-to:message-id; bh=zpBY2sbAf3ugjQp1+eDZumXA4vCMZxwygzISFCEom34=; b=ADqDne/dNqiQY/oKhsg8YHEli8etR5VzSQZGOBquF+f2yUFiTBCHWxhwKKCskgBk/yKR4c YTfbIMaC9+5E/vK3v6saXGxWJz1BRv+4iRePamQBgWx6vdvqUgMU4PP5fqwcUvtSqyYT3l 4IP8GPsin8B9UWDxDWr2XVdW2d3An5xbNS7ztzuxIpSGA49MvF+CKpIvOmwgdnhsBnuhUU JHk0WeHxCq3bwaUxitcPrySe1KUw1uFEXovrHF6wNwIlGrPjntL4rDIHhrf0Pp1HNLrXhb 4ft1rNFmxsGscujhNi6rdIV8CbE25lIj9rTcPtSwSpce2lo4tuMQk90mtvPlmw== X-Mailer: git-send-email 2.17.1 Message-Id: <20251227113326.964-1-guojinhui.liam@bytedance.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 X-Lms-Return-Path: Content-Transfer-Encoding: quoted-printable X-Original-From: Jinhui Guo To: , , , , Cc: , , , From: "Jinhui Guo" Subject: [PATCH] PCI: Avoid work_on_cpu() in async probe workers Date: Sat, 27 Dec 2025 19:33:26 +0800 Content-Type: text/plain; charset="utf-8" Commit ef0ff68351be ("driver core: Probe devices asynchronously instead of the driver") speeds up the loading of large numbers of device drivers by submitting asynchronous probe workers to an unbounded workqueue and binding each worker to the CPU near the device=E2=80=99s NUMA node. These workers a= re not scheduled on isolated CPUs because their cpumask is restricted to housekeeping_cpumask(HK_TYPE_WQ) and housekeeping_cpumask(HK_TYPE_DOMAIN). However, when PCI devices reside on the same NUMA node, all their drivers=E2=80=99 probe workers are bound to the same CPU within that node, = yet the probes still run in parallel because pci_call_probe() invokes work_on_cpu(). Introduced by commit 873392ca514f ("PCI: work_on_cpu: use in drivers/pci/pci-driver.c"), work_on_cpu() queues a worker on system_percpu_wq to bind the probe thread to the first CPU in the device=E2=80=99s NUMA node (chosen via cpumask_any_and() in pci_call_probe(= )). 1. The function __driver_attach() submits an asynchronous worker with callback __driver_attach_async_helper(). __driver_attach() async_schedule_dev(__driver_attach_async_helper, dev) async_schedule_node(func, dev, dev_to_node(dev)) async_schedule_node_domain(func, data, node, &async_dfl_domain) __async_schedule_node_domain(func, data, node, domain, entry) queue_work_node(node, async_wq, &entry->work) 2. The asynchronous probe worker ultimately calls work_on_cpu() in pci_call_probe(), binding the worker to the same CPU within the device=E2=80=99s NUMA node. __driver_attach_async_helper() driver_probe_device(drv, dev) __driver_probe_device(drv, dev) really_probe(dev, drv) call_driver_probe(dev, drv) dev->bus->probe(dev) pci_device_probe(dev) __pci_device_probe(drv, pci_dev) pci_call_probe(drv, pci_dev, id) cpu =3D cpumask_any_and(cpumask_of_node(node), wq_domain_mask) error =3D work_on_cpu(cpu, local_pci_probe, &ddi) schedule_work_on(cpu, &wfc.work); queue_work_on(cpu, system_percpu_wq, work) To fix the issue, pci_call_probe() must not call work_on_cpu() when it is already running inside an unbounded asynchronous worker. Because a driver can be probed asynchronously either by probe_type or by the kernel command line, we cannot rely on PROBE_PREFER_ASYNCHRONOUS alone. Instead, we test the PF_WQ_WORKER flag in current->flags; if it is set, pci_call_probe() is executing within an unbounded workqueue worker and should skip the extra work_on_cpu() call. Testing three NVMe devices on the same NUMA node of an AMD EPYC 9A64 2.4 GHz processor shows a 35 % probe-time improvement with the patch: Before (all on CPU 0): nvme 0000:01:00.0: CPU: 0, COMM: kworker/0:1, probe cost: 53372612 ns nvme 0000:02:00.0: CPU: 0, COMM: kworker/0:2, probe cost: 49532941 ns nvme 0000:03:00.0: CPU: 0, COMM: kworker/0:3, probe cost: 47315175 ns After (spread across CPUs 1, 2, 5): nvme 0000:01:00.0: CPU: 5, COMM: kworker/u1025:5, probe cost: 34765890 ns nvme 0000:02:00.0: CPU: 1, COMM: kworker/u1025:2, probe cost: 34696433 ns nvme 0000:03:00.0: CPU: 2, COMM: kworker/u1025:3, probe cost: 33233323 ns The improvement grows with more PCI devices because fewer probes contend for the same CPU. Fixes: ef0ff68351be ("driver core: Probe devices asynchronously instead of = the driver") Cc: stable@vger.kernel.org Signed-off-by: Jinhui Guo --- drivers/pci/pci-driver.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c index 7c2d9d596258..4bc47a84d330 100644 --- a/drivers/pci/pci-driver.c +++ b/drivers/pci/pci-driver.c @@ -366,9 +366,11 @@ static int pci_call_probe(struct pci_driver *drv, stru= ct pci_dev *dev, /* * Prevent nesting work_on_cpu() for the case where a Virtual Function * device is probed from work_on_cpu() of the Physical device. + * Check PF_WQ_WORKER to prevent invoking work_on_cpu() in an asynchronous + * probe worker when the driver allows asynchronous probing. */ if (node < 0 || node >=3D MAX_NUMNODES || !node_online(node) || - pci_physfn_is_probed(dev)) { + pci_physfn_is_probed(dev) || (current->flags & PF_WQ_WORKER)) { cpu =3D nr_cpu_ids; } else { cpumask_var_t wq_domain_mask; --=20 2.20.1