[PATCH v2] PCI: Prevent workqueue code nesting in local_pci_probe()

Waiman Long posted 1 patch 1 week, 6 days ago
drivers/pci/pci-driver.c | 19 +++++++++++++++++--
1 file changed, 17 insertions(+), 2 deletions(-)
[PATCH v2] PCI: Prevent workqueue code nesting in local_pci_probe()
Posted by Waiman Long 1 week, 6 days ago
local_pci_probe() and hence pci_call_probe() can be called
recursively. If the recursive calls are done indirectly via workqueue
kworker, a lockdep recursive warning can be produced.

There are older commits that tries to prevent that. One example is
commit 12c3156f10c5 ("PCI: Avoid unnecessary CPU switch when calling
driver .probe() method") which prevents work_on_cpu() recursion when
the current device is a virtual function and the physical device has
been probed. However, there are still other cases where workqueue code
nesting is possible leading to a lockdep recursive locking warning like
the following stack trace on a 4-socket x86-64 Skylake server.

  <TASK>
    :
  start_flush_work+0x40b/0x9b0
  __flush_work+0xbd/0x1a0
  pci_call_probe+0x510/0x700
  pci_device_probe+0x17c/0x270
  call_driver_probe+0x68/0x1f0
  really_probe+0x197/0x7b0
  __driver_probe_device+0x32d/0x460
  driver_probe_device+0x49/0x120
  __device_attach_driver+0x162/0x290
  bus_for_each_drv+0x109/0x190
  __device_attach+0x1a2/0x3f0
  device_initial_probe+0x7d/0xa0
  pci_bus_add_device+0x93/0xe0
  pci_bus_add_devices+0x83/0x190
  vmd_enable_domain+0x11fb/0x1b80
  vmd_probe+0x34c/0x4b0
  local_pci_probe+0xdf/0x190
  local_pci_probe_callback+0x35/0x80
  process_one_work+0x919/0x1af0
  worker_thread+0x5a6/0xd10
    :
  </TASK>

Fix that by adding a new wq_kworker() helper to check if the current
task is a workqueue kworker. If so, call local_pci_probe() directly
instead of calling into workqueue code recursively. However, this patch
will increase the level of local_pci_probe() nesting that is possible.

For this particular system with a patched kernel built with a RHEL
based kernel config file, the local_pci_probe() nesting is only 2
levels deep. The additional stack usage from one local_pci_probe()
to the next is measured to be 792 bytes.  With a 16k kernel stack,
it can sustain up to 20 local_pci_probe() nesting loop. The mileage
may vary depending on what kernel options are enabled and future code
changes in the PCI driver code base.

It is unlikely that we will see a real system with a very deeply nested
PCIe topology requiring more levels of nesting than is supportable by
a 16k kernel stack. To be cautious, a comment is added to mention that
kernel stack exhaustion can be a possibility if a very deeply nested
PCIe topology is to be supported as further change to how PCI device
probing works may be needed.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 drivers/pci/pci-driver.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

 [v2] Directly check PF_WQ_WORKER as suggested by Sashiko and document
 the kernel stack exhaustion issue with deeply nested PCIe topology.

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index e3f59001785a..6d9944b42c3c 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -370,6 +370,14 @@ static bool pci_physfn_is_probed(struct pci_dev *dev)
 #endif
 }
 
+/*
+ * Return true if current task is a workqueue kworker
+ */
+static bool wq_kworker(void)
+{
+	return current->flags & PF_WQ_WORKER;
+}
+
 static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
 			  const struct pci_device_id *id)
 {
@@ -387,10 +395,17 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
 	cpu_hotplug_disable();
 	/*
 	 * Prevent nesting work_on_cpu() for the case where a Virtual Function
-	 * device is probed from work_on_cpu() of the Physical device.
+	 * device is probed from work_on_cpu() of the Physical device or when
+	 * the current task is a workqueue kworker.
+	 *
+	 * TODO: With a deeply nested PCIe topology, pci_call_probe() can be
+	 * recursively called multiple times. If the nesting is deep enough,
+	 * it may cause exhaustion of the kernel stack. So some additional
+	 * changes will be needed if such a deeply nested topology is to be
+	 * supported.
 	 */
 	if (node < 0 || node >= MAX_NUMNODES || !node_online(node) ||
-	    pci_physfn_is_probed(dev)) {
+	    pci_physfn_is_probed(dev) || wq_kworker()) {
 		error = local_pci_probe(&ddi);
 	} else {
 		struct pci_probe_arg arg = { .ddi = &ddi };
-- 
2.54.0