Fix the longstanding race between hv_pci_query_relations() and
survey_child_resources() by flushing the workqueue before we exit from
hv_pci_query_relations().
Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft Hyper-V VMs")
Signed-off-by: Dexuan Cui <decui@microsoft.com>
---
drivers/pci/controller/pci-hyperv.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
With the below debug code:
@@ -2103,6 +2103,8 @@ static void survey_child_resources(struct hv_pcibus_device *hbus)
}
spin_unlock_irqrestore(&hbus->device_list_lock, flags);
+ ssleep(15);
+ printk("%s: completing %px\n", __func__, event);
complete(event);
}
@@ -3305,8 +3307,12 @@ static int hv_pci_query_relations(struct hv_device *hdev)
ret = vmbus_sendpacket(hdev->channel, &message, sizeof(message),
0, VM_PKT_DATA_INBAND, 0);
- if (!ret)
+ if (!ret) {
+ ssleep(10); // unassign the PCI device on the host during the 10s
ret = wait_for_response(hdev, &comp);
+ printk("%s: comp=%px is becoming invalid! ret=%d\n",
+ __func__, &comp, ret);
+ }
return ret;
}
@@ -3635,6 +3641,8 @@ static int hv_pci_probe(struct hv_device *hdev,
retry:
ret = hv_pci_query_relations(hdev);
+ printk("hv_pci_query_relations() exited\n");
+
if (ret)
goto free_irq_domain;
I'm able to repro the below hang issue:
[ 74.544744] hv_pci b92a0085-468b-407a-a88a-d33fac8edc75: PCI VMBus probing: Using version 0x10004
[ 76.886944] hv_netvsc 818fe754-b912-4445-af51-1f584812e3c9 eth0: VF slot 1 removed
[ 84.788266] hv_pci b92a0085-468b-407a-a88a-d33fac8edc75: The device is gone.
[ 84.792586] hv_pci_query_relations: comp=ffffa7504012fb58 is becoming invalid! ret=-19
[ 84.797505] hv_pci_query_relations() exited
[ 89.652268] survey_child_resources: completing ffffa7504012fb58
[ 150.392242] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 150.398447] rcu: 15-...0: (2 ticks this GP) idle=867c/1/0x4000000000000000 softirq=947/947 fqs=5234
[ 150.405851] rcu: (detected by 14, t=15004 jiffies, g=2553, q=4833 ncpus=16)
[ 150.410870] Sending NMI from CPU 14 to CPUs 15:
[ 150.414836] NMI backtrace for cpu 15
[ 150.414840] CPU: 15 PID: 10 Comm: kworker/u32:0 Tainted: G W E 6.3.0-rc3-decui-dirty #34
...
[ 150.414849] Workqueue: hv_pci_468b pci_devices_present_work [pci_hyperv]
[ 150.414866] RIP: 0010:__pv_queued_spin_lock_slowpath+0x10f/0x3c0
...
[ 150.414905] Call Trace:
[ 150.414907] <TASK>
[ 150.414911] _raw_spin_lock_irqsave+0x40/0x50
[ 150.414917] complete+0x1d/0x60
[ 150.414924] pci_devices_present_work+0x5dd/0x680 [pci_hyperv]
[ 150.414946] process_one_work+0x21f/0x430
[ 150.414952] worker_thread+0x4a/0x3c0
With this patch, the hang issue goes away:
[ 186.143612] hv_pci b92a0085-468b-407a-a88a-d33fac8edc75: The device is gone.
[ 186.148034] hv_pci_query_relations: comp=ffffa7cfd0aa3b50 is becoming invalid! ret=-19
[ 191.263611] survey_child_resources: completing ffffa7cfd0aa3b50
[ 191.267732] hv_pci_query_relations() exited
diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index f33370b75628..b82c7cde19e6 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -3308,6 +3308,19 @@ static int hv_pci_query_relations(struct hv_device *hdev)
if (!ret)
ret = wait_for_response(hdev, &comp);
+ /*
+ * In the case of fast device addition/removal, it's possible that
+ * vmbus_sendpacket() or wait_for_response() returns -ENODEV but we
+ * already got a PCI_BUS_RELATIONS* message from the host and the
+ * channel callback already scheduled a work to hbus->wq, which can be
+ * running survey_child_resources() -> complete(&hbus->survey_event),
+ * even after hv_pci_query_relations() exits and the stack variable
+ * 'comp' is no longer valid. This can cause a strange hang issue
+ * or sometimes a page fault. Flush hbus->wq before we exit from
+ * hv_pci_query_relations() to avoid the issues.
+ */
+ flush_workqueue(hbus->wq);
+
return ret;
}
--
2.25.1
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci- > hyperv.c > index f33370b75628..b82c7cde19e6 100644 > --- a/drivers/pci/controller/pci-hyperv.c > +++ b/drivers/pci/controller/pci-hyperv.c > @@ -3308,6 +3308,19 @@ static int hv_pci_query_relations(struct hv_device > *hdev) > if (!ret) > ret = wait_for_response(hdev, &comp); > > + /* > + * In the case of fast device addition/removal, it's possible that > + * vmbus_sendpacket() or wait_for_response() returns -ENODEV but > we > + * already got a PCI_BUS_RELATIONS* message from the host and the > + * channel callback already scheduled a work to hbus->wq, which can > be > + * running survey_child_resources() -> complete(&hbus- > >survey_event), > + * even after hv_pci_query_relations() exits and the stack variable > + * 'comp' is no longer valid. This can cause a strange hang issue > + * or sometimes a page fault. Flush hbus->wq before we exit from > + * hv_pci_query_relations() to avoid the issues. > + */ > + flush_workqueue(hbus->wq); Is it possible for PCI_BUS_RELATIONS to be scheduled arrive after calling flush_workqueue(hbus->wq)? > + > return ret; > } > > -- > 2.25.1
> From: Long Li <longli@microsoft.com> > Sent: Tuesday, March 28, 2023 9:49 AM > > --- a/drivers/pci/controller/pci-hyperv.c > > +++ b/drivers/pci/controller/pci-hyperv.c > > @@ -3308,6 +3308,19 @@ static int hv_pci_query_relations(struct > hv_device > > *hdev) > > if (!ret) > > ret = wait_for_response(hdev, &comp); > > > > + /* > > + * In the case of fast device addition/removal, it's possible that > > + * vmbus_sendpacket() or wait_for_response() returns -ENODEV but > > we > > + * already got a PCI_BUS_RELATIONS* message from the host and the > > + * channel callback already scheduled a work to hbus->wq, which can > > be > > + * running survey_child_resources() -> complete(&hbus- > > >survey_event), > > + * even after hv_pci_query_relations() exits and the stack variable > > + * 'comp' is no longer valid. This can cause a strange hang issue > > + * or sometimes a page fault. Flush hbus->wq before we exit from > > + * hv_pci_query_relations() to avoid the issues. > > + */ > > + flush_workqueue(hbus->wq); > > Is it possible for PCI_BUS_RELATIONS to be scheduled arrive after calling > flush_workqueue(hbus->wq)? It's possible, but that doesn't matter: hv_pci_query_relations() is called only once, and it sets hbus->survey_event to point to the stack variable 'comp'. The first survey_child_resources() calls complete() for the 'comp' and sets hbus->survey_event to NULL. When the second survey_child_resources() is called, hbus->survey_event is NULL, so survey_child_resources() returns immediately. According to my test, after hv_pci_enter_d0() posts PCI_BUS_D0ENTRY, the guest receives a second PCI_BUS_RELATIONS2 message, which is the same as the first PCI_BUS_RELATIONS2 message, which is basically a no-op in pci_devices_present_work(), especially with the newly-introduced per-hbus state_lock mutex.
> -----Original Message-----
> From: Dexuan Cui <decui@microsoft.com>
> Sent: Tuesday, March 28, 2023 10:21 AM
> To: bhelgaas@google.com; davem@davemloft.net; Dexuan Cui
> <decui@microsoft.com>; edumazet@google.com; Haiyang Zhang
> <haiyangz@microsoft.com>; Jake Oshins <jakeo@microsoft.com>;
> kuba@kernel.org; kw@linux.com; KY Srinivasan <kys@microsoft.com>;
> leon@kernel.org; linux-pci@vger.kernel.org; lpieralisi@kernel.org; Michael
> Kelley (LINUX) <mikelley@microsoft.com>; pabeni@redhat.com;
> robh@kernel.org; saeedm@nvidia.com; wei.liu@kernel.org; Long Li
> <longli@microsoft.com>; boqun.feng@gmail.com
> Cc: linux-hyperv@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
> rdma@vger.kernel.org; netdev@vger.kernel.org
> Subject: [EXTERNAL] [PATCH 1/6] PCI: hv: fix a race condition bug in
> hv_pci_query_relations()
>
> Fix the longstanding race between hv_pci_query_relations() and
> survey_child_resources() by flushing the workqueue before we exit from
> hv_pci_query_relations().
>
> Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft
> Hyper-V VMs")
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
>
> ---
> drivers/pci/controller/pci-hyperv.c | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
> With the below debug code:
>
> @@ -2103,6 +2103,8 @@ static void survey_child_resources(struct
> hv_pcibus_device *hbus)
> }
>
> spin_unlock_irqrestore(&hbus->device_list_lock, flags);
> + ssleep(15);
> + printk("%s: completing %px\n", __func__, event);
> complete(event);
> }
>
> @@ -3305,8 +3307,12 @@ static int hv_pci_query_relations(struct hv_device
> *hdev)
>
> ret = vmbus_sendpacket(hdev->channel, &message, sizeof(message),
> 0, VM_PKT_DATA_INBAND, 0);
> - if (!ret)
> + if (!ret) {
> + ssleep(10); // unassign the PCI device on the host during the
> 10s
> ret = wait_for_response(hdev, &comp);
> + printk("%s: comp=%px is becoming invalid! ret=%d\n",
> + __func__, &comp, ret);
> + }
>
> return ret;
> }
> @@ -3635,6 +3641,8 @@ static int hv_pci_probe(struct hv_device *hdev,
>
> retry:
> ret = hv_pci_query_relations(hdev);
> + printk("hv_pci_query_relations() exited\n");
Can we use pr_* or the appropriate KERN_<LEVEL> in all the printk(s).
> +
> if (ret)
> goto free_irq_domain;
>
> I'm able to repro the below hang issue:
>
> [ 74.544744] hv_pci b92a0085-468b-407a-a88a-d33fac8edc75: PCI VMBus
> probing: Using version 0x10004
> [ 76.886944] hv_netvsc 818fe754-b912-4445-af51-1f584812e3c9 eth0: VF slot
> 1 removed
> [ 84.788266] hv_pci b92a0085-468b-407a-a88a-d33fac8edc75: The device is
> gone.
> [ 84.792586] hv_pci_query_relations: comp=ffffa7504012fb58 is becoming
> invalid! ret=-19
> [ 84.797505] hv_pci_query_relations() exited
> [ 89.652268] survey_child_resources: completing ffffa7504012fb58
> [ 150.392242] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 150.398447] rcu: 15-...0: (2 ticks this GP)
> idle=867c/1/0x4000000000000000 softirq=947/947 fqs=5234
> [ 150.405851] rcu: (detected by 14, t=15004 jiffies, g=2553, q=4833
> ncpus=16)
> [ 150.410870] Sending NMI from CPU 14 to CPUs 15:
> [ 150.414836] NMI backtrace for cpu 15
> [ 150.414840] CPU: 15 PID: 10 Comm: kworker/u32:0 Tainted: G W E
> 6.3.0-rc3-decui-dirty #34
> ...
> [ 150.414849] Workqueue: hv_pci_468b pci_devices_present_work
> [pci_hyperv] [ 150.414866] RIP:
> 0010:__pv_queued_spin_lock_slowpath+0x10f/0x3c0
> ...
> [ 150.414905] Call Trace:
> [ 150.414907] <TASK>
> [ 150.414911] _raw_spin_lock_irqsave+0x40/0x50 [ 150.414917]
> complete+0x1d/0x60 [ 150.414924] pci_devices_present_work+0x5dd/0x680
> [pci_hyperv] [ 150.414946] process_one_work+0x21f/0x430 [ 150.414952]
> worker_thread+0x4a/0x3c0
>
> With this patch, the hang issue goes away:
>
> [ 186.143612] hv_pci b92a0085-468b-407a-a88a-d33fac8edc75: The device is
> gone.
> [ 186.148034] hv_pci_query_relations: comp=ffffa7cfd0aa3b50 is becoming
> invalid! ret=-19 [ 191.263611] survey_child_resources: completing
> ffffa7cfd0aa3b50 [ 191.267732] hv_pci_query_relations() exited
>
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-
> hyperv.c
> index f33370b75628..b82c7cde19e6 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -3308,6 +3308,19 @@ static int hv_pci_query_relations(struct hv_device
> *hdev)
> if (!ret)
> ret = wait_for_response(hdev, &comp);
>
> + /*
> + * In the case of fast device addition/removal, it's possible that
> + * vmbus_sendpacket() or wait_for_response() returns -ENODEV but
> we
> + * already got a PCI_BUS_RELATIONS* message from the host and the
> + * channel callback already scheduled a work to hbus->wq, which can
> be
> + * running survey_child_resources() -> complete(&hbus-
> >survey_event),
> + * even after hv_pci_query_relations() exits and the stack variable
> + * 'comp' is no longer valid. This can cause a strange hang issue
> + * or sometimes a page fault. Flush hbus->wq before we exit from
> + * hv_pci_query_relations() to avoid the issues.
> + */
> + flush_workqueue(hbus->wq);
> +
> return ret;
> }
>
> --
> 2.25.1
> From: Saurabh Singh Sengar <ssengar@microsoft.com>
> Sent: Monday, March 27, 2023 10:29 PM
> > ...
> > ---
Please note this special line "---".
Anything after the special line and before the line "diff --git" is discarded
automaticaly by 'git' and 'patch'.
> > drivers/pci/controller/pci-hyperv.c | 13 +++++++++++++
> > 1 file changed, 13 insertions(+)
> >
> > @@ -3635,6 +3641,8 @@ static int hv_pci_probe(struct hv_device *hdev,
> >
> > retry:
> > ret = hv_pci_query_relations(hdev);
> > + printk("hv_pci_query_relations() exited\n");
>
> Can we use pr_* or the appropriate KERN_<LEVEL> in all the printk(s).
This is not part of the real patch :-)
I just thought the debug code can help understand the issues
resolved by the patches.
I'll remove the debug code to avoid confusion if I need to post v2.
On Tue, Mar 28, 2023 at 05:38:59AM +0000, Dexuan Cui wrote:
> > From: Saurabh Singh Sengar <ssengar@microsoft.com>
> > Sent: Monday, March 27, 2023 10:29 PM
> > > ...
> > > ---
>
> Please note this special line "---".
> Anything after the special line and before the line "diff --git" is discarded
> automaticaly by 'git' and 'patch'.
>
> > > drivers/pci/controller/pci-hyperv.c | 13 +++++++++++++
> > > 1 file changed, 13 insertions(+)
> > >
> > > @@ -3635,6 +3641,8 @@ static int hv_pci_probe(struct hv_device *hdev,
> > >
> > > retry:
> > > ret = hv_pci_query_relations(hdev);
> > > + printk("hv_pci_query_relations() exited\n");
> >
> > Can we use pr_* or the appropriate KERN_<LEVEL> in all the printk(s).
>
> This is not part of the real patch :-)
> I just thought the debug code can help understand the issues
> resolved by the patches.
> I'll remove the debug code to avoid confusion if I need to post v2.
I guess that means you *will* post a v2, right? Or do you expect
somebody else to remove the debug code? If you do keep any debug or
other logging, use pci_info() (or dev_info()) whenever possible.
Also capitalize the subject line to match the others in the series.
Bjorn
> From: Bjorn Helgaas <helgaas@kernel.org>
> Sent: Tuesday, March 28, 2023 10:25 AM
> To: Dexuan Cui <decui@microsoft.com>
> ...
> On Tue, Mar 28, 2023 at 05:38:59AM +0000, Dexuan Cui wrote:
> > > From: Saurabh Singh Sengar <ssengar@microsoft.com>
> > > Sent: Monday, March 27, 2023 10:29 PM
> > > > ...
> > > > ---
> >
> > Please note this special line "---".
> > Anything after the special line and before the line "diff --git" is discarded
> > automaticaly by 'git' and 'patch'.
> >
> > > > drivers/pci/controller/pci-hyperv.c | 13 +++++++++++++
> > > > 1 file changed, 13 insertions(+)
> > > >
> > > > @@ -3635,6 +3641,8 @@ static int hv_pci_probe(struct hv_device *hdev,
> > > >
> > > > retry:
> > > > ret = hv_pci_query_relations(hdev);
> > > > + printk("hv_pci_query_relations() exited\n");
> > >
> > > Can we use pr_* or the appropriate KERN_<LEVEL> in all the printk(s).
> >
> > This is not part of the real patch :-)
> > I just thought the debug code can help understand the issues
> > resolved by the patches.
> > I'll remove the debug code to avoid confusion if I need to post v2.
>
> I guess that means you *will* post a v2, right?
I guess I didn't make myself clear, sorry. The "debug code" is not
part of the real patch body -- if we run the "patch" program or "git am"
to apply the patches, the "debug code" is automatically dropped because
it's between the special "---" line and the real start of the patch body (i.e.
the "diff --git" line).
So far, IMO I don't have to post v2 because the patch body and the patch
description (except for the part that's automatically removed by 'patch'
and 'git') don't need any change.
> Or do you expect
> somebody else to remove the debug code? If you do keep any debug or
> other logging, use pci_info() (or dev_info()) whenever possible.
As I explained above, 'patch' and 'git' automatically remove the part that
don't have to be in the git history.
>
> Also capitalize the subject line to match the others in the series.
>
> Bjorn
Thanks for catching this! If this is the only thing to be fixed, I hope the
PCI folks can help fix this when accepting the patch. If you think I should
post v2, please let me know.
On Wed, Mar 29, 2023 at 08:37:20AM +0000, Dexuan Cui wrote:
> > From: Bjorn Helgaas <helgaas@kernel.org>
> > Sent: Tuesday, March 28, 2023 10:25 AM
> > To: Dexuan Cui <decui@microsoft.com>
> > ...
> > On Tue, Mar 28, 2023 at 05:38:59AM +0000, Dexuan Cui wrote:
> > > > From: Saurabh Singh Sengar <ssengar@microsoft.com>
> > > > Sent: Monday, March 27, 2023 10:29 PM
> > > > > ...
> > > > > ---
> > >
> > > Please note this special line "---".
> > > Anything after the special line and before the line "diff --git" is discarded
> > > automaticaly by 'git' and 'patch'.
> > >
> > > > > drivers/pci/controller/pci-hyperv.c | 13 +++++++++++++
> > > > > 1 file changed, 13 insertions(+)
> > > > >
> > > > > @@ -3635,6 +3641,8 @@ static int hv_pci_probe(struct hv_device *hdev,
> > > > >
> > > > > retry:
> > > > > ret = hv_pci_query_relations(hdev);
> > > > > + printk("hv_pci_query_relations() exited\n");
> > > >
> > > > Can we use pr_* or the appropriate KERN_<LEVEL> in all the printk(s).
> > >
> > > This is not part of the real patch :-)
> > > I just thought the debug code can help understand the issues
> > > resolved by the patches.
> > > I'll remove the debug code to avoid confusion if I need to post v2.
> >
> > I guess that means you *will* post a v2, right?
>
> I guess I didn't make myself clear, sorry. The "debug code" is not
> part of the real patch body -- if we run the "patch" program or "git am"
> to apply the patches, the "debug code" is automatically dropped because
> it's between the special "---" line and the real start of the patch body (i.e.
> the "diff --git" line).
>
> So far, IMO I don't have to post v2 because the patch body and the patch
> description (except for the part that's automatically removed by 'patch'
> and 'git') don't need any change.
Ah, sorry, I didn't notice that, even though you explicitly pointed it
out earlier. No need to repost just to capitalize the subject,
Lorenzo can do it or not as he chooses.
Bjorn
© 2016 - 2026 Red Hat, Inc.