[PATCH 1/6] PCI: hv: fix a race condition bug in hv_pci_query_relations()

Dexuan Cui posted 6 patches 2 years, 10 months ago
[PATCH 1/6] PCI: hv: fix a race condition bug in hv_pci_query_relations()
Posted by Dexuan Cui 2 years, 10 months ago
Fix the longstanding race between hv_pci_query_relations() and
survey_child_resources() by flushing the workqueue before we exit from
hv_pci_query_relations().

Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft Hyper-V VMs")
Signed-off-by: Dexuan Cui <decui@microsoft.com>

---
 drivers/pci/controller/pci-hyperv.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

With the below debug code:

@@ -2103,6 +2103,8 @@ static void survey_child_resources(struct hv_pcibus_device *hbus)
 	}

 	spin_unlock_irqrestore(&hbus->device_list_lock, flags);
+	ssleep(15);
+	printk("%s: completing %px\n", __func__, event);
 	complete(event);
 }

@@ -3305,8 +3307,12 @@ static int hv_pci_query_relations(struct hv_device *hdev)

 	ret = vmbus_sendpacket(hdev->channel, &message, sizeof(message),
 			       0, VM_PKT_DATA_INBAND, 0);
-	if (!ret)
+	if (!ret) {
+		ssleep(10); // unassign the PCI device on the host during the 10s
 		ret = wait_for_response(hdev, &comp);
+		printk("%s: comp=%px is becoming invalid! ret=%d\n",
+			__func__, &comp, ret);
+	}

 	return ret;
 }
@@ -3635,6 +3641,8 @@ static int hv_pci_probe(struct hv_device *hdev,

 retry:
 	ret = hv_pci_query_relations(hdev);
+	printk("hv_pci_query_relations() exited\n");
+
 	if (ret)
 		goto free_irq_domain;

I'm able to repro the below hang issue:

[   74.544744] hv_pci b92a0085-468b-407a-a88a-d33fac8edc75: PCI VMBus probing: Using version 0x10004
[   76.886944] hv_netvsc 818fe754-b912-4445-af51-1f584812e3c9 eth0: VF slot 1 removed
[   84.788266] hv_pci b92a0085-468b-407a-a88a-d33fac8edc75: The device is gone.
[   84.792586] hv_pci_query_relations: comp=ffffa7504012fb58 is becoming invalid! ret=-19
[   84.797505] hv_pci_query_relations() exited
[   89.652268] survey_child_resources: completing ffffa7504012fb58
[  150.392242] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  150.398447] rcu:     15-...0: (2 ticks this GP) idle=867c/1/0x4000000000000000 softirq=947/947 fqs=5234
[  150.405851] rcu:     (detected by 14, t=15004 jiffies, g=2553, q=4833 ncpus=16)
[  150.410870] Sending NMI from CPU 14 to CPUs 15:
[  150.414836] NMI backtrace for cpu 15
[  150.414840] CPU: 15 PID: 10 Comm: kworker/u32:0 Tainted: G        W   E      6.3.0-rc3-decui-dirty #34
...
[  150.414849] Workqueue: hv_pci_468b pci_devices_present_work [pci_hyperv]
[  150.414866] RIP: 0010:__pv_queued_spin_lock_slowpath+0x10f/0x3c0
...
[  150.414905] Call Trace:
[  150.414907]  <TASK>
[  150.414911]  _raw_spin_lock_irqsave+0x40/0x50
[  150.414917]  complete+0x1d/0x60
[  150.414924]  pci_devices_present_work+0x5dd/0x680 [pci_hyperv]
[  150.414946]  process_one_work+0x21f/0x430
[  150.414952]  worker_thread+0x4a/0x3c0

With this patch, the hang issue goes away:

[  186.143612] hv_pci b92a0085-468b-407a-a88a-d33fac8edc75: The device is gone.
[  186.148034] hv_pci_query_relations: comp=ffffa7cfd0aa3b50 is becoming invalid! ret=-19
[  191.263611] survey_child_resources: completing ffffa7cfd0aa3b50
[  191.267732] hv_pci_query_relations() exited

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index f33370b75628..b82c7cde19e6 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -3308,6 +3308,19 @@ static int hv_pci_query_relations(struct hv_device *hdev)
 	if (!ret)
 		ret = wait_for_response(hdev, &comp);
 
+	/*
+	 * In the case of fast device addition/removal, it's possible that
+	 * vmbus_sendpacket() or wait_for_response() returns -ENODEV but we
+	 * already got a PCI_BUS_RELATIONS* message from the host and the
+	 * channel callback already scheduled a work to hbus->wq, which can be
+	 * running survey_child_resources() -> complete(&hbus->survey_event),
+	 * even after hv_pci_query_relations() exits and the stack variable
+	 * 'comp' is no longer valid. This can cause a strange hang issue
+	 * or sometimes a page fault. Flush hbus->wq before we exit from
+	 * hv_pci_query_relations() to avoid the issues.
+	 */
+	flush_workqueue(hbus->wq);
+
 	return ret;
 }
 
-- 
2.25.1
RE: [PATCH 1/6] PCI: hv: fix a race condition bug in hv_pci_query_relations()
Posted by Long Li 2 years, 10 months ago
 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-
> hyperv.c
> index f33370b75628..b82c7cde19e6 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -3308,6 +3308,19 @@ static int hv_pci_query_relations(struct hv_device
> *hdev)
>  	if (!ret)
>  		ret = wait_for_response(hdev, &comp);
> 
> +	/*
> +	 * In the case of fast device addition/removal, it's possible that
> +	 * vmbus_sendpacket() or wait_for_response() returns -ENODEV but
> we
> +	 * already got a PCI_BUS_RELATIONS* message from the host and the
> +	 * channel callback already scheduled a work to hbus->wq, which can
> be
> +	 * running survey_child_resources() -> complete(&hbus-
> >survey_event),
> +	 * even after hv_pci_query_relations() exits and the stack variable
> +	 * 'comp' is no longer valid. This can cause a strange hang issue
> +	 * or sometimes a page fault. Flush hbus->wq before we exit from
> +	 * hv_pci_query_relations() to avoid the issues.
> +	 */
> +	flush_workqueue(hbus->wq);

Is it possible for PCI_BUS_RELATIONS to be scheduled arrive after calling flush_workqueue(hbus->wq)?

> +
>  	return ret;
>  }
> 
> --
> 2.25.1
RE: [PATCH 1/6] PCI: hv: fix a race condition bug in hv_pci_query_relations()
Posted by Dexuan Cui 2 years, 10 months ago
> From: Long Li <longli@microsoft.com>
> Sent: Tuesday, March 28, 2023 9:49 AM
> > --- a/drivers/pci/controller/pci-hyperv.c
> > +++ b/drivers/pci/controller/pci-hyperv.c
> > @@ -3308,6 +3308,19 @@ static int hv_pci_query_relations(struct
> hv_device
> > *hdev)
> >  	if (!ret)
> >  		ret = wait_for_response(hdev, &comp);
> >
> > +	/*
> > +	 * In the case of fast device addition/removal, it's possible that
> > +	 * vmbus_sendpacket() or wait_for_response() returns -ENODEV but
> > we
> > +	 * already got a PCI_BUS_RELATIONS* message from the host and the
> > +	 * channel callback already scheduled a work to hbus->wq, which can
> > be
> > +	 * running survey_child_resources() -> complete(&hbus-
> > >survey_event),
> > +	 * even after hv_pci_query_relations() exits and the stack variable
> > +	 * 'comp' is no longer valid. This can cause a strange hang issue
> > +	 * or sometimes a page fault. Flush hbus->wq before we exit from
> > +	 * hv_pci_query_relations() to avoid the issues.
> > +	 */
> > +	flush_workqueue(hbus->wq);
> 
> Is it possible for PCI_BUS_RELATIONS to be scheduled arrive after calling
> flush_workqueue(hbus->wq)?

It's possible, but that doesn't matter:

hv_pci_query_relations() is called only once, and it sets hbus->survey_event
to point to the stack variable 'comp'. The first survey_child_resources()
calls complete() for the 'comp' and sets hbus->survey_event to NULL.

When the second survey_child_resources() is called, hbus->survey_event
is NULL, so survey_child_resources() returns immediately.

According to my test, after hv_pci_enter_d0() posts PCI_BUS_D0ENTRY,
the guest receives a second PCI_BUS_RELATIONS2 message, which is
the same as the first PCI_BUS_RELATIONS2 message, which is basically
a no-op in pci_devices_present_work(), especially with the
newly-introduced per-hbus state_lock mutex.
RE: [EXTERNAL] [PATCH 1/6] PCI: hv: fix a race condition bug in hv_pci_query_relations()
Posted by Saurabh Singh Sengar 2 years, 10 months ago

> -----Original Message-----
> From: Dexuan Cui <decui@microsoft.com>
> Sent: Tuesday, March 28, 2023 10:21 AM
> To: bhelgaas@google.com; davem@davemloft.net; Dexuan Cui
> <decui@microsoft.com>; edumazet@google.com; Haiyang Zhang
> <haiyangz@microsoft.com>; Jake Oshins <jakeo@microsoft.com>;
> kuba@kernel.org; kw@linux.com; KY Srinivasan <kys@microsoft.com>;
> leon@kernel.org; linux-pci@vger.kernel.org; lpieralisi@kernel.org; Michael
> Kelley (LINUX) <mikelley@microsoft.com>; pabeni@redhat.com;
> robh@kernel.org; saeedm@nvidia.com; wei.liu@kernel.org; Long Li
> <longli@microsoft.com>; boqun.feng@gmail.com
> Cc: linux-hyperv@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
> rdma@vger.kernel.org; netdev@vger.kernel.org
> Subject: [EXTERNAL] [PATCH 1/6] PCI: hv: fix a race condition bug in
> hv_pci_query_relations()
> 
> Fix the longstanding race between hv_pci_query_relations() and
> survey_child_resources() by flushing the workqueue before we exit from
> hv_pci_query_relations().
> 
> Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft
> Hyper-V VMs")
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> 
> ---
>  drivers/pci/controller/pci-hyperv.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> With the below debug code:
> 
> @@ -2103,6 +2103,8 @@ static void survey_child_resources(struct
> hv_pcibus_device *hbus)
>  	}
> 
>  	spin_unlock_irqrestore(&hbus->device_list_lock, flags);
> +	ssleep(15);
> +	printk("%s: completing %px\n", __func__, event);
>  	complete(event);
>  }
> 
> @@ -3305,8 +3307,12 @@ static int hv_pci_query_relations(struct hv_device
> *hdev)
> 
>  	ret = vmbus_sendpacket(hdev->channel, &message, sizeof(message),
>  			       0, VM_PKT_DATA_INBAND, 0);
> -	if (!ret)
> +	if (!ret) {
> +		ssleep(10); // unassign the PCI device on the host during the
> 10s
>  		ret = wait_for_response(hdev, &comp);
> +		printk("%s: comp=%px is becoming invalid! ret=%d\n",
> +			__func__, &comp, ret);
> +	}
> 
>  	return ret;
>  }
> @@ -3635,6 +3641,8 @@ static int hv_pci_probe(struct hv_device *hdev,
> 
>  retry:
>  	ret = hv_pci_query_relations(hdev);
> +	printk("hv_pci_query_relations() exited\n");

Can we use pr_* or the appropriate KERN_<LEVEL> in all the printk(s).

> +
>  	if (ret)
>  		goto free_irq_domain;
> 
> I'm able to repro the below hang issue:
> 
> [   74.544744] hv_pci b92a0085-468b-407a-a88a-d33fac8edc75: PCI VMBus
> probing: Using version 0x10004
> [   76.886944] hv_netvsc 818fe754-b912-4445-af51-1f584812e3c9 eth0: VF slot
> 1 removed
> [   84.788266] hv_pci b92a0085-468b-407a-a88a-d33fac8edc75: The device is
> gone.
> [   84.792586] hv_pci_query_relations: comp=ffffa7504012fb58 is becoming
> invalid! ret=-19
> [   84.797505] hv_pci_query_relations() exited
> [   89.652268] survey_child_resources: completing ffffa7504012fb58
> [  150.392242] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> [  150.398447] rcu:     15-...0: (2 ticks this GP)
> idle=867c/1/0x4000000000000000 softirq=947/947 fqs=5234
> [  150.405851] rcu:     (detected by 14, t=15004 jiffies, g=2553, q=4833
> ncpus=16)
> [  150.410870] Sending NMI from CPU 14 to CPUs 15:
> [  150.414836] NMI backtrace for cpu 15
> [  150.414840] CPU: 15 PID: 10 Comm: kworker/u32:0 Tainted: G        W   E
> 6.3.0-rc3-decui-dirty #34
> ...
> [  150.414849] Workqueue: hv_pci_468b pci_devices_present_work
> [pci_hyperv] [  150.414866] RIP:
> 0010:__pv_queued_spin_lock_slowpath+0x10f/0x3c0
> ...
> [  150.414905] Call Trace:
> [  150.414907]  <TASK>
> [  150.414911]  _raw_spin_lock_irqsave+0x40/0x50 [  150.414917]
> complete+0x1d/0x60 [  150.414924]  pci_devices_present_work+0x5dd/0x680
> [pci_hyperv] [  150.414946]  process_one_work+0x21f/0x430 [  150.414952]
> worker_thread+0x4a/0x3c0
> 
> With this patch, the hang issue goes away:
> 
> [  186.143612] hv_pci b92a0085-468b-407a-a88a-d33fac8edc75: The device is
> gone.
> [  186.148034] hv_pci_query_relations: comp=ffffa7cfd0aa3b50 is becoming
> invalid! ret=-19 [  191.263611] survey_child_resources: completing
> ffffa7cfd0aa3b50 [  191.267732] hv_pci_query_relations() exited
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-
> hyperv.c
> index f33370b75628..b82c7cde19e6 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -3308,6 +3308,19 @@ static int hv_pci_query_relations(struct hv_device
> *hdev)
>  	if (!ret)
>  		ret = wait_for_response(hdev, &comp);
> 
> +	/*
> +	 * In the case of fast device addition/removal, it's possible that
> +	 * vmbus_sendpacket() or wait_for_response() returns -ENODEV but
> we
> +	 * already got a PCI_BUS_RELATIONS* message from the host and the
> +	 * channel callback already scheduled a work to hbus->wq, which can
> be
> +	 * running survey_child_resources() -> complete(&hbus-
> >survey_event),
> +	 * even after hv_pci_query_relations() exits and the stack variable
> +	 * 'comp' is no longer valid. This can cause a strange hang issue
> +	 * or sometimes a page fault. Flush hbus->wq before we exit from
> +	 * hv_pci_query_relations() to avoid the issues.
> +	 */
> +	flush_workqueue(hbus->wq);
> +
>  	return ret;
>  }
> 
> --
> 2.25.1
RE: [EXTERNAL] [PATCH 1/6] PCI: hv: fix a race condition bug in hv_pci_query_relations()
Posted by Dexuan Cui 2 years, 10 months ago
> From: Saurabh Singh Sengar <ssengar@microsoft.com>
> Sent: Monday, March 27, 2023 10:29 PM
> > ...
> > ---

Please note this special line "---". 
Anything after the special line and before the line "diff --git" is discarded
automaticaly by 'git' and 'patch'. 

> >  drivers/pci/controller/pci-hyperv.c | 13 +++++++++++++
> >  1 file changed, 13 insertions(+)
> >
> > @@ -3635,6 +3641,8 @@ static int hv_pci_probe(struct hv_device *hdev,
> >
> >  retry:
> >  	ret = hv_pci_query_relations(hdev);
> > +	printk("hv_pci_query_relations() exited\n");
> 
> Can we use pr_* or the appropriate KERN_<LEVEL> in all the printk(s).

This is not part of the real patch :-)
I just thought the debug code can help understand the issues
resolved by the patches.
I'll remove the debug code to avoid confusion if I need to post v2.
Re: [EXTERNAL] [PATCH 1/6] PCI: hv: fix a race condition bug in hv_pci_query_relations()
Posted by Bjorn Helgaas 2 years, 10 months ago
On Tue, Mar 28, 2023 at 05:38:59AM +0000, Dexuan Cui wrote:
> > From: Saurabh Singh Sengar <ssengar@microsoft.com>
> > Sent: Monday, March 27, 2023 10:29 PM
> > > ...
> > > ---
> 
> Please note this special line "---". 
> Anything after the special line and before the line "diff --git" is discarded
> automaticaly by 'git' and 'patch'. 
> 
> > >  drivers/pci/controller/pci-hyperv.c | 13 +++++++++++++
> > >  1 file changed, 13 insertions(+)
> > >
> > > @@ -3635,6 +3641,8 @@ static int hv_pci_probe(struct hv_device *hdev,
> > >
> > >  retry:
> > >  	ret = hv_pci_query_relations(hdev);
> > > +	printk("hv_pci_query_relations() exited\n");
> > 
> > Can we use pr_* or the appropriate KERN_<LEVEL> in all the printk(s).
> 
> This is not part of the real patch :-)
> I just thought the debug code can help understand the issues
> resolved by the patches.
> I'll remove the debug code to avoid confusion if I need to post v2.

I guess that means you *will* post a v2, right?  Or do you expect
somebody else to remove the debug code?  If you do keep any debug or
other logging, use pci_info() (or dev_info()) whenever possible.

Also capitalize the subject line to match the others in the series.

Bjorn
RE: [EXTERNAL] [PATCH 1/6] PCI: hv: fix a race condition bug in hv_pci_query_relations()
Posted by Dexuan Cui 2 years, 10 months ago
> From: Bjorn Helgaas <helgaas@kernel.org>
> Sent: Tuesday, March 28, 2023 10:25 AM
> To: Dexuan Cui <decui@microsoft.com>
> ...
> On Tue, Mar 28, 2023 at 05:38:59AM +0000, Dexuan Cui wrote:
> > > From: Saurabh Singh Sengar <ssengar@microsoft.com>
> > > Sent: Monday, March 27, 2023 10:29 PM
> > > > ...
> > > > ---
> >
> > Please note this special line "---".
> > Anything after the special line and before the line "diff --git" is discarded
> > automaticaly by 'git' and 'patch'.
> >
> > > >  drivers/pci/controller/pci-hyperv.c | 13 +++++++++++++
> > > >  1 file changed, 13 insertions(+)
> > > >
> > > > @@ -3635,6 +3641,8 @@ static int hv_pci_probe(struct hv_device *hdev,
> > > >
> > > >  retry:
> > > >  	ret = hv_pci_query_relations(hdev);
> > > > +	printk("hv_pci_query_relations() exited\n");
> > >
> > > Can we use pr_* or the appropriate KERN_<LEVEL> in all the printk(s).
> >
> > This is not part of the real patch :-)
> > I just thought the debug code can help understand the issues
> > resolved by the patches.
> > I'll remove the debug code to avoid confusion if I need to post v2.
> 
> I guess that means you *will* post a v2, right?  

I guess I didn't make myself clear, sorry. The "debug code" is not
part of the real patch body -- if we run the "patch" program or "git am"
to apply the patches, the "debug code" is automatically dropped because
it's between the special "---" line and the real start of the patch body (i.e.
the "diff --git" line). 

So far, IMO I don't have to post v2 because the patch body and the patch
description (except for the part that's automatically removed by 'patch'
and 'git') don't need any change.

> Or do you expect
> somebody else to remove the debug code?  If you do keep any debug or
> other logging, use pci_info() (or dev_info()) whenever possible.

As I explained above, 'patch' and 'git' automatically remove the part that
don't have to be in the git history.

> 
> Also capitalize the subject line to match the others in the series.
> 
> Bjorn

Thanks for catching this! If this is the only thing to be fixed, I hope the
PCI folks can help fix this when accepting the patch. If you think I should
post v2, please let me know. 
Re: [EXTERNAL] [PATCH 1/6] PCI: hv: fix a race condition bug in hv_pci_query_relations()
Posted by Bjorn Helgaas 2 years, 10 months ago
On Wed, Mar 29, 2023 at 08:37:20AM +0000, Dexuan Cui wrote:
> > From: Bjorn Helgaas <helgaas@kernel.org>
> > Sent: Tuesday, March 28, 2023 10:25 AM
> > To: Dexuan Cui <decui@microsoft.com>
> > ...
> > On Tue, Mar 28, 2023 at 05:38:59AM +0000, Dexuan Cui wrote:
> > > > From: Saurabh Singh Sengar <ssengar@microsoft.com>
> > > > Sent: Monday, March 27, 2023 10:29 PM
> > > > > ...
> > > > > ---
> > >
> > > Please note this special line "---".
> > > Anything after the special line and before the line "diff --git" is discarded
> > > automaticaly by 'git' and 'patch'.
> > >
> > > > >  drivers/pci/controller/pci-hyperv.c | 13 +++++++++++++
> > > > >  1 file changed, 13 insertions(+)
> > > > >
> > > > > @@ -3635,6 +3641,8 @@ static int hv_pci_probe(struct hv_device *hdev,
> > > > >
> > > > >  retry:
> > > > >  	ret = hv_pci_query_relations(hdev);
> > > > > +	printk("hv_pci_query_relations() exited\n");
> > > >
> > > > Can we use pr_* or the appropriate KERN_<LEVEL> in all the printk(s).
> > >
> > > This is not part of the real patch :-)
> > > I just thought the debug code can help understand the issues
> > > resolved by the patches.
> > > I'll remove the debug code to avoid confusion if I need to post v2.
> > 
> > I guess that means you *will* post a v2, right?  
> 
> I guess I didn't make myself clear, sorry. The "debug code" is not
> part of the real patch body -- if we run the "patch" program or "git am"
> to apply the patches, the "debug code" is automatically dropped because
> it's between the special "---" line and the real start of the patch body (i.e.
> the "diff --git" line). 
> 
> So far, IMO I don't have to post v2 because the patch body and the patch
> description (except for the part that's automatically removed by 'patch'
> and 'git') don't need any change.

Ah, sorry, I didn't notice that, even though you explicitly pointed it
out earlier.  No need to repost just to capitalize the subject,
Lorenzo can do it or not as he chooses.

Bjorn