[PATCH v2] cxl/core/port: defer endpoint probes when ACPI likely hasn't finished

Gregory Price posted 1 patch 1 month, 3 weeks ago
drivers/cxl/core/port.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
[PATCH v2] cxl/core/port: defer endpoint probes when ACPI likely hasn't finished
Posted by Gregory Price 1 month, 3 weeks ago
In cxl_acpi_probe, we add dports and uports to host bridges iteratively:
- bus_for_each_dev(adev->dev.bus, NULL, root_port, add_host_bridge_dport);
- bus_for_each_dev(adev->dev.bus, NULL, root_port, add_host_bridge_uport);

Simultaneously, as ports are probed, memdev endpoints can also be
probed. This creates a race condition, where an endpoint can perceive
its path to the root being broken in devm_cxl_enumerate_ports.

The memdev/endpoint probe will see a heirarchy that may look like:
    mem1
      parent => 0000:c1:00.0
        parent => 0000:c0:01.1
	  parent->parent => NULL

This results in find_cxl_port() returning NULL (since the port hasn't
been associated with the host bridge yet), and add_port_attach_ep
fails because the grandparent's grandparent is NULL.

When the latter condition is detected, the comments suggest:
    /*
     * The iteration reached the topology root without finding the
     * CXL-root 'cxl_port' on a previous iteration, fail for now to
     * be re-probed after platform driver attaches.
     */

This case results in an -ENXIO; however, a re-probe never occurs. Change
this return condition to -EPROBE_DEFER to explicitly cause a reprobe.

(v2: additional debug information)

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 drivers/cxl/core/port.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 1d5007e3795a..d6bebf70d142 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -1553,7 +1553,7 @@ static int add_port_attach_ep(struct cxl_memdev *cxlmd,
 		 */
 		dev_dbg(&cxlmd->dev, "%s is a root dport\n",
 			dev_name(dport_dev));
-		return -ENXIO;
+		return -EPROBE_DEFER;
 	}
 
 	parent_port = find_cxl_port(dparent, &parent_dport);
-- 
2.43.0
Re: [PATCH v2] cxl/core/port: defer endpoint probes when ACPI likely hasn't finished
Posted by Dan Williams 1 month, 3 weeks ago
Gregory Price wrote:
> In cxl_acpi_probe, we add dports and uports to host bridges iteratively:
> - bus_for_each_dev(adev->dev.bus, NULL, root_port, add_host_bridge_dport);
> - bus_for_each_dev(adev->dev.bus, NULL, root_port, add_host_bridge_uport);
> 
> Simultaneously, as ports are probed, memdev endpoints can also be
> probed. This creates a race condition, where an endpoint can perceive
> its path to the root being broken in devm_cxl_enumerate_ports.
> 
> The memdev/endpoint probe will see a heirarchy that may look like:
>     mem1
>       parent => 0000:c1:00.0
>         parent => 0000:c0:01.1
> 	  parent->parent => NULL
> 
> This results in find_cxl_port() returning NULL (since the port hasn't
> been associated with the host bridge yet), and add_port_attach_ep
> fails because the grandparent's grandparent is NULL.
> 
> When the latter condition is detected, the comments suggest:
>     /*
>      * The iteration reached the topology root without finding the
>      * CXL-root 'cxl_port' on a previous iteration, fail for now to
>      * be re-probed after platform driver attaches.
>      */
> 
> This case results in an -ENXIO; however, a re-probe never occurs. Change
> this return condition to -EPROBE_DEFER to explicitly cause a reprobe.

Ok, thanks for the additional debug. Like we chatted on the CXL Discord
I think this is potentially pointing to a bug in bus_rescan_devices()
where it checks dev->driver without holding the lock.

Can you give this fix a try to see if it also resolves the issue?
Effectively, cxl_bus_rescan() is always needed in case the cxl_acpi
driver loads waaaay after deferred probing has given up, and if this
works then EPROBE_DEFER can remain limited to cases where it is
absolutely known that no other device_attach() kick is coming to save
the day.

-- 8< --
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 1d5007e3795a..6c0cd94888a3 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -2088,11 +2088,18 @@ static void cxl_bus_remove(struct device *dev)
 
 static struct workqueue_struct *cxl_bus_wq;
 
-static void cxl_bus_rescan_queue(struct work_struct *w)
+static int attach_device(struct device *dev, void *data)
 {
-	int rc = bus_rescan_devices(&cxl_bus_type);
+	int rc = device_attach(dev);
+
+	dev_vdbg(dev, "rescan: %s\n", rc ? "attach" : "detached");
 
-	pr_debug("CXL bus rescan result: %d\n", rc);
+	return 0;
+}
+
+static void cxl_bus_rescan_queue(struct work_struct *w)
+{
+	bus_for_each_dev(&cxl_bus_type, NULL, NULL, attach_device);
 }
 
 void cxl_bus_rescan(void)
Re: [PATCH v2] cxl/core/port: defer endpoint probes when ACPI likely hasn't finished
Posted by Gregory Price 1 month, 3 weeks ago
On Fri, Oct 04, 2024 at 05:08:03PM -0700, Dan Williams wrote:
> Gregory Price wrote:
> > In cxl_acpi_probe, we add dports and uports to host bridges iteratively:
> > - bus_for_each_dev(adev->dev.bus, NULL, root_port, add_host_bridge_dport);
> > - bus_for_each_dev(adev->dev.bus, NULL, root_port, add_host_bridge_uport);
> > 
> > Simultaneously, as ports are probed, memdev endpoints can also be
> > probed. This creates a race condition, where an endpoint can perceive
> > its path to the root being broken in devm_cxl_enumerate_ports.
> > 
> > The memdev/endpoint probe will see a heirarchy that may look like:
> >     mem1
> >       parent => 0000:c1:00.0
> >         parent => 0000:c0:01.1
> > 	  parent->parent => NULL
> > 
> > This results in find_cxl_port() returning NULL (since the port hasn't
> > been associated with the host bridge yet), and add_port_attach_ep
> > fails because the grandparent's grandparent is NULL.
> > 
> > When the latter condition is detected, the comments suggest:
> >     /*
> >      * The iteration reached the topology root without finding the
> >      * CXL-root 'cxl_port' on a previous iteration, fail for now to
> >      * be re-probed after platform driver attaches.
> >      */
> > 
> > This case results in an -ENXIO; however, a re-probe never occurs. Change
> > this return condition to -EPROBE_DEFER to explicitly cause a reprobe.
> 
> Ok, thanks for the additional debug. Like we chatted on the CXL Discord
> I think this is potentially pointing to a bug in bus_rescan_devices()
> where it checks dev->driver without holding the lock.
> 
> Can you give this fix a try to see if it also resolves the issue?
> Effectively, cxl_bus_rescan() is always needed in case the cxl_acpi
> driver loads waaaay after deferred probing has given up, and if this
> works then EPROBE_DEFER can remain limited to cases where it is
> absolutely known that no other device_attach() kick is coming to save
> the day.
> 

Funny enough, not only did it not work, but now i get neither endpoint lol

$ ls /sys/bus/cxl/devices/
decoder0.0  decoder1.0  decoder2.0  decoder3.1  mem0  port1  port3  root0
decoder0.1  decoder1.1  decoder3.0  decoder4.0  mem1  port2  port4

w/ reprobe patch

# ls /sys/bus/cxl/devices
decoder0.0  decoder1.0  decoder2.0  decoder3.1  decoder5.0  decoder6.0  endpoint5  mem0  port1  port3  root0
decoder0.1  decoder1.1  decoder3.0  decoder4.0  decoder5.1  decoder6.1  endpoint6  mem1  port2  port4

> -- 8< --
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 1d5007e3795a..6c0cd94888a3 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -2088,11 +2088,18 @@ static void cxl_bus_remove(struct device *dev)
>  
>  static struct workqueue_struct *cxl_bus_wq;
>  
> -static void cxl_bus_rescan_queue(struct work_struct *w)
> +static int attach_device(struct device *dev, void *data)
>  {
> -	int rc = bus_rescan_devices(&cxl_bus_type);
> +	int rc = device_attach(dev);
> +
> +	dev_vdbg(dev, "rescan: %s\n", rc ? "attach" : "detached");
>  
> -	pr_debug("CXL bus rescan result: %d\n", rc);
> +	return 0;
> +}
> +
> +static void cxl_bus_rescan_queue(struct work_struct *w)
> +{
> +	bus_for_each_dev(&cxl_bus_type, NULL, NULL, attach_device);
>  }
>  
>  void cxl_bus_rescan(void)
Re: [PATCH v2] cxl/core/port: defer endpoint probes when ACPI likely hasn't finished
Posted by Dan Williams 1 month, 2 weeks ago
Gregory Price wrote:
> On Fri, Oct 04, 2024 at 05:08:03PM -0700, Dan Williams wrote:
> > Gregory Price wrote:
> > > In cxl_acpi_probe, we add dports and uports to host bridges iteratively:
> > > - bus_for_each_dev(adev->dev.bus, NULL, root_port, add_host_bridge_dport);
> > > - bus_for_each_dev(adev->dev.bus, NULL, root_port, add_host_bridge_uport);
> > > 
> > > Simultaneously, as ports are probed, memdev endpoints can also be
> > > probed. This creates a race condition, where an endpoint can perceive
> > > its path to the root being broken in devm_cxl_enumerate_ports.
> > > 
> > > The memdev/endpoint probe will see a heirarchy that may look like:
> > >     mem1
> > >       parent => 0000:c1:00.0
> > >         parent => 0000:c0:01.1
> > > 	  parent->parent => NULL
> > > 
> > > This results in find_cxl_port() returning NULL (since the port hasn't
> > > been associated with the host bridge yet), and add_port_attach_ep
> > > fails because the grandparent's grandparent is NULL.
> > > 
> > > When the latter condition is detected, the comments suggest:
> > >     /*
> > >      * The iteration reached the topology root without finding the
> > >      * CXL-root 'cxl_port' on a previous iteration, fail for now to
> > >      * be re-probed after platform driver attaches.
> > >      */
> > > 
> > > This case results in an -ENXIO; however, a re-probe never occurs. Change
> > > this return condition to -EPROBE_DEFER to explicitly cause a reprobe.
> > 
> > Ok, thanks for the additional debug. Like we chatted on the CXL Discord
> > I think this is potentially pointing to a bug in bus_rescan_devices()
> > where it checks dev->driver without holding the lock.
> > 
> > Can you give this fix a try to see if it also resolves the issue?
> > Effectively, cxl_bus_rescan() is always needed in case the cxl_acpi
> > driver loads waaaay after deferred probing has given up, and if this
> > works then EPROBE_DEFER can remain limited to cases where it is
> > absolutely known that no other device_attach() kick is coming to save
> > the day.
> > 
> 
> Funny enough, not only did it not work, but now i get neither endpoint lol
> 
> $ ls /sys/bus/cxl/devices/
> decoder0.0  decoder1.0  decoder2.0  decoder3.1  mem0  port1  port3  root0
> decoder0.1  decoder1.1  decoder3.0  decoder4.0  mem1  port2  port4
> 
> w/ reprobe patch
> 
> # ls /sys/bus/cxl/devices
> decoder0.0  decoder1.0  decoder2.0  decoder3.1  decoder5.0  decoder6.0  endpoint5  mem0  port1  port3  root0
> decoder0.1  decoder1.1  decoder3.0  decoder4.0  decoder5.1  decoder6.1  endpoint6  mem1  port2  port4

Such a violent result is interesting! While I would have preferred an
"all fixed!" version of "interesting", making something fail reliably and
completely is at least indication that the starting point was more
fragile than expected.

Now, I tried to get cxl_test to fail by probing memdevs asynchronously
alongside the ACPI root driver. That did reveal a use-after-free bug
when out-of-order shutdown causes some cleanup to be skipped, will send
a separate fixup for that, but it failed to reproduce the bug you are
seeing.

The incremental fix here, that applies on top of the device_attach()
fixup, is to make sure that all cxl_port instances registered by
cxl_acpi_probe() are active before cxl_bus_rescan() runs. That can only
be guaranteed when the cxl_port driver is pre-loaded.

If you are running from an initial ram disk make sure cxl_port.ko is
included there...

-- 8< --
diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
index 82b78e331d8e..432b7cfd12a8 100644
--- a/drivers/cxl/acpi.c
+++ b/drivers/cxl/acpi.c
@@ -924,6 +924,13 @@ static void __exit cxl_acpi_exit(void)
 
 /* load before dax_hmem sees 'Soft Reserved' CXL ranges */
 subsys_initcall(cxl_acpi_init);
+
+/*
+ * Arrange for host-bridge ports to be active synchronous with
+ * cxl_acpi_probe() exit.
+ */
+MODULE_SOFTDEP("pre: cxl_port");
+
 module_exit(cxl_acpi_exit);
 MODULE_DESCRIPTION("CXL ACPI: Platform Support");
 MODULE_LICENSE("GPL v2");
Re: [PATCH v2] cxl/core/port: defer endpoint probes when ACPI likely hasn't finished
Posted by Gregory Price 1 month, 2 weeks ago
On Tue, Oct 08, 2024 at 03:37:45PM -0700, Dan Williams wrote:
> Gregory Price wrote:
> > On Fri, Oct 04, 2024 at 05:08:03PM -0700, Dan Williams wrote:
> > > Gregory Price wrote:
> > > > In cxl_acpi_probe, we add dports and uports to host bridges iteratively:
> > > > - bus_for_each_dev(adev->dev.bus, NULL, root_port, add_host_bridge_dport);
> > > > - bus_for_each_dev(adev->dev.bus, NULL, root_port, add_host_bridge_uport);
> > > > 
> > > > Simultaneously, as ports are probed, memdev endpoints can also be
> > > > probed. This creates a race condition, where an endpoint can perceive
> > > > its path to the root being broken in devm_cxl_enumerate_ports.
> > > > 
> > > > The memdev/endpoint probe will see a heirarchy that may look like:
> > > >     mem1
> > > >       parent => 0000:c1:00.0
> > > >         parent => 0000:c0:01.1
> > > > 	  parent->parent => NULL
> > > > 
> > > > This results in find_cxl_port() returning NULL (since the port hasn't
> > > > been associated with the host bridge yet), and add_port_attach_ep
> > > > fails because the grandparent's grandparent is NULL.
> > > > 
> > > > When the latter condition is detected, the comments suggest:
> > > >     /*
> > > >      * The iteration reached the topology root without finding the
> > > >      * CXL-root 'cxl_port' on a previous iteration, fail for now to
> > > >      * be re-probed after platform driver attaches.
> > > >      */
> > > > 
> > > > This case results in an -ENXIO; however, a re-probe never occurs. Change
> > > > this return condition to -EPROBE_DEFER to explicitly cause a reprobe.
> > > 
> > > Ok, thanks for the additional debug. Like we chatted on the CXL Discord
> > > I think this is potentially pointing to a bug in bus_rescan_devices()
> > > where it checks dev->driver without holding the lock.
> > > 
> > > Can you give this fix a try to see if it also resolves the issue?
> > > Effectively, cxl_bus_rescan() is always needed in case the cxl_acpi
> > > driver loads waaaay after deferred probing has given up, and if this
> > > works then EPROBE_DEFER can remain limited to cases where it is
> > > absolutely known that no other device_attach() kick is coming to save
> > > the day.
> > > 
> > 
> > Funny enough, not only did it not work, but now i get neither endpoint lol
> > 
> > $ ls /sys/bus/cxl/devices/
> > decoder0.0  decoder1.0  decoder2.0  decoder3.1  mem0  port1  port3  root0
> > decoder0.1  decoder1.1  decoder3.0  decoder4.0  mem1  port2  port4
> > 
> > w/ reprobe patch
> > 
> > # ls /sys/bus/cxl/devices
> > decoder0.0  decoder1.0  decoder2.0  decoder3.1  decoder5.0  decoder6.0  endpoint5  mem0  port1  port3  root0
> > decoder0.1  decoder1.1  decoder3.0  decoder4.0  decoder5.1  decoder6.1  endpoint6  mem1  port2  port4
> 
> Such a violent result is interesting! While I would have preferred an
> "all fixed!" version of "interesting", making something fail reliably and
> completely is at least indication that the starting point was more
> fragile than expected.
> 
> Now, I tried to get cxl_test to fail by probing memdevs asynchronously
> alongside the ACPI root driver. That did reveal a use-after-free bug
> when out-of-order shutdown causes some cleanup to be skipped, will send
> a separate fixup for that, but it failed to reproduce the bug you are
> seeing.
> 
> The incremental fix here, that applies on top of the device_attach()
> fixup, is to make sure that all cxl_port instances registered by
> cxl_acpi_probe() are active before cxl_bus_rescan() runs. That can only
> be guaranteed when the cxl_port driver is pre-loaded.
> 

:< unfortunately the result is... the same but also different? 

the worst kind of result

# ls /sys/bus/cxl/devices/
decoder0.0  decoder0.2  decoder2.0  decoder3.1  mem0  port1  port3  root0
decoder0.1  decoder1.0  decoder3.0  decoder4.0  mem1  port2  port4

apparently sometimes we get decoder0.2 and sometimes we get decoder1.1

after a reboot we get it the other way

# ls /sys/bus/cxl/devices/
decoder0.0  decoder0.2  decoder2.0  decoder3.1  mem0  port1  port3  root0
decoder0.1  decoder1.0  decoder3.0  decoder4.0  mem1  port2  port4

in my experimental kernel where everything "works" I get something like

# ls /sys/bus/cxl/devices/
dax_region0  decoder0.0  decoder1.0  decoder3.1  decoder5.1  endpoint5  mem1   port3    region1
dax_region1  decoder0.1  decoder2.0  decoder4.0  decoder6.0  endpoint6  port1  port4    region2
dax_region2  decoder0.2  decoder3.0  decoder5.0  decoder6.1  mem0       port2  region0  root0

which does not have decoder1.1 - but i haven't confirmed whether this is
consistent or not.  I don't know what causes 0.N vs 1.M, maybe you do?

but in the first result (past email) you can see the reprobe patch also generated
decoder1.1 instead of decoder0.2

probably not related, more fun side quests!

Regardless, the additional patch did not resolve the problem :<

> If you are running from an initial ram disk make sure cxl_port.ko is
> included there...
> 
> -- 8< --
> diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
> index 82b78e331d8e..432b7cfd12a8 100644
> --- a/drivers/cxl/acpi.c
> +++ b/drivers/cxl/acpi.c
> @@ -924,6 +924,13 @@ static void __exit cxl_acpi_exit(void)
>  
>  /* load before dax_hmem sees 'Soft Reserved' CXL ranges */
>  subsys_initcall(cxl_acpi_init);
> +
> +/*
> + * Arrange for host-bridge ports to be active synchronous with
> + * cxl_acpi_probe() exit.
> + */
> +MODULE_SOFTDEP("pre: cxl_port");
> +
>  module_exit(cxl_acpi_exit);
>  MODULE_DESCRIPTION("CXL ACPI: Platform Support");
>  MODULE_LICENSE("GPL v2");