[PATCH v2] PCI: Add quirk to disable ASPM L1 for Sandisk SN740 NVMe SSDs

Manivannan Sadhasivam posted 1 patch 1 week, 4 days ago
drivers/pci/quirks.c | 11 +++++++++++
1 file changed, 11 insertions(+)
[PATCH v2] PCI: Add quirk to disable ASPM L1 for Sandisk SN740 NVMe SSDs
Posted by Manivannan Sadhasivam 1 week, 4 days ago
The Sandisk SN740 NVMe SSDs cause below AER errors on the upstream Root
Port of PCIe controller in Microsoft Surface Laptop 7, when ASPM L1 is
enabled:

  pcieport 0006:00:00.0: AER: Correctable error message received from 0006:01:00.0
  nvme 0006:01:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
  nvme 0006:01:00.0:   device [15b7:5015] error status/mask=00000001/0000e000
  nvme 0006:01:00.0:    [ 0] RxErr

Hence, add a quirk to disable L1 by removing the ASPM_L1 CAP for this SSD.

Reported-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Signed-off-by: Manivannan Sadhasivam <mani@kernel.org>
---

Changes in v2:

* Fixed the laptop name
* Rebased on top of v6.18-rc6 for pcie_aspm_remove_cap()

 drivers/pci/quirks.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index b9c252aa6fe0..adc54533df7f 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -2527,6 +2527,17 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_FREESCALE, 0x0451, quirk_disable_aspm_l0s
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_PASEMI, 0xa002, quirk_disable_aspm_l0s_l1);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_HUAWEI, 0x1105, quirk_disable_aspm_l0s_l1);
 
+static void quirk_disable_aspm_l1(struct pci_dev *dev)
+{
+	pcie_aspm_remove_cap(dev, PCI_EXP_LNKCAP_ASPM_L1);
+}
+
+/*
+ * Sandisk SN740 NVMe SSDs cause AER timeout errors on the upstream PCIe Root
+ * Port when ASPM L1 is enabled.
+ */
+DECLARE_PCI_FIXUP_HEADER(0x15b7, 0x5015, quirk_disable_aspm_l1);
+
 /*
  * Some Pericom PCIe-to-PCI bridges in reverse mode need the PCIe Retrain
  * Link bit cleared after starting the link retrain process to allow this
-- 
2.48.1
Re: [PATCH v2] PCI: Add quirk to disable ASPM L1 for Sandisk SN740 NVMe SSDs
Posted by Bjorn Helgaas 1 week ago
[+cc Alexey, Jeffrey, Avinash]

On Thu, Nov 20, 2025 at 09:42:53PM +0530, Manivannan Sadhasivam wrote:
> The Sandisk SN740 NVMe SSDs cause below AER errors on the upstream Root
> Port of PCIe controller in Microsoft Surface Laptop 7, when ASPM L1 is
> enabled:
> 
>   pcieport 0006:00:00.0: AER: Correctable error message received from 0006:01:00.0
>   nvme 0006:01:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
>   nvme 0006:01:00.0:   device [15b7:5015] error status/mask=00000001/0000e000
>   nvme 0006:01:00.0:    [ 0] RxErr

Do we have any information about whether this error happens with the
SN740 on platforms other than the Surface Laptop 7?  Or whether it
happens on the Surface with other endpoints?

I'm a little hesitant about quirking devices and claiming they are
defective without a solid root cause.

Sandisk folks, do you have any insight into this?  Any known errata or
possibility of looking into this with an analyzer?

> Hence, add a quirk to disable L1 by removing the ASPM_L1 CAP for this SSD.
> 
> Reported-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
> Signed-off-by: Manivannan Sadhasivam <mani@kernel.org>
> ---
> 
> Changes in v2:
> 
> * Fixed the laptop name
> * Rebased on top of v6.18-rc6 for pcie_aspm_remove_cap()
> 
>  drivers/pci/quirks.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index b9c252aa6fe0..adc54533df7f 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -2527,6 +2527,17 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_FREESCALE, 0x0451, quirk_disable_aspm_l0s
>  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_PASEMI, 0xa002, quirk_disable_aspm_l0s_l1);
>  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_HUAWEI, 0x1105, quirk_disable_aspm_l0s_l1);
>  
> +static void quirk_disable_aspm_l1(struct pci_dev *dev)
> +{
> +	pcie_aspm_remove_cap(dev, PCI_EXP_LNKCAP_ASPM_L1);
> +}
> +
> +/*
> + * Sandisk SN740 NVMe SSDs cause AER timeout errors on the upstream PCIe Root
> + * Port when ASPM L1 is enabled.
> + */
> +DECLARE_PCI_FIXUP_HEADER(0x15b7, 0x5015, quirk_disable_aspm_l1);
Re: [PATCH v2] PCI: Add quirk to disable ASPM L1 for Sandisk SN740 NVMe SSDs
Posted by Manivannan Sadhasivam 6 days, 20 hours ago
On Mon, Nov 24, 2025 at 05:53:07PM -0600, Bjorn Helgaas wrote:
> [+cc Alexey, Jeffrey, Avinash]
> 
> On Thu, Nov 20, 2025 at 09:42:53PM +0530, Manivannan Sadhasivam wrote:
> > The Sandisk SN740 NVMe SSDs cause below AER errors on the upstream Root
> > Port of PCIe controller in Microsoft Surface Laptop 7, when ASPM L1 is
> > enabled:
> > 
> >   pcieport 0006:00:00.0: AER: Correctable error message received from 0006:01:00.0
> >   nvme 0006:01:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
> >   nvme 0006:01:00.0:   device [15b7:5015] error status/mask=00000001/0000e000
> >   nvme 0006:01:00.0:    [ 0] RxErr
> 
> Do we have any information about whether this error happens with the
> SN740 on platforms other than the Surface Laptop 7?  Or whether it
> happens on the Surface with other endpoints?
> 

This device comes pre installed with the Surface Laptop 7 I believe. It is not
very convenient to replace the NVMe in a laptop for testing.

> I'm a little hesitant about quirking devices and claiming they are
> defective without a solid root cause.
> 

There are a couple of points that made me convince myself:

* Other X1E laptops are working fine with ASPM L1.
* This laptop has WCN785x WiFi/BT combo card connected to the other controller
instance and L1 is working fine for it.
* There is no known issue with ASPM L1 in X1E chipsets.

Because of these, I was so certain that the NVMe is the fault here.

> Sandisk folks, do you have any insight into this?  Any known errata or
> possibility of looking into this with an analyzer?
> 

I don't think Konrad has access to the analyzer, neither any of us.

If you are still hesitant, I'd suggest adding the platform check so that this
quirk is only limited to the Surface Laptop 7:

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index adc54533df7f..1655757ba66a 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -29,6 +29,7 @@
 #include <linux/ktime.h>
 #include <linux/mm.h>
 #include <linux/nvme.h>
+#include <linux/of.h>
 #include <linux/platform_data/x86/apple.h>
 #include <linux/pm_runtime.h>
 #include <linux/sizes.h>
@@ -2527,15 +2528,19 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_FREESCALE, 0x0451, quirk_disable_aspm_l0s
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_PASEMI, 0xa002, quirk_disable_aspm_l0s_l1);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_HUAWEI, 0x1105, quirk_disable_aspm_l0s_l1);
 
+/*
+ * Sandisk SN740 NVMe SSDs in Microsoft Surface Laptop 7 cause AER timeout
+ * errors on the upstream PCIe Root Port when ASPM L1 is enabled.
+ */
 static void quirk_disable_aspm_l1(struct pci_dev *dev)
 {
-       pcie_aspm_remove_cap(dev, PCI_EXP_LNKCAP_ASPM_L1);
+       struct device_node *root __free(device_node) = of_find_node_by_path("/");
+       const char *model = of_get_property(root, "compatible", NULL);
+
+       if (!strcmp(model, "microsoft,romulus13"))
+               pcie_aspm_remove_cap(dev, PCI_EXP_LNKCAP_ASPM_L1);
 }
 
-/*
- * Sandisk SN740 NVMe SSDs cause AER timeout errors on the upstream PCIe Root
- * Port when ASPM L1 is enabled.
- */
 DECLARE_PCI_FIXUP_HEADER(0x15b7, 0x5015, quirk_disable_aspm_l1);
 
 /*

This check is similar to the DMI checks we have currently non-DT platforms.
Infact, we have also use the DMI checks on this laptop as it comes with SMBIOS.

Note: I'm not sure if Konrad's lapto is based on "microsoft,romulus13" or
"microsoft,romulus15".

- Mani

-- 
மணிவண்ணன் சதாசிவம்
Re: [PATCH v2] PCI: Add quirk to disable ASPM L1 for Sandisk SN740 NVMe SSDs
Posted by Val Packett 18 hours ago
On 11/25/25 2:21 AM, Manivannan Sadhasivam wrote:
> [..]
> There are a couple of points that made me convince myself:
>
> * Other X1E laptops are working fine with ASPM L1.
> * This laptop has WCN785x WiFi/BT combo card connected to the other controller
> instance and L1 is working fine for it.
> * There is no known issue with ASPM L1 in X1E chipsets.
>
> Because of these, I was so certain that the NVMe is the fault here.

There is *a* known issue with ASPM L1 on X1E, reported by maaaany users 
on #aarch64-laptops, that we discussed in another thread..

But it is a full system freeze, **not** a correctable AER message, and 
it definitely happens with a bunch of various SSDs on various laptops. I 
personally have had it happen both with the SN740 and an SK Hynix drive, 
on a Latitude 7455. It's an SSD-only issue (disabling ASPM just for the 
drive, but keeping it on for the WiFi, was enough to get to month-long 
uptime) but not specific to any SSD model.

One bit of news I have about it is that I recently started using EL2 
(slbounce), and I did see something that looked like that hang.. but 
unlike in EL1, right before the reboot the panic LED did start blinking. 
So if that was indeed from the same issue, I should now be able to catch 
it into pstore (if pstore works.. trying blk with sdhc instead of efi 
now 0.o) Maybe QHEE was eating the fault and itself crashing, since it 
"owns" the PCIe IOMMU when it's running.. (???)

~val
Re: [PATCH v2] PCI: Add quirk to disable ASPM L1 for Sandisk SN740 NVMe SSDs
Posted by Manivannan Sadhasivam 14 hours ago
On Mon, Dec 01, 2025 at 03:48:13AM -0300, Val Packett wrote:
> 
> On 11/25/25 2:21 AM, Manivannan Sadhasivam wrote:
> > [..]
> > There are a couple of points that made me convince myself:
> > 
> > * Other X1E laptops are working fine with ASPM L1.
> > * This laptop has WCN785x WiFi/BT combo card connected to the other controller
> > instance and L1 is working fine for it.
> > * There is no known issue with ASPM L1 in X1E chipsets.
> > 
> > Because of these, I was so certain that the NVMe is the fault here.
> 
> There is *a* known issue with ASPM L1 on X1E, reported by maaaany users on
> #aarch64-laptops, that we discussed in another thread..
> 

The other thread you are referring to is this one I believe:
https://lore.kernel.org/linux-pci/21398de7-3dd9-4c43-97d9-7c3002c401e5@packett.cool/

From this, I cannot conclude that controller was at the fault. Atleast, not
until now.

> But it is a full system freeze, **not** a correctable AER message, and it
> definitely happens with a bunch of various SSDs on various laptops. I
> personally have had it happen both with the SN740 and an SK Hynix drive, on
> a Latitude 7455. It's an SSD-only issue (disabling ASPM just for the drive,
> but keeping it on for the WiFi, was enough to get to month-long uptime) but
> not specific to any SSD model.
> 

Please confirm whether you disabled all ASPM states (L0s, L1 and L1ss) or just
L1ss for the controller instance where SSD is connected. Starting from
v6.18-rc3, only L0s and L1 will be enabled by default without any
cmdline/Kconfig changes.

> One bit of news I have about it is that I recently started using EL2
> (slbounce), and I did see something that looked like that hang.. but unlike
> in EL1, right before the reboot the panic LED did start blinking. So if that
> was indeed from the same issue, I should now be able to catch it into pstore
> (if pstore works.. trying blk with sdhc instead of efi now 0.o)

That would be helpful. I guess Abel did it on XPS13, but need to check more.

> Maybe QHEE
> was eating the fault and itself crashing, since it "owns" the PCIe IOMMU
> when it's running.. (???)
> 

Yes, they all are captured by QHEE for post mortem analsys that could only be
performed using Qcom tools and on non-production devices. I don't know how to
capture those logs on production laptops.

Anyhow, to isolate this issue to ASPM L1 on the X1E PCIe controller, please
disable all ASPM states by selecting CONFIG_PCIEASPM_PERFORMANCE in Kconfig and
let it run. If you do not see the crash at all for some time (or days), then the
crash was related to ASPM issue in the controller (since you said the crash was
repro. with other SSDs as well). If not, there is something else going wrong.

- Mani

-- 
மணிவண்ணன் சதாசிவம்
Re: [PATCH v2] PCI: Add quirk to disable ASPM L1 for Sandisk SN740 NVMe SSDs
Posted by Konrad Dybcio 4 days, 6 hours ago
On 11/25/25 6:21 AM, Manivannan Sadhasivam wrote:
> On Mon, Nov 24, 2025 at 05:53:07PM -0600, Bjorn Helgaas wrote:
>> [+cc Alexey, Jeffrey, Avinash]
>>
>> On Thu, Nov 20, 2025 at 09:42:53PM +0530, Manivannan Sadhasivam wrote:
>>> The Sandisk SN740 NVMe SSDs cause below AER errors on the upstream Root
>>> Port of PCIe controller in Microsoft Surface Laptop 7, when ASPM L1 is
>>> enabled:
>>>
>>>   pcieport 0006:00:00.0: AER: Correctable error message received from 0006:01:00.0
>>>   nvme 0006:01:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
>>>   nvme 0006:01:00.0:   device [15b7:5015] error status/mask=00000001/0000e000
>>>   nvme 0006:01:00.0:    [ 0] RxErr
>>
>> Do we have any information about whether this error happens with the
>> SN740 on platforms other than the Surface Laptop 7?  Or whether it
>> happens on the Surface with other endpoints?

[...]

> -/*
> - * Sandisk SN740 NVMe SSDs cause AER timeout errors on the upstream PCIe Root
> - * Port when ASPM L1 is enabled.
> - */
>  DECLARE_PCI_FIXUP_HEADER(0x15b7, 0x5015, quirk_disable_aspm_l1);
>  
>  /*
> 
> This check is similar to the DMI checks we have currently non-DT platforms.
> Infact, we have also use the DMI checks on this laptop as it comes with SMBIOS.
> 
> Note: I'm not sure if Konrad's lapto is based on "microsoft,romulus13" or
> "microsoft,romulus15".

15, but they're otherwise identical hardware. Please quirk off both if you
go this route.

Konrad
Re: [PATCH v2] PCI: Add quirk to disable ASPM L1 for Sandisk SN740 NVMe SSDs
Posted by Konrad Dybcio 1 week ago
On 11/20/25 5:12 PM, Manivannan Sadhasivam wrote:
> The Sandisk SN740 NVMe SSDs cause below AER errors on the upstream Root
> Port of PCIe controller in Microsoft Surface Laptop 7, when ASPM L1 is
> enabled:
> 
>   pcieport 0006:00:00.0: AER: Correctable error message received from 0006:01:00.0
>   nvme 0006:01:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
>   nvme 0006:01:00.0:   device [15b7:5015] error status/mask=00000001/0000e000
>   nvme 0006:01:00.0:    [ 0] RxErr
> 
> Hence, add a quirk to disable L1 by removing the ASPM_L1 CAP for this SSD.
> 
> Reported-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
> Signed-off-by: Manivannan Sadhasivam <mani@kernel.org>
> ---

This revision also works for me, thank you

Tested-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com> # X1E80100 Romulus

Konrad