drivers/pci/quirks.c | 11 +++++++++++ include/linux/pci_ids.h | 2 ++ 2 files changed, 13 insertions(+)
From: Johnny-CC Chang <Johnny-CC.Chang@mediatek.com>
Nvidia GB10 PCIe hosts will encounter problem occasionally
after SBR(secondary bus reset) is applied.
Enable NO_BUS_RESET quirk for Nvidia GB10 PCIe hosts.
Signed-off-by: Johnny-CC Chang <Johnny-CC.Chang@mediatek.com>
---
drivers/pci/quirks.c | 11 +++++++++++
include/linux/pci_ids.h | 2 ++
2 files changed, 13 insertions(+)
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index b94264cd3833..12a10fa84c8a 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3746,6 +3746,17 @@ static void quirk_no_bus_reset(struct pci_dev *dev)
dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
}
+/*
+ * Nvidia GB10 PCIe hosts will encounter problem occasionally
+ * after SBR (secondary bus reset) is applied.
+ * SBR needs to be prevented for these PCIe hosts.
+ */
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_GB10_GEN5_X4,
+ quirk_no_bus_reset);
+
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_GB10_GEN4_X1,
+ quirk_no_bus_reset);
+
/*
* Some NVIDIA GPU devices do not work with bus reset, SBR needs to be
* prevented for those affected devices.
diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
index 92ffc4373f6d..661dc1594213 100644
--- a/include/linux/pci_ids.h
+++ b/include/linux/pci_ids.h
@@ -1382,6 +1382,8 @@
#define PCI_DEVICE_ID_NVIDIA_GEFORCE_320M 0x08A0
#define PCI_DEVICE_ID_NVIDIA_NFORCE_MCP79_SMBUS 0x0AA2
#define PCI_DEVICE_ID_NVIDIA_NFORCE_MCP89_SATA 0x0D85
+#define PCI_DEVICE_ID_NVIDIA_GB10_GEN5_X4 0x22CE
+#define PCI_DEVICE_ID_NVIDIA_GB10_GEN4_X1 0x22D0
#define PCI_VENDOR_ID_IMS 0x10e0
#define PCI_DEVICE_ID_IMS_TT128 0x9128
--
2.45.2
On Thu, Nov 13, 2025 at 04:44:06PM +0800, Johnny Chang wrote: > From: Johnny-CC Chang <Johnny-CC.Chang@mediatek.com> > > Nvidia GB10 PCIe hosts will encounter problem occasionally > after SBR(secondary bus reset) is applied. > Enable NO_BUS_RESET quirk for Nvidia GB10 PCIe hosts. > > Signed-off-by: Johnny-CC Chang <Johnny-CC.Chang@mediatek.com> Applied with the commit log below to pci/virtualization for v6.20, thanks! PCI: Mark Nvidia GB10 to avoid bus reset After asserting Secondary Bus Reset to downstream devices via a GB10 Root Port, the link may not retrain correctly, e.g., the link may retrain with a lower lane count or config accesses to downstream devices may fail. Prevent use of Secondary Bus Reset for devices below GB10. Signed-off-by: Johnny-CC Chang <Johnny-CC.Chang@mediatek.com> [bhelgaas: drop pci_ids.h update (only used once), update commit log] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Link: https://patch.msgid.link/20251113084441.2124737-1-Johnny-CC.Chang@mediatek.com > --- > drivers/pci/quirks.c | 11 +++++++++++ > include/linux/pci_ids.h | 2 ++ > 2 files changed, 13 insertions(+) > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > index b94264cd3833..12a10fa84c8a 100644 > --- a/drivers/pci/quirks.c > +++ b/drivers/pci/quirks.c > @@ -3746,6 +3746,17 @@ static void quirk_no_bus_reset(struct pci_dev *dev) > dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET; > } > > +/* > + * Nvidia GB10 PCIe hosts will encounter problem occasionally > + * after SBR (secondary bus reset) is applied. > + * SBR needs to be prevented for these PCIe hosts. > + */ > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_GB10_GEN5_X4, > + quirk_no_bus_reset); > + > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_GB10_GEN4_X1, > + quirk_no_bus_reset); > + > /* > * Some NVIDIA GPU devices do not work with bus reset, SBR needs to be > * prevented for those affected devices. > diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h > index 92ffc4373f6d..661dc1594213 100644 > --- a/include/linux/pci_ids.h > +++ b/include/linux/pci_ids.h > @@ -1382,6 +1382,8 @@ > #define PCI_DEVICE_ID_NVIDIA_GEFORCE_320M 0x08A0 > #define PCI_DEVICE_ID_NVIDIA_NFORCE_MCP79_SMBUS 0x0AA2 > #define PCI_DEVICE_ID_NVIDIA_NFORCE_MCP89_SATA 0x0D85 > +#define PCI_DEVICE_ID_NVIDIA_GB10_GEN5_X4 0x22CE > +#define PCI_DEVICE_ID_NVIDIA_GB10_GEN4_X1 0x22D0 > > #define PCI_VENDOR_ID_IMS 0x10e0 > #define PCI_DEVICE_ID_IMS_TT128 0x9128 > -- > 2.45.2 >
On Thu, Nov 13, 2025 at 04:44:06PM +0800, Johnny Chang wrote: > Nvidia GB10 PCIe hosts will encounter problem occasionally > after SBR(secondary bus reset) is applied. Could you elaborate what kinds of problems occur, how often they occur, etc? Thanks, Lukas
On Thu, 2025-11-13 at 10:39 +0100, Lukas Wunner wrote: > On Thu, Nov 13, 2025 at 04:44:06PM +0800, Johnny Chang wrote: > > Nvidia GB10 PCIe hosts will encounter problem occasionally > > after SBR(secondary bus reset) is applied. > > Could you elaborate what kinds of problems occur, how often they > occur, etc? There is about 1/1000 chance that after SBR is applied, any further access via this root port will be blocked and make system crash. Thanks, Johnny
On Tue, 2025-11-18 at 17:39 +0800, Johnny-CC Chang wrote: > On Thu, 2025-11-13 at 10:39 +0100, Lukas Wunner wrote: > > On Thu, Nov 13, 2025 at 04:44:06PM +0800, Johnny Chang wrote: > > > Nvidia GB10 PCIe hosts will encounter problem occasionally > > > after SBR(secondary bus reset) is applied. > > > > Could you elaborate what kinds of problems occur, how often they > > occur, etc? > > There is about 1/1000 chance that after SBR is applied, any further > access via this root port will be blocked and make system crash. > > Thanks, > > Johnny I would like to update below description to replace original comment in v1 patch, is this information sufficient? -------- /* * After SBR(secondary bus reset) is applied on an Nvidia GB10 * PCIe root port, there is 1/1000 chance that further requests * via this root port will be blocked and cause system unstable. */ --------
[+cc Jason, Alex for Nvidia input] On Wed, Jan 14, 2026 at 06:39:24AM +0000, Johnny-CC Chang (張晋嘉) wrote: > On Tue, 2025-11-18 at 17:39 +0800, Johnny-CC Chang wrote: > > On Thu, 2025-11-13 at 10:39 +0100, Lukas Wunner wrote: > > > On Thu, Nov 13, 2025 at 04:44:06PM +0800, Johnny Chang wrote: > > > > Nvidia GB10 PCIe hosts will encounter problem occasionally > > > > after SBR(secondary bus reset) is applied. > > > > > > Could you elaborate what kinds of problems occur, how often they > > > occur, etc? > > > > There is about 1/1000 chance that after SBR is applied, any further > > access via this root port will be blocked and make system crash. What sort of crash happens? It's useful if we can include a bread crumb that will help people identify the crash and find a fix. What I would expect is some kind of PCIe error like a config read timeout or unsupport request error. But usually those just result in ~0 data back to the CPU, which usually doesn't directly cause a crash. > I would like to update below description to replace original comment in > v1 patch, is this information sufficient? > -------- > /* > * After SBR(secondary bus reset) is applied on an Nvidia GB10 > * PCIe root port, there is 1/1000 chance that further requests > * via this root port will be blocked and cause system unstable. I'm confused about what the topology is. I first assumed GB10 was a PCIe Endpoint, since Secondary Bus Reset only applies to devices below a bridge, so SBR would be applied to a device by a config write to that bridge. But you mention a GB10 Root Port here, which obviously is not an Endpoint, so there's no bridge upstream from the GB10 that could initiate SBR to the GB10. If this is actually a GB10 issue, it sounds like a hardware erratum that lots of users would see and Nvidia would likely be aware of. Bjorn
On 1/14/26 09:28, Bjorn Helgaas wrote: > What sort of crash happens? It's useful if we can include a bread > crumb that will help people identify the crash and find a fix. We observed retraining to lower PCIe lane count and config read timeout. So yes crash is not the best way to describe it. > I'm confused about what the topology is. I first assumed GB10 was > a PCIe Endpoint, since Secondary Bus Reset only applies to devices > below a bridge, so SBR would be applied to a device by a config > write to that bridge. gb10 is an SoC designed by NVIDIA and Mediatek in collaboration. It's not an endpoint, but has its own PCIe controller for connecting PCIe peripherals like NVMe drives, NIC, etc. > If this is actually a GB10 issue, it sounds like a hardware erratum > that lots of users would see and Nvidia would likely be aware of. We're aware. We've maintained a quirk in a kernel tree for DGX Spark and other gb10 powered products until this gets upstreamed. Terje
On Thu, Jan 15, 2026 at 12:11:09PM -0800, Terje Bergstrom wrote: > On 1/14/26 09:28, Bjorn Helgaas wrote: > > > What sort of crash happens? It's useful if we can include a bread > > crumb that will help people identify the crash and find a fix. > > We observed retraining to lower PCIe lane count and config read > timeout. So yes crash is not the best way to describe it. > > > I'm confused about what the topology is. I first assumed GB10 was > > a PCIe Endpoint, since Secondary Bus Reset only applies to devices > > below a bridge, so SBR would be applied to a device by a config > > write to that bridge. > > gb10 is an SoC designed by NVIDIA and Mediatek in collaboration. > It's not an endpoint, but has its own PCIe controller for connecting > PCIe peripherals like NVMe drives, NIC, etc. OK, so you do SBR to some endpoint below a GB10 Root Port, and after the SBR, the link to the endpoint retrains with a lower lane count and config reads to the endpoint time out? I see you're from NVIDIA, so if you're confirming that this is a hardware erratum (not an issue with the GB10 PCI controller driver), we should definitely apply this, and I'll wordsmith the commit log and comment something like this: When asserting Secondary Bus Reset to downstream devices via a GB10 Root Port, the link doesn't retrain correctly. The link may retrain with a lower lane count, and config accesses to downstream devices may fail. Bjorn
On 1/15/26 12:53, Bjorn Helgaas wrote: > OK, so you do SBR to some endpoint below a GB10 Root Port, and after > the SBR, the link to the endpoint retrains with a lower lane count > and config reads to the endpoint time out? That's right. The symptoms can vary, i.e. sometimes it retrains with lower lane count, and sometimes config reads start timing out, and very often it works just fine. > I see you're from NVIDIA, so if you're confirming that this is a > hardware erratum (not an issue with the GB10 PCI controller driver), > we should definitely apply this, and I'll wordsmith the commit log > and comment something like this: > > When asserting Secondary Bus Reset to downstream devices via a GB10 > Root Port, the link doesn't retrain correctly. The link may retrain > with a lower lane count, and config accesses to downstream devices > may fail. Yes, I confirm this is a HW erratum. The problem doesn't occur every time, so "the link may not retrain correctly" would be more correct, but that's a minor comment. Terje
© 2016 - 2026 Red Hat, Inc.