> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Friday, April 17, 2026 1:20 AM > > On Thu, Apr 16, 2026 at 05:49:24PM +0100, Robin Murphy wrote: > > On 15/04/2026 10:17 pm, Nicolin Chen wrote: > > > When transitioning to a kdump kernel, the primary kernel might have > crashed > > > while endpoint devices were actively bus-mastering DMA. Currently, the > SMMU > > > driver aggressively resets the hardware during probe by clearing > CR0_SMMUEN > > > and setting the Global Bypass Attribute (GBPA) to ABORT. > > > > > > In a kdump scenario, this aggressive reset is highly destructive: > > > a) If GBPA is set to ABORT, in-flight DMA will be aborted, generating fatal > > > PCIe AER or SErrors that may panic the kdump kernel > > > b) If GBPA is set to BYPASS, in-flight DMA targeting some IOVAs will > bypass > > > the SMMU and corrupt the physical memory at those 1:1 mapped > IOVAs. > > > > But wasn't that rather the point? Th kdump kernel doesn't know the scope > of > > how much could have gone wrong (including potentially the SMMU > configuration > > itself), so it just blocks everything, resets and reenables the devices it > > cares about, and ignores whatever else might be on fire. > > The purpose of kdump is to have the maximum chance to capture a dump > from the blown up kernel. > > Yes, on a perfect platform aborting the entire SMMU should improve the > chance of getting that dump. > > But sadly there are so many busted up platforms where if you start > messing with the IOMMU they will explode and blow up the kdump. x86 > and "firmware first" error handling systems are particularly notorious > for nasty behavior like this. > > Seems like there are now ARM systems too. :( is there any report on such systems? It might be informational to include a link to the report so it's clear that this series fixes real issues instead of a preparation for coming systems... > > So, the iommu drivers have been preserving the IOMMU and not > disrupting the DMAs on x86 for a long time. This is established kdump > practice. > > > If AER can panic a kdump kernel, that seems like a failing of the kdump > > kernel itself more than anything else (especially given the likelihood that > > additional AER events could follow from whatever initial crash/failure > > triggered kdump to begin with). > > Probably the kdump wasn't triggered by AER. You want kdump to not > trigger more RAS events that might blow up the kdump while it is > trying to run.. That increases the chance of success > btw the DMA is allowed after the previous kernel is hung til the point where smmu driver blocks it. In cases where in-fly DMAs are considered dangerous to kdump, this series just make it worse instead of creating a new issue. While for majority other failures not related to DMAs, unblocking then increases the chance of success...
On Fri, Apr 17, 2026 at 07:48:46AM +0000, Tian, Kevin wrote: > is there any report on such systems? It might be informational to include > a link to the report so it's clear that this series fixes real issues instead of > a preparation for coming systems... Yeah, we have an internal report and this was confirmed to fix it. > btw the DMA is allowed after the previous kernel is hung til the point > where smmu driver blocks it. In cases where in-fly DMAs are considered > dangerous to kdump, this series just make it worse instead of creating > a new issue. While for majority other failures not related to DMAs, > unblocking then increases the chance of success... Right, exactly. If DMA's are splattering over the kdump carve out memory its is probably dead no matter what. Jason
© 2016 - 2026 Red Hat, Inc.