RE: [PATCH rc v2 0/5] iommu/arm-smmu-v3: Fix device crash on kdump kernel

Posted by Tian, Kevin 1 month, 4 weeks ago

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, April 17, 2026 1:20 AM
> 
> On Thu, Apr 16, 2026 at 05:49:24PM +0100, Robin Murphy wrote:
> > On 15/04/2026 10:17 pm, Nicolin Chen wrote:
> > > When transitioning to a kdump kernel, the primary kernel might have
> crashed
> > > while endpoint devices were actively bus-mastering DMA. Currently, the
> SMMU
> > > driver aggressively resets the hardware during probe by clearing
> CR0_SMMUEN
> > > and setting the Global Bypass Attribute (GBPA) to ABORT.
> > >
> > > In a kdump scenario, this aggressive reset is highly destructive:
> > > a) If GBPA is set to ABORT, in-flight DMA will be aborted, generating fatal
> > >     PCIe AER or SErrors that may panic the kdump kernel
> > > b) If GBPA is set to BYPASS, in-flight DMA targeting some IOVAs will
> bypass
> > >     the SMMU and corrupt the physical memory at those 1:1 mapped
> IOVAs.
> >
> > But wasn't that rather the point? Th kdump kernel doesn't know the scope
> of
> > how much could have gone wrong (including potentially the SMMU
> configuration
> > itself), so it just blocks everything, resets and reenables the devices it
> > cares about, and ignores whatever else might be on fire.
> 
> The purpose of kdump is to have the maximum chance to capture a dump
> from the blown up kernel.
> 
> Yes, on a perfect platform aborting the entire SMMU should improve the
> chance of getting that dump.
> 
> But sadly there are so many busted up platforms where if you start
> messing with the IOMMU they will explode and blow up the kdump. x86
> and "firmware first" error handling systems are particularly notorious
> for nasty behavior like this.
> 
> Seems like there are now ARM systems too. :(

is there any report on such systems? It might be informational to include
a link to the report so it's clear that this series fixes real issues instead of
a preparation for coming systems...

> 
> So, the iommu drivers have been preserving the IOMMU and not
> disrupting the DMAs on x86 for a long time. This is established kdump
> practice.
> 
> > If AER can panic a kdump kernel, that seems like a failing of the kdump
> > kernel itself more than anything else (especially given the likelihood that
> > additional AER events could follow from whatever initial crash/failure
> > triggered kdump to begin with).
> 
> Probably the kdump wasn't triggered by AER. You want kdump to not
> trigger more RAS events that might blow up the kdump while it is
> trying to run.. That increases the chance of success
> 

btw the DMA is allowed after the previous kernel is hung til the point
where smmu driver blocks it. In cases where in-fly DMAs are considered
dangerous to kdump, this series just make it worse instead of creating
a new issue. While for majority other failures not related to DMAs, 
unblocking then increases the chance of success...

Re: [PATCH rc v2 0/5] iommu/arm-smmu-v3: Fix device crash on kdump kernel

Posted by Jason Gunthorpe 1 month, 4 weeks ago

On Fri, Apr 17, 2026 at 07:48:46AM +0000, Tian, Kevin wrote:
> is there any report on such systems? It might be informational to include
> a link to the report so it's clear that this series fixes real issues instead of
> a preparation for coming systems...

Yeah, we have an internal report and this was confirmed to fix it.

> btw the DMA is allowed after the previous kernel is hung til the point
> where smmu driver blocks it. In cases where in-fly DMAs are considered
> dangerous to kdump, this series just make it worse instead of creating
> a new issue. While for majority other failures not related to DMAs, 
> unblocking then increases the chance of success...

Right, exactly.

If DMA's are splattering over the kdump carve out memory its is
probably dead no matter what.

Jason