amd/iommu: do not split domain flushes when flushing the entire range

[PATCH] amd/iommu: do not split domain flushes when flushing the entire range

Posted by Josef Bacik 1 month ago

We are hitting the following soft lockup in production on v6.6 and
v6.12, but the bug exists in all versions

watchdog: BUG: soft lockup - CPU#24 stuck for 31s! [tokio-runtime-w:1274919]
CPU: 24 PID: 1274919 Comm: tokio-runtime-w Not tainted 6.6.105+ #1
Hardware name: Google Google Compute Engine/Google Comput Engine, BIOS Google 10/25/2025
RIP: 0010:__raw_spin_unlock_irqrestore+0x21/0x30
Call Trace:
 <TASK>
 amd_iommu_attach_device+0x69/0x450
 __iommu_device_set_domain+0x7b/0x190
 __iommu_group_set_core_domain+0x61/0xd0
 iommu_detatch_group+0x27/0x40
 vfio_iommu_type1_detach_group+0x157/0x780 [vfio_iommu_type1]
 vfio_group_detach_container+0x59/0x160 [vfio]
 vfio_group_fops_release+0x4d/0x90 [vfio]
 __fput+0x95/0x2a0
 task_work_run+0x93/0xc0
 do_exit+0x321/0x950
 do_group_exit+0x7f/0xa0
 get_signal_0x77d/0x780
 </TASK>

This occurs because we're a VM and we're splitting up the size
CMD_INV_IOMMU_ALL_PAGES_ADDRESS we get from
amd_iommu_domain_flush_tlb_pde() into a bunch of smaller flushes. These
trap into the host on each flush, all while holding the domain lock with
IRQs disabled.

Fix this by not splitting up this special size amount and sending the
whole command in, so perhaps the host will decide to be gracious and not
spend 7 business years to do a flush.

cc: stable@vger.kernel.org
Fixes: a270be1b3fdf ("iommu/amd: Use only natural aligned flushes in a VM")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 drivers/iommu/amd/iommu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 81c4d7733872..f0d3e06734ef 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -1769,7 +1769,8 @@ void amd_iommu_domain_flush_pages(struct protection_domain *domain,
 {
 	lockdep_assert_held(&domain->lock);
 
-	if (likely(!amd_iommu_np_cache)) {
+	if (likely(!amd_iommu_np_cache) ||
+	    size == CMD_INV_IOMMU_ALL_PAGES_ADDRESS) {
 		__domain_flush_pages(domain, address, size);
 
 		/* Wait until IOMMU TLB and all device IOTLB flushes are complete */
-- 
2.53.0

Re: [PATCH] amd/iommu: do not split domain flushes when flushing the entire range

Posted by Josef Bacik 2 weeks, 2 days ago

On Wed, Mar 04, 2026 at 04:30:03PM -0500, Josef Bacik wrote:
> We are hitting the following soft lockup in production on v6.6 and
> v6.12, but the bug exists in all versions
> 

Can I get this reviewed/merged? I'm hitting this softlockup hundreds of times a
day in production and I need it in stable so I can have it backported to our
kernels. Thanks,

Josef

Re: [PATCH] amd/iommu: do not split domain flushes when flushing the entire range

Posted by Jason Gunthorpe 4 weeks ago

On Wed, Mar 04, 2026 at 04:30:03PM -0500, Josef Bacik wrote:
> We are hitting the following soft lockup in production on v6.6 and
> v6.12, but the bug exists in all versions
> 
> watchdog: BUG: soft lockup - CPU#24 stuck for 31s! [tokio-runtime-w:1274919]
> CPU: 24 PID: 1274919 Comm: tokio-runtime-w Not tainted 6.6.105+ #1
> Hardware name: Google Google Compute Engine/Google Comput Engine, BIOS Google 10/25/2025
> RIP: 0010:__raw_spin_unlock_irqrestore+0x21/0x30
> Call Trace:
>  <TASK>
>  amd_iommu_attach_device+0x69/0x450
>  __iommu_device_set_domain+0x7b/0x190
>  __iommu_group_set_core_domain+0x61/0xd0
>  iommu_detatch_group+0x27/0x40
>  vfio_iommu_type1_detach_group+0x157/0x780 [vfio_iommu_type1]
>  vfio_group_detach_container+0x59/0x160 [vfio]
>  vfio_group_fops_release+0x4d/0x90 [vfio]
>  __fput+0x95/0x2a0
>  task_work_run+0x93/0xc0
>  do_exit+0x321/0x950
>  do_group_exit+0x7f/0xa0
>  get_signal_0x77d/0x780
>  </TASK>
> 
> This occurs because we're a VM and we're splitting up the size
> CMD_INV_IOMMU_ALL_PAGES_ADDRESS we get from
> amd_iommu_domain_flush_tlb_pde() into a bunch of smaller flushes. 

This function doesn't exist in the upstream kernel anymore, and the
new code doesn't generate CMD_INV_IOMMU_ALL_PAGES_ADDRESS flushes at
all, AFAIK.

Your patch makes sense, but it needs to go to stable only somehow.

Jason

Re: [PATCH] amd/iommu: do not split domain flushes when flushing the entire range

Posted by Josef Bacik 3 weeks, 5 days ago

On Thu, Mar 12, 2026 at 9:40 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Wed, Mar 04, 2026 at 04:30:03PM -0500, Josef Bacik wrote:
> > We are hitting the following soft lockup in production on v6.6 and
> > v6.12, but the bug exists in all versions
> >
> > watchdog: BUG: soft lockup - CPU#24 stuck for 31s! [tokio-runtime-w:1274919]
> > CPU: 24 PID: 1274919 Comm: tokio-runtime-w Not tainted 6.6.105+ #1
> > Hardware name: Google Google Compute Engine/Google Comput Engine, BIOS Google 10/25/2025
> > RIP: 0010:__raw_spin_unlock_irqrestore+0x21/0x30
> > Call Trace:
> >  <TASK>
> >  amd_iommu_attach_device+0x69/0x450
> >  __iommu_device_set_domain+0x7b/0x190
> >  __iommu_group_set_core_domain+0x61/0xd0
> >  iommu_detatch_group+0x27/0x40
> >  vfio_iommu_type1_detach_group+0x157/0x780 [vfio_iommu_type1]
> >  vfio_group_detach_container+0x59/0x160 [vfio]
> >  vfio_group_fops_release+0x4d/0x90 [vfio]
> >  __fput+0x95/0x2a0
> >  task_work_run+0x93/0xc0
> >  do_exit+0x321/0x950
> >  do_group_exit+0x7f/0xa0
> >  get_signal_0x77d/0x780
> >  </TASK>
> >
> > This occurs because we're a VM and we're splitting up the size
> > CMD_INV_IOMMU_ALL_PAGES_ADDRESS we get from
> > amd_iommu_domain_flush_tlb_pde() into a bunch of smaller flushes.
>
> This function doesn't exist in the upstream kernel anymore, and the
> new code doesn't generate CMD_INV_IOMMU_ALL_PAGES_ADDRESS flushes at
> all, AFAIK.

This was based on linus/master as of March 4th, and we get here via
amd_iommu_flush_tlb_all, which definitely still exists, so what
specifically are you talking about? Thanks,

Josef

Re: [PATCH] amd/iommu: do not split domain flushes when flushing the entire range

Posted by Jason Gunthorpe 1 week, 6 days ago

On Sat, Mar 14, 2026 at 02:24:11PM -0400, Josef Bacik wrote:
> On Thu, Mar 12, 2026 at 9:40 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Wed, Mar 04, 2026 at 04:30:03PM -0500, Josef Bacik wrote:
> > > We are hitting the following soft lockup in production on v6.6 and
> > > v6.12, but the bug exists in all versions
> > >
> > > watchdog: BUG: soft lockup - CPU#24 stuck for 31s! [tokio-runtime-w:1274919]
> > > CPU: 24 PID: 1274919 Comm: tokio-runtime-w Not tainted 6.6.105+ #1
> > > Hardware name: Google Google Compute Engine/Google Comput Engine, BIOS Google 10/25/2025
> > > RIP: 0010:__raw_spin_unlock_irqrestore+0x21/0x30
> > > Call Trace:
> > >  <TASK>
> > >  amd_iommu_attach_device+0x69/0x450
> > >  __iommu_device_set_domain+0x7b/0x190
> > >  __iommu_group_set_core_domain+0x61/0xd0
> > >  iommu_detatch_group+0x27/0x40
> > >  vfio_iommu_type1_detach_group+0x157/0x780 [vfio_iommu_type1]
> > >  vfio_group_detach_container+0x59/0x160 [vfio]
> > >  vfio_group_fops_release+0x4d/0x90 [vfio]
> > >  __fput+0x95/0x2a0
> > >  task_work_run+0x93/0xc0
> > >  do_exit+0x321/0x950
> > >  do_group_exit+0x7f/0xa0
> > >  get_signal_0x77d/0x780
> > >  </TASK>
> > >
> > > This occurs because we're a VM and we're splitting up the size
> > > CMD_INV_IOMMU_ALL_PAGES_ADDRESS we get from
> > > amd_iommu_domain_flush_tlb_pde() into a bunch of smaller flushes.
> >
> > This function doesn't exist in the upstream kernel anymore, and the
> > new code doesn't generate CMD_INV_IOMMU_ALL_PAGES_ADDRESS flushes at
> > all, AFAIK.
> 
> This was based on linus/master as of March 4th, and we get here via
> amd_iommu_flush_tlb_all, which definitely still exists, so what
> specifically are you talking about? Thanks,

$ git grep amd_iommu_domain_flush_tlb_pde | wc -l
0

The entire page table logic was rewritten. The stuff that caused these
issues is gone and the new stuff doesn't appear to have this bug of
passing size == CMD_INV_IOMMU_ALL_PAGES_ADDRESS.

If it does please explain it in terms of the new stuff without
referencing deleted functions.

I don't know how you get something like this into -stable.

Jason

Re: [PATCH] amd/iommu: do not split domain flushes when flushing the entire range

Posted by Weinan Liu 13 hours ago

> On Thu, Mar 26, 2026 19:05:12 -0300 Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > On Sat, Mar 14, 2026 at 02:24:11PM -0400, Josef Bacik wrote:
> > On Thu, Mar 12, 2026 at 9:40 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > On Wed, Mar 04, 2026 at 04:30:03PM -0500, Josef Bacik wrote:
> > > > We are hitting the following soft lockup in production on v6.6 and
> > > > v6.12, but the bug exists in all versions
> > > >
> > > > watchdog: BUG: soft lockup - CPU#24 stuck for 31s! [tokio-runtime-w:1274919]
> > > > CPU: 24 PID: 1274919 Comm: tokio-runtime-w Not tainted 6.6.105+ #1
> > > > Hardware name: Google Google Compute Engine/Google Comput Engine, BIOS Google 10/25/2025
> > > > RIP: 0010:__raw_spin_unlock_irqrestore+0x21/0x30
> > > > Call Trace:
> > > >  <TASK>
> > > >  amd_iommu_attach_device+0x69/0x450
> > > >  __iommu_device_set_domain+0x7b/0x190
> > > >  __iommu_group_set_core_domain+0x61/0xd0
> > > >  iommu_detatch_group+0x27/0x40
> > > >  vfio_iommu_type1_detach_group+0x157/0x780 [vfio_iommu_type1]
> > > >  vfio_group_detach_container+0x59/0x160 [vfio]
> > > >  vfio_group_fops_release+0x4d/0x90 [vfio]
> > > >  __fput+0x95/0x2a0
> > > >  task_work_run+0x93/0xc0
> > > >  do_exit+0x321/0x950
> > > >  do_group_exit+0x7f/0xa0
> > > >  get_signal_0x77d/0x780
> > > >  </TASK>
> > > >
> > > > This occurs because we're a VM and we're splitting up the size
> > > > CMD_INV_IOMMU_ALL_PAGES_ADDRESS we get from
> > > > amd_iommu_domain_flush_tlb_pde() into a bunch of smaller flushes.
> > >
> > > This function doesn't exist in the upstream kernel anymore, and the
> > > new code doesn't generate CMD_INV_IOMMU_ALL_PAGES_ADDRESS flushes at
> > > all, AFAIK.
> > 
> > This was based on linus/master as of March 4th, and we get here via
> > amd_iommu_flush_tlb_all, which definitely still exists, so what
> > specifically are you talking about? Thanks,
> 
> $ git grep amd_iommu_domain_flush_tlb_pde | wc -l
> 0
> 
> The entire page table logic was rewritten. The stuff that caused these
> issues is gone and the new stuff doesn't appear to have this bug of
> passing size == CMD_INV_IOMMU_ALL_PAGES_ADDRESS.
> 
> If it does please explain it in terms of the new stuff without
> referencing deleted functions.
> 
> I don't know how you get something like this into -stable.

I believe the function Josef is referring to on linux/master is amd_iommu_domain_flush_all().
https://elixir.bootlin.com/linux/v7.0-rc7/source/drivers/iommu/amd/iommu.c#L1820

The potential call sequence appears to be:
```
blocked_domain_attach_device() or amd_iommu_attach_device()
  -> detach_device()
    -> amd_iommu_domain_flush_all()
      ->amd_iommu_domain_flush_pages(...,
		CMD_INV_IOMMU_ALL_PAGES_ADDRESS);
```

Based on the code in build_inv_address()[1], it doesn't make sense to break 
the entire cache size into smaller sizes to perform multiple flushes for a chunk size
larger than 1 << 51(full flush)

[1] https://elixir.bootlin.com/linux/v7.0-rc7/source/drivers/iommu/amd/iommu.c#L1289

Re: [PATCH] amd/iommu: do not split domain flushes when flushing the entire range

Posted by Jason Gunthorpe 8 hours ago

On Thu, Apr 09, 2026 at 08:12:25AM +0000, Weinan Liu wrote:
> > On Thu, Mar 26, 2026 19:05:12 -0300 Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > On Sat, Mar 14, 2026 at 02:24:11PM -0400, Josef Bacik wrote:
> > > On Thu, Mar 12, 2026 at 9:40 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > >
> > > > On Wed, Mar 04, 2026 at 04:30:03PM -0500, Josef Bacik wrote:
> > > > > We are hitting the following soft lockup in production on v6.6 and
> > > > > v6.12, but the bug exists in all versions
> > > > >
> > > > > watchdog: BUG: soft lockup - CPU#24 stuck for 31s! [tokio-runtime-w:1274919]
> > > > > CPU: 24 PID: 1274919 Comm: tokio-runtime-w Not tainted 6.6.105+ #1
> > > > > Hardware name: Google Google Compute Engine/Google Comput Engine, BIOS Google 10/25/2025
> > > > > RIP: 0010:__raw_spin_unlock_irqrestore+0x21/0x30
> > > > > Call Trace:
> > > > >  <TASK>
> > > > >  amd_iommu_attach_device+0x69/0x450
> > > > >  __iommu_device_set_domain+0x7b/0x190
> > > > >  __iommu_group_set_core_domain+0x61/0xd0
> > > > >  iommu_detatch_group+0x27/0x40
> > > > >  vfio_iommu_type1_detach_group+0x157/0x780 [vfio_iommu_type1]
> > > > >  vfio_group_detach_container+0x59/0x160 [vfio]
> > > > >  vfio_group_fops_release+0x4d/0x90 [vfio]
> > > > >  __fput+0x95/0x2a0
> > > > >  task_work_run+0x93/0xc0
> > > > >  do_exit+0x321/0x950
> > > > >  do_group_exit+0x7f/0xa0
> > > > >  get_signal_0x77d/0x780
> > > > >  </TASK>
> > > > >
> > > > > This occurs because we're a VM and we're splitting up the size
> > > > > CMD_INV_IOMMU_ALL_PAGES_ADDRESS we get from
> > > > > amd_iommu_domain_flush_tlb_pde() into a bunch of smaller flushes.
> > > >
> > > > This function doesn't exist in the upstream kernel anymore, and the
> > > > new code doesn't generate CMD_INV_IOMMU_ALL_PAGES_ADDRESS flushes at
> > > > all, AFAIK.
> > > 
> > > This was based on linus/master as of March 4th, and we get here via
> > > amd_iommu_flush_tlb_all, which definitely still exists, so what
> > > specifically are you talking about? Thanks,
> > 
> > $ git grep amd_iommu_domain_flush_tlb_pde | wc -l
> > 0
> > 
> > The entire page table logic was rewritten. The stuff that caused these
> > issues is gone and the new stuff doesn't appear to have this bug of
> > passing size == CMD_INV_IOMMU_ALL_PAGES_ADDRESS.
> > 
> > If it does please explain it in terms of the new stuff without
> > referencing deleted functions.
> > 
> > I don't know how you get something like this into -stable.
> 
> I believe the function Josef is referring to on linux/master is amd_iommu_domain_flush_all().
> https://elixir.bootlin.com/linux/v7.0-rc7/source/drivers/iommu/amd/iommu.c#L1820

That does seem to be an issue, but it is not going to be triggred by a
VFIO trace like Josef is showing. I've already fixed this properly in
my series:

https://lore.kernel.org/all/3-v2-90ddd19c0894+13561-iommupt_inv_amd_jgg@nvidia.com/

+	if (likely(!amd_iommu_np_cache) ||
+	    unlikely(address == 0 && last == U64_MAX)) {
+		__domain_flush_pages(domain, address, last);

By fully getting rid of the wrong use of
CMD_INV_IOMMU_ALL_PAGES_ADDRESS as a size in the callers.

So there is a small window when this patch could land with a commit
message to address amd_iommu_domain_flush_all() and be backported
before it all gets reworked and backporting will become hard. Respin
it quickly?

Jason