RE: [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates

Tian, Kevin posted 5 patches 1 month ago
Only 0 patches received!
There is a newer version of this series
RE: [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates
Posted by Tian, Kevin 1 month ago
> From: Dmytro Maluka <dmaluka@chromium.org>
> Sent: Tuesday, January 6, 2026 11:50 PM
> 
> On Tue, Jan 06, 2026 at 10:23:01AM -0400, Jason Gunthorpe wrote:
> > On Tue, Jan 06, 2026 at 02:51:38PM +0100, Dmytro Maluka wrote:
> > > Regarding flushing caches right after that - what for? (BTW the Intel
> > > driver doesn't do that either.) If we don't do that and as a result the
> > > HW is using an old entry cached before we cleared the present bit, it
> > > is not affected by our later modifications anyway.
> >
> > You don't know what state the HW fetcher is in. This kind of race is possible:
> >
> >      CPU                 FETCHER
> >                         read present = 1
> >     present = 0
> >     mangle qword 1
> >                         read qword 1
> >                         < fail - HW sees a corrupted entry >
> >
> > The flush is not just a flush but a barrier to synchronize with the HW
> > that it is done all fetches that may have been dependent on seeing
> > present = 1.
> >
> > So missing a flush after clearing present is possibly a bug today - I
> > don't remember what guarenteed the atomic size is for Intel IOMMU
> > though, if the atomic size is the whole entry it is OK since there is
> > only one fetcher read. Though AMD is 128 bits and ARM is 64 bits.
> 
> Indeed, may be a bug... In the VT-d spec I don't immediately see a
> guarantee that context and PASID entries are fetched atomically. (And
> for PASID entries, which are 512 bits, that seems particularly
> unlikely.)
> 

512bits atomicity is possible, but not on the PASID entry.

VT-d spec, head of section 9 (Translation Structure Formats):

"
This chapter describes the memory-resident structures for DMA and
interrupt remapping. Hardware must access structure entries that
are 64-bit or 128-bit atomically. Hardware must update a 512-bit
Posted Interrupt Descriptor (see Section 9.11 for details) atomically.
Other than the Posted Interrupt Descriptor (PID), hardware is allowed 
to break access to larger than 128-bit entries into multiple aligned
128-bit accesses.
"

root entry, scalable root entry, context entry and IRTE are 128bits
so they are OK.

scalable context entry are 256bits but only the lower 128bits are
defined so it's OK for now.

scalable PASID directory entry is 64bits. ok.

posted interrupt descriptor is 512bits with atomicity guaranteed.

but we do have problem on scalable pasid entry which is 512bits.
  - bits beyond 191 are for future hardware, not a problem now
  - bits 128-191 are for 1st-stage
  - bits 0-127 manages stage selection, 2nd-stage, and some 1st-stage

so in theory 1st-stage and nesting are affected by this bug.

In reality:
  - iommu driver shouldn't receive an attach request on an in-use pasid
    entry, so the cache should remain cleared (either at initial state or
    flushed by previous teardown) then hw won't use a partial 1st-stage
    config after seeing the entry as non-present.

  - replace is already broken, as the entry should not be cleared in the
    1st place then this bug will be fixed when replace is reworked.

If no oversight (Baolu?), probably we don't need to fix it strictly following
Jason's pseudo logic at this point. Instead, just rename pasid_clear_entry()
to pasid_clear_entry_no_flush() for now (with some comment to clarify
the expectation), and rework the replace path in parallel.

We may never require a pasid_clear_entry_flush_cache() once hitless
replace is in place. 😊
Re: [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates
Posted by Baolu Lu 1 month ago
On 1/8/26 10:09, Tian, Kevin wrote:
>> From: Dmytro Maluka <dmaluka@chromium.org>
>> Sent: Tuesday, January 6, 2026 11:50 PM
>>
>> On Tue, Jan 06, 2026 at 10:23:01AM -0400, Jason Gunthorpe wrote:
>>> On Tue, Jan 06, 2026 at 02:51:38PM +0100, Dmytro Maluka wrote:
>>>> Regarding flushing caches right after that - what for? (BTW the Intel
>>>> driver doesn't do that either.) If we don't do that and as a result the
>>>> HW is using an old entry cached before we cleared the present bit, it
>>>> is not affected by our later modifications anyway.
>>>
>>> You don't know what state the HW fetcher is in. This kind of race is possible:
>>>
>>>       CPU                 FETCHER
>>>                          read present = 1
>>>      present = 0
>>>      mangle qword 1
>>>                          read qword 1
>>>                          < fail - HW sees a corrupted entry >
>>>
>>> The flush is not just a flush but a barrier to synchronize with the HW
>>> that it is done all fetches that may have been dependent on seeing
>>> present = 1.
>>>
>>> So missing a flush after clearing present is possibly a bug today - I
>>> don't remember what guarenteed the atomic size is for Intel IOMMU
>>> though, if the atomic size is the whole entry it is OK since there is
>>> only one fetcher read. Though AMD is 128 bits and ARM is 64 bits.
>>
>> Indeed, may be a bug... In the VT-d spec I don't immediately see a
>> guarantee that context and PASID entries are fetched atomically. (And
>> for PASID entries, which are 512 bits, that seems particularly
>> unlikely.)
>>
> 
> 512bits atomicity is possible, but not on the PASID entry.
> 
> VT-d spec, head of section 9 (Translation Structure Formats):
> 
> "
> This chapter describes the memory-resident structures for DMA and
> interrupt remapping. Hardware must access structure entries that
> are 64-bit or 128-bit atomically. Hardware must update a 512-bit
> Posted Interrupt Descriptor (see Section 9.11 for details) atomically.
> Other than the Posted Interrupt Descriptor (PID), hardware is allowed
> to break access to larger than 128-bit entries into multiple aligned
> 128-bit accesses.
> "
> 
> root entry, scalable root entry, context entry and IRTE are 128bits
> so they are OK.
> 
> scalable context entry are 256bits but only the lower 128bits are
> defined so it's OK for now.
> 
> scalable PASID directory entry is 64bits. ok.
> 
> posted interrupt descriptor is 512bits with atomicity guaranteed.
> 
> but we do have problem on scalable pasid entry which is 512bits.
>    - bits beyond 191 are for future hardware, not a problem now
>    - bits 128-191 are for 1st-stage
>    - bits 0-127 manages stage selection, 2nd-stage, and some 1st-stage
> 
> so in theory 1st-stage and nesting are affected by this bug.

Yes. This is a real software bug. The hardware is legally allowed to
tear the pasid table entry read into 4 128-bit chunks. For first-stage
(first-only or nested) translation, chunk 1 (bit 0-127) and chunk 2 (bit
128-191) both contain active configuration data, hardware could possibly
read a entry composed of half-old and half-new data.

The VT-d spec defines software guide for pasid table entry manipulation
in section 6.5.3.3 (Guidance to Software for Invalidations), I think the
Linux driver doesn't handle the handshake between CPU and IOMMU hardware
in the right way.

The correct way should be a clear-flush-update sequence, that is, when
tearing down a present pasid table entry, the software should

1. Clear the Present bit;
2. Invalidate the cache according to section 6.5.3.3;
[now CPU owns the pasid table entry]
3. Update other fields;
4. Set the Present bit;
[now the VT-d hardware owns the pasid table entry].

> 
> In reality:
>    - iommu driver shouldn't receive an attach request on an in-use pasid
>      entry, so the cache should remain cleared (either at initial state or
>      flushed by previous teardown) then hw won't use a partial 1st-stage
>      config after seeing the entry as non-present.

Yes. But the current tear down process is buggy as described above.

> 
>    - replace is already broken, as the entry should not be cleared in the
>      1st place then this bug will be fixed when replace is reworked.

We still have a long way to go before achieving the real replace since
the hardware doesn't guarantee 512-bit atomicity, "hitless" is very
difficult.

> 
> If no oversight (Baolu?), probably we don't need to fix it strictly following
> Jason's pseudo logic at this point. Instead, just rename pasid_clear_entry()
> to pasid_clear_entry_no_flush() for now (with some comment to clarify
> the expectation), and rework the replace path in parallel.
> We may never require a pasid_clear_entry_flush_cache() once hitless> replace is in place. 😊

Perhaps we can first add pasid_entry_clear_present() to fulfill the
clear-flush-update handshake between the software and the IOMMU hardware
while reworking the PASID replace path?

Thanks,
baolu