[v2] iommu/vt-d: Ensure memory ordering in context & root entry updates

[PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates

Posted by Dmytro Maluka 1 month, 1 week ago

As discussed in [1], we don't currently prevent the compiler from
reordering memory writes when updating context entries, which is
potentially dangerous, as it may cause setting the present bit (i.e.
enabling DMA translation for the given device) before finishing setting
up other bits in the context entry (and thus creating a time window when
a DMA from the device may result in an unpredicted behavior).

Fix this in the same way as how this is already addressed for PASID
entries, i.e. by using READ_ONCE/WRITE_ONCE in the helpers used for
setting individual bits in context entries, so that memory writes done
by those helpers are ordered in relation to each other (plus, prevent
load/store tearing and so on).

While at it, similarly paranoidally fix updating root entries as well:
use WRITE_ONCE to make sure that the present bit is set atomically
together with the context table address bits, not before them.

[1] https://lore.kernel.org/all/aTG7gc7I5wExai3S@google.com/

v1 -> v2:
- Sanitize bits to not exceed the mask (suggested by Baolu)
- Reuse pasid_set_bits() for context entries as well (rename it to
  entry_set_bits())
- Add extra barrier in *_set_present() (suggested by Baolu)

Dmytro Maluka (5):
  iommu/vt-d: Sanitize set bits in pasid_set_bits()
  iommu/vt-d: Generalize pasid_set_bits()
  iommu/vt-d: Ensure memory ordering in context entry updates
  iommu/vt-d: Use smp_wmb() before setting context/pasid present bit
  iommu/vt-d: Use WRITE_ONCE for setting root table entries

 drivers/iommu/intel/iommu.c |  2 +-
 drivers/iommu/intel/iommu.h | 49 +++++++++++++++++++++++++------------
 drivers/iommu/intel/pasid.c |  3 ++-
 drivers/iommu/intel/pasid.h | 46 +++++++++++++++++-----------------
 4 files changed, 59 insertions(+), 41 deletions(-)

-- 
2.47.3

Re: [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates

Posted by Jason Gunthorpe 1 month ago

On Sat, Dec 27, 2025 at 06:57:23PM +0100, Dmytro Maluka wrote:
> As discussed in [1], we don't currently prevent the compiler from
> reordering memory writes when updating context entries, which is
> potentially dangerous, as it may cause setting the present bit (i.e.
> enabling DMA translation for the given device) before finishing setting
> up other bits in the context entry (and thus creating a time window when
> a DMA from the device may result in an unpredicted behavior).
> 
> Fix this in the same way as how this is already addressed for PASID
> entries, i.e. by using READ_ONCE/WRITE_ONCE in the helpers used for
> setting individual bits in context entries, so that memory writes done
> by those helpers are ordered in relation to each other (plus, prevent
> load/store tearing and so on).
> 
> While at it, similarly paranoidally fix updating root entries as well:
> use WRITE_ONCE to make sure that the present bit is set atomically
> together with the context table address bits, not before them.

The PASID entries should not be manipulated 'livel' in a haphazard way
like this in the first place!

Like AMD and ARM build the new PASID entry on the stack and then it
should be copied to the DMA'able memory in a way that is consistent
with the HW's atomicity granual, paying attention not to 'tear' it.

This manipulate-in-place is just asking for trouble, and can never
support replace or full viommu requirements.. :\

So while it is perhaps an improvement to do this work, it would be
better to fix the root cause issue if someone has time..

Jason

Re: [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates

Posted by Baolu Lu 1 month ago

On 1/6/26 02:12, Jason Gunthorpe wrote:
> On Sat, Dec 27, 2025 at 06:57:23PM +0100, Dmytro Maluka wrote:
>> As discussed in [1], we don't currently prevent the compiler from
>> reordering memory writes when updating context entries, which is
>> potentially dangerous, as it may cause setting the present bit (i.e.
>> enabling DMA translation for the given device) before finishing setting
>> up other bits in the context entry (and thus creating a time window when
>> a DMA from the device may result in an unpredicted behavior).
>>
>> Fix this in the same way as how this is already addressed for PASID
>> entries, i.e. by using READ_ONCE/WRITE_ONCE in the helpers used for
>> setting individual bits in context entries, so that memory writes done
>> by those helpers are ordered in relation to each other (plus, prevent
>> load/store tearing and so on).
>>
>> While at it, similarly paranoidally fix updating root entries as well:
>> use WRITE_ONCE to make sure that the present bit is set atomically
>> together with the context table address bits, not before them.
> The PASID entries should not be manipulated 'livel' in a haphazard way
> like this in the first place!
> 
> Like AMD and ARM build the new PASID entry on the stack and then it
> should be copied to the DMA'able memory in a way that is consistent
> with the HW's atomicity granual, paying attention not to 'tear' it.
> 
> This manipulate-in-place is just asking for trouble, and can never
> support replace or full viommu requirements.. :\
> 
> So while it is perhaps an improvement to do this work, it would be
> better to fix the root cause issue if someone has time..

Agreed. The current Intel IOMMU driver uses a 'clear-populate-set'
pattern protected by a spinlock, which is why it doesn't support
'replace' yet. Dmytro's patch addresses the immediate risk of the
compiler reordering those writes and exposing invalid data to the
hardware.

Moving to an on-stack construction (like AMD/ARM) and updating
atomically is the right direction for the driver. We'll look into that
refactoring as a follow-up series to modernize the entry manipulation
logic.

Thanks,
baolu

Re: [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates

Posted by Dmytro Maluka 1 month ago

On Mon, Jan 05, 2026 at 02:12:00PM -0400, Jason Gunthorpe wrote:
> On Sat, Dec 27, 2025 at 06:57:23PM +0100, Dmytro Maluka wrote:
> > As discussed in [1], we don't currently prevent the compiler from
> > reordering memory writes when updating context entries, which is
> > potentially dangerous, as it may cause setting the present bit (i.e.
> > enabling DMA translation for the given device) before finishing setting
> > up other bits in the context entry (and thus creating a time window when
> > a DMA from the device may result in an unpredicted behavior).
> > 
> > Fix this in the same way as how this is already addressed for PASID
> > entries, i.e. by using READ_ONCE/WRITE_ONCE in the helpers used for
> > setting individual bits in context entries, so that memory writes done
> > by those helpers are ordered in relation to each other (plus, prevent
> > load/store tearing and so on).
> > 
> > While at it, similarly paranoidally fix updating root entries as well:
> > use WRITE_ONCE to make sure that the present bit is set atomically
> > together with the context table address bits, not before them.
> 
> The PASID entries should not be manipulated 'livel' in a haphazard way
> like this in the first place!
> 
> Like AMD and ARM build the new PASID entry on the stack and then it
> should be copied to the DMA'able memory in a way that is consistent
> with the HW's atomicity granual, paying attention not to 'tear' it.

As I understand, the "consistent with the HW's atomicity granual, paying
attention not to 'tear' it" part is already fulfilled for PASID entries
(and with this series, for context entries as well):

static inline void pasid_set_bits(u64 *ptr, u64 mask, u64 bits)
{
	u64 old;

	old = READ_ONCE(*ptr);
	WRITE_ONCE(*ptr, (old & ~mask) | bits);
}

I've been assuming it's ok to manipulate other bits in place as long as
we take care to only do that while the present bit it cleared (i.e.
while the entry is ignored by hardware)?

So IIUC the only problem with this approach is the redundancy: we do
this READ_ONCE+WRITE_ONCE for each invididual field in a PASID entry.

So while I agree it would be more more natural to build whole entries,
and the existing way looks strange and not the most efficient, I'm
wondering if it is causing any actual correctness issues (apart from
those addressed by this series).

> This manipulate-in-place is just asking for trouble, and can never
> support replace or full viommu requirements.. :\
> 
> So while it is perhaps an improvement to do this work, it would be
> better to fix the root cause issue if someone has time..
> 
> Jason

Re: [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates

Posted by Jason Gunthorpe 1 month ago

On Mon, Jan 05, 2026 at 07:54:53PM +0100, Dmytro Maluka wrote:
> > Like AMD and ARM build the new PASID entry on the stack and then it
> > should be copied to the DMA'able memory in a way that is consistent
> > with the HW's atomicity granual, paying attention not to 'tear' it.
> 
> As I understand, the "consistent with the HW's atomicity granual, paying
> attention not to 'tear' it" part is already fulfilled for PASID entries
> (and with this series, for context entries as well):
> 
> static inline void pasid_set_bits(u64 *ptr, u64 mask, u64 bits)
> {
> 	u64 old;
> 
> 	old = READ_ONCE(*ptr);
> 	WRITE_ONCE(*ptr, (old & ~mask) | bits);
> }
> 
> I've been assuming it's ok to manipulate other bits in place as long as
> we take care to only do that while the present bit it cleared (i.e.
> while the entry is ignored by hardware)?

If these are only done while non-present then the only issue is
missing a barrier before setting present, that should be a one line
patch, no?

> So IIUC the only problem with this approach is the redundancy: we do
> this READ_ONCE+WRITE_ONCE for each invididual field in a PASID entry.

You don't need READ_ONCE if there isn't another thread concurrently
writing, and WRITE_ONCE is pointless if the HW is promising not to
read it due to non-present.

> So while I agree it would be more more natural to build whole entries,
> and the existing way looks strange and not the most efficient, I'm
> wondering if it is causing any actual correctness issues (apart from
> those addressed by this series).

It prevents doing the replace operation, which is a correctness issue
for VMs.

Jason

Re: [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates

Posted by Dmytro Maluka 1 month ago

On Mon, Jan 05, 2026 at 03:14:10PM -0400, Jason Gunthorpe wrote:
> On Mon, Jan 05, 2026 at 07:54:53PM +0100, Dmytro Maluka wrote:
> > > Like AMD and ARM build the new PASID entry on the stack and then it
> > > should be copied to the DMA'able memory in a way that is consistent
> > > with the HW's atomicity granual, paying attention not to 'tear' it.
> > 
> > As I understand, the "consistent with the HW's atomicity granual, paying
> > attention not to 'tear' it" part is already fulfilled for PASID entries
> > (and with this series, for context entries as well):
> > 
> > static inline void pasid_set_bits(u64 *ptr, u64 mask, u64 bits)
> > {
> > 	u64 old;
> > 
> > 	old = READ_ONCE(*ptr);
> > 	WRITE_ONCE(*ptr, (old & ~mask) | bits);
> > }
> > 
> > I've been assuming it's ok to manipulate other bits in place as long as
> > we take care to only do that while the present bit it cleared (i.e.
> > while the entry is ignored by hardware)?
> 
> If these are only done while non-present then the only issue is
> missing a barrier before setting present, that should be a one line
> patch, no?

Yes, a barrier, or alternatively WRITE_ONCE for any updates of entries.
The latter is how it was already done for PASID entries, before this
series (so IIUC for PASID entries there was already no issue), and my
main goal was to fix the issue for context entries as well, so I just
did it similarly.

> > So IIUC the only problem with this approach is the redundancy: we do
> > this READ_ONCE+WRITE_ONCE for each invididual field in a PASID entry.
> 
> You don't need READ_ONCE if there isn't another thread concurrently
> writing,

Good point. But it was there before this series.

> and WRITE_ONCE is pointless if the HW is promising not to
> read it due to non-present.

As long as we use a barrier. And IIUC vice versa, if we use WRITE_ONCE
for any updates, a barrier is not necessary (on x86). And WRITE_ONCE for
any updates (for PASID entries) is what was already done before this
series.

Although, perhaps even with a barrier, WRITE_ONCE is still desirable to
prevent the compiler from doing strange but legal things that involve
transiently setting and then clearing the present bit behind our back?
(Not that I'm aware of any compilers doing that, but I'm no compiler
expert.)

> > So while I agree it would be more more natural to build whole entries,
> > and the existing way looks strange and not the most efficient, I'm
> > wondering if it is causing any actual correctness issues (apart from
> > those addressed by this series).
> 
> It prevents doing the replace operation, which is a correctness issue
> for VMs.
> 
> Jason

Re: [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates

Posted by Jason Gunthorpe 1 month ago

On Mon, Jan 05, 2026 at 09:05:36PM +0100, Dmytro Maluka wrote:
> > and WRITE_ONCE is pointless if the HW is promising not to
> > read it due to non-present.
> 
> As long as we use a barrier. And IIUC vice versa, if we use WRITE_ONCE
> for any updates, a barrier is not necessary (on x86). And WRITE_ONCE for
> any updates (for PASID entries) is what was already done before this
> series.

That is an x86 ism and it shouldn't be needlessly leaked into drivers.
As this is not performance it should have the portable flow:

   WRITE_ONCE(non-present)
   dma_wmb()
   <cmd to flush caches>

   [..]
   <writes to the entry>

   dma_wmb()
   WRITE_ONCE(present)

So, it seems to me like all you need here is the one line to add the
dma_wmb() prior to setting present.

> Although, perhaps even with a barrier, WRITE_ONCE is still desirable to
> prevent the compiler from doing strange but legal things that involve
> transiently setting and then clearing the present bit behind our back?
> (Not that I'm aware of any compilers doing that, but I'm no compiler
> expert.)

You do have to write once the present bit, but not the others. The
dma_wmb() will make sure the HW cannot observe other states when it
observes present.

Jason

RE: [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates

Posted by Tian, Kevin 1 month ago

> From: Jason Gunthorpe <jgg@ziepe.ca>
> Sent: Tuesday, January 6, 2026 8:14 AM
> 
> On Mon, Jan 05, 2026 at 09:05:36PM +0100, Dmytro Maluka wrote:
> > > and WRITE_ONCE is pointless if the HW is promising not to
> > > read it due to non-present.
> >
> > As long as we use a barrier. And IIUC vice versa, if we use WRITE_ONCE
> > for any updates, a barrier is not necessary (on x86). And WRITE_ONCE for
> > any updates (for PASID entries) is what was already done before this
> > series.
> 
> That is an x86 ism and it shouldn't be needlessly leaked into drivers.

yeah WRITE_ONCE() is not by definition to guarantee the ordering between
CPU and device.

lots of READ_ONCE()/WRITE_ONCE() in existing code are meaningless,
as 1) between CPUs there is already lock protection; 2) between CPU and
device it requires dma_wmb() to guarantee the order.

> As this is not performance it should have the portable flow:
> 
>    WRITE_ONCE(non-present)
>    dma_wmb()
>    <cmd to flush caches>
> 
>    [..]
>    <writes to the entry>
> 
>    dma_wmb()
>    WRITE_ONCE(present)
> 
> So, it seems to me like all you need here is the one line to add the
> dma_wmb() prior to setting present.

+1

Re: [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates

Posted by Dmytro Maluka 1 month ago

On Tue, Jan 06, 2026 at 07:48:50AM +0000, Tian, Kevin wrote:
> yeah WRITE_ONCE() is not by definition to guarantee the ordering between
> CPU and device.

Yes, WRITE_ONCE is not about HW guarantess at all, it is about compiler
guarantess. And it is not about ordering, it is about compiler's
guarantee to store the given 64-bit value once, with one instruction.
But this compiler guarantee is exactly my point (see my last reply to
Jason).

> lots of READ_ONCE()/WRITE_ONCE() in existing code are meaningless,
> as 1) between CPUs there is already lock protection; 2) between CPU and
> device it requires dma_wmb() to guarantee the order.

As I see it, those WRITE_ONCEs (maybe not READ_ONCEs) haven't been
meaningless (I mean, they have been actually useful) so long as we
haven't been using any barriers. Again, on x86, store ordering requires
just compiler ordering, and dma_wmb() is just a compiler barrier. So,
assuming this driver is only used on x86 (which is, well, true :)),
we are lacking even compiler barriers, but at least we have those
WRITE_ONCEs, which provide compiler ordering too (although only between
each other, not with any other memory accesses, but that seems enough
for our case).

And again, I agree it is not pretty to rely on arch-specific ordering
assumptions, and doing in-place updates via those context_xxx() and
pasid_xxx() helpers all over the place instead of updating whole entries
seems a strange choice. But that is how it was implemented 10 or so
years ago, and overhauling that hasn't been my goal.

RE: [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates

Posted by Tian, Kevin 1 month ago

> From: Dmytro Maluka <dmaluka@chromium.org>
> Sent: Tuesday, January 6, 2026 10:40 PM
> 
> On Tue, Jan 06, 2026 at 07:48:50AM +0000, Tian, Kevin wrote:
> > yeah WRITE_ONCE() is not by definition to guarantee the ordering between
> > CPU and device.
> 
> Yes, WRITE_ONCE is not about HW guarantess at all, it is about compiler
> guarantess. And it is not about ordering, it is about compiler's
> guarantee to store the given 64-bit value once, with one instruction.
> But this compiler guarantee is exactly my point (see my last reply to
> Jason).
> 
> > lots of READ_ONCE()/WRITE_ONCE() in existing code are meaningless,
> > as 1) between CPUs there is already lock protection; 2) between CPU and
> > device it requires dma_wmb() to guarantee the order.
> 
> As I see it, those WRITE_ONCEs (maybe not READ_ONCEs) haven't been
> meaningless (I mean, they have been actually useful) so long as we
> haven't been using any barriers. Again, on x86, store ordering requires
> just compiler ordering, and dma_wmb() is just a compiler barrier. So,
> assuming this driver is only used on x86 (which is, well, true :)),
> we are lacking even compiler barriers, but at least we have those
> WRITE_ONCEs, which provide compiler ordering too (although only between
> each other, not with any other memory accesses, but that seems enough
> for our case).
> 
> And again, I agree it is not pretty to rely on arch-specific ordering
> assumptions, and doing in-place updates via those context_xxx() and
> pasid_xxx() helpers all over the place instead of updating whole entries
> seems a strange choice. But that is how it was implemented 10 or so
> years ago, and overhauling that hasn't been my goal.

sure. the point here is to align on what is the right thing to do. then
we could have short-term fixes plus bigger refactoring later.

Re: [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates

Posted by Dmytro Maluka 1 month ago

On Mon, Jan 05, 2026 at 08:14:18PM -0400, Jason Gunthorpe wrote:
> On Mon, Jan 05, 2026 at 09:05:36PM +0100, Dmytro Maluka wrote:
> > > and WRITE_ONCE is pointless if the HW is promising not to
> > > read it due to non-present.
> > 
> > As long as we use a barrier. And IIUC vice versa, if we use WRITE_ONCE
> > for any updates, a barrier is not necessary (on x86). And WRITE_ONCE for
> > any updates (for PASID entries) is what was already done before this
> > series.
> 
> That is an x86 ism and it shouldn't be needlessly leaked into drivers.
> As this is not performance it should have the portable flow:

Agree. And as a matter of fact, patch 4 in this series already does that
- it adds a barrier before setting present bit, despite it being
"redundant on x86".

It adds smp_wmb() as suggested by Baolu but I can change it to dma_wmb()
which indeed seems more suitable. (FWIW on x86 there is no difference
between smp_wmb() and dma_wmb(), both are just barrier().)

>    WRITE_ONCE(non-present)
>    dma_wmb()
>    <cmd to flush caches>

Regarding a barrier after clearing present bit - good point, I should
probably add that to my patch 4 as well.

Regarding flushing caches right after that - what for? (BTW the Intel
driver doesn't do that either.) If we don't do that and as a result the
HW is using an old entry cached before we cleared the present bit, it
is not affected by our later modifications anyway.

Of course we flush caches at the end, after setting the present bit, but
that's a different story.

> > Although, perhaps even with a barrier, WRITE_ONCE is still desirable to
> > prevent the compiler from doing strange but legal things that involve
> > transiently setting and then clearing the present bit behind our back?
> > (Not that I'm aware of any compilers doing that, but I'm no compiler
> > expert.)
> 
> You do have to write once the present bit, but not the others. The
> dma_wmb() will make sure the HW cannot observe other states when it
> observes present.

I was talking about compiler guarantees, not HW guarantees. I mean: when
setting some other bit in the entry before the barrier, if we do that
without WRITE_ONCE, with a mere "foo |= bar", are we certain the
compiler will not implement that as, for example, setting the value to
0xffffffffffffffff and then clearing other bits (for whatever crazy
reason)? That would be still a legal thing for the compiler to do, in
terms of its single-thread guarantees?

So it still seems a good idea to use WRITE_ONCE (or anything not weaker
than WRITE_ONCE) for any updates of memory structures used by hardware,
to prevent such things? (And using WRITE_ONCE for that doesn't really
cost us anything anyway.)

Re: [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates

Posted by Jason Gunthorpe 1 month ago

On Tue, Jan 06, 2026 at 02:51:38PM +0100, Dmytro Maluka wrote:
> >    WRITE_ONCE(non-present)
> >    dma_wmb()
> >    <cmd to flush caches>
> 
> Regarding a barrier after clearing present bit - good point, I should
> probably add that to my patch 4 as well.

It is already integrated into <cmd to flush caches> I showed it here
for clarity.

> Regarding flushing caches right after that - what for? (BTW the Intel
> driver doesn't do that either.) If we don't do that and as a result the
> HW is using an old entry cached before we cleared the present bit, it
> is not affected by our later modifications anyway.

You don't know what state the HW fetcher is in. This kind of race is possible:

     CPU                 FETCHER
                        read present = 1
    present = 0
    mangle qword 1
                        read qword 1
                        < fail - HW sees a corrupted entry >

The flush is not just a flush but a barrier to synchronize with the HW
that it is done all fetches that may have been dependent on seeing
present = 1.

So missing a flush after clearing present is possibly a bug today - I
don't remember what guarenteed the atomic size is for Intel IOMMU
though, if the atomic size is the whole entry it is OK since there is
only one fetcher read. Though AMD is 128 bits and ARM is 64 bits.

> I was talking about compiler guarantees, not HW guarantees. I mean: when
> setting some other bit in the entry before the barrier, if we do that
> without WRITE_ONCE, with a mere "foo |= bar", are we certain the
> compiler will not implement that as, for example, setting the value to
> 0xffffffffffffffff and then clearing other bits (for whatever crazy
> reason)? That would be still a legal thing for the compiler to do, in
> terms of its single-thread guarantees?

The HW doesn't read the values the CPU is writing, so it doesn't
matter if the compiler does something strange. That is the whole
justification for why it is possible to code it like this at all.

The dma_mb() is also a compiler barrier and ensures all that
uncertainty is resolved. Once it completes a DMA from the HW will see
the program defined values only.

Jason

Re: [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates

Posted by Dmytro Maluka 1 month ago

On Tue, Jan 06, 2026 at 10:23:01AM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 06, 2026 at 02:51:38PM +0100, Dmytro Maluka wrote:
> > Regarding flushing caches right after that - what for? (BTW the Intel
> > driver doesn't do that either.) If we don't do that and as a result the
> > HW is using an old entry cached before we cleared the present bit, it
> > is not affected by our later modifications anyway.
> 
> You don't know what state the HW fetcher is in. This kind of race is possible:
> 
>      CPU                 FETCHER
>                         read present = 1
>     present = 0
>     mangle qword 1
>                         read qword 1
>                         < fail - HW sees a corrupted entry >
> 
> The flush is not just a flush but a barrier to synchronize with the HW
> that it is done all fetches that may have been dependent on seeing
> present = 1.
> 
> So missing a flush after clearing present is possibly a bug today - I
> don't remember what guarenteed the atomic size is for Intel IOMMU
> though, if the atomic size is the whole entry it is OK since there is
> only one fetcher read. Though AMD is 128 bits and ARM is 64 bits.

Indeed, may be a bug... In the VT-d spec I don't immediately see a
guarantee that context and PASID entries are fetched atomically. (And
for PASID entries, which are 512 bits, that seems particularly
unlikely.)

> > I was talking about compiler guarantees, not HW guarantees. I mean: when
> > setting some other bit in the entry before the barrier, if we do that
> > without WRITE_ONCE, with a mere "foo |= bar", are we certain the
> > compiler will not implement that as, for example, setting the value to
> > 0xffffffffffffffff and then clearing other bits (for whatever crazy
> > reason)? That would be still a legal thing for the compiler to do, in
> > terms of its single-thread guarantees?
> 
> The HW doesn't read the values the CPU is writing, so it doesn't
> matter if the compiler does something strange. That is the whole
> justification for why it is possible to code it like this at all.
> 
> The dma_mb() is also a compiler barrier and ensures all that
> uncertainty is resolved. Once it completes a DMA from the HW will see
> the program defined values only.

The HW (the IOMMU) may read context / PASID / page table entries from
memory at any time, whenever a device makes a DMA request so the IOMMU
needs to translate it? We don't know if it happens before or after the
barrier?

So we'd better make sure that if it happens before the barrier (i.e.
when the device is not supposed to do DMA), the compiler (and thus
the CPU) doesn't set the present bit, so it stays non-present, so
the IOMMU will block this unexpected/malicious DMA?

RE: [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates

Posted by Tian, Kevin 1 month ago

> From: Dmytro Maluka <dmaluka@chromium.org>
> Sent: Tuesday, January 6, 2026 11:50 PM
> 
> On Tue, Jan 06, 2026 at 10:23:01AM -0400, Jason Gunthorpe wrote:
> > On Tue, Jan 06, 2026 at 02:51:38PM +0100, Dmytro Maluka wrote:
> > > Regarding flushing caches right after that - what for? (BTW the Intel
> > > driver doesn't do that either.) If we don't do that and as a result the
> > > HW is using an old entry cached before we cleared the present bit, it
> > > is not affected by our later modifications anyway.
> >
> > You don't know what state the HW fetcher is in. This kind of race is possible:
> >
> >      CPU                 FETCHER
> >                         read present = 1
> >     present = 0
> >     mangle qword 1
> >                         read qword 1
> >                         < fail - HW sees a corrupted entry >
> >
> > The flush is not just a flush but a barrier to synchronize with the HW
> > that it is done all fetches that may have been dependent on seeing
> > present = 1.
> >
> > So missing a flush after clearing present is possibly a bug today - I
> > don't remember what guarenteed the atomic size is for Intel IOMMU
> > though, if the atomic size is the whole entry it is OK since there is
> > only one fetcher read. Though AMD is 128 bits and ARM is 64 bits.
> 
> Indeed, may be a bug... In the VT-d spec I don't immediately see a
> guarantee that context and PASID entries are fetched atomically. (And
> for PASID entries, which are 512 bits, that seems particularly
> unlikely.)
> 

512bits atomicity is possible, but not on the PASID entry.

VT-d spec, head of section 9 (Translation Structure Formats):

"
This chapter describes the memory-resident structures for DMA and
interrupt remapping. Hardware must access structure entries that
are 64-bit or 128-bit atomically. Hardware must update a 512-bit
Posted Interrupt Descriptor (see Section 9.11 for details) atomically.
Other than the Posted Interrupt Descriptor (PID), hardware is allowed 
to break access to larger than 128-bit entries into multiple aligned
128-bit accesses.
"

root entry, scalable root entry, context entry and IRTE are 128bits
so they are OK.

scalable context entry are 256bits but only the lower 128bits are
defined so it's OK for now.

scalable PASID directory entry is 64bits. ok.

posted interrupt descriptor is 512bits with atomicity guaranteed.

but we do have problem on scalable pasid entry which is 512bits.
  - bits beyond 191 are for future hardware, not a problem now
  - bits 128-191 are for 1st-stage
  - bits 0-127 manages stage selection, 2nd-stage, and some 1st-stage

so in theory 1st-stage and nesting are affected by this bug.

In reality:
  - iommu driver shouldn't receive an attach request on an in-use pasid
    entry, so the cache should remain cleared (either at initial state or
    flushed by previous teardown) then hw won't use a partial 1st-stage
    config after seeing the entry as non-present.

  - replace is already broken, as the entry should not be cleared in the
    1st place then this bug will be fixed when replace is reworked.

If no oversight (Baolu?), probably we don't need to fix it strictly following
Jason's pseudo logic at this point. Instead, just rename pasid_clear_entry()
to pasid_clear_entry_no_flush() for now (with some comment to clarify
the expectation), and rework the replace path in parallel.

We may never require a pasid_clear_entry_flush_cache() once hitless
replace is in place. 😊

Re: [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates

Posted by Baolu Lu 4 weeks, 1 day ago

On 1/8/26 10:09, Tian, Kevin wrote:
>> From: Dmytro Maluka <dmaluka@chromium.org>
>> Sent: Tuesday, January 6, 2026 11:50 PM
>>
>> On Tue, Jan 06, 2026 at 10:23:01AM -0400, Jason Gunthorpe wrote:
>>> On Tue, Jan 06, 2026 at 02:51:38PM +0100, Dmytro Maluka wrote:
>>>> Regarding flushing caches right after that - what for? (BTW the Intel
>>>> driver doesn't do that either.) If we don't do that and as a result the
>>>> HW is using an old entry cached before we cleared the present bit, it
>>>> is not affected by our later modifications anyway.
>>>
>>> You don't know what state the HW fetcher is in. This kind of race is possible:
>>>
>>>       CPU                 FETCHER
>>>                          read present = 1
>>>      present = 0
>>>      mangle qword 1
>>>                          read qword 1
>>>                          < fail - HW sees a corrupted entry >
>>>
>>> The flush is not just a flush but a barrier to synchronize with the HW
>>> that it is done all fetches that may have been dependent on seeing
>>> present = 1.
>>>
>>> So missing a flush after clearing present is possibly a bug today - I
>>> don't remember what guarenteed the atomic size is for Intel IOMMU
>>> though, if the atomic size is the whole entry it is OK since there is
>>> only one fetcher read. Though AMD is 128 bits and ARM is 64 bits.
>>
>> Indeed, may be a bug... In the VT-d spec I don't immediately see a
>> guarantee that context and PASID entries are fetched atomically. (And
>> for PASID entries, which are 512 bits, that seems particularly
>> unlikely.)
>>
> 
> 512bits atomicity is possible, but not on the PASID entry.
> 
> VT-d spec, head of section 9 (Translation Structure Formats):
> 
> "
> This chapter describes the memory-resident structures for DMA and
> interrupt remapping. Hardware must access structure entries that
> are 64-bit or 128-bit atomically. Hardware must update a 512-bit
> Posted Interrupt Descriptor (see Section 9.11 for details) atomically.
> Other than the Posted Interrupt Descriptor (PID), hardware is allowed
> to break access to larger than 128-bit entries into multiple aligned
> 128-bit accesses.
> "
> 
> root entry, scalable root entry, context entry and IRTE are 128bits
> so they are OK.
> 
> scalable context entry are 256bits but only the lower 128bits are
> defined so it's OK for now.
> 
> scalable PASID directory entry is 64bits. ok.
> 
> posted interrupt descriptor is 512bits with atomicity guaranteed.
> 
> but we do have problem on scalable pasid entry which is 512bits.
>    - bits beyond 191 are for future hardware, not a problem now
>    - bits 128-191 are for 1st-stage
>    - bits 0-127 manages stage selection, 2nd-stage, and some 1st-stage
> 
> so in theory 1st-stage and nesting are affected by this bug.

Yes. This is a real software bug. The hardware is legally allowed to
tear the pasid table entry read into 4 128-bit chunks. For first-stage
(first-only or nested) translation, chunk 1 (bit 0-127) and chunk 2 (bit
128-191) both contain active configuration data, hardware could possibly
read a entry composed of half-old and half-new data.

The VT-d spec defines software guide for pasid table entry manipulation
in section 6.5.3.3 (Guidance to Software for Invalidations), I think the
Linux driver doesn't handle the handshake between CPU and IOMMU hardware
in the right way.

The correct way should be a clear-flush-update sequence, that is, when
tearing down a present pasid table entry, the software should

1. Clear the Present bit;
2. Invalidate the cache according to section 6.5.3.3;
[now CPU owns the pasid table entry]
3. Update other fields;
4. Set the Present bit;
[now the VT-d hardware owns the pasid table entry].

> 
> In reality:
>    - iommu driver shouldn't receive an attach request on an in-use pasid
>      entry, so the cache should remain cleared (either at initial state or
>      flushed by previous teardown) then hw won't use a partial 1st-stage
>      config after seeing the entry as non-present.

Yes. But the current tear down process is buggy as described above.

> 
>    - replace is already broken, as the entry should not be cleared in the
>      1st place then this bug will be fixed when replace is reworked.

We still have a long way to go before achieving the real replace since
the hardware doesn't guarantee 512-bit atomicity, "hitless" is very
difficult.

> 
> If no oversight (Baolu?), probably we don't need to fix it strictly following
> Jason's pseudo logic at this point. Instead, just rename pasid_clear_entry()
> to pasid_clear_entry_no_flush() for now (with some comment to clarify
> the expectation), and rework the replace path in parallel.
> We may never require a pasid_clear_entry_flush_cache() once hitless> replace is in place. 😊

Perhaps we can first add pasid_entry_clear_present() to fulfill the
clear-flush-update handshake between the software and the IOMMU hardware
while reworking the PASID replace path?

Thanks,
baolu

Re: [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates

Posted by Jason Gunthorpe 1 month ago

On Tue, Jan 06, 2026 at 04:50:11PM +0100, Dmytro Maluka wrote:
> So we'd better make sure that if it happens before the barrier (i.e.
> when the device is not supposed to do DMA), the compiler (and thus
> the CPU) doesn't set the present bit, so it stays non-present, so
> the IOMMU will block this unexpected/malicious DMA?

It is true that any write to the dword containing the present bit
(only!) should probably use WRITE_ONCE.

Jason

Re: [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates

Posted by Dmytro Maluka 1 month ago

On Tue, Jan 06, 2026 at 12:45:18PM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 06, 2026 at 04:50:11PM +0100, Dmytro Maluka wrote:
> > So we'd better make sure that if it happens before the barrier (i.e.
> > when the device is not supposed to do DMA), the compiler (and thus
> > the CPU) doesn't set the present bit, so it stays non-present, so
> > the IOMMU will block this unexpected/malicious DMA?
> 
> It is true that any write to the dword containing the present bit
> (only!) should probably use WRITE_ONCE.

Yeah, I agree that for other dwords within an entry it isn't necessary.

So, when it comes to the actual code, it's not like the Intel IOMMU
driver has lots of WRITE_ONCE spread all over the code. It's only in
pasid_set_bits() and pasid_clear_entry(). Though, since pasid_set_bits()
is used for setting any bits, as a result, WRITE_ONCE is also done for
dwords that don't need it, but that doesn't hurt either.