drivers/iommu/intel/iommu.c | 2 +- drivers/iommu/intel/iommu.h | 49 +++++++++++++++++++++++++------------ drivers/iommu/intel/pasid.c | 3 ++- drivers/iommu/intel/pasid.h | 46 +++++++++++++++++----------------- 4 files changed, 59 insertions(+), 41 deletions(-)
As discussed in [1], we don't currently prevent the compiler from reordering memory writes when updating context entries, which is potentially dangerous, as it may cause setting the present bit (i.e. enabling DMA translation for the given device) before finishing setting up other bits in the context entry (and thus creating a time window when a DMA from the device may result in an unpredicted behavior). Fix this in the same way as how this is already addressed for PASID entries, i.e. by using READ_ONCE/WRITE_ONCE in the helpers used for setting individual bits in context entries, so that memory writes done by those helpers are ordered in relation to each other (plus, prevent load/store tearing and so on). While at it, similarly paranoidally fix updating root entries as well: use WRITE_ONCE to make sure that the present bit is set atomically together with the context table address bits, not before them. [1] https://lore.kernel.org/all/aTG7gc7I5wExai3S@google.com/ v1 -> v2: - Sanitize bits to not exceed the mask (suggested by Baolu) - Reuse pasid_set_bits() for context entries as well (rename it to entry_set_bits()) - Add extra barrier in *_set_present() (suggested by Baolu) Dmytro Maluka (5): iommu/vt-d: Sanitize set bits in pasid_set_bits() iommu/vt-d: Generalize pasid_set_bits() iommu/vt-d: Ensure memory ordering in context entry updates iommu/vt-d: Use smp_wmb() before setting context/pasid present bit iommu/vt-d: Use WRITE_ONCE for setting root table entries drivers/iommu/intel/iommu.c | 2 +- drivers/iommu/intel/iommu.h | 49 +++++++++++++++++++++++++------------ drivers/iommu/intel/pasid.c | 3 ++- drivers/iommu/intel/pasid.h | 46 +++++++++++++++++----------------- 4 files changed, 59 insertions(+), 41 deletions(-) -- 2.47.3
On Sat, Dec 27, 2025 at 06:57:23PM +0100, Dmytro Maluka wrote: > As discussed in [1], we don't currently prevent the compiler from > reordering memory writes when updating context entries, which is > potentially dangerous, as it may cause setting the present bit (i.e. > enabling DMA translation for the given device) before finishing setting > up other bits in the context entry (and thus creating a time window when > a DMA from the device may result in an unpredicted behavior). > > Fix this in the same way as how this is already addressed for PASID > entries, i.e. by using READ_ONCE/WRITE_ONCE in the helpers used for > setting individual bits in context entries, so that memory writes done > by those helpers are ordered in relation to each other (plus, prevent > load/store tearing and so on). > > While at it, similarly paranoidally fix updating root entries as well: > use WRITE_ONCE to make sure that the present bit is set atomically > together with the context table address bits, not before them. The PASID entries should not be manipulated 'livel' in a haphazard way like this in the first place! Like AMD and ARM build the new PASID entry on the stack and then it should be copied to the DMA'able memory in a way that is consistent with the HW's atomicity granual, paying attention not to 'tear' it. This manipulate-in-place is just asking for trouble, and can never support replace or full viommu requirements.. :\ So while it is perhaps an improvement to do this work, it would be better to fix the root cause issue if someone has time.. Jason
On 1/6/26 02:12, Jason Gunthorpe wrote: > On Sat, Dec 27, 2025 at 06:57:23PM +0100, Dmytro Maluka wrote: >> As discussed in [1], we don't currently prevent the compiler from >> reordering memory writes when updating context entries, which is >> potentially dangerous, as it may cause setting the present bit (i.e. >> enabling DMA translation for the given device) before finishing setting >> up other bits in the context entry (and thus creating a time window when >> a DMA from the device may result in an unpredicted behavior). >> >> Fix this in the same way as how this is already addressed for PASID >> entries, i.e. by using READ_ONCE/WRITE_ONCE in the helpers used for >> setting individual bits in context entries, so that memory writes done >> by those helpers are ordered in relation to each other (plus, prevent >> load/store tearing and so on). >> >> While at it, similarly paranoidally fix updating root entries as well: >> use WRITE_ONCE to make sure that the present bit is set atomically >> together with the context table address bits, not before them. > The PASID entries should not be manipulated 'livel' in a haphazard way > like this in the first place! > > Like AMD and ARM build the new PASID entry on the stack and then it > should be copied to the DMA'able memory in a way that is consistent > with the HW's atomicity granual, paying attention not to 'tear' it. > > This manipulate-in-place is just asking for trouble, and can never > support replace or full viommu requirements.. :\ > > So while it is perhaps an improvement to do this work, it would be > better to fix the root cause issue if someone has time.. Agreed. The current Intel IOMMU driver uses a 'clear-populate-set' pattern protected by a spinlock, which is why it doesn't support 'replace' yet. Dmytro's patch addresses the immediate risk of the compiler reordering those writes and exposing invalid data to the hardware. Moving to an on-stack construction (like AMD/ARM) and updating atomically is the right direction for the driver. We'll look into that refactoring as a follow-up series to modernize the entry manipulation logic. Thanks, baolu
On Mon, Jan 05, 2026 at 02:12:00PM -0400, Jason Gunthorpe wrote:
> On Sat, Dec 27, 2025 at 06:57:23PM +0100, Dmytro Maluka wrote:
> > As discussed in [1], we don't currently prevent the compiler from
> > reordering memory writes when updating context entries, which is
> > potentially dangerous, as it may cause setting the present bit (i.e.
> > enabling DMA translation for the given device) before finishing setting
> > up other bits in the context entry (and thus creating a time window when
> > a DMA from the device may result in an unpredicted behavior).
> >
> > Fix this in the same way as how this is already addressed for PASID
> > entries, i.e. by using READ_ONCE/WRITE_ONCE in the helpers used for
> > setting individual bits in context entries, so that memory writes done
> > by those helpers are ordered in relation to each other (plus, prevent
> > load/store tearing and so on).
> >
> > While at it, similarly paranoidally fix updating root entries as well:
> > use WRITE_ONCE to make sure that the present bit is set atomically
> > together with the context table address bits, not before them.
>
> The PASID entries should not be manipulated 'livel' in a haphazard way
> like this in the first place!
>
> Like AMD and ARM build the new PASID entry on the stack and then it
> should be copied to the DMA'able memory in a way that is consistent
> with the HW's atomicity granual, paying attention not to 'tear' it.
As I understand, the "consistent with the HW's atomicity granual, paying
attention not to 'tear' it" part is already fulfilled for PASID entries
(and with this series, for context entries as well):
static inline void pasid_set_bits(u64 *ptr, u64 mask, u64 bits)
{
u64 old;
old = READ_ONCE(*ptr);
WRITE_ONCE(*ptr, (old & ~mask) | bits);
}
I've been assuming it's ok to manipulate other bits in place as long as
we take care to only do that while the present bit it cleared (i.e.
while the entry is ignored by hardware)?
So IIUC the only problem with this approach is the redundancy: we do
this READ_ONCE+WRITE_ONCE for each invididual field in a PASID entry.
So while I agree it would be more more natural to build whole entries,
and the existing way looks strange and not the most efficient, I'm
wondering if it is causing any actual correctness issues (apart from
those addressed by this series).
> This manipulate-in-place is just asking for trouble, and can never
> support replace or full viommu requirements.. :\
>
> So while it is perhaps an improvement to do this work, it would be
> better to fix the root cause issue if someone has time..
>
> Jason
On Mon, Jan 05, 2026 at 07:54:53PM +0100, Dmytro Maluka wrote:
> > Like AMD and ARM build the new PASID entry on the stack and then it
> > should be copied to the DMA'able memory in a way that is consistent
> > with the HW's atomicity granual, paying attention not to 'tear' it.
>
> As I understand, the "consistent with the HW's atomicity granual, paying
> attention not to 'tear' it" part is already fulfilled for PASID entries
> (and with this series, for context entries as well):
>
> static inline void pasid_set_bits(u64 *ptr, u64 mask, u64 bits)
> {
> u64 old;
>
> old = READ_ONCE(*ptr);
> WRITE_ONCE(*ptr, (old & ~mask) | bits);
> }
>
> I've been assuming it's ok to manipulate other bits in place as long as
> we take care to only do that while the present bit it cleared (i.e.
> while the entry is ignored by hardware)?
If these are only done while non-present then the only issue is
missing a barrier before setting present, that should be a one line
patch, no?
> So IIUC the only problem with this approach is the redundancy: we do
> this READ_ONCE+WRITE_ONCE for each invididual field in a PASID entry.
You don't need READ_ONCE if there isn't another thread concurrently
writing, and WRITE_ONCE is pointless if the HW is promising not to
read it due to non-present.
> So while I agree it would be more more natural to build whole entries,
> and the existing way looks strange and not the most efficient, I'm
> wondering if it is causing any actual correctness issues (apart from
> those addressed by this series).
It prevents doing the replace operation, which is a correctness issue
for VMs.
Jason
On Mon, Jan 05, 2026 at 03:14:10PM -0400, Jason Gunthorpe wrote:
> On Mon, Jan 05, 2026 at 07:54:53PM +0100, Dmytro Maluka wrote:
> > > Like AMD and ARM build the new PASID entry on the stack and then it
> > > should be copied to the DMA'able memory in a way that is consistent
> > > with the HW's atomicity granual, paying attention not to 'tear' it.
> >
> > As I understand, the "consistent with the HW's atomicity granual, paying
> > attention not to 'tear' it" part is already fulfilled for PASID entries
> > (and with this series, for context entries as well):
> >
> > static inline void pasid_set_bits(u64 *ptr, u64 mask, u64 bits)
> > {
> > u64 old;
> >
> > old = READ_ONCE(*ptr);
> > WRITE_ONCE(*ptr, (old & ~mask) | bits);
> > }
> >
> > I've been assuming it's ok to manipulate other bits in place as long as
> > we take care to only do that while the present bit it cleared (i.e.
> > while the entry is ignored by hardware)?
>
> If these are only done while non-present then the only issue is
> missing a barrier before setting present, that should be a one line
> patch, no?
Yes, a barrier, or alternatively WRITE_ONCE for any updates of entries.
The latter is how it was already done for PASID entries, before this
series (so IIUC for PASID entries there was already no issue), and my
main goal was to fix the issue for context entries as well, so I just
did it similarly.
> > So IIUC the only problem with this approach is the redundancy: we do
> > this READ_ONCE+WRITE_ONCE for each invididual field in a PASID entry.
>
> You don't need READ_ONCE if there isn't another thread concurrently
> writing,
Good point. But it was there before this series.
> and WRITE_ONCE is pointless if the HW is promising not to
> read it due to non-present.
As long as we use a barrier. And IIUC vice versa, if we use WRITE_ONCE
for any updates, a barrier is not necessary (on x86). And WRITE_ONCE for
any updates (for PASID entries) is what was already done before this
series.
Although, perhaps even with a barrier, WRITE_ONCE is still desirable to
prevent the compiler from doing strange but legal things that involve
transiently setting and then clearing the present bit behind our back?
(Not that I'm aware of any compilers doing that, but I'm no compiler
expert.)
> > So while I agree it would be more more natural to build whole entries,
> > and the existing way looks strange and not the most efficient, I'm
> > wondering if it is causing any actual correctness issues (apart from
> > those addressed by this series).
>
> It prevents doing the replace operation, which is a correctness issue
> for VMs.
>
> Jason
On Mon, Jan 05, 2026 at 09:05:36PM +0100, Dmytro Maluka wrote: > > and WRITE_ONCE is pointless if the HW is promising not to > > read it due to non-present. > > As long as we use a barrier. And IIUC vice versa, if we use WRITE_ONCE > for any updates, a barrier is not necessary (on x86). And WRITE_ONCE for > any updates (for PASID entries) is what was already done before this > series. That is an x86 ism and it shouldn't be needlessly leaked into drivers. As this is not performance it should have the portable flow: WRITE_ONCE(non-present) dma_wmb() <cmd to flush caches> [..] <writes to the entry> dma_wmb() WRITE_ONCE(present) So, it seems to me like all you need here is the one line to add the dma_wmb() prior to setting present. > Although, perhaps even with a barrier, WRITE_ONCE is still desirable to > prevent the compiler from doing strange but legal things that involve > transiently setting and then clearing the present bit behind our back? > (Not that I'm aware of any compilers doing that, but I'm no compiler > expert.) You do have to write once the present bit, but not the others. The dma_wmb() will make sure the HW cannot observe other states when it observes present. Jason
> From: Jason Gunthorpe <jgg@ziepe.ca> > Sent: Tuesday, January 6, 2026 8:14 AM > > On Mon, Jan 05, 2026 at 09:05:36PM +0100, Dmytro Maluka wrote: > > > and WRITE_ONCE is pointless if the HW is promising not to > > > read it due to non-present. > > > > As long as we use a barrier. And IIUC vice versa, if we use WRITE_ONCE > > for any updates, a barrier is not necessary (on x86). And WRITE_ONCE for > > any updates (for PASID entries) is what was already done before this > > series. > > That is an x86 ism and it shouldn't be needlessly leaked into drivers. yeah WRITE_ONCE() is not by definition to guarantee the ordering between CPU and device. lots of READ_ONCE()/WRITE_ONCE() in existing code are meaningless, as 1) between CPUs there is already lock protection; 2) between CPU and device it requires dma_wmb() to guarantee the order. > As this is not performance it should have the portable flow: > > WRITE_ONCE(non-present) > dma_wmb() > <cmd to flush caches> > > [..] > <writes to the entry> > > dma_wmb() > WRITE_ONCE(present) > > So, it seems to me like all you need here is the one line to add the > dma_wmb() prior to setting present. +1
On Tue, Jan 06, 2026 at 07:48:50AM +0000, Tian, Kevin wrote: > yeah WRITE_ONCE() is not by definition to guarantee the ordering between > CPU and device. Yes, WRITE_ONCE is not about HW guarantess at all, it is about compiler guarantess. And it is not about ordering, it is about compiler's guarantee to store the given 64-bit value once, with one instruction. But this compiler guarantee is exactly my point (see my last reply to Jason). > lots of READ_ONCE()/WRITE_ONCE() in existing code are meaningless, > as 1) between CPUs there is already lock protection; 2) between CPU and > device it requires dma_wmb() to guarantee the order. As I see it, those WRITE_ONCEs (maybe not READ_ONCEs) haven't been meaningless (I mean, they have been actually useful) so long as we haven't been using any barriers. Again, on x86, store ordering requires just compiler ordering, and dma_wmb() is just a compiler barrier. So, assuming this driver is only used on x86 (which is, well, true :)), we are lacking even compiler barriers, but at least we have those WRITE_ONCEs, which provide compiler ordering too (although only between each other, not with any other memory accesses, but that seems enough for our case). And again, I agree it is not pretty to rely on arch-specific ordering assumptions, and doing in-place updates via those context_xxx() and pasid_xxx() helpers all over the place instead of updating whole entries seems a strange choice. But that is how it was implemented 10 or so years ago, and overhauling that hasn't been my goal.
> From: Dmytro Maluka <dmaluka@chromium.org> > Sent: Tuesday, January 6, 2026 10:40 PM > > On Tue, Jan 06, 2026 at 07:48:50AM +0000, Tian, Kevin wrote: > > yeah WRITE_ONCE() is not by definition to guarantee the ordering between > > CPU and device. > > Yes, WRITE_ONCE is not about HW guarantess at all, it is about compiler > guarantess. And it is not about ordering, it is about compiler's > guarantee to store the given 64-bit value once, with one instruction. > But this compiler guarantee is exactly my point (see my last reply to > Jason). > > > lots of READ_ONCE()/WRITE_ONCE() in existing code are meaningless, > > as 1) between CPUs there is already lock protection; 2) between CPU and > > device it requires dma_wmb() to guarantee the order. > > As I see it, those WRITE_ONCEs (maybe not READ_ONCEs) haven't been > meaningless (I mean, they have been actually useful) so long as we > haven't been using any barriers. Again, on x86, store ordering requires > just compiler ordering, and dma_wmb() is just a compiler barrier. So, > assuming this driver is only used on x86 (which is, well, true :)), > we are lacking even compiler barriers, but at least we have those > WRITE_ONCEs, which provide compiler ordering too (although only between > each other, not with any other memory accesses, but that seems enough > for our case). > > And again, I agree it is not pretty to rely on arch-specific ordering > assumptions, and doing in-place updates via those context_xxx() and > pasid_xxx() helpers all over the place instead of updating whole entries > seems a strange choice. But that is how it was implemented 10 or so > years ago, and overhauling that hasn't been my goal. sure. the point here is to align on what is the right thing to do. then we could have short-term fixes plus bigger refactoring later.
On Mon, Jan 05, 2026 at 08:14:18PM -0400, Jason Gunthorpe wrote: > On Mon, Jan 05, 2026 at 09:05:36PM +0100, Dmytro Maluka wrote: > > > and WRITE_ONCE is pointless if the HW is promising not to > > > read it due to non-present. > > > > As long as we use a barrier. And IIUC vice versa, if we use WRITE_ONCE > > for any updates, a barrier is not necessary (on x86). And WRITE_ONCE for > > any updates (for PASID entries) is what was already done before this > > series. > > That is an x86 ism and it shouldn't be needlessly leaked into drivers. > As this is not performance it should have the portable flow: Agree. And as a matter of fact, patch 4 in this series already does that - it adds a barrier before setting present bit, despite it being "redundant on x86". It adds smp_wmb() as suggested by Baolu but I can change it to dma_wmb() which indeed seems more suitable. (FWIW on x86 there is no difference between smp_wmb() and dma_wmb(), both are just barrier().) > WRITE_ONCE(non-present) > dma_wmb() > <cmd to flush caches> Regarding a barrier after clearing present bit - good point, I should probably add that to my patch 4 as well. Regarding flushing caches right after that - what for? (BTW the Intel driver doesn't do that either.) If we don't do that and as a result the HW is using an old entry cached before we cleared the present bit, it is not affected by our later modifications anyway. Of course we flush caches at the end, after setting the present bit, but that's a different story. > > Although, perhaps even with a barrier, WRITE_ONCE is still desirable to > > prevent the compiler from doing strange but legal things that involve > > transiently setting and then clearing the present bit behind our back? > > (Not that I'm aware of any compilers doing that, but I'm no compiler > > expert.) > > You do have to write once the present bit, but not the others. The > dma_wmb() will make sure the HW cannot observe other states when it > observes present. I was talking about compiler guarantees, not HW guarantees. I mean: when setting some other bit in the entry before the barrier, if we do that without WRITE_ONCE, with a mere "foo |= bar", are we certain the compiler will not implement that as, for example, setting the value to 0xffffffffffffffff and then clearing other bits (for whatever crazy reason)? That would be still a legal thing for the compiler to do, in terms of its single-thread guarantees? So it still seems a good idea to use WRITE_ONCE (or anything not weaker than WRITE_ONCE) for any updates of memory structures used by hardware, to prevent such things? (And using WRITE_ONCE for that doesn't really cost us anything anyway.)
On Tue, Jan 06, 2026 at 02:51:38PM +0100, Dmytro Maluka wrote:
> > WRITE_ONCE(non-present)
> > dma_wmb()
> > <cmd to flush caches>
>
> Regarding a barrier after clearing present bit - good point, I should
> probably add that to my patch 4 as well.
It is already integrated into <cmd to flush caches> I showed it here
for clarity.
> Regarding flushing caches right after that - what for? (BTW the Intel
> driver doesn't do that either.) If we don't do that and as a result the
> HW is using an old entry cached before we cleared the present bit, it
> is not affected by our later modifications anyway.
You don't know what state the HW fetcher is in. This kind of race is possible:
CPU FETCHER
read present = 1
present = 0
mangle qword 1
read qword 1
< fail - HW sees a corrupted entry >
The flush is not just a flush but a barrier to synchronize with the HW
that it is done all fetches that may have been dependent on seeing
present = 1.
So missing a flush after clearing present is possibly a bug today - I
don't remember what guarenteed the atomic size is for Intel IOMMU
though, if the atomic size is the whole entry it is OK since there is
only one fetcher read. Though AMD is 128 bits and ARM is 64 bits.
> I was talking about compiler guarantees, not HW guarantees. I mean: when
> setting some other bit in the entry before the barrier, if we do that
> without WRITE_ONCE, with a mere "foo |= bar", are we certain the
> compiler will not implement that as, for example, setting the value to
> 0xffffffffffffffff and then clearing other bits (for whatever crazy
> reason)? That would be still a legal thing for the compiler to do, in
> terms of its single-thread guarantees?
The HW doesn't read the values the CPU is writing, so it doesn't
matter if the compiler does something strange. That is the whole
justification for why it is possible to code it like this at all.
The dma_mb() is also a compiler barrier and ensures all that
uncertainty is resolved. Once it completes a DMA from the HW will see
the program defined values only.
Jason
On Tue, Jan 06, 2026 at 10:23:01AM -0400, Jason Gunthorpe wrote: > On Tue, Jan 06, 2026 at 02:51:38PM +0100, Dmytro Maluka wrote: > > Regarding flushing caches right after that - what for? (BTW the Intel > > driver doesn't do that either.) If we don't do that and as a result the > > HW is using an old entry cached before we cleared the present bit, it > > is not affected by our later modifications anyway. > > You don't know what state the HW fetcher is in. This kind of race is possible: > > CPU FETCHER > read present = 1 > present = 0 > mangle qword 1 > read qword 1 > < fail - HW sees a corrupted entry > > > The flush is not just a flush but a barrier to synchronize with the HW > that it is done all fetches that may have been dependent on seeing > present = 1. > > So missing a flush after clearing present is possibly a bug today - I > don't remember what guarenteed the atomic size is for Intel IOMMU > though, if the atomic size is the whole entry it is OK since there is > only one fetcher read. Though AMD is 128 bits and ARM is 64 bits. Indeed, may be a bug... In the VT-d spec I don't immediately see a guarantee that context and PASID entries are fetched atomically. (And for PASID entries, which are 512 bits, that seems particularly unlikely.) > > I was talking about compiler guarantees, not HW guarantees. I mean: when > > setting some other bit in the entry before the barrier, if we do that > > without WRITE_ONCE, with a mere "foo |= bar", are we certain the > > compiler will not implement that as, for example, setting the value to > > 0xffffffffffffffff and then clearing other bits (for whatever crazy > > reason)? That would be still a legal thing for the compiler to do, in > > terms of its single-thread guarantees? > > The HW doesn't read the values the CPU is writing, so it doesn't > matter if the compiler does something strange. That is the whole > justification for why it is possible to code it like this at all. > > The dma_mb() is also a compiler barrier and ensures all that > uncertainty is resolved. Once it completes a DMA from the HW will see > the program defined values only. The HW (the IOMMU) may read context / PASID / page table entries from memory at any time, whenever a device makes a DMA request so the IOMMU needs to translate it? We don't know if it happens before or after the barrier? So we'd better make sure that if it happens before the barrier (i.e. when the device is not supposed to do DMA), the compiler (and thus the CPU) doesn't set the present bit, so it stays non-present, so the IOMMU will block this unexpected/malicious DMA?
> From: Dmytro Maluka <dmaluka@chromium.org>
> Sent: Tuesday, January 6, 2026 11:50 PM
>
> On Tue, Jan 06, 2026 at 10:23:01AM -0400, Jason Gunthorpe wrote:
> > On Tue, Jan 06, 2026 at 02:51:38PM +0100, Dmytro Maluka wrote:
> > > Regarding flushing caches right after that - what for? (BTW the Intel
> > > driver doesn't do that either.) If we don't do that and as a result the
> > > HW is using an old entry cached before we cleared the present bit, it
> > > is not affected by our later modifications anyway.
> >
> > You don't know what state the HW fetcher is in. This kind of race is possible:
> >
> > CPU FETCHER
> > read present = 1
> > present = 0
> > mangle qword 1
> > read qword 1
> > < fail - HW sees a corrupted entry >
> >
> > The flush is not just a flush but a barrier to synchronize with the HW
> > that it is done all fetches that may have been dependent on seeing
> > present = 1.
> >
> > So missing a flush after clearing present is possibly a bug today - I
> > don't remember what guarenteed the atomic size is for Intel IOMMU
> > though, if the atomic size is the whole entry it is OK since there is
> > only one fetcher read. Though AMD is 128 bits and ARM is 64 bits.
>
> Indeed, may be a bug... In the VT-d spec I don't immediately see a
> guarantee that context and PASID entries are fetched atomically. (And
> for PASID entries, which are 512 bits, that seems particularly
> unlikely.)
>
512bits atomicity is possible, but not on the PASID entry.
VT-d spec, head of section 9 (Translation Structure Formats):
"
This chapter describes the memory-resident structures for DMA and
interrupt remapping. Hardware must access structure entries that
are 64-bit or 128-bit atomically. Hardware must update a 512-bit
Posted Interrupt Descriptor (see Section 9.11 for details) atomically.
Other than the Posted Interrupt Descriptor (PID), hardware is allowed
to break access to larger than 128-bit entries into multiple aligned
128-bit accesses.
"
root entry, scalable root entry, context entry and IRTE are 128bits
so they are OK.
scalable context entry are 256bits but only the lower 128bits are
defined so it's OK for now.
scalable PASID directory entry is 64bits. ok.
posted interrupt descriptor is 512bits with atomicity guaranteed.
but we do have problem on scalable pasid entry which is 512bits.
- bits beyond 191 are for future hardware, not a problem now
- bits 128-191 are for 1st-stage
- bits 0-127 manages stage selection, 2nd-stage, and some 1st-stage
so in theory 1st-stage and nesting are affected by this bug.
In reality:
- iommu driver shouldn't receive an attach request on an in-use pasid
entry, so the cache should remain cleared (either at initial state or
flushed by previous teardown) then hw won't use a partial 1st-stage
config after seeing the entry as non-present.
- replace is already broken, as the entry should not be cleared in the
1st place then this bug will be fixed when replace is reworked.
If no oversight (Baolu?), probably we don't need to fix it strictly following
Jason's pseudo logic at this point. Instead, just rename pasid_clear_entry()
to pasid_clear_entry_no_flush() for now (with some comment to clarify
the expectation), and rework the replace path in parallel.
We may never require a pasid_clear_entry_flush_cache() once hitless
replace is in place. 😊
On 1/8/26 10:09, Tian, Kevin wrote: >> From: Dmytro Maluka <dmaluka@chromium.org> >> Sent: Tuesday, January 6, 2026 11:50 PM >> >> On Tue, Jan 06, 2026 at 10:23:01AM -0400, Jason Gunthorpe wrote: >>> On Tue, Jan 06, 2026 at 02:51:38PM +0100, Dmytro Maluka wrote: >>>> Regarding flushing caches right after that - what for? (BTW the Intel >>>> driver doesn't do that either.) If we don't do that and as a result the >>>> HW is using an old entry cached before we cleared the present bit, it >>>> is not affected by our later modifications anyway. >>> >>> You don't know what state the HW fetcher is in. This kind of race is possible: >>> >>> CPU FETCHER >>> read present = 1 >>> present = 0 >>> mangle qword 1 >>> read qword 1 >>> < fail - HW sees a corrupted entry > >>> >>> The flush is not just a flush but a barrier to synchronize with the HW >>> that it is done all fetches that may have been dependent on seeing >>> present = 1. >>> >>> So missing a flush after clearing present is possibly a bug today - I >>> don't remember what guarenteed the atomic size is for Intel IOMMU >>> though, if the atomic size is the whole entry it is OK since there is >>> only one fetcher read. Though AMD is 128 bits and ARM is 64 bits. >> >> Indeed, may be a bug... In the VT-d spec I don't immediately see a >> guarantee that context and PASID entries are fetched atomically. (And >> for PASID entries, which are 512 bits, that seems particularly >> unlikely.) >> > > 512bits atomicity is possible, but not on the PASID entry. > > VT-d spec, head of section 9 (Translation Structure Formats): > > " > This chapter describes the memory-resident structures for DMA and > interrupt remapping. Hardware must access structure entries that > are 64-bit or 128-bit atomically. Hardware must update a 512-bit > Posted Interrupt Descriptor (see Section 9.11 for details) atomically. > Other than the Posted Interrupt Descriptor (PID), hardware is allowed > to break access to larger than 128-bit entries into multiple aligned > 128-bit accesses. > " > > root entry, scalable root entry, context entry and IRTE are 128bits > so they are OK. > > scalable context entry are 256bits but only the lower 128bits are > defined so it's OK for now. > > scalable PASID directory entry is 64bits. ok. > > posted interrupt descriptor is 512bits with atomicity guaranteed. > > but we do have problem on scalable pasid entry which is 512bits. > - bits beyond 191 are for future hardware, not a problem now > - bits 128-191 are for 1st-stage > - bits 0-127 manages stage selection, 2nd-stage, and some 1st-stage > > so in theory 1st-stage and nesting are affected by this bug. Yes. This is a real software bug. The hardware is legally allowed to tear the pasid table entry read into 4 128-bit chunks. For first-stage (first-only or nested) translation, chunk 1 (bit 0-127) and chunk 2 (bit 128-191) both contain active configuration data, hardware could possibly read a entry composed of half-old and half-new data. The VT-d spec defines software guide for pasid table entry manipulation in section 6.5.3.3 (Guidance to Software for Invalidations), I think the Linux driver doesn't handle the handshake between CPU and IOMMU hardware in the right way. The correct way should be a clear-flush-update sequence, that is, when tearing down a present pasid table entry, the software should 1. Clear the Present bit; 2. Invalidate the cache according to section 6.5.3.3; [now CPU owns the pasid table entry] 3. Update other fields; 4. Set the Present bit; [now the VT-d hardware owns the pasid table entry]. > > In reality: > - iommu driver shouldn't receive an attach request on an in-use pasid > entry, so the cache should remain cleared (either at initial state or > flushed by previous teardown) then hw won't use a partial 1st-stage > config after seeing the entry as non-present. Yes. But the current tear down process is buggy as described above. > > - replace is already broken, as the entry should not be cleared in the > 1st place then this bug will be fixed when replace is reworked. We still have a long way to go before achieving the real replace since the hardware doesn't guarantee 512-bit atomicity, "hitless" is very difficult. > > If no oversight (Baolu?), probably we don't need to fix it strictly following > Jason's pseudo logic at this point. Instead, just rename pasid_clear_entry() > to pasid_clear_entry_no_flush() for now (with some comment to clarify > the expectation), and rework the replace path in parallel. > We may never require a pasid_clear_entry_flush_cache() once hitless> replace is in place. 😊 Perhaps we can first add pasid_entry_clear_present() to fulfill the clear-flush-update handshake between the software and the IOMMU hardware while reworking the PASID replace path? Thanks, baolu
On Tue, Jan 06, 2026 at 04:50:11PM +0100, Dmytro Maluka wrote: > So we'd better make sure that if it happens before the barrier (i.e. > when the device is not supposed to do DMA), the compiler (and thus > the CPU) doesn't set the present bit, so it stays non-present, so > the IOMMU will block this unexpected/malicious DMA? It is true that any write to the dword containing the present bit (only!) should probably use WRITE_ONCE. Jason
On Tue, Jan 06, 2026 at 12:45:18PM -0400, Jason Gunthorpe wrote: > On Tue, Jan 06, 2026 at 04:50:11PM +0100, Dmytro Maluka wrote: > > So we'd better make sure that if it happens before the barrier (i.e. > > when the device is not supposed to do DMA), the compiler (and thus > > the CPU) doesn't set the present bit, so it stays non-present, so > > the IOMMU will block this unexpected/malicious DMA? > > It is true that any write to the dword containing the present bit > (only!) should probably use WRITE_ONCE. Yeah, I agree that for other dwords within an entry it isn't necessary. So, when it comes to the actual code, it's not like the Intel IOMMU driver has lots of WRITE_ONCE spread all over the code. It's only in pasid_set_bits() and pasid_clear_entry(). Though, since pasid_set_bits() is used for setting any bits, as a result, WRITE_ONCE is also done for dwords that don't need it, but that doesn't hurt either.
© 2016 - 2026 Red Hat, Inc.