[PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX

Sean Christopherson posted 2 patches 1 year ago
arch/x86/include/asm/mtrr.h        |  5 +++--
arch/x86/kernel/cpu/mtrr/generic.c | 11 +++++++----
arch/x86/kernel/kvm.c              | 31 ++++++++++++++++++++++++++++--
3 files changed, 39 insertions(+), 8 deletions(-)
[PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
Posted by Sean Christopherson 1 year ago
Attempt to hack around the SNP/TDX guest MTRR disaster by hijacking
x86_platform.is_untracked_pat_range() to force the legacy PCI hole, i.e.
memory from TOLUD => 4GiB, as unconditionally writeback.

TDX in particular has created an impossible situation with MTRRs.  Because
TDX disallows toggling CR0.CD, TDX enabling decided the easiest solution
was to ignore MTRRs entirely (because omitting CR0.CD write is obviously
too simple).

Unfortunately, under KVM at least, the kernel subtly relies on MTRRs to
make ACPI play nice with device drivers.  ACPI tries to map ranges it finds
as WB, which in turn prevents device drivers from mapping device memory as
WC/UC-.

For the record, I hate this hack.  But it's the safest approach I can come
up with.  E.g. forcing ioremap() to always use WB scares me because it's
possible, however unlikely, that the kernel could try to map non-emulated
memory (that is presented as MMIO to the guest) as WC/UC-, and silently
forcing those mappings to WB could do weird things.

My initial thought was to effectively revert the offending commit and
skip the cache disabling/enabling, i.e. the problematic CR0.CD toggling,
but unfortunately OVMF/EDKII has also added code to skip MTRR setup. :-(

Sean Christopherson (2):
  x86/mtrr: Return success vs. "failure" from guest_force_mtrr_state()
  x86/kvm: Override low memory above TOLUD to WB when MTRRs are forced
    WB

 arch/x86/include/asm/mtrr.h        |  5 +++--
 arch/x86/kernel/cpu/mtrr/generic.c | 11 +++++++----
 arch/x86/kernel/kvm.c              | 31 ++++++++++++++++++++++++++++--
 3 files changed, 39 insertions(+), 8 deletions(-)


base-commit: fd8c09ad0d87783b9b6a27900d66293be45b7bad
-- 
2.48.1.362.g079036d154-goog
Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
Posted by Nikolay Borisov 7 months ago

On 2/1/25 02:50, Sean Christopherson wrote:
> Attempt to hack around the SNP/TDX guest MTRR disaster by hijacking
> x86_platform.is_untracked_pat_range() to force the legacy PCI hole, i.e.
> memory from TOLUD => 4GiB, as unconditionally writeback.
> 
> TDX in particular has created an impossible situation with MTRRs.  Because
> TDX disallows toggling CR0.CD, TDX enabling decided the easiest solution
> was to ignore MTRRs entirely (because omitting CR0.CD write is obviously
> too simple).
> 
> Unfortunately, under KVM at least, the kernel subtly relies on MTRRs to
> make ACPI play nice with device drivers.  ACPI tries to map ranges it finds
> as WB, which in turn prevents device drivers from mapping device memory as
> WC/UC-.
> 
> For the record, I hate this hack.  But it's the safest approach I can come
> up with.  E.g. forcing ioremap() to always use WB scares me because it's
> possible, however unlikely, that the kernel could try to map non-emulated
> memory (that is presented as MMIO to the guest) as WC/UC-, and silently
> forcing those mappings to WB could do weird things.
> 
> My initial thought was to effectively revert the offending commit and
> skip the cache disabling/enabling, i.e. the problematic CR0.CD toggling,
> but unfortunately OVMF/EDKII has also added code to skip MTRR setup. :-(
> 
> Sean Christopherson (2):
>    x86/mtrr: Return success vs. "failure" from guest_force_mtrr_state()
>    x86/kvm: Override low memory above TOLUD to WB when MTRRs are forced
>      WB
> 
>   arch/x86/include/asm/mtrr.h        |  5 +++--
>   arch/x86/kernel/cpu/mtrr/generic.c | 11 +++++++----
>   arch/x86/kernel/kvm.c              | 31 ++++++++++++++++++++++++++++--
>   3 files changed, 39 insertions(+), 8 deletions(-)
> 
> 
> base-commit: fd8c09ad0d87783b9b6a27900d66293be45b7bad


This prevents TPM from functioning which in turn breaks attestation on 
TDX enabled guests. So what's the status of it?
Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
Posted by Edgecombe, Rick P 1 year ago
On Fri, 2025-01-31 at 16:50 -0800, Sean Christopherson wrote:
> Attempt to hack around the SNP/TDX guest MTRR disaster by hijacking
> x86_platform.is_untracked_pat_range() to force the legacy PCI hole, i.e.
> memory from TOLUD => 4GiB, as unconditionally writeback.
> 
> TDX in particular has created an impossible situation with MTRRs.  Because
> TDX disallows toggling CR0.CD, TDX enabling decided the easiest solution
> was to ignore MTRRs entirely (because omitting CR0.CD write is obviously
> too simple).
> 
> Unfortunately, under KVM at least, the kernel subtly relies on MTRRs to
> make ACPI play nice with device drivers.  ACPI tries to map ranges it finds
> as WB, which in turn prevents device drivers from mapping device memory as
> WC/UC-.
> 
> For the record, I hate this hack.  But it's the safest approach I can come
> up with.  E.g. forcing ioremap() to always use WB scares me because it's
> possible, however unlikely, that the kernel could try to map non-emulated
> memory (that is presented as MMIO to the guest) as WC/UC-, and silently
> forcing those mappings to WB could do weird things.
> 
> My initial thought was to effectively revert the offending commit and
> skip the cache disabling/enabling, i.e. the problematic CR0.CD toggling,
> but unfortunately OVMF/EDKII has also added code to skip MTRR setup. :-(

Oof. The missing context in 8e690b817e38 ("x86/kvm: Override default caching
mode for SEV-SNP and TDX"), is that since it is impossible to virtualize MTRRs
on TDX private memory (in the old way KVM used to do it) and there was no
upstream support yet, there looked like an opportunity to avoid strange "happens
to work" support that normal VMs ended up with. Instead KVM could just not
support TDX KVM MTRRs from the beginning. So part of the thinking was that we
could drop all TDX KVM MTRR hacks. (which I guess turned out to be wrong).

Since there is no upstream KVM TDX support yet, why isn't it an option to still
revert the EDKII commit too? It was a relatively recent change.


To me it seems that the normal KVM MTRR support is not ideal, because it is
still lying about what it is doing. For example, in the past there was an
attempt to use UC to prevent speculative execution accesses to sensitive data.
The KVM MTRR support only happens to work with existing guests, but not all
possible MTRR usages.

Since diverging from the architecture creates loose ends like that, we could
instead define some other way for EDKII to communicate the ranges to the kernel.
Like some simple KVM PV MSRs that are for communication only, and not
overlapping with architecture that expects to cause memory behavior. EDKII and
the kernel could be changed to use them.
Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
Posted by Sean Christopherson 1 year ago
On Mon, Feb 03, 2025, Rick P Edgecombe wrote:
> On Fri, 2025-01-31 at 16:50 -0800, Sean Christopherson wrote:
> > Attempt to hack around the SNP/TDX guest MTRR disaster by hijacking
> > x86_platform.is_untracked_pat_range() to force the legacy PCI hole, i.e.
> > memory from TOLUD => 4GiB, as unconditionally writeback.
> > 
> > TDX in particular has created an impossible situation with MTRRs.  Because
> > TDX disallows toggling CR0.CD, TDX enabling decided the easiest solution
> > was to ignore MTRRs entirely (because omitting CR0.CD write is obviously
> > too simple).
> > 
> > Unfortunately, under KVM at least, the kernel subtly relies on MTRRs to
> > make ACPI play nice with device drivers.  ACPI tries to map ranges it finds
> > as WB, which in turn prevents device drivers from mapping device memory as
> > WC/UC-.
> > 
> > For the record, I hate this hack.  But it's the safest approach I can come
> > up with.  E.g. forcing ioremap() to always use WB scares me because it's
> > possible, however unlikely, that the kernel could try to map non-emulated
> > memory (that is presented as MMIO to the guest) as WC/UC-, and silently
> > forcing those mappings to WB could do weird things.
> > 
> > My initial thought was to effectively revert the offending commit and
> > skip the cache disabling/enabling, i.e. the problematic CR0.CD toggling,
> > but unfortunately OVMF/EDKII has also added code to skip MTRR setup. :-(
> 
> Oof. The missing context in 8e690b817e38 ("x86/kvm: Override default caching
> mode for SEV-SNP and TDX"), is that since it is impossible to virtualize MTRRs
> on TDX private memory (in the old way KVM used to do it) and there was no
> upstream support yet, there looked like an opportunity to avoid strange "happens
> to work" support that normal VMs ended up with. Instead KVM could just not
> support TDX KVM MTRRs from the beginning. So part of the thinking was that we
> could drop all TDX KVM MTRR hacks. (which I guess turned out to be wrong).
> 
> Since there is no upstream KVM TDX support yet, why isn't it an option to still
> revert the EDKII commit too? It was a relatively recent change.

I'm fine with that route too, but it too is a band-aid.  Relying on the *untrusted*
hypervisor to essentially communicate memory maps is not a winning strategy. 

> To me it seems that the normal KVM MTRR support is not ideal, because it is
> still lying about what it is doing. For example, in the past there was an
> attempt to use UC to prevent speculative execution accesses to sensitive data.
> The KVM MTRR support only happens to work with existing guests, but not all
> possible MTRR usages.
> 
> Since diverging from the architecture creates loose ends like that, we could
> instead define some other way for EDKII to communicate the ranges to the kernel.
> Like some simple KVM PV MSRs that are for communication only, and not

Hard "no" to any PV solution.  This isn't KVM specific, and as above, bouncing
through the hypervisor to communicate information within the guest is asinine,
especially for CoCo VMs.

> overlapping with architecture that expects to cause memory behavior. EDKII and
> the kernel could be changed to use them.
Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
Posted by Edgecombe, Rick P 1 year ago
On Mon, 2025-02-03 at 12:33 -0800, Sean Christopherson wrote:
> > Since there is no upstream KVM TDX support yet, why isn't it an option to
> > still
> > revert the EDKII commit too? It was a relatively recent change.
> 
> I'm fine with that route too, but it too is a band-aid.  Relying on the
> *untrusted*
> hypervisor to essentially communicate memory maps is not a winning strategy. 
> 
> > To me it seems that the normal KVM MTRR support is not ideal, because it is
> > still lying about what it is doing. For example, in the past there was an
> > attempt to use UC to prevent speculative execution accesses to sensitive
> > data.
> > The KVM MTRR support only happens to work with existing guests, but not all
> > possible MTRR usages.
> > 
> > Since diverging from the architecture creates loose ends like that, we could
> > instead define some other way for EDKII to communicate the ranges to the
> > kernel.
> > Like some simple KVM PV MSRs that are for communication only, and not
> 
> Hard "no" to any PV solution.  This isn't KVM specific, and as above, bouncing
> through the hypervisor to communicate information within the guest is asinine,
> especially for CoCo VMs.

Hmm, right.

So the other options could be:

1. Some TDX module feature to hold the ranges:
 - Con: Not shared with AMD

2. Re-use MTRRs for the communication, revert changes in guest and edk2:
 - Con: Creating more half support, when it's technically not required
 - Con: Still bouncing through the hypervisor
 - Pro: Design and code is clear

3. Create some new architectural definition, like a bit that means "MTRRs don't
actually work:
 - Con: Takes a long time, need to get agreement
 - Con: Still bouncing through the hypervisor
 - Pro: More pure solution

4. Do this series:
 - Pro: Looks ok to me
 - Cons: As explained in the patches, it's a bit hacky.
 - Cons: Could there be more cases than the legacy PCI hole?


I would kind of like to see something like 3, but 2 or 4 seem the only feasible
ones if we want to resolve this soon.
Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
Posted by Sean Christopherson 1 year ago
On Mon, Feb 03, 2025, Rick P Edgecombe wrote:
> On Mon, 2025-02-03 at 12:33 -0800, Sean Christopherson wrote:
> > > Since there is no upstream KVM TDX support yet, why isn't it an option to
> > > still
> > > revert the EDKII commit too? It was a relatively recent change.
> > 
> > I'm fine with that route too, but it too is a band-aid.  Relying on the
> > *untrusted*
> > hypervisor to essentially communicate memory maps is not a winning strategy. 
> > 
> > > To me it seems that the normal KVM MTRR support is not ideal, because it is
> > > still lying about what it is doing. For example, in the past there was an
> > > attempt to use UC to prevent speculative execution accesses to sensitive
> > > data.
> > > The KVM MTRR support only happens to work with existing guests, but not all
> > > possible MTRR usages.
> > > 
> > > Since diverging from the architecture creates loose ends like that, we could
> > > instead define some other way for EDKII to communicate the ranges to the
> > > kernel.
> > > Like some simple KVM PV MSRs that are for communication only, and not
> > 
> > Hard "no" to any PV solution.  This isn't KVM specific, and as above, bouncing
> > through the hypervisor to communicate information within the guest is asinine,
> > especially for CoCo VMs.
> 
> Hmm, right.
> 
> So the other options could be:
> 
> 1. Some TDX module feature to hold the ranges:
>  - Con: Not shared with AMD
> 
> 2. Re-use MTRRs for the communication, revert changes in guest and edk2:

Thinking more about how EDK2 is consumed downstream, I think reverting the EDK2
changes is necessary regardless of what happens in the kernel.  Or at the least,
somehow communicate to EDK2 users that ingesting those changes is a bad idea
unless the kernel has also been updated.

AFAIK, Bring Your Own Firmware[*] isn't widely adopted, which means that the CSP
is shipping the firmware.  And shipping OVMF/EDK2 with the "ignores MTRRs" code
will cause problems for guests without commit 8e690b817e38 ("x86/kvm: Override
default caching mode for SEV-SNP and TDX").  Since the host doesn't control the
guest kernel, there's no way to know if deploying those EDK2 changes is safe.
 
[*] https://kvm-forum.qemu.org/2024/BYOF_-_KVM_Forum_2024_iWTioIP.pdf

>  - Con: Creating more half support, when it's technically not required
>  - Con: Still bouncing through the hypervisor

I assume by "Re-use MTRRs for the communication" you also mean updating the guest
to address the "everything is UC!" flaw, otherwise another con is:

   - Con: Doesn't address the performance issue with TDX guests "using" UC
          memory by default (unless there's yet more enabled).

Presumably that can be accomplished by simply skipping the CR0.CD toggling, and
doing MTRR stuff as nonrmal?

>  - Pro: Design and code is clear
>
> 3. Create some new architectural definition, like a bit that means "MTRRs don't
> actually work:
>  - Con: Takes a long time, need to get agreement
>  - Con: Still bouncing through the hypervisor

Not for KVM guests.  As I laid out in my bug report, it's safe to assume MTRRs
don't actually affect the memory type when running under KVM.

FWIW, PAT doesn't "work" on most KVM Intel setups either, because of misguided
KVM code that resulted in "Ignore Guest PAT" being set in all EPTEs for the
overwhelming majority of guests.  That's not desirable long term because it
prevents the guest from using WC (via PAT) in situations where doing so is needed
for performance and/or correctness.

>  - Pro: More pure solution

MTRRs "not working" is a red herring.  The problem isn't that MTRRs don't work,
it's that the kernel is (somewhat unknowingly) using MTRRs as a crutch to get the
desired memtype for devices.  E.g. for emulated MMIO, MTRRs _can't_ be virtualized,
because there's never a valid mapping, i.e. there is no physical memory and thus
no memtype.  In other words, under KVM guests (and possibly other hypervisors),
MTRRs end up being nothing more than a communication channel between guest firmware
and the kernel.

The gap for CoCo VMs is that using MTRRs is undesirable because they are controlled
by the untrusted host.  But that's largely a future problem, unless someone has a
clever way to fix the kernel mess.

> 4. Do this series:
>  - Pro: Looks ok to me
>  - Cons: As explained in the patches, it's a bit hacky.
>  - Cons: Could there be more cases than the legacy PCI hole?
> 
> I would kind of like to see something like 3, but 2 or 4 seem the only feasible
> ones if we want to resolve this soon.
Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
Posted by Binbin Wu 12 months ago

On 2/4/2025 8:27 AM, Sean Christopherson wrote:
> On Mon, Feb 03, 2025, Rick P Edgecombe wrote:
>> On Mon, 2025-02-03 at 12:33 -0800, Sean Christopherson wrote:
>>>> Since there is no upstream KVM TDX support yet, why isn't it an option to
>>>> still
>>>> revert the EDKII commit too? It was a relatively recent change.
>>> I'm fine with that route too, but it too is a band-aid.  Relying on the
>>> *untrusted*
>>> hypervisor to essentially communicate memory maps is not a winning strategy.
>>>
>>>> To me it seems that the normal KVM MTRR support is not ideal, because it is
>>>> still lying about what it is doing. For example, in the past there was an
>>>> attempt to use UC to prevent speculative execution accesses to sensitive
>>>> data.
>>>> The KVM MTRR support only happens to work with existing guests, but not all
>>>> possible MTRR usages.
>>>>
>>>> Since diverging from the architecture creates loose ends like that, we could
>>>> instead define some other way for EDKII to communicate the ranges to the
>>>> kernel.
>>>> Like some simple KVM PV MSRs that are for communication only, and not
>>> Hard "no" to any PV solution.  This isn't KVM specific, and as above, bouncing
>>> through the hypervisor to communicate information within the guest is asinine,
>>> especially for CoCo VMs.
>> Hmm, right.
>>
>> So the other options could be:
>>
>> 1. Some TDX module feature to hold the ranges:
>>   - Con: Not shared with AMD
>>
>> 2. Re-use MTRRs for the communication, revert changes in guest and edk2:
> Thinking more about how EDK2 is consumed downstream, I think reverting the EDK2
> changes is necessary regardless of what happens in the kernel.
IIUC, 071d2cfab8 ("OvmfPkg/Sec: Skip setup MTRR early in TD-Guest") was added
to avoid changing the setting of MTRR, which will trigger #VE by setting
CR0.CD=1.

And recently, 3a3b12cbda ("UefiCpuPkg/MtrrLib: MtrrLibIsMtrrSupported always
return FALSE in TD-Guest") was added to avoid touching MTRR MSRs at all,
so that the MTRR MSRs support for TDX guests was dropped as described in
[PATCH 00/18] KVM: TDX: TDX "the rest" part [1].

If we want to revert the two commits, we need to:
1. Make sure that OVMF will not set CR0.CD to 1 for TDX guests, probably
    needs some kind of hack in OVMF.

2. Need to bring back the support of MTRR MSRs in KVM for TDX guests.
    TDX KVM basic support patch set v19 and earlier versions enforce default
    MTRR type as WB and disabled fixed/variable MTRR (by reporting MSR_MTRRcap
    as 0) for TDX guests, which was kind of half support and needed
    "clearcpuid=mtrr" to avoid toggling of CR0.CD.

    If we really want to rely on the OVMF to program the MTRR values, maybe we
    can treat MTRR MSRs for TDX guests as normal VMs, i.e., allow guests to
    read/write the values without any further virtualization.
    Of course, it needs to prompt the guest kernel to skip toggling CR0.CD for
    TDX guests.

[1] https://lore.kernel.org/kvm/20241210004946.3718496-1-binbin.wu@linux.intel.com


>    Or at the least,
> somehow communicate to EDK2 users that ingesting those changes is a bad idea
> unless the kernel has also been updated.
>
> AFAIK, Bring Your Own Firmware[*] isn't widely adopted, which means that the CSP
> is shipping the firmware.  And shipping OVMF/EDK2 with the "ignores MTRRs" code
> will cause problems for guests without commit 8e690b817e38 ("x86/kvm: Override
> default caching mode for SEV-SNP and TDX").  Since the host doesn't control the
> guest kernel, there's no way to know if deploying those EDK2 changes is safe.

Oh, yea.

And if we drop the MTRR MSRs access in KVM for TDX guests, an old guest kernel
without commit 8e690b817e38 ("x86/kvm: Override default caching mode for
SEV-SNP and TDX") will require "clearcpuid=mtrr". :(

>   
> [*] https://kvm-forum.qemu.org/2024/BYOF_-_KVM_Forum_2024_iWTioIP.pdf
>
>>   - Con: Creating more half support, when it's technically not required
>>   - Con: Still bouncing through the hypervisor
> I assume by "Re-use MTRRs for the communication" you also mean updating the guest
> to address the "everything is UC!" flaw, otherwise another con is:
>
>     - Con: Doesn't address the performance issue with TDX guests "using" UC
>            memory by default (unless there's yet more enabled).
>
> Presumably that can be accomplished by simply skipping the CR0.CD toggling, and
> doing MTRR stuff as nonrmal?
>
>>   - Pro: Design and code is clear
>>
>> 3. Create some new architectural definition, like a bit that means "MTRRs don't
>> actually work:
>>   - Con: Takes a long time, need to get agreement
>>   - Con: Still bouncing through the hypervisor
> Not for KVM guests.  As I laid out in my bug report, it's safe to assume MTRRs
> don't actually affect the memory type when running under KVM.
>
> FWIW, PAT doesn't "work" on most KVM Intel setups either, because of misguided
> KVM code that resulted in "Ignore Guest PAT" being set in all EPTEs for the
> overwhelming majority of guests.  That's not desirable long term because it
> prevents the guest from using WC (via PAT) in situations where doing so is needed
> for performance and/or correctness.
>
>>   - Pro: More pure solution
> MTRRs "not working" is a red herring.  The problem isn't that MTRRs don't work,
> it's that the kernel is (somewhat unknowingly) using MTRRs as a crutch to get the
> desired memtype for devices.  E.g. for emulated MMIO, MTRRs _can't_ be virtualized,
> because there's never a valid mapping, i.e. there is no physical memory and thus
> no memtype.  In other words, under KVM guests (and possibly other hypervisors),
> MTRRs end up being nothing more than a communication channel between guest firmware
> and the kernel.
>
> The gap for CoCo VMs is that using MTRRs is undesirable because they are controlled
> by the untrusted host.  But that's largely a future problem, unless someone has a
> clever way to fix the kernel mess.
>
>> 4. Do this series:
>>   - Pro: Looks ok to me
It looks OK to me too.
But as mentioned above, mismatch of OVMF, guest kernel, host kernel
version will cause problem.

>>   - Cons: As explained in the patches, it's a bit hacky.
Skipping toggling CR0.CD in guest kernel seems also a bit hacky.

>>   - Cons: Could there be more cases than the legacy PCI hole?
>>
>> I would kind of like to see something like 3, but 2 or 4 seem the only feasible
>> ones if we want to resolve this soon.

Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
Posted by Edgecombe, Rick P 1 year ago
+Min, can you comment?

3a3b12cbda ("UefiCpuPkg/MtrrLib: MtrrLibIsMtrrSupported always return FALSE in
TD-Guest") turned out to be problematic in practice.

Full thread:
https://lore.kernel.org/kvm/20250201005048.657470-1-seanjc@google.com/

On Mon, 2025-02-03 at 16:27 -0800, Sean Christopherson wrote:
> On Mon, Feb 03, 2025, Rick P Edgecombe wrote:
> > On Mon, 2025-02-03 at 12:33 -0800, Sean Christopherson wrote:
> > > > Since there is no upstream KVM TDX support yet, why isn't it an option to
> > > > still
> > > > revert the EDKII commit too? It was a relatively recent change.
> > > 
> > > I'm fine with that route too, but it too is a band-aid.  Relying on the
> > > *untrusted*
> > > hypervisor to essentially communicate memory maps is not a winning strategy. 
> > > 
> > > > To me it seems that the normal KVM MTRR support is not ideal, because it is
> > > > still lying about what it is doing. For example, in the past there was an
> > > > attempt to use UC to prevent speculative execution accesses to sensitive
> > > > data.
> > > > The KVM MTRR support only happens to work with existing guests, but not all
> > > > possible MTRR usages.
> > > > 
> > > > Since diverging from the architecture creates loose ends like that, we could
> > > > instead define some other way for EDKII to communicate the ranges to the
> > > > kernel.
> > > > Like some simple KVM PV MSRs that are for communication only, and not
> > > 
> > > Hard "no" to any PV solution.  This isn't KVM specific, and as above, bouncing
> > > through the hypervisor to communicate information within the guest is asinine,
> > > especially for CoCo VMs.
> > 
> > Hmm, right.
> > 
> > So the other options could be:
> > 
> > 1. Some TDX module feature to hold the ranges:
> >  - Con: Not shared with AMD
> > 
> > 2. Re-use MTRRs for the communication, revert changes in guest and edk2:
> 
> Thinking more about how EDK2 is consumed downstream, I think reverting the EDK2
> changes is necessary regardless of what happens in the kernel.  Or at the least,
> somehow communicate to EDK2 users that ingesting those changes is a bad idea
> unless the kernel has also been updated.
> 
> AFAIK, Bring Your Own Firmware[*] isn't widely adopted, which means that the CSP
> is shipping the firmware.  And shipping OVMF/EDK2 with the "ignores MTRRs" code
> will cause problems for guests without commit 8e690b817e38 ("x86/kvm: Override
> default caching mode for SEV-SNP and TDX").  Since the host doesn't control the
> guest kernel, there's no way to know if deploying those EDK2 changes is safe.
>  
> [*] https://kvm-forum.qemu.org/2024/BYOF_-_KVM_Forum_2024_iWTioIP.pdf
> 

Hmm. Since there is no upstream TDX KVM support, for it's part, I guess KVM
should still get a chance to define a cleaner solution (if there actually was a
cleaner solution). But yea, it would mean only components from after the
solution was settled could be used together for a fully working stack. And
it should probably be called out somehow. Maybe could be in the KVM TDX docs or
something.

Still seems like a thing to avoid if possible.

> >  - Con: Creating more half support, when it's technically not required
> >  - Con: Still bouncing through the hypervisor
> 
> I assume by "Re-use MTRRs for the communication" you also mean updating the guest
> to address the "everything is UC!" flaw, otherwise another con is:
> 
>    - Con: Doesn't address the performance issue with TDX guests "using" UC
>           memory by default (unless there's yet more enabled).

Hmm. This is quite the tangled corner.

> 
> Presumably that can be accomplished by simply skipping the CR0.CD toggling, and
> doing MTRR stuff as nonrmal?

I'll have to get back to you on this one. Kirill probably could give a better
answer, but likely will not be able to follow up on this thread until next week.

> 
> >  - Pro: Design and code is clear
> > 
> > 3. Create some new architectural definition, like a bit that means "MTRRs don't
> > actually work:
> >  - Con: Takes a long time, need to get agreement
> >  - Con: Still bouncing through the hypervisor
> 
> Not for KVM guests.  As I laid out in my bug report, it's safe to assume MTRRs
> don't actually affect the memory type when running under KVM.
> 
> FWIW, PAT doesn't "work" on most KVM Intel setups either, because of misguided
> KVM code that resulted in "Ignore Guest PAT" being set in all EPTEs for the
> overwhelming majority of guests.  That's not desirable long term because it
> prevents the guest from using WC (via PAT) in situations where doing so is needed
> for performance and/or correctness.
> 
> >  - Pro: More pure solution
> 
> MTRRs "not working" is a red herring.  The problem isn't that MTRRs don't work,
> it's that the kernel is (somewhat unknowingly) using MTRRs as a crutch to get the
> desired memtype for devices.  E.g. for emulated MMIO, MTRRs _can't_ be virtualized,
> because there's never a valid mapping, i.e. there is no physical memory and thus
> no memtype.  In other words, under KVM guests (and possibly other hypervisors),
> MTRRs end up being nothing more than a communication channel between guest firmware
> and the kernel.

Yea.

> 
> The gap for CoCo VMs is that using MTRRs is undesirable because they are controlled
> by the untrusted host.  But that's largely a future problem, unless someone has a
> clever way to fix the kernel mess.
> 
> 

Yea, I wondered about that too. I imagine the thinking was that since it is only
controlling shared memory, it can be untrusted.

And I guess the solution in this patchset is hypothetically a bit more locked
down in that respect.
RE: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
Posted by Xu, Min M 1 year ago
On Wednesday, February 5, 2025 11:51 AM, Edgecombe, Rick P wrote:
> 
> +Min, can you comment?
> 
> 3a3b12cbda ("UefiCpuPkg/MtrrLib: MtrrLibIsMtrrSupported always return FALSE
> in
> TD-Guest") turned out to be problematic in practice.
> 
As the commit(3a3b12cbda) message mentioned that "For Linux kernel, there is a mechanism called SW defined MTRR introduced  by the patch https://lore.kernel.org/all/20230502120931.20719-4-jgross@suse.com/". 
From the discussion in below thread it seems the patch (SW defined MTR in kernel) has not been introduced into kernel master branch, right?
We need some time to investigate it and will give an update here. 
> Full thread:
> https://lore.kernel.org/kvm/20250201005048.657470-1-seanjc@google.com/
> 

Thanks
Min
Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
Posted by Dionna Amalie Glaze 1 year ago
On Fri, Jan 31, 2025 at 4:50 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Attempt to hack around the SNP/TDX guest MTRR disaster by hijacking
> x86_platform.is_untracked_pat_range() to force the legacy PCI hole, i.e.
> memory from TOLUD => 4GiB, as unconditionally writeback.
>
> TDX in particular has created an impossible situation with MTRRs.  Because
> TDX disallows toggling CR0.CD, TDX enabling decided the easiest solution
> was to ignore MTRRs entirely (because omitting CR0.CD write is obviously
> too simple).
>
> Unfortunately, under KVM at least, the kernel subtly relies on MTRRs to
> make ACPI play nice with device drivers.  ACPI tries to map ranges it finds
> as WB, which in turn prevents device drivers from mapping device memory as
> WC/UC-.
>
> For the record, I hate this hack.  But it's the safest approach I can come
> up with.  E.g. forcing ioremap() to always use WB scares me because it's
> possible, however unlikely, that the kernel could try to map non-emulated
> memory (that is presented as MMIO to the guest) as WC/UC-, and silently
> forcing those mappings to WB could do weird things.
>
> My initial thought was to effectively revert the offending commit and
> skip the cache disabling/enabling, i.e. the problematic CR0.CD toggling,
> but unfortunately OVMF/EDKII has also added code to skip MTRR setup. :-(
>

EDK2 has a bug tracker. Maybe this is still fixable on Intel's end.
Adding Qinglan, Isaku, and Min to comment.

> Sean Christopherson (2):
>   x86/mtrr: Return success vs. "failure" from guest_force_mtrr_state()
>   x86/kvm: Override low memory above TOLUD to WB when MTRRs are forced
>     WB
>
>  arch/x86/include/asm/mtrr.h        |  5 +++--
>  arch/x86/kernel/cpu/mtrr/generic.c | 11 +++++++----
>  arch/x86/kernel/kvm.c              | 31 ++++++++++++++++++++++++++++--
>  3 files changed, 39 insertions(+), 8 deletions(-)
>
>
> base-commit: fd8c09ad0d87783b9b6a27900d66293be45b7bad
> --
> 2.48.1.362.g079036d154-goog
>


-- 
-Dionna Glaze, PhD, CISSP, CCSP (she/her)