We can't make correctness of our own behavior dependent upon a
hypervisor underneath us correctly telling us the true physical address
with hardware uses. Without knowing this, we can't be certain reserved
bit faults can actually be observed. Therefore, besides evaluating the
number of address bits when deciding whether to use the optimization,
also check whether we're running virtualized ourselves.
Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
--- a/xen/arch/x86/mm/shadow/types.h
+++ b/xen/arch/x86/mm/shadow/types.h
@@ -282,10 +282,16 @@ shadow_put_page_from_l1e(shadow_l1e_t sl
*
* This is only feasible for PAE and 64bit Xen: 32-bit non-PAE PTEs don't
* have reserved bits that we can use for this. And even there it can only
- * be used if the processor doesn't use all 52 address bits.
+ * be used if we can be certain the processor doesn't use all 52 address bits.
*/
#define SH_L1E_MAGIC 0xffffffff00000001ULL
+
+static inline bool sh_have_pte_rsvd_bits(void)
+{
+ return paddr_bits < PADDR_BITS && !cpu_has_hypervisor;
+}
+
static inline bool sh_l1e_is_magic(shadow_l1e_t sl1e)
{
return (sl1e.l1 & SH_L1E_MAGIC) == SH_L1E_MAGIC;
@@ -303,7 +309,7 @@ static inline shadow_l1e_t sh_l1e_gnp(vo
* On systems with no reserved physical address bits we can't engage the
* fast fault path.
*/
- return paddr_bits < PADDR_BITS ? sh_l1e_gnp_raw()
+ return sh_have_pte_rsvd_bits() ? sh_l1e_gnp_raw()
: shadow_l1e_empty();
}
@@ -326,7 +332,7 @@ static inline shadow_l1e_t sh_l1e_mmio(g
{
unsigned long gfn_val = MASK_INSR(gfn_x(gfn), SH_L1E_MMIO_GFN_MASK);
- if ( paddr_bits >= PADDR_BITS ||
+ if ( !sh_have_pte_rsvd_bits() ||
gfn_x(gfn) != MASK_EXTR(gfn_val, SH_L1E_MMIO_GFN_MASK) )
return shadow_l1e_empty();
On 05/03/2021 15:37, Jan Beulich wrote: > We can't make correctness of our own behavior dependent upon a > hypervisor underneath us correctly telling us the true physical address > with hardware uses. Without knowing this, we can't be certain reserved > bit faults can actually be observed. Therefore, besides evaluating the > number of address bits when deciding whether to use the optimization, > also check whether we're running virtualized ourselves. I think it would be helpful to point out why we can't even probe at boot - the behaviour may genuinely change as we migrate, and if we ever end up on an IceLake system levelled down for compatibility with older CPUs, the "paddr_bits < PADDR_BITS" check will malfunction in an unsafe way. > > Requested-by: Andrew Cooper <andrew.cooper3@citrix.com> > Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> This wants backporting to stable releases, so I would recommend for 4.15 even at this point. ~Andrew
On 05.03.2021 16:47, Andrew Cooper wrote: > On 05/03/2021 15:37, Jan Beulich wrote: >> We can't make correctness of our own behavior dependent upon a >> hypervisor underneath us correctly telling us the true physical address >> with hardware uses. Without knowing this, we can't be certain reserved >> bit faults can actually be observed. Therefore, besides evaluating the >> number of address bits when deciding whether to use the optimization, >> also check whether we're running virtualized ourselves. > > I think it would be helpful to point out why we can't even probe at boot > - the behaviour may genuinely change as we migrate, and if we ever end > up on an IceLake system levelled down for compatibility with older CPUs, > the "paddr_bits < PADDR_BITS" check will malfunction in an unsafe way. I've added a sentence to this effect. >> Requested-by: Andrew Cooper <andrew.cooper3@citrix.com> >> Signed-off-by: Jan Beulich <jbeulich@suse.com> > > Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Thanks. Jan
Andrew Cooper writes ("Re: [PATCH 1/2][4.15?] x86/shadow: suppress "fast fault path" optimization when running virtualized"):
> This wants backporting to stable releases, so I would recommend for 4.15
> even at this point.
Can someone explain to me the implications of not taking these patch,
and the risks of taking them ?
AFIACT the implications of not taking 1/ are that we would misbehave
in a security relevant way, sometimes, when we are running under
another hypervisor ?
And the implications of not taking 2/ is a performance problem ?
As to the risks, 1/ looks obviously correct even to me.
2/ seems complex. What would go wrong if there were a misplaced ) or
confused bit-twiddling or something ?
Ian.
On 05/03/2021 16:40, Ian Jackson wrote:
> Andrew Cooper writes ("Re: [PATCH 1/2][4.15?] x86/shadow: suppress "fast fault path" optimization when running virtualized"):
>> This wants backporting to stable releases, so I would recommend for 4.15
>> even at this point.
> Can someone explain to me the implications of not taking these patch,
> and the risks of taking them ?
>
> AFIACT the implications of not taking 1/ are that we would misbehave
> in a security relevant way, sometimes, when we are running under
> another hypervisor ?
Correct. Specifically if you've got a migration pool containing an
IceLake server and something older.
> And the implications of not taking 2/ is a performance problem ?
Correct (I believe).
> As to the risks, 1/ looks obviously correct even to me.
I agree, although Tim has the deciding maintainer vote.
> 2/ seems complex. What would go wrong if there were a misplaced ) or
> confused bit-twiddling or something ?
The bit twiddling can be independency checked by disassembling the binary.
However, I have some concerns with the patch as-is, in relation to L1TF
/ XSA-273.
~Andrew
Andrew Cooper writes ("Re: [PATCH 1/2][4.15?] x86/shadow: suppress "fast fault path" optimization when running virtualized"):
> On 05/03/2021 16:40, Ian Jackson wrote:
> > Andrew Cooper writes ("Re: [PATCH 1/2][4.15?] x86/shadow: suppress "fast fault path" optimization when running virtualized"):
> >> This wants backporting to stable releases, so I would recommend for 4.15
> >> even at this point.
> > Can someone explain to me the implications of not taking these patch,
> > and the risks of taking them ?
> >
> > AFIACT the implications of not taking 1/ are that we would misbehave
> > in a security relevant way, sometimes, when we are running under
> > another hypervisor ?
>
> Correct. Specifically if you've got a migration pool containing an
> IceLake server and something older.
>
> > As to the risks, 1/ looks obviously correct even to me.
>
> I agree, although Tim has the deciding maintainer vote.
Right, well, for patch 1 then
Release-Acked-by: Ian Jackson <iwj@xenproject.org>
> > And the implications of not taking 2/ is a performance problem ?
>
> Correct (I believe).
>
> > 2/ seems complex. What would go wrong if there were a misplaced ) or
> > confused bit-twiddling or something ?
>
> The bit twiddling can be independency checked by disassembling the binary.
>
> However, I have some concerns with the patch as-is, in relation to L1TF
> / XSA-273.
I'm going to hold off on this for now. I think to give it a
release-ack I would want someone to argue the case. Concerns would
include Andy's comments (which I saw earlier but do not fully
understand) and me wanting to to know (i) how bad is the perf impact
without it (ii) how has this bit-twiddling been checked.
I hope that makes sense.
Thanks,
Ian.
At 16:37 +0100 on 05 Mar (1614962224), Jan Beulich wrote: > We can't make correctness of our own behavior dependent upon a > hypervisor underneath us correctly telling us the true physical address > with hardware uses. Without knowing this, we can't be certain reserved > bit faults can actually be observed. Therefore, besides evaluating the > number of address bits when deciding whether to use the optimization, > also check whether we're running virtualized ourselves. > > Requested-by: Andrew Cooper <andrew.cooper3@citrix.com> > Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org> I would consider this to be a bug in the underlying hypervisor, but I agree than in practice it won't be safe to rely on it being correct. These checks are getting fiddly now. I think that if we end up adding any more to them it might be good to set a read-mostly variable at boot time rather than do them on every MMIO/NP fault. Cheers, Tim.
On 08.03.2021 10:25, Tim Deegan wrote: > At 16:37 +0100 on 05 Mar (1614962224), Jan Beulich wrote: >> We can't make correctness of our own behavior dependent upon a >> hypervisor underneath us correctly telling us the true physical address >> with hardware uses. Without knowing this, we can't be certain reserved >> bit faults can actually be observed. Therefore, besides evaluating the >> number of address bits when deciding whether to use the optimization, >> also check whether we're running virtualized ourselves. >> >> Requested-by: Andrew Cooper <andrew.cooper3@citrix.com> >> Signed-off-by: Jan Beulich <jbeulich@suse.com> > > Acked-by: Tim Deegan <tim@xen.org> Thanks. > I would consider this to be a bug in the underlying hypervisor, but I > agree than in practice it won't be safe to rely on it being correct. Suffice it to say that I don't think we present a correct value to our guests. Plus, as said elsewhere, what would you suggest to hand to the guest in case it may need migrating (to a host with a different number of PA bits)? > These checks are getting fiddly now. I think that if we end up adding > any more to them it might be good to set a read-mostly variable at boot > time rather than do them on every MMIO/NP fault. Maybe, but I'd like to point out that the fault path uses only the sh_l1e_is_*() functions (plus sh_l1e_mmio_get_gfn()), and hence isn't affected by the added fiddly-ness. Jan
On 08/03/2021 09:25, Tim Deegan wrote: > At 16:37 +0100 on 05 Mar (1614962224), Jan Beulich wrote: >> We can't make correctness of our own behavior dependent upon a >> hypervisor underneath us correctly telling us the true physical address >> with hardware uses. Without knowing this, we can't be certain reserved >> bit faults can actually be observed. Therefore, besides evaluating the >> number of address bits when deciding whether to use the optimization, >> also check whether we're running virtualized ourselves. >> >> Requested-by: Andrew Cooper <andrew.cooper3@citrix.com> >> Signed-off-by: Jan Beulich <jbeulich@suse.com> > Acked-by: Tim Deegan <tim@xen.org> > > I would consider this to be a bug in the underlying hypervisor, but I > agree than in practice it won't be safe to rely on it being correct. I'd argue against this being a hypervisor bug. If anything, it is a weakness in how x86 virtualisation works. For booting on a single host, then yes - vMAXPHYSADDR really ought to be the same as MAXPHYSADDR, and is what happens in the common case. For booting in a heterogeneous pool, the only safe value is the min of MAXPHYSADDR across the resource pool. Anything higher, and the VM will malfunction (get #PF[rsvd] for apparently-legal PTEs) on the smallest pool member(s). Address widths vary greatly between generations and SKUs, so blocking migrate on a MAXPHYSADDR mismatch isn't a viable option. VM migration works in practice because native kernels don't tend to use reserved bit optimisations in the first place. The fault lies with Xen. We're using a property of reserved bit behaviour which was always going to change eventually, and can't be levelled in common heterogeneous scenarios. ~Andrew
On 08.03.2021 14:47, Andrew Cooper wrote: > On 08/03/2021 09:25, Tim Deegan wrote: >> At 16:37 +0100 on 05 Mar (1614962224), Jan Beulich wrote: >>> We can't make correctness of our own behavior dependent upon a >>> hypervisor underneath us correctly telling us the true physical address >>> with hardware uses. Without knowing this, we can't be certain reserved >>> bit faults can actually be observed. Therefore, besides evaluating the >>> number of address bits when deciding whether to use the optimization, >>> also check whether we're running virtualized ourselves. >>> >>> Requested-by: Andrew Cooper <andrew.cooper3@citrix.com> >>> Signed-off-by: Jan Beulich <jbeulich@suse.com> >> Acked-by: Tim Deegan <tim@xen.org> >> >> I would consider this to be a bug in the underlying hypervisor, but I >> agree than in practice it won't be safe to rely on it being correct. > > I'd argue against this being a hypervisor bug. If anything, it is a > weakness in how x86 virtualisation works. > > For booting on a single host, then yes - vMAXPHYSADDR really ought to be > the same as MAXPHYSADDR, and is what happens in the common case. > > For booting in a heterogeneous pool, the only safe value is the min of > MAXPHYSADDR across the resource pool. Anything higher, and the VM will > malfunction (get #PF[rsvd] for apparently-legal PTEs) on the smallest > pool member(s). Except that min isn't safe either - the guest may then expect reserved bit faults where none surface. Jan
On 08/03/2021 13:51, Jan Beulich wrote: > On 08.03.2021 14:47, Andrew Cooper wrote: >> On 08/03/2021 09:25, Tim Deegan wrote: >>> At 16:37 +0100 on 05 Mar (1614962224), Jan Beulich wrote: >>>> We can't make correctness of our own behavior dependent upon a >>>> hypervisor underneath us correctly telling us the true physical address >>>> with hardware uses. Without knowing this, we can't be certain reserved >>>> bit faults can actually be observed. Therefore, besides evaluating the >>>> number of address bits when deciding whether to use the optimization, >>>> also check whether we're running virtualized ourselves. >>>> >>>> Requested-by: Andrew Cooper <andrew.cooper3@citrix.com> >>>> Signed-off-by: Jan Beulich <jbeulich@suse.com> >>> Acked-by: Tim Deegan <tim@xen.org> >>> >>> I would consider this to be a bug in the underlying hypervisor, but I >>> agree than in practice it won't be safe to rely on it being correct. >> I'd argue against this being a hypervisor bug. If anything, it is a >> weakness in how x86 virtualisation works. >> >> For booting on a single host, then yes - vMAXPHYSADDR really ought to be >> the same as MAXPHYSADDR, and is what happens in the common case. >> >> For booting in a heterogeneous pool, the only safe value is the min of >> MAXPHYSADDR across the resource pool. Anything higher, and the VM will >> malfunction (get #PF[rsvd] for apparently-legal PTEs) on the smallest >> pool member(s). > Except that min isn't safe either - the guest may then expect reserved > bit faults where none surface. Such a guest is buggy, and in clear violation of the rules set out in the architecture. All reserved behaviour is subject to change in the future. Any software (Xen included) deliberately choosing to depend on the specifics of reserved behaviour, get to keep all resulting pieces. ~Andrew
On 08.03.2021 14:59, Andrew Cooper wrote: > On 08/03/2021 13:51, Jan Beulich wrote: >> On 08.03.2021 14:47, Andrew Cooper wrote: >>> On 08/03/2021 09:25, Tim Deegan wrote: >>>> At 16:37 +0100 on 05 Mar (1614962224), Jan Beulich wrote: >>>>> We can't make correctness of our own behavior dependent upon a >>>>> hypervisor underneath us correctly telling us the true physical address >>>>> with hardware uses. Without knowing this, we can't be certain reserved >>>>> bit faults can actually be observed. Therefore, besides evaluating the >>>>> number of address bits when deciding whether to use the optimization, >>>>> also check whether we're running virtualized ourselves. >>>>> >>>>> Requested-by: Andrew Cooper <andrew.cooper3@citrix.com> >>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com> >>>> Acked-by: Tim Deegan <tim@xen.org> >>>> >>>> I would consider this to be a bug in the underlying hypervisor, but I >>>> agree than in practice it won't be safe to rely on it being correct. >>> I'd argue against this being a hypervisor bug. If anything, it is a >>> weakness in how x86 virtualisation works. >>> >>> For booting on a single host, then yes - vMAXPHYSADDR really ought to be >>> the same as MAXPHYSADDR, and is what happens in the common case. >>> >>> For booting in a heterogeneous pool, the only safe value is the min of >>> MAXPHYSADDR across the resource pool. Anything higher, and the VM will >>> malfunction (get #PF[rsvd] for apparently-legal PTEs) on the smallest >>> pool member(s). >> Except that min isn't safe either - the guest may then expect reserved >> bit faults where none surface. > > Such a guest is buggy, and in clear violation of the rules set out in > the architecture. All reserved behaviour is subject to change in the > future. > > Any software (Xen included) deliberately choosing to depend on the > specifics of reserved behaviour, get to keep all resulting pieces. While I could understand what you're saying when considering our prior unconditional relying on getting reserved bit faults, are you suggesting the recently adjusted behavior is still "in clear violation of the rules set out in the architecture"? And hence are you suggesting we should have outright dropped that optimization? Jan
On 08/03/2021 14:29, Jan Beulich wrote: > On 08.03.2021 14:59, Andrew Cooper wrote: >> On 08/03/2021 13:51, Jan Beulich wrote: >>> On 08.03.2021 14:47, Andrew Cooper wrote: >>>> On 08/03/2021 09:25, Tim Deegan wrote: >>>>> At 16:37 +0100 on 05 Mar (1614962224), Jan Beulich wrote: >>>>>> We can't make correctness of our own behavior dependent upon a >>>>>> hypervisor underneath us correctly telling us the true physical address >>>>>> with hardware uses. Without knowing this, we can't be certain reserved >>>>>> bit faults can actually be observed. Therefore, besides evaluating the >>>>>> number of address bits when deciding whether to use the optimization, >>>>>> also check whether we're running virtualized ourselves. >>>>>> >>>>>> Requested-by: Andrew Cooper <andrew.cooper3@citrix.com> >>>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com> >>>>> Acked-by: Tim Deegan <tim@xen.org> >>>>> >>>>> I would consider this to be a bug in the underlying hypervisor, but I >>>>> agree than in practice it won't be safe to rely on it being correct. >>>> I'd argue against this being a hypervisor bug. If anything, it is a >>>> weakness in how x86 virtualisation works. >>>> >>>> For booting on a single host, then yes - vMAXPHYSADDR really ought to be >>>> the same as MAXPHYSADDR, and is what happens in the common case. >>>> >>>> For booting in a heterogeneous pool, the only safe value is the min of >>>> MAXPHYSADDR across the resource pool. Anything higher, and the VM will >>>> malfunction (get #PF[rsvd] for apparently-legal PTEs) on the smallest >>>> pool member(s). >>> Except that min isn't safe either - the guest may then expect reserved >>> bit faults where none surface. >> Such a guest is buggy, and in clear violation of the rules set out in >> the architecture. All reserved behaviour is subject to change in the >> future. >> >> Any software (Xen included) deliberately choosing to depend on the >> specifics of reserved behaviour, get to keep all resulting pieces. > While I could understand what you're saying when considering our > prior unconditional relying on getting reserved bit faults, are you > suggesting the recently adjusted behavior is still "in clear > violation of the rules set out in the architecture"? And hence are > you suggesting we should have outright dropped that optimization? Strictly speaking, we shouldn't use reserved bits at all. That is the most compatible option available. Given that we have chosen to use reserved bits, it is our responsibility for ensuring that they are safe to use. That is why we elect not to use them when virtualised (for this bug), or on native when we think they won't work (the previous change for IceLake). From that point of view, I think we're fine. We're (knowingly) operating outside of the rules, and now taking appropriate precautions to cover the corner cases we are aware of. The behaviour of already-shipped CPUs, while not architecturally guarenteed, is known and can reasonably be depended upon. ~Andrew
© 2016 - 2026 Red Hat, Inc.