x86/shadow: further refinements to "fast fault path" suppression

[PATCH 2/2][4.15?] x86/shadow: encode full GFN in magic MMIO entries

Posted by Jan Beulich 4 years, 11 months ago

Since we don't need to encode all of the PTE flags, we have enough bits
in the shadow entry to store the full GFN. Don't use literal numbers -
instead derive the involved values. Or, where derivation would become
too ugly, sanity-check the result (invoking #error to identify failure).

This then allows dropping from sh_l1e_mmio() again the guarding against
too large GFNs.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
I wonder if the respective check in sh_audit_l1_table() is actually
useful to retain with these changes.

--- a/xen/arch/x86/mm/shadow/types.h
+++ b/xen/arch/x86/mm/shadow/types.h
@@ -283,9 +283,17 @@ shadow_put_page_from_l1e(shadow_l1e_t sl
  * This is only feasible for PAE and 64bit Xen: 32-bit non-PAE PTEs don't
  * have reserved bits that we can use for this.  And even there it can only
  * be used if we can be certain the processor doesn't use all 52 address bits.
+ *
+ * For the MMIO encoding (see below) we need the bottom 4 bits for
+ * identifying the kind of entry and a full GFN's worth of bits to encode
+ * the originating frame number.  Set all remaining bits to trigger
+ * reserved bit faults, if (see above) the hardware permits triggering such.
  */
 
-#define SH_L1E_MAGIC 0xffffffff00000001ULL
+#define SH_L1E_MAGIC_NR_META_BITS 4
+#define SH_L1E_MAGIC_MASK ((~0ULL << (PADDR_BITS - PAGE_SHIFT + \
+                                      SH_L1E_MAGIC_NR_META_BITS)) | \
+                           _PAGE_PRESENT)
 
 static inline bool sh_have_pte_rsvd_bits(void)
 {
@@ -294,7 +302,8 @@ static inline bool sh_have_pte_rsvd_bits
 
 static inline bool sh_l1e_is_magic(shadow_l1e_t sl1e)
 {
-    return (sl1e.l1 & SH_L1E_MAGIC) == SH_L1E_MAGIC;
+    BUILD_BUG_ON(!(PADDR_MASK & SH_L1E_MAGIC_MASK));
+    return (sl1e.l1 & SH_L1E_MAGIC_MASK) == SH_L1E_MAGIC_MASK;
 }
 
 /* Guest not present: a single magic value */
@@ -320,20 +329,26 @@ static inline bool sh_l1e_is_gnp(shadow_
 
 /*
  * MMIO: an invalid PTE that contains the GFN of the equivalent guest l1e.
- * We store 28 bits of GFN in bits 4:32 of the entry.
+ * We store the GFN in bits 4:43 of the entry.
  * The present bit is set, and the U/S and R/W bits are taken from the guest.
  * Bit 3 is always 0, to differentiate from gnp above.
  */
-#define SH_L1E_MMIO_MAGIC       0xffffffff00000001ULL
-#define SH_L1E_MMIO_MAGIC_MASK  0xffffffff00000009ULL
-#define SH_L1E_MMIO_GFN_MASK    0x00000000fffffff0ULL
+#define SH_L1E_MMIO_MAGIC       SH_L1E_MAGIC_MASK
+#define SH_L1E_MMIO_MAGIC_BIT   ((_PAGE_PRESENT | _PAGE_RW | _PAGE_USER) + 1)
+#if SH_L1E_MMIO_MAGIC_BIT & (SH_L1E_MMIO_MAGIC_BIT - 1)
+# error SH_L1E_MMIO_MAGIC_BIT needs to be a power of 2
+#endif
+#if SH_L1E_MMIO_MAGIC_BIT >> SH_L1E_MAGIC_NR_META_BITS
+# error SH_L1E_MMIO_MAGIC_BIT and SH_L1E_MAGIC_NR_META_BITS are out of sync
+#endif
+#define SH_L1E_MMIO_MAGIC_MASK  (SH_L1E_MAGIC_MASK | SH_L1E_MMIO_MAGIC_BIT)
+#define SH_L1E_MMIO_GFN_MASK    ~(SH_L1E_MMIO_MAGIC_MASK | _PAGE_RW | _PAGE_USER)
 
 static inline shadow_l1e_t sh_l1e_mmio(gfn_t gfn, u32 gflags)
 {
     unsigned long gfn_val = MASK_INSR(gfn_x(gfn), SH_L1E_MMIO_GFN_MASK);
 
-    if ( !sh_have_pte_rsvd_bits() ||
-         gfn_x(gfn) != MASK_EXTR(gfn_val, SH_L1E_MMIO_GFN_MASK) )
+    if ( !sh_have_pte_rsvd_bits() )
         return shadow_l1e_empty();
 
     return (shadow_l1e_t) { (SH_L1E_MMIO_MAGIC | gfn_val |

Re: [PATCH 2/2][4.15?] x86/shadow: encode full GFN in magic MMIO entries

Posted by Andrew Cooper 4 years, 11 months ago

On 05/03/2021 15:37, Jan Beulich wrote:
> Since we don't need to encode all of the PTE flags, we have enough bits
> in the shadow entry to store the full GFN. Don't use literal numbers -
> instead derive the involved values. Or, where derivation would become
> too ugly, sanity-check the result (invoking #error to identify failure).
>
> This then allows dropping from sh_l1e_mmio() again the guarding against
> too large GFNs.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> I wonder if the respective check in sh_audit_l1_table() is actually
> useful to retain with these changes.
>
> --- a/xen/arch/x86/mm/shadow/types.h
> +++ b/xen/arch/x86/mm/shadow/types.h
> @@ -283,9 +283,17 @@ shadow_put_page_from_l1e(shadow_l1e_t sl
>   * This is only feasible for PAE and 64bit Xen: 32-bit non-PAE PTEs don't
>   * have reserved bits that we can use for this.  And even there it can only
>   * be used if we can be certain the processor doesn't use all 52 address bits.
> + *
> + * For the MMIO encoding (see below) we need the bottom 4 bits for
> + * identifying the kind of entry and a full GFN's worth of bits to encode
> + * the originating frame number.  Set all remaining bits to trigger
> + * reserved bit faults, if (see above) the hardware permits triggering such.
>   */
>  
> -#define SH_L1E_MAGIC 0xffffffff00000001ULL
> +#define SH_L1E_MAGIC_NR_META_BITS 4
> +#define SH_L1E_MAGIC_MASK ((~0ULL << (PADDR_BITS - PAGE_SHIFT + \
> +                                      SH_L1E_MAGIC_NR_META_BITS)) | \
> +                           _PAGE_PRESENT)
>  
>  static inline bool sh_have_pte_rsvd_bits(void)
>  {
> @@ -294,7 +302,8 @@ static inline bool sh_have_pte_rsvd_bits
>  
>  static inline bool sh_l1e_is_magic(shadow_l1e_t sl1e)
>  {
> -    return (sl1e.l1 & SH_L1E_MAGIC) == SH_L1E_MAGIC;
> +    BUILD_BUG_ON(!(PADDR_MASK & SH_L1E_MAGIC_MASK));
> +    return (sl1e.l1 & SH_L1E_MAGIC_MASK) == SH_L1E_MAGIC_MASK;
>  }
>  
>  /* Guest not present: a single magic value */
> @@ -320,20 +329,26 @@ static inline bool sh_l1e_is_gnp(shadow_
>  
>  /*
>   * MMIO: an invalid PTE that contains the GFN of the equivalent guest l1e.
> - * We store 28 bits of GFN in bits 4:32 of the entry.
> + * We store the GFN in bits 4:43 of the entry.
>   * The present bit is set, and the U/S and R/W bits are taken from the guest.
>   * Bit 3 is always 0, to differentiate from gnp above.
>   */
> -#define SH_L1E_MMIO_MAGIC       0xffffffff00000001ULL
> -#define SH_L1E_MMIO_MAGIC_MASK  0xffffffff00000009ULL
> -#define SH_L1E_MMIO_GFN_MASK    0x00000000fffffff0ULL
> +#define SH_L1E_MMIO_MAGIC       SH_L1E_MAGIC_MASK
> +#define SH_L1E_MMIO_MAGIC_BIT   ((_PAGE_PRESENT | _PAGE_RW | _PAGE_USER) + 1)
> +#if SH_L1E_MMIO_MAGIC_BIT & (SH_L1E_MMIO_MAGIC_BIT - 1)
> +# error SH_L1E_MMIO_MAGIC_BIT needs to be a power of 2
> +#endif
> +#if SH_L1E_MMIO_MAGIC_BIT >> SH_L1E_MAGIC_NR_META_BITS
> +# error SH_L1E_MMIO_MAGIC_BIT and SH_L1E_MAGIC_NR_META_BITS are out of sync
> +#endif
> +#define SH_L1E_MMIO_MAGIC_MASK  (SH_L1E_MAGIC_MASK | SH_L1E_MMIO_MAGIC_BIT)
> +#define SH_L1E_MMIO_GFN_MASK    ~(SH_L1E_MMIO_MAGIC_MASK | _PAGE_RW | _PAGE_USER)

In practice, it is 4:36, because we have to limit shadow guests to 32
bits of gfn for XSA-173 (width of the superpage backpointer IIRC).

Also, this property is important for L1TF.  The more guest-controllable
bits we permit in here, the greater the chance of being vulnerable to
L1TF on massive machines.

(I'm a little concerned that I can't spot an L1TF comment which has been
made stale by these changes...  Probably the fault of whichever numpty
prepared the L1TF patches, because I'm certain we discussed this at the
time)

Going from 32 to 36 bits moves the upper safety barrier from TOP-4G to
TOP-64G but I recall us agreed that that was ok, especially as the main
safety guestimate is "no RAM in the top quarter of the address space".

However, I don't think we want to accidentally creep beyond bit 36, so
I'd suggest that the easy fix here is just adjusting a nibble in the
MMIO masks.

~Andrew

Re: [PATCH 2/2][4.15?] x86/shadow: encode full GFN in magic MMIO entries

Posted by Jan Beulich 4 years, 11 months ago

On 05.03.2021 17:32, Andrew Cooper wrote:
> On 05/03/2021 15:37, Jan Beulich wrote:
>> Since we don't need to encode all of the PTE flags, we have enough bits
>> in the shadow entry to store the full GFN. Don't use literal numbers -
>> instead derive the involved values. Or, where derivation would become
>> too ugly, sanity-check the result (invoking #error to identify failure).
>>
>> This then allows dropping from sh_l1e_mmio() again the guarding against
>> too large GFNs.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> I wonder if the respective check in sh_audit_l1_table() is actually
>> useful to retain with these changes.
>>
>> --- a/xen/arch/x86/mm/shadow/types.h
>> +++ b/xen/arch/x86/mm/shadow/types.h
>> @@ -283,9 +283,17 @@ shadow_put_page_from_l1e(shadow_l1e_t sl
>>   * This is only feasible for PAE and 64bit Xen: 32-bit non-PAE PTEs don't
>>   * have reserved bits that we can use for this.  And even there it can only
>>   * be used if we can be certain the processor doesn't use all 52 address bits.
>> + *
>> + * For the MMIO encoding (see below) we need the bottom 4 bits for
>> + * identifying the kind of entry and a full GFN's worth of bits to encode
>> + * the originating frame number.  Set all remaining bits to trigger
>> + * reserved bit faults, if (see above) the hardware permits triggering such.
>>   */
>>  
>> -#define SH_L1E_MAGIC 0xffffffff00000001ULL
>> +#define SH_L1E_MAGIC_NR_META_BITS 4
>> +#define SH_L1E_MAGIC_MASK ((~0ULL << (PADDR_BITS - PAGE_SHIFT + \
>> +                                      SH_L1E_MAGIC_NR_META_BITS)) | \
>> +                           _PAGE_PRESENT)
>>  
>>  static inline bool sh_have_pte_rsvd_bits(void)
>>  {
>> @@ -294,7 +302,8 @@ static inline bool sh_have_pte_rsvd_bits
>>  
>>  static inline bool sh_l1e_is_magic(shadow_l1e_t sl1e)
>>  {
>> -    return (sl1e.l1 & SH_L1E_MAGIC) == SH_L1E_MAGIC;
>> +    BUILD_BUG_ON(!(PADDR_MASK & SH_L1E_MAGIC_MASK));
>> +    return (sl1e.l1 & SH_L1E_MAGIC_MASK) == SH_L1E_MAGIC_MASK;
>>  }
>>  
>>  /* Guest not present: a single magic value */
>> @@ -320,20 +329,26 @@ static inline bool sh_l1e_is_gnp(shadow_
>>  
>>  /*
>>   * MMIO: an invalid PTE that contains the GFN of the equivalent guest l1e.
>> - * We store 28 bits of GFN in bits 4:32 of the entry.
>> + * We store the GFN in bits 4:43 of the entry.
>>   * The present bit is set, and the U/S and R/W bits are taken from the guest.
>>   * Bit 3 is always 0, to differentiate from gnp above.
>>   */
>> -#define SH_L1E_MMIO_MAGIC       0xffffffff00000001ULL
>> -#define SH_L1E_MMIO_MAGIC_MASK  0xffffffff00000009ULL
>> -#define SH_L1E_MMIO_GFN_MASK    0x00000000fffffff0ULL
>> +#define SH_L1E_MMIO_MAGIC       SH_L1E_MAGIC_MASK
>> +#define SH_L1E_MMIO_MAGIC_BIT   ((_PAGE_PRESENT | _PAGE_RW | _PAGE_USER) + 1)
>> +#if SH_L1E_MMIO_MAGIC_BIT & (SH_L1E_MMIO_MAGIC_BIT - 1)
>> +# error SH_L1E_MMIO_MAGIC_BIT needs to be a power of 2
>> +#endif
>> +#if SH_L1E_MMIO_MAGIC_BIT >> SH_L1E_MAGIC_NR_META_BITS
>> +# error SH_L1E_MMIO_MAGIC_BIT and SH_L1E_MAGIC_NR_META_BITS are out of sync
>> +#endif
>> +#define SH_L1E_MMIO_MAGIC_MASK  (SH_L1E_MAGIC_MASK | SH_L1E_MMIO_MAGIC_BIT)
>> +#define SH_L1E_MMIO_GFN_MASK    ~(SH_L1E_MMIO_MAGIC_MASK | _PAGE_RW | _PAGE_USER)
> 
> In practice, it is 4:36, because we have to limit shadow guests to 32
> bits of gfn for XSA-173 (width of the superpage backpointer IIRC).

When !BIGMEM - yes.

> Also, this property is important for L1TF.  The more guest-controllable
> bits we permit in here, the greater the chance of being vulnerable to
> L1TF on massive machines.
> 
> (I'm a little concerned that I can't spot an L1TF comment which has been
> made stale by these changes...  Probably the fault of whichever numpty
> prepared the L1TF patches, because I'm certain we discussed this at the
> time)
> 
> Going from 32 to 36 bits moves the upper safety barrier from TOP-4G to
> TOP-64G but I recall us agreed that that was ok, especially as the main
> safety guestimate is "no RAM in the top quarter of the address space".
> 
> However, I don't think we want to accidentally creep beyond bit 36, so
> I'd suggest that the easy fix here is just adjusting a nibble in the
> MMIO masks.

With BIGMEM I'm not sure we want to be this strict. Nor do we need
to as long as we only need 4 bits at the bottom - we only go up to
bit 43 with what we allow guests control over. IOW we will need to
be careful on old hardware when l1d_maxphysaddr == 44, but on
anything newer we're still far enough away I would think. So I
guess instead of outright dropping the GFN check from sh_l1e_mmio()
I want to replace it by use of is_l1tf_safe_maddr() (on the
produced shadow_l1e_t).

Jan

[PATCH 1/2][4.15?] x86/shadow: suppress "fast fault path" optimization when running virtualized
[PATCH 2/2][4.15?] x86/shadow: encode full GFN in magic MMIO entries