x86/PV: memory management consistency and minor relaxations

[PATCH 0/5] x86/PV: memory management consistency and minor relaxations

Posted by Jan Beulich 5 years, 3 months ago

Especially the latter three patches provide only small possible
gains, from all I can tell. I nevertheless wanted to put up the
entire set for consideration.

1: consistently inline {,un}adjust_guest_l<N>e()
2: fold redundant calls to adjust_guest_l<N>e()
3: _PAGE_RW changes may take fast path of mod_l[234]_entry()
4: restrict TLB flushing after mod_l[234]_entry()
5: avoid TLB flushing after mod_l3_entry()

Jan

[PATCH 1/5] x86/PV: consistently inline {,un}adjust_guest_l<N>e()

Posted by Jan Beulich 5 years, 3 months ago

Commit 8a74707a7c ("x86/nospec: Use always_inline to fix code gen for
evaluate_nospec") converted inline to always_inline for
adjust_guest_l[134]e(), but left adjust_guest_l2e() and
unadjust_guest_l3e() alone without saying why these two would differ in
the needed / wanted treatment. Adjust these two as well.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Actually I question the need for always_inline here, for two reasons:
1) All that's guarded are updates to local variables, depending on
   merely the values held by these local variables (really: function
   arguments) on input. As a result it would look to me as if we wanted
   evaluate_nospec()-free variants of is_pv{,_32bit}_domain() and alike,
   to be used e.g. here.
2) These functions don't act as predicates, and hence the concern
   expressed in said commit doesn't apply here: Callers wouldn't observe
   misplaced LFENCEs, as they don't use the results of these helpers for
   further (direct) code flow control.
---
So far I've observed only clang to not inline the two functions when
just "inline" is in place.

--- a/xen/arch/x86/pv/mm.h
+++ b/xen/arch/x86/pv/mm.h
@@ -99,8 +99,8 @@ static always_inline l1_pgentry_t adjust
     return l1e;
 }
 
-static inline l2_pgentry_t adjust_guest_l2e(l2_pgentry_t l2e,
-                                            const struct domain *d)
+static always_inline l2_pgentry_t adjust_guest_l2e(l2_pgentry_t l2e,
+                                                   const struct domain *d)
 {
     if ( likely(l2e_get_flags(l2e) & _PAGE_PRESENT) &&
          likely(!is_pv_32bit_domain(d)) )
@@ -119,8 +119,8 @@ static always_inline l3_pgentry_t adjust
     return l3e;
 }
 
-static inline l3_pgentry_t unadjust_guest_l3e(l3_pgentry_t l3e,
-                                              const struct domain *d)
+static always_inline l3_pgentry_t unadjust_guest_l3e(l3_pgentry_t l3e,
+                                                     const struct domain *d)
 {
     if ( unlikely(is_pv_32bit_domain(d)) &&
          likely(l3e_get_flags(l3e) & _PAGE_PRESENT) )

Re: [PATCH 1/5] x86/PV: consistently inline {,un}adjust_guest_l<N>e()

Posted by Roger Pau Monné 5 years, 1 month ago

On Tue, Nov 03, 2020 at 11:56:16AM +0100, Jan Beulich wrote:
> Commit 8a74707a7c ("x86/nospec: Use always_inline to fix code gen for
> evaluate_nospec") converted inline to always_inline for
> adjust_guest_l[134]e(), but left adjust_guest_l2e() and
> unadjust_guest_l3e() alone without saying why these two would differ in
> the needed / wanted treatment. Adjust these two as well.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Roger Pau Monné <roger.pau@citrix.com>

I'm also unsure why those where left without always_inline. Maybe it's
because for the way they are used?

Turning them into always_inline shouldn't do any damage, so I think
it's fine.

Thanks, Roger.

[PATCH 2/5] x86/PV: fold redundant calls to adjust_guest_l<N>e()

Posted by Jan Beulich 5 years, 3 months ago

At least from an abstract perspective it is quite odd for us to compare
adjusted old and unadjusted new page table entries when determining
whether the fast path can be used. This is largely benign because
FASTPATH_FLAG_WHITELIST covers most of the flags which the adjustments
may set, and the flags getting set don't affect the outcome of
get_page_from_l<N>e(). There's one exception: 32-bit L3 entries get
_PAGE_RW set, but get_page_from_l3e() doesn't allow linear page tables
to be created at this level for such guests. Apart from this _PAGE_RW
is unused by get_page_from_l<N>e() (for N > 1), and hence forcing the
bit on early has no functional effect.

The main reason for the change, however, is that adjust_guest_l<N>e()
aren't exactly cheap - both in terms of pure code size and because each
one has at least one evaluate_nospec() by way of containing
is_pv_32bit_domain() conditionals.

Call the functions once ahead of the fast path checks, instead of twice
after.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -2199,10 +2199,11 @@ static int mod_l1_entry(l1_pgentry_t *pl
             nl1e = l1e_from_page(page, l1e_get_flags(nl1e));
         }
 
+        nl1e = adjust_guest_l1e(nl1e, pt_dom);
+
         /* Fast path for sufficiently-similar mappings. */
         if ( !l1e_has_changed(ol1e, nl1e, ~FASTPATH_FLAG_WHITELIST) )
         {
-            nl1e = adjust_guest_l1e(nl1e, pt_dom);
             rc = UPDATE_ENTRY(l1, pl1e, ol1e, nl1e, gl1mfn, pt_vcpu,
                               preserve_ad);
             if ( page )
@@ -2227,7 +2228,6 @@ static int mod_l1_entry(l1_pgentry_t *pl
         if ( page )
             put_page(page);
 
-        nl1e = adjust_guest_l1e(nl1e, pt_dom);
         if ( unlikely(!UPDATE_ENTRY(l1, pl1e, ol1e, nl1e, gl1mfn, pt_vcpu,
                                     preserve_ad)) )
         {
@@ -2279,10 +2279,11 @@ static int mod_l2_entry(l2_pgentry_t *pl
             return -EINVAL;
         }
 
+        nl2e = adjust_guest_l2e(nl2e, d);
+
         /* Fast path for sufficiently-similar mappings. */
         if ( !l2e_has_changed(ol2e, nl2e, ~FASTPATH_FLAG_WHITELIST) )
         {
-            nl2e = adjust_guest_l2e(nl2e, d);
             if ( UPDATE_ENTRY(l2, pl2e, ol2e, nl2e, mfn, vcpu, preserve_ad) )
                 return 0;
             return -EBUSY;
@@ -2291,7 +2292,6 @@ static int mod_l2_entry(l2_pgentry_t *pl
         if ( unlikely((rc = get_page_from_l2e(nl2e, mfn, d, 0)) < 0) )
             return rc;
 
-        nl2e = adjust_guest_l2e(nl2e, d);
         if ( unlikely(!UPDATE_ENTRY(l2, pl2e, ol2e, nl2e, mfn, vcpu,
                                     preserve_ad)) )
         {
@@ -2341,10 +2341,11 @@ static int mod_l3_entry(l3_pgentry_t *pl
             return -EINVAL;
         }
 
+        nl3e = adjust_guest_l3e(nl3e, d);
+
         /* Fast path for sufficiently-similar mappings. */
         if ( !l3e_has_changed(ol3e, nl3e, ~FASTPATH_FLAG_WHITELIST) )
         {
-            nl3e = adjust_guest_l3e(nl3e, d);
             rc = UPDATE_ENTRY(l3, pl3e, ol3e, nl3e, mfn, vcpu, preserve_ad);
             return rc ? 0 : -EFAULT;
         }
@@ -2354,7 +2355,6 @@ static int mod_l3_entry(l3_pgentry_t *pl
             return rc;
         rc = 0;
 
-        nl3e = adjust_guest_l3e(nl3e, d);
         if ( unlikely(!UPDATE_ENTRY(l3, pl3e, ol3e, nl3e, mfn, vcpu,
                                     preserve_ad)) )
         {
@@ -2403,10 +2403,11 @@ static int mod_l4_entry(l4_pgentry_t *pl
             return -EINVAL;
         }
 
+        nl4e = adjust_guest_l4e(nl4e, d);
+
         /* Fast path for sufficiently-similar mappings. */
         if ( !l4e_has_changed(ol4e, nl4e, ~FASTPATH_FLAG_WHITELIST) )
         {
-            nl4e = adjust_guest_l4e(nl4e, d);
             rc = UPDATE_ENTRY(l4, pl4e, ol4e, nl4e, mfn, vcpu, preserve_ad);
             return rc ? 0 : -EFAULT;
         }
@@ -2416,7 +2417,6 @@ static int mod_l4_entry(l4_pgentry_t *pl
             return rc;
         rc = 0;
 
-        nl4e = adjust_guest_l4e(nl4e, d);
         if ( unlikely(!UPDATE_ENTRY(l4, pl4e, ol4e, nl4e, mfn, vcpu,
                                     preserve_ad)) )
         {

Re: [PATCH 2/5] x86/PV: fold redundant calls to adjust_guest_l<N>e()

Posted by Roger Pau Monné 5 years ago

On Tue, Nov 03, 2020 at 11:56:44AM +0100, Jan Beulich wrote:
> At least from an abstract perspective it is quite odd for us to compare
> adjusted old and unadjusted new page table entries when determining
> whether the fast path can be used. This is largely benign because
> FASTPATH_FLAG_WHITELIST covers most of the flags which the adjustments
> may set, and the flags getting set don't affect the outcome of
> get_page_from_l<N>e(). There's one exception: 32-bit L3 entries get
> _PAGE_RW set, but get_page_from_l3e() doesn't allow linear page tables
> to be created at this level for such guests. Apart from this _PAGE_RW
> is unused by get_page_from_l<N>e() (for N > 1), and hence forcing the
> bit on early has no functional effect.
> 
> The main reason for the change, however, is that adjust_guest_l<N>e()
> aren't exactly cheap - both in terms of pure code size and because each
> one has at least one evaluate_nospec() by way of containing
> is_pv_32bit_domain() conditionals.
> 
> Call the functions once ahead of the fast path checks, instead of twice
> after.

I guess part of the reasoning for doing it that way is because you can
avoid the adjust_guest_l1e in the slow path if get_page_from_l1e
fails?

In any case, adjust_guest_l1e was only called once from either the
fast or the slow paths.

> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.

Re: [PATCH 2/5] x86/PV: fold redundant calls to adjust_guest_l<N>e()

Posted by Jan Beulich 5 years ago

On 11.01.2021 11:36, Roger Pau Monné wrote:
> On Tue, Nov 03, 2020 at 11:56:44AM +0100, Jan Beulich wrote:
>> At least from an abstract perspective it is quite odd for us to compare
>> adjusted old and unadjusted new page table entries when determining
>> whether the fast path can be used. This is largely benign because
>> FASTPATH_FLAG_WHITELIST covers most of the flags which the adjustments
>> may set, and the flags getting set don't affect the outcome of
>> get_page_from_l<N>e(). There's one exception: 32-bit L3 entries get
>> _PAGE_RW set, but get_page_from_l3e() doesn't allow linear page tables
>> to be created at this level for such guests. Apart from this _PAGE_RW
>> is unused by get_page_from_l<N>e() (for N > 1), and hence forcing the
>> bit on early has no functional effect.
>>
>> The main reason for the change, however, is that adjust_guest_l<N>e()
>> aren't exactly cheap - both in terms of pure code size and because each
>> one has at least one evaluate_nospec() by way of containing
>> is_pv_32bit_domain() conditionals.
>>
>> Call the functions once ahead of the fast path checks, instead of twice
>> after.
> 
> I guess part of the reasoning for doing it that way is because you can
> avoid the adjust_guest_l1e in the slow path if get_page_from_l1e
> fails?

Possibly, but this is specifically the not generally interesting
code path (in particular not one where I'd consider performance
relevant).

> In any case, adjust_guest_l1e was only called once from either the
> fast or the slow paths.
> 
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Acked-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks.

Jan

[PATCH 3/5] x86/PV: _PAGE_RW changes may take fast path of mod_l[234]_entry()

Posted by Jan Beulich 5 years, 3 months ago

The only time _PAGE_RW matters when validating an L2 or higher entry is
when a linear page table is tried to be installed. Therefore when we
disallow such at build time, we can allow _PAGE_RW changes to take the
fast paths there.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -2140,6 +2140,18 @@ static void l3t_unlock(struct page_info
     (_PAGE_NX_BIT | _PAGE_AVAIL_HIGH | _PAGE_AVAIL | _PAGE_GLOBAL | \
      _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_USER)
 
+/*
+ * PDE flags that a guest may change without re-validating the PDE.
+ * All other bits affect translation, caching, or Xen's safety. When guest
+ * created linear page tables aren't allowed, intermediate page tables may
+ * have _PAGE_RW altered without this requiring re-validation.
+ */
+#ifndef CONFIG_PV_LINEAR_PT
+# define FASTPATH_PDE_FLAG_WHITELIST (FASTPATH_FLAG_WHITELIST | _PAGE_RW)
+#else
+# define FASTPATH_PDE_FLAG_WHITELIST FASTPATH_FLAG_WHITELIST
+#endif
+
 /* Update the L1 entry at pl1e to new value nl1e. */
 static int mod_l1_entry(l1_pgentry_t *pl1e, l1_pgentry_t nl1e,
                         mfn_t gl1mfn, unsigned int cmd,
@@ -2282,7 +2294,7 @@ static int mod_l2_entry(l2_pgentry_t *pl
         nl2e = adjust_guest_l2e(nl2e, d);
 
         /* Fast path for sufficiently-similar mappings. */
-        if ( !l2e_has_changed(ol2e, nl2e, ~FASTPATH_FLAG_WHITELIST) )
+        if ( !l2e_has_changed(ol2e, nl2e, ~FASTPATH_PDE_FLAG_WHITELIST) )
         {
             if ( UPDATE_ENTRY(l2, pl2e, ol2e, nl2e, mfn, vcpu, preserve_ad) )
                 return 0;
@@ -2344,7 +2356,7 @@ static int mod_l3_entry(l3_pgentry_t *pl
         nl3e = adjust_guest_l3e(nl3e, d);
 
         /* Fast path for sufficiently-similar mappings. */
-        if ( !l3e_has_changed(ol3e, nl3e, ~FASTPATH_FLAG_WHITELIST) )
+        if ( !l3e_has_changed(ol3e, nl3e, ~FASTPATH_PDE_FLAG_WHITELIST) )
         {
             rc = UPDATE_ENTRY(l3, pl3e, ol3e, nl3e, mfn, vcpu, preserve_ad);
             return rc ? 0 : -EFAULT;
@@ -2406,7 +2418,7 @@ static int mod_l4_entry(l4_pgentry_t *pl
         nl4e = adjust_guest_l4e(nl4e, d);
 
         /* Fast path for sufficiently-similar mappings. */
-        if ( !l4e_has_changed(ol4e, nl4e, ~FASTPATH_FLAG_WHITELIST) )
+        if ( !l4e_has_changed(ol4e, nl4e, ~FASTPATH_PDE_FLAG_WHITELIST) )
         {
             rc = UPDATE_ENTRY(l4, pl4e, ol4e, nl4e, mfn, vcpu, preserve_ad);
             return rc ? 0 : -EFAULT;

Re: [PATCH 3/5] x86/PV: _PAGE_RW changes may take fast path of mod_l[234]_entry()

Posted by Roger Pau Monné 5 years ago

On Tue, Nov 03, 2020 at 11:57:10AM +0100, Jan Beulich wrote:
> The only time _PAGE_RW matters when validating an L2 or higher entry is
> when a linear page table is tried to be installed. Therefore when we
> disallow such at build time, we can allow _PAGE_RW changes to take the
> fast paths there.

I think it would be helpful to note why PDEs with linear page tables
aren't allowed to have the RW bit set, likely here and in the comment
below.

Thanks, Roger.

Re: [PATCH 3/5] x86/PV: _PAGE_RW changes may take fast path of mod_l[234]_entry()

Posted by Jan Beulich 5 years ago

On 11.01.2021 12:08, Roger Pau Monné wrote:
> On Tue, Nov 03, 2020 at 11:57:10AM +0100, Jan Beulich wrote:
>> The only time _PAGE_RW matters when validating an L2 or higher entry is
>> when a linear page table is tried to be installed. Therefore when we
>> disallow such at build time, we can allow _PAGE_RW changes to take the
>> fast paths there.
> 
> I think it would be helpful to note why PDEs with linear page tables
> aren't allowed to have the RW bit set, likely here and in the comment
> below.

I've changed the description to

"The only time _PAGE_RW matters when validating an L2 or higher entry is
 when a linear page table is tried to be installed (see the comment ahead
 of define_get_linear_pagetable()). Therefore when we disallow such at
 build time, we can allow _PAGE_RW changes to take the fast paths there."

Considering there already is a code comment explaining this, I'm
less convinced of also editing the comment. An option might be to
move up the #define-s next to define_get_linear_pagetable(), but
imo this should then involve moving FASTPATH_FLAG_WHITELIST as
well.

Jan

Re: [PATCH 3/5] x86/PV: _PAGE_RW changes may take fast path of mod_l[234]_entry()

Posted by Roger Pau Monné 5 years ago

On Mon, Jan 11, 2021 at 02:31:10PM +0100, Jan Beulich wrote:
> On 11.01.2021 12:08, Roger Pau Monné wrote:
> > On Tue, Nov 03, 2020 at 11:57:10AM +0100, Jan Beulich wrote:
> >> The only time _PAGE_RW matters when validating an L2 or higher entry is
> >> when a linear page table is tried to be installed. Therefore when we
> >> disallow such at build time, we can allow _PAGE_RW changes to take the
> >> fast paths there.
> > 
> > I think it would be helpful to note why PDEs with linear page tables
> > aren't allowed to have the RW bit set, likely here and in the comment
> > below.
> 
> I've changed the description to
> 
> "The only time _PAGE_RW matters when validating an L2 or higher entry is
>  when a linear page table is tried to be installed (see the comment ahead
>  of define_get_linear_pagetable()). Therefore when we disallow such at
>  build time, we can allow _PAGE_RW changes to take the fast paths there."
> 
> Considering there already is a code comment explaining this, I'm
> less convinced of also editing the comment. An option might be to
> move up the #define-s next to define_get_linear_pagetable(), but
> imo this should then involve moving FASTPATH_FLAG_WHITELIST as
> well.

Referencing the comment in define_get_linear_pagetable in the commit
message seems fine to me:

Acked-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.

[PATCH 4/5] x86/PV: restrict TLB flushing after mod_l[234]_entry()

Posted by Jan Beulich 5 years, 3 months ago

Just like we avoid to invoke remote root pt flushes when all uses of an
L4 table can be accounted for locally, the same can be done for all of
L[234] for the linear pt flush when the table is a "free floating" one,
i.e. it is pinned but not hooked up anywhere. While this situation
doesn't occur very often, it can be observed.

Since this breaks one of the implications of the XSA-286 fix, drop the
flush_root_pt_local variable again and set ->root_pgt_changed directly,
just like it was before that change.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
While adjusting the big comment that was added for XSA-286 I wondered
why it talks about the "construction of 32bit PV guests". How are 64-bit
PV guests different in this regard?

--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -3903,8 +3903,7 @@ long do_mmu_update(
     struct vcpu *curr = current, *v = curr;
     struct domain *d = v->domain, *pt_owner = d, *pg_owner;
     mfn_t map_mfn = INVALID_MFN, mfn;
-    bool flush_linear_pt = false, flush_root_pt_local = false,
-        flush_root_pt_others = false;
+    bool flush_linear_pt = false, flush_root_pt_others = false;
     uint32_t xsm_needed = 0;
     uint32_t xsm_checked = 0;
     int rc = put_old_guest_table(curr);
@@ -4054,7 +4053,9 @@ long do_mmu_update(
                         break;
                     rc = mod_l2_entry(va, l2e_from_intpte(req.val), mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
-                    if ( !rc )
+                    if ( !rc &&
+                         (page->u.inuse.type_info & PGT_count_mask) >
+                         1 + !!(page->u.inuse.type_info & PGT_pinned) )
                         flush_linear_pt = true;
                     break;
 
@@ -4063,7 +4064,9 @@ long do_mmu_update(
                         break;
                     rc = mod_l3_entry(va, l3e_from_intpte(req.val), mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
-                    if ( !rc )
+                    if ( !rc &&
+                         (page->u.inuse.type_info & PGT_count_mask) >
+                         1 + !!(page->u.inuse.type_info & PGT_pinned) )
                         flush_linear_pt = true;
                     break;
 
@@ -4072,7 +4075,9 @@ long do_mmu_update(
                         break;
                     rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
-                    if ( !rc )
+                    if ( !rc &&
+                         (page->u.inuse.type_info & PGT_count_mask) >
+                         1 + !!(page->u.inuse.type_info & PGT_pinned) )
                         flush_linear_pt = true;
                     if ( !rc && pt_owner->arch.pv.xpti )
                     {
@@ -4082,7 +4087,7 @@ long do_mmu_update(
                                     mfn) )
                         {
                             local_in_use = true;
-                            flush_root_pt_local = true;
+                            get_cpu_info()->root_pgt_changed = true;
                         }
 
                         /*
@@ -4199,8 +4204,8 @@ long do_mmu_update(
     /*
      * Perform required TLB maintenance.
      *
-     * This logic currently depend on flush_linear_pt being a superset of the
-     * flush_root_pt_* conditions.
+     * This logic currently depends on flush_linear_pt being a superset of the
+     * flush_root_pt_others condition.
      *
      * pt_owner may not be current->domain.  This may occur during
      * construction of 32bit PV guests, or debugging of PV guests.  The
@@ -4219,7 +4224,7 @@ long do_mmu_update(
      * pt_owner->dirty_cpumask), and/or all *other* dirty CPUs as there are
      * references we can't account for locally.
      */
-    if ( flush_linear_pt /* || flush_root_pt_local || flush_root_pt_others */ )
+    if ( flush_linear_pt /* || flush_root_pt_others */ )
     {
         unsigned int cpu = smp_processor_id();
         cpumask_t *mask = pt_owner->dirty_cpumask;
@@ -4236,12 +4241,8 @@ long do_mmu_update(
             cpumask_copy(mask, pt_owner->dirty_cpumask);
             __cpumask_clear_cpu(cpu, mask);
 
-            flush_local(FLUSH_TLB |
-                        (flush_root_pt_local ? FLUSH_ROOT_PGTBL : 0));
+            flush_local(FLUSH_TLB);
         }
-        else
-            /* Sanity check.  flush_root_pt_local implies local cpu is dirty. */
-            ASSERT(!flush_root_pt_local);
 
         /* Flush the remote dirty CPUs.  Does not include the local CPU. */
         if ( !cpumask_empty(mask) )
@@ -4249,8 +4250,8 @@ long do_mmu_update(
                        (flush_root_pt_others ? FLUSH_ROOT_PGTBL : 0));
     }
     else
-        /* Sanity check.  flush_root_pt_* implies flush_linear_pt. */
-        ASSERT(!flush_root_pt_local && !flush_root_pt_others);
+        /* Sanity check.  flush_root_pt_others implies flush_linear_pt. */
+        ASSERT(!flush_root_pt_others);
 
     perfc_add(num_page_updates, i);

Re: [PATCH 4/5] x86/PV: restrict TLB flushing after mod_l[234]_entry()

Posted by Andrew Cooper 5 years, 3 months ago

On 03/11/2020 10:57, Jan Beulich wrote:
> Just like we avoid to invoke remote root pt flushes when all uses of an
> L4 table can be accounted for locally, the same can be done for all of
> L[234] for the linear pt flush when the table is a "free floating" one,
> i.e. it is pinned but not hooked up anywhere. While this situation
> doesn't occur very often, it can be observed.
>
> Since this breaks one of the implications of the XSA-286 fix, drop the
> flush_root_pt_local variable again and set ->root_pgt_changed directly,
> just like it was before that change.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> While adjusting the big comment that was added for XSA-286 I wondered
> why it talks about the "construction of 32bit PV guests". How are 64-bit
> PV guests different in this regard?

Because the sole caller is move_l3_below_4G() for 32bit PV guests which
don't support folded CR3's.

It's not impossible that future changes to PV construction might change
this, but it is highly unlikely in practice.

~Andrew

Re: [PATCH 4/5] x86/PV: restrict TLB flushing after mod_l[234]_entry()

Posted by Roger Pau Monné 5 years ago

On Tue, Nov 03, 2020 at 11:57:33AM +0100, Jan Beulich wrote:
> Just like we avoid to invoke remote root pt flushes when all uses of an
> L4 table can be accounted for locally, the same can be done for all of
> L[234] for the linear pt flush when the table is a "free floating" one,
> i.e. it is pinned but not hooked up anywhere. While this situation
> doesn't occur very often, it can be observed.
> 
> Since this breaks one of the implications of the XSA-286 fix, drop the
> flush_root_pt_local variable again and set ->root_pgt_changed directly,
> just like it was before that change.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

LGTM, so:

Acked-by: Roger Pau Monné <roger.pau@citrix.com>

It would be good however if Andrew can give is opinion also, since he
was the one to introduce such logic, and it's not trivial.

> ---
> While adjusting the big comment that was added for XSA-286 I wondered
> why it talks about the "construction of 32bit PV guests". How are 64-bit
> PV guests different in this regard?
> 
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -3903,8 +3903,7 @@ long do_mmu_update(
>      struct vcpu *curr = current, *v = curr;
>      struct domain *d = v->domain, *pt_owner = d, *pg_owner;
>      mfn_t map_mfn = INVALID_MFN, mfn;
> -    bool flush_linear_pt = false, flush_root_pt_local = false,
> -        flush_root_pt_others = false;
> +    bool flush_linear_pt = false, flush_root_pt_others = false;
>      uint32_t xsm_needed = 0;
>      uint32_t xsm_checked = 0;
>      int rc = put_old_guest_table(curr);
> @@ -4054,7 +4053,9 @@ long do_mmu_update(
>                          break;
>                      rc = mod_l2_entry(va, l2e_from_intpte(req.val), mfn,
>                                        cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
> -                    if ( !rc )
> +                    if ( !rc &&
> +                         (page->u.inuse.type_info & PGT_count_mask) >
> +                         1 + !!(page->u.inuse.type_info & PGT_pinned) )

I think adding a macro to encapsulate the added condition would make
the code clearer, maybe PAGETABLE_IN_USE, _LOADED or _ACTIVE?

Thanks, Roger.

Re: [PATCH 4/5] x86/PV: restrict TLB flushing after mod_l[234]_entry()

Posted by Jan Beulich 5 years ago

On 11.01.2021 14:00, Roger Pau Monné wrote:
> On Tue, Nov 03, 2020 at 11:57:33AM +0100, Jan Beulich wrote:
>> Just like we avoid to invoke remote root pt flushes when all uses of an
>> L4 table can be accounted for locally, the same can be done for all of
>> L[234] for the linear pt flush when the table is a "free floating" one,
>> i.e. it is pinned but not hooked up anywhere. While this situation
>> doesn't occur very often, it can be observed.
>>
>> Since this breaks one of the implications of the XSA-286 fix, drop the
>> flush_root_pt_local variable again and set ->root_pgt_changed directly,
>> just like it was before that change.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> LGTM, so:
> 
> Acked-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks.

> It would be good however if Andrew can give is opinion also, since he
> was the one to introduce such logic, and it's not trivial.

I can certainly wait some, also to hopefully get a comment on

>> ---
>> While adjusting the big comment that was added for XSA-286 I wondered
>> why it talks about the "construction of 32bit PV guests". How are 64-bit
>> PV guests different in this regard?

... this, but his email troubles make it hard to judge for how
long to wait.

>> @@ -4054,7 +4053,9 @@ long do_mmu_update(
>>                          break;
>>                      rc = mod_l2_entry(va, l2e_from_intpte(req.val), mfn,
>>                                        cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
>> -                    if ( !rc )
>> +                    if ( !rc &&
>> +                         (page->u.inuse.type_info & PGT_count_mask) >
>> +                         1 + !!(page->u.inuse.type_info & PGT_pinned) )
> 
> I think adding a macro to encapsulate the added condition would make
> the code clearer, maybe PAGETABLE_IN_USE, _LOADED or _ACTIVE?

A macro or inline function may certainly be nice (indeed I did
consider adding one), but I think here more than in some other
cases it is crucial that the name properly reflect what it
does. Unfortunately none of the ones you suggest nor any of the
ones I did consider will meet this requirement. In particular,
the check is about "is there any _other_ use". Hence I went
with the not very fortunate redundant open-coding.

Jan

Ping: [PATCH 4/5] x86/PV: restrict TLB flushing after mod_l[234]_entry()

Posted by Jan Beulich 4 years, 10 months ago

On 11.01.2021 14:22, Jan Beulich wrote:
> On 11.01.2021 14:00, Roger Pau Monné wrote:
>> On Tue, Nov 03, 2020 at 11:57:33AM +0100, Jan Beulich wrote:
>>> Just like we avoid to invoke remote root pt flushes when all uses of an
>>> L4 table can be accounted for locally, the same can be done for all of
>>> L[234] for the linear pt flush when the table is a "free floating" one,
>>> i.e. it is pinned but not hooked up anywhere. While this situation
>>> doesn't occur very often, it can be observed.
>>>
>>> Since this breaks one of the implications of the XSA-286 fix, drop the
>>> flush_root_pt_local variable again and set ->root_pgt_changed directly,
>>> just like it was before that change.
>>>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>
>> LGTM, so:
>>
>> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> Thanks.
> 
>> It would be good however if Andrew can give is opinion also, since he
>> was the one to introduce such logic, and it's not trivial.
> 
> I can certainly wait some, also to hopefully get a comment on
> 
>>> ---
>>> While adjusting the big comment that was added for XSA-286 I wondered
>>> why it talks about the "construction of 32bit PV guests". How are 64-bit
>>> PV guests different in this regard?
> 
> ... this, but his email troubles make it hard to judge for how
> long to wait.

About like for the other one just pinged, I intend to commit this
one together with the other two still pending ones in this series
once the tree is fully open again, unless I hear otherwise by then.
All three have been acked by Roger.

Jan

[PATCH 5/5] x86/PV32: avoid TLB flushing after mod_l3_entry()

Posted by Jan Beulich 5 years, 3 months ago

32-bit guests may not depend upon the side effect of using ordinary
4-level paging when running on a 64-bit hypervisor. For L3 entry updates
to take effect, they have to use a CR3 reload. Therefore there's no need
to issue a paging structure invalidating TLB flush in this case.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -4066,7 +4066,8 @@ long do_mmu_update(
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
                     if ( !rc &&
                          (page->u.inuse.type_info & PGT_count_mask) >
-                         1 + !!(page->u.inuse.type_info & PGT_pinned) )
+                         1 + !!(page->u.inuse.type_info & PGT_pinned) &&
+                         !is_pv_32bit_domain(pt_owner) )
                         flush_linear_pt = true;
                     break;

Re: [PATCH 5/5] x86/PV32: avoid TLB flushing after mod_l3_entry()

Posted by Roger Pau Monné 5 years ago

On Tue, Nov 03, 2020 at 11:58:16AM +0100, Jan Beulich wrote:
> 32-bit guests may not depend upon the side effect of using ordinary
> 4-level paging when running on a 64-bit hypervisor. For L3 entry updates
> to take effect, they have to use a CR3 reload. Therefore there's no need
> to issue a paging structure invalidating TLB flush in this case.

I assume it's fine for the Xen linear page tables to be lkely out of
sync during the windows between the entry update and the CR3 reload?

I wonder, won't something similar also apply to 64bit and L4 entries?

Thanks, Roger.

Re: [PATCH 5/5] x86/PV32: avoid TLB flushing after mod_l3_entry()

Posted by Jan Beulich 5 years ago

On 11.01.2021 15:23, Roger Pau Monné wrote:
> On Tue, Nov 03, 2020 at 11:58:16AM +0100, Jan Beulich wrote:
>> 32-bit guests may not depend upon the side effect of using ordinary
>> 4-level paging when running on a 64-bit hypervisor. For L3 entry updates
>> to take effect, they have to use a CR3 reload. Therefore there's no need
>> to issue a paging structure invalidating TLB flush in this case.
> 
> I assume it's fine for the Xen linear page tables to be lkely out of
> sync during the windows between the entry update and the CR3 reload?

Yes, because ...

> I wonder, won't something similar also apply to 64bit and L4 entries?

... unlike 64-bit paging, PAE paging special cases the treatment
of the 4 top level table entries. On bare metal they get loaded
by the CPU upon CR3 load, not when walking page tables.

Jan

Re: [PATCH 5/5] x86/PV32: avoid TLB flushing after mod_l3_entry()

Posted by Roger Pau Monné 5 years ago

On Mon, Jan 11, 2021 at 03:28:23PM +0100, Jan Beulich wrote:
> On 11.01.2021 15:23, Roger Pau Monné wrote:
> > On Tue, Nov 03, 2020 at 11:58:16AM +0100, Jan Beulich wrote:
> >> 32-bit guests may not depend upon the side effect of using ordinary
> >> 4-level paging when running on a 64-bit hypervisor. For L3 entry updates
> >> to take effect, they have to use a CR3 reload. Therefore there's no need
> >> to issue a paging structure invalidating TLB flush in this case.
> > 
> > I assume it's fine for the Xen linear page tables to be lkely out of
> > sync during the windows between the entry update and the CR3 reload?
> 
> Yes, because ...
> 
> > I wonder, won't something similar also apply to 64bit and L4 entries?
> 
> ... unlike 64-bit paging, PAE paging special cases the treatment
> of the 4 top level table entries. On bare metal they get loaded
> by the CPU upon CR3 load, not when walking page tables.

I wouldn't mind having this added to the commit message. In any case:

Acked-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.

Ping: [PATCH 0/5] x86/PV: memory management consistency and minor relaxations

Posted by Jan Beulich 5 years, 2 months ago

On 03.11.2020 11:54, Jan Beulich wrote:
> Especially the latter three patches provide only small possible
> gains, from all I can tell. I nevertheless wanted to put up the
> entire set for consideration.
> 
> 1: consistently inline {,un}adjust_guest_l<N>e()
> 2: fold redundant calls to adjust_guest_l<N>e()
> 3: _PAGE_RW changes may take fast path of mod_l[234]_entry()
> 4: restrict TLB flushing after mod_l[234]_entry()
> 5: avoid TLB flushing after mod_l3_entry()

Ping?

Jan