x86/ept: simplify detection of special pages for EMT calculation

[PATCH] x86/ept: simplify detection of special pages for EMT calculation

Posted by Roger Pau Monne 1 year, 7 months ago

The current way to detect whether a page handled to
epte_get_entry_emt() is special and needs a forced write-back cache
attribute involves iterating over all the smaller 4K pages for
superpages.

Such loop consumes a high amount of CPU time for 1GiB pages (order
18): on a Xeon® Silver 4216 (Cascade Lake) at 2GHz this takes an
average amount of time of 1.5ms.  Note that this figure just accounts
for the is_special_page() loop, and not the whole code of
epte_get_entry_emt().  Also the resolve_misconfig() operation that
calls into epte_get_entry_emt() is done while holding the p2m lock in
write (exclusive) mode, which blocks concurrent EPT_MISCONFIG faults
and prevents most guest hypercalls for progressing due to the need to
take the p2m lock in read mode to access any guest provided hypercall
buffers.

Simplify the checking in epte_get_entry_emt() and remove the loop,
assuming that there won't be superpages being only partially special.

So far we have no special superpages added to the guest p2m, and in
any case the forcing of the write-back cache attribute is a courtesy
to the guest to avoid such ranges being accessed as uncached when not
really needed.  It's not acceptable for such assistance to tax the
system so badly.

Fixes: 60d1adfa18 ('x86/ept: fix shattering of special pages')
Fixes: ca24b2ffdb ('x86/hvm: set 'ipat' in EPT for special pages')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/mm/p2m-ept.c | 24 ++++++++----------------
 1 file changed, 8 insertions(+), 16 deletions(-)

diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
index b4919bad51..d0e1c31612 100644
--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -491,7 +491,6 @@ int epte_get_entry_emt(struct domain *d, gfn_t gfn, mfn_t mfn,
 {
     int gmtrr_mtype, hmtrr_mtype;
     struct vcpu *v = current;
-    unsigned long i, special_pgs;
 
     *ipat = false;
 
@@ -518,26 +517,19 @@ int epte_get_entry_emt(struct domain *d, gfn_t gfn, mfn_t mfn,
         return MTRR_TYPE_UNCACHABLE;
     }
 
-    if ( type != p2m_mmio_direct && !is_iommu_enabled(d) &&
-         !cache_flush_permitted(d) )
+    if ( (type != p2m_mmio_direct && !is_iommu_enabled(d) &&
+          !cache_flush_permitted(d)) ||
+         /*
+          * Assume the whole page to be special if the first 4K chunk is:
+          * iterating over all possible 4K sub-pages for higher order pages is
+          * too expensive.
+          */
+         is_special_page(mfn_to_page(mfn)) )
     {
         *ipat = true;
         return MTRR_TYPE_WRBACK;
     }
 
-    for ( special_pgs = i = 0; i < (1ul << order); i++ )
-        if ( is_special_page(mfn_to_page(mfn_add(mfn, i))) )
-            special_pgs++;
-
-    if ( special_pgs )
-    {
-        if ( special_pgs != (1ul << order) )
-            return -1;
-
-        *ipat = true;
-        return MTRR_TYPE_WRBACK;
-    }
-
     switch ( type )
     {
     case p2m_mmio_direct:
-- 
2.37.3

Re: [PATCH] x86/ept: simplify detection of special pages for EMT calculation

Posted by Jan Beulich 1 year, 7 months ago

On 23.09.2022 12:56, Roger Pau Monne wrote:
> The current way to detect whether a page handled to
> epte_get_entry_emt() is special and needs a forced write-back cache
> attribute involves iterating over all the smaller 4K pages for
> superpages.
> 
> Such loop consumes a high amount of CPU time for 1GiB pages (order
> 18): on a Xeon® Silver 4216 (Cascade Lake) at 2GHz this takes an
> average amount of time of 1.5ms.  Note that this figure just accounts
> for the is_special_page() loop, and not the whole code of
> epte_get_entry_emt().  Also the resolve_misconfig() operation that
> calls into epte_get_entry_emt() is done while holding the p2m lock in
> write (exclusive) mode, which blocks concurrent EPT_MISCONFIG faults
> and prevents most guest hypercalls for progressing due to the need to
> take the p2m lock in read mode to access any guest provided hypercall
> buffers.
> 
> Simplify the checking in epte_get_entry_emt() and remove the loop,
> assuming that there won't be superpages being only partially special.
> 
> So far we have no special superpages added to the guest p2m,

We may not be adding them as superpages, but what a guest makes of
the pages it is given access to for e.g. grant handling, or what Dom0
makes of e.g. the (per-CPU) trace buffers is unknown. And I guess
Dom0 ending up with a non-WB mapping of the trace buffers might
impact tracing quite a bit. I don't think we can build on guests not
making any such the subject of a large-range mapping attempt, which
might end up suitable for a superpage mapping (recall that rather
sooner than later we ought to finally re-combine suitable ranges of
contiguous 4k mappings into 2M ones, just like we [now] do in IOMMU
code).

Since for data structures like the ones named above 2M mappings
might be enough (i.e. there might be little "risk" of even needing to
go to 1G ones), could we maybe take a "middle" approach and check all
pages when order == 9, but use your approach for higher orders? The
to-be-added re-coalescing would then need to by taught to refuse re-
coalescing of such ranges to larger than 2M mappings, while still
at least allowing for 2M ones. (Special casing at that boundary is
going to be necessary also for shadow code, at the very least.) But
see also below as to caveats.

> and in
> any case the forcing of the write-back cache attribute is a courtesy
> to the guest to avoid such ranges being accessed as uncached when not
> really needed.  It's not acceptable for such assistance to tax the
> system so badly.

I agree we would better improve the situation, but I don't think we
can do so by ...

> @@ -518,26 +517,19 @@ int epte_get_entry_emt(struct domain *d, gfn_t gfn, mfn_t mfn,
>          return MTRR_TYPE_UNCACHABLE;
>      }
>  
> -    if ( type != p2m_mmio_direct && !is_iommu_enabled(d) &&
> -         !cache_flush_permitted(d) )
> +    if ( (type != p2m_mmio_direct && !is_iommu_enabled(d) &&
> +          !cache_flush_permitted(d)) ||
> +         /*
> +          * Assume the whole page to be special if the first 4K chunk is:
> +          * iterating over all possible 4K sub-pages for higher order pages is
> +          * too expensive.
> +          */
> +         is_special_page(mfn_to_page(mfn)) )

... building in assumptions like this one. The more that here you may
also produce too weak a memory type (think of a later page in the range
requiring a stronger-ordered memory type).

While it may not help much, ...

>      {
>          *ipat = true;
>          return MTRR_TYPE_WRBACK;
>      }
>  
> -    for ( special_pgs = i = 0; i < (1ul << order); i++ )
> -        if ( is_special_page(mfn_to_page(mfn_add(mfn, i))) )
> -            special_pgs++;
> -
> -    if ( special_pgs )
> -    {
> -        if ( special_pgs != (1ul << order) )
> -            return -1;
> -
> -        *ipat = true;
> -        return MTRR_TYPE_WRBACK;
> -    }

... this logic could be improved to at least bail from the loop once it's
clear that the "-1" return path will be taken. Improvements beyond that
would likely involve adding some data structure (rangeset?) to track
special pages.

Jan

Re: [PATCH] x86/ept: simplify detection of special pages for EMT calculation

Posted by Roger Pau Monné 1 year, 7 months ago

On Mon, Sep 26, 2022 at 10:38:40AM +0200, Jan Beulich wrote:
> On 23.09.2022 12:56, Roger Pau Monne wrote:
> > The current way to detect whether a page handled to
> > epte_get_entry_emt() is special and needs a forced write-back cache
> > attribute involves iterating over all the smaller 4K pages for
> > superpages.
> > 
> > Such loop consumes a high amount of CPU time for 1GiB pages (order
> > 18): on a Xeon® Silver 4216 (Cascade Lake) at 2GHz this takes an
> > average amount of time of 1.5ms.  Note that this figure just accounts
> > for the is_special_page() loop, and not the whole code of
> > epte_get_entry_emt().  Also the resolve_misconfig() operation that
> > calls into epte_get_entry_emt() is done while holding the p2m lock in
> > write (exclusive) mode, which blocks concurrent EPT_MISCONFIG faults
> > and prevents most guest hypercalls for progressing due to the need to
> > take the p2m lock in read mode to access any guest provided hypercall
> > buffers.
> > 
> > Simplify the checking in epte_get_entry_emt() and remove the loop,
> > assuming that there won't be superpages being only partially special.
> > 
> > So far we have no special superpages added to the guest p2m,
> 
> We may not be adding them as superpages, but what a guest makes of
> the pages it is given access to for e.g. grant handling, or what Dom0
> makes of e.g. the (per-CPU) trace buffers is unknown. And I guess
> Dom0 ending up with a non-WB mapping of the trace buffers might
> impact tracing quite a bit. I don't think we can build on guests not
> making any such the subject of a large-range mapping attempt, which
> might end up suitable for a superpage mapping (recall that rather
> sooner than later we ought to finally re-combine suitable ranges of
> contiguous 4k mappings into 2M ones, just like we [now] do in IOMMU
> code).

Hm, doesn't pages used for grant handling (XENMAPSPACE_grant_table)
cause them to be mapped as 4K entries in the p2m page tables.  The
code in xenmem_add_to_physmap_one() seems to remove and re-add them
with order 0. Same with the trace buffers, they are added as order 0
to the p2m.

Note that when coalescing we would need to be careful then to not
coalesce special pages.

Might not be the best model because I'm not sure why we require
XENMAPSPACE_grant_table to force entries to not be mapped as part of a
super page in the guest p2m.

> Since for data structures like the ones named above 2M mappings
> might be enough (i.e. there might be little "risk" of even needing to
> go to 1G ones), could we maybe take a "middle" approach and check all
> pages when order == 9, but use your approach for higher orders? The
> to-be-added re-coalescing would then need to by taught to refuse re-
> coalescing of such ranges to larger than 2M mappings, while still
> at least allowing for 2M ones. (Special casing at that boundary is
> going to be necessary also for shadow code, at the very least.) But
> see also below as to caveats.

I guess a rangeset would be more future proof than anything else.

> > and in
> > any case the forcing of the write-back cache attribute is a courtesy
> > to the guest to avoid such ranges being accessed as uncached when not
> > really needed.  It's not acceptable for such assistance to tax the
> > system so badly.
> 
> I agree we would better improve the situation, but I don't think we
> can do so by ...
> 
> > @@ -518,26 +517,19 @@ int epte_get_entry_emt(struct domain *d, gfn_t gfn, mfn_t mfn,
> >          return MTRR_TYPE_UNCACHABLE;
> >      }
> >  
> > -    if ( type != p2m_mmio_direct && !is_iommu_enabled(d) &&
> > -         !cache_flush_permitted(d) )
> > +    if ( (type != p2m_mmio_direct && !is_iommu_enabled(d) &&
> > +          !cache_flush_permitted(d)) ||
> > +         /*
> > +          * Assume the whole page to be special if the first 4K chunk is:
> > +          * iterating over all possible 4K sub-pages for higher order pages is
> > +          * too expensive.
> > +          */
> > +         is_special_page(mfn_to_page(mfn)) )
> 
> ... building in assumptions like this one. The more that here you may
> also produce too weak a memory type (think of a later page in the range
> requiring a stronger-ordered memory type).
> 
> While it may not help much, ...
> 
> >      {
> >          *ipat = true;
> >          return MTRR_TYPE_WRBACK;
> >      }
> >  
> > -    for ( special_pgs = i = 0; i < (1ul << order); i++ )
> > -        if ( is_special_page(mfn_to_page(mfn_add(mfn, i))) )
> > -            special_pgs++;
> > -
> > -    if ( special_pgs )
> > -    {
> > -        if ( special_pgs != (1ul << order) )
> > -            return -1;
> > -
> > -        *ipat = true;
> > -        return MTRR_TYPE_WRBACK;
> > -    }
> 
> ... this logic could be improved to at least bail from the loop once it's
> clear that the "-1" return path will be taken. Improvements beyond that
> would likely involve adding some data structure (rangeset?) to track
> special pages.

For the guest I was running the loop didn't find any special pages in
order 18 mappings, which are the most troublesome to handle in the
loop.  I'm not sure bailing early would make that much of a difference
in practice TBH.

I did also consider using a rangeset, but that would use more
per-domain memory and also require coordination to add special pages
to it.

Thanks, Roger.

Re: [PATCH] x86/ept: simplify detection of special pages for EMT calculation

Posted by Jan Beulich 1 year, 7 months ago

On 26.09.2022 16:27, Roger Pau Monné wrote:
> On Mon, Sep 26, 2022 at 10:38:40AM +0200, Jan Beulich wrote:
>> On 23.09.2022 12:56, Roger Pau Monne wrote:
>>> The current way to detect whether a page handled to
>>> epte_get_entry_emt() is special and needs a forced write-back cache
>>> attribute involves iterating over all the smaller 4K pages for
>>> superpages.
>>>
>>> Such loop consumes a high amount of CPU time for 1GiB pages (order
>>> 18): on a Xeon® Silver 4216 (Cascade Lake) at 2GHz this takes an
>>> average amount of time of 1.5ms.  Note that this figure just accounts
>>> for the is_special_page() loop, and not the whole code of
>>> epte_get_entry_emt().  Also the resolve_misconfig() operation that
>>> calls into epte_get_entry_emt() is done while holding the p2m lock in
>>> write (exclusive) mode, which blocks concurrent EPT_MISCONFIG faults
>>> and prevents most guest hypercalls for progressing due to the need to
>>> take the p2m lock in read mode to access any guest provided hypercall
>>> buffers.
>>>
>>> Simplify the checking in epte_get_entry_emt() and remove the loop,
>>> assuming that there won't be superpages being only partially special.
>>>
>>> So far we have no special superpages added to the guest p2m,
>>
>> We may not be adding them as superpages, but what a guest makes of
>> the pages it is given access to for e.g. grant handling, or what Dom0
>> makes of e.g. the (per-CPU) trace buffers is unknown. And I guess
>> Dom0 ending up with a non-WB mapping of the trace buffers might
>> impact tracing quite a bit. I don't think we can build on guests not
>> making any such the subject of a large-range mapping attempt, which
>> might end up suitable for a superpage mapping (recall that rather
>> sooner than later we ought to finally re-combine suitable ranges of
>> contiguous 4k mappings into 2M ones, just like we [now] do in IOMMU
>> code).
> 
> Hm, doesn't pages used for grant handling (XENMAPSPACE_grant_table)
> cause them to be mapped as 4K entries in the p2m page tables.  The
> code in xenmem_add_to_physmap_one() seems to remove and re-add them
> with order 0. Same with the trace buffers, they are added as order 0
> to the p2m.

Indeed. I was half way through writing the earlier response when
recalling that aspect; I may not have succeeded in adjusting the text
to properly convey that the concern is applicable only to future code,
not what we have right now.

> Note that when coalescing we would need to be careful then to not
> coalesce special pages.

Well, no, ...

> Might not be the best model because I'm not sure why we require
> XENMAPSPACE_grant_table to force entries to not be mapped as part of a
> super page in the guest p2m.

... as you say here, there may actually be benefits from allowing such
(re-)coalescing.

>> Since for data structures like the ones named above 2M mappings
>> might be enough (i.e. there might be little "risk" of even needing to
>> go to 1G ones), could we maybe take a "middle" approach and check all
>> pages when order == 9, but use your approach for higher orders? The
>> to-be-added re-coalescing would then need to by taught to refuse re-
>> coalescing of such ranges to larger than 2M mappings, while still
>> at least allowing for 2M ones. (Special casing at that boundary is
>> going to be necessary also for shadow code, at the very least.) But
>> see also below as to caveats.
> 
> I guess a rangeset would be more future proof than anything else.
> 
>>> and in
>>> any case the forcing of the write-back cache attribute is a courtesy
>>> to the guest to avoid such ranges being accessed as uncached when not
>>> really needed.  It's not acceptable for such assistance to tax the
>>> system so badly.
>>
>> I agree we would better improve the situation, but I don't think we
>> can do so by ...
>>
>>> @@ -518,26 +517,19 @@ int epte_get_entry_emt(struct domain *d, gfn_t gfn, mfn_t mfn,
>>>          return MTRR_TYPE_UNCACHABLE;
>>>      }
>>>  
>>> -    if ( type != p2m_mmio_direct && !is_iommu_enabled(d) &&
>>> -         !cache_flush_permitted(d) )
>>> +    if ( (type != p2m_mmio_direct && !is_iommu_enabled(d) &&
>>> +          !cache_flush_permitted(d)) ||
>>> +         /*
>>> +          * Assume the whole page to be special if the first 4K chunk is:
>>> +          * iterating over all possible 4K sub-pages for higher order pages is
>>> +          * too expensive.
>>> +          */
>>> +         is_special_page(mfn_to_page(mfn)) )
>>
>> ... building in assumptions like this one. The more that here you may
>> also produce too weak a memory type (think of a later page in the range
>> requiring a stronger-ordered memory type).
>>
>> While it may not help much, ...
>>
>>>      {
>>>          *ipat = true;
>>>          return MTRR_TYPE_WRBACK;
>>>      }
>>>  
>>> -    for ( special_pgs = i = 0; i < (1ul << order); i++ )
>>> -        if ( is_special_page(mfn_to_page(mfn_add(mfn, i))) )
>>> -            special_pgs++;
>>> -
>>> -    if ( special_pgs )
>>> -    {
>>> -        if ( special_pgs != (1ul << order) )
>>> -            return -1;
>>> -
>>> -        *ipat = true;
>>> -        return MTRR_TYPE_WRBACK;
>>> -    }
>>
>> ... this logic could be improved to at least bail from the loop once it's
>> clear that the "-1" return path will be taken. Improvements beyond that
>> would likely involve adding some data structure (rangeset?) to track
>> special pages.
> 
> For the guest I was running the loop didn't find any special pages in
> order 18 mappings, which are the most troublesome to handle in the
> loop.  I'm not sure bailing early would make that much of a difference
> in practice TBH.

As said - it may not help much.

> I did also consider using a rangeset, but that would use more
> per-domain memory and also require coordination to add special pages
> to it.

"Special" is a global property, so I don't think this would need to be a
per-domain data structure.

Jan