[PATCH v2 00/17] xvmalloc() / x86 xstate area / x86 CPUID / AMX beginnings

Jan Beulich posted 17 patches 3 years, 4 months ago
Only 0 patches received!
[PATCH v2 00/17] xvmalloc() / x86 xstate area / x86 CPUID / AMX beginnings
Posted by Jan Beulich 3 years, 4 months ago
While the sub-groups may seem somewhat unrelated, there are inter-
dependencies (logical, functional, or contextual). The majority of
the patches is new in v2; one previously standalone patch has been
integrated into here.

01: mm: check for truncation in vmalloc_type()
02: mm: introduce xvmalloc() et al and use for grant table allocations
03: x86/xstate: use xvzalloc() for save area allocation
04: x86/xstate: re-size save area when CPUID policy changes
05: x86/xstate: re-use valid_xcr0() for boot-time checks
06: x86/xstate: drop xstate_offsets[] and xstate_sizes[]
07: x86/xstate: replace xsave_cntxt_size and drop XCNTXT_MASK
08: x86/xstate: avoid accounting for unsupported components
09: x86: use xvmalloc() for extended context buffer allocations
10: x86/xstate: enable AMX components
11: x86/CPUID: adjust extended leaves out of range clearing
12: x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents
13: x86/CPUID: move bounding of max_{,sub}leaf fields to library code
14: x86/CPUID: enable AMX leaves
15: x86emul: introduce X86EMUL_FPU_tile
16: x86emul: support TILERELEASE
17: x86emul: support {LD,ST}TILECFG

Jan

[PATCH v2 01/17] mm: check for truncation in vmalloc_type()
Posted by Jan Beulich 3 years, 4 months ago
While it's currently implied from the checking xmalloc_array() does,
let's make this more explicit in the function itself. As a result both
involved local variables don't need to have size_t type anymore. This
brings them in line with the rest of the code in this file.

Requested-by: Julien Grall <julien@xen.org>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: New.

--- a/xen/common/vmap.c
+++ b/xen/common/vmap.c
@@ -242,13 +242,15 @@ void vunmap(const void *va)
 static void *vmalloc_type(size_t size, enum vmap_region type)
 {
     mfn_t *mfn;
-    size_t pages, i;
+    unsigned int i, pages = PFN_UP(size);
     struct page_info *pg;
     void *va;
 
     ASSERT(size);
 
-    pages = PFN_UP(size);
+    if ( PFN_DOWN(size) > pages )
+        return NULL;
+
     mfn = xmalloc_array(mfn_t, pages);
     if ( mfn == NULL )
         return NULL;


Re: [PATCH v2 01/17] mm: check for truncation in vmalloc_type()
Posted by Julien Grall 3 years, 4 months ago

On 23/11/2020 14:23, Jan Beulich wrote:
> While it's currently implied from the checking xmalloc_array() does,
> let's make this more explicit in the function itself. As a result both
> involved local variables don't need to have size_t type anymore. This
> brings them in line with the rest of the code in this file.
> 
> Requested-by: Julien Grall <julien@xen.org>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Julien Grall <jgrall@amazon.com>

Cheers,

-- 
Julien Grall

[PATCH v2 02/17] mm: introduce xvmalloc() et al and use for grant table allocations
Posted by Jan Beulich 3 years, 4 months ago
All of the array allocations in grant_table_init() can exceed a page's
worth of memory, which xmalloc()-based interfaces aren't really suitable
for after boot. We also don't need any of these allocations to be
physically contiguous.. Introduce interfaces dynamically switching
between xmalloc() et al and vmalloc() et al, based on requested size,
and use them instead.

All the wrappers in the new header get cloned mostly verbatim from
xmalloc.h, with the sole adjustment to switch unsigned long to size_t
for sizes and to unsigned int for alignments.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: Actually edit a copy-and-pasted comment in xvmalloc.h which was
    meant to be edited from the beginning.
---
I'm unconvinced of the mentioning of "physically contiguous" in the
comment at the top of the new header: I don't think xmalloc() provides
such a guarantee. Any use assuming so would look (latently) broken to
me.

--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -37,7 +37,7 @@
 #include <xen/iommu.h>
 #include <xen/paging.h>
 #include <xen/keyhandler.h>
-#include <xen/vmap.h>
+#include <xen/xvmalloc.h>
 #include <xen/nospec.h>
 #include <xsm/xsm.h>
 #include <asm/flushtlb.h>
@@ -1925,27 +1925,28 @@ int grant_table_init(struct domain *d, i
     d->grant_table = gt;
 
     /* Active grant table. */
-    gt->active = xzalloc_array(struct active_grant_entry *,
-                               max_nr_active_grant_frames(gt));
+    gt->active = xvzalloc_array(struct active_grant_entry *,
+                                max_nr_active_grant_frames(gt));
     if ( gt->active == NULL )
         goto out;
 
     /* Tracking of mapped foreign frames table */
     if ( gt->max_maptrack_frames )
     {
-        gt->maptrack = vzalloc(gt->max_maptrack_frames * sizeof(*gt->maptrack));
+        gt->maptrack = xvzalloc_array(struct grant_mapping *,
+                                      gt->max_maptrack_frames);
         if ( gt->maptrack == NULL )
             goto out;
     }
 
     /* Shared grant table. */
-    gt->shared_raw = xzalloc_array(void *, gt->max_grant_frames);
+    gt->shared_raw = xvzalloc_array(void *, gt->max_grant_frames);
     if ( gt->shared_raw == NULL )
         goto out;
 
     /* Status pages for grant table - for version 2 */
-    gt->status = xzalloc_array(grant_status_t *,
-                               grant_to_status_frames(gt->max_grant_frames));
+    gt->status = xvzalloc_array(grant_status_t *,
+                                grant_to_status_frames(gt->max_grant_frames));
     if ( gt->status == NULL )
         goto out;
 
@@ -3870,19 +3871,19 @@ grant_table_destroy(
 
     for ( i = 0; i < nr_grant_frames(t); i++ )
         free_xenheap_page(t->shared_raw[i]);
-    xfree(t->shared_raw);
+    xvfree(t->shared_raw);
 
     for ( i = 0; i < nr_maptrack_frames(t); i++ )
         free_xenheap_page(t->maptrack[i]);
-    vfree(t->maptrack);
+    xvfree(t->maptrack);
 
     for ( i = 0; i < nr_active_grant_frames(t); i++ )
         free_xenheap_page(t->active[i]);
-    xfree(t->active);
+    xvfree(t->active);
 
     for ( i = 0; i < nr_status_frames(t); i++ )
         free_xenheap_page(t->status[i]);
-    xfree(t->status);
+    xvfree(t->status);
 
     xfree(t);
     d->grant_table = NULL;
--- a/xen/common/vmap.c
+++ b/xen/common/vmap.c
@@ -7,6 +7,7 @@
 #include <xen/spinlock.h>
 #include <xen/types.h>
 #include <xen/vmap.h>
+#include <xen/xvmalloc.h>
 #include <asm/page.h>
 
 static DEFINE_SPINLOCK(vm_lock);
@@ -301,11 +302,29 @@ void *vzalloc(size_t size)
     return p;
 }
 
-void vfree(void *va)
+static void _vfree(const void *va, unsigned int pages, enum vmap_region type)
 {
-    unsigned int i, pages;
+    unsigned int i;
     struct page_info *pg;
     PAGE_LIST_HEAD(pg_list);
+
+    ASSERT(pages);
+
+    for ( i = 0; i < pages; i++ )
+    {
+        pg = vmap_to_page(va + i * PAGE_SIZE);
+        ASSERT(pg);
+        page_list_add(pg, &pg_list);
+    }
+    vunmap(va);
+
+    while ( (pg = page_list_remove_head(&pg_list)) != NULL )
+        free_domheap_page(pg);
+}
+
+void vfree(void *va)
+{
+    unsigned int pages;
     enum vmap_region type = VMAP_DEFAULT;
 
     if ( !va )
@@ -317,18 +336,71 @@ void vfree(void *va)
         type = VMAP_XEN;
         pages = vm_size(va, type);
     }
-    ASSERT(pages);
 
-    for ( i = 0; i < pages; i++ )
+    _vfree(va, pages, type);
+}
+
+void xvfree(void *va)
+{
+    unsigned int pages = vm_size(va, VMAP_DEFAULT);
+
+    if ( pages )
+        _vfree(va, pages, VMAP_DEFAULT);
+    else
+        xfree(va);
+}
+
+void *_xvmalloc(size_t size, unsigned int align)
+{
+    ASSERT(align <= PAGE_SIZE);
+    return size <= PAGE_SIZE ? _xmalloc(size, align) : vmalloc(size);
+}
+
+void *_xvzalloc(size_t size, unsigned int align)
+{
+    ASSERT(align <= PAGE_SIZE);
+    return size <= PAGE_SIZE ? _xzalloc(size, align) : vzalloc(size);
+}
+
+void *_xvrealloc(void *va, size_t size, unsigned int align)
+{
+    size_t pages = vm_size(va, VMAP_DEFAULT);
+    void *ptr;
+
+    ASSERT(align <= PAGE_SIZE);
+
+    if ( !pages )
     {
-        struct page_info *page = vmap_to_page(va + i * PAGE_SIZE);
+        if ( size <= PAGE_SIZE )
+            return _xrealloc(va, size, align);
 
-        ASSERT(page);
-        page_list_add(page, &pg_list);
+        ptr = vmalloc(size);
+        if ( ptr && va && va != ZERO_BLOCK_PTR )
+        {
+            /*
+             * xmalloc-based allocations up to PAGE_SIZE don't cross page
+             * boundaries. Therefore, without needing to know the exact
+             * prior allocation size, simply copy the entire tail of the
+             * page containing the earlier allocation.
+             */
+            memcpy(ptr, va, PAGE_SIZE - PAGE_OFFSET(va));
+            xfree(va);
+        }
+    }
+    else if ( pages == PFN_UP(size) )
+        ptr = va;
+    else
+    {
+        ptr = _xvmalloc(size, align);
+        if ( ptr )
+        {
+            memcpy(ptr, va, min(size, pages << PAGE_SHIFT));
+            vfree(va);
+        }
+        else if ( pages > PFN_UP(size) )
+            ptr = va;
     }
-    vunmap(va);
 
-    while ( (pg = page_list_remove_head(&pg_list)) != NULL )
-        free_domheap_page(pg);
+    return ptr;
 }
 #endif
--- /dev/null
+++ b/xen/include/xen/xvmalloc.h
@@ -0,0 +1,73 @@
+
+#ifndef __XVMALLOC_H__
+#define __XVMALLOC_H__
+
+#include <xen/cache.h>
+#include <xen/types.h>
+
+/*
+ * Xen malloc/free-style interface for allocations possibly exceeding a page's
+ * worth of memory, as long as there's no need to have physically contiguous
+ * memory allocated.  These should be used in preference to xmalloc() et al
+ * whenever the size is not known to be constrained to at most a single page.
+ */
+
+/* Allocate space for typed object. */
+#define xvmalloc(_type) ((_type *)_xvmalloc(sizeof(_type), __alignof__(_type)))
+#define xvzalloc(_type) ((_type *)_xvzalloc(sizeof(_type), __alignof__(_type)))
+
+/* Allocate space for array of typed objects. */
+#define xvmalloc_array(_type, _num) \
+    ((_type *)_xvmalloc_array(sizeof(_type), __alignof__(_type), _num))
+#define xvzalloc_array(_type, _num) \
+    ((_type *)_xvzalloc_array(sizeof(_type), __alignof__(_type), _num))
+
+/* Allocate space for a structure with a flexible array of typed objects. */
+#define xvzalloc_flex_struct(type, field, nr) \
+    ((type *)_xvzalloc(offsetof(type, field[nr]), __alignof__(type)))
+
+#define xvmalloc_flex_struct(type, field, nr) \
+    ((type *)_xvmalloc(offsetof(type, field[nr]), __alignof__(type)))
+
+/* Re-allocate space for a structure with a flexible array of typed objects. */
+#define xvrealloc_flex_struct(ptr, field, nr)                          \
+    ((typeof(ptr))_xvrealloc(ptr, offsetof(typeof(*(ptr)), field[nr]), \
+                             __alignof__(typeof(*(ptr)))))
+
+/* Allocate untyped storage. */
+#define xvmalloc_bytes(_bytes) _xvmalloc(_bytes, SMP_CACHE_BYTES)
+#define xvzalloc_bytes(_bytes) _xvzalloc(_bytes, SMP_CACHE_BYTES)
+
+/* Free any of the above. */
+extern void xvfree(void *);
+
+/* Free an allocation, and zero the pointer to it. */
+#define XVFREE(p) do { \
+    xvfree(p);         \
+    (p) = NULL;        \
+} while ( false )
+
+/* Underlying functions */
+extern void *_xvmalloc(size_t size, unsigned int align);
+extern void *_xvzalloc(size_t size, unsigned int align);
+extern void *_xvrealloc(void *ptr, size_t size, unsigned int align);
+
+static inline void *_xvmalloc_array(
+    size_t size, unsigned int align, unsigned long num)
+{
+    /* Check for overflow. */
+    if ( size && num > UINT_MAX / size )
+        return NULL;
+    return _xvmalloc(size * num, align);
+}
+
+static inline void *_xvzalloc_array(
+    size_t size, unsigned int align, unsigned long num)
+{
+    /* Check for overflow. */
+    if ( size && num > UINT_MAX / size )
+        return NULL;
+    return _xvzalloc(size * num, align);
+}
+
+#endif /* __XVMALLOC_H__ */


Re: [PATCH v2 02/17] mm: introduce xvmalloc() et al and use for grant table allocations
Posted by Julien Grall 3 years, 4 months ago
Hi Jan,

On 23/11/2020 14:23, Jan Beulich wrote:
> All of the array allocations in grant_table_init() can exceed a page's
> worth of memory, which xmalloc()-based interfaces aren't really suitable
> for after boot. We also don't need any of these allocations to be
> physically contiguous.. Introduce interfaces dynamically switching
> between xmalloc() et al and vmalloc() et al, based on requested size,
> and use them instead.
> 
> All the wrappers in the new header get cloned mostly verbatim from
> xmalloc.h, with the sole adjustment to switch unsigned long to size_t
> for sizes and to unsigned int for alignments.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> v2: Actually edit a copy-and-pasted comment in xvmalloc.h which was
>      meant to be edited from the beginning.
> ---
> I'm unconvinced of the mentioning of "physically contiguous" in the
> comment at the top of the new header: I don't think xmalloc() provides
> such a guarantee. Any use assuming so would look (latently) broken to
> me.

I haven't had the chance to reply to the first version about this. So I 
will reply here to avoid confusion.

I can at least spot one user in Arm that would use xmalloc() that way 
(see the allocation of itt_addr in arch/arm/gic-v3-its.c).

AFAIK, the memory is for the sole purpose of the ITS and should not be 
accessed by Xen. So I think we can replace by a new version of 
alloc_domheap_pages().

However, I still question the usefulness of introducing yet another way 
to allocate memory (we already have alloc_xenheap_pages(), xmalloc(), 
alloc_domheap_pages(), vmap()) if you think users cannot rely on 
xmalloc() to allocate memory physically contiguous.

It definitely makes more difficult to figure out when to use xmalloc() 
vs xvalloc().

I would like to hear an opinion from the other maintainers.

> 
> --- a/xen/common/grant_table.c
> +++ b/xen/common/grant_table.c
> @@ -37,7 +37,7 @@
>   #include <xen/iommu.h>
>   #include <xen/paging.h>
>   #include <xen/keyhandler.h>
> -#include <xen/vmap.h>
> +#include <xen/xvmalloc.h>
>   #include <xen/nospec.h>
>   #include <xsm/xsm.h>
>   #include <asm/flushtlb.h>
> @@ -1925,27 +1925,28 @@ int grant_table_init(struct domain *d, i
>       d->grant_table = gt;
>   
>       /* Active grant table. */
> -    gt->active = xzalloc_array(struct active_grant_entry *,
> -                               max_nr_active_grant_frames(gt));
> +    gt->active = xvzalloc_array(struct active_grant_entry *,
> +                                max_nr_active_grant_frames(gt));
>       if ( gt->active == NULL )
>           goto out;
>   
>       /* Tracking of mapped foreign frames table */
>       if ( gt->max_maptrack_frames )
>       {
> -        gt->maptrack = vzalloc(gt->max_maptrack_frames * sizeof(*gt->maptrack));
> +        gt->maptrack = xvzalloc_array(struct grant_mapping *,
> +                                      gt->max_maptrack_frames);
>           if ( gt->maptrack == NULL )
>               goto out;
>       }
>   
>       /* Shared grant table. */
> -    gt->shared_raw = xzalloc_array(void *, gt->max_grant_frames);
> +    gt->shared_raw = xvzalloc_array(void *, gt->max_grant_frames);
>       if ( gt->shared_raw == NULL )
>           goto out;
>   
>       /* Status pages for grant table - for version 2 */
> -    gt->status = xzalloc_array(grant_status_t *,
> -                               grant_to_status_frames(gt->max_grant_frames));
> +    gt->status = xvzalloc_array(grant_status_t *,
> +                                grant_to_status_frames(gt->max_grant_frames));
>       if ( gt->status == NULL )
>           goto out;
>   
> @@ -3870,19 +3871,19 @@ grant_table_destroy(
>   
>       for ( i = 0; i < nr_grant_frames(t); i++ )
>           free_xenheap_page(t->shared_raw[i]);
> -    xfree(t->shared_raw);
> +    xvfree(t->shared_raw);
>   
>       for ( i = 0; i < nr_maptrack_frames(t); i++ )
>           free_xenheap_page(t->maptrack[i]);
> -    vfree(t->maptrack);
> +    xvfree(t->maptrack);
>   
>       for ( i = 0; i < nr_active_grant_frames(t); i++ )
>           free_xenheap_page(t->active[i]);
> -    xfree(t->active);
> +    xvfree(t->active);
>   
>       for ( i = 0; i < nr_status_frames(t); i++ )
>           free_xenheap_page(t->status[i]);
> -    xfree(t->status);
> +    xvfree(t->status);
>   
>       xfree(t);
>       d->grant_table = NULL;
> --- a/xen/common/vmap.c
> +++ b/xen/common/vmap.c
> @@ -7,6 +7,7 @@
>   #include <xen/spinlock.h>
>   #include <xen/types.h>
>   #include <xen/vmap.h>
> +#include <xen/xvmalloc.h>
>   #include <asm/page.h>
>   
>   static DEFINE_SPINLOCK(vm_lock);
> @@ -301,11 +302,29 @@ void *vzalloc(size_t size)
>       return p;
>   }
>   
> -void vfree(void *va)
> +static void _vfree(const void *va, unsigned int pages, enum vmap_region type)
>   {
> -    unsigned int i, pages;
> +    unsigned int i;
>       struct page_info *pg;
>       PAGE_LIST_HEAD(pg_list);
> +
> +    ASSERT(pages);
> +
> +    for ( i = 0; i < pages; i++ )
> +    {
> +        pg = vmap_to_page(va + i * PAGE_SIZE);
> +        ASSERT(pg);
> +        page_list_add(pg, &pg_list);
> +    }
> +    vunmap(va);
> +
> +    while ( (pg = page_list_remove_head(&pg_list)) != NULL )
> +        free_domheap_page(pg);
> +}
> +
> +void vfree(void *va)
> +{
> +    unsigned int pages;
>       enum vmap_region type = VMAP_DEFAULT;
>   
>       if ( !va )
> @@ -317,18 +336,71 @@ void vfree(void *va)
>           type = VMAP_XEN;
>           pages = vm_size(va, type);
>       }
> -    ASSERT(pages);
>   
> -    for ( i = 0; i < pages; i++ )
> +    _vfree(va, pages, type);
> +}
> +
> +void xvfree(void *va)
> +{
> +    unsigned int pages = vm_size(va, VMAP_DEFAULT);
> +
> +    if ( pages )
> +        _vfree(va, pages, VMAP_DEFAULT);
> +    else
> +        xfree(va);
> +}
> +
> +void *_xvmalloc(size_t size, unsigned int align)
> +{
> +    ASSERT(align <= PAGE_SIZE);
> +    return size <= PAGE_SIZE ? _xmalloc(size, align) : vmalloc(size);
> +}
> +
> +void *_xvzalloc(size_t size, unsigned int align)
> +{
> +    ASSERT(align <= PAGE_SIZE);
> +    return size <= PAGE_SIZE ? _xzalloc(size, align) : vzalloc(size);
> +}
> +
> +void *_xvrealloc(void *va, size_t size, unsigned int align)
> +{
> +    size_t pages = vm_size(va, VMAP_DEFAULT);
> +    void *ptr;
> +
> +    ASSERT(align <= PAGE_SIZE);
> +
> +    if ( !pages )
>       {
> -        struct page_info *page = vmap_to_page(va + i * PAGE_SIZE);
> +        if ( size <= PAGE_SIZE )
> +            return _xrealloc(va, size, align);
>   
> -        ASSERT(page);
> -        page_list_add(page, &pg_list);
> +        ptr = vmalloc(size);
> +        if ( ptr && va && va != ZERO_BLOCK_PTR )
> +        {
> +            /*
> +             * xmalloc-based allocations up to PAGE_SIZE don't cross page
> +             * boundaries. Therefore, without needing to know the exact
> +             * prior allocation size, simply copy the entire tail of the
> +             * page containing the earlier allocation.
> +             */
> +            memcpy(ptr, va, PAGE_SIZE - PAGE_OFFSET(va));
> +            xfree(va);
> +        }
> +    }
> +    else if ( pages == PFN_UP(size) )
> +        ptr = va;
> +    else
> +    {
> +        ptr = _xvmalloc(size, align);
> +        if ( ptr )
> +        {
> +            memcpy(ptr, va, min(size, pages << PAGE_SHIFT));
> +            vfree(va);
> +        }
> +        else if ( pages > PFN_UP(size) )
> +            ptr = va;
>       }
> -    vunmap(va);
>   
> -    while ( (pg = page_list_remove_head(&pg_list)) != NULL )
> -        free_domheap_page(pg);
> +    return ptr;
>   }
>   #endif
> --- /dev/null
> +++ b/xen/include/xen/xvmalloc.h
> @@ -0,0 +1,73 @@
> +
> +#ifndef __XVMALLOC_H__
> +#define __XVMALLOC_H__
> +
> +#include <xen/cache.h>
> +#include <xen/types.h>
> +
> +/*
> + * Xen malloc/free-style interface for allocations possibly exceeding a page's
> + * worth of memory, as long as there's no need to have physically contiguous
> + * memory allocated.  These should be used in preference to xmalloc() et al
> + * whenever the size is not known to be constrained to at most a single page.
> + */
> +
> +/* Allocate space for typed object. */
> +#define xvmalloc(_type) ((_type *)_xvmalloc(sizeof(_type), __alignof__(_type)))
> +#define xvzalloc(_type) ((_type *)_xvzalloc(sizeof(_type), __alignof__(_type)))
> +
> +/* Allocate space for array of typed objects. */
> +#define xvmalloc_array(_type, _num) \
> +    ((_type *)_xvmalloc_array(sizeof(_type), __alignof__(_type), _num))
> +#define xvzalloc_array(_type, _num) \
> +    ((_type *)_xvzalloc_array(sizeof(_type), __alignof__(_type), _num))
> +
> +/* Allocate space for a structure with a flexible array of typed objects. */
> +#define xvzalloc_flex_struct(type, field, nr) \
> +    ((type *)_xvzalloc(offsetof(type, field[nr]), __alignof__(type)))
> +
> +#define xvmalloc_flex_struct(type, field, nr) \
> +    ((type *)_xvmalloc(offsetof(type, field[nr]), __alignof__(type)))
> +
> +/* Re-allocate space for a structure with a flexible array of typed objects. */
> +#define xvrealloc_flex_struct(ptr, field, nr)                          \
> +    ((typeof(ptr))_xvrealloc(ptr, offsetof(typeof(*(ptr)), field[nr]), \
> +                             __alignof__(typeof(*(ptr)))))
> +
> +/* Allocate untyped storage. */
> +#define xvmalloc_bytes(_bytes) _xvmalloc(_bytes, SMP_CACHE_BYTES)
> +#define xvzalloc_bytes(_bytes) _xvzalloc(_bytes, SMP_CACHE_BYTES)
> +
> +/* Free any of the above. */
> +extern void xvfree(void *);
> +
> +/* Free an allocation, and zero the pointer to it. */
> +#define XVFREE(p) do { \
> +    xvfree(p);         \
> +    (p) = NULL;        \
> +} while ( false )
> +
> +/* Underlying functions */
> +extern void *_xvmalloc(size_t size, unsigned int align);
> +extern void *_xvzalloc(size_t size, unsigned int align);
> +extern void *_xvrealloc(void *ptr, size_t size, unsigned int align);
> +
> +static inline void *_xvmalloc_array(
> +    size_t size, unsigned int align, unsigned long num)
> +{
> +    /* Check for overflow. */
> +    if ( size && num > UINT_MAX / size )
> +        return NULL;
> +    return _xvmalloc(size * num, align);
> +}
> +
> +static inline void *_xvzalloc_array(
> +    size_t size, unsigned int align, unsigned long num)
> +{
> +    /* Check for overflow. */
> +    if ( size && num > UINT_MAX / size )
> +        return NULL;
> +    return _xvzalloc(size * num, align);
> +}
> +
> +#endif /* __XVMALLOC_H__ */
> 

-- 
Julien Grall

Re: [PATCH v2 02/17] mm: introduce xvmalloc() et al and use for grant table allocations
Posted by Jan Beulich 3 years, 4 months ago
On 25.11.2020 13:15, Julien Grall wrote:
> On 23/11/2020 14:23, Jan Beulich wrote:
>> All of the array allocations in grant_table_init() can exceed a page's
>> worth of memory, which xmalloc()-based interfaces aren't really suitable
>> for after boot. We also don't need any of these allocations to be
>> physically contiguous.. Introduce interfaces dynamically switching
>> between xmalloc() et al and vmalloc() et al, based on requested size,
>> and use them instead.
>>
>> All the wrappers in the new header get cloned mostly verbatim from
>> xmalloc.h, with the sole adjustment to switch unsigned long to size_t
>> for sizes and to unsigned int for alignments.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> v2: Actually edit a copy-and-pasted comment in xvmalloc.h which was
>>      meant to be edited from the beginning.
>> ---
>> I'm unconvinced of the mentioning of "physically contiguous" in the
>> comment at the top of the new header: I don't think xmalloc() provides
>> such a guarantee. Any use assuming so would look (latently) broken to
>> me.
> 
> I haven't had the chance to reply to the first version about this. So I 
> will reply here to avoid confusion.
> 
> I can at least spot one user in Arm that would use xmalloc() that way 
> (see the allocation of itt_addr in arch/arm/gic-v3-its.c).

And I surely wouldn't have spotted this, even if I had tried
to find "offenders", i.e. as said before not wanting to alter
the behavior of existing code (beyond the explicit changes
done here) was ...

> AFAIK, the memory is for the sole purpose of the ITS and should not be 
> accessed by Xen. So I think we can replace by a new version of 
> alloc_domheap_pages().
> 
> However, I still question the usefulness of introducing yet another way 
> to allocate memory (we already have alloc_xenheap_pages(), xmalloc(), 
> alloc_domheap_pages(), vmap()) if you think users cannot rely on 
> xmalloc() to allocate memory physically contiguous.

... the reason to introduce a separate new interface. Plus of
course this parallels what Linux has.

> It definitely makes more difficult to figure out when to use xmalloc() 
> vs xvalloc().

I don't see the difficulty:
- if you need physically contiguous memory, use alloc_xen*_pages(),
- if you know the allocation size is always no more than a page,
  use xmalloc(),
- if you know the allocation size is always more than a page, use
  vmalloc(),
- otherwise use xvmalloc(). Exceptions may of course apply, i.e.
this is just a rule of thumb.

> I would like to hear an opinion from the other maintainers.

Let's hope at least one will voice theirs.

Jan

Re: [PATCH v2 02/17] mm: introduce xvmalloc() et al and use for grant table allocations
Posted by Stefano Stabellini 3 years, 4 months ago
On Wed, 25 Nov 2020, Jan Beulich wrote:
> On 25.11.2020 13:15, Julien Grall wrote:
> > On 23/11/2020 14:23, Jan Beulich wrote:
> >> All of the array allocations in grant_table_init() can exceed a page's
> >> worth of memory, which xmalloc()-based interfaces aren't really suitable
> >> for after boot. We also don't need any of these allocations to be
> >> physically contiguous.. Introduce interfaces dynamically switching
> >> between xmalloc() et al and vmalloc() et al, based on requested size,
> >> and use them instead.
> >>
> >> All the wrappers in the new header get cloned mostly verbatim from
> >> xmalloc.h, with the sole adjustment to switch unsigned long to size_t
> >> for sizes and to unsigned int for alignments.
> >>
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >> ---
> >> v2: Actually edit a copy-and-pasted comment in xvmalloc.h which was
> >>      meant to be edited from the beginning.
> >> ---
> >> I'm unconvinced of the mentioning of "physically contiguous" in the
> >> comment at the top of the new header: I don't think xmalloc() provides
> >> such a guarantee. Any use assuming so would look (latently) broken to
> >> me.
> > 
> > I haven't had the chance to reply to the first version about this. So I 
> > will reply here to avoid confusion.
> > 
> > I can at least spot one user in Arm that would use xmalloc() that way 
> > (see the allocation of itt_addr in arch/arm/gic-v3-its.c).
> 
> And I surely wouldn't have spotted this, even if I had tried
> to find "offenders", i.e. as said before not wanting to alter
> the behavior of existing code (beyond the explicit changes
> done here) was ...
> 
> > AFAIK, the memory is for the sole purpose of the ITS and should not be 
> > accessed by Xen. So I think we can replace by a new version of 
> > alloc_domheap_pages().
> > 
> > However, I still question the usefulness of introducing yet another way 
> > to allocate memory (we already have alloc_xenheap_pages(), xmalloc(), 
> > alloc_domheap_pages(), vmap()) if you think users cannot rely on 
> > xmalloc() to allocate memory physically contiguous.
> 
> ... the reason to introduce a separate new interface. Plus of
> course this parallels what Linux has.
> 
> > It definitely makes more difficult to figure out when to use xmalloc() 
> > vs xvalloc().
> 
> I don't see the difficulty:
> - if you need physically contiguous memory, use alloc_xen*_pages(),
> - if you know the allocation size is always no more than a page,
>   use xmalloc(),

What if you need memory physically contiguous but not necessarily an
order of pages, such as for instance 5200 bytes?

If xmalloc can't do physically contiguous allocations, we need something
else that does physically contiguous allocations not only at page
granularity, right?

The other issue is semantics. If xmalloc is unable to allocate more than
a page of contiguous memory, then it is identical to vmalloc from the
caller's point of view: both xmalloc and vmalloc return a virtual
address for an allocation that might not be physically contiguous.

Maybe we should get rid of xmalloc entirely and improve the
implementation of vmalloc so that it falls back to xmalloc for
sub-page allocations. Which in fact is almost the same thing that you
did.


> - if you know the allocation size is always more than a page, use
>   vmalloc(),
> - otherwise use xvmalloc(). Exceptions may of course apply, i.e.
> this is just a rule of thumb.
> 
> > I would like to hear an opinion from the other maintainers.
> 
> Let's hope at least one will voice theirs.

If we take a step back, I think we only really need two memory
allocators:

1) one that allocates physically contiguous memory
2) one that allocates non-physically contiguous memory

That's it, right?

In addition to that, I understand it could be convient to have a little
wrapper that automatically chooses between 1) and 2) depending on
circumstances.

But if the circumstance is just size < PAGE_SIZE then I don't think we
need any convenience wrappers: we should just be able to call 2), which
is vmalloc, once we improve the vmalloc implementation.

Or do you see any reasons to keep the current vmalloc implementation as
is for sub-page allocations?

Re: [PATCH v2 02/17] mm: introduce xvmalloc() et al and use for grant table allocations
Posted by Jan Beulich 3 years, 4 months ago
On 25.11.2020 20:48, Stefano Stabellini wrote:
> On Wed, 25 Nov 2020, Jan Beulich wrote:
>> On 25.11.2020 13:15, Julien Grall wrote:
>>> On 23/11/2020 14:23, Jan Beulich wrote:
>>>> All of the array allocations in grant_table_init() can exceed a page's
>>>> worth of memory, which xmalloc()-based interfaces aren't really suitable
>>>> for after boot. We also don't need any of these allocations to be
>>>> physically contiguous.. Introduce interfaces dynamically switching
>>>> between xmalloc() et al and vmalloc() et al, based on requested size,
>>>> and use them instead.
>>>>
>>>> All the wrappers in the new header get cloned mostly verbatim from
>>>> xmalloc.h, with the sole adjustment to switch unsigned long to size_t
>>>> for sizes and to unsigned int for alignments.
>>>>
>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>> ---
>>>> v2: Actually edit a copy-and-pasted comment in xvmalloc.h which was
>>>>      meant to be edited from the beginning.
>>>> ---
>>>> I'm unconvinced of the mentioning of "physically contiguous" in the
>>>> comment at the top of the new header: I don't think xmalloc() provides
>>>> such a guarantee. Any use assuming so would look (latently) broken to
>>>> me.
>>>
>>> I haven't had the chance to reply to the first version about this. So I 
>>> will reply here to avoid confusion.
>>>
>>> I can at least spot one user in Arm that would use xmalloc() that way 
>>> (see the allocation of itt_addr in arch/arm/gic-v3-its.c).
>>
>> And I surely wouldn't have spotted this, even if I had tried
>> to find "offenders", i.e. as said before not wanting to alter
>> the behavior of existing code (beyond the explicit changes
>> done here) was ...
>>
>>> AFAIK, the memory is for the sole purpose of the ITS and should not be 
>>> accessed by Xen. So I think we can replace by a new version of 
>>> alloc_domheap_pages().
>>>
>>> However, I still question the usefulness of introducing yet another way 
>>> to allocate memory (we already have alloc_xenheap_pages(), xmalloc(), 
>>> alloc_domheap_pages(), vmap()) if you think users cannot rely on 
>>> xmalloc() to allocate memory physically contiguous.
>>
>> ... the reason to introduce a separate new interface. Plus of
>> course this parallels what Linux has.
>>
>>> It definitely makes more difficult to figure out when to use xmalloc() 
>>> vs xvalloc().
>>
>> I don't see the difficulty:
>> - if you need physically contiguous memory, use alloc_xen*_pages(),
>> - if you know the allocation size is always no more than a page,
>>   use xmalloc(),
> 
> What if you need memory physically contiguous but not necessarily an
> order of pages, such as for instance 5200 bytes?

This case is, I think, rare enough (in particular in Xen) that the
waste of space can be tolerated imo.

> If xmalloc can't do physically contiguous allocations, we need something
> else that does physically contiguous allocations not only at page
> granularity, right?

Well, we first need to settle on what guarantees xmalloc() is meant
to provide. It may be just me assuming it doesn't provide the same
ones which Linux'es kmalloc() makes. I'm first and foremost
judging by the comment near the top of xmalloc.h, which compares
with malloc() / free(), not kmalloc() / kfree().

> The other issue is semantics. If xmalloc is unable to allocate more than
> a page of contiguous memory, then it is identical to vmalloc from the
> caller's point of view: both xmalloc and vmalloc return a virtual
> address for an allocation that might not be physically contiguous.

Almost. vmalloc() puts guard pages around the allocation and
guarantees page alignment.

> Maybe we should get rid of xmalloc entirely and improve the
> implementation of vmalloc so that it falls back to xmalloc for
> sub-page allocations. Which in fact is almost the same thing that you
> did.

This would break callers assuming page alignment (and - shouldn't
be an issue in practice - granularity). If anything, as Julien
did suggest, we could modify xmalloc() accordingly, but then of
course making sure we also honor alignment requests beyond page
size.

Neither of these is the goal here, hence this "intermediate"
implementation, which is only almost "redundant".

>> - if you know the allocation size is always more than a page, use
>>   vmalloc(),
>> - otherwise use xvmalloc(). Exceptions may of course apply, i.e.
>> this is just a rule of thumb.
>>
>>> I would like to hear an opinion from the other maintainers.
>>
>> Let's hope at least one will voice theirs.
> 
> If we take a step back, I think we only really need two memory
> allocators:
> 
> 1) one that allocates physically contiguous memory
> 2) one that allocates non-physically contiguous memory
> 
> That's it, right?
> 
> In addition to that, I understand it could be convient to have a little
> wrapper that automatically chooses between 1) and 2) depending on
> circumstances.
> 
> But if the circumstance is just size < PAGE_SIZE then I don't think we
> need any convenience wrappers: we should just be able to call 2), which
> is vmalloc, once we improve the vmalloc implementation.
> 
> Or do you see any reasons to keep the current vmalloc implementation as
> is for sub-page allocations?

See my "Almost. ..." above.

As an aside, I also find it quite puzzling that in one of the rare
cases where I propose to clone an interface from Linux without much
deviation from their model, I get objections. It typically was the
other way around in the past ...

Jan

Re: [PATCH v2 02/17] mm: introduce xvmalloc() et al and use for grant table allocations
Posted by Julien Grall 3 years, 4 months ago
Hi Jan,

On 26/11/2020 11:34, Jan Beulich wrote:
> On 25.11.2020 20:48, Stefano Stabellini wrote:
>> On Wed, 25 Nov 2020, Jan Beulich wrote:
>>> On 25.11.2020 13:15, Julien Grall wrote:
>>>> On 23/11/2020 14:23, Jan Beulich wrote:
>>>>> All of the array allocations in grant_table_init() can exceed a page's
>>>>> worth of memory, which xmalloc()-based interfaces aren't really suitable
>>>>> for after boot. We also don't need any of these allocations to be
>>>>> physically contiguous.. Introduce interfaces dynamically switching
>>>>> between xmalloc() et al and vmalloc() et al, based on requested size,
>>>>> and use them instead.
>>>>>
>>>>> All the wrappers in the new header get cloned mostly verbatim from
>>>>> xmalloc.h, with the sole adjustment to switch unsigned long to size_t
>>>>> for sizes and to unsigned int for alignments.
>>>>>
>>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>>> ---
>>>>> v2: Actually edit a copy-and-pasted comment in xvmalloc.h which was
>>>>>       meant to be edited from the beginning.
>>>>> ---
>>>>> I'm unconvinced of the mentioning of "physically contiguous" in the
>>>>> comment at the top of the new header: I don't think xmalloc() provides
>>>>> such a guarantee. Any use assuming so would look (latently) broken to
>>>>> me.
>>>>
>>>> I haven't had the chance to reply to the first version about this. So I
>>>> will reply here to avoid confusion.
>>>>
>>>> I can at least spot one user in Arm that would use xmalloc() that way
>>>> (see the allocation of itt_addr in arch/arm/gic-v3-its.c).
>>>
>>> And I surely wouldn't have spotted this, even if I had tried
>>> to find "offenders", i.e. as said before not wanting to alter
>>> the behavior of existing code (beyond the explicit changes
>>> done here) was ...
>>>
>>>> AFAIK, the memory is for the sole purpose of the ITS and should not be
>>>> accessed by Xen. So I think we can replace by a new version of
>>>> alloc_domheap_pages().
>>>>
>>>> However, I still question the usefulness of introducing yet another way
>>>> to allocate memory (we already have alloc_xenheap_pages(), xmalloc(),
>>>> alloc_domheap_pages(), vmap()) if you think users cannot rely on
>>>> xmalloc() to allocate memory physically contiguous.
>>>
>>> ... the reason to introduce a separate new interface. Plus of
>>> course this parallels what Linux has.
>>>
>>>> It definitely makes more difficult to figure out when to use xmalloc()
>>>> vs xvalloc().
>>>
>>> I don't see the difficulty:
>>> - if you need physically contiguous memory, use alloc_xen*_pages(),
>>> - if you know the allocation size is always no more than a page,
>>>    use xmalloc(),

If that's then intention, then may I ask why xmalloc() is able to 
support multiple pages allocation?

Your assumption is Xen will always be built with the same page size 
across all the architecture. While Xen only works with 4KB pages today, 
Arm can support 16KB and 64KB. I have long term plan to add support for it.

So I don't think you can use the page size as a way to distinguish which 
one to use.

>>
>> What if you need memory physically contiguous but not necessarily an
>> order of pages, such as for instance 5200 bytes?
> 
> This case is, I think, rare enough (in particular in Xen) that the
> waste of space can be tolerated imo.

This is quite departure from:

commit b829a0ff5794ee5b0f96a0c872f6a4ed7b1007c7
Author: Jan Beulich <jbeulich@suse.com>
Date:   Thu Oct 13 10:03:43 2011 +0200

     xmalloc: return unused full pages on multi-page allocations

     Certain (boot time) allocations are relatively large (particularly when
     building with high NR_CPUS), but can also happen to be pretty far away
     from a power-of-two size. Utilize the page allocator's (other than
     Linux'es) capability of allowing to return space from higher-order
     allocations in smaller pieces to return the unused parts immediately.

     Signed-off-by: Jan Beulich <jbeulich@suse.com>
     Acked-by: Keir Fraser <keir@xen.org>

I am curious to know what changed...

Anyway, what you wrote is very server focused. On Arm, we have plan to 
run Xen on smaller hardware where wasting memory mean less usable RAM 
for guests.

The problem with using an order is the bigger the order is the more 
change you will waste space...

Allocating more than a page is fairly common on Arm, so we really want 
to reduce the amount of memory wasted.

> 
>> If xmalloc can't do physically contiguous allocations, we need something
>> else that does physically contiguous allocations not only at page
>> granularity, right?
> 
> Well, we first need to settle on what guarantees xmalloc() is meant
> to provide. It may be just me assuming it doesn't provide the same
> ones which Linux'es kmalloc() makes. I'm first and foremost
> judging by the comment near the top of xmalloc.h, which compares
> with malloc() / free(), not kmalloc() / kfree().
> 
>> The other issue is semantics. If xmalloc is unable to allocate more than
>> a page of contiguous memory, then it is identical to vmalloc from the
>> caller's point of view: both xmalloc and vmalloc return a virtual
>> address for an allocation that might not be physically contiguous.
> 
> Almost. vmalloc() puts guard pages around the allocation and
> guarantees page alignment.
> 
>> Maybe we should get rid of xmalloc entirely and improve the
>> implementation of vmalloc so that it falls back to xmalloc for
>> sub-page allocations. Which in fact is almost the same thing that you
>> did.
> 
> This would break callers assuming page alignment (and - shouldn't
> be an issue in practice - granularity). If anything, as Julien
> did suggest, we could modify xmalloc() accordingly, but then of
> course making sure we also honor alignment requests beyond page
> size.
> 
> Neither of these is the goal here, hence this "intermediate"
> implementation, which is only almost "redundant".
> 
>>> - if you know the allocation size is always more than a page, use
>>>    vmalloc(),
>>> - otherwise use xvmalloc(). Exceptions may of course apply, i.e.
>>> this is just a rule of thumb.
>>>
>>>> I would like to hear an opinion from the other maintainers.
>>>
>>> Let's hope at least one will voice theirs.
>>
>> If we take a step back, I think we only really need two memory
>> allocators:
>>
>> 1) one that allocates physically contiguous memory
>> 2) one that allocates non-physically contiguous memory
>>
>> That's it, right?
>>
>> In addition to that, I understand it could be convient to have a little
>> wrapper that automatically chooses between 1) and 2) depending on
>> circumstances.
>>
>> But if the circumstance is just size < PAGE_SIZE then I don't think we
>> need any convenience wrappers: we should just be able to call 2), which
>> is vmalloc, once we improve the vmalloc implementation.
>>
>> Or do you see any reasons to keep the current vmalloc implementation as
>> is for sub-page allocations?
> 
> See my "Almost. ..." above.
> 
> As an aside, I also find it quite puzzling that in one of the rare
> cases where I propose to clone an interface from Linux without much
> deviation from their model, I get objections. It typically was the
> other way around in the past ...

If we were really following Linux, then we would have two interfaces:
    - xmalloc() which is the same as kmalloc()
    - xvalloc() which is the same a kvalloc()

However, you seem to be the one objecting on the behavior of xmalloc().

I can't speak for Stefano, but I don't object on following Linux. 
Instead I am objecting on the growing number of way to allocate memory 
in Xen and that differ depending on the system_state.

Cheers,

-- 
Julien Grall

Re: [PATCH v2 02/17] mm: introduce xvmalloc() et al and use for grant table allocations
Posted by Jan Beulich 3 years, 4 months ago
On 26.11.2020 14:22, Julien Grall wrote:
> On 26/11/2020 11:34, Jan Beulich wrote:
>> On 25.11.2020 20:48, Stefano Stabellini wrote:
>>> On Wed, 25 Nov 2020, Jan Beulich wrote:
>>>> On 25.11.2020 13:15, Julien Grall wrote:
>>>>> On 23/11/2020 14:23, Jan Beulich wrote:
>>>>>> I'm unconvinced of the mentioning of "physically contiguous" in the
>>>>>> comment at the top of the new header: I don't think xmalloc() provides
>>>>>> such a guarantee. Any use assuming so would look (latently) broken to
>>>>>> me.
>>>>>
>>>>> I haven't had the chance to reply to the first version about this. So I
>>>>> will reply here to avoid confusion.
>>>>>
>>>>> I can at least spot one user in Arm that would use xmalloc() that way
>>>>> (see the allocation of itt_addr in arch/arm/gic-v3-its.c).
>>>>
>>>> And I surely wouldn't have spotted this, even if I had tried
>>>> to find "offenders", i.e. as said before not wanting to alter
>>>> the behavior of existing code (beyond the explicit changes
>>>> done here) was ...
>>>>
>>>>> AFAIK, the memory is for the sole purpose of the ITS and should not be
>>>>> accessed by Xen. So I think we can replace by a new version of
>>>>> alloc_domheap_pages().
>>>>>
>>>>> However, I still question the usefulness of introducing yet another way
>>>>> to allocate memory (we already have alloc_xenheap_pages(), xmalloc(),
>>>>> alloc_domheap_pages(), vmap()) if you think users cannot rely on
>>>>> xmalloc() to allocate memory physically contiguous.
>>>>
>>>> ... the reason to introduce a separate new interface. Plus of
>>>> course this parallels what Linux has.
>>>>
>>>>> It definitely makes more difficult to figure out when to use xmalloc()
>>>>> vs xvalloc().
>>>>
>>>> I don't see the difficulty:
>>>> - if you need physically contiguous memory, use alloc_xen*_pages(),
>>>> - if you know the allocation size is always no more than a page,
>>>>    use xmalloc(),
> 
> If that's then intention, then may I ask why xmalloc() is able to 
> support multiple pages allocation?

Because support for this pre-dates even the introduction of vmalloc()?

> Your assumption is Xen will always be built with the same page size 
> across all the architecture. While Xen only works with 4KB pages today, 
> Arm can support 16KB and 64KB. I have long term plan to add support for it.
> 
> So I don't think you can use the page size as a way to distinguish which 
> one to use.

The let's abstract this one level further

- if you know the allocation size is always no more than the smallest
  possible page size, use xmalloc()

>>> What if you need memory physically contiguous but not necessarily an
>>> order of pages, such as for instance 5200 bytes?
>>
>> This case is, I think, rare enough (in particular in Xen) that the
>> waste of space can be tolerated imo.
> 
> This is quite departure from:
> 
> commit b829a0ff5794ee5b0f96a0c872f6a4ed7b1007c7
> Author: Jan Beulich <jbeulich@suse.com>
> Date:   Thu Oct 13 10:03:43 2011 +0200
> 
>      xmalloc: return unused full pages on multi-page allocations
> 
>      Certain (boot time) allocations are relatively large (particularly when
>      building with high NR_CPUS), but can also happen to be pretty far away
>      from a power-of-two size. Utilize the page allocator's (other than
>      Linux'es) capability of allowing to return space from higher-order
>      allocations in smaller pieces to return the unused parts immediately.
> 
>      Signed-off-by: Jan Beulich <jbeulich@suse.com>
>      Acked-by: Keir Fraser <keir@xen.org>
> 
> I am curious to know what changed...

Nothing. But even if something had, citing a 9 year old commit is
not likely to point out any actual contradiction.

> Anyway, what you wrote is very server focused. On Arm, we have plan to 
> run Xen on smaller hardware where wasting memory mean less usable RAM 
> for guests.
> 
> The problem with using an order is the bigger the order is the more 
> change you will waste space...
> 
> Allocating more than a page is fairly common on Arm, so we really want 
> to reduce the amount of memory wasted.

The amount of space wasted is the same - the tail of the trailing
page. I'm afraid I don't see what your point is.

>>> If xmalloc can't do physically contiguous allocations, we need something
>>> else that does physically contiguous allocations not only at page
>>> granularity, right?
>>
>> Well, we first need to settle on what guarantees xmalloc() is meant
>> to provide. It may be just me assuming it doesn't provide the same
>> ones which Linux'es kmalloc() makes. I'm first and foremost
>> judging by the comment near the top of xmalloc.h, which compares
>> with malloc() / free(), not kmalloc() / kfree().
>>
>>> The other issue is semantics. If xmalloc is unable to allocate more than
>>> a page of contiguous memory, then it is identical to vmalloc from the
>>> caller's point of view: both xmalloc and vmalloc return a virtual
>>> address for an allocation that might not be physically contiguous.
>>
>> Almost. vmalloc() puts guard pages around the allocation and
>> guarantees page alignment.
>>
>>> Maybe we should get rid of xmalloc entirely and improve the
>>> implementation of vmalloc so that it falls back to xmalloc for
>>> sub-page allocations. Which in fact is almost the same thing that you
>>> did.
>>
>> This would break callers assuming page alignment (and - shouldn't
>> be an issue in practice - granularity). If anything, as Julien
>> did suggest, we could modify xmalloc() accordingly, but then of
>> course making sure we also honor alignment requests beyond page
>> size.
>>
>> Neither of these is the goal here, hence this "intermediate"
>> implementation, which is only almost "redundant".
>>
>>>> - if you know the allocation size is always more than a page, use
>>>>    vmalloc(),
>>>> - otherwise use xvmalloc(). Exceptions may of course apply, i.e.
>>>> this is just a rule of thumb.
>>>>
>>>>> I would like to hear an opinion from the other maintainers.
>>>>
>>>> Let's hope at least one will voice theirs.
>>>
>>> If we take a step back, I think we only really need two memory
>>> allocators:
>>>
>>> 1) one that allocates physically contiguous memory
>>> 2) one that allocates non-physically contiguous memory
>>>
>>> That's it, right?
>>>
>>> In addition to that, I understand it could be convient to have a little
>>> wrapper that automatically chooses between 1) and 2) depending on
>>> circumstances.
>>>
>>> But if the circumstance is just size < PAGE_SIZE then I don't think we
>>> need any convenience wrappers: we should just be able to call 2), which
>>> is vmalloc, once we improve the vmalloc implementation.
>>>
>>> Or do you see any reasons to keep the current vmalloc implementation as
>>> is for sub-page allocations?
>>
>> See my "Almost. ..." above.
>>
>> As an aside, I also find it quite puzzling that in one of the rare
>> cases where I propose to clone an interface from Linux without much
>> deviation from their model, I get objections. It typically was the
>> other way around in the past ...
> 
> If we were really following Linux, then we would have two interfaces:
>     - xmalloc() which is the same as kmalloc()
>     - xvalloc() which is the same a kvalloc()

(correction: xvmalloc() and kvmalloc())

- vmalloc() (named identically in Linux and Xen)

IOW the same set of _three_ interface groups.

> However, you seem to be the one objecting on the behavior of xmalloc().

I don't think I'm objecting to any existing behavior. What I did
is state my view on (non-)guarantees by xmalloc(). And I've
already said - maybe I'm wrong and, like Linux'es kmalloc(),
there is a guarantee of it producing contiguous memory, and I
merely didn't find where that's said.

> I can't speak for Stefano, but I don't object on following Linux. 
> Instead I am objecting on the growing number of way to allocate memory 
> in Xen and that differ depending on the system_state.

But as per above the addition only brings us on par with Linux.
There, kvmalloc_node() is simply a wrapper (with different logic
when to try what) around kmalloc_node() and __vmalloc_node(). No
different (in the basic idea) from what I'm doing here.

Jan

Re: [PATCH v2 02/17] mm: introduce xvmalloc() et al and use for grant table allocations
Posted by Julien Grall 3 years, 4 months ago
Hi Jan,

On 26/11/2020 15:18, Jan Beulich wrote:
> On 26.11.2020 14:22, Julien Grall wrote:
>> On 26/11/2020 11:34, Jan Beulich wrote:
>>> On 25.11.2020 20:48, Stefano Stabellini wrote:
>>>> On Wed, 25 Nov 2020, Jan Beulich wrote:
>>>>> On 25.11.2020 13:15, Julien Grall wrote:
>>>>>> On 23/11/2020 14:23, Jan Beulich wrote:
>>>>>>> I'm unconvinced of the mentioning of "physically contiguous" in the
>>>>>>> comment at the top of the new header: I don't think xmalloc() provides
>>>>>>> such a guarantee. Any use assuming so would look (latently) broken to
>>>>>>> me.
>>>>>>
>>>>>> I haven't had the chance to reply to the first version about this. So I
>>>>>> will reply here to avoid confusion.
>>>>>>
>>>>>> I can at least spot one user in Arm that would use xmalloc() that way
>>>>>> (see the allocation of itt_addr in arch/arm/gic-v3-its.c).
>>>>>
>>>>> And I surely wouldn't have spotted this, even if I had tried
>>>>> to find "offenders", i.e. as said before not wanting to alter
>>>>> the behavior of existing code (beyond the explicit changes
>>>>> done here) was ...
>>>>>
>>>>>> AFAIK, the memory is for the sole purpose of the ITS and should not be
>>>>>> accessed by Xen. So I think we can replace by a new version of
>>>>>> alloc_domheap_pages().
>>>>>>
>>>>>> However, I still question the usefulness of introducing yet another way
>>>>>> to allocate memory (we already have alloc_xenheap_pages(), xmalloc(),
>>>>>> alloc_domheap_pages(), vmap()) if you think users cannot rely on
>>>>>> xmalloc() to allocate memory physically contiguous.
>>>>>
>>>>> ... the reason to introduce a separate new interface. Plus of
>>>>> course this parallels what Linux has.
>>>>>
>>>>>> It definitely makes more difficult to figure out when to use xmalloc()
>>>>>> vs xvalloc().
>>>>>
>>>>> I don't see the difficulty:
>>>>> - if you need physically contiguous memory, use alloc_xen*_pages(),
>>>>> - if you know the allocation size is always no more than a page,
>>>>>     use xmalloc(),
>>
>> If that's then intention, then may I ask why xmalloc() is able to
>> support multiple pages allocation?
> 
> Because support for this pre-dates even the introduction of vmalloc()?

Right, so the code should disappear if we want people to avoid making 
that assumption with xmalloc().

> 
>> Your assumption is Xen will always be built with the same page size
>> across all the architecture. While Xen only works with 4KB pages today,
>> Arm can support 16KB and 64KB. I have long term plan to add support for it.
>>
>> So I don't think you can use the page size as a way to distinguish which
>> one to use.
> 
> The let's abstract this one level further
> 
> - if you know the allocation size is always no more than the smallest
>    possible page size, use xmalloc()

So basically, xmalloc() is becoming pointless when xvmalloc() can do the 
same for you (as it would call xmalloc()).

> 
>>>> What if you need memory physically contiguous but not necessarily an
>>>> order of pages, such as for instance 5200 bytes?
>>>
>>> This case is, I think, rare enough (in particular in Xen) that the
>>> waste of space can be tolerated imo.
>>
>> This is quite departure from:
>>
>> commit b829a0ff5794ee5b0f96a0c872f6a4ed7b1007c7
>> Author: Jan Beulich <jbeulich@suse.com>
>> Date:   Thu Oct 13 10:03:43 2011 +0200
>>
>>       xmalloc: return unused full pages on multi-page allocations
>>
>>       Certain (boot time) allocations are relatively large (particularly when
>>       building with high NR_CPUS), but can also happen to be pretty far away
>>       from a power-of-two size. Utilize the page allocator's (other than
>>       Linux'es) capability of allowing to return space from higher-order
>>       allocations in smaller pieces to return the unused parts immediately.
>>
>>       Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>       Acked-by: Keir Fraser <keir@xen.org>
>>
>> I am curious to know what changed...
> 
> Nothing. But even if something had, citing a 9 year old commit is
> not likely to point out any actual contradiction.

You can't say it is tolerable but in the past suggested that it was not 
(otherwise why would you hand back memory?).

Therefore I would like to understand why in the past this was not 
tolerable but now it is... What changed?

> 
>> Anyway, what you wrote is very server focused. On Arm, we have plan to
>> run Xen on smaller hardware where wasting memory mean less usable RAM
>> for guests.
>>
>> The problem with using an order is the bigger the order is the more
>> change you will waste space...
>>
>> Allocating more than a page is fairly common on Arm, so we really want
>> to reduce the amount of memory wasted.
> 
> The amount of space wasted is the same - the tail of the trailing
> page. I'm afraid I don't see what your point is.

There would obviously be no difference if one wants to allocate more 
than one page but less than 2 pages....

But that was not my point. My point is when you allocate with an order 
greater or equal to 2, then you will start wasting memory when not using 
xmalloc().

For instance, if you want to allocate 20kB then you will need to 
allocate 32kB and lose 12kB. To make it sound bigger, you could replace 
kB by mB.

>>>> If xmalloc can't do physically contiguous allocations, we need something
>>>> else that does physically contiguous allocations not only at page
>>>> granularity, right?
>>>
>>> Well, we first need to settle on what guarantees xmalloc() is meant
>>> to provide. It may be just me assuming it doesn't provide the same
>>> ones which Linux'es kmalloc() makes. I'm first and foremost
>>> judging by the comment near the top of xmalloc.h, which compares
>>> with malloc() / free(), not kmalloc() / kfree().
>>>
>>>> The other issue is semantics. If xmalloc is unable to allocate more than
>>>> a page of contiguous memory, then it is identical to vmalloc from the
>>>> caller's point of view: both xmalloc and vmalloc return a virtual
>>>> address for an allocation that might not be physically contiguous.
>>>
>>> Almost. vmalloc() puts guard pages around the allocation and
>>> guarantees page alignment.
>>>
>>>> Maybe we should get rid of xmalloc entirely and improve the
>>>> implementation of vmalloc so that it falls back to xmalloc for
>>>> sub-page allocations. Which in fact is almost the same thing that you
>>>> did.
>>>
>>> This would break callers assuming page alignment (and - shouldn't
>>> be an issue in practice - granularity). If anything, as Julien
>>> did suggest, we could modify xmalloc() accordingly, but then of
>>> course making sure we also honor alignment requests beyond page
>>> size.
>>>
>>> Neither of these is the goal here, hence this "intermediate"
>>> implementation, which is only almost "redundant".
>>>
>>>>> - if you know the allocation size is always more than a page, use
>>>>>     vmalloc(),
>>>>> - otherwise use xvmalloc(). Exceptions may of course apply, i.e.
>>>>> this is just a rule of thumb.
>>>>>
>>>>>> I would like to hear an opinion from the other maintainers.
>>>>>
>>>>> Let's hope at least one will voice theirs.
>>>>
>>>> If we take a step back, I think we only really need two memory
>>>> allocators:
>>>>
>>>> 1) one that allocates physically contiguous memory
>>>> 2) one that allocates non-physically contiguous memory
>>>>
>>>> That's it, right?
>>>>
>>>> In addition to that, I understand it could be convient to have a little
>>>> wrapper that automatically chooses between 1) and 2) depending on
>>>> circumstances.
>>>>
>>>> But if the circumstance is just size < PAGE_SIZE then I don't think we
>>>> need any convenience wrappers: we should just be able to call 2), which
>>>> is vmalloc, once we improve the vmalloc implementation.
>>>>
>>>> Or do you see any reasons to keep the current vmalloc implementation as
>>>> is for sub-page allocations?
>>>
>>> See my "Almost. ..." above.
>>>
>>> As an aside, I also find it quite puzzling that in one of the rare
>>> cases where I propose to clone an interface from Linux without much
>>> deviation from their model, I get objections. It typically was the
>>> other way around in the past ...
>>
>> If we were really following Linux, then we would have two interfaces:
>>      - xmalloc() which is the same as kmalloc()
>>      - xvalloc() which is the same a kvalloc()
> 
> (correction: xvmalloc() and kvmalloc())
> 
> - vmalloc() (named identically in Linux and Xen)
> 
> IOW the same set of _three_ interface groups.
> 
>> However, you seem to be the one objecting on the behavior of xmalloc().
> 
> I don't think I'm objecting to any existing behavior. What I did
> is state my view on (non-)guarantees by xmalloc(). And I've
> already said - maybe I'm wrong and, like Linux'es kmalloc(),
> there is a guarantee of it producing contiguous memory, and I
> merely didn't find where that's said.

I can find quite a few places in Linux that use kmalloc() with size that 
are bigger than a page size.

That's enough for me to think that while this may not have been the 
original intention, people are using it like that (same in Xen). So we 
can't dismiss it.

> 
>> I can't speak for Stefano, but I don't object on following Linux.
>> Instead I am objecting on the growing number of way to allocate memory
>> in Xen and that differ depending on the system_state.
> 
> But as per above the addition only brings us on par with Linux.
> There, kvmalloc_node() is simply a wrapper (with different logic
> when to try what) around kmalloc_node() and __vmalloc_node(). No
> different (in the basic idea) from what I'm doing here.

There are at least two more in Xen so far:

  - alloc_domheap_pages()
  - alloc_xenheap_pages()

I still maintain that the way you suggest to use each of them is not 
clear. In particular, that xmalloc() doesn't guarantee physcally 
contiguous allocation for allocation larger than a page size.

In summary, I would be happy with the introduction of xvmalloc() as long 
as we guarantee that xmalloc() is allocating physically contiguous memory.

Users that doesn't care about this guarantee would have to be switched 
to use xvmalloc() (This doesn't need to be done here).

Cheers,

-- 
Julien Grall

Re: [PATCH v2 02/17] mm: introduce xvmalloc() et al and use for grant table allocations
Posted by Jan Beulich 3 years, 4 months ago
On 26.11.2020 16:53, Julien Grall wrote:
> On 26/11/2020 15:18, Jan Beulich wrote:
>> On 26.11.2020 14:22, Julien Grall wrote:
>>> On 26/11/2020 11:34, Jan Beulich wrote:
>>>> On 25.11.2020 20:48, Stefano Stabellini wrote:
>>>>> On Wed, 25 Nov 2020, Jan Beulich wrote:
>>>>>> On 25.11.2020 13:15, Julien Grall wrote:
>>>>>>> On 23/11/2020 14:23, Jan Beulich wrote:
>>>>>>>> I'm unconvinced of the mentioning of "physically contiguous" in the
>>>>>>>> comment at the top of the new header: I don't think xmalloc() provides
>>>>>>>> such a guarantee. Any use assuming so would look (latently) broken to
>>>>>>>> me.
>>>>>>>
>>>>>>> I haven't had the chance to reply to the first version about this. So I
>>>>>>> will reply here to avoid confusion.
>>>>>>>
>>>>>>> I can at least spot one user in Arm that would use xmalloc() that way
>>>>>>> (see the allocation of itt_addr in arch/arm/gic-v3-its.c).
>>>>>>
>>>>>> And I surely wouldn't have spotted this, even if I had tried
>>>>>> to find "offenders", i.e. as said before not wanting to alter
>>>>>> the behavior of existing code (beyond the explicit changes
>>>>>> done here) was ...
>>>>>>
>>>>>>> AFAIK, the memory is for the sole purpose of the ITS and should not be
>>>>>>> accessed by Xen. So I think we can replace by a new version of
>>>>>>> alloc_domheap_pages().
>>>>>>>
>>>>>>> However, I still question the usefulness of introducing yet another way
>>>>>>> to allocate memory (we already have alloc_xenheap_pages(), xmalloc(),
>>>>>>> alloc_domheap_pages(), vmap()) if you think users cannot rely on
>>>>>>> xmalloc() to allocate memory physically contiguous.
>>>>>>
>>>>>> ... the reason to introduce a separate new interface. Plus of
>>>>>> course this parallels what Linux has.
>>>>>>
>>>>>>> It definitely makes more difficult to figure out when to use xmalloc()
>>>>>>> vs xvalloc().
>>>>>>
>>>>>> I don't see the difficulty:
>>>>>> - if you need physically contiguous memory, use alloc_xen*_pages(),
>>>>>> - if you know the allocation size is always no more than a page,
>>>>>>     use xmalloc(),
>>>
>>> If that's then intention, then may I ask why xmalloc() is able to
>>> support multiple pages allocation?
>>
>> Because support for this pre-dates even the introduction of vmalloc()?
> 
> Right, so the code should disappear if we want people to avoid making 
> that assumption with xmalloc().

You mean once all users of xmalloc() needing more than a page have
disappeared? We can't drop that code any earlier.

>>> Your assumption is Xen will always be built with the same page size
>>> across all the architecture. While Xen only works with 4KB pages today,
>>> Arm can support 16KB and 64KB. I have long term plan to add support for it.
>>>
>>> So I don't think you can use the page size as a way to distinguish which
>>> one to use.
>>
>> The let's abstract this one level further
>>
>> - if you know the allocation size is always no more than the smallest
>>    possible page size, use xmalloc()
> 
> So basically, xmalloc() is becoming pointless when xvmalloc() can do the 
> same for you (as it would call xmalloc()).

At the expense of one extra call layer. I still think directly
calling xmalloc() for small blocks of memory is a sensible
thing to do. So no, I don't view it as becoming pointless.

>>>>> What if you need memory physically contiguous but not necessarily an
>>>>> order of pages, such as for instance 5200 bytes?
>>>>
>>>> This case is, I think, rare enough (in particular in Xen) that the
>>>> waste of space can be tolerated imo.
>>>
>>> This is quite departure from:
>>>
>>> commit b829a0ff5794ee5b0f96a0c872f6a4ed7b1007c7
>>> Author: Jan Beulich <jbeulich@suse.com>
>>> Date:   Thu Oct 13 10:03:43 2011 +0200
>>>
>>>       xmalloc: return unused full pages on multi-page allocations
>>>
>>>       Certain (boot time) allocations are relatively large (particularly when
>>>       building with high NR_CPUS), but can also happen to be pretty far away
>>>       from a power-of-two size. Utilize the page allocator's (other than
>>>       Linux'es) capability of allowing to return space from higher-order
>>>       allocations in smaller pieces to return the unused parts immediately.
>>>
>>>       Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>       Acked-by: Keir Fraser <keir@xen.org>
>>>
>>> I am curious to know what changed...
>>
>> Nothing. But even if something had, citing a 9 year old commit is
>> not likely to point out any actual contradiction.
> 
> You can't say it is tolerable but in the past suggested that it was not 
> (otherwise why would you hand back memory?).
> 
> Therefore I would like to understand why in the past this was not 
> tolerable but now it is... What changed?

Oh, so it looks like I misunderstood, taking together this and ...

>>> Anyway, what you wrote is very server focused. On Arm, we have plan to
>>> run Xen on smaller hardware where wasting memory mean less usable RAM
>>> for guests.
>>>
>>> The problem with using an order is the bigger the order is the more
>>> change you will waste space...
>>>
>>> Allocating more than a page is fairly common on Arm, so we really want
>>> to reduce the amount of memory wasted.
>>
>> The amount of space wasted is the same - the tail of the trailing
>> page. I'm afraid I don't see what your point is.
> 
> There would obviously be no difference if one wants to allocate more 
> than one page but less than 2 pages....
> 
> But that was not my point. My point is when you allocate with an order 
> greater or equal to 2, then you will start wasting memory when not using 
> xmalloc().
> 
> For instance, if you want to allocate 20kB then you will need to 
> allocate 32kB and lose 12kB. To make it sound bigger, you could replace 
> kB by mB.

... this. You're asking why the order based alloc_*heap_pages()
induced waste is acceptable? It's not. But that's not the point
here. vmalloc() doesn't incur such a waste, and as was said
earlier perhaps we should gain a count-based page allocator
interface (which could simply be the code from xmalloc() broken
out). But all of this is mostly unrelated to the change here,
not the least because call sites caring to avoid the waste can
easily return what they don't need. It is again a property of
the current xmalloc() implementation, but not a guarantee, that
such returning happens.

>>> However, you seem to be the one objecting on the behavior of xmalloc().
>>
>> I don't think I'm objecting to any existing behavior. What I did
>> is state my view on (non-)guarantees by xmalloc(). And I've
>> already said - maybe I'm wrong and, like Linux'es kmalloc(),
>> there is a guarantee of it producing contiguous memory, and I
>> merely didn't find where that's said.
> 
> I can find quite a few places in Linux that use kmalloc() with size that 
> are bigger than a page size.
> 
> That's enough for me to think that while this may not have been the 
> original intention, people are using it like that (same in Xen). So we 
> can't dismiss it.

Once there are such uses, replacing them takes (often a lot of)
time. So unless it's clear they're intentionally kmalloc() despite
the big size, I'd generally suspect people simply didn't know or
care to use vmalloc(). If it also quite common for allocation
sizes to cross the boundary of page size without the person doing
the "offending" change actually noticing.

>>> I can't speak for Stefano, but I don't object on following Linux.
>>> Instead I am objecting on the growing number of way to allocate memory
>>> in Xen and that differ depending on the system_state.
>>
>> But as per above the addition only brings us on par with Linux.
>> There, kvmalloc_node() is simply a wrapper (with different logic
>> when to try what) around kmalloc_node() and __vmalloc_node(). No
>> different (in the basic idea) from what I'm doing here.
> 
> There are at least two more in Xen so far:
> 
>   - alloc_domheap_pages()
>   - alloc_xenheap_pages()

Which have an equivalent in Linux, too.

> I still maintain that the way you suggest to use each of them is not 
> clear. In particular, that xmalloc() doesn't guarantee physcally 
> contiguous allocation for allocation larger than a page size.
> 
> In summary, I would be happy with the introduction of xvmalloc() as long 
> as we guarantee that xmalloc() is allocating physically contiguous memory.

So the main reason why I think xmalloc() and kmalloc() can be
considered different in this regard is that we have basically no
device drivers in Xen, and hence there's pretty limited need for
physically contiguous memory (outside of cases going straight to
the page allocator anyway).

Yet once again - that's my view on it, and the original intentions
may have been different.

All I want is code that allows me to avoid, at call sites, to
have to explicitly decide whether I want xmalloc() or vmalloc().
The grant table code changed here is one good example. The uses
of the new functions in subsequent patches are further ones. I
do specifically _not_ care to replace any existing functionality.
I only want these library-like helpers. Optimization and code
folding could come later.

> Users that doesn't care about this guarantee would have to be switched 
> to use xvmalloc() (This doesn't need to be done here).

And that's what I dislike: I don't see any "have to" here. I view
"have to" as applicable only to runtime allocations which may end
up being non-order-0 once they hit the underlying page allocator.
Don't forget that vmalloc(), besides the memory to be allocated,
also consumes VA space. xmalloc() doesn't as long we have a
directmap. Hence at least during boot there may be good reasons
why someone prefers to use xmalloc() even for larger bodies of
memory.

Jan

[PATCH v2 03/17] x86/xstate: use xvzalloc() for save area allocation
Posted by Jan Beulich 3 years, 4 months ago
This is in preparation for the area size exceeding a page's worth of
space, as will happen with AMX as well as Architectural LBR.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/xstate.c
+++ b/xen/arch/x86/xstate.c
@@ -8,6 +8,7 @@
 #include <xen/param.h>
 #include <xen/percpu.h>
 #include <xen/sched.h>
+#include <xen/xvmalloc.h>
 #include <asm/current.h>
 #include <asm/processor.h>
 #include <asm/hvm/support.h>
@@ -522,7 +523,7 @@ int xstate_alloc_save_area(struct vcpu *
 
     /* XSAVE/XRSTOR requires the save area be 64-byte-boundary aligned. */
     BUILD_BUG_ON(__alignof(*save_area) < 64);
-    save_area = _xzalloc(size, __alignof(*save_area));
+    save_area = _xvzalloc(size, __alignof(*save_area));
     if ( save_area == NULL )
         return -ENOMEM;
 
@@ -543,8 +544,7 @@ int xstate_alloc_save_area(struct vcpu *
 
 void xstate_free_save_area(struct vcpu *v)
 {
-    xfree(v->arch.xsave_area);
-    v->arch.xsave_area = NULL;
+    XVFREE(v->arch.xsave_area);
 }
 
 static unsigned int _xstate_ctxt_size(u64 xcr0)


[PATCH v2 04/17] x86/xstate: re-size save area when CPUID policy changes
Posted by Jan Beulich 3 years, 4 months ago
vCPU-s get maximum size areas allocated initially. Hidden (and in
particular default-off) features may allow for a smaller size area to
suffice.

Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: Use 1ul instead of 1ull.
---
This could be further shrunk if we used XSAVEC / if we really used
XSAVES, as then we don't need to also cover the holes. But since we
currently use neither of the two in reality, this would require more
work than just adding the alternative size calculation here.

Seeing that both vcpu_init_fpu() and cpuid_policy_updated() get called
from arch_vcpu_create(), I'm not sure we really need this two-stage
approach - the slightly longer period of time during which
v->arch.xsave_area would remain NULL doesn't look all that problematic.
But since xstate_alloc_save_area() gets called for idle vCPU-s, it has
to stay anyway in some form, so the extra code churn may not be worth
it.

--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -294,7 +294,21 @@ void update_guest_memory_policy(struct v
     }
 }
 
-void domain_cpu_policy_changed(struct domain *d)
+/*
+ * Called during vcpu construction, and each time the toolstack changes the
+ * CPUID configuration for the domain.
+ */
+static int __must_check cpuid_policy_updated(struct vcpu *v)
+{
+    int rc = xstate_update_save_area(v);
+
+    if ( !rc && is_hvm_vcpu(v) )
+        hvm_cpuid_policy_changed(v);
+
+    return rc;
+}
+
+int domain_cpu_policy_changed(struct domain *d)
 {
     const struct cpuid_policy *p = d->arch.cpuid;
     struct vcpu *v;
@@ -452,13 +466,18 @@ void domain_cpu_policy_changed(struct do
 
     for_each_vcpu ( d, v )
     {
-        cpuid_policy_updated(v);
+        int rc = cpuid_policy_updated(v);
+
+        if ( rc )
+            return rc;
 
         /* If PMU version is zero then the guest doesn't have VPMU */
         if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
              p->basic.pmu_version == 0 )
             vpmu_destroy(v);
     }
+
+    return 0;
 }
 
 #ifndef CONFIG_BIGMEM
@@ -597,7 +616,7 @@ int arch_vcpu_create(struct vcpu *v)
     {
         vpmu_initialise(v);
 
-        cpuid_policy_updated(v);
+        rc = cpuid_policy_updated(v);
     }
 
     return rc;
@@ -841,9 +860,9 @@ int arch_domain_create(struct domain *d,
      */
     d->arch.x87_fip_width = cpu_has_fpu_sel ? 0 : 8;
 
-    domain_cpu_policy_changed(d);
-
-    return 0;
+    rc = domain_cpu_policy_changed(d);
+    if ( !rc )
+        return 0;
 
  fail:
     d->is_dying = DOMDYING_dead;
@@ -2434,16 +2453,6 @@ int domain_relinquish_resources(struct d
     return 0;
 }
 
-/*
- * Called during vcpu construction, and each time the toolstack changes the
- * CPUID configuration for the domain.
- */
-void cpuid_policy_updated(struct vcpu *v)
-{
-    if ( is_hvm_vcpu(v) )
-        hvm_cpuid_policy_changed(v);
-}
-
 void arch_dump_domain_info(struct domain *d)
 {
     paging_dump_domain_info(d);
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -91,7 +91,7 @@ static int update_domain_cpu_policy(stru
     recalculate_cpuid_policy(d);
 
     /* Recalculate relevant dom/vcpu state now the policy has changed. */
-    domain_cpu_policy_changed(d);
+    ret = domain_cpu_policy_changed(d);
 
  out:
     /* Free whichever cpuid/msr structs are not installed in struct domain. */
--- a/xen/arch/x86/xstate.c
+++ b/xen/arch/x86/xstate.c
@@ -541,6 +541,41 @@ int xstate_alloc_save_area(struct vcpu *
 
     return 0;
 }
+
+int xstate_update_save_area(struct vcpu *v)
+{
+    unsigned int i, size, old;
+    struct xsave_struct *save_area;
+    uint64_t xcr0_max = cpuid_policy_xcr0_max(v->domain->arch.cpuid);
+
+    ASSERT(!is_idle_vcpu(v));
+
+    if ( !cpu_has_xsave )
+        return 0;
+
+    if ( v->arch.xcr0_accum & ~xcr0_max )
+        return -EBUSY;
+
+    for ( size = old = XSTATE_AREA_MIN_SIZE, i = 2; i < xstate_features; ++i )
+    {
+        if ( xcr0_max & (1ul << i) )
+            size = max(size, xstate_offsets[i] + xstate_sizes[i]);
+        if ( v->arch.xcr0_accum & (1ul << i) )
+            old = max(old, xstate_offsets[i] + xstate_sizes[i]);
+    }
+
+    save_area = _xvrealloc(v->arch.xsave_area, size, __alignof(*save_area));
+    if ( !save_area )
+        return -ENOMEM;
+
+    ASSERT(old <= size);
+    memset((void *)save_area + old, 0, size - old);
+
+    v->arch.xsave_area = save_area;
+    v->arch.fpu_ctxt = &v->arch.xsave_area->fpu_sse;
+
+    return 0;
+}
 
 void xstate_free_save_area(struct vcpu *v)
 {
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -78,8 +78,6 @@ void toggle_guest_mode(struct vcpu *);
 /* x86/64: toggle guest page tables between kernel and user modes. */
 void toggle_guest_pt(struct vcpu *);
 
-void cpuid_policy_updated(struct vcpu *v);
-
 /*
  * Initialise a hypercall-transfer page. The given pointer must be mapped
  * in Xen virtual address space (accesses are not validated or checked).
@@ -667,7 +665,7 @@ struct guest_memory_policy
 void update_guest_memory_policy(struct vcpu *v,
                                 struct guest_memory_policy *policy);
 
-void domain_cpu_policy_changed(struct domain *d);
+int __must_check domain_cpu_policy_changed(struct domain *d);
 
 bool update_runstate_area(struct vcpu *);
 bool update_secondary_system_time(struct vcpu *,
--- a/xen/include/asm-x86/xstate.h
+++ b/xen/include/asm-x86/xstate.h
@@ -106,6 +106,7 @@ void compress_xsave_states(struct vcpu *
 /* extended state init and cleanup functions */
 void xstate_free_save_area(struct vcpu *v);
 int xstate_alloc_save_area(struct vcpu *v);
+int xstate_update_save_area(struct vcpu *v);
 void xstate_init(struct cpuinfo_x86 *c);
 unsigned int xstate_ctxt_size(u64 xcr0);
 


[PATCH v2 05/17] x86/xstate: re-use valid_xcr0() for boot-time checks
Posted by Jan Beulich 3 years, 4 months ago
Instead of (just partially) open-coding it, re-use the function after
suitably moving it up.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: New.

--- a/xen/arch/x86/xstate.c
+++ b/xen/arch/x86/xstate.c
@@ -611,6 +611,34 @@ unsigned int xstate_ctxt_size(u64 xcr0)
     return _xstate_ctxt_size(xcr0);
 }
 
+static bool valid_xcr0(uint64_t xcr0)
+{
+    /* FP must be unconditionally set. */
+    if ( !(xcr0 & X86_XCR0_FP) )
+        return false;
+
+    /* YMM depends on SSE. */
+    if ( (xcr0 & X86_XCR0_YMM) && !(xcr0 & X86_XCR0_SSE) )
+        return false;
+
+    if ( xcr0 & (X86_XCR0_OPMASK | X86_XCR0_ZMM | X86_XCR0_HI_ZMM) )
+    {
+        /* OPMASK, ZMM, and HI_ZMM require YMM. */
+        if ( !(xcr0 & X86_XCR0_YMM) )
+            return false;
+
+        /* OPMASK, ZMM, and HI_ZMM must be the same. */
+        if ( ~xcr0 & (X86_XCR0_OPMASK | X86_XCR0_ZMM | X86_XCR0_HI_ZMM) )
+            return false;
+    }
+
+    /* BNDREGS and BNDCSR must be the same. */
+    if ( !(xcr0 & X86_XCR0_BNDREGS) != !(xcr0 & X86_XCR0_BNDCSR) )
+        return false;
+
+    return true;
+}
+
 /* Collect the information of processor's extended state */
 void xstate_init(struct cpuinfo_x86 *c)
 {
@@ -646,10 +674,9 @@ void xstate_init(struct cpuinfo_x86 *c)
     }
 
     cpuid_count(XSTATE_CPUID, 0, &eax, &ebx, &ecx, &edx);
-
-    BUG_ON((eax & XSTATE_FP_SSE) != XSTATE_FP_SSE);
-    BUG_ON((eax & X86_XCR0_YMM) && !(eax & X86_XCR0_SSE));
     feature_mask = (((u64)edx << 32) | eax) & XCNTXT_MASK;
+    BUG_ON(!valid_xcr0(feature_mask));
+    BUG_ON(!(feature_mask & X86_XCR0_SSE));
 
     /*
      * Set CR4_OSXSAVE and run "cpuid" to get xsave_cntxt_size.
@@ -679,31 +706,6 @@ void xstate_init(struct cpuinfo_x86 *c)
         BUG();
 }
 
-static bool valid_xcr0(u64 xcr0)
-{
-    /* FP must be unconditionally set. */
-    if ( !(xcr0 & X86_XCR0_FP) )
-        return false;
-
-    /* YMM depends on SSE. */
-    if ( (xcr0 & X86_XCR0_YMM) && !(xcr0 & X86_XCR0_SSE) )
-        return false;
-
-    if ( xcr0 & (X86_XCR0_OPMASK | X86_XCR0_ZMM | X86_XCR0_HI_ZMM) )
-    {
-        /* OPMASK, ZMM, and HI_ZMM require YMM. */
-        if ( !(xcr0 & X86_XCR0_YMM) )
-            return false;
-
-        /* OPMASK, ZMM, and HI_ZMM must be the same. */
-        if ( ~xcr0 & (X86_XCR0_OPMASK | X86_XCR0_ZMM | X86_XCR0_HI_ZMM) )
-            return false;
-    }
-
-    /* BNDREGS and BNDCSR must be the same. */
-    return !(xcr0 & X86_XCR0_BNDREGS) == !(xcr0 & X86_XCR0_BNDCSR);
-}
-
 int validate_xstate(const struct domain *d, uint64_t xcr0, uint64_t xcr0_accum,
                     const struct xsave_hdr *hdr)
 {


[PATCH v2 06/17] x86/xstate: drop xstate_offsets[] and xstate_sizes[]
Posted by Jan Beulich 3 years, 4 months ago
They're redundant with respective fields from the raw CPUID policy; no
need to keep two copies of the same data. This also breaks
recalculate_xstate()'s dependency on xstate_init(), allowing host CPUID
policy calculation to be moved together with that of the raw one (which
a subsequent change willl require anyway).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: New.

--- a/xen/arch/x86/cpu/common.c
+++ b/xen/arch/x86/cpu/common.c
@@ -498,6 +498,8 @@ void identify_cpu(struct cpuinfo_x86 *c)
 	}
 
 	/* Now the feature flags better reflect actual CPU features! */
+	if (c == &boot_cpu_data)
+		init_host_cpuid();
 
 	xstate_init(c);
 
--- a/xen/arch/x86/cpuid.c
+++ b/xen/arch/x86/cpuid.c
@@ -167,32 +167,32 @@ static void recalculate_xstate(struct cp
     {
         xstates |= X86_XCR0_YMM;
         xstate_size = max(xstate_size,
-                          xstate_offsets[X86_XCR0_YMM_POS] +
-                          xstate_sizes[X86_XCR0_YMM_POS]);
+                          xstate_offset(X86_XCR0_YMM_POS) +
+                          xstate_size(X86_XCR0_YMM_POS));
     }
 
     if ( p->feat.mpx )
     {
         xstates |= X86_XCR0_BNDREGS | X86_XCR0_BNDCSR;
         xstate_size = max(xstate_size,
-                          xstate_offsets[X86_XCR0_BNDCSR_POS] +
-                          xstate_sizes[X86_XCR0_BNDCSR_POS]);
+                          xstate_offset(X86_XCR0_BNDCSR_POS) +
+                          xstate_size(X86_XCR0_BNDCSR_POS));
     }
 
     if ( p->feat.avx512f )
     {
         xstates |= X86_XCR0_OPMASK | X86_XCR0_ZMM | X86_XCR0_HI_ZMM;
         xstate_size = max(xstate_size,
-                          xstate_offsets[X86_XCR0_HI_ZMM_POS] +
-                          xstate_sizes[X86_XCR0_HI_ZMM_POS]);
+                          xstate_offset(X86_XCR0_HI_ZMM_POS) +
+                          xstate_size(X86_XCR0_HI_ZMM_POS));
     }
 
     if ( p->feat.pku )
     {
         xstates |= X86_XCR0_PKRU;
         xstate_size = max(xstate_size,
-                          xstate_offsets[X86_XCR0_PKRU_POS] +
-                          xstate_sizes[X86_XCR0_PKRU_POS]);
+                          xstate_offset(X86_XCR0_PKRU_POS) +
+                          xstate_size(X86_XCR0_PKRU_POS));
     }
 
     p->xstate.max_size  =  xstate_size;
@@ -215,8 +215,8 @@ static void recalculate_xstate(struct cp
         if ( !(xstates & curr_xstate) )
             continue;
 
-        p->xstate.comp[i].size   = xstate_sizes[i];
-        p->xstate.comp[i].offset = xstate_offsets[i];
+        p->xstate.comp[i].size   = xstate_size(i);
+        p->xstate.comp[i].offset = xstate_offset(i);
         p->xstate.comp[i].xss    = curr_xstate & XSTATE_XSAVES_ONLY;
         p->xstate.comp[i].align  = curr_xstate & xstate_align;
     }
@@ -512,10 +512,16 @@ static void __init calculate_hvm_def_pol
     recalculate_xstate(p);
 }
 
-void __init init_guest_cpuid(void)
+void __init init_host_cpuid(void)
 {
     calculate_raw_policy();
     calculate_host_policy();
+}
+
+void __init init_guest_cpuid(void)
+{
+    /* Do this a 2nd time to account for setup_{clear,force}_cpu_cap() uses. */
+    calculate_host_policy();
 
     if ( IS_ENABLED(CONFIG_PV) )
     {
--- a/xen/arch/x86/xstate.c
+++ b/xen/arch/x86/xstate.c
@@ -9,6 +9,7 @@
 #include <xen/percpu.h>
 #include <xen/sched.h>
 #include <xen/xvmalloc.h>
+#include <asm/cpuid.h>
 #include <asm/current.h>
 #include <asm/processor.h>
 #include <asm/hvm/support.h>
@@ -26,8 +27,6 @@ static u32 __read_mostly xsave_cntxt_siz
 /* A 64-bit bitmask of the XSAVE/XRSTOR features supported by processor. */
 u64 __read_mostly xfeature_mask;
 
-unsigned int *__read_mostly xstate_offsets;
-unsigned int *__read_mostly xstate_sizes;
 u64 __read_mostly xstate_align;
 static unsigned int __read_mostly xstate_features;
 
@@ -93,34 +92,19 @@ static int setup_xstate_features(bool bs
     unsigned int leaf, eax, ebx, ecx, edx;
 
     if ( bsp )
-    {
         xstate_features = flsl(xfeature_mask);
-        xstate_offsets = xzalloc_array(unsigned int, xstate_features);
-        if ( !xstate_offsets )
-            return -ENOMEM;
-
-        xstate_sizes = xzalloc_array(unsigned int, xstate_features);
-        if ( !xstate_sizes )
-            return -ENOMEM;
-    }
 
     for ( leaf = 2; leaf < xstate_features; leaf++ )
     {
-        if ( bsp )
-        {
-            cpuid_count(XSTATE_CPUID, leaf, &xstate_sizes[leaf],
-                        &xstate_offsets[leaf], &ecx, &edx);
-            if ( ecx & XSTATE_ALIGN64 )
-                __set_bit(leaf, &xstate_align);
-        }
+        cpuid_count(XSTATE_CPUID, leaf, &eax,
+                    &ebx, &ecx, &edx);
+        BUG_ON(eax != xstate_size(leaf));
+        BUG_ON(ebx != xstate_offset(leaf));
+
+        if ( bsp && (ecx & XSTATE_ALIGN64) )
+            __set_bit(leaf, &xstate_align);
         else
-        {
-            cpuid_count(XSTATE_CPUID, leaf, &eax,
-                        &ebx, &ecx, &edx);
-            BUG_ON(eax != xstate_sizes[leaf]);
-            BUG_ON(ebx != xstate_offsets[leaf]);
             BUG_ON(!(ecx & XSTATE_ALIGN64) != !test_bit(leaf, &xstate_align));
-        }
     }
 
     return 0;
@@ -150,7 +134,7 @@ static void setup_xstate_comp(uint16_t *
             if ( test_bit(i, &xstate_align) )
                 offset = ROUNDUP(offset, 64);
             comp_offsets[i] = offset;
-            offset += xstate_sizes[i];
+            offset += xstate_size(i);
         }
     }
     ASSERT(offset <= xsave_cntxt_size);
@@ -213,10 +197,10 @@ void expand_xsave_states(struct vcpu *v,
          * comp_offsets[] information, something is very broken.
          */
         BUG_ON(!comp_offsets[index]);
-        BUG_ON((xstate_offsets[index] + xstate_sizes[index]) > size);
+        BUG_ON((xstate_offset(index) + xstate_size(index)) > size);
 
-        memcpy(dest + xstate_offsets[index], src + comp_offsets[index],
-               xstate_sizes[index]);
+        memcpy(dest + xstate_offset(index), src + comp_offsets[index],
+               xstate_size(index));
 
         valid &= ~feature;
     }
@@ -279,10 +263,10 @@ void compress_xsave_states(struct vcpu *
          * comp_offset[] information, something is very broken.
          */
         BUG_ON(!comp_offsets[index]);
-        BUG_ON((xstate_offsets[index] + xstate_sizes[index]) > size);
+        BUG_ON((xstate_offset(index) + xstate_size(index)) > size);
 
-        memcpy(dest + comp_offsets[index], src + xstate_offsets[index],
-               xstate_sizes[index]);
+        memcpy(dest + comp_offsets[index], src + xstate_offset(index),
+               xstate_size(index));
 
         valid &= ~feature;
     }
@@ -516,8 +500,8 @@ int xstate_alloc_save_area(struct vcpu *
         unsigned int i;
 
         for ( size = 0, i = 2; i < xstate_features; ++i )
-            if ( size < xstate_sizes[i] )
-                size = xstate_sizes[i];
+            if ( size < xstate_size(i) )
+                size = xstate_size(i);
         size += XSTATE_AREA_MIN_SIZE;
     }
 
@@ -560,9 +544,9 @@ int xstate_update_save_area(struct vcpu
     for ( size = old = XSTATE_AREA_MIN_SIZE, i = 2; i < xstate_features; ++i )
     {
         if ( xcr0_max & (1ul << i) )
-            size = max(size, xstate_offsets[i] + xstate_sizes[i]);
+            size = max(size, xstate_offset(i) + xstate_size(i));
         if ( v->arch.xcr0_accum & (1ul << i) )
-            old = max(old, xstate_offsets[i] + xstate_sizes[i]);
+            old = max(old, xstate_offset(i) + xstate_size(i));
     }
 
     save_area = _xvrealloc(v->arch.xsave_area, size, __alignof(*save_area));
@@ -821,7 +805,7 @@ uint64_t read_bndcfgu(void)
               : "=m" (*xstate)
               : "a" (X86_XCR0_BNDCSR), "d" (0), "D" (xstate) );
 
-        bndcsr = (void *)xstate + xstate_offsets[X86_XCR0_BNDCSR_POS];
+        bndcsr = (void *)xstate + xstate_offset(X86_XCR0_BNDCSR_POS);
     }
 
     if ( cr0 & X86_CR0_TS )
--- a/xen/include/asm-x86/cpuid.h
+++ b/xen/include/asm-x86/cpuid.h
@@ -16,6 +16,7 @@
 extern const uint32_t known_features[FSCAPINTS];
 extern const uint32_t special_features[FSCAPINTS];
 
+void init_host_cpuid(void);
 void init_guest_cpuid(void);
 
 /*
--- a/xen/include/asm-x86/xstate.h
+++ b/xen/include/asm-x86/xstate.h
@@ -44,8 +44,9 @@ extern uint32_t mxcsr_mask;
 
 extern u64 xfeature_mask;
 extern u64 xstate_align;
-extern unsigned int *xstate_offsets;
-extern unsigned int *xstate_sizes;
+
+#define xstate_offset(n) (raw_cpuid_policy.xstate.comp[n].offset)
+#define xstate_size(n)   (raw_cpuid_policy.xstate.comp[n].size)
 
 /* extended state save area */
 struct __attribute__((aligned (64))) xsave_struct


[PATCH v2 07/17] x86/xstate: replace xsave_cntxt_size and drop XCNTXT_MASK
Posted by Jan Beulich 3 years, 4 months ago
XCNTXT_MASK is effectively embedded in recalculate_xstate(), and
xsave_cntxt_size was redundant with the host CPUID policy's
xstate.max_size field.

Use the host CPUID policy as input (requiring it to be calculated
earlier), thus allowing e.g. "cpuid=no-avx512f" to also result in
avoiding allocation of space for ZMM and mask register state.

Also drop a stale part of an adjacent comment.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: New.

--- a/xen/arch/x86/xstate.c
+++ b/xen/arch/x86/xstate.c
@@ -20,9 +20,10 @@
 /*
  * Maximum size (in byte) of the XSAVE/XRSTOR save area required by all
  * the supported and enabled features on the processor, including the
- * XSAVE.HEADER. We only enable XCNTXT_MASK that we have known.
+ * XSAVE.HEADER. We only enable cpuid_policy_xcr0_max(&host_cpuid_policy).
+ * Note that this identifier should not be usable as an lvalue.
  */
-static u32 __read_mostly xsave_cntxt_size;
+#define xsave_cntxt_size (host_cpuid_policy.xstate.max_size | 0)
 
 /* A 64-bit bitmask of the XSAVE/XRSTOR features supported by processor. */
 u64 __read_mostly xfeature_mask;
@@ -577,8 +578,23 @@ static unsigned int _xstate_ctxt_size(u6
     ASSERT(ok);
     cpuid_count(XSTATE_CPUID, 0, &eax, &ebx, &ecx, &edx);
     ASSERT(ebx <= ecx);
-    ok = set_xcr0(act_xcr0);
-    ASSERT(ok);
+
+    /*
+     * When called the very first time from xstate_init(), act_xcr0 (as read
+     * from per-CPU data) is still zero. xstate_init() wants this function to
+     * leave xfeature_mask in place, so avoid restoration in this case (which
+     * would fail anyway).
+     */
+    if ( act_xcr0 )
+    {
+        ok = set_xcr0(act_xcr0);
+        ASSERT(ok);
+    }
+    else
+    {
+        BUG_ON(!ok);
+        ASSERT(xcr0 == xfeature_mask);
+    }
 
     return ebx;
 }
@@ -650,42 +666,35 @@ void xstate_init(struct cpuinfo_x86 *c)
         return;
 
     if ( (bsp && !use_xsave) ||
-         boot_cpu_data.cpuid_level < XSTATE_CPUID )
+         c->cpuid_level < XSTATE_CPUID )
     {
         BUG_ON(!bsp);
         setup_clear_cpu_cap(X86_FEATURE_XSAVE);
         return;
     }
 
-    cpuid_count(XSTATE_CPUID, 0, &eax, &ebx, &ecx, &edx);
-    feature_mask = (((u64)edx << 32) | eax) & XCNTXT_MASK;
-    BUG_ON(!valid_xcr0(feature_mask));
-    BUG_ON(!(feature_mask & X86_XCR0_SSE));
-
-    /*
-     * Set CR4_OSXSAVE and run "cpuid" to get xsave_cntxt_size.
-     */
-    set_in_cr4(X86_CR4_OSXSAVE);
-    if ( !set_xcr0(feature_mask) )
-        BUG();
-
     if ( bsp )
     {
+        feature_mask = cpuid_policy_xcr0_max(&host_cpuid_policy);
+        BUG_ON(!valid_xcr0(feature_mask));
+        BUG_ON(!(feature_mask & X86_XCR0_SSE));
+
         xfeature_mask = feature_mask;
-        /*
-         * xsave_cntxt_size is the max size required by enabled features.
-         * We know FP/SSE and YMM about eax, and nothing about edx at present.
-         */
-        xsave_cntxt_size = _xstate_ctxt_size(feature_mask);
+        /* xsave_cntxt_size is the max size required by enabled features. */
         printk("xstate: size: %#x and states: %#"PRIx64"\n",
-               xsave_cntxt_size, xfeature_mask);
-    }
-    else
-    {
-        BUG_ON(xfeature_mask != feature_mask);
-        BUG_ON(xsave_cntxt_size != _xstate_ctxt_size(feature_mask));
+               xsave_cntxt_size, feature_mask);
+
+        set_in_cr4(X86_CR4_OSXSAVE);
     }
 
+    cpuid_count(XSTATE_CPUID, 0, &eax, &ebx, &ecx, &edx);
+    feature_mask = (((uint64_t)edx << 32) | eax) & xfeature_mask;
+    BUG_ON(xfeature_mask != feature_mask);
+
+    /* This has the side effect of set_xcr0(feature_mask). */
+    if ( xsave_cntxt_size != _xstate_ctxt_size(feature_mask) )
+        BUG();
+
     if ( setup_xstate_features(bsp) && bsp )
         BUG();
 }
--- a/xen/include/asm-x86/xstate.h
+++ b/xen/include/asm-x86/xstate.h
@@ -30,9 +30,6 @@ extern uint32_t mxcsr_mask;
 #define XSTATE_AREA_MIN_SIZE      (FXSAVE_SIZE + XSAVE_HDR_SIZE)
 
 #define XSTATE_FP_SSE  (X86_XCR0_FP | X86_XCR0_SSE)
-#define XCNTXT_MASK    (X86_XCR0_FP | X86_XCR0_SSE | X86_XCR0_YMM | \
-                        X86_XCR0_OPMASK | X86_XCR0_ZMM | X86_XCR0_HI_ZMM | \
-                        XSTATE_NONLAZY)
 
 #define XSTATE_ALL     (~(1ULL << 63))
 #define XSTATE_NONLAZY (X86_XCR0_BNDREGS | X86_XCR0_BNDCSR | X86_XCR0_PKRU)


[PATCH v2 08/17] x86/xstate: avoid accounting for unsupported components
Posted by Jan Beulich 3 years, 4 months ago
There's no point in including unsupported components in the size
calculations of xstate_{alloc,update}_save_area().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: New.

--- a/xen/arch/x86/xstate.c
+++ b/xen/arch/x86/xstate.c
@@ -501,8 +501,12 @@ int xstate_alloc_save_area(struct vcpu *
         unsigned int i;
 
         for ( size = 0, i = 2; i < xstate_features; ++i )
+        {
+            if ( !(xfeature_mask & (1ul << i)) )
+                continue;
             if ( size < xstate_size(i) )
                 size = xstate_size(i);
+        }
         size += XSTATE_AREA_MIN_SIZE;
     }
 
@@ -544,6 +548,8 @@ int xstate_update_save_area(struct vcpu
 
     for ( size = old = XSTATE_AREA_MIN_SIZE, i = 2; i < xstate_features; ++i )
     {
+        if ( !(xfeature_mask & (1ul << i)) )
+            continue;
         if ( xcr0_max & (1ul << i) )
             size = max(size, xstate_offset(i) + xstate_size(i));
         if ( v->arch.xcr0_accum & (1ul << i) )


[PATCH v2 09/17] x86: use xvmalloc() for extended context buffer allocations
Posted by Jan Beulich 3 years, 4 months ago
This is in preparation for the buffer sizes exceeding a page's worth of
space, as will happen with AMX as well as Architectural LBR.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: New.

--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -30,6 +30,7 @@
 #include <xsm/xsm.h>
 #include <xen/iommu.h>
 #include <xen/vm_event.h>
+#include <xen/xvmalloc.h>
 #include <public/vm_event.h>
 #include <asm/mem_sharing.h>
 #include <asm/xstate.h>
@@ -331,7 +332,7 @@ long arch_do_domctl(
             goto sethvmcontext_out;
 
         ret = -ENOMEM;
-        if ( (c.data = xmalloc_bytes(c.size)) == NULL )
+        if ( (c.data = xvmalloc_bytes(c.size)) == NULL )
             goto sethvmcontext_out;
 
         ret = -EFAULT;
@@ -343,7 +344,7 @@ long arch_do_domctl(
         domain_unpause(d);
 
     sethvmcontext_out:
-        xfree(c.data);
+        xvfree(c.data);
         break;
     }
 
@@ -373,7 +374,7 @@ long arch_do_domctl(
 
         /* Allocate our own marshalling buffer */
         ret = -ENOMEM;
-        if ( (c.data = xmalloc_bytes(c.size)) == NULL )
+        if ( (c.data = xvmalloc_bytes(c.size)) == NULL )
             goto gethvmcontext_out;
 
         domain_pause(d);
@@ -386,7 +387,7 @@ long arch_do_domctl(
 
     gethvmcontext_out:
         copyback = true;
-        xfree(c.data);
+        xvfree(c.data);
         break;
     }
 
@@ -904,7 +905,7 @@ long arch_do_domctl(
             if ( !ret && size > PV_XSAVE_HDR_SIZE )
             {
                 unsigned int xsave_size = size - PV_XSAVE_HDR_SIZE;
-                void *xsave_area = xmalloc_bytes(xsave_size);
+                void *xsave_area = xvmalloc_bytes(xsave_size);
 
                 if ( !xsave_area )
                 {
@@ -918,7 +919,7 @@ long arch_do_domctl(
                 if ( copy_to_guest_offset(evc->buffer, offset, xsave_area,
                                           xsave_size) )
                      ret = -EFAULT;
-                xfree(xsave_area);
+                xvfree(xsave_area);
            }
 
             vcpu_unpause(v);
@@ -938,7 +939,7 @@ long arch_do_domctl(
                  evc->size > PV_XSAVE_SIZE(xfeature_mask) )
                 goto vcpuextstate_out;
 
-            receive_buf = xmalloc_bytes(evc->size);
+            receive_buf = xvmalloc_bytes(evc->size);
             if ( !receive_buf )
             {
                 ret = -ENOMEM;
@@ -948,7 +949,7 @@ long arch_do_domctl(
                                         offset, evc->size) )
             {
                 ret = -EFAULT;
-                xfree(receive_buf);
+                xvfree(receive_buf);
                 goto vcpuextstate_out;
             }
 
@@ -966,7 +967,7 @@ long arch_do_domctl(
                 ret = 0;
             if ( ret )
             {
-                xfree(receive_buf);
+                xvfree(receive_buf);
                 goto vcpuextstate_out;
             }
 
@@ -994,7 +995,7 @@ long arch_do_domctl(
                 vcpu_unpause(v);
             }
 
-            xfree(receive_buf);
+            xvfree(receive_buf);
         }
 
 #undef PV_XSAVE_HDR_SIZE
--- a/xen/arch/x86/hvm/save.c
+++ b/xen/arch/x86/hvm/save.c
@@ -23,6 +23,7 @@
 #include <xen/guest_access.h>
 #include <xen/softirq.h>
 #include <xen/version.h>
+#include <xen/xvmalloc.h>
 
 #include <asm/hvm/support.h>
 
@@ -154,7 +155,7 @@ int hvm_save_one(struct domain *d, unsig
     else
         v = d->vcpu[instance];
     ctxt.size = hvm_sr_handlers[typecode].size;
-    ctxt.data = xmalloc_bytes(ctxt.size);
+    ctxt.data = xvmalloc_bytes(ctxt.size);
     if ( !ctxt.data )
         return -ENOMEM;
 
@@ -200,7 +201,7 @@ int hvm_save_one(struct domain *d, unsig
     else
         domain_unpause(d);
 
-    xfree(ctxt.data);
+    xvfree(ctxt.data);
     return rv;
 }
 


[PATCH v2 10/17] x86/xstate: enable AMX components
Posted by Jan Beulich 3 years, 4 months ago
These being controlled by XCR0, enabling support is relatively
straightforward. Note however that there won't be any use of them until
their dependent ISA extension CPUID flags get exposed, not the least due
to the way recalculate_xstate() handles the dependencies in kind of a
reverse manner.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: New.

--- a/tools/libs/light/libxl_cpuid.c
+++ b/tools/libs/light/libxl_cpuid.c
@@ -219,6 +219,9 @@ int libxl_cpuid_parse_config(libxl_cpuid
         {"md-clear",     0x00000007,  0, CPUID_REG_EDX, 10,  1},
         {"serialize",    0x00000007,  0, CPUID_REG_EDX, 14,  1},
         {"cet-ibt",      0x00000007,  0, CPUID_REG_EDX, 20,  1},
+        {"amx-bf16",     0x00000007,  0, CPUID_REG_EDX, 22,  1},
+        {"amx-tile",     0x00000007,  0, CPUID_REG_EDX, 24,  1},
+        {"amx-int8",     0x00000007,  0, CPUID_REG_EDX, 25,  1},
         {"ibrsb",        0x00000007,  0, CPUID_REG_EDX, 26,  1},
         {"stibp",        0x00000007,  0, CPUID_REG_EDX, 27,  1},
         {"l1d-flush",    0x00000007,  0, CPUID_REG_EDX, 28,  1},
--- a/tools/misc/xen-cpuid.c
+++ b/tools/misc/xen-cpuid.c
@@ -167,7 +167,8 @@ static const char *const str_7d0[32] =
 
     [18] = "pconfig",
     [20] = "cet-ibt",
-
+    [22] = "amx-bf16",
+    [24] = "amx-tile",      [25] = "amx-int8",
     [26] = "ibrsb",         [27] = "stibp",
     [28] = "l1d-flush",     [29] = "arch-caps",
     [30] = "core-caps",     [31] = "ssbd",
--- a/xen/arch/x86/cpuid.c
+++ b/xen/arch/x86/cpuid.c
@@ -195,6 +195,14 @@ static void recalculate_xstate(struct cp
                           xstate_size(X86_XCR0_PKRU_POS));
     }
 
+    if ( p->feat.amx_tile )
+    {
+        xstates |= X86_XCR0_TILECFG | X86_XCR0_TILEDATA;
+        xstate_size = max(xstate_size,
+                          xstate_offset(X86_XCR0_TILEDATA_POS) +
+                          xstate_size(X86_XCR0_TILEDATA_POS));
+    }
+
     p->xstate.max_size  =  xstate_size;
     p->xstate.xcr0_low  =  xstates & ~XSTATE_XSAVES_ONLY;
     p->xstate.xcr0_high = (xstates & ~XSTATE_XSAVES_ONLY) >> 32;
--- a/xen/arch/x86/xstate.c
+++ b/xen/arch/x86/xstate.c
@@ -642,6 +642,10 @@ static bool valid_xcr0(uint64_t xcr0)
     if ( !(xcr0 & X86_XCR0_BNDREGS) != !(xcr0 & X86_XCR0_BNDCSR) )
         return false;
 
+    /* TILECFG and TILEDATA must be the same. */
+    if ( !(xcr0 & X86_XCR0_TILECFG) != !(xcr0 & X86_XCR0_TILEDATA) )
+        return false;
+
     return true;
 }
 
--- a/xen/include/asm-x86/x86-defns.h
+++ b/xen/include/asm-x86/x86-defns.h
@@ -96,6 +96,10 @@
 #define X86_XCR0_HI_ZMM           (1ULL << X86_XCR0_HI_ZMM_POS)
 #define X86_XCR0_PKRU_POS         9
 #define X86_XCR0_PKRU             (1ULL << X86_XCR0_PKRU_POS)
+#define X86_XCR0_TILECFG_POS      17
+#define X86_XCR0_TILECFG          (1ULL << X86_XCR0_TILECFG_POS)
+#define X86_XCR0_TILEDATA_POS     18
+#define X86_XCR0_TILEDATA         (1ULL << X86_XCR0_TILEDATA_POS)
 #define X86_XCR0_LWP_POS          62
 #define X86_XCR0_LWP              (1ULL << X86_XCR0_LWP_POS)
 
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -265,6 +265,9 @@ XEN_CPUFEATURE(MD_CLEAR,      9*32+10) /
 XEN_CPUFEATURE(TSX_FORCE_ABORT, 9*32+13) /* MSR_TSX_FORCE_ABORT.RTM_ABORT */
 XEN_CPUFEATURE(SERIALIZE,     9*32+14) /*a  SERIALIZE insn */
 XEN_CPUFEATURE(CET_IBT,       9*32+20) /*   CET - Indirect Branch Tracking */
+XEN_CPUFEATURE(AMX_BF16,      9*32+22) /*   AMX BFloat16 instructions */
+XEN_CPUFEATURE(AMX_TILE,      9*32+24) /*   AMX tile architecture */
+XEN_CPUFEATURE(AMX_INT8,      9*32+25) /*   AMX 8-bit integer instructions */
 XEN_CPUFEATURE(IBRSB,         9*32+26) /*A  IBRS and IBPB support (used by Intel) */
 XEN_CPUFEATURE(STIBP,         9*32+27) /*A  STIBP */
 XEN_CPUFEATURE(L1D_FLUSH,     9*32+28) /*S  MSR_FLUSH_CMD and L1D flush. */
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -222,7 +222,7 @@ def crunch_numbers(state):
         # instruction groups which are specified to require XSAVE for state
         # management.
         XSAVE: [XSAVEOPT, XSAVEC, XGETBV1, XSAVES,
-                AVX, MPX, PKU, LWP],
+                AVX, MPX, PKU, AMX_TILE, LWP],
 
         # AVX is taken to mean hardware support for 256bit registers (which in
         # practice depends on the VEX prefix to encode), and the instructions
@@ -288,6 +288,11 @@ def crunch_numbers(state):
 
         # In principle the TSXLDTRK insns could also be considered independent.
         RTM: [TSXLDTRK],
+
+        # AMX-TILE means hardware support for tile registers and general non-
+        # computational instructions.  All further AMX features are built on top
+        # of AMX-TILE.
+        AMX_TILE: [AMX_BF16, AMX_INT8],
     }
 
     deep_features = tuple(sorted(deps.keys()))


[PATCH v2 11/17] x86/CPUID: adjust extended leaves out of range clearing
Posted by Jan Beulich 3 years, 4 months ago
A maximum extended leaf input value with the high half different from
0x8000 should not be considered valid - all leaves should be cleared in
this case.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: Integrate into series.

--- a/tools/tests/cpu-policy/test-cpu-policy.c
+++ b/tools/tests/cpu-policy/test-cpu-policy.c
@@ -516,11 +516,22 @@ static void test_cpuid_out_of_range_clea
             },
         },
         {
+            .name = "no extd",
+            .nr_markers = 0,
+            .p = {
+                /* Clears all markers. */
+                .extd.max_leaf = 0,
+
+                .extd.vendor_ebx = 0xc2,
+                .extd.raw_fms = 0xc2,
+            },
+        },
+        {
             .name = "extd",
             .nr_markers = 1,
             .p = {
                 /* Retains marker in leaf 0.  Clears others. */
-                .extd.max_leaf = 0,
+                .extd.max_leaf = 0x80000000,
                 .extd.vendor_ebx = 0xc2,
 
                 .extd.raw_fms = 0xc2,
--- a/xen/lib/x86/cpuid.c
+++ b/xen/lib/x86/cpuid.c
@@ -232,7 +232,9 @@ void x86_cpuid_policy_clear_out_of_range
                     ARRAY_SIZE(p->xstate.raw) - 1);
     }
 
-    zero_leaves(p->extd.raw, (p->extd.max_leaf & 0xffff) + 1,
+    zero_leaves(p->extd.raw,
+                ((p->extd.max_leaf >> 16) == 0x8000
+                 ? (p->extd.max_leaf & 0xffff) + 1 : 0),
                 ARRAY_SIZE(p->extd.raw) - 1);
 }
 

Re: [PATCH v2 11/17] x86/CPUID: adjust extended leaves out of range clearing
Posted by Roger Pau Monné 2 years, 11 months ago
On Mon, Nov 23, 2020 at 03:32:35PM +0100, Jan Beulich wrote:
> A maximum extended leaf input value with the high half different from
> 0x8000 should not be considered valid - all leaves should be cleared in
> this case.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.

Re: [PATCH v2 11/17] x86/CPUID: adjust extended leaves out of range clearing
Posted by Andrew Cooper 2 years, 11 months ago
On 23/11/2020 14:32, Jan Beulich wrote:
> A maximum extended leaf input value with the high half different from
> 0x8000 should not be considered valid - all leaves should be cleared in
> this case.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> v2: Integrate into series.
>
> --- a/tools/tests/cpu-policy/test-cpu-policy.c
> +++ b/tools/tests/cpu-policy/test-cpu-policy.c
> @@ -516,11 +516,22 @@ static void test_cpuid_out_of_range_clea
>              },
>          },
>          {
> +            .name = "no extd",
> +            .nr_markers = 0,
> +            .p = {
> +                /* Clears all markers. */
> +                .extd.max_leaf = 0,
> +
> +                .extd.vendor_ebx = 0xc2,
> +                .extd.raw_fms = 0xc2,
> +            },
> +        },
> +        {
>              .name = "extd",
>              .nr_markers = 1,
>              .p = {
>                  /* Retains marker in leaf 0.  Clears others. */
> -                .extd.max_leaf = 0,
> +                .extd.max_leaf = 0x80000000,
>                  .extd.vendor_ebx = 0xc2,
>  
>                  .extd.raw_fms = 0xc2,
> --- a/xen/lib/x86/cpuid.c
> +++ b/xen/lib/x86/cpuid.c
> @@ -232,7 +232,9 @@ void x86_cpuid_policy_clear_out_of_range
>                      ARRAY_SIZE(p->xstate.raw) - 1);
>      }
>  
> -    zero_leaves(p->extd.raw, (p->extd.max_leaf & 0xffff) + 1,
> +    zero_leaves(p->extd.raw,
> +                ((p->extd.max_leaf >> 16) == 0x8000
> +                 ? (p->extd.max_leaf & 0xffff) + 1 : 0),
>                  ARRAY_SIZE(p->extd.raw) - 1);

Honestly, this is unnecessary complexity and overhead, and the logic is
already hard enough to follow.

There won't be extd.max_leaf with the high half != 0x8000 in real
policies, because of how we fill them.  Nor ought there to be, given the
intended meaning of this part of the union.

I think we simply forbid this case, rather than taking extra complexity
to cope with it.  Approximately all VMs will have 0x80000008 as a
minimum, and I don't think catering to pre-64bit Intel CPUs is worth our
effort either.

~Andrew


Re: [PATCH v2 11/17] x86/CPUID: adjust extended leaves out of range clearing
Posted by Jan Beulich 2 years, 11 months ago
On 15.04.2021 14:48, Andrew Cooper wrote:
> On 23/11/2020 14:32, Jan Beulich wrote:
>> --- a/xen/lib/x86/cpuid.c
>> +++ b/xen/lib/x86/cpuid.c
>> @@ -232,7 +232,9 @@ void x86_cpuid_policy_clear_out_of_range
>>                      ARRAY_SIZE(p->xstate.raw) - 1);
>>      }
>>  
>> -    zero_leaves(p->extd.raw, (p->extd.max_leaf & 0xffff) + 1,
>> +    zero_leaves(p->extd.raw,
>> +                ((p->extd.max_leaf >> 16) == 0x8000
>> +                 ? (p->extd.max_leaf & 0xffff) + 1 : 0),
>>                  ARRAY_SIZE(p->extd.raw) - 1);
> 
> Honestly, this is unnecessary complexity and overhead, and the logic is
> already hard enough to follow.
> 
> There won't be extd.max_leaf with the high half != 0x8000 in real
> policies, because of how we fill them.  Nor ought there to be, given the
> intended meaning of this part of the union.
> 
> I think we simply forbid this case, rather than taking extra complexity
> to cope with it.  Approximately all VMs will have 0x80000008 as a
> minimum, and I don't think catering to pre-64bit Intel CPUs is worth our
> effort either.

"Forbid" has got to mean something other than "assume all input is fine".
Aiui guest config files can restrict the maximum sub-leaves exposed to a
guest. If we don't want to handle the case here, where else to you
suggest we at least and complain about the broken assumption? (IOW I'd
be okay dropping this patch if your "forbid" actually means some
enforcement elsewhere, and then perhaps a comment pointing there ahead
of the zero_leaves() invocation here.)

Jan

Ping: [PATCH v2 11/17] x86/CPUID: adjust extended leaves out of range clearing
Posted by Jan Beulich 3 years, 1 month ago
On 23.11.2020 15:32, Jan Beulich wrote:
> A maximum extended leaf input value with the high half different from
> 0x8000 should not be considered valid - all leaves should be cleared in
> this case.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> v2: Integrate into series.

While most other parts of this series are to be delayed until
(at least) 4.16, I consider this one a bug fix.

Jan

> --- a/tools/tests/cpu-policy/test-cpu-policy.c
> +++ b/tools/tests/cpu-policy/test-cpu-policy.c
> @@ -516,11 +516,22 @@ static void test_cpuid_out_of_range_clea
>              },
>          },
>          {
> +            .name = "no extd",
> +            .nr_markers = 0,
> +            .p = {
> +                /* Clears all markers. */
> +                .extd.max_leaf = 0,
> +
> +                .extd.vendor_ebx = 0xc2,
> +                .extd.raw_fms = 0xc2,
> +            },
> +        },
> +        {
>              .name = "extd",
>              .nr_markers = 1,
>              .p = {
>                  /* Retains marker in leaf 0.  Clears others. */
> -                .extd.max_leaf = 0,
> +                .extd.max_leaf = 0x80000000,
>                  .extd.vendor_ebx = 0xc2,
>  
>                  .extd.raw_fms = 0xc2,
> --- a/xen/lib/x86/cpuid.c
> +++ b/xen/lib/x86/cpuid.c
> @@ -232,7 +232,9 @@ void x86_cpuid_policy_clear_out_of_range
>                      ARRAY_SIZE(p->xstate.raw) - 1);
>      }
>  
> -    zero_leaves(p->extd.raw, (p->extd.max_leaf & 0xffff) + 1,
> +    zero_leaves(p->extd.raw,
> +                ((p->extd.max_leaf >> 16) == 0x8000
> +                 ? (p->extd.max_leaf & 0xffff) + 1 : 0),
>                  ARRAY_SIZE(p->extd.raw) - 1);
>  }
>  
> 


[PATCH v2 12/17] x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents
Posted by Jan Beulich 3 years, 4 months ago
Zapping leaf data for out of range leaves is just one half of it: To
avoid guests (bogusly or worse) inferring information from mere leaf
presence, also shrink maximum indicators such that the respective
trailing entry is not all blank (unless of course it's the initial
subleaf of a leaf that's not the final one).

This is also in preparation of bumping the maximum basic leaf we
support, to ensure guests not getting exposed related features won't
observe a change in behavior.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: New.

--- a/tools/tests/cpu-policy/test-cpu-policy.c
+++ b/tools/tests/cpu-policy/test-cpu-policy.c
@@ -8,10 +8,13 @@
 #include <err.h>
 
 #include <xen-tools/libs.h>
+#include <xen/asm/x86-defns.h>
 #include <xen/asm/x86-vendors.h>
 #include <xen/lib/x86/cpu-policy.h>
 #include <xen/domctl.h>
 
+#define XSTATE_FP_SSE  (X86_XCR0_FP | X86_XCR0_SSE)
+
 static unsigned int nr_failures;
 #define fail(fmt, ...)                          \
 ({                                              \
@@ -564,6 +567,103 @@ static void test_cpuid_out_of_range_clea
     }
 }
 
+static void test_cpuid_maximum_leaf_shrinking(void)
+{
+    static const struct test {
+        const char *name;
+        struct cpuid_policy p;
+    } tests[] = {
+        {
+            .name = "basic",
+            .p = {
+                /* Very basic information only. */
+                .basic.max_leaf = 1,
+                .basic.raw_fms = 0xc2,
+            },
+        },
+        {
+            .name = "cache",
+            .p = {
+                /* Cache subleaves present. */
+                .basic.max_leaf = 4,
+                .cache.subleaf[0].type = 1,
+            },
+        },
+        {
+            .name = "feat#0",
+            .p = {
+                /* Subleaf 0 only with some valid bit. */
+                .basic.max_leaf = 7,
+                .feat.max_subleaf = 0,
+                .feat.fsgsbase = 1,
+            },
+        },
+        {
+            .name = "feat#1",
+            .p = {
+                /* Subleaf 1 only with some valid bit. */
+                .basic.max_leaf = 7,
+                .feat.max_subleaf = 1,
+                .feat.avx_vnni = 1,
+            },
+        },
+        {
+            .name = "topo",
+            .p = {
+                /* Topology subleaves present. */
+                .basic.max_leaf = 0xb,
+                .topo.subleaf[0].type = 1,
+            },
+        },
+        {
+            .name = "xstate",
+            .p = {
+                /* First subleaf always valid (and then non-zero). */
+                .basic.max_leaf = 0xd,
+                .xstate.xcr0_low = XSTATE_FP_SSE,
+            },
+        },
+        {
+            .name = "extd",
+            .p = {
+                /* Commonly available information only. */
+                .extd.max_leaf = 0x80000008,
+                .extd.maxphysaddr = 0x28,
+                .extd.maxlinaddr = 0x30,
+            },
+        },
+    };
+
+    printf("Testing CPUID maximum leaf shrinking:\n");
+
+    for ( size_t i = 0; i < ARRAY_SIZE(tests); ++i )
+    {
+        const struct test *t = &tests[i];
+        struct cpuid_policy *p = memdup(&t->p);
+
+        p->basic.max_leaf = ARRAY_SIZE(p->basic.raw) - 1;
+        p->feat.max_subleaf = ARRAY_SIZE(p->feat.raw) - 1;
+        p->extd.max_leaf = 0x80000000 | (ARRAY_SIZE(p->extd.raw) - 1);
+
+        x86_cpuid_policy_shrink_max_leaves(p);
+
+        /* Check the the resulting max (sub)leaf values against expecations. */
+        if ( p->basic.max_leaf != t->p.basic.max_leaf )
+             fail("  Test %s basic fail - expected %#x, got %#x\n",
+                  t->name, t->p.basic.max_leaf, p->basic.max_leaf);
+
+        if ( p->extd.max_leaf != t->p.extd.max_leaf )
+             fail("  Test %s extd fail - expected %#x, got %#x\n",
+                  t->name, t->p.extd.max_leaf, p->extd.max_leaf);
+
+        if ( p->feat.max_subleaf != t->p.feat.max_subleaf )
+             fail("  Test %s feat fail - expected %#x, got %#x\n",
+                  t->name, t->p.feat.max_subleaf, p->feat.max_subleaf);
+
+        free(p);
+    }
+}
+
 static void test_is_compatible_success(void)
 {
     static struct test {
@@ -679,6 +779,7 @@ int main(int argc, char **argv)
     test_cpuid_serialise_success();
     test_cpuid_deserialise_failure();
     test_cpuid_out_of_range_clearing();
+    test_cpuid_maximum_leaf_shrinking();
 
     test_msr_serialise_success();
     test_msr_deserialise_failure();
--- a/xen/arch/x86/cpuid.c
+++ b/xen/arch/x86/cpuid.c
@@ -346,6 +346,8 @@ static void __init calculate_host_policy
         p->extd.raw[0xa].d |= ((1u << SVM_FEATURE_VMCBCLEAN) |
                                (1u << SVM_FEATURE_TSCRATEMSR));
     }
+
+    x86_cpuid_policy_shrink_max_leaves(p);
 }
 
 static void __init guest_common_default_feature_adjustments(uint32_t *fs)
@@ -415,6 +417,8 @@ static void __init calculate_pv_max_poli
     recalculate_xstate(p);
 
     p->extd.raw[0xa] = EMPTY_LEAF; /* No SVM for PV guests. */
+
+    x86_cpuid_policy_shrink_max_leaves(p);
 }
 
 static void __init calculate_pv_def_policy(void)
@@ -435,6 +439,8 @@ static void __init calculate_pv_def_poli
     sanitise_featureset(pv_featureset);
     cpuid_featureset_to_policy(pv_featureset, p);
     recalculate_xstate(p);
+
+    x86_cpuid_policy_shrink_max_leaves(p);
 }
 
 static void __init calculate_hvm_max_policy(void)
@@ -494,6 +500,8 @@ static void __init calculate_hvm_max_pol
     sanitise_featureset(hvm_featureset);
     cpuid_featureset_to_policy(hvm_featureset, p);
     recalculate_xstate(p);
+
+    x86_cpuid_policy_shrink_max_leaves(p);
 }
 
 static void __init calculate_hvm_def_policy(void)
@@ -518,6 +526,8 @@ static void __init calculate_hvm_def_pol
     sanitise_featureset(hvm_featureset);
     cpuid_featureset_to_policy(hvm_featureset, p);
     recalculate_xstate(p);
+
+    x86_cpuid_policy_shrink_max_leaves(p);
 }
 
 void __init init_host_cpuid(void)
@@ -704,6 +714,8 @@ void recalculate_cpuid_policy(struct dom
 
     if ( !p->extd.page1gb )
         p->extd.raw[0x19] = EMPTY_LEAF;
+
+    x86_cpuid_policy_shrink_max_leaves(p);
 }
 
 int init_domain_cpuid_policy(struct domain *d)
--- a/xen/arch/x86/hvm/viridian/viridian.c
+++ b/xen/arch/x86/hvm/viridian/viridian.c
@@ -121,7 +121,9 @@ void cpuid_viridian_leaves(const struct
     switch ( leaf )
     {
     case 0:
-        res->a = 0x40000006; /* Maximum leaf */
+        /* Maximum leaf */
+        cpuid_viridian_leaves(v, 0x40000006, 0, res);
+        res->a = res->a | res->b | res->c | res->d ? 0x40000006 : 0x40000004;
         memcpy(&res->b, "Micr", 4);
         memcpy(&res->c, "osof", 4);
         memcpy(&res->d, "t Hv", 4);
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -964,13 +964,15 @@ void cpuid_hypervisor_leaves(const struc
     uint32_t base = is_viridian_domain(d) ? 0x40000100 : 0x40000000;
     uint32_t idx  = leaf - base;
     unsigned int limit = is_viridian_domain(d) ? p->hv2_limit : p->hv_limit;
+    unsigned int dflt = is_pv_domain(d) ? XEN_CPUID_MAX_PV_NUM_LEAVES
+                                        : XEN_CPUID_MAX_HVM_NUM_LEAVES;
 
     if ( limit == 0 )
         /* Default number of leaves */
-        limit = XEN_CPUID_MAX_NUM_LEAVES;
+        limit = dflt;
     else
         /* Clamp toolstack value between 2 and MAX_NUM_LEAVES. */
-        limit = min(max(limit, 2u), XEN_CPUID_MAX_NUM_LEAVES + 0u);
+        limit = min(max(limit, 2u), dflt);
 
     if ( idx > limit )
         return;
--- a/xen/include/public/arch-x86/cpuid.h
+++ b/xen/include/public/arch-x86/cpuid.h
@@ -113,6 +113,10 @@
 /* Max. address width in bits taking memory hotplug into account. */
 #define XEN_CPUID_MACHINE_ADDRESS_WIDTH_MASK (0xffu << 0)
 
-#define XEN_CPUID_MAX_NUM_LEAVES 5
+#define XEN_CPUID_MAX_PV_NUM_LEAVES 5
+#define XEN_CPUID_MAX_HVM_NUM_LEAVES 4
+#define XEN_CPUID_MAX_NUM_LEAVES \
+    (XEN_CPUID_MAX_PV_NUM_LEAVES > XEN_CPUID_MAX_HVM_NUM_LEAVES ? \
+     XEN_CPUID_MAX_PV_NUM_LEAVES : XEN_CPUID_MAX_HVM_NUM_LEAVES)
 
 #endif /* __XEN_PUBLIC_ARCH_X86_CPUID_H__ */
--- a/xen/include/xen/lib/x86/cpuid.h
+++ b/xen/include/xen/lib/x86/cpuid.h
@@ -351,6 +351,13 @@ void x86_cpuid_policy_fill_native(struct
  */
 void x86_cpuid_policy_clear_out_of_range_leaves(struct cpuid_policy *p);
 
+/**
+ * Shrink max leaf/subleaf values such that the last respective valid entry
+ * isn't all blank.  While permitted by the spec, such extraneous leaves may
+ * provide undue "hints" to guests.
+ */
+void x86_cpuid_policy_shrink_max_leaves(struct cpuid_policy *p);
+
 #ifdef __XEN__
 #include <public/arch-x86/xen.h>
 typedef XEN_GUEST_HANDLE_64(xen_cpuid_leaf_t) cpuid_leaf_buffer_t;
--- a/xen/lib/x86/cpuid.c
+++ b/xen/lib/x86/cpuid.c
@@ -238,6 +238,45 @@ void x86_cpuid_policy_clear_out_of_range
                 ARRAY_SIZE(p->extd.raw) - 1);
 }
 
+void x86_cpuid_policy_shrink_max_leaves(struct cpuid_policy *p)
+{
+    unsigned int i;
+
+    p->basic.raw[0x4] = p->cache.raw[0];
+
+    for ( i = p->feat.max_subleaf; i; --i )
+        if ( p->feat.raw[i].a | p->feat.raw[i].b |
+             p->feat.raw[i].c | p->feat.raw[i].d )
+            break;
+    p->feat.max_subleaf = i;
+    p->basic.raw[0x7] = p->feat.raw[0];
+
+    p->basic.raw[0xb] = p->topo.raw[0];
+
+    /*
+     * Due to the way xstate gets handled in the hypervisor (see
+     * recalculate_xstate()) there is (for now at least) no need to fiddle
+     * with the xstate subleaves (IOW we assume they're already in consistent
+     * shape, for coming from either hardware or recalculate_xstate()).
+     */
+    p->basic.raw[0xd] = p->xstate.raw[0];
+
+    for ( i = p->basic.max_leaf; i; --i )
+        if ( p->basic.raw[i].a | p->basic.raw[i].b |
+             p->basic.raw[i].c | p->basic.raw[i].d )
+            break;
+    p->basic.max_leaf = i;
+
+    for ( i = p->extd.max_leaf & 0xffff; i; --i )
+        if ( p->extd.raw[i].a | p->extd.raw[i].b |
+             p->extd.raw[i].c | p->extd.raw[i].d )
+            break;
+    if ( i | p->extd.raw[0].b | p->extd.raw[0].c | p->extd.raw[0].d )
+        p->extd.max_leaf = 0x80000000 | i;
+    else
+        p->extd.max_leaf = 0;
+}
+
 const uint32_t *x86_cpuid_lookup_deep_deps(uint32_t feature)
 {
     static const uint32_t deep_features[] = INIT_DEEP_FEATURES;


Re: [PATCH v2 12/17] x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents
Posted by Jan Beulich 3 years, 1 month ago
On 23.11.2020 15:33, Jan Beulich wrote:
> Zapping leaf data for out of range leaves is just one half of it: To
> avoid guests (bogusly or worse) inferring information from mere leaf
> presence, also shrink maximum indicators such that the respective
> trailing entry is not all blank (unless of course it's the initial
> subleaf of a leaf that's not the final one).
> 
> This is also in preparation of bumping the maximum basic leaf we
> support, to ensure guests not getting exposed related features won't
> observe a change in behavior.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> v2: New.

While most other parts of this series are to be delayed until
(at least) 4.16, I consider this one a bug fix as well, just
like the previous one. I also realize only now that I forgot to
Cc Paul on the original submission - sorry.

Jan

> --- a/tools/tests/cpu-policy/test-cpu-policy.c
> +++ b/tools/tests/cpu-policy/test-cpu-policy.c
> @@ -8,10 +8,13 @@
>  #include <err.h>
>  
>  #include <xen-tools/libs.h>
> +#include <xen/asm/x86-defns.h>
>  #include <xen/asm/x86-vendors.h>
>  #include <xen/lib/x86/cpu-policy.h>
>  #include <xen/domctl.h>
>  
> +#define XSTATE_FP_SSE  (X86_XCR0_FP | X86_XCR0_SSE)
> +
>  static unsigned int nr_failures;
>  #define fail(fmt, ...)                          \
>  ({                                              \
> @@ -564,6 +567,103 @@ static void test_cpuid_out_of_range_clea
>      }
>  }
>  
> +static void test_cpuid_maximum_leaf_shrinking(void)
> +{
> +    static const struct test {
> +        const char *name;
> +        struct cpuid_policy p;
> +    } tests[] = {
> +        {
> +            .name = "basic",
> +            .p = {
> +                /* Very basic information only. */
> +                .basic.max_leaf = 1,
> +                .basic.raw_fms = 0xc2,
> +            },
> +        },
> +        {
> +            .name = "cache",
> +            .p = {
> +                /* Cache subleaves present. */
> +                .basic.max_leaf = 4,
> +                .cache.subleaf[0].type = 1,
> +            },
> +        },
> +        {
> +            .name = "feat#0",
> +            .p = {
> +                /* Subleaf 0 only with some valid bit. */
> +                .basic.max_leaf = 7,
> +                .feat.max_subleaf = 0,
> +                .feat.fsgsbase = 1,
> +            },
> +        },
> +        {
> +            .name = "feat#1",
> +            .p = {
> +                /* Subleaf 1 only with some valid bit. */
> +                .basic.max_leaf = 7,
> +                .feat.max_subleaf = 1,
> +                .feat.avx_vnni = 1,
> +            },
> +        },
> +        {
> +            .name = "topo",
> +            .p = {
> +                /* Topology subleaves present. */
> +                .basic.max_leaf = 0xb,
> +                .topo.subleaf[0].type = 1,
> +            },
> +        },
> +        {
> +            .name = "xstate",
> +            .p = {
> +                /* First subleaf always valid (and then non-zero). */
> +                .basic.max_leaf = 0xd,
> +                .xstate.xcr0_low = XSTATE_FP_SSE,
> +            },
> +        },
> +        {
> +            .name = "extd",
> +            .p = {
> +                /* Commonly available information only. */
> +                .extd.max_leaf = 0x80000008,
> +                .extd.maxphysaddr = 0x28,
> +                .extd.maxlinaddr = 0x30,
> +            },
> +        },
> +    };
> +
> +    printf("Testing CPUID maximum leaf shrinking:\n");
> +
> +    for ( size_t i = 0; i < ARRAY_SIZE(tests); ++i )
> +    {
> +        const struct test *t = &tests[i];
> +        struct cpuid_policy *p = memdup(&t->p);
> +
> +        p->basic.max_leaf = ARRAY_SIZE(p->basic.raw) - 1;
> +        p->feat.max_subleaf = ARRAY_SIZE(p->feat.raw) - 1;
> +        p->extd.max_leaf = 0x80000000 | (ARRAY_SIZE(p->extd.raw) - 1);
> +
> +        x86_cpuid_policy_shrink_max_leaves(p);
> +
> +        /* Check the the resulting max (sub)leaf values against expecations. */
> +        if ( p->basic.max_leaf != t->p.basic.max_leaf )
> +             fail("  Test %s basic fail - expected %#x, got %#x\n",
> +                  t->name, t->p.basic.max_leaf, p->basic.max_leaf);
> +
> +        if ( p->extd.max_leaf != t->p.extd.max_leaf )
> +             fail("  Test %s extd fail - expected %#x, got %#x\n",
> +                  t->name, t->p.extd.max_leaf, p->extd.max_leaf);
> +
> +        if ( p->feat.max_subleaf != t->p.feat.max_subleaf )
> +             fail("  Test %s feat fail - expected %#x, got %#x\n",
> +                  t->name, t->p.feat.max_subleaf, p->feat.max_subleaf);
> +
> +        free(p);
> +    }
> +}
> +
>  static void test_is_compatible_success(void)
>  {
>      static struct test {
> @@ -679,6 +779,7 @@ int main(int argc, char **argv)
>      test_cpuid_serialise_success();
>      test_cpuid_deserialise_failure();
>      test_cpuid_out_of_range_clearing();
> +    test_cpuid_maximum_leaf_shrinking();
>  
>      test_msr_serialise_success();
>      test_msr_deserialise_failure();
> --- a/xen/arch/x86/cpuid.c
> +++ b/xen/arch/x86/cpuid.c
> @@ -346,6 +346,8 @@ static void __init calculate_host_policy
>          p->extd.raw[0xa].d |= ((1u << SVM_FEATURE_VMCBCLEAN) |
>                                 (1u << SVM_FEATURE_TSCRATEMSR));
>      }
> +
> +    x86_cpuid_policy_shrink_max_leaves(p);
>  }
>  
>  static void __init guest_common_default_feature_adjustments(uint32_t *fs)
> @@ -415,6 +417,8 @@ static void __init calculate_pv_max_poli
>      recalculate_xstate(p);
>  
>      p->extd.raw[0xa] = EMPTY_LEAF; /* No SVM for PV guests. */
> +
> +    x86_cpuid_policy_shrink_max_leaves(p);
>  }
>  
>  static void __init calculate_pv_def_policy(void)
> @@ -435,6 +439,8 @@ static void __init calculate_pv_def_poli
>      sanitise_featureset(pv_featureset);
>      cpuid_featureset_to_policy(pv_featureset, p);
>      recalculate_xstate(p);
> +
> +    x86_cpuid_policy_shrink_max_leaves(p);
>  }
>  
>  static void __init calculate_hvm_max_policy(void)
> @@ -494,6 +500,8 @@ static void __init calculate_hvm_max_pol
>      sanitise_featureset(hvm_featureset);
>      cpuid_featureset_to_policy(hvm_featureset, p);
>      recalculate_xstate(p);
> +
> +    x86_cpuid_policy_shrink_max_leaves(p);
>  }
>  
>  static void __init calculate_hvm_def_policy(void)
> @@ -518,6 +526,8 @@ static void __init calculate_hvm_def_pol
>      sanitise_featureset(hvm_featureset);
>      cpuid_featureset_to_policy(hvm_featureset, p);
>      recalculate_xstate(p);
> +
> +    x86_cpuid_policy_shrink_max_leaves(p);
>  }
>  
>  void __init init_host_cpuid(void)
> @@ -704,6 +714,8 @@ void recalculate_cpuid_policy(struct dom
>  
>      if ( !p->extd.page1gb )
>          p->extd.raw[0x19] = EMPTY_LEAF;
> +
> +    x86_cpuid_policy_shrink_max_leaves(p);
>  }
>  
>  int init_domain_cpuid_policy(struct domain *d)
> --- a/xen/arch/x86/hvm/viridian/viridian.c
> +++ b/xen/arch/x86/hvm/viridian/viridian.c
> @@ -121,7 +121,9 @@ void cpuid_viridian_leaves(const struct
>      switch ( leaf )
>      {
>      case 0:
> -        res->a = 0x40000006; /* Maximum leaf */
> +        /* Maximum leaf */
> +        cpuid_viridian_leaves(v, 0x40000006, 0, res);
> +        res->a = res->a | res->b | res->c | res->d ? 0x40000006 : 0x40000004;
>          memcpy(&res->b, "Micr", 4);
>          memcpy(&res->c, "osof", 4);
>          memcpy(&res->d, "t Hv", 4);
> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -964,13 +964,15 @@ void cpuid_hypervisor_leaves(const struc
>      uint32_t base = is_viridian_domain(d) ? 0x40000100 : 0x40000000;
>      uint32_t idx  = leaf - base;
>      unsigned int limit = is_viridian_domain(d) ? p->hv2_limit : p->hv_limit;
> +    unsigned int dflt = is_pv_domain(d) ? XEN_CPUID_MAX_PV_NUM_LEAVES
> +                                        : XEN_CPUID_MAX_HVM_NUM_LEAVES;
>  
>      if ( limit == 0 )
>          /* Default number of leaves */
> -        limit = XEN_CPUID_MAX_NUM_LEAVES;
> +        limit = dflt;
>      else
>          /* Clamp toolstack value between 2 and MAX_NUM_LEAVES. */
> -        limit = min(max(limit, 2u), XEN_CPUID_MAX_NUM_LEAVES + 0u);
> +        limit = min(max(limit, 2u), dflt);
>  
>      if ( idx > limit )
>          return;
> --- a/xen/include/public/arch-x86/cpuid.h
> +++ b/xen/include/public/arch-x86/cpuid.h
> @@ -113,6 +113,10 @@
>  /* Max. address width in bits taking memory hotplug into account. */
>  #define XEN_CPUID_MACHINE_ADDRESS_WIDTH_MASK (0xffu << 0)
>  
> -#define XEN_CPUID_MAX_NUM_LEAVES 5
> +#define XEN_CPUID_MAX_PV_NUM_LEAVES 5
> +#define XEN_CPUID_MAX_HVM_NUM_LEAVES 4
> +#define XEN_CPUID_MAX_NUM_LEAVES \
> +    (XEN_CPUID_MAX_PV_NUM_LEAVES > XEN_CPUID_MAX_HVM_NUM_LEAVES ? \
> +     XEN_CPUID_MAX_PV_NUM_LEAVES : XEN_CPUID_MAX_HVM_NUM_LEAVES)
>  
>  #endif /* __XEN_PUBLIC_ARCH_X86_CPUID_H__ */
> --- a/xen/include/xen/lib/x86/cpuid.h
> +++ b/xen/include/xen/lib/x86/cpuid.h
> @@ -351,6 +351,13 @@ void x86_cpuid_policy_fill_native(struct
>   */
>  void x86_cpuid_policy_clear_out_of_range_leaves(struct cpuid_policy *p);
>  
> +/**
> + * Shrink max leaf/subleaf values such that the last respective valid entry
> + * isn't all blank.  While permitted by the spec, such extraneous leaves may
> + * provide undue "hints" to guests.
> + */
> +void x86_cpuid_policy_shrink_max_leaves(struct cpuid_policy *p);
> +
>  #ifdef __XEN__
>  #include <public/arch-x86/xen.h>
>  typedef XEN_GUEST_HANDLE_64(xen_cpuid_leaf_t) cpuid_leaf_buffer_t;
> --- a/xen/lib/x86/cpuid.c
> +++ b/xen/lib/x86/cpuid.c
> @@ -238,6 +238,45 @@ void x86_cpuid_policy_clear_out_of_range
>                  ARRAY_SIZE(p->extd.raw) - 1);
>  }
>  
> +void x86_cpuid_policy_shrink_max_leaves(struct cpuid_policy *p)
> +{
> +    unsigned int i;
> +
> +    p->basic.raw[0x4] = p->cache.raw[0];
> +
> +    for ( i = p->feat.max_subleaf; i; --i )
> +        if ( p->feat.raw[i].a | p->feat.raw[i].b |
> +             p->feat.raw[i].c | p->feat.raw[i].d )
> +            break;
> +    p->feat.max_subleaf = i;
> +    p->basic.raw[0x7] = p->feat.raw[0];
> +
> +    p->basic.raw[0xb] = p->topo.raw[0];
> +
> +    /*
> +     * Due to the way xstate gets handled in the hypervisor (see
> +     * recalculate_xstate()) there is (for now at least) no need to fiddle
> +     * with the xstate subleaves (IOW we assume they're already in consistent
> +     * shape, for coming from either hardware or recalculate_xstate()).
> +     */
> +    p->basic.raw[0xd] = p->xstate.raw[0];
> +
> +    for ( i = p->basic.max_leaf; i; --i )
> +        if ( p->basic.raw[i].a | p->basic.raw[i].b |
> +             p->basic.raw[i].c | p->basic.raw[i].d )
> +            break;
> +    p->basic.max_leaf = i;
> +
> +    for ( i = p->extd.max_leaf & 0xffff; i; --i )
> +        if ( p->extd.raw[i].a | p->extd.raw[i].b |
> +             p->extd.raw[i].c | p->extd.raw[i].d )
> +            break;
> +    if ( i | p->extd.raw[0].b | p->extd.raw[0].c | p->extd.raw[0].d )
> +        p->extd.max_leaf = 0x80000000 | i;
> +    else
> +        p->extd.max_leaf = 0;
> +}
> +
>  const uint32_t *x86_cpuid_lookup_deep_deps(uint32_t feature)
>  {
>      static const uint32_t deep_features[] = INIT_DEEP_FEATURES;
> 
> 


Re: [PATCH v2 12/17] x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents
Posted by Roger Pau Monné 2 years, 11 months ago
On Mon, Nov 23, 2020 at 03:33:03PM +0100, Jan Beulich wrote:
> Zapping leaf data for out of range leaves is just one half of it: To
> avoid guests (bogusly or worse) inferring information from mere leaf
> presence, also shrink maximum indicators such that the respective
> trailing entry is not all blank (unless of course it's the initial
> subleaf of a leaf that's not the final one).
> 
> This is also in preparation of bumping the maximum basic leaf we
> support, to ensure guests not getting exposed related features won't
> observe a change in behavior.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> v2: New.
> 
> --- a/tools/tests/cpu-policy/test-cpu-policy.c
> +++ b/tools/tests/cpu-policy/test-cpu-policy.c
> @@ -8,10 +8,13 @@
>  #include <err.h>
>  
>  #include <xen-tools/libs.h>
> +#include <xen/asm/x86-defns.h>
>  #include <xen/asm/x86-vendors.h>
>  #include <xen/lib/x86/cpu-policy.h>
>  #include <xen/domctl.h>
>  
> +#define XSTATE_FP_SSE  (X86_XCR0_FP | X86_XCR0_SSE)

This gets used only once...

> +
>  static unsigned int nr_failures;
>  #define fail(fmt, ...)                          \
>  ({                                              \
> @@ -564,6 +567,103 @@ static void test_cpuid_out_of_range_clea
>      }
>  }
>  
> +static void test_cpuid_maximum_leaf_shrinking(void)
> +{
> +    static const struct test {
> +        const char *name;
> +        struct cpuid_policy p;
> +    } tests[] = {
> +        {
> +            .name = "basic",
> +            .p = {
> +                /* Very basic information only. */
> +                .basic.max_leaf = 1,
> +                .basic.raw_fms = 0xc2,
> +            },
> +        },
> +        {
> +            .name = "cache",
> +            .p = {
> +                /* Cache subleaves present. */
> +                .basic.max_leaf = 4,
> +                .cache.subleaf[0].type = 1,
> +            },
> +        },
> +        {
> +            .name = "feat#0",
> +            .p = {
> +                /* Subleaf 0 only with some valid bit. */
> +                .basic.max_leaf = 7,
> +                .feat.max_subleaf = 0,
> +                .feat.fsgsbase = 1,
> +            },
> +        },
> +        {
> +            .name = "feat#1",
> +            .p = {
> +                /* Subleaf 1 only with some valid bit. */
> +                .basic.max_leaf = 7,
> +                .feat.max_subleaf = 1,
> +                .feat.avx_vnni = 1,
> +            },
> +        },
> +        {
> +            .name = "topo",
> +            .p = {
> +                /* Topology subleaves present. */
> +                .basic.max_leaf = 0xb,
> +                .topo.subleaf[0].type = 1,
> +            },
> +        },
> +        {
> +            .name = "xstate",
> +            .p = {
> +                /* First subleaf always valid (and then non-zero). */
> +                .basic.max_leaf = 0xd,
> +                .xstate.xcr0_low = XSTATE_FP_SSE,

...here. And then I also wonder whether this requires having any
specific values rather than just using ~0 or any random non-0 value.

> +            },
> +        },
> +        {
> +            .name = "extd",
> +            .p = {
> +                /* Commonly available information only. */
> +                .extd.max_leaf = 0x80000008,
> +                .extd.maxphysaddr = 0x28,
> +                .extd.maxlinaddr = 0x30,
> +            },
> +        },
> +    };
> +
> +    printf("Testing CPUID maximum leaf shrinking:\n");
> +
> +    for ( size_t i = 0; i < ARRAY_SIZE(tests); ++i )
> +    {
> +        const struct test *t = &tests[i];
> +        struct cpuid_policy *p = memdup(&t->p);
> +
> +        p->basic.max_leaf = ARRAY_SIZE(p->basic.raw) - 1;
> +        p->feat.max_subleaf = ARRAY_SIZE(p->feat.raw) - 1;
> +        p->extd.max_leaf = 0x80000000 | (ARRAY_SIZE(p->extd.raw) - 1);
> +
> +        x86_cpuid_policy_shrink_max_leaves(p);
> +
> +        /* Check the the resulting max (sub)leaf values against expecations. */
> +        if ( p->basic.max_leaf != t->p.basic.max_leaf )
> +             fail("  Test %s basic fail - expected %#x, got %#x\n",
> +                  t->name, t->p.basic.max_leaf, p->basic.max_leaf);
> +
> +        if ( p->extd.max_leaf != t->p.extd.max_leaf )
> +             fail("  Test %s extd fail - expected %#x, got %#x\n",
> +                  t->name, t->p.extd.max_leaf, p->extd.max_leaf);
> +
> +        if ( p->feat.max_subleaf != t->p.feat.max_subleaf )
> +             fail("  Test %s feat fail - expected %#x, got %#x\n",
> +                  t->name, t->p.feat.max_subleaf, p->feat.max_subleaf);
> +
> +        free(p);
> +    }
> +}
> +
>  static void test_is_compatible_success(void)
>  {
>      static struct test {
> @@ -679,6 +779,7 @@ int main(int argc, char **argv)
>      test_cpuid_serialise_success();
>      test_cpuid_deserialise_failure();
>      test_cpuid_out_of_range_clearing();
> +    test_cpuid_maximum_leaf_shrinking();
>  
>      test_msr_serialise_success();
>      test_msr_deserialise_failure();
> --- a/xen/arch/x86/cpuid.c
> +++ b/xen/arch/x86/cpuid.c
> @@ -346,6 +346,8 @@ static void __init calculate_host_policy
>          p->extd.raw[0xa].d |= ((1u << SVM_FEATURE_VMCBCLEAN) |
>                                 (1u << SVM_FEATURE_TSCRATEMSR));
>      }
> +
> +    x86_cpuid_policy_shrink_max_leaves(p);
>  }
>  
>  static void __init guest_common_default_feature_adjustments(uint32_t *fs)
> @@ -415,6 +417,8 @@ static void __init calculate_pv_max_poli
>      recalculate_xstate(p);
>  
>      p->extd.raw[0xa] = EMPTY_LEAF; /* No SVM for PV guests. */
> +
> +    x86_cpuid_policy_shrink_max_leaves(p);
>  }
>  
>  static void __init calculate_pv_def_policy(void)
> @@ -435,6 +439,8 @@ static void __init calculate_pv_def_poli
>      sanitise_featureset(pv_featureset);
>      cpuid_featureset_to_policy(pv_featureset, p);
>      recalculate_xstate(p);
> +
> +    x86_cpuid_policy_shrink_max_leaves(p);
>  }
>  
>  static void __init calculate_hvm_max_policy(void)
> @@ -494,6 +500,8 @@ static void __init calculate_hvm_max_pol
>      sanitise_featureset(hvm_featureset);
>      cpuid_featureset_to_policy(hvm_featureset, p);
>      recalculate_xstate(p);
> +
> +    x86_cpuid_policy_shrink_max_leaves(p);
>  }
>  
>  static void __init calculate_hvm_def_policy(void)
> @@ -518,6 +526,8 @@ static void __init calculate_hvm_def_pol
>      sanitise_featureset(hvm_featureset);
>      cpuid_featureset_to_policy(hvm_featureset, p);
>      recalculate_xstate(p);
> +
> +    x86_cpuid_policy_shrink_max_leaves(p);
>  }
>  
>  void __init init_host_cpuid(void)
> @@ -704,6 +714,8 @@ void recalculate_cpuid_policy(struct dom
>  
>      if ( !p->extd.page1gb )
>          p->extd.raw[0x19] = EMPTY_LEAF;
> +
> +    x86_cpuid_policy_shrink_max_leaves(p);
>  }
>  
>  int init_domain_cpuid_policy(struct domain *d)
> --- a/xen/arch/x86/hvm/viridian/viridian.c
> +++ b/xen/arch/x86/hvm/viridian/viridian.c
> @@ -121,7 +121,9 @@ void cpuid_viridian_leaves(const struct
>      switch ( leaf )
>      {
>      case 0:
> -        res->a = 0x40000006; /* Maximum leaf */
> +        /* Maximum leaf */
> +        cpuid_viridian_leaves(v, 0x40000006, 0, res);
> +        res->a = res->a | res->b | res->c | res->d ? 0x40000006 : 0x40000004;

I think you would need to adjust this chunk to also take into account
leaf 0x40000005 now.

I also wonder whether we should actually limit HyperV leaves. I think
it's perfectly fine to report up to the maximum supported by Xen, even
if it turns out none of the advertised feat are present, as in: Xen
supports those leaves, but none of the features exposed are
available.

>          memcpy(&res->b, "Micr", 4);
>          memcpy(&res->c, "osof", 4);
>          memcpy(&res->d, "t Hv", 4);
> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -964,13 +964,15 @@ void cpuid_hypervisor_leaves(const struc
>      uint32_t base = is_viridian_domain(d) ? 0x40000100 : 0x40000000;
>      uint32_t idx  = leaf - base;
>      unsigned int limit = is_viridian_domain(d) ? p->hv2_limit : p->hv_limit;
> +    unsigned int dflt = is_pv_domain(d) ? XEN_CPUID_MAX_PV_NUM_LEAVES
> +                                        : XEN_CPUID_MAX_HVM_NUM_LEAVES;
>  
>      if ( limit == 0 )
>          /* Default number of leaves */
> -        limit = XEN_CPUID_MAX_NUM_LEAVES;
> +        limit = dflt;
>      else
>          /* Clamp toolstack value between 2 and MAX_NUM_LEAVES. */
> -        limit = min(max(limit, 2u), XEN_CPUID_MAX_NUM_LEAVES + 0u);
> +        limit = min(max(limit, 2u), dflt);
>  
>      if ( idx > limit )
>          return;
> --- a/xen/include/public/arch-x86/cpuid.h
> +++ b/xen/include/public/arch-x86/cpuid.h
> @@ -113,6 +113,10 @@
>  /* Max. address width in bits taking memory hotplug into account. */
>  #define XEN_CPUID_MACHINE_ADDRESS_WIDTH_MASK (0xffu << 0)
>  
> -#define XEN_CPUID_MAX_NUM_LEAVES 5
> +#define XEN_CPUID_MAX_PV_NUM_LEAVES 5
> +#define XEN_CPUID_MAX_HVM_NUM_LEAVES 4
> +#define XEN_CPUID_MAX_NUM_LEAVES \
> +    (XEN_CPUID_MAX_PV_NUM_LEAVES > XEN_CPUID_MAX_HVM_NUM_LEAVES ? \
> +     XEN_CPUID_MAX_PV_NUM_LEAVES : XEN_CPUID_MAX_HVM_NUM_LEAVES)

I guess we don't have any form of max macro available here to use?

>  
>  #endif /* __XEN_PUBLIC_ARCH_X86_CPUID_H__ */
> --- a/xen/include/xen/lib/x86/cpuid.h
> +++ b/xen/include/xen/lib/x86/cpuid.h
> @@ -351,6 +351,13 @@ void x86_cpuid_policy_fill_native(struct
>   */
>  void x86_cpuid_policy_clear_out_of_range_leaves(struct cpuid_policy *p);
>  
> +/**
> + * Shrink max leaf/subleaf values such that the last respective valid entry
> + * isn't all blank.  While permitted by the spec, such extraneous leaves may
> + * provide undue "hints" to guests.
> + */
> +void x86_cpuid_policy_shrink_max_leaves(struct cpuid_policy *p);
> +
>  #ifdef __XEN__
>  #include <public/arch-x86/xen.h>
>  typedef XEN_GUEST_HANDLE_64(xen_cpuid_leaf_t) cpuid_leaf_buffer_t;
> --- a/xen/lib/x86/cpuid.c
> +++ b/xen/lib/x86/cpuid.c
> @@ -238,6 +238,45 @@ void x86_cpuid_policy_clear_out_of_range
>                  ARRAY_SIZE(p->extd.raw) - 1);
>  }
>  
> +void x86_cpuid_policy_shrink_max_leaves(struct cpuid_policy *p)
> +{
> +    unsigned int i;
> +
> +    p->basic.raw[0x4] = p->cache.raw[0];
> +
> +    for ( i = p->feat.max_subleaf; i; --i )
> +        if ( p->feat.raw[i].a | p->feat.raw[i].b |
> +             p->feat.raw[i].c | p->feat.raw[i].d )
> +            break;
> +    p->feat.max_subleaf = i;
> +    p->basic.raw[0x7] = p->feat.raw[0];

Do you need to use p->feat.raw[i] to assure the basic 0x7 leaf is seen
as populated?

I think in theory raw[0] could be clear while raw[i] cannot as long as
there's a populated leaf > 0 due to the check above.

Thanks, Roger.

Re: [PATCH v2 12/17] x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents
Posted by Jan Beulich 2 years, 11 months ago
On 15.04.2021 11:52, Roger Pau Monné wrote:
> On Mon, Nov 23, 2020 at 03:33:03PM +0100, Jan Beulich wrote:
>> --- a/tools/tests/cpu-policy/test-cpu-policy.c
>> +++ b/tools/tests/cpu-policy/test-cpu-policy.c
>> @@ -8,10 +8,13 @@
>>  #include <err.h>
>>  
>>  #include <xen-tools/libs.h>
>> +#include <xen/asm/x86-defns.h>
>>  #include <xen/asm/x86-vendors.h>
>>  #include <xen/lib/x86/cpu-policy.h>
>>  #include <xen/domctl.h>
>>  
>> +#define XSTATE_FP_SSE  (X86_XCR0_FP | X86_XCR0_SSE)
> 
> This gets used only once...
> 
>> +
>>  static unsigned int nr_failures;
>>  #define fail(fmt, ...)                          \
>>  ({                                              \
>> @@ -564,6 +567,103 @@ static void test_cpuid_out_of_range_clea
>>      }
>>  }
>>  
>> +static void test_cpuid_maximum_leaf_shrinking(void)
>> +{
>> +    static const struct test {
>> +        const char *name;
>> +        struct cpuid_policy p;
>> +    } tests[] = {
>> +        {
>> +            .name = "basic",
>> +            .p = {
>> +                /* Very basic information only. */
>> +                .basic.max_leaf = 1,
>> +                .basic.raw_fms = 0xc2,
>> +            },
>> +        },
>> +        {
>> +            .name = "cache",
>> +            .p = {
>> +                /* Cache subleaves present. */
>> +                .basic.max_leaf = 4,
>> +                .cache.subleaf[0].type = 1,
>> +            },
>> +        },
>> +        {
>> +            .name = "feat#0",
>> +            .p = {
>> +                /* Subleaf 0 only with some valid bit. */
>> +                .basic.max_leaf = 7,
>> +                .feat.max_subleaf = 0,
>> +                .feat.fsgsbase = 1,
>> +            },
>> +        },
>> +        {
>> +            .name = "feat#1",
>> +            .p = {
>> +                /* Subleaf 1 only with some valid bit. */
>> +                .basic.max_leaf = 7,
>> +                .feat.max_subleaf = 1,
>> +                .feat.avx_vnni = 1,
>> +            },
>> +        },
>> +        {
>> +            .name = "topo",
>> +            .p = {
>> +                /* Topology subleaves present. */
>> +                .basic.max_leaf = 0xb,
>> +                .topo.subleaf[0].type = 1,
>> +            },
>> +        },
>> +        {
>> +            .name = "xstate",
>> +            .p = {
>> +                /* First subleaf always valid (and then non-zero). */
>> +                .basic.max_leaf = 0xd,
>> +                .xstate.xcr0_low = XSTATE_FP_SSE,
> 
> ...here.

For now, yes. I'm introducing the constant because I think it wants
using in other places too, to avoid using literal numbers. See e.g.

                .xstate.xcr0_low = 7,

in test_cpuid_serialise_success().

> And then I also wonder whether this requires having any
> specific values rather than just using ~0 or any random non-0 value.

I'm afraid I don't understand: There's no ~0 here and no random
non-zero value - all other structure elements are left default-
initialized.

>> --- a/xen/arch/x86/hvm/viridian/viridian.c
>> +++ b/xen/arch/x86/hvm/viridian/viridian.c
>> @@ -121,7 +121,9 @@ void cpuid_viridian_leaves(const struct
>>      switch ( leaf )
>>      {
>>      case 0:
>> -        res->a = 0x40000006; /* Maximum leaf */
>> +        /* Maximum leaf */
>> +        cpuid_viridian_leaves(v, 0x40000006, 0, res);
>> +        res->a = res->a | res->b | res->c | res->d ? 0x40000006 : 0x40000004;
> 
> I think you would need to adjust this chunk to also take into account
> leaf 0x40000005 now.

Hmm, yes, looks like I failed to take note that I need to re-base
over that addition.

> I also wonder whether we should actually limit HyperV leaves. I think
> it's perfectly fine to report up to the maximum supported by Xen, even
> if it turns out none of the advertised feat are present, as in: Xen
> supports those leaves, but none of the features exposed are
> available.

Well, if the Viridian maintainers (I realize I failed to Cc Paul on the
original submission) think I should leave the Viridian leaves alone
(rather than handling consistently with other leaves), I can drop this
part of the change.

>> --- a/xen/include/public/arch-x86/cpuid.h
>> +++ b/xen/include/public/arch-x86/cpuid.h
>> @@ -113,6 +113,10 @@
>>  /* Max. address width in bits taking memory hotplug into account. */
>>  #define XEN_CPUID_MACHINE_ADDRESS_WIDTH_MASK (0xffu << 0)
>>  
>> -#define XEN_CPUID_MAX_NUM_LEAVES 5
>> +#define XEN_CPUID_MAX_PV_NUM_LEAVES 5
>> +#define XEN_CPUID_MAX_HVM_NUM_LEAVES 4
>> +#define XEN_CPUID_MAX_NUM_LEAVES \
>> +    (XEN_CPUID_MAX_PV_NUM_LEAVES > XEN_CPUID_MAX_HVM_NUM_LEAVES ? \
>> +     XEN_CPUID_MAX_PV_NUM_LEAVES : XEN_CPUID_MAX_HVM_NUM_LEAVES)
> 
> I guess we don't have any form of max macro available here to use?

I'm not aware of one that could be used in pre-processor directives
as well.

>> --- a/xen/lib/x86/cpuid.c
>> +++ b/xen/lib/x86/cpuid.c
>> @@ -238,6 +238,45 @@ void x86_cpuid_policy_clear_out_of_range
>>                  ARRAY_SIZE(p->extd.raw) - 1);
>>  }
>>  
>> +void x86_cpuid_policy_shrink_max_leaves(struct cpuid_policy *p)
>> +{
>> +    unsigned int i;
>> +
>> +    p->basic.raw[0x4] = p->cache.raw[0];
>> +
>> +    for ( i = p->feat.max_subleaf; i; --i )
>> +        if ( p->feat.raw[i].a | p->feat.raw[i].b |
>> +             p->feat.raw[i].c | p->feat.raw[i].d )
>> +            break;
>> +    p->feat.max_subleaf = i;
>> +    p->basic.raw[0x7] = p->feat.raw[0];
> 
> Do you need to use p->feat.raw[i] to assure the basic 0x7 leaf is seen
> as populated?
> 
> I think in theory raw[0] could be clear while raw[i] cannot as long as
> there's a populated leaf > 0 due to the check above.

Hmm, yes, this may not be very likely, but is definitely possible.

Jan

Re: [PATCH v2 12/17] x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents
Posted by Roger Pau Monné 2 years, 11 months ago
On Thu, Apr 15, 2021 at 12:37:20PM +0200, Jan Beulich wrote:
> On 15.04.2021 11:52, Roger Pau Monné wrote:
> > On Mon, Nov 23, 2020 at 03:33:03PM +0100, Jan Beulich wrote:
> >> --- a/tools/tests/cpu-policy/test-cpu-policy.c
> >> +++ b/tools/tests/cpu-policy/test-cpu-policy.c
> >> @@ -8,10 +8,13 @@
> >>  #include <err.h>
> >>  
> >>  #include <xen-tools/libs.h>
> >> +#include <xen/asm/x86-defns.h>
> >>  #include <xen/asm/x86-vendors.h>
> >>  #include <xen/lib/x86/cpu-policy.h>
> >>  #include <xen/domctl.h>
> >>  
> >> +#define XSTATE_FP_SSE  (X86_XCR0_FP | X86_XCR0_SSE)
> > 
> > This gets used only once...
> > 
> >> +
> >>  static unsigned int nr_failures;
> >>  #define fail(fmt, ...)                          \
> >>  ({                                              \
> >> @@ -564,6 +567,103 @@ static void test_cpuid_out_of_range_clea
> >>      }
> >>  }
> >>  
> >> +static void test_cpuid_maximum_leaf_shrinking(void)
> >> +{
> >> +    static const struct test {
> >> +        const char *name;
> >> +        struct cpuid_policy p;
> >> +    } tests[] = {
> >> +        {
> >> +            .name = "basic",
> >> +            .p = {
> >> +                /* Very basic information only. */
> >> +                .basic.max_leaf = 1,
> >> +                .basic.raw_fms = 0xc2,
> >> +            },
> >> +        },
> >> +        {
> >> +            .name = "cache",
> >> +            .p = {
> >> +                /* Cache subleaves present. */
> >> +                .basic.max_leaf = 4,
> >> +                .cache.subleaf[0].type = 1,
> >> +            },
> >> +        },
> >> +        {
> >> +            .name = "feat#0",
> >> +            .p = {
> >> +                /* Subleaf 0 only with some valid bit. */
> >> +                .basic.max_leaf = 7,
> >> +                .feat.max_subleaf = 0,
> >> +                .feat.fsgsbase = 1,
> >> +            },
> >> +        },
> >> +        {
> >> +            .name = "feat#1",
> >> +            .p = {
> >> +                /* Subleaf 1 only with some valid bit. */
> >> +                .basic.max_leaf = 7,
> >> +                .feat.max_subleaf = 1,
> >> +                .feat.avx_vnni = 1,
> >> +            },
> >> +        },
> >> +        {
> >> +            .name = "topo",
> >> +            .p = {
> >> +                /* Topology subleaves present. */
> >> +                .basic.max_leaf = 0xb,
> >> +                .topo.subleaf[0].type = 1,
> >> +            },
> >> +        },
> >> +        {
> >> +            .name = "xstate",
> >> +            .p = {
> >> +                /* First subleaf always valid (and then non-zero). */
> >> +                .basic.max_leaf = 0xd,
> >> +                .xstate.xcr0_low = XSTATE_FP_SSE,
> > 
> > ...here.
> 
> For now, yes. I'm introducing the constant because I think it wants
> using in other places too, to avoid using literal numbers. See e.g.
> 
>                 .xstate.xcr0_low = 7,
> 
> in test_cpuid_serialise_success().
> 
> > And then I also wonder whether this requires having any
> > specific values rather than just using ~0 or any random non-0 value.
> 
> I'm afraid I don't understand: There's no ~0 here and no random
> non-zero value - all other structure elements are left default-
> initialized.

Oh, I've worded that sentence wrongly I think. What I meant to say is
that for the purposes of the test here you could just fill the fields
with ~0 instead of using things like XSTATE_FP_SSE?

> >> --- a/xen/arch/x86/hvm/viridian/viridian.c
> >> +++ b/xen/arch/x86/hvm/viridian/viridian.c
> >> @@ -121,7 +121,9 @@ void cpuid_viridian_leaves(const struct
> >>      switch ( leaf )
> >>      {
> >>      case 0:
> >> -        res->a = 0x40000006; /* Maximum leaf */
> >> +        /* Maximum leaf */
> >> +        cpuid_viridian_leaves(v, 0x40000006, 0, res);
> >> +        res->a = res->a | res->b | res->c | res->d ? 0x40000006 : 0x40000004;
> > 
> > I think you would need to adjust this chunk to also take into account
> > leaf 0x40000005 now.
> 
> Hmm, yes, looks like I failed to take note that I need to re-base
> over that addition.
> 
> > I also wonder whether we should actually limit HyperV leaves. I think
> > it's perfectly fine to report up to the maximum supported by Xen, even
> > if it turns out none of the advertised feat are present, as in: Xen
> > supports those leaves, but none of the features exposed are
> > available.
> 
> Well, if the Viridian maintainers (I realize I failed to Cc Paul on the
> original submission) think I should leave the Viridian leaves alone
> (rather than handling consistently with other leaves), I can drop this
> part of the change.

As I understand this is change is partially motivated to avoid leaking
the hardware max number of leaves when not required. With Viridian
it's all software based, so we are not leaking any hardware details
AFAICT, and hence I would be fine with just using a fixed value.

Thanks, Roger.

Re: [PATCH v2 12/17] x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents
Posted by Jan Beulich 2 years, 11 months ago
On 15.04.2021 14:30, Roger Pau Monné wrote:
> On Thu, Apr 15, 2021 at 12:37:20PM +0200, Jan Beulich wrote:
>> On 15.04.2021 11:52, Roger Pau Monné wrote:
>>> On Mon, Nov 23, 2020 at 03:33:03PM +0100, Jan Beulich wrote:
>>>> --- a/tools/tests/cpu-policy/test-cpu-policy.c
>>>> +++ b/tools/tests/cpu-policy/test-cpu-policy.c
>>>> @@ -8,10 +8,13 @@
>>>>  #include <err.h>
>>>>  
>>>>  #include <xen-tools/libs.h>
>>>> +#include <xen/asm/x86-defns.h>
>>>>  #include <xen/asm/x86-vendors.h>
>>>>  #include <xen/lib/x86/cpu-policy.h>
>>>>  #include <xen/domctl.h>
>>>>  
>>>> +#define XSTATE_FP_SSE  (X86_XCR0_FP | X86_XCR0_SSE)
>>>
>>> This gets used only once...
>>>
>>>> +
>>>>  static unsigned int nr_failures;
>>>>  #define fail(fmt, ...)                          \
>>>>  ({                                              \
>>>> @@ -564,6 +567,103 @@ static void test_cpuid_out_of_range_clea
>>>>      }
>>>>  }
>>>>  
>>>> +static void test_cpuid_maximum_leaf_shrinking(void)
>>>> +{
>>>> +    static const struct test {
>>>> +        const char *name;
>>>> +        struct cpuid_policy p;
>>>> +    } tests[] = {
>>>> +        {
>>>> +            .name = "basic",
>>>> +            .p = {
>>>> +                /* Very basic information only. */
>>>> +                .basic.max_leaf = 1,
>>>> +                .basic.raw_fms = 0xc2,
>>>> +            },
>>>> +        },
>>>> +        {
>>>> +            .name = "cache",
>>>> +            .p = {
>>>> +                /* Cache subleaves present. */
>>>> +                .basic.max_leaf = 4,
>>>> +                .cache.subleaf[0].type = 1,
>>>> +            },
>>>> +        },
>>>> +        {
>>>> +            .name = "feat#0",
>>>> +            .p = {
>>>> +                /* Subleaf 0 only with some valid bit. */
>>>> +                .basic.max_leaf = 7,
>>>> +                .feat.max_subleaf = 0,
>>>> +                .feat.fsgsbase = 1,
>>>> +            },
>>>> +        },
>>>> +        {
>>>> +            .name = "feat#1",
>>>> +            .p = {
>>>> +                /* Subleaf 1 only with some valid bit. */
>>>> +                .basic.max_leaf = 7,
>>>> +                .feat.max_subleaf = 1,
>>>> +                .feat.avx_vnni = 1,
>>>> +            },
>>>> +        },
>>>> +        {
>>>> +            .name = "topo",
>>>> +            .p = {
>>>> +                /* Topology subleaves present. */
>>>> +                .basic.max_leaf = 0xb,
>>>> +                .topo.subleaf[0].type = 1,
>>>> +            },
>>>> +        },
>>>> +        {
>>>> +            .name = "xstate",
>>>> +            .p = {
>>>> +                /* First subleaf always valid (and then non-zero). */
>>>> +                .basic.max_leaf = 0xd,
>>>> +                .xstate.xcr0_low = XSTATE_FP_SSE,
>>>
>>> ...here.
>>
>> For now, yes. I'm introducing the constant because I think it wants
>> using in other places too, to avoid using literal numbers. See e.g.
>>
>>                 .xstate.xcr0_low = 7,
>>
>> in test_cpuid_serialise_success().
>>
>>> And then I also wonder whether this requires having any
>>> specific values rather than just using ~0 or any random non-0 value.
>>
>> I'm afraid I don't understand: There's no ~0 here and no random
>> non-zero value - all other structure elements are left default-
>> initialized.
> 
> Oh, I've worded that sentence wrongly I think. What I meant to say is
> that for the purposes of the test here you could just fill the fields
> with ~0 instead of using things like XSTATE_FP_SSE?

The test would perhaps be fine, at least right now. But ~0 is not
really a legitimate value, especially if - for consistency - also
putting it in .xcr0_high. I wanted to have a well-defined, always
valid value here, avoiding the risk of needing to change the value
again later on.

Jan

[PATCH v2 13/17] x86/CPUID: move bounding of max_{,sub}leaf fields to library code
Posted by Jan Beulich 3 years, 4 months ago
Break out this logic from calculate_host_policy() to also use it in the
x86 emulator harness, where subsequently we'll want to avoid open-coding
AMX maximum palette bounding.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: New.

--- a/tools/tests/x86_emulator/x86-emulate.c
+++ b/tools/tests/x86_emulator/x86-emulate.c
@@ -79,6 +79,7 @@ bool emul_test_init(void)
     unsigned long sp;
 
     x86_cpuid_policy_fill_native(&cp);
+    x86_cpuid_policy_bound_max_leaves(&cp);
 
     /*
      * The emulator doesn't use these instructions, so can always emulate
@@ -91,6 +92,8 @@ bool emul_test_init(void)
     cp.feat.rdpid = true;
     cp.extd.clzero = true;
 
+    x86_cpuid_policy_shrink_max_leaves(&cp);
+
     if ( cpu_has_xsave )
     {
         unsigned int tmp, ebx;
--- a/xen/arch/x86/cpuid.c
+++ b/xen/arch/x86/cpuid.c
@@ -319,12 +319,7 @@ static void __init calculate_host_policy
 
     *p = raw_cpuid_policy;
 
-    p->basic.max_leaf =
-        min_t(uint32_t, p->basic.max_leaf,   ARRAY_SIZE(p->basic.raw) - 1);
-    p->feat.max_subleaf =
-        min_t(uint32_t, p->feat.max_subleaf, ARRAY_SIZE(p->feat.raw) - 1);
-    p->extd.max_leaf = 0x80000000 | min_t(uint32_t, p->extd.max_leaf & 0xffff,
-                                          ARRAY_SIZE(p->extd.raw) - 1);
+    x86_cpuid_policy_bound_max_leaves(p);
 
     cpuid_featureset_to_policy(boot_cpu_data.x86_capability, p);
     recalculate_xstate(p);
--- a/xen/include/xen/lib/x86/cpuid.h
+++ b/xen/include/xen/lib/x86/cpuid.h
@@ -352,6 +352,12 @@ void x86_cpuid_policy_fill_native(struct
 void x86_cpuid_policy_clear_out_of_range_leaves(struct cpuid_policy *p);
 
 /**
+ * Bound max leaf/subleaf values according to the capacity of the respective
+ * arrays in struct cpuid_policy.
+ */
+void x86_cpuid_policy_bound_max_leaves(struct cpuid_policy *p);
+
+/**
  * Shrink max leaf/subleaf values such that the last respective valid entry
  * isn't all blank.  While permitted by the spec, such extraneous leaves may
  * provide undue "hints" to guests.
--- a/xen/lib/x86/cpuid.c
+++ b/xen/lib/x86/cpuid.c
@@ -238,6 +238,16 @@ void x86_cpuid_policy_clear_out_of_range
                 ARRAY_SIZE(p->extd.raw) - 1);
 }
 
+void x86_cpuid_policy_bound_max_leaves(struct cpuid_policy *p)
+{
+    p->basic.max_leaf =
+        min_t(uint32_t, p->basic.max_leaf, ARRAY_SIZE(p->basic.raw) - 1);
+    p->feat.max_subleaf =
+        min_t(uint32_t, p->feat.max_subleaf, ARRAY_SIZE(p->feat.raw) - 1);
+    p->extd.max_leaf = 0x80000000 | min_t(uint32_t, p->extd.max_leaf & 0xffff,
+                                          ARRAY_SIZE(p->extd.raw) - 1);
+}
+
 void x86_cpuid_policy_shrink_max_leaves(struct cpuid_policy *p)
 {
     unsigned int i;


[PATCH v2 14/17] x86/CPUID: enable AMX leaves
Posted by Jan Beulich 3 years, 4 months ago
This requires bumping the number of basic leaves we support. Apart from
this the logic is modeled as closely as possible to that of leaf 7
handling.

The checks in x86_cpu_policies_are_compatible() may be more strict than
they ultimately need to be, but I'd rather start being on the safe side.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: New.
---
It's not clear to me in how far libxl_cpuid.c would want extending: It
doesn't look to offer a way to override the maximum subleaf of leaf 7.
In fact I can't seem to be able to spot a max extended leaf override
mechanism either.

--- a/tools/tests/cpu-policy/test-cpu-policy.c
+++ b/tools/tests/cpu-policy/test-cpu-policy.c
@@ -190,6 +190,40 @@ static void test_cpuid_serialise_success
             },
             .nr_leaves = 4 + 0xd + 1 + 1,
         },
+
+        /* Leaf 0x1d serialisation stops at max_palette. */
+        {
+            .name = "empty leaf 0x1d",
+            .p = {
+                .basic.max_leaf = 0x1d,
+            },
+            .nr_leaves = 4 + 0x1d + 1,
+        },
+        {
+            .name = "partial leaf 0x1d",
+            .p = {
+                .basic.max_leaf = 0x1d,
+                .tile.max_palette = 1,
+            },
+            .nr_leaves = 4 + 0x1d + 1 + 1,
+        },
+
+        /* Leaf 0x1e serialisation stops at 0. */
+        {
+            .name = "empty leaf 0x1e",
+            .p = {
+                .basic.max_leaf = 0x1e,
+            },
+            .nr_leaves = 4 + 0x1e + 1,
+        },
+        {
+            .name = "partial leaf 0x1e",
+            .p = {
+                .basic.max_leaf = 0x1e,
+                .tmul.maxk = 16,
+            },
+            .nr_leaves = 4 + 0x1e + 1,
+        },
     };
 
     printf("Testing CPUID serialise success:\n");
@@ -321,6 +355,14 @@ static void test_cpuid_deserialise_failu
             .leaf = { .leaf = 0xd, .subleaf = CPUID_GUEST_NR_XSTATE },
         },
         {
+            .name = "OoB tile leaf",
+            .leaf = { .leaf = 0x1d, .subleaf = CPUID_GUEST_NR_PALETTE },
+        },
+        {
+            .name = "OoB tmul leaf",
+            .leaf = { .leaf = 0x1e, .subleaf = CPUID_GUEST_NR_TMUL },
+        },
+        {
             .name = "OoB extd leaf",
             .leaf = { .leaf = 0x80000000 | CPUID_GUEST_NR_EXTD },
         },
@@ -432,6 +474,8 @@ static void test_cpuid_out_of_range_clea
                 .topo.raw[0].a = 0xc2,
                 .xstate.raw[0].a = 0xc2,
                 .xstate.raw[1].a = 0xc2,
+                .tile.raw[0].a = 0xc2,
+                .tmul.raw[0].a = 0xc2,
             },
         },
         {
@@ -447,6 +491,8 @@ static void test_cpuid_out_of_range_clea
                 .topo.raw[0].a = 0xc2,
                 .xstate.raw[0].a = 0xc2,
                 .xstate.raw[1].a = 0xc2,
+                .tile.raw[0].a = 0xc2,
+                .tmul.raw[0].a = 0xc2,
             },
         },
         {
@@ -461,6 +507,8 @@ static void test_cpuid_out_of_range_clea
                 .topo.raw[0].a = 0xc2,
                 .xstate.raw[0].a = 0xc2,
                 .xstate.raw[1].a = 0xc2,
+                .tile.raw[0].a = 0xc2,
+                .tmul.raw[0].a = 0xc2,
             },
         },
         {
@@ -474,6 +522,8 @@ static void test_cpuid_out_of_range_clea
                 .topo.raw[1].b = 0xc2,
                 .xstate.raw[0].a = 0xc2,
                 .xstate.raw[1].a = 0xc2,
+                .tile.raw[0].a = 0xc2,
+                .tmul.raw[0].a = 0xc2,
             },
         },
         {
@@ -488,6 +538,8 @@ static void test_cpuid_out_of_range_clea
 
                 .xstate.raw[2].b = 0xc2,
                 .xstate.raw[3].b = 0xc2,
+                .tile.raw[0].a = 0xc2,
+                .tmul.raw[0].a = 0xc2,
             },
         },
         {
@@ -530,6 +582,34 @@ static void test_cpuid_out_of_range_clea
             },
         },
         {
+            .name = "tile no palette",
+            .nr_markers = 0,
+            .p = {
+                /* First two subleaves invalid as a pair.  Others cleared. */
+                .basic.max_leaf = 0x1d,
+                .xstate.xcr0_low = XSTATE_FP_SSE,
+
+                .tile.raw[0].a = 0xc2,
+                .tile.raw[1].b = 0xc2,
+                .tmul.raw[0].a = 0xc2,
+            },
+        },
+        {
+            .name = "tile palette 1",
+            .nr_markers = 1,
+            .p = {
+                /* First two subleaves valid as a pair.  Others cleared. */
+                .basic.max_leaf = 0x1d,
+                .feat.amx_tile = 1,
+                .xstate.xcr0_low = XSTATE_FP_SSE | X86_XCR0_TILECFG |
+                                   X86_XCR0_TILEDATA,
+                .tile.raw[0].a = 1,
+                .tile.raw[1].b = 0xc2,
+
+                .tmul.raw[0].a = 0xc2,
+            },
+        },
+        {
             .name = "extd",
             .nr_markers = 1,
             .p = {
@@ -624,6 +704,24 @@ static void test_cpuid_maximum_leaf_shri
             },
         },
         {
+            .name = "tile",
+            .p = {
+                /* Subleaf 1 only with some valid value. */
+                .basic.max_leaf = 0x1d,
+                .tile.raw[0].a = 1,
+                .tile.raw[1].a = 1024,
+            },
+        },
+        {
+            .name = "tmul",
+            .p = {
+                /* Subleaf 0 only with some valid values. */
+                .basic.max_leaf = 0x1e,
+                .tmul.maxk = 16,
+                .tmul.maxn = 16,
+            },
+        },
+        {
             .name = "extd",
             .p = {
                 /* Commonly available information only. */
@@ -643,6 +741,7 @@ static void test_cpuid_maximum_leaf_shri
 
         p->basic.max_leaf = ARRAY_SIZE(p->basic.raw) - 1;
         p->feat.max_subleaf = ARRAY_SIZE(p->feat.raw) - 1;
+        p->tile.max_palette = ARRAY_SIZE(p->tile.raw) - 1;
         p->extd.max_leaf = 0x80000000 | (ARRAY_SIZE(p->extd.raw) - 1);
 
         x86_cpuid_policy_shrink_max_leaves(p);
@@ -660,6 +759,10 @@ static void test_cpuid_maximum_leaf_shri
              fail("  Test %s feat fail - expected %#x, got %#x\n",
                   t->name, t->p.feat.max_subleaf, p->feat.max_subleaf);
 
+        if ( p->tile.max_palette != t->p.tile.max_palette )
+             fail("  Test %s tile fail - expected %#x, got %#x\n",
+                  t->name, t->p.tile.max_palette, p->tile.max_palette);
+
         free(p);
     }
 }
--- a/xen/arch/x86/cpuid.c
+++ b/xen/arch/x86/cpuid.c
@@ -230,6 +230,29 @@ static void recalculate_xstate(struct cp
     }
 }
 
+static void recalculate_tile(struct cpuid_policy *p)
+{
+    unsigned int i;
+
+    if ( !p->feat.amx_tile )
+    {
+        memset(&p->tile, 0, sizeof(p->tile));
+        memset(&p->tmul, 0, sizeof(p->tmul));
+        return;
+    }
+
+    p->tile.raw[0].b = p->tile.raw[0].c = p->tile.raw[0].d = 0;
+
+    for ( i = 1; i <= p->tile.max_palette; ++i )
+    {
+        p->tile.raw[i].c &= 0x0000ffff;
+        p->tile.raw[i].d = 0;
+    }
+
+    p->tmul.raw[0].a = p->tmul.raw[0].c = p->tmul.raw[0].d = 0;
+    p->tmul.raw[0].b &= 0x00ffffff;
+}
+
 /*
  * Misc adjustments to the policy.  Mostly clobbering reserved fields and
  * duplicating shared fields.  Intentionally hidden fields are annotated.
@@ -249,6 +272,8 @@ static void recalculate_misc(struct cpui
 
     p->basic.raw[0xc] = EMPTY_LEAF;
 
+    zero_leaves(p->basic.raw, 0xe, 0x1c);
+
     p->extd.e1d &= ~CPUID_COMMON_1D_FEATURES;
 
     /* Most of Power/RAS hidden from guests. */
@@ -323,6 +348,7 @@ static void __init calculate_host_policy
 
     cpuid_featureset_to_policy(boot_cpu_data.x86_capability, p);
     recalculate_xstate(p);
+    recalculate_tile(p);
     recalculate_misc(p);
 
     /* When vPMU is disabled, drop it from the host policy. */
@@ -410,6 +436,7 @@ static void __init calculate_pv_max_poli
     sanitise_featureset(pv_featureset);
     cpuid_featureset_to_policy(pv_featureset, p);
     recalculate_xstate(p);
+    recalculate_tile(p);
 
     p->extd.raw[0xa] = EMPTY_LEAF; /* No SVM for PV guests. */
 
@@ -434,6 +461,7 @@ static void __init calculate_pv_def_poli
     sanitise_featureset(pv_featureset);
     cpuid_featureset_to_policy(pv_featureset, p);
     recalculate_xstate(p);
+    recalculate_tile(p);
 
     x86_cpuid_policy_shrink_max_leaves(p);
 }
@@ -495,6 +523,7 @@ static void __init calculate_hvm_max_pol
     sanitise_featureset(hvm_featureset);
     cpuid_featureset_to_policy(hvm_featureset, p);
     recalculate_xstate(p);
+    recalculate_tile(p);
 
     x86_cpuid_policy_shrink_max_leaves(p);
 }
@@ -521,6 +550,7 @@ static void __init calculate_hvm_def_pol
     sanitise_featureset(hvm_featureset);
     cpuid_featureset_to_policy(hvm_featureset, p);
     recalculate_xstate(p);
+    recalculate_tile(p);
 
     x86_cpuid_policy_shrink_max_leaves(p);
 }
@@ -591,6 +621,7 @@ void recalculate_cpuid_policy(struct dom
 
     p->basic.max_leaf   = min(p->basic.max_leaf,   max->basic.max_leaf);
     p->feat.max_subleaf = min(p->feat.max_subleaf, max->feat.max_subleaf);
+    p->tile.max_palette = min(p->tile.max_palette, max->tile.max_palette);
     p->extd.max_leaf    = 0x80000000 | min(p->extd.max_leaf & 0xffff,
                                            ((p->x86_vendor & (X86_VENDOR_AMD |
                                                               X86_VENDOR_HYGON))
@@ -681,6 +712,7 @@ void recalculate_cpuid_policy(struct dom
     p->extd.maxlinaddr = p->extd.lm ? 48 : 32;
 
     recalculate_xstate(p);
+    recalculate_tile(p);
     recalculate_misc(p);
 
     for ( i = 0; i < ARRAY_SIZE(p->cache.raw); ++i )
@@ -803,6 +835,22 @@ void guest_cpuid(const struct vcpu *v, u
             *res = array_access_nospec(p->xstate.raw, subleaf);
             break;
 
+        case 0x1d:
+            ASSERT(p->tile.max_palette < ARRAY_SIZE(p->tile.raw));
+            if ( subleaf > min_t(uint32_t, p->tile.max_palette,
+                                 ARRAY_SIZE(p->tile.raw) - 1) )
+                return;
+
+            *res = array_access_nospec(p->tile.raw, subleaf);
+            break;
+
+        case 0x1e:
+            if ( subleaf >= ARRAY_SIZE(p->tmul.raw) )
+                return;
+
+            *res = array_access_nospec(p->tmul.raw, subleaf);
+            break;
+
         default:
             *res = array_access_nospec(p->basic.raw, leaf);
             break;
@@ -1136,6 +1184,8 @@ static void __init __maybe_unused build_
                  sizeof(raw_cpuid_policy.feat.raw));
     BUILD_BUG_ON(sizeof(raw_cpuid_policy.xstate) !=
                  sizeof(raw_cpuid_policy.xstate.raw));
+    BUILD_BUG_ON(sizeof(raw_cpuid_policy.tile) !=
+                 sizeof(raw_cpuid_policy.tile.raw));
     BUILD_BUG_ON(sizeof(raw_cpuid_policy.extd) !=
                  sizeof(raw_cpuid_policy.extd.raw));
 }
--- a/xen/include/xen/lib/x86/cpuid.h
+++ b/xen/include/xen/lib/x86/cpuid.h
@@ -78,11 +78,13 @@ unsigned int x86_cpuid_lookup_vendor(uin
  */
 const char *x86_cpuid_vendor_to_str(unsigned int vendor);
 
-#define CPUID_GUEST_NR_BASIC      (0xdu + 1)
+#define CPUID_GUEST_NR_BASIC      (0x1eu + 1)
 #define CPUID_GUEST_NR_CACHE      (5u + 1)
 #define CPUID_GUEST_NR_FEAT       (1u + 1)
 #define CPUID_GUEST_NR_TOPO       (1u + 1)
 #define CPUID_GUEST_NR_XSTATE     (62u + 1)
+#define CPUID_GUEST_NR_PALETTE    (1u + 1)
+#define CPUID_GUEST_NR_TMUL       (0u + 1)
 #define CPUID_GUEST_NR_EXTD_INTEL (0x8u + 1)
 #define CPUID_GUEST_NR_EXTD_AMD   (0x1cu + 1)
 #define CPUID_GUEST_NR_EXTD       MAX(CPUID_GUEST_NR_EXTD_INTEL, \
@@ -225,6 +227,35 @@ struct cpuid_policy
         } comp[CPUID_GUEST_NR_XSTATE];
     } xstate;
 
+    /* Structured tile information leaf: 0x00000001d[xx] */
+    union {
+        struct cpuid_leaf raw[CPUID_GUEST_NR_PALETTE];
+        struct {
+            /* Subleaf 0. */
+            uint32_t max_palette;
+            uint32_t /* b */:32, /* c */:32, /* d */:32;
+        };
+
+        /* Per-palette common state.  Valid for i >= 1. */
+        struct {
+            uint16_t tot_bytes, bytes_per_tile;
+            uint16_t bytes_per_row, num_regs;
+            uint16_t max_rows, :16;
+            uint32_t /* d */:32;
+        } palette[CPUID_GUEST_NR_PALETTE];
+    } tile;
+
+    /* Structured tmul information leaf: 0x00000001e[xx] */
+    union {
+        struct cpuid_leaf raw[CPUID_GUEST_NR_TMUL];
+        struct {
+            /* Subleaf 0. */
+            uint32_t /* a */:32;
+            uint32_t maxk:8, maxn:16, :8;
+            uint32_t /* c */:32, /* d */:32;
+        };
+    } tmul;
+
     /* Extended leaves: 0x800000xx */
     union {
         struct cpuid_leaf raw[CPUID_GUEST_NR_EXTD];
--- a/xen/lib/x86/cpuid.c
+++ b/xen/lib/x86/cpuid.c
@@ -170,6 +170,18 @@ void x86_cpuid_policy_fill_native(struct
         }
     }
 
+    if ( p->basic.max_leaf >= 0x1d )
+    {
+        cpuid_count_leaf(0x1d, 0, &p->tile.raw[0]);
+
+        for ( i = 1; i <= MIN(p->tile.max_palette,
+                              ARRAY_SIZE(p->tile.raw) - 1); ++i )
+            cpuid_count_leaf(0x1d, i, &p->tile.raw[i]);
+    }
+
+    if ( p->basic.max_leaf >= 0x1e )
+        cpuid_count_leaf(0x1e, 0, &p->tmul.raw[0]);
+
     /* Extended leaves. */
     cpuid_leaf(0x80000000, &p->extd.raw[0]);
     for ( i = 1; i <= MIN(p->extd.max_leaf & 0xffffU,
@@ -232,6 +244,19 @@ void x86_cpuid_policy_clear_out_of_range
                     ARRAY_SIZE(p->xstate.raw) - 1);
     }
 
+    if ( p->basic.max_leaf < 0x1d ||
+         (cpuid_policy_xstates(p) &
+          (X86_XCR0_TILECFG | X86_XCR0_TILEDATA)) !=
+         (X86_XCR0_TILECFG | X86_XCR0_TILEDATA) )
+        memset(p->tile.raw, 0, sizeof(p->tile.raw));
+    else
+        zero_leaves(p->tile.raw, p->tile.max_palette + 1,
+                    ARRAY_SIZE(p->tile.raw) - 1);
+
+    if ( p->basic.max_leaf < 0x1e || !p->tile.max_palette ||
+         (!p->feat.amx_int8 && !p->feat.amx_bf16) )
+        memset(p->tmul.raw, 0, sizeof(p->tmul.raw));
+
     zero_leaves(p->extd.raw,
                 ((p->extd.max_leaf >> 16) == 0x8000
                  ? (p->extd.max_leaf & 0xffff) + 1 : 0),
@@ -244,6 +269,8 @@ void x86_cpuid_policy_bound_max_leaves(s
         min_t(uint32_t, p->basic.max_leaf, ARRAY_SIZE(p->basic.raw) - 1);
     p->feat.max_subleaf =
         min_t(uint32_t, p->feat.max_subleaf, ARRAY_SIZE(p->feat.raw) - 1);
+    p->tile.max_palette =
+        min_t(uint32_t, p->tile.max_palette, ARRAY_SIZE(p->tile.raw) - 1);
     p->extd.max_leaf = 0x80000000 | min_t(uint32_t, p->extd.max_leaf & 0xffff,
                                           ARRAY_SIZE(p->extd.raw) - 1);
 }
@@ -271,6 +298,21 @@ void x86_cpuid_policy_shrink_max_leaves(
      */
     p->basic.raw[0xd] = p->xstate.raw[0];
 
+    for ( i = p->tile.max_palette; i; --i )
+        if ( p->tile.raw[i].a | p->tile.raw[i].b |
+             p->tile.raw[i].c | p->tile.raw[i].d )
+            break;
+    if ( i )
+        p->tile.max_palette = i;
+    else
+    {
+        ASSERT(!p->feat.amx_tile);
+        zero_leaves(p->tile.raw, 0, 0);
+    }
+    p->basic.raw[0x1d] = p->tile.raw[0];
+
+    p->basic.raw[0x1e] = p->tmul.raw[0];
+
     for ( i = p->basic.max_leaf; i; --i )
         if ( p->basic.raw[i].a | p->basic.raw[i].b |
              p->basic.raw[i].c | p->basic.raw[i].d )
@@ -404,6 +446,19 @@ int x86_cpuid_copy_to_buffer(const struc
             break;
         }
 
+        case 0x1d:
+            for ( subleaf = 0;
+                  subleaf <= MIN(p->tile.max_palette,
+                                 ARRAY_SIZE(p->tile.raw) - 1); ++subleaf )
+                COPY_LEAF(leaf, subleaf, &p->tile.raw[subleaf]);
+            break;
+
+        case 0x1e:
+            for ( subleaf = 0;
+                  subleaf <= ARRAY_SIZE(p->tmul.raw) - 1; ++subleaf )
+                COPY_LEAF(leaf, subleaf, &p->tmul.raw[subleaf]);
+            break;
+
         default:
             COPY_LEAF(leaf, XEN_CPUID_NO_SUBLEAF, &p->basic.raw[leaf]);
             break;
@@ -496,6 +551,20 @@ int x86_cpuid_copy_from_buffer(struct cp
                 array_access_nospec(p->xstate.raw, data.subleaf) = l;
                 break;
 
+            case 0x1d:
+                if ( data.subleaf >= ARRAY_SIZE(p->tile.raw) )
+                    goto out_of_range;
+
+                array_access_nospec(p->tile.raw, data.subleaf) = l;
+                break;
+
+            case 0x1e:
+                if ( data.subleaf >= ARRAY_SIZE(p->tmul.raw) )
+                    goto out_of_range;
+
+                array_access_nospec(p->tmul.raw, data.subleaf) = l;
+                break;
+
             default:
                 if ( data.subleaf != XEN_CPUID_NO_SUBLEAF )
                     goto out_of_range;
--- a/xen/lib/x86/policy.c
+++ b/xen/lib/x86/policy.c
@@ -7,6 +7,7 @@ int x86_cpu_policies_are_compatible(cons
                                     struct cpu_policy_errors *err)
 {
     struct cpu_policy_errors e = INIT_CPU_POLICY_ERRORS;
+    unsigned int i;
     int ret = -EINVAL;
 
 #define NA XEN_CPUID_NO_SUBLEAF
@@ -21,6 +22,31 @@ int x86_cpu_policies_are_compatible(cons
     if ( guest->cpuid->feat.max_subleaf > host->cpuid->feat.max_subleaf )
         FAIL_CPUID(7, 0);
 
+    if ( (guest->cpuid->feat.amx_tile && !guest->cpuid->tile.max_palette) ||
+         guest->cpuid->tile.max_palette > host->cpuid->tile.max_palette )
+        FAIL_CPUID(0x1d, 0);
+
+    for ( i = 1; i <= guest->cpuid->tile.max_palette; ++i )
+    {
+        const typeof(guest->cpuid->tile.palette[0]) *gt, *ht;
+
+        gt = &guest->cpuid->tile.palette[i];
+        ht = &host->cpuid->tile.palette[i];
+
+        if ( gt->tot_bytes != ht->tot_bytes ||
+             gt->bytes_per_tile != ht->bytes_per_tile ||
+             gt->bytes_per_row != ht->bytes_per_row ||
+             !gt->num_regs || gt->num_regs > ht->num_regs ||
+             !gt->max_rows || gt->max_rows > ht->max_rows )
+            FAIL_CPUID(0x1d, i);
+    }
+
+    if ( ((guest->cpuid->feat.amx_int8 || guest->cpuid->feat.amx_bf16) &&
+          (!guest->cpuid->tmul.maxk || !guest->cpuid->tmul.maxn)) ||
+         guest->cpuid->tmul.maxk > host->cpuid->tmul.maxk ||
+         guest->cpuid->tmul.maxn > host->cpuid->tmul.maxn )
+        FAIL_CPUID(0x1e, 0);
+
     if ( guest->cpuid->extd.max_leaf > host->cpuid->extd.max_leaf )
         FAIL_CPUID(0x80000000, NA);
 
--- a/xen/lib/x86/private.h
+++ b/xen/lib/x86/private.h
@@ -17,13 +17,17 @@
 
 #else
 
+#include <assert.h>
 #include <errno.h>
 #include <inttypes.h>
 #include <stdbool.h>
 #include <stddef.h>
 #include <string.h>
 
+#define ASSERT assert
+
 #include <xen/asm/msr-index.h>
+#include <xen/asm/x86-defns.h>
 #include <xen/asm/x86-vendors.h>
 
 #include <xen-tools/libs.h>


[PATCH v2 15/17] x86emul: introduce X86EMUL_FPU_tile
Posted by Jan Beulich 3 years, 4 months ago
This will be used by AMX insns. Note that CR0.TS behavior is only
assumed to be similar to AVX* insns, as the ISA extensions document (as
of rev 041) doesn't specify this either way. But since XFD is not
supposed to be used for lazy context restore, it's unlikely to work any
other way.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: New.

--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -1420,6 +1420,12 @@ static int _get_fpu(
             return X86EMUL_UNHANDLEABLE;
         break;
 
+    case X86EMUL_FPU_tile:
+        ASSERT(mode_64bit());
+        if ( !(xcr0 & X86_XCR0_TILECFG) || !(xcr0 & X86_XCR0_TILEDATA) )
+            return X86EMUL_UNHANDLEABLE;
+        break;
+
     default:
         break;
     }
--- a/xen/arch/x86/x86_emulate/x86_emulate.h
+++ b/xen/arch/x86/x86_emulate/x86_emulate.h
@@ -172,6 +172,7 @@ enum x86_emulate_fpu_type {
     X86EMUL_FPU_ymm, /* AVX/XOP instruction set (%ymm0-%ymm7/15) */
     X86EMUL_FPU_opmask, /* AVX512 opmask instruction set (%k0-%k7) */
     X86EMUL_FPU_zmm, /* AVX512 instruction set (%zmm0-%zmm7/31) */
+    X86EMUL_FPU_tile, /* AMX instruction set (%tmm0-%tmmN, tilecfg) */
     /* This sentinel will never be passed to ->get_fpu(). */
     X86EMUL_FPU_none
 };


[PATCH v2 16/17] x86emul: support TILERELEASE
Posted by Jan Beulich 3 years, 4 months ago
This is relatively straightforward, and hence best suited to introduce a
few other general pieces.

Testing of this will be added once a sensible test can be put together,
i.e. when support for other insns is also there.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: New.

--- a/tools/tests/x86_emulator/predicates.c
+++ b/tools/tests/x86_emulator/predicates.c
@@ -1335,6 +1335,7 @@ static const struct vex {
     { { 0x45 }, 2, T, R, pfx_66, Wn, Ln }, /* vpsrlv{d,q} */
     { { 0x46 }, 2, T, R, pfx_66, W0, Ln }, /* vpsravd */
     { { 0x47 }, 2, T, R, pfx_66, Wn, Ln }, /* vpsllv{d,q} */
+    { { 0x49, 0xc0 }, 2, F, N, pfx_no, W0, L0 }, /* tilerelease */
     { { 0x50 }, 2, T, R, pfx_66, W0, Ln }, /* vpdpbusd */
     { { 0x51 }, 2, T, R, pfx_66, W0, Ln }, /* vpdpbusds */
     { { 0x52 }, 2, T, R, pfx_66, W0, Ln }, /* vpdpwssd */
--- a/tools/tests/x86_emulator/x86-emulate.c
+++ b/tools/tests/x86_emulator/x86-emulate.c
@@ -247,6 +247,9 @@ int emul_test_get_fpu(
             break;
     default:
         return X86EMUL_UNHANDLEABLE;
+
+    case X86EMUL_FPU_tile:
+        return cpu_has_amx_tile ? X86EMUL_OKAY : X86EMUL_UNHANDLEABLE;
     }
     return X86EMUL_OKAY;
 }
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -475,6 +475,7 @@ static const struct ext0f38_table {
     [0x43] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x44] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x45 ... 0x47] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
+    [0x49] = { .simd_size = simd_other, .two_op = 1 },
     [0x4c] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x4d] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x4e] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
@@ -2014,6 +2015,7 @@ amd_like(const struct x86_emulate_ctxt *
 #define vcpu_has_avx512_4fmaps() (ctxt->cpuid->feat.avx512_4fmaps)
 #define vcpu_has_avx512_vp2intersect() (ctxt->cpuid->feat.avx512_vp2intersect)
 #define vcpu_has_serialize()   (ctxt->cpuid->feat.serialize)
+#define vcpu_has_amx_tile()    (ctxt->cpuid->feat.amx_tile)
 #define vcpu_has_avx_vnni()    (ctxt->cpuid->feat.avx_vnni)
 #define vcpu_has_avx512_bf16() (ctxt->cpuid->feat.avx512_bf16)
 
@@ -9460,6 +9462,24 @@ x86_emulate(
         generate_exception_if(vex.l, EXC_UD);
         goto simd_0f_avx;
 
+    case X86EMUL_OPC_VEX(0x0f38, 0x49):
+        generate_exception_if(!mode_64bit() || vex.l || vex.w, EXC_UD);
+        if ( ea.type == OP_REG )
+        {
+            switch ( modrm )
+            {
+            case 0xc0: /* tilerelease */
+                host_and_vcpu_must_have(amx_tile);
+                get_fpu(X86EMUL_FPU_tile);
+                op_bytes = 1; /* fake */
+                goto simd_0f_common;
+
+            default:
+                goto unrecognized_insn;
+            }
+        }
+        goto unimplemented_insn;
+
     case X86EMUL_OPC_VEX_66(0x0f38, 0x50): /* vpdpbusd [xy]mm/mem,[xy]mm,[xy]mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x51): /* vpdpbusds [xy]mm/mem,[xy]mm,[xy]mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x52): /* vpdpwssd [xy]mm/mem,[xy]mm,[xy]mm */
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -131,6 +131,7 @@
 #define cpu_has_avx512_vp2intersect boot_cpu_has(X86_FEATURE_AVX512_VP2INTERSECT)
 #define cpu_has_tsx_force_abort boot_cpu_has(X86_FEATURE_TSX_FORCE_ABORT)
 #define cpu_has_serialize       boot_cpu_has(X86_FEATURE_SERIALIZE)
+#define cpu_has_amx_tile        boot_cpu_has(X86_FEATURE_AMX_TILE)
 
 /* CPUID level 0x00000007:1.eax */
 #define cpu_has_avx_vnni        boot_cpu_has(X86_FEATURE_AVX_VNNI)


[PATCH v2 17/17] x86emul: support {LD,ST}TILECFG
Posted by Jan Beulich 3 years, 4 months ago
While ver 041 of the ISA extensions doc also specifies
xcr0_supports_palette() returning false as one of the #GP(0) reasons for
LDTILECFG, the earlier #UD conditions look to make this fully dead.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: New.
---
SDE: -spr

--- a/tools/tests/x86_emulator/predicates.c
+++ b/tools/tests/x86_emulator/predicates.c
@@ -1335,6 +1335,8 @@ static const struct vex {
     { { 0x45 }, 2, T, R, pfx_66, Wn, Ln }, /* vpsrlv{d,q} */
     { { 0x46 }, 2, T, R, pfx_66, W0, Ln }, /* vpsravd */
     { { 0x47 }, 2, T, R, pfx_66, Wn, Ln }, /* vpsllv{d,q} */
+    { { 0x49, 0x00 }, 2, F, R, pfx_no, W0, L0 }, /* ldtilecfg */
+    { { 0x49, 0x00 }, 2, F, W, pfx_66, W0, L0 }, /* sttilecfg */
     { { 0x49, 0xc0 }, 2, F, N, pfx_no, W0, L0 }, /* tilerelease */
     { { 0x50 }, 2, T, R, pfx_66, W0, Ln }, /* vpdpbusd */
     { { 0x51 }, 2, T, R, pfx_66, W0, Ln }, /* vpdpbusds */
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -898,6 +898,16 @@ int main(int argc, char **argv)
     int rc;
 #ifdef __x86_64__
     unsigned int vendor_native;
+    static const struct {
+        uint8_t palette, start_row;
+        uint8_t res[14];
+        uint16_t colsb[16];
+        uint8_t rows[16];
+    } tilecfg = {
+        .palette = 1,
+        .colsb = { 2, 4, 5, 3 },
+        .rows = { 2, 4, 3, 5 },
+    };
 #else
     unsigned int bcdres_native, bcdres_emul;
 #endif
@@ -4463,6 +4473,74 @@ int main(int argc, char **argv)
         printf("skipped\n");
 
 #ifdef __x86_64__
+    printf("%-40s", "Testing tilerelease;sttilecfg 4(%rcx)...");
+    if ( stack_exec && cpu_has_amx_tile )
+    {
+        decl_insn(tilerelease);
+
+        asm volatile ( put_insn(tilerelease,
+                                /* tilerelease */
+                                ".byte 0xC4, 0xE2, 0x78, 0x49, 0xC0;"
+                                /* sttilecfg 4(%0) */
+                                ".byte 0xC4, 0xE2, 0x79, 0x49, 0x41, 0x04")
+                                :: "c" (NULL) );
+
+        memset(res, ~0, 72);
+        set_insn(tilerelease);
+        regs.ecx = (unsigned long)res;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc == X86EMUL_OKAY )
+            rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(tilerelease) ||
+             ~res[0] || ~res[17] || memchr_inv(res + 1, 0, 64) )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
+    printf("%-40s", "Testing ldtilecfg (%rdx)...");
+    if ( stack_exec && cpu_has_amx_tile )
+    {
+        decl_insn(ldtilecfg);
+
+        asm volatile ( put_insn(ldtilecfg,
+                                /* ldtilecfg (%0) */
+                                ".byte 0xC4, 0xE2, 0x78, 0x49, 0x02")
+                                :: "d" (NULL) );
+
+        set_insn(ldtilecfg);
+        regs.edx = (unsigned long)&tilecfg;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(ldtilecfg) )
+            goto fail;
+        printf("pending\n");
+    }
+    else
+        printf("skipped\n");
+
+    printf("%-40s", "Testing sttilecfg -4(%rcx)...");
+    if ( stack_exec && cpu_has_amx_tile )
+    {
+        decl_insn(sttilecfg);
+
+        asm volatile ( put_insn(sttilecfg,
+                                /* sttilecfg 4(%0) */
+                                ".byte 0xC4, 0xE2, 0x79, 0x49, 0x41, 0xfc")
+                                :: "c" (NULL) );
+
+        memset(res, ~0, 72);
+        set_insn(sttilecfg);
+        regs.ecx = (unsigned long)(res + 2);
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(sttilecfg) ||
+             ~res[0] || ~res[17] || memcmp(res + 1, &tilecfg, 64) )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
     printf("%-40s", "Testing vzeroupper (compat)...");
     if ( cpu_has_avx )
     {
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -67,6 +67,17 @@
 
 #define is_canonical_address(x) (((int64_t)(x) >> 47) == ((int64_t)(x) >> 63))
 
+static inline void *memchr_inv(const void *s, int c, size_t n)
+{
+    const unsigned char *p = s;
+
+    while ( n-- )
+        if ( (unsigned char)c != *p++ )
+            return (void *)(p - 1);
+
+    return NULL;
+}
+
 extern uint32_t mxcsr_mask;
 extern struct cpuid_policy cp;
 
@@ -170,6 +181,8 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_avx512_4fmaps (cp.feat.avx512_4fmaps && xcr0_mask(0xe6))
 #define cpu_has_avx512_vp2intersect (cp.feat.avx512_vp2intersect && xcr0_mask(0xe6))
 #define cpu_has_serialize  cp.feat.serialize
+#define cpu_has_amx_tile   (cp.feat.amx_tile && \
+                            xcr0_mask(X86_XCR0_TILECFG | X86_XCR0_TILEDATA))
 #define cpu_has_avx_vnni   (cp.feat.avx_vnni && xcr0_mask(6))
 #define cpu_has_avx512_bf16 (cp.feat.avx512_bf16 && xcr0_mask(0xe6))
 
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -957,6 +957,12 @@ typedef union {
     uint64_t __attribute__ ((aligned(16))) xmm[2];
     uint64_t __attribute__ ((aligned(32))) ymm[4];
     uint64_t __attribute__ ((aligned(64))) zmm[8];
+    struct {
+        uint8_t palette, start_row;
+        uint8_t res[14];
+        uint16_t colsb[16];
+        uint8_t rows[16];
+    } tilecfg;
     uint32_t data32[16];
 } mmval_t;
 
@@ -2848,6 +2854,10 @@ x86_decode_0f38(
         state->simd_size = simd_scalar_vexw;
         break;
 
+    case X86EMUL_OPC_VEX_66(0, 0x49): /* sttilecfg */
+        state->desc = DstMem | SrcImplicit | Mov;
+        break;
+
     case X86EMUL_OPC_EVEX_66(0, 0x7a): /* vpbroadcastb */
     case X86EMUL_OPC_EVEX_66(0, 0x7b): /* vpbroadcastw */
     case X86EMUL_OPC_EVEX_66(0, 0x7c): /* vpbroadcast{d,q} */
@@ -9478,7 +9488,66 @@ x86_emulate(
                 goto unrecognized_insn;
             }
         }
-        goto unimplemented_insn;
+
+        switch ( modrm_reg & 7 )
+        {
+        case 0: /* ldtilecfg mem */
+            generate_exception_if(vex.reg != 0xf, EXC_UD);
+            host_and_vcpu_must_have(amx_tile);
+            get_fpu(X86EMUL_FPU_tile);
+            rc = ops->read(ea.mem.seg, ea.mem.off, mmvalp, 64, ctxt);
+            if ( rc != X86EMUL_OKAY )
+                goto done;
+            generate_exception_if((mmvalp->tilecfg.palette >
+                                   ctxt->cpuid->tile.max_palette),
+                                  EXC_GP, 0);
+            if ( mmvalp->tilecfg.palette )
+            {
+                const typeof(*ctxt->cpuid->tile.palette) *palette;
+
+                generate_exception_if(memchr_inv(mmvalp->tilecfg.res, 0,
+                                                 sizeof(mmvalp->tilecfg.res)),
+                                      EXC_GP, 0);
+
+                /*
+                 * Parameters for valid registers must be within bounds, or
+                 * both be zero at the same time.
+                 */
+                palette = &ctxt->cpuid->tile.palette[mmvalp->tilecfg.palette];
+                for ( i = 0; i < palette->num_regs; ++i )
+                    generate_exception_if(((mmvalp->tilecfg.colsb[i] >
+                                            palette->bytes_per_row) ||
+                                           (mmvalp->tilecfg.rows[i] >
+                                            palette->max_rows) ||
+                                           (!mmvalp->tilecfg.colsb[i] !=
+                                            !mmvalp->tilecfg.rows[i])),
+                                          EXC_GP, 0);
+
+                /* All remaining entries must be zero. */
+                for ( ; i < 16; ++i )
+                    generate_exception_if((mmvalp->tilecfg.colsb[i] ||
+                                           mmvalp->tilecfg.rows[i]),
+                                          EXC_GP, 0);
+            }
+            op_bytes = 64;
+            goto simd_0f_common;
+        }
+        goto unrecognized_insn;
+
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x49):
+        generate_exception_if(!mode_64bit() || vex.l || vex.w, EXC_UD);
+        if ( ea.type == OP_REG )
+            goto unrecognized_insn;
+
+        switch ( modrm_reg & 7 )
+        {
+        case 0: /* sttilecfg mem */
+            host_and_vcpu_must_have(amx_tile);
+            get_fpu(X86EMUL_FPU_tile);
+            op_bytes = 64;
+            goto simd_0f_common;
+        }
+        goto unrecognized_insn;
 
     case X86EMUL_OPC_VEX_66(0x0f38, 0x50): /* vpdpbusd [xy]mm/mem,[xy]mm,[xy]mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x51): /* vpdpbusds [xy]mm/mem,[xy]mm,[xy]mm */