[v2] xen/riscv: introduce p2m functionality

[PATCH v2 06/17] xen/riscv: add root page table allocation

Posted by Oleksii Kurochko 4 months, 3 weeks ago

Introduce support for allocating and initializing the root page table
required for RISC-V stage-2 address translation.

To implement root page table allocation the following is introduced:
- p2m_get_clean_page() and p2m_allocate_root() helpers to allocate and
  zero a 16 KiB root page table, as mandated by the RISC-V privileged
  specification for Sv39x4/Sv48x4 modes.
- Add hgatp_from_page() to construct the hgatp register value from the
  allocated root page.
- Update p2m_init() to allocate the root table and initialize
  p2m->root and p2m->hgatp.
- Add maddr_to_page() and page_to_maddr() macros for easier address
  manipulation.
- Allocate root p2m table after p2m pool is initialized.

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in v2:
 - This patch was created from "xen/riscv: introduce things necessary for p2m
   initialization" with the following changes:
   - [clear_and_clean_page()] Add missed call of clean_dcache_va_range().
   - Drop p2m_get_clean_page() as it is going to be used only once to allocate
     root page table. Open-code it explicittly in p2m_allocate_root(). Also,
     it will help avoid duplication of the code connected to order and nr_pages
     of p2m root page table.
   - Instead of using order 2 for alloc_domheap_pages(), use
     get_order_from_bytes(KB(16)).
   - Clear and clean a proper amount of allocated pages in p2m_allocate_root().
   - Drop _info from the function name hgatp_from_page_info() and its argument
     page_info.
   - Introduce HGATP_MODE_MASK and use MASK_INSR() instead of shift to calculate
     value of hgatp.
   - Drop unnecessary parentheses in definition of page_to_maddr().
   - Add support of VMID.
   - Drop TLB flushing in p2m_alloc_root_table() and do that once when VMID
     is re-used. [Look at p2m_alloc_vmid()]
   - Allocate p2m root table after p2m pool is fully initialized: first
     return pages to p2m pool them allocate p2m root table.
---
 xen/arch/riscv/include/asm/mm.h             |  4 +
 xen/arch/riscv/include/asm/p2m.h            |  6 ++
 xen/arch/riscv/include/asm/riscv_encoding.h |  4 +
 xen/arch/riscv/p2m.c                        | 94 +++++++++++++++++++++
 4 files changed, 108 insertions(+)

diff --git a/xen/arch/riscv/include/asm/mm.h b/xen/arch/riscv/include/asm/mm.h
index 01bbd92a06..912bc79e1b 100644
--- a/xen/arch/riscv/include/asm/mm.h
+++ b/xen/arch/riscv/include/asm/mm.h
@@ -149,6 +149,10 @@ extern struct page_info *frametable_virt_start;
 #define mfn_to_page(mfn)    (frametable_virt_start + mfn_x(mfn))
 #define page_to_mfn(pg)     _mfn((pg) - frametable_virt_start)
 
+/* Convert between machine addresses and page-info structures. */
+#define maddr_to_page(ma) mfn_to_page(maddr_to_mfn(ma))
+#define page_to_maddr(pg) mfn_to_maddr(page_to_mfn(pg))
+
 static inline void *page_to_virt(const struct page_info *pg)
 {
     return mfn_to_virt(mfn_x(page_to_mfn(pg)));
diff --git a/xen/arch/riscv/include/asm/p2m.h b/xen/arch/riscv/include/asm/p2m.h
index 9570eff014..a31b05bd50 100644
--- a/xen/arch/riscv/include/asm/p2m.h
+++ b/xen/arch/riscv/include/asm/p2m.h
@@ -26,6 +26,12 @@ struct p2m_domain {
     /* Pages used to construct the p2m */
     struct page_list_head pages;
 
+    /* The root of the p2m tree. May be concatenated */
+    struct page_info *root;
+
+    /* Address Translation Table for the p2m */
+    paddr_t hgatp;
+
     /* Indicate if it is required to clean the cache when writing an entry */
     bool clean_pte;
 
diff --git a/xen/arch/riscv/include/asm/riscv_encoding.h b/xen/arch/riscv/include/asm/riscv_encoding.h
index 6cc8f4eb45..a71b7546ef 100644
--- a/xen/arch/riscv/include/asm/riscv_encoding.h
+++ b/xen/arch/riscv/include/asm/riscv_encoding.h
@@ -133,11 +133,13 @@
 #define HGATP_MODE_SV48X4		_UL(9)
 
 #define HGATP32_MODE_SHIFT		31
+#define HGATP32_MODE_MASK		_UL(0x80000000)
 #define HGATP32_VMID_SHIFT		22
 #define HGATP32_VMID_MASK		_UL(0x1FC00000)
 #define HGATP32_PPN			_UL(0x003FFFFF)
 
 #define HGATP64_MODE_SHIFT		60
+#define HGATP64_MODE_MASK		_ULL(0xF000000000000000)
 #define HGATP64_VMID_SHIFT		44
 #define HGATP64_VMID_MASK		_ULL(0x03FFF00000000000)
 #define HGATP64_PPN			_ULL(0x00000FFFFFFFFFFF)
@@ -170,6 +172,7 @@
 #define HGATP_VMID_SHIFT		HGATP64_VMID_SHIFT
 #define HGATP_VMID_MASK			HGATP64_VMID_MASK
 #define HGATP_MODE_SHIFT		HGATP64_MODE_SHIFT
+#define HGATP_MODE_MASK			HGATP64_MODE_MASK
 #else
 #define MSTATUS_SD			MSTATUS32_SD
 #define SSTATUS_SD			SSTATUS32_SD
@@ -181,6 +184,7 @@
 #define HGATP_VMID_SHIFT		HGATP32_VMID_SHIFT
 #define HGATP_VMID_MASK			HGATP32_VMID_MASK
 #define HGATP_MODE_SHIFT		HGATP32_MODE_SHIFT
+#define HGATP_MODE_MASK			HGATP32_MODE_MASK
 #endif
 
 #define TOPI_IID_SHIFT			16
diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
index e409997499..2419a61d8c 100644
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -41,6 +41,91 @@ void p2m_write_unlock(struct p2m_domain *p2m)
     write_unlock(&p2m->lock);
 }
 
+static void clear_and_clean_page(struct page_info *page)
+{
+    clean_dcache_va_range(page, PAGE_SIZE);
+    clear_domain_page(page_to_mfn(page));
+}
+
+static struct page_info *p2m_allocate_root(struct domain *d)
+{
+    struct page_info *page;
+    unsigned int order = get_order_from_bytes(KB(16));
+    unsigned int nr_pages = _AC(1,U) << order;
+
+    /* Return back nr_pages necessary for p2m root table. */
+
+    if ( ACCESS_ONCE(d->arch.paging.p2m_total_pages) < nr_pages )
+        panic("Specify more xen,domain-p2m-mem-mb\n");
+
+    for ( unsigned int i = 0; i < nr_pages; i++ )
+    {
+        /* Return memory to domheap. */
+        page = page_list_remove_head(&d->arch.paging.p2m_freelist);
+        if( page )
+        {
+            ACCESS_ONCE(d->arch.paging.p2m_total_pages)--;
+            free_domheap_page(page);
+        }
+        else
+        {
+            printk(XENLOG_ERR
+                   "Failed to free P2M pages, P2M freelist is empty.\n");
+            return NULL;
+        }
+    }
+
+    /* Allocate memory for p2m root table. */
+
+    /*
+     * As mentioned in the Priviliged Architecture Spec (version 20240411)
+     * As explained in Section 18.5.1, for the paged virtual-memory schemes
+     * (Sv32x4, Sv39x4, Sv48x4, and Sv57x4), the root page table is 16 KiB
+     * and must be aligned to a 16-KiB boundary.
+     */
+    page = alloc_domheap_pages(d, order, MEMF_no_owner);
+    if ( page == NULL )
+        return NULL;
+
+    for ( unsigned int i = 0; i < nr_pages; i++ )
+        clear_and_clean_page(page + i);
+
+    return page;
+}
+
+static unsigned long hgatp_from_page(struct p2m_domain *p2m)
+{
+    struct page_info *p2m_root_page = p2m->root;
+    unsigned long ppn;
+    unsigned long hgatp_mode;
+
+    ppn = PFN_DOWN(page_to_maddr(p2m_root_page)) & HGATP_PPN;
+
+#if RV_STAGE1_MODE == SATP_MODE_SV39
+    hgatp_mode = HGATP_MODE_SV39X4;
+#elif RV_STAGE1_MODE == SATP_MODE_SV48
+    hgatp_mode = HGATP_MODE_SV48X4;
+#else
+#   error "add HGATP_MODE"
+#endif
+
+    return ppn | MASK_INSR(p2m->vmid, HGATP_VMID_MASK) |
+           MASK_INSR(hgatp_mode, HGATP_MODE_MASK);
+}
+
+static int p2m_alloc_root_table(struct domain *d)
+{
+    struct p2m_domain *p2m = p2m_get_hostp2m(d);
+
+    p2m->root = p2m_allocate_root(d);
+    if ( !p2m->root )
+        return -ENOMEM;
+
+    p2m->hgatp = hgatp_from_page(p2m);
+
+    return 0;
+}
+
 static spinlock_t vmid_alloc_lock = SPIN_LOCK_UNLOCKED;
 
 /*
@@ -228,5 +313,14 @@ int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
         }
     }
 
+    /*
+    * First, wait for the p2m pool to be initialized. Then allocate the root
+    * table so that the necessary pages can be returned from the p2m pool,
+    * since the root table must be allocated using alloc_domheap_pages(...)
+    * to meet its specific requirements.
+    */
+    if ( !d->arch.p2m.root )
+        p2m_alloc_root_table(d);
+
     return 0;
 }
-- 
2.49.0

Re: [PATCH v2 06/17] xen/riscv: add root page table allocation

Posted by Jan Beulich 4 months ago

On 10.06.2025 15:05, Oleksii Kurochko wrote:
> --- a/xen/arch/riscv/include/asm/p2m.h
> +++ b/xen/arch/riscv/include/asm/p2m.h
> @@ -26,6 +26,12 @@ struct p2m_domain {
>      /* Pages used to construct the p2m */
>      struct page_list_head pages;
>  
> +    /* The root of the p2m tree. May be concatenated */
> +    struct page_info *root;
> +
> +    /* Address Translation Table for the p2m */
> +    paddr_t hgatp;

Does this really need holding in a struct field? Can't is be re-created at
any time from "root" above? And such re-creation is apparently infrequent,
if happening at all after initial allocation. (But of course I don't know
what future patches of yours will bring.) This is even more so if ...

> --- a/xen/arch/riscv/include/asm/riscv_encoding.h
> +++ b/xen/arch/riscv/include/asm/riscv_encoding.h
> @@ -133,11 +133,13 @@
>  #define HGATP_MODE_SV48X4		_UL(9)
>  
>  #define HGATP32_MODE_SHIFT		31
> +#define HGATP32_MODE_MASK		_UL(0x80000000)
>  #define HGATP32_VMID_SHIFT		22
>  #define HGATP32_VMID_MASK		_UL(0x1FC00000)
>  #define HGATP32_PPN			_UL(0x003FFFFF)
>  
>  #define HGATP64_MODE_SHIFT		60
> +#define HGATP64_MODE_MASK		_ULL(0xF000000000000000)
>  #define HGATP64_VMID_SHIFT		44
>  #define HGATP64_VMID_MASK		_ULL(0x03FFF00000000000)

... VMID management is going to change as previously discussed, at which
point the value to put in hgatp will need (partly) re-calculating at certain
points anyway.

> --- a/xen/arch/riscv/p2m.c
> +++ b/xen/arch/riscv/p2m.c
> @@ -41,6 +41,91 @@ void p2m_write_unlock(struct p2m_domain *p2m)
>      write_unlock(&p2m->lock);
>  }
>  
> +static void clear_and_clean_page(struct page_info *page)
> +{
> +    clean_dcache_va_range(page, PAGE_SIZE);
> +    clear_domain_page(page_to_mfn(page));
> +}

A function of this name can, imo, only clear and then clean. Question is why
it's the other way around, and what the underlying requirement is for the
cleaning part to be there in the first place. Maybe that's obvious for a
RISC-V person, but it's entirely non-obvious to me (Arm being different in
this regard because of running with caches disabled at certain points in
time).

> +static struct page_info *p2m_allocate_root(struct domain *d)
> +{
> +    struct page_info *page;
> +    unsigned int order = get_order_from_bytes(KB(16));

While better than a hard-coded order of 2, this still is lacking. Is there
a reason there can't be a suitable manifest constant in the header?

> +    unsigned int nr_pages = _AC(1,U) << order;

Nit (style): Missing blank after comma.

> +    /* Return back nr_pages necessary for p2m root table. */
> +
> +    if ( ACCESS_ONCE(d->arch.paging.p2m_total_pages) < nr_pages )
> +        panic("Specify more xen,domain-p2m-mem-mb\n");

You shouldn't panic() in anything involved in domain creation. You want to
return NULL in this case.

Further, to me the use of "more" looks misleading here. Do you perhaps mean
"larger" or "bigger"?

This also looks to be happening without any lock held. If that's intentional,
I think the "why" wants clarifying in a code comment.

> +    for ( unsigned int i = 0; i < nr_pages; i++ )
> +    {
> +        /* Return memory to domheap. */
> +        page = page_list_remove_head(&d->arch.paging.p2m_freelist);
> +        if( page )
> +        {
> +            ACCESS_ONCE(d->arch.paging.p2m_total_pages)--;
> +            free_domheap_page(page);
> +        }
> +        else
> +        {
> +            printk(XENLOG_ERR
> +                   "Failed to free P2M pages, P2M freelist is empty.\n");
> +            return NULL;
> +        }
> +    }

The reason for doing this may also want to be put in a comment.

> +    /* Allocate memory for p2m root table. */
> +
> +    /*
> +     * As mentioned in the Priviliged Architecture Spec (version 20240411)
> +     * As explained in Section 18.5.1, for the paged virtual-memory schemes

The first sentence didn't finish when the 2nd starts. Is there a piece missing?
Do the two sentences want to be joined together?

> +static unsigned long hgatp_from_page(struct p2m_domain *p2m)

Function name and parameter type/name don't fit together.

> +{
> +    struct page_info *p2m_root_page = p2m->root;

As always: pointer-to-const wherever possible, please. But: Is this local
variable really useful to have?

> +    unsigned long ppn;
> +    unsigned long hgatp_mode;
> +
> +    ppn = PFN_DOWN(page_to_maddr(p2m_root_page)) & HGATP_PPN;
> +
> +#if RV_STAGE1_MODE == SATP_MODE_SV39
> +    hgatp_mode = HGATP_MODE_SV39X4;
> +#elif RV_STAGE1_MODE == SATP_MODE_SV48
> +    hgatp_mode = HGATP_MODE_SV48X4;
> +#else
> +#   error "add HGATP_MODE"
> +#endif
> +
> +    return ppn | MASK_INSR(p2m->vmid, HGATP_VMID_MASK) |
> +           MASK_INSR(hgatp_mode, HGATP_MODE_MASK);
> +}
> +
> +static int p2m_alloc_root_table(struct domain *d)

As indicated earlier, in a wider context - this is a good candidate where
the caller rather wants to pass struct p2m_domain *. Once you get variations
on P2Ms (like x86'es altp2m or nestedp2m, the domain won't be meaningful
here to know which P2M to allocate the root for.

> +{
> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
> +
> +    p2m->root = p2m_allocate_root(d);
> +    if ( !p2m->root )
> +        return -ENOMEM;
> +
> +    p2m->hgatp = hgatp_from_page(p2m);
> +
> +    return 0;
> +}
> +
>  static spinlock_t vmid_alloc_lock = SPIN_LOCK_UNLOCKED;
>  
>  /*
> @@ -228,5 +313,14 @@ int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
>          }
>      }
>  
> +    /*
> +    * First, wait for the p2m pool to be initialized. Then allocate the root

Why "wait"? There's waiting here.

> +    * table so that the necessary pages can be returned from the p2m pool,
> +    * since the root table must be allocated using alloc_domheap_pages(...)
> +    * to meet its specific requirements.
> +    */
> +    if ( !d->arch.p2m.root )

Aren't you open-coding p2m_get_hostp2m() here?

Jan

> +        p2m_alloc_root_table(d);
> +
>      return 0;
>  }

Re: [PATCH v2 06/17] xen/riscv: add root page table allocation

Posted by Oleksii Kurochko 4 months ago

On 6/30/25 5:22 PM, Jan Beulich wrote:
> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>> --- a/xen/arch/riscv/include/asm/p2m.h
>> +++ b/xen/arch/riscv/include/asm/p2m.h
>> @@ -26,6 +26,12 @@ struct p2m_domain {
>>       /* Pages used to construct the p2m */
>>       struct page_list_head pages;
>>   
>> +    /* The root of the p2m tree. May be concatenated */
>> +    struct page_info *root;
>> +
>> +    /* Address Translation Table for the p2m */
>> +    paddr_t hgatp;
> Does this really need holding in a struct field? Can't is be re-created at
> any time from "root" above?

Yes, with the current one implementation, I agree it would be enough only
root. But as you noticed below...

> And such re-creation is apparently infrequent,
> if happening at all after initial allocation. (But of course I don't know
> what future patches of yours will bring.) This is even more so if ...
>
>> --- a/xen/arch/riscv/include/asm/riscv_encoding.h
>> +++ b/xen/arch/riscv/include/asm/riscv_encoding.h
>> @@ -133,11 +133,13 @@
>>   #define HGATP_MODE_SV48X4		_UL(9)
>>   
>>   #define HGATP32_MODE_SHIFT		31
>> +#define HGATP32_MODE_MASK		_UL(0x80000000)
>>   #define HGATP32_VMID_SHIFT		22
>>   #define HGATP32_VMID_MASK		_UL(0x1FC00000)
>>   #define HGATP32_PPN			_UL(0x003FFFFF)
>>   
>>   #define HGATP64_MODE_SHIFT		60
>> +#define HGATP64_MODE_MASK		_ULL(0xF000000000000000)
>>   #define HGATP64_VMID_SHIFT		44
>>   #define HGATP64_VMID_MASK		_ULL(0x03FFF00000000000)
> ... VMID management is going to change as previously discussed, at which
> point the value to put in hgatp will need (partly) re-calculating at certain
> points anyway.

... after VMID management will changed to per-CPU base then it will be needed
to update re-calculate hgatp each time vCPU on pCPU is changed.
In this case I prefer to have partially calculated 'hgatp'.

>> --- a/xen/arch/riscv/p2m.c
>> +++ b/xen/arch/riscv/p2m.c
>> @@ -41,6 +41,91 @@ void p2m_write_unlock(struct p2m_domain *p2m)
>>       write_unlock(&p2m->lock);
>>   }
>>   
>> +static void clear_and_clean_page(struct page_info *page)
>> +{
>> +    clean_dcache_va_range(page, PAGE_SIZE);
>> +    clear_domain_page(page_to_mfn(page));
>> +}
> A function of this name can, imo, only clear and then clean. Question is why
> it's the other way around, and what the underlying requirement is for the
> cleaning part to be there in the first place. Maybe that's obvious for a
> RISC-V person, but it's entirely non-obvious to me (Arm being different in
> this regard because of running with caches disabled at certain points in
> time).

You're right, the current name|clear_and_clean_page()| implies that clearing
should come before cleaning, which contradicts the current implementation.
The intent here is to ensure that the page contents are consistent in RAM
(not just in cache) before use by other entities (guests or devices).

The clean must follow the clear — so yes, the order needs to be reversed.

>
>> +static struct page_info *p2m_allocate_root(struct domain *d)
>> +{
>> +    struct page_info *page;
>> +    unsigned int order = get_order_from_bytes(KB(16));
> While better than a hard-coded order of 2, this still is lacking. Is there
> a reason there can't be a suitable manifest constant in the header?

No any specific reason, I just decided not to introduce new definition as
it is going to be used only inside this function.

I think it will make sense to have in p2m.c:
  #define P2M_ROOT_PT_SIZE KB(16)
If it isn't the best one option, then what about to move this defintion
to config.h or asm/p2m.h.

>
>> +    unsigned int nr_pages = _AC(1,U) << order;
> Nit (style): Missing blank after comma.

I've changed that to BIT(order, U)

>
>> +    /* Return back nr_pages necessary for p2m root table. */
>> +
>> +    if ( ACCESS_ONCE(d->arch.paging.p2m_total_pages) < nr_pages )
>> +        panic("Specify more xen,domain-p2m-mem-mb\n");
> You shouldn't panic() in anything involved in domain creation. You want to
> return NULL in this case.

It makes sense in this case just to return NULL.

>
> Further, to me the use of "more" looks misleading here. Do you perhaps mean
> "larger" or "bigger"?
>
> This also looks to be happening without any lock held. If that's intentional,
> I think the "why" wants clarifying in a code comment.

Agree, returning back pages necessary for p2m root table should be done under
spin_lock(&d->arch.paging.lock).

>
>> +    for ( unsigned int i = 0; i < nr_pages; i++ )
>> +    {
>> +        /* Return memory to domheap. */
>> +        page = page_list_remove_head(&d->arch.paging.p2m_freelist);
>> +        if( page )
>> +        {
>> +            ACCESS_ONCE(d->arch.paging.p2m_total_pages)--;
>> +            free_domheap_page(page);
>> +        }
>> +        else
>> +        {
>> +            printk(XENLOG_ERR
>> +                   "Failed to free P2M pages, P2M freelist is empty.\n");
>> +            return NULL;
>> +        }
>> +    }
> The reason for doing this may also want to be put in a comment.

I thought it would be enough the comment above: /* Return back nr_pages necessary for p2m root table. */

>
>> +    /* Allocate memory for p2m root table. */
>> +
>> +    /*
>> +     * As mentioned in the Priviliged Architecture Spec (version 20240411)
>> +     * As explained in Section 18.5.1, for the paged virtual-memory schemes
> The first sentence didn't finish when the 2nd starts. Is there a piece missing?
> Do the two sentences want to be joined together?

Nothing is missed, just bad wording. I will update to:
   As mentioned in the Priviliged Architecture Spec (version 20240411) in Section 18.5.1, ...

>
>> +static unsigned long hgatp_from_page(struct p2m_domain *p2m)
> Function name and parameter type/name don't fit together.

I'll update an argument to struct page_info *root.

>
>> +{
>> +    struct page_info *p2m_root_page = p2m->root;
> As always: pointer-to-const wherever possible, please. But: Is this local
> variable really useful to have?

No, it will be just passed as an argument.

>
>> +    unsigned long ppn;
>> +    unsigned long hgatp_mode;
>> +
>> +    ppn = PFN_DOWN(page_to_maddr(p2m_root_page)) & HGATP_PPN;
>> +
>> +#if RV_STAGE1_MODE == SATP_MODE_SV39
>> +    hgatp_mode = HGATP_MODE_SV39X4;
>> +#elif RV_STAGE1_MODE == SATP_MODE_SV48
>> +    hgatp_mode = HGATP_MODE_SV48X4;
>> +#else
>> +#   error "add HGATP_MODE"
>> +#endif
>> +
>> +    return ppn | MASK_INSR(p2m->vmid, HGATP_VMID_MASK) |
>> +           MASK_INSR(hgatp_mode, HGATP_MODE_MASK);
>> +}
>> +
>> +static int p2m_alloc_root_table(struct domain *d)
> As indicated earlier, in a wider context - this is a good candidate where
> the caller rather wants to pass struct p2m_domain *. Once you get variations
> on P2Ms (like x86'es altp2m or nestedp2m, the domain won't be meaningful
> here to know which P2M to allocate the root for.

Good point. I will re-work that.

>
>> +{
>> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
>> +
>> +    p2m->root = p2m_allocate_root(d);
>> +    if ( !p2m->root )
>> +        return -ENOMEM;
>> +
>> +    p2m->hgatp = hgatp_from_page(p2m);
>> +
>> +    return 0;
>> +}
>> +
>>   static spinlock_t vmid_alloc_lock = SPIN_LOCK_UNLOCKED;
>>   
>>   /*
>> @@ -228,5 +313,14 @@ int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
>>           }
>>       }
>>   
>> +    /*
>> +    * First, wait for the p2m pool to be initialized. Then allocate the root
> Why "wait"? There's waiting here.

I am not really get your question.

"wait" here is about the initialization of the pool which happens above this comment.

>
>> +    * table so that the necessary pages can be returned from the p2m pool,
>> +    * since the root table must be allocated using alloc_domheap_pages(...)
>> +    * to meet its specific requirements.
>> +    */
>> +    if ( !d->arch.p2m.root )
> Aren't you open-coding p2m_get_hostp2m() here?

Yes, p2m_get_hostp2m()  should be used here.

~ Oleksii

Re: [PATCH v2 06/17] xen/riscv: add root page table allocation

Posted by Jan Beulich 4 months ago

On 30.06.2025 18:18, Oleksii Kurochko wrote:
> On 6/30/25 5:22 PM, Jan Beulich wrote:
>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>> --- a/xen/arch/riscv/include/asm/p2m.h
>>> +++ b/xen/arch/riscv/include/asm/p2m.h
>>> @@ -26,6 +26,12 @@ struct p2m_domain {
>>>       /* Pages used to construct the p2m */
>>>       struct page_list_head pages;
>>>   
>>> +    /* The root of the p2m tree. May be concatenated */
>>> +    struct page_info *root;
>>> +
>>> +    /* Address Translation Table for the p2m */
>>> +    paddr_t hgatp;
>> Does this really need holding in a struct field? Can't is be re-created at
>> any time from "root" above?
> 
> Yes, with the current one implementation, I agree it would be enough only
> root. But as you noticed below...
> 
>> And such re-creation is apparently infrequent,
>> if happening at all after initial allocation. (But of course I don't know
>> what future patches of yours will bring.) This is even more so if ...
>>
>>> --- a/xen/arch/riscv/include/asm/riscv_encoding.h
>>> +++ b/xen/arch/riscv/include/asm/riscv_encoding.h
>>> @@ -133,11 +133,13 @@
>>>   #define HGATP_MODE_SV48X4		_UL(9)
>>>   
>>>   #define HGATP32_MODE_SHIFT		31
>>> +#define HGATP32_MODE_MASK		_UL(0x80000000)
>>>   #define HGATP32_VMID_SHIFT		22
>>>   #define HGATP32_VMID_MASK		_UL(0x1FC00000)
>>>   #define HGATP32_PPN			_UL(0x003FFFFF)
>>>   
>>>   #define HGATP64_MODE_SHIFT		60
>>> +#define HGATP64_MODE_MASK		_ULL(0xF000000000000000)
>>>   #define HGATP64_VMID_SHIFT		44
>>>   #define HGATP64_VMID_MASK		_ULL(0x03FFF00000000000)
>> ... VMID management is going to change as previously discussed, at which
>> point the value to put in hgatp will need (partly) re-calculating at certain
>> points anyway.
> 
> ... after VMID management will changed to per-CPU base then it will be needed
> to update re-calculate hgatp each time vCPU on pCPU is changed.
> In this case I prefer to have partially calculated 'hgatp'.

But why, when you need to do some recalculation anyway?

>>> --- a/xen/arch/riscv/p2m.c
>>> +++ b/xen/arch/riscv/p2m.c
>>> @@ -41,6 +41,91 @@ void p2m_write_unlock(struct p2m_domain *p2m)
>>>       write_unlock(&p2m->lock);
>>>   }
>>>   
>>> +static void clear_and_clean_page(struct page_info *page)
>>> +{
>>> +    clean_dcache_va_range(page, PAGE_SIZE);
>>> +    clear_domain_page(page_to_mfn(page));
>>> +}
>> A function of this name can, imo, only clear and then clean. Question is why
>> it's the other way around, and what the underlying requirement is for the
>> cleaning part to be there in the first place. Maybe that's obvious for a
>> RISC-V person, but it's entirely non-obvious to me (Arm being different in
>> this regard because of running with caches disabled at certain points in
>> time).
> 
> You're right, the current name|clear_and_clean_page()| implies that clearing
> should come before cleaning, which contradicts the current implementation.
> The intent here is to ensure that the page contents are consistent in RAM
> (not just in cache) before use by other entities (guests or devices).
> 
> The clean must follow the clear — so yes, the order needs to be reversed.

What you don't address though - why's the cleaning needed in the first place?

>>> +static struct page_info *p2m_allocate_root(struct domain *d)
>>> +{
>>> +    struct page_info *page;
>>> +    unsigned int order = get_order_from_bytes(KB(16));
>> While better than a hard-coded order of 2, this still is lacking. Is there
>> a reason there can't be a suitable manifest constant in the header?
> 
> No any specific reason, I just decided not to introduce new definition as
> it is going to be used only inside this function.
> 
> I think it will make sense to have in p2m.c:
>   #define P2M_ROOT_PT_SIZE KB(16)
> If it isn't the best one option, then what about to move this defintion
> to config.h or asm/p2m.h.

It's defined by the hardware, so neither of the two headers looks to be a
good fit. Nor is the P2M_ prefix really in line with this being hardware-
defined. page.h has various paging-related hw definitions, and
riscv_encoding.h may also be a suitable place. There may be other candidates
that I'm presently overlooking.

>>> +    unsigned int nr_pages = _AC(1,U) << order;
>> Nit (style): Missing blank after comma.
> 
> I've changed that to BIT(order, U)
> 
>>
>>> +    /* Return back nr_pages necessary for p2m root table. */
>>> +
>>> +    if ( ACCESS_ONCE(d->arch.paging.p2m_total_pages) < nr_pages )
>>> +        panic("Specify more xen,domain-p2m-mem-mb\n");
>> You shouldn't panic() in anything involved in domain creation. You want to
>> return NULL in this case.
> 
> It makes sense in this case just to return NULL.
> 
>>
>> Further, to me the use of "more" looks misleading here. Do you perhaps mean
>> "larger" or "bigger"?
>>
>> This also looks to be happening without any lock held. If that's intentional,
>> I think the "why" wants clarifying in a code comment.
> 
> Agree, returning back pages necessary for p2m root table should be done under
> spin_lock(&d->arch.paging.lock).

Which should be acquired at the paging_*() layer then, not at the p2m_*() layer.
(As long as you mean to have that separation, that is. See the earlier discussion
on that matter.)

>>> +    for ( unsigned int i = 0; i < nr_pages; i++ )
>>> +    {
>>> +        /* Return memory to domheap. */
>>> +        page = page_list_remove_head(&d->arch.paging.p2m_freelist);
>>> +        if( page )
>>> +        {
>>> +            ACCESS_ONCE(d->arch.paging.p2m_total_pages)--;
>>> +            free_domheap_page(page);
>>> +        }
>>> +        else
>>> +        {
>>> +            printk(XENLOG_ERR
>>> +                   "Failed to free P2M pages, P2M freelist is empty.\n");
>>> +            return NULL;
>>> +        }
>>> +    }
>> The reason for doing this may also want to be put in a comment.
> 
> I thought it would be enough the comment above: /* Return back nr_pages necessary for p2m root table. */

That describes what the code does, but not why.

>>> +{
>>> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
>>> +
>>> +    p2m->root = p2m_allocate_root(d);
>>> +    if ( !p2m->root )
>>> +        return -ENOMEM;
>>> +
>>> +    p2m->hgatp = hgatp_from_page(p2m);
>>> +
>>> +    return 0;
>>> +}
>>> +
>>>   static spinlock_t vmid_alloc_lock = SPIN_LOCK_UNLOCKED;
>>>   
>>>   /*
>>> @@ -228,5 +313,14 @@ int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
>>>           }
>>>       }
>>>   
>>> +    /*
>>> +    * First, wait for the p2m pool to be initialized. Then allocate the root
>> Why "wait"? There's waiting here.
> 
> I am not really get your question.
> 
> "wait" here is about the initialization of the pool which happens above this comment.

But there's no "waiting" involved. What you talk about is one thing needing to
happen after the other.

Jan

Re: [PATCH v2 06/17] xen/riscv: add root page table allocation

Posted by Oleksii Kurochko 4 months ago

On 7/1/25 8:29 AM, Jan Beulich wrote:
> On 30.06.2025 18:18, Oleksii Kurochko wrote:
>> On 6/30/25 5:22 PM, Jan Beulich wrote:
>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>> --- a/xen/arch/riscv/include/asm/p2m.h
>>>> +++ b/xen/arch/riscv/include/asm/p2m.h
>>>> @@ -26,6 +26,12 @@ struct p2m_domain {
>>>>        /* Pages used to construct the p2m */
>>>>        struct page_list_head pages;
>>>>    
>>>> +    /* The root of the p2m tree. May be concatenated */
>>>> +    struct page_info *root;
>>>> +
>>>> +    /* Address Translation Table for the p2m */
>>>> +    paddr_t hgatp;
>>> Does this really need holding in a struct field? Can't is be re-created at
>>> any time from "root" above?
>> Yes, with the current one implementation, I agree it would be enough only
>> root. But as you noticed below...
>>
>>> And such re-creation is apparently infrequent,
>>> if happening at all after initial allocation. (But of course I don't know
>>> what future patches of yours will bring.) This is even more so if ...
>>>
>>>> --- a/xen/arch/riscv/include/asm/riscv_encoding.h
>>>> +++ b/xen/arch/riscv/include/asm/riscv_encoding.h
>>>> @@ -133,11 +133,13 @@
>>>>    #define HGATP_MODE_SV48X4		_UL(9)
>>>>    
>>>>    #define HGATP32_MODE_SHIFT		31
>>>> +#define HGATP32_MODE_MASK		_UL(0x80000000)
>>>>    #define HGATP32_VMID_SHIFT		22
>>>>    #define HGATP32_VMID_MASK		_UL(0x1FC00000)
>>>>    #define HGATP32_PPN			_UL(0x003FFFFF)
>>>>    
>>>>    #define HGATP64_MODE_SHIFT		60
>>>> +#define HGATP64_MODE_MASK		_ULL(0xF000000000000000)
>>>>    #define HGATP64_VMID_SHIFT		44
>>>>    #define HGATP64_VMID_MASK		_ULL(0x03FFF00000000000)
>>> ... VMID management is going to change as previously discussed, at which
>>> point the value to put in hgatp will need (partly) re-calculating at certain
>>> points anyway.
>> ... after VMID management will changed to per-CPU base then it will be needed
>> to update re-calculate hgatp each time vCPU on pCPU is changed.
>> In this case I prefer to have partially calculated 'hgatp'.
> But why, when you need to do some recalculation anyway?

Less operations will be needed to do.
If we have partially prepared 'hgatp' then we have to only update VMID bits
instead of getting ppn for page, then calculate hgatp_mode each time.
But if you think it isn't really needed I can add vmid argument for hgatp_from_page()
and just call this function when an update of hgatp is needed.

>
>>>> --- a/xen/arch/riscv/p2m.c
>>>> +++ b/xen/arch/riscv/p2m.c
>>>> @@ -41,6 +41,91 @@ void p2m_write_unlock(struct p2m_domain *p2m)
>>>>        write_unlock(&p2m->lock);
>>>>    }
>>>>    
>>>> +static void clear_and_clean_page(struct page_info *page)
>>>> +{
>>>> +    clean_dcache_va_range(page, PAGE_SIZE);
>>>> +    clear_domain_page(page_to_mfn(page));
>>>> +}
>>> A function of this name can, imo, only clear and then clean. Question is why
>>> it's the other way around, and what the underlying requirement is for the
>>> cleaning part to be there in the first place. Maybe that's obvious for a
>>> RISC-V person, but it's entirely non-obvious to me (Arm being different in
>>> this regard because of running with caches disabled at certain points in
>>> time).
>> You're right, the current name|clear_and_clean_page()| implies that clearing
>> should come before cleaning, which contradicts the current implementation.
>> The intent here is to ensure that the page contents are consistent in RAM
>> (not just in cache) before use by other entities (guests or devices).
>>
>> The clean must follow the clear — so yes, the order needs to be reversed.
> What you don't address though - why's the cleaning needed in the first place?

If we clean the data cache first, we flush the d-cache and then use the page to
perform the clear operation. As a result, the "cleared" value will be written into
the d-cache. To avoid polluting the d-cache with the "cleared" value, the correct
sequence is to clear the page first, then clean the data cache.

>>>> +    unsigned int nr_pages = _AC(1,U) << order;
>>> Nit (style): Missing blank after comma.
>> I've changed that to BIT(order, U)
>>
>>>> +    /* Return back nr_pages necessary for p2m root table. */
>>>> +
>>>> +    if ( ACCESS_ONCE(d->arch.paging.p2m_total_pages) < nr_pages )
>>>> +        panic("Specify more xen,domain-p2m-mem-mb\n");
>>> You shouldn't panic() in anything involved in domain creation. You want to
>>> return NULL in this case.
>> It makes sense in this case just to return NULL.
>>
>>> Further, to me the use of "more" looks misleading here. Do you perhaps mean
>>> "larger" or "bigger"?
>>>
>>> This also looks to be happening without any lock held. If that's intentional,
>>> I think the "why" wants clarifying in a code comment.
>> Agree, returning back pages necessary for p2m root table should be done under
>> spin_lock(&d->arch.paging.lock).
> Which should be acquired at the paging_*() layer then, not at the p2m_*() layer.
> (As long as you mean to have that separation, that is. See the earlier discussion
> on that matter.)

Then partly p2m_set_allocation() should be moved to paging_*() too.

>>>> +    for ( unsigned int i = 0; i < nr_pages; i++ )
>>>> +    {
>>>> +        /* Return memory to domheap. */
>>>> +        page = page_list_remove_head(&d->arch.paging.p2m_freelist);
>>>> +        if( page )
>>>> +        {
>>>> +            ACCESS_ONCE(d->arch.paging.p2m_total_pages)--;
>>>> +            free_domheap_page(page);
>>>> +        }
>>>> +        else
>>>> +        {
>>>> +            printk(XENLOG_ERR
>>>> +                   "Failed to free P2M pages, P2M freelist is empty.\n");
>>>> +            return NULL;
>>>> +        }
>>>> +    }
>>> The reason for doing this may also want to be put in a comment.
>> I thought it would be enough the comment above: /* Return back nr_pages necessary for p2m root table. */
> That describes what the code does, but not why.

I will add to the comment: "... to get the memory accounting right".

>
>>>> +{
>>>> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
>>>> +
>>>> +    p2m->root = p2m_allocate_root(d);
>>>> +    if ( !p2m->root )
>>>> +        return -ENOMEM;
>>>> +
>>>> +    p2m->hgatp = hgatp_from_page(p2m);
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>>    static spinlock_t vmid_alloc_lock = SPIN_LOCK_UNLOCKED;
>>>>    
>>>>    /*
>>>> @@ -228,5 +313,14 @@ int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
>>>>            }
>>>>        }
>>>>    
>>>> +    /*
>>>> +    * First, wait for the p2m pool to be initialized. Then allocate the root
>>> Why "wait"? There's waiting here.
>> I am not really get your question.
>>
>> "wait" here is about the initialization of the pool which happens above this comment.
> But there's no "waiting" involved. What you talk about is one thing needing to
> happen after the other.

Okay, then I will just reword comment.

Thanks.

~ Oleksii

Re: [PATCH v2 06/17] xen/riscv: add root page table allocation

Posted by Jan Beulich 4 months ago

On 01.07.2025 11:44, Oleksii Kurochko wrote:
> On 7/1/25 8:29 AM, Jan Beulich wrote:
>> On 30.06.2025 18:18, Oleksii Kurochko wrote:
>>> On 6/30/25 5:22 PM, Jan Beulich wrote:
>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>> --- a/xen/arch/riscv/include/asm/p2m.h
>>>>> +++ b/xen/arch/riscv/include/asm/p2m.h
>>>>> @@ -26,6 +26,12 @@ struct p2m_domain {
>>>>>        /* Pages used to construct the p2m */
>>>>>        struct page_list_head pages;
>>>>>    
>>>>> +    /* The root of the p2m tree. May be concatenated */
>>>>> +    struct page_info *root;
>>>>> +
>>>>> +    /* Address Translation Table for the p2m */
>>>>> +    paddr_t hgatp;
>>>> Does this really need holding in a struct field? Can't is be re-created at
>>>> any time from "root" above?
>>> Yes, with the current one implementation, I agree it would be enough only
>>> root. But as you noticed below...
>>>
>>>> And such re-creation is apparently infrequent,
>>>> if happening at all after initial allocation. (But of course I don't know
>>>> what future patches of yours will bring.) This is even more so if ...
>>>>
>>>>> --- a/xen/arch/riscv/include/asm/riscv_encoding.h
>>>>> +++ b/xen/arch/riscv/include/asm/riscv_encoding.h
>>>>> @@ -133,11 +133,13 @@
>>>>>    #define HGATP_MODE_SV48X4		_UL(9)
>>>>>    
>>>>>    #define HGATP32_MODE_SHIFT		31
>>>>> +#define HGATP32_MODE_MASK		_UL(0x80000000)
>>>>>    #define HGATP32_VMID_SHIFT		22
>>>>>    #define HGATP32_VMID_MASK		_UL(0x1FC00000)
>>>>>    #define HGATP32_PPN			_UL(0x003FFFFF)
>>>>>    
>>>>>    #define HGATP64_MODE_SHIFT		60
>>>>> +#define HGATP64_MODE_MASK		_ULL(0xF000000000000000)
>>>>>    #define HGATP64_VMID_SHIFT		44
>>>>>    #define HGATP64_VMID_MASK		_ULL(0x03FFF00000000000)
>>>> ... VMID management is going to change as previously discussed, at which
>>>> point the value to put in hgatp will need (partly) re-calculating at certain
>>>> points anyway.
>>> ... after VMID management will changed to per-CPU base then it will be needed
>>> to update re-calculate hgatp each time vCPU on pCPU is changed.
>>> In this case I prefer to have partially calculated 'hgatp'.
>> But why, when you need to do some recalculation anyway?
> 
> Less operations will be needed to do.

Right; I wonder how big the savings would be.

> If we have partially prepared 'hgatp' then we have to only update VMID bits
> instead of getting ppn for page, then calculate hgatp_mode each time.
> But if you think it isn't really needed I can add vmid argument for hgatp_from_page()
> and just call this function when an update of hgatp is needed.

I think it'll need to be struct p2m_domain * that you (also?) pass in. In the
longer run I think you will want to support all three permitted modes, with
smaller guests using fewer page table levels.

As to "also" - maybe it's better to change the name of the function, and pass
in just (const if possible) struct p2m_domain *.

>>>>> --- a/xen/arch/riscv/p2m.c
>>>>> +++ b/xen/arch/riscv/p2m.c
>>>>> @@ -41,6 +41,91 @@ void p2m_write_unlock(struct p2m_domain *p2m)
>>>>>        write_unlock(&p2m->lock);
>>>>>    }
>>>>>    
>>>>> +static void clear_and_clean_page(struct page_info *page)
>>>>> +{
>>>>> +    clean_dcache_va_range(page, PAGE_SIZE);
>>>>> +    clear_domain_page(page_to_mfn(page));
>>>>> +}
>>>> A function of this name can, imo, only clear and then clean. Question is why
>>>> it's the other way around, and what the underlying requirement is for the
>>>> cleaning part to be there in the first place. Maybe that's obvious for a
>>>> RISC-V person, but it's entirely non-obvious to me (Arm being different in
>>>> this regard because of running with caches disabled at certain points in
>>>> time).
>>> You're right, the current name|clear_and_clean_page()| implies that clearing
>>> should come before cleaning, which contradicts the current implementation.
>>> The intent here is to ensure that the page contents are consistent in RAM
>>> (not just in cache) before use by other entities (guests or devices).
>>>
>>> The clean must follow the clear — so yes, the order needs to be reversed.
>> What you don't address though - why's the cleaning needed in the first place?
> 
> If we clean the data cache first, we flush the d-cache and then use the page to
> perform the clear operation. As a result, the "cleared" value will be written into
> the d-cache. To avoid polluting the d-cache with the "cleared" value, the correct
> sequence is to clear the page first, then clean the data cache.

If you want to avoid cache pollution, I think you'd need to use a form of stores
which simply bypass the cache. Yet then - why would this matter here, but not
elsewhere? Wouldn't you better leave such to the hardware, unless you can prove
a (meaningful) performance gain?

>>>>> +    unsigned int nr_pages = _AC(1,U) << order;
>>>> Nit (style): Missing blank after comma.
>>> I've changed that to BIT(order, U)
>>>
>>>>> +    /* Return back nr_pages necessary for p2m root table. */
>>>>> +
>>>>> +    if ( ACCESS_ONCE(d->arch.paging.p2m_total_pages) < nr_pages )
>>>>> +        panic("Specify more xen,domain-p2m-mem-mb\n");
>>>> You shouldn't panic() in anything involved in domain creation. You want to
>>>> return NULL in this case.
>>> It makes sense in this case just to return NULL.
>>>
>>>> Further, to me the use of "more" looks misleading here. Do you perhaps mean
>>>> "larger" or "bigger"?
>>>>
>>>> This also looks to be happening without any lock held. If that's intentional,
>>>> I think the "why" wants clarifying in a code comment.
>>> Agree, returning back pages necessary for p2m root table should be done under
>>> spin_lock(&d->arch.paging.lock).
>> Which should be acquired at the paging_*() layer then, not at the p2m_*() layer.
>> (As long as you mean to have that separation, that is. See the earlier discussion
>> on that matter.)
> 
> Then partly p2m_set_allocation() should be moved to paging_*() too.

Not exactly sure what you mean. On x86 at least the paging layer part of
the function is pretty slim.

>>>>> +    for ( unsigned int i = 0; i < nr_pages; i++ )
>>>>> +    {
>>>>> +        /* Return memory to domheap. */
>>>>> +        page = page_list_remove_head(&d->arch.paging.p2m_freelist);
>>>>> +        if( page )
>>>>> +        {
>>>>> +            ACCESS_ONCE(d->arch.paging.p2m_total_pages)--;
>>>>> +            free_domheap_page(page);
>>>>> +        }
>>>>> +        else
>>>>> +        {
>>>>> +            printk(XENLOG_ERR
>>>>> +                   "Failed to free P2M pages, P2M freelist is empty.\n");
>>>>> +            return NULL;
>>>>> +        }
>>>>> +    }
>>>> The reason for doing this may also want to be put in a comment.
>>> I thought it would be enough the comment above: /* Return back nr_pages necessary for p2m root table. */
>> That describes what the code does, but not why.
> 
> I will add to the comment: "... to get the memory accounting right".

I'm sorry to be picky, but what is "right"? You want assure the root table
memory is also accounted against the P2M pool of the domain. Can't you say
exactly that?

Jan

Re: [PATCH v2 06/17] xen/riscv: add root page table allocation

Posted by Oleksii Kurochko 4 months ago

On 7/1/25 12:27 PM, Jan Beulich wrote:
> On 01.07.2025 11:44, Oleksii Kurochko wrote:
>> On 7/1/25 8:29 AM, Jan Beulich wrote:
>>> On 30.06.2025 18:18, Oleksii Kurochko wrote:
>>>> On 6/30/25 5:22 PM, Jan Beulich wrote:
>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>> --- a/xen/arch/riscv/include/asm/p2m.h
>>>>>> +++ b/xen/arch/riscv/include/asm/p2m.h
>>>>>> @@ -26,6 +26,12 @@ struct p2m_domain {
>>>>>>         /* Pages used to construct the p2m */
>>>>>>         struct page_list_head pages;
>>>>>>     
>>>>>> +    /* The root of the p2m tree. May be concatenated */
>>>>>> +    struct page_info *root;
>>>>>> +
>>>>>> +    /* Address Translation Table for the p2m */
>>>>>> +    paddr_t hgatp;
>>>>> Does this really need holding in a struct field? Can't is be re-created at
>>>>> any time from "root" above?
>>>> Yes, with the current one implementation, I agree it would be enough only
>>>> root. But as you noticed below...
>>>>
>>>>> And such re-creation is apparently infrequent,
>>>>> if happening at all after initial allocation. (But of course I don't know
>>>>> what future patches of yours will bring.) This is even more so if ...
>>>>>
>>>>>> --- a/xen/arch/riscv/include/asm/riscv_encoding.h
>>>>>> +++ b/xen/arch/riscv/include/asm/riscv_encoding.h
>>>>>> @@ -133,11 +133,13 @@
>>>>>>     #define HGATP_MODE_SV48X4		_UL(9)
>>>>>>     
>>>>>>     #define HGATP32_MODE_SHIFT		31
>>>>>> +#define HGATP32_MODE_MASK		_UL(0x80000000)
>>>>>>     #define HGATP32_VMID_SHIFT		22
>>>>>>     #define HGATP32_VMID_MASK		_UL(0x1FC00000)
>>>>>>     #define HGATP32_PPN			_UL(0x003FFFFF)
>>>>>>     
>>>>>>     #define HGATP64_MODE_SHIFT		60
>>>>>> +#define HGATP64_MODE_MASK		_ULL(0xF000000000000000)
>>>>>>     #define HGATP64_VMID_SHIFT		44
>>>>>>     #define HGATP64_VMID_MASK		_ULL(0x03FFF00000000000)
>>>>> ... VMID management is going to change as previously discussed, at which
>>>>> point the value to put in hgatp will need (partly) re-calculating at certain
>>>>> points anyway.
>>>> ... after VMID management will changed to per-CPU base then it will be needed
>>>> to update re-calculate hgatp each time vCPU on pCPU is changed.
>>>> In this case I prefer to have partially calculated 'hgatp'.
>>> But why, when you need to do some recalculation anyway?
>> Less operations will be needed to do.
> Right; I wonder how big the savings would be.

Probably not big.

>
>> If we have partially prepared 'hgatp' then we have to only update VMID bits
>> instead of getting ppn for page, then calculate hgatp_mode each time.
>> But if you think it isn't really needed I can add vmid argument for hgatp_from_page()
>> and just call this function when an update of hgatp is needed.
> I think it'll need to be struct p2m_domain * that you (also?) pass in. In the
> longer run I think you will want to support all three permitted modes, with
> smaller guests using fewer page table levels.

Yes, but these modes will be const for a domain, I guess. I mean that once a mode has
been set, it isn't going to be changed. But VMID is going to be changed each time vCPU
gives control to another vCPU.
Anyway, I am okay to make update of hgatp more generic and just update it fully each
time it is needed.

>
> As to "also" - maybe it's better to change the name of the function, and pass
> in just (const if possible) struct p2m_domain *.
>
>>>>>> --- a/xen/arch/riscv/p2m.c
>>>>>> +++ b/xen/arch/riscv/p2m.c
>>>>>> @@ -41,6 +41,91 @@ void p2m_write_unlock(struct p2m_domain *p2m)
>>>>>>         write_unlock(&p2m->lock);
>>>>>>     }
>>>>>>     
>>>>>> +static void clear_and_clean_page(struct page_info *page)
>>>>>> +{
>>>>>> +    clean_dcache_va_range(page, PAGE_SIZE);
>>>>>> +    clear_domain_page(page_to_mfn(page));
>>>>>> +}
>>>>> A function of this name can, imo, only clear and then clean. Question is why
>>>>> it's the other way around, and what the underlying requirement is for the
>>>>> cleaning part to be there in the first place. Maybe that's obvious for a
>>>>> RISC-V person, but it's entirely non-obvious to me (Arm being different in
>>>>> this regard because of running with caches disabled at certain points in
>>>>> time).
>>>> You're right, the current name|clear_and_clean_page()| implies that clearing
>>>> should come before cleaning, which contradicts the current implementation.
>>>> The intent here is to ensure that the page contents are consistent in RAM
>>>> (not just in cache) before use by other entities (guests or devices).
>>>>
>>>> The clean must follow the clear — so yes, the order needs to be reversed.
>>> What you don't address though - why's the cleaning needed in the first place?
>> If we clean the data cache first, we flush the d-cache and then use the page to
>> perform the clear operation. As a result, the "cleared" value will be written into
>> the d-cache. To avoid polluting the d-cache with the "cleared" value, the correct
>> sequence is to clear the page first, then clean the data cache.
> If you want to avoid cache pollution, I think you'd need to use a form of stores
> which simply bypass the cache. Yet then - why would this matter here, but not
> elsewhere? Wouldn't you better leave such to the hardware, unless you can prove
> a (meaningful) performance gain?

I thought about a case when IOMMU doesn't support coherent walks and p2m tables are
shared between CPU and IOMMU. Then my understanding is:
- clear_page(p) just zero-ing a page in a CPU's cache.
- But IOMMU can see old data or uninitialized, if they still in cache.
- So, it is need to do clean_cache() to writeback data from cache to RAM, before a
   page will be used as a part of page table for IOMMU.

>
>>>>>> +    unsigned int nr_pages = _AC(1,U) << order;
>>>>> Nit (style): Missing blank after comma.
>>>> I've changed that to BIT(order, U)
>>>>
>>>>>> +    /* Return back nr_pages necessary for p2m root table. */
>>>>>> +
>>>>>> +    if ( ACCESS_ONCE(d->arch.paging.p2m_total_pages) < nr_pages )
>>>>>> +        panic("Specify more xen,domain-p2m-mem-mb\n");
>>>>> You shouldn't panic() in anything involved in domain creation. You want to
>>>>> return NULL in this case.
>>>> It makes sense in this case just to return NULL.
>>>>
>>>>> Further, to me the use of "more" looks misleading here. Do you perhaps mean
>>>>> "larger" or "bigger"?
>>>>>
>>>>> This also looks to be happening without any lock held. If that's intentional,
>>>>> I think the "why" wants clarifying in a code comment.
>>>> Agree, returning back pages necessary for p2m root table should be done under
>>>> spin_lock(&d->arch.paging.lock).
>>> Which should be acquired at the paging_*() layer then, not at the p2m_*() layer.
>>> (As long as you mean to have that separation, that is. See the earlier discussion
>>> on that matter.)
>> Then partly p2m_set_allocation() should be moved to paging_*() too.
> Not exactly sure what you mean. On x86 at least the paging layer part of
> the function is pretty slim.

I meant that part of code which is spin_lock(&d->arch.paging.lock); ... spin_unlock(&d->arch.paging.lock)
in function p2m_set_allocation() should be moved somewhere to paging_*() layer for the same logic as you
suggested to move part of  p2m_allocate_root()'s code which is guarded by d->arch.paging.lock to
paging_*() layer.

Or I just misunderstood your initial idea with this paging_*() layer and its necessity.

>
>>>>>> +    for ( unsigned int i = 0; i < nr_pages; i++ )
>>>>>> +    {
>>>>>> +        /* Return memory to domheap. */
>>>>>> +        page = page_list_remove_head(&d->arch.paging.p2m_freelist);
>>>>>> +        if( page )
>>>>>> +        {
>>>>>> +            ACCESS_ONCE(d->arch.paging.p2m_total_pages)--;
>>>>>> +            free_domheap_page(page);
>>>>>> +        }
>>>>>> +        else
>>>>>> +        {
>>>>>> +            printk(XENLOG_ERR
>>>>>> +                   "Failed to free P2M pages, P2M freelist is empty.\n");
>>>>>> +            return NULL;
>>>>>> +        }
>>>>>> +    }
>>>>> The reason for doing this may also want to be put in a comment.
>>>> I thought it would be enough the comment above: /* Return back nr_pages necessary for p2m root table. */
>>> That describes what the code does, but not why.
>> I will add to the comment: "... to get the memory accounting right".
> I'm sorry to be picky, but what is "right"? You want assure the root table
> memory is also accounted against the P2M pool of the domain. Can't you say
> exactly that?

It can be said in this way.

Thanks.

~ Oleksii

Re: [PATCH v2 06/17] xen/riscv: add root page table allocation

Posted by Jan Beulich 4 months ago

On 01.07.2025 16:02, Oleksii Kurochko wrote:
> On 7/1/25 12:27 PM, Jan Beulich wrote:
>> On 01.07.2025 11:44, Oleksii Kurochko wrote:
>>> On 7/1/25 8:29 AM, Jan Beulich wrote:
>>>> On 30.06.2025 18:18, Oleksii Kurochko wrote:
>>>>> On 6/30/25 5:22 PM, Jan Beulich wrote:
>>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>>> --- a/xen/arch/riscv/p2m.c
>>>>>>> +++ b/xen/arch/riscv/p2m.c
>>>>>>> @@ -41,6 +41,91 @@ void p2m_write_unlock(struct p2m_domain *p2m)
>>>>>>>         write_unlock(&p2m->lock);
>>>>>>>     }
>>>>>>>     
>>>>>>> +static void clear_and_clean_page(struct page_info *page)
>>>>>>> +{
>>>>>>> +    clean_dcache_va_range(page, PAGE_SIZE);
>>>>>>> +    clear_domain_page(page_to_mfn(page));
>>>>>>> +}
>>>>>> A function of this name can, imo, only clear and then clean. Question is why
>>>>>> it's the other way around, and what the underlying requirement is for the
>>>>>> cleaning part to be there in the first place. Maybe that's obvious for a
>>>>>> RISC-V person, but it's entirely non-obvious to me (Arm being different in
>>>>>> this regard because of running with caches disabled at certain points in
>>>>>> time).
>>>>> You're right, the current name|clear_and_clean_page()| implies that clearing
>>>>> should come before cleaning, which contradicts the current implementation.
>>>>> The intent here is to ensure that the page contents are consistent in RAM
>>>>> (not just in cache) before use by other entities (guests or devices).
>>>>>
>>>>> The clean must follow the clear — so yes, the order needs to be reversed.
>>>> What you don't address though - why's the cleaning needed in the first place?
>>> If we clean the data cache first, we flush the d-cache and then use the page to
>>> perform the clear operation. As a result, the "cleared" value will be written into
>>> the d-cache. To avoid polluting the d-cache with the "cleared" value, the correct
>>> sequence is to clear the page first, then clean the data cache.
>> If you want to avoid cache pollution, I think you'd need to use a form of stores
>> which simply bypass the cache. Yet then - why would this matter here, but not
>> elsewhere? Wouldn't you better leave such to the hardware, unless you can prove
>> a (meaningful) performance gain?
> 
> I thought about a case when IOMMU doesn't support coherent walks and p2m tables are
> shared between CPU and IOMMU. Then my understanding is:
> - clear_page(p) just zero-ing a page in a CPU's cache.
> - But IOMMU can see old data or uninitialized, if they still in cache.
> - So, it is need to do clean_cache() to writeback data from cache to RAM, before a
>    page will be used as a part of page table for IOMMU.

Okay, so this is purely about something that doesn't matter at all for now
(until IOMMU support is introduced). Fair enough then to play safe from the
beginning.

>>>>>>> +    unsigned int nr_pages = _AC(1,U) << order;
>>>>>> Nit (style): Missing blank after comma.
>>>>> I've changed that to BIT(order, U)
>>>>>
>>>>>>> +    /* Return back nr_pages necessary for p2m root table. */
>>>>>>> +
>>>>>>> +    if ( ACCESS_ONCE(d->arch.paging.p2m_total_pages) < nr_pages )
>>>>>>> +        panic("Specify more xen,domain-p2m-mem-mb\n");
>>>>>> You shouldn't panic() in anything involved in domain creation. You want to
>>>>>> return NULL in this case.
>>>>> It makes sense in this case just to return NULL.
>>>>>
>>>>>> Further, to me the use of "more" looks misleading here. Do you perhaps mean
>>>>>> "larger" or "bigger"?
>>>>>>
>>>>>> This also looks to be happening without any lock held. If that's intentional,
>>>>>> I think the "why" wants clarifying in a code comment.
>>>>> Agree, returning back pages necessary for p2m root table should be done under
>>>>> spin_lock(&d->arch.paging.lock).
>>>> Which should be acquired at the paging_*() layer then, not at the p2m_*() layer.
>>>> (As long as you mean to have that separation, that is. See the earlier discussion
>>>> on that matter.)
>>> Then partly p2m_set_allocation() should be moved to paging_*() too.
>> Not exactly sure what you mean. On x86 at least the paging layer part of
>> the function is pretty slim.
> 
> I meant that part of code which is spin_lock(&d->arch.paging.lock); ... spin_unlock(&d->arch.paging.lock)
> in function p2m_set_allocation() should be moved somewhere to paging_*() layer for the same logic as you
> suggested to move part of  p2m_allocate_root()'s code which is guarded by d->arch.paging.lock to
> paging_*() layer.

Yes, of course.

Jan