[v4] xen/riscv: introduce p2m functionality

[PATCH v4 12/18] xen/riscv: Implement p2m_pte_from_mfn() and support PBMT configuration

Posted by Oleksii Kurochko 4 months, 3 weeks ago

This patch adds the initial logic for constructing PTEs from MFNs in the RISC-V
p2m subsystem. It includes:
- Implementation of p2m_pte_from_mfn(): Generates a valid PTE using the
  given MFN, p2m_type_t, including permission encoding and PBMT attribute
  setup.
- New helper p2m_set_permission(): Encodes access rights (r, w, x) into the
  PTE based on both p2m type and access permissions.
- p2m_set_type(): Stores the p2m type in PTE's bits. The storage of types,
  which don't fit PTE bits, will be implemented separately later.
- Add detection of Svade extension to properly handle a possible page-fault
  if A and D bits aren't set.

PBMT type encoding support:
- Introduces an enum pbmt_type_t to represent the PBMT field values.
- Maps types like p2m_mmio_direct_dev to p2m_mmio_direct_io, others default
  to pbmt_pma.

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V4:
 - p2m_set_permission() updates:
   - Update permissions for p2m_ram_rw case, make it also executable.
   - Add pernissions setting for p2m_map_foreign_* types.
   - Drop setting peromissions for p2m_ext_storage.
   - Only turn off PTE_VALID bit for p2m_invalid, don't touch other bits.
 - p2m_pte_from_mfn() updates:
   - Update ASSERT(), add a check that mfn isn't INVALID_MFN (1)
     explicitly to avoid the case when PADDR_MASK isn't narrow enough to
     catch the case (1).
   - Drop unnessary check around call of p2m_set_type() as this check
     is already included inside p2m_set_type().
 - Introduce new p2m type p2m_first_external to detect that passed type
   is stored in external storage.
 - Add handling of PTE's A and D bits in pm2_set_permission. Also, set
   PTE_USER bit. For this cpufeatures.{h and c} were updated to be able
   to detect availability of Svade extension.
 - Drop grant table related code as it isn't going to be used at the moment.
---
Changes in V3:
 - s/p2m_entry_from_mfn/p2m_pte_from_mfn.
 - s/pbmt_type_t/pbmt_type.
 - s/pbmt_max/pbmt_count.
 - s/p2m_type_radix_set/p2m_set_type.
 - Rework p2m_set_type() to handle only types which are fited into PTEs bits.
   Other types will be covered separately.
   Update arguments of p2m_set_type(): there is no any reason for p2m anymore.
 - p2m_set_permissions() updates:
   - Update the code in p2m_set_permission() for cases p2m_raw_rw and
     p2m_mmio_direct_io to set proper type permissions.
   - Add cases for p2m_grant_map_rw and p2m_grant_map_ro.
   - Use ASSERT_UNEACHABLE() instead of BUG() in switch cases of
     p2m_set_permissions.
   - Add blank lines non-fall-through case blocks in switch cases.
 - Set MFN before permissions are set in p2m_pte_from_mfn().
 - Update prototype of p2m_entry_from_mfn().
---
Changes in V2:
 - New patch. It was a part of a big patch "xen/riscv: implement p2m mapping
   functionality" which was splitted to smaller.
---
 xen/arch/riscv/cpufeature.c             |  1 +
 xen/arch/riscv/include/asm/cpufeature.h |  1 +
 xen/arch/riscv/include/asm/page.h       |  8 ++
 xen/arch/riscv/p2m.c                    | 97 ++++++++++++++++++++++++-
 4 files changed, 103 insertions(+), 4 deletions(-)

diff --git a/xen/arch/riscv/cpufeature.c b/xen/arch/riscv/cpufeature.c
index b846a106a3..02b68aeaa4 100644
--- a/xen/arch/riscv/cpufeature.c
+++ b/xen/arch/riscv/cpufeature.c
@@ -138,6 +138,7 @@ const struct riscv_isa_ext_data __initconst riscv_isa_ext[] = {
     RISCV_ISA_EXT_DATA(zbs),
     RISCV_ISA_EXT_DATA(smaia),
     RISCV_ISA_EXT_DATA(ssaia),
+    RISCV_ISA_EXT_DATA(svade),
     RISCV_ISA_EXT_DATA(svpbmt),
 };
 
diff --git a/xen/arch/riscv/include/asm/cpufeature.h b/xen/arch/riscv/include/asm/cpufeature.h
index 768b84b769..5f756c76db 100644
--- a/xen/arch/riscv/include/asm/cpufeature.h
+++ b/xen/arch/riscv/include/asm/cpufeature.h
@@ -37,6 +37,7 @@ enum riscv_isa_ext_id {
     RISCV_ISA_EXT_zbs,
     RISCV_ISA_EXT_smaia,
     RISCV_ISA_EXT_ssaia,
+    RISCV_ISA_EXT_svade,
     RISCV_ISA_EXT_svpbmt,
     RISCV_ISA_EXT_MAX
 };
diff --git a/xen/arch/riscv/include/asm/page.h b/xen/arch/riscv/include/asm/page.h
index cb303af0c0..4fa0556073 100644
--- a/xen/arch/riscv/include/asm/page.h
+++ b/xen/arch/riscv/include/asm/page.h
@@ -74,6 +74,14 @@
 #define PTE_SMALL       BIT(10, UL)
 #define PTE_POPULATE    BIT(11, UL)
 
+enum pbmt_type {
+    pbmt_pma,
+    pbmt_nc,
+    pbmt_io,
+    pbmt_rsvd,
+    pbmt_count,
+};
+
 #define PTE_ACCESS_MASK (PTE_READABLE | PTE_WRITABLE | PTE_EXECUTABLE)
 
 #define PTE_PBMT_MASK   (PTE_PBMT_NOCACHE | PTE_PBMT_IO)
diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
index 10acfa0a9c..2d4433360d 100644
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -10,6 +10,7 @@
 #include <xen/sched.h>
 #include <xen/sections.h>
 
+#include <asm/cpufeature.h>
 #include <asm/csr.h>
 #include <asm/flushtlb.h>
 #include <asm/paging.h>
@@ -288,6 +289,18 @@ static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
     return __map_domain_page(p2m->root + root_table_indx);
 }
 
+static int p2m_set_type(pte_t *pte, p2m_type_t t)
+{
+    int rc = 0;
+
+    if ( t > p2m_first_external )
+        panic("unimplemeted\n");
+    else
+        pte->pte |= MASK_INSR(t, P2M_TYPE_PTE_BITS_MASK);
+
+    return rc;
+}
+
 static p2m_type_t p2m_get_type(const pte_t pte)
 {
     p2m_type_t type = MASK_EXTR(pte.pte, P2M_TYPE_PTE_BITS_MASK);
@@ -318,11 +331,87 @@ static inline void p2m_clean_pte(pte_t *p, bool clean_pte)
     p2m_write_pte(p, pte, clean_pte);
 }
 
-static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t)
+static void p2m_set_permission(pte_t *e, p2m_type_t t)
 {
-    panic("%s: hasn't been implemented yet\n", __func__);
+    e->pte &= ~PTE_ACCESS_MASK;
+
+    e->pte |= PTE_USER;
+
+    /*
+     * Two schemes to manage the A and D bits are defined:
+     *   • The Svade extension: when a virtual page is accessed and the A bit
+     *     is clear, or is written and the D bit is clear, a page-fault
+     *     exception is raised.
+     *   • When the Svade extension is not implemented, the following scheme
+     *     applies.
+     *     When a virtual page is accessed and the A bit is clear, the PTE is
+     *     updated to set the A bit. When the virtual page is written and the
+     *     D bit is clear, the PTE is updated to set the D bit. When G-stage
+     *     address translation is in use and is not Bare, the G-stage virtual
+     *     pages may be accessed or written by implicit accesses to VS-level
+     *     memory management data structures, such as page tables.
+     * Thereby to avoid a page-fault in case of Svade is available, it is
+     * necesssary to set A and D bits.
+     */
+    if ( riscv_isa_extension_available(NULL, RISCV_ISA_EXT_svade) )
+        e->pte |= PTE_ACCESSED | PTE_DIRTY;
+
+    switch ( t )
+    {
+    case p2m_map_foreign_rw:
+    case p2m_mmio_direct_io:
+        e->pte |= PTE_READABLE | PTE_WRITABLE;
+        break;
+
+    case p2m_ram_rw:
+        e->pte |= PTE_ACCESS_MASK;
+        break;
+
+    case p2m_invalid:
+        e->pte &= ~PTE_VALID;
+        break;
+
+    case p2m_map_foreign_ro:
+        e->pte |= PTE_READABLE;
+        break;
+
+    default:
+        ASSERT_UNREACHABLE();
+        break;
+    }
+}
+
+static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t, bool is_table)
+{
+    pte_t e = (pte_t) { PTE_VALID };
+
+    switch ( t )
+    {
+    case p2m_mmio_direct_io:
+        e.pte |= PTE_PBMT_IO;
+        break;
+
+    default:
+        break;
+    }
+
+    pte_set_mfn(&e, mfn);
+
+    ASSERT(!(mfn_to_maddr(mfn) & ~PADDR_MASK) || mfn_eq(mfn, INVALID_MFN));
+
+    if ( !is_table )
+    {
+        p2m_set_permission(&e, t);
+        p2m_set_type(&e, t);
+    }
+    else
+        /*
+         * According to the spec and table "Encoding of PTE R/W/X fields":
+         *   X=W=R=0 -> Pointer to next level of page table.
+         */
+        e.pte &= ~PTE_ACCESS_MASK;
 
-    return (pte_t) { .pte = 0 };
+    return e;
 }
 
 #define P2M_TABLE_MAP_NONE 0
@@ -580,7 +669,7 @@ static int p2m_set_entry(struct p2m_domain *p2m,
         p2m_clean_pte(entry, p2m->clean_dcache);
     else
     {
-        pte_t pte = p2m_pte_from_mfn(mfn, t);
+        pte_t pte = p2m_pte_from_mfn(mfn, t, false);
 
         p2m_write_pte(entry, pte, p2m->clean_dcache);
 
-- 
2.51.0

Re: [PATCH v4 12/18] xen/riscv: Implement p2m_pte_from_mfn() and support PBMT configuration

Posted by Jan Beulich 4 months, 2 weeks ago

On 17.09.2025 23:55, Oleksii Kurochko wrote:
> @@ -318,11 +331,87 @@ static inline void p2m_clean_pte(pte_t *p, bool clean_pte)
>      p2m_write_pte(p, pte, clean_pte);
>  }
>  
> -static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t)
> +static void p2m_set_permission(pte_t *e, p2m_type_t t)
>  {
> -    panic("%s: hasn't been implemented yet\n", __func__);
> +    e->pte &= ~PTE_ACCESS_MASK;
> +
> +    e->pte |= PTE_USER;
> +
> +    /*
> +     * Two schemes to manage the A and D bits are defined:
> +     *   • The Svade extension: when a virtual page is accessed and the A bit
> +     *     is clear, or is written and the D bit is clear, a page-fault
> +     *     exception is raised.
> +     *   • When the Svade extension is not implemented, the following scheme
> +     *     applies.
> +     *     When a virtual page is accessed and the A bit is clear, the PTE is
> +     *     updated to set the A bit. When the virtual page is written and the
> +     *     D bit is clear, the PTE is updated to set the D bit. When G-stage
> +     *     address translation is in use and is not Bare, the G-stage virtual
> +     *     pages may be accessed or written by implicit accesses to VS-level
> +     *     memory management data structures, such as page tables.
> +     * Thereby to avoid a page-fault in case of Svade is available, it is
> +     * necesssary to set A and D bits.
> +     */
> +    if ( riscv_isa_extension_available(NULL, RISCV_ISA_EXT_svade) )
> +        e->pte |= PTE_ACCESSED | PTE_DIRTY;

All of this depending on menvcfg.ADUE anyway, is this really needed? Isn't
machine mode software responsible for dealing with this kind of page faults
(just like the hypervisor is reponsible for dealing with ones resulting
from henvcfg.ADUE being clear)?

> +    switch ( t )
> +    {
> +    case p2m_map_foreign_rw:
> +    case p2m_mmio_direct_io:
> +        e->pte |= PTE_READABLE | PTE_WRITABLE;
> +        break;
> +
> +    case p2m_ram_rw:
> +        e->pte |= PTE_ACCESS_MASK;
> +        break;
> +
> +    case p2m_invalid:
> +        e->pte &= ~PTE_VALID;
> +        break;
> +
> +    case p2m_map_foreign_ro:
> +        e->pte |= PTE_READABLE;
> +        break;
> +
> +    default:
> +        ASSERT_UNREACHABLE();
> +        break;
> +    }
> +}
> +
> +static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t, bool is_table)
> +{
> +    pte_t e = (pte_t) { PTE_VALID };
> +
> +    switch ( t )
> +    {
> +    case p2m_mmio_direct_io:
> +        e.pte |= PTE_PBMT_IO;
> +        break;

Shouldn't this be limited to the !is_table case (just like you have it ...

> +    default:
> +        break;
> +    }
> +
> +    pte_set_mfn(&e, mfn);
> +
> +    ASSERT(!(mfn_to_maddr(mfn) & ~PADDR_MASK) || mfn_eq(mfn, INVALID_MFN));
> +
> +    if ( !is_table )
> +    {
> +        p2m_set_permission(&e, t);

... here? Or else at least ASSERT(!is_table) up there? Personally I think
all of this !is_table stuff would best be done here.

Jan

Re: [PATCH v4 12/18] xen/riscv: Implement p2m_pte_from_mfn() and support PBMT configuration

Posted by Oleksii Kurochko 4 months, 1 week ago

On 9/22/25 6:28 PM, Jan Beulich wrote:
> On 17.09.2025 23:55, Oleksii Kurochko wrote:
>> @@ -318,11 +331,87 @@ static inline void p2m_clean_pte(pte_t *p, bool clean_pte)
>>       p2m_write_pte(p, pte, clean_pte);
>>   }
>>   
>> -static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t)
>> +static void p2m_set_permission(pte_t *e, p2m_type_t t)
>>   {
>> -    panic("%s: hasn't been implemented yet\n", __func__);
>> +    e->pte &= ~PTE_ACCESS_MASK;
>> +
>> +    e->pte |= PTE_USER;
>> +
>> +    /*
>> +     * Two schemes to manage the A and D bits are defined:
>> +     *   • The Svade extension: when a virtual page is accessed and the A bit
>> +     *     is clear, or is written and the D bit is clear, a page-fault
>> +     *     exception is raised.
>> +     *   • When the Svade extension is not implemented, the following scheme
>> +     *     applies.
>> +     *     When a virtual page is accessed and the A bit is clear, the PTE is
>> +     *     updated to set the A bit. When the virtual page is written and the
>> +     *     D bit is clear, the PTE is updated to set the D bit. When G-stage
>> +     *     address translation is in use and is not Bare, the G-stage virtual
>> +     *     pages may be accessed or written by implicit accesses to VS-level
>> +     *     memory management data structures, such as page tables.
>> +     * Thereby to avoid a page-fault in case of Svade is available, it is
>> +     * necesssary to set A and D bits.
>> +     */
>> +    if ( riscv_isa_extension_available(NULL, RISCV_ISA_EXT_svade) )
>> +        e->pte |= PTE_ACCESSED | PTE_DIRTY;
> All of this depending on menvcfg.ADUE anyway, is this really needed? Isn't
> machine mode software responsible for dealing with this kind of page faults
> (just like the hypervisor is reponsible for dealing with ones resulting
> from henvcfg.ADUE being clear)?

In general, I think you are right.

In this case, though, I just wanted to avoid unnecessary page faults for now.
My understanding is that having such faults handled by the hypervisor can indeed
be useful, for example to track which pages are being accessed. However, since we
currently don’t track page usage, handling these traps would only result in
setting the A and D bits and then returning control to the guest.
To avoid this overhead, I chose to set the bits up front.

>
>> +    switch ( t )
>> +    {
>> +    case p2m_map_foreign_rw:
>> +    case p2m_mmio_direct_io:
>> +        e->pte |= PTE_READABLE | PTE_WRITABLE;
>> +        break;
>> +
>> +    case p2m_ram_rw:
>> +        e->pte |= PTE_ACCESS_MASK;
>> +        break;
>> +
>> +    case p2m_invalid:
>> +        e->pte &= ~PTE_VALID;
>> +        break;
>> +
>> +    case p2m_map_foreign_ro:
>> +        e->pte |= PTE_READABLE;
>> +        break;
>> +
>> +    default:
>> +        ASSERT_UNREACHABLE();
>> +        break;
>> +    }
>> +}
>> +
>> +static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t, bool is_table)
>> +{
>> +    pte_t e = (pte_t) { PTE_VALID };
>> +
>> +    switch ( t )
>> +    {
>> +    case p2m_mmio_direct_io:
>> +        e.pte |= PTE_PBMT_IO;
>> +        break;
> Shouldn't this be limited to the !is_table case (just like you have it ...
>
>> +    default:
>> +        break;
>> +    }
>> +
>> +    pte_set_mfn(&e, mfn);
>> +
>> +    ASSERT(!(mfn_to_maddr(mfn) & ~PADDR_MASK) || mfn_eq(mfn, INVALID_MFN));
>> +
>> +    if ( !is_table )
>> +    {
>> +        p2m_set_permission(&e, t);
> ... here? Or else at least ASSERT(!is_table) up there? Personally I think
> all of this !is_table stuff would best be done here.

Agree, that this should be done only for leaf PTEs.
I think that I will move that inside p2m_set_permissions() where
p2m_mmio_direct_io case is handled.

Thanks.

~ Oleksii

Re: [PATCH v4 12/18] xen/riscv: Implement p2m_pte_from_mfn() and support PBMT configuration

Posted by Jan Beulich 4 months ago

On 29.09.2025 15:30, Oleksii Kurochko wrote:
> 
> On 9/22/25 6:28 PM, Jan Beulich wrote:
>> On 17.09.2025 23:55, Oleksii Kurochko wrote:
>>> @@ -318,11 +331,87 @@ static inline void p2m_clean_pte(pte_t *p, bool clean_pte)
>>>       p2m_write_pte(p, pte, clean_pte);
>>>   }
>>>   
>>> -static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t)
>>> +static void p2m_set_permission(pte_t *e, p2m_type_t t)
>>>   {
>>> -    panic("%s: hasn't been implemented yet\n", __func__);
>>> +    e->pte &= ~PTE_ACCESS_MASK;
>>> +
>>> +    e->pte |= PTE_USER;
>>> +
>>> +    /*
>>> +     * Two schemes to manage the A and D bits are defined:
>>> +     *   • The Svade extension: when a virtual page is accessed and the A bit
>>> +     *     is clear, or is written and the D bit is clear, a page-fault
>>> +     *     exception is raised.
>>> +     *   • When the Svade extension is not implemented, the following scheme
>>> +     *     applies.
>>> +     *     When a virtual page is accessed and the A bit is clear, the PTE is
>>> +     *     updated to set the A bit. When the virtual page is written and the
>>> +     *     D bit is clear, the PTE is updated to set the D bit. When G-stage
>>> +     *     address translation is in use and is not Bare, the G-stage virtual
>>> +     *     pages may be accessed or written by implicit accesses to VS-level
>>> +     *     memory management data structures, such as page tables.
>>> +     * Thereby to avoid a page-fault in case of Svade is available, it is
>>> +     * necesssary to set A and D bits.
>>> +     */
>>> +    if ( riscv_isa_extension_available(NULL, RISCV_ISA_EXT_svade) )
>>> +        e->pte |= PTE_ACCESSED | PTE_DIRTY;
>> All of this depending on menvcfg.ADUE anyway, is this really needed? Isn't
>> machine mode software responsible for dealing with this kind of page faults
>> (just like the hypervisor is reponsible for dealing with ones resulting
>> from henvcfg.ADUE being clear)?
> 
> In general, I think you are right.
> 
> In this case, though, I just wanted to avoid unnecessary page faults for now.
> My understanding is that having such faults handled by the hypervisor can indeed
> be useful, for example to track which pages are being accessed. However, since we
> currently don’t track page usage, handling these traps would only result in
> setting the A and D bits and then returning control to the guest.

Yet that still be be machine-mode software aiui. By always setting the bits we'd
undermine whatever purpose _they_ have enabled the extension for, wouldn't we?

> To avoid this overhead, I chose to set the bits up front.

Irrespective to the answer to the question above, if you mean to do so, I think
all of this needs explaining better in the comment.

Jan

Re: [PATCH v4 12/18] xen/riscv: Implement p2m_pte_from_mfn() and support PBMT configuration

Posted by Oleksii Kurochko 4 months ago

On 10/7/25 3:09 PM, Jan Beulich wrote:
> On 29.09.2025 15:30, Oleksii Kurochko wrote:
>> On 9/22/25 6:28 PM, Jan Beulich wrote:
>>> On 17.09.2025 23:55, Oleksii Kurochko wrote:
>>>> @@ -318,11 +331,87 @@ static inline void p2m_clean_pte(pte_t *p, bool clean_pte)
>>>>        p2m_write_pte(p, pte, clean_pte);
>>>>    }
>>>>    
>>>> -static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t)
>>>> +static void p2m_set_permission(pte_t *e, p2m_type_t t)
>>>>    {
>>>> -    panic("%s: hasn't been implemented yet\n", __func__);
>>>> +    e->pte &= ~PTE_ACCESS_MASK;
>>>> +
>>>> +    e->pte |= PTE_USER;
>>>> +
>>>> +    /*
>>>> +     * Two schemes to manage the A and D bits are defined:
>>>> +     *   • The Svade extension: when a virtual page is accessed and the A bit
>>>> +     *     is clear, or is written and the D bit is clear, a page-fault
>>>> +     *     exception is raised.
>>>> +     *   • When the Svade extension is not implemented, the following scheme
>>>> +     *     applies.
>>>> +     *     When a virtual page is accessed and the A bit is clear, the PTE is
>>>> +     *     updated to set the A bit. When the virtual page is written and the
>>>> +     *     D bit is clear, the PTE is updated to set the D bit. When G-stage
>>>> +     *     address translation is in use and is not Bare, the G-stage virtual
>>>> +     *     pages may be accessed or written by implicit accesses to VS-level
>>>> +     *     memory management data structures, such as page tables.
>>>> +     * Thereby to avoid a page-fault in case of Svade is available, it is
>>>> +     * necesssary to set A and D bits.
>>>> +     */
>>>> +    if ( riscv_isa_extension_available(NULL, RISCV_ISA_EXT_svade) )
>>>> +        e->pte |= PTE_ACCESSED | PTE_DIRTY;
>>> All of this depending on menvcfg.ADUE anyway, is this really needed? Isn't
>>> machine mode software responsible for dealing with this kind of page faults
>>> (just like the hypervisor is reponsible for dealing with ones resulting
>>> from henvcfg.ADUE being clear)?
>> In general, I think you are right.
>>
>> In this case, though, I just wanted to avoid unnecessary page faults for now.
>> My understanding is that having such faults handled by the hypervisor can indeed
>> be useful, for example to track which pages are being accessed. However, since we
>> currently don’t track page usage, handling these traps would only result in
>> setting the A and D bits and then returning control to the guest.
> Yet that still be be machine-mode software aiui. By always setting the bits we'd
> undermine whatever purpose _they_ have enabled the extension for, wouldn't we?

It’s a good point, and from an architectural perspective, it’s possible that
machine-mode software might want to handle page faults.
However, looking at OpenSBI, it delegates (otherwise all traps/interrupts by
default are going to machine-mode) page faults [1] to lower modes, and I expect
that other machine-mode software does the same (but of course there is no such
guarantee).

Therefore, considering that OpenSBI delegates page faults to lower modes and
does not set the A and D bits for p2m (guest) PTEs, this will result in a page
fault being handled by the hypervisor. As a result, we don’t affect the behavior
of machine-mode software at all.

If we want to avoid depending on how OpenSBI or other machine-mode software is
implemented, we might instead want to have our own page fault handler in Xen,
and then set the A and D bits within this handler.
Do you think it would be better to do in this way from the start? If yes, then
we also want drop setting of A and D bits for Xen's PTEs [3] to allow M-mode to
handle S/HS-mode page faults.

Interestingly, OpenSBI doesn’t allow hypervisor mode to decide whether to
support Svade or not [2]. By doing so, we can’t set|henvcfg.adue = 1| to disable
it as menvcfg.adue=0 has more power, which is not very flexible.

[1]https://github.com/riscv-software-src/opensbi/blob/master/lib/sbi/sbi_hart.c#L209
[2]https://github.com/riscv-software-src/opensbi/blob/master/lib/sbi/sbi_hart.c#L168
[3]https://gitlab.com/xen-project/xen/-/blob/staging/xen/arch/riscv/pt.c?ref_type=heads#L343

>> To avoid this overhead, I chose to set the bits up front.
> Irrespective to the answer to the question above, if you mean to do so, I think
> all of this needs explaining better in the comment.

Sure, I will add the comment if the current one approach of setting A and D bits
will be chosen.

~ Oleksii

Re: [PATCH v4 12/18] xen/riscv: Implement p2m_pte_from_mfn() and support PBMT configuration

Posted by Jan Beulich 4 months ago

On 09.10.2025 11:21, Oleksii Kurochko wrote:
> 
> On 10/7/25 3:09 PM, Jan Beulich wrote:
>> On 29.09.2025 15:30, Oleksii Kurochko wrote:
>>> On 9/22/25 6:28 PM, Jan Beulich wrote:
>>>> On 17.09.2025 23:55, Oleksii Kurochko wrote:
>>>>> @@ -318,11 +331,87 @@ static inline void p2m_clean_pte(pte_t *p, bool clean_pte)
>>>>>        p2m_write_pte(p, pte, clean_pte);
>>>>>    }
>>>>>    
>>>>> -static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t)
>>>>> +static void p2m_set_permission(pte_t *e, p2m_type_t t)
>>>>>    {
>>>>> -    panic("%s: hasn't been implemented yet\n", __func__);
>>>>> +    e->pte &= ~PTE_ACCESS_MASK;
>>>>> +
>>>>> +    e->pte |= PTE_USER;
>>>>> +
>>>>> +    /*
>>>>> +     * Two schemes to manage the A and D bits are defined:
>>>>> +     *   • The Svade extension: when a virtual page is accessed and the A bit
>>>>> +     *     is clear, or is written and the D bit is clear, a page-fault
>>>>> +     *     exception is raised.
>>>>> +     *   • When the Svade extension is not implemented, the following scheme
>>>>> +     *     applies.
>>>>> +     *     When a virtual page is accessed and the A bit is clear, the PTE is
>>>>> +     *     updated to set the A bit. When the virtual page is written and the
>>>>> +     *     D bit is clear, the PTE is updated to set the D bit. When G-stage
>>>>> +     *     address translation is in use and is not Bare, the G-stage virtual
>>>>> +     *     pages may be accessed or written by implicit accesses to VS-level
>>>>> +     *     memory management data structures, such as page tables.
>>>>> +     * Thereby to avoid a page-fault in case of Svade is available, it is
>>>>> +     * necesssary to set A and D bits.
>>>>> +     */
>>>>> +    if ( riscv_isa_extension_available(NULL, RISCV_ISA_EXT_svade) )
>>>>> +        e->pte |= PTE_ACCESSED | PTE_DIRTY;
>>>> All of this depending on menvcfg.ADUE anyway, is this really needed? Isn't
>>>> machine mode software responsible for dealing with this kind of page faults
>>>> (just like the hypervisor is reponsible for dealing with ones resulting
>>>> from henvcfg.ADUE being clear)?
>>> In general, I think you are right.
>>>
>>> In this case, though, I just wanted to avoid unnecessary page faults for now.
>>> My understanding is that having such faults handled by the hypervisor can indeed
>>> be useful, for example to track which pages are being accessed. However, since we
>>> currently don’t track page usage, handling these traps would only result in
>>> setting the A and D bits and then returning control to the guest.
>> Yet that still be be machine-mode software aiui. By always setting the bits we'd
>> undermine whatever purpose _they_ have enabled the extension for, wouldn't we?
> 
> It’s a good point, and from an architectural perspective, it’s possible that
> machine-mode software might want to handle page faults.
> However, looking at OpenSBI, it delegates (otherwise all traps/interrupts by
> default are going to machine-mode) page faults [1] to lower modes, and I expect
> that other machine-mode software does the same (but of course there is no such
> guarantee).
> 
> Therefore, considering that OpenSBI delegates page faults to lower modes and
> does not set the A and D bits for p2m (guest) PTEs, this will result in a page
> fault being handled by the hypervisor. As a result, we don’t affect the behavior
> of machine-mode software at all.
> 
> If we want to avoid depending on how OpenSBI or other machine-mode software is
> implemented, we might instead want to have our own page fault handler in Xen,
> and then set the A and D bits within this handler.

Won't Xen need its own page fault handler anyway?

> Do you think it would be better to do in this way from the start? If yes, then
> we also want drop setting of A and D bits for Xen's PTEs [3] to allow M-mode to
> handle S/HS-mode page faults.

What I don't really understand is what the intended use of that extension is.
Surely every entity should be responsible for its own A/D bits, with lower
layers coming into play only when certain things need e.g. emulating. This
lack of understanding on my part extends to ...

> Interestingly, OpenSBI doesn’t allow hypervisor mode to decide whether to
> support Svade or not [2]. By doing so, we can’t set|henvcfg.adue = 1| to disable
> it as menvcfg.adue=0 has more power, which is not very flexible.

... this point, which I was also wondering about before.

Jan

Re: [PATCH v4 12/18] xen/riscv: Implement p2m_pte_from_mfn() and support PBMT configuration

Posted by Oleksii Kurochko 4 months ago

On 10/9/25 2:06 PM, Jan Beulich wrote:
> On 09.10.2025 11:21, Oleksii Kurochko wrote:
>> On 10/7/25 3:09 PM, Jan Beulich wrote:
>>> On 29.09.2025 15:30, Oleksii Kurochko wrote:
>>>> On 9/22/25 6:28 PM, Jan Beulich wrote:
>>>>> On 17.09.2025 23:55, Oleksii Kurochko wrote:
>>>>>> @@ -318,11 +331,87 @@ static inline void p2m_clean_pte(pte_t *p, bool clean_pte)
>>>>>>         p2m_write_pte(p, pte, clean_pte);
>>>>>>     }
>>>>>>     
>>>>>> -static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t)
>>>>>> +static void p2m_set_permission(pte_t *e, p2m_type_t t)
>>>>>>     {
>>>>>> -    panic("%s: hasn't been implemented yet\n", __func__);
>>>>>> +    e->pte &= ~PTE_ACCESS_MASK;
>>>>>> +
>>>>>> +    e->pte |= PTE_USER;
>>>>>> +
>>>>>> +    /*
>>>>>> +     * Two schemes to manage the A and D bits are defined:
>>>>>> +     *   • The Svade extension: when a virtual page is accessed and the A bit
>>>>>> +     *     is clear, or is written and the D bit is clear, a page-fault
>>>>>> +     *     exception is raised.
>>>>>> +     *   • When the Svade extension is not implemented, the following scheme
>>>>>> +     *     applies.
>>>>>> +     *     When a virtual page is accessed and the A bit is clear, the PTE is
>>>>>> +     *     updated to set the A bit. When the virtual page is written and the
>>>>>> +     *     D bit is clear, the PTE is updated to set the D bit. When G-stage
>>>>>> +     *     address translation is in use and is not Bare, the G-stage virtual
>>>>>> +     *     pages may be accessed or written by implicit accesses to VS-level
>>>>>> +     *     memory management data structures, such as page tables.
>>>>>> +     * Thereby to avoid a page-fault in case of Svade is available, it is
>>>>>> +     * necesssary to set A and D bits.
>>>>>> +     */
>>>>>> +    if ( riscv_isa_extension_available(NULL, RISCV_ISA_EXT_svade) )
>>>>>> +        e->pte |= PTE_ACCESSED | PTE_DIRTY;
>>>>> All of this depending on menvcfg.ADUE anyway, is this really needed? Isn't
>>>>> machine mode software responsible for dealing with this kind of page faults
>>>>> (just like the hypervisor is reponsible for dealing with ones resulting
>>>>> from henvcfg.ADUE being clear)?
>>>> In general, I think you are right.
>>>>
>>>> In this case, though, I just wanted to avoid unnecessary page faults for now.
>>>> My understanding is that having such faults handled by the hypervisor can indeed
>>>> be useful, for example to track which pages are being accessed. However, since we
>>>> currently don’t track page usage, handling these traps would only result in
>>>> setting the A and D bits and then returning control to the guest.
>>> Yet that still be be machine-mode software aiui. By always setting the bits we'd
>>> undermine whatever purpose _they_ have enabled the extension for, wouldn't we?
>> It’s a good point, and from an architectural perspective, it’s possible that
>> machine-mode software might want to handle page faults.
>> However, looking at OpenSBI, it delegates (otherwise all traps/interrupts by
>> default are going to machine-mode) page faults [1] to lower modes, and I expect
>> that other machine-mode software does the same (but of course there is no such
>> guarantee).
>>
>> Therefore, considering that OpenSBI delegates page faults to lower modes and
>> does not set the A and D bits for p2m (guest) PTEs, this will result in a page
>> fault being handled by the hypervisor. As a result, we don’t affect the behavior
>> of machine-mode software at all.
>>
>> If we want to avoid depending on how OpenSBI or other machine-mode software is
>> implemented, we might instead want to have our own page fault handler in Xen,
>> and then set the A and D bits within this handler.
> Won't Xen need its own page fault handler anyway?

Of course, it will.
I just meant that it won’t need it solely for the purpose of setting the A and
D bits.

Considering that Svade is mandatory for RVAxx profiles, and that at some point
we may want to implement certain optimizations (mentioned below), it would make
sense to handle the A/D bits in the page fault handler.
However, for now, for the sake of simplicity and given that none of the
optimizations mentioned below are currently implemented and OpenSBI delegates
page fault handling to hypervisor so OpenSBI isn't planning to deal with A/D
bits, I think we can set the A/D bits during PTE creation with a comment
explaining why it’s done this way, as suggested before.
Later, when additional optimizations that rely on A/D bits are needed, we can
remove this initial setting and add proper A/D handling in the page fault
handler.

>
>> Do you think it would be better to do in this way from the start? If yes, then
>> we also want drop setting of A and D bits for Xen's PTEs [3] to allow M-mode to
>> handle S/HS-mode page faults.
> What I don't really understand is what the intended use of that extension is.

I think this is mainly for software-managed PTE A/D bit updates, which could be
useful for several use cases such as demand paging, cache flushing optimizations,
and memory access tracking.

Also, from a hardware perspective, it’s probably simpler to let software manage
the PTE A/D bits (using the Svade extension) rather than implementing the
Svadu extension for hardware-managed updates.


~ Oleksii

> Surely every entity should be responsible for its own A/D bits, with lower
> layers coming into play only when certain things need e.g. emulating. This
> lack of understanding on my part extends to ...
>
>> Interestingly, OpenSBI doesn’t allow hypervisor mode to decide whether to
>> support Svade or not [2]. By doing so, we can’t set|henvcfg.adue = 1| to disable
>> it as menvcfg.adue=0 has more power, which is not very flexible.
> ... this point, which I was also wondering about before.