[for 4.22 v5 10/18] xen/riscv: implement p2m_set_range()

Oleksii Kurochko posted 18 patches 3 months, 3 weeks ago
There is a newer version of this series
[for 4.22 v5 10/18] xen/riscv: implement p2m_set_range()
Posted by Oleksii Kurochko 3 months, 3 weeks ago
This patch introduces p2m_set_range() and its core helper p2m_set_entry() for
RISC-V, based loosely on the Arm implementation, with several RISC-V-specific
modifications.

The main changes are:
- Simplification of Break-Before-Make (BBM) approach as according to RISC-V
  spec:
    It is permitted for multiple address-translation cache entries to co-exist
    for the same address. This represents the fact that in a conventional
    TLB hierarchy, it is possible for multiple entries to match a single
    address if, for example, a page is upgraded to a superpage without first
    clearing the original non-leaf PTE’s valid bit and executing an SFENCE.VMA
    with rs1=x0, or if multiple TLBs exist in parallel at a given level of the
    hierarchy. In this case, just as if an SFENCE.VMA is not executed between
    a write to the memory-management tables and subsequent implicit read of the
    same address: it is unpredictable whether the old non-leaf PTE or the new
    leaf PTE is used, but the behavior is otherwise well defined.
  In contrast to the Arm architecture, where BBM is mandatory and failing to
  use it in some cases can lead to CPU instability, RISC-V guarantees
  stability, and the behavior remains safe — though unpredictable in terms of
  which translation will be used.
- Unlike Arm, the valid bit is not repurposed for other uses in this
  implementation. Instead, entry validity is determined based solely on P2M
  PTE's valid bit.

The main functionality is in p2m_set_entry(), which handles mappings aligned
to page table block entries (e.g., 1GB, 2MB, or 4KB with 4KB granularity).

p2m_set_range() breaks a region down into block-aligned mappings and calls
p2m_set_entry() accordingly.

Stub implementations (to be completed later) include:
- p2m_free_subtree()
- p2m_next_level()
- p2m_pte_from_mfn()

Note: Support for shattering block entries is not implemented in this
patch and will be added separately.

Additionally, some straightforward helper functions are now implemented:
- p2m_write_pte()
- p2m_remove_pte()
- p2m_get_root_pointer()

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V5:
 - Update the comment above p2m_get_root_pointer().
 - Fix an identation for p2m_set_entry()'s arguments.
 - Update the comment in p2m_set_entry() where lookup is happening.
 - Drop part of the comment above p2m_set_entry() as it is not really
   needed anymore.
 - Introduce P2M_DECLARE_OFFSETS() to use it insetead of
   DECLARE_OFFSETS() as the latter could have an issue with P2M code.
 - Update p2m_get_root_pointer() to work only with P2M root properties.
 - Update the comment inside in p2m_set_entry() for the case when
   p2m_next_level() returns P2M_TABLE_MAP_{NONE,NOMEM}.
 - Simplify a little bit a condition when p2m_free_subtree() by removing
   a case when removing && mfn(0) are checked explicitly.
---
Changes in V4:
 - Introduce gstage_root_level and use it for defintion of P2M_ROOT_LEVEL.
 - Introduce P2M_LEVEL_ORDER() macros and P2M_PAGETABLE_ENTRIES().
 - Add the TODO comment in p2m_write_pte() about possible perfomance
   optimization.
 - Use compound literal for `pte` variable inside p2m_clean_pte().
 - Fix the comment above p2m_next_level().
 - Update ASSERT() inside p2m_set_entry() and leave only a check of a
   target as p2m_mapping_order() that page_order will be correctly
   aligned.
 - Update the comment above declaration of `removing_mapping` in
   p2m_set_entry().
 - Stray blanks.
 - Handle possibly overflow of an amount of unmapped GFNs in case of
   some failute in p2m_set_range().
 - Handle a case when MFN is 0 and removing of such MFN is happening in
   p2m_set_entry.
 - Fix p2m_get_root_pointer() to return correct pointer to root page table.
---
Changes in V3:
 - Drop p2m_access_t connected stuff as it isn't going to be used, at least
   now.
 - Move defintion of P2M_ROOT_ORDER and P2M_ROOT_PAGES to earlier patches.
 - Update the comment above lowest_mapped_gfn declaration.
 - Update the comment above p2m_get_root_pointer(): s/"...ofset of the root
   table"/"...ofset into root table".
 - s/p2m_remove_pte/p2m_clean_pte.
 - Use plain 0 instead of 0x00 in p2m_clean_pte().
 - s/p2m_entry_from_mfn/p2m_pte_from_mfn.
 - s/GUEST_TABLE_*/P2M_TABLE_*.
 - Update the comment above p2m_next_level(): "GFN entry" -> "corresponding
   the entry corresponding to the GFN".
 - s/__p2m_set_entry/_p2m_set_entry.
 - drop "s" for sgfn and smfn prefixes of _p2m_set_entry()'s arguments
   as this function work only with one GFN and one MFN.
 - Return correct return code when p2m_next_level() faild in _p2m_set_entry(),
   also drop "else" and just handle case (rc != P2M_TABLE_NORMAL) separately.
 - Code style fixes.
 - Use unsigned int for "order" in p2m_set_entry().
 - s/p2m_set_entry/p2m_free_subtree.
 - Update ASSERT() in __p2m_set_enty() to check that page_order is propertly
   aligned.
 - Return -EACCES instead of -ENOMEM in the chase when domain is dying and
   someone called p2m_set_entry.
 - s/p2m_set_entry/p2m_set_range.
 - s/__p2m_set_entry/p2m_set_entry
 - s/p2me_is_valid/p2m_is_valid()
 - Return a number of successfully mapped GFNs in case if not all were mapped
   in p2m_set_range().
 - Use BIT(order, UL) instead of 1 << order.
 - Drop IOMMU flushing code from p2m_set_entry().
 - set p2m->need_flush=true when entry in p2m_set_entry() is changed.
 - Introduce p2m_mapping_order() to support superpages.
 - Drop p2m_is_valid() and use pte_is_valid() instead as there is no tricks
   with copying of valid bit anymore.
 - Update p2m_pte_from_mfn() prototype: drop p2m argument.
---
Changes in V2:
 - New patch. It was a part of a big patch "xen/riscv: implement p2m mapping
   functionality" which was splitted to smaller.
 - Update the way when p2m TLB is flushed:
 - RISC-V does't require BBM so there is no need to remove PTE before making
   new so drop 'if /*pte_is_valid(orig_pte) */' and remove PTE only removing
   has been requested.
 - Drop p2m->need_flush |= !!pte_is_valid(orig_pte); for the case when
   PTE's removing is happening as RISC-V could cache invalid PTE and thereby
   it requires to do a flush each time and it doesn't matter if PTE is valid
   or not at the moment when PTE removing is happening.
 - Drop a check if PTE is valid in case of PTE is modified as it was mentioned
   above as BBM isn't required so TLB flushing could be defered and there is
   no need to do it before modifying of PTE.
 - Drop p2m->need_flush as it seems like it will be always true.
 - Drop foreign mapping things as it isn't necessary for RISC-V right now.
 - s/p2m_is_valid/p2me_is_valid.
 - Move definition and initalization of p2m->{max_mapped_gfn,lowest_mapped_gfn}
   to this patch.
---
 xen/arch/riscv/include/asm/p2m.h |  43 ++++
 xen/arch/riscv/p2m.c             | 331 ++++++++++++++++++++++++++++++-
 2 files changed, 373 insertions(+), 1 deletion(-)

diff --git a/xen/arch/riscv/include/asm/p2m.h b/xen/arch/riscv/include/asm/p2m.h
index 4fafb26e1e..ce8bcb944f 100644
--- a/xen/arch/riscv/include/asm/p2m.h
+++ b/xen/arch/riscv/include/asm/p2m.h
@@ -8,12 +8,45 @@
 #include <xen/rwlock.h>
 #include <xen/types.h>
 
+#include <asm/page.h>
 #include <asm/page-bits.h>
 
 extern unsigned char gstage_mode;
+extern unsigned int gstage_root_level;
 
 #define P2M_ROOT_ORDER  (ilog2(GSTAGE_ROOT_PAGE_TABLE_SIZE) - PAGE_SHIFT)
 #define P2M_ROOT_PAGES  BIT(P2M_ROOT_ORDER, U)
+#define P2M_ROOT_LEVEL  gstage_root_level
+
+/*
+ * According to the RISC-V spec:
+ *   When hgatp.MODE specifies a translation scheme of Sv32x4, Sv39x4, Sv48x4,
+ *   or Sv57x4, G-stage address translation is a variation on the usual
+ *   page-based virtual address translation scheme of Sv32, Sv39, Sv48, or
+ *   Sv57, respectively. In each case, the size of the incoming address is
+ *   widened by 2 bits (to 34, 41, 50, or 59 bits).
+ *
+ * P2M_LEVEL_ORDER(lvl) defines the bit position in the GFN from which
+ * the index for this level of the P2M page table starts. The extra 2
+ * bits added by the "x4" schemes only affect the root page table width.
+ *
+ * Therefore, this macro can safely reuse XEN_PT_LEVEL_ORDER() for all
+ * levels: the extra 2 bits do not change the indices of lower levels.
+ *
+ * The extra 2 bits are only relevant if one tried to address beyond the
+ * root level (i.e., P2M_LEVEL_ORDER(P2M_ROOT_LEVEL + 1)), which is
+ * invalid.
+ */
+#define P2M_LEVEL_ORDER(lvl) XEN_PT_LEVEL_ORDER(lvl)
+
+#define P2M_ROOT_EXTRA_BITS(lvl) (2 * ((lvl) == P2M_ROOT_LEVEL))
+
+#define P2M_PAGETABLE_ENTRIES(lvl) \
+    (BIT(PAGETABLE_ORDER + P2M_ROOT_EXTRA_BITS(lvl), UL))
+
+#define GFN_MASK(lvl) (P2M_PAGETABLE_ENTRIES(lvl) - 1UL)
+
+#define P2M_LEVEL_SHIFT(lvl) (P2M_LEVEL_ORDER(lvl) + PAGE_SHIFT)
 
 #define paddr_bits PADDR_BITS
 
@@ -52,6 +85,16 @@ struct p2m_domain {
      * when a page is needed to be fully cleared and cleaned.
      */
     bool clean_dcache;
+
+    /* Highest guest frame that's ever been mapped in the p2m */
+    gfn_t max_mapped_gfn;
+
+    /*
+     * Lowest mapped gfn in the p2m. When releasing mapped gfn's in a
+     * preemptible manner this is updated to track where to resume
+     * the search. Apart from during teardown this can only decrease.
+     */
+    gfn_t lowest_mapped_gfn;
 };
 
 /*
diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
index e571257022..f13458712a 100644
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -9,6 +9,7 @@
 #include <xen/rwlock.h>
 #include <xen/sched.h>
 #include <xen/sections.h>
+#include <xen/xvmalloc.h>
 
 #include <asm/csr.h>
 #include <asm/flushtlb.h>
@@ -17,6 +18,43 @@
 #include <asm/vmid.h>
 
 unsigned char __ro_after_init gstage_mode;
+unsigned int __ro_after_init gstage_root_level;
+
+/*
+ * The P2M root page table is extended by 2 bits, making its size 16KB
+ * (instead of 4KB for non-root page tables). Therefore, P2M root page
+ * is allocated as four consecutive 4KB pages (since alloc_domheap_pages()
+ * only allocates 4KB pages).
+ */
+#define ENTRIES_PER_ROOT_PAGE \
+    (P2M_PAGETABLE_ENTRIES(P2M_ROOT_LEVEL) / P2M_ROOT_ORDER)
+
+static inline unsigned int calc_offset(unsigned int lvl, vaddr_t va)
+{
+    unsigned int offset = (va >> P2M_LEVEL_SHIFT(lvl)) & GFN_MASK(lvl);
+
+    /*
+     * For P2M_ROOT_LEVEL, `offset` ranges from 0 to 2047, since the root
+     * page table spans 4 consecutive 4KB pages.
+     * We want to return an index within one of these 4 pages.
+     * The specific page to use is determined by `p2m_get_root_pointer()`.
+     *
+     * Example: if `offset == 512`:
+     *  - A single 4KB page holds 512 entries.
+     *  - Therefore, entry 512 corresponds to index 0 of the second page.
+     *
+     * At all other levels, only one page is allocated, and `offset` is
+     * always in the range 0 to 511, since the VPN is 9 bits long.
+     */
+    return offset % ENTRIES_PER_ROOT_PAGE;
+}
+
+#define P2M_MAX_ROOT_LEVEL 4
+
+#define P2M_DECLARE_OFFSETS(var, addr) \
+    unsigned int var[P2M_MAX_ROOT_LEVEL] = {-1};\
+    for ( unsigned int i = 0; i <= gstage_root_level; i++ ) \
+        var[i] = calc_offset(i, addr);
 
 static void __init gstage_mode_detect(void)
 {
@@ -54,6 +92,14 @@ static void __init gstage_mode_detect(void)
         if ( MASK_EXTR(csr_read(CSR_HGATP), HGATP_MODE_MASK) == mode )
         {
             gstage_mode = mode;
+            gstage_root_level = modes[mode_idx].paging_levels - 1;
+            /*
+             * The highest supported mode at the moment is Sv57, where L4
+             * is the root page table.
+             * If this changes in the future, P2M_MAX_ROOT_LEVEL must be
+             * updated accordingly.
+             */
+            ASSERT(gstage_root_level <= P2M_MAX_ROOT_LEVEL);
             break;
         }
     }
@@ -218,6 +264,9 @@ int p2m_init(struct domain *d)
     rwlock_init(&p2m->lock);
     INIT_PAGE_LIST_HEAD(&p2m->pages);
 
+    p2m->max_mapped_gfn = _gfn(0);
+    p2m->lowest_mapped_gfn = _gfn(ULONG_MAX);
+
     /*
      * Currently, the infrastructure required to enable CONFIG_HAS_PASSTHROUGH
      * is not ready for RISC-V support.
@@ -259,13 +308,293 @@ int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
     return rc;
 }
 
+/*
+ * Map one of the four root pages of the P2M root page table.
+ *
+ * The P2M root page table is larger than normal (16KB instead of 4KB),
+ * so it is allocated as four consecutive 4KB pages. This function selects
+ * the appropriate 4KB page based on the given GFN and returns a mapping
+ * to it.
+ *
+ * The caller is responsible for unmapping the page after use.
+ *
+ * Returns NULL if the calculated offset into the root table is invalid.
+ */
+static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
+{
+    unsigned long root_table_indx;
+
+    root_table_indx = gfn_x(gfn) >> P2M_LEVEL_ORDER(P2M_ROOT_LEVEL);
+    if ( root_table_indx >= P2M_ROOT_PAGES )
+        return NULL;
+
+    /*
+     * The P2M root page table is extended by 2 bits, making its size 16KB
+     * (instead of 4KB for non-root page tables). Therefore, p2m->root is
+     * allocated as four consecutive 4KB pages (since alloc_domheap_pages()
+     * only allocates 4KB pages).
+     *
+     * Initially, `root_table_indx` is derived directly from `va`.
+     * To locate the correct entry within a single 4KB page,
+     * we rescale the offset so it falls within one of the 4 pages.
+     *
+     * Example: if `root_table_indx == 512`
+     * - A 4KB page holds 512 entries.
+     * - Thus, entry 512 corresponds to index 0 of the second page.
+     */
+    root_table_indx /= ENTRIES_PER_ROOT_PAGE;
+
+    return __map_domain_page(p2m->root + root_table_indx);
+}
+
+static inline void p2m_write_pte(pte_t *p, pte_t pte, bool clean_pte)
+{
+    write_pte(p, pte);
+
+    /*
+     * TODO: if multiple adjacent PTEs are written without releasing
+     *       the lock, this then redundant cache flushing can be a
+     *       performance issue.
+     */
+    if ( clean_pte )
+        clean_dcache_va_range(p, sizeof(*p));
+}
+
+static inline void p2m_clean_pte(pte_t *p, bool clean_pte)
+{
+    pte_t pte = { .pte = 0 };
+
+    p2m_write_pte(p, pte, clean_pte);
+}
+
+static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t)
+{
+    panic("%s: hasn't been implemented yet\n", __func__);
+
+    return (pte_t) { .pte = 0 };
+}
+
+#define P2M_TABLE_MAP_NONE 0
+#define P2M_TABLE_MAP_NOMEM 1
+#define P2M_TABLE_SUPER_PAGE 2
+#define P2M_TABLE_NORMAL 3
+
+/*
+ * Take the currently mapped table, find the entry corresponding to the GFN,
+ * and map the next-level table if available. The previous table will be
+ * unmapped if the next level was mapped (e.g., when P2M_TABLE_NORMAL is
+ * returned).
+ *
+ * `alloc_tbl` parameter indicates whether intermediate tables should
+ * be allocated when not present.
+ *
+ * Return values:
+ *  P2M_TABLE_MAP_NONE: a table allocation isn't permitted.
+ *  P2M_TABLE_MAP_NOMEM: allocating a new page failed.
+ *  P2M_TABLE_SUPER_PAGE: next level or leaf mapped normally.
+ *  P2M_TABLE_NORMAL: The next entry points to a superpage.
+ */
+static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl,
+                          unsigned int level, pte_t **table,
+                          unsigned int offset)
+{
+    panic("%s: hasn't been implemented yet\n", __func__);
+
+    return P2M_TABLE_MAP_NONE;
+}
+
+/* Free pte sub-tree behind an entry */
+static void p2m_free_subtree(struct p2m_domain *p2m,
+                             pte_t entry, unsigned int level)
+{
+    panic("%s: hasn't been implemented yet\n", __func__);
+}
+
+/* Insert an entry in the p2m */
+static int p2m_set_entry(struct p2m_domain *p2m,
+                         gfn_t gfn,
+                         unsigned long page_order,
+                         mfn_t mfn,
+                         p2m_type_t t)
+{
+    unsigned int level;
+    unsigned int target = page_order / PAGETABLE_ORDER;
+    pte_t *entry, *table, orig_pte;
+    int rc;
+    /*
+     * A mapping is removed only if the MFN is explicitly set to INVALID_MFN.
+     * Other MFNs that are considered invalid by mfn_valid() (e.g., MMIO)
+     * are still allowed.
+     */
+    bool removing_mapping = mfn_eq(mfn, INVALID_MFN);
+    P2M_DECLARE_OFFSETS(offsets, gfn_to_gaddr(gfn));
+
+    ASSERT(p2m_is_write_locked(p2m));
+
+    /*
+     * Check if the level target is valid: we only support
+     * 4K - 2M - 1G mapping.
+     */
+    ASSERT(target <= 2);
+
+    table = p2m_get_root_pointer(p2m, gfn);
+    if ( !table )
+        return -EINVAL;
+
+    for ( level = P2M_ROOT_LEVEL; level > target; level-- )
+    {
+        /*
+         * Don't try to allocate intermediate page table if the mapping
+         * is about to be removed.
+         */
+        rc = p2m_next_level(p2m, !removing_mapping,
+                            level, &table, offsets[level]);
+        if ( (rc == P2M_TABLE_MAP_NONE) || (rc == P2M_TABLE_MAP_NOMEM) )
+        {
+            rc = (rc == P2M_TABLE_MAP_NONE) ? -ENOENT : -ENOMEM;
+            /*
+             * We are here because p2m_next_level has failed to map
+             * the intermediate page table (e.g the table does not exist
+             * and none should be allocated). It is a valid case
+             * when removing a mapping as it may not exist in the
+             * page table. In this case, just ignore lookup failure.
+             */
+            rc = removing_mapping ? 0 : rc;
+            goto out;
+        }
+
+        if ( rc != P2M_TABLE_NORMAL )
+            break;
+    }
+
+    entry = table + offsets[level];
+
+    /*
+     * If we are here with level > target, we must be at a leaf node,
+     * and we need to break up the superpage.
+     */
+    if ( level > target )
+    {
+        panic("Shattering isn't implemented\n");
+    }
+
+    /*
+     * We should always be there with the correct level because all the
+     * intermediate tables have been installed if necessary.
+     */
+    ASSERT(level == target);
+
+    orig_pte = *entry;
+
+    if ( removing_mapping )
+        p2m_clean_pte(entry, p2m->clean_dcache);
+    else
+    {
+        pte_t pte = p2m_pte_from_mfn(mfn, t);
+
+        p2m_write_pte(entry, pte, p2m->clean_dcache);
+
+        p2m->max_mapped_gfn = gfn_max(p2m->max_mapped_gfn,
+                                      gfn_add(gfn, BIT(page_order, UL) - 1));
+        p2m->lowest_mapped_gfn = gfn_min(p2m->lowest_mapped_gfn, gfn);
+    }
+
+    p2m->need_flush = true;
+
+    /*
+     * Currently, the infrastructure required to enable CONFIG_HAS_PASSTHROUGH
+     * is not ready for RISC-V support.
+     *
+     * When CONFIG_HAS_PASSTHROUGH=y, iommu_iotlb_flush() should be done
+     * here.
+     */
+#ifdef CONFIG_HAS_PASSTHROUGH
+#   error "add code to flush IOMMU TLB"
+#endif
+
+    rc = 0;
+
+    /*
+     * In case of a VALID -> INVALID transition, the original PTE should
+     * always be freed.
+     *
+     * In case of a VALID -> VALID transition, the original PTE should be
+     * freed only if the MFNs are different. If the MFNs are the same
+     * (i.e., only permissions differ), there is no need to free the
+     * original PTE.
+     */
+    if ( pte_is_valid(orig_pte) &&
+         (!pte_is_valid(*entry) ||
+         !mfn_eq(pte_get_mfn(*entry), pte_get_mfn(orig_pte))) )
+        p2m_free_subtree(p2m, orig_pte, level);
+
+ out:
+    unmap_domain_page(table);
+
+    return rc;
+}
+
+/* Return mapping order for given gfn, mfn and nr */
+static unsigned long p2m_mapping_order(gfn_t gfn, mfn_t mfn, unsigned long nr)
+{
+    unsigned long mask;
+    /* 1gb, 2mb, 4k mappings are supported */
+    unsigned int level = min(P2M_ROOT_LEVEL, _AC(2, U));
+    unsigned long order = 0;
+
+    mask = !mfn_eq(mfn, INVALID_MFN) ? mfn_x(mfn) : 0;
+    mask |= gfn_x(gfn);
+
+    for ( ; level != 0; level-- )
+    {
+        if ( !(mask & (BIT(P2M_LEVEL_ORDER(level), UL) - 1)) &&
+             (nr >= BIT(P2M_LEVEL_ORDER(level), UL)) )
+        {
+                order = P2M_LEVEL_ORDER(level);
+                break;
+        }
+    }
+
+    return order;
+}
+
 static int p2m_set_range(struct p2m_domain *p2m,
                          gfn_t sgfn,
                          unsigned long nr,
                          mfn_t smfn,
                          p2m_type_t t)
 {
-    return -EOPNOTSUPP;
+    int rc = 0;
+    unsigned long left = nr;
+
+    /*
+     * Any reference taken by the P2M mappings (e.g. foreign mapping) will
+     * be dropped in relinquish_p2m_mapping(). As the P2M will still
+     * be accessible after, we need to prevent mapping to be added when the
+     * domain is dying.
+     */
+    if ( unlikely(p2m->domain->is_dying) )
+        return -EACCES;
+
+    while ( left )
+    {
+        unsigned long order = p2m_mapping_order(sgfn, smfn, left);
+
+        rc = p2m_set_entry(p2m, sgfn, order, smfn, t);
+        if ( rc )
+            break;
+
+        sgfn = gfn_add(sgfn, BIT(order, UL));
+        if ( !mfn_eq(smfn, INVALID_MFN) )
+           smfn = mfn_add(smfn, BIT(order, UL));
+
+        left -= BIT(order, UL);
+    }
+
+    if ( left > INT_MAX )
+        rc = -EOVERFLOW;
+
+    return !left ? rc : left;
 }
 
 int map_regions_p2mt(struct domain *d,
-- 
2.51.0


Re: [for 4.22 v5 10/18] xen/riscv: implement p2m_set_range()
Posted by Jan Beulich 3 months ago
On 20.10.2025 17:57, Oleksii Kurochko wrote:
> --- a/xen/arch/riscv/include/asm/p2m.h
> +++ b/xen/arch/riscv/include/asm/p2m.h
> @@ -8,12 +8,45 @@
>  #include <xen/rwlock.h>
>  #include <xen/types.h>
>  
> +#include <asm/page.h>
>  #include <asm/page-bits.h>
>  
>  extern unsigned char gstage_mode;
> +extern unsigned int gstage_root_level;
>  
>  #define P2M_ROOT_ORDER  (ilog2(GSTAGE_ROOT_PAGE_TABLE_SIZE) - PAGE_SHIFT)
>  #define P2M_ROOT_PAGES  BIT(P2M_ROOT_ORDER, U)
> +#define P2M_ROOT_LEVEL  gstage_root_level
> +
> +/*
> + * According to the RISC-V spec:
> + *   When hgatp.MODE specifies a translation scheme of Sv32x4, Sv39x4, Sv48x4,
> + *   or Sv57x4, G-stage address translation is a variation on the usual
> + *   page-based virtual address translation scheme of Sv32, Sv39, Sv48, or
> + *   Sv57, respectively. In each case, the size of the incoming address is
> + *   widened by 2 bits (to 34, 41, 50, or 59 bits).
> + *
> + * P2M_LEVEL_ORDER(lvl) defines the bit position in the GFN from which
> + * the index for this level of the P2M page table starts. The extra 2
> + * bits added by the "x4" schemes only affect the root page table width.
> + *
> + * Therefore, this macro can safely reuse XEN_PT_LEVEL_ORDER() for all
> + * levels: the extra 2 bits do not change the indices of lower levels.
> + *
> + * The extra 2 bits are only relevant if one tried to address beyond the
> + * root level (i.e., P2M_LEVEL_ORDER(P2M_ROOT_LEVEL + 1)), which is
> + * invalid.
> + */
> +#define P2M_LEVEL_ORDER(lvl) XEN_PT_LEVEL_ORDER(lvl)

Is the last paragraph of the comment really needed? It talks about something
absurd / impossible only.

> +#define P2M_ROOT_EXTRA_BITS(lvl) (2 * ((lvl) == P2M_ROOT_LEVEL))
> +
> +#define P2M_PAGETABLE_ENTRIES(lvl) \
> +    (BIT(PAGETABLE_ORDER + P2M_ROOT_EXTRA_BITS(lvl), UL))
> +
> +#define GFN_MASK(lvl) (P2M_PAGETABLE_ENTRIES(lvl) - 1UL)

If I'm not mistaken, this is a mask with the low 10 or 12 bits set.
That's not really something you can apply to a GFN, unlike the name
suggests.

> +#define P2M_LEVEL_SHIFT(lvl) (P2M_LEVEL_ORDER(lvl) + PAGE_SHIFT)

Whereas here the macro name doesn't make clear what is shifted: An
address or a GFN. (It's the former, aiui.)

> --- a/xen/arch/riscv/p2m.c
> +++ b/xen/arch/riscv/p2m.c
> @@ -9,6 +9,7 @@
>  #include <xen/rwlock.h>
>  #include <xen/sched.h>
>  #include <xen/sections.h>
> +#include <xen/xvmalloc.h>
>  
>  #include <asm/csr.h>
>  #include <asm/flushtlb.h>
> @@ -17,6 +18,43 @@
>  #include <asm/vmid.h>
>  
>  unsigned char __ro_after_init gstage_mode;
> +unsigned int __ro_after_init gstage_root_level;

Like for mode, I'm unconvinced of this being a global (and not per-P2M /
per-domain).

> +/*
> + * The P2M root page table is extended by 2 bits, making its size 16KB
> + * (instead of 4KB for non-root page tables). Therefore, P2M root page
> + * is allocated as four consecutive 4KB pages (since alloc_domheap_pages()
> + * only allocates 4KB pages).
> + */
> +#define ENTRIES_PER_ROOT_PAGE \
> +    (P2M_PAGETABLE_ENTRIES(P2M_ROOT_LEVEL) / P2M_ROOT_ORDER)
> +
> +static inline unsigned int calc_offset(unsigned int lvl, vaddr_t va)

Where would a vaddr_t come from here? Your input are guest-physical addresses,
if I'm not mistaken.

> +{
> +    unsigned int offset = (va >> P2M_LEVEL_SHIFT(lvl)) & GFN_MASK(lvl);
> +
> +    /*
> +     * For P2M_ROOT_LEVEL, `offset` ranges from 0 to 2047, since the root
> +     * page table spans 4 consecutive 4KB pages.
> +     * We want to return an index within one of these 4 pages.
> +     * The specific page to use is determined by `p2m_get_root_pointer()`.
> +     *
> +     * Example: if `offset == 512`:
> +     *  - A single 4KB page holds 512 entries.
> +     *  - Therefore, entry 512 corresponds to index 0 of the second page.
> +     *
> +     * At all other levels, only one page is allocated, and `offset` is
> +     * always in the range 0 to 511, since the VPN is 9 bits long.
> +     */
> +    return offset % ENTRIES_PER_ROOT_PAGE;

Seeing something "root" used here (when this is for all levels) is pretty odd,
despite all the commentary. Given all the commentary, why not simply

    return offset & ((1U << PAGETABLE_ORDER) - 1);

?

> +}
> +
> +#define P2M_MAX_ROOT_LEVEL 4
> +
> +#define P2M_DECLARE_OFFSETS(var, addr) \
> +    unsigned int var[P2M_MAX_ROOT_LEVEL] = {-1};\
> +    for ( unsigned int i = 0; i <= gstage_root_level; i++ ) \
> +        var[i] = calc_offset(i, addr);

This surely is more than just "declare", and it's dealing with all levels no
matter whether you actually will use all offsets.

> @@ -259,13 +308,293 @@ int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
>      return rc;
>  }
>  
> +/*
> + * Map one of the four root pages of the P2M root page table.
> + *
> + * The P2M root page table is larger than normal (16KB instead of 4KB),
> + * so it is allocated as four consecutive 4KB pages. This function selects
> + * the appropriate 4KB page based on the given GFN and returns a mapping
> + * to it.
> + *
> + * The caller is responsible for unmapping the page after use.
> + *
> + * Returns NULL if the calculated offset into the root table is invalid.
> + */
> +static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
> +{
> +    unsigned long root_table_indx;
> +
> +    root_table_indx = gfn_x(gfn) >> P2M_LEVEL_ORDER(P2M_ROOT_LEVEL);

With the variable name shortened (to e.g. idx) this could be its initializer
without ending up with too long a line. The root_table_ prefix isn't really
adding much value in the context of this function.

> +    if ( root_table_indx >= P2M_ROOT_PAGES )
> +        return NULL;
> +
> +    /*
> +     * The P2M root page table is extended by 2 bits, making its size 16KB
> +     * (instead of 4KB for non-root page tables). Therefore, p2m->root is
> +     * allocated as four consecutive 4KB pages (since alloc_domheap_pages()
> +     * only allocates 4KB pages).
> +     *
> +     * Initially, `root_table_indx` is derived directly from `va`.

There's no 'va' here.

> +static inline void p2m_clean_pte(pte_t *p, bool clean_pte)

"clean_pte" as a parameter name of a function of this name is, well, odd.

> +/* Insert an entry in the p2m */
> +static int p2m_set_entry(struct p2m_domain *p2m,
> +                         gfn_t gfn,
> +                         unsigned long page_order,
> +                         mfn_t mfn,
> +                         p2m_type_t t)
> +{
> +    unsigned int level;
> +    unsigned int target = page_order / PAGETABLE_ORDER;
> +    pte_t *entry, *table, orig_pte;
> +    int rc;
> +    /*
> +     * A mapping is removed only if the MFN is explicitly set to INVALID_MFN.
> +     * Other MFNs that are considered invalid by mfn_valid() (e.g., MMIO)
> +     * are still allowed.
> +     */
> +    bool removing_mapping = mfn_eq(mfn, INVALID_MFN);
> +    P2M_DECLARE_OFFSETS(offsets, gfn_to_gaddr(gfn));
> +
> +    ASSERT(p2m_is_write_locked(p2m));
> +
> +    /*
> +     * Check if the level target is valid: we only support
> +     * 4K - 2M - 1G mapping.
> +     */
> +    ASSERT(target <= 2);
> +
> +    table = p2m_get_root_pointer(p2m, gfn);
> +    if ( !table )
> +        return -EINVAL;
> +
> +    for ( level = P2M_ROOT_LEVEL; level > target; level-- )
> +    {
> +        /*
> +         * Don't try to allocate intermediate page table if the mapping
> +         * is about to be removed.
> +         */
> +        rc = p2m_next_level(p2m, !removing_mapping,
> +                            level, &table, offsets[level]);
> +        if ( (rc == P2M_TABLE_MAP_NONE) || (rc == P2M_TABLE_MAP_NOMEM) )
> +        {
> +            rc = (rc == P2M_TABLE_MAP_NONE) ? -ENOENT : -ENOMEM;
> +            /*
> +             * We are here because p2m_next_level has failed to map
> +             * the intermediate page table (e.g the table does not exist
> +             * and none should be allocated). It is a valid case
> +             * when removing a mapping as it may not exist in the
> +             * page table. In this case, just ignore lookup failure.
> +             */
> +            rc = removing_mapping ? 0 : rc;
> +            goto out;
> +        }
> +
> +        if ( rc != P2M_TABLE_NORMAL )
> +            break;
> +    }
> +
> +    entry = table + offsets[level];
> +
> +    /*
> +     * If we are here with level > target, we must be at a leaf node,
> +     * and we need to break up the superpage.
> +     */
> +    if ( level > target )
> +    {
> +        panic("Shattering isn't implemented\n");
> +    }
> +
> +    /*
> +     * We should always be there with the correct level because all the
> +     * intermediate tables have been installed if necessary.
> +     */
> +    ASSERT(level == target);
> +
> +    orig_pte = *entry;
> +
> +    if ( removing_mapping )
> +        p2m_clean_pte(entry, p2m->clean_dcache);
> +    else
> +    {
> +        pte_t pte = p2m_pte_from_mfn(mfn, t);
> +
> +        p2m_write_pte(entry, pte, p2m->clean_dcache);
> +
> +        p2m->max_mapped_gfn = gfn_max(p2m->max_mapped_gfn,
> +                                      gfn_add(gfn, BIT(page_order, UL) - 1));
> +        p2m->lowest_mapped_gfn = gfn_min(p2m->lowest_mapped_gfn, gfn);
> +    }
> +
> +    p2m->need_flush = true;
> +
> +    /*
> +     * Currently, the infrastructure required to enable CONFIG_HAS_PASSTHROUGH
> +     * is not ready for RISC-V support.
> +     *
> +     * When CONFIG_HAS_PASSTHROUGH=y, iommu_iotlb_flush() should be done
> +     * here.
> +     */
> +#ifdef CONFIG_HAS_PASSTHROUGH
> +#   error "add code to flush IOMMU TLB"
> +#endif
> +
> +    rc = 0;
> +
> +    /*
> +     * In case of a VALID -> INVALID transition, the original PTE should
> +     * always be freed.
> +     *
> +     * In case of a VALID -> VALID transition, the original PTE should be
> +     * freed only if the MFNs are different. If the MFNs are the same
> +     * (i.e., only permissions differ), there is no need to free the
> +     * original PTE.
> +     */
> +    if ( pte_is_valid(orig_pte) &&
> +         (!pte_is_valid(*entry) ||
> +         !mfn_eq(pte_get_mfn(*entry), pte_get_mfn(orig_pte))) )

Besides my continued impression of this condition being more complex than it
ought to be expected, indentation is off by one on the last of the three lines.
(Since, otoh, I can't suggest any simpler expression (for now), this isn't a
request to further change it.)

> +/* Return mapping order for given gfn, mfn and nr */
> +static unsigned long p2m_mapping_order(gfn_t gfn, mfn_t mfn, unsigned long nr)
> +{
> +    unsigned long mask;
> +    /* 1gb, 2mb, 4k mappings are supported */
> +    unsigned int level = min(P2M_ROOT_LEVEL, _AC(2, U));

Further up you has such a literal 2 already - please make a constant, so all
instances can easily be associated with one another.

> +    unsigned long order = 0;
> +
> +    mask = !mfn_eq(mfn, INVALID_MFN) ? mfn_x(mfn) : 0;
> +    mask |= gfn_x(gfn);
> +
> +    for ( ; level != 0; level-- )
> +    {
> +        if ( !(mask & (BIT(P2M_LEVEL_ORDER(level), UL) - 1)) &&
> +             (nr >= BIT(P2M_LEVEL_ORDER(level), UL)) )
> +        {
> +                order = P2M_LEVEL_ORDER(level);
> +                break;

I'm pretty sure I did complain about the too deep indentation here already.

> +        }
> +    }
> +
> +    return order;
> +}
> +
>  static int p2m_set_range(struct p2m_domain *p2m,
>                           gfn_t sgfn,
>                           unsigned long nr,
>                           mfn_t smfn,
>                           p2m_type_t t)
>  {
> -    return -EOPNOTSUPP;
> +    int rc = 0;
> +    unsigned long left = nr;
> +
> +    /*
> +     * Any reference taken by the P2M mappings (e.g. foreign mapping) will
> +     * be dropped in relinquish_p2m_mapping(). As the P2M will still
> +     * be accessible after, we need to prevent mapping to be added when the
> +     * domain is dying.
> +     */
> +    if ( unlikely(p2m->domain->is_dying) )
> +        return -EACCES;
> +
> +    while ( left )
> +    {
> +        unsigned long order = p2m_mapping_order(sgfn, smfn, left);
> +
> +        rc = p2m_set_entry(p2m, sgfn, order, smfn, t);
> +        if ( rc )
> +            break;
> +
> +        sgfn = gfn_add(sgfn, BIT(order, UL));
> +        if ( !mfn_eq(smfn, INVALID_MFN) )
> +           smfn = mfn_add(smfn, BIT(order, UL));

Off-by-1 indentation again.

Jan
Re: [for 4.22 v5 10/18] xen/riscv: implement p2m_set_range()
Posted by Jan Beulich 3 months ago
On 20.10.2025 17:57, Oleksii Kurochko wrote:
> --- a/xen/arch/riscv/include/asm/p2m.h
> +++ b/xen/arch/riscv/include/asm/p2m.h
> @@ -8,12 +8,45 @@
>  #include <xen/rwlock.h>
>  #include <xen/types.h>
>  
> +#include <asm/page.h>
>  #include <asm/page-bits.h>
>  
>  extern unsigned char gstage_mode;
> +extern unsigned int gstage_root_level;
>  
>  #define P2M_ROOT_ORDER  (ilog2(GSTAGE_ROOT_PAGE_TABLE_SIZE) - PAGE_SHIFT)
>  #define P2M_ROOT_PAGES  BIT(P2M_ROOT_ORDER, U)
> +#define P2M_ROOT_LEVEL  gstage_root_level
> +
> +/*
> + * According to the RISC-V spec:
> + *   When hgatp.MODE specifies a translation scheme of Sv32x4, Sv39x4, Sv48x4,
> + *   or Sv57x4, G-stage address translation is a variation on the usual
> + *   page-based virtual address translation scheme of Sv32, Sv39, Sv48, or
> + *   Sv57, respectively. In each case, the size of the incoming address is
> + *   widened by 2 bits (to 34, 41, 50, or 59 bits).
> + *
> + * P2M_LEVEL_ORDER(lvl) defines the bit position in the GFN from which
> + * the index for this level of the P2M page table starts. The extra 2
> + * bits added by the "x4" schemes only affect the root page table width.
> + *
> + * Therefore, this macro can safely reuse XEN_PT_LEVEL_ORDER() for all
> + * levels: the extra 2 bits do not change the indices of lower levels.
> + *
> + * The extra 2 bits are only relevant if one tried to address beyond the
> + * root level (i.e., P2M_LEVEL_ORDER(P2M_ROOT_LEVEL + 1)), which is
> + * invalid.
> + */
> +#define P2M_LEVEL_ORDER(lvl) XEN_PT_LEVEL_ORDER(lvl)

Is the last paragraph of the comment really needed? It talks about something
absurd / impossible only.

> +#define P2M_ROOT_EXTRA_BITS(lvl) (2 * ((lvl) == P2M_ROOT_LEVEL))
> +
> +#define P2M_PAGETABLE_ENTRIES(lvl) \
> +    (BIT(PAGETABLE_ORDER + P2M_ROOT_EXTRA_BITS(lvl), UL))
> +
> +#define GFN_MASK(lvl) (P2M_PAGETABLE_ENTRIES(lvl) - 1UL)

If I'm not mistaken, this is a mask with the low 10 or 12 bits set.
That's not really something you can apply to a GFN, unlike the name
suggests.

> +#define P2M_LEVEL_SHIFT(lvl) (P2M_LEVEL_ORDER(lvl) + PAGE_SHIFT)

Whereas here the macro name doesn't make clear what is shifted: An
address or a GFN. (It's the former, aiui.)

> --- a/xen/arch/riscv/p2m.c
> +++ b/xen/arch/riscv/p2m.c
> @@ -9,6 +9,7 @@
>  #include <xen/rwlock.h>
>  #include <xen/sched.h>
>  #include <xen/sections.h>
> +#include <xen/xvmalloc.h>
>  
>  #include <asm/csr.h>
>  #include <asm/flushtlb.h>
> @@ -17,6 +18,43 @@
>  #include <asm/vmid.h>
>  
>  unsigned char __ro_after_init gstage_mode;
> +unsigned int __ro_after_init gstage_root_level;

Like for mode, I'm unconvinced of this being a global (and not per-P2M /
per-domain).

> +/*
> + * The P2M root page table is extended by 2 bits, making its size 16KB
> + * (instead of 4KB for non-root page tables). Therefore, P2M root page
> + * is allocated as four consecutive 4KB pages (since alloc_domheap_pages()
> + * only allocates 4KB pages).
> + */
> +#define ENTRIES_PER_ROOT_PAGE \
> +    (P2M_PAGETABLE_ENTRIES(P2M_ROOT_LEVEL) / P2M_ROOT_ORDER)
> +
> +static inline unsigned int calc_offset(unsigned int lvl, vaddr_t va)

Where would a vaddr_t come from here? Your input are guest-physical addresses,
if I'm not mistaken.

> +{
> +    unsigned int offset = (va >> P2M_LEVEL_SHIFT(lvl)) & GFN_MASK(lvl);
> +
> +    /*
> +     * For P2M_ROOT_LEVEL, `offset` ranges from 0 to 2047, since the root
> +     * page table spans 4 consecutive 4KB pages.
> +     * We want to return an index within one of these 4 pages.
> +     * The specific page to use is determined by `p2m_get_root_pointer()`.
> +     *
> +     * Example: if `offset == 512`:
> +     *  - A single 4KB page holds 512 entries.
> +     *  - Therefore, entry 512 corresponds to index 0 of the second page.
> +     *
> +     * At all other levels, only one page is allocated, and `offset` is
> +     * always in the range 0 to 511, since the VPN is 9 bits long.
> +     */
> +    return offset % ENTRIES_PER_ROOT_PAGE;

Seeing something "root" used here (when this is for all levels) is pretty odd,
despite all the commentary. Given all the commentary, why not simply

    return offset & ((1U << PAGETABLE_ORDER) - 1);

?

> +}
> +
> +#define P2M_MAX_ROOT_LEVEL 4
> +
> +#define P2M_DECLARE_OFFSETS(var, addr) \
> +    unsigned int var[P2M_MAX_ROOT_LEVEL] = {-1};\
> +    for ( unsigned int i = 0; i <= gstage_root_level; i++ ) \
> +        var[i] = calc_offset(i, addr);

This surely is more than just "declare", and it's dealing with all levels no
matter whether you actually will use all offsets.

> @@ -259,13 +308,293 @@ int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
>      return rc;
>  }
>  
> +/*
> + * Map one of the four root pages of the P2M root page table.
> + *
> + * The P2M root page table is larger than normal (16KB instead of 4KB),
> + * so it is allocated as four consecutive 4KB pages. This function selects
> + * the appropriate 4KB page based on the given GFN and returns a mapping
> + * to it.
> + *
> + * The caller is responsible for unmapping the page after use.
> + *
> + * Returns NULL if the calculated offset into the root table is invalid.
> + */
> +static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
> +{
> +    unsigned long root_table_indx;
> +
> +    root_table_indx = gfn_x(gfn) >> P2M_LEVEL_ORDER(P2M_ROOT_LEVEL);

With the variable name shortened (to e.g. idx) this could be its initializer
without ending up with too long a line. The root_table_ prefix isn't really
adding much value in the context of this function.

> +    if ( root_table_indx >= P2M_ROOT_PAGES )
> +        return NULL;
> +
> +    /*
> +     * The P2M root page table is extended by 2 bits, making its size 16KB
> +     * (instead of 4KB for non-root page tables). Therefore, p2m->root is
> +     * allocated as four consecutive 4KB pages (since alloc_domheap_pages()
> +     * only allocates 4KB pages).
> +     *
> +     * Initially, `root_table_indx` is derived directly from `va`.

There's no 'va' here.

> +static inline void p2m_clean_pte(pte_t *p, bool clean_pte)

"clean_pte" as a parameter name of a function of this name is, well, odd.

> +/* Insert an entry in the p2m */
> +static int p2m_set_entry(struct p2m_domain *p2m,
> +                         gfn_t gfn,
> +                         unsigned long page_order,
> +                         mfn_t mfn,
> +                         p2m_type_t t)
> +{
> +    unsigned int level;
> +    unsigned int target = page_order / PAGETABLE_ORDER;
> +    pte_t *entry, *table, orig_pte;
> +    int rc;
> +    /*
> +     * A mapping is removed only if the MFN is explicitly set to INVALID_MFN.
> +     * Other MFNs that are considered invalid by mfn_valid() (e.g., MMIO)
> +     * are still allowed.
> +     */
> +    bool removing_mapping = mfn_eq(mfn, INVALID_MFN);
> +    P2M_DECLARE_OFFSETS(offsets, gfn_to_gaddr(gfn));
> +
> +    ASSERT(p2m_is_write_locked(p2m));
> +
> +    /*
> +     * Check if the level target is valid: we only support
> +     * 4K - 2M - 1G mapping.
> +     */
> +    ASSERT(target <= 2);
> +
> +    table = p2m_get_root_pointer(p2m, gfn);
> +    if ( !table )
> +        return -EINVAL;
> +
> +    for ( level = P2M_ROOT_LEVEL; level > target; level-- )
> +    {
> +        /*
> +         * Don't try to allocate intermediate page table if the mapping
> +         * is about to be removed.
> +         */
> +        rc = p2m_next_level(p2m, !removing_mapping,
> +                            level, &table, offsets[level]);
> +        if ( (rc == P2M_TABLE_MAP_NONE) || (rc == P2M_TABLE_MAP_NOMEM) )
> +        {
> +            rc = (rc == P2M_TABLE_MAP_NONE) ? -ENOENT : -ENOMEM;
> +            /*
> +             * We are here because p2m_next_level has failed to map
> +             * the intermediate page table (e.g the table does not exist
> +             * and none should be allocated). It is a valid case
> +             * when removing a mapping as it may not exist in the
> +             * page table. In this case, just ignore lookup failure.
> +             */
> +            rc = removing_mapping ? 0 : rc;
> +            goto out;
> +        }
> +
> +        if ( rc != P2M_TABLE_NORMAL )
> +            break;
> +    }
> +
> +    entry = table + offsets[level];
> +
> +    /*
> +     * If we are here with level > target, we must be at a leaf node,
> +     * and we need to break up the superpage.
> +     */
> +    if ( level > target )
> +    {
> +        panic("Shattering isn't implemented\n");
> +    }
> +
> +    /*
> +     * We should always be there with the correct level because all the
> +     * intermediate tables have been installed if necessary.
> +     */
> +    ASSERT(level == target);
> +
> +    orig_pte = *entry;
> +
> +    if ( removing_mapping )
> +        p2m_clean_pte(entry, p2m->clean_dcache);
> +    else
> +    {
> +        pte_t pte = p2m_pte_from_mfn(mfn, t);
> +
> +        p2m_write_pte(entry, pte, p2m->clean_dcache);
> +
> +        p2m->max_mapped_gfn = gfn_max(p2m->max_mapped_gfn,
> +                                      gfn_add(gfn, BIT(page_order, UL) - 1));
> +        p2m->lowest_mapped_gfn = gfn_min(p2m->lowest_mapped_gfn, gfn);
> +    }
> +
> +    p2m->need_flush = true;
> +
> +    /*
> +     * Currently, the infrastructure required to enable CONFIG_HAS_PASSTHROUGH
> +     * is not ready for RISC-V support.
> +     *
> +     * When CONFIG_HAS_PASSTHROUGH=y, iommu_iotlb_flush() should be done
> +     * here.
> +     */
> +#ifdef CONFIG_HAS_PASSTHROUGH
> +#   error "add code to flush IOMMU TLB"
> +#endif
> +
> +    rc = 0;
> +
> +    /*
> +     * In case of a VALID -> INVALID transition, the original PTE should
> +     * always be freed.
> +     *
> +     * In case of a VALID -> VALID transition, the original PTE should be
> +     * freed only if the MFNs are different. If the MFNs are the same
> +     * (i.e., only permissions differ), there is no need to free the
> +     * original PTE.
> +     */
> +    if ( pte_is_valid(orig_pte) &&
> +         (!pte_is_valid(*entry) ||
> +         !mfn_eq(pte_get_mfn(*entry), pte_get_mfn(orig_pte))) )

Besides my continued impression of this condition being more complex than it
ought to be expected, indentation is off by one on the last of the three lines.
(Since, otoh, I can't suggest any simpler expression (for now), this isn't a
request to further change it.)

> +/* Return mapping order for given gfn, mfn and nr */
> +static unsigned long p2m_mapping_order(gfn_t gfn, mfn_t mfn, unsigned long nr)
> +{
> +    unsigned long mask;
> +    /* 1gb, 2mb, 4k mappings are supported */
> +    unsigned int level = min(P2M_ROOT_LEVEL, _AC(2, U));

Further up you has such a literal 2 already - please make a constant, so all
instances can easily be associated with one another.

> +    unsigned long order = 0;
> +
> +    mask = !mfn_eq(mfn, INVALID_MFN) ? mfn_x(mfn) : 0;
> +    mask |= gfn_x(gfn);
> +
> +    for ( ; level != 0; level-- )
> +    {
> +        if ( !(mask & (BIT(P2M_LEVEL_ORDER(level), UL) - 1)) &&
> +             (nr >= BIT(P2M_LEVEL_ORDER(level), UL)) )
> +        {
> +                order = P2M_LEVEL_ORDER(level);
> +                break;

I'm pretty sure I did complain about the too deep indentation here already.

> +        }
> +    }
> +
> +    return order;
> +}
> +
>  static int p2m_set_range(struct p2m_domain *p2m,
>                           gfn_t sgfn,
>                           unsigned long nr,
>                           mfn_t smfn,
>                           p2m_type_t t)
>  {
> -    return -EOPNOTSUPP;
> +    int rc = 0;
> +    unsigned long left = nr;
> +
> +    /*
> +     * Any reference taken by the P2M mappings (e.g. foreign mapping) will
> +     * be dropped in relinquish_p2m_mapping(). As the P2M will still
> +     * be accessible after, we need to prevent mapping to be added when the
> +     * domain is dying.
> +     */
> +    if ( unlikely(p2m->domain->is_dying) )
> +        return -EACCES;
> +
> +    while ( left )
> +    {
> +        unsigned long order = p2m_mapping_order(sgfn, smfn, left);
> +
> +        rc = p2m_set_entry(p2m, sgfn, order, smfn, t);
> +        if ( rc )
> +            break;
> +
> +        sgfn = gfn_add(sgfn, BIT(order, UL));
> +        if ( !mfn_eq(smfn, INVALID_MFN) )
> +           smfn = mfn_add(smfn, BIT(order, UL));

Off-by-1 indentation again.

Jan
Re: [for 4.22 v5 10/18] xen/riscv: implement p2m_set_range()
Posted by Oleksii Kurochko 2 months, 3 weeks ago
On 11/10/25 3:53 PM, Jan Beulich wrote:
> On 20.10.2025 17:57, Oleksii Kurochko wrote:
>> --- a/xen/arch/riscv/include/asm/p2m.h
>> +++ b/xen/arch/riscv/include/asm/p2m.h
>> @@ -8,12 +8,45 @@
>>   #include <xen/rwlock.h>
>>   #include <xen/types.h>
>>   
>> +#include <asm/page.h>
>>   #include <asm/page-bits.h>
>>   
>>   extern unsigned char gstage_mode;
>> +extern unsigned int gstage_root_level;
>>   
>>   #define P2M_ROOT_ORDER  (ilog2(GSTAGE_ROOT_PAGE_TABLE_SIZE) - PAGE_SHIFT)
>>   #define P2M_ROOT_PAGES  BIT(P2M_ROOT_ORDER, U)
>> +#define P2M_ROOT_LEVEL  gstage_root_level
>> +
>> +/*
>> + * According to the RISC-V spec:
>> + *   When hgatp.MODE specifies a translation scheme of Sv32x4, Sv39x4, Sv48x4,
>> + *   or Sv57x4, G-stage address translation is a variation on the usual
>> + *   page-based virtual address translation scheme of Sv32, Sv39, Sv48, or
>> + *   Sv57, respectively. In each case, the size of the incoming address is
>> + *   widened by 2 bits (to 34, 41, 50, or 59 bits).
>> + *
>> + * P2M_LEVEL_ORDER(lvl) defines the bit position in the GFN from which
>> + * the index for this level of the P2M page table starts. The extra 2
>> + * bits added by the "x4" schemes only affect the root page table width.
>> + *
>> + * Therefore, this macro can safely reuse XEN_PT_LEVEL_ORDER() for all
>> + * levels: the extra 2 bits do not change the indices of lower levels.
>> + *
>> + * The extra 2 bits are only relevant if one tried to address beyond the
>> + * root level (i.e., P2M_LEVEL_ORDER(P2M_ROOT_LEVEL + 1)), which is
>> + * invalid.
>> + */
>> +#define P2M_LEVEL_ORDER(lvl) XEN_PT_LEVEL_ORDER(lvl)
> Is the last paragraph of the comment really needed? It talks about something
> absurd / impossible only.

Agree, it isn't really needed, lets drop it.

>
>> +#define P2M_ROOT_EXTRA_BITS(lvl) (2 * ((lvl) == P2M_ROOT_LEVEL))
>> +
>> +#define P2M_PAGETABLE_ENTRIES(lvl) \
>> +    (BIT(PAGETABLE_ORDER + P2M_ROOT_EXTRA_BITS(lvl), UL))
>> +
>> +#define GFN_MASK(lvl) (P2M_PAGETABLE_ENTRIES(lvl) - 1UL)
> If I'm not mistaken, this is a mask with the low 10 or 12 bits set.

I'm not sure I fully understand you here. With the current implementation,
it returns a bitmask that corresponds to the number of index bits used
at each level. So, if|P2M_ROOT_LEVEL = 2|, then:
   |G||FN_MASK(0) = 0x1ff| (9-bit GFN for the level 0)
   |GFN_MASK(1) = 0x1ff| (9-bit GFN width for level 1)
   |GFN_MASK(2) = 0x7ff| (11-bit GFN width for level 2)

Or do you mean that GFN_MASK(lvl) should return something like this:
   |G||FN_MASK_(0) = 0x1FF000 (0x1ff << 0xc) GFN_MASK_(1) = 0x3FE00000 
(GFN_MASK_(0)<<9) GFN_MASK_(2) = 0x1FFC0000000 (GFN_MASK_(1)<<9 + extra 
2 bits) And then here ...|

> That's not really something you can apply to a GFN, unlike the name
> suggests.

That is why virtual address should be properly shifted before, something
like it is done in calc_offset():
   (va >> P2M_LEVEL_SHIFT(lvl)) & GFN_MASK(lvl);

...
  (va & GFN_MASK_(lvl)) >> P2M_LEVEL_SHIFT(lvl) ?
In this option more shifts will be needed.

Would it be better to just rename GFN_MASK() to P2M_PT_INDEX_MASK()? Or,
maybe, even just P2M_INDEX_MASK().

>
>> +#define P2M_LEVEL_SHIFT(lvl) (P2M_LEVEL_ORDER(lvl) + PAGE_SHIFT)
> Whereas here the macro name doesn't make clear what is shifted: An
> address or a GFN. (It's the former, aiui.)

Yes, it is expected to be used to shift gfn.

The similar as with above would it be better to rename P2M_LEVEL_SHIFT to
P2M_GFN_LEVEL_SHIFT()?

>
>> --- a/xen/arch/riscv/p2m.c
>> +++ b/xen/arch/riscv/p2m.c
>> @@ -9,6 +9,7 @@
>>   #include <xen/rwlock.h>
>>   #include <xen/sched.h>
>>   #include <xen/sections.h>
>> +#include <xen/xvmalloc.h>
>>   
>>   #include <asm/csr.h>
>>   #include <asm/flushtlb.h>
>> @@ -17,6 +18,43 @@
>>   #include <asm/vmid.h>
>>   
>>   unsigned char __ro_after_init gstage_mode;
>> +unsigned int __ro_after_init gstage_root_level;
> Like for mode, I'm unconvinced of this being a global (and not per-P2M /
> per-domain).

The question is then if we really will (or want to) have cases when gstage
mode will be different per-domain/per-p2m?

>
>> +/*
>> + * The P2M root page table is extended by 2 bits, making its size 16KB
>> + * (instead of 4KB for non-root page tables). Therefore, P2M root page
>> + * is allocated as four consecutive 4KB pages (since alloc_domheap_pages()
>> + * only allocates 4KB pages).
>> + */
>> +#define ENTRIES_PER_ROOT_PAGE \
>> +    (P2M_PAGETABLE_ENTRIES(P2M_ROOT_LEVEL) / P2M_ROOT_ORDER)
>> +
>> +static inline unsigned int calc_offset(unsigned int lvl, vaddr_t va)
> Where would a vaddr_t come from here? Your input are guest-physical addresses,
> if I'm not mistaken.

You are right. Would it be right to 'paddr_t gpa' here? Or paddr_t is supposed to use
only with machine physical address?


>
>> +{
>> +    unsigned int offset = (va >> P2M_LEVEL_SHIFT(lvl)) & GFN_MASK(lvl);
>> +
>> +    /*
>> +     * For P2M_ROOT_LEVEL, `offset` ranges from 0 to 2047, since the root
>> +     * page table spans 4 consecutive 4KB pages.
>> +     * We want to return an index within one of these 4 pages.
>> +     * The specific page to use is determined by `p2m_get_root_pointer()`.
>> +     *
>> +     * Example: if `offset == 512`:
>> +     *  - A single 4KB page holds 512 entries.
>> +     *  - Therefore, entry 512 corresponds to index 0 of the second page.
>> +     *
>> +     * At all other levels, only one page is allocated, and `offset` is
>> +     * always in the range 0 to 511, since the VPN is 9 bits long.
>> +     */
>> +    return offset % ENTRIES_PER_ROOT_PAGE;
> Seeing something "root" used here (when this is for all levels) is pretty odd,
> despite all the commentary. Given all the commentary, why not simply
>
>      return offset & ((1U << PAGETABLE_ORDER) - 1);
>
> ?

It works for all levels where|lvl < P2M_ROOT_LEVEL|, because in those cases the GFN
bit length is equal to|PAGETABLE_ORDER|. However, at the root level the GFN bit length
is 2 bits larger. So something like the following is needed:
   offset & ((1U << (PAGETABLE_ORDER + P2M_ROOT_EXTRA_BITS(lvl))) - 1);
This still returns an offset within a single 16 KB page, but in the case of the P2M
root we actually have four consecutive 4 KB pages, so the intention was to return
an offset inside one of those four 4 KB pages.

While writing the above, I started thinking whether|calc_offset()| could be implemented
much more simply. Since the root page table consists of four/consecutive/ pages, it seems
acceptable to have the offset in the range|[0, 2^11)| instead of doing all the extra
manipulation to determine which of the four pages is used and the offset within that
specific page:

   static inline unsigned int calc_offset(unsigned int lvl, paddr_t gpa)
   {
      return (gpa >> P2M_LEVEL_SHIFT(lvl)) & GFN_MASK(lvl);
   }

   static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
   {
     return __map_domain_page(p2m->root);
   }

It probably still makes sense for|p2m_get_root_pointer()| to check that the root GFN
index is not larger than|2^11|.

Am I missing something?

>
>> +}
>> +
>> +#define P2M_MAX_ROOT_LEVEL 4
>> +
>> +#define P2M_DECLARE_OFFSETS(var, addr) \
>> +    unsigned int var[P2M_MAX_ROOT_LEVEL] = {-1};\
>> +    for ( unsigned int i = 0; i <= gstage_root_level; i++ ) \
>> +        var[i] = calc_offset(i, addr);
> This surely is more than just "declare", and it's dealing with all levels no
> matter whether you actually will use all offsets.

I will rename|P2M_DECLARE_OFFSETS| to|P2M_BUILD_LEVEL_OFFSETS()|.

But how can I know which offset I will actually need to use?
If we take the following loop as an example:
   |for( level = P2M_ROOT_LEVEL; level > target; level-- ) { ||/* ||* Don't try to allocate intermediate page tables if the mapping ||* is about to be removed. ||*/ ||rc = p2m_next_level(p2m, !removing_mapping, ||level, &table, offsets[level]); ||... ||} |It walks from|P2M_ROOT_LEVEL| down to|target|, where|target| is determined at runtime.

If you mean that, for example, when the G-stage mode is Sv39, there is no need to allocate
an array with 4 entries (or 5 entries if we consider Sv57, so P2M_MAX_ROOT_LEVEL should be
updated), because Sv39 only uses 3 page table levels — then yes, in theory it could be
smaller. But I don't think it is a real issue if the|offsets[]| array on the stack has a
few extra unused entries.

If preferred, Icould allocate the array dynamically based on|gstage_root_level|.
Would that be better?

>
>> @@ -259,13 +308,293 @@ int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
>>       return rc;
>>   }
>>   
>> +/*
>> + * Map one of the four root pages of the P2M root page table.
>> + *
>> + * The P2M root page table is larger than normal (16KB instead of 4KB),
>> + * so it is allocated as four consecutive 4KB pages. This function selects
>> + * the appropriate 4KB page based on the given GFN and returns a mapping
>> + * to it.
>> + *
>> + * The caller is responsible for unmapping the page after use.
>> + *
>> + * Returns NULL if the calculated offset into the root table is invalid.
>> + */
>> +static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
>> +{
>> +    unsigned long root_table_indx;
>> +
>> +    root_table_indx = gfn_x(gfn) >> P2M_LEVEL_ORDER(P2M_ROOT_LEVEL);
> With the variable name shortened (to e.g. idx) this could be its initializer
> without ending up with too long a line. The root_table_ prefix isn't really
> adding much value in the context of this function.
>
>> +    if ( root_table_indx >= P2M_ROOT_PAGES )
>> +        return NULL;
>> +
>> +    /*
>> +     * The P2M root page table is extended by 2 bits, making its size 16KB
>> +     * (instead of 4KB for non-root page tables). Therefore, p2m->root is
>> +     * allocated as four consecutive 4KB pages (since alloc_domheap_pages()
>> +     * only allocates 4KB pages).
>> +     *
>> +     * Initially, `root_table_indx` is derived directly from `va`.
> There's no 'va' here.

Should be gfn.

>
>> +static inline void p2m_clean_pte(pte_t *p, bool clean_pte)
> "clean_pte" as a parameter name of a function of this name is, well, odd.

clean_cache should be better:
   p2m_clean_pte(pte_t *p, bool clean_cache)

I suppose it would be nice to rename everywhere clean_pte to clean_cache.

Thanks.

~ Oleksii
Re: [for 4.22 v5 10/18] xen/riscv: implement p2m_set_range()
Posted by Jan Beulich 2 months, 3 weeks ago
On 14.11.2025 18:04, Oleksii Kurochko wrote:
> On 11/10/25 3:53 PM, Jan Beulich wrote:
>> On 20.10.2025 17:57, Oleksii Kurochko wrote:
>>> +#define GFN_MASK(lvl) (P2M_PAGETABLE_ENTRIES(lvl) - 1UL)
>> If I'm not mistaken, this is a mask with the low 10 or 12 bits set.
> 
> I'm not sure I fully understand you here. With the current implementation,
> it returns a bitmask that corresponds to the number of index bits used
> at each level. So, if|P2M_ROOT_LEVEL = 2|, then:
>    |G||FN_MASK(0) = 0x1ff| (9-bit GFN for the level 0)
>    |GFN_MASK(1) = 0x1ff| (9-bit GFN width for level 1)
>    |GFN_MASK(2) = 0x7ff| (11-bit GFN width for level 2)

Oh, sorry, 9 and 11 bits is what I meant.

> Or do you mean that GFN_MASK(lvl) should return something like this:
>    |G||FN_MASK_(0) = 0x1FF000 (0x1ff << 0xc) GFN_MASK_(1) = 0x3FE00000 
> (GFN_MASK_(0)<<9) GFN_MASK_(2) = 0x1FFC0000000 (GFN_MASK_(1)<<9 + extra 
> 2 bits)

Yes.

> And then here ...|
> 
>> That's not really something you can apply to a GFN, unlike the name
>> suggests.
> 
> That is why virtual address should be properly shifted before, something
> like it is done in calc_offset():

Please can we stop calling guest physical addresses "virtual address"?

>    (va >> P2M_LEVEL_SHIFT(lvl)) & GFN_MASK(lvl);
> 
> ...
>   (va & GFN_MASK_(lvl)) >> P2M_LEVEL_SHIFT(lvl) ?
> In this option more shifts will be needed.

It's okay to try to limit the number of shifts needed, but the macros need
naming accordingly.

> Would it be better to just rename GFN_MASK() to P2M_PT_INDEX_MASK()? Or,
> maybe, even just P2M_INDEX_MASK().

Perhaps. I would recommend though that you take a looks at other ports'
naming. In x86, for example, we have l<N>_table_offset().

>>> --- a/xen/arch/riscv/p2m.c
>>> +++ b/xen/arch/riscv/p2m.c
>>> @@ -9,6 +9,7 @@
>>>   #include <xen/rwlock.h>
>>>   #include <xen/sched.h>
>>>   #include <xen/sections.h>
>>> +#include <xen/xvmalloc.h>
>>>   
>>>   #include <asm/csr.h>
>>>   #include <asm/flushtlb.h>
>>> @@ -17,6 +18,43 @@
>>>   #include <asm/vmid.h>
>>>   
>>>   unsigned char __ro_after_init gstage_mode;
>>> +unsigned int __ro_after_init gstage_root_level;
>> Like for mode, I'm unconvinced of this being a global (and not per-P2M /
>> per-domain).
> 
> The question is then if we really will (or want to) have cases when gstage
> mode will be different per-domain/per-p2m?

Can you explain to me why you think we wouldn't want that, sooner or later?

>>> +/*
>>> + * The P2M root page table is extended by 2 bits, making its size 16KB
>>> + * (instead of 4KB for non-root page tables). Therefore, P2M root page
>>> + * is allocated as four consecutive 4KB pages (since alloc_domheap_pages()
>>> + * only allocates 4KB pages).
>>> + */
>>> +#define ENTRIES_PER_ROOT_PAGE \
>>> +    (P2M_PAGETABLE_ENTRIES(P2M_ROOT_LEVEL) / P2M_ROOT_ORDER)
>>> +
>>> +static inline unsigned int calc_offset(unsigned int lvl, vaddr_t va)
>> Where would a vaddr_t come from here? Your input are guest-physical addresses,
>> if I'm not mistaken.
> 
> You are right. Would it be right to 'paddr_t gpa' here? Or paddr_t is supposed to use
> only with machine physical address?

In x86 we use paddr_t in such cases. Arm iirc additionally has gaddr_t.

>>> +#define P2M_MAX_ROOT_LEVEL 4
>>> +
>>> +#define P2M_DECLARE_OFFSETS(var, addr) \
>>> +    unsigned int var[P2M_MAX_ROOT_LEVEL] = {-1};\
>>> +    for ( unsigned int i = 0; i <= gstage_root_level; i++ ) \
>>> +        var[i] = calc_offset(i, addr);
>> This surely is more than just "declare", and it's dealing with all levels no
>> matter whether you actually will use all offsets.
> 
> I will rename|P2M_DECLARE_OFFSETS| to|P2M_BUILD_LEVEL_OFFSETS()|.
> 
> But how can I know which offset I will actually need to use?
> If we take the following loop as an example:
>    |for( level = P2M_ROOT_LEVEL; level > target; level-- ) { ||/* ||* Don't try to allocate intermediate page tables if the mapping ||* is about to be removed. ||*/ ||rc = p2m_next_level(p2m, !removing_mapping, ||level, &table, offsets[level]); ||... ||} |It walks from|P2M_ROOT_LEVEL| down to|target|, where|target| is determined at runtime.
> 
> If you mean that, for example, when the G-stage mode is Sv39, there is no need to allocate
> an array with 4 entries (or 5 entries if we consider Sv57, so P2M_MAX_ROOT_LEVEL should be
> updated), because Sv39 only uses 3 page table levels — then yes, in theory it could be
> smaller. But I don't think it is a real issue if the|offsets[]| array on the stack has a
> few extra unused entries.
> 
> If preferred, Icould allocate the array dynamically based on|gstage_root_level|.
> Would that be better?

Having a few unused entries isn't a big deal imo. What I'm not happy with here is
that you may _initialize_ more entries than actually needed. I have no good
suggestion within the conceptual framework you use for page walking (the same
issue iirc exists in host page table walks, just that the calculations there are
cheaper).

Jan

Re: [for 4.22 v5 10/18] xen/riscv: implement p2m_set_range()
Posted by Oleksii Kurochko 2 months, 3 weeks ago
On 11/17/25 9:56 AM, Jan Beulich wrote:
>>>> +#define P2M_MAX_ROOT_LEVEL 4
>>>> +
>>>> +#define P2M_DECLARE_OFFSETS(var, addr) \
>>>> +    unsigned int var[P2M_MAX_ROOT_LEVEL] = {-1};\
>>>> +    for ( unsigned int i = 0; i <= gstage_root_level; i++ ) \
>>>> +        var[i] = calc_offset(i, addr);
>>> This surely is more than just "declare", and it's dealing with all levels no
>>> matter whether you actually will use all offsets.
>> I will rename|P2M_DECLARE_OFFSETS| to|P2M_BUILD_LEVEL_OFFSETS()|.
>>
>> But how can I know which offset I will actually need to use?
>> If we take the following loop as an example:
>>     |for( level = P2M_ROOT_LEVEL; level > target; level-- ) { ||/* ||* 
>> Don't try to allocate intermediate page tables if the mapping ||* is 
>> about to be removed. ||*/ ||rc = p2m_next_level(p2m, 
>> !removing_mapping, ||level, &table, offsets[level]); ||... ||} |It 
>> walks from|P2M_ROOT_LEVEL| down to|target|, where|target| is determined at runtime.
>>
>> If you mean that, for example, when the G-stage mode is Sv39, there is no need to allocate
>> an array with 4 entries (or 5 entries if we consider Sv57, so P2M_MAX_ROOT_LEVEL should be
>> updated), because Sv39 only uses 3 page table levels — then yes, in theory it could be
>> smaller. But I don't think it is a real issue if the|offsets[]| array on the stack has a
>> few extra unused entries.
>>
>> If preferred, Icould allocate the array dynamically based on|gstage_root_level|.
>> Would that be better?
> Having a few unused entries isn't a big deal imo. What I'm not happy with here is
> that you may_initialize_ more entries than actually needed. I have no good
> suggestion within the conceptual framework you use for page walking (the same
> issue iirc exists in host page table walks, just that the calculations there are
> cheaper).

The loop inside|P2M_DECLARE_OFFSETS()| uses|gstage_root_level|, so only the entries that
are actually going to be used are initialized.

You probably mean that it’s possible we don’t need to walk all the tables because a
leaf page-table entry appears earlier than the L0 page table for some gfns? IMO, it’s not
really a big deal, because at worst we just spend some time calculating something that
isn’t actually needed, but considering that it will be just extra 2 calls in the worst case
(when mapping is 1g, for no reason we calculated offsets for L1 and L0) of calc_offset()
it won't affect performance too much.

~ Oleksii
Re: [for 4.22 v5 10/18] xen/riscv: implement p2m_set_range()
Posted by Jan Beulich 2 months, 3 weeks ago
On 18.11.2025 16:28, Oleksii Kurochko wrote:
> 
> On 11/17/25 9:56 AM, Jan Beulich wrote:
>>>>> +#define P2M_MAX_ROOT_LEVEL 4
>>>>> +
>>>>> +#define P2M_DECLARE_OFFSETS(var, addr) \
>>>>> +    unsigned int var[P2M_MAX_ROOT_LEVEL] = {-1};\
>>>>> +    for ( unsigned int i = 0; i <= gstage_root_level; i++ ) \
>>>>> +        var[i] = calc_offset(i, addr);
>>>> This surely is more than just "declare", and it's dealing with all levels no
>>>> matter whether you actually will use all offsets.
>>> I will rename|P2M_DECLARE_OFFSETS| to|P2M_BUILD_LEVEL_OFFSETS()|.
>>>
>>> But how can I know which offset I will actually need to use?
>>> If we take the following loop as an example:
>>>     |for( level = P2M_ROOT_LEVEL; level > target; level-- ) { ||/* ||* 
>>> Don't try to allocate intermediate page tables if the mapping ||* is 
>>> about to be removed. ||*/ ||rc = p2m_next_level(p2m, 
>>> !removing_mapping, ||level, &table, offsets[level]); ||... ||} |It 
>>> walks from|P2M_ROOT_LEVEL| down to|target|, where|target| is determined at runtime.
>>>
>>> If you mean that, for example, when the G-stage mode is Sv39, there is no need to allocate
>>> an array with 4 entries (or 5 entries if we consider Sv57, so P2M_MAX_ROOT_LEVEL should be
>>> updated), because Sv39 only uses 3 page table levels — then yes, in theory it could be
>>> smaller. But I don't think it is a real issue if the|offsets[]| array on the stack has a
>>> few extra unused entries.
>>>
>>> If preferred, Icould allocate the array dynamically based on|gstage_root_level|.
>>> Would that be better?
>> Having a few unused entries isn't a big deal imo. What I'm not happy with here is
>> that you may_initialize_ more entries than actually needed. I have no good
>> suggestion within the conceptual framework you use for page walking (the same
>> issue iirc exists in host page table walks, just that the calculations there are
>> cheaper).
> 
> The loop inside|P2M_DECLARE_OFFSETS()| uses|gstage_root_level|, so only the entries that
> are actually going to be used are initialized.
> 
> You probably mean that it’s possible we don’t need to walk all the tables because a
> leaf page-table entry appears earlier than the L0 page table for some gfns?

Yes.

> IMO, it’s not
> really a big deal, because at worst we just spend some time calculating something that
> isn’t actually needed, but considering that it will be just extra 2 calls in the worst case
> (when mapping is 1g, for no reason we calculated offsets for L1 and L0) of calc_offset()
> it won't affect performance too much.

Well, it's your call in the end.

Jan