[v3] xen/riscv: introduce p2m functionality

[PATCH v3 12/20] xen/riscv: implement p2m_set_range()

Posted by Oleksii Kurochko 3 months ago

This patch introduces p2m_set_range() and its core helper p2m_set_entry() for
RISC-V, based loosely on the Arm implementation, with several RISC-V-specific
modifications.

The main changes are:
- Simplification of Break-Before-Make (BBM) approach as according to RISC-V
  spec:
    It is permitted for multiple address-translation cache entries to co-exist
    for the same address. This represents the fact that in a conventional
    TLB hierarchy, it is possible for multiple entries to match a single
    address if, for example, a page is upgraded to a superpage without first
    clearing the original non-leaf PTE’s valid bit and executing an SFENCE.VMA
    with rs1=x0, or if multiple TLBs exist in parallel at a given level of the
    hierarchy. In this case, just as if an SFENCE.VMA is not executed between
    a write to the memory-management tables and subsequent implicit read of the
    same address: it is unpredictable whether the old non-leaf PTE or the new
    leaf PTE is used, but the behavior is otherwise well defined.
  In contrast to the Arm architecture, where BBM is mandatory and failing to
  use it in some cases can lead to CPU instability, RISC-V guarantees
  stability, and the behavior remains safe — though unpredictable in terms of
  which translation will be used.
- Unlike Arm, the valid bit is not repurposed for other uses in this
  implementation. Instead, entry validity is determined based solely on P2M
  PTE's valid bit.

The main functionality is in p2m_set_entry(), which handles mappings aligned
to page table block entries (e.g., 1GB, 2MB, or 4KB with 4KB granularity).

p2m_set_range() breaks a region down into block-aligned mappings and calls
p2m_set_entry() accordingly.

Stub implementations (to be completed later) include:
- p2m_free_subtree()
- p2m_next_level()
- p2m_pte_from_mfn()

Note: Support for shattering block entries is not implemented in this
patch and will be added separately.

Additionally, some straightforward helper functions are now implemented:
- p2m_write_pte()
- p2m_remove_pte()
- p2m_get_root_pointer()

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V3:
 - Drop p2m_access_t connected stuff as it isn't going to be used, at least
   now.
 - Move defintion of P2M_ROOT_ORDER and P2M_ROOT_PAGES to earlier patches.
 - Update the comment above lowest_mapped_gfn declaration.
 - Update the comment above p2m_get_root_pointer(): s/"...ofset of the root
   table"/"...ofset into root table".
 - s/p2m_remove_pte/p2m_clean_pte.
 - Use plain 0 instead of 0x00 in p2m_clean_pte().
 - s/p2m_entry_from_mfn/p2m_pte_from_mfn.
 - s/GUEST_TABLE_*/P2M_TABLE_*.
 - Update the comment above p2m_next_level(): "GFN entry" -> "corresponding
   the entry corresponding to the GFN".
 - s/__p2m_set_entry/_p2m_set_entry.
 - drop "s" for sgfn and smfn prefixes of _p2m_set_entry()'s arguments
   as this function work only with one GFN and one MFN.
 - Return correct return code when p2m_next_level() faild in _p2m_set_entry(),
   also drop "else" and just handle case (rc != P2M_TABLE_NORMAL) separately.
 - Code style fixes.
 - Use unsigned int for "order" in p2m_set_entry().
 - s/p2m_set_entry/p2m_free_subtree.
 - Update ASSERT() in __p2m_set_enty() to check that page_order is propertly
   aligned.
 - Return -EACCES instead of -ENOMEM in the chase when domain is dying and
   someone called p2m_set_entry.
 - s/p2m_set_entry/p2m_set_range.
 - s/__p2m_set_entry/p2m_set_entry
 - s/p2me_is_valid/p2m_is_valid()
 - Return a number of successfully mapped GFNs in case if not all were mapped
   in p2m_set_range().
 - Use BIT(order, UL) instead of 1 << order.
 - Drop IOMMU flushing code from p2m_set_entry().
 - set p2m->need_flush=true when entry in p2m_set_entry() is changed.
 - Introduce p2m_mapping_order() to support superpages.
 - Drop p2m_is_valid() and use pte_is_valid() instead as there is no tricks
   with copying of valid bit anymore.
 - Update p2m_pte_from_mfn() prototype: drop p2m argument.
---
Changes in V2:
 - New patch. It was a part of a big patch "xen/riscv: implement p2m mapping
   functionality" which was splitted to smaller.
 - Update the way when p2m TLB is flushed:
 - RISC-V does't require BBM so there is no need to remove PTE before making
   new so drop 'if /*pte_is_valid(orig_pte) */' and remove PTE only removing
   has been requested.
 - Drop p2m->need_flush |= !!pte_is_valid(orig_pte); for the case when
   PTE's removing is happening as RISC-V could cache invalid PTE and thereby
   it requires to do a flush each time and it doesn't matter if PTE is valid
   or not at the moment when PTE removing is happening.
 - Drop a check if PTE is valid in case of PTE is modified as it was mentioned
   above as BBM isn't required so TLB flushing could be defered and there is
   no need to do it before modifying of PTE.
 - Drop p2m->need_flush as it seems like it will be always true.
 - Drop foreign mapping things as it isn't necessary for RISC-V right now.
 - s/p2m_is_valid/p2me_is_valid.
 - Move definition and initalization of p2m->{max_mapped_gfn,lowest_mapped_gfn}
   to this patch.
---
 xen/arch/riscv/include/asm/p2m.h |  12 ++
 xen/arch/riscv/p2m.c             | 250 ++++++++++++++++++++++++++++++-
 2 files changed, 261 insertions(+), 1 deletion(-)

diff --git a/xen/arch/riscv/include/asm/p2m.h b/xen/arch/riscv/include/asm/p2m.h
index ada3c398b4..26ad87b8df 100644
--- a/xen/arch/riscv/include/asm/p2m.h
+++ b/xen/arch/riscv/include/asm/p2m.h
@@ -7,11 +7,13 @@
 #include <xen/rwlock.h>
 #include <xen/types.h>
 
+#include <asm/page.h>
 #include <asm/page-bits.h>
 
 extern unsigned int p2m_root_order;
 #define P2M_ROOT_ORDER  p2m_root_order
 #define P2M_ROOT_PAGES  BIT(P2M_ROOT_ORDER, U)
+#define P2M_ROOT_LEVEL  HYP_PT_ROOT_LEVEL
 
 #define paddr_bits PADDR_BITS
 
@@ -50,6 +52,16 @@ struct p2m_domain {
      * shattered), call p2m_tlb_flush_sync().
      */
     bool need_flush;
+
+    /* Highest guest frame that's ever been mapped in the p2m */
+    gfn_t max_mapped_gfn;
+
+    /*
+     * Lowest mapped gfn in the p2m. When releasing mapped gfn's in a
+     * preemptible manner this is updated to track where to resume
+     * the search. Apart from during teardown this can only decrease.
+     */
+    gfn_t lowest_mapped_gfn;
 };
 
 /*
diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
index 7cfcf76f24..6c99719c66 100644
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -3,6 +3,7 @@
 #include <xen/rwlock.h>
 #include <xen/sched.h>
 
+#include <asm/page.h>
 #include <asm/paging.h>
 #include <asm/p2m.h>
 #include <asm/riscv_encoding.h>
@@ -132,6 +133,9 @@ int p2m_init(struct domain *d)
     rwlock_init(&p2m->lock);
     INIT_PAGE_LIST_HEAD(&p2m->pages);
 
+    p2m->max_mapped_gfn = _gfn(0);
+    p2m->lowest_mapped_gfn = _gfn(ULONG_MAX);
+
     /*
      * Currently, the infrastructure required to enable CONFIG_HAS_PASSTHROUGH
      * is not ready for RISC-V support.
@@ -175,13 +179,257 @@ int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
     return 0;
 }
 
+/*
+ * Find and map the root page table. The caller is responsible for
+ * unmapping the table.
+ *
+ * The function will return NULL if the offset into the root table is
+ * invalid.
+ */
+static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
+{
+    unsigned long root_table_indx;
+
+    root_table_indx = gfn_x(gfn) >> XEN_PT_LEVEL_ORDER(P2M_ROOT_LEVEL);
+    if ( root_table_indx >= P2M_ROOT_PAGES )
+        return NULL;
+
+    return __map_domain_page(p2m->root + root_table_indx);
+}
+
+static inline void p2m_write_pte(pte_t *p, pte_t pte, bool clean_pte)
+{
+    write_pte(p, pte);
+    if ( clean_pte )
+        clean_dcache_va_range(p, sizeof(*p));
+}
+
+static inline void p2m_clean_pte(pte_t *p, bool clean_pte)
+{
+    pte_t pte;
+
+    memset(&pte, 0, sizeof(pte));
+    p2m_write_pte(p, pte, clean_pte);
+}
+
+static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t)
+{
+    panic("%s: hasn't been implemented yet\n", __func__);
+
+    return (pte_t) { .pte = 0 };
+}
+
+#define P2M_TABLE_MAP_NONE 0
+#define P2M_TABLE_MAP_NOMEM 1
+#define P2M_TABLE_SUPER_PAGE 2
+#define P2M_TABLE_NORMAL 3
+
+/*
+ * Take the currently mapped table, find the corresponding the entry
+ * corresponding to the GFN, and map the next table, if available.
+ * The previous table will be unmapped if the next level was mapped
+ * (e.g P2M_TABLE_NORMAL returned).
+ *
+ * `alloc_tbl` parameter indicates whether intermediate tables should
+ * be allocated when not present.
+ *
+ * Return values:
+ *  P2M_TABLE_MAP_NONE: a table allocation isn't permitted.
+ *  P2M_TABLE_MAP_NOMEM: allocating a new page failed.
+ *  P2M_TABLE_SUPER_PAGE: next level or leaf mapped normally.
+ *  P2M_TABLE_NORMAL: The next entry points to a superpage.
+ */
+static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl,
+                          unsigned int level, pte_t **table,
+                          unsigned int offset)
+{
+    panic("%s: hasn't been implemented yet\n", __func__);
+
+    return P2M_TABLE_MAP_NONE;
+}
+
+/* Free pte sub-tree behind an entry */
+static void p2m_free_subtree(struct p2m_domain *p2m,
+                             pte_t entry, unsigned int level)
+{
+    panic("%s: hasn't been implemented yet\n", __func__);
+}
+
+/*
+ * Insert an entry in the p2m. This should be called with a mapping
+ * equal to a page/superpage.
+ */
+static int p2m_set_entry(struct p2m_domain *p2m,
+                           gfn_t gfn,
+                           unsigned long page_order,
+                           mfn_t mfn,
+                           p2m_type_t t)
+{
+    unsigned int level;
+    unsigned int target = page_order / PAGETABLE_ORDER;
+    pte_t *entry, *table, orig_pte;
+    int rc;
+    /* A mapping is removed if the MFN is invalid. */
+    bool removing_mapping = mfn_eq(mfn, INVALID_MFN);
+    DECLARE_OFFSETS(offsets, gfn_to_gaddr(gfn));
+
+    ASSERT(p2m_is_write_locked(p2m));
+
+    /*
+     * Check if the level target is valid: we only support
+     * 4K - 2M - 1G mapping.
+     */
+    ASSERT((target <= 2) && !(page_order % PAGETABLE_ORDER));
+
+    table = p2m_get_root_pointer(p2m, gfn);
+    if ( !table )
+        return -EINVAL;
+
+    for ( level = P2M_ROOT_LEVEL; level > target; level-- )
+    {
+        /*
+         * Don't try to allocate intermediate page table if the mapping
+         * is about to be removed.
+         */
+        rc = p2m_next_level(p2m, !removing_mapping,
+                            level, &table, offsets[level]);
+        if ( (rc == P2M_TABLE_MAP_NONE) || (rc == P2M_TABLE_MAP_NOMEM) )
+        {
+            rc = (rc == P2M_TABLE_MAP_NONE) ? -ENOENT : -ENOMEM;
+            /*
+             * We are here because p2m_next_level has failed to map
+             * the intermediate page table (e.g the table does not exist
+             * and they p2m tree is read-only). It is a valid case
+             * when removing a mapping as it may not exist in the
+             * page table. In this case, just ignore it.
+             */
+            rc = removing_mapping ?  0 : rc;
+            goto out;
+        }
+
+        if ( rc != P2M_TABLE_NORMAL )
+            break;
+    }
+
+    entry = table + offsets[level];
+
+    /*
+     * If we are here with level > target, we must be at a leaf node,
+     * and we need to break up the superpage.
+     */
+    if ( level > target )
+    {
+        panic("Shattering isn't implemented\n");
+    }
+
+    /*
+     * We should always be there with the correct level because all the
+     * intermediate tables have been installed if necessary.
+     */
+    ASSERT(level == target);
+
+    orig_pte = *entry;
+
+    if ( removing_mapping )
+        p2m_clean_pte(entry, p2m->clean_pte);
+    else
+    {
+        pte_t pte = p2m_pte_from_mfn(mfn, t);
+
+        p2m_write_pte(entry, pte, p2m->clean_pte);
+
+        p2m->max_mapped_gfn = gfn_max(p2m->max_mapped_gfn,
+                                      gfn_add(gfn, BIT(page_order, UL) - 1));
+        p2m->lowest_mapped_gfn = gfn_min(p2m->lowest_mapped_gfn, gfn);
+    }
+
+    p2m->need_flush = true;
+
+    /*
+     * Currently, the infrastructure required to enable CONFIG_HAS_PASSTHROUGH
+     * is not ready for RISC-V support.
+     *
+     * When CONFIG_HAS_PASSTHROUGH=y, iommu_iotlb_flush() should be done
+     * here.
+     */
+#ifdef CONFIG_HAS_PASSTHROUGH
+#   error "add code to flush IOMMU TLB"
+#endif
+
+    rc = 0;
+
+    /*
+     * Free the entry only if the original pte was valid and the base
+     * is different (to avoid freeing when permission is changed).
+     */
+    if ( pte_is_valid(orig_pte) &&
+         !mfn_eq(pte_get_mfn(*entry), pte_get_mfn(orig_pte)) )
+        p2m_free_subtree(p2m, orig_pte, level);
+
+ out:
+    unmap_domain_page(table);
+
+    return rc;
+}
+
+/* Return mapping order for given gfn, mfn and nr */
+static unsigned long p2m_mapping_order(gfn_t gfn, mfn_t mfn, unsigned long nr)
+{
+    unsigned long mask;
+    /* 1gb, 2mb, 4k mappings are supported */
+    unsigned int level = min(P2M_ROOT_LEVEL, 2);
+    unsigned long order = 0;
+
+    mask = !mfn_eq(mfn, INVALID_MFN) ? mfn_x(mfn) : 0;
+    mask |= gfn_x(gfn);
+
+    for ( ; level != 0; level-- )
+    {
+        if ( !(mask & (BIT(XEN_PT_LEVEL_ORDER(level), UL) - 1)) &&
+             (nr >= BIT(XEN_PT_LEVEL_ORDER(level), UL)) )
+        {
+                order = XEN_PT_LEVEL_ORDER(level);
+                break;
+        }
+    }
+
+    return order;
+}
+
 static int p2m_set_range(struct p2m_domain *p2m,
                          gfn_t sgfn,
                          unsigned long nr,
                          mfn_t smfn,
                          p2m_type_t t)
 {
-    return -EOPNOTSUPP;
+    int rc = 0;
+    unsigned long left = nr;
+
+    /*
+     * Any reference taken by the P2M mappings (e.g. foreign mapping) will
+     * be dropped in relinquish_p2m_mapping(). As the P2M will still
+     * be accessible after, we need to prevent mapping to be added when the
+     * domain is dying.
+     */
+    if ( unlikely(p2m->domain->is_dying) )
+        return -EACCES;
+
+    while ( left )
+    {
+        unsigned long order = p2m_mapping_order(sgfn, smfn, left);
+
+        rc = p2m_set_entry(p2m, sgfn, order, smfn, t);
+        if ( rc )
+            break;
+
+        sgfn = gfn_add(sgfn, BIT(order, UL));
+        if ( !mfn_eq(smfn, INVALID_MFN) )
+           smfn = mfn_add(smfn, BIT(order, UL));
+
+        left -= BIT(order, UL);
+    }
+
+    return !left ? 0 : left == nr ? rc : (nr - left);
 }
 
 static int p2m_insert_mapping(struct p2m_domain *p2m, gfn_t start_gfn,
-- 
2.50.1

Re: [PATCH v3 12/20] xen/riscv: implement p2m_set_range()

Posted by Jan Beulich 2 months, 3 weeks ago

On 31.07.2025 17:58, Oleksii Kurochko wrote:
> This patch introduces p2m_set_range() and its core helper p2m_set_entry() for

Nit: This patch doesn't introduce p2m_set_range(); it merely fleshes it out.

> --- a/xen/arch/riscv/include/asm/p2m.h
> +++ b/xen/arch/riscv/include/asm/p2m.h
> @@ -7,11 +7,13 @@
>  #include <xen/rwlock.h>
>  #include <xen/types.h>
>  
> +#include <asm/page.h>
>  #include <asm/page-bits.h>
>  
>  extern unsigned int p2m_root_order;
>  #define P2M_ROOT_ORDER  p2m_root_order
>  #define P2M_ROOT_PAGES  BIT(P2M_ROOT_ORDER, U)
> +#define P2M_ROOT_LEVEL  HYP_PT_ROOT_LEVEL

I think I commented on this before, and I would have hoped for at least a remark
in the description to appear (perhaps even a comment here): It's okay(ish) to tie
these together for now, but in the longer run I don't expect this is going to be
wanted. If e.g. we ran Xen in Sv57 mode, there would be no reason at all to force
all P2Ms to use 5 levels of page tables.

> @@ -175,13 +179,257 @@ int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
>      return 0;
>  }
>  
> +/*
> + * Find and map the root page table. The caller is responsible for
> + * unmapping the table.
> + *
> + * The function will return NULL if the offset into the root table is
> + * invalid.
> + */
> +static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
> +{
> +    unsigned long root_table_indx;
> +
> +    root_table_indx = gfn_x(gfn) >> XEN_PT_LEVEL_ORDER(P2M_ROOT_LEVEL);

Right now page table layouts / arrangements are indeed similar enough to
share accessor constructs. Nevertheless I find it problematic (doc-wise
at the very least) that a Xen page table construct is used to access a
P2M page table. If and when these needed to be decoupled, it would likely
help of the distinction was already made, by - for now - simply
introducing aliases (here e.g. P2M_LEVEL_ORDER(), expanding to
XEN_PT_LEVEL_ORDER() for the time being).

> +    if ( root_table_indx >= P2M_ROOT_PAGES )
> +        return NULL;
> +
> +    return __map_domain_page(p2m->root + root_table_indx);
> +}
> +
> +static inline void p2m_write_pte(pte_t *p, pte_t pte, bool clean_pte)
> +{
> +    write_pte(p, pte);
> +    if ( clean_pte )
> +        clean_dcache_va_range(p, sizeof(*p));

Not necessarily for right away, but if multiple adjacent PTEs are
written without releasing the lock, this then redundant cache flushing
can be a performance issue.

> +}
> +
> +static inline void p2m_clean_pte(pte_t *p, bool clean_pte)
> +{
> +    pte_t pte;
> +
> +    memset(&pte, 0, sizeof(pte));

Why memset()? Why not simply give the variable an appropriate initializer?
Or use ...

> +    p2m_write_pte(p, pte, clean_pte);

... a compound literal here, like you do ...

> +}
> +
> +static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t)
> +{
> +    panic("%s: hasn't been implemented yet\n", __func__);
> +
> +    return (pte_t) { .pte = 0 };

... here? (Just {} would also do, if I'm not mistaken.)

> +}
> +
> +#define P2M_TABLE_MAP_NONE 0
> +#define P2M_TABLE_MAP_NOMEM 1
> +#define P2M_TABLE_SUPER_PAGE 2
> +#define P2M_TABLE_NORMAL 3
> +
> +/*
> + * Take the currently mapped table, find the corresponding the entry
> + * corresponding to the GFN, and map the next table, if available.

Nit: Double "corresponding".

> + * The previous table will be unmapped if the next level was mapped
> + * (e.g P2M_TABLE_NORMAL returned).
> + *
> + * `alloc_tbl` parameter indicates whether intermediate tables should
> + * be allocated when not present.
> + *
> + * Return values:
> + *  P2M_TABLE_MAP_NONE: a table allocation isn't permitted.
> + *  P2M_TABLE_MAP_NOMEM: allocating a new page failed.
> + *  P2M_TABLE_SUPER_PAGE: next level or leaf mapped normally.
> + *  P2M_TABLE_NORMAL: The next entry points to a superpage.
> + */
> +static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl,
> +                          unsigned int level, pte_t **table,
> +                          unsigned int offset)
> +{
> +    panic("%s: hasn't been implemented yet\n", __func__);
> +
> +    return P2M_TABLE_MAP_NONE;
> +}
> +
> +/* Free pte sub-tree behind an entry */
> +static void p2m_free_subtree(struct p2m_domain *p2m,
> +                             pte_t entry, unsigned int level)
> +{
> +    panic("%s: hasn't been implemented yet\n", __func__);
> +}
> +
> +/*
> + * Insert an entry in the p2m. This should be called with a mapping
> + * equal to a page/superpage.
> + */
> +static int p2m_set_entry(struct p2m_domain *p2m,
> +                           gfn_t gfn,
> +                           unsigned long page_order,
> +                           mfn_t mfn,
> +                           p2m_type_t t)
> +{
> +    unsigned int level;
> +    unsigned int target = page_order / PAGETABLE_ORDER;
> +    pte_t *entry, *table, orig_pte;
> +    int rc;
> +    /* A mapping is removed if the MFN is invalid. */
> +    bool removing_mapping = mfn_eq(mfn, INVALID_MFN);

Comment and code don't fit together. Many MFNs are invalid (any for which
mfn_valid() returns false), yet you only check for INVALID_MFN here.

> +    DECLARE_OFFSETS(offsets, gfn_to_gaddr(gfn));
> +
> +    ASSERT(p2m_is_write_locked(p2m));
> +
> +    /*
> +     * Check if the level target is valid: we only support
> +     * 4K - 2M - 1G mapping.
> +     */
> +    ASSERT((target <= 2) && !(page_order % PAGETABLE_ORDER));

If you think you need to check this, don't you also want to check that
GFN and MFN (the latter if it isn't INVALID_MFN) fit the requested order?

> +    table = p2m_get_root_pointer(p2m, gfn);
> +    if ( !table )
> +        return -EINVAL;
> +
> +    for ( level = P2M_ROOT_LEVEL; level > target; level-- )
> +    {
> +        /*
> +         * Don't try to allocate intermediate page table if the mapping
> +         * is about to be removed.
> +         */
> +        rc = p2m_next_level(p2m, !removing_mapping,
> +                            level, &table, offsets[level]);
> +        if ( (rc == P2M_TABLE_MAP_NONE) || (rc == P2M_TABLE_MAP_NOMEM) )
> +        {
> +            rc = (rc == P2M_TABLE_MAP_NONE) ? -ENOENT : -ENOMEM;
> +            /*
> +             * We are here because p2m_next_level has failed to map
> +             * the intermediate page table (e.g the table does not exist
> +             * and they p2m tree is read-only). It is a valid case
> +             * when removing a mapping as it may not exist in the
> +             * page table. In this case, just ignore it.
> +             */
> +            rc = removing_mapping ?  0 : rc;

Nit: Stray blank.

> +            goto out;
> +        }
> +
> +        if ( rc != P2M_TABLE_NORMAL )
> +            break;
> +    }
> +
> +    entry = table + offsets[level];
> +
> +    /*
> +     * If we are here with level > target, we must be at a leaf node,
> +     * and we need to break up the superpage.
> +     */
> +    if ( level > target )
> +    {
> +        panic("Shattering isn't implemented\n");
> +    }
> +
> +    /*
> +     * We should always be there with the correct level because all the
> +     * intermediate tables have been installed if necessary.
> +     */
> +    ASSERT(level == target);
> +
> +    orig_pte = *entry;
> +
> +    if ( removing_mapping )
> +        p2m_clean_pte(entry, p2m->clean_pte);
> +    else
> +    {
> +        pte_t pte = p2m_pte_from_mfn(mfn, t);
> +
> +        p2m_write_pte(entry, pte, p2m->clean_pte);
> +
> +        p2m->max_mapped_gfn = gfn_max(p2m->max_mapped_gfn,
> +                                      gfn_add(gfn, BIT(page_order, UL) - 1));
> +        p2m->lowest_mapped_gfn = gfn_min(p2m->lowest_mapped_gfn, gfn);
> +    }
> +
> +    p2m->need_flush = true;
> +
> +    /*
> +     * Currently, the infrastructure required to enable CONFIG_HAS_PASSTHROUGH
> +     * is not ready for RISC-V support.
> +     *
> +     * When CONFIG_HAS_PASSTHROUGH=y, iommu_iotlb_flush() should be done
> +     * here.
> +     */
> +#ifdef CONFIG_HAS_PASSTHROUGH
> +#   error "add code to flush IOMMU TLB"
> +#endif
> +
> +    rc = 0;
> +
> +    /*
> +     * Free the entry only if the original pte was valid and the base
> +     * is different (to avoid freeing when permission is changed).
> +     */
> +    if ( pte_is_valid(orig_pte) &&
> +         !mfn_eq(pte_get_mfn(*entry), pte_get_mfn(orig_pte)) )

I'm puzzled by this 2nd check: A permission change would - I expect - only
occur to a leaf entry. If the new entry is a super-page one and the old
wasn't, don't you still need to free the sub-tree, no matter whether the
MFNs are the same? Plus consider the special case of MFN 0: If you clear
an entry using MFN 0, you will find old and new PTEs' both having the same
MFN.

>  static int p2m_set_range(struct p2m_domain *p2m,
>                           gfn_t sgfn,
>                           unsigned long nr,
>                           mfn_t smfn,
>                           p2m_type_t t)
>  {
> -    return -EOPNOTSUPP;
> +    int rc = 0;
> +    unsigned long left = nr;
> +
> +    /*
> +     * Any reference taken by the P2M mappings (e.g. foreign mapping) will
> +     * be dropped in relinquish_p2m_mapping(). As the P2M will still
> +     * be accessible after, we need to prevent mapping to be added when the
> +     * domain is dying.
> +     */
> +    if ( unlikely(p2m->domain->is_dying) )
> +        return -EACCES;
> +
> +    while ( left )
> +    {
> +        unsigned long order = p2m_mapping_order(sgfn, smfn, left);
> +
> +        rc = p2m_set_entry(p2m, sgfn, order, smfn, t);
> +        if ( rc )
> +            break;
> +
> +        sgfn = gfn_add(sgfn, BIT(order, UL));
> +        if ( !mfn_eq(smfn, INVALID_MFN) )
> +           smfn = mfn_add(smfn, BIT(order, UL));
> +
> +        left -= BIT(order, UL);
> +    }
> +
> +    return !left ? 0 : left == nr ? rc : (nr - left);

The function returning "int", you may be truncating the return value here.
In the worst case indicating success (0) or an error (negative) when some
of the upper bits were set.

Also looks like you could get away with a single conditional operator here:

    return !left || left == nr ? rc : (nr - left);

Jan

Re: [PATCH v3 12/20] xen/riscv: implement p2m_set_range()

Posted by Oleksii Kurochko 2 months, 2 weeks ago

On 8/5/25 6:04 PM, Jan Beulich wrote:
> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>> This patch introduces p2m_set_range() and its core helper p2m_set_entry() for
> Nit: This patch doesn't introduce p2m_set_range(); it merely fleshes it out.
>
>> --- a/xen/arch/riscv/include/asm/p2m.h
>> +++ b/xen/arch/riscv/include/asm/p2m.h
>> @@ -7,11 +7,13 @@
>>   #include <xen/rwlock.h>
>>   #include <xen/types.h>
>>   
>> +#include <asm/page.h>
>>   #include <asm/page-bits.h>
>>   
>>   extern unsigned int p2m_root_order;
>>   #define P2M_ROOT_ORDER  p2m_root_order
>>   #define P2M_ROOT_PAGES  BIT(P2M_ROOT_ORDER, U)
>> +#define P2M_ROOT_LEVEL  HYP_PT_ROOT_LEVEL
> I think I commented on this before, and I would have hoped for at least a remark
> in the description to appear (perhaps even a comment here): It's okay(ish) to tie
> these together for now, but in the longer run I don't expect this is going to be
> wanted. If e.g. we ran Xen in Sv57 mode, there would be no reason at all to force
> all P2Ms to use 5 levels of page tables.

Do you mean that for G-stage it could be chosen any SvXX mode to limit an amount
of page tables necessary for G-stage? If yes, then, at least, I agree that a
comment should be added or, probably, "#warning optimize an amount of p2m root level
for MMU mode > Sv48" (or maybe >=).

Or do you mean if we set hgatp.mode=Sv57 then it is possible to limit an amount of
page table's levels to use? In this case I think hardware still will expect to see
5 levels of page tables.


>> @@ -175,13 +179,257 @@ int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
>>       return 0;
>>   }
>>   
>> +/*
>> + * Find and map the root page table. The caller is responsible for
>> + * unmapping the table.
>> + *
>> + * The function will return NULL if the offset into the root table is
>> + * invalid.
>> + */
>> +static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
>> +{
>> +    unsigned long root_table_indx;
>> +
>> +    root_table_indx = gfn_x(gfn) >> XEN_PT_LEVEL_ORDER(P2M_ROOT_LEVEL);
> Right now page table layouts / arrangements are indeed similar enough to
> share accessor constructs. Nevertheless I find it problematic (doc-wise
> at the very least) that a Xen page table construct is used to access a
> P2M page table. If and when these needed to be decoupled, it would likely
> help of the distinction was already made, by - for now - simply
> introducing aliases (here e.g. P2M_LEVEL_ORDER(), expanding to
> XEN_PT_LEVEL_ORDER() for the time being).

I think it's better to define this correctly now, as I initially missed
that only non-root page tables and all page table entries (PTEs) share
the same format as their corresponding Sv39, Sv48, or Sv57 modes
(i.e., corresponding to SvXXx4).

However, in this case, we're dealing with the*root level*, which is extended
by 2 extra bits. Therefore, using|XEN_PT_LEVEL_ORDER()| would be incorrect
if those two extra bits are actually used.

We're just fortunate that no one currently uses these extra 2 bits.

>> +    if ( root_table_indx >= P2M_ROOT_PAGES )
>> +        return NULL;
>> +
>> +    return __map_domain_page(p2m->root + root_table_indx);
>> +}
>> +
>> +static inline void p2m_write_pte(pte_t *p, pte_t pte, bool clean_pte)
>> +{
>> +    write_pte(p, pte);
>> +    if ( clean_pte )
>> +        clean_dcache_va_range(p, sizeof(*p));
> Not necessarily for right away, but if multiple adjacent PTEs are
> written without releasing the lock, this then redundant cache flushing
> can be a performance issue.

Can't it be resolved on a caller side? Something like:
   p2m_write_pte(p1, pte1, false);
   p2m_write_pte(p2, pte2, false);
   p2m_write_pte(p3, pte3, false);
   p2m_write_pte(p4, pte4, true);
where p1-p4 are adjacent.

  

>> +}
>> +
>> +static inline void p2m_clean_pte(pte_t *p, bool clean_pte)
>> +{
>> +    pte_t pte;
>> +
>> +    memset(&pte, 0, sizeof(pte));
> Why memset()? Why not simply give the variable an appropriate initializer?

Good point. It would be much better just to use an appropriate initializer.

> Or use ...
>
>> +    p2m_write_pte(p, pte, clean_pte);
> ... a compound literal here, like you do ...
>
>> +}
>> +
>> +static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t)
>> +{
>> +    panic("%s: hasn't been implemented yet\n", __func__);
>> +
>> +    return (pte_t) { .pte = 0 };
> ... here? (Just {} would also do, if I'm not mistaken.)

According to C99 it will be enough {}, but {.pte = 0} is more obvious and
clearer, IMO.

>> +}
>> +
>> +#define P2M_TABLE_MAP_NONE 0
>> +#define P2M_TABLE_MAP_NOMEM 1
>> +#define P2M_TABLE_SUPER_PAGE 2
>> +#define P2M_TABLE_NORMAL 3
>> +
>> +/*
>> + * Take the currently mapped table, find the corresponding the entry
>> + * corresponding to the GFN, and map the next table, if available.
> Nit: Double "corresponding".
>
>> + * The previous table will be unmapped if the next level was mapped
>> + * (e.g P2M_TABLE_NORMAL returned).
>> + *
>> + * `alloc_tbl` parameter indicates whether intermediate tables should
>> + * be allocated when not present.
>> + *
>> + * Return values:
>> + *  P2M_TABLE_MAP_NONE: a table allocation isn't permitted.
>> + *  P2M_TABLE_MAP_NOMEM: allocating a new page failed.
>> + *  P2M_TABLE_SUPER_PAGE: next level or leaf mapped normally.
>> + *  P2M_TABLE_NORMAL: The next entry points to a superpage.
>> + */
>> +static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl,
>> +                          unsigned int level, pte_t **table,
>> +                          unsigned int offset)
>> +{
>> +    panic("%s: hasn't been implemented yet\n", __func__);
>> +
>> +    return P2M_TABLE_MAP_NONE;
>> +}
>> +
>> +/* Free pte sub-tree behind an entry */
>> +static void p2m_free_subtree(struct p2m_domain *p2m,
>> +                             pte_t entry, unsigned int level)
>> +{
>> +    panic("%s: hasn't been implemented yet\n", __func__);
>> +}
>> +
>> +/*
>> + * Insert an entry in the p2m. This should be called with a mapping
>> + * equal to a page/superpage.
>> + */
>> +static int p2m_set_entry(struct p2m_domain *p2m,
>> +                           gfn_t gfn,
>> +                           unsigned long page_order,
>> +                           mfn_t mfn,
>> +                           p2m_type_t t)
>> +{
>> +    unsigned int level;
>> +    unsigned int target = page_order / PAGETABLE_ORDER;
>> +    pte_t *entry, *table, orig_pte;
>> +    int rc;
>> +    /* A mapping is removed if the MFN is invalid. */
>> +    bool removing_mapping = mfn_eq(mfn, INVALID_MFN);
> Comment and code don't fit together. Many MFNs are invalid (any for which
> mfn_valid() returns false), yet you only check for INVALID_MFN here.

Probably, it makes sense to add an|ASSERT()| here for the case when
|mfn_valid(mfn)| is false, but the MFN is not explicitly equal to|INVALID_MFN|.
This would indicate that someone attempted to perform a mapping with an
incorrect MFN, which, IMO, is entirely wrong.
In the case where the MFN is explicitly set to|INVALID_MFN|, I believe it's
enough to indicate that a mapping is being removed.

>> +    DECLARE_OFFSETS(offsets, gfn_to_gaddr(gfn));
>> +
>> +    ASSERT(p2m_is_write_locked(p2m));
>> +
>> +    /*
>> +     * Check if the level target is valid: we only support
>> +     * 4K - 2M - 1G mapping.
>> +     */
>> +    ASSERT((target <= 2) && !(page_order % PAGETABLE_ORDER));
> If you think you need to check this, don't you also want to check that
> GFN and MFN (the latter if it isn't INVALID_MFN) fit the requested order?

I'm not 100% sure it makes sense to check|page_order|, as in IMO, proper
alignment is already guaranteed by|p2m_mapping_order(sgfn, smfn, left)|,
and thus both GFN and MFN are aligned accordingly.
So, I think this part of the check could be dropped.

>> +    table = p2m_get_root_pointer(p2m, gfn);
>> +    if ( !table )
>> +        return -EINVAL;
>> +
>> +    for ( level = P2M_ROOT_LEVEL; level > target; level-- )
>> +    {
>> +        /*
>> +         * Don't try to allocate intermediate page table if the mapping
>> +         * is about to be removed.
>> +         */
>> +        rc = p2m_next_level(p2m, !removing_mapping,
>> +                            level, &table, offsets[level]);
>> +        if ( (rc == P2M_TABLE_MAP_NONE) || (rc == P2M_TABLE_MAP_NOMEM) )
>> +        {
>> +            rc = (rc == P2M_TABLE_MAP_NONE) ? -ENOENT : -ENOMEM;
>> +            /*
>> +             * We are here because p2m_next_level has failed to map
>> +             * the intermediate page table (e.g the table does not exist
>> +             * and they p2m tree is read-only). It is a valid case
>> +             * when removing a mapping as it may not exist in the
>> +             * page table. In this case, just ignore it.
>> +             */
>> +            rc = removing_mapping ?  0 : rc;
> Nit: Stray blank.
>
>> +            goto out;
>> +        }
>> +
>> +        if ( rc != P2M_TABLE_NORMAL )
>> +            break;
>> +    }
>> +
>> +    entry = table + offsets[level];
>> +
>> +    /*
>> +     * If we are here with level > target, we must be at a leaf node,
>> +     * and we need to break up the superpage.
>> +     */
>> +    if ( level > target )
>> +    {
>> +        panic("Shattering isn't implemented\n");
>> +    }
>> +
>> +    /*
>> +     * We should always be there with the correct level because all the
>> +     * intermediate tables have been installed if necessary.
>> +     */
>> +    ASSERT(level == target);
>> +
>> +    orig_pte = *entry;
>> +
>> +    if ( removing_mapping )
>> +        p2m_clean_pte(entry, p2m->clean_pte);
>> +    else
>> +    {
>> +        pte_t pte = p2m_pte_from_mfn(mfn, t);
>> +
>> +        p2m_write_pte(entry, pte, p2m->clean_pte);
>> +
>> +        p2m->max_mapped_gfn = gfn_max(p2m->max_mapped_gfn,
>> +                                      gfn_add(gfn, BIT(page_order, UL) - 1));
>> +        p2m->lowest_mapped_gfn = gfn_min(p2m->lowest_mapped_gfn, gfn);
>> +    }
>> +
>> +    p2m->need_flush = true;
>> +
>> +    /*
>> +     * Currently, the infrastructure required to enable CONFIG_HAS_PASSTHROUGH
>> +     * is not ready for RISC-V support.
>> +     *
>> +     * When CONFIG_HAS_PASSTHROUGH=y, iommu_iotlb_flush() should be done
>> +     * here.
>> +     */
>> +#ifdef CONFIG_HAS_PASSTHROUGH
>> +#   error "add code to flush IOMMU TLB"
>> +#endif
>> +
>> +    rc = 0;
>> +
>> +    /*
>> +     * Free the entry only if the original pte was valid and the base
>> +     * is different (to avoid freeing when permission is changed).
>> +     */
>> +    if ( pte_is_valid(orig_pte) &&
>> +         !mfn_eq(pte_get_mfn(*entry), pte_get_mfn(orig_pte)) )
> I'm puzzled by this 2nd check: A permission change would - I expect - only
> occur to a leaf entry. If the new entry is a super-page one and the old
> wasn't, don't you still need to free the sub-tree, no matter whether the
> MFNs are the same?

I expect the MFNs to differ in this scenario, so the old sub-tree will be freed.

Based on your example (new entry is super-page and old entry isn't):
For old mapping (lets say, 4 KiB leaf) p2m_set_entry() walks all levels down
to L0, so we will have the following MMU page table walks:
   L2 PTE -> L1 PTE (MFN of L0 page table) -> L0 PTE -> RAM

When new mapping (lets say, 2 MiB superpage) will be requested, p2m_set_entry()
will stop at L1 (the superpage level):
  L2 PTE -> L1 PTE (at this moment, L1 PTE points to L0 page table, which
                    points to RAM)
Then the old L1 PTE will be saved in 'orig_pte', then writes 'entry' with
the RAM MFN for the 2 MiB mapping. The walk becomes:
   L2 PTE -> L1 PTE -> RAM

Therefore, 'entry' now holds an MFN pointing to RAM (superpage leaf). 'orig_pte'
still holds an MFN pointing to the L0 table (the old sub-tree). Since these MFNs
differ, the code calls p2m_free_subtree(p2m, orig_pte, …) and frees the old L0
sub-tree.

     

>   Plus consider the special case of MFN 0: If you clear
> an entry using MFN 0, you will find old and new PTEs' both having the same
> MFN.

Isn't this happen only when a mapping removal is explicitly requested?
In the case of a mapping removal it seems to me it is enough just to
clear PTE with all zeroes.

>   static int p2m_set_range(struct p2m_domain *p2m,
>                            gfn_t sgfn,
>                            unsigned long nr,
>                            mfn_t smfn,
>                            p2m_type_t t)
>   {
> -    return -EOPNOTSUPP;
> +    int rc = 0;
> +    unsigned long left = nr;
> +
> +    /*
> +     * Any reference taken by the P2M mappings (e.g. foreign mapping) will
> +     * be dropped in relinquish_p2m_mapping(). As the P2M will still
> +     * be accessible after, we need to prevent mapping to be added when the
> +     * domain is dying.
> +     */
> +    if ( unlikely(p2m->domain->is_dying) )
> +        return -EACCES;
> +
> +    while ( left )
> +    {
> +        unsigned long order = p2m_mapping_order(sgfn, smfn, left);
> +
> +        rc = p2m_set_entry(p2m, sgfn, order, smfn, t);
> +        if ( rc )
> +            break;
> +
> +        sgfn = gfn_add(sgfn, BIT(order, UL));
> +        if ( !mfn_eq(smfn, INVALID_MFN) )
> +           smfn = mfn_add(smfn, BIT(order, UL));
> +
> +        left -= BIT(order, UL);
> +    }
> +
> +    return !left ? 0 : left == nr ? rc : (nr - left);
> The function returning "int", you may be truncating the return value here.
> In the worst case indicating success (0) or an error (negative) when some
> of the upper bits were set.

I think what will be better:
or (1) Return long instead of int and the following check:
    long result = nr - left;
    if (result < 0 || result > LONG_MAX)
       return -ERANGE;
or (2) Just add new `left_to_map` (or just left) argument to p2m_set_range().

> Also looks like you could get away with a single conditional operator here:
>
>      return !left || left == nr ? rc : (nr - left);

Thanks.

~ Oleksii

Re: [PATCH v3 12/20] xen/riscv: implement p2m_set_range()

Posted by Jan Beulich 2 months, 2 weeks ago

On 15.08.2025 11:52, Oleksii Kurochko wrote:
> On 8/5/25 6:04 PM, Jan Beulich wrote:
>> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>>> This patch introduces p2m_set_range() and its core helper p2m_set_entry() for
>> Nit: This patch doesn't introduce p2m_set_range(); it merely fleshes it out.
>>
>>> --- a/xen/arch/riscv/include/asm/p2m.h
>>> +++ b/xen/arch/riscv/include/asm/p2m.h
>>> @@ -7,11 +7,13 @@
>>>   #include <xen/rwlock.h>
>>>   #include <xen/types.h>
>>>   
>>> +#include <asm/page.h>
>>>   #include <asm/page-bits.h>
>>>   
>>>   extern unsigned int p2m_root_order;
>>>   #define P2M_ROOT_ORDER  p2m_root_order
>>>   #define P2M_ROOT_PAGES  BIT(P2M_ROOT_ORDER, U)
>>> +#define P2M_ROOT_LEVEL  HYP_PT_ROOT_LEVEL
>> I think I commented on this before, and I would have hoped for at least a remark
>> in the description to appear (perhaps even a comment here): It's okay(ish) to tie
>> these together for now, but in the longer run I don't expect this is going to be
>> wanted. If e.g. we ran Xen in Sv57 mode, there would be no reason at all to force
>> all P2Ms to use 5 levels of page tables.
> 
> Do you mean that for G-stage it could be chosen any SvXX mode to limit an amount
> of page tables necessary for G-stage? If yes, then, at least, I agree that a
> comment should be added or, probably, "#warning optimize an amount of p2m root level
> for MMU mode > Sv48" (or maybe >=).

Yes.

> Or do you mean if we set hgatp.mode=Sv57 then it is possible to limit an amount of
> page table's levels to use? In this case I think hardware still will expect to see
> 5 levels of page tables.

No.

>>> +static inline void p2m_write_pte(pte_t *p, pte_t pte, bool clean_pte)
>>> +{
>>> +    write_pte(p, pte);
>>> +    if ( clean_pte )
>>> +        clean_dcache_va_range(p, sizeof(*p));
>> Not necessarily for right away, but if multiple adjacent PTEs are
>> written without releasing the lock, this then redundant cache flushing
>> can be a performance issue.
> 
> Can't it be resolved on a caller side? Something like:
>    p2m_write_pte(p1, pte1, false);
>    p2m_write_pte(p2, pte2, false);
>    p2m_write_pte(p3, pte3, false);
>    p2m_write_pte(p4, pte4, true);
> where p1-p4 are adjacent.

No. You wouldn't know whether the last write flushes what the earlier
three have written. There may be a cacheline boundary in between. Plus
I didn't really think of back-to-back writes, but e.g. a loop doing
many of them, where a single wider flush may then be more efficient.

>>> +#define P2M_TABLE_MAP_NONE 0
>>> +#define P2M_TABLE_MAP_NOMEM 1
>>> +#define P2M_TABLE_SUPER_PAGE 2
>>> +#define P2M_TABLE_NORMAL 3
>>> +
>>> +/*
>>> + * Take the currently mapped table, find the corresponding the entry
>>> + * corresponding to the GFN, and map the next table, if available.
>> Nit: Double "corresponding".
>>
>>> + * The previous table will be unmapped if the next level was mapped
>>> + * (e.g P2M_TABLE_NORMAL returned).
>>> + *
>>> + * `alloc_tbl` parameter indicates whether intermediate tables should
>>> + * be allocated when not present.
>>> + *
>>> + * Return values:
>>> + *  P2M_TABLE_MAP_NONE: a table allocation isn't permitted.
>>> + *  P2M_TABLE_MAP_NOMEM: allocating a new page failed.
>>> + *  P2M_TABLE_SUPER_PAGE: next level or leaf mapped normally.
>>> + *  P2M_TABLE_NORMAL: The next entry points to a superpage.
>>> + */
>>> +static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl,
>>> +                          unsigned int level, pte_t **table,
>>> +                          unsigned int offset)
>>> +{
>>> +    panic("%s: hasn't been implemented yet\n", __func__);
>>> +
>>> +    return P2M_TABLE_MAP_NONE;
>>> +}
>>> +
>>> +/* Free pte sub-tree behind an entry */
>>> +static void p2m_free_subtree(struct p2m_domain *p2m,
>>> +                             pte_t entry, unsigned int level)
>>> +{
>>> +    panic("%s: hasn't been implemented yet\n", __func__);
>>> +}
>>> +
>>> +/*
>>> + * Insert an entry in the p2m. This should be called with a mapping
>>> + * equal to a page/superpage.
>>> + */
>>> +static int p2m_set_entry(struct p2m_domain *p2m,
>>> +                           gfn_t gfn,
>>> +                           unsigned long page_order,
>>> +                           mfn_t mfn,
>>> +                           p2m_type_t t)
>>> +{
>>> +    unsigned int level;
>>> +    unsigned int target = page_order / PAGETABLE_ORDER;
>>> +    pte_t *entry, *table, orig_pte;
>>> +    int rc;
>>> +    /* A mapping is removed if the MFN is invalid. */
>>> +    bool removing_mapping = mfn_eq(mfn, INVALID_MFN);
>> Comment and code don't fit together. Many MFNs are invalid (any for which
>> mfn_valid() returns false), yet you only check for INVALID_MFN here.
> 
> Probably, it makes sense to add an|ASSERT()| here for the case when
> |mfn_valid(mfn)| is false, but the MFN is not explicitly equal to|INVALID_MFN|.
> This would indicate that someone attempted to perform a mapping with an
> incorrect MFN, which, IMO, is entirely wrong.

No, and we've been there before. MMIO can live anywhere, and mappings for
such still will need to be permitted. It is correct to check only for
INVALID_MFN here imo; it's just the comment which also needs to reflect
that.

>>> +    /*
>>> +     * If we are here with level > target, we must be at a leaf node,
>>> +     * and we need to break up the superpage.
>>> +     */
>>> +    if ( level > target )
>>> +    {
>>> +        panic("Shattering isn't implemented\n");
>>> +    }
>>> +
>>> +    /*
>>> +     * We should always be there with the correct level because all the
>>> +     * intermediate tables have been installed if necessary.
>>> +     */
>>> +    ASSERT(level == target);
>>> +
>>> +    orig_pte = *entry;
>>> +
>>> +    if ( removing_mapping )
>>> +        p2m_clean_pte(entry, p2m->clean_pte);
>>> +    else
>>> +    {
>>> +        pte_t pte = p2m_pte_from_mfn(mfn, t);
>>> +
>>> +        p2m_write_pte(entry, pte, p2m->clean_pte);
>>> +
>>> +        p2m->max_mapped_gfn = gfn_max(p2m->max_mapped_gfn,
>>> +                                      gfn_add(gfn, BIT(page_order, UL) - 1));
>>> +        p2m->lowest_mapped_gfn = gfn_min(p2m->lowest_mapped_gfn, gfn);
>>> +    }
>>> +
>>> +    p2m->need_flush = true;
>>> +
>>> +    /*
>>> +     * Currently, the infrastructure required to enable CONFIG_HAS_PASSTHROUGH
>>> +     * is not ready for RISC-V support.
>>> +     *
>>> +     * When CONFIG_HAS_PASSTHROUGH=y, iommu_iotlb_flush() should be done
>>> +     * here.
>>> +     */
>>> +#ifdef CONFIG_HAS_PASSTHROUGH
>>> +#   error "add code to flush IOMMU TLB"
>>> +#endif
>>> +
>>> +    rc = 0;
>>> +
>>> +    /*
>>> +     * Free the entry only if the original pte was valid and the base
>>> +     * is different (to avoid freeing when permission is changed).
>>> +     */
>>> +    if ( pte_is_valid(orig_pte) &&
>>> +         !mfn_eq(pte_get_mfn(*entry), pte_get_mfn(orig_pte)) )
>> I'm puzzled by this 2nd check: A permission change would - I expect - only
>> occur to a leaf entry. If the new entry is a super-page one and the old
>> wasn't, don't you still need to free the sub-tree, no matter whether the
>> MFNs are the same?
> 
> I expect the MFNs to differ in this scenario, so the old sub-tree will be freed.

You expecting something isn't a good criteria. If it's possible, even if
unexpected (by you), it needs dealing with correctly.

> Based on your example (new entry is super-page and old entry isn't):
> For old mapping (lets say, 4 KiB leaf) p2m_set_entry() walks all levels down
> to L0, so we will have the following MMU page table walks:
>    L2 PTE -> L1 PTE (MFN of L0 page table) -> L0 PTE -> RAM
> 
> When new mapping (lets say, 2 MiB superpage) will be requested, p2m_set_entry()
> will stop at L1 (the superpage level):
>   L2 PTE -> L1 PTE (at this moment, L1 PTE points to L0 page table, which
>                     points to RAM)
> Then the old L1 PTE will be saved in 'orig_pte', then writes 'entry' with
> the RAM MFN for the 2 MiB mapping. The walk becomes:
>    L2 PTE -> L1 PTE -> RAM
> 
> Therefore, 'entry' now holds an MFN pointing to RAM (superpage leaf). 'orig_pte'
> still holds an MFN pointing to the L0 table (the old sub-tree). Since these MFNs
> differ, the code calls p2m_free_subtree(p2m, orig_pte, …) and frees the old L0
> sub-tree.

A particular example doesn't help. All possible cases need handling correctly.

>>   Plus consider the special case of MFN 0: If you clear
>> an entry using MFN 0, you will find old and new PTEs' both having the same
>> MFN.
> 
> Isn't this happen only when a mapping removal is explicitly requested?
> In the case of a mapping removal it seems to me it is enough just to
> clear PTE with all zeroes.

Correct. Which means original MFN (PPN) and new MFN (PPN) would match.

Jan

Re: [PATCH v3 12/20] xen/riscv: implement p2m_set_range()

Posted by Oleksii Kurochko 2 months, 1 week ago

On 8/15/25 2:50 PM, Jan Beulich wrote:
> On 15.08.2025 11:52, Oleksii Kurochko wrote:
>> On 8/5/25 6:04 PM, Jan Beulich wrote:
>>> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>>>> +static inline void p2m_write_pte(pte_t *p, pte_t pte, bool clean_pte)
>>>> +{
>>>> +    write_pte(p, pte);
>>>> +    if ( clean_pte )
>>>> +        clean_dcache_va_range(p, sizeof(*p));
>>> Not necessarily for right away, but if multiple adjacent PTEs are
>>> written without releasing the lock, this then redundant cache flushing
>>> can be a performance issue.
>> Can't it be resolved on a caller side? Something like:
>>     p2m_write_pte(p1, pte1, false);
>>     p2m_write_pte(p2, pte2, false);
>>     p2m_write_pte(p3, pte3, false);
>>     p2m_write_pte(p4, pte4, true);
>> where p1-p4 are adjacent.
> No. You wouldn't know whether the last write flushes what the earlier
> three have written. There may be a cacheline boundary in between.

Oh, correct. It would be hard to detect, so agree that it will work
badly...

> Plus
> I didn't really think of back-to-back writes, but e.g. a loop doing
> many of them, where a single wider flush may then be more efficient.

... So IIUC you mean something like:
   for (i = 0; i < nr_entries; i++)
       p2m_write_pte(&pt[i], entries[i], false);  // no flush yet

   clean_dcache_va_range(pt, nr_entries * sizeof(pte_t));

>>>> +#define P2M_TABLE_MAP_NONE 0
>>>> +#define P2M_TABLE_MAP_NOMEM 1
>>>> +#define P2M_TABLE_SUPER_PAGE 2
>>>> +#define P2M_TABLE_NORMAL 3
>>>> +
>>>> +/*
>>>> + * Take the currently mapped table, find the corresponding the entry
>>>> + * corresponding to the GFN, and map the next table, if available.
>>> Nit: Double "corresponding".
>>>
>>>> + * The previous table will be unmapped if the next level was mapped
>>>> + * (e.g P2M_TABLE_NORMAL returned).
>>>> + *
>>>> + * `alloc_tbl` parameter indicates whether intermediate tables should
>>>> + * be allocated when not present.
>>>> + *
>>>> + * Return values:
>>>> + *  P2M_TABLE_MAP_NONE: a table allocation isn't permitted.
>>>> + *  P2M_TABLE_MAP_NOMEM: allocating a new page failed.
>>>> + *  P2M_TABLE_SUPER_PAGE: next level or leaf mapped normally.
>>>> + *  P2M_TABLE_NORMAL: The next entry points to a superpage.
>>>> + */
>>>> +static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl,
>>>> +                          unsigned int level, pte_t **table,
>>>> +                          unsigned int offset)
>>>> +{
>>>> +    panic("%s: hasn't been implemented yet\n", __func__);
>>>> +
>>>> +    return P2M_TABLE_MAP_NONE;
>>>> +}
>>>> +
>>>> +/* Free pte sub-tree behind an entry */
>>>> +static void p2m_free_subtree(struct p2m_domain *p2m,
>>>> +                             pte_t entry, unsigned int level)
>>>> +{
>>>> +    panic("%s: hasn't been implemented yet\n", __func__);
>>>> +}
>>>> +
>>>> +/*
>>>> + * Insert an entry in the p2m. This should be called with a mapping
>>>> + * equal to a page/superpage.
>>>> + */
>>>> +static int p2m_set_entry(struct p2m_domain *p2m,
>>>> +                           gfn_t gfn,
>>>> +                           unsigned long page_order,
>>>> +                           mfn_t mfn,
>>>> +                           p2m_type_t t)
>>>> +{
>>>> +    unsigned int level;
>>>> +    unsigned int target = page_order / PAGETABLE_ORDER;
>>>> +    pte_t *entry, *table, orig_pte;
>>>> +    int rc;
>>>> +    /* A mapping is removed if the MFN is invalid. */
>>>> +    bool removing_mapping = mfn_eq(mfn, INVALID_MFN);
>>> Comment and code don't fit together. Many MFNs are invalid (any for which
>>> mfn_valid() returns false), yet you only check for INVALID_MFN here.
>> Probably, it makes sense to add an|ASSERT()| here for the case when
>> |mfn_valid(mfn)| is false, but the MFN is not explicitly equal to|INVALID_MFN|.
>> This would indicate that someone attempted to perform a mapping with an
>> incorrect MFN, which, IMO, is entirely wrong.
> No, and we've been there before. MMIO can live anywhere, and mappings for
> such still will need to be permitted. It is correct to check only for
> INVALID_MFN here imo; it's just the comment which also needs to reflect
> that.

Got it now. The original one comment looked clear to me, but considering what
you wrote, I will update the comment then to:
   A mapping is removed only if the MFN is explicitly passed as INVALID_MFN.
Also, perhaps, it makes sense to add the following:
   Other MFNs that are not valid (e.g., MMIO) from mfn_valid() point of
   view are allowed.

Does it make more sense now?

>
>>>> +    /*
>>>> +     * If we are here with level > target, we must be at a leaf node,
>>>> +     * and we need to break up the superpage.
>>>> +     */
>>>> +    if ( level > target )
>>>> +    {
>>>> +        panic("Shattering isn't implemented\n");
>>>> +    }
>>>> +
>>>> +    /*
>>>> +     * We should always be there with the correct level because all the
>>>> +     * intermediate tables have been installed if necessary.
>>>> +     */
>>>> +    ASSERT(level == target);
>>>> +
>>>> +    orig_pte = *entry;
>>>> +
>>>> +    if ( removing_mapping )
>>>> +        p2m_clean_pte(entry, p2m->clean_pte);
>>>> +    else
>>>> +    {
>>>> +        pte_t pte = p2m_pte_from_mfn(mfn, t);
>>>> +
>>>> +        p2m_write_pte(entry, pte, p2m->clean_pte);
>>>> +
>>>> +        p2m->max_mapped_gfn = gfn_max(p2m->max_mapped_gfn,
>>>> +                                      gfn_add(gfn, BIT(page_order, UL) - 1));
>>>> +        p2m->lowest_mapped_gfn = gfn_min(p2m->lowest_mapped_gfn, gfn);
>>>> +    }
>>>> +
>>>> +    p2m->need_flush = true;
>>>> +
>>>> +    /*
>>>> +     * Currently, the infrastructure required to enable CONFIG_HAS_PASSTHROUGH
>>>> +     * is not ready for RISC-V support.
>>>> +     *
>>>> +     * When CONFIG_HAS_PASSTHROUGH=y, iommu_iotlb_flush() should be done
>>>> +     * here.
>>>> +     */
>>>> +#ifdef CONFIG_HAS_PASSTHROUGH
>>>> +#   error "add code to flush IOMMU TLB"
>>>> +#endif
>>>> +
>>>> +    rc = 0;
>>>> +
>>>> +    /*
>>>> +     * Free the entry only if the original pte was valid and the base
>>>> +     * is different (to avoid freeing when permission is changed).
>>>> +     */
>>>> +    if ( pte_is_valid(orig_pte) &&
>>>> +         !mfn_eq(pte_get_mfn(*entry), pte_get_mfn(orig_pte)) )
>>> I'm puzzled by this 2nd check: A permission change would - I expect - only
>>> occur to a leaf entry. If the new entry is a super-page one and the old
>>> wasn't, don't you still need to free the sub-tree, no matter whether the
>>> MFNs are the same?
>> I expect the MFNs to differ in this scenario, so the old sub-tree will be freed.
> You expecting something isn't a good criteria. If it's possible, even if
> unexpected (by you), it needs dealing with correctly.
>
>> Based on your example (new entry is super-page and old entry isn't):
>> For old mapping (lets say, 4 KiB leaf) p2m_set_entry() walks all levels down
>> to L0, so we will have the following MMU page table walks:
>>     L2 PTE -> L1 PTE (MFN of L0 page table) -> L0 PTE -> RAM
>>
>> When new mapping (lets say, 2 MiB superpage) will be requested, p2m_set_entry()
>> will stop at L1 (the superpage level):
>>    L2 PTE -> L1 PTE (at this moment, L1 PTE points to L0 page table, which
>>                      points to RAM)
>> Then the old L1 PTE will be saved in 'orig_pte', then writes 'entry' with
>> the RAM MFN for the 2 MiB mapping. The walk becomes:
>>     L2 PTE -> L1 PTE -> RAM
>>
>> Therefore, 'entry' now holds an MFN pointing to RAM (superpage leaf). 'orig_pte'
>> still holds an MFN pointing to the L0 table (the old sub-tree). Since these MFNs
>> differ, the code calls p2m_free_subtree(p2m, orig_pte, …) and frees the old L0
>> sub-tree.
> A particular example doesn't help. All possible cases need handling correctly.

For sure, all possible cases need handling correctly, but I don't see any cases
except one you mentioned below where MFNs will be the same.

>
>>>    Plus consider the special case of MFN 0: If you clear
>>> an entry using MFN 0, you will find old and new PTEs' both having the same
>>> MFN.
>> Isn't this happen only when a mapping removal is explicitly requested?
>> In the case of a mapping removal it seems to me it is enough just to
>> clear PTE with all zeroes.
> Correct. Which means original MFN (PPN) and new MFN (PPN) would match.

Oh, I got it what is the issue here. If previously MFN 0 was mapped, then
it is going to be removed and considering that during removing MFN 0 is
used, we won't put MFN 0 page reference (mapped earlier) because
p2m_free_subtree() won't be called.

In this case, if-condidtion should be updated with:
   @@ -883,7 +890,8 @@ static int p2m_set_entry(struct p2m_domain *p2m,
         * is different (to avoid freeing when permission is changed).
         */
        if ( pte_is_valid(orig_pte) &&
   -         !mfn_eq(pte_get_mfn(*entry), pte_get_mfn(orig_pte)) )
   +         (!mfn_eq(pte_get_mfn(*entry), pte_get_mfn(orig_pte)) ||
   +          (removing_mapping && mfn_eq(pte_get_mfn(*entry), mfn_t(0))) )

or call p2m_free_subentry() in remove_mapping handling:
   @@ -850,7 +852,12 @@ static int p2m_set_entry(struct p2m_domain *p2m,
        orig_pte = *entry;
  
        if ( removing_mapping )
   +    {
   +       if ( mfn_eq(pte_get_mfn(*entry), mfn_t(0) )
   +            p2m_free_subtree(p2m, orig_pte, level,  virt_to_page(table), offsets[level]);
   +
          p2m_clean_pte(entry, p2m->clean_pte);
   +    }
        else

~ Oleksii