This patch introduces a working implementation of p2m_free_subtree() for RISC-V
based on ARM's implementation of p2m_free_entry(), enabling proper cleanup
of page table entries in the P2M (physical-to-machine) mapping.
Only few things are changed:
- Introduce and use p2m_get_type() to get a type of p2m entry as
RISC-V's PTE doesn't have enough space to store all necessary types so
a type is stored outside PTE. But, at the moment, handle only types
which fit into PTE's bits.
Key additions include:
- p2m_free_subtree(): Recursively frees page table entries at all levels. It
handles both regular and superpage mappings and ensures that TLB entries
are flushed before freeing intermediate tables.
- p2m_put_page() and helpers:
- p2m_put_4k_page(): Clears GFN from xenheap pages if applicable.
- p2m_put_2m_superpage(): Releases foreign page references in a 2MB
superpage.
- p2m_get_type(): Extracts the stored p2m_type from the PTE bits.
- p2m_free_page(): Returns a page to a domain's freelist.
- Introduce p2m_is_foreign() and connected to it things.
Defines XEN_PT_ENTRIES in asm/page.h to simplify loops over page table
entries.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V4:
- Stray blanks.
- Implement arch_flush_tlb_mask() to make the comment in p2m_put_foreign()
clear and explicit.
- Update the comment above p2m_is_ram() in p2m_put_4k_page() with an explanation
why p2m_is_ram() is used.
- Add a type check inside p2m_put_2m_superpage().
- Swap two conditions around in p2m_free_subtree():
if ( (level == 0) || pte_is_superpage(entry, level) )
- Add ASSERT() inside p2m_free_subtree() to check that level is <= 2; otherwise,
it could consume a lot of time and big memory usage because of recursion.
- Drop page_list_del() before p2m_free_page() as page_list_del() is called
inside p2m_free_page().
- Update p2m_freelist's total_pages when a page is added to p2m_freelist in
paging_free_page().
- Introduce P2M_SUPPORTED_LEVEL_MAPPING and use it in ASSERTs() which check
supported level.
- Use P2M_PAGETABLE_ENTRIES as XEN_PT_ENTRIES
doesn't takeinto into acount that G stage root page table is
extended by 2 bits.
- Update prototype of p2m_put_page() to not have unnecessary changes later.
---
Changes in V3:
- Use p2m_tlb_flush_sync(p2m) instead of p2m_force_tlb_flush_sync() in
p2m_free_subtree().
- Drop p2m_is_valid() implementation as pte_is_valid() is going to be used
instead.
- Drop p2m_is_superpage() and introduce pte_is_superpage() instead.
- s/p2m_free_entry/p2m_free_subtree.
- s/p2m_type_radix_get/p2m_get_type.
- Update implementation of p2m_get_type() to get type both from PTE bits,
other cases will be covered in a separate patch. This requires an
introduction of new P2M_TYPE_PTE_BITS_MASK macros.
- Drop p2m argument of p2m_get_type() as it isn't needed anymore.
- Put cheapest checks first in p2m_is_superpage().
- Use switch() in p2m_put_page().
- Update the comment in p2m_put_foreign_page().
- Code style fixes.
- Move p2m_foreign stuff to this commit.
- Drop p2m argument of p2m_put_page() as itsn't used anymore.
---
Changes in V2:
- New patch. It was a part of 2ma big patch "xen/riscv: implement p2m mapping
functionality" which was splitted to smaller.
- s/p2m_is_superpage/p2me_is_superpage.
---
xen/arch/riscv/include/asm/flushtlb.h | 6 +-
xen/arch/riscv/include/asm/p2m.h | 18 ++-
xen/arch/riscv/include/asm/page.h | 6 +
xen/arch/riscv/include/asm/paging.h | 2 +
xen/arch/riscv/p2m.c | 152 +++++++++++++++++++++++++-
xen/arch/riscv/paging.c | 8 ++
xen/arch/riscv/stubs.c | 5 -
7 files changed, 187 insertions(+), 10 deletions(-)
diff --git a/xen/arch/riscv/include/asm/flushtlb.h b/xen/arch/riscv/include/asm/flushtlb.h
index e70badae0c..ab32311568 100644
--- a/xen/arch/riscv/include/asm/flushtlb.h
+++ b/xen/arch/riscv/include/asm/flushtlb.h
@@ -41,8 +41,10 @@ static inline void page_set_tlbflush_timestamp(struct page_info *page)
BUG_ON("unimplemented");
}
-/* Flush specified CPUs' TLBs */
-void arch_flush_tlb_mask(const cpumask_t *mask);
+static inline void arch_flush_tlb_mask(const cpumask_t *mask)
+{
+ sbi_remote_hfence_gvma(mask, 0, 0);
+}
#endif /* ASM__RISCV__FLUSHTLB_H */
diff --git a/xen/arch/riscv/include/asm/p2m.h b/xen/arch/riscv/include/asm/p2m.h
index 1a43736855..29685c7852 100644
--- a/xen/arch/riscv/include/asm/p2m.h
+++ b/xen/arch/riscv/include/asm/p2m.h
@@ -106,6 +106,8 @@ typedef enum {
p2m_mmio_direct_io, /* Read/write mapping of genuine Device MMIO area,
PTE_PBMT_IO will be used for such mappings */
p2m_ext_storage, /* Following types'll be stored outsude PTE bits: */
+ p2m_map_foreign_rw, /* Read/write RAM pages from foreign domain */
+ p2m_map_foreign_ro, /* Read-only RAM pages from foreign domain */
/* Sentinel — not a real type, just a marker for comparison */
p2m_first_external = p2m_ext_storage,
@@ -116,15 +118,29 @@ static inline p2m_type_t arch_dt_passthrough_p2m_type(void)
return p2m_mmio_direct_io;
}
+/*
+ * Bits 8 and 9 are reserved for use by supervisor software;
+ * the implementation shall ignore this field.
+ * We are going to use to save in these bits frequently used types to avoid
+ * get/set of a type from radix tree.
+ */
+#define P2M_TYPE_PTE_BITS_MASK 0x300
+
/* We use bitmaps and mask to handle groups of types */
#define p2m_to_mask(t) BIT(t, UL)
/* RAM types, which map to real machine frames */
#define P2M_RAM_TYPES (p2m_to_mask(p2m_ram_rw))
+/* Foreign mappings types */
+#define P2M_FOREIGN_TYPES (p2m_to_mask(p2m_map_foreign_rw) | \
+ p2m_to_mask(p2m_map_foreign_ro))
+
/* Useful predicates */
#define p2m_is_ram(t_) (p2m_to_mask(t_) & P2M_RAM_TYPES)
-#define p2m_is_any_ram(t_) (p2m_to_mask(t_) & P2M_RAM_TYPES)
+#define p2m_is_any_ram(t_) (p2m_to_mask(t_) & \
+ (P2M_RAM_TYPES | P2M_FOREIGN_TYPES))
+#define p2m_is_foreign(t) (p2m_to_mask(t) & P2M_FOREIGN_TYPES)
#include <xen/p2m-common.h>
diff --git a/xen/arch/riscv/include/asm/page.h b/xen/arch/riscv/include/asm/page.h
index 66cb192316..cb303af0c0 100644
--- a/xen/arch/riscv/include/asm/page.h
+++ b/xen/arch/riscv/include/asm/page.h
@@ -20,6 +20,7 @@
#define XEN_PT_LEVEL_SIZE(lvl) (_AT(paddr_t, 1) << XEN_PT_LEVEL_SHIFT(lvl))
#define XEN_PT_LEVEL_MAP_MASK(lvl) (~(XEN_PT_LEVEL_SIZE(lvl) - 1))
#define XEN_PT_LEVEL_MASK(lvl) (VPN_MASK << XEN_PT_LEVEL_SHIFT(lvl))
+#define XEN_PT_ENTRIES (_AT(unsigned int, 1) << PAGETABLE_ORDER)
/*
* PTE format:
@@ -182,6 +183,11 @@ static inline bool pte_is_mapping(pte_t p)
return (p.pte & PTE_VALID) && (p.pte & PTE_ACCESS_MASK);
}
+static inline bool pte_is_superpage(pte_t p, unsigned int level)
+{
+ return (level > 0) && pte_is_mapping(p);
+}
+
static inline int clean_and_invalidate_dcache_va_range(const void *p,
unsigned long size)
{
diff --git a/xen/arch/riscv/include/asm/paging.h b/xen/arch/riscv/include/asm/paging.h
index befad14f82..9712aa77c5 100644
--- a/xen/arch/riscv/include/asm/paging.h
+++ b/xen/arch/riscv/include/asm/paging.h
@@ -13,4 +13,6 @@ int paging_freelist_adjust(struct domain *d, unsigned long pages,
int paging_ret_pages_to_domheap(struct domain *d, unsigned int nr_pages);
int paging_ret_pages_to_freelist(struct domain *d, unsigned int nr_pages);
+void paging_free_page(struct domain *d, struct page_info *pg);
+
#endif /* ASM_RISCV_PAGING_H */
diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
index db9f7a77ff..10acfa0a9c 100644
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -98,6 +98,8 @@ void __init gstage_mode_detect(void)
local_hfence_gvma_all();
}
+#define P2M_SUPPORTED_LEVEL_MAPPING 2
+
/*
* Force a synchronous P2M TLB flush.
*
@@ -286,6 +288,16 @@ static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
return __map_domain_page(p2m->root + root_table_indx);
}
+static p2m_type_t p2m_get_type(const pte_t pte)
+{
+ p2m_type_t type = MASK_EXTR(pte.pte, P2M_TYPE_PTE_BITS_MASK);
+
+ if ( type == p2m_ext_storage )
+ panic("unimplemented\n");
+
+ return type;
+}
+
static inline void p2m_write_pte(pte_t *p, pte_t pte, bool clean_pte)
{
write_pte(p, pte);
@@ -342,11 +354,147 @@ static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl,
return P2M_TABLE_MAP_NONE;
}
+static void p2m_put_foreign_page(struct page_info *pg)
+{
+ /*
+ * It’s safe to call put_page() here because arch_flush_tlb_mask()
+ * will be invoked if the page is reallocated before the end of
+ * this loop, which will trigger a flush of the guest TLBs.
+ */
+ put_page(pg);
+}
+
+/* Put any references on the single 4K page referenced by mfn. */
+static void p2m_put_4k_page(mfn_t mfn, p2m_type_t type)
+{
+ /* TODO: Handle other p2m types */
+
+ if ( p2m_is_foreign(type) )
+ {
+ ASSERT(mfn_valid(mfn));
+ p2m_put_foreign_page(mfn_to_page(mfn));
+ }
+}
+
+/* Put any references on the superpage referenced by mfn. */
+static void p2m_put_2m_superpage(mfn_t mfn, p2m_type_t type)
+{
+ struct page_info *pg;
+ unsigned int i;
+
+ /*
+ * TODO: Handle other p2m types, but be aware that any changes to handle
+ * different types should require an update on the relinquish code to
+ * handle preemption.
+ */
+ if ( !p2m_is_foreign(type) )
+ return;
+
+ ASSERT(mfn_valid(mfn));
+
+ pg = mfn_to_page(mfn);
+
+ for ( i = 0; i < P2M_PAGETABLE_ENTRIES(1); i++, pg++ )
+ p2m_put_foreign_page(pg);
+}
+
+/* Put any references on the page referenced by pte. */
+static void p2m_put_page(const pte_t pte, unsigned int level, p2m_type_t p2mt)
+{
+ mfn_t mfn = pte_get_mfn(pte);
+
+ ASSERT(pte_is_valid(pte));
+
+ /*
+ * TODO: Currently we don't handle level 2 super-page, Xen is not
+ * preemptible and therefore some work is needed to handle such
+ * superpages, for which at some point Xen might end up freeing memory
+ * and therefore for such a big mapping it could end up in a very long
+ * operation.
+ */
+ switch ( level )
+ {
+ case 1:
+ return p2m_put_2m_superpage(mfn, p2mt);
+
+ case 0:
+ return p2m_put_4k_page(mfn, p2mt);
+
+ default:
+ assert_failed("Unsupported level");
+ break;
+ }
+}
+
+static void p2m_free_page(struct p2m_domain *p2m, struct page_info *pg)
+{
+ page_list_del(pg, &p2m->pages);
+
+ paging_free_page(p2m->domain, pg);
+}
+
/* Free pte sub-tree behind an entry */
static void p2m_free_subtree(struct p2m_domain *p2m,
pte_t entry, unsigned int level)
{
- panic("%s: hasn't been implemented yet\n", __func__);
+ unsigned int i;
+ pte_t *table;
+ mfn_t mfn;
+ struct page_info *pg;
+
+ /*
+ * Check if the level is valid: only 4K - 2M - 1G mappings are supported.
+ * To support levels > 2, the implementation of p2m_free_subtree() would
+ * need to be updated, as the current recursive approach could consume
+ * excessive time and memory.
+ */
+ ASSERT(level <= P2M_SUPPORTED_LEVEL_MAPPING);
+
+ /* Nothing to do if the entry is invalid. */
+ if ( !pte_is_valid(entry) )
+ return;
+
+ if ( (level == 0) || pte_is_superpage(entry, level) )
+ {
+ p2m_type_t p2mt = p2m_get_type(entry);
+
+#ifdef CONFIG_IOREQ_SERVER
+ /*
+ * If this gets called then either the entry was replaced by an entry
+ * with a different base (valid case) or the shattering of a superpage
+ * has failed (error case).
+ * So, at worst, the spurious mapcache invalidation might be sent.
+ */
+ if ( p2m_is_ram(p2m_get_type(p2m, entry)) &&
+ domain_has_ioreq_server(p2m->domain) )
+ ioreq_request_mapcache_invalidate(p2m->domain);
+#endif
+
+ p2m_put_page(entry, level, p2mt);
+
+ return;
+ }
+
+ table = map_domain_page(pte_get_mfn(entry));
+ for ( i = 0; i < P2M_PAGETABLE_ENTRIES(level); i++ )
+ p2m_free_subtree(p2m, table[i], level - 1);
+
+ unmap_domain_page(table);
+
+ /*
+ * Make sure all the references in the TLB have been removed before
+ * freing the intermediate page table.
+ * XXX: Should we defer the free of the page table to avoid the
+ * flush?
+ */
+ p2m_tlb_flush_sync(p2m);
+
+ mfn = pte_get_mfn(entry);
+ ASSERT(mfn_valid(mfn));
+
+ pg = mfn_to_page(mfn);
+
+ p2m_free_page(p2m, pg);
}
/*
@@ -377,7 +525,7 @@ static int p2m_set_entry(struct p2m_domain *p2m,
* Check if the level target is valid: we only support
* 4K - 2M - 1G mapping.
*/
- ASSERT(target <= 2);
+ ASSERT(target <= P2M_SUPPORTED_LEVEL_MAPPING);
table = p2m_get_root_pointer(p2m, gfn);
if ( !table )
diff --git a/xen/arch/riscv/paging.c b/xen/arch/riscv/paging.c
index ed537fee07..049b850e03 100644
--- a/xen/arch/riscv/paging.c
+++ b/xen/arch/riscv/paging.c
@@ -107,6 +107,14 @@ int paging_ret_pages_to_domheap(struct domain *d, unsigned int nr_pages)
return 0;
}
+void paging_free_page(struct domain *d, struct page_info *pg)
+{
+ spin_lock(&d->arch.paging.lock);
+ page_list_add_tail(pg, &d->arch.paging.freelist);
+ ACCESS_ONCE(d->arch.paging.total_pages)++;
+ spin_unlock(&d->arch.paging.lock);
+}
+
/* Domain paging struct initialization. */
int paging_domain_init(struct domain *d)
{
diff --git a/xen/arch/riscv/stubs.c b/xen/arch/riscv/stubs.c
index 1a8c86cd8d..ad6fdbf501 100644
--- a/xen/arch/riscv/stubs.c
+++ b/xen/arch/riscv/stubs.c
@@ -65,11 +65,6 @@ int arch_monitor_domctl_event(struct domain *d,
/* smp.c */
-void arch_flush_tlb_mask(const cpumask_t *mask)
-{
- BUG_ON("unimplemented");
-}
-
void smp_send_event_check_mask(const cpumask_t *mask)
{
BUG_ON("unimplemented");
--
2.51.0
On 17.09.2025 23:55, Oleksii Kurochko wrote:
> @@ -342,11 +354,147 @@ static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl,
> return P2M_TABLE_MAP_NONE;
> }
>
> +static void p2m_put_foreign_page(struct page_info *pg)
> +{
> + /*
> + * It’s safe to call put_page() here because arch_flush_tlb_mask()
> + * will be invoked if the page is reallocated before the end of
> + * this loop, which will trigger a flush of the guest TLBs.
> + */
> + put_page(pg);
> +}
What is "this loop" referring to in the comment? There's no loop here.
> +/* Put any references on the page referenced by pte. */
> +static void p2m_put_page(const pte_t pte, unsigned int level, p2m_type_t p2mt)
> +{
> + mfn_t mfn = pte_get_mfn(pte);
> +
> + ASSERT(pte_is_valid(pte));
> +
> + /*
> + * TODO: Currently we don't handle level 2 super-page, Xen is not
> + * preemptible and therefore some work is needed to handle such
> + * superpages, for which at some point Xen might end up freeing memory
> + * and therefore for such a big mapping it could end up in a very long
> + * operation.
> + */
> + switch ( level )
> + {
> + case 1:
> + return p2m_put_2m_superpage(mfn, p2mt);
> +
> + case 0:
> + return p2m_put_4k_page(mfn, p2mt);
> +
> + default:
> + assert_failed("Unsupported level");
I don't think assert_failed() is supposed to be used directly. What's
wrong with using ASSERT_UNREACHABLE() here?
> --- a/xen/arch/riscv/paging.c
> +++ b/xen/arch/riscv/paging.c
> @@ -107,6 +107,14 @@ int paging_ret_pages_to_domheap(struct domain *d, unsigned int nr_pages)
> return 0;
> }
>
> +void paging_free_page(struct domain *d, struct page_info *pg)
> +{
> + spin_lock(&d->arch.paging.lock);
> + page_list_add_tail(pg, &d->arch.paging.freelist);
> + ACCESS_ONCE(d->arch.paging.total_pages)++;
More a question to other REST maintainers than to you: Is this kind of
use of ACCESS_ONCE() okay? By the wording, one might assume a single
memory access, yet only x86 can actually carry out an increment (or
alike) of an item in memory in a single insn.
Jan
On 9/20/25 1:57 AM, Jan Beulich wrote:
> On 17.09.2025 23:55, Oleksii Kurochko wrote:
>> @@ -342,11 +354,147 @@ static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl,
>> return P2M_TABLE_MAP_NONE;
>> }
>>
>> +static void p2m_put_foreign_page(struct page_info *pg)
>> +{
>> + /*
>> + * It’s safe to call put_page() here because arch_flush_tlb_mask()
>> + * will be invoked if the page is reallocated before the end of
>> + * this loop, which will trigger a flush of the guest TLBs.
>> + */
>> + put_page(pg);
>> +}
> What is "this loop" referring to in the comment? There's no loop here.
The loop is inside the caller (p2m_put_2m_superpage()):
...
for ( i = 0; i < P2M_PAGETABLE_ENTRIES(1); i++, pg++ )
p2m_put_foreign_page(pg);
Agree, that comment is pretty confusing. I am not sure it is necessary to
mention a specific loop — the comment would still be correct without
referring to "this loop". So I will rewrite the comment as:
/*
* It’s safe to call put_page() here because arch_flush_tlb_mask()
* will be invoked if the page is reallocated, which will trigger a
* flush of the guest TLBs.
*/
>
>> +/* Put any references on the page referenced by pte. */
>> +static void p2m_put_page(const pte_t pte, unsigned int level, p2m_type_t p2mt)
>> +{
>> + mfn_t mfn = pte_get_mfn(pte);
>> +
>> + ASSERT(pte_is_valid(pte));
>> +
>> + /*
>> + * TODO: Currently we don't handle level 2 super-page, Xen is not
>> + * preemptible and therefore some work is needed to handle such
>> + * superpages, for which at some point Xen might end up freeing memory
>> + * and therefore for such a big mapping it could end up in a very long
>> + * operation.
>> + */
>> + switch ( level )
>> + {
>> + case 1:
>> + return p2m_put_2m_superpage(mfn, p2mt);
>> +
>> + case 0:
>> + return p2m_put_4k_page(mfn, p2mt);
>> +
>> + default:
>> + assert_failed("Unsupported level");
> I don't think assert_failed() is supposed to be used directly. What's
> wrong with using ASSERT_UNREACHABLE() here?
Nothing, I just wanted to have some custom massage. I am okay with
ASSERT_UNREACHABLE(), anyway it will print where ASSERT occurred.
>
>> --- a/xen/arch/riscv/paging.c
>> +++ b/xen/arch/riscv/paging.c
>> @@ -107,6 +107,14 @@ int paging_ret_pages_to_domheap(struct domain *d, unsigned int nr_pages)
>> return 0;
>> }
>>
>> +void paging_free_page(struct domain *d, struct page_info *pg)
>> +{
>> + spin_lock(&d->arch.paging.lock);
>> + page_list_add_tail(pg, &d->arch.paging.freelist);
>> + ACCESS_ONCE(d->arch.paging.total_pages)++;
> More a question to other REST maintainers than to you: Is this kind of
> use of ACCESS_ONCE() okay? By the wording, one might assume a single
> memory access, yet only x86 can actually carry out an increment (or
> alike) of an item in memory in a single insn.
I thought that ACCESS_ONCE() is more about preventing compiler optimizations
than about ensuring atomicity.
In this specific case, I don’t think ACCESS_ONCE() is really needed since
a spin lock is already being used.
~ Oleksii
On 26.09.2025 17:33, Oleksii Kurochko wrote:
> On 9/20/25 1:57 AM, Jan Beulich wrote:
>> On 17.09.2025 23:55, Oleksii Kurochko wrote:
>>> +/* Put any references on the page referenced by pte. */
>>> +static void p2m_put_page(const pte_t pte, unsigned int level, p2m_type_t p2mt)
>>> +{
>>> + mfn_t mfn = pte_get_mfn(pte);
>>> +
>>> + ASSERT(pte_is_valid(pte));
>>> +
>>> + /*
>>> + * TODO: Currently we don't handle level 2 super-page, Xen is not
>>> + * preemptible and therefore some work is needed to handle such
>>> + * superpages, for which at some point Xen might end up freeing memory
>>> + * and therefore for such a big mapping it could end up in a very long
>>> + * operation.
>>> + */
>>> + switch ( level )
>>> + {
>>> + case 1:
>>> + return p2m_put_2m_superpage(mfn, p2mt);
>>> +
>>> + case 0:
>>> + return p2m_put_4k_page(mfn, p2mt);
>>> +
>>> + default:
>>> + assert_failed("Unsupported level");
>> I don't think assert_failed() is supposed to be used directly. What's
>> wrong with using ASSERT_UNREACHABLE() here?
>
> Nothing, I just wanted to have some custom massage. I am okay with
> ASSERT_UNREACHABLE(), anyway it will print where ASSERT occurred.
Just fyi, the (kind of) "canonical" way of having a custom message emitted
from an assertion is ASSERT(!"<message text>").
Jan
© 2016 - 2025 Red Hat, Inc.