RISC-V's PTE has only two available bits that can be used to store the P2M
type. This is insufficient to represent all the current RISC-V P2M types.
Therefore, some P2M types must be stored outside the PTE bits.
To address this, a metadata table is introduced to store P2M types that
cannot fit in the PTE itself. Not all P2M types are stored in the
metadata table—only those that require it.
The metadata table is linked to the intermediate page table via the
`struct page_info`'s v.md.metadata field of the corresponding intermediate
page.
Such pages are allocated with MEMF_no_owner, which allows us to use
the v field for the purpose of storing the metadata table.
To simplify the allocation and linking of intermediate and metadata page
tables, `p2m_{alloc,free}_table()` functions are implemented.
These changes impact `p2m_split_superpage()`, since when a superpage is
split, it is necessary to update the metadata table of the new
intermediate page table — if the entry being split has its P2M type set
to `p2m_ext_storage` in its `P2M_TYPES` bits. In addition to updating
the metadata of the new intermediate page table, the corresponding entry
in the metadata for the original superpage is invalidated.
Also, update p2m_{get,set}_type to work with P2M types which don't fit
into PTE bits.
Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V5:
- Rename metadata member of stuct md inside struct page_info to pg.
- Stray blank in the declaration of p2m_alloc_table().
- Use "<" instead of "<=" in ASSERT() in p2m_set_type().
- Move the check that ctx is provided to an earlier point in
p2m_set_type().
- Set `md_pg` after ASSERT() in p2m_set_type().
- Add BUG_ON() insetead of ASSERT_UNREACHABLE() in p2m_set_type().
- Drop a check that metadata isn't NULL before unmap_domain_page() is
being called.
- Make const `md` variable in p2m_get_type().
- unmap correct domain's page in p2m_get_type: use `md` instead of
ctx->pt_page->v.md.pg.
- Add description of how p2m and p2m_pte_ctx is expected to be used
in p2m_pte_from_mfn() and drop a comment from page_to_p2m_table().
- Drop the stale part of the comment above p2m_alloc_table().
- Drop ASSERT(tbl_pg->v.md.pg) from p2m_free_table() as tbl_pg->v.md.pg
is created conditionally now.
- Drop an introduction of p2m_alloc_table(), update p2m_alloc_page()
correspondengly and use it instead.
- Add missing blank in definition of level member for tmp_ctx variable
in p2m_free_subtree(). Also, add the comma at the end.
- Initialize old_type once before for-loop in p2m_split_superpage() as
old type will be used for all newly created PTEs.
- Properly initialize p2m_pte_ctx.level with next_level instead of
level when p2m_set_type() is going to be called for new PTEs.
- Fix identations.
- Move ASSERT(p2m) on top of p2m_set_type() to be sure that NULL isn't
passed for p2m argument of p2m_set_type().
- s/virt_to_page(table)/mfn_to_page(domain_page_map_to_mfn(table))
to recieve correct page for a table which is mapped by domain_page_map().
- Add "return;" after domain_crash() in p2m_set_type() to avoid potential
NULL pointer dereference of md_pg.
---
Changes in V4:
- Add Suggested-by: Jan Beulich <jbeulich@suse.com>.
- Update the comment above declation of md structure inside struct page_info to:
"Page is used as an intermediate P2M page table".
- Allocate metadata table on demand to save some memory. (1)
- Rework p2m_set_type():
- Add allocatation of metadata page only if needed.
- Move a check what kind of type we are handling inside p2m_set_type().
- Move mapping of metadata page inside p2m_get_type() as it is needed only
in case if PTE's type is equal to p2m_ext_storage.
- Add some description to p2m_get_type() function.
- Drop blank after return type of p2m_alloc_table().
- Drop allocation of metadata page inside p2m_alloc_table becaues of (1).
- Fix p2m_free_table() to free metadata page only if it was allocated.
---
Changes in V3:
- Add is_p2m_foreign() macro and connected stuff.
- Change struct domain *d argument of p2m_get_page_from_gfn() to
struct p2m_domain.
- Update the comment above p2m_get_entry().
- s/_t/p2mt for local variable in p2m_get_entry().
- Drop local variable addr in p2m_get_entry() and use gfn_to_gaddr(gfn)
to define offsets array.
- Code style fixes.
- Update a check of rc code from p2m_next_level() in p2m_get_entry()
and drop "else" case.
- Do not call p2m_get_type() if p2m_get_entry()'s t argument is NULL.
- Use struct p2m_domain instead of struct domain for p2m_lookup() and
p2m_get_page_from_gfn().
- Move defintion of get_page() from "xen/riscv: implement mfn_valid() and page reference, ownership handling helpers"
---
Changes in V2:
- New patch.
---
xen/arch/riscv/include/asm/mm.h | 9 ++
xen/arch/riscv/p2m.c | 223 +++++++++++++++++++++++++++-----
2 files changed, 198 insertions(+), 34 deletions(-)
diff --git a/xen/arch/riscv/include/asm/mm.h b/xen/arch/riscv/include/asm/mm.h
index 1b16809749..b18892e4fc 100644
--- a/xen/arch/riscv/include/asm/mm.h
+++ b/xen/arch/riscv/include/asm/mm.h
@@ -149,6 +149,15 @@ struct page_info
/* Order-size of the free chunk this page is the head of. */
unsigned int order;
} free;
+
+ /* Page is used as an intermediate P2M page table */
+ struct {
+ /*
+ * Pointer to a page which store metadata for an intermediate page
+ * table.
+ */
+ struct page_info *pg;
+ } md;
} v;
union {
diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
index 785d11aaff..c8112faacb 100644
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -20,6 +20,16 @@
#define P2M_SUPPORTED_LEVEL_MAPPING 2
+/*
+ * P2M PTE context is used only when a PTE's P2M type is p2m_ext_storage.
+ * In this case, the P2M type is stored separately in the metadata page.
+ */
+struct p2m_pte_ctx {
+ struct page_info *pt_page; /* Page table page containing the PTE. */
+ unsigned int index; /* Index of the PTE within that page. */
+ unsigned int level; /* Paging level at which the PTE resides. */
+};
+
unsigned char __ro_after_init gstage_mode;
unsigned int __ro_after_init gstage_root_level;
@@ -363,24 +373,89 @@ static struct page_info *p2m_alloc_page(struct p2m_domain *p2m)
return pg;
}
-static int p2m_set_type(pte_t *pte, p2m_type_t t)
+/*
+ * `pte` – PTE entry for which the type `t` will be stored.
+ *
+ * If `t` is `p2m_ext_storage`, both `ctx` and `p2m` must be provided;
+ * otherwise, only p2m may be NULL.
+ */
+static void p2m_set_type(pte_t *pte, const p2m_type_t t,
+ struct p2m_pte_ctx *ctx,
+ struct p2m_domain *p2m)
{
- int rc = 0;
+ struct page_info **md_pg;
+ pte_t *metadata = NULL;
- if ( t > p2m_first_external )
- panic("unimplemeted\n");
- else
+ ASSERT(p2m);
+
+ /* Be sure that an index correspondent to page level is passed. */
+ ASSERT(ctx && ctx->index < P2M_PAGETABLE_ENTRIES(ctx->level));
+
+ /*
+ * For the root page table (16 KB in size), we need to select the correct
+ * metadata table, since allocations are 4 KB each. In total, there are
+ * 4 tables of 4 KB each.
+ * For none-root page table index of ->pt_page[] will be always 0 as
+ * index won't be higher then 511. ASSERT() above verifies that.
+ */
+ md_pg = &ctx->pt_page[ctx->index / PAGETABLE_ENTRIES].v.md.pg;
+
+ if ( !*md_pg && (t >= p2m_first_external) )
+ {
+ BUG_ON(ctx->level > P2M_SUPPORTED_LEVEL_MAPPING);
+
+ if ( ctx->level <= P2M_SUPPORTED_LEVEL_MAPPING )
+ {
+ struct domain *d = p2m->domain;
+
+ *md_pg = p2m_alloc_page(p2m);
+ if ( !*md_pg )
+ {
+ printk("%s: can't allocate extra memory for dom%d\n",
+ __func__, d->domain_id);
+ domain_crash(d);
+
+ return;
+ }
+ }
+ }
+
+ if ( *md_pg )
+ metadata = __map_domain_page(*md_pg);
+
+ if ( t < p2m_first_external )
+ {
pte->pte |= MASK_INSR(t, P2M_TYPE_PTE_BITS_MASK);
- return rc;
+ if ( metadata )
+ metadata[ctx->index].pte = p2m_invalid;
+ }
+ else
+ {
+ pte->pte |= MASK_INSR(p2m_ext_storage, P2M_TYPE_PTE_BITS_MASK);
+
+ metadata[ctx->index].pte = t;
+ }
+
+ unmap_domain_page(metadata);
}
-static p2m_type_t p2m_get_type(const pte_t pte)
+/*
+ * `pte` -> PTE entry that stores the PTE's type.
+ *
+ * If the PTE's type is `p2m_ext_storage`, `ctx` should be provided;
+ * otherwise it could be NULL.
+ */
+static p2m_type_t p2m_get_type(const pte_t pte, const struct p2m_pte_ctx *ctx)
{
p2m_type_t type = MASK_EXTR(pte.pte, P2M_TYPE_PTE_BITS_MASK);
if ( type == p2m_ext_storage )
- panic("unimplemented\n");
+ {
+ const pte_t *md = __map_domain_page(ctx->pt_page->v.md.pg);
+ type = md[ctx->index].pte;
+ unmap_domain_page(md);
+ }
return type;
}
@@ -470,7 +545,15 @@ static void p2m_set_permission(pte_t *e, p2m_type_t t)
}
}
-static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t, bool is_table)
+/*
+ * If p2m_pte_from_mfn() is called with p2m_pte_ctx = NULL and p2m = NULL,
+ * it means the function is working with a page table for which the `t`
+ * should not be applicable. Otherwise, the function is handling a leaf PTE
+ * for which `t` is applicable.
+ */
+static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t,
+ struct p2m_pte_ctx *p2m_pte_ctx,
+ struct p2m_domain *p2m)
{
pte_t e = (pte_t) { PTE_VALID };
@@ -478,7 +561,7 @@ static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t, bool is_table)
ASSERT(!(mfn_to_maddr(mfn) & ~PADDR_MASK) || mfn_eq(mfn, INVALID_MFN));
- if ( !is_table )
+ if ( p2m_pte_ctx && p2m )
{
switch ( t )
{
@@ -491,7 +574,7 @@ static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t, bool is_table)
}
p2m_set_permission(&e, t);
- p2m_set_type(&e, t);
+ p2m_set_type(&e, t, p2m_pte_ctx, p2m);
}
else
/*
@@ -506,12 +589,19 @@ static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t, bool is_table)
/* Generate table entry with correct attributes. */
static pte_t page_to_p2m_table(const struct page_info *page)
{
- /*
- * p2m_invalid will be ignored inside p2m_pte_from_mfn() as is_table is
- * set to true and p2m_type_t shouldn't be applied for PTEs which
- * describe an intermidiate table.
- */
- return p2m_pte_from_mfn(page_to_mfn(page), p2m_invalid, true);
+ return p2m_pte_from_mfn(page_to_mfn(page), p2m_invalid, NULL, NULL);
+}
+
+static void p2m_free_page(struct p2m_domain *p2m, struct page_info *pg);
+
+/*
+ * Free page table's page and metadata page linked to page table's page.
+ */
+static void p2m_free_table(struct p2m_domain *p2m, struct page_info *tbl_pg)
+{
+ if ( tbl_pg->v.md.pg )
+ p2m_free_page(p2m, tbl_pg->v.md.pg);
+ p2m_free_page(p2m, tbl_pg);
}
/* Allocate a new page table page and hook it in via the given entry. */
@@ -673,12 +763,14 @@ static void p2m_free_page(struct p2m_domain *p2m, struct page_info *pg)
/* Free pte sub-tree behind an entry */
static void p2m_free_subtree(struct p2m_domain *p2m,
- pte_t entry, unsigned int level)
+ pte_t entry,
+ const struct p2m_pte_ctx *p2m_pte_ctx)
{
unsigned int i;
pte_t *table;
mfn_t mfn;
struct page_info *pg;
+ unsigned int level = p2m_pte_ctx->level;
/*
* Check if the level is valid: only 4K - 2M - 1G mappings are supported.
@@ -694,7 +786,7 @@ static void p2m_free_subtree(struct p2m_domain *p2m,
if ( (level == 0) || pte_is_superpage(entry, level) )
{
- p2m_type_t p2mt = p2m_get_type(entry);
+ p2m_type_t p2mt = p2m_get_type(entry, p2m_pte_ctx);
#ifdef CONFIG_IOREQ_SERVER
/*
@@ -713,9 +805,21 @@ static void p2m_free_subtree(struct p2m_domain *p2m,
return;
}
- table = map_domain_page(pte_get_mfn(entry));
+ mfn = pte_get_mfn(entry);
+ ASSERT(mfn_valid(mfn));
+ table = map_domain_page(mfn);
+ pg = mfn_to_page(mfn);
+
for ( i = 0; i < P2M_PAGETABLE_ENTRIES(level); i++ )
- p2m_free_subtree(p2m, table[i], level - 1);
+ {
+ struct p2m_pte_ctx tmp_ctx = {
+ .pt_page = pg,
+ .index = i,
+ .level = level - 1,
+ };
+
+ p2m_free_subtree(p2m, table[i], &tmp_ctx);
+ }
unmap_domain_page(table);
@@ -727,17 +831,13 @@ static void p2m_free_subtree(struct p2m_domain *p2m,
*/
p2m_tlb_flush_sync(p2m);
- mfn = pte_get_mfn(entry);
- ASSERT(mfn_valid(mfn));
-
- pg = mfn_to_page(mfn);
-
- p2m_free_page(p2m, pg);
+ p2m_free_table(p2m, pg);
}
static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
unsigned int level, unsigned int target,
- const unsigned int *offsets)
+ const unsigned int *offsets,
+ struct page_info *tbl_pg)
{
struct page_info *page;
unsigned long i;
@@ -749,6 +849,10 @@ static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
unsigned int next_level = level - 1;
unsigned int level_order = P2M_LEVEL_ORDER(next_level);
+ struct p2m_pte_ctx p2m_pte_ctx;
+ /* Init with p2m_invalid just to make compiler happy. */
+ p2m_type_t old_type = p2m_invalid;
+
/*
* This should only be called with target != level and the entry is
* a superpage.
@@ -770,6 +874,19 @@ static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
table = __map_domain_page(page);
+ if ( MASK_EXTR(entry->pte, P2M_TYPE_PTE_BITS_MASK) == p2m_ext_storage )
+ {
+ p2m_pte_ctx.pt_page = tbl_pg;
+ p2m_pte_ctx.index = offsets[level];
+ /*
+ * It doesn't really matter what is a value for a level as
+ * p2m_get_type() doesn't need it, so it is initialized just in case.
+ */
+ p2m_pte_ctx.level = level;
+
+ old_type = p2m_get_type(*entry, &p2m_pte_ctx);
+ }
+
for ( i = 0; i < P2M_PAGETABLE_ENTRIES(next_level); i++ )
{
pte_t *new_entry = table + i;
@@ -781,6 +898,15 @@ static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
pte = *entry;
pte_set_mfn(&pte, mfn_add(mfn, i << level_order));
+ if ( MASK_EXTR(pte.pte, P2M_TYPE_PTE_BITS_MASK) == p2m_ext_storage )
+ {
+ p2m_pte_ctx.pt_page = page;
+ p2m_pte_ctx.index = i;
+ p2m_pte_ctx.level = next_level;
+
+ p2m_set_type(&pte, old_type, &p2m_pte_ctx, p2m);
+ }
+
write_pte(new_entry, pte);
}
@@ -792,7 +918,7 @@ static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
*/
if ( next_level != target )
rv = p2m_split_superpage(p2m, table + offsets[next_level],
- next_level, target, offsets);
+ next_level, target, offsets, page);
if ( p2m->clean_dcache )
clean_dcache_va_range(table, PAGE_SIZE);
@@ -883,13 +1009,21 @@ static int p2m_set_entry(struct p2m_domain *p2m,
{
/* We need to split the original page. */
pte_t split_pte = *entry;
+ struct page_info *tbl_pg = mfn_to_page(domain_page_map_to_mfn(table));
ASSERT(pte_is_superpage(*entry, level));
- if ( !p2m_split_superpage(p2m, &split_pte, level, target, offsets) )
+ if ( !p2m_split_superpage(p2m, &split_pte, level, target, offsets,
+ tbl_pg) )
{
+ struct p2m_pte_ctx tmp_ctx = {
+ .pt_page = tbl_pg,
+ .index = offsets[level],
+ .level = level,
+ };
+
/* Free the allocated sub-tree */
- p2m_free_subtree(p2m, split_pte, level);
+ p2m_free_subtree(p2m, split_pte, &tmp_ctx);
rc = -ENOMEM;
goto out;
@@ -927,7 +1061,13 @@ static int p2m_set_entry(struct p2m_domain *p2m,
p2m_clean_pte(entry, p2m->clean_dcache);
else
{
- pte_t pte = p2m_pte_from_mfn(mfn, t, false);
+ struct p2m_pte_ctx tmp_ctx = {
+ .pt_page = mfn_to_page(domain_page_map_to_mfn(table)),
+ .index = offsets[level],
+ .level = level,
+ };
+
+ pte_t pte = p2m_pte_from_mfn(mfn, t, &tmp_ctx, p2m);
p2m_write_pte(entry, pte, p2m->clean_dcache);
@@ -963,7 +1103,15 @@ static int p2m_set_entry(struct p2m_domain *p2m,
if ( pte_is_valid(orig_pte) &&
(!pte_is_valid(*entry) ||
!mfn_eq(pte_get_mfn(*entry), pte_get_mfn(orig_pte))) )
- p2m_free_subtree(p2m, orig_pte, level);
+ {
+ struct p2m_pte_ctx tmp_ctx = {
+ .pt_page = mfn_to_page(domain_page_map_to_mfn(table)),
+ .index = offsets[level],
+ .level = level,
+ };
+
+ p2m_free_subtree(p2m, orig_pte, &tmp_ctx);
+ }
out:
unmap_domain_page(table);
@@ -1153,7 +1301,14 @@ static mfn_t p2m_get_entry(struct p2m_domain *p2m, gfn_t gfn,
if ( pte_is_valid(entry) )
{
if ( t )
- *t = p2m_get_type(entry);
+ {
+ struct p2m_pte_ctx p2m_pte_ctx = {
+ .pt_page = mfn_to_page(domain_page_map_to_mfn(table)),
+ .index = offsets[level],
+ };
+
+ *t = p2m_get_type(entry, &p2m_pte_ctx);
+ }
mfn = pte_get_mfn(entry);
--
2.51.0
On 20.10.2025 17:58, Oleksii Kurochko wrote:
> --- a/xen/arch/riscv/p2m.c
> +++ b/xen/arch/riscv/p2m.c
> @@ -20,6 +20,16 @@
>
> #define P2M_SUPPORTED_LEVEL_MAPPING 2
>
> +/*
> + * P2M PTE context is used only when a PTE's P2M type is p2m_ext_storage.
> + * In this case, the P2M type is stored separately in the metadata page.
> + */
> +struct p2m_pte_ctx {
> + struct page_info *pt_page; /* Page table page containing the PTE. */
> + unsigned int index; /* Index of the PTE within that page. */
> + unsigned int level; /* Paging level at which the PTE resides. */
> +};
> +
> unsigned char __ro_after_init gstage_mode;
> unsigned int __ro_after_init gstage_root_level;
>
> @@ -363,24 +373,89 @@ static struct page_info *p2m_alloc_page(struct p2m_domain *p2m)
> return pg;
> }
>
> -static int p2m_set_type(pte_t *pte, p2m_type_t t)
> +/*
> + * `pte` – PTE entry for which the type `t` will be stored.
> + *
> + * If `t` is `p2m_ext_storage`, both `ctx` and `p2m` must be provided;
> + * otherwise, only p2m may be NULL.
> + */
> +static void p2m_set_type(pte_t *pte, const p2m_type_t t,
> + struct p2m_pte_ctx *ctx,
> + struct p2m_domain *p2m)
> {
> - int rc = 0;
> + struct page_info **md_pg;
> + pte_t *metadata = NULL;
I'm not convinced it is a good idea to re-use pte_t for this purpose. If you used
a separate type, and if then you defined that as a bitfield with only a few bits
dedicated to type, future changes (additions) may be quite a bit easier.
> - if ( t > p2m_first_external )
> - panic("unimplemeted\n");
> - else
> + ASSERT(p2m);
> +
> + /* Be sure that an index correspondent to page level is passed. */
> + ASSERT(ctx && ctx->index < P2M_PAGETABLE_ENTRIES(ctx->level));
> +
> + /*
> + * For the root page table (16 KB in size), we need to select the correct
> + * metadata table, since allocations are 4 KB each. In total, there are
> + * 4 tables of 4 KB each.
> + * For none-root page table index of ->pt_page[] will be always 0 as
> + * index won't be higher then 511. ASSERT() above verifies that.
> + */
> + md_pg = &ctx->pt_page[ctx->index / PAGETABLE_ENTRIES].v.md.pg;
> +
> + if ( !*md_pg && (t >= p2m_first_external) )
> + {
> + BUG_ON(ctx->level > P2M_SUPPORTED_LEVEL_MAPPING);
> +
> + if ( ctx->level <= P2M_SUPPORTED_LEVEL_MAPPING )
> + {
> + struct domain *d = p2m->domain;
This is (if at all) needed only ...
> + *md_pg = p2m_alloc_page(p2m);
> + if ( !*md_pg )
> + {
... in this more narrow scope.
> + printk("%s: can't allocate extra memory for dom%d\n",
> + __func__, d->domain_id);
The logging text isn't specific enough for my taste. For ordinary printk()s I'd
also recommend against use of __func__ (that's fine for dprintk()).
Also please us %pd in such cases.
> + domain_crash(d);
> +
> + return;
> + }
> + }
> + }
> +
> + if ( *md_pg )
> + metadata = __map_domain_page(*md_pg);
> +
> + if ( t < p2m_first_external )
> + {
> pte->pte |= MASK_INSR(t, P2M_TYPE_PTE_BITS_MASK);
>
> - return rc;
> + if ( metadata )
> + metadata[ctx->index].pte = p2m_invalid;
Shouldn't this be accompanied with a BUILD_BUG_ON(p2m_invalid), as otherwise
p2m_alloc_page()'s clearing of the page won't have the intended effect?
> + }
> + else
> + {
> + pte->pte |= MASK_INSR(p2m_ext_storage, P2M_TYPE_PTE_BITS_MASK);
> +
> + metadata[ctx->index].pte = t;
If you set t to p2m_ext_storage here, the pte->pte updating could be moved ...
> + }
... here, covering both cases. Overally this may then be easier as
if ( t >= p2m_first_external )
metadata[ctx->index].pte = t;
else if ( metadata )
metadata[ctx->index].pte = p2m_invalid;
pte->pte |= MASK_INSR(t, P2M_TYPE_PTE_BITS_MASK);
Then raising the question whether it couldn't still be the real type that's
stored in metadata[] even for t < p2m_first_external. That woiuld further
reduce conditionals.
> + unmap_domain_page(metadata);
> }
>
> -static p2m_type_t p2m_get_type(const pte_t pte)
> +/*
> + * `pte` -> PTE entry that stores the PTE's type.
> + *
> + * If the PTE's type is `p2m_ext_storage`, `ctx` should be provided;
> + * otherwise it could be NULL.
> + */
> +static p2m_type_t p2m_get_type(const pte_t pte, const struct p2m_pte_ctx *ctx)
> {
> p2m_type_t type = MASK_EXTR(pte.pte, P2M_TYPE_PTE_BITS_MASK);
>
> if ( type == p2m_ext_storage )
> - panic("unimplemented\n");
> + {
> + const pte_t *md = __map_domain_page(ctx->pt_page->v.md.pg);
> + type = md[ctx->index].pte;
> + unmap_domain_page(md);
Nit (style): Blank line please between declaration(s) and statement(s).
> @@ -470,7 +545,15 @@ static void p2m_set_permission(pte_t *e, p2m_type_t t)
> }
> }
>
> -static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t, bool is_table)
> +/*
> + * If p2m_pte_from_mfn() is called with p2m_pte_ctx = NULL and p2m = NULL,
> + * it means the function is working with a page table for which the `t`
> + * should not be applicable. Otherwise, the function is handling a leaf PTE
> + * for which `t` is applicable.
> + */
> +static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t,
> + struct p2m_pte_ctx *p2m_pte_ctx,
> + struct p2m_domain *p2m)
> {
> pte_t e = (pte_t) { PTE_VALID };
>
> @@ -478,7 +561,7 @@ static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t, bool is_table)
>
> ASSERT(!(mfn_to_maddr(mfn) & ~PADDR_MASK) || mfn_eq(mfn, INVALID_MFN));
>
> - if ( !is_table )
> + if ( p2m_pte_ctx && p2m )
> {
Maybe better
if ( p2m_pte_ctx )
{
ASSERT(p2m);
...
(if you really think the 2nd check is needed)?
> @@ -506,12 +589,19 @@ static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t, bool is_table)
> /* Generate table entry with correct attributes. */
> static pte_t page_to_p2m_table(const struct page_info *page)
> {
> - /*
> - * p2m_invalid will be ignored inside p2m_pte_from_mfn() as is_table is
> - * set to true and p2m_type_t shouldn't be applied for PTEs which
> - * describe an intermidiate table.
> - */
> - return p2m_pte_from_mfn(page_to_mfn(page), p2m_invalid, true);
> + return p2m_pte_from_mfn(page_to_mfn(page), p2m_invalid, NULL, NULL);
> +}
How come the comment is dropped? If you deem it unecessary, why was it added
earlier in this same series?
> +static void p2m_free_page(struct p2m_domain *p2m, struct page_info *pg);
> +
> +/*
> + * Free page table's page and metadata page linked to page table's page.
> + */
> +static void p2m_free_table(struct p2m_domain *p2m, struct page_info *tbl_pg)
> +{
> + if ( tbl_pg->v.md.pg )
> + p2m_free_page(p2m, tbl_pg->v.md.pg);
To play safe, maybe better also clear tbl_pg->v.md.pg?
> @@ -749,6 +849,10 @@ static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
> unsigned int next_level = level - 1;
> unsigned int level_order = P2M_LEVEL_ORDER(next_level);
>
> + struct p2m_pte_ctx p2m_pte_ctx;
I think this would better be one variable instance per scope where it's needed,
and then using an initzializer. Or else ...
> + /* Init with p2m_invalid just to make compiler happy. */
> + p2m_type_t old_type = p2m_invalid;
> +
> /*
> * This should only be called with target != level and the entry is
> * a superpage.
> @@ -770,6 +874,19 @@ static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
>
> table = __map_domain_page(page);
>
> + if ( MASK_EXTR(entry->pte, P2M_TYPE_PTE_BITS_MASK) == p2m_ext_storage )
> + {
> + p2m_pte_ctx.pt_page = tbl_pg;
> + p2m_pte_ctx.index = offsets[level];
> + /*
> + * It doesn't really matter what is a value for a level as
> + * p2m_get_type() doesn't need it, so it is initialized just in case.
> + */
> + p2m_pte_ctx.level = level;
> +
> + old_type = p2m_get_type(*entry, &p2m_pte_ctx);
> + }
> +
> for ( i = 0; i < P2M_PAGETABLE_ENTRIES(next_level); i++ )
> {
> pte_t *new_entry = table + i;
> @@ -781,6 +898,15 @@ static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
> pte = *entry;
> pte_set_mfn(&pte, mfn_add(mfn, i << level_order));
>
> + if ( MASK_EXTR(pte.pte, P2M_TYPE_PTE_BITS_MASK) == p2m_ext_storage )
> + {
> + p2m_pte_ctx.pt_page = page;
> + p2m_pte_ctx.index = i;
> + p2m_pte_ctx.level = next_level;
... why are the loop-invariat fields not filled ahead of the loop here?
> @@ -927,7 +1061,13 @@ static int p2m_set_entry(struct p2m_domain *p2m,
> p2m_clean_pte(entry, p2m->clean_dcache);
> else
> {
> - pte_t pte = p2m_pte_from_mfn(mfn, t, false);
> + struct p2m_pte_ctx tmp_ctx = {
> + .pt_page = mfn_to_page(domain_page_map_to_mfn(table)),
> + .index = offsets[level],
Nit: Stray blank.
> @@ -1153,7 +1301,14 @@ static mfn_t p2m_get_entry(struct p2m_domain *p2m, gfn_t gfn,
> if ( pte_is_valid(entry) )
> {
> if ( t )
> - *t = p2m_get_type(entry);
> + {
> + struct p2m_pte_ctx p2m_pte_ctx = {
> + .pt_page = mfn_to_page(domain_page_map_to_mfn(table)),
> + .index = offsets[level],
> + };
.level not being set here?
Jan
On 11/12/25 12:49 PM, Jan Beulich wrote:
> On 20.10.2025 17:58, Oleksii Kurochko wrote:
>> --- a/xen/arch/riscv/p2m.c
>> +++ b/xen/arch/riscv/p2m.c
>> @@ -20,6 +20,16 @@
>>
>> #define P2M_SUPPORTED_LEVEL_MAPPING 2
>>
>> +/*
>> + * P2M PTE context is used only when a PTE's P2M type is p2m_ext_storage.
>> + * In this case, the P2M type is stored separately in the metadata page.
>> + */
>> +struct p2m_pte_ctx {
>> + struct page_info *pt_page; /* Page table page containing the PTE. */
>> + unsigned int index; /* Index of the PTE within that page. */
>> + unsigned int level; /* Paging level at which the PTE resides. */
>> +};
>> +
>> unsigned char __ro_after_init gstage_mode;
>> unsigned int __ro_after_init gstage_root_level;
>>
>> @@ -363,24 +373,89 @@ static struct page_info *p2m_alloc_page(struct p2m_domain *p2m)
>> return pg;
>> }
>>
>> -static int p2m_set_type(pte_t *pte, p2m_type_t t)
>> +/*
>> + * `pte` – PTE entry for which the type `t` will be stored.
>> + *
>> + * If `t` is `p2m_ext_storage`, both `ctx` and `p2m` must be provided;
>> + * otherwise, only p2m may be NULL.
>> + */
>> +static void p2m_set_type(pte_t *pte, const p2m_type_t t,
>> + struct p2m_pte_ctx *ctx,
>> + struct p2m_domain *p2m)
>> {
>> - int rc = 0;
>> + struct page_info **md_pg;
>> + pte_t *metadata = NULL;
> I'm not convinced it is a good idea to re-use pte_t for this purpose. If you used
> a separate type, and if then you defined that as a bitfield with only a few bits
> dedicated to type, future changes (additions) may be quite a bit easier.
Make sense, then lets go with the following structure:
struct md_t {
/*
* Describes a type stored outside the PTE.
* Look at the comment above definition of enum p2m_type_t.
*/
p2m_type_t type : 4;
};
>
>> - if ( t > p2m_first_external )
>> - panic("unimplemeted\n");
>> - else
>> + ASSERT(p2m);
>> +
>> + /* Be sure that an index correspondent to page level is passed. */
>> + ASSERT(ctx && ctx->index < P2M_PAGETABLE_ENTRIES(ctx->level));
>> +
>> + /*
>> + * For the root page table (16 KB in size), we need to select the correct
>> + * metadata table, since allocations are 4 KB each. In total, there are
>> + * 4 tables of 4 KB each.
>> + * For none-root page table index of ->pt_page[] will be always 0 as
>> + * index won't be higher then 511. ASSERT() above verifies that.
>> + */
>> + md_pg = &ctx->pt_page[ctx->index / PAGETABLE_ENTRIES].v.md.pg;
>> +
>> + if ( !*md_pg && (t >= p2m_first_external) )
>> + {
>> + BUG_ON(ctx->level > P2M_SUPPORTED_LEVEL_MAPPING);
>> +
>> + if ( ctx->level <= P2M_SUPPORTED_LEVEL_MAPPING )
>> + {
>> + struct domain *d = p2m->domain;
> This is (if at all) needed only ...
>
>> + *md_pg = p2m_alloc_page(p2m);
>> + if ( !*md_pg )
>> + {
> ... in this more narrow scope.
>
>> + printk("%s: can't allocate extra memory for dom%d\n",
>> + __func__, d->domain_id);
> The logging text isn't specific enough for my taste. For ordinary printk()s I'd
> also recommend against use of __func__ (that's fine for dprintk()).
I will update the message to:
printk("%pd: can't allocate metadata page\n", p2m->domain);
>
> Also please us %pd in such cases.
>
>> + domain_crash(d);
>> +
>> + return;
>> + }
>> + }
>> + }
>> +
>> + if ( *md_pg )
>> + metadata = __map_domain_page(*md_pg);
>> +
>> + if ( t < p2m_first_external )
>> + {
>> pte->pte |= MASK_INSR(t, P2M_TYPE_PTE_BITS_MASK);
>>
>> - return rc;
>> + if ( metadata )
>> + metadata[ctx->index].pte = p2m_invalid;
> Shouldn't this be accompanied with a BUILD_BUG_ON(p2m_invalid), as otherwise
> p2m_alloc_page()'s clearing of the page won't have the intended effect?
I think that, at least, at the moment we are always explicitly set p2m type and
do not rely on that by default 0==p2m_invalid.
Just to be safe, I will add after "if ( metadata )" suggested
BUILD_BUG_ON(p2m_invalid):
if ( metadata )
metadata[ctx->index].type = p2m_invalid;
/*
* metadata.type is expected to be p2m_invalid (0) after the page is
* allocated and zero-initialized in p2m_alloc_page().
*/
BUILD_BUG_ON(p2m_invalid);
...
>
>> + }
>> + else
>> + {
>> + pte->pte |= MASK_INSR(p2m_ext_storage, P2M_TYPE_PTE_BITS_MASK);
>> +
>> + metadata[ctx->index].pte = t;
> If you set t to p2m_ext_storage here, the pte->pte updating could be moved ...
't' shouldn't be passed as 'p2m_ext_storage'.
For example, in this case we will have that in metadata page we will have type
equal to p2m_ext_storage and then in pte->pte will have the type set to
p2m_ext_storage, and the we end that we don't have a real type stored somewhere.
Even more, metadata.pte shouldn't be used to store p2m_ext_storage, only
p2m_invalid and types mentioned in enum p2m_t after p2m_ext_storage.
>
>> + }
> ... here, covering both cases. Overally this may then be easier as
>
> if ( t >= p2m_first_external )
> metadata[ctx->index].pte = t;
> else if ( metadata )
> metadata[ctx->index].pte = p2m_invalid;
>
> pte->pte |= MASK_INSR(t, P2M_TYPE_PTE_BITS_MASK);
>
> Then raising the question whether it couldn't still be the real type that's
> stored in metadata[] even for t < p2m_first_external. That woiuld further
> reduce conditionals.
It would be nice, but I think that at the moment we can’t do that. As I explained
above, 't' should not normally be passed as p2m_ext_storage. If we want to
handle this properly, I would need to update the code to:
if (!*md_pg && (t > p2m_first_external))
Alternatively, we could set p2m_first_external = p2m_map_foreign_rw instead of
p2m_ext_storage, since p2m_ext_storage is technically just a marker indicating
that the type is stored elsewhere.
We should also add a BUG_ON(t == p2m_ext_storage) before the if-condition
mentioned above.
>
>> @@ -470,7 +545,15 @@ static void p2m_set_permission(pte_t *e, p2m_type_t t)
>> }
>> }
>>
>> -static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t, bool is_table)
>> +/*
>> + * If p2m_pte_from_mfn() is called with p2m_pte_ctx = NULL and p2m = NULL,
>> + * it means the function is working with a page table for which the `t`
>> + * should not be applicable. Otherwise, the function is handling a leaf PTE
>> + * for which `t` is applicable.
>> + */
>> +static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t,
>> + struct p2m_pte_ctx *p2m_pte_ctx,
>> + struct p2m_domain *p2m)
>> {
>> pte_t e = (pte_t) { PTE_VALID };
>>
>> @@ -478,7 +561,7 @@ static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t, bool is_table)
>>
>> ASSERT(!(mfn_to_maddr(mfn) & ~PADDR_MASK) || mfn_eq(mfn, INVALID_MFN));
>>
>> - if ( !is_table )
>> + if ( p2m_pte_ctx && p2m )
>> {
> Maybe better
>
> if ( p2m_pte_ctx )
> {
> ASSERT(p2m);
> ...
>
> (if you really think the 2nd check is needed)?
It seems like we don't really need it as p2m_set_type() has the same ASSERT() at the start.
I will double-check why I've added it and drop if it was not very specific reason.
>
>> @@ -506,12 +589,19 @@ static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t, bool is_table)
>> /* Generate table entry with correct attributes. */
>> static pte_t page_to_p2m_table(const struct page_info *page)
>> {
>> - /*
>> - * p2m_invalid will be ignored inside p2m_pte_from_mfn() as is_table is
>> - * set to true and p2m_type_t shouldn't be applied for PTEs which
>> - * describe an intermidiate table.
>> - */
>> - return p2m_pte_from_mfn(page_to_mfn(page), p2m_invalid, true);
>> + return p2m_pte_from_mfn(page_to_mfn(page), p2m_invalid, NULL, NULL);
>> +}
> How come the comment is dropped? If you deem it unecessary, why was it added
> earlier in this same series?
It is still relevant. Something went wrong during rebase and conflict resolving. Thanks for
finding that.
>
>> +static void p2m_free_page(struct p2m_domain *p2m, struct page_info *pg);
>> +
>> +/*
>> + * Free page table's page and metadata page linked to page table's page.
>> + */
>> +static void p2m_free_table(struct p2m_domain *p2m, struct page_info *tbl_pg)
>> +{
>> + if ( tbl_pg->v.md.pg )
>> + p2m_free_page(p2m, tbl_pg->v.md.pg);
> To play safe, maybe better also clear tbl_pg->v.md.pg?
I thought it would be enough to clear it during allocation in p2m_alloc_page(),
since I'm not sure it is critical if md.pg data were somehow leaked and read.
But to be safer, we can add this here:
clear_and_clean_page(tbl_pg->v.md.pg, p2m->clean_dcache);
>
>> @@ -749,6 +849,10 @@ static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
>> unsigned int next_level = level - 1;
>> unsigned int level_order = P2M_LEVEL_ORDER(next_level);
>>
>> + struct p2m_pte_ctx p2m_pte_ctx;
> I think this would better be one variable instance per scope where it's needed,
> and then using an initzializer. Or else ...
>
>> + /* Init with p2m_invalid just to make compiler happy. */
>> + p2m_type_t old_type = p2m_invalid;
>> +
>> /*
>> * This should only be called with target != level and the entry is
>> * a superpage.
>> @@ -770,6 +874,19 @@ static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
>>
>> table = __map_domain_page(page);
>>
>> + if ( MASK_EXTR(entry->pte, P2M_TYPE_PTE_BITS_MASK) == p2m_ext_storage )
>> + {
>> + p2m_pte_ctx.pt_page = tbl_pg;
>> + p2m_pte_ctx.index = offsets[level];
>> + /*
>> + * It doesn't really matter what is a value for a level as
>> + * p2m_get_type() doesn't need it, so it is initialized just in case.
>> + */
>> + p2m_pte_ctx.level = level;
>> +
>> + old_type = p2m_get_type(*entry, &p2m_pte_ctx);
>> + }
>> +
>> for ( i = 0; i < P2M_PAGETABLE_ENTRIES(next_level); i++ )
>> {
>> pte_t *new_entry = table + i;
>> @@ -781,6 +898,15 @@ static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
>> pte = *entry;
>> pte_set_mfn(&pte, mfn_add(mfn, i << level_order));
>>
>> + if ( MASK_EXTR(pte.pte, P2M_TYPE_PTE_BITS_MASK) == p2m_ext_storage )
>> + {
>> + p2m_pte_ctx.pt_page = page;
>> + p2m_pte_ctx.index = i;
>> + p2m_pte_ctx.level = next_level;
> ... why are the loop-invariat fields not filled ahead of the loop here?
Actually, they could be filled before the loop. If I move the initialization of
p2m_pte_ctx.pt_page and p2m_pte_ctx.level ahead of the loop, does it still make
sense to have a separate variable inside
"if (MASK_EXTR(entry->pte, P2M_TYPE_PTE_BITS_MASK) == p2m_ext_storage)"?
>
>> @@ -927,7 +1061,13 @@ static int p2m_set_entry(struct p2m_domain *p2m,
>> p2m_clean_pte(entry, p2m->clean_dcache);
>> else
>> {
>> - pte_t pte = p2m_pte_from_mfn(mfn, t, false);
>> + struct p2m_pte_ctx tmp_ctx = {
>> + .pt_page = mfn_to_page(domain_page_map_to_mfn(table)),
>> + .index = offsets[level],
> Nit: Stray blank.
>
>> @@ -1153,7 +1301,14 @@ static mfn_t p2m_get_entry(struct p2m_domain *p2m, gfn_t gfn,
>> if ( pte_is_valid(entry) )
>> {
>> if ( t )
>> - *t = p2m_get_type(entry);
>> + {
>> + struct p2m_pte_ctx p2m_pte_ctx = {
>> + .pt_page = mfn_to_page(domain_page_map_to_mfn(table)),
>> + .index = offsets[level],
>> + };
> .level not being set here?
It isn't used in the case when p2m_get_type() is called, but just for consistency and to
be sure that nothing will be broken if an implemnatation of p2m_get_type() will change,
I will add:
.level = level,
Thanks.
~ Oleksii
On 17.11.2025 20:51, Oleksii Kurochko wrote:
> On 11/12/25 12:49 PM, Jan Beulich wrote:
>> On 20.10.2025 17:58, Oleksii Kurochko wrote:
>>> + if ( *md_pg )
>>> + metadata = __map_domain_page(*md_pg);
>>> +
>>> + if ( t < p2m_first_external )
>>> + {
>>> pte->pte |= MASK_INSR(t, P2M_TYPE_PTE_BITS_MASK);
>>> - return rc;
>>> + if ( metadata )
>>> + metadata[ctx->index].pte = p2m_invalid;
>> Shouldn't this be accompanied with a BUILD_BUG_ON(p2m_invalid), as otherwise
>> p2m_alloc_page()'s clearing of the page won't have the intended effect?
>
> I think that, at least, at the moment we are always explicitly set p2m type and
> do not rely on that by default 0==p2m_invalid.
You don't, and ...
> Just to be safe, I will add after "if ( metadata )" suggested
> BUILD_BUG_ON(p2m_invalid):
> if ( metadata )
> metadata[ctx->index].type = p2m_invalid;
> /*
> * metadata.type is expected to be p2m_invalid (0) after the page is
> * allocated and zero-initialized in p2m_alloc_page().
> */
> BUILD_BUG_ON(p2m_invalid);
> ...
... this leaves me with the impression that you didn't read my reply correctly.
p2m_alloc_page() clear the page, thus _implicitly_ setting all entries to
p2m_invalid. That's where the BUILD_BUG_ON() would want to go (the call site,
ftaod).
>>> + }
>>> + else
>>> + {
>>> + pte->pte |= MASK_INSR(p2m_ext_storage, P2M_TYPE_PTE_BITS_MASK);
>>> +
>>> + metadata[ctx->index].pte = t;
>> If you set t to p2m_ext_storage here, the pte->pte updating could be moved ...
>
> 't' shouldn't be passed as 'p2m_ext_storage'.
Of course not. I said "set", not "pass". I suggested to set t to p2m_ext_storage
right after the assignment above. I notice though that I missed ...
> For example, in this case we will have that in metadata page we will have type
> equal to p2m_ext_storage and then in pte->pte will have the type set to
> p2m_ext_storage, and the we end that we don't have a real type stored somewhere.
> Even more, metadata.pte shouldn't be used to store p2m_ext_storage, only
> p2m_invalid and types mentioned in enum p2m_t after p2m_ext_storage.
>
>>
>>> + }
>> ... here, covering both cases. Overally this may then be easier as
>>
>> if ( t >= p2m_first_external )
>> metadata[ctx->index].pte = t;
... the respective line (and the figure braces which are the needed) here:
t = p2m_ext_storage;
>> else if ( metadata )
>> metadata[ctx->index].pte = p2m_invalid;
>>
>> pte->pte |= MASK_INSR(t, P2M_TYPE_PTE_BITS_MASK);
>>
>> Then raising the question whether it couldn't still be the real type that's
>> stored in metadata[] even for t < p2m_first_external. That woiuld further
>> reduce conditionals.
>
> It would be nice, but I think that at the moment we can’t do that. As I explained
> above, 't' should not normally be passed as p2m_ext_storage.
Of course not, but how's that relevant to storing the _real_ type in the
metadata page even when it's one which can also can be stored in the PTE?
As said, for a frequently used path it may help to reduce the number of
conditionals here.
>>> +static void p2m_free_page(struct p2m_domain *p2m, struct page_info *pg);
>>> +
>>> +/*
>>> + * Free page table's page and metadata page linked to page table's page.
>>> + */
>>> +static void p2m_free_table(struct p2m_domain *p2m, struct page_info *tbl_pg)
>>> +{
>>> + if ( tbl_pg->v.md.pg )
>>> + p2m_free_page(p2m, tbl_pg->v.md.pg);
>> To play safe, maybe better also clear tbl_pg->v.md.pg?
>
> I thought it would be enough to clear it during allocation in p2m_alloc_page(),
> since I'm not sure it is critical if md.pg data were somehow leaked and read.
> But to be safer, we can add this here:
> clear_and_clean_page(tbl_pg->v.md.pg, p2m->clean_dcache);
I didn't say clear what tbl_pg->v.md.pg points to, though. I suggested to clear
the struct field itself.
>>> @@ -749,6 +849,10 @@ static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
>>> unsigned int next_level = level - 1;
>>> unsigned int level_order = P2M_LEVEL_ORDER(next_level);
>>> + struct p2m_pte_ctx p2m_pte_ctx;
>> I think this would better be one variable instance per scope where it's needed,
>> and then using an initzializer. Or else ...
>>
>>> + /* Init with p2m_invalid just to make compiler happy. */
>>> + p2m_type_t old_type = p2m_invalid;
>>> +
>>> /*
>>> * This should only be called with target != level and the entry is
>>> * a superpage.
>>> @@ -770,6 +874,19 @@ static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
>>> table = __map_domain_page(page);
>>> + if ( MASK_EXTR(entry->pte, P2M_TYPE_PTE_BITS_MASK) == p2m_ext_storage )
>>> + {
>>> + p2m_pte_ctx.pt_page = tbl_pg;
>>> + p2m_pte_ctx.index = offsets[level];
>>> + /*
>>> + * It doesn't really matter what is a value for a level as
>>> + * p2m_get_type() doesn't need it, so it is initialized just in case.
>>> + */
>>> + p2m_pte_ctx.level = level;
>>> +
>>> + old_type = p2m_get_type(*entry, &p2m_pte_ctx);
>>> + }
>>> +
>>> for ( i = 0; i < P2M_PAGETABLE_ENTRIES(next_level); i++ )
>>> {
>>> pte_t *new_entry = table + i;
>>> @@ -781,6 +898,15 @@ static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
>>> pte = *entry;
>>> pte_set_mfn(&pte, mfn_add(mfn, i << level_order));
>>> + if ( MASK_EXTR(pte.pte, P2M_TYPE_PTE_BITS_MASK) == p2m_ext_storage )
>>> + {
>>> + p2m_pte_ctx.pt_page = page;
>>> + p2m_pte_ctx.index = i;
>>> + p2m_pte_ctx.level = next_level;
>> ... why are the loop-invariat fields not filled ahead of the loop here?
>
> Actually, they could be filled before the loop. If I move the initialization of
> p2m_pte_ctx.pt_page and p2m_pte_ctx.level ahead of the loop, does it still make
> sense to have a separate variable inside
> "if (MASK_EXTR(entry->pte, P2M_TYPE_PTE_BITS_MASK) == p2m_ext_storage)"?
No, it's one of the two - scope limited variable within the loop, or wider-scope
variable with loop-invariant fields set ahead of the loop.
Jan
On 11/18/25 7:58 AM, Jan Beulich wrote:
> On 17.11.2025 20:51, Oleksii Kurochko wrote:
>> On 11/12/25 12:49 PM, Jan Beulich wrote:
>>> On 20.10.2025 17:58, Oleksii Kurochko wrote:
>>>> + if ( *md_pg )
>>>> + metadata = __map_domain_page(*md_pg);
>>>> +
>>>> + if ( t < p2m_first_external )
>>>> + {
>>>> pte->pte |= MASK_INSR(t, P2M_TYPE_PTE_BITS_MASK);
>>>> - return rc;
>>>> + if ( metadata )
>>>> + metadata[ctx->index].pte = p2m_invalid;
>>> Shouldn't this be accompanied with a BUILD_BUG_ON(p2m_invalid), as otherwise
>>> p2m_alloc_page()'s clearing of the page won't have the intended effect?
>> I think that, at least, at the moment we are always explicitly set p2m type and
>> do not rely on that by default 0==p2m_invalid.
> You don't, and ...
>
>> Just to be safe, I will add after "if ( metadata )" suggested
>> BUILD_BUG_ON(p2m_invalid):
>> if ( metadata )
>> metadata[ctx->index].type = p2m_invalid;
>> /*
>> * metadata.type is expected to be p2m_invalid (0) after the page is
>> * allocated and zero-initialized in p2m_alloc_page().
>> */
>> BUILD_BUG_ON(p2m_invalid);
>> ...
> ... this leaves me with the impression that you didn't read my reply correctly.
> p2m_alloc_page() clear the page, thus _implicitly_ setting all entries to
> p2m_invalid. That's where the BUILD_BUG_ON() would want to go (the call site,
> ftaod).
I think I still don’t fully understand what the issue would be if|p2m_invalid| were
ever equal to 1 instead of 0 in the context of a metadata page.
Yes, if|p2m_invalid| were 1, there would be a problem if someone tried to read this
metadata pagebefore it was assigned any type. They would find a value of 0, which
corresponds to a valid type rather than to|p2m_invalid|, as one might expect.
However, I’m not sure I currently see a scenario in which the metadata page would
be read before being initialized.
But just to be safe when such case will occur I am okay with putting
BUILD_BUG_ON(p2m_invalid) before p2m_alloc_page() in p2m_set_type() function.
>
>>>> + }
>>>> + else
>>>> + {
>>>> + pte->pte |= MASK_INSR(p2m_ext_storage, P2M_TYPE_PTE_BITS_MASK);
>>>> +
>>>> + metadata[ctx->index].pte = t;
>>> If you set t to p2m_ext_storage here, the pte->pte updating could be moved ...
>> 't' shouldn't be passed as 'p2m_ext_storage'.
> Of course not. I said "set", not "pass". I suggested to set t to p2m_ext_storage
> right after the assignment above. I notice though that I missed ...
Now, I see then ...
>
>> For example, in this case we will have that in metadata page we will have type
>> equal to p2m_ext_storage and then in pte->pte will have the type set to
>> p2m_ext_storage, and the we end that we don't have a real type stored somewhere.
>> Even more, metadata.pte shouldn't be used to store p2m_ext_storage, only
>> p2m_invalid and types mentioned in enum p2m_t after p2m_ext_storage.
>>
>>>> + }
>>> ... here, covering both cases. Overally this may then be easier as
>>>
>>> if ( t >= p2m_first_external )
>>> metadata[ctx->index].pte = t;
> ... the respective line (and the figure braces which are the needed) here:
>
> t = p2m_ext_storage;
... (what suggested above) will work.
>
>>> else if ( metadata )
>>> metadata[ctx->index].pte = p2m_invalid;
>>>
>>> pte->pte |= MASK_INSR(t, P2M_TYPE_PTE_BITS_MASK);
>>>
>>> Then raising the question whether it couldn't still be the real type that's
>>> stored in metadata[] even for t < p2m_first_external. That woiuld further
>>> reduce conditionals.
>> It would be nice, but I think that at the moment we can’t do that. As I explained
>> above, 't' should not normally be passed as p2m_ext_storage.
> Of course not, but how's that relevant to storing the _real_ type in the
> metadata page even when it's one which can also can be stored in the PTE?
> As said, for a frequently used path it may help to reduce the number of
> conditionals here.
IIUC, you are asking whether, if|pte->pte| stores a type|< p2m_ext_storage|,
it would still be possible for|metadata[].pte| to contain anyreal type?
If yes, then the answer is that it could be done, because in the|p2m_get_type() |function the value stored in|pte->pte| is checked first. If it isn't|p2m_ext_storage|,
then|metadata[].pte| will not be checked at all. So technically, it could contain
whatever we want in case when pte.pte's type != p2m_ext_storage.
But will it really reduce an amount of conditions? It seems like we still need one
condition to check of metadata is mapped and one condition to set 't' to p2m_ext_storage:
if ( metadata )
metadata[ctx->index].pte = t;
if ( t >= p2m_first_external )
t = p2m_ext_storage;
pte->pte |= MASK_INSR(t, P2M_TYPE_PTE_BITS_MASK);
We can do:
if ( metadata )
{
metadata[ctx->index].pte = t;
if ( t >= p2m_first_external )
t = p2m_ext_storage;
}
pte->pte |= MASK_INSR(t, P2M_TYPE_PTE_BITS_MASK);
It will reduce an amount of conditions if metadata wasn't used/allocated, but I think you
have a different idea, don't you?
>
>>>> +static void p2m_free_page(struct p2m_domain *p2m, struct page_info *pg);
>>>> +
>>>> +/*
>>>> + * Free page table's page and metadata page linked to page table's page.
>>>> + */
>>>> +static void p2m_free_table(struct p2m_domain *p2m, struct page_info *tbl_pg)
>>>> +{
>>>> + if ( tbl_pg->v.md.pg )
>>>> + p2m_free_page(p2m, tbl_pg->v.md.pg);
>>> To play safe, maybe better also clear tbl_pg->v.md.pg?
>> I thought it would be enough to clear it during allocation in p2m_alloc_page(),
>> since I'm not sure it is critical if md.pg data were somehow leaked and read.
>> But to be safer, we can add this here:
>> clear_and_clean_page(tbl_pg->v.md.pg, p2m->clean_dcache);
> I didn't say clear what tbl_pg->v.md.pg points to, though. I suggested to clear
> the struct field itself.
Won't be enough just tbl_pg->v.md.pg = NULL; ?
Thanks!
~ Oleksii
On 20.11.2025 16:38, Oleksii Kurochko wrote:
> On 11/18/25 7:58 AM, Jan Beulich wrote:
>> On 17.11.2025 20:51, Oleksii Kurochko wrote:
>>> On 11/12/25 12:49 PM, Jan Beulich wrote:
>>>> On 20.10.2025 17:58, Oleksii Kurochko wrote:
>>>>> + if ( *md_pg )
>>>>> + metadata = __map_domain_page(*md_pg);
>>>>> +
>>>>> + if ( t < p2m_first_external )
>>>>> + {
>>>>> pte->pte |= MASK_INSR(t, P2M_TYPE_PTE_BITS_MASK);
>>>>> - return rc;
>>>>> + if ( metadata )
>>>>> + metadata[ctx->index].pte = p2m_invalid;
>>>> Shouldn't this be accompanied with a BUILD_BUG_ON(p2m_invalid), as otherwise
>>>> p2m_alloc_page()'s clearing of the page won't have the intended effect?
>>> I think that, at least, at the moment we are always explicitly set p2m type and
>>> do not rely on that by default 0==p2m_invalid.
>> You don't, and ...
>>
>>> Just to be safe, I will add after "if ( metadata )" suggested
>>> BUILD_BUG_ON(p2m_invalid):
>>> if ( metadata )
>>> metadata[ctx->index].type = p2m_invalid;
>>> /*
>>> * metadata.type is expected to be p2m_invalid (0) after the page is
>>> * allocated and zero-initialized in p2m_alloc_page().
>>> */
>>> BUILD_BUG_ON(p2m_invalid);
>>> ...
>> ... this leaves me with the impression that you didn't read my reply correctly.
>> p2m_alloc_page() clear the page, thus _implicitly_ setting all entries to
>> p2m_invalid. That's where the BUILD_BUG_ON() would want to go (the call site,
>> ftaod).
>
> I think I still don’t fully understand what the issue would be if|p2m_invalid| were
> ever equal to 1 instead of 0 in the context of a metadata page.
>
> Yes, if|p2m_invalid| were 1, there would be a problem if someone tried to read this
> metadata pagebefore it was assigned any type. They would find a value of 0, which
> corresponds to a valid type rather than to|p2m_invalid|, as one might expect.
> However, I’m not sure I currently see a scenario in which the metadata page would
> be read before being initialized.
Are you sure walks can only happen for GFNs that were set up? What you need to
do walks on is under guest control, after all.
> But just to be safe when such case will occur I am okay with putting
> BUILD_BUG_ON(p2m_invalid) before p2m_alloc_page() in p2m_set_type() function.
Thanks.
>>>>> +static void p2m_free_page(struct p2m_domain *p2m, struct page_info *pg);
>>>>> +
>>>>> +/*
>>>>> + * Free page table's page and metadata page linked to page table's page.
>>>>> + */
>>>>> +static void p2m_free_table(struct p2m_domain *p2m, struct page_info *tbl_pg)
>>>>> +{
>>>>> + if ( tbl_pg->v.md.pg )
>>>>> + p2m_free_page(p2m, tbl_pg->v.md.pg);
>>>> To play safe, maybe better also clear tbl_pg->v.md.pg?
>>> I thought it would be enough to clear it during allocation in p2m_alloc_page(),
>>> since I'm not sure it is critical if md.pg data were somehow leaked and read.
>>> But to be safer, we can add this here:
>>> clear_and_clean_page(tbl_pg->v.md.pg, p2m->clean_dcache);
>> I didn't say clear what tbl_pg->v.md.pg points to, though. I suggested to clear
>> the struct field itself.
>
> Won't be enough just tbl_pg->v.md.pg = NULL; ?
Exactly that, yes.
Jan
On 11/20/25 4:47 PM, Jan Beulich wrote:
> On 20.11.2025 16:38, Oleksii Kurochko wrote:
>> On 11/18/25 7:58 AM, Jan Beulich wrote:
>>> On 17.11.2025 20:51, Oleksii Kurochko wrote:
>>>> On 11/12/25 12:49 PM, Jan Beulich wrote:
>>>>> On 20.10.2025 17:58, Oleksii Kurochko wrote:
>>>>>> + if ( *md_pg )
>>>>>> + metadata = __map_domain_page(*md_pg);
>>>>>> +
>>>>>> + if ( t < p2m_first_external )
>>>>>> + {
>>>>>> pte->pte |= MASK_INSR(t, P2M_TYPE_PTE_BITS_MASK);
>>>>>> - return rc;
>>>>>> + if ( metadata )
>>>>>> + metadata[ctx->index].pte = p2m_invalid;
>>>>> Shouldn't this be accompanied with a BUILD_BUG_ON(p2m_invalid), as otherwise
>>>>> p2m_alloc_page()'s clearing of the page won't have the intended effect?
>>>> I think that, at least, at the moment we are always explicitly set p2m type and
>>>> do not rely on that by default 0==p2m_invalid.
>>> You don't, and ...
>>>
>>>> Just to be safe, I will add after "if ( metadata )" suggested
>>>> BUILD_BUG_ON(p2m_invalid):
>>>> if ( metadata )
>>>> metadata[ctx->index].type = p2m_invalid;
>>>> /*
>>>> * metadata.type is expected to be p2m_invalid (0) after the page is
>>>> * allocated and zero-initialized in p2m_alloc_page().
>>>> */
>>>> BUILD_BUG_ON(p2m_invalid);
>>>> ...
>>> ... this leaves me with the impression that you didn't read my reply correctly.
>>> p2m_alloc_page() clear the page, thus_implicitly_ setting all entries to
>>> p2m_invalid. That's where the BUILD_BUG_ON() would want to go (the call site,
>>> ftaod).
>> I think I still don’t fully understand what the issue would be if|p2m_invalid| were
>> ever equal to 1 instead of 0 in the context of a metadata page.
>>
>> Yes, if|p2m_invalid| were 1, there would be a problem if someone tried to read this
>> metadata pagebefore it was assigned any type. They would find a value of 0, which
>> corresponds to a valid type rather than to|p2m_invalid|, as one might expect.
>> However, I’m not sure I currently see a scenario in which the metadata page would
>> be read before being initialized.
> Are you sure walks can only happen for GFNs that were set up? What you need to
> do walks on is under guest control, after all.
If a GFN lies within the range[p2m->lowest_mapped_gfn, p2m->max_mapped_gfn], then
|p2m_set_entry()| must already have been called for this GFN. This means that either
- a metadata page has been created and its entry filled with the appropriate type, or
- no metadata page was needed and the type was stored directly in|pte->pte|
For a GFN outside the range(p2m->lowest_mapped_gfn, p2m->max_mapped_gfn),
|p2m_get_entry()| will not even attempt a walk because of the boundary checks:
static mfn_t p2m_get_entry(struct p2m_domain *p2m, gfn_t gfn,
p2m_type_t *t,
unsigned int *page_order)
...
if ( check_outside_boundary(p2m, gfn, p2m->lowest_mapped_gfn, true,
&level) )
goto out;
if ( check_outside_boundary(p2m, gfn, p2m->max_mapped_gfn, false, &level) )
goto out;
If I am misunderstanding something and there are other cases where a walk can occur for
GFNs that were never set up, then such GFNs would have neither an allocated metadata
page nor a type stored in|pte->pte|, which looks like we are in trouble.
~ Oleksii
On 20.11.2025 17:52, Oleksii Kurochko wrote:
>
> On 11/20/25 4:47 PM, Jan Beulich wrote:
>> On 20.11.2025 16:38, Oleksii Kurochko wrote:
>>> On 11/18/25 7:58 AM, Jan Beulich wrote:
>>>> On 17.11.2025 20:51, Oleksii Kurochko wrote:
>>>>> On 11/12/25 12:49 PM, Jan Beulich wrote:
>>>>>> On 20.10.2025 17:58, Oleksii Kurochko wrote:
>>>>>>> + if ( *md_pg )
>>>>>>> + metadata = __map_domain_page(*md_pg);
>>>>>>> +
>>>>>>> + if ( t < p2m_first_external )
>>>>>>> + {
>>>>>>> pte->pte |= MASK_INSR(t, P2M_TYPE_PTE_BITS_MASK);
>>>>>>> - return rc;
>>>>>>> + if ( metadata )
>>>>>>> + metadata[ctx->index].pte = p2m_invalid;
>>>>>> Shouldn't this be accompanied with a BUILD_BUG_ON(p2m_invalid), as otherwise
>>>>>> p2m_alloc_page()'s clearing of the page won't have the intended effect?
>>>>> I think that, at least, at the moment we are always explicitly set p2m type and
>>>>> do not rely on that by default 0==p2m_invalid.
>>>> You don't, and ...
>>>>
>>>>> Just to be safe, I will add after "if ( metadata )" suggested
>>>>> BUILD_BUG_ON(p2m_invalid):
>>>>> if ( metadata )
>>>>> metadata[ctx->index].type = p2m_invalid;
>>>>> /*
>>>>> * metadata.type is expected to be p2m_invalid (0) after the page is
>>>>> * allocated and zero-initialized in p2m_alloc_page().
>>>>> */
>>>>> BUILD_BUG_ON(p2m_invalid);
>>>>> ...
>>>> ... this leaves me with the impression that you didn't read my reply correctly.
>>>> p2m_alloc_page() clear the page, thus_implicitly_ setting all entries to
>>>> p2m_invalid. That's where the BUILD_BUG_ON() would want to go (the call site,
>>>> ftaod).
>>> I think I still don’t fully understand what the issue would be if|p2m_invalid| were
>>> ever equal to 1 instead of 0 in the context of a metadata page.
>>>
>>> Yes, if|p2m_invalid| were 1, there would be a problem if someone tried to read this
>>> metadata pagebefore it was assigned any type. They would find a value of 0, which
>>> corresponds to a valid type rather than to|p2m_invalid|, as one might expect.
>>> However, I’m not sure I currently see a scenario in which the metadata page would
>>> be read before being initialized.
>> Are you sure walks can only happen for GFNs that were set up? What you need to
>> do walks on is under guest control, after all.
>
> If a GFN lies within the range[p2m->lowest_mapped_gfn, p2m->max_mapped_gfn], then
> |p2m_set_entry()| must already have been called for this GFN.
No. All you know from the pre-condition is that p2m_set_entry() was called for the
lowest and highest GFNs in that range.
Jan
> This means that either
> - a metadata page has been created and its entry filled with the appropriate type, or
> - no metadata page was needed and the type was stored directly in|pte->pte|
>
> For a GFN outside the range(p2m->lowest_mapped_gfn, p2m->max_mapped_gfn),
> |p2m_get_entry()| will not even attempt a walk because of the boundary checks:
> static mfn_t p2m_get_entry(struct p2m_domain *p2m, gfn_t gfn,
> p2m_type_t *t,
> unsigned int *page_order)
> ...
> if ( check_outside_boundary(p2m, gfn, p2m->lowest_mapped_gfn, true,
> &level) )
> goto out;
>
> if ( check_outside_boundary(p2m, gfn, p2m->max_mapped_gfn, false, &level) )
> goto out;
>
> If I am misunderstanding something and there are other cases where a walk can occur for
> GFNs that were never set up, then such GFNs would have neither an allocated metadata
> page nor a type stored in|pte->pte|, which looks like we are in trouble.
>
> ~ Oleksii
>
On 11/21/25 8:35 AM, Jan Beulich wrote:
> On 20.11.2025 17:52, Oleksii Kurochko wrote:
>> On 11/20/25 4:47 PM, Jan Beulich wrote:
>>> On 20.11.2025 16:38, Oleksii Kurochko wrote:
>>>> On 11/18/25 7:58 AM, Jan Beulich wrote:
>>>>> On 17.11.2025 20:51, Oleksii Kurochko wrote:
>>>>>> On 11/12/25 12:49 PM, Jan Beulich wrote:
>>>>>>> On 20.10.2025 17:58, Oleksii Kurochko wrote:
>>>>>>>> + if ( *md_pg )
>>>>>>>> + metadata = __map_domain_page(*md_pg);
>>>>>>>> +
>>>>>>>> + if ( t < p2m_first_external )
>>>>>>>> + {
>>>>>>>> pte->pte |= MASK_INSR(t, P2M_TYPE_PTE_BITS_MASK);
>>>>>>>> - return rc;
>>>>>>>> + if ( metadata )
>>>>>>>> + metadata[ctx->index].pte = p2m_invalid;
>>>>>>> Shouldn't this be accompanied with a BUILD_BUG_ON(p2m_invalid), as otherwise
>>>>>>> p2m_alloc_page()'s clearing of the page won't have the intended effect?
>>>>>> I think that, at least, at the moment we are always explicitly set p2m type and
>>>>>> do not rely on that by default 0==p2m_invalid.
>>>>> You don't, and ...
>>>>>
>>>>>> Just to be safe, I will add after "if ( metadata )" suggested
>>>>>> BUILD_BUG_ON(p2m_invalid):
>>>>>> if ( metadata )
>>>>>> metadata[ctx->index].type = p2m_invalid;
>>>>>> /*
>>>>>> * metadata.type is expected to be p2m_invalid (0) after the page is
>>>>>> * allocated and zero-initialized in p2m_alloc_page().
>>>>>> */
>>>>>> BUILD_BUG_ON(p2m_invalid);
>>>>>> ...
>>>>> ... this leaves me with the impression that you didn't read my reply correctly.
>>>>> p2m_alloc_page() clear the page, thus_implicitly_ setting all entries to
>>>>> p2m_invalid. That's where the BUILD_BUG_ON() would want to go (the call site,
>>>>> ftaod).
>>>> I think I still don’t fully understand what the issue would be if|p2m_invalid| were
>>>> ever equal to 1 instead of 0 in the context of a metadata page.
>>>>
>>>> Yes, if|p2m_invalid| were 1, there would be a problem if someone tried to read this
>>>> metadata pagebefore it was assigned any type. They would find a value of 0, which
>>>> corresponds to a valid type rather than to|p2m_invalid|, as one might expect.
>>>> However, I’m not sure I currently see a scenario in which the metadata page would
>>>> be read before being initialized.
>>> Are you sure walks can only happen for GFNs that were set up? What you need to
>>> do walks on is under guest control, after all.
>> If a GFN lies within the range[p2m->lowest_mapped_gfn, p2m->max_mapped_gfn], then
>> |p2m_set_entry()| must already have been called for this GFN.
> No. All you know from the pre-condition is that p2m_set_entry() was called for the
> lowest and highest GFNs in that range.
>
Oh, right. There could still be some GFNs inside the range for which|p2m_set_entry()| has
not yet been called. Then it probably makes sense to add a|BUILD_BUG_ON| here as well, before
"if ( type == p2m_ext_storage )":
static p2m_type_t p2m_get_type(const pte_t pte, const struct p2m_pte_ctx *ctx)
{
p2m_type_t type = MASK_EXTR(pte.pte, P2M_TYPE_PTE_BITS_MASK);
if ( type == p2m_ext_storage )
{
const pte_t *md = __map_domain_page(ctx->pt_page->v.md.pg);
type = md[ctx->index].pte;
unmap_domain_page(md);
}
return type;
}
I would expect that if|p2m_set_entry()| has not been called for a GFN, then|p2m_get_entry()|
(|p2m_get_type()| will be called somewhere inside|p2m_get_entry()|, for example) should return
the|p2m_invalid| type. I think we want to have the same check (as the one before the call to
|p2m_alloc_page()|), placed before the condition:
BUILD_BUG_ON(p2m_invalid);
~ Oleksii
>
>> This means that either
>> - a metadata page has been created and its entry filled with the appropriate type, or
>> - no metadata page was needed and the type was stored directly in|pte->pte|
>>
>> For a GFN outside the range(p2m->lowest_mapped_gfn, p2m->max_mapped_gfn),
>> |p2m_get_entry()| will not even attempt a walk because of the boundary checks:
>> static mfn_t p2m_get_entry(struct p2m_domain *p2m, gfn_t gfn,
>> p2m_type_t *t,
>> unsigned int *page_order)
>> ...
>> if ( check_outside_boundary(p2m, gfn, p2m->lowest_mapped_gfn, true,
>> &level) )
>> goto out;
>>
>> if ( check_outside_boundary(p2m, gfn, p2m->max_mapped_gfn, false, &level) )
>> goto out;
>>
>> If I am misunderstanding something and there are other cases where a walk can occur for
>> GFNs that were never set up, then such GFNs would have neither an allocated metadata
>> page nor a type stored in|pte->pte|, which looks like we are in trouble.
>>
>> ~ Oleksii
>>
© 2016 - 2026 Red Hat, Inc.