Implement map_pages_to_xen() which requires several
functions to manage page tables and entries:
- pt_update()
- pt_mapping_level()
- pt_update_entry()
- pt_next_level()
- pt_check_entry()
To support these operations, add functions for creating,
mapping, and unmapping Xen tables:
- create_table()
- map_table()
- unmap_table()
Introduce internal macros starting with PTE_* for convenience.
These macros closely resemble PTE bits, with the exception of
PTE_SMALL, which indicates that 4KB is needed.
In addition introduce flush_tlb_range_va() for TLB flushing across
CPUs after updating the PTE for the requested mapping.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V5:
- s/xen_{un}map/{un}map
- introduce PTE_SMALL instead of PTE_BLOCK.
- update the comment above defintion of PTE_4K_PAGES.
- code style fixes.
- s/RV_STAGE1_MODE > SATP_MODE_SV48/RV_STAGE1_MODE > SATP_MODE_SV39 around
DECLARE_OFFSETS macros.
- change type of root_maddr from unsgined long to maddr_t.
- drop duplicated check ( if (rc) break ) in pt_update() inside while cycle.
- s/1U/1UL
- put 'spin_unlock(&xen_pt_lock);' ahead of TLB flush in pt_update().
- update the commit message.
- update the comment above ASSERT() in map_pages_to_xen() and also update
the check within ASSERT() to check that flags has PTE_VALID bit set.
- update the comment above pt_update() function.
- add the comment inside pt_check_entry().
- update the TLB flushing region in pt_update().
- s/alloc_only/alloc_tbl
---
Changes in V4:
- update the commit message.
- drop xen_ prefix for functions: xen_pt_update(), xen_pt_mapping_level(),
xen_pt_update_entry(), xen_pt_next_level(), xen_pt_check_entry().
- drop 'select GENERIC_PT' for CONFIG_RISCV. There is no GENERIC_PT anymore.
- update implementation of flush_xen_tlb_range_va and s/flush_xen_tlb_range_va/flush_tlb_range_va
- s/pte_get_mfn/mfn_from_pte. Others similar definitions I decided not to touch as
they were introduced before and this patter of naming such type of macros will be applied
for newly introduced macros.
- drop _PAGE_* definitions and use analogues of PTE_*.
- introduce PTE_{W,X,R}_MASK and drop PAGE_{XN,W,X}_MASK. Also drop _PAGE_{*}_BIT
- introduce PAGE_HYPERVISOR_RX.
- drop unused now l3_table_offset.
- drop struct pt_t as it was used only for one function. If it will be needed in the future
pt_t will be re-introduced.
- code styles fixes in pte_is_table(). drop level argument from t.
- update implementation and prototype of pte_is_mapping().
- drop level argument from pt_next_level().
- introduce definition of SATP_PPN_MASK.
- isolate PPN of CSR_SATP before shift by PAGE_SHIFT.
- drop set_permission() functions as it is not used more then once.
- update prototype of pt_check_entry(): drop level argument as it is not used.
- pt_check_entry():
- code style fixes
- update the sanity check when modifying an entry
- update the sanity check when when removing a mapping.
- s/read_only/alloc_only.
- code style fixes for pt_next_level().
- pt_update_entry() changes:
- drop arch_level variable inisde pt_update_entry()
- drop convertion near virt to paddr_t in DECLARE_OFFSETS(offsets, virt);
- pull out "goto out inside first 'for' cycle.
- drop braces for 'if' cases which has only one line.
- ident 'out' label with one blank.
- update the comment above alloc_only and also definition to take into
account that if pte population was requested or not.
- drop target variable and rename arch_target argument of the function to
target.
- pt_mapping_level() changes:
- move the check if PTE_BLOCK should be mapped on the top of the function.
- change int i to unsigned int and update 'for' cycle correspondingly.
- update prototye of pt_update():
- drop the comment above nr_mfns and drop const to be consistent with other
arguments.
- always flush TLB at the end of the function as non-present entries can be put
in the TLB.
- add fence before TLB flush to ensure that PTEs are all updated before flushing.
- s/XEN_TABLE_NORMAL_PAGE/XEN_TABLE_NORMAL
- add a check in map_pages_to_xen() the mfn is not INVALID_MFN.
- add the comment on top of pt_update() how mfn = INVALID_MFN is considered.
- s/_PAGE_BLOCK/PTE_BLOCK.
- add the comment with additional explanation for PTE_BLOCK.
- drop defintion of FIRST_SIZE as it isn't used.
---
Changes in V3:
- new patch. ( Technically it is reworked version of the generic approach
which I tried to suggest in the previous version )
---
xen/arch/riscv/Makefile | 1 +
xen/arch/riscv/include/asm/flushtlb.h | 12 +
xen/arch/riscv/include/asm/mm.h | 2 +
xen/arch/riscv/include/asm/page.h | 57 +++
xen/arch/riscv/include/asm/riscv_encoding.h | 1 +
xen/arch/riscv/mm.c | 9 -
xen/arch/riscv/pt.c | 420 ++++++++++++++++++++
7 files changed, 493 insertions(+), 9 deletions(-)
create mode 100644 xen/arch/riscv/pt.c
diff --git a/xen/arch/riscv/Makefile b/xen/arch/riscv/Makefile
index 334fd24547..d058ea4e95 100644
--- a/xen/arch/riscv/Makefile
+++ b/xen/arch/riscv/Makefile
@@ -1,6 +1,7 @@
obj-$(CONFIG_EARLY_PRINTK) += early_printk.o
obj-y += entry.o
obj-y += mm.o
+obj-y += pt.o
obj-$(CONFIG_RISCV_64) += riscv64/
obj-y += sbi.o
obj-y += setup.o
diff --git a/xen/arch/riscv/include/asm/flushtlb.h b/xen/arch/riscv/include/asm/flushtlb.h
index f4a735fd6c..031d781aa2 100644
--- a/xen/arch/riscv/include/asm/flushtlb.h
+++ b/xen/arch/riscv/include/asm/flushtlb.h
@@ -5,12 +5,24 @@
#include <xen/bug.h>
#include <xen/cpumask.h>
+#include <asm/sbi.h>
+
/* Flush TLB of local processor for address va. */
static inline void flush_tlb_one_local(vaddr_t va)
{
asm volatile ( "sfence.vma %0" :: "r" (va) : "memory" );
}
+/*
+ * Flush a range of VA's hypervisor mappings from the TLB of all
+ * processors in the inner-shareable domain.
+ */
+static inline void flush_tlb_range_va(vaddr_t va, size_t size)
+{
+ BUG_ON(!sbi_has_rfence());
+ sbi_remote_sfence_vma(NULL, va, size);
+}
+
/*
* Filter the given set of CPUs, removing those that definitely flushed their
* TLB since @page_timestamp.
diff --git a/xen/arch/riscv/include/asm/mm.h b/xen/arch/riscv/include/asm/mm.h
index a0bdc2bc3a..ce1557bb27 100644
--- a/xen/arch/riscv/include/asm/mm.h
+++ b/xen/arch/riscv/include/asm/mm.h
@@ -42,6 +42,8 @@ static inline void *maddr_to_virt(paddr_t ma)
#define virt_to_mfn(va) __virt_to_mfn(va)
#define mfn_to_virt(mfn) __mfn_to_virt(mfn)
+#define mfn_from_pte(pte) maddr_to_mfn(pte_to_paddr(pte))
+
struct page_info
{
/* Each frame can be threaded onto a doubly-linked list. */
diff --git a/xen/arch/riscv/include/asm/page.h b/xen/arch/riscv/include/asm/page.h
index 55916eaa92..f148d82261 100644
--- a/xen/arch/riscv/include/asm/page.h
+++ b/xen/arch/riscv/include/asm/page.h
@@ -21,6 +21,11 @@
#define XEN_PT_LEVEL_MAP_MASK(lvl) (~(XEN_PT_LEVEL_SIZE(lvl) - 1))
#define XEN_PT_LEVEL_MASK(lvl) (VPN_MASK << XEN_PT_LEVEL_SHIFT(lvl))
+/*
+ * PTE format:
+ * | XLEN-1 10 | 9 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0
+ * PFN reserved for SW D A G U X W R V
+ */
#define PTE_VALID BIT(0, UL)
#define PTE_READABLE BIT(1, UL)
#define PTE_WRITABLE BIT(2, UL)
@@ -34,15 +39,53 @@
#define PTE_LEAF_DEFAULT (PTE_VALID | PTE_READABLE | PTE_WRITABLE)
#define PTE_TABLE (PTE_VALID)
+#define PAGE_HYPERVISOR_RO (PTE_VALID | PTE_READABLE)
#define PAGE_HYPERVISOR_RW (PTE_VALID | PTE_READABLE | PTE_WRITABLE)
+#define PAGE_HYPERVISOR_RX (PTE_VALID | PTE_READABLE | PTE_EXECUTABLE)
#define PAGE_HYPERVISOR PAGE_HYPERVISOR_RW
+/*
+ * The PTE format does not contain the following bits within itself;
+ * they are created artificially to inform the Xen page table
+ * handling algorithm. These bits should not be explicitly written
+ * to the PTE entry.
+ */
+#define PTE_SMALL BIT(10, UL)
+#define PTE_POPULATE BIT(11, UL)
+
+#define PTE_R_MASK(x) ((x) & PTE_READABLE)
+#define PTE_W_MASK(x) ((x) & PTE_WRITABLE)
+#define PTE_X_MASK(x) ((x) & PTE_EXECUTABLE)
+
+#define PTE_RWX_MASK(x) ((x) & (PTE_READABLE | PTE_WRITABLE | PTE_EXECUTABLE))
+
/* Calculate the offsets into the pagetables for a given VA */
#define pt_linear_offset(lvl, va) ((va) >> XEN_PT_LEVEL_SHIFT(lvl))
#define pt_index(lvl, va) (pt_linear_offset((lvl), (va)) & VPN_MASK)
+#define PAGETABLE_ORDER_MASK ((_AC(1, U) << PAGETABLE_ORDER) - 1)
+#define TABLE_OFFSET(offs) (_AT(unsigned int, offs) & PAGETABLE_ORDER_MASK)
+
+#if RV_STAGE1_MODE > SATP_MODE_SV39
+#error "need to to update DECLARE_OFFSETS macros"
+#else
+
+#define l0_table_offset(va) TABLE_OFFSET(pt_linear_offset(0, va))
+#define l1_table_offset(va) TABLE_OFFSET(pt_linear_offset(1, va))
+#define l2_table_offset(va) TABLE_OFFSET(pt_linear_offset(2, va))
+
+/* Generate an array @var containing the offset for each level from @addr */
+#define DECLARE_OFFSETS(var, addr) \
+ const unsigned int var[] = { \
+ l0_table_offset(addr), \
+ l1_table_offset(addr), \
+ l2_table_offset(addr), \
+ }
+
+#endif
+
/* Page Table entry */
typedef struct {
#ifdef CONFIG_RISCV_64
@@ -68,6 +111,20 @@ static inline bool pte_is_valid(pte_t p)
return p.pte & PTE_VALID;
}
+inline bool pte_is_table(const pte_t p)
+{
+ return ((p.pte & (PTE_VALID |
+ PTE_READABLE |
+ PTE_WRITABLE |
+ PTE_EXECUTABLE)) == PTE_VALID);
+}
+
+static inline bool pte_is_mapping(const pte_t p)
+{
+ return (p.pte & PTE_VALID) &&
+ (p.pte & (PTE_WRITABLE | PTE_EXECUTABLE));
+}
+
static inline void invalidate_icache(void)
{
BUG_ON("unimplemented");
diff --git a/xen/arch/riscv/include/asm/riscv_encoding.h b/xen/arch/riscv/include/asm/riscv_encoding.h
index 58abe5eccc..d80cef0093 100644
--- a/xen/arch/riscv/include/asm/riscv_encoding.h
+++ b/xen/arch/riscv/include/asm/riscv_encoding.h
@@ -164,6 +164,7 @@
#define SSTATUS_SD SSTATUS64_SD
#define SATP_MODE SATP64_MODE
#define SATP_MODE_SHIFT SATP64_MODE_SHIFT
+#define SATP_PPN_MASK _UL(0x00000FFFFFFFFFFF)
#define HGATP_PPN HGATP64_PPN
#define HGATP_VMID_SHIFT HGATP64_VMID_SHIFT
diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c
index b8ff91cf4e..e8430def14 100644
--- a/xen/arch/riscv/mm.c
+++ b/xen/arch/riscv/mm.c
@@ -369,12 +369,3 @@ int destroy_xen_mappings(unsigned long s, unsigned long e)
BUG_ON("unimplemented");
return -1;
}
-
-int map_pages_to_xen(unsigned long virt,
- mfn_t mfn,
- unsigned long nr_mfns,
- unsigned int flags)
-{
- BUG_ON("unimplemented");
- return -1;
-}
diff --git a/xen/arch/riscv/pt.c b/xen/arch/riscv/pt.c
new file mode 100644
index 0000000000..15eb02fe9e
--- /dev/null
+++ b/xen/arch/riscv/pt.c
@@ -0,0 +1,420 @@
+#include <xen/bug.h>
+#include <xen/domain_page.h>
+#include <xen/errno.h>
+#include <xen/mm.h>
+#include <xen/mm-frame.h>
+#include <xen/pmap.h>
+#include <xen/spinlock.h>
+
+#include <asm/flushtlb.h>
+#include <asm/page.h>
+
+static inline const mfn_t get_root_page(void)
+{
+ paddr_t root_maddr = (csr_read(CSR_SATP) & SATP_PPN_MASK) << PAGE_SHIFT;
+
+ return maddr_to_mfn(root_maddr);
+}
+
+/* Sanity check of the entry. */
+static bool pt_check_entry(pte_t entry, mfn_t mfn, unsigned int flags)
+{
+ /*
+ * See the comment about the possible combination of (mfn, flags) in
+ * the comment above pt_update().
+ */
+
+ /* Sanity check when modifying an entry. */
+ if ( (flags & PTE_VALID) && mfn_eq(mfn, INVALID_MFN) )
+ {
+ /* We don't allow modifying an invalid entry. */
+ if ( !pte_is_valid(entry) )
+ {
+ printk("Modifying invalid entry is not allowed.\n");
+ return false;
+ }
+
+ /* We don't allow modifying a table entry */
+ if ( pte_is_table(entry) )
+ {
+ printk("Modifying a table entry is not allowed.\n");
+ return false;
+ }
+ }
+ /* Sanity check when inserting a mapping */
+ else if ( flags & PTE_VALID )
+ {
+ /* We should be here with a valid MFN. */
+ ASSERT(!mfn_eq(mfn, INVALID_MFN));
+
+ /*
+ * We don't allow replacing any valid entry.
+ *
+ * Note that the function pt_update() relies on this
+ * assumption and will skip the TLB flush (when Svvptc
+ * extension will be ratified). The function will need
+ * to be updated if the check is relaxed.
+ */
+ if ( pte_is_valid(entry) )
+ {
+ if ( pte_is_mapping(entry) )
+ printk("Changing MFN for a valid entry is not allowed (%#"PRI_mfn" -> %#"PRI_mfn").\n",
+ mfn_x(mfn_from_pte(entry)), mfn_x(mfn));
+ else
+ printk("Trying to replace a table with a mapping.\n");
+ return false;
+ }
+ }
+ /* Sanity check when removing a mapping. */
+ else if ( (flags & (PTE_VALID | PTE_POPULATE)) == 0 )
+ {
+ /* We should be here with an invalid MFN. */
+ ASSERT(mfn_eq(mfn, INVALID_MFN));
+
+ /* We don't allow removing a table */
+ if ( pte_is_table(entry) )
+ {
+ printk("Removing a table is not allowed.\n");
+ return false;
+ }
+ }
+ /* Sanity check when populating the page-table. No check so far. */
+ else
+ {
+ ASSERT(flags & PTE_POPULATE);
+ /* We should be here with an invalid MFN */
+ ASSERT(mfn_eq(mfn, INVALID_MFN));
+ }
+
+ return true;
+}
+
+static pte_t *map_table(mfn_t mfn)
+{
+ /*
+ * During early boot, map_domain_page() may be unusable. Use the
+ * PMAP to map temporarily a page-table.
+ */
+ if ( system_state == SYS_STATE_early_boot )
+ return pmap_map(mfn);
+
+ return map_domain_page(mfn);
+}
+
+static void unmap_table(const pte_t *table)
+{
+ /*
+ * During early boot, map_table() will not use map_domain_page()
+ * but the PMAP.
+ */
+ if ( system_state == SYS_STATE_early_boot )
+ pmap_unmap(table);
+ else
+ unmap_domain_page(table);
+}
+
+static int create_table(pte_t *entry)
+{
+ mfn_t mfn;
+ void *p;
+ pte_t pte;
+
+ if ( system_state != SYS_STATE_early_boot )
+ {
+ struct page_info *pg = alloc_domheap_page(NULL, 0);
+
+ if ( pg == NULL )
+ return -ENOMEM;
+
+ mfn = page_to_mfn(pg);
+ }
+ else
+ mfn = alloc_boot_pages(1, 1);
+
+ p = map_table(mfn);
+ clear_page(p);
+ unmap_table(p);
+
+ pte = pte_from_mfn(mfn, PTE_TABLE);
+ write_pte(entry, pte);
+
+ return 0;
+}
+
+#define XEN_TABLE_MAP_FAILED 0
+#define XEN_TABLE_SUPER_PAGE 1
+#define XEN_TABLE_NORMAL 2
+
+/*
+ * Take the currently mapped table, find the corresponding entry,
+ * and map the next table, if available.
+ *
+ * The alloc_tbl parameters indicates whether intermediate tables should
+ * be allocated when not present.
+ *
+ * Return values:
+ * XEN_TABLE_MAP_FAILED: Either alloc_only was set and the entry
+ * was empty, or allocating a new page failed.
+ * XEN_TABLE_NORMAL: next level or leaf mapped normally
+ * XEN_TABLE_SUPER_PAGE: The next entry points to a superpage.
+ */
+static int pt_next_level(bool alloc_tbl, pte_t **table, unsigned int offset)
+{
+ pte_t *entry;
+ int ret;
+ mfn_t mfn;
+
+ entry = *table + offset;
+
+ if ( !pte_is_valid(*entry) )
+ {
+ if ( alloc_tbl )
+ return XEN_TABLE_MAP_FAILED;
+
+ ret = create_table(entry);
+ if ( ret )
+ return XEN_TABLE_MAP_FAILED;
+ }
+
+ if ( pte_is_mapping(*entry) )
+ return XEN_TABLE_SUPER_PAGE;
+
+ mfn = mfn_from_pte(*entry);
+
+ unmap_table(*table);
+ *table = map_table(mfn);
+
+ return XEN_TABLE_NORMAL;
+}
+
+/* Update an entry at the level @target. */
+static int pt_update_entry(mfn_t root, unsigned long virt,
+ mfn_t mfn, unsigned int target,
+ unsigned int flags)
+{
+ int rc;
+ unsigned int level = HYP_PT_ROOT_LEVEL;
+ pte_t *table;
+ /*
+ * The intermediate page table shouldn't be allocated when MFN isn't
+ * valid and we are not populating page table.
+ * This means we either modify permissions or remove an entry, or
+ * inserting brand new entry.
+ *
+ * See the comment above pt_update() for an additional explanation about
+ * combinations of (mfn, flags).
+ */
+ bool alloc_tbl = mfn_eq(mfn, INVALID_MFN) && !(flags & PTE_POPULATE);
+ pte_t pte, *entry;
+
+ /* convenience aliases */
+ DECLARE_OFFSETS(offsets, virt);
+
+ table = map_table(root);
+ for ( ; level > target; level-- )
+ {
+ rc = pt_next_level(alloc_tbl, &table, offsets[level]);
+ if ( rc == XEN_TABLE_MAP_FAILED )
+ {
+ rc = 0;
+
+ /*
+ * We are here because pt_next_level has failed to map
+ * the intermediate page table (e.g the table does not exist
+ * and the pt is read-only). It is a valid case when
+ * removing a mapping as it may not exist in the page table.
+ * In this case, just ignore it.
+ */
+ if ( flags & PTE_VALID )
+ {
+ printk("%s: Unable to map level %u\n", __func__, level);
+ rc = -ENOENT;
+ }
+
+ goto out;
+ }
+ else if ( rc != XEN_TABLE_NORMAL )
+ break;
+ }
+
+ if ( level != target )
+ {
+ printk("%s: Shattering superpage is not supported\n", __func__);
+ rc = -EOPNOTSUPP;
+ goto out;
+ }
+
+ entry = table + offsets[level];
+
+ rc = -EINVAL;
+ if ( !pt_check_entry(*entry, mfn, flags) )
+ goto out;
+
+ /* We are removing the page */
+ if ( !(flags & PTE_VALID) )
+ memset(&pte, 0x00, sizeof(pte));
+ else
+ {
+ /* We are inserting a mapping => Create new pte. */
+ if ( !mfn_eq(mfn, INVALID_MFN) )
+ pte = pte_from_mfn(mfn, PTE_VALID);
+ else /* We are updating the permission => Copy the current pte. */
+ pte = *entry;
+
+ /* update permission according to the flags */
+ pte.pte |= PTE_RWX_MASK(flags) | PTE_ACCESSED | PTE_DIRTY;
+ }
+
+ write_pte(entry, pte);
+
+ rc = 0;
+
+ out:
+ unmap_table(table);
+
+ return rc;
+}
+
+/* Return the level where mapping should be done */
+static int pt_mapping_level(unsigned long vfn, mfn_t mfn, unsigned long nr,
+ unsigned int flags)
+{
+ unsigned int level = 0;
+ unsigned long mask;
+ unsigned int i;
+
+ /* Use blocking mapping unless the caller requests 4K mapping */
+ if ( unlikely(flags & PTE_SMALL) )
+ return level;
+
+ /*
+ * Don't take into account the MFN when removing mapping (i.e
+ * MFN_INVALID) to calculate the correct target order.
+ *
+ * `vfn` and `mfn` must be both superpage aligned.
+ * They are or-ed together and then checked against the size of
+ * each level.
+ *
+ * `left` is not included and checked separately to allow
+ * superpage mapping even if it is not properly aligned (the
+ * user may have asked to map 2MB + 4k).
+ */
+ mask = !mfn_eq(mfn, INVALID_MFN) ? mfn_x(mfn) : 0;
+ mask |= vfn;
+
+ for ( i = HYP_PT_ROOT_LEVEL; i != 0; i-- )
+ {
+ if ( !(mask & (BIT(XEN_PT_LEVEL_ORDER(i), UL) - 1)) &&
+ (nr >= BIT(XEN_PT_LEVEL_ORDER(i), UL)) )
+ {
+ level = i;
+ break;
+ }
+ }
+
+ return level;
+}
+
+static DEFINE_SPINLOCK(xen_pt_lock);
+
+/*
+ * If `mfn` equals `INVALID_MFN`, it indicates that the following page table
+ * update operation might be related to either populating the table (
+ * PTE_POPULATE will be set additionaly), destroying a mapping, or modifying
+ * an existing mapping.
+ *
+ * If `mfn` is valid and flags has PTE_VALID bit set then it means that
+ * inserting will be done.
+ */
+static int pt_update(unsigned long virt,
+ mfn_t mfn,
+ unsigned long nr_mfns,
+ unsigned int flags)
+{
+ int rc = 0;
+ unsigned long vfn = virt >> PAGE_SHIFT;
+ unsigned long left = nr_mfns;
+
+ const mfn_t root = get_root_page();
+
+ /*
+ * It is bad idea to have mapping both writeable and
+ * executable.
+ * When modifying/creating mapping (i.e PTE_VALID is set),
+ * prevent any update if this happen.
+ */
+ if ( (flags & PTE_VALID) && PTE_W_MASK(flags) && PTE_X_MASK(flags) )
+ {
+ printk("Mappings should not be both Writeable and Executable.\n");
+ return -EINVAL;
+ }
+
+ if ( !IS_ALIGNED(virt, PAGE_SIZE) )
+ {
+ printk("The virtual address is not aligned to the page-size.\n");
+ return -EINVAL;
+ }
+
+ spin_lock(&xen_pt_lock);
+
+ while ( left )
+ {
+ unsigned int order, level;
+
+ level = pt_mapping_level(vfn, mfn, left, flags);
+ order = XEN_PT_LEVEL_ORDER(level);
+
+ ASSERT(left >= BIT(order, UL));
+
+ rc = pt_update_entry(root, vfn << PAGE_SHIFT, mfn, level, flags);
+ if ( rc )
+ break;
+
+ vfn += 1UL << order;
+ if ( !mfn_eq(mfn, INVALID_MFN) )
+ mfn = mfn_add(mfn, 1UL << order);
+
+ left -= (1UL << order);
+ }
+
+ /* Ensure that PTEs are all updated before flushing */
+ RISCV_FENCE(rw, rw);
+
+ spin_unlock(&xen_pt_lock);
+
+ /*
+ * Always flush TLB at the end of the function as non-present entries
+ * can be put in the TLB.
+ *
+ * The remote fence operation applies to the entire address space if
+ * either:
+ * - start and size are both 0, or
+ * - size is equal to 2^XLEN-1.
+ *
+ * TODO: come up with something which will allow not to flash the entire
+ * address space.
+ */
+ flush_tlb_range_va(0, 0);
+
+ return rc;
+}
+
+int map_pages_to_xen(unsigned long virt,
+ mfn_t mfn,
+ unsigned long nr_mfns,
+ unsigned int flags)
+{
+ /*
+ * Ensure that flags has PTE_VALID bit as map_pages_to_xen() is supposed
+ * to create a mapping.
+ *
+ * Ensure that we have a valid MFN before proceeding.
+ *
+ * If the MFN is invalid, pt_update() might misinterpret the operation,
+ * treating it as either a population, a mapping destruction,
+ * or a mapping modification.
+ */
+ ASSERT(!mfn_eq(mfn, INVALID_MFN) || (flags & PTE_VALID));
+
+ return pt_update(virt, mfn, nr_mfns, flags);
+}
--
2.46.0
On 21.08.2024 18:06, Oleksii Kurochko wrote: > Implement map_pages_to_xen() which requires several > functions to manage page tables and entries: > - pt_update() > - pt_mapping_level() > - pt_update_entry() > - pt_next_level() > - pt_check_entry() > > To support these operations, add functions for creating, > mapping, and unmapping Xen tables: > - create_table() > - map_table() > - unmap_table() > > Introduce internal macros starting with PTE_* for convenience. > These macros closely resemble PTE bits, with the exception of > PTE_SMALL, which indicates that 4KB is needed. What macros are you talking about here? Is this partially stale, as only PTE_SMALL and PTE_POPULATE (and a couple of masks) are being added? > --- a/xen/arch/riscv/include/asm/flushtlb.h > +++ b/xen/arch/riscv/include/asm/flushtlb.h > @@ -5,12 +5,24 @@ > #include <xen/bug.h> > #include <xen/cpumask.h> > > +#include <asm/sbi.h> > + > /* Flush TLB of local processor for address va. */ > static inline void flush_tlb_one_local(vaddr_t va) > { > asm volatile ( "sfence.vma %0" :: "r" (va) : "memory" ); > } > > +/* > + * Flush a range of VA's hypervisor mappings from the TLB of all > + * processors in the inner-shareable domain. > + */ Isn't inner-sharable an Arm term? Don't you simply mean "all" here? > @@ -68,6 +111,20 @@ static inline bool pte_is_valid(pte_t p) > return p.pte & PTE_VALID; > } > > +inline bool pte_is_table(const pte_t p) > +{ > + return ((p.pte & (PTE_VALID | > + PTE_READABLE | > + PTE_WRITABLE | > + PTE_EXECUTABLE)) == PTE_VALID); > +} In how far is the READABLE check valid here? You (imo correctly) ... > +static inline bool pte_is_mapping(const pte_t p) > +{ > + return (p.pte & PTE_VALID) && > + (p.pte & (PTE_WRITABLE | PTE_EXECUTABLE)); > +} ... don't consider this bit here. > --- a/xen/arch/riscv/include/asm/riscv_encoding.h > +++ b/xen/arch/riscv/include/asm/riscv_encoding.h > @@ -164,6 +164,7 @@ > #define SSTATUS_SD SSTATUS64_SD > #define SATP_MODE SATP64_MODE > #define SATP_MODE_SHIFT SATP64_MODE_SHIFT > +#define SATP_PPN_MASK _UL(0x00000FFFFFFFFFFF) > > #define HGATP_PPN HGATP64_PPN > #define HGATP_VMID_SHIFT HGATP64_VMID_SHIFT This looks odd, padding-wise, but that's because hard tabs are being used here. Is that intentional? > --- /dev/null > +++ b/xen/arch/riscv/pt.c > @@ -0,0 +1,420 @@ > +#include <xen/bug.h> > +#include <xen/domain_page.h> > +#include <xen/errno.h> > +#include <xen/mm.h> > +#include <xen/mm-frame.h> > +#include <xen/pmap.h> > +#include <xen/spinlock.h> > + > +#include <asm/flushtlb.h> > +#include <asm/page.h> > + > +static inline const mfn_t get_root_page(void) > +{ > + paddr_t root_maddr = (csr_read(CSR_SATP) & SATP_PPN_MASK) << PAGE_SHIFT; > + > + return maddr_to_mfn(root_maddr); > +} > + > +/* Sanity check of the entry. */ > +static bool pt_check_entry(pte_t entry, mfn_t mfn, unsigned int flags) > +{ > + /* > + * See the comment about the possible combination of (mfn, flags) in > + * the comment above pt_update(). > + */ > + > + /* Sanity check when modifying an entry. */ > + if ( (flags & PTE_VALID) && mfn_eq(mfn, INVALID_MFN) ) > + { > + /* We don't allow modifying an invalid entry. */ > + if ( !pte_is_valid(entry) ) > + { > + printk("Modifying invalid entry is not allowed.\n"); Perhaps all of these printk()s should be dprintk()? And not have a full stop? > + return false; > + } > + > + /* We don't allow modifying a table entry */ > + if ( pte_is_table(entry) ) > + { > + printk("Modifying a table entry is not allowed.\n"); > + return false; > + } > + } > + /* Sanity check when inserting a mapping */ > + else if ( flags & PTE_VALID ) > + { > + /* We should be here with a valid MFN. */ > + ASSERT(!mfn_eq(mfn, INVALID_MFN)); This is odd to have here, considering the if() further up. > + /* > + * We don't allow replacing any valid entry. > + * > + * Note that the function pt_update() relies on this > + * assumption and will skip the TLB flush (when Svvptc > + * extension will be ratified). The function will need > + * to be updated if the check is relaxed. > + */ > + if ( pte_is_valid(entry) ) > + { > + if ( pte_is_mapping(entry) ) > + printk("Changing MFN for a valid entry is not allowed (%#"PRI_mfn" -> %#"PRI_mfn").\n", > + mfn_x(mfn_from_pte(entry)), mfn_x(mfn)); > + else > + printk("Trying to replace a table with a mapping.\n"); > + return false; > + } > + } > + /* Sanity check when removing a mapping. */ > + else if ( (flags & (PTE_VALID | PTE_POPULATE)) == 0 ) The PTE_VALID part of the check is pointless considering the earlier if(). I guess you may want to have it for doc purposes ... Since further up you're using "else if ( flags & PTE_VALID )" imo here you want to use "else if ( !(flags & ...) )". > + { > + /* We should be here with an invalid MFN. */ > + ASSERT(mfn_eq(mfn, INVALID_MFN)); > + > + /* We don't allow removing a table */ > + if ( pte_is_table(entry) ) > + { > + printk("Removing a table is not allowed.\n"); > + return false; > + } Is this restriction temporary? > + } > + /* Sanity check when populating the page-table. No check so far. */ > + else > + { > + ASSERT(flags & PTE_POPULATE); This again is redundant with earlier if() conditions. > +#define XEN_TABLE_MAP_FAILED 0 > +#define XEN_TABLE_SUPER_PAGE 1 > +#define XEN_TABLE_NORMAL 2 > + > +/* > + * Take the currently mapped table, find the corresponding entry, > + * and map the next table, if available. > + * > + * The alloc_tbl parameters indicates whether intermediate tables should > + * be allocated when not present. > + * > + * Return values: > + * XEN_TABLE_MAP_FAILED: Either alloc_only was set and the entry > + * was empty, or allocating a new page failed. > + * XEN_TABLE_NORMAL: next level or leaf mapped normally > + * XEN_TABLE_SUPER_PAGE: The next entry points to a superpage. > + */ > +static int pt_next_level(bool alloc_tbl, pte_t **table, unsigned int offset) Having the boolean first is unusual, but well - it's your choice. > +{ > + pte_t *entry; > + int ret; > + mfn_t mfn; > + > + entry = *table + offset; > + > + if ( !pte_is_valid(*entry) ) > + { > + if ( alloc_tbl ) > + return XEN_TABLE_MAP_FAILED; Is this condition meant to be inverted? > + ret = create_table(entry); > + if ( ret ) > + return XEN_TABLE_MAP_FAILED; You don't really use "ret". Why not omit the local variable, even more so that it has too wide scope? > +/* Update an entry at the level @target. */ > +static int pt_update_entry(mfn_t root, unsigned long virt, > + mfn_t mfn, unsigned int target, > + unsigned int flags) > +{ > + int rc; > + unsigned int level = HYP_PT_ROOT_LEVEL; > + pte_t *table; > + /* > + * The intermediate page table shouldn't be allocated when MFN isn't > + * valid and we are not populating page table. > + * This means we either modify permissions or remove an entry, or > + * inserting brand new entry. > + * > + * See the comment above pt_update() for an additional explanation about > + * combinations of (mfn, flags). > + */ > + bool alloc_tbl = mfn_eq(mfn, INVALID_MFN) && !(flags & PTE_POPULATE); Is this meant to be inverted, too (to actually match variable name and comment)? > + pte_t pte, *entry; > + > + /* convenience aliases */ > + DECLARE_OFFSETS(offsets, virt); > + > + table = map_table(root); > + for ( ; level > target; level-- ) > + { > + rc = pt_next_level(alloc_tbl, &table, offsets[level]); > + if ( rc == XEN_TABLE_MAP_FAILED ) > + { > + rc = 0; > + > + /* > + * We are here because pt_next_level has failed to map > + * the intermediate page table (e.g the table does not exist > + * and the pt is read-only). It is a valid case when I'm sorry, but there's still a "read-only" left here. > + * removing a mapping as it may not exist in the page table. > + * In this case, just ignore it. > + */ > + if ( flags & PTE_VALID ) > + { > + printk("%s: Unable to map level %u\n", __func__, level); > + rc = -ENOENT; > + } > + > + goto out; > + } > + else if ( rc != XEN_TABLE_NORMAL ) No need for "else" when the earlier if() ends in "goto". > + break; > + } > + > + if ( level != target ) > + { > + printk("%s: Shattering superpage is not supported\n", __func__); > + rc = -EOPNOTSUPP; > + goto out; > + } > + > + entry = table + offsets[level]; > + > + rc = -EINVAL; > + if ( !pt_check_entry(*entry, mfn, flags) ) > + goto out; > + > + /* We are removing the page */ > + if ( !(flags & PTE_VALID) ) > + memset(&pte, 0x00, sizeof(pte)); > + else > + { > + /* We are inserting a mapping => Create new pte. */ > + if ( !mfn_eq(mfn, INVALID_MFN) ) > + pte = pte_from_mfn(mfn, PTE_VALID); > + else /* We are updating the permission => Copy the current pte. */ > + pte = *entry; > + > + /* update permission according to the flags */ > + pte.pte |= PTE_RWX_MASK(flags) | PTE_ACCESSED | PTE_DIRTY; When updating an entry, don't you also need to clear (some of) the flags? > +/* Return the level where mapping should be done */ > +static int pt_mapping_level(unsigned long vfn, mfn_t mfn, unsigned long nr, > + unsigned int flags) > +{ > + unsigned int level = 0; > + unsigned long mask; > + unsigned int i; > + > + /* Use blocking mapping unless the caller requests 4K mapping */ > + if ( unlikely(flags & PTE_SMALL) ) > + return level; > + > + /* > + * Don't take into account the MFN when removing mapping (i.e > + * MFN_INVALID) to calculate the correct target order. > + * > + * `vfn` and `mfn` must be both superpage aligned. > + * They are or-ed together and then checked against the size of > + * each level. > + * > + * `left` is not included and checked separately to allow > + * superpage mapping even if it is not properly aligned (the > + * user may have asked to map 2MB + 4k). What is this about? There's nothing named "left" here. > + */ > + mask = !mfn_eq(mfn, INVALID_MFN) ? mfn_x(mfn) : 0; > + mask |= vfn; > + > + for ( i = HYP_PT_ROOT_LEVEL; i != 0; i-- ) > + { > + if ( !(mask & (BIT(XEN_PT_LEVEL_ORDER(i), UL) - 1)) && > + (nr >= BIT(XEN_PT_LEVEL_ORDER(i), UL)) ) > + { > + level = i; > + break; > + } > + } > + > + return level; > +} > + > +static DEFINE_SPINLOCK(xen_pt_lock); Another largely meaningless xen_ prefix? > +/* > + * If `mfn` equals `INVALID_MFN`, it indicates that the following page table > + * update operation might be related to either populating the table ( > + * PTE_POPULATE will be set additionaly), destroying a mapping, or modifying > + * an existing mapping. And the latter two are distinguished by? PTE_VALID? > + * If `mfn` is valid and flags has PTE_VALID bit set then it means that > + * inserting will be done. > + */ What about mfn != INVALID_MFN and PTE_VALID clear? Also note that "`mfn` is valid" isn't the same as "mfn != INVALID_MFN". You want to be precise here, to avoid confusion later on. (I say that knowing that we're still fighting especially shadow paging code on x86 not having those properly separated.) > +static int pt_update(unsigned long virt, > + mfn_t mfn, > + unsigned long nr_mfns, > + unsigned int flags) > +{ > + int rc = 0; > + unsigned long vfn = virt >> PAGE_SHIFT; Please don't open-code e.g PFN_DOWN(). > + unsigned long left = nr_mfns; > + > + const mfn_t root = get_root_page(); > + > + /* > + * It is bad idea to have mapping both writeable and > + * executable. > + * When modifying/creating mapping (i.e PTE_VALID is set), > + * prevent any update if this happen. > + */ > + if ( (flags & PTE_VALID) && PTE_W_MASK(flags) && PTE_X_MASK(flags) ) Seeing them in use, I wonder about the naming of those PTE_?_MASK() macros. Along with the lhs, why not simply (flags & PTE_...)? Jan
On Tue, 2024-08-27 at 17:00 +0200, Jan Beulich wrote: > On 21.08.2024 18:06, Oleksii Kurochko wrote: > > Implement map_pages_to_xen() which requires several > > functions to manage page tables and entries: > > - pt_update() > > - pt_mapping_level() > > - pt_update_entry() > > - pt_next_level() > > - pt_check_entry() > > > > To support these operations, add functions for creating, > > mapping, and unmapping Xen tables: > > - create_table() > > - map_table() > > - unmap_table() > > > > Introduce internal macros starting with PTE_* for convenience. > > These macros closely resemble PTE bits, with the exception of > > PTE_SMALL, which indicates that 4KB is needed. > > What macros are you talking about here? Is this partially stale, as > only PTE_SMALL and PTE_POPULATE (and a couple of masks) are being > added? I am speaking about macros connected to masks: #define PTE_R_MASK(x) ((x) & PTE_READABLE) #define PTE_W_MASK(x) ((x) & PTE_WRITABLE) #define PTE_X_MASK(x) ((x) & PTE_EXECUTABLE) #define PTE_RWX_MASK(x) ((x) & (PTE_READABLE | PTE_WRITABLE | PTE_EXECUTABLE)) > > > --- a/xen/arch/riscv/include/asm/flushtlb.h > > +++ b/xen/arch/riscv/include/asm/flushtlb.h > > @@ -5,12 +5,24 @@ > > #include <xen/bug.h> > > #include <xen/cpumask.h> > > > > +#include <asm/sbi.h> > > + > > /* Flush TLB of local processor for address va. */ > > static inline void flush_tlb_one_local(vaddr_t va) > > { > > asm volatile ( "sfence.vma %0" :: "r" (va) : "memory" ); > > } > > > > +/* > > + * Flush a range of VA's hypervisor mappings from the TLB of all > > + * processors in the inner-shareable domain. > > + */ > > Isn't inner-sharable an Arm term? Don't you simply mean "all" here? Yes, this is Arm term. It should used "all" instead. Thanks. > > > @@ -68,6 +111,20 @@ static inline bool pte_is_valid(pte_t p) > > return p.pte & PTE_VALID; > > } > > > > +inline bool pte_is_table(const pte_t p) > > +{ > > + return ((p.pte & (PTE_VALID | > > + PTE_READABLE | > > + PTE_WRITABLE | > > + PTE_EXECUTABLE)) == PTE_VALID); > > +} > > In how far is the READABLE check valid here? You (imo correctly) ... > > > +static inline bool pte_is_mapping(const pte_t p) > > +{ > > + return (p.pte & PTE_VALID) && > > + (p.pte & (PTE_WRITABLE | PTE_EXECUTABLE)); > > +} > > ... don't consider this bit here. pte_is_mapping() seems to me is correct as according to the RISC-V privileged spec: 4. Otherwise, the PTE is valid. If pte.r = 1 or pte.x = 1, go to step 5. Otherwise, this PTE is a pointer to the next level of the page table. 5. A leaf PTE has been found. ... and regarding pte_is_table() READABLE check is valid as we have to check only that pte.r = pte.x = 0. WRITABLE check should be dropped. Or just use define pte_is_table() as: inline bool pte_is_table(const pte_t p) { return !pte_is_mapping(p); } > > > --- a/xen/arch/riscv/include/asm/riscv_encoding.h > > +++ b/xen/arch/riscv/include/asm/riscv_encoding.h > > @@ -164,6 +164,7 @@ > > #define SSTATUS_SD SSTATUS64_SD > > #define SATP_MODE SATP64_MODE > > #define SATP_MODE_SHIFT SATP64_MODE_SHIFT > > +#define SATP_PPN_MASK _UL(0x00000FFFFFFFFFFF) > > > > #define HGATP_PPN HGATP64_PPN > > #define HGATP_VMID_SHIFT HGATP64_VMID_SHIFT > > This looks odd, padding-wise, but that's because hard tabs are being > used here. Is that intentional? I use tabs here because riscv_encoding.h was copied from Linux kernel which uses hard tabs and definitions above use 3 tabs so I used 3 hard tabs too. > > > --- /dev/null > > +++ b/xen/arch/riscv/pt.c > > @@ -0,0 +1,420 @@ > > +#include <xen/bug.h> > > +#include <xen/domain_page.h> > > +#include <xen/errno.h> > > +#include <xen/mm.h> > > +#include <xen/mm-frame.h> > > +#include <xen/pmap.h> > > +#include <xen/spinlock.h> > > + > > +#include <asm/flushtlb.h> > > +#include <asm/page.h> > > + > > +static inline const mfn_t get_root_page(void) > > +{ > > + paddr_t root_maddr = (csr_read(CSR_SATP) & SATP_PPN_MASK) << > > PAGE_SHIFT; > > + > > + return maddr_to_mfn(root_maddr); > > +} > > + > > +/* Sanity check of the entry. */ > > +static bool pt_check_entry(pte_t entry, mfn_t mfn, unsigned int > > flags) > > +{ > > + /* > > + * See the comment about the possible combination of (mfn, > > flags) in > > + * the comment above pt_update(). > > + */ > > + > > + /* Sanity check when modifying an entry. */ > > + if ( (flags & PTE_VALID) && mfn_eq(mfn, INVALID_MFN) ) > > + { > > + /* We don't allow modifying an invalid entry. */ > > + if ( !pte_is_valid(entry) ) > > + { > > + printk("Modifying invalid entry is not allowed.\n"); > > Perhaps all of these printk()s should be dprintk()? It could be dprintk() but at the same time I don't see any issue if it will be printed once. > And not have a full > stop? By "full stop," do you mean something like panic() or BUG_ON()? The error is propagated up to the caller, which then calls panic(). Anexample of this is: if ( (offset + size) > MB(2) ) { rc = map_pages_to_xen(BOOT_FDT_VIRT_START + MB(2), maddr_to_mfn(base_paddr + MB(2)), MB(2) >> PAGE_SHIFT, PAGE_HYPERVISOR_RO); if ( rc ) panic("Unable to map the device-tree\n"); } If it would be better for some reason to call panic() or BUG_ON() as soon as pt_check_entry() returns false, I can do it that way as well. > > > + return false; > > + } > > + > > + /* We don't allow modifying a table entry */ > > + if ( pte_is_table(entry) ) > > + { > > + printk("Modifying a table entry is not allowed.\n"); > > + return false; > > + } > > + } > > + /* Sanity check when inserting a mapping */ > > + else if ( flags & PTE_VALID ) > > + { > > + /* We should be here with a valid MFN. */ > > + ASSERT(!mfn_eq(mfn, INVALID_MFN)); > > This is odd to have here, considering the if() further up. Agree, ASSERT() could be drop. > > > + /* > > + * We don't allow replacing any valid entry. > > + * > > + * Note that the function pt_update() relies on this > > + * assumption and will skip the TLB flush (when Svvptc > > + * extension will be ratified). The function will need > > + * to be updated if the check is relaxed. > > + */ > > + if ( pte_is_valid(entry) ) > > + { > > + if ( pte_is_mapping(entry) ) > > + printk("Changing MFN for a valid entry is not > > allowed (%#"PRI_mfn" -> %#"PRI_mfn").\n", > > + mfn_x(mfn_from_pte(entry)), mfn_x(mfn)); > > + else > > + printk("Trying to replace a table with a > > mapping.\n"); > > + return false; > > + } > > + } > > + /* Sanity check when removing a mapping. */ > > + else if ( (flags & (PTE_VALID | PTE_POPULATE)) == 0 ) > > The PTE_VALID part of the check is pointless considering the earlier > if(). I guess you may want to have it for doc purposes ... Yes, it just helps to read the code and understand "confusing" if's() above. > > Since further up you're using "else if ( flags & PTE_VALID )" imo > here you want to use "else if ( !(flags & ...) )". > > > + { > > + /* We should be here with an invalid MFN. */ > > + ASSERT(mfn_eq(mfn, INVALID_MFN)); > > + > > + /* We don't allow removing a table */ > > + if ( pte_is_table(entry) ) > > + { > > + printk("Removing a table is not allowed.\n"); > > + return false; > > + } > > Is this restriction temporary? Yes. > > > + } > > + /* Sanity check when populating the page-table. No check so > > far. */ > > + else > > + { > > + ASSERT(flags & PTE_POPULATE); > > This again is redundant with earlier if() conditions. > > > +#define XEN_TABLE_MAP_FAILED 0 > > +#define XEN_TABLE_SUPER_PAGE 1 > > +#define XEN_TABLE_NORMAL 2 > > + > > +/* > > + * Take the currently mapped table, find the corresponding entry, > > + * and map the next table, if available. > > + * > > + * The alloc_tbl parameters indicates whether intermediate tables > > should > > + * be allocated when not present. > > + * > > + * Return values: > > + * XEN_TABLE_MAP_FAILED: Either alloc_only was set and the entry > > + * was empty, or allocating a new page failed. > > + * XEN_TABLE_NORMAL: next level or leaf mapped normally > > + * XEN_TABLE_SUPER_PAGE: The next entry points to a superpage. > > + */ > > +static int pt_next_level(bool alloc_tbl, pte_t **table, unsigned > > int offset) > > Having the boolean first is unusual, but well - it's your choice. > > > +{ > > + pte_t *entry; > > + int ret; > > + mfn_t mfn; > > + > > + entry = *table + offset; > > + > > + if ( !pte_is_valid(*entry) ) > > + { > > + if ( alloc_tbl ) > > + return XEN_TABLE_MAP_FAILED; > > Is this condition meant to be inverted? if alloc_tbl = true we shouldn't allocatetable as: * The intermediate page table shouldn't be allocated when MFN isn't * valid and we are not populating page table. ... */ bool alloc_tbl = mfn_eq(mfn, INVALID_MFN) && !(flags & PTE_POPULATE); So if mfn = INVALID_MFN and flags.PTE_POPULATE=0 it means that this table shouldn't be allocated and thereby pt_next_level() should return XEN_TABLE_MAP_FAILED. Or to invert if ( alloc_tbl )it will be needed to invert defintion of alloc_tbl: bool alloc_tbl = !mfn_eq(mfn, INVALID_MFN) || (flags & PTE_POPULATE); > > > + ret = create_table(entry); > > + if ( ret ) > > + return XEN_TABLE_MAP_FAILED; > > You don't really use "ret". Why not omit the local variable, even > more so that it has too wide scope? I'll omit that, it is really useless. > > > +/* Update an entry at the level @target. */ > > +static int pt_update_entry(mfn_t root, unsigned long virt, > > + mfn_t mfn, unsigned int target, > > + unsigned int flags) > > +{ > > + int rc; > > + unsigned int level = HYP_PT_ROOT_LEVEL; > > + pte_t *table; > > + /* > > + * The intermediate page table shouldn't be allocated when MFN > > isn't > > + * valid and we are not populating page table. > > + * This means we either modify permissions or remove an entry, > > or > > + * inserting brand new entry. > > + * > > + * See the comment above pt_update() for an additional > > explanation about > > + * combinations of (mfn, flags). > > + */ > > + bool alloc_tbl = mfn_eq(mfn, INVALID_MFN) && !(flags & > > PTE_POPULATE); > > Is this meant to be inverted, too (to actually match variable name > and > comment)? Oh, you mentioned that too. I wrote the similar above. I think it would be better to invert if we want to use alloc_tbl variable name. > > > + break; > > + } > > + > > + if ( level != target ) > > + { > > + printk("%s: Shattering superpage is not supported\n", > > __func__); > > + rc = -EOPNOTSUPP; > > + goto out; > > + } > > + > > + entry = table + offsets[level]; > > + > > + rc = -EINVAL; > > + if ( !pt_check_entry(*entry, mfn, flags) ) > > + goto out; > > + > > + /* We are removing the page */ > > + if ( !(flags & PTE_VALID) ) > > + memset(&pte, 0x00, sizeof(pte)); > > + else > > + { > > + /* We are inserting a mapping => Create new pte. */ > > + if ( !mfn_eq(mfn, INVALID_MFN) ) > > + pte = pte_from_mfn(mfn, PTE_VALID); > > + else /* We are updating the permission => Copy the current > > pte. */ > > + pte = *entry; > > + > > + /* update permission according to the flags */ > > + pte.pte |= PTE_RWX_MASK(flags) | PTE_ACCESSED | PTE_DIRTY; > > When updating an entry, don't you also need to clear (some of) the > flags? I am not sure why some flags should be cleared. Here we are taking only necessary for pte flags such as R, W, X or other bits in flags are ignored. > > > +/* Return the level where mapping should be done */ > > +static int pt_mapping_level(unsigned long vfn, mfn_t mfn, unsigned > > long nr, > > + unsigned int flags) > > +{ > > + unsigned int level = 0; > > + unsigned long mask; > > + unsigned int i; > > + > > + /* Use blocking mapping unless the caller requests 4K mapping > > */ > > + if ( unlikely(flags & PTE_SMALL) ) > > + return level; > > + > > + /* > > + * Don't take into account the MFN when removing mapping (i.e > > + * MFN_INVALID) to calculate the correct target order. > > + * > > + * `vfn` and `mfn` must be both superpage aligned. > > + * They are or-ed together and then checked against the size > > of > > + * each level. > > + * > > + * `left` is not included and checked separately to allow > > + * superpage mapping even if it is not properly aligned (the > > + * user may have asked to map 2MB + 4k). > > What is this about? There's nothing named "left" here. It refer to "remaining" pages or "leftover" space after trying to align a mapping to a superpage boundary. > > > + */ > > + mask = !mfn_eq(mfn, INVALID_MFN) ? mfn_x(mfn) : 0; > > + mask |= vfn; > > + > > + for ( i = HYP_PT_ROOT_LEVEL; i != 0; i-- ) > > + { > > + if ( !(mask & (BIT(XEN_PT_LEVEL_ORDER(i), UL) - 1)) && > > + (nr >= BIT(XEN_PT_LEVEL_ORDER(i), UL)) ) > > + { > > + level = i; > > + break; > > + } > > + } > > + > > + return level; > > +} > > + > > +static DEFINE_SPINLOCK(xen_pt_lock); > > Another largely meaningless xen_ prefix? Thanks. I'll drop it. > > > +/* > > + * If `mfn` equals `INVALID_MFN`, it indicates that the following > > page table > > + * update operation might be related to either populating the > > table ( > > + * PTE_POPULATE will be set additionaly), destroying a mapping, or > > modifying > > + * an existing mapping. > > And the latter two are distinguished by? PTE_VALID? inserting -> (PTE_VALID=1 + (mfn=something valid)) destroying-> ( PTE_VALID=0 ) > > > + * If `mfn` is valid and flags has PTE_VALID bit set then it means > > that > > + * inserting will be done. > > + */ > > What about mfn != INVALID_MFN and PTE_VALID clear? PTE_VALID=0 will be always considered as destroying and no matter what is mfn value as in this case the removing is done in the way where mfn isn't used: memset(&pte, 0x00, sizeof(pte)); > Also note that "`mfn` is > valid" isn't the same as "mfn != INVALID_MFN". You want to be precise > here, > to avoid confusion later on. (I say that knowing that we're still > fighting > especially shadow paging code on x86 not having those properly > separated.) If it is needed to be precise and mfn is valid isn't the same as "mfn != INVALID_MFN" only for the case of shadow paging? > > > > > + unsigned long left = nr_mfns; > > + > > + const mfn_t root = get_root_page(); > > + > > + /* > > + * It is bad idea to have mapping both writeable and > > + * executable. > > + * When modifying/creating mapping (i.e PTE_VALID is set), > > + * prevent any update if this happen. > > + */ > > + if ( (flags & PTE_VALID) && PTE_W_MASK(flags) && > > PTE_X_MASK(flags) ) > > Seeing them in use, I wonder about the naming of those PTE_?_MASK() > macros. Along with the lhs, why not simply (flags & PTE_...)? Hmm, good point. They can be really dropped with simplification of the mentioned if(...). Thanks. ~ Oleksii
On 28.08.2024 18:11, oleksii.kurochko@gmail.com wrote: > On Tue, 2024-08-27 at 17:00 +0200, Jan Beulich wrote: >> On 21.08.2024 18:06, Oleksii Kurochko wrote: >>> Implement map_pages_to_xen() which requires several >>> functions to manage page tables and entries: >>> - pt_update() >>> - pt_mapping_level() >>> - pt_update_entry() >>> - pt_next_level() >>> - pt_check_entry() >>> >>> To support these operations, add functions for creating, >>> mapping, and unmapping Xen tables: >>> - create_table() >>> - map_table() >>> - unmap_table() >>> >>> Introduce internal macros starting with PTE_* for convenience. >>> These macros closely resemble PTE bits, with the exception of >>> PTE_SMALL, which indicates that 4KB is needed. >> >> What macros are you talking about here? Is this partially stale, as >> only PTE_SMALL and PTE_POPULATE (and a couple of masks) are being >> added? > I am speaking about macros connected to masks: > #define PTE_R_MASK(x) ((x) & PTE_READABLE) > #define PTE_W_MASK(x) ((x) & PTE_WRITABLE) > #define PTE_X_MASK(x) ((x) & PTE_EXECUTABLE) > > #define PTE_RWX_MASK(x) ((x) & (PTE_READABLE | PTE_WRITABLE | > PTE_EXECUTABLE)) Some of which is did question further down in my reply. But what's worse - by saying "closely resemble PTE bits, with the exception of PTE_SMALL" you pretty clearly _do not_ refer to the macros above, but to PTE_VALID etc. >>> @@ -68,6 +111,20 @@ static inline bool pte_is_valid(pte_t p) >>> return p.pte & PTE_VALID; >>> } >>> >>> +inline bool pte_is_table(const pte_t p) >>> +{ >>> + return ((p.pte & (PTE_VALID | >>> + PTE_READABLE | >>> + PTE_WRITABLE | >>> + PTE_EXECUTABLE)) == PTE_VALID); >>> +} >> >> In how far is the READABLE check valid here? You (imo correctly) ... Oh, I wrongly picked on READABLE when it should have been the WRITABLE bit. >>> +static inline bool pte_is_mapping(const pte_t p) >>> +{ >>> + return (p.pte & PTE_VALID) && >>> + (p.pte & (PTE_WRITABLE | PTE_EXECUTABLE)); >>> +} >> >> ... don't consider this bit here. > pte_is_mapping() seems to me is correct as according to the RISC-V > privileged spec: > 4. Otherwise, the PTE is valid. If pte.r = 1 or pte.x = 1, go to step > 5. Otherwise, this PTE is a pointer to the next level of the page > table. > 5. A leaf PTE has been found. ... Right. And then why do you check all three of r, x, and w, when the doc mentions only r and x? There may be reasons, but such reasons then need clearly stating in a code comment, for people to understand why the code is not following the spec. > and regarding pte_is_table() READABLE check is valid as we have to > check only that pte.r = pte.x = 0. WRITABLE check should be dropped. Or > just use define pte_is_table() as: > inline bool pte_is_table(const pte_t p) > { > return !pte_is_mapping(p); > } You had it like this earlier on, didn't you? That's wrong, because for a PTE to describe another page table level PTE_VALID needs to be set. >>> --- /dev/null >>> +++ b/xen/arch/riscv/pt.c >>> @@ -0,0 +1,420 @@ >>> +#include <xen/bug.h> >>> +#include <xen/domain_page.h> >>> +#include <xen/errno.h> >>> +#include <xen/mm.h> >>> +#include <xen/mm-frame.h> >>> +#include <xen/pmap.h> >>> +#include <xen/spinlock.h> >>> + >>> +#include <asm/flushtlb.h> >>> +#include <asm/page.h> >>> + >>> +static inline const mfn_t get_root_page(void) >>> +{ >>> + paddr_t root_maddr = (csr_read(CSR_SATP) & SATP_PPN_MASK) << >>> PAGE_SHIFT; >>> + >>> + return maddr_to_mfn(root_maddr); >>> +} >>> + >>> +/* Sanity check of the entry. */ >>> +static bool pt_check_entry(pte_t entry, mfn_t mfn, unsigned int >>> flags) >>> +{ >>> + /* >>> + * See the comment about the possible combination of (mfn, >>> flags) in >>> + * the comment above pt_update(). >>> + */ >>> + >>> + /* Sanity check when modifying an entry. */ >>> + if ( (flags & PTE_VALID) && mfn_eq(mfn, INVALID_MFN) ) >>> + { >>> + /* We don't allow modifying an invalid entry. */ >>> + if ( !pte_is_valid(entry) ) >>> + { >>> + printk("Modifying invalid entry is not allowed.\n"); >> >> Perhaps all of these printk()s should be dprintk()? > It could be dprintk() but at the same time I don't see any issue if it > will be printed once. What guarantees that it wouldn't be logged over and over? It's simply bad practice to accompany all error returns with log messages, even in release builds. Even if right now you're only in the bring-up phase, you still want to have security in mind. If any such log message ended up reachable from a guest-invoked path, an XSA would be needed. >> And not have a full >> stop? > By "full stop," do you mean something like panic() or BUG_ON()? No. "Full stop" is the period at the end of a sentence (which shouldn't normally be there at the end of log messages). >>> + /* >>> + * We don't allow replacing any valid entry. >>> + * >>> + * Note that the function pt_update() relies on this >>> + * assumption and will skip the TLB flush (when Svvptc >>> + * extension will be ratified). The function will need >>> + * to be updated if the check is relaxed. >>> + */ >>> + if ( pte_is_valid(entry) ) >>> + { >>> + if ( pte_is_mapping(entry) ) >>> + printk("Changing MFN for a valid entry is not >>> allowed (%#"PRI_mfn" -> %#"PRI_mfn").\n", >>> + mfn_x(mfn_from_pte(entry)), mfn_x(mfn)); >>> + else >>> + printk("Trying to replace a table with a >>> mapping.\n"); >>> + return false; >>> + } >>> + } >>> + /* Sanity check when removing a mapping. */ >>> + else if ( (flags & (PTE_VALID | PTE_POPULATE)) == 0 ) >> >> The PTE_VALID part of the check is pointless considering the earlier >> if(). I guess you may want to have it for doc purposes ... > Yes, it just helps to read the code and understand "confusing" if's() > above. Well, since you mention "confusing": I for one consider such redundant checks confusing. It raises the question whether this check is wrong or the earlier one is. Therefore, if you want to keep the redundancy, it may help if you extend the comment to mention it's actually redundant (e.g. by saying "for completeness" or some such). >>> +#define XEN_TABLE_MAP_FAILED 0 >>> +#define XEN_TABLE_SUPER_PAGE 1 >>> +#define XEN_TABLE_NORMAL 2 >>> + >>> +/* >>> + * Take the currently mapped table, find the corresponding entry, >>> + * and map the next table, if available. >>> + * >>> + * The alloc_tbl parameters indicates whether intermediate tables >>> should >>> + * be allocated when not present. >>> + * >>> + * Return values: >>> + * XEN_TABLE_MAP_FAILED: Either alloc_only was set and the entry >>> + * was empty, or allocating a new page failed. >>> + * XEN_TABLE_NORMAL: next level or leaf mapped normally >>> + * XEN_TABLE_SUPER_PAGE: The next entry points to a superpage. >>> + */ >>> +static int pt_next_level(bool alloc_tbl, pte_t **table, unsigned >>> int offset) >> >> Having the boolean first is unusual, but well - it's your choice. >> >>> +{ >>> + pte_t *entry; >>> + int ret; >>> + mfn_t mfn; >>> + >>> + entry = *table + offset; >>> + >>> + if ( !pte_is_valid(*entry) ) >>> + { >>> + if ( alloc_tbl ) >>> + return XEN_TABLE_MAP_FAILED; >> >> Is this condition meant to be inverted? > if alloc_tbl = true we shouldn't allocatetable as: > * The intermediate page table shouldn't be allocated when MFN > isn't > * valid and we are not populating page table. > ... > */ Well, no. The variable name really shouldn't be the opposite of what is meant. "alloc_tbl" can only possibly mean "allocate a table if none is there". I can't think of a sensible interpretation in the inverted sense. I'm curious how you mean to interpret that variable name. > bool alloc_tbl = mfn_eq(mfn, INVALID_MFN) && !(flags & > PTE_POPULATE); > > So if mfn = INVALID_MFN and flags.PTE_POPULATE=0 it means that this > table shouldn't be allocated and thereby pt_next_level() should return > XEN_TABLE_MAP_FAILED. > > Or to invert if ( alloc_tbl )it will be needed to invert defintion of > alloc_tbl: > bool alloc_tbl = !mfn_eq(mfn, INVALID_MFN) || (flags & PTE_POPULATE); Yes, as I did comment further down. >>> + if ( level != target ) >>> + { >>> + printk("%s: Shattering superpage is not supported\n", >>> __func__); >>> + rc = -EOPNOTSUPP; >>> + goto out; >>> + } >>> + >>> + entry = table + offsets[level]; >>> + >>> + rc = -EINVAL; >>> + if ( !pt_check_entry(*entry, mfn, flags) ) >>> + goto out; >>> + >>> + /* We are removing the page */ >>> + if ( !(flags & PTE_VALID) ) >>> + memset(&pte, 0x00, sizeof(pte)); >>> + else >>> + { >>> + /* We are inserting a mapping => Create new pte. */ >>> + if ( !mfn_eq(mfn, INVALID_MFN) ) >>> + pte = pte_from_mfn(mfn, PTE_VALID); >>> + else /* We are updating the permission => Copy the current >>> pte. */ >>> + pte = *entry; >>> + >>> + /* update permission according to the flags */ >>> + pte.pte |= PTE_RWX_MASK(flags) | PTE_ACCESSED | PTE_DIRTY; >> >> When updating an entry, don't you also need to clear (some of) the >> flags? > I am not sure why some flags should be cleared. Here we are taking only > necessary for pte flags such as R, W, X or other bits in flags are > ignored. Consider what happens to a PTE with R and X set when a request comes in to change to R/W. You'll end up with R, X, and W all set if you don't first clear the bits that are meant to be changeable in a "modify" operation. >>> +/* Return the level where mapping should be done */ >>> +static int pt_mapping_level(unsigned long vfn, mfn_t mfn, unsigned >>> long nr, >>> + unsigned int flags) >>> +{ >>> + unsigned int level = 0; >>> + unsigned long mask; >>> + unsigned int i; >>> + >>> + /* Use blocking mapping unless the caller requests 4K mapping >>> */ >>> + if ( unlikely(flags & PTE_SMALL) ) >>> + return level; >>> + >>> + /* >>> + * Don't take into account the MFN when removing mapping (i.e >>> + * MFN_INVALID) to calculate the correct target order. >>> + * >>> + * `vfn` and `mfn` must be both superpage aligned. >>> + * They are or-ed together and then checked against the size >>> of >>> + * each level. >>> + * >>> + * `left` is not included and checked separately to allow >>> + * superpage mapping even if it is not properly aligned (the >>> + * user may have asked to map 2MB + 4k). >> >> What is this about? There's nothing named "left" here. > It refer to "remaining" pages or "leftover" space after trying to align > a mapping to a superpage boundary. What what is the quoted "left" here? Such a variable appears to exist in the caller, but using the name here is lacking context. >>> +/* >>> + * If `mfn` equals `INVALID_MFN`, it indicates that the following >>> page table >>> + * update operation might be related to either populating the >>> table ( >>> + * PTE_POPULATE will be set additionaly), destroying a mapping, or >>> modifying >>> + * an existing mapping. >> >> And the latter two are distinguished by? PTE_VALID? > inserting -> (PTE_VALID=1 + (mfn=something valid)) > destroying-> ( PTE_VALID=0 ) Which then needs saying in the comment. >>> + * If `mfn` is valid and flags has PTE_VALID bit set then it means >>> that >>> + * inserting will be done. >>> + */ >> >> What about mfn != INVALID_MFN and PTE_VALID clear? > PTE_VALID=0 will be always considered as destroying and no matter what > is mfn value as in this case the removing is done in the way where mfn > isn't used: Right, yet elsewhere you're restrictive as to MFN values valid to use. Not requiring INVALID_MFN here looks inconsistent to me. > memset(&pte, 0x00, sizeof(pte)); Just to mention it: I don't think memset() is a very good way of clearing a PTE, even if right here it's not a live one. >> Also note that "`mfn` is >> valid" isn't the same as "mfn != INVALID_MFN". You want to be precise >> here, >> to avoid confusion later on. (I say that knowing that we're still >> fighting >> especially shadow paging code on x86 not having those properly >> separated.) > If it is needed to be precise and mfn is valid isn't the same as "mfn > != INVALID_MFN" only for the case of shadow paging? No, I used shadow paging only as an example of where we have similar issues. I'd like to avoid that a new port starts out with introducing more instances of that. You want to properly separate INVALID_MFN from "invalid MFN", where the latter means any MFN where either nothing exists at all, or (see mfn_valid()) where no struct page_info exists. Jan
On Thu, 2024-08-29 at 09:01 +0200, Jan Beulich wrote: > On 28.08.2024 18:11, oleksii.kurochko@gmail.com wrote: > > On Tue, 2024-08-27 at 17:00 +0200, Jan Beulich wrote: > > > On 21.08.2024 18:06, Oleksii Kurochko wrote: > > > > Implement map_pages_to_xen() which requires several > > > > functions to manage page tables and entries: > > > > - pt_update() > > > > - pt_mapping_level() > > > > - pt_update_entry() > > > > - pt_next_level() > > > > - pt_check_entry() > > > > > > > > To support these operations, add functions for creating, > > > > mapping, and unmapping Xen tables: > > > > - create_table() > > > > - map_table() > > > > - unmap_table() > > > > > > > > Introduce internal macros starting with PTE_* for convenience. > > > > These macros closely resemble PTE bits, with the exception of > > > > PTE_SMALL, which indicates that 4KB is needed. > > > > > > What macros are you talking about here? Is this partially stale, > > > as > > > only PTE_SMALL and PTE_POPULATE (and a couple of masks) are being > > > added? > > I am speaking about macros connected to masks: > > #define PTE_R_MASK(x) ((x) & PTE_READABLE) > > #define PTE_W_MASK(x) ((x) & PTE_WRITABLE) > > #define PTE_X_MASK(x) ((x) & PTE_EXECUTABLE) > > > > #define PTE_RWX_MASK(x) ((x) & (PTE_READABLE | PTE_WRITABLE | > > PTE_EXECUTABLE)) > > Some of which is did question further down in my reply. But what's > worse - by saying "closely resemble PTE bits, with the exception of > PTE_SMALL" you pretty clearly _do not_ refer to the macros above, but > to PTE_VALID etc. Agree, it should be corrected. > > > > > @@ -68,6 +111,20 @@ static inline bool pte_is_valid(pte_t p) > > > > return p.pte & PTE_VALID; > > > > } > > > > > > > > +inline bool pte_is_table(const pte_t p) > > > > +{ > > > > + return ((p.pte & (PTE_VALID | > > > > + PTE_READABLE | > > > > + PTE_WRITABLE | > > > > + PTE_EXECUTABLE)) == PTE_VALID); > > > > +} > > > > > > In how far is the READABLE check valid here? You (imo correctly) > > > ... > > Oh, I wrongly picked on READABLE when it should have been the > WRITABLE > bit. > > > > > +static inline bool pte_is_mapping(const pte_t p) > > > > +{ > > > > + return (p.pte & PTE_VALID) && > > > > + (p.pte & (PTE_WRITABLE | PTE_EXECUTABLE)); > > > > +} > > > > > > ... don't consider this bit here. > > pte_is_mapping() seems to me is correct as according to the RISC-V > > privileged spec: > > 4. Otherwise, the PTE is valid. If pte.r = 1 or pte.x = 1, go to > > step > > 5. Otherwise, this PTE is a pointer to the next level of the > > page > > table. > > 5. A leaf PTE has been found. ... > > Right. And then why do you check all three of r, x, and w, when the > doc > mentions only r and x? There may be reasons, but such reasons then > need > clearly stating in a code comment, for people to understand why the > code > is not following the spec. So I remembered why R, W, and X are checked. There is contradictory information about these bits (https://github.com/riscv/riscv-isa-manual/blob/main/src/supervisor.adoc?plain=1#L1317C64-L1321C10 ): ``` The permission bits, R, W, and X, indicate whether the page is readable, writable, and executable, respectively. When all three are zero, the PTE is a pointer to the next level of the page table; otherwise, it is a leaf PTE. ``` However, it is also written here (https://github.com/riscv/riscv-isa-manual/blob/main/src/supervisor.adoc?plain=1#L1539 ) that only pte.r and pte.x should be checked. I can assume that the interpretation that R=W=X=0 indicates a pointer to the next level of the page table could come from this statement (https://github.com/riscv/riscv-isa-manual/blob/main/src/supervisor.adoc?plain=1#L1538 ): ``` If _pte_._v_ = 0, or if _pte_._r_ = 0 and _pte_._w_ = 1, or if any bits or encodings that are reserved for future standard use are set within _pte_, stop and raise a page-fault exception corresponding to the original access type. ``` From this, I can assume that when pte.r = 0, pte.w should also always be zero; otherwise, a page-fault exception will be raised. ( but it is no obviously connected to if the PTE is a pointer to the next page table or not... ). > > > and regarding pte_is_table() READABLE check is valid as we have to > > check only that pte.r = pte.x = 0. WRITABLE check should be > > dropped. Or > > just use define pte_is_table() as: > > inline bool pte_is_table(const pte_t p) > > { > > return !pte_is_mapping(p); > > } > > You had it like this earlier on, didn't you? That's wrong, because > for a > PTE to describe another page table level PTE_VALID needs to be set. Agree, it's wrong, missed that. > > > > +#define XEN_TABLE_MAP_FAILED 0 > > > > +#define XEN_TABLE_SUPER_PAGE 1 > > > > +#define XEN_TABLE_NORMAL 2 > > > > + > > > > +/* > > > > + * Take the currently mapped table, find the corresponding > > > > entry, > > > > + * and map the next table, if available. > > > > + * > > > > + * The alloc_tbl parameters indicates whether intermediate > > > > tables > > > > should > > > > + * be allocated when not present. > > > > + * > > > > + * Return values: > > > > + * XEN_TABLE_MAP_FAILED: Either alloc_only was set and the > > > > entry > > > > + * was empty, or allocating a new page failed. > > > > + * XEN_TABLE_NORMAL: next level or leaf mapped normally > > > > + * XEN_TABLE_SUPER_PAGE: The next entry points to a > > > > superpage. > > > > + */ > > > > +static int pt_next_level(bool alloc_tbl, pte_t **table, > > > > unsigned > > > > int offset) > > > > > > Having the boolean first is unusual, but well - it's your choice. > > > > > > > +{ > > > > + pte_t *entry; > > > > + int ret; > > > > + mfn_t mfn; > > > > + > > > > + entry = *table + offset; > > > > + > > > > + if ( !pte_is_valid(*entry) ) > > > > + { > > > > + if ( alloc_tbl ) > > > > + return XEN_TABLE_MAP_FAILED; > > > > > > Is this condition meant to be inverted? > > if alloc_tbl = true we shouldn't allocatetable as: > > * The intermediate page table shouldn't be allocated when MFN > > isn't > > * valid and we are not populating page table. > > ... > > */ > > Well, no. The variable name really shouldn't be the opposite of what > is > meant. "alloc_tbl" can only possibly mean "allocate a table if none > is > there". I can't think of a sensible interpretation in the inverted > sense. > I'm curious how you mean to interpret that variable name. My interpretation was that alloc_tbl = true means that algorithm is trying to allocate the table what is forbidden at the moment but I agree that your interpretation sounds more understandable based on the variable name. > > > bool alloc_tbl = mfn_eq(mfn, INVALID_MFN) && !(flags & > > PTE_POPULATE); > > > > So if mfn = INVALID_MFN and flags.PTE_POPULATE=0 it means that this > > table shouldn't be allocated and thereby pt_next_level() should > > return > > XEN_TABLE_MAP_FAILED. > > > > Or to invert if ( alloc_tbl )it will be needed to invert defintion > > of > > alloc_tbl: > > bool alloc_tbl = !mfn_eq(mfn, INVALID_MFN) || (flags & > > PTE_POPULATE); > > Yes, as I did comment further down. > > > > > + if ( level != target ) > > > > + { > > > > + printk("%s: Shattering superpage is not supported\n", > > > > __func__); > > > > + rc = -EOPNOTSUPP; > > > > + goto out; > > > > + } > > > > + > > > > + entry = table + offsets[level]; > > > > + > > > > + rc = -EINVAL; > > > > + if ( !pt_check_entry(*entry, mfn, flags) ) > > > > + goto out; > > > > + > > > > + /* We are removing the page */ > > > > + if ( !(flags & PTE_VALID) ) > > > > + memset(&pte, 0x00, sizeof(pte)); > > > > + else > > > > + { > > > > + /* We are inserting a mapping => Create new pte. */ > > > > + if ( !mfn_eq(mfn, INVALID_MFN) ) > > > > + pte = pte_from_mfn(mfn, PTE_VALID); > > > > + else /* We are updating the permission => Copy the > > > > current > > > > pte. */ > > > > + pte = *entry; > > > > + > > > > + /* update permission according to the flags */ > > > > + pte.pte |= PTE_RWX_MASK(flags) | PTE_ACCESSED | > > > > PTE_DIRTY; > > > > > > When updating an entry, don't you also need to clear (some of) > > > the > > > flags? > > I am not sure why some flags should be cleared. Here we are taking > > only > > necessary for pte flags such as R, W, X or other bits in flags are > > ignored. > > Consider what happens to a PTE with R and X set when a request comes > in > to change to R/W. You'll end up with R, X, and W all set if you don't > first clear the bits that are meant to be changeable in a "modify" > operation. That's definitely going to be a problem. I'll update the code then. > > > > > +/* Return the level where mapping should be done */ > > > > +static int pt_mapping_level(unsigned long vfn, mfn_t mfn, > > > > unsigned > > > > long nr, > > > > + unsigned int flags) > > > > +{ > > > > + unsigned int level = 0; > > > > + unsigned long mask; > > > > + unsigned int i; > > > > + > > > > + /* Use blocking mapping unless the caller requests 4K > > > > mapping > > > > */ > > > > + if ( unlikely(flags & PTE_SMALL) ) > > > > + return level; > > > > + > > > > + /* > > > > + * Don't take into account the MFN when removing mapping > > > > (i.e > > > > + * MFN_INVALID) to calculate the correct target order. > > > > + * > > > > + * `vfn` and `mfn` must be both superpage aligned. > > > > + * They are or-ed together and then checked against the > > > > size > > > > of > > > > + * each level. > > > > + * > > > > + * `left` is not included and checked separately to allow > > > > + * superpage mapping even if it is not properly aligned > > > > (the > > > > + * user may have asked to map 2MB + 4k). > > > > > > What is this about? There's nothing named "left" here. > > It refer to "remaining" pages or "leftover" space after trying to > > align > > a mapping to a superpage boundary. > > What what is the quoted "left" here? Such a variable appears to exist > in > the caller, but using the name here is lacking context. Then I will update the comment and tell from where 'left' is coming. > > > > > > + * If `mfn` is valid and flags has PTE_VALID bit set then it > > > > means > > > > that > > > > + * inserting will be done. > > > > + */ > > > > > > What about mfn != INVALID_MFN and PTE_VALID clear? > > PTE_VALID=0 will be always considered as destroying and no matter > > what > > is mfn value as in this case the removing is done in the way where > > mfn > > isn't used: > > Right, yet elsewhere you're restrictive as to MFN values valid to > use. > Not requiring INVALID_MFN here looks inconsistent to me. but actually if we will leave ASSERT in pt_check_entry() we will be sure that we are here with mfn = INVALID_MFN: /* Sanity check when removing a mapping. */ else if ( (flags & (PTE_VALID | PTE_POPULATE)) == 0 ) { /* We should be here with an invalid MFN. */ ASSERT(mfn_eq(mfn, INVALID_MFN)); > > > memset(&pte, 0x00, sizeof(pte)); > > Just to mention it: I don't think memset() is a very good way of > clearing > a PTE, even if right here it's not a live one. Just direct assigning would be better? > > > > Also note that "`mfn` is > > > valid" isn't the same as "mfn != INVALID_MFN". You want to be > > > precise > > > here, > > > to avoid confusion later on. (I say that knowing that we're still > > > fighting > > > especially shadow paging code on x86 not having those properly > > > separated.) > > If it is needed to be precise and mfn is valid isn't the same as > > "mfn > > != INVALID_MFN" only for the case of shadow paging? > > No, I used shadow paging only as an example of where we have similar > issues. I'd like to avoid that a new port starts out with introducing > more instances of that. You want to properly separate INVALID_MFN > from > "invalid MFN", where the latter means any MFN where either nothing > exists at all, or (see mfn_valid()) where no struct page_info exists. Well, now I think I understand the difference between "INVALID_MFN" and "invalid MFN." Referring back to your original reply, I need to update the comment above pt_update(): ``` ... * If `mfn` is valid ( exist ) and flags has PTE_VALID bit set then it means that inserting will be done. ``` Would this be correct and more precise? Based on the code for mfn_valid(), the separation is currently done using the max_page value, which can't be initialized at the moment as it requires reading the device tree file to obtain the RAM end. We could use a placeholder for the RAM end (for example, a very high value like -1UL) and then add __mfn_valid() within pt_update(). However, I'm not sure if this approach aligns with what you consider by proper separation between INVALID_MFN and "invalid MFN." ~ Oleksii
On 29.08.2024 14:04, oleksii.kurochko@gmail.com wrote: > On Thu, 2024-08-29 at 09:01 +0200, Jan Beulich wrote: >> On 28.08.2024 18:11, oleksii.kurochko@gmail.com wrote: >>> On Tue, 2024-08-27 at 17:00 +0200, Jan Beulich wrote: >>>> On 21.08.2024 18:06, Oleksii Kurochko wrote: >>>>> @@ -68,6 +111,20 @@ static inline bool pte_is_valid(pte_t p) >>>>> return p.pte & PTE_VALID; >>>>> } >>>>> >>>>> +inline bool pte_is_table(const pte_t p) >>>>> +{ >>>>> + return ((p.pte & (PTE_VALID | >>>>> + PTE_READABLE | >>>>> + PTE_WRITABLE | >>>>> + PTE_EXECUTABLE)) == PTE_VALID); >>>>> +} >>>> >>>> In how far is the READABLE check valid here? You (imo correctly) >>>> ... >> >> Oh, I wrongly picked on READABLE when it should have been the >> WRITABLE >> bit. >> >>>>> +static inline bool pte_is_mapping(const pte_t p) >>>>> +{ >>>>> + return (p.pte & PTE_VALID) && >>>>> + (p.pte & (PTE_WRITABLE | PTE_EXECUTABLE)); >>>>> +} >>>> >>>> ... don't consider this bit here. >>> pte_is_mapping() seems to me is correct as according to the RISC-V >>> privileged spec: >>> 4. Otherwise, the PTE is valid. If pte.r = 1 or pte.x = 1, go to >>> step >>> 5. Otherwise, this PTE is a pointer to the next level of the >>> page >>> table. >>> 5. A leaf PTE has been found. ... >> >> Right. And then why do you check all three of r, x, and w, when the >> doc >> mentions only r and x? There may be reasons, but such reasons then >> need >> clearly stating in a code comment, for people to understand why the >> code >> is not following the spec. > So I remembered why R, W, and X are checked. There is contradictory > information about these bits > (https://github.com/riscv/riscv-isa-manual/blob/main/src/supervisor.adoc?plain=1#L1317C64-L1321C10 > ): > ``` > The permission bits, R, W, and X, indicate whether the page is > readable, writable, and executable, respectively. When all three are > zero, the PTE is a pointer to the next level of the page table; > otherwise, it is a leaf PTE. > ``` > > However, it is also written here > (https://github.com/riscv/riscv-isa-manual/blob/main/src/supervisor.adoc?plain=1#L1539 > ) that only pte.r and pte.x should be checked. > > I can assume that the interpretation that R=W=X=0 indicates a pointer > to the next level of the page table could come from this statement > (https://github.com/riscv/riscv-isa-manual/blob/main/src/supervisor.adoc?plain=1#L1538 > ): > ``` > If _pte_._v_ = 0, or if _pte_._r_ = 0 and _pte_._w_ = 1, or if any bits > or encodings that are reserved for future standard use are set within > _pte_, stop and raise a page-fault exception corresponding to the > original access type. > ``` > From this, I can assume that when pte.r = 0, pte.w should also always > be zero; otherwise, a page-fault exception will be raised. ( but it is > no obviously connected to if the PTE is a pointer to the next page > table or not... ). I don't view the information provided as contradictory, especially when further taking the "Encoding of PTE R/W/X fields" table into account: W set but the other two clear is "Reserved for future use." >>>>> + * If `mfn` is valid and flags has PTE_VALID bit set then it >>>>> means >>>>> that >>>>> + * inserting will be done. >>>>> + */ >>>> >>>> What about mfn != INVALID_MFN and PTE_VALID clear? >>> PTE_VALID=0 will be always considered as destroying and no matter >>> what >>> is mfn value as in this case the removing is done in the way where >>> mfn >>> isn't used: >> >> Right, yet elsewhere you're restrictive as to MFN values valid to >> use. >> Not requiring INVALID_MFN here looks inconsistent to me. > but actually if we will leave ASSERT in pt_check_entry() we will be > sure that we are here with mfn = INVALID_MFN: > /* Sanity check when removing a mapping. */ > else if ( (flags & (PTE_VALID | PTE_POPULATE)) == 0 ) > { > /* We should be here with an invalid MFN. */ > ASSERT(mfn_eq(mfn, INVALID_MFN)); Having such an assertion there is fine, but doesn't save you from getting comments correct / complete. >>> memset(&pte, 0x00, sizeof(pte)); >> >> Just to mention it: I don't think memset() is a very good way of >> clearing >> a PTE, even if right here it's not a live one. > Just direct assigning would be better? Imo yes. >>>> Also note that "`mfn` is >>>> valid" isn't the same as "mfn != INVALID_MFN". You want to be >>>> precise >>>> here, >>>> to avoid confusion later on. (I say that knowing that we're still >>>> fighting >>>> especially shadow paging code on x86 not having those properly >>>> separated.) >>> If it is needed to be precise and mfn is valid isn't the same as >>> "mfn >>> != INVALID_MFN" only for the case of shadow paging? >> >> No, I used shadow paging only as an example of where we have similar >> issues. I'd like to avoid that a new port starts out with introducing >> more instances of that. You want to properly separate INVALID_MFN >> from >> "invalid MFN", where the latter means any MFN where either nothing >> exists at all, or (see mfn_valid()) where no struct page_info exists. > Well, now I think I understand the difference between "INVALID_MFN" and > "invalid MFN." > > Referring back to your original reply, I need to update the comment > above pt_update(): > ``` > ... > * If `mfn` is valid ( exist ) and flags has PTE_VALID bit set then it > means that inserting will be done. > ``` > Would this be correct and more precise? That depends on whether it correctly describes what the code does. If the code continues to check against INVALID_MFN, such a description wouldn't be correct. Also, just to re-iterate, ... > Based on the code for mfn_valid(), the separation is currently done > using the max_page value, which can't be initialized at the moment as > it requires reading the device tree file to obtain the RAM end. ... mfn_valid() may return false for MMIO pages, for which it may still be legitimate to create mappings. IMO ... > We could use a placeholder for the RAM end (for example, a very high > value like -1UL) and then add __mfn_valid() within pt_update(). > However, I'm not sure if this approach aligns with what you consider by > proper separation between INVALID_MFN and "invalid MFN." ... throughout the code here you mean INVALID_MFN and never "invalid MFN". Populating page tables is lower a layer than where you want to be concerned with that distinction; the callers of these low level functions will need to make the distinction where necessary. Jan
On Thu, 2024-08-29 at 14:14 +0200, Jan Beulich wrote: > > > > > Also note that "`mfn` is > > > > > valid" isn't the same as "mfn != INVALID_MFN". You want to be > > > > > precise > > > > > here, > > > > > to avoid confusion later on. (I say that knowing that we're > > > > > still > > > > > fighting > > > > > especially shadow paging code on x86 not having those > > > > > properly > > > > > separated.) > > > > If it is needed to be precise and mfn is valid isn't the same > > > > as > > > > "mfn > > > > != INVALID_MFN" only for the case of shadow paging? > > > > > > No, I used shadow paging only as an example of where we have > > > similar > > > issues. I'd like to avoid that a new port starts out with > > > introducing > > > more instances of that. You want to properly separate INVALID_MFN > > > from > > > "invalid MFN", where the latter means any MFN where either > > > nothing > > > exists at all, or (see mfn_valid()) where no struct page_info > > > exists. > > Well, now I think I understand the difference between "INVALID_MFN" > > and > > "invalid MFN." > > > > Referring back to your original reply, I need to update the comment > > above pt_update(): > > ``` > > ... > > * If `mfn` is valid ( exist ) and flags has PTE_VALID bit set > > then it > > means that inserting will be done. > > ``` > > Would this be correct and more precise? > > That depends on whether it correctly describes what the code does. If > the code continues to check against INVALID_MFN, such a description > wouldn't be correct. Also, just to re-iterate, ... > > > Based on the code for mfn_valid(), the separation is currently done > > using the max_page value, which can't be initialized at the moment > > as > > it requires reading the device tree file to obtain the RAM end. > > ... mfn_valid() may return false for MMIO pages, for which it may > still > be legitimate to create mappings. IMO ... > > > We could use a placeholder for the RAM end (for example, a very > > high > > value like -1UL) and then add __mfn_valid() within pt_update(). > > However, I'm not sure if this approach aligns with what you > > consider by > > proper separation between INVALID_MFN and "invalid MFN." > > ... throughout the code here you mean INVALID_MFN and never "invalid > MFN". IIC INVALID_MFN should mean that mfn exist ( correspond to some usable memory range of memory map ) but hasn't been mapped yet. Then for me what I have in the comment seems correct to me: ``` if `mfn` isn't equal to INVALID_MFN ( so it is valid/exist in terms that there is real memory range in memory map to which this mfn correspond ) and flags PTE_VALID bit set ... ``` > Populating page tables is lower a layer than where you want to be > concerned with that distinction; the callers of these low level > functions > will need to make the distinction where necessary. Then the question now is just in a proper wording of the pt_update() arguments values? ~ Oleksii
On 29.08.2024 16:42, oleksii.kurochko@gmail.com wrote: > On Thu, 2024-08-29 at 14:14 +0200, Jan Beulich wrote: >>>>>> Also note that "`mfn` is >>>>>> valid" isn't the same as "mfn != INVALID_MFN". You want to be >>>>>> precise >>>>>> here, >>>>>> to avoid confusion later on. (I say that knowing that we're >>>>>> still >>>>>> fighting >>>>>> especially shadow paging code on x86 not having those >>>>>> properly >>>>>> separated.) >>>>> If it is needed to be precise and mfn is valid isn't the same >>>>> as >>>>> "mfn >>>>> != INVALID_MFN" only for the case of shadow paging? >>>> >>>> No, I used shadow paging only as an example of where we have >>>> similar >>>> issues. I'd like to avoid that a new port starts out with >>>> introducing >>>> more instances of that. You want to properly separate INVALID_MFN >>>> from >>>> "invalid MFN", where the latter means any MFN where either >>>> nothing >>>> exists at all, or (see mfn_valid()) where no struct page_info >>>> exists. >>> Well, now I think I understand the difference between "INVALID_MFN" >>> and >>> "invalid MFN." >>> >>> Referring back to your original reply, I need to update the comment >>> above pt_update(): >>> ``` >>> ... >>> * If `mfn` is valid ( exist ) and flags has PTE_VALID bit set >>> then it >>> means that inserting will be done. >>> ``` >>> Would this be correct and more precise? >> >> That depends on whether it correctly describes what the code does. If >> the code continues to check against INVALID_MFN, such a description >> wouldn't be correct. Also, just to re-iterate, ... >> >>> Based on the code for mfn_valid(), the separation is currently done >>> using the max_page value, which can't be initialized at the moment >>> as >>> it requires reading the device tree file to obtain the RAM end. >> >> ... mfn_valid() may return false for MMIO pages, for which it may >> still >> be legitimate to create mappings. IMO ... >> >>> We could use a placeholder for the RAM end (for example, a very >>> high >>> value like -1UL) and then add __mfn_valid() within pt_update(). >>> However, I'm not sure if this approach aligns with what you >>> consider by >>> proper separation between INVALID_MFN and "invalid MFN." >> >> ... throughout the code here you mean INVALID_MFN and never "invalid >> MFN". > IIC INVALID_MFN should mean that mfn exist ( correspond to some usable > memory range of memory map ) but hasn't been mapped yet. Then for me > what I have in the comment seems correct to me: > ``` > if `mfn` isn't equal to INVALID_MFN ( so it is valid/exist in terms > that there is real memory range in memory map to which this mfn > correspond ) and flags PTE_VALID bit set ... > ``` Not really, no, as said ... >> Populating page tables is lower a layer than where you want to be >> concerned with that distinction; the callers of these low level >> functions >> will need to make the distinction where necessary. ... here. At this level I think you want to consider only INVALID_MFN, and for anything else you're simply not concerned what the MFN provided points at[1]. Specifically, said said before, it may point at an MMIO range which may not be in the memory map (a PCI device BAR for example). Jan [1] One thing that could be checked at this layer is the number of significant MFN bits, in case there were hardware setups in which you know that not the full width is permitted that the PTE has room for. No idea whether such exists in the RISC-V world.
© 2016 - 2024 Red Hat, Inc.