[PATCH v3 1/3] xen/riscv: introduce setup_mm()

Oleksii Kurochko posted 3 patches 2 weeks, 6 days ago
There is a newer version of this series
[PATCH v3 1/3] xen/riscv: introduce setup_mm()
Posted by Oleksii Kurochko 2 weeks, 6 days ago
Introduce the implementation of setup_mm(), which includes:
1. Adding all free regions to the boot allocator, as memory is needed
   to allocate page tables used for frame table mapping.
2. Calculating RAM size and the RAM end address.
3. Setting up direct map mappings from each RAM bank and initialize
   directmap_virt_start to keep simple VA <-> PA translation and map
   DIRECTMAP_VIRT_START with some offset ( if RAM start isn't properly
   aligned ) to RAM start to be properly aligned with RAM
   start to use more superpages to reduce pressure on the TLB
4. Setting up frame table mappings from physical address 0 to ram_end
   to simplify mfn_to_page() and page_to_mfn() conversions.
5. Setting up total_pages and max_page.

Update virt_to_maddr() to use introduced directmap_virt_start.

Implement maddr_to_virt() function to convert a machine address
to a virtual address. This function is specifically designed to be used
only for the DIRECTMAP region, so a check has been added to ensure that
the address does not exceed DIRECTMAP_SIZE.

After the introduction of maddr_to_virt() the following linkage error starts
to occur and to avoid it share_xen_page_with_guest() stub is added:
  riscv64-linux-gnu-ld: prelink.o: in function `tasklet_kill':
  /build/xen/common/tasklet.c:176: undefined reference to
     `share_xen_page_with_guest'
  riscv64-linux-gnu-ld: ./.xen-syms.0: hidden symbol `share_xen_page_with_guest'
    isn't defined riscv64-linux-gnu-ld: final link failed: bad value

Despite the linkger fingering tasklet.c, it's trace.o which has the undefined
refenrece:
  $ find . -name \*.o | while read F; do nm $F | grep share_xen_page_with_guest &&
    echo $F; done
                     U share_xen_page_with_guest
    ./xen/common/built_in.o
                     U share_xen_page_with_guest
    ./xen/common/trace.o
                     U share_xen_page_with_guest
    ./xen/prelink.o

Looking at trace.i, there is call of share_xen_page_with_guest() but in case of
when maddr_to_virt() is defined as stub ("BUG_ON(); return NULL;") DCE happens and
the code is just eliminated.

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V3:
  - Update the comment the part where DCE should be mentioned and directmap-related
    things are touched.
  - Move ROUNDDOWN to <xen/macros.h>.
  - s/sizeof(struct page_info)/sizeof(*frame_table) in setup_frametable_mapping().
  - Updates in setup_frametable_mapping():
    - align phys_start (ps) up to a page boundary and align phys_end (pe) down
      to a page boundary.
    - Update panic message.
    - Add the comment about alignment of frametable_size and base_mfn.
    - invalidate all frame_table entries and then just init with 0 only valid ones.
  - Double blank lines removing.
  - Initialize directmap_virt_start with DIRECTMAP_VIRT_START.
  - Updates in setup_directmap_mapping():
    - Drop local variable rc as it is used only once.
    - Move directmap_mfn_start definition to setup_directmap_mapping() and
      change __ro_after_init to __initdata.
    - Update the commit message about alignment of directmap_virt_start.
    - Move down directmap_virt_start for (base_addr & ~XEN_PT_LEVEL_SIZE(HYP_PT_ROOT_LEVEL))
      to not waste a lot of directmap space.
    - Map DIRECTMAP_VIRT_START + (base_addr & XEN_PT_LEVEL_SIZE(HYP_PT_ROOT_LEVEL))
      to _mfn(base_mfn).
    - Add log of the range in case directmap mapping failure.
  - Drop XENHEAP_VIRT_START and use directmap_virt_start instead.
  - Update the comment above setup_mm().
  - Update the calculation of bank_start and bank_end in setup_mm() to cover
    the case when a bank doesn't cover full pages.
  - Move share_xen_page_with_guest() to riscv/mm.c instead of riscv/stub.c.
  - Drop inclusion of <xen/pdx.h> in riscv/mm.c as thery is nothing used
    anymore in riscv/mm.c.
  - Move variable directmap_virt_start and setup_mm() outside
    #ifndef CONFIG_RISCV_32 ... #endif as they are likely to be common.
---
 Changes in V2:
  - merge patch 2 ( xen/riscv: implement maddr_to_virt() ) to the current one
    as maddr_to_virt() started to use the thing which are introduced in the
    current patch.
  - merge with patch 1 ( xen/riscv: add stub for share_xen_page_with_guest() )
    as this linkage issue happens during introduction of maddr_to_virt().
  - use mathematical range expressions for log messages.
  - calculate properly amount of mfns in setup_frametable_mapping() taking into
    account that ps and pe can be not properly aligned.
  - drop full stop at the end of debug message.
  - use PFN_DOWN(framsetable_size) instead of frametable_size >> PAGE_SHIFT.
  - round down ram_size when it is being accumulated in setup_mm() to guarantee
    that banks can never have partial pages at their start/end.
  - call setup_directmap_mappings() only for ram bank regions instead of
    mapping [0, ram_end] region.
  - drop directmap_virt_end for now as it isn't used at the moment.
  - update the commit message.
---
 xen/arch/riscv/include/asm/mm.h    |  13 ++-
 xen/arch/riscv/include/asm/setup.h |   2 +
 xen/arch/riscv/mm.c                | 141 +++++++++++++++++++++++++++++
 xen/arch/riscv/setup.c             |   3 +
 xen/include/xen/macros.h           |   1 +
 5 files changed, 156 insertions(+), 4 deletions(-)

diff --git a/xen/arch/riscv/include/asm/mm.h b/xen/arch/riscv/include/asm/mm.h
index ebb142502e..50d2ac3830 100644
--- a/xen/arch/riscv/include/asm/mm.h
+++ b/xen/arch/riscv/include/asm/mm.h
@@ -12,6 +12,8 @@
 
 #include <asm/page-bits.h>
 
+extern vaddr_t directmap_virt_start;
+
 #define pfn_to_paddr(pfn) ((paddr_t)(pfn) << PAGE_SHIFT)
 #define paddr_to_pfn(pa)  ((unsigned long)((pa) >> PAGE_SHIFT))
 
@@ -25,8 +27,11 @@
 
 static inline void *maddr_to_virt(paddr_t ma)
 {
-    BUG_ON("unimplemented");
-    return NULL;
+    unsigned long va_offset = maddr_to_directmapoff(ma);
+
+    ASSERT(va_offset < DIRECTMAP_SIZE);
+
+    return (void *)(directmap_virt_start + va_offset);
 }
 
 /*
@@ -37,9 +42,9 @@ static inline void *maddr_to_virt(paddr_t ma)
  */
 static inline unsigned long virt_to_maddr(unsigned long va)
 {
-    if ((va >= DIRECTMAP_VIRT_START) &&
+    if ((va >= directmap_virt_start) &&
         (va < (DIRECTMAP_VIRT_START + DIRECTMAP_SIZE)))
-        return directmapoff_to_maddr(va - DIRECTMAP_VIRT_START);
+        return directmapoff_to_maddr(va - directmap_virt_start);
 
     BUILD_BUG_ON(XEN_VIRT_SIZE != MB(2));
     ASSERT((va >> (PAGETABLE_ORDER + PAGE_SHIFT)) ==
diff --git a/xen/arch/riscv/include/asm/setup.h b/xen/arch/riscv/include/asm/setup.h
index c0214a9bf2..844a2f0ef1 100644
--- a/xen/arch/riscv/include/asm/setup.h
+++ b/xen/arch/riscv/include/asm/setup.h
@@ -5,6 +5,8 @@
 
 #define max_init_domid (0)
 
+void setup_mm(void);
+
 #endif /* ASM__RISCV__SETUP_H */
 
 /*
diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c
index 27026d803b..185008f4c6 100644
--- a/xen/arch/riscv/mm.c
+++ b/xen/arch/riscv/mm.c
@@ -372,6 +372,12 @@ int destroy_xen_mappings(unsigned long s, unsigned long e)
     return -1;
 }
 
+void share_xen_page_with_guest(struct page_info *page, struct domain *d,
+                               enum XENSHARE_flags flags)
+{
+    BUG_ON("unimplemented");
+}
+
 void * __init early_fdt_map(paddr_t fdt_paddr)
 {
     /* We are using 2MB superpage for mapping the FDT */
@@ -423,3 +429,138 @@ void * __init early_fdt_map(paddr_t fdt_paddr)
 
     return fdt_virt;
 }
+
+vaddr_t __ro_after_init directmap_virt_start = DIRECTMAP_VIRT_START;
+
+#ifndef CONFIG_RISCV_32
+
+/* Map a frame table to cover physical addresses ps through pe */
+static void __init setup_frametable_mappings(paddr_t ps, paddr_t pe)
+{
+    paddr_t aligned_ps = ROUNDUP(ps, PAGE_SIZE);
+    paddr_t aligned_pe = ROUNDDOWN(pe, PAGE_SIZE);
+    unsigned long nr_mfns = PFN_DOWN(aligned_pe - aligned_ps);
+    unsigned long frametable_size = nr_mfns * sizeof(*frame_table);
+    mfn_t base_mfn;
+
+    if ( frametable_size > FRAMETABLE_SIZE )
+        panic("The frametable cannot cover [%#"PRIpaddr", %#"PRIpaddr")\n",
+              ps, pe);
+
+    /*
+     * align base_mfn and frametable_size to MB(2) to have superpage mapping
+     * in map_pages_to_xen()
+     */
+    frametable_size = ROUNDUP(frametable_size, MB(2));
+    base_mfn = alloc_boot_pages(frametable_size >> PAGE_SHIFT, PFN_DOWN(MB(2)));
+
+    if ( map_pages_to_xen(FRAMETABLE_VIRT_START, base_mfn,
+                          PFN_DOWN(frametable_size),
+                          PAGE_HYPERVISOR_RW) )
+        panic("frametable mappings failed: %#lx -> %#lx\n",
+              FRAMETABLE_VIRT_START, mfn_x(base_mfn));
+
+    memset(&frame_table[0], -1, frametable_size);
+    memset(&frame_table[PFN_DOWN(aligned_ps)],
+           0, nr_mfns * sizeof(*frame_table));
+}
+
+/* Map the region in the directmap area. */
+static void __init setup_directmap_mappings(unsigned long base_mfn,
+                                            unsigned long nr_mfns)
+{
+    static mfn_t __initdata directmap_mfn_start = INVALID_MFN_INITIALIZER;
+
+    unsigned long base_addr = mfn_to_maddr(_mfn(base_mfn));
+    unsigned long high_bits_mask = XEN_PT_LEVEL_MAP_MASK(HYP_PT_ROOT_LEVEL);
+
+    /* First call sets the directmap physical and virtual offset. */
+    if ( mfn_eq(directmap_mfn_start, INVALID_MFN) )
+    {
+        directmap_mfn_start = _mfn(base_mfn);
+
+        directmap_virt_start -= (base_addr & high_bits_mask);
+    }
+
+    if ( base_mfn < mfn_x(directmap_mfn_start) )
+        panic("cannot add directmap mapping at %#lx below heap start %#lx\n",
+              base_mfn, mfn_x(directmap_mfn_start));
+
+    /*
+     * The base address may not be aligned to the second level
+     * size in case of Sv39 (e.g. 1GB when using 4KB pages).
+     * This would prevent superpage mappings for all the regions
+     * because the virtual address and machine address should
+     * both be suitably aligned.
+     *
+     * Prevent that by offsetting the start of the directmap virtual
+     * address.
+     */
+    if ( map_pages_to_xen(DIRECTMAP_VIRT_START + (base_addr & ~high_bits_mask),
+                          _mfn(base_mfn), nr_mfns,
+                          PAGE_HYPERVISOR_RW) )
+        panic("Directmap mappings for [%#"PRIpaddr", %#"PRIpaddr") failed\n",
+              mfn_to_maddr(_mfn(base_mfn)),
+              mfn_to_maddr(_mfn(base_mfn + nr_mfns)));
+}
+
+#else /* CONFIG_RISCV_32 */
+#error setup_{directmap,frametable}_mapping() should be implemented for RV_32
+#endif
+
+/*
+ * Setup memory management
+ *
+ * RISC-V 64 has a large virtual address space (the minimum supported
+ * MMU mode is Sv39, which provides GBs of VA space).
+ *
+ * The directmap_virt_start is shifted lower in the VA space to
+ * (DIRECTMAP_VIRT_START - masked_low_bits_of_ram_start_address) to avoid
+ * wasting a large portion of the directmap space, this also allows for simple
+ * VA <-> PA translations. Also aligns DIRECTMAP_VIRT_START to a GB boundary
+ * (for Sv39; for other MMU mode boundaries will be bigger ) by masking the
+ * higher bits of the RAM start address to enable the use of superpages in
+ * map_pages_to_xen().
+ *
+ * The frametable is mapped starting from physical address 0, minimizing
+ * wasted VA space and simplifying page_to_mfn() and mfn_to_page()
+ * translations.
+ */
+void __init setup_mm(void)
+{
+    const struct membanks *banks = bootinfo_get_mem();
+    paddr_t ram_start = INVALID_PADDR;
+    paddr_t ram_end = 0;
+    paddr_t ram_size = 0;
+    unsigned int i;
+
+    /*
+     * We need some memory to allocate the page-tables used for the directmap
+     * mappings. But some regions may contain memory already allocated
+     * for other uses (e.g. modules, reserved-memory...).
+     *
+     * For simplicity, add all the free regions in the boot allocator.
+     */
+    populate_boot_allocator();
+
+    total_pages = 0;
+
+    for ( i = 0; i < banks->nr_banks; i++ )
+    {
+        const struct membank *bank = &banks->bank[i];
+        paddr_t bank_start = ROUNDUP(bank->start, PAGE_SIZE);
+        paddr_t bank_end = ROUNDDOWN(bank->start + bank->size, PAGE_SIZE);
+        unsigned long bank_size = bank_end - bank_start;
+
+        ram_size += bank_size;
+        ram_start = min(ram_start, bank_start);
+        ram_end = max(ram_end, bank_end);
+
+        setup_directmap_mappings(PFN_DOWN(bank_start), PFN_DOWN(bank_size));
+    }
+
+    total_pages = PFN_DOWN(ram_size);
+
+    setup_frametable_mappings(0, ram_end);
+    max_page = PFN_DOWN(ram_end);
+}
diff --git a/xen/arch/riscv/setup.c b/xen/arch/riscv/setup.c
index e29bd75d7c..2887a18c0c 100644
--- a/xen/arch/riscv/setup.c
+++ b/xen/arch/riscv/setup.c
@@ -12,6 +12,7 @@
 
 #include <asm/early_printk.h>
 #include <asm/sbi.h>
+#include <asm/setup.h>
 #include <asm/smp.h>
 #include <asm/traps.h>
 
@@ -59,6 +60,8 @@ void __init noreturn start_xen(unsigned long bootcpu_id,
     printk("Command line: %s\n", cmdline);
     cmdline_parse(cmdline);
 
+    setup_mm();
+
     printk("All set up\n");
 
     machine_halt();
diff --git a/xen/include/xen/macros.h b/xen/include/xen/macros.h
index 19caaa8026..cd528fbdb1 100644
--- a/xen/include/xen/macros.h
+++ b/xen/include/xen/macros.h
@@ -2,6 +2,7 @@
 #define __MACROS_H__
 
 #define ROUNDUP(x, a) (((x) + (a) - 1) & ~((a) - 1))
+#define ROUNDDOWN(x, a) ((x) & ~((a) - 1))
 
 #define IS_ALIGNED(val, align) (!((val) & ((align) - 1)))
 
-- 
2.47.0
Re: [PATCH v3 1/3] xen/riscv: introduce setup_mm()
Posted by Jan Beulich 2 weeks, 2 days ago
On 01.11.2024 14:16, Oleksii Kurochko wrote:> @@ -423,3 +429,138 @@ void * __init early_fdt_map(paddr_t fdt_paddr)
>  
>      return fdt_virt;
>  }
> +
> +vaddr_t __ro_after_init directmap_virt_start = DIRECTMAP_VIRT_START;
> +
> +#ifndef CONFIG_RISCV_32
> +
> +/* Map a frame table to cover physical addresses ps through pe */
> +static void __init setup_frametable_mappings(paddr_t ps, paddr_t pe)
> +{
> +    paddr_t aligned_ps = ROUNDUP(ps, PAGE_SIZE);
> +    paddr_t aligned_pe = ROUNDDOWN(pe, PAGE_SIZE);
> +    unsigned long nr_mfns = PFN_DOWN(aligned_pe - aligned_ps);
> +    unsigned long frametable_size = nr_mfns * sizeof(*frame_table);
> +    mfn_t base_mfn;
> +
> +    if ( frametable_size > FRAMETABLE_SIZE )
> +        panic("The frametable cannot cover [%#"PRIpaddr", %#"PRIpaddr")\n",
> +              ps, pe);
> +
> +    /*
> +     * align base_mfn and frametable_size to MB(2) to have superpage mapping
> +     * in map_pages_to_xen()
> +     */
> +    frametable_size = ROUNDUP(frametable_size, MB(2));
> +    base_mfn = alloc_boot_pages(frametable_size >> PAGE_SHIFT, PFN_DOWN(MB(2)));
> +
> +    if ( map_pages_to_xen(FRAMETABLE_VIRT_START, base_mfn,
> +                          PFN_DOWN(frametable_size),
> +                          PAGE_HYPERVISOR_RW) )
> +        panic("frametable mappings failed: %#lx -> %#lx\n",
> +              FRAMETABLE_VIRT_START, mfn_x(base_mfn));
> +
> +    memset(&frame_table[0], -1, frametable_size);
> +    memset(&frame_table[PFN_DOWN(aligned_ps)],
> +           0, nr_mfns * sizeof(*frame_table));

Interesting - now you write out a huge amount of -1s, just to then overwrite
most of them with zeroes. I'm not going to insist that you change this yet
another time, but the performance hit from this is going to bite you/us as
soon as Xen is run on bigger-memory systems.

Plus, unless I'm mistaken, the function continues to rely on ps == 0 as
input. Just that the dependency is now better hidden. Specifically if you
calculate nr_mfns from the difference, and then use that for allocation,
then you need to offset the start of the mapping you create accordingly. At
which point you may need to apply extra care to cover the case where
sizeof(*frame_table) is not a power of two, and hence e.g. the first valid
page might have a struct instance straddling a page boundary.

> +/* Map the region in the directmap area. */
> +static void __init setup_directmap_mappings(unsigned long base_mfn,
> +                                            unsigned long nr_mfns)
> +{
> +    static mfn_t __initdata directmap_mfn_start = INVALID_MFN_INITIALIZER;
> +
> +    unsigned long base_addr = mfn_to_maddr(_mfn(base_mfn));
> +    unsigned long high_bits_mask = XEN_PT_LEVEL_MAP_MASK(HYP_PT_ROOT_LEVEL);
> +
> +    /* First call sets the directmap physical and virtual offset. */
> +    if ( mfn_eq(directmap_mfn_start, INVALID_MFN) )
> +    {
> +        directmap_mfn_start = _mfn(base_mfn);
> +
> +        directmap_virt_start -= (base_addr & high_bits_mask);
> +    }
> +
> +    if ( base_mfn < mfn_x(directmap_mfn_start) )
> +        panic("cannot add directmap mapping at %#lx below heap start %#lx\n",
> +              base_mfn, mfn_x(directmap_mfn_start));

Nit: Leftover use of "heap"?

> +    /*
> +     * The base address may not be aligned to the second level
> +     * size in case of Sv39 (e.g. 1GB when using 4KB pages).
> +     * This would prevent superpage mappings for all the regions
> +     * because the virtual address and machine address should
> +     * both be suitably aligned.
> +     *
> +     * Prevent that by offsetting the start of the directmap virtual
> +     * address.
> +     */
> +    if ( map_pages_to_xen(DIRECTMAP_VIRT_START + (base_addr & ~high_bits_mask),

I'm afraid this is correct only for the first invocation of the function.
For any further invocation you'd likely (attempt to) replace previously
established mappings. I think that here you need to use directmap_virt_start
instead.

> +/*
> + * Setup memory management
> + *
> + * RISC-V 64 has a large virtual address space (the minimum supported
> + * MMU mode is Sv39, which provides GBs of VA space).
> + *
> + * The directmap_virt_start is shifted lower in the VA space to
> + * (DIRECTMAP_VIRT_START - masked_low_bits_of_ram_start_address) to avoid
> + * wasting a large portion of the directmap space, this also allows for simple
> + * VA <-> PA translations. Also aligns DIRECTMAP_VIRT_START to a GB boundary
> + * (for Sv39; for other MMU mode boundaries will be bigger ) by masking the
> + * higher bits of the RAM start address to enable the use of superpages in
> + * map_pages_to_xen().
> + *
> + * The frametable is mapped starting from physical address 0, minimizing
> + * wasted VA space and simplifying page_to_mfn() and mfn_to_page()
> + * translations.
> + */
> +void __init setup_mm(void)
> +{
> +    const struct membanks *banks = bootinfo_get_mem();
> +    paddr_t ram_start = INVALID_PADDR;
> +    paddr_t ram_end = 0;
> +    paddr_t ram_size = 0;
> +    unsigned int i;
> +
> +    /*
> +     * We need some memory to allocate the page-tables used for the directmap
> +     * mappings. But some regions may contain memory already allocated
> +     * for other uses (e.g. modules, reserved-memory...).
> +     *
> +     * For simplicity, add all the free regions in the boot allocator.
> +     */
> +    populate_boot_allocator();
> +
> +    total_pages = 0;

Nit: Is this actually necessary?

> +    for ( i = 0; i < banks->nr_banks; i++ )
> +    {
> +        const struct membank *bank = &banks->bank[i];
> +        paddr_t bank_start = ROUNDUP(bank->start, PAGE_SIZE);
> +        paddr_t bank_end = ROUNDDOWN(bank->start + bank->size, PAGE_SIZE);
> +        unsigned long bank_size = bank_end - bank_start;
> +
> +        ram_size += bank_size;
> +        ram_start = min(ram_start, bank_start);
> +        ram_end = max(ram_end, bank_end);
> +
> +        setup_directmap_mappings(PFN_DOWN(bank_start), PFN_DOWN(bank_size));
> +    }

You maintain ram_start in the loop, just to then ...

> +    total_pages = PFN_DOWN(ram_size);
> +
> +    setup_frametable_mappings(0, ram_end);
> +    max_page = PFN_DOWN(ram_end);
> +}

... not use it at all - why?

Jan
Re: [PATCH v3 1/3] xen/riscv: introduce setup_mm()
Posted by oleksii.kurochko@gmail.com 2 weeks, 1 day ago
On Tue, 2024-11-05 at 16:20 +0100, Jan Beulich wrote:
> On 01.11.2024 14:16, Oleksii Kurochko wrote:> @@ -423,3 +429,138 @@
> void * __init early_fdt_map(paddr_t fdt_paddr)
> >  
> >      return fdt_virt;
> >  }
> > +
> > +vaddr_t __ro_after_init directmap_virt_start =
> > DIRECTMAP_VIRT_START;
> > +
> > +#ifndef CONFIG_RISCV_32
> > +
> > +/* Map a frame table to cover physical addresses ps through pe */
> > +static void __init setup_frametable_mappings(paddr_t ps, paddr_t
> > pe)
> > +{
> > +    paddr_t aligned_ps = ROUNDUP(ps, PAGE_SIZE);
> > +    paddr_t aligned_pe = ROUNDDOWN(pe, PAGE_SIZE);
> > +    unsigned long nr_mfns = PFN_DOWN(aligned_pe - aligned_ps);
> > +    unsigned long frametable_size = nr_mfns *
> > sizeof(*frame_table);
> > +    mfn_t base_mfn;
> > +
> > +    if ( frametable_size > FRAMETABLE_SIZE )
> > +        panic("The frametable cannot cover [%#"PRIpaddr",
> > %#"PRIpaddr")\n",
> > +              ps, pe);
> > +
> > +    /*
> > +     * align base_mfn and frametable_size to MB(2) to have
> > superpage mapping
> > +     * in map_pages_to_xen()
> > +     */
> > +    frametable_size = ROUNDUP(frametable_size, MB(2));
> > +    base_mfn = alloc_boot_pages(frametable_size >> PAGE_SHIFT,
> > PFN_DOWN(MB(2)));
> > +
> > +    if ( map_pages_to_xen(FRAMETABLE_VIRT_START, base_mfn,
> > +                          PFN_DOWN(frametable_size),
> > +                          PAGE_HYPERVISOR_RW) )
> > +        panic("frametable mappings failed: %#lx -> %#lx\n",
> > +              FRAMETABLE_VIRT_START, mfn_x(base_mfn));
> > +
> > +    memset(&frame_table[0], -1, frametable_size);
> > +    memset(&frame_table[PFN_DOWN(aligned_ps)],
> > +           0, nr_mfns * sizeof(*frame_table));
> 
> Interesting - now you write out a huge amount of -1s, just to then
> overwrite
> most of them with zeroes. I'm not going to insist that you change
> this yet
> another time, but the performance hit from this is going to bite
> you/us as
> soon as Xen is run on bigger-memory systems.
I agree that validating or invalidating frames in a single pass would
be preferable to nearly two passes. I’m considering whether there’s a
way to ensure that frame_table is set to -1 at compile time. It seems
the best approach (and only one?) might be to initialize frame_table in
one pass as follows:
1) [0, aligned_ps) = -1
2) [aligned_ps, nr_mfns * sizeof(*frame_table)) = 0
3) [nr_mfns * sizeof(*frame_table), frame_table_size) = -1
Does this approach seem optimal?

> 
> Plus, unless I'm mistaken, the function continues to rely on ps == 0
> as
> input. Just that the dependency is now better hidden. Specifically if
> you
> calculate nr_mfns from the difference, and then use that for
> allocation,
> then you need to offset the start of the mapping you create
> accordingly.
I'm not quite understanding why the method of calculating nr_mfns
affects how the frame_table is mapped. Isn’t it only necessary to
calculate the correct size of the frame_table that needs to be
allocated?

I assume by the offset you mean something similar to what was done for
directmap mapping?


>  At
> which point you may need to apply extra care to cover the case where
> sizeof(*frame_table) is not a power of two, and hence e.g. the first
> valid
> page might have a struct instance straddling a page boundary.
The first valid page is aligned_ps ( which is aligned on a page
boundary ) so assuming that sizeof(*frame_table) < PAGE_SIZE we can't
straddle a page boundary, can we?

> 
> > +    /*
> > +     * The base address may not be aligned to the second level
> > +     * size in case of Sv39 (e.g. 1GB when using 4KB pages).
> > +     * This would prevent superpage mappings for all the regions
> > +     * because the virtual address and machine address should
> > +     * both be suitably aligned.
> > +     *
> > +     * Prevent that by offsetting the start of the directmap
> > virtual
> > +     * address.
> > +     */
> > +    if ( map_pages_to_xen(DIRECTMAP_VIRT_START + (base_addr &
> > ~high_bits_mask),
> 
> I'm afraid this is correct only for the first invocation of the
> function.
> For any further invocation you'd likely (attempt to) replace
> previously
> established mappings. I think that here you need to use
> directmap_virt_start
> instead.
Banks are sorted by bank start address ( common/device-
tree/bootfdt.c:623 ):
       /*
        * On Arm64 setup_directmap_mappings() expects to be called with
   the lowest
        * bank in memory first. There is no requirement that the DT will
   provide
        * the banks sorted in ascending order. So sort them through.
        */
       sort(mem->bank, mem->nr_banks, sizeof(struct membank),
            cmp_memory_node, swap_memory_node);
( btw, comment is needed to be updated ... )

Thereby no replacement should happen if I don't miss something.

I am thinking wouldn't it be more correct to use mfn_to_virt(base_mfn):
       if ( map_pages_to_xen(__mfn_to_virt(base_mfn) + (base_addr &
   ~high_bits_mask),
                             _mfn(base_mfn), nr_mfns,
                             PAGE_HYPERVISOR_RW) )

> 
> > +/*
> > + * Setup memory management
> > + *
> > + * RISC-V 64 has a large virtual address space (the minimum
> > supported
> > + * MMU mode is Sv39, which provides GBs of VA space).
> > + *
> > + * The directmap_virt_start is shifted lower in the VA space to
> > + * (DIRECTMAP_VIRT_START - masked_low_bits_of_ram_start_address)
> > to avoid
> > + * wasting a large portion of the directmap space, this also
> > allows for simple
> > + * VA <-> PA translations. Also aligns DIRECTMAP_VIRT_START to a
> > GB boundary
> > + * (for Sv39; for other MMU mode boundaries will be bigger ) by
> > masking the
> > + * higher bits of the RAM start address to enable the use of
> > superpages in
> > + * map_pages_to_xen().
> > + *
> > + * The frametable is mapped starting from physical address 0,
> > minimizing
> > + * wasted VA space and simplifying page_to_mfn() and mfn_to_page()
> > + * translations.
> > + */
> > +void __init setup_mm(void)
> > +{
> > +    const struct membanks *banks = bootinfo_get_mem();
> > +    paddr_t ram_start = INVALID_PADDR;
> > +    paddr_t ram_end = 0;
> > +    paddr_t ram_size = 0;
> > +    unsigned int i;
> > +
> > +    /*
> > +     * We need some memory to allocate the page-tables used for
> > the directmap
> > +     * mappings. But some regions may contain memory already
> > allocated
> > +     * for other uses (e.g. modules, reserved-memory...).
> > +     *
> > +     * For simplicity, add all the free regions in the boot
> > allocator.
> > +     */
> > +    populate_boot_allocator();
> > +
> > +    total_pages = 0;
> 
> Nit: Is this actually necessary?
Agree, there is no need for total_pages. It should be dropped.

> 
> > +    for ( i = 0; i < banks->nr_banks; i++ )
> > +    {
> > +        const struct membank *bank = &banks->bank[i];
> > +        paddr_t bank_start = ROUNDUP(bank->start, PAGE_SIZE);
> > +        paddr_t bank_end = ROUNDDOWN(bank->start + bank->size,
> > PAGE_SIZE);
> > +        unsigned long bank_size = bank_end - bank_start;
> > +
> > +        ram_size += bank_size;
> > +        ram_start = min(ram_start, bank_start);
> > +        ram_end = max(ram_end, bank_end);
> > +
> > +        setup_directmap_mappings(PFN_DOWN(bank_start),
> > PFN_DOWN(bank_size));
> > +    }
> 
> You maintain ram_start in the loop, just to then ...
> 
> > +    total_pages = PFN_DOWN(ram_size);
> > +
> > +    setup_frametable_mappings(0, ram_end);
> > +    max_page = PFN_DOWN(ram_end);
> > +}
> 
> ... not use it at all - why?
ram_start was needed for the case when setup_frametable_mappings() used
it as the first argument. Now it isn't true anymore so should be
dropped.

Thanks.

~ Oleksii
Re: [PATCH v3 1/3] xen/riscv: introduce setup_mm()
Posted by Jan Beulich 2 weeks ago
On 06.11.2024 13:44, oleksii.kurochko@gmail.com wrote:
> On Tue, 2024-11-05 at 16:20 +0100, Jan Beulich wrote:
>> On 01.11.2024 14:16, Oleksii Kurochko wrote:> @@ -423,3 +429,138 @@
>> void * __init early_fdt_map(paddr_t fdt_paddr)
>>>  
>>>      return fdt_virt;
>>>  }
>>> +
>>> +vaddr_t __ro_after_init directmap_virt_start =
>>> DIRECTMAP_VIRT_START;
>>> +
>>> +#ifndef CONFIG_RISCV_32
>>> +
>>> +/* Map a frame table to cover physical addresses ps through pe */
>>> +static void __init setup_frametable_mappings(paddr_t ps, paddr_t
>>> pe)
>>> +{
>>> +    paddr_t aligned_ps = ROUNDUP(ps, PAGE_SIZE);
>>> +    paddr_t aligned_pe = ROUNDDOWN(pe, PAGE_SIZE);
>>> +    unsigned long nr_mfns = PFN_DOWN(aligned_pe - aligned_ps);
>>> +    unsigned long frametable_size = nr_mfns *
>>> sizeof(*frame_table);
>>> +    mfn_t base_mfn;
>>> +
>>> +    if ( frametable_size > FRAMETABLE_SIZE )
>>> +        panic("The frametable cannot cover [%#"PRIpaddr",
>>> %#"PRIpaddr")\n",
>>> +              ps, pe);
>>> +
>>> +    /*
>>> +     * align base_mfn and frametable_size to MB(2) to have
>>> superpage mapping
>>> +     * in map_pages_to_xen()
>>> +     */
>>> +    frametable_size = ROUNDUP(frametable_size, MB(2));
>>> +    base_mfn = alloc_boot_pages(frametable_size >> PAGE_SHIFT,
>>> PFN_DOWN(MB(2)));
>>> +
>>> +    if ( map_pages_to_xen(FRAMETABLE_VIRT_START, base_mfn,
>>> +                          PFN_DOWN(frametable_size),
>>> +                          PAGE_HYPERVISOR_RW) )
>>> +        panic("frametable mappings failed: %#lx -> %#lx\n",
>>> +              FRAMETABLE_VIRT_START, mfn_x(base_mfn));
>>> +
>>> +    memset(&frame_table[0], -1, frametable_size);
>>> +    memset(&frame_table[PFN_DOWN(aligned_ps)],
>>> +           0, nr_mfns * sizeof(*frame_table));
>>
>> Interesting - now you write out a huge amount of -1s, just to then
>> overwrite
>> most of them with zeroes. I'm not going to insist that you change
>> this yet
>> another time, but the performance hit from this is going to bite
>> you/us as
>> soon as Xen is run on bigger-memory systems.
> I agree that validating or invalidating frames in a single pass would
> be preferable to nearly two passes. I’m considering whether there’s a
> way to ensure that frame_table is set to -1 at compile time.

How would that work, if the entire frame table is allocated dynamically?

> It seems
> the best approach (and only one?) might be to initialize frame_table in
> one pass as follows:
> 1) [0, aligned_ps) = -1
> 2) [aligned_ps, nr_mfns * sizeof(*frame_table)) = 0
> 3) [nr_mfns * sizeof(*frame_table), frame_table_size) = -1
> Does this approach seem optimal?

That's what I would have expected, yes.

>> Plus, unless I'm mistaken, the function continues to rely on ps == 0
>> as
>> input. Just that the dependency is now better hidden. Specifically if
>> you
>> calculate nr_mfns from the difference, and then use that for
>> allocation,
>> then you need to offset the start of the mapping you create
>> accordingly.
> I'm not quite understanding why the method of calculating nr_mfns
> affects how the frame_table is mapped. Isn’t it only necessary to
> calculate the correct size of the frame_table that needs to be
> allocated?

Assume there's 4G of memory actually starting at 16G. Afaict you'll
allocate a frame table for those 4G, but you'll map it right at
FRAMETABLE_VIRT_START. Hence it'll cover the first 4G of physical
addresses, but _none_ of the actual memory you've got.

> I assume by the offset you mean something similar to what was done for
> directmap mapping?

Kind of, yes.

>>  At
>> which point you may need to apply extra care to cover the case where
>> sizeof(*frame_table) is not a power of two, and hence e.g. the first
>> valid
>> page might have a struct instance straddling a page boundary.
> The first valid page is aligned_ps ( which is aligned on a page
> boundary ) so assuming that sizeof(*frame_table) < PAGE_SIZE we can't
> straddle a page boundary, can we?

But sizeof(*frame_table) < PAGE_SIZE means nothing as to the alignment
of an individual struct instance in memory. Iirc sizeof(*frame_table)
is 48 for RISC-V, so the common alignment across struct instances isn't
going to be better than 8, and there _will_ be instances straddling
page boundaries.

>>> +    /*
>>> +     * The base address may not be aligned to the second level
>>> +     * size in case of Sv39 (e.g. 1GB when using 4KB pages).
>>> +     * This would prevent superpage mappings for all the regions
>>> +     * because the virtual address and machine address should
>>> +     * both be suitably aligned.
>>> +     *
>>> +     * Prevent that by offsetting the start of the directmap
>>> virtual
>>> +     * address.
>>> +     */
>>> +    if ( map_pages_to_xen(DIRECTMAP_VIRT_START + (base_addr &
>>> ~high_bits_mask),
>>
>> I'm afraid this is correct only for the first invocation of the
>> function.
>> For any further invocation you'd likely (attempt to) replace
>> previously
>> established mappings. I think that here you need to use
>> directmap_virt_start
>> instead.
> Banks are sorted by bank start address ( common/device-
> tree/bootfdt.c:623 ):
>        /*
>         * On Arm64 setup_directmap_mappings() expects to be called with
>    the lowest
>         * bank in memory first. There is no requirement that the DT will
>    provide
>         * the banks sorted in ascending order. So sort them through.
>         */
>        sort(mem->bank, mem->nr_banks, sizeof(struct membank),
>             cmp_memory_node, swap_memory_node);
> ( btw, comment is needed to be updated ... )
> 
> Thereby no replacement should happen if I don't miss something.

I don't see how banks being sorted matters here. On the 2nd invocation
you'll start mapping pages again from DIRECTMAP_VIRT_START plus an at
most 1G (for SV39) offset. If both banks have 2G, the resulting mappings
will necessarily overlap, if I'm not mistaken.

>>> +    for ( i = 0; i < banks->nr_banks; i++ )
>>> +    {
>>> +        const struct membank *bank = &banks->bank[i];
>>> +        paddr_t bank_start = ROUNDUP(bank->start, PAGE_SIZE);
>>> +        paddr_t bank_end = ROUNDDOWN(bank->start + bank->size,
>>> PAGE_SIZE);
>>> +        unsigned long bank_size = bank_end - bank_start;
>>> +
>>> +        ram_size += bank_size;
>>> +        ram_start = min(ram_start, bank_start);
>>> +        ram_end = max(ram_end, bank_end);
>>> +
>>> +        setup_directmap_mappings(PFN_DOWN(bank_start),
>>> PFN_DOWN(bank_size));
>>> +    }
>>
>> You maintain ram_start in the loop, just to then ...
>>
>>> +    total_pages = PFN_DOWN(ram_size);
>>> +
>>> +    setup_frametable_mappings(0, ram_end);
>>> +    max_page = PFN_DOWN(ram_end);
>>> +}
>>
>> ... not use it at all - why?
> ram_start was needed for the case when setup_frametable_mappings() used
> it as the first argument. Now it isn't true anymore so should be
> dropped.

As per above it may actually be necessary (or at least desirable) to pass
it into there again, if nothing else then to know how much of the initial
part of the (mapped) frame table to invalidate.

Jan

Re: [PATCH v3 1/3] xen/riscv: introduce setup_mm()
Posted by oleksii.kurochko@gmail.com 2 weeks ago
On Thu, 2024-11-07 at 10:19 +0100, Jan Beulich wrote:
> On 06.11.2024 13:44, oleksii.kurochko@gmail.com wrote:
> > On Tue, 2024-11-05 at 16:20 +0100, Jan Beulich wrote:
> > > On 01.11.2024 14:16, Oleksii Kurochko wrote:> @@ -423,3 +429,138
> > > @@
> > > void * __init early_fdt_map(paddr_t fdt_paddr)
> > > >  
> > > >      return fdt_virt;
> > > >  }
> > > > +
> > > > +vaddr_t __ro_after_init directmap_virt_start =
> > > > DIRECTMAP_VIRT_START;
> > > > +
> > > > +#ifndef CONFIG_RISCV_32
> > > > +
> > > > +/* Map a frame table to cover physical addresses ps through pe
> > > > */
> > > > +static void __init setup_frametable_mappings(paddr_t ps,
> > > > paddr_t
> > > > pe)
> > > > +{
> > > > +    paddr_t aligned_ps = ROUNDUP(ps, PAGE_SIZE);
> > > > +    paddr_t aligned_pe = ROUNDDOWN(pe, PAGE_SIZE);
> > > > +    unsigned long nr_mfns = PFN_DOWN(aligned_pe - aligned_ps);
> > > > +    unsigned long frametable_size = nr_mfns *
> > > > sizeof(*frame_table);
> > > > +    mfn_t base_mfn;
> > > > +
> > > > +    if ( frametable_size > FRAMETABLE_SIZE )
> > > > +        panic("The frametable cannot cover [%#"PRIpaddr",
> > > > %#"PRIpaddr")\n",
> > > > +              ps, pe);
> > > > +
> > > > +    /*
> > > > +     * align base_mfn and frametable_size to MB(2) to have
> > > > superpage mapping
> > > > +     * in map_pages_to_xen()
> > > > +     */
> > > > +    frametable_size = ROUNDUP(frametable_size, MB(2));
> > > > +    base_mfn = alloc_boot_pages(frametable_size >> PAGE_SHIFT,
> > > > PFN_DOWN(MB(2)));
> > > > +
> > > > +    if ( map_pages_to_xen(FRAMETABLE_VIRT_START, base_mfn,
> > > > +                          PFN_DOWN(frametable_size),
> > > > +                          PAGE_HYPERVISOR_RW) )
> > > > +        panic("frametable mappings failed: %#lx -> %#lx\n",
> > > > +              FRAMETABLE_VIRT_START, mfn_x(base_mfn));
> > > > +
> > > > +    memset(&frame_table[0], -1, frametable_size);
> > > > +    memset(&frame_table[PFN_DOWN(aligned_ps)],
> > > > +           0, nr_mfns * sizeof(*frame_table));
> > > 
> > > Interesting - now you write out a huge amount of -1s, just to
> > > then
> > > overwrite
> > > most of them with zeroes. I'm not going to insist that you change
> > > this yet
> > > another time, but the performance hit from this is going to bite
> > > you/us as
> > > soon as Xen is run on bigger-memory systems.
> > I agree that validating or invalidating frames in a single pass
> > would
> > be preferable to nearly two passes. I’m considering whether there’s
> > a
> > way to ensure that frame_table is set to -1 at compile time.
> 
> How would that work, if the entire frame table is allocated
> dynamically?
Yeah, there is no way to work in dynamically allocated way. Then let's
stick to ...
> 
> > It seems
> > the best approach (and only one?) might be to initialize
> > frame_table in
> > one pass as follows:
> > 1) [0, aligned_ps) = -1
> > 2) [aligned_ps, nr_mfns * sizeof(*frame_table)) = 0
> > 3) [nr_mfns * sizeof(*frame_table), frame_table_size) = -1
> > Does this approach seem optimal?
> 
...
> That's what I would have expected, yes.

> 
> > > Plus, unless I'm mistaken, the function continues to rely on ps
> > > == 0
> > > as
> > > input. Just that the dependency is now better hidden.
> > > Specifically if
> > > you
> > > calculate nr_mfns from the difference, and then use that for
> > > allocation,
> > > then you need to offset the start of the mapping you create
> > > accordingly.
> > I'm not quite understanding why the method of calculating nr_mfns
> > affects how the frame_table is mapped. Isn’t it only necessary to
> > calculate the correct size of the frame_table that needs to be
> > allocated?
> 
> Assume there's 4G of memory actually starting at 16G. Afaict you'll
> allocate a frame table for those 4G, but you'll map it right at
> FRAMETABLE_VIRT_START. Hence it'll cover the first 4G of physical
> addresses, but _none_ of the actual memory you've got.
I need to clarify some basics about the frame table.

Does Xen expect that frame_table[0] = 0 (PA), frame_table[1] = 4k (PA),
..., frame_table[x] = RAM_START_PA, frame_table[x+1] = RAM_START_PA +
4k, and so on?

My understanding is that it could be done as follows: frame_table[0] =
RAM_START_PA, frame_table[1] = RAM_START_PA + 4k, and so on, taking
into account mfn_to_page() and page_to_mfn() logic. (And yes, in that
case, the current implementations of mfn_to_page() and page_to_mfn()
aren't correct and should be updated as suggested here:
https://lore.kernel.org/xen-devel/cover.1730465154.git.oleksii.kurochko@gmail.com/T/#me2fc410f3d4758b71a9974d0be18a36f50d683b1as
as these implementations are based on that ps == 0). With this setup,
mapping FRAMETABLE_VIRT_START to base_mfn should be correct, shouldn’t
it?

With the current implementations of mfn_to_page() and page_to_mfn(), we
either need to allocate a larger frame_table to cover the [0, ram_end)
range (in which case mapping FRAMETABLE_VIRT_START to base_mfn would
work), or change the mapping to frame_table=( FRAMETABLE_VIRT_START +
PFN_DOWN(ram_start) ) -> base_mfn. Or to not loose virtual addrees
space of FRAMETABLE ( so the mapping will still be
FRAMETABLE_VIRT_START -> base_mfn ) to do the similar to directmap
mapping ( or what the changes suggested in the link above). Is my
understanding correct?

> 
> > >  At
> > > which point you may need to apply extra care to cover the case
> > > where
> > > sizeof(*frame_table) is not a power of two, and hence e.g. the
> > > first
> > > valid
> > > page might have a struct instance straddling a page boundary.
> > The first valid page is aligned_ps ( which is aligned on a page
> > boundary ) so assuming that sizeof(*frame_table) < PAGE_SIZE we
> > can't
> > straddle a page boundary, can we?
> 
> But sizeof(*frame_table) < PAGE_SIZE means nothing as to the
> alignment
> of an individual struct instance in memory. Iirc sizeof(*frame_table)
> is 48 for RISC-V, so the common alignment across struct instances
> isn't
> going to be better than 8, and there _will_ be instances straddling
> page boundaries.
If we speak about the alignment of an individual struct instance in
memory, what is the issue with that, except that it could be less
efficient when accessing this memory? This inefficiency could
potentially be addressed by adding some padding to the struct page_info
but then we will have bigger frame table size.
Another issue I can see is that the size of the frame table could be
calculated incorrectly. It may require an additional page to cover the
case when the frame table size isn't aligned to PAGE_SIZE, but this is
accounted for by rounding up the frame table size to 2MB
(frametable_size = ROUNDUP(frametable_size, MB(2));) before allocating
the frame table (base_mfn = alloc_boot_pages(frametable_size >>
PAGE_SHIFT, PFN_DOWN(MB(2)));).

Something else should be considered?

> 
> > > > +    /*
> > > > +     * The base address may not be aligned to the second level
> > > > +     * size in case of Sv39 (e.g. 1GB when using 4KB pages).
> > > > +     * This would prevent superpage mappings for all the
> > > > regions
> > > > +     * because the virtual address and machine address should
> > > > +     * both be suitably aligned.
> > > > +     *
> > > > +     * Prevent that by offsetting the start of the directmap
> > > > virtual
> > > > +     * address.
> > > > +     */
> > > > +    if ( map_pages_to_xen(DIRECTMAP_VIRT_START + (base_addr &
> > > > ~high_bits_mask),
> > > 
> > > I'm afraid this is correct only for the first invocation of the
> > > function.
> > > For any further invocation you'd likely (attempt to) replace
> > > previously
> > > established mappings. I think that here you need to use
> > > directmap_virt_start
> > > instead.
> > Banks are sorted by bank start address ( common/device-
> > tree/bootfdt.c:623 ):
> >        /*
> >         * On Arm64 setup_directmap_mappings() expects to be called
> > with
> >    the lowest
> >         * bank in memory first. There is no requirement that the DT
> > will
> >    provide
> >         * the banks sorted in ascending order. So sort them
> > through.
> >         */
> >        sort(mem->bank, mem->nr_banks, sizeof(struct membank),
> >             cmp_memory_node, swap_memory_node);
> > ( btw, comment is needed to be updated ... )
> > 
> > Thereby no replacement should happen if I don't miss something.
> 
> I don't see how banks being sorted matters here. On the 2nd
> invocation
> you'll start mapping pages again from DIRECTMAP_VIRT_START plus an at
> most 1G (for SV39) offset. If both banks have 2G, the resulting
> mappings
> will necessarily overlap, if I'm not mistaken.
Yes, you are write, overlapping is really possible.

Then __mfn_to_virt(base_mfn) to calculate VA of a bank start should be
used:
   @@ -473,24 +483,24 @@ static void __init
   setup_directmap_mappings(unsigned long base_mfn,
        {
            directmap_mfn_start = _mfn(base_mfn);
    
   -        directmap_virt_start -= (base_addr & high_bits_mask);
   +       /*
   +        * The base address may not be aligned to the second level
   +        * size in case of Sv39 (e.g. 1GB when using 4KB pages).
   +        * This would prevent superpage mappings for all the regions
   +        * because the virtual address and machine address should
   +        * both be suitably aligned.
   +        *
   +        * Prevent that by offsetting the start of the directmap
   virtual
   +        * address.
   +        */
   +        directmap_virt_start -= (base_addr & high_bits_mask) +
   (base_addr & ~high_bits_mask);
        }
    
        if ( base_mfn < mfn_x(directmap_mfn_start) )
            panic("cannot add directmap mapping at %#lx below heap
   start %#lx\n",
                  base_mfn, mfn_x(directmap_mfn_start));
    
   -    /*
   -     * The base address may not be aligned to the second level
   -     * size in case of Sv39 (e.g. 1GB when using 4KB pages).
   -     * This would prevent superpage mappings for all the regions
   -     * because the virtual address and machine address should
   -     * both be suitably aligned.
   -     *
   -     * Prevent that by offsetting the start of the directmap
   virtual
   -     * address.
   -     */
   -    if ( map_pages_to_xen(DIRECTMAP_VIRT_START + (base_addr &
   ~high_bits_mask),
   +    if ( map_pages_to_xen((vaddr_t)mfn_to_virt(base_mfn),
                              _mfn(base_mfn), nr_mfns,
                              PAGE_HYPERVISOR_RW) )

Thanks!

~ Oleksii
Re: [PATCH v3 1/3] xen/riscv: introduce setup_mm()
Posted by Jan Beulich 1 week, 6 days ago
On 07.11.2024 13:32, oleksii.kurochko@gmail.com wrote:
> On Thu, 2024-11-07 at 10:19 +0100, Jan Beulich wrote:
>> On 06.11.2024 13:44, oleksii.kurochko@gmail.com wrote:
>>> On Tue, 2024-11-05 at 16:20 +0100, Jan Beulich wrote:
>>>> On 01.11.2024 14:16, Oleksii Kurochko wrote:> @@ -423,3 +429,138
>>>> Plus, unless I'm mistaken, the function continues to rely on ps
>>>> == 0
>>>> as
>>>> input. Just that the dependency is now better hidden.
>>>> Specifically if
>>>> you
>>>> calculate nr_mfns from the difference, and then use that for
>>>> allocation,
>>>> then you need to offset the start of the mapping you create
>>>> accordingly.
>>> I'm not quite understanding why the method of calculating nr_mfns
>>> affects how the frame_table is mapped. Isn’t it only necessary to
>>> calculate the correct size of the frame_table that needs to be
>>> allocated?
>>
>> Assume there's 4G of memory actually starting at 16G. Afaict you'll
>> allocate a frame table for those 4G, but you'll map it right at
>> FRAMETABLE_VIRT_START. Hence it'll cover the first 4G of physical
>> addresses, but _none_ of the actual memory you've got.
> I need to clarify some basics about the frame table.
> 
> Does Xen expect that frame_table[0] = 0 (PA), frame_table[1] = 4k (PA),
> ..., frame_table[x] = RAM_START_PA, frame_table[x+1] = RAM_START_PA +
> 4k, and so on?

Not strictly, no. You'll find that common code has very few uses of
frame_table, so many aspects are fully up to the arch. However, there
is the assumption you mention above in PDX space. When PDX == PFN
clearly that assumption would then also hold for PFNs.

> My understanding is that it could be done as follows: frame_table[0] =
> RAM_START_PA, frame_table[1] = RAM_START_PA + 4k, and so on, taking
> into account mfn_to_page() and page_to_mfn() logic. (And yes, in that
> case, the current implementations of mfn_to_page() and page_to_mfn()
> aren't correct and should be updated as suggested here:
> https://lore.kernel.org/xen-devel/cover.1730465154.git.oleksii.kurochko@gmail.com/T/#me2fc410f3d4758b71a9974d0be18a36f50d683b1as
> as these implementations are based on that ps == 0). With this setup,
> mapping FRAMETABLE_VIRT_START to base_mfn should be correct, shouldn’t
> it?

Yes.

> With the current implementations of mfn_to_page() and page_to_mfn(), we
> either need to allocate a larger frame_table to cover the [0, ram_end)
> range (in which case mapping FRAMETABLE_VIRT_START to base_mfn would
> work), or change the mapping to frame_table=( FRAMETABLE_VIRT_START +
> PFN_DOWN(ram_start) ) -> base_mfn. Or to not loose virtual addrees
> space of FRAMETABLE ( so the mapping will still be
> FRAMETABLE_VIRT_START -> base_mfn ) to do the similar to directmap
> mapping ( or what the changes suggested in the link above). Is my
> understanding correct?

Largely yes. There's one more aspect to consider: Even when frame_table[0]
maps MFN 0, the initial part of frame_table[] doesn't necessarily need
mapping to any memory when RAM starts at a much higher address. Valid
struct page_info instances only need to exist for any MFN for which
mfn_valid() returns true.

>>>>  At
>>>> which point you may need to apply extra care to cover the case
>>>> where
>>>> sizeof(*frame_table) is not a power of two, and hence e.g. the
>>>> first
>>>> valid
>>>> page might have a struct instance straddling a page boundary.
>>> The first valid page is aligned_ps ( which is aligned on a page
>>> boundary ) so assuming that sizeof(*frame_table) < PAGE_SIZE we
>>> can't
>>> straddle a page boundary, can we?
>>
>> But sizeof(*frame_table) < PAGE_SIZE means nothing as to the
>> alignment
>> of an individual struct instance in memory. Iirc sizeof(*frame_table)
>> is 48 for RISC-V, so the common alignment across struct instances
>> isn't
>> going to be better than 8, and there _will_ be instances straddling
>> page boundaries.
> If we speak about the alignment of an individual struct instance in
> memory, what is the issue with that, except that it could be less
> efficient when accessing this memory? This inefficiency could
> potentially be addressed by adding some padding to the struct page_info
> but then we will have bigger frame table size.
> Another issue I can see is that the size of the frame table could be
> calculated incorrectly. It may require an additional page to cover the
> case when the frame table size isn't aligned to PAGE_SIZE, but this is
> accounted for by rounding up the frame table size to 2MB
> (frametable_size = ROUNDUP(frametable_size, MB(2));) before allocating
> the frame table (base_mfn = alloc_boot_pages(frametable_size >>
> PAGE_SHIFT, PFN_DOWN(MB(2)));).
> 
> Something else should be considered?

Much depends on what your decision is as to the earlier aspect. If
you map [0,end), then what you have ought to be fine (except for
being wasteful).

Jan

Re: [PATCH v3 1/3] xen/riscv: introduce setup_mm()
Posted by oleksii.kurochko@gmail.com 2 weeks, 1 day ago
On Wed, 2024-11-06 at 13:44 +0100, oleksii.kurochko@gmail.com wrote:
> On Tue, 2024-11-05 at 16:20 +0100, Jan Beulich wrote:
> > On 01.11.2024 14:16, Oleksii Kurochko wrote:> @@ -423,3 +429,138 @@
> > void * __init early_fdt_map(paddr_t fdt_paddr)
> > >  
> > >      return fdt_virt;
> > >  }
> > > +
> > > +vaddr_t __ro_after_init directmap_virt_start =
> > > DIRECTMAP_VIRT_START;
> > > +
> > > +#ifndef CONFIG_RISCV_32
> > > +
> > > +/* Map a frame table to cover physical addresses ps through pe
> > > */
> > > +static void __init setup_frametable_mappings(paddr_t ps, paddr_t
> > > pe)
> > > +{
> > > +    paddr_t aligned_ps = ROUNDUP(ps, PAGE_SIZE);
> > > +    paddr_t aligned_pe = ROUNDDOWN(pe, PAGE_SIZE);
> > > +    unsigned long nr_mfns = PFN_DOWN(aligned_pe - aligned_ps);
> > > +    unsigned long frametable_size = nr_mfns *
> > > sizeof(*frame_table);
> > > +    mfn_t base_mfn;
> > > +
> > > +    if ( frametable_size > FRAMETABLE_SIZE )
> > > +        panic("The frametable cannot cover [%#"PRIpaddr",
> > > %#"PRIpaddr")\n",
> > > +              ps, pe);
> > > +
> > > +    /*
> > > +     * align base_mfn and frametable_size to MB(2) to have
> > > superpage mapping
> > > +     * in map_pages_to_xen()
> > > +     */
> > > +    frametable_size = ROUNDUP(frametable_size, MB(2));
> > > +    base_mfn = alloc_boot_pages(frametable_size >> PAGE_SHIFT,
> > > PFN_DOWN(MB(2)));
> > > +
> > > +    if ( map_pages_to_xen(FRAMETABLE_VIRT_START, base_mfn,
> > > +                          PFN_DOWN(frametable_size),
> > > +                          PAGE_HYPERVISOR_RW) )
> > > +        panic("frametable mappings failed: %#lx -> %#lx\n",
> > > +              FRAMETABLE_VIRT_START, mfn_x(base_mfn));
> > > +
> > > +    memset(&frame_table[0], -1, frametable_size);
> > > +    memset(&frame_table[PFN_DOWN(aligned_ps)],
> > > +           0, nr_mfns * sizeof(*frame_table));
> > 
> > Interesting - now you write out a huge amount of -1s, just to then
> > overwrite
> > most of them with zeroes. I'm not going to insist that you change
> > this yet
> > another time, but the performance hit from this is going to bite
> > you/us as
> > soon as Xen is run on bigger-memory systems.
> I agree that validating or invalidating frames in a single pass would
> be preferable to nearly two passes. I’m considering whether there’s a
> way to ensure that frame_table is set to -1 at compile time. It seems
> the best approach (and only one?) might be to initialize frame_table
> in
> one pass as follows:
> 1) [0, aligned_ps) = -1
> 2) [aligned_ps, nr_mfns * sizeof(*frame_table)) = 0
> 3) [nr_mfns * sizeof(*frame_table), frame_table_size) = -1
> Does this approach seem optimal?
> 
> > 
> > Plus, unless I'm mistaken, the function continues to rely on ps ==
> > 0
> > as
> > input. Just that the dependency is now better hidden. Specifically
> > if
> > you
> > calculate nr_mfns from the difference, and then use that for
> > allocation,
> > then you need to offset the start of the mapping you create
> > accordingly.
> I'm not quite understanding why the method of calculating nr_mfns
> affects how the frame_table is mapped. Isn’t it only necessary to
> calculate the correct size of the frame_table that needs to be
> allocated?
> 
> I assume by the offset you mean something similar to what was done
> for
> directmap mapping?
Here is the small diff for simplicity:
   --- a/xen/arch/riscv/include/asm/mm.h
   +++ b/xen/arch/riscv/include/asm/mm.h
   @@ -132,11 +132,13 @@ struct page_info
        };
    };
    
   +extern struct page_info *frame_table_virt_start;
   +
    #define frame_table ((struct page_info *)FRAMETABLE_VIRT_START)
    
    /* Convert between machine frame numbers and page-info structures.
   */
   -#define mfn_to_page(mfn)    (frame_table + mfn_x(mfn))
   -#define page_to_mfn(pg)     _mfn((pg) - frame_table)
   +#define mfn_to_page(mfn)    (frame_table_virt_start + mfn_x(mfn))
   +#define page_to_mfn(pg)     _mfn((pg) - frame_table_virt_start)
    
    static inline void *page_to_virt(const struct page_info *pg)
    {
   diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c
   index db75aa1d4f..9bd15597a9 100644
   --- a/xen/arch/riscv/mm.c
   +++ b/xen/arch/riscv/mm.c
   @@ -426,6 +426,8 @@ void * __init early_fdt_map(paddr_t fdt_paddr)
    
    vaddr_t __ro_after_init directmap_virt_start =
   DIRECTMAP_VIRT_START;
    
   +struct page_info *frame_table_virt_start;
   +
    #ifndef CONFIG_RISCV_32
    
    /* Map a frame table to cover physical addresses ps through pe */
   @@ -437,6 +439,11 @@ static void __init
   setup_frametable_mappings(paddr_t ps, paddr_t pe)
        unsigned long frametable_size = nr_mfns * sizeof(*frame_table);
        mfn_t base_mfn;
    
   +    if ( !frame_table_virt_start )
   +    {
   +        frame_table_virt_start = frame_table -
   paddr_to_pfn(aligned_ps);
   +    }
   +
        if ( frametable_size > FRAMETABLE_SIZE )
            panic("The frametable cannot cover [%#"PRIpaddr",
   %#"PRIpaddr")\n",
                  ps, pe);
   @@ -454,9 +461,12 @@ static void __init
   setup_frametable_mappings(paddr_t ps, paddr_t pe)
            panic("frametable mappings failed: %#lx -> %#lx\n",
                  FRAMETABLE_VIRT_START, mfn_x(base_mfn));
    
   -    memset(&frame_table[0], -1, frametable_size);
   -    memset(&frame_table[PFN_DOWN(aligned_ps)],
   -           0, nr_mfns * sizeof(*frame_table));
   +    memset(&frame_table[0], -1,
   +           PFN_DOWN(aligned_ps) * sizeof(*frame_table));
   +    memset(&frame_table[PFN_DOWN(aligned_ps)], 0,
   +           nr_mfns * sizeof(*frame_table));
   +    memset(&frame_table[PFN_DOWN(aligned_pe)], -1,
   +           frametable_size - (nr_mfns * sizeof(*frame_table)));
    }

~ Oleksii 

> 
> 
> >  At
> > which point you may need to apply extra care to cover the case
> > where
> > sizeof(*frame_table) is not a power of two, and hence e.g. the
> > first
> > valid
> > page might have a struct instance straddling a page boundary.
> The first valid page is aligned_ps ( which is aligned on a page
> boundary ) so assuming that sizeof(*frame_table) < PAGE_SIZE we can't
> straddle a page boundary, can we?
> 
> > 
> > > +    /*
> > > +     * The base address may not be aligned to the second level
> > > +     * size in case of Sv39 (e.g. 1GB when using 4KB pages).
> > > +     * This would prevent superpage mappings for all the regions
> > > +     * because the virtual address and machine address should
> > > +     * both be suitably aligned.
> > > +     *
> > > +     * Prevent that by offsetting the start of the directmap
> > > virtual
> > > +     * address.
> > > +     */
> > > +    if ( map_pages_to_xen(DIRECTMAP_VIRT_START + (base_addr &
> > > ~high_bits_mask),
> > 
> > I'm afraid this is correct only for the first invocation of the
> > function.
> > For any further invocation you'd likely (attempt to) replace
> > previously
> > established mappings. I think that here you need to use
> > directmap_virt_start
> > instead.
> Banks are sorted by bank start address ( common/device-
> tree/bootfdt.c:623 ):
>        /*
>         * On Arm64 setup_directmap_mappings() expects to be called
> with
>    the lowest
>         * bank in memory first. There is no requirement that the DT
> will
>    provide
>         * the banks sorted in ascending order. So sort them through.
>         */
>        sort(mem->bank, mem->nr_banks, sizeof(struct membank),
>             cmp_memory_node, swap_memory_node);
> ( btw, comment is needed to be updated ... )
> 
> Thereby no replacement should happen if I don't miss something.
> 
> I am thinking wouldn't it be more correct to use
> mfn_to_virt(base_mfn):
>        if ( map_pages_to_xen(__mfn_to_virt(base_mfn) + (base_addr &
>    ~high_bits_mask),
>                              _mfn(base_mfn), nr_mfns,
>                              PAGE_HYPERVISOR_RW) )
> 
> > 
> > > +/*
> > > + * Setup memory management
> > > + *
> > > + * RISC-V 64 has a large virtual address space (the minimum
> > > supported
> > > + * MMU mode is Sv39, which provides GBs of VA space).
> > > + *
> > > + * The directmap_virt_start is shifted lower in the VA space to
> > > + * (DIRECTMAP_VIRT_START - masked_low_bits_of_ram_start_address)
> > > to avoid
> > > + * wasting a large portion of the directmap space, this also
> > > allows for simple
> > > + * VA <-> PA translations. Also aligns DIRECTMAP_VIRT_START to a
> > > GB boundary
> > > + * (for Sv39; for other MMU mode boundaries will be bigger ) by
> > > masking the
> > > + * higher bits of the RAM start address to enable the use of
> > > superpages in
> > > + * map_pages_to_xen().
> > > + *
> > > + * The frametable is mapped starting from physical address 0,
> > > minimizing
> > > + * wasted VA space and simplifying page_to_mfn() and
> > > mfn_to_page()
> > > + * translations.
> > > + */
> > > +void __init setup_mm(void)
> > > +{
> > > +    const struct membanks *banks = bootinfo_get_mem();
> > > +    paddr_t ram_start = INVALID_PADDR;
> > > +    paddr_t ram_end = 0;
> > > +    paddr_t ram_size = 0;
> > > +    unsigned int i;
> > > +
> > > +    /*
> > > +     * We need some memory to allocate the page-tables used for
> > > the directmap
> > > +     * mappings. But some regions may contain memory already
> > > allocated
> > > +     * for other uses (e.g. modules, reserved-memory...).
> > > +     *
> > > +     * For simplicity, add all the free regions in the boot
> > > allocator.
> > > +     */
> > > +    populate_boot_allocator();
> > > +
> > > +    total_pages = 0;
> > 
> > Nit: Is this actually necessary?
> Agree, there is no need for total_pages. It should be dropped.
> 
> > 
> > > +    for ( i = 0; i < banks->nr_banks; i++ )
> > > +    {
> > > +        const struct membank *bank = &banks->bank[i];
> > > +        paddr_t bank_start = ROUNDUP(bank->start, PAGE_SIZE);
> > > +        paddr_t bank_end = ROUNDDOWN(bank->start + bank->size,
> > > PAGE_SIZE);
> > > +        unsigned long bank_size = bank_end - bank_start;
> > > +
> > > +        ram_size += bank_size;
> > > +        ram_start = min(ram_start, bank_start);
> > > +        ram_end = max(ram_end, bank_end);
> > > +
> > > +        setup_directmap_mappings(PFN_DOWN(bank_start),
> > > PFN_DOWN(bank_size));
> > > +    }
> > 
> > You maintain ram_start in the loop, just to then ...
> > 
> > > +    total_pages = PFN_DOWN(ram_size);
> > > +
> > > +    setup_frametable_mappings(0, ram_end);
> > > +    max_page = PFN_DOWN(ram_end);
> > > +}
> > 
> > ... not use it at all - why?
> ram_start was needed for the case when setup_frametable_mappings()
> used
> it as the first argument. Now it isn't true anymore so should be
> dropped.
> 
> Thanks.
> 
> ~ Oleksii