[PATCH v2-ish] x86/boot: Factor move_xen() out of __start_xen()

Andrew Cooper posted 1 patch 1 year, 4 months ago
Test gitlab-ci failed
Patches applied successfully (tree, apply log)
git fetch https://gitlab.com/xen-project/patchew/xen tags/patchew/20221216201749.32164-1-andrew.cooper3@citrix.com
There is a newer version of this series
xen/arch/x86/setup.c | 178 +++++++++++++++++++++++++++++++--------------------
1 file changed, 108 insertions(+), 70 deletions(-)
[PATCH v2-ish] x86/boot: Factor move_xen() out of __start_xen()
Posted by Andrew Cooper 1 year, 4 months ago
Partly for clarity because there is a lot of subtle magic at work here.
Expand the commentary of what's going on.

Also, because there is no need to double copy the stack (32kB) to retrieve
local values spilled to the stack under the old alias, when all of the
aforementioned local variables are going out of scope anyway.

There is also a logic change when walking l2_xenmap[].  The old logic would
skip recursing into 4k mappings; this would corrupt Xen if it were used.
There are no 4k mappings in l2_xenmap[] this early on boot, so assert the
property instead of leaving a latent breakage.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>

We probably want to drop PGE from XEN_MINIMAL_CR4.  It will simplify several
boot paths which don't need to care about global pages, and PGE is conditional
anyway now with the PCID support.

v2-ish:
 * Split out previous series.  Was previously "x86/boot: Don't double-copy the
   stack during physical relocation" reworked to remove the risk of losing
   local stack updated we might care about.

Note that putting a printk() in the critical region wouldn't have worked even
with the old logic, and is not a reduction in functionality caused by this
patch avoiding the second stack copy.
---
 xen/arch/x86/setup.c | 178 +++++++++++++++++++++++++++++++--------------------
 1 file changed, 108 insertions(+), 70 deletions(-)

diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 4102aae76dde..830502f2d0d9 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -467,6 +467,113 @@ static void __init move_memory(
     }
 }
 
+static void __init noinline move_xen(void)
+{
+    l4_pgentry_t *pl4e;
+    l3_pgentry_t *pl3e;
+    l2_pgentry_t *pl2e;
+    unsigned long tmp;
+    int i, j, k;
+
+    /*
+     * The caller has selected xen_phys_start, ensuring that the old and new
+     * locations do not overlap.  Make it so.
+     *
+     * Prevent the compiler from reordering anything across this point.  Such
+     * things will end badly.
+     */
+    barrier();
+
+    /*
+     * Copy out of the current alias, into the directmap at the new location.
+     * This includes a snapshot of the current stack.
+     */
+    memcpy(__va(__pa(_start)), _start, _end - _start);
+
+    /*
+     * We are now in a critical region.  Any write (modifying a global,
+     * spilling a local to the stack, etc) via the current alias will get lost
+     * when we switch to the new pagetables.  This even means no printk()'s
+     * debugging purposes.
+     *
+     * Walk the soon-to-be-used pagetables in the new location, relocating all
+     * intermediate (non-leaf) entries to point to their new-location
+     * equivalents.  All writes are via the directmap alias.
+     */
+    pl4e = __va(__pa(idle_pg_table));
+    for ( i = 0 ; i < L4_PAGETABLE_ENTRIES; i++, pl4e++ )
+    {
+        if ( !(l4e_get_flags(*pl4e) & _PAGE_PRESENT) )
+            continue;
+
+        *pl4e = l4e_from_intpte(l4e_get_intpte(*pl4e) + xen_phys_start);
+        pl3e = __va(l4e_get_paddr(*pl4e));
+        for ( j = 0; j < L3_PAGETABLE_ENTRIES; j++, pl3e++ )
+        {
+            if ( !(l3e_get_flags(*pl3e) & _PAGE_PRESENT) ||
+                 (l3e_get_flags(*pl3e) & _PAGE_PSE) )
+                continue;
+
+            *pl3e = l3e_from_intpte(l3e_get_intpte(*pl3e) + xen_phys_start);
+            pl2e = __va(l3e_get_paddr(*pl3e));
+            for ( k = 0; k < L2_PAGETABLE_ENTRIES; k++, pl2e++ )
+            {
+                if ( !(l2e_get_flags(*pl2e) & _PAGE_PRESENT) ||
+                     (l2e_get_flags(*pl2e) & _PAGE_PSE) )
+                    continue;
+
+                *pl2e = l2e_from_intpte(l2e_get_intpte(*pl2e) + xen_phys_start);
+            }
+        }
+    }
+
+    /*
+     * Walk the soon-to-be-used l2_xenmap[], relocating all the leaf mappings
+     * so text/data/bss etc refer to the new location in memory.
+     */
+    pl2e = __va(__pa(l2_xenmap));
+    for ( i = 0; i < L2_PAGETABLE_ENTRIES; i++, pl2e++ )
+    {
+        if ( !(l2e_get_flags(*pl2e) & _PAGE_PRESENT) )
+            continue;
+
+        /*
+         * We don't have 4k mappings in l2_xenmap[] at this point in boot, nor
+         * is this anticipated to change.  Simply assert the fact, rather than
+         * introducing dead logic to decend into l1 tables.
+         */
+        ASSERT(l2e_get_flags(*pl2e) & _PAGE_PSE);
+
+        *pl2e = l2e_from_intpte(l2e_get_intpte(*pl2e) + xen_phys_start);
+    }
+
+    /*
+     * Switch to relocated pagetables, shooting down global mappings as we go.
+     */
+    asm volatile (
+        "mov    %%cr4, %[cr4]\n\t"
+        "andb   $~%c[pge], %b[cr4]\n\t"
+        "mov    %[cr4], %%cr4\n\t"     /* CR4.PGE = 0 */
+        "mov    %[cr3], %%cr3\n\t"     /* CR3 = new pagetables */
+        "orb    $%c[pge], %b[cr4]\n\t"
+        "mov    %[cr4], %%cr4\n\t"     /* CR4.PGE = 1 */
+        : [cr4] "=&a" (tmp) /* Could be "r", but "a" makes better asm */
+        : [cr3] "r" (__pa(idle_pg_table)),
+          [pge] "i" (X86_CR4_PGE)
+        : "memory" );
+
+    /*
+     * End of the critical region.  Updates to locals and globals now work as
+     * expected.
+     *
+     * Updates to local variables which have been spilled to the stack since
+     * the memcpy() have been lost.  But we don't care, because they're all
+     * going out of scope imminently.
+     */
+
+    printk("New Xen image base address: %#lx\n", xen_phys_start);
+}
+
 #undef BOOTSTRAP_MAP_LIMIT
 
 static uint64_t __init consider_modules(
@@ -1255,81 +1362,12 @@ void __init noreturn __start_xen(unsigned long mbi_p)
          */
         if ( (end > s) && (end - reloc_size + XEN_IMG_OFFSET >= __pa(_end)) )
         {
-            l4_pgentry_t *pl4e;
-            l3_pgentry_t *pl3e;
-            l2_pgentry_t *pl2e;
-            int i, j, k;
-
             /* Select relocation address. */
             xen_phys_start = end - reloc_size;
             e = xen_phys_start + XEN_IMG_OFFSET;
             bootsym(trampoline_xen_phys_start) = xen_phys_start;
 
-            /*
-             * Perform relocation to new physical address.
-             * Before doing so we must sync static/global data with main memory
-             * with a barrier(). After this we must *not* modify static/global
-             * data until after we have switched to the relocated pagetables!
-             */
-            barrier();
-            memcpy(__va(__pa(_start)), _start, _end - _start);
-
-            /* Walk idle_pg_table, relocating non-leaf entries. */
-            pl4e = __va(__pa(idle_pg_table));
-            for ( i = 0 ; i < L4_PAGETABLE_ENTRIES; i++, pl4e++ )
-            {
-                if ( !(l4e_get_flags(*pl4e) & _PAGE_PRESENT) )
-                    continue;
-                *pl4e = l4e_from_intpte(l4e_get_intpte(*pl4e) +
-                                        xen_phys_start);
-                pl3e = __va(l4e_get_paddr(*pl4e));
-                for ( j = 0; j < L3_PAGETABLE_ENTRIES; j++, pl3e++ )
-                {
-                    /* Not present, 1GB mapping, or already relocated? */
-                    if ( !(l3e_get_flags(*pl3e) & _PAGE_PRESENT) ||
-                         (l3e_get_flags(*pl3e) & _PAGE_PSE) )
-                        continue;
-                    *pl3e = l3e_from_intpte(l3e_get_intpte(*pl3e) +
-                                            xen_phys_start);
-                    pl2e = __va(l3e_get_paddr(*pl3e));
-                    for ( k = 0; k < L2_PAGETABLE_ENTRIES; k++, pl2e++ )
-                    {
-                        /* Not present, PSE, or already relocated? */
-                        if ( !(l2e_get_flags(*pl2e) & _PAGE_PRESENT) ||
-                             (l2e_get_flags(*pl2e) & _PAGE_PSE) )
-                            continue;
-                        *pl2e = l2e_from_intpte(l2e_get_intpte(*pl2e) +
-                                                xen_phys_start);
-                    }
-                }
-            }
-
-            /* Walk l2_xenmap[], relocating 2M superpage leaves. */
-            pl2e = __va(__pa(l2_xenmap));
-            for ( i = 0; i < L2_PAGETABLE_ENTRIES; i++, pl2e++ )
-            {
-                if ( !(l2e_get_flags(*pl2e) & _PAGE_PRESENT) ||
-                     !(l2e_get_flags(*pl2e) & _PAGE_PSE) )
-                    continue;
-
-                *pl2e = l2e_from_intpte(l2e_get_intpte(*pl2e) + xen_phys_start);
-            }
-
-            /* Re-sync the stack and then switch to relocated pagetables. */
-            asm volatile (
-                "rep movsq        ; " /* re-sync the stack */
-                "movq %%cr4,%%rsi ; "
-                "andb $0x7f,%%sil ; "
-                "movq %%rsi,%%cr4 ; " /* CR4.PGE == 0 */
-                "movq %[pg],%%cr3 ; " /* CR3 == new pagetables */
-                "orb $0x80,%%sil  ; "
-                "movq %%rsi,%%cr4   " /* CR4.PGE == 1 */
-                : "=&S" (i), "=&D" (i), "=&c" (i) /* All outputs discarded. */
-                :  [pg] "r" (__pa(idle_pg_table)), "0" (cpu0_stack),
-                   "1" (__va(__pa(cpu0_stack))), "2" (STACK_SIZE / 8)
-                : "memory" );
-
-            printk("New Xen image base address: %#lx\n", xen_phys_start);
+            move_xen();
         }
 
         /* Is the region suitable for relocating the multiboot modules? */
-- 
2.11.0


Re: [PATCH v2-ish] x86/boot: Factor move_xen() out of __start_xen()
Posted by Jan Beulich 1 year, 4 months ago
On 16.12.2022 21:17, Andrew Cooper wrote:
> Partly for clarity because there is a lot of subtle magic at work here.
> Expand the commentary of what's going on.
> 
> Also, because there is no need to double copy the stack (32kB) to retrieve
> local values spilled to the stack under the old alias, when all of the
> aforementioned local variables are going out of scope anyway.
> 
> There is also a logic change when walking l2_xenmap[].  The old logic would
> skip recursing into 4k mappings; this would corrupt Xen if it were used.
> There are no 4k mappings in l2_xenmap[] this early on boot, so assert the
> property instead of leaving a latent breakage.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> CC: Wei Liu <wl@xen.org>
> 
> We probably want to drop PGE from XEN_MINIMAL_CR4.  It will simplify several
> boot paths which don't need to care about global pages, and PGE is conditional
> anyway now with the PCID support.

Perhaps, especially if - as you say - this allows for simplification.

> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -467,6 +467,113 @@ static void __init move_memory(
>      }
>  }
>  
> +static void __init noinline move_xen(void)
> +{
> +    l4_pgentry_t *pl4e;
> +    l3_pgentry_t *pl3e;
> +    l2_pgentry_t *pl2e;
> +    unsigned long tmp;
> +    int i, j, k;

Can these become "unsigned int" please at this occasion? (Perhaps as
another reason to make the change, mention that the old instances of i and
j did shadow outer scope variables in the same function?)

> +    /*
> +     * The caller has selected xen_phys_start, ensuring that the old and new
> +     * locations do not overlap.  Make it so.

As a non-native speaker I'm struggling with what "Make it so" is supposed
to tell me here.

> +     * Prevent the compiler from reordering anything across this point.  Such
> +     * things will end badly.
> +     */
> +    barrier();
> +
> +    /*
> +     * Copy out of the current alias, into the directmap at the new location.
> +     * This includes a snapshot of the current stack.
> +     */
> +    memcpy(__va(__pa(_start)), _start, _end - _start);
> +
> +    /*
> +     * We are now in a critical region.  Any write (modifying a global,
> +     * spilling a local to the stack, etc) via the current alias will get lost
> +     * when we switch to the new pagetables.  This even means no printk()'s
> +     * debugging purposes.

Nit: Missing "for"?

> +     * Walk the soon-to-be-used pagetables in the new location, relocating all
> +     * intermediate (non-leaf) entries to point to their new-location
> +     * equivalents.  All writes are via the directmap alias.
> +     */
> +    pl4e = __va(__pa(idle_pg_table));
> +    for ( i = 0 ; i < L4_PAGETABLE_ENTRIES; i++, pl4e++ )
> +    {
> +        if ( !(l4e_get_flags(*pl4e) & _PAGE_PRESENT) )
> +            continue;
> +
> +        *pl4e = l4e_from_intpte(l4e_get_intpte(*pl4e) + xen_phys_start);
> +        pl3e = __va(l4e_get_paddr(*pl4e));
> +        for ( j = 0; j < L3_PAGETABLE_ENTRIES; j++, pl3e++ )
> +        {
> +            if ( !(l3e_get_flags(*pl3e) & _PAGE_PRESENT) ||
> +                 (l3e_get_flags(*pl3e) & _PAGE_PSE) )
> +                continue;
> +
> +            *pl3e = l3e_from_intpte(l3e_get_intpte(*pl3e) + xen_phys_start);
> +            pl2e = __va(l3e_get_paddr(*pl3e));
> +            for ( k = 0; k < L2_PAGETABLE_ENTRIES; k++, pl2e++ )
> +            {
> +                if ( !(l2e_get_flags(*pl2e) & _PAGE_PRESENT) ||
> +                     (l2e_get_flags(*pl2e) & _PAGE_PSE) )
> +                    continue;
> +
> +                *pl2e = l2e_from_intpte(l2e_get_intpte(*pl2e) + xen_phys_start);
> +            }
> +        }
> +    }
> +
> +    /*
> +     * Walk the soon-to-be-used l2_xenmap[], relocating all the leaf mappings
> +     * so text/data/bss etc refer to the new location in memory.
> +     */
> +    pl2e = __va(__pa(l2_xenmap));
> +    for ( i = 0; i < L2_PAGETABLE_ENTRIES; i++, pl2e++ )
> +    {
> +        if ( !(l2e_get_flags(*pl2e) & _PAGE_PRESENT) )
> +            continue;
> +
> +        /*
> +         * We don't have 4k mappings in l2_xenmap[] at this point in boot, nor
> +         * is this anticipated to change.  Simply assert the fact, rather than
> +         * introducing dead logic to decend into l1 tables.

Nit: "descend"?

> +         */
> +        ASSERT(l2e_get_flags(*pl2e) & _PAGE_PSE);

The warning about the use of printk() around here could make one think
that using ASSERT() (or BUG()) is similarly bad. However, aiui using
printk() isn't by itself a problem, just that the log message would be
lost as far as the circular buffer goes. The message would be observable
on the serial con sole though, at least with "sync_console". It is on
this basis that using ASSERT() here makes sense. Perhaps the printk()
related comment could be slightly adjusted to express this?

> +        *pl2e = l2e_from_intpte(l2e_get_intpte(*pl2e) + xen_phys_start);
> +    }
> +
> +    /*
> +     * Switch to relocated pagetables, shooting down global mappings as we go.
> +     */
> +    asm volatile (
> +        "mov    %%cr4, %[cr4]\n\t"
> +        "andb   $~%c[pge], %b[cr4]\n\t"
> +        "mov    %[cr4], %%cr4\n\t"     /* CR4.PGE = 0 */
> +        "mov    %[cr3], %%cr3\n\t"     /* CR3 = new pagetables */
> +        "orb    $%c[pge], %b[cr4]\n\t"

I understand you need %c to apply the ~ in the operand of ANDB above.
But could I talk you into keeping things as simple as possible here
by using plain %[pge]?

> +        "mov    %[cr4], %%cr4\n\t"     /* CR4.PGE = 1 */
> +        : [cr4] "=&a" (tmp) /* Could be "r", but "a" makes better asm */
> +        : [cr3] "r" (__pa(idle_pg_table)),
> +          [pge] "i" (X86_CR4_PGE)
> +        : "memory" );

The removed stack copying worries me, to be honest. Yes, for local
variables of ours it doesn't matter because they are about to go out
of scope. But what if the compiler instantiates any for its own use,
e.g. ...

> +    /*
> +     * End of the critical region.  Updates to locals and globals now work as
> +     * expected.
> +     *
> +     * Updates to local variables which have been spilled to the stack since
> +     * the memcpy() have been lost.  But we don't care, because they're all
> +     * going out of scope imminently.
> +     */
> +
> +    printk("New Xen image base address: %#lx\n", xen_phys_start);

... the result of the address calculation for the string literal
here? Such auxiliary calculations can happen at any point in the
function, and won't necessarily (hence the example chosen) become
impossible for the compiler to do with the memory clobber in the
asm(). And while the string literal's address is likely cheap
enough to calculate right in the function invocation sequence (and
an option would also be to retain the printk() in the caller),
other instrumentation options could be undermined by this as well.

Jan

Re: [PATCH v2-ish] x86/boot: Factor move_xen() out of __start_xen()
Posted by Andrew Cooper 1 year, 4 months ago
On 20/12/2022 1:51 pm, Jan Beulich wrote:
> On 16.12.2022 21:17, Andrew Cooper wrote:
>> Partly for clarity because there is a lot of subtle magic at work here.
>> Expand the commentary of what's going on.
>>
>> Also, because there is no need to double copy the stack (32kB) to retrieve
>> local values spilled to the stack under the old alias, when all of the
>> aforementioned local variables are going out of scope anyway.
>>
>> There is also a logic change when walking l2_xenmap[].  The old logic would
>> skip recursing into 4k mappings; this would corrupt Xen if it were used.
>> There are no 4k mappings in l2_xenmap[] this early on boot, so assert the
>> property instead of leaving a latent breakage.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> ---
>> CC: Jan Beulich <JBeulich@suse.com>
>> CC: Roger Pau Monné <roger.pau@citrix.com>
>> CC: Wei Liu <wl@xen.org>
>>
>> We probably want to drop PGE from XEN_MINIMAL_CR4.  It will simplify several
>> boot paths which don't need to care about global pages, and PGE is conditional
>> anyway now with the PCID support.
> Perhaps, especially if - as you say - this allows for simplification.
>
>> --- a/xen/arch/x86/setup.c
>> +++ b/xen/arch/x86/setup.c
>> @@ -467,6 +467,113 @@ static void __init move_memory(
>>      }
>>  }
>>  
>> +static void __init noinline move_xen(void)
>> +{
>> +    l4_pgentry_t *pl4e;
>> +    l3_pgentry_t *pl3e;
>> +    l2_pgentry_t *pl2e;
>> +    unsigned long tmp;
>> +    int i, j, k;
> Can these become "unsigned int" please at this occasion?

Fixed.

> (Perhaps as
> another reason to make the change, mention that the old instances of i and
> j did shadow outer scope variables in the same function?)

Oh, so it does.  I'm slightly surprised that we haven't seen a compiler
objection from that.

>
>> +    /*
>> +     * The caller has selected xen_phys_start, ensuring that the old and new
>> +     * locations do not overlap.  Make it so.
> As a non-native speaker I'm struggling with what "Make it so" is supposed
> to tell me here.

"Actually do it"

I'll drop it.  It's not overly important to the understanding here.

>
>> +     * Prevent the compiler from reordering anything across this point.  Such
>> +     * things will end badly.
>> +     */
>> +    barrier();
>> +
>> +    /*
>> +     * Copy out of the current alias, into the directmap at the new location.
>> +     * This includes a snapshot of the current stack.
>> +     */
>> +    memcpy(__va(__pa(_start)), _start, _end - _start);
>> +
>> +    /*
>> +     * We are now in a critical region.  Any write (modifying a global,
>> +     * spilling a local to the stack, etc) via the current alias will get lost
>> +     * when we switch to the new pagetables.  This even means no printk()'s
>> +     * debugging purposes.
> Nit: Missing "for"?

Fixed.

>
>> +     * Walk the soon-to-be-used pagetables in the new location, relocating all
>> +     * intermediate (non-leaf) entries to point to their new-location
>> +     * equivalents.  All writes are via the directmap alias.
>> +     */
>> +    pl4e = __va(__pa(idle_pg_table));
>> +    for ( i = 0 ; i < L4_PAGETABLE_ENTRIES; i++, pl4e++ )
>> +    {
>> +        if ( !(l4e_get_flags(*pl4e) & _PAGE_PRESENT) )
>> +            continue;
>> +
>> +        *pl4e = l4e_from_intpte(l4e_get_intpte(*pl4e) + xen_phys_start);
>> +        pl3e = __va(l4e_get_paddr(*pl4e));
>> +        for ( j = 0; j < L3_PAGETABLE_ENTRIES; j++, pl3e++ )
>> +        {
>> +            if ( !(l3e_get_flags(*pl3e) & _PAGE_PRESENT) ||
>> +                 (l3e_get_flags(*pl3e) & _PAGE_PSE) )
>> +                continue;
>> +
>> +            *pl3e = l3e_from_intpte(l3e_get_intpte(*pl3e) + xen_phys_start);
>> +            pl2e = __va(l3e_get_paddr(*pl3e));
>> +            for ( k = 0; k < L2_PAGETABLE_ENTRIES; k++, pl2e++ )
>> +            {
>> +                if ( !(l2e_get_flags(*pl2e) & _PAGE_PRESENT) ||
>> +                     (l2e_get_flags(*pl2e) & _PAGE_PSE) )
>> +                    continue;
>> +
>> +                *pl2e = l2e_from_intpte(l2e_get_intpte(*pl2e) + xen_phys_start);
>> +            }
>> +        }
>> +    }
>> +
>> +    /*
>> +     * Walk the soon-to-be-used l2_xenmap[], relocating all the leaf mappings
>> +     * so text/data/bss etc refer to the new location in memory.
>> +     */
>> +    pl2e = __va(__pa(l2_xenmap));
>> +    for ( i = 0; i < L2_PAGETABLE_ENTRIES; i++, pl2e++ )
>> +    {
>> +        if ( !(l2e_get_flags(*pl2e) & _PAGE_PRESENT) )
>> +            continue;
>> +
>> +        /*
>> +         * We don't have 4k mappings in l2_xenmap[] at this point in boot, nor
>> +         * is this anticipated to change.  Simply assert the fact, rather than
>> +         * introducing dead logic to decend into l1 tables.
> Nit: "descend"?

Fixed.

>
>> +         */
>> +        ASSERT(l2e_get_flags(*pl2e) & _PAGE_PSE);
> The warning about the use of printk() around here could make one think
> that using ASSERT() (or BUG()) is similarly bad. However, aiui using
> printk() isn't by itself a problem, just that the log message would be
> lost as far as the circular buffer goes. The message would be observable
> on the serial con sole though, at least with "sync_console". It is on
> this basis that using ASSERT() here makes sense. Perhaps the printk()
> related comment could be slightly adjusted to express this?

I did try to describe it, and it gets complicated, so I opted for the
simple approach.

Before the pagetable switch, it will operate "properly" and emit text
onto the screen, serial, etc.

After the pagetable switch, all Xen data will be lost, which includes
the main console buffer, but also things like the local state for Xen's
UART driver.  I gave up trying to reason what would happen at that point.


The ASSERT() is very much a best-effort attempt.  Anyone liable to trip
the assert is already working on the boot pagetable and ought to be aware.

>
>> +        *pl2e = l2e_from_intpte(l2e_get_intpte(*pl2e) + xen_phys_start);
>> +    }
>> +
>> +    /*
>> +     * Switch to relocated pagetables, shooting down global mappings as we go.
>> +     */
>> +    asm volatile (
>> +        "mov    %%cr4, %[cr4]\n\t"
>> +        "andb   $~%c[pge], %b[cr4]\n\t"
>> +        "mov    %[cr4], %%cr4\n\t"     /* CR4.PGE = 0 */
>> +        "mov    %[cr3], %%cr3\n\t"     /* CR3 = new pagetables */
>> +        "orb    $%c[pge], %b[cr4]\n\t"
> I understand you need %c to apply the ~ in the operand of ANDB above.
> But could I talk you into keeping things as simple as possible here
> by using plain %[pge]?

The $ is still useful in the second case, for consistency if nothing else.

Both of these instances disappear if PGE gets cleaned up on boot.

>
>> +        "mov    %[cr4], %%cr4\n\t"     /* CR4.PGE = 1 */
>> +        : [cr4] "=&a" (tmp) /* Could be "r", but "a" makes better asm */
>> +        : [cr3] "r" (__pa(idle_pg_table)),
>> +          [pge] "i" (X86_CR4_PGE)
>> +        : "memory" );
> The removed stack copying worries me, to be honest. Yes, for local
> variables of ours it doesn't matter because they are about to go out
> of scope. But what if the compiler instantiates any for its own use,
> e.g. ...
>
>> +    /*
>> +     * End of the critical region.  Updates to locals and globals now work as
>> +     * expected.
>> +     *
>> +     * Updates to local variables which have been spilled to the stack since
>> +     * the memcpy() have been lost.  But we don't care, because they're all
>> +     * going out of scope imminently.
>> +     */
>> +
>> +    printk("New Xen image base address: %#lx\n", xen_phys_start);
> ... the result of the address calculation for the string literal
> here? Such auxiliary calculations can happen at any point in the
> function, and won't necessarily (hence the example chosen) become
> impossible for the compiler to do with the memory clobber in the
> asm(). And while the string literal's address is likely cheap
> enough to calculate right in the function invocation sequence (and
> an option would also be to retain the printk() in the caller),
> other instrumentation options could be undermined by this as well.

Right now, the compiler is free to calculate the address of the string
literal in a register, and move it the other side of the TLB flush. 
This will work just fine.

However, the compiler cannot now, or ever in the future, spill such a
calculation to the stack.

Whether it's spelt "memory", or something else in the future, OSes
really do genuinely need a way of telling the compiler "you literally
cannot move anything the other side of this asm()", because otherwise
malfunctions will occur.

It's not the locals which worry me - we really do lose things like
coverage data in here.  Short of writing it fully in asm (which would be
irritating to maintain), I don't see any other option.

~Andrew
Re: [PATCH v2-ish] x86/boot: Factor move_xen() out of __start_xen()
Posted by Jan Beulich 1 year, 3 months ago
On 20.12.2022 21:56, Andrew Cooper wrote:
> On 20/12/2022 1:51 pm, Jan Beulich wrote:
>> On 16.12.2022 21:17, Andrew Cooper wrote:
>>> +        "mov    %[cr4], %%cr4\n\t"     /* CR4.PGE = 1 */
>>> +        : [cr4] "=&a" (tmp) /* Could be "r", but "a" makes better asm */
>>> +        : [cr3] "r" (__pa(idle_pg_table)),
>>> +          [pge] "i" (X86_CR4_PGE)
>>> +        : "memory" );
>> The removed stack copying worries me, to be honest. Yes, for local
>> variables of ours it doesn't matter because they are about to go out
>> of scope. But what if the compiler instantiates any for its own use,
>> e.g. ...
>>
>>> +    /*
>>> +     * End of the critical region.  Updates to locals and globals now work as
>>> +     * expected.
>>> +     *
>>> +     * Updates to local variables which have been spilled to the stack since
>>> +     * the memcpy() have been lost.  But we don't care, because they're all
>>> +     * going out of scope imminently.
>>> +     */
>>> +
>>> +    printk("New Xen image base address: %#lx\n", xen_phys_start);
>> ... the result of the address calculation for the string literal
>> here? Such auxiliary calculations can happen at any point in the
>> function, and won't necessarily (hence the example chosen) become
>> impossible for the compiler to do with the memory clobber in the
>> asm(). And while the string literal's address is likely cheap
>> enough to calculate right in the function invocation sequence (and
>> an option would also be to retain the printk() in the caller),
>> other instrumentation options could be undermined by this as well.
> 
> Right now, the compiler is free to calculate the address of the string
> literal in a register, and move it the other side of the TLB flush. 
> This will work just fine.
> 
> However, the compiler cannot now, or ever in the future, spill such a
> calculation to the stack.

I'm afraid the compiler's view at things is different: Whatever it puts
on the stack is viewed as virtual registers, unaffected by a memory
clobber (of course there can be effects resulting from uses of named
variables). Look at -O3 output of gcc12 (which is what I happened to
play with; I don't think it's overly version dependent) for this
(clearly contrived) piece of code:

int __attribute__((const)) calc(int);

int test(int i) {
	int j = calc(i);

	asm("nopl %0" : "+m" (j));
	asm("nopq %%rsp" ::: "memory", "ax", "cx", "dx", "bx", "bp", "si", "di",
	                     "r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15");
	j = calc(i);
	asm("nopl %0" :: "m" (j));

	return j;
}

It instantiates a local on the stack for the result of calc(); it does
not re-invoke calc() a 2nd time. Which means the memory clobber in the
middle asm() does not affect that (and by implication: any) stack slot.

Using godbolt I can also see that clang15 agrees with gcc12 in this
regard. I didn't bother trying other versions.

> Whether it's spelt "memory", or something else in the future, OSes
> really do genuinely need a way of telling the compiler "you literally
> cannot move anything the other side of this asm()", because otherwise
> malfunctions will occur.

Something like this may indeed be desirable in situations like this one,
but I don't think such a construct exists.

Jan

Re: [PATCH v2-ish] x86/boot: Factor move_xen() out of __start_xen()
Posted by Andrew Cooper 1 year, 3 months ago
On 21/12/2022 7:40 am, Jan Beulich wrote:
> On 20.12.2022 21:56, Andrew Cooper wrote:
>> On 20/12/2022 1:51 pm, Jan Beulich wrote:
>>> On 16.12.2022 21:17, Andrew Cooper wrote:
>>>> +        "mov    %[cr4], %%cr4\n\t"     /* CR4.PGE = 1 */
>>>> +        : [cr4] "=&a" (tmp) /* Could be "r", but "a" makes better asm */
>>>> +        : [cr3] "r" (__pa(idle_pg_table)),
>>>> +          [pge] "i" (X86_CR4_PGE)
>>>> +        : "memory" );
>>> The removed stack copying worries me, to be honest. Yes, for local
>>> variables of ours it doesn't matter because they are about to go out
>>> of scope. But what if the compiler instantiates any for its own use,
>>> e.g. ...
>>>
>>>> +    /*
>>>> +     * End of the critical region.  Updates to locals and globals now work as
>>>> +     * expected.
>>>> +     *
>>>> +     * Updates to local variables which have been spilled to the stack since
>>>> +     * the memcpy() have been lost.  But we don't care, because they're all
>>>> +     * going out of scope imminently.
>>>> +     */
>>>> +
>>>> +    printk("New Xen image base address: %#lx\n", xen_phys_start);
>>> ... the result of the address calculation for the string literal
>>> here? Such auxiliary calculations can happen at any point in the
>>> function, and won't necessarily (hence the example chosen) become
>>> impossible for the compiler to do with the memory clobber in the
>>> asm(). And while the string literal's address is likely cheap
>>> enough to calculate right in the function invocation sequence (and
>>> an option would also be to retain the printk() in the caller),
>>> other instrumentation options could be undermined by this as well.
>> Right now, the compiler is free to calculate the address of the string
>> literal in a register, and move it the other side of the TLB flush. 
>> This will work just fine.
>>
>> However, the compiler cannot now, or ever in the future, spill such a
>> calculation to the stack.
> I'm afraid the compiler's view at things is different: Whatever it puts
> on the stack is viewed as virtual registers, unaffected by a memory
> clobber (of course there can be effects resulting from uses of named
> variables). Look at -O3 output of gcc12 (which is what I happened to
> play with; I don't think it's overly version dependent) for this
> (clearly contrived) piece of code:
>
> int __attribute__((const)) calc(int);
>
> int test(int i) {
> 	int j = calc(i);
>
> 	asm("nopl %0" : "+m" (j));
> 	asm("nopq %%rsp" ::: "memory", "ax", "cx", "dx", "bx", "bp", "si", "di",
> 	                     "r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15");
> 	j = calc(i);
> 	asm("nopl %0" :: "m" (j));
>
> 	return j;
> }
>
> It instantiates a local on the stack for the result of calc(); it does
> not re-invoke calc() a 2nd time. Which means the memory clobber in the
> middle asm() does not affect that (and by implication: any) stack slot.
>
> Using godbolt I can also see that clang15 agrees with gcc12 in this
> regard. I didn't bother trying other versions.

Well this is problematic, because it contradicts what we depend on
asm("":::"memory") doing...

https://godbolt.org/z/xeGMc3sM9

But I don't fully agree with the conclusions drawn by this example.

It only instantiates a local on the stack because you force a memory
operand to satisfy the "m" constraints, not to satisfy the "memory"
constraint.

By declaring calc as const, you are permitting the compiler to make an
explicit transformation to delete one of the calls, irrespective of
anything else in the function.

It is weird that 'j' ends up taking two stack slots when would be
absolutely fine for it to only have 1, and indeed this is what happens
when you remove the first and third asm()'s.  It is these which force
'j' to be on the stack, not the memory clobber in the middle.

Observe that after commenting those two out, Clang transforms things to
spill 'i' onto the stack, rather than 'j', and then tail-call calc() on
the way out.  This is actually deleting the first calc() call, rather
than the second.

~Andrew
Re: [PATCH v2-ish] x86/boot: Factor move_xen() out of __start_xen()
Posted by Jan Beulich 1 year, 3 months ago
On 04.01.2023 21:04, Andrew Cooper wrote:
> On 21/12/2022 7:40 am, Jan Beulich wrote:
>> On 20.12.2022 21:56, Andrew Cooper wrote:
>>> On 20/12/2022 1:51 pm, Jan Beulich wrote:
>>>> On 16.12.2022 21:17, Andrew Cooper wrote:
>>>>> +        "mov    %[cr4], %%cr4\n\t"     /* CR4.PGE = 1 */
>>>>> +        : [cr4] "=&a" (tmp) /* Could be "r", but "a" makes better asm */
>>>>> +        : [cr3] "r" (__pa(idle_pg_table)),
>>>>> +          [pge] "i" (X86_CR4_PGE)
>>>>> +        : "memory" );
>>>> The removed stack copying worries me, to be honest. Yes, for local
>>>> variables of ours it doesn't matter because they are about to go out
>>>> of scope. But what if the compiler instantiates any for its own use,
>>>> e.g. ...
>>>>
>>>>> +    /*
>>>>> +     * End of the critical region.  Updates to locals and globals now work as
>>>>> +     * expected.
>>>>> +     *
>>>>> +     * Updates to local variables which have been spilled to the stack since
>>>>> +     * the memcpy() have been lost.  But we don't care, because they're all
>>>>> +     * going out of scope imminently.
>>>>> +     */
>>>>> +
>>>>> +    printk("New Xen image base address: %#lx\n", xen_phys_start);
>>>> ... the result of the address calculation for the string literal
>>>> here? Such auxiliary calculations can happen at any point in the
>>>> function, and won't necessarily (hence the example chosen) become
>>>> impossible for the compiler to do with the memory clobber in the
>>>> asm(). And while the string literal's address is likely cheap
>>>> enough to calculate right in the function invocation sequence (and
>>>> an option would also be to retain the printk() in the caller),
>>>> other instrumentation options could be undermined by this as well.
>>> Right now, the compiler is free to calculate the address of the string
>>> literal in a register, and move it the other side of the TLB flush. 
>>> This will work just fine.
>>>
>>> However, the compiler cannot now, or ever in the future, spill such a
>>> calculation to the stack.
>> I'm afraid the compiler's view at things is different: Whatever it puts
>> on the stack is viewed as virtual registers, unaffected by a memory
>> clobber (of course there can be effects resulting from uses of named
>> variables). Look at -O3 output of gcc12 (which is what I happened to
>> play with; I don't think it's overly version dependent) for this
>> (clearly contrived) piece of code:
>>
>> int __attribute__((const)) calc(int);
>>
>> int test(int i) {
>> 	int j = calc(i);
>>
>> 	asm("nopl %0" : "+m" (j));
>> 	asm("nopq %%rsp" ::: "memory", "ax", "cx", "dx", "bx", "bp", "si", "di",
>> 	                     "r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15");
>> 	j = calc(i);
>> 	asm("nopl %0" :: "m" (j));
>>
>> 	return j;
>> }
>>
>> It instantiates a local on the stack for the result of calc(); it does
>> not re-invoke calc() a 2nd time. Which means the memory clobber in the
>> middle asm() does not affect that (and by implication: any) stack slot.
>>
>> Using godbolt I can also see that clang15 agrees with gcc12 in this
>> regard. I didn't bother trying other versions.
> 
> Well this is problematic, because it contradicts what we depend on
> asm("":::"memory") doing...

Does it? I'm unaware of instances where we use "memory" to deal with
local variables.

> https://godbolt.org/z/xeGMc3sM9
> 
> But I don't fully agree with the conclusions drawn by this example.
> 
> It only instantiates a local on the stack because you force a memory
> operand to satisfy the "m" constraints, not to satisfy the "memory"
> constraint.

Sure, all I wanted to show is that the compiler may make such
transformations. As said, the example is clearly contrived, but I
also didn't want to spend too much time on finding a yet better one.

> By declaring calc as const, you are permitting the compiler to make an
> explicit transformation to delete one of the calls, irrespective of
> anything else in the function.
> 
> It is weird that 'j' ends up taking two stack slots when would be
> absolutely fine for it to only have 1, and indeed this is what happens
> when you remove the first and third asm()'s.  It is these which force
> 'j' to be on the stack, not the memory clobber in the middle.
> 
> Observe that after commenting those two out, Clang transforms things to
> spill 'i' onto the stack, rather than 'j', and then tail-call calc() on
> the way out.  This is actually deleting the first calc() call, rather
> than the second.

Which would still demonstrate what I wanted to demonstrate - we can't
assume that "memory" guards against the use of stack locals by the
compiler.

Jan