[PATCH] x86/PV: assert page state in mark_pv_pt_pages_rdonly()

Jan Beulich posted 1 patch 2 years, 8 months ago
Test gitlab-ci failed
Patches applied successfully (tree, apply log)
git fetch https://gitlab.com/xen-project/patchew/xen tags/patchew/cbe46564-74ab-af38-7e31-2f0a3f46ab41@suse.com
There is a newer version of this series
[PATCH] x86/PV: assert page state in mark_pv_pt_pages_rdonly()
Posted by Jan Beulich 2 years, 8 months ago
About every time I look at dom0_construct_pv()'s "calculation" of
nr_pt_pages I question (myself) whether the result is precise or merely
an upper bound. I think it is meant to be precise, but I think we would
be better off having some checking in place. Hence add ASSERT()s to
verify that
- all pages have a valid L1...Ln (currently L4) page table type and
- no other bits are set, in particular the type refcount is still zero.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
There are (at least) two factors supporting my uncertainty about the
"calculation" being precise: The loop starting from 2 (which clearly is
too small for a possible result) and an apparently wrong comment stating
that not only v_end but also v_start would be superpage aligned (in fact
v_end is 4MiB aligned, which is the superpage size only on long
abandoned [by us] non-PAE x86-32).

--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -59,6 +59,10 @@ static __init void mark_pv_pt_pages_rdon
         l1e_remove_flags(*pl1e, _PAGE_RW);
         page = mfn_to_page(l1e_get_mfn(*pl1e));
 
+        ASSERT(page->u.inuse.type_info & PGT_type_mask);
+        ASSERT((page->u.inuse.type_info & PGT_type_mask) <= PGT_root_page_table);
+        ASSERT(!(page->u.inuse.type_info & ~PGT_type_mask));
+
         /* Read-only mapping + PGC_allocated + page-table page. */
         page->count_info         = PGC_allocated | 3;
         page->u.inuse.type_info |= PGT_validated | 1;


Re: [PATCH] x86/PV: assert page state in mark_pv_pt_pages_rdonly()
Posted by Andrew Cooper 2 years, 8 months ago
On 16/08/2021 16:29, Jan Beulich wrote:
> About every time I look at dom0_construct_pv()'s "calculation" of
> nr_pt_pages I question (myself) whether the result is precise or merely
> an upper bound. I think it is meant to be precise, but I think we would
> be better off having some checking in place. Hence add ASSERT()s to
> verify that
> - all pages have a valid L1...Ln (currently L4) page table type and
> - no other bits are set, in particular the type refcount is still zero.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> There are (at least) two factors supporting my uncertainty about the
> "calculation" being precise: The loop starting from 2 (which clearly is
> too small for a possible result)

2 was the correct absolute minimum for 2-level guests.

XTF kernels don't exceed the 2M boundary (at least, not currently), so
they can be mapped with only 3 or 4 pagetables, except:

* 3-level guests are created with 4 L2's for no obvious reason.  This is
nothing to do with legacy PAE paging, nor with how a typical Linux/BSD
kernel works.  The requirement to make 3-level guests work (and even
then, only under 32bit Xen) is to create a PGT_pae_xen_l2 if not already
covered by the other mappings.  Any non-toy kernel discards these
pagetables in favour of its own idea of pagetables.

* v_end is rounded up to 4MB.

Most XTF guests will operate entirely happily in a few hundred kb of
space, and the same will be true of other microservices.  The rounding
up of memory might be helpful for the traditional big VMs case, but it
isn't correct or useful for other usecases.

> and an apparently wrong comment stating
> that not only v_end but also v_start would be superpage aligned

Which comment?  The only one I see about 4M has nothing to do with
superpages.

>  (in fact
> v_end is 4MiB aligned, which is the superpage size only on long
> abandoned [by us] non-PAE x86-32).

Tangentially, that code needs some serious work to use ROUNDUP/DOWN
macros for clarity.

>
> --- a/xen/arch/x86/pv/dom0_build.c
> +++ b/xen/arch/x86/pv/dom0_build.c
> @@ -59,6 +59,10 @@ static __init void mark_pv_pt_pages_rdon
>          l1e_remove_flags(*pl1e, _PAGE_RW);
>          page = mfn_to_page(l1e_get_mfn(*pl1e));
>  
> +        ASSERT(page->u.inuse.type_info & PGT_type_mask);
> +        ASSERT((page->u.inuse.type_info & PGT_type_mask) <= PGT_root_page_table);

This is an obfuscated

ASSERT((page->u.inuse.type_info & PGT_type_mask) >= PGT_l1_page_table &&
       (page->u.inuse.type_info & PGT_type_mask) <= PGT_root_page_table);

and

> +        ASSERT(!(page->u.inuse.type_info & ~PGT_type_mask));

this has no context.

At a bare minimum, you need a comment stating what properties we're
looking for, so anyone suffering an assertion failure has some clue as
to what may have gone wrong.

~Andrew


Re: [PATCH] x86/PV: assert page state in mark_pv_pt_pages_rdonly()
Posted by Jan Beulich 2 years, 8 months ago
On 16.08.2021 21:25, Andrew Cooper wrote:
> On 16/08/2021 16:29, Jan Beulich wrote:
>> About every time I look at dom0_construct_pv()'s "calculation" of
>> nr_pt_pages I question (myself) whether the result is precise or merely
>> an upper bound. I think it is meant to be precise, but I think we would
>> be better off having some checking in place. Hence add ASSERT()s to
>> verify that
>> - all pages have a valid L1...Ln (currently L4) page table type and
>> - no other bits are set, in particular the type refcount is still zero.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> There are (at least) two factors supporting my uncertainty about the
>> "calculation" being precise: The loop starting from 2 (which clearly is
>> too small for a possible result)
> 
> 2 was the correct absolute minimum for 2-level guests.

Which has been history for how many years? The minimum for the
current implementation is 4 afaict, and ...

> XTF kernels don't exceed the 2M boundary (at least, not currently), so
> they can be mapped with only 3 or 4 pagetables, except:
> 
> * 3-level guests are created with 4 L2's for no obvious reason.  This is
> nothing to do with legacy PAE paging, nor with how a typical Linux/BSD
> kernel works.  The requirement to make 3-level guests work (and even
> then, only under 32bit Xen) is to create a PGT_pae_xen_l2 if not already
> covered by the other mappings.  Any non-toy kernel discards these
> pagetables in favour of its own idea of pagetables.

... could be 3 for 32-bit Dom0.

> * v_end is rounded up to 4MB.
> 
> Most XTF guests will operate entirely happily in a few hundred kb of
> space, and the same will be true of other microservices.  The rounding
> up of memory might be helpful for the traditional big VMs case, but it
> isn't correct or useful for other usecases.
> 
>> and an apparently wrong comment stating
>> that not only v_end but also v_start would be superpage aligned
> 
> Which comment?  The only one I see about 4M has nothing to do with
> superpages.

The one immediately ahead of the related variable declarations:

    /*
     * This fully describes the memory layout of the initial domain. All
     * *_start address are page-aligned, except v_start (and v_end) which are
     * superpage-aligned.
     */

I see nothing forcing v_start to be superpage-aligned, while I
do suspect that the "calculation" of the number of page tables
will be wrong when it isn't.

>>  (in fact
>> v_end is 4MiB aligned, which is the superpage size only on long
>> abandoned [by us] non-PAE x86-32).
> 
> Tangentially, that code needs some serious work to use ROUNDUP/DOWN
> macros for clarity.

Agreed.

>> --- a/xen/arch/x86/pv/dom0_build.c
>> +++ b/xen/arch/x86/pv/dom0_build.c
>> @@ -59,6 +59,10 @@ static __init void mark_pv_pt_pages_rdon
>>          l1e_remove_flags(*pl1e, _PAGE_RW);
>>          page = mfn_to_page(l1e_get_mfn(*pl1e));
>>  
>> +        ASSERT(page->u.inuse.type_info & PGT_type_mask);
>> +        ASSERT((page->u.inuse.type_info & PGT_type_mask) <= PGT_root_page_table);
> 
> This is an obfuscated
> 
> ASSERT((page->u.inuse.type_info & PGT_type_mask) >= PGT_l1_page_table &&
>        (page->u.inuse.type_info & PGT_type_mask) <= PGT_root_page_table);

I can certainly switch to this yet longer piece of code, and ...

> and
> 
>> +        ASSERT(!(page->u.inuse.type_info & ~PGT_type_mask));
> 
> this has no context.
> 
> At a bare minimum, you need a comment stating what properties we're
> looking for, so anyone suffering an assertion failure has some clue as
> to what may have gone wrong.

... I can certainly transform the respective parts of the
description into a code comment.

Jan


Re: [PATCH] x86/PV: assert page state in mark_pv_pt_pages_rdonly()
Posted by Andrew Cooper 2 years, 8 months ago
On 17/08/2021 09:54, Jan Beulich wrote:
> On 16.08.2021 21:25, Andrew Cooper wrote:
>> On 16/08/2021 16:29, Jan Beulich wrote:
>>> About every time I look at dom0_construct_pv()'s "calculation" of
>>> nr_pt_pages I question (myself) whether the result is precise or merely
>>> an upper bound. I think it is meant to be precise, but I think we would
>>> be better off having some checking in place. Hence add ASSERT()s to
>>> verify that
>>> - all pages have a valid L1...Ln (currently L4) page table type and
>>> - no other bits are set, in particular the type refcount is still zero.
>>>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>> ---
>>> There are (at least) two factors supporting my uncertainty about the
>>> "calculation" being precise: The loop starting from 2 (which clearly is
>>> too small for a possible result)
>> 2 was the correct absolute minimum for 2-level guests.
> Which has been history for how many years?

Yeah, but I'd expect to phrase that as "a remnant of 32bit non-PAE guests".

> The minimum for the current implementation is 4 afaict,

I'm not sure the monitor table for PV32 is intended to count.

>  and ...
>
>> XTF kernels don't exceed the 2M boundary (at least, not currently), so
>> they can be mapped with only 3 or 4 pagetables, except:
>>
>> * 3-level guests are created with 4 L2's for no obvious reason.  This is
>> nothing to do with legacy PAE paging, nor with how a typical Linux/BSD
>> kernel works.  The requirement to make 3-level guests work (and even
>> then, only under 32bit Xen) is to create a PGT_pae_xen_l2 if not already
>> covered by the other mappings.  Any non-toy kernel discards these
>> pagetables in favour of its own idea of pagetables.
> ... could be 3 for 32-bit Dom0.

Right, and starting from (compat ? 6 : 4) is probably a good move, along
with an explanation of these totally magic numbers.


And now I've thought about this some more, I'm pretty certain we do
better than an "start at 2, inc 1 and retry" loop.  We can calculate the
pagetables needed for the virtual layout statically, possibly even
including the default 6/4, and use that for a lower bound.  At most, we
need to loop once per possibly-unpopulated pagetable level (so 1 for
32bit, 3 for 64bit) to cover the cases of extra pagetables tipping over
a page size boundary.

>
>> * v_end is rounded up to 4MB.
>>
>> Most XTF guests will operate entirely happily in a few hundred kb of
>> space, and the same will be true of other microservices.  The rounding
>> up of memory might be helpful for the traditional big VMs case, but it
>> isn't correct or useful for other usecases.
>>
>>> and an apparently wrong comment stating
>>> that not only v_end but also v_start would be superpage aligned
>> Which comment?  The only one I see about 4M has nothing to do with
>> superpages.
> The one immediately ahead of the related variable declarations:
>
>     /*
>      * This fully describes the memory layout of the initial domain. All
>      * *_start address are page-aligned, except v_start (and v_end) which are
>      * superpage-aligned.
>      */
>
> I see nothing forcing v_start to be superpage-aligned, while I
> do suspect that the "calculation" of the number of page tables
> will be wrong when it isn't.

This is an XTF test booting as dom0

(d2) (XEN) *** Building a PV Dom0 ***
(d2) (XEN) ELF: phdr: paddr=0x100000 memsz=0x12000
(d2) (XEN) ELF: memory: 0x100000 -> 0x112000
(d2) (XEN) ELF: note: GUEST_OS = "XTF"
(d2) (XEN) ELF: note: GUEST_VERSION = "0"
(d2) (XEN) ELF: note: LOADER = "generic"
(d2) (XEN) ELF: note: HYPERCALL_PAGE = 0x106000
(d2) (XEN) ELF: note: XEN_VERSION = "xen-3.0"
(d2) (XEN) ELF: note: FEATURES = "!writable_page_tables|pae_pgdir_above_4gb"
(d2) (XEN) ELF: note: PAE_MODE = "yes"
(d2) (XEN) ELF: using notes from SHT_NOTE section
(d2) (XEN) ELF: VIRT_BASE unset, using 0
(d2) (XEN) ELF_PADDR_OFFSET unset, using 0
(d2) (XEN) ELF: addresses:
(d2) (XEN)     virt_base        = 0x0
(d2) (XEN)     elf_paddr_offset = 0x0
(d2) (XEN)     virt_offset      = 0x0
(d2) (XEN)     virt_kstart      = 0x100000
(d2) (XEN)     virt_kend        = 0x112000
(d2) (XEN)     virt_entry       = 0x100000
(d2) (XEN)     p2m_base         = 0xffffffffffffffff
(d2) (XEN)  Xen  kernel: 64-bit, lsb, compat32
(d2) (XEN)  Dom0 kernel: 64-bit, PAE, lsb, paddr 0x100000 -> 0x112000
(d2) (XEN) PHYSICAL MEMORY ARRANGEMENT:
(d2) (XEN)  Dom0 alloc.:   000000003e800000->000000003ec00000 (240731
pages to be allocated)
(d2) (XEN) VIRTUAL MEMORY ARRANGEMENT:
(d2) (XEN)  Loaded kernel: 0000000000100000->0000000000112000
(d2) (XEN)  Init. ramdisk: 0000000000112000->0000000000112000
(d2) (XEN)  Phys-Mach map: 0000000000112000->00000000002ea2d8
(d2) (XEN)  Start info:    00000000002eb000->00000000002eb4b8
(d2) (XEN)  Xenstore ring: 0000000000000000->0000000000000000
(d2) (XEN)  Console ring:  0000000000000000->0000000000000000
(d2) (XEN)  Page tables:   00000000002ec000->00000000002f1000
(d2) (XEN)  Boot stack:    00000000002f1000->00000000002f2000
(d2) (XEN)  TOTAL:         0000000000000000->0000000000400000
(d2) (XEN)  ENTRY ADDRESS: 0000000000100000

It would appear that v_start comes directly and unmodified from the
VIRT_BASE ELF note.

Other observations:  The ramdisk start/end aren't zero despite one being
absent, and the M2P and start info ends aren't aligned.

>>>  (in fact
>>> v_end is 4MiB aligned, which is the superpage size only on long
>>> abandoned [by us] non-PAE x86-32).
>> Tangentially, that code needs some serious work to use ROUNDUP/DOWN
>> macros for clarity.
> Agreed.
>
>>> --- a/xen/arch/x86/pv/dom0_build.c
>>> +++ b/xen/arch/x86/pv/dom0_build.c
>>> @@ -59,6 +59,10 @@ static __init void mark_pv_pt_pages_rdon
>>>          l1e_remove_flags(*pl1e, _PAGE_RW);
>>>          page = mfn_to_page(l1e_get_mfn(*pl1e));
>>>  
>>> +        ASSERT(page->u.inuse.type_info & PGT_type_mask);
>>> +        ASSERT((page->u.inuse.type_info & PGT_type_mask) <= PGT_root_page_table);
>> This is an obfuscated
>>
>> ASSERT((page->u.inuse.type_info & PGT_type_mask) >= PGT_l1_page_table &&
>>        (page->u.inuse.type_info & PGT_type_mask) <= PGT_root_page_table);
> I can certainly switch to this yet longer piece of code,

Improved clarity is substantially more important than conciseness.

>  and ...
>
>> and
>>
>>> +        ASSERT(!(page->u.inuse.type_info & ~PGT_type_mask));
>> this has no context.
>>
>> At a bare minimum, you need a comment stating what properties we're
>> looking for, so anyone suffering an assertion failure has some clue as
>> to what may have gone wrong.
> ... I can certainly transform the respective parts of the
> description into a code comment.

Thanks.  It doesn't need to be much, but it does need to be something.

~Andrew


Re: [PATCH] x86/PV: assert page state in mark_pv_pt_pages_rdonly()
Posted by Jan Beulich 2 years, 8 months ago
On 17.08.2021 12:30, Andrew Cooper wrote:
> On 17/08/2021 09:54, Jan Beulich wrote:
>> On 16.08.2021 21:25, Andrew Cooper wrote:
>>> On 16/08/2021 16:29, Jan Beulich wrote:
>>>> and an apparently wrong comment stating
>>>> that not only v_end but also v_start would be superpage aligned
>>> Which comment?  The only one I see about 4M has nothing to do with
>>> superpages.
>> The one immediately ahead of the related variable declarations:
>>
>>     /*
>>      * This fully describes the memory layout of the initial domain. All
>>      * *_start address are page-aligned, except v_start (and v_end) which are
>>      * superpage-aligned.
>>      */
>>
>> I see nothing forcing v_start to be superpage-aligned, while I
>> do suspect that the "calculation" of the number of page tables
>> will be wrong when it isn't.
> 
> This is an XTF test booting as dom0
> 
> (d2) (XEN) *** Building a PV Dom0 ***
> (d2) (XEN) ELF: phdr: paddr=0x100000 memsz=0x12000
> (d2) (XEN) ELF: memory: 0x100000 -> 0x112000
> (d2) (XEN) ELF: note: GUEST_OS = "XTF"
> (d2) (XEN) ELF: note: GUEST_VERSION = "0"
> (d2) (XEN) ELF: note: LOADER = "generic"
> (d2) (XEN) ELF: note: HYPERCALL_PAGE = 0x106000
> (d2) (XEN) ELF: note: XEN_VERSION = "xen-3.0"
> (d2) (XEN) ELF: note: FEATURES = "!writable_page_tables|pae_pgdir_above_4gb"
> (d2) (XEN) ELF: note: PAE_MODE = "yes"
> (d2) (XEN) ELF: using notes from SHT_NOTE section
> (d2) (XEN) ELF: VIRT_BASE unset, using 0
> (d2) (XEN) ELF_PADDR_OFFSET unset, using 0
> (d2) (XEN) ELF: addresses:
> (d2) (XEN)     virt_base        = 0x0
> (d2) (XEN)     elf_paddr_offset = 0x0
> (d2) (XEN)     virt_offset      = 0x0
> (d2) (XEN)     virt_kstart      = 0x100000
> (d2) (XEN)     virt_kend        = 0x112000
> (d2) (XEN)     virt_entry       = 0x100000
> (d2) (XEN)     p2m_base         = 0xffffffffffffffff
> (d2) (XEN)  Xen  kernel: 64-bit, lsb, compat32
> (d2) (XEN)  Dom0 kernel: 64-bit, PAE, lsb, paddr 0x100000 -> 0x112000
> (d2) (XEN) PHYSICAL MEMORY ARRANGEMENT:
> (d2) (XEN)  Dom0 alloc.:   000000003e800000->000000003ec00000 (240731
> pages to be allocated)
> (d2) (XEN) VIRTUAL MEMORY ARRANGEMENT:
> (d2) (XEN)  Loaded kernel: 0000000000100000->0000000000112000
> (d2) (XEN)  Init. ramdisk: 0000000000112000->0000000000112000
> (d2) (XEN)  Phys-Mach map: 0000000000112000->00000000002ea2d8
> (d2) (XEN)  Start info:    00000000002eb000->00000000002eb4b8
> (d2) (XEN)  Xenstore ring: 0000000000000000->0000000000000000
> (d2) (XEN)  Console ring:  0000000000000000->0000000000000000
> (d2) (XEN)  Page tables:   00000000002ec000->00000000002f1000
> (d2) (XEN)  Boot stack:    00000000002f1000->00000000002f2000
> (d2) (XEN)  TOTAL:         0000000000000000->0000000000400000
> (d2) (XEN)  ENTRY ADDRESS: 0000000000100000
> 
> It would appear that v_start comes directly and unmodified from the
> VIRT_BASE ELF note.
> 
> Other observations:  The ramdisk start/end aren't zero despite one being
> absent, and the M2P and start info ends aren't aligned.

I've already dealt with this ramdisk aspect in the v2 patch altering
the printing. For *_end there's nothing wrong to be misaligned (as
per the comment in question) - only v_end has a statement.

Jan