xen/arch/x86/machine_kexec.c | 15 ++++-- xen/arch/x86/setup.c | 122 +++++++++++++++++++++++++++++++++++++++---- xen/include/public/kexec.h | 1 + 3 files changed, 124 insertions(+), 14 deletions(-)
When doing a live update, Xen needs to be very careful not to scribble on pages which contain guest memory or state information for the domains which are being preserved. The information about which pages are in use is contained in the live update state passed from the previous Xen — which is mostly just a guest-transparent live migration data stream, except that it points to the page tables in place in memory while traditional live migration obviously copies the pages separately. Our initial implementation actually prepended a list of 'in-use' ranges to the live update state, and made the boot allocator treat them the same as 'bad pages'. That worked well enough for initial development but wouldn't scale to a live production system, mainly because the boot allocator has a limit of 512 memory ranges that it can keep track of, and a real system would end up more fragmented than that. My other concern with that approach is that it required two passes over the domain-owned pages. We have to do a later pass *anyway*, as we set up ownership in the frametable for each page — and that has to happen after we've managed to allocate a 'struct domain' for each page_info to point to. If we want to keep the pause time due to a live update down to a bare minimum, doing two passes over the full set of domain pages isn't my favourite strategy. So we've settled on a simpler approach — reserve a contiguous region of physical memory which *won't* be used for domain pages. Let the boot allocator see *only* that region of memory, and plug the rest of the memory in later only after doing a full pass of the live update state. This means that we have to ensure the reserved region is large enough, but ultimately we had that problem either way — even if we were processing the actual free ranges, if the page_info grew and we didn't have enough contiguous space for the new frametable we were hosed anyway. So the straw man patch ends up being really simple, as a seed for bikeshedding. Just take a 'liveupdate=' region on the command line, which kexec(8) can find from the running Xen. The initial Xen needs to ensure that it *won't* allocate any pages from that range which will subsequently need to be preserved across live update, which isn't done yet. We just need to make sure that any page which might be given to share_xen_page_with_guest() is allocated appropriately. The part which actually hands over the live update state isn't included yet, so this really does just *defer* the addition of the memory until a little bit later in __start_xen(). Actually taking ranges out of it will come later. David Woodhouse (3): x86/setup: Don't skip 2MiB underneath relocated Xen image x86/boot: Reserve live update boot memory Add KEXEC_RANGE_MA_LIVEUPDATE xen/arch/x86/machine_kexec.c | 15 ++++-- xen/arch/x86/setup.c | 122 +++++++++++++++++++++++++++++++++++++++---- xen/include/public/kexec.h | 1 + 3 files changed, 124 insertions(+), 14 deletions(-) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
On Wed, 2020-01-08 at 17:24 +0000, David Woodhouse wrote: > When doing a live update, Xen needs to be very careful not to scribble > on pages which contain guest memory or state information for the > domains which are being preserved. > > The information about which pages are in use is contained in the live > update state passed from the previous Xen — which is mostly just a > guest-transparent live migration data stream, except that it points to > the page tables in place in memory while traditional live migration > obviously copies the pages separately. > > Our initial implementation actually prepended a list of 'in-use' ranges > to the live update state, and made the boot allocator treat them the > same as 'bad pages'. That worked well enough for initial development > but wouldn't scale to a live production system, mainly because the boot > allocator has a limit of 512 memory ranges that it can keep track of, > and a real system would end up more fragmented than that. > > My other concern with that approach is that it required two passes over > the domain-owned pages. We have to do a later pass *anyway*, as we set > up ownership in the frametable for each page — and that has to happen > after we've managed to allocate a 'struct domain' for each page_info to > point to. If we want to keep the pause time due to a live update down > to a bare minimum, doing two passes over the full set of domain pages > isn't my favourite strategy. > > So we've settled on a simpler approach — reserve a contiguous region > of physical memory which *won't* be used for domain pages. Let the boot > allocator see *only* that region of memory, and plug the rest of the > memory in later only after doing a full pass of the live update state. > > This means that we have to ensure the reserved region is large enough, > but ultimately we had that problem either way — even if we were > processing the actual free ranges, if the page_info grew and we didn't > have enough contiguous space for the new frametable we were hosed > anyway. > > So the straw man patch ends up being really simple, as a seed for > bikeshedding. Just take a 'liveupdate=' region on the command line, > which kexec(8) can find from the running Xen. The initial Xen needs to > ensure that it *won't* allocate any pages from that range which will > subsequently need to be preserved across live update, which isn't done > yet. We just need to make sure that any page which might be given to > share_xen_page_with_guest() is allocated appropriately. > > The part which actually hands over the live update state isn't included > yet, so this really does just *defer* the addition of the memory until > a little bit later in __start_xen(). Actually taking ranges out of it > will come later. What isn't addressed in this series is actually *honouring* the promise not to put pages into the reserved LU bootmem region that need to be preserved over live update. As things stand, we just add them to the heap anyway in end_boot_allocator(). It isn't even sufficient to use these pages for xenheap allocations and not domheap, since there are cases where we allocate from the xenheap and then share pages to a domain. Hongyan's patches to kill the directmap have already started addressing a bunch of the places that do that, so what I'm inclined to do in the short term is just *not* use the remaining space in the reserved LU bootmem region. Use it for boot time allocations (including the frametable) only, and *not* insert the rest of those pages into the heap allocator in end_boot_allocator() for now. If sized appropriately, there shouldn't be much wastage anyway. We can refine it and ensure that we can use those pages but *not* for domain allocations, once the dust has settled on the directmap removal. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Hi David, On 13/01/2020 11:54, David Woodhouse wrote: > On Wed, 2020-01-08 at 17:24 +0000, David Woodhouse wrote: >> When doing a live update, Xen needs to be very careful not to scribble >> on pages which contain guest memory or state information for the >> domains which are being preserved. >> >> The information about which pages are in use is contained in the live >> update state passed from the previous Xen — which is mostly just a >> guest-transparent live migration data stream, except that it points to >> the page tables in place in memory while traditional live migration >> obviously copies the pages separately. >> >> Our initial implementation actually prepended a list of 'in-use' ranges >> to the live update state, and made the boot allocator treat them the >> same as 'bad pages'. That worked well enough for initial development >> but wouldn't scale to a live production system, mainly because the boot >> allocator has a limit of 512 memory ranges that it can keep track of, >> and a real system would end up more fragmented than that. >> >> My other concern with that approach is that it required two passes over >> the domain-owned pages. We have to do a later pass *anyway*, as we set >> up ownership in the frametable for each page — and that has to happen >> after we've managed to allocate a 'struct domain' for each page_info to >> point to. If we want to keep the pause time due to a live update down >> to a bare minimum, doing two passes over the full set of domain pages >> isn't my favourite strategy. We actually need one more pass for PV domain (at least). The pass is used to allocate the page type (e.g L4, L1,...). This can't be done before because we need the pages to belongs to the guest before going through its page-tables. >> >> So we've settled on a simpler approach — reserve a contiguous region >> of physical memory which *won't* be used for domain pages. Let the boot >> allocator see *only* that region of memory, and plug the rest of the >> memory in later only after doing a full pass of the live update state. It is a bit unclear what the region will be used for. If you plan to put the state of the VMs in it, then you can't possibly use it for boot allocation (e.g frametable) otherwise this may be overwritten when doing the live update. The problem would arise in the first Xen but also in the second Xen if you plan to live update another time. Did I miss anything? Cheers, -- Julien Grall _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
On Tue, 2020-01-14 at 14:15 +0000, Julien Grall wrote: > Hi David, > > On 13/01/2020 11:54, David Woodhouse wrote: > > On Wed, 2020-01-08 at 17:24 +0000, David Woodhouse wrote: > > > When doing a live update, Xen needs to be very careful not to scribble > > > on pages which contain guest memory or state information for the > > > domains which are being preserved. > > > > > > The information about which pages are in use is contained in the live > > > update state passed from the previous Xen — which is mostly just a > > > guest-transparent live migration data stream, except that it points to > > > the page tables in place in memory while traditional live migration > > > obviously copies the pages separately. > > > > > > Our initial implementation actually prepended a list of 'in-use' ranges > > > to the live update state, and made the boot allocator treat them the > > > same as 'bad pages'. That worked well enough for initial development > > > but wouldn't scale to a live production system, mainly because the boot > > > allocator has a limit of 512 memory ranges that it can keep track of, > > > and a real system would end up more fragmented than that. > > > > > > My other concern with that approach is that it required two passes over > > > the domain-owned pages. We have to do a later pass *anyway*, as we set > > > up ownership in the frametable for each page — and that has to happen > > > after we've managed to allocate a 'struct domain' for each page_info to > > > point to. If we want to keep the pause time due to a live update down > > > to a bare minimum, doing two passes over the full set of domain pages > > > isn't my favourite strategy. > > We actually need one more pass for PV domain (at least). The pass is > used to allocate the page type (e.g L4, L1,...). This can't be done > before because we need the pages to belongs to the guest before going > through its page-tables. All the more reason why I don't want to do an *additional* pass just for the allocator. > > > > > > So we've settled on a simpler approach — reserve a contiguous region > > > of physical memory which *won't* be used for domain pages. Let the boot > > > allocator see *only* that region of memory, and plug the rest of the > > > memory in later only after doing a full pass of the live update state. > > It is a bit unclear what the region will be used for. If you plan to put > the state of the VMs in it, then you can't possibly use it for boot > allocation (e.g frametable) otherwise this may be overwritten when doing > the live update. Right. This is only for boot time allocations by Xen#2, before it's processed the LU data and knows which parts of the rest of memory it can use. It allocates its frame table from there, as well as anything else it needs to allocate before/while processing the LU data. As an implementation detail, I anticipate that we'll be using the boot allocator for that early part from the reserved region, and that the switch to using the full available memory (less those pages already in- use) will *coincide* with switching to the real heap allocator. The reserved region *isn't* for the LU data itself. That can be allocated from arbitrary pages *outside* the reserved area, in Xen#1. Xen#2 can vmap those pages, and needs to avoid stomping on them just like it needs to avoid stomping on actual domain-owned pages. The plan is that Xen#1 allocates arbitrary pages to store the actual LU data. Then another page (or higher order allocation if we need >2MiB of actual LU data) containing the MFNs of all those data pages. Then we need to somehow pass the address of that MFN-list to Xen#2. My current plan is to put *that* in the first 64 bits of the reserved LU bootmem region, and load it from there early in the Xen#2 boot process. I'm looking at adding an IND_WRITE64 primitive to the kimage processing, to allow it to be trivially appended for kexec_reloc() to obey. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
On 14/01/2020 14:48, David Woodhouse wrote: > On Tue, 2020-01-14 at 14:15 +0000, Julien Grall wrote: >> Hi David, >> >> On 13/01/2020 11:54, David Woodhouse wrote: >>> On Wed, 2020-01-08 at 17:24 +0000, David Woodhouse wrote: >>>> So we've settled on a simpler approach — reserve a contiguous region >>>> of physical memory which *won't* be used for domain pages. Let the boot >>>> allocator see *only* that region of memory, and plug the rest of the >>>> memory in later only after doing a full pass of the live update state. >> >> It is a bit unclear what the region will be used for. If you plan to put >> the state of the VMs in it, then you can't possibly use it for boot >> allocation (e.g frametable) otherwise this may be overwritten when doing >> the live update. > > Right. This is only for boot time allocations by Xen#2, before it's > processed the LU data and knows which parts of the rest of memory it > can use. It allocates its frame table from there, as well as anything > else it needs to allocate before/while processing the LU data. It would be worth documenting what is the expectation of the buffer. Maybe in xen-command-line along with the rest of the new option you introduced? Or in a separate document. > As an implementation detail, I anticipate that we'll be using the boot > allocator for that early part from the reserved region, and that the > switch to using the full available memory (less those pages already in- > use) will *coincide* with switching to the real heap allocator. > > The reserved region *isn't* for the LU data itself. That can be > allocated from arbitrary pages *outside* the reserved area, in Xen#1. > Xen#2 can vmap those pages, and needs to avoid stomping on them just > like it needs to avoid stomping on actual domain-owned pages. > > The plan is that Xen#1 allocates arbitrary pages to store the actual LU > data. Then another page (or higher order allocation if we need >2MiB of > actual LU data) containing the MFNs of all those data pages. Then we > need to somehow pass the address of that MFN-list to Xen#2. > > My current plan is to put *that* in the first 64 bits of the reserved > LU bootmem region, and load it from there early in the Xen#2 boot > process. I'm looking at adding an IND_WRITE64 primitive to the kimage > processing, to allow it to be trivially appended for kexec_reloc() to > obey. Wouldn't it be better to reserve the first 4K page of the LU bootmem region? Otherwise, you may end up into the same trouble as described above (to a lesser extent) if the 64-bit value overwrite anything useful for the current Xen. But I guess, you could delay the writing just before you jump to xen#2. Cheers, -- Julien Grall _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
On Tue, 2020-01-14 at 15:00 +0000, Julien Grall wrote: > > On 14/01/2020 14:48, David Woodhouse wrote: > > On Tue, 2020-01-14 at 14:15 +0000, Julien Grall wrote: > > > Hi David, > > > > > > On 13/01/2020 11:54, David Woodhouse wrote: > > > > On Wed, 2020-01-08 at 17:24 +0000, David Woodhouse wrote: > > > > > So we've settled on a simpler approach — reserve a contiguous region > > > > > of physical memory which *won't* be used for domain pages. Let the boot > > > > > allocator see *only* that region of memory, and plug the rest of the > > > > > memory in later only after doing a full pass of the live update state. > > > > > > It is a bit unclear what the region will be used for. If you plan to put > > > the state of the VMs in it, then you can't possibly use it for boot > > > allocation (e.g frametable) otherwise this may be overwritten when doing > > > the live update. > > > > Right. This is only for boot time allocations by Xen#2, before it's > > processed the LU data and knows which parts of the rest of memory it > > can use. It allocates its frame table from there, as well as anything > > else it needs to allocate before/while processing the LU data. > > It would be worth documenting what is the expectation of the buffer. > Maybe in xen-command-line along with the rest of the new option you > introduced? Or in a separate document. Kind of need to implement that part too, and then we can document what it finally looks like :) > > As an implementation detail, I anticipate that we'll be using the boot > > allocator for that early part from the reserved region, and that the > > switch to using the full available memory (less those pages already in- > > use) will *coincide* with switching to the real heap allocator. > > > > The reserved region *isn't* for the LU data itself. That can be > > allocated from arbitrary pages *outside* the reserved area, in Xen#1. > > Xen#2 can vmap those pages, and needs to avoid stomping on them just > > like it needs to avoid stomping on actual domain-owned pages. > > > > The plan is that Xen#1 allocates arbitrary pages to store the actual LU > > data. Then another page (or higher order allocation if we need >2MiB of > > actual LU data) containing the MFNs of all those data pages. Then we > > need to somehow pass the address of that MFN-list to Xen#2. > > > > My current plan is to put *that* in the first 64 bits of the reserved > > LU bootmem region, and load it from there early in the Xen#2 boot > > process. I'm looking at adding an IND_WRITE64 primitive to the kimage > > processing, to allow it to be trivially appended for kexec_reloc() to > > obey. > > Wouldn't it be better to reserve the first 4K page of the LU bootmem region? > > Otherwise, you may end up into the same trouble as described above (to a > lesser extent) if the 64-bit value overwrite anything useful for the > current Xen. But I guess, you could delay the writing just before you > jump to xen#2. That's the point in appending an IND_WRITE64 operation to the kimage stream. The actual write is done in the last gasp of kexec_reloc() after Xen#1 is quiescent, on the way into purgatory. So when Xen#1 has created the LU data stream, (for which the pointer to the root of that data structure is page-aligned) it just calls kimage_add_entry(image, IND_WRITE64 | lu_data_address); --- a/xen/arch/x86/x86_64/kexec_reloc.S +++ b/xen/arch/x86/x86_64/kexec_reloc.S @@ -131,11 +131,18 @@ is_source: jmp next_entry is_zero: testb $IND_ZERO, %cl - jz next_entry + jz is_write64 movl $(PAGE_SIZE / 8), %ecx /* Zero the destination page. */ xorl %eax, %eax rep stosq jmp next_entry +is_write64: + testb $IND_WRITE64, %cl + jz next_entry + andq $PAGE_MASK, %rcx + movl %rcx, %rax + stosq + jmp next_entry done: popq %rbx _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Hi David, On 14/01/2020 15:20, David Woodhouse wrote: > On Tue, 2020-01-14 at 15:00 +0000, Julien Grall wrote: >> >> On 14/01/2020 14:48, David Woodhouse wrote: >>> On Tue, 2020-01-14 at 14:15 +0000, Julien Grall wrote: >>>> Hi David, >>>> >>>> On 13/01/2020 11:54, David Woodhouse wrote: >>>>> On Wed, 2020-01-08 at 17:24 +0000, David Woodhouse wrote: >>>>>> So we've settled on a simpler approach — reserve a contiguous region >>>>>> of physical memory which *won't* be used for domain pages. Let the boot >>>>>> allocator see *only* that region of memory, and plug the rest of the >>>>>> memory in later only after doing a full pass of the live update state. >>>> >>>> It is a bit unclear what the region will be used for. If you plan to put >>>> the state of the VMs in it, then you can't possibly use it for boot >>>> allocation (e.g frametable) otherwise this may be overwritten when doing >>>> the live update. >>> >>> Right. This is only for boot time allocations by Xen#2, before it's >>> processed the LU data and knows which parts of the rest of memory it >>> can use. It allocates its frame table from there, as well as anything >>> else it needs to allocate before/while processing the LU data. >> >> It would be worth documenting what is the expectation of the buffer. >> Maybe in xen-command-line along with the rest of the new option you >> introduced? Or in a separate document. > > Kind of need to implement that part too, and then we can document what > it finally looks like :) > >>> As an implementation detail, I anticipate that we'll be using the boot >>> allocator for that early part from the reserved region, and that the >>> switch to using the full available memory (less those pages already in- >>> use) will *coincide* with switching to the real heap allocator. >>> >>> The reserved region *isn't* for the LU data itself. That can be >>> allocated from arbitrary pages *outside* the reserved area, in Xen#1. >>> Xen#2 can vmap those pages, and needs to avoid stomping on them just >>> like it needs to avoid stomping on actual domain-owned pages. >>> >>> The plan is that Xen#1 allocates arbitrary pages to store the actual LU >>> data. Then another page (or higher order allocation if we need >2MiB of >>> actual LU data) containing the MFNs of all those data pages. Then we >>> need to somehow pass the address of that MFN-list to Xen#2. >>> >>> My current plan is to put *that* in the first 64 bits of the reserved >>> LU bootmem region, and load it from there early in the Xen#2 boot >>> process. I'm looking at adding an IND_WRITE64 primitive to the kimage >>> processing, to allow it to be trivially appended for kexec_reloc() to >>> obey. >> >> Wouldn't it be better to reserve the first 4K page of the LU bootmem region? >> >> Otherwise, you may end up into the same trouble as described above (to a >> lesser extent) if the 64-bit value overwrite anything useful for the >> current Xen. But I guess, you could delay the writing just before you >> jump to xen#2. > > That's the point in appending an IND_WRITE64 operation to the kimage > stream. The actual write is done in the last gasp of kexec_reloc() > after Xen#1 is quiescent, on the way into purgatory. I was not sure what you meant by IND_WRITE64. Maybe I should have asked it first :). Thank you for the explanation! Cheers, -- Julien Grall _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
On Tue, 2020-01-14 at 16:29 +0000, Julien Grall wrote: > > That's the point in appending an IND_WRITE64 operation to the kimage > > stream. The actual write is done in the last gasp of kexec_reloc() > > after Xen#1 is quiescent, on the way into purgatory. > > I was not sure what you meant by IND_WRITE64. Maybe I should have asked > it first :). Thank you for the explanation! Don't you often find an email is made easier to understand by the addition of a few lines of unified diff of assembler code...? _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
On 15/01/2020 07:40, David Woodhouse wrote: > On Tue, 2020-01-14 at 16:29 +0000, Julien Grall wrote: >>> That's the point in appending an IND_WRITE64 operation to the kimage >>> stream. The actual write is done in the last gasp of kexec_reloc() >>> after Xen#1 is quiescent, on the way into purgatory. >> >> I was not sure what you meant by IND_WRITE64. Maybe I should have asked >> it first :). Thank you for the explanation! > > Don't you often find an email is made easier to understand by the > addition of a few lines of unified diff of assembler code...? It definitely helps. I tend to prefer diff over a long paragraph trying to explain the same :). Cheers, -- Julien Grall _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
© 2016 - 2024 Red Hat, Inc.