kernel/kexec_core.c | 10 ++++++++++ 1 file changed, 10 insertions(+)
The kexec destination addresses (incluing those for purgatory, the new
kernel, boot params/cmdline, and initrd) are searched from the free area of
memblock or RAM resources. Since they are not allocated by the currently
running kernel, it is not guaranteed that they are accepted before
relocating the new kernel.
Accept the destination addresses for the new kernel, as the new kernel may
not be able to or may not accept them by itself.
Place the "accept" code immediately after the destination addresses pass
sanity checks, so the code can be shared by both users of the kexec_load
and kexec_file_load system calls.
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
kernel/kexec_core.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index c0caa14880c3..d97376eafc1a 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -210,6 +210,16 @@ int sanity_check_segment_list(struct kimage *image)
}
#endif
+ /*
+ * The destination addresses are searched from free memory ranges rather
+ * than being allocated from the current kernel, so they are not
+ * guaranteed to be accepted by the current kernel.
+ * Accept those initial pages for the new kernel since it may not be
+ * able to accept them by itself.
+ */
+ for (i = 0; i < nr_segments; i++)
+ accept_memory(image->segment[i].mem, image->segment[i].memsz);
+
return 0;
}
base-commit: 8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b
--
2.43.2
Yan Zhao <yan.y.zhao@intel.com> writes: > The kexec destination addresses (incluing those for purgatory, the new > kernel, boot params/cmdline, and initrd) are searched from the free area of > memblock or RAM resources. Since they are not allocated by the currently > running kernel, it is not guaranteed that they are accepted before > relocating the new kernel. > > Accept the destination addresses for the new kernel, as the new kernel may > not be able to or may not accept them by itself. > > Place the "accept" code immediately after the destination addresses pass > sanity checks, so the code can be shared by both users of the kexec_load > and kexec_file_load system calls. I am not at all certain this is sufficient, and I am a bit flummoxed about the need to ever ``accept'' memory lazily. In a past life I wrote bootup firmware, and as part of that was the code to initialize the contents of memory. When properly tuned and setup it would never take more than a second to just blast initial values into memory. That is because the ratio of memory per memory controller to memory bandwidth stayed roughly constant while I was paying attention. I expect that ratio to continue staying roughly constant or systems will quickly start developing unacceptable boot times. As I recall Intel TDX is where the contents of memory are encrypted per virtual machine. Which implies that you have the same challenge as bootup initializing memory, and that is what ``accepting'' memory is. I am concerned that an unfiltered accept_memory may result in memory that has already been ``accepted'' being accepted again. This has the potential to be wasteful in the best case, and the potential to cause memory that is in use to be reinitialized losing the values that are currently stored there. I am concerned that the target kernel won't know about about accepting memory, or might not perform the work early enough and try to use memory without accepting it first. I would much prefer if getting into kexec_load would force the memory acceptance out of lazy mode (or possibly not even work in lazy mode). That keeps things simple for now. Once enough people have machines requiring the use of accept_memory we can worry about optimizing things and pushing the accept_memory call down into kexec_load. Ugh. I just noticed another issue. Unless the memory we are talking about is the memory reserved for kexec on panic kernels the memory needs struct pages and everything setup so it can be allocated from anyway. Which is to say I think this is has the potential to conflict with the accounting in try_to_accept_memory. Please just make memory acceptance ``eager'' non-lazy when using kexec. Unless someone has messed their implementation badly it won't be a significant amount of time in human terms, and it makes the code so much easier to understand and think about. Eric > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> > --- > kernel/kexec_core.c | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c > index c0caa14880c3..d97376eafc1a 100644 > --- a/kernel/kexec_core.c > +++ b/kernel/kexec_core.c > @@ -210,6 +210,16 @@ int sanity_check_segment_list(struct kimage *image) > } > #endif > > + /* > + * The destination addresses are searched from free memory ranges rather > + * than being allocated from the current kernel, so they are not > + * guaranteed to be accepted by the current kernel. > + * Accept those initial pages for the new kernel since it may not be > + * able to accept them by itself. > + */ > + for (i = 0; i < nr_segments; i++) > + accept_memory(image->segment[i].mem, image->segment[i].memsz); > + > return 0; > } > > base-commit: 8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b
On Mon, Oct 21, 2024 at 09:33:17AM -0500, Eric W. Biederman wrote: > Yan Zhao <yan.y.zhao@intel.com> writes: > > > The kexec destination addresses (incluing those for purgatory, the new > > kernel, boot params/cmdline, and initrd) are searched from the free area of > > memblock or RAM resources. Since they are not allocated by the currently > > running kernel, it is not guaranteed that they are accepted before > > relocating the new kernel. > > > > Accept the destination addresses for the new kernel, as the new kernel may > > not be able to or may not accept them by itself. > > > > Place the "accept" code immediately after the destination addresses pass > > sanity checks, so the code can be shared by both users of the kexec_load > > and kexec_file_load system calls. > > I am not at all certain this is sufficient, and I am a bit flummoxed > about the need to ever ``accept'' memory lazily. > > In a past life I wrote bootup firmware, and as part of that was the code > to initialize the contents of memory. When properly tuned and setup it > would never take more than a second to just blast initial values into > memory. That is because the ratio of memory per memory controller to > memory bandwidth stayed roughly constant while I was paying attention. > I expect that ratio to continue staying roughly constant or systems > will quickly start developing unacceptable boot times. > > As I recall Intel TDX is where the contents of memory are encrypted per > virtual machine. Which implies that you have the same challenge as > bootup initializing memory, and that is what ``accepting'' memory is. > > I am concerned that an unfiltered accept_memory may result in memory > that has already been ``accepted'' being accepted again. It is not unfiltered. We check it against bitmap that maintains the accept status of the memory block. > This has > the potential to be wasteful in the best case, and the potential to > cause memory that is in use to be reinitialized losing the values > that are currently stored there. > > I am concerned that the target kernel won't know about about accepting > memory, or might not perform the work early enough and try to use memory > without accepting it first. The bitmap I mentioned above passed between two kernels via an EFI config table. This mechanism predates kexec enabling of the systems with unaccepted memory support, so there should not be a problem. > I would much prefer if getting into kexec_load would force the memory > acceptance out of lazy mode (or possibly not even work in lazy mode). > That keeps things simple for now. You can always force this behaviour with accept_memory=eager, but it is waaay slower for larger VMs. It is especially bad idea if kexec used as initial bootloader and most of the memory is not yet accepted by the time kexec is triggered. > Once enough people have machines requiring the use of accept_memory > we can worry about optimizing things and pushing the accept_memory call > down into kexec_load. It is already here and it works. Despite some bugs that need to be addressed. > Ugh. I just noticed another issue. Unless the memory we are talking > about is the memory reserved for kexec on panic kernels the memory needs > struct pages and everything setup so it can be allocated from anyway. I am not sure I follow. Could you please elaborate? > Which is to say I think this is has the potential to conflict with > the accounting in try_to_accept_memory. > > Please just make memory acceptance ``eager'' non-lazy when using kexec. > Unless someone has messed their implementation badly it won't be a > significant amount of time in human terms, and it makes the code > so much easier to understand and think about. Waiting minutes to get VM booted to shell is not feasible for most deployments. Lazy is sane default to me. -- Kiryl Shutsemau / Kirill A. Shutemov
"Kirill A. Shutemov" <kirill@shutemov.name> writes: > Waiting minutes to get VM booted to shell is not feasible for most > deployments. Lazy is sane default to me. Huh? Unless my guesses about what is happening are wrong lazy is hiding a serious implementation deficiency. From all hardware I have seen taking minutes is absolutely ridiculous. Does writing to all of memory at full speed take minutes? How can such a system be functional? If you don't actually have to write to the pages and it is just some accounting function it is even more ridiculous. I had previously thought that accept_memory was the firmware call. Now that I see that it is just a wrapper for some hardware specific calls I am even more perplexed. Quite honestly what this looks like to me is that someone failed to enable write-combining or write-back caching when writing to memory when initializing the protected memory. With the result that everything is moving dog slow, and people are introducing complexity left and write to avoid that bad implementation. Can someone please explain to me why this accept_memory stuff has to be slow, why it has to take minutes to do it's job. I would much rather spend my time figuring out how to make accept_memory run at a reasonable speed than to litter the kernel with more of this nonsense. Eric
On Wed, Oct 23, 2024 at 10:44:11AM -0500, Eric W. Biederman wrote: > "Kirill A. Shutemov" <kirill@shutemov.name> writes: > > > Waiting minutes to get VM booted to shell is not feasible for most > > deployments. Lazy is sane default to me. > > Huh? > > Unless my guesses about what is happening are wrong lazy is hiding > a serious implementation deficiency. From all hardware I have seen > taking minutes is absolutely ridiculous. > > Does writing to all of memory at full speed take minutes? How can such > a system be functional? It is not only memory write (to encrypt the memory), but also TDCALL which is TD-exit on every page. That is costly in TDX case. On single vCPU it takes about a minute to accept 90GiB of memory. It improves a bit with number of vCPUs. It is 40 seconds with 4 vCPU, but it doesn't scale past that in my setup. But it is all rather pathological: VMM doesn't support huge pages yet and all memory is accepted in 4K chunks. Bringing 2M support would cut number of TDCALLs by 512. Once memory accepted, memory access cost is comparable to bare metal minus usual virtualisation tax on page walk. I don't know what the picture looks like in AMD case. j > If you don't actually have to write to the pages and it is just some > accounting function it is even more ridiculous. > > > I had previously thought that accept_memory was the firmware call. > Now that I see that it is just a wrapper for some hardware specific > calls I am even more perplexed. It is hypercall basically. The feature is only used in guests so far. -- Kiryl Shutsemau / Kirill A. Shutemov
On Fri, Oct 25, 2024 at 04:56:41PM +0300, Kirill A. Shutemov wrote: > On Wed, Oct 23, 2024 at 10:44:11AM -0500, Eric W. Biederman wrote: > > "Kirill A. Shutemov" <kirill@shutemov.name> writes: > > > > > Waiting minutes to get VM booted to shell is not feasible for most > > > deployments. Lazy is sane default to me. > > > > Huh? > > > > Unless my guesses about what is happening are wrong lazy is hiding > > a serious implementation deficiency. From all hardware I have seen > > taking minutes is absolutely ridiculous. > > > > Does writing to all of memory at full speed take minutes? How can such > > a system be functional? > > It is not only memory write (to encrypt the memory), but also TDCALL which > is TD-exit on every page. That is costly in TDX case. > > On single vCPU it takes about a minute to accept 90GiB of memory. > > It improves a bit with number of vCPUs. It is 40 seconds with 4 vCPU, but > it doesn't scale past that in my setup. > > But it is all rather pathological: VMM doesn't support huge pages yet and > all memory is accepted in 4K chunks. Bringing 2M support would cut number > of TDCALLs by 512. > > Once memory accepted, memory access cost is comparable to bare metal minus > usual virtualisation tax on page walk. > > I don't know what the picture looks like in AMD case. > j > > If you don't actually have to write to the pages and it is just some > > accounting function it is even more ridiculous. > > > > > > I had previously thought that accept_memory was the firmware call. > > Now that I see that it is just a wrapper for some hardware specific > > calls I am even more perplexed. > > It is hypercall basically. The feature is only used in guests so far. Eric, can we get the patch applied? It fixes a crash. -- Kiryl Shutsemau / Kirill A. Shutemov
On Mon, Nov 04, 2024 at 10:35:53AM +0200, Kirill A. Shutemov wrote: > On Fri, Oct 25, 2024 at 04:56:41PM +0300, Kirill A. Shutemov wrote: > > On Wed, Oct 23, 2024 at 10:44:11AM -0500, Eric W. Biederman wrote: > > > "Kirill A. Shutemov" <kirill@shutemov.name> writes: > > > > > > > Waiting minutes to get VM booted to shell is not feasible for most > > > > deployments. Lazy is sane default to me. > > > > > > Huh? > > > > > > Unless my guesses about what is happening are wrong lazy is hiding > > > a serious implementation deficiency. From all hardware I have seen > > > taking minutes is absolutely ridiculous. > > > > > > Does writing to all of memory at full speed take minutes? How can such > > > a system be functional? > > > > It is not only memory write (to encrypt the memory), but also TDCALL which > > is TD-exit on every page. That is costly in TDX case. > > > > On single vCPU it takes about a minute to accept 90GiB of memory. > > > > It improves a bit with number of vCPUs. It is 40 seconds with 4 vCPU, but > > it doesn't scale past that in my setup. > > > > But it is all rather pathological: VMM doesn't support huge pages yet and > > all memory is accepted in 4K chunks. Bringing 2M support would cut number > > of TDCALLs by 512. > > > > Once memory accepted, memory access cost is comparable to bare metal minus > > usual virtualisation tax on page walk. > > > > I don't know what the picture looks like in AMD case. > > j > > > If you don't actually have to write to the pages and it is just some > > > accounting function it is even more ridiculous. > > > > > > > > > I had previously thought that accept_memory was the firmware call. > > > Now that I see that it is just a wrapper for some hardware specific > > > calls I am even more perplexed. > > > > It is hypercall basically. The feature is only used in guests so far. > > Eric, can we get the patch applied? It fixes a crash. Ping? -- Kiryl Shutsemau / Kirill A. Shutemov
On Wed, Oct 23, 2024 at 10:44:11AM -0500, Eric W. Biederman wrote: > "Kirill A. Shutemov" <kirill@shutemov.name> writes: > > > Waiting minutes to get VM booted to shell is not feasible for most > > deployments. Lazy is sane default to me. > > Huh? > > Unless my guesses about what is happening are wrong lazy is hiding > a serious implementation deficiency. From all hardware I have seen > taking minutes is absolutely ridiculous. > > Does writing to all of memory at full speed take minutes? How can such > a system be functional? > > If you don't actually have to write to the pages and it is just some > accounting function it is even more ridiculous. > > > I had previously thought that accept_memory was the firmware call. > Now that I see that it is just a wrapper for some hardware specific > calls I am even more perplexed. > > > Quite honestly what this looks like to me is that someone failed to > enable write-combining or write-back caching when writing to memory > when initializing the protected memory. With the result that everything > is moving dog slow, and people are introducing complexity left and write > to avoid that bad implementation. > > > Can someone please explain to me why this accept_memory stuff has to be > slow, why it has to take minutes to do it's job. This kexec patch is a fix to a guest(TD)'s kexce failure. For a linux guest, the accept_memory() happens before the guest accesses a page. It will (if the guest is a TD) (1) trigger the host to allocate the physical page on host to map the accessed guest page, which might be slow with wait and sleep involved, depending on the memory pressure on host. (2) initializing the protected page. Actually most of guest memory are not accessed by guest during the guest life cycle. accept_memory() may cause the host to commit a never-to-be-used page, with the host physical page not even being able to get swapped out. That's why we need a lazy accept, which does not accept_memory() until after a page is allocated by the kernel (in alloc_page(s)). > I would much rather spend my time figuring out how to make accept_memory > run at a reasonable speed than to litter the kernel with more of this > nonsense. > > Eric
On Thu, Oct 24, 2024 at 08:15:13AM +0800, Yan Zhao wrote: > On Wed, Oct 23, 2024 at 10:44:11AM -0500, Eric W. Biederman wrote: > > "Kirill A. Shutemov" <kirill@shutemov.name> writes: > > > > > Waiting minutes to get VM booted to shell is not feasible for most > > > deployments. Lazy is sane default to me. > > > > Huh? > > > > Unless my guesses about what is happening are wrong lazy is hiding > > a serious implementation deficiency. From all hardware I have seen > > taking minutes is absolutely ridiculous. > > > > Does writing to all of memory at full speed take minutes? How can such > > a system be functional? > > > > If you don't actually have to write to the pages and it is just some > > accounting function it is even more ridiculous. > > > > > > I had previously thought that accept_memory was the firmware call. > > Now that I see that it is just a wrapper for some hardware specific > > calls I am even more perplexed. > > > > > > Quite honestly what this looks like to me is that someone failed to > > enable write-combining or write-back caching when writing to memory > > when initializing the protected memory. With the result that everything > > is moving dog slow, and people are introducing complexity left and write > > to avoid that bad implementation. > > > > > > Can someone please explain to me why this accept_memory stuff has to be > > slow, why it has to take minutes to do it's job. > This kexec patch is a fix to a guest(TD)'s kexce failure. > > For a linux guest, the accept_memory() happens before the guest accesses a page. > It will (if the guest is a TD) > (1) trigger the host to allocate the physical page on host to map the accessed ^^^^^^^^ s/accessed/specified > guest page, which might be slow with wait and sleep involved, depending on > the memory pressure on host. > (2) initializing the protected page. > > Actually most of guest memory are not accessed by guest during the guest life > cycle. accept_memory() may cause the host to commit a never-to-be-used page, > with the host physical page not even being able to get swapped out. > > That's why we need a lazy accept, which does not accept_memory() until after a > page is allocated by the kernel (in alloc_page(s)). > > > I would much rather spend my time figuring out how to make accept_memory > > run at a reasonable speed than to litter the kernel with more of this > > nonsense. > > > > Eric
On Mon, Oct 21, 2024 at 09:33:17AM -0500, Eric W. Biederman wrote: > Yan Zhao <yan.y.zhao@intel.com> writes: > > > The kexec destination addresses (incluing those for purgatory, the new > > kernel, boot params/cmdline, and initrd) are searched from the free area of > > memblock or RAM resources. Since they are not allocated by the currently > > running kernel, it is not guaranteed that they are accepted before > > relocating the new kernel. > > > > Accept the destination addresses for the new kernel, as the new kernel may > > not be able to or may not accept them by itself. > > > > Place the "accept" code immediately after the destination addresses pass > > sanity checks, so the code can be shared by both users of the kexec_load > > and kexec_file_load system calls. > > I am not at all certain this is sufficient, and I am a bit flummoxed > about the need to ever ``accept'' memory lazily. > > In a past life I wrote bootup firmware, and as part of that was the code > to initialize the contents of memory. When properly tuned and setup it > would never take more than a second to just blast initial values into > memory. That is because the ratio of memory per memory controller to > memory bandwidth stayed roughly constant while I was paying attention. > I expect that ratio to continue staying roughly constant or systems > will quickly start developing unacceptable boot times. > > As I recall Intel TDX is where the contents of memory are encrypted per > virtual machine. Which implies that you have the same challenge as > bootup initializing memory, and that is what ``accepting'' memory is. Yes, the kernel actually will accept initial memory used by itself in extract_kernel(), as in arch/x86/boot/compressed/misc.c. But the target kernel may not be able to accept memory for purgatory. And it's currently does not accept memory for boot params/cmdline, and initrd . > > I am concerned that an unfiltered accept_memory may result in memory > that has already been ``accepted'' being accepted again. This has > the potential to be wasteful in the best case, and the potential to > cause memory that is in use to be reinitialized losing the values > that are currently stored there. accept_memory() will not accept memory that has already been accepted. An unaccepted->bitmap is maintained and queried before accepting. (this is at least the implementation in drivers/firmware/efi/unaccepted_memory.c) If it's still a concern to you, is it better to add a check like this? if (range_contains_unaccepted_memory(mstart, size)) accept_memory(mstart, size); > > I am concerned that the target kernel won't know about about accepting > memory, or might not perform the work early enough and try to use memory > without accepting it first. The target kernel does accept memory before use it. But not including those in kexec segments for purgatory, boot params/cmdline, and initrd. > I would much prefer if getting into kexec_load would force the memory > acceptance out of lazy mode (or possibly not even work in lazy mode). > That keeps things simple for now. > > Once enough people have machines requiring the use of accept_memory > we can worry about optimizing things and pushing the accept_memory call > down into kexec_load. > > > Ugh. I just noticed another issue. Unless the memory we are talking > about is the memory reserved for kexec on panic kernels the memory needs > struct pages and everything setup so it can be allocated from anyway. > > Which is to say I think this is has the potential to conflict with > the accounting in try_to_accept_memory. Then could we put the accept into machine_kexec(), given that accept_memory() will not fail? > > Please just make memory acceptance ``eager'' non-lazy when using kexec. > Unless someone has messed their implementation badly it won't be a > significant amount of time in human terms, and it makes the code > so much easier to understand and think about. > yes, it's also an approach if the above cannot convince you. > > > > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > > Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> > > --- > > kernel/kexec_core.c | 10 ++++++++++ > > 1 file changed, 10 insertions(+) > > > > diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c > > index c0caa14880c3..d97376eafc1a 100644 > > --- a/kernel/kexec_core.c > > +++ b/kernel/kexec_core.c > > @@ -210,6 +210,16 @@ int sanity_check_segment_list(struct kimage *image) > > } > > #endif > > > > + /* > > + * The destination addresses are searched from free memory ranges rather > > + * than being allocated from the current kernel, so they are not > > + * guaranteed to be accepted by the current kernel. > > + * Accept those initial pages for the new kernel since it may not be > > + * able to accept them by itself. > > + */ > > + for (i = 0; i < nr_segments; i++) > > + accept_memory(image->segment[i].mem, image->segment[i].memsz); > > + > > return 0; > > } > > > > base-commit: 8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b
© 2016 - 2024 Red Hat, Inc.