From: David Woodhouse <dwmw@amazon.co.uk>
The kernel switches to a new set of page tables during kexec. The global
mappings (_PAGE_GLOBAL==1) can remain in the TLB after this switch. This
is generally not a problem because the new page tables use a different
portion of the virtual address space than the normal kernel mappings.
The critical exception to that generalisation (and the only mapping
which isn't an identity mapping) is the kexec control page itself —
which was ROX in the original kernel mapping, but should be RWX in the
new page tables. If there is a global TLB entry for that in its prior
read-only state, it definitely needs to be flushed before attempting to
write through that virtual mapping.
It would be possible to just avoid writing to the virtual address of the
page and defer all writes until they can be done through the identity
mapping. But there's no good reason to keep the old TLB entries around,
as they can cause nothing but trouble.
Clear the PGE bit in %cr4 early, before storing data in the control page.
Fixes: 5a82223e0743 ("x86/kexec: Mark relocate_kernel page as ROX instead of RWX")
Co-authored-by: Dave Hansen <dave.hansen@linux.intel.com>
Reported-by: Nathan Chancellor <nathan@kernel.org>
Reported-by: "Ning, Hongyu" <hongyu.ning@linux.intel.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219592
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: "Ning, Hongyu" <hongyu.ning@linux.intel.com>
---
arch/x86/kernel/relocate_kernel_64.S | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index 8bc86a1e056a..9bd601dd8659 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -70,14 +70,20 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
movq kexec_pa_table_page(%rip), %r9
movq %r9, %cr3
+ /* Leave CR4 in %r13 to enable the right paging mode later. */
+ movq %cr4, %r13
+
+ /* Disable global pages immediately to ensure this mapping is RWX */
+ movq %r13, %r12
+ andq $~(X86_CR4_PGE), %r12
+ movq %r12, %cr4
+
/* Save %rsp and CRs. */
+ movq %r13, saved_cr4(%rip)
movq %rsp, saved_rsp(%rip)
movq %rax, saved_cr3(%rip)
movq %cr0, %rax
movq %rax, saved_cr0(%rip)
- /* Leave CR4 in %r13 to enable the right paging mode later. */
- movq %cr4, %r13
- movq %r13, saved_cr4(%rip)
/* save indirection list for jumping back */
movq %rdi, pa_backup_pages_map(%rip)
base-commit: 35aafa1d41cee0d3d50164561bca34befc1d9ce3
--
2.47.0
On Mon, Dec 16, 2024 at 11:24:08PM +0000, David Woodhouse wrote: > From: David Woodhouse <dwmw@amazon.co.uk> > > The kernel switches to a new set of page tables during kexec. The global > mappings (_PAGE_GLOBAL==1) can remain in the TLB after this switch. This > is generally not a problem because the new page tables use a different > portion of the virtual address space than the normal kernel mappings. > > The critical exception to that generalisation (and the only mapping > which isn't an identity mapping) is the kexec control page itself — > which was ROX in the original kernel mapping, but should be RWX in the > new page tables. If there is a global TLB entry for that in its prior > read-only state, it definitely needs to be flushed before attempting to > write through that virtual mapping. > > It would be possible to just avoid writing to the virtual address of the > page and defer all writes until they can be done through the identity > mapping. But there's no good reason to keep the old TLB entries around, > as they can cause nothing but trouble. > > Clear the PGE bit in %cr4 early, before storing data in the control page. It worth noting that flipping CR4.PGE triggers TLB flush. I was not sure if CR3 write is required to make it happen. -- Kiryl Shutsemau / Kirill A. Shutemov
On 12/17/24 04:25, Kirill A. Shutemov wrote: >> Clear the PGE bit in %cr4 early, before storing data in the control page. > It worth noting that flipping CR4.PGE triggers TLB flush. I was not sure > if CR3 write is required to make it happen. I thought about removing the CR3 write. But I decided against it because CR4.PGE needs to actually change value, unlike CR3 writes where any write can flush the TLB (modulo globals, PCID and bit 63 of course). X86_FEATURE_PGE itself is required but I couldn't actually remember if there are any cases where CR4.PGE==0. If there were, the CR3 write would still be needed. I don't _think_ there are any ways forx86_64 to end up with CR4.PGE==0, but I also wouldn't out the possibility that some silly issue pops up making us play stupid games and win stupid prizes. Anyway, I think we can leave the belt-and-suspenders programming in this case. A comment wouldn't hurt I guess.
On Tue, 2024-12-17 at 06:51 -0800, Dave Hansen wrote: > On 12/17/24 04:25, Kirill A. Shutemov wrote: > > > Clear the PGE bit in %cr4 early, before storing data in the control page. > > It worth noting that flipping CR4.PGE triggers TLB flush. I was not sure > > if CR3 write is required to make it happen. > > I thought about removing the CR3 write. But I decided against it because > CR4.PGE needs to actually change value, unlike CR3 writes where any > write can flush the TLB (modulo globals, PCID and bit 63 of course). > > X86_FEATURE_PGE itself is required but I couldn't actually remember if > there are any cases where CR4.PGE==0. If there were, the CR3 write would > still be needed. I don't _think_ there are any ways forx86_64 to end up > with CR4.PGE==0, but I also wouldn't out the possibility that some silly > issue pops up making us play stupid games and win stupid prizes. > > Anyway, I think we can leave the belt-and-suspenders programming in this > case. A comment wouldn't hurt I guess. I'm a little lost. In this case I don't see belt-and-suspenders programming. We're not loading CR3 after clearing CR4.PGE just to be paranoid about making really really sure the TLB is flushed. We're loading CR3 because we're switching from the kernel's page tables to the new identity mapping set up for the relocate_kernel environment.
On 12/17/24 06:56, David Woodhouse wrote: >> Anyway, I think we can leave the belt-and-suspenders programming in this >> case. A comment wouldn't hurt I guess. > I'm a little lost. In this case I don't see belt-and-suspenders > programming. We're not loading CR3 after clearing CR4.PGE just to be > paranoid about making really really sure the TLB is flushed. > > We're loading CR3 because we're switching from the kernel's page tables > to the new identity mapping set up for the relocate_kernel environment. Yes, agreed, that's another reason the CR3 write must stay. I hadn't even considered that part yet honestly.
On 17 December 2024 13:25:48 CET, "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote: >On Mon, Dec 16, 2024 at 11:24:08PM +0000, David Woodhouse wrote: >> From: David Woodhouse <dwmw@amazon.co.uk> >> >> The kernel switches to a new set of page tables during kexec. The global >> mappings (_PAGE_GLOBAL==1) can remain in the TLB after this switch. This >> is generally not a problem because the new page tables use a different >> portion of the virtual address space than the normal kernel mappings. >> >> The critical exception to that generalisation (and the only mapping >> which isn't an identity mapping) is the kexec control page itself — >> which was ROX in the original kernel mapping, but should be RWX in the >> new page tables. If there is a global TLB entry for that in its prior >> read-only state, it definitely needs to be flushed before attempting to >> write through that virtual mapping. >> >> It would be possible to just avoid writing to the virtual address of the >> page and defer all writes until they can be done through the identity >> mapping. But there's no good reason to keep the old TLB entries around, >> as they can cause nothing but trouble. >> >> Clear the PGE bit in %cr4 early, before storing data in the control page. > >It worth noting that flipping CR4.PGE triggers TLB flush. I was not sure >if CR3 write is required to make it happen. Well, until we flip to the new CR3 the read-only PTE can just get reloaded. But after CR4.PGE is cleared, of course they won't be global any more. So they will get flushed (again) when CR3 is reloaded. Maybe it could run a tiny bit faster if we change CR3 before CR4? I don't know that we care about microbenchmarking kexec to that degree, but I may take a look...
© 2016 - 2025 Red Hat, Inc.