From: Fangyu Yu <fangyu.yu@linux.alibaba.com>
In a RISC-V kernel, both kexec and crashdump need to hand off execution
to the next kernel after tearing down the current kernel address space.
However, under virtualization the guest uses two-stage address
translation, and pc does not jump to stvec after setting satp to zero,
so the legacy single-step "csrw satp,0 + stvec redirect" sequence
traps with "kvm run failed Operation not supported" and the VCPU dies.
This patch set introduces a dedicated kexec trampoline text section and
builds a minimal trampoline page table for it. Both handoffs are then
reworked into a two-pass trampoline:
1. First enter via the kernel VA, install the trampoline page table,
and jump to the trampoline VA(=PA) of the entry stub;
2. Continue execution with PC already on a PA, drop SATP with
csrw satp,0 (now safe because PC re-anchoring is moot), and
jump directly to the target -- either the crash kernel entry
(crash path) or the per-image control_code_buffer that runs
the relocate body with SATP=0 throughout (normal path).
With this, both kexec and crashdump in RISC-V guests become robust
against the two-stage translation.
Tested on QEMU virt under two configurations:
* HS-mode bare (QEMU TCG) -- regression check
- normal kexec: kexec -l/-e succeeds, second kernel boots and
prints the userspace SECOND BOOT marker.
- crash kdump: panic triggers crash kernel boot, /proc/vmcore
opens cleanly in crash(1) and shows the panic
backtrace.
* VS-mode (L0 x86 + QEMU TCG -> L1 riscv64 + KVM -> L2)
Before this series, both paths die with
kvm run failed Operation not supported
and an all-zero M-mode register dump on the SATP transition.
After this series, both paths succeed end-to-end and the
vmcore opens cleanly in crash.
---
Changes in v3 (Sashiko AI review):
- Add two new Fixes: patches at the start of the series:
#1: machine_kexec_cleanup() was empty, so the set_memory_x()
call in prepare() leaked an executable direct-map page back
to the buddy allocator on kexec -u (W^X bypass). Fix: add
set_memory_nx() in cleanup.
#2: machine_kexec_prepare() FDT search checked
memsz <= sizeof(fdt) but read sizeof(fdt) bytes from
segment[i].buf, which can be smaller than memsz. Fix: check
bufsz instead.
- Inline the .kexec.tramp.text section definition directly into
vmlinux.lds.S instead of using a macro in image-vars.h.
- Rewrite map_tramp_page() to share a single set of lower-level
page tables between the VA and PA mappings (5 BSS pages instead
of 9), with a collision-safe walker that only populates entries
still zero. Add Sv32 support.
- Link to v2:
https://lore.kernel.org/linux-riscv/20260526125009.2404-1-fangyu.yu@linux.alibaba.com/
- Link to v1:
https://lore.kernel.org/linux-riscv/20260324114527.91494-1-fangyu.yu@linux.alibaba.com/
Fangyu Yu (9):
riscv: kexec: Reset executable bit on the control code page in cleanup
riscv: kexec: Bound FDT search by source buffer size, not destination
riscv: Add kexec trampoline text section to vmlinux.lds.S
riscv: kexec: Place norelocate trampoline into .kexec.tramp.text
riscv: kexec: Build trampoline page tables for crash kernel entry
riscv: kexec: Switch to trampoline page table before norelocate
riscv: kexec: Always build the trampoline page table
riscv: kexec: Add the relocate-trampoline wrapper
riscv: kexec: Route normal kexec through the trampoline page table
arch/riscv/include/asm/kexec.h | 5 +
arch/riscv/kernel/kexec_relocate.S | 92 +++++++++++-----
arch/riscv/kernel/machine_kexec.c | 171 +++++++++++++++++++++++++++--
arch/riscv/kernel/vmlinux.lds.S | 10 ++
4 files changed, 241 insertions(+), 37 deletions(-)
--
2.50.1