[RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions

Jork Loeser posted 20 patches 1 week, 4 days ago
arch/arm64/hyperv/hv_core.c        |   6 +-
arch/x86/hyperv/hv_init.c          |   4 +-
drivers/hv/Kconfig                 |   3 +
drivers/hv/Makefile                |   2 +-
drivers/hv/hv_common.c             |   5 +-
drivers/hv/hv_proc.c               |  32 +-
drivers/hv/mshv_debugfs.c          |  99 +++++
drivers/hv/mshv_page_preserve.c    | 557 ++++++++++++++++++++++++++
drivers/hv/mshv_page_preserve.h    |  21 +
drivers/hv/mshv_root.h             |   5 +
drivers/hv/mshv_root_hv_call.c     |  12 +-
drivers/hv/mshv_root_main.c        | 341 ++++++++++++++--
include/linux/kexec_handover.h     |   1 +
include/linux/kho_radix_tree.h     |  90 ++++-
include/linux/memblock.h           |  14 +
kernel/kexec_core.c                |   1 +
kernel/liveupdate/kexec_handover.c | 605 +++++++++++++++++++++++------
mm/hugetlb.c                       |  19 +-
mm/memblock.c                      | 177 +++++++--
mm/mm_init.c                       |   1 +
20 files changed, 1767 insertions(+), 228 deletions(-)
create mode 100644 drivers/hv/mshv_page_preserve.c
create mode 100644 drivers/hv/mshv_page_preserve.h
[RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions
Posted by Jork Loeser 1 week, 4 days ago
When Linux runs as an L1 Virtual Host (L1VH) under Hyper-V, the MSHV
root partition driver deposits pages to the hypervisor and creates
partitions for guest VMs. Prior patches enabled kexec for L1VH, but
only when no partitions had been created and no memory had been donated.

This series lifts that limitation. It uses KHO (Kexec Handover) to:

 - Track all pages deposited to the hypervisor in a KHO radix tree
   and preserve them across kexec so the new kernel knows which pages
   are owned by the hypervisor.

 - Freeze running partitions before kexec, record their IDs in the
   KHO FDT, and vacuum (tear down + reclaim memory) stale partitions
   after kexec.

 - In case of a crash, exclude hypervisor-owned pages from crash
   dump collection by passing the radix tree root PA via Hyper-V
   crash MSR P2 to the crash kernel.

Dependency on Pratyush's KHO series
===================================

Patches 1-12 are cherry-picked from Pratyush Yadav's v1 series
"kho: make boot time huge page allocation work nicely with KHO" [1],
which is still under discussion. This series uses functionality from
those patches -- specifically the meta-data page enumeration via table
callbacks and the restructured radix tree API. It also extends the
KHO radix tree with:

 - A freeze mechanism to lock the tree before serializing for kexec
   (patch 13).

 - A crash-kernel-safe variant that memremaps radix nodes for use
   outside the direct map (patch 14).

Patch overview
==============

Patches 1-12:  KHO radix tree and memblock changes (from [1])
Patch 13:      Radix tree freeze and del_key() error reporting
Patch 14:      Crash-kernel-safe radix tree presence check
Patch 15:      Page tracker using KHO radix tree for deposited pages
Patch 16:      Debugfs interface for page tracker
Patches 17-18: Crash MSR reshuffling + crash dump page exclusion
Patch 19:      Export kexec_in_progress for modules
Patch 20:      Freeze and vacuum partitions across kexec

Feedback
========

This is an RFC. I am looking for feedback on the overall approach as
well as the KHO changes (patches 13-14).

[1] https://lore.kernel.org/linux-mm/20260429133928.850721-1-pratyush@kernel.org/

Based-on: linux-next/master (next-20260527)

Jork Loeser (8):
  kho: add radix tree freeze and del_key() error reporting
  kho: Add crash-kernel-safe radix tree presence check
  mshv: Use page tracker to manage MSHV-owned pages and preserve with
    KHO
  mshv: Add debugfs interface to page tracker
  hyperv: Reserve crash MSR P2 for page preservation root PA
  mshv: Exclude Hyper-V donated pages from crash dump collection
  kexec: export kexec_in_progress for modules
  mshv: freeze and vacuum partitions across kexec

Pratyush Yadav (Google) (12):
  kho: generalize radix tree APIs
  kho: store incoming radix tree in kho_in
  kho: add a struct for radix callbacks
  kho: add callback for table pages
  kho: add data argument to radix walk callback
  kho: allow early-boot usage of the KHO radix tree
  kho: allow destroying KHO radix tree
  kho: add kho_radix_init_tree()
  memblock: introduce MEMBLOCK_KHO_SCRATCH_EXT
  kho: extended scratch
  kho: return virtual address of mem_map
  mm/hugetlb: make bootmem allocation work with KHO

 arch/arm64/hyperv/hv_core.c        |   6 +-
 arch/x86/hyperv/hv_init.c          |   4 +-
 drivers/hv/Kconfig                 |   3 +
 drivers/hv/Makefile                |   2 +-
 drivers/hv/hv_common.c             |   5 +-
 drivers/hv/hv_proc.c               |  32 +-
 drivers/hv/mshv_debugfs.c          |  99 +++++
 drivers/hv/mshv_page_preserve.c    | 557 ++++++++++++++++++++++++++
 drivers/hv/mshv_page_preserve.h    |  21 +
 drivers/hv/mshv_root.h             |   5 +
 drivers/hv/mshv_root_hv_call.c     |  12 +-
 drivers/hv/mshv_root_main.c        | 341 ++++++++++++++--
 include/linux/kexec_handover.h     |   1 +
 include/linux/kho_radix_tree.h     |  90 ++++-
 include/linux/memblock.h           |  14 +
 kernel/kexec_core.c                |   1 +
 kernel/liveupdate/kexec_handover.c | 605 +++++++++++++++++++++++------
 mm/hugetlb.c                       |  19 +-
 mm/memblock.c                      | 177 +++++++--
 mm/mm_init.c                       |   1 +
 20 files changed, 1767 insertions(+), 228 deletions(-)
 create mode 100644 drivers/hv/mshv_page_preserve.c
 create mode 100644 drivers/hv/mshv_page_preserve.h

--
2.43.0
Re: [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions
Posted by Mike Rapoport 1 week ago
Hi Jork,

Only had time to skim through the patches.
I have a couple of high level questions for now.

On Wed, May 27, 2026 at 05:41:42PM -0700, Jork Loeser wrote:
> When Linux runs as an L1 Virtual Host (L1VH) under Hyper-V, the MSHV
> root partition driver deposits pages to the hypervisor and creates
> partitions for guest VMs. Prior patches enabled kexec for L1VH, but
> only when no partitions had been created and no memory had been donated.
> 
> This series lifts that limitation. It uses KHO (Kexec Handover) to:
> 
>  - Track all pages deposited to the hypervisor in a KHO radix tree
>    and preserve them across kexec so the new kernel knows which pages
>    are owned by the hypervisor.
> 
>  - Freeze running partitions before kexec, record their IDs in the
>    KHO FDT, and vacuum (tear down + reclaim memory) stale partitions
>    after kexec.
> 
>  - In case of a crash, exclude hypervisor-owned pages from crash
>    dump collection by passing the radix tree root PA via Hyper-V
>    crash MSR P2 to the crash kernel.
> 
> Dependency on Pratyush's KHO series
> ===================================
> 
> Patches 1-12 are cherry-picked from Pratyush Yadav's v1 series
> "kho: make boot time huge page allocation work nicely with KHO" [1],
> which is still under discussion. This series uses functionality from
> those patches -- specifically the meta-data page enumeration via table
> callbacks and the restructured radix tree API. It also extends the
> KHO radix tree with:
> 
>  - A freeze mechanism to lock the tree before serializing for kexec
>    (patch 13).

There were a lot of effort to make KHO stateless and drop the requirement
for finalization/freeze.

Why is this necessary to add a freeze mechanism to kho_radix_tree?
If it's a hard requirement of mshv maybe the freeze part should be handled
there?
 
>  - A crash-kernel-safe variant that memremaps radix nodes for use
>    outside the direct map (patch 14).
> 
> Patch overview
> ==============
> 
> Patches 1-12:  KHO radix tree and memblock changes (from [1])
> Patch 13:      Radix tree freeze and del_key() error reporting

del_key() error reporting sounds like something we'd want to avoid.
del_key() is called on "freeing" path and during error handling, it would
be hard if at all possible to deal with errors from del_key().

> Patch 14:      Crash-kernel-safe radix tree presence check
> Patch 15:      Page tracker using KHO radix tree for deposited pages
> Patch 16:      Debugfs interface for page tracker
> Patches 17-18: Crash MSR reshuffling + crash dump page exclusion
> Patch 19:      Export kexec_in_progress for modules

Isn't there another way to differentiate kexec reboot?

> Patch 20:      Freeze and vacuum partitions across kexec
> 
> Feedback
> ========
> 
> This is an RFC. I am looking for feedback on the overall approach as
> well as the KHO changes (patches 13-14).
> 
> [1] https://lore.kernel.org/linux-mm/20260429133928.850721-1-pratyush@kernel.org/
> 
> Based-on: linux-next/master (next-20260527)

-- 
Sincerely yours,
Mike.
Re: [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions
Posted by Jork Loeser 6 days, 19 hours ago

On Sun, 31 May 2026, Mike Rapoport wrote:

> Hi Jork,

>>  - A freeze mechanism to lock the tree before serializing for kexec
>>    (patch 13).
>
> There were a lot of effort to make KHO stateless and drop the requirement
> for finalization/freeze.
>
> Why is this necessary to add a freeze mechanism to kho_radix_tree?
> If it's a hard requirement of mshv maybe the freeze part should be handled
> there?

Good feedback. It's a safety-net so we do not accidentally donate pages 
without being able to track them. Thought it might be a good generic 
feature. Let me keep it in the MSHV driver.

>> Patch 13:      Radix tree freeze and del_key() error reporting
>
> del_key() error reporting sounds like something we'd want to avoid.
> del_key() is called on "freeing" path and during error handling, it would
> be hard if at all possible to deal with errors from del_key().

I hear you. Stating "yeah, it can only really fail if the key isn't there, 
or it's frozen, but not due to other things, so don't bother to check the 
return code if you are sure" is an odd contract. With the freeze-logic 
moving into MSHV, will revert to no-error.

>> Patch 19:      Export kexec_in_progress for modules
>
> Isn't there another way to differentiate kexec reboot?

I could not find one, unfortunately.

> Sincerely yours,
> Mike.

Best,
Jork
Re: [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions
Posted by Mike Rapoport 5 days, 6 hours ago
On Mon, Jun 01, 2026 at 01:09:41PM -0700, Jork Loeser wrote:
> On Sun, 31 May 2026, Mike Rapoport wrote:
> 
> > > Patch 19:      Export kexec_in_progress for modules
> > 
> > Isn't there another way to differentiate kexec reboot?

There's that "kexec reboot" string passed as the cmd to the reboot
notifier.
Maybe we can make it somehow more well defined API and use it?
 
> I could not find one, unfortunately.
> 
> > Sincerely yours,
> > Mike.
> 
> Best,
> Jork

-- 
Sincerely yours,
Mike.
Re: [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions
Posted by Jork Loeser 4 days, 22 hours ago

On Wed, 3 Jun 2026, Mike Rapoport wrote:

> On Mon, Jun 01, 2026 at 01:09:41PM -0700, Jork Loeser wrote:
>> On Sun, 31 May 2026, Mike Rapoport wrote:
>>
>>>> Patch 19:      Export kexec_in_progress for modules
>>>
>>> Isn't there another way to differentiate kexec reboot?
>
> There's that "kexec reboot" string passed as the cmd to the reboot
> notifier.
> Maybe we can make it somehow more well defined API and use it?

A string? Dear my - the compiler won't flag it on an API change then, not 
ideal clearly. What's wrong with exporting kexec_in_progress()?

Best,
Jork
Re: [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions
Posted by Mike Rapoport 4 days, 3 hours ago
On Wed, Jun 03, 2026 at 10:25:58AM -0700, Jork Loeser wrote:
> 
> 
> On Wed, 3 Jun 2026, Mike Rapoport wrote:
> 
> > On Mon, Jun 01, 2026 at 01:09:41PM -0700, Jork Loeser wrote:
> > > On Sun, 31 May 2026, Mike Rapoport wrote:
> > > 
> > > > > Patch 19:      Export kexec_in_progress for modules
> > > > 
> > > > Isn't there another way to differentiate kexec reboot?
> > 
> > There's that "kexec reboot" string passed as the cmd to the reboot
> > notifier.
> > Maybe we can make it somehow more well defined API and use it?
> 
> A string? Dear my - the compiler won't flag it on an API change then, not
> ideal clearly. What's wrong with exporting kexec_in_progress()?

The policy in general is avoid exports unless strictly necessary.
A string can be declared as const char *KEXEC_REBOOT = "kexec reboot" and
used in both kexec and mshv. Not ideal, but still better.

No strong feelings from my side, just EXPORT_SYMBOL there felt a bit off.
 
> Best,
> Jork

-- 
Sincerely yours,
Mike.
Re: [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions
Posted by Pasha Tatashin 1 week ago
On 05-31 20:10, Mike Rapoport wrote:
> Hi Jork,
> 
> Only had time to skim through the patches.
> I have a couple of high level questions for now.
> 
> On Wed, May 27, 2026 at 05:41:42PM -0700, Jork Loeser wrote:
> > When Linux runs as an L1 Virtual Host (L1VH) under Hyper-V, the MSHV
> > root partition driver deposits pages to the hypervisor and creates
> > partitions for guest VMs. Prior patches enabled kexec for L1VH, but
> > only when no partitions had been created and no memory had been donated.
> > 
> > This series lifts that limitation. It uses KHO (Kexec Handover) to:
> > 
> >  - Track all pages deposited to the hypervisor in a KHO radix tree
> >    and preserve them across kexec so the new kernel knows which pages
> >    are owned by the hypervisor.
> > 
> >  - Freeze running partitions before kexec, record their IDs in the
> >    KHO FDT, and vacuum (tear down + reclaim memory) stale partitions
> >    after kexec.
> > 
> >  - In case of a crash, exclude hypervisor-owned pages from crash
> >    dump collection by passing the radix tree root PA via Hyper-V
> >    crash MSR P2 to the crash kernel.
> > 
> > Dependency on Pratyush's KHO series
> > ===================================
> > 
> > Patches 1-12 are cherry-picked from Pratyush Yadav's v1 series
> > "kho: make boot time huge page allocation work nicely with KHO" [1],
> > which is still under discussion. This series uses functionality from
> > those patches -- specifically the meta-data page enumeration via table
> > callbacks and the restructured radix tree API. It also extends the
> > KHO radix tree with:
> > 
> >  - A freeze mechanism to lock the tree before serializing for kexec
> >    (patch 13).
> 
> There were a lot of effort to make KHO stateless and drop the requirement
> for finalization/freeze.

Yes, using KHO directly here is incorrect. The state machine is provided 
by LUO, so we should use LUO here. MSHV should provide a file that 
userspace adds to LUO, and all state machine management would be the 
same as for all other clients participating in LU.

> 
> Why is this necessary to add a freeze mechanism to kho_radix_tree?
> If it's a hard requirement of mshv maybe the freeze part should be handled
> there?
j  
> >  - A crash-kernel-safe variant that memremaps radix nodes for use
> >    outside the direct map (patch 14).
> > 
> > Patch overview
> > ==============
> > 
> > Patches 1-12:  KHO radix tree and memblock changes (from [1])
> > Patch 13:      Radix tree freeze and del_key() error reporting
> 
> del_key() error reporting sounds like something we'd want to avoid.
> del_key() is called on "freeing" path and during error handling, it would
> be hard if at all possible to deal with errors from del_key().
> 
> > Patch 14:      Crash-kernel-safe radix tree presence check
> > Patch 15:      Page tracker using KHO radix tree for deposited pages
> > Patch 16:      Debugfs interface for page tracker
> > Patches 17-18: Crash MSR reshuffling + crash dump page exclusion
> > Patch 19:      Export kexec_in_progress for modules
> 
> Isn't there another way to differentiate kexec reboot?
> 
> > Patch 20:      Freeze and vacuum partitions across kexec
> > 
> > Feedback
> > ========
> > 
> > This is an RFC. I am looking for feedback on the overall approach as
> > well as the KHO changes (patches 13-14).
> > 
> > [1] https://lore.kernel.org/linux-mm/20260429133928.850721-1-pratyush@kernel.org/
> > 
> > Based-on: linux-next/master (next-20260527)
> 
> -- 
> Sincerely yours,
> Mike.
Re: [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions
Posted by Jork Loeser 6 days, 19 hours ago

On Mon, 1 Jun 2026, Pasha Tatashin wrote:

> On 05-31 20:10, Mike Rapoport wrote:

>>>  - A freeze mechanism to lock the tree before serializing for kexec
>>>    (patch 13).
>>
>> There were a lot of effort to make KHO stateless and drop the requirement
>> for finalization/freeze.
>
> Yes, using KHO directly here is incorrect. The state machine is provided
> by LUO, so we should use LUO here. MSHV should provide a file that
> userspace adds to LUO, and all state machine management would be the
> same as for all other clients participating in LU.

The thing is, there is no file handle to rely on. Even once partitions are 
all removed, Hyper-V might hang onto pages (and won't return them even if 
asked). However, these pages very much must be excluded from Linux 
post-kexec, or the system will crash. We cannot rely on UM to ensure 
integrity of memory management.

Contrast that to standard LUO use: If you drop individual file handles, or 
even skip the LUO phase entirely, the worst that will happen is that the 
objects will be gone post-kexec. The MM itself will still be consistent. 
For MSHV & page donation, this is different.

(And yes, partition preservation will very much tie into LUO)

Best,
Jork