[PATCH v7 00/12] Direct Map Removal Support for guest_memfd

Patrick Roy posted 12 patches 4 months, 2 weeks ago
Only 3 patches received!
There is a newer version of this series
Documentation/virt/kvm/api.rst                |  5 ++
arch/arm64/include/asm/kvm_host.h             | 12 ++++
arch/arm64/mm/pageattr.c                      |  1 +
arch/loongarch/mm/pageattr.c                  |  1 +
arch/riscv/mm/pageattr.c                      |  1 +
arch/s390/mm/pageattr.c                       |  1 +
arch/x86/include/asm/tlbflush.h               |  3 +-
arch/x86/mm/pat/set_memory.c                  |  1 +
arch/x86/mm/tlb.c                             |  1 +
include/linux/kvm_host.h                      |  9 +++
include/linux/pagemap.h                       | 16 +++++
include/linux/secretmem.h                     | 18 -----
include/uapi/linux/kvm.h                      |  2 +
lib/buildid.c                                 |  4 +-
mm/gup.c                                      | 19 ++----
mm/mlock.c                                    |  2 +-
mm/secretmem.c                                |  8 +--
.../testing/selftests/kvm/guest_memfd_test.c  |  2 +
.../testing/selftests/kvm/include/kvm_util.h  | 37 ++++++++---
.../testing/selftests/kvm/include/test_util.h |  8 +++
tools/testing/selftests/kvm/lib/elf.c         |  8 +--
tools/testing/selftests/kvm/lib/io.c          | 23 +++++++
tools/testing/selftests/kvm/lib/kvm_util.c    | 61 +++++++++--------
tools/testing/selftests/kvm/lib/test_util.c   |  8 +++
tools/testing/selftests/kvm/lib/x86/sev.c     |  1 +
.../selftests/kvm/pre_fault_memory_test.c     |  1 +
.../selftests/kvm/set_memory_region_test.c    | 50 ++++++++++++--
.../kvm/x86/private_mem_conversions_test.c    |  7 +-
virt/kvm/guest_memfd.c                        | 66 +++++++++++++++++--
virt/kvm/kvm_main.c                           |  8 +++
30 files changed, 290 insertions(+), 94 deletions(-)
[PATCH v7 00/12] Direct Map Removal Support for guest_memfd
Posted by Patrick Roy 4 months, 2 weeks ago
From: Patrick Roy <roypat@amazon.co.uk>

[ based on kvm/next ]

Unmapping virtual machine guest memory from the host kernel's direct map is a
successful mitigation against Spectre-style transient execution issues: If the
kernel page tables do not contain entries pointing to guest memory, then any
attempted speculative read through the direct map will necessarily be blocked
by the MMU before any observable microarchitectural side-effects happen. This
means that Spectre-gadgets and similar cannot be used to target virtual machine
memory. Roughly 60% of speculative execution issues fall into this category [1,
Table 1].

This patch series extends guest_memfd with the ability to remove its memory
from the host kernel's direct map, to be able to attain the above protection
for KVM guests running inside guest_memfd.

Additionally, a Firecracker branch with support for these VMs can be found on
GitHub [2].

For more details, please refer to the v5 cover letter [v5]. No
substantial changes in design have taken place since.

=== Changes Since v6 ===

- Drop patch for passing struct address_space to ->free_folio(), due to
  possible races with freeing of the address_space. (Hugh)
- Stop using PG_uptodate / gmem preparedness tracking to keep track of
  direct map state.  Instead, use the lowest bit of folio->private. (Mike, David)
- Do direct map removal when establishing mapping of gmem folio instead
  of at allocation time, due to impossibility of handling direct map
  removal errors in kvm_gmem_populate(). (Patrick)
- Do TLB flushes after direct map removal, and provide a module
  parameter to opt out from them, and a new patch to export
  flush_tlb_kernel_range() to KVM. (Will)

[1]: https://download.vusec.net/papers/quarantine_raid23.pdf
[2]: https://github.com/firecracker-microvm/firecracker/tree/feature/secret-hiding
[RFCv1]: https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@amazon.co.uk/
[RFCv2]: https://lore.kernel.org/kvm/20240910163038.1298452-1-roypat@amazon.co.uk/
[RFCv3]: https://lore.kernel.org/kvm/20241030134912.515725-1-roypat@amazon.co.uk/
[v4]: https://lore.kernel.org/kvm/20250221160728.1584559-1-roypat@amazon.co.uk/
[v5]: https://lore.kernel.org/kvm/20250828093902.2719-1-roypat@amazon.co.uk/
[v6]: https://lore.kernel.org/kvm/20250912091708.17502-1-roypat@amazon.co.uk/


Patrick Roy (12):
  arch: export set_direct_map_valid_noflush to KVM module
  x86/tlb: export flush_tlb_kernel_range to KVM module
  mm: introduce AS_NO_DIRECT_MAP
  KVM: guest_memfd: Add stub for kvm_arch_gmem_invalidate
  KVM: guest_memfd: Add flag to remove from direct map
  KVM: guest_memfd: add module param for disabling TLB flushing
  KVM: selftests: load elf via bounce buffer
  KVM: selftests: set KVM_MEM_GUEST_MEMFD in vm_mem_add() if guest_memfd
    != -1
  KVM: selftests: Add guest_memfd based vm_mem_backing_src_types
  KVM: selftests: cover GUEST_MEMFD_FLAG_NO_DIRECT_MAP in existing
    selftests
  KVM: selftests: stuff vm_mem_backing_src_type into vm_shape
  KVM: selftests: Test guest execution from direct map removed gmem

 Documentation/virt/kvm/api.rst                |  5 ++
 arch/arm64/include/asm/kvm_host.h             | 12 ++++
 arch/arm64/mm/pageattr.c                      |  1 +
 arch/loongarch/mm/pageattr.c                  |  1 +
 arch/riscv/mm/pageattr.c                      |  1 +
 arch/s390/mm/pageattr.c                       |  1 +
 arch/x86/include/asm/tlbflush.h               |  3 +-
 arch/x86/mm/pat/set_memory.c                  |  1 +
 arch/x86/mm/tlb.c                             |  1 +
 include/linux/kvm_host.h                      |  9 +++
 include/linux/pagemap.h                       | 16 +++++
 include/linux/secretmem.h                     | 18 -----
 include/uapi/linux/kvm.h                      |  2 +
 lib/buildid.c                                 |  4 +-
 mm/gup.c                                      | 19 ++----
 mm/mlock.c                                    |  2 +-
 mm/secretmem.c                                |  8 +--
 .../testing/selftests/kvm/guest_memfd_test.c  |  2 +
 .../testing/selftests/kvm/include/kvm_util.h  | 37 ++++++++---
 .../testing/selftests/kvm/include/test_util.h |  8 +++
 tools/testing/selftests/kvm/lib/elf.c         |  8 +--
 tools/testing/selftests/kvm/lib/io.c          | 23 +++++++
 tools/testing/selftests/kvm/lib/kvm_util.c    | 61 +++++++++--------
 tools/testing/selftests/kvm/lib/test_util.c   |  8 +++
 tools/testing/selftests/kvm/lib/x86/sev.c     |  1 +
 .../selftests/kvm/pre_fault_memory_test.c     |  1 +
 .../selftests/kvm/set_memory_region_test.c    | 50 ++++++++++++--
 .../kvm/x86/private_mem_conversions_test.c    |  7 +-
 virt/kvm/guest_memfd.c                        | 66 +++++++++++++++++--
 virt/kvm/kvm_main.c                           |  8 +++
 30 files changed, 290 insertions(+), 94 deletions(-)


base-commit: a6ad54137af92535cfe32e19e5f3bc1bb7dbd383
-- 
2.51.0
Re: [PATCH v7 00/12] Direct Map Removal Support for guest_memfd
Posted by Brendan Jackman 3 months ago
On Wed Sep 24, 2025 at 3:10 PM UTC, Patrick Roy wrote:
> From: Patrick Roy <roypat@amazon.co.uk>
>
> [ based on kvm/next ]
>
> Unmapping virtual machine guest memory from the host kernel's direct map is a
> successful mitigation against Spectre-style transient execution issues: If the
> kernel page tables do not contain entries pointing to guest memory, then any
> attempted speculative read through the direct map will necessarily be blocked
> by the MMU before any observable microarchitectural side-effects happen. This
> means that Spectre-gadgets and similar cannot be used to target virtual machine
> memory. Roughly 60% of speculative execution issues fall into this category [1,
> Table 1].
>
> This patch series extends guest_memfd with the ability to remove its memory
> from the host kernel's direct map, to be able to attain the above protection
> for KVM guests running inside guest_memfd.
>
> Additionally, a Firecracker branch with support for these VMs can be found on
> GitHub [2].
>
> For more details, please refer to the v5 cover letter [v5]. No
> substantial changes in design have taken place since.
>
> === Changes Since v6 ===
>
> - Drop patch for passing struct address_space to ->free_folio(), due to
>   possible races with freeing of the address_space. (Hugh)
> - Stop using PG_uptodate / gmem preparedness tracking to keep track of
>   direct map state.  Instead, use the lowest bit of folio->private. (Mike, David)
> - Do direct map removal when establishing mapping of gmem folio instead
>   of at allocation time, due to impossibility of handling direct map
>   removal errors in kvm_gmem_populate(). (Patrick)
> - Do TLB flushes after direct map removal, and provide a module
>   parameter to opt out from them, and a new patch to export
>   flush_tlb_kernel_range() to KVM. (Will)
>
> [1]: https://download.vusec.net/papers/quarantine_raid23.pdf
> [2]: https://github.com/firecracker-microvm/firecracker/tree/feature/secret-hiding

I just got around to trying this out, I checked out this patchset using
its base-commit and grabbed the Firecracker branch. Things seem OK until
I set the secrets_free flag in the Firecracker config which IIUC makes
it set GUEST_MEMFD_FLAG_NO_DIRECT_MAP.

If I set it, I find the guest doesn't show anything on the console.
Running it in a VM and attaching GDB suggests that it's entering the
guest repeatedly, it doesn't seem like the vCPU thread is stuck or
anything. I'm a bit clueless about how to debug that (so far, whenever
I've broken KVM, things always exploded very dramatically).
 
Anyway, if I then kill the firecracker process, the host sometimes
crashes, I think this is the most suggestive splat I've seen:

[   99.673420][    T2] BUG: unable to handle page fault for address: ffff888012804000
[   99.676216][    T2] #PF: supervisor write access in kernel mode
[   99.678381][    T2] #PF: error_code(0x0002) - not-present page
[   99.680499][    T2] PGD 2e01067 P4D 2e01067 PUD 2e02067 PMD 12801063 PTE 800fffffed7fb020
[   99.683374][    T2] Oops: Oops: 0002 [#1] SMP
[   99.685004][    T2] CPU: 0 UID: 0 PID: 2 Comm: kthreadd Not tainted 6.17.0-rc7-00366-g473c46a3cb2a #106 NONE 
[   99.688514][    T2] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.1 11/11/2019
[   99.691547][    T2] RIP: 0010:clear_page_erms+0x7/0x10
[   99.693440][    T2] Code: 48 89 47 18 48 89 47 20 48 89 47 28 48 89 47 30 48 89 47 38 48 8d 7f 40 75 d9 90 c3 0f 1f 80 00 00 00 00 b9 00 10 00 00 31 c0 <f3> aa c3 66 0f 1f 44 00 00 48 83 f9 40 73 2a 83 f9 08 73 0f 85 c9
[   99.700188][    T2] RSP: 0018:ffff88800318fc10 EFLAGS: 00010246
[   99.702321][    T2] RAX: 0000000000000000 RBX: 0000000000400dc0 RCX: 0000000000001000
[   99.705100][    T2] RDX: ffffea00004a0100 RSI: ffffea00004a0200 RDI: ffff888012804000
[   99.707861][    T2] RBP: 0000000000000801 R08: 0000000000000000 R09: 0000000000000000
[   99.710648][    T2] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
[   99.713412][    T2] R13: 0000000000000801 R14: ffffea00004a0100 R15: ffffffff81f4df80
[   99.716191][    T2] FS:  0000000000000000(0000) GS:ffff8880bbf28000(0000) knlGS:0000000000000000
[   99.719316][    T2] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   99.721648][    T2] CR2: ffff888012804000 CR3: 0000000007583001 CR4: 0000000000372eb0
[   99.724421][    T2] Call Trace:
[   99.725608][    T2]  <TASK>
[   99.726646][    T2]  get_page_from_freelist+0x6fe/0x14b0
[   99.728583][    T2]  ? fs_reclaim_acquire+0x43/0xe0
[   99.730325][    T2]  ? find_held_lock+0x2b/0x80
[   99.731965][    T2]  __alloc_frozen_pages_noprof+0x147/0x2d0
[   99.734003][    T2]  __alloc_pages_noprof+0x5/0x50
[   99.735766][    T2]  copy_process+0x1b1/0x1b30
[   99.737398][    T2]  ? lock_is_held_type+0x89/0x100
[   99.739157][    T2]  ? kthreadd+0x25/0x190
[   99.740664][    T2]  kernel_clone+0x59/0x390
[   99.742213][    T2]  ? kthreadd+0x25/0x190
[   99.743728][    T2]  kernel_thread+0x55/0x70
[   99.745310][    T2]  ? kthread_complete_and_exit+0x20/0x20
[   99.747265][    T2]  kthreadd+0x117/0x190
[   99.748748][    T2]  ? kthread_is_per_cpu+0x30/0x30
[   99.750509][    T2]  ret_from_fork+0x16b/0x1e0
[   99.752193][    T2]  ? kthread_is_per_cpu+0x30/0x30
[   99.753992][    T2]  ret_from_fork_asm+0x11/0x20
[   99.755717][    T2]  </TASK>
[   99.756861][    T2] CR2: ffff888012804000
[   99.758353][    T2] ---[ end trace 0000000000000000 ]---
[   99.760319][    T2] RIP: 0010:clear_page_erms+0x7/0x10
[   99.762209][    T2] Code: 48 89 47 18 48 89 47 20 48 89 47 28 48 89 47 30 48 89 47 38 48 8d 7f 40 75 d9 90 c3 0f 1f 80 00 00 00 00 b9 00 10 00 00 31 c0 <f3> aa c3 66 0f 1f 44 00 00 48 83 f9 40 73 2a 83 f9 08 73 0f 85 c9
[   99.769129][    T2] RSP: 0018:ffff88800318fc10 EFLAGS: 00010246
[   99.771297][    T2] RAX: 0000000000000000 RBX: 0000000000400dc0 RCX: 0000000000001000
[   99.774126][    T2] RDX: ffffea00004a0100 RSI: ffffea00004a0200 RDI: ffff888012804000
[   99.777013][    T2] RBP: 0000000000000801 R08: 0000000000000000 R09: 0000000000000000
[   99.779827][    T2] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
[   99.782641][    T2] R13: 0000000000000801 R14: ffffea00004a0100 R15: ffffffff81f4df80
[   99.785487][    T2] FS:  0000000000000000(0000) GS:ffff8880bbf28000(0000) knlGS:0000000000000000
[   99.788671][    T2] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   99.791012][    T2] CR2: ffff888012804000 CR3: 0000000007583001 CR4: 0000000000372eb0
[   99.793863][    T2] Kernel panic - not syncing: Fatal exception
[   99.796760][    T2] Kernel Offset: disabled
[   99.798296][    T2] ---[ end Kernel panic - not syncing: Fatal exception ]---

This makes me suspect the kvm_gmem_folio_restore_direct_map() path isn't
working or isn't getting called.

If anyone wants help trying to reproduce this let me know.
Re: [PATCH v7 00/12] Direct Map Removal Support for guest_memfd
Posted by Brendan Jackman 3 months ago
On Fri Nov 7, 2025 at 3:54 PM UTC, Brendan Jackman wrote:
> On Wed Sep 24, 2025 at 3:10 PM UTC, Patrick Roy wrote:
>> From: Patrick Roy <roypat@amazon.co.uk>
>>
>> [ based on kvm/next ]
>>
>> Unmapping virtual machine guest memory from the host kernel's direct map is a
>> successful mitigation against Spectre-style transient execution issues: If the
>> kernel page tables do not contain entries pointing to guest memory, then any
>> attempted speculative read through the direct map will necessarily be blocked
>> by the MMU before any observable microarchitectural side-effects happen. This
>> means that Spectre-gadgets and similar cannot be used to target virtual machine
>> memory. Roughly 60% of speculative execution issues fall into this category [1,
>> Table 1].
>>
>> This patch series extends guest_memfd with the ability to remove its memory
>> from the host kernel's direct map, to be able to attain the above protection
>> for KVM guests running inside guest_memfd.
>>
>> Additionally, a Firecracker branch with support for these VMs can be found on
>> GitHub [2].
>>
>> For more details, please refer to the v5 cover letter [v5]. No
>> substantial changes in design have taken place since.
>>
>> === Changes Since v6 ===
>>
>> - Drop patch for passing struct address_space to ->free_folio(), due to
>>   possible races with freeing of the address_space. (Hugh)
>> - Stop using PG_uptodate / gmem preparedness tracking to keep track of
>>   direct map state.  Instead, use the lowest bit of folio->private. (Mike, David)
>> - Do direct map removal when establishing mapping of gmem folio instead
>>   of at allocation time, due to impossibility of handling direct map
>>   removal errors in kvm_gmem_populate(). (Patrick)
>> - Do TLB flushes after direct map removal, and provide a module
>>   parameter to opt out from them, and a new patch to export
>>   flush_tlb_kernel_range() to KVM. (Will)
>>
>> [1]: https://download.vusec.net/papers/quarantine_raid23.pdf
>> [2]: https://github.com/firecracker-microvm/firecracker/tree/feature/secret-hiding
>
> I just got around to trying this out, I checked out this patchset using
> its base-commit and grabbed the Firecracker branch. Things seem OK until
> I set the secrets_free flag in the Firecracker config which IIUC makes
> it set GUEST_MEMFD_FLAG_NO_DIRECT_MAP.
>
> If I set it, I find the guest doesn't show anything on the console.
> Running it in a VM and attaching GDB suggests that it's entering the
> guest repeatedly, it doesn't seem like the vCPU thread is stuck or
> anything. I'm a bit clueless about how to debug that (so far, whenever
> I've broken KVM, things always exploded very dramatically).

I discovered that Firecracker has a GDB stub, so I can just attach to
that and see what the guest is up to.

The issue that the pvclock_vcpu_time_info in kvmclock is all zero:

(gdb) backtrace
#0  pvclock_tsc_khz (src=0xffffffff83a03000 <hv_clock_boot>) at ../arch/x86/kernel/pvclock.c:28
#1  0xffffffff8109d137 in kvm_get_tsc_khz () at ../arch/x86/include/asm/kvmclock.h:11
#2  0xffffffff835c1842 in kvm_get_preset_lpj () at ../arch/x86/kernel/kvmclock.c:128
#3  kvmclock_init () at ../arch/x86/kernel/kvmclock.c:332
#4  0xffffffff835c1487 in kvm_init_platform () at ../arch/x86/kernel/kvm.c:982
#5  0xffffffff835a83df in setup_arch (cmdline_p=cmdline_p@entry=0xffffffff82e03f00) at ../arch/x86/kernel/setup.c:916
#6  0xffffffff83595a22 in start_kernel () at ../init/main.c:925
#7  0xffffffff835a7354 in x86_64_start_reservations (
    real_mode_data=real_mode_data@entry=0x36326c0 <error: Cannot access memory at address 0x36326c0>) at ../arch/x86/kernel/head64.c:507
#8  0xffffffff835a7466 in x86_64_start_kernel (real_mode_data=0x36326c0 <error: Cannot access memory at address 0x36326c0>)
    at ../arch/x86/kernel/head64.c:488
#9  0xffffffff8103e7fd in secondary_startup_64 () at ../arch/x86/kernel/head_64.S:413
#10 0x0000000000000000 in ?? ()
(gdb) p *src
$3 = {version = 0, pad0 = 0, tsc_timestamp = 0, system_time = 0, tsc_to_system_mul = 0, tsc_shift = 0 '\000', flags = 0 '\000', 
  pad = "\000"}

This causes a divide by zero in kvm_get_tsc_khz().

Probably the only reason I didn't see any console output is that I
forgot to set earlyprintk, oops...
Re: [PATCH v7 00/12] Direct Map Removal Support for guest_memfd
Posted by Nikita Kalyazin 3 months ago

On 07/11/2025 15:54, Brendan Jackman wrote:
> On Wed Sep 24, 2025 at 3:10 PM UTC, Patrick Roy wrote:
>> From: Patrick Roy <roypat@amazon.co.uk>
>>
>> [ based on kvm/next ]
>>
>> Unmapping virtual machine guest memory from the host kernel's direct map is a
>> successful mitigation against Spectre-style transient execution issues: If the
>> kernel page tables do not contain entries pointing to guest memory, then any
>> attempted speculative read through the direct map will necessarily be blocked
>> by the MMU before any observable microarchitectural side-effects happen. This
>> means that Spectre-gadgets and similar cannot be used to target virtual machine
>> memory. Roughly 60% of speculative execution issues fall into this category [1,
>> Table 1].
>>
>> This patch series extends guest_memfd with the ability to remove its memory
>> from the host kernel's direct map, to be able to attain the above protection
>> for KVM guests running inside guest_memfd.
>>
>> Additionally, a Firecracker branch with support for these VMs can be found on
>> GitHub [2].
>>
>> For more details, please refer to the v5 cover letter [v5]. No
>> substantial changes in design have taken place since.
>>
>> === Changes Since v6 ===
>>
>> - Drop patch for passing struct address_space to ->free_folio(), due to
>>    possible races with freeing of the address_space. (Hugh)
>> - Stop using PG_uptodate / gmem preparedness tracking to keep track of
>>    direct map state.  Instead, use the lowest bit of folio->private. (Mike, David)
>> - Do direct map removal when establishing mapping of gmem folio instead
>>    of at allocation time, due to impossibility of handling direct map
>>    removal errors in kvm_gmem_populate(). (Patrick)
>> - Do TLB flushes after direct map removal, and provide a module
>>    parameter to opt out from them, and a new patch to export
>>    flush_tlb_kernel_range() to KVM. (Will)
>>
>> [1]: https://download.vusec.net/papers/quarantine_raid23.pdf
>> [2]: https://github.com/firecracker-microvm/firecracker/tree/feature/secret-hiding
> 
> I just got around to trying this out, I checked out this patchset using
> its base-commit and grabbed the Firecracker branch. Things seem OK until
> I set the secrets_free flag in the Firecracker config which IIUC makes
> it set GUEST_MEMFD_FLAG_NO_DIRECT_MAP.
> 
> If I set it, I find the guest doesn't show anything on the console.
> Running it in a VM and attaching GDB suggests that it's entering the
> guest repeatedly, it doesn't seem like the vCPU thread is stuck or
> anything. I'm a bit clueless about how to debug that (so far, whenever
> I've broken KVM, things always exploded very dramatically).
> 
> Anyway, if I then kill the firecracker process, the host sometimes
> crashes, I think this is the most suggestive splat I've seen:
> 
> [   99.673420][    T2] BUG: unable to handle page fault for address: ffff888012804000
> [   99.676216][    T2] #PF: supervisor write access in kernel mode
> [   99.678381][    T2] #PF: error_code(0x0002) - not-present page
> [   99.680499][    T2] PGD 2e01067 P4D 2e01067 PUD 2e02067 PMD 12801063 PTE 800fffffed7fb020
> [   99.683374][    T2] Oops: Oops: 0002 [#1] SMP
> [   99.685004][    T2] CPU: 0 UID: 0 PID: 2 Comm: kthreadd Not tainted 6.17.0-rc7-00366-g473c46a3cb2a #106 NONE
> [   99.688514][    T2] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.1 11/11/2019
> [   99.691547][    T2] RIP: 0010:clear_page_erms+0x7/0x10
> [   99.693440][    T2] Code: 48 89 47 18 48 89 47 20 48 89 47 28 48 89 47 30 48 89 47 38 48 8d 7f 40 75 d9 90 c3 0f 1f 80 00 00 00 00 b9 00 10 00 00 31 c0 <f3> aa c3 66 0f 1f 44 00 00 48 83 f9 40 73 2a 83 f9 08 73 0f 85 c9
> [   99.700188][    T2] RSP: 0018:ffff88800318fc10 EFLAGS: 00010246
> [   99.702321][    T2] RAX: 0000000000000000 RBX: 0000000000400dc0 RCX: 0000000000001000
> [   99.705100][    T2] RDX: ffffea00004a0100 RSI: ffffea00004a0200 RDI: ffff888012804000
> [   99.707861][    T2] RBP: 0000000000000801 R08: 0000000000000000 R09: 0000000000000000
> [   99.710648][    T2] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
> [   99.713412][    T2] R13: 0000000000000801 R14: ffffea00004a0100 R15: ffffffff81f4df80
> [   99.716191][    T2] FS:  0000000000000000(0000) GS:ffff8880bbf28000(0000) knlGS:0000000000000000
> [   99.719316][    T2] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   99.721648][    T2] CR2: ffff888012804000 CR3: 0000000007583001 CR4: 0000000000372eb0
> [   99.724421][    T2] Call Trace:
> [   99.725608][    T2]  <TASK>
> [   99.726646][    T2]  get_page_from_freelist+0x6fe/0x14b0
> [   99.728583][    T2]  ? fs_reclaim_acquire+0x43/0xe0
> [   99.730325][    T2]  ? find_held_lock+0x2b/0x80
> [   99.731965][    T2]  __alloc_frozen_pages_noprof+0x147/0x2d0
> [   99.734003][    T2]  __alloc_pages_noprof+0x5/0x50
> [   99.735766][    T2]  copy_process+0x1b1/0x1b30
> [   99.737398][    T2]  ? lock_is_held_type+0x89/0x100
> [   99.739157][    T2]  ? kthreadd+0x25/0x190
> [   99.740664][    T2]  kernel_clone+0x59/0x390
> [   99.742213][    T2]  ? kthreadd+0x25/0x190
> [   99.743728][    T2]  kernel_thread+0x55/0x70
> [   99.745310][    T2]  ? kthread_complete_and_exit+0x20/0x20
> [   99.747265][    T2]  kthreadd+0x117/0x190
> [   99.748748][    T2]  ? kthread_is_per_cpu+0x30/0x30
> [   99.750509][    T2]  ret_from_fork+0x16b/0x1e0
> [   99.752193][    T2]  ? kthread_is_per_cpu+0x30/0x30
> [   99.753992][    T2]  ret_from_fork_asm+0x11/0x20
> [   99.755717][    T2]  </TASK>
> [   99.756861][    T2] CR2: ffff888012804000
> [   99.758353][    T2] ---[ end trace 0000000000000000 ]---
> [   99.760319][    T2] RIP: 0010:clear_page_erms+0x7/0x10
> [   99.762209][    T2] Code: 48 89 47 18 48 89 47 20 48 89 47 28 48 89 47 30 48 89 47 38 48 8d 7f 40 75 d9 90 c3 0f 1f 80 00 00 00 00 b9 00 10 00 00 31 c0 <f3> aa c3 66 0f 1f 44 00 00 48 83 f9 40 73 2a 83 f9 08 73 0f 85 c9
> [   99.769129][    T2] RSP: 0018:ffff88800318fc10 EFLAGS: 00010246
> [   99.771297][    T2] RAX: 0000000000000000 RBX: 0000000000400dc0 RCX: 0000000000001000
> [   99.774126][    T2] RDX: ffffea00004a0100 RSI: ffffea00004a0200 RDI: ffff888012804000
> [   99.777013][    T2] RBP: 0000000000000801 R08: 0000000000000000 R09: 0000000000000000
> [   99.779827][    T2] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
> [   99.782641][    T2] R13: 0000000000000801 R14: ffffea00004a0100 R15: ffffffff81f4df80
> [   99.785487][    T2] FS:  0000000000000000(0000) GS:ffff8880bbf28000(0000) knlGS:0000000000000000
> [   99.788671][    T2] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   99.791012][    T2] CR2: ffff888012804000 CR3: 0000000007583001 CR4: 0000000000372eb0
> [   99.793863][    T2] Kernel panic - not syncing: Fatal exception
> [   99.796760][    T2] Kernel Offset: disabled
> [   99.798296][    T2] ---[ end Kernel panic - not syncing: Fatal exception ]---
> 
> This makes me suspect the kvm_gmem_folio_restore_direct_map() path isn't
> working or isn't getting called.
> 
> If anyone wants help trying to reproduce this let me know.

Hi Brendan,

Thanks for trying to run it!

Just as a sanity check, the way it is known for us to work is we apply 
all patches from [1].  For booted VMs (as opposed to restored from 
snapshot), apart from the v6 of the direct map removal series, the only 
additional patch is a fix for kvmclock on x86 [2].  Please let me know 
if you see the same issue with that patch applied too.

Nikita

[1] 
https://github.com/firecracker-microvm/firecracker/tree/feature/secret-hiding/resources/hiding_ci/linux_patches
[2] 
https://github.com/firecracker-microvm/firecracker/tree/feature/secret-hiding/resources/hiding_ci/linux_patches/11-kvm-clock
Re: [PATCH v7 00/12] Direct Map Removal Support for guest_memfd
Posted by Brendan Jackman 3 months ago
On Fri Nov 7, 2025 at 5:23 PM UTC, Nikita Kalyazin wrote:
>
>
> On 07/11/2025 15:54, Brendan Jackman wrote:
>> On Wed Sep 24, 2025 at 3:10 PM UTC, Patrick Roy wrote:
>>> From: Patrick Roy <roypat@amazon.co.uk>
>>>
>>> [ based on kvm/next ]
>>>
>>> Unmapping virtual machine guest memory from the host kernel's direct map is a
>>> successful mitigation against Spectre-style transient execution issues: If the
>>> kernel page tables do not contain entries pointing to guest memory, then any
>>> attempted speculative read through the direct map will necessarily be blocked
>>> by the MMU before any observable microarchitectural side-effects happen. This
>>> means that Spectre-gadgets and similar cannot be used to target virtual machine
>>> memory. Roughly 60% of speculative execution issues fall into this category [1,
>>> Table 1].
>>>
>>> This patch series extends guest_memfd with the ability to remove its memory
>>> from the host kernel's direct map, to be able to attain the above protection
>>> for KVM guests running inside guest_memfd.
>>>
>>> Additionally, a Firecracker branch with support for these VMs can be found on
>>> GitHub [2].
>>>
>>> For more details, please refer to the v5 cover letter [v5]. No
>>> substantial changes in design have taken place since.
>>>
>>> === Changes Since v6 ===
>>>
>>> - Drop patch for passing struct address_space to ->free_folio(), due to
>>>    possible races with freeing of the address_space. (Hugh)
>>> - Stop using PG_uptodate / gmem preparedness tracking to keep track of
>>>    direct map state.  Instead, use the lowest bit of folio->private. (Mike, David)
>>> - Do direct map removal when establishing mapping of gmem folio instead
>>>    of at allocation time, due to impossibility of handling direct map
>>>    removal errors in kvm_gmem_populate(). (Patrick)
>>> - Do TLB flushes after direct map removal, and provide a module
>>>    parameter to opt out from them, and a new patch to export
>>>    flush_tlb_kernel_range() to KVM. (Will)
>>>
>>> [1]: https://download.vusec.net/papers/quarantine_raid23.pdf
>>> [2]: https://github.com/firecracker-microvm/firecracker/tree/feature/secret-hiding
>> 
>> I just got around to trying this out, I checked out this patchset using
>> its base-commit and grabbed the Firecracker branch. Things seem OK until
>> I set the secrets_free flag in the Firecracker config which IIUC makes
>> it set GUEST_MEMFD_FLAG_NO_DIRECT_MAP.
>> 
>> If I set it, I find the guest doesn't show anything on the console.
>> Running it in a VM and attaching GDB suggests that it's entering the
>> guest repeatedly, it doesn't seem like the vCPU thread is stuck or
>> anything. I'm a bit clueless about how to debug that (so far, whenever
>> I've broken KVM, things always exploded very dramatically).
>> 
>> Anyway, if I then kill the firecracker process, the host sometimes
>> crashes, I think this is the most suggestive splat I've seen:
>> 
>> [   99.673420][    T2] BUG: unable to handle page fault for address: ffff888012804000
>> [   99.676216][    T2] #PF: supervisor write access in kernel mode
>> [   99.678381][    T2] #PF: error_code(0x0002) - not-present page
>> [   99.680499][    T2] PGD 2e01067 P4D 2e01067 PUD 2e02067 PMD 12801063 PTE 800fffffed7fb020
>> [   99.683374][    T2] Oops: Oops: 0002 [#1] SMP
>> [   99.685004][    T2] CPU: 0 UID: 0 PID: 2 Comm: kthreadd Not tainted 6.17.0-rc7-00366-g473c46a3cb2a #106 NONE
>> [   99.688514][    T2] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.1 11/11/2019
>> [   99.691547][    T2] RIP: 0010:clear_page_erms+0x7/0x10
>> [   99.693440][    T2] Code: 48 89 47 18 48 89 47 20 48 89 47 28 48 89 47 30 48 89 47 38 48 8d 7f 40 75 d9 90 c3 0f 1f 80 00 00 00 00 b9 00 10 00 00 31 c0 <f3> aa c3 66 0f 1f 44 00 00 48 83 f9 40 73 2a 83 f9 08 73 0f 85 c9
>> [   99.700188][    T2] RSP: 0018:ffff88800318fc10 EFLAGS: 00010246
>> [   99.702321][    T2] RAX: 0000000000000000 RBX: 0000000000400dc0 RCX: 0000000000001000
>> [   99.705100][    T2] RDX: ffffea00004a0100 RSI: ffffea00004a0200 RDI: ffff888012804000
>> [   99.707861][    T2] RBP: 0000000000000801 R08: 0000000000000000 R09: 0000000000000000
>> [   99.710648][    T2] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
>> [   99.713412][    T2] R13: 0000000000000801 R14: ffffea00004a0100 R15: ffffffff81f4df80
>> [   99.716191][    T2] FS:  0000000000000000(0000) GS:ffff8880bbf28000(0000) knlGS:0000000000000000
>> [   99.719316][    T2] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [   99.721648][    T2] CR2: ffff888012804000 CR3: 0000000007583001 CR4: 0000000000372eb0
>> [   99.724421][    T2] Call Trace:
>> [   99.725608][    T2]  <TASK>
>> [   99.726646][    T2]  get_page_from_freelist+0x6fe/0x14b0
>> [   99.728583][    T2]  ? fs_reclaim_acquire+0x43/0xe0
>> [   99.730325][    T2]  ? find_held_lock+0x2b/0x80
>> [   99.731965][    T2]  __alloc_frozen_pages_noprof+0x147/0x2d0
>> [   99.734003][    T2]  __alloc_pages_noprof+0x5/0x50
>> [   99.735766][    T2]  copy_process+0x1b1/0x1b30
>> [   99.737398][    T2]  ? lock_is_held_type+0x89/0x100
>> [   99.739157][    T2]  ? kthreadd+0x25/0x190
>> [   99.740664][    T2]  kernel_clone+0x59/0x390
>> [   99.742213][    T2]  ? kthreadd+0x25/0x190
>> [   99.743728][    T2]  kernel_thread+0x55/0x70
>> [   99.745310][    T2]  ? kthread_complete_and_exit+0x20/0x20
>> [   99.747265][    T2]  kthreadd+0x117/0x190
>> [   99.748748][    T2]  ? kthread_is_per_cpu+0x30/0x30
>> [   99.750509][    T2]  ret_from_fork+0x16b/0x1e0
>> [   99.752193][    T2]  ? kthread_is_per_cpu+0x30/0x30
>> [   99.753992][    T2]  ret_from_fork_asm+0x11/0x20
>> [   99.755717][    T2]  </TASK>
>> [   99.756861][    T2] CR2: ffff888012804000
>> [   99.758353][    T2] ---[ end trace 0000000000000000 ]---
>> [   99.760319][    T2] RIP: 0010:clear_page_erms+0x7/0x10
>> [   99.762209][    T2] Code: 48 89 47 18 48 89 47 20 48 89 47 28 48 89 47 30 48 89 47 38 48 8d 7f 40 75 d9 90 c3 0f 1f 80 00 00 00 00 b9 00 10 00 00 31 c0 <f3> aa c3 66 0f 1f 44 00 00 48 83 f9 40 73 2a 83 f9 08 73 0f 85 c9
>> [   99.769129][    T2] RSP: 0018:ffff88800318fc10 EFLAGS: 00010246
>> [   99.771297][    T2] RAX: 0000000000000000 RBX: 0000000000400dc0 RCX: 0000000000001000
>> [   99.774126][    T2] RDX: ffffea00004a0100 RSI: ffffea00004a0200 RDI: ffff888012804000
>> [   99.777013][    T2] RBP: 0000000000000801 R08: 0000000000000000 R09: 0000000000000000
>> [   99.779827][    T2] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
>> [   99.782641][    T2] R13: 0000000000000801 R14: ffffea00004a0100 R15: ffffffff81f4df80
>> [   99.785487][    T2] FS:  0000000000000000(0000) GS:ffff8880bbf28000(0000) knlGS:0000000000000000
>> [   99.788671][    T2] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [   99.791012][    T2] CR2: ffff888012804000 CR3: 0000000007583001 CR4: 0000000000372eb0
>> [   99.793863][    T2] Kernel panic - not syncing: Fatal exception
>> [   99.796760][    T2] Kernel Offset: disabled
>> [   99.798296][    T2] ---[ end Kernel panic - not syncing: Fatal exception ]---
>> 
>> This makes me suspect the kvm_gmem_folio_restore_direct_map() path isn't
>> working or isn't getting called.
>> 
>> If anyone wants help trying to reproduce this let me know.
>
> Hi Brendan,
>
> Thanks for trying to run it!
>
> Just as a sanity check, the way it is known for us to work is we apply 
> all patches from [1].  For booted VMs (as opposed to restored from 
> snapshot), apart from the v6 of the direct map removal series, the only 
> additional patch is a fix for kvmclock on x86 [2].  Please let me know 
> if you see the same issue with that patch applied too.
>
> Nikita
>
> [1] 
> https://github.com/firecracker-microvm/firecracker/tree/feature/secret-hiding/resources/hiding_ci/linux_patches
> [2] 
> https://github.com/firecracker-microvm/firecracker/tree/feature/secret-hiding/resources/hiding_ci/linux_patches/11-kvm-clock

Ah, thanks! Seems I should have checked my inbox before sending my other
mail. With the kvmclock fix applied to my host kernel, I start setting
the other crash immediately when the VM boots. If I comment out the
actual unmapping of memory, it boots (before, it wouldn't boot even with
that commented out).

For the other linux_patches, I couldn't apply them on top of this
series, do you have a branch I can use as a reference?

Anyway, the solution I'm hoping to present for your problem gets rid of
that explicit unmapping code (the allocator will do it for you), so in
the meantime I have something I can work with.
Re: [PATCH v7 00/12] Direct Map Removal Support for guest_memfd
Posted by Nikita Kalyazin 3 months ago

On 07/11/2025 18:04, Brendan Jackman wrote:
> On Fri Nov 7, 2025 at 5:23 PM UTC, Nikita Kalyazin wrote:
>>
>>
>> On 07/11/2025 15:54, Brendan Jackman wrote:
>>> On Wed Sep 24, 2025 at 3:10 PM UTC, Patrick Roy wrote:
>>>> From: Patrick Roy <roypat@amazon.co.uk>
>>>>
>>>> [ based on kvm/next ]
>>>>
>>>> Unmapping virtual machine guest memory from the host kernel's direct map is a
>>>> successful mitigation against Spectre-style transient execution issues: If the
>>>> kernel page tables do not contain entries pointing to guest memory, then any
>>>> attempted speculative read through the direct map will necessarily be blocked
>>>> by the MMU before any observable microarchitectural side-effects happen. This
>>>> means that Spectre-gadgets and similar cannot be used to target virtual machine
>>>> memory. Roughly 60% of speculative execution issues fall into this category [1,
>>>> Table 1].
>>>>
>>>> This patch series extends guest_memfd with the ability to remove its memory
>>>> from the host kernel's direct map, to be able to attain the above protection
>>>> for KVM guests running inside guest_memfd.
>>>>
>>>> Additionally, a Firecracker branch with support for these VMs can be found on
>>>> GitHub [2].
>>>>
>>>> For more details, please refer to the v5 cover letter [v5]. No
>>>> substantial changes in design have taken place since.
>>>>
>>>> === Changes Since v6 ===
>>>>
>>>> - Drop patch for passing struct address_space to ->free_folio(), due to
>>>>     possible races with freeing of the address_space. (Hugh)
>>>> - Stop using PG_uptodate / gmem preparedness tracking to keep track of
>>>>     direct map state.  Instead, use the lowest bit of folio->private. (Mike, David)
>>>> - Do direct map removal when establishing mapping of gmem folio instead
>>>>     of at allocation time, due to impossibility of handling direct map
>>>>     removal errors in kvm_gmem_populate(). (Patrick)
>>>> - Do TLB flushes after direct map removal, and provide a module
>>>>     parameter to opt out from them, and a new patch to export
>>>>     flush_tlb_kernel_range() to KVM. (Will)
>>>>
>>>> [1]: https://download.vusec.net/papers/quarantine_raid23.pdf
>>>> [2]: https://github.com/firecracker-microvm/firecracker/tree/feature/secret-hiding
>>>
>>> I just got around to trying this out, I checked out this patchset using
>>> its base-commit and grabbed the Firecracker branch. Things seem OK until
>>> I set the secrets_free flag in the Firecracker config which IIUC makes
>>> it set GUEST_MEMFD_FLAG_NO_DIRECT_MAP.
>>>
>>> If I set it, I find the guest doesn't show anything on the console.
>>> Running it in a VM and attaching GDB suggests that it's entering the
>>> guest repeatedly, it doesn't seem like the vCPU thread is stuck or
>>> anything. I'm a bit clueless about how to debug that (so far, whenever
>>> I've broken KVM, things always exploded very dramatically).
>>>
>>> Anyway, if I then kill the firecracker process, the host sometimes
>>> crashes, I think this is the most suggestive splat I've seen:
>>>
>>> [   99.673420][    T2] BUG: unable to handle page fault for address: ffff888012804000
>>> [   99.676216][    T2] #PF: supervisor write access in kernel mode
>>> [   99.678381][    T2] #PF: error_code(0x0002) - not-present page
>>> [   99.680499][    T2] PGD 2e01067 P4D 2e01067 PUD 2e02067 PMD 12801063 PTE 800fffffed7fb020
>>> [   99.683374][    T2] Oops: Oops: 0002 [#1] SMP
>>> [   99.685004][    T2] CPU: 0 UID: 0 PID: 2 Comm: kthreadd Not tainted 6.17.0-rc7-00366-g473c46a3cb2a #106 NONE
>>> [   99.688514][    T2] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.1 11/11/2019
>>> [   99.691547][    T2] RIP: 0010:clear_page_erms+0x7/0x10
>>> [   99.693440][    T2] Code: 48 89 47 18 48 89 47 20 48 89 47 28 48 89 47 30 48 89 47 38 48 8d 7f 40 75 d9 90 c3 0f 1f 80 00 00 00 00 b9 00 10 00 00 31 c0 <f3> aa c3 66 0f 1f 44 00 00 48 83 f9 40 73 2a 83 f9 08 73 0f 85 c9
>>> [   99.700188][    T2] RSP: 0018:ffff88800318fc10 EFLAGS: 00010246
>>> [   99.702321][    T2] RAX: 0000000000000000 RBX: 0000000000400dc0 RCX: 0000000000001000
>>> [   99.705100][    T2] RDX: ffffea00004a0100 RSI: ffffea00004a0200 RDI: ffff888012804000
>>> [   99.707861][    T2] RBP: 0000000000000801 R08: 0000000000000000 R09: 0000000000000000
>>> [   99.710648][    T2] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
>>> [   99.713412][    T2] R13: 0000000000000801 R14: ffffea00004a0100 R15: ffffffff81f4df80
>>> [   99.716191][    T2] FS:  0000000000000000(0000) GS:ffff8880bbf28000(0000) knlGS:0000000000000000
>>> [   99.719316][    T2] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [   99.721648][    T2] CR2: ffff888012804000 CR3: 0000000007583001 CR4: 0000000000372eb0
>>> [   99.724421][    T2] Call Trace:
>>> [   99.725608][    T2]  <TASK>
>>> [   99.726646][    T2]  get_page_from_freelist+0x6fe/0x14b0
>>> [   99.728583][    T2]  ? fs_reclaim_acquire+0x43/0xe0
>>> [   99.730325][    T2]  ? find_held_lock+0x2b/0x80
>>> [   99.731965][    T2]  __alloc_frozen_pages_noprof+0x147/0x2d0
>>> [   99.734003][    T2]  __alloc_pages_noprof+0x5/0x50
>>> [   99.735766][    T2]  copy_process+0x1b1/0x1b30
>>> [   99.737398][    T2]  ? lock_is_held_type+0x89/0x100
>>> [   99.739157][    T2]  ? kthreadd+0x25/0x190
>>> [   99.740664][    T2]  kernel_clone+0x59/0x390
>>> [   99.742213][    T2]  ? kthreadd+0x25/0x190
>>> [   99.743728][    T2]  kernel_thread+0x55/0x70
>>> [   99.745310][    T2]  ? kthread_complete_and_exit+0x20/0x20
>>> [   99.747265][    T2]  kthreadd+0x117/0x190
>>> [   99.748748][    T2]  ? kthread_is_per_cpu+0x30/0x30
>>> [   99.750509][    T2]  ret_from_fork+0x16b/0x1e0
>>> [   99.752193][    T2]  ? kthread_is_per_cpu+0x30/0x30
>>> [   99.753992][    T2]  ret_from_fork_asm+0x11/0x20
>>> [   99.755717][    T2]  </TASK>
>>> [   99.756861][    T2] CR2: ffff888012804000
>>> [   99.758353][    T2] ---[ end trace 0000000000000000 ]---
>>> [   99.760319][    T2] RIP: 0010:clear_page_erms+0x7/0x10
>>> [   99.762209][    T2] Code: 48 89 47 18 48 89 47 20 48 89 47 28 48 89 47 30 48 89 47 38 48 8d 7f 40 75 d9 90 c3 0f 1f 80 00 00 00 00 b9 00 10 00 00 31 c0 <f3> aa c3 66 0f 1f 44 00 00 48 83 f9 40 73 2a 83 f9 08 73 0f 85 c9
>>> [   99.769129][    T2] RSP: 0018:ffff88800318fc10 EFLAGS: 00010246
>>> [   99.771297][    T2] RAX: 0000000000000000 RBX: 0000000000400dc0 RCX: 0000000000001000
>>> [   99.774126][    T2] RDX: ffffea00004a0100 RSI: ffffea00004a0200 RDI: ffff888012804000
>>> [   99.777013][    T2] RBP: 0000000000000801 R08: 0000000000000000 R09: 0000000000000000
>>> [   99.779827][    T2] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
>>> [   99.782641][    T2] R13: 0000000000000801 R14: ffffea00004a0100 R15: ffffffff81f4df80
>>> [   99.785487][    T2] FS:  0000000000000000(0000) GS:ffff8880bbf28000(0000) knlGS:0000000000000000
>>> [   99.788671][    T2] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [   99.791012][    T2] CR2: ffff888012804000 CR3: 0000000007583001 CR4: 0000000000372eb0
>>> [   99.793863][    T2] Kernel panic - not syncing: Fatal exception
>>> [   99.796760][    T2] Kernel Offset: disabled
>>> [   99.798296][    T2] ---[ end Kernel panic - not syncing: Fatal exception ]---
>>>
>>> This makes me suspect the kvm_gmem_folio_restore_direct_map() path isn't
>>> working or isn't getting called.
>>>
>>> If anyone wants help trying to reproduce this let me know.
>>
>> Hi Brendan,
>>
>> Thanks for trying to run it!
>>
>> Just as a sanity check, the way it is known for us to work is we apply
>> all patches from [1].  For booted VMs (as opposed to restored from
>> snapshot), apart from the v6 of the direct map removal series, the only
>> additional patch is a fix for kvmclock on x86 [2].  Please let me know
>> if you see the same issue with that patch applied too.
>>
>> Nikita
>>
>> [1]
>> https://github.com/firecracker-microvm/firecracker/tree/feature/secret-hiding/resources/hiding_ci/linux_patches
>> [2]
>> https://github.com/firecracker-microvm/firecracker/tree/feature/secret-hiding/resources/hiding_ci/linux_patches/11-kvm-clock
> 
> Ah, thanks! Seems I should have checked my inbox before sending my other
> mail. With the kvmclock fix applied to my host kernel, I start setting
> the other crash immediately when the VM boots. If I comment out the
> actual unmapping of memory, it boots (before, it wouldn't boot even with
> that commented out).
> 
> For the other linux_patches, I couldn't apply them on top of this
> series, do you have a branch I can use as a reference?

Instead of having an explicit branch, we apply all the patches on top of 
[1].  There is a script that performs fetch/build/install end-to-end: [2].

[1] 
https://github.com/firecracker-microvm/firecracker/blob/feature/secret-hiding/resources/hiding_ci/kernel_commit_hash
[2] 
https://github.com/firecracker-microvm/firecracker/blob/feature/secret-hiding/resources/hiding_ci/build_and_install_kernel.sh

> 
> Anyway, the solution I'm hoping to present for your problem gets rid of
> that explicit unmapping code (the allocator will do it for you), so in
> the meantime I have something I can work with.
Re: [PATCH v7 00/12] Direct Map Removal Support for guest_memfd
Posted by Brendan Jackman 2 months, 4 weeks ago
On Fri Nov 7, 2025 at 6:11 PM UTC, Nikita Kalyazin wrote:
>
>
> On 07/11/2025 18:04, Brendan Jackman wrote:
>> On Fri Nov 7, 2025 at 5:23 PM UTC, Nikita Kalyazin wrote:
>>>
>>>
>>> On 07/11/2025 15:54, Brendan Jackman wrote:
>>>> On Wed Sep 24, 2025 at 3:10 PM UTC, Patrick Roy wrote:
>>>>> From: Patrick Roy <roypat@amazon.co.uk>
>>>>>
>>>>> [ based on kvm/next ]
>>>>>
>>>>> Unmapping virtual machine guest memory from the host kernel's direct map is a
>>>>> successful mitigation against Spectre-style transient execution issues: If the
>>>>> kernel page tables do not contain entries pointing to guest memory, then any
>>>>> attempted speculative read through the direct map will necessarily be blocked
>>>>> by the MMU before any observable microarchitectural side-effects happen. This
>>>>> means that Spectre-gadgets and similar cannot be used to target virtual machine
>>>>> memory. Roughly 60% of speculative execution issues fall into this category [1,
>>>>> Table 1].
>>>>>
>>>>> This patch series extends guest_memfd with the ability to remove its memory
>>>>> from the host kernel's direct map, to be able to attain the above protection
>>>>> for KVM guests running inside guest_memfd.
>>>>>
>>>>> Additionally, a Firecracker branch with support for these VMs can be found on
>>>>> GitHub [2].
>>>>>
>>>>> For more details, please refer to the v5 cover letter [v5]. No
>>>>> substantial changes in design have taken place since.
>>>>>
>>>>> === Changes Since v6 ===
>>>>>
>>>>> - Drop patch for passing struct address_space to ->free_folio(), due to
>>>>>     possible races with freeing of the address_space. (Hugh)
>>>>> - Stop using PG_uptodate / gmem preparedness tracking to keep track of
>>>>>     direct map state.  Instead, use the lowest bit of folio->private. (Mike, David)
>>>>> - Do direct map removal when establishing mapping of gmem folio instead
>>>>>     of at allocation time, due to impossibility of handling direct map
>>>>>     removal errors in kvm_gmem_populate(). (Patrick)
>>>>> - Do TLB flushes after direct map removal, and provide a module
>>>>>     parameter to opt out from them, and a new patch to export
>>>>>     flush_tlb_kernel_range() to KVM. (Will)
>>>>>
>>>>> [1]: https://download.vusec.net/papers/quarantine_raid23.pdf
>>>>> [2]: https://github.com/firecracker-microvm/firecracker/tree/feature/secret-hiding
>>>>
>>>> I just got around to trying this out, I checked out this patchset using
>>>> its base-commit and grabbed the Firecracker branch. Things seem OK until
>>>> I set the secrets_free flag in the Firecracker config which IIUC makes
>>>> it set GUEST_MEMFD_FLAG_NO_DIRECT_MAP.
>>>>
>>>> If I set it, I find the guest doesn't show anything on the console.
>>>> Running it in a VM and attaching GDB suggests that it's entering the
>>>> guest repeatedly, it doesn't seem like the vCPU thread is stuck or
>>>> anything. I'm a bit clueless about how to debug that (so far, whenever
>>>> I've broken KVM, things always exploded very dramatically).
>>>>
>>>> Anyway, if I then kill the firecracker process, the host sometimes
>>>> crashes, I think this is the most suggestive splat I've seen:
>>>>
>>>> [   99.673420][    T2] BUG: unable to handle page fault for address: ffff888012804000
>>>> [   99.676216][    T2] #PF: supervisor write access in kernel mode
>>>> [   99.678381][    T2] #PF: error_code(0x0002) - not-present page
>>>> [   99.680499][    T2] PGD 2e01067 P4D 2e01067 PUD 2e02067 PMD 12801063 PTE 800fffffed7fb020
>>>> [   99.683374][    T2] Oops: Oops: 0002 [#1] SMP
>>>> [   99.685004][    T2] CPU: 0 UID: 0 PID: 2 Comm: kthreadd Not tainted 6.17.0-rc7-00366-g473c46a3cb2a #106 NONE
>>>> [   99.688514][    T2] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.1 11/11/2019
>>>> [   99.691547][    T2] RIP: 0010:clear_page_erms+0x7/0x10
>>>> [   99.693440][    T2] Code: 48 89 47 18 48 89 47 20 48 89 47 28 48 89 47 30 48 89 47 38 48 8d 7f 40 75 d9 90 c3 0f 1f 80 00 00 00 00 b9 00 10 00 00 31 c0 <f3> aa c3 66 0f 1f 44 00 00 48 83 f9 40 73 2a 83 f9 08 73 0f 85 c9
>>>> [   99.700188][    T2] RSP: 0018:ffff88800318fc10 EFLAGS: 00010246
>>>> [   99.702321][    T2] RAX: 0000000000000000 RBX: 0000000000400dc0 RCX: 0000000000001000
>>>> [   99.705100][    T2] RDX: ffffea00004a0100 RSI: ffffea00004a0200 RDI: ffff888012804000
>>>> [   99.707861][    T2] RBP: 0000000000000801 R08: 0000000000000000 R09: 0000000000000000
>>>> [   99.710648][    T2] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
>>>> [   99.713412][    T2] R13: 0000000000000801 R14: ffffea00004a0100 R15: ffffffff81f4df80
>>>> [   99.716191][    T2] FS:  0000000000000000(0000) GS:ffff8880bbf28000(0000) knlGS:0000000000000000
>>>> [   99.719316][    T2] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [   99.721648][    T2] CR2: ffff888012804000 CR3: 0000000007583001 CR4: 0000000000372eb0
>>>> [   99.724421][    T2] Call Trace:
>>>> [   99.725608][    T2]  <TASK>
>>>> [   99.726646][    T2]  get_page_from_freelist+0x6fe/0x14b0
>>>> [   99.728583][    T2]  ? fs_reclaim_acquire+0x43/0xe0
>>>> [   99.730325][    T2]  ? find_held_lock+0x2b/0x80
>>>> [   99.731965][    T2]  __alloc_frozen_pages_noprof+0x147/0x2d0
>>>> [   99.734003][    T2]  __alloc_pages_noprof+0x5/0x50
>>>> [   99.735766][    T2]  copy_process+0x1b1/0x1b30
>>>> [   99.737398][    T2]  ? lock_is_held_type+0x89/0x100
>>>> [   99.739157][    T2]  ? kthreadd+0x25/0x190
>>>> [   99.740664][    T2]  kernel_clone+0x59/0x390
>>>> [   99.742213][    T2]  ? kthreadd+0x25/0x190
>>>> [   99.743728][    T2]  kernel_thread+0x55/0x70
>>>> [   99.745310][    T2]  ? kthread_complete_and_exit+0x20/0x20
>>>> [   99.747265][    T2]  kthreadd+0x117/0x190
>>>> [   99.748748][    T2]  ? kthread_is_per_cpu+0x30/0x30
>>>> [   99.750509][    T2]  ret_from_fork+0x16b/0x1e0
>>>> [   99.752193][    T2]  ? kthread_is_per_cpu+0x30/0x30
>>>> [   99.753992][    T2]  ret_from_fork_asm+0x11/0x20
>>>> [   99.755717][    T2]  </TASK>
>>>> [   99.756861][    T2] CR2: ffff888012804000
>>>> [   99.758353][    T2] ---[ end trace 0000000000000000 ]---
>>>> [   99.760319][    T2] RIP: 0010:clear_page_erms+0x7/0x10
>>>> [   99.762209][    T2] Code: 48 89 47 18 48 89 47 20 48 89 47 28 48 89 47 30 48 89 47 38 48 8d 7f 40 75 d9 90 c3 0f 1f 80 00 00 00 00 b9 00 10 00 00 31 c0 <f3> aa c3 66 0f 1f 44 00 00 48 83 f9 40 73 2a 83 f9 08 73 0f 85 c9
>>>> [   99.769129][    T2] RSP: 0018:ffff88800318fc10 EFLAGS: 00010246
>>>> [   99.771297][    T2] RAX: 0000000000000000 RBX: 0000000000400dc0 RCX: 0000000000001000
>>>> [   99.774126][    T2] RDX: ffffea00004a0100 RSI: ffffea00004a0200 RDI: ffff888012804000
>>>> [   99.777013][    T2] RBP: 0000000000000801 R08: 0000000000000000 R09: 0000000000000000
>>>> [   99.779827][    T2] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
>>>> [   99.782641][    T2] R13: 0000000000000801 R14: ffffea00004a0100 R15: ffffffff81f4df80
>>>> [   99.785487][    T2] FS:  0000000000000000(0000) GS:ffff8880bbf28000(0000) knlGS:0000000000000000
>>>> [   99.788671][    T2] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [   99.791012][    T2] CR2: ffff888012804000 CR3: 0000000007583001 CR4: 0000000000372eb0
>>>> [   99.793863][    T2] Kernel panic - not syncing: Fatal exception
>>>> [   99.796760][    T2] Kernel Offset: disabled
>>>> [   99.798296][    T2] ---[ end Kernel panic - not syncing: Fatal exception ]---
>>>>
>>>> This makes me suspect the kvm_gmem_folio_restore_direct_map() path isn't
>>>> working or isn't getting called.
>>>>
>>>> If anyone wants help trying to reproduce this let me know.
>>>
>>> Hi Brendan,
>>>
>>> Thanks for trying to run it!
>>>
>>> Just as a sanity check, the way it is known for us to work is we apply
>>> all patches from [1].  For booted VMs (as opposed to restored from
>>> snapshot), apart from the v6 of the direct map removal series, the only
>>> additional patch is a fix for kvmclock on x86 [2].  Please let me know
>>> if you see the same issue with that patch applied too.
>>>
>>> Nikita
>>>
>>> [1]
>>> https://github.com/firecracker-microvm/firecracker/tree/feature/secret-hiding/resources/hiding_ci/linux_patches
>>> [2]
>>> https://github.com/firecracker-microvm/firecracker/tree/feature/secret-hiding/resources/hiding_ci/linux_patches/11-kvm-clock
>> 
>> Ah, thanks! Seems I should have checked my inbox before sending my other
>> mail. With the kvmclock fix applied to my host kernel, I start setting
>> the other crash immediately when the VM boots. If I comment out the
>> actual unmapping of memory, it boots (before, it wouldn't boot even with
>> that commented out).
>> 
>> For the other linux_patches, I couldn't apply them on top of this
>> series, do you have a branch I can use as a reference?
>
> Instead of having an explicit branch, we apply all the patches on top of 
> [1].  There is a script that performs fetch/build/install end-to-end: [2].
>
> [1] 
> https://github.com/firecracker-microvm/firecracker/blob/feature/secret-hiding/resources/hiding_ci/kernel_commit_hash
> [2] 
> https://github.com/firecracker-microvm/firecracker/blob/feature/secret-hiding/resources/hiding_ci/build_and_install_kernel.sh

Thanks, I was able to construct a branch and confirm the crashes go
away. I guess this should block merging the feature though, right? Do
you know which particular of the patches are the likely relevant ones
here?
Re: [PATCH v7 00/12] Direct Map Removal Support for guest_memfd
Posted by Roy, Patrick 4 months, 2 weeks ago
_sigh_

I tried to submit this iteration from a personal email, because amazon's mail
server was scrambling the "From" header and I couldn't figure out why (and also
because I am leaving Amazon next month and wanted replies to go into an inbox
to which I'll continue to have access). And then after posting the first 4
emails I hit "daily mail quota exceeded", and had to submit the rest of the
patch series from the amazon email anyway. Sorry about the resulting mess (i
think the threading got slightly messed up as a result of this). I'll something
else out for the next iteration.

Re: [PATCH v7 00/12] Direct Map Removal Support for guest_memfd
Posted by David Hildenbrand 4 months, 2 weeks ago
On 24.09.25 17:29, Roy, Patrick wrote:
> _sigh_

Happens to the best of us :)

> 
> I tried to submit this iteration from a personal email, because amazon's mail
> server was scrambling the "From" header and I couldn't figure out why (and also
> because I am leaving Amazon next month and wanted replies to go into an inbox
> to which I'll continue to have access). And then after posting the first 4
> emails I hit "daily mail quota exceeded", and had to submit the rest of the
> patch series from the amazon email anyway. Sorry about the resulting mess (i
> think the threading got slightly messed up as a result of this). I'll something
> else out for the next iteration.

I had luck recovering from temporary mail server issues in the past by 
sending the remainder as "--in-reply-to=" with message-id of cover 
letter and using "--no-thread" IIRC.

-- 
Cheers

David / dhildenb