[PATCH v11 0/8] KVM: allow mapping non-refcounted pages

David Stevens posted 8 patches 1 year, 9 months ago
arch/x86/kvm/mmu/mmu.c          | 108 +++++++---
arch/x86/kvm/mmu/mmu_internal.h |   2 +
arch/x86/kvm/mmu/paging_tmpl.h  |   7 +-
arch/x86/kvm/mmu/spte.c         |   5 +-
arch/x86/kvm/mmu/spte.h         |  16 +-
arch/x86/kvm/mmu/tdp_mmu.c      |  22 +-
arch/x86/kvm/x86.c              |  11 +-
include/linux/kvm_host.h        |  58 +++++-
virt/kvm/guest_memfd.c          |   8 +-
virt/kvm/kvm_main.c             | 345 +++++++++++++++++++-------------
virt/kvm/kvm_mm.h               |   3 +-
virt/kvm/pfncache.c             |  11 +-
12 files changed, 399 insertions(+), 197 deletions(-)
[PATCH v11 0/8] KVM: allow mapping non-refcounted pages
Posted by David Stevens 1 year, 9 months ago
From: David Stevens <stevensd@chromium.org>

This patch series adds support for mapping VM_IO and VM_PFNMAP memory
that is backed by struct pages that aren't currently being refcounted
(e.g. tail pages of non-compound higher order allocations) into the
guest.

Our use case is virtio-gpu blob resources [1], which directly map host
graphics buffers into the guest as "vram" for the virtio-gpu device.
This feature currently does not work on systems using the amdgpu driver,
as that driver allocates non-compound higher order pages via
ttm_pool_alloc_page().

First, this series replaces the gfn_to_pfn_memslot() API with a more
extensible kvm_follow_pfn() API. The updated API rearranges
gfn_to_pfn_memslot()'s args into a struct and where possible packs the
bool arguments into a FOLL_ flags argument. The refactoring changes do
not change any behavior.

From there, this series extends the kvm_follow_pfn() API so that
non-refconuted pages can be safely handled. This invloves adding an
input parameter to indicate whether the caller can safely use
non-refcounted pfns and an output parameter to tell the caller whether
or not the returned page is refcounted. This change includes a breaking
change, by disallowing non-refcounted pfn mappings by default, as such
mappings are unsafe. To allow such systems to continue to function, an
opt-in module parameter is added to allow the unsafe behavior.

This series only adds support for non-refcounted pages to x86. Other
MMUs can likely be updated without too much difficulty, but it is not
needed at this point. Updating other parts of KVM (e.g. pfncache) is not
straightforward [2].

[1]
https://patchwork.kernel.org/project/dri-devel/cover/20200814024000.2485-1-gurchetansingh@chromium.org/
[2] https://lore.kernel.org/all/ZBEEQtmtNPaEqU1i@google.com/

v10 -> v11:
 - Switch to u64 __read_mostly shadow_refcounted_mask.
 - Update comments about allow_non_refcounted_struct_page.
v9 -> v10:
 - Re-add FOLL_GET changes.
 - Split x86/mmu spte+non-refcount-page patch into two patches.
 - Rename 'foll' variables to 'kfp'.
 - Properly gate usage of refcount spte bit when it's not available.
 - Replace kfm_follow_pfn's is_refcounted_page output parameter with
   a struct page *refcounted_page pointing to the page in question.
 - Add patch downgrading BUG_ON to WARN_ON_ONCE.
v8 -> v9:
 - Make paying attention to is_refcounted_page mandatory. This means
   that FOLL_GET is no longer necessary. For compatibility with
   un-migrated callers, add a temporary parameter to sidestep
   ref-counting issues.
 - Add allow_unsafe_mappings, which is a breaking change.
 - Migrate kvm_vcpu_map and other callsites used by x86 to the new API.
 - Drop arm and ppc changes.
v7 -> v8:
 - Set access bits before releasing mmu_lock.
 - Pass FOLL_GET on 32-bit x86 or !tdp_enabled.
 - Refactor FOLL_GET handling, add kvm_follow_refcounted_pfn helper.
 - Set refcounted bit on >4k pages.
 - Add comments and apply formatting suggestions.
 - rebase on kvm next branch.
v6 -> v7:
 - Replace __gfn_to_pfn_memslot with a more flexible __kvm_faultin_pfn,
   and extend that API to support non-refcounted pages (complete
   rewrite).

David Stevens (7):
  KVM: Relax BUG_ON argument validation
  KVM: mmu: Introduce kvm_follow_pfn()
  KVM: mmu: Improve handling of non-refcounted pfns
  KVM: Migrate kvm_vcpu_map() to kvm_follow_pfn()
  KVM: x86: Migrate to kvm_follow_pfn()
  KVM: x86/mmu: Track if sptes refer to refcounted pages
  KVM: x86/mmu: Handle non-refcounted pages

Sean Christopherson (1):
  KVM: Assert that a page's refcount is elevated when marking
    accessed/dirty

 arch/x86/kvm/mmu/mmu.c          | 108 +++++++---
 arch/x86/kvm/mmu/mmu_internal.h |   2 +
 arch/x86/kvm/mmu/paging_tmpl.h  |   7 +-
 arch/x86/kvm/mmu/spte.c         |   5 +-
 arch/x86/kvm/mmu/spte.h         |  16 +-
 arch/x86/kvm/mmu/tdp_mmu.c      |  22 +-
 arch/x86/kvm/x86.c              |  11 +-
 include/linux/kvm_host.h        |  58 +++++-
 virt/kvm/guest_memfd.c          |   8 +-
 virt/kvm/kvm_main.c             | 345 +++++++++++++++++++-------------
 virt/kvm/kvm_mm.h               |   3 +-
 virt/kvm/pfncache.c             |  11 +-
 12 files changed, 399 insertions(+), 197 deletions(-)


base-commit: 54be6c6c5ae8e0d93a6c4641cb7528eb0b6ba478
-- 
2.44.0.rc1.240.g4c46232300-goog
Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages
Posted by Sean Christopherson 1 year, 5 months ago
On Thu, Feb 29, 2024, David Stevens wrote:
> From: David Stevens <stevensd@chromium.org>
> 
> This patch series adds support for mapping VM_IO and VM_PFNMAP memory
> that is backed by struct pages that aren't currently being refcounted
> (e.g. tail pages of non-compound higher order allocations) into the
> guest.
> 
> Our use case is virtio-gpu blob resources [1], which directly map host
> graphics buffers into the guest as "vram" for the virtio-gpu device.
> This feature currently does not work on systems using the amdgpu driver,
> as that driver allocates non-compound higher order pages via
> ttm_pool_alloc_page().
> 
> First, this series replaces the gfn_to_pfn_memslot() API with a more
> extensible kvm_follow_pfn() API. The updated API rearranges
> gfn_to_pfn_memslot()'s args into a struct and where possible packs the
> bool arguments into a FOLL_ flags argument. The refactoring changes do
> not change any behavior.
> 
> From there, this series extends the kvm_follow_pfn() API so that
> non-refconuted pages can be safely handled. This invloves adding an
> input parameter to indicate whether the caller can safely use
> non-refcounted pfns and an output parameter to tell the caller whether
> or not the returned page is refcounted. This change includes a breaking
> change, by disallowing non-refcounted pfn mappings by default, as such
> mappings are unsafe. To allow such systems to continue to function, an
> opt-in module parameter is added to allow the unsafe behavior.
> 
> This series only adds support for non-refcounted pages to x86. Other
> MMUs can likely be updated without too much difficulty, but it is not
> needed at this point. Updating other parts of KVM (e.g. pfncache) is not
> straightforward [2].

FYI, on the off chance that someone else is eyeballing this, I am working on
revamping this series.  It's still a ways out, but I'm optimistic that we'll be
able to address the concerns raised by Christoph and Christian, and maybe even
get KVM out of the weeds straightaway (PPC looks thorny :-/).
Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages
Posted by Alex Bennée 1 year, 4 months ago
Sean Christopherson <seanjc@google.com> writes:

> On Thu, Feb 29, 2024, David Stevens wrote:
>> From: David Stevens <stevensd@chromium.org>
>> 
>> This patch series adds support for mapping VM_IO and VM_PFNMAP memory
>> that is backed by struct pages that aren't currently being refcounted
>> (e.g. tail pages of non-compound higher order allocations) into the
>> guest.
>> 
>> Our use case is virtio-gpu blob resources [1], which directly map host
>> graphics buffers into the guest as "vram" for the virtio-gpu device.
>> This feature currently does not work on systems using the amdgpu driver,
>> as that driver allocates non-compound higher order pages via
>> ttm_pool_alloc_page().
>> 
>> First, this series replaces the gfn_to_pfn_memslot() API with a more
>> extensible kvm_follow_pfn() API. The updated API rearranges
>> gfn_to_pfn_memslot()'s args into a struct and where possible packs the
>> bool arguments into a FOLL_ flags argument. The refactoring changes do
>> not change any behavior.
>> 
>> From there, this series extends the kvm_follow_pfn() API so that
>> non-refconuted pages can be safely handled. This invloves adding an
>> input parameter to indicate whether the caller can safely use
>> non-refcounted pfns and an output parameter to tell the caller whether
>> or not the returned page is refcounted. This change includes a breaking
>> change, by disallowing non-refcounted pfn mappings by default, as such
>> mappings are unsafe. To allow such systems to continue to function, an
>> opt-in module parameter is added to allow the unsafe behavior.
>> 
>> This series only adds support for non-refcounted pages to x86. Other
>> MMUs can likely be updated without too much difficulty, but it is not
>> needed at this point. Updating other parts of KVM (e.g. pfncache) is not
>> straightforward [2].
>
> FYI, on the off chance that someone else is eyeballing this, I am working on
> revamping this series.  It's still a ways out, but I'm optimistic that we'll be
> able to address the concerns raised by Christoph and Christian, and maybe even
> get KVM out of the weeds straightaway (PPC looks thorny :-/).

I've applied this series to the latest 6.9.x while attempting to
diagnose some of the virtio-gpu problems it may or may not address.
However launching KVM guests keeps triggering a bunch of BUGs that
eventually leave a hung guest:

  12:16:54 [root@draig:~] # dmesg -c                                                                                                                                           
  [252080.141629] RAX: ffffffffffffffda RBX: 0000560a64915500 RCX: 00007faa23e81c5b                                                                                            
  [252080.141629] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000017                                                                                            
  [252080.141630] RBP: 000000000000ae80 R08: 0000000000000000 R09: 0000000000000000                                                                                            
  [252080.141630] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000                                                                                            
  [252080.141631] R13: 0000000000000001 R14: 00000000000000b2 R15: 0000000000000002                                                                                            
  [252080.141632]  </TASK>                                                                                                                                                     
  [252080.141632] BUG: Bad page state in process CPU 0/KVM  pfn:fb1665                                                                                                         
  [252080.141633] page: refcount:0 mapcount:1 mapping:0000000000000000 index:0x7fa8117c3 pfn:0xfb1665                                                                          
  [252080.141633] flags: 0x17ffffc00a000c(referenced|uptodate|mappedtodisk|swapbacked|node=0|zone=2|lastcpupid=0x1fffff)                                                       
  [252080.141634] page_type: 0x0()                                                                                                                                             
  [252080.141635] raw: 0017ffffc00a000c dead000000000100 dead000000000122 0000000000000000                                                                                     
  [252080.141635] raw: 00000007fa8117c3 0000000000000000 0000000000000000 0000000000000000                                                                                     
  [252080.141635] page dumped because: nonzero mapcount                                                                                                                        
  [252080.141636] Modules linked in: vhost_net vhost vhost_iotlb tap tun uas usb_storage veth cfg80211 nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter nft_ma
  sq wireguard libchacha20poly1305 chacha_x86_64 poly1305_x86_64 curve25519_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel rfcomm snd_seq_dummy snd_hrtimer s
  nd_seq xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetl
  ink bridge stp llc qrtr overlay cmac algif_hash algif_skcipher af_alg bnep binfmt_misc squashfs snd_hda_codec_hdmi intel_uncore_frequency snd_ctl_led intel_uncore_frequency_
  common ledtrig_audio x86_pkg_temp_thermal intel_powerclamp coretemp snd_sof_pci_intel_tgl snd_sof_intel_hda_common kvm_intel soundwire_intel soundwire_generic_allocation btu
  sb snd_sof_intel_hda_mlink sd_mod soundwire_cadence btrtl snd_hda_codec_realtek kvm sg snd_sof_intel_hda btintel snd_sof_pci btbcm snd_hda_codec_generic btmtk               
  [252080.141656]  snd_sof_xtensa_dsp crc32_pclmul bluetooth snd_hda_scodec_component ghash_clmulni_intel snd_sof sha256_ssse3 sha1_ssse3 snd_sof_utils snd_soc_hdac_hda snd_hd
  a_ext_core snd_soc_acpi_intel_match snd_soc_acpi snd_soc_core snd_compress soundwire_bus sha3_generic jitterentropy_rng aesni_intel snd_hda_intel snd_intel_dspcfg crypto_sim
  d sha512_ssse3 snd_intel_sdw_acpi cryptd sha512_generic uvcvideo snd_hda_codec snd_usb_audio videobuf2_vmalloc uvc ctr videobuf2_memops snd_hda_core snd_usbmidi_lib videobuf
  2_v4l2 snd_rawmidi drbg snd_hwdep dell_wmi snd_seq_device nls_ascii ahci ansi_cprng iTCO_wdt processor_thermal_device_pci videodev nls_cp437 snd_pcm intel_pmc_bxt dell_smbio
  s libahci processor_thermal_device rapl rtsx_pci_sdmmc iTCO_vendor_support ecdh_generic mmc_core mei_hdcp watchdog libata intel_rapl_msr videobuf2_common rfkill vfat process
  or_thermal_wt_hint pl2303 snd_timer dcdbas dell_wmi_ddv dell_wmi_sysman processor_thermal_rfim ucsi_acpi fat intel_cstate usbserial intel_uncore cdc_acm mc battery ecc      
  [252080.141670]  firmware_attributes_class dell_wmi_descriptor wmi_bmof dell_smm_hwmon processor_thermal_rapl pcspkr scsi_mod mei_me intel_lpss_pci snd typec_ucsi igc e1000e
   i2c_i801 rtsx_pci intel_rapl_common intel_lpss roles mei soundcore processor_thermal_wt_req i2c_smbus idma64 scsi_common processor_thermal_power_floor typec processor_therm
  al_mbox button intel_pmc_core int3403_thermal int340x_thermal_zone intel_vsec pmt_telemetry intel_hid int3400_thermal pmt_class sparse_keymap acpi_tad acpi_pad acpi_thermal_
  rel msr parport_pc ppdev lp parport fuse loop efi_pstore configfs efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 hid_microsoft joydev ff_memless hid_generic usb
  hid hid btrfs blake2b_generic libcrc32c crc32c_generic xor raid6_pq evdev dm_mod i915 i2c_algo_bit drm_buddy ttm drm_display_helper xhci_pci xhci_hcd drm_kms_helper nvme nvm
  e_core drm t10_pi usbcore video crc64_rocksoft crc64 crc_t10dif cec crct10dif_generic crct10dif_pclmul crc32c_intel rc_core usb_common crct10dif_common wmi                  
  [252080.141686]  pinctrl_alderlake                                                                                                                                           
  [252080.141686] CPU: 8 PID: 1819169 Comm: CPU 0/KVM Tainted: G    B   W          6.9.12-ajb-00008-gfcd4b7efbad0 #17                                                          
  [252080.141687] Hardware name: Dell Inc. Precision 3660/0PRR48, BIOS 2.8.1 08/14/2023                                                                                        
  [252080.141688] Call Trace:                                                                                                                                                  
  [252080.141688]  <TASK>                                                                                                                                                      
  [252080.141688]  dump_stack_lvl+0x60/0x80                                                                                                                                    
  [252080.141689]  bad_page+0x70/0x100                                                                                                                                         
  [252080.141690]  free_unref_page_prepare+0x22a/0x370                                                                                                                         
  [252080.141692]  free_unref_folios+0xe5/0x340                                                                                                                                
  [252080.141693]  ? __mem_cgroup_uncharge_folios+0x7a/0xa0                                                                                                                    
  [252080.141694]  folios_put_refs+0x147/0x1e0                                                                                                                                 
  [252080.141696]  ? __pfx_lru_add_fn+0x10/0x10                                                                                                                                
  [252080.141697]  folio_batch_move_lru+0xc8/0x140                                                                                                                             
  [252080.141699]  folio_add_lru+0x51/0xa0                                                                                                                                     
  [252080.141700]  do_wp_page+0x4dd/0xb60                                                                                                                                      
  [252080.141701]  __handle_mm_fault+0xb2a/0xe30          
  [252080.141703]  handle_mm_fault+0x18c/0x320                                                                                                                                 
  [252080.141704]  __get_user_pages+0x164/0x6f0                                                                                                                                
  [252080.141705]  get_user_pages_unlocked+0xe2/0x370                                                                                                                          
  [252080.141706]  hva_to_pfn+0xa0/0x740 [kvm]                                                                                                                                 
  [252080.141724]  kvm_faultin_pfn+0xf3/0x5f0 [kvm]                                                                                                                            
  [252080.141750]  kvm_tdp_page_fault+0x100/0x150 [kvm]                                                                                                                        
  [252080.141774]  kvm_mmu_page_fault+0x27e/0x7f0 [kvm]                                                                                                                        
  [252080.141798]  ? em_rsm+0xad/0x170 [kvm]                                                                                                                                   
  [252080.141823]  ? writeback_registers+0x44/0x80 [kvm]                                                                                                                       
  [252080.141848]  ? vmx_set_cr0+0xc7/0x1320 [kvm_intel]                                                                                                                       
  [252080.141853]  ? x86_emulate_insn+0x484/0xe60 [kvm]                                                                                                                        
  [252080.141877]  ? vmx_vmexit+0x6e/0xd0 [kvm_intel]                                                                                                                          
  [252080.141882]  ? vmx_vmexit+0x99/0xd0 [kvm_intel]                                                                                                                          
  [252080.141887]  vmx_handle_exit+0x129/0x930 [kvm_intel]                                                                                                                     
  [252080.141892]  kvm_arch_vcpu_ioctl_run+0x682/0x15b0 [kvm]                                                                                                                  
  [252080.141918]  kvm_vcpu_ioctl+0x23d/0x6f0 [kvm]                                                                                                                            
  [252080.141936]  ? __seccomp_filter+0x32f/0x500                                                                                                                              
  [252080.141937]  ? kvm_io_bus_read+0x42/0xd0 [kvm]                                                                                                                           
  [252080.141956]  __x64_sys_ioctl+0x90/0xd0                                                                                                                                   
  [252080.141957]  do_syscall_64+0x80/0x190                                                                                                                                    
  [252080.141958]  ? kvm_arch_vcpu_put+0x126/0x160 [kvm]                                                                                                                       
  [252080.141982]  ? vcpu_put+0x1e/0x50 [kvm]                                                                                                                                  
  [252080.141999]  ? kvm_arch_vcpu_ioctl_run+0x757/0x15b0 [kvm]                                                                                                                
  [252080.142023]  ? kvm_vcpu_ioctl+0x29e/0x6f0 [kvm]                                                                                                                          
  [252080.142040]  ? __seccomp_filter+0x32f/0x500                                                                                                                              
  [252080.142042]  ? kvm_on_user_return+0x60/0x90 [kvm]                                                                                                                        
  [252080.142065]  ? fire_user_return_notifiers+0x30/0x60                                                                                                                      
  [252080.142066]  ? syscall_exit_to_user_mode+0x73/0x200                                                                                                                      
  [252080.142067]  ? do_syscall_64+0x8c/0x190                                                                                                                                  
  [252080.142068]  ? kvm_on_user_return+0x60/0x90 [kvm]                                                                                                                        
  [252080.142090]  ? fire_user_return_notifiers+0x30/0x60                                                                                                                      
  [252080.142091]  ? syscall_exit_to_user_mode+0x73/0x200                                                                                                                      
  [252080.142092]  ? do_syscall_64+0x8c/0x190                                                                                                                                  
  [252080.142093]  ? do_syscall_64+0x8c/0x190                                                                                                                                  
  [252080.142094]  ? do_syscall_64+0x8c/0x190                                                                                                                                  
  [252080.142095]  ? exc_page_fault+0x72/0x170                                                                                                                                 
  [252080.142096]  entry_SYSCALL_64_after_hwframe+0x76/0x7e                                                                                                                    

This backtrace repeats for a large chunk of pfns

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro
Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages
Posted by David Stevens 1 year, 4 months ago
On Wed, Jul 31, 2024 at 8:41 PM Alex Bennée <alex.bennee@linaro.org> wrote:
>
> Sean Christopherson <seanjc@google.com> writes:
>
> > On Thu, Feb 29, 2024, David Stevens wrote:
> >> From: David Stevens <stevensd@chromium.org>
> >>
> >> This patch series adds support for mapping VM_IO and VM_PFNMAP memory
> >> that is backed by struct pages that aren't currently being refcounted
> >> (e.g. tail pages of non-compound higher order allocations) into the
> >> guest.
> >>
> >> Our use case is virtio-gpu blob resources [1], which directly map host
> >> graphics buffers into the guest as "vram" for the virtio-gpu device.
> >> This feature currently does not work on systems using the amdgpu driver,
> >> as that driver allocates non-compound higher order pages via
> >> ttm_pool_alloc_page().
> >>
> >> First, this series replaces the gfn_to_pfn_memslot() API with a more
> >> extensible kvm_follow_pfn() API. The updated API rearranges
> >> gfn_to_pfn_memslot()'s args into a struct and where possible packs the
> >> bool arguments into a FOLL_ flags argument. The refactoring changes do
> >> not change any behavior.
> >>
> >> From there, this series extends the kvm_follow_pfn() API so that
> >> non-refconuted pages can be safely handled. This invloves adding an
> >> input parameter to indicate whether the caller can safely use
> >> non-refcounted pfns and an output parameter to tell the caller whether
> >> or not the returned page is refcounted. This change includes a breaking
> >> change, by disallowing non-refcounted pfn mappings by default, as such
> >> mappings are unsafe. To allow such systems to continue to function, an
> >> opt-in module parameter is added to allow the unsafe behavior.
> >>
> >> This series only adds support for non-refcounted pages to x86. Other
> >> MMUs can likely be updated without too much difficulty, but it is not
> >> needed at this point. Updating other parts of KVM (e.g. pfncache) is not
> >> straightforward [2].
> >
> > FYI, on the off chance that someone else is eyeballing this, I am working on
> > revamping this series.  It's still a ways out, but I'm optimistic that we'll be
> > able to address the concerns raised by Christoph and Christian, and maybe even
> > get KVM out of the weeds straightaway (PPC looks thorny :-/).
>
> I've applied this series to the latest 6.9.x while attempting to
> diagnose some of the virtio-gpu problems it may or may not address.
> However launching KVM guests keeps triggering a bunch of BUGs that
> eventually leave a hung guest:
>

Likely the same issue as [1]. Commit d02c357e5bfa added another call
to kvm_release_pfn_clean() in kvm_faultin_pfn(), which ends up
releasing a reference that is no longer being taken. If you replace
that with kvm_set_page_accessed() instead, then things should work
again. I didn't send out a rebased version of the series, since Sean's
work supersedes my series.

-David

[1] https://lore.kernel.org/lkml/15865985-4688-4b7e-9f2d-89803adb8f5b@collabora.com/
Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages
Posted by Sean Christopherson 1 year, 4 months ago
On Wed, Jul 31, 2024, Alex Bennée wrote:
> Sean Christopherson <seanjc@google.com> writes:
> 
> > On Thu, Feb 29, 2024, David Stevens wrote:
> >> From: David Stevens <stevensd@chromium.org>
> >> 
> >> This patch series adds support for mapping VM_IO and VM_PFNMAP memory
> >> that is backed by struct pages that aren't currently being refcounted
> >> (e.g. tail pages of non-compound higher order allocations) into the
> >> guest.
> >> 
> >> Our use case is virtio-gpu blob resources [1], which directly map host
> >> graphics buffers into the guest as "vram" for the virtio-gpu device.
> >> This feature currently does not work on systems using the amdgpu driver,
> >> as that driver allocates non-compound higher order pages via
> >> ttm_pool_alloc_page().
> >> 
> >> First, this series replaces the gfn_to_pfn_memslot() API with a more
> >> extensible kvm_follow_pfn() API. The updated API rearranges
> >> gfn_to_pfn_memslot()'s args into a struct and where possible packs the
> >> bool arguments into a FOLL_ flags argument. The refactoring changes do
> >> not change any behavior.
> >> 
> >> From there, this series extends the kvm_follow_pfn() API so that
> >> non-refconuted pages can be safely handled. This invloves adding an
> >> input parameter to indicate whether the caller can safely use
> >> non-refcounted pfns and an output parameter to tell the caller whether
> >> or not the returned page is refcounted. This change includes a breaking
> >> change, by disallowing non-refcounted pfn mappings by default, as such
> >> mappings are unsafe. To allow such systems to continue to function, an
> >> opt-in module parameter is added to allow the unsafe behavior.
> >> 
> >> This series only adds support for non-refcounted pages to x86. Other
> >> MMUs can likely be updated without too much difficulty, but it is not
> >> needed at this point. Updating other parts of KVM (e.g. pfncache) is not
> >> straightforward [2].
> >
> > FYI, on the off chance that someone else is eyeballing this, I am working on
> > revamping this series.  It's still a ways out, but I'm optimistic that we'll be
> > able to address the concerns raised by Christoph and Christian, and maybe even
> > get KVM out of the weeds straightaway (PPC looks thorny :-/).
> 
> I've applied this series to the latest 6.9.x while attempting to
> diagnose some of the virtio-gpu problems it may or may not address.
> However launching KVM guests keeps triggering a bunch of BUGs that
> eventually leave a hung guest:

Can you give v12 (which is comically large) a spin?  I still need to do more
testing, but if it too is buggy, I definitely want to know sooner than later.

Thanks!

https://lore.kernel.org/all/20240726235234.228822-1-seanjc@google.com
Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages
Posted by Christoph Hellwig 1 year, 9 months ago
On Thu, Feb 29, 2024 at 11:57:51AM +0900, David Stevens wrote:
> Our use case is virtio-gpu blob resources [1], which directly map host
> graphics buffers into the guest as "vram" for the virtio-gpu device.
> This feature currently does not work on systems using the amdgpu driver,
> as that driver allocates non-compound higher order pages via
> ttm_pool_alloc_page().

.. and just as last time around that is still the problem that needs
to be fixed instead of creating a monster like this to map
non-refcounted pages.
Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages
Posted by David Stevens 1 year, 9 months ago
On Thu, Feb 29, 2024 at 10:36 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Thu, Feb 29, 2024 at 11:57:51AM +0900, David Stevens wrote:
> > Our use case is virtio-gpu blob resources [1], which directly map host
> > graphics buffers into the guest as "vram" for the virtio-gpu device.
> > This feature currently does not work on systems using the amdgpu driver,
> > as that driver allocates non-compound higher order pages via
> > ttm_pool_alloc_page().
>
> .. and just as last time around that is still the problem that needs
> to be fixed instead of creating a monster like this to map
> non-refcounted pages.
>

Patches to amdgpu to have been NAKed [1] with the justification that
using non-refcounted pages is working as intended and KVM is in the
wrong for wanting to take references to pages mapped with VM_PFNMAP
[2].

The existence of the VM_PFNMAP implies that the existence of
non-refcounted pages is working as designed. We can argue about
whether or not VM_PFNMAP should exist, but until VM_PFNMAP is removed,
KVM should be able to handle it. Also note that this is not adding a
new source of non-refcounted pages, so it doesn't make removing
non-refcounted pages more difficult, if the kernel does decide to go
in that direction.

-David

[1] https://lore.kernel.org/lkml/8230a356-be38-f228-4a8e-95124e8e8db6@amd.com/
[2] https://lore.kernel.org/lkml/594f1013-b925-3c75-be61-2d649f5ca54e@amd.com/
Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages
Posted by Christoph Hellwig 1 year, 9 months ago
On Wed, Mar 13, 2024 at 01:55:20PM +0900, David Stevens wrote:
> On Thu, Feb 29, 2024 at 10:36 PM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > On Thu, Feb 29, 2024 at 11:57:51AM +0900, David Stevens wrote:
> > > Our use case is virtio-gpu blob resources [1], which directly map host
> > > graphics buffers into the guest as "vram" for the virtio-gpu device.
> > > This feature currently does not work on systems using the amdgpu driver,
> > > as that driver allocates non-compound higher order pages via
> > > ttm_pool_alloc_page().
> >
> > .. and just as last time around that is still the problem that needs
> > to be fixed instead of creating a monster like this to map
> > non-refcounted pages.
> >
> 
> Patches to amdgpu to have been NAKed [1] with the justification that
> using non-refcounted pages is working as intended and KVM is in the
> wrong for wanting to take references to pages mapped with VM_PFNMAP
> [2].

So make them not work for KVM as they should to kick the amdgpu
maintainers a***es.

Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages
Posted by Christian König 1 year, 9 months ago
Am 13.03.24 um 05:55 schrieb David Stevens:
> On Thu, Feb 29, 2024 at 10:36 PM Christoph Hellwig <hch@infradead.org> wrote:
>> On Thu, Feb 29, 2024 at 11:57:51AM +0900, David Stevens wrote:
>>> Our use case is virtio-gpu blob resources [1], which directly map host
>>> graphics buffers into the guest as "vram" for the virtio-gpu device.
>>> This feature currently does not work on systems using the amdgpu driver,
>>> as that driver allocates non-compound higher order pages via
>>> ttm_pool_alloc_page().
>> .. and just as last time around that is still the problem that needs
>> to be fixed instead of creating a monster like this to map
>> non-refcounted pages.
>>
> Patches to amdgpu to have been NAKed [1] with the justification that
> using non-refcounted pages is working as intended and KVM is in the
> wrong for wanting to take references to pages mapped with VM_PFNMAP
> [2].
>
> The existence of the VM_PFNMAP implies that the existence of
> non-refcounted pages is working as designed. We can argue about
> whether or not VM_PFNMAP should exist, but until VM_PFNMAP is removed,
> KVM should be able to handle it. Also note that this is not adding a
> new source of non-refcounted pages, so it doesn't make removing
> non-refcounted pages more difficult, if the kernel does decide to go
> in that direction.

Well, the meaning of VM_PFNMAP is that you should not touch the 
underlying struct page the PTE is pointing to. As far as I can see this 
includes grabbing a reference count.

But that isn't really the problem here. The issue is rather that KVM 
assumes that by grabbing a reference count to the page that the driver 
won't change the PTE to point somewhere else.. And that is simply not true.

So what KVM needs to do is to either have an MMU notifier installed so 
that updates to the PTEs on the host side are reflected immediately to 
the PTEs on the guest side.

Or (even better) you use hardware functionality like nested page tables 
so that we don't actually need to update the guest PTEs when the host 
PTEs change.

And when you have either of those two functionalities the requirement to 
add a long term reference to the struct page goes away completely. So 
when this is done right you don't need to grab a reference in the first 
place.

Regards,
Christian.

>
> -David
>
> [1] https://lore.kernel.org/lkml/8230a356-be38-f228-4a8e-95124e8e8db6@amd.com/
> [2] https://lore.kernel.org/lkml/594f1013-b925-3c75-be61-2d649f5ca54e@amd.com/

Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages
Posted by Sean Christopherson 1 year, 9 months ago
On Wed, Mar 13, 2024, Christian König wrote:
> Am 13.03.24 um 05:55 schrieb David Stevens:
> > On Thu, Feb 29, 2024 at 10:36 PM Christoph Hellwig <hch@infradead.org> wrote:
> > > On Thu, Feb 29, 2024 at 11:57:51AM +0900, David Stevens wrote:
> > > > Our use case is virtio-gpu blob resources [1], which directly map host
> > > > graphics buffers into the guest as "vram" for the virtio-gpu device.
> > > > This feature currently does not work on systems using the amdgpu driver,
> > > > as that driver allocates non-compound higher order pages via
> > > > ttm_pool_alloc_page().
> > > .. and just as last time around that is still the problem that needs
> > > to be fixed instead of creating a monster like this to map
> > > non-refcounted pages.
> > > 
> > Patches to amdgpu to have been NAKed [1] with the justification that
> > using non-refcounted pages is working as intended and KVM is in the
> > wrong for wanting to take references to pages mapped with VM_PFNMAP
> > [2].
> > 
> > The existence of the VM_PFNMAP implies that the existence of
> > non-refcounted pages is working as designed. We can argue about
> > whether or not VM_PFNMAP should exist, but until VM_PFNMAP is removed,
> > KVM should be able to handle it. Also note that this is not adding a
> > new source of non-refcounted pages, so it doesn't make removing
> > non-refcounted pages more difficult, if the kernel does decide to go
> > in that direction.
> 
> Well, the meaning of VM_PFNMAP is that you should not touch the underlying
> struct page the PTE is pointing to. As far as I can see this includes
> grabbing a reference count.
> 
> But that isn't really the problem here. The issue is rather that KVM assumes
> that by grabbing a reference count to the page that the driver won't change
> the PTE to point somewhere else.. And that is simply not true.

No, KVM doesn't assume that.

> So what KVM needs to do is to either have an MMU notifier installed so that
> updates to the PTEs on the host side are reflected immediately to the PTEs
> on the guest side.

KVM already has an MMU notifier and reacts accordingly.

> Or (even better) you use hardware functionality like nested page tables so
> that we don't actually need to update the guest PTEs when the host PTEs
> change.

That's not how stage-2 page tables work. 

> And when you have either of those two functionalities the requirement to add
> a long term reference to the struct page goes away completely. So when this
> is done right you don't need to grab a reference in the first place.

The KVM issue that this series is solving isn't that KVM grabs a reference, it's
that KVM assumes that any non-reserved pfn that is backed by "struct page" is
refcounted.

What Christoph is objecting to is that, in this series, KVM is explicitly adding
support for mapping non-compound (huge)pages into KVM guests.  David is arguing
that Christoph's objection to _KVM_ adding support is unfair, because the real
problem is that the kernel already maps such pages into host userspace.  I.e. if
the userspace mapping ceases to exist, then there are no mappings for KVM to follow
and propagate to KVM's stage-2 page tables.
Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages
Posted by Christian König 1 year, 9 months ago
Am 13.03.24 um 14:34 schrieb Sean Christopherson:
> On Wed, Mar 13, 2024, Christian König wrote:
>> Am 13.03.24 um 05:55 schrieb David Stevens:
>>> On Thu, Feb 29, 2024 at 10:36 PM Christoph Hellwig <hch@infradead.org> wrote:
>>>> On Thu, Feb 29, 2024 at 11:57:51AM +0900, David Stevens wrote:
>>>>> Our use case is virtio-gpu blob resources [1], which directly map host
>>>>> graphics buffers into the guest as "vram" for the virtio-gpu device.
>>>>> This feature currently does not work on systems using the amdgpu driver,
>>>>> as that driver allocates non-compound higher order pages via
>>>>> ttm_pool_alloc_page().
>>>> .. and just as last time around that is still the problem that needs
>>>> to be fixed instead of creating a monster like this to map
>>>> non-refcounted pages.
>>>>
>>> Patches to amdgpu to have been NAKed [1] with the justification that
>>> using non-refcounted pages is working as intended and KVM is in the
>>> wrong for wanting to take references to pages mapped with VM_PFNMAP
>>> [2].
>>>
>>> The existence of the VM_PFNMAP implies that the existence of
>>> non-refcounted pages is working as designed. We can argue about
>>> whether or not VM_PFNMAP should exist, but until VM_PFNMAP is removed,
>>> KVM should be able to handle it. Also note that this is not adding a
>>> new source of non-refcounted pages, so it doesn't make removing
>>> non-refcounted pages more difficult, if the kernel does decide to go
>>> in that direction.
>> Well, the meaning of VM_PFNMAP is that you should not touch the underlying
>> struct page the PTE is pointing to. As far as I can see this includes
>> grabbing a reference count.
>>
>> But that isn't really the problem here. The issue is rather that KVM assumes
>> that by grabbing a reference count to the page that the driver won't change
>> the PTE to point somewhere else.. And that is simply not true.
> No, KVM doesn't assume that.
>
>> So what KVM needs to do is to either have an MMU notifier installed so that
>> updates to the PTEs on the host side are reflected immediately to the PTEs
>> on the guest side.
> KVM already has an MMU notifier and reacts accordingly.
>
>> Or (even better) you use hardware functionality like nested page tables so
>> that we don't actually need to update the guest PTEs when the host PTEs
>> change.
> That's not how stage-2 page tables work.
>
>> And when you have either of those two functionalities the requirement to add
>> a long term reference to the struct page goes away completely. So when this
>> is done right you don't need to grab a reference in the first place.
> The KVM issue that this series is solving isn't that KVM grabs a reference, it's
> that KVM assumes that any non-reserved pfn that is backed by "struct page" is
> refcounted.

Well why does it assumes that? When you have a MMU notifier that seems 
unnecessary.

> What Christoph is objecting to is that, in this series, KVM is explicitly adding
> support for mapping non-compound (huge)pages into KVM guests.  David is arguing
> that Christoph's objection to _KVM_ adding support is unfair, because the real
> problem is that the kernel already maps such pages into host userspace.  I.e. if
> the userspace mapping ceases to exist, then there are no mappings for KVM to follow
> and propagate to KVM's stage-2 page tables.

And I have to agree with Christoph that this doesn't make much sense. 
KVM should *never* map (huge) pages from VMAs marked with VM_PFNMAP into 
KVM guests in the first place.

What it should do instead is to mirror the PFN from the host page tables 
into the guest page tables. If there is a page behind that or not *must* 
be completely irrelevant to KVM.

The background here is that drivers are modifying the page table on the 
fly to point to either MMIO or real memory, this also includes switching 
the caching attributes.

The real question is why is KVM trying to grab a page reference when 
there is an MMU notifier installed.

Regards,
Christian.

Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages
Posted by Sean Christopherson 1 year, 9 months ago
On Wed, Mar 13, 2024, Christian König wrote:
> Am 13.03.24 um 14:34 schrieb Sean Christopherson:
> > On Wed, Mar 13, 2024, Christian König wrote:
> > > And when you have either of those two functionalities the requirement to add
> > > a long term reference to the struct page goes away completely. So when this
> > > is done right you don't need to grab a reference in the first place.
> > The KVM issue that this series is solving isn't that KVM grabs a reference, it's
> > that KVM assumes that any non-reserved pfn that is backed by "struct page" is
> > refcounted.
> 
> Well why does it assumes that? When you have a MMU notifier that seems
> unnecessary.

Indeed, it's legacy code that we're trying to clean up.  It's the bulk of this
series.

> > What Christoph is objecting to is that, in this series, KVM is explicitly adding
> > support for mapping non-compound (huge)pages into KVM guests.  David is arguing
> > that Christoph's objection to _KVM_ adding support is unfair, because the real
> > problem is that the kernel already maps such pages into host userspace.  I.e. if
> > the userspace mapping ceases to exist, then there are no mappings for KVM to follow
> > and propagate to KVM's stage-2 page tables.
> 
> And I have to agree with Christoph that this doesn't make much sense. KVM
> should *never* map (huge) pages from VMAs marked with VM_PFNMAP into KVM
> guests in the first place.
> 
> What it should do instead is to mirror the PFN from the host page tables
> into the guest page tables.

That's exactly what this series does.  Christoph is objecting to KVM playing nice
with non-compound hugepages, as he feels that such mappings should not exist
*anywhere*.

I.e. Christoph is (implicitly) saying that instead of modifying KVM to play nice,
we should instead fix the TTM allocations.  And David pointed out that that was
tried and got NAK'd.