[PATCH v5 0/8] Support virtio-gpu DRM native context

Dmitry Osipenko posted 8 patches 2 months, 2 weeks ago
There is a newer version of this series
docs/system/devices/virtio-gpu.rst | 105 +++++++++++++++++--
hw/display/virtio-gpu-gl.c         |   5 +
hw/display/virtio-gpu-virgl.c      | 159 ++++++++++++++++++++++++++++-
hw/display/virtio-gpu.c            |  15 +++
include/hw/virtio/virtio-gpu.h     |  16 +++
include/ui/sdl2.h                  |   7 ++
meson.build                        |   6 +-
ui/gtk-egl.c                       |   1 -
ui/gtk-gl-area.c                   |   1 -
ui/sdl2-gl.c                       |  68 +++++++++++-
ui/sdl2.c                          |  42 ++++++++
11 files changed, 411 insertions(+), 14 deletions(-)
[PATCH v5 0/8] Support virtio-gpu DRM native context
Posted by Dmitry Osipenko 2 months, 2 weeks ago
This patchset adds DRM native context support to VirtIO-GPU on Qemu.

Contarary to Virgl and Venus contexts that mediates high level GFX APIs,
DRM native context [1] mediates lower level kernel driver UAPI, which
reflects in a less CPU overhead and less/simpler code needed to support it.
DRM context consists of a host and guest parts that have to be implemented
for each GPU driver. On a guest side, DRM context presents a virtual GPU as
a real/native host GPU device for GL/VK applications.

[1] https://www.youtube.com/watch?v=9sFP_yddLLQ

Today there are four known DRM native context drivers existing in a wild:

  - Freedreno (Qualcomm SoC GPUs), completely upstreamed
  - AMDGPU, mostly merged into upstreams
  - Intel (i915), merge requests are opened
  - Asahi (Apple SoC GPUs), WIP status


# How to try out DRM context:

1. DRM context uses host blobs and requires latest developer version 
of Linux kernel [2] that has necessary KVM fixes.

[2] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/

2. Use latest libvirglrenderer from upstream git/main for Freedreno
and AMDGPU native contexts. For Intel use patches [3].

[3] https://gitlab.freedesktop.org/virgl/virglrenderer/-/merge_requests/1384

3. On guest, use latest Mesa version for Freedreno. For AMDGPU use
Mesa patches [4], for Intel [5].

[4] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21658
[5] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29870

4. On guest, use latest Linux kernel v6.6+. Apply patch [6] if you're
   running Xorg in guest.

[6] https://lore.kernel.org/dri-devel/20241020224725.179937-1-dmitry.osipenko@collabora.com/

Example Qemu cmdline that enables DRM context:

  qemu-system-x86_64 -device virtio-vga-gl,hostmem=4G,blob=on,drm_native_context=on \
      -machine q35,accel=kvm,memory-backend=mem1 \
      -object memory-backend-memfd,id=mem1,size=8G -m 8G


# Note about known performance problem in Qemu:

DRM contexts are mapping host blobs extensively and these mapping
operations work slowly in Qemu. Exact reason is unknown. Mappings work
fast on Crosvm For DRM contexts this problem is more visible than for
Venus/Virgl.

Changelog:

v5: - Added r-bs from Akihiko Odaki.

    - Added acks from Michael Tsirkin.

    - Fixed compilation warning using older version of virglrenderer that
      was reported by Alex Bennée. Noticed that I need to keep old
      virgl_write_fence() code around for the older virglrenderer in
      "Support  asynchronous fencing" patch, so added it back and verified
      that old virglrenderer works properly.

    - Added new patch from Alex Bennée that adds more virtio-gpu 
      documentation with a couple corrections and additions to it from me.

    - Rebased patches on top of latest staging tree.

v4: - Improved SDL2/dmabuf patch by reusing existing Meson X11 config 
      option, better handling EGL error and extending comment telling
      that it's safe to enable SDL2 EGL preference hint. As was suggested
      by Akihiko Odaki.

    - Replaced another QSLIST_FOREACH_SAFE with QSLIST_EMPTY+FIRST in
      the async-fencing patch for more consistency of the code. As was
      suggested by Akihiko Odaki.

    - Added missing braces around if-statement that was spotted by
      Alex Bennée.

    - Renamed 'drm=on' option of virtio-gpu-gl device to 
      'drm_native_context=on' for more clarity as was suggested by 
      Alex Bennée. Haven't added added new context-type option that 
      was also proposed by Alex, might do it with a separate patch.
      This context-type option will duplicate and depecate existing
      options, but in a longer run likely will be worthwhile adding
      it.

    - Dropped Linux headers-update patch as headers has been updated
      in the staging tree.

v3: - Improved EGL presence-check code on X11 systems for the SDL2
      hint that prefers EGL over GLX by using better ifdefs and checking
      Xlib presence at a build time to avoid build failure if lib SDL2
      and system are configured with a disabled X11 support. Also added
      clarifying comment telling that X11 hint doesn't affect Wayland
      systems. Suggested by Akihiko Odaki.

    - Corrected strerror(err) that used negative error where it should
      be positive and vice versa that was caught by Akihiko Odaki. Added
      clarifying comment for the case where we get positive error code
      from virglrenderer that differs from other virglrenderer API functions.

    - Improved QSLIST usage by dropping mutex protecting the async fence
      list and using atomic variant of QSLIST helpers instead. Switched away
      from using FOREACH helper to improve readability of the code, showing
      that we don't precess list in unoptimal way. Like was suggested by
      Akihiko Odaki.

    - Updated patchset base to Venus v18.

v2: - Updated SDL2-dmabuf patch by making use of error_report() and
      checking presense of X11+EGL in the system before making SDL2
      to prefer EGL backend over GLX, suggested by Akihiko Odaki.

    - Improved SDL2's dmabuf-presence check that wasn't done properly
      in v1, where EGL was set up only after first console was fully
      inited, and thus, SDL's display .has_dmabuf callback didn't work
      for the first console. Now dmabuf support status is pre-checked
      before console is registered.

    - Updated commit description of the patch that fixes SDL2's context
      switching logic with a more detailed explanation of the problem.
      Suggested by Akihiko Odaki.

    - Corrected rebase typo in the async-fencing patch and switched
      async-fencing to use a sigle-linked list instead of the double,
      as was suggested by Akihiko Odaki.

    - Replaced "=true" with "=on" in the DRM native context documentation
      example and made virtio_gpu_virgl_init() to fail with a error message
      if DRM context can't be initialized instead of giving a warning
      message, as was suggested by Akihiko Odaki.

    - Added patchew's dependecy tag to the cover letter as was suggested by
      Akihiko Odaki.


Alex Bennée (1):
  docs/system: Expand the virtio-gpu documentation

Dmitry Osipenko (6):
  ui/sdl2: Restore original context after new context creation
  virtio-gpu: Handle virgl fence creation errors
  virtio-gpu: Support asynchronous fencing
  virtio-gpu: Support DRM native context
  ui/sdl2: Don't disable scanout when display is refreshed
  ui/gtk: Don't disable scanout when display is refreshed

Pierre-Eric Pelloux-Prayer (1):
  ui/sdl2: Implement dpy dmabuf functions

 docs/system/devices/virtio-gpu.rst | 105 +++++++++++++++++--
 hw/display/virtio-gpu-gl.c         |   5 +
 hw/display/virtio-gpu-virgl.c      | 159 ++++++++++++++++++++++++++++-
 hw/display/virtio-gpu.c            |  15 +++
 include/hw/virtio/virtio-gpu.h     |  16 +++
 include/ui/sdl2.h                  |   7 ++
 meson.build                        |   6 +-
 ui/gtk-egl.c                       |   1 -
 ui/gtk-gl-area.c                   |   1 -
 ui/sdl2-gl.c                       |  68 +++++++++++-
 ui/sdl2.c                          |  42 ++++++++
 11 files changed, 411 insertions(+), 14 deletions(-)

-- 
2.47.1


Re: [PATCH v5 0/8] Support virtio-gpu DRM native context
Posted by Alex Bennée 2 months, 1 week ago
Dmitry Osipenko <dmitry.osipenko@collabora.com> writes:

> This patchset adds DRM native context support to VirtIO-GPU on Qemu.
>
> Contarary to Virgl and Venus contexts that mediates high level GFX APIs,
> DRM native context [1] mediates lower level kernel driver UAPI, which
> reflects in a less CPU overhead and less/simpler code needed to support it.
> DRM context consists of a host and guest parts that have to be implemented
> for each GPU driver. On a guest side, DRM context presents a virtual GPU as
> a real/native host GPU device for GL/VK applications.
>
> [1] https://www.youtube.com/watch?v=9sFP_yddLLQ
>
> Today there are four known DRM native context drivers existing in a wild:
>
>   - Freedreno (Qualcomm SoC GPUs), completely upstreamed
>   - AMDGPU, mostly merged into upstreams

I tried my AMD system today with:

Host:
  Aarch64 AVA system
  Trixie
  virglrenderer @ v1.1.0/99557f5aa130930d11f04ffeb07f3a9aa5963182
  -display sdl,gl=on (gtk,gl=on also came up but handled window resizing
  poorly)
  
KVM Guest

  Aarch64
  Trixie
  mesa @ main/d27748a76f7dd9236bfcf9ef172dc13b8c0e170f
  -Dvulkan-drivers=virtio,amd -Dgallium-drivers=virgl,radeonsi -Damdgpu-virtio=true

However when I ran vulkan-info --summary KVM faulted with:

  debian-trixie login: error: kvm run failed Bad address
   PC=0000ffffb9aa1eb0 X00=0000ffffba0450a4 X01=0000aaaaf7f32400
  X02=000000000000013c X03=0000ffffba045098 X04=0000aaaaf7f3253c
  X05=0000ffffba0451d4 X06=00000000c0016900 X07=000000000000000e
  X08=0000000000000014 X09=00000000000000ff X10=0000aaaaf7f32500
  X11=0000aaaaf7e4d028 X12=0000aaaaf7edbcb0 X13=0000000000000001
  X14=000000000000000c X15=0000000000007718 X16=0000ffffb93601f0
  X17=0000ffffb9aa1dc0 X18=00000000000076f0 X19=0000aaaaf7f31330
  X20=0000aaaaf7f323f0 X21=0000aaaaf7f235e0 X22=000000000000004c
  X23=0000aaaaf7f2b5e0 X24=0000aaaaf7ee0cb0 X25=00000000000000ff
  X26=0000000000000076 X27=0000ffffcd2b18a8 X28=0000aaaaf7ee0cb0
  X29=0000ffffcd2b0bd0 X30=0000ffffb86c8b98  SP=0000ffffcd2b0bd0
  PSTATE=20001000 --C- EL0t
  QEMU 9.2.50 monitor - type 'help' for more information
  (qemu) quit

Which looks very much like the PFN locking failure. However booting up
with venus=on instead works. Could there be any differences in the way
device memory is mapped in the two cases?

>   - Intel (i915), merge requests are opened
>   - Asahi (Apple SoC GPUs), WIP status
>
<snip>

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro
Re: [PATCH v5 0/8] Support virtio-gpu DRM native context
Posted by Dmitry Osipenko 2 months, 1 week ago
On 1/22/25 20:00, Alex Bennée wrote:
> Dmitry Osipenko <dmitry.osipenko@collabora.com> writes:
> 
>> This patchset adds DRM native context support to VirtIO-GPU on Qemu.
>>
>> Contarary to Virgl and Venus contexts that mediates high level GFX APIs,
>> DRM native context [1] mediates lower level kernel driver UAPI, which
>> reflects in a less CPU overhead and less/simpler code needed to support it.
>> DRM context consists of a host and guest parts that have to be implemented
>> for each GPU driver. On a guest side, DRM context presents a virtual GPU as
>> a real/native host GPU device for GL/VK applications.
>>
>> [1] https://www.youtube.com/watch?v=9sFP_yddLLQ
>>
>> Today there are four known DRM native context drivers existing in a wild:
>>
>>   - Freedreno (Qualcomm SoC GPUs), completely upstreamed
>>   - AMDGPU, mostly merged into upstreams
> 
> I tried my AMD system today with:
> 
> Host:
>   Aarch64 AVA system
>   Trixie
>   virglrenderer @ v1.1.0/99557f5aa130930d11f04ffeb07f3a9aa5963182
>   -display sdl,gl=on (gtk,gl=on also came up but handled window resizing
>   poorly)
>   
> KVM Guest
> 
>   Aarch64
>   Trixie
>   mesa @ main/d27748a76f7dd9236bfcf9ef172dc13b8c0e170f
>   -Dvulkan-drivers=virtio,amd -Dgallium-drivers=virgl,radeonsi -Damdgpu-virtio=true
> 
> However when I ran vulkan-info --summary KVM faulted with:
> 
>   debian-trixie login: error: kvm run failed Bad address
>    PC=0000ffffb9aa1eb0 X00=0000ffffba0450a4 X01=0000aaaaf7f32400
>   X02=000000000000013c X03=0000ffffba045098 X04=0000aaaaf7f3253c
>   X05=0000ffffba0451d4 X06=00000000c0016900 X07=000000000000000e
>   X08=0000000000000014 X09=00000000000000ff X10=0000aaaaf7f32500
>   X11=0000aaaaf7e4d028 X12=0000aaaaf7edbcb0 X13=0000000000000001
>   X14=000000000000000c X15=0000000000007718 X16=0000ffffb93601f0
>   X17=0000ffffb9aa1dc0 X18=00000000000076f0 X19=0000aaaaf7f31330
>   X20=0000aaaaf7f323f0 X21=0000aaaaf7f235e0 X22=000000000000004c
>   X23=0000aaaaf7f2b5e0 X24=0000aaaaf7ee0cb0 X25=00000000000000ff
>   X26=0000000000000076 X27=0000ffffcd2b18a8 X28=0000aaaaf7ee0cb0
>   X29=0000ffffcd2b0bd0 X30=0000ffffb86c8b98  SP=0000ffffcd2b0bd0
>   PSTATE=20001000 --C- EL0t
>   QEMU 9.2.50 monitor - type 'help' for more information
>   (qemu) quit
> 
> Which looks very much like the PFN locking failure. However booting up
> with venus=on instead works. Could there be any differences in the way
> device memory is mapped in the two cases?

Memory mapping works exactly the same for nctx and venus. Are you on
6.13 host kernel?

-- 
Best regards,
Dmitry

Re: [PATCH v5 0/8] Support virtio-gpu DRM native context
Posted by Alex Bennée 2 months, 1 week ago
Dmitry Osipenko <dmitry.osipenko@collabora.com> writes:

> On 1/22/25 20:00, Alex Bennée wrote:
>> Dmitry Osipenko <dmitry.osipenko@collabora.com> writes:
>> 
>>> This patchset adds DRM native context support to VirtIO-GPU on Qemu.
>>>
>>> Contarary to Virgl and Venus contexts that mediates high level GFX APIs,
>>> DRM native context [1] mediates lower level kernel driver UAPI, which
>>> reflects in a less CPU overhead and less/simpler code needed to support it.
>>> DRM context consists of a host and guest parts that have to be implemented
>>> for each GPU driver. On a guest side, DRM context presents a virtual GPU as
>>> a real/native host GPU device for GL/VK applications.
>>>
>>> [1] https://www.youtube.com/watch?v=9sFP_yddLLQ
>>>
>>> Today there are four known DRM native context drivers existing in a wild:
>>>
>>>   - Freedreno (Qualcomm SoC GPUs), completely upstreamed
>>>   - AMDGPU, mostly merged into upstreams
>> 
>> I tried my AMD system today with:
>> 
>> Host:
>>   Aarch64 AVA system
>>   Trixie
>>   virglrenderer @ v1.1.0/99557f5aa130930d11f04ffeb07f3a9aa5963182
>>   -display sdl,gl=on (gtk,gl=on also came up but handled window resizing
>>   poorly)
>>   
>> KVM Guest
>> 
>>   Aarch64
>>   Trixie
>>   mesa @ main/d27748a76f7dd9236bfcf9ef172dc13b8c0e170f
>>   -Dvulkan-drivers=virtio,amd -Dgallium-drivers=virgl,radeonsi -Damdgpu-virtio=true
>> 
>> However when I ran vulkan-info --summary KVM faulted with:
>> 
>>   debian-trixie login: error: kvm run failed Bad address
>>    PC=0000ffffb9aa1eb0 X00=0000ffffba0450a4 X01=0000aaaaf7f32400
>>   X02=000000000000013c X03=0000ffffba045098 X04=0000aaaaf7f3253c
>>   X05=0000ffffba0451d4 X06=00000000c0016900 X07=000000000000000e
>>   X08=0000000000000014 X09=00000000000000ff X10=0000aaaaf7f32500
>>   X11=0000aaaaf7e4d028 X12=0000aaaaf7edbcb0 X13=0000000000000001
>>   X14=000000000000000c X15=0000000000007718 X16=0000ffffb93601f0
>>   X17=0000ffffb9aa1dc0 X18=00000000000076f0 X19=0000aaaaf7f31330
>>   X20=0000aaaaf7f323f0 X21=0000aaaaf7f235e0 X22=000000000000004c
>>   X23=0000aaaaf7f2b5e0 X24=0000aaaaf7ee0cb0 X25=00000000000000ff
>>   X26=0000000000000076 X27=0000ffffcd2b18a8 X28=0000aaaaf7ee0cb0
>>   X29=0000ffffcd2b0bd0 X30=0000ffffb86c8b98  SP=0000ffffcd2b0bd0
>>   PSTATE=20001000 --C- EL0t
>>   QEMU 9.2.50 monitor - type 'help' for more information
>>   (qemu) quit
>> 
>> Which looks very much like the PFN locking failure. However booting up
>> with venus=on instead works. Could there be any differences in the way
>> device memory is mapped in the two cases?
>
> Memory mapping works exactly the same for nctx and venus. Are you on
> 6.13 host kernel?

Yes - with the Altra PCI workaround patches on both host and guest
kernel.

Is there anyway to trace the sharing of device memory on the host so I
can verify its an attempt at device access? The PC looks like its in
user-space but once this fails the guest is suspended so I can't poke
around in its environment.

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro
Re: [PATCH v5 0/8] Support virtio-gpu DRM native context
Posted by Dmitry Osipenko 2 months, 1 week ago
On 1/23/25 14:58, Alex Bennée wrote:
> Dmitry Osipenko <dmitry.osipenko@collabora.com> writes:
> 
>> On 1/22/25 20:00, Alex Bennée wrote:
>>> Dmitry Osipenko <dmitry.osipenko@collabora.com> writes:
>>>
>>>> This patchset adds DRM native context support to VirtIO-GPU on Qemu.
>>>>
>>>> Contarary to Virgl and Venus contexts that mediates high level GFX APIs,
>>>> DRM native context [1] mediates lower level kernel driver UAPI, which
>>>> reflects in a less CPU overhead and less/simpler code needed to support it.
>>>> DRM context consists of a host and guest parts that have to be implemented
>>>> for each GPU driver. On a guest side, DRM context presents a virtual GPU as
>>>> a real/native host GPU device for GL/VK applications.
>>>>
>>>> [1] https://www.youtube.com/watch?v=9sFP_yddLLQ
>>>>
>>>> Today there are four known DRM native context drivers existing in a wild:
>>>>
>>>>   - Freedreno (Qualcomm SoC GPUs), completely upstreamed
>>>>   - AMDGPU, mostly merged into upstreams
>>>
>>> I tried my AMD system today with:
>>>
>>> Host:
>>>   Aarch64 AVA system
>>>   Trixie
>>>   virglrenderer @ v1.1.0/99557f5aa130930d11f04ffeb07f3a9aa5963182
>>>   -display sdl,gl=on (gtk,gl=on also came up but handled window resizing
>>>   poorly)
>>>   
>>> KVM Guest
>>>
>>>   Aarch64
>>>   Trixie
>>>   mesa @ main/d27748a76f7dd9236bfcf9ef172dc13b8c0e170f
>>>   -Dvulkan-drivers=virtio,amd -Dgallium-drivers=virgl,radeonsi -Damdgpu-virtio=true
>>>
>>> However when I ran vulkan-info --summary KVM faulted with:
>>>
>>>   debian-trixie login: error: kvm run failed Bad address
>>>    PC=0000ffffb9aa1eb0 X00=0000ffffba0450a4 X01=0000aaaaf7f32400
>>>   X02=000000000000013c X03=0000ffffba045098 X04=0000aaaaf7f3253c
>>>   X05=0000ffffba0451d4 X06=00000000c0016900 X07=000000000000000e
>>>   X08=0000000000000014 X09=00000000000000ff X10=0000aaaaf7f32500
>>>   X11=0000aaaaf7e4d028 X12=0000aaaaf7edbcb0 X13=0000000000000001
>>>   X14=000000000000000c X15=0000000000007718 X16=0000ffffb93601f0
>>>   X17=0000ffffb9aa1dc0 X18=00000000000076f0 X19=0000aaaaf7f31330
>>>   X20=0000aaaaf7f323f0 X21=0000aaaaf7f235e0 X22=000000000000004c
>>>   X23=0000aaaaf7f2b5e0 X24=0000aaaaf7ee0cb0 X25=00000000000000ff
>>>   X26=0000000000000076 X27=0000ffffcd2b18a8 X28=0000aaaaf7ee0cb0
>>>   X29=0000ffffcd2b0bd0 X30=0000ffffb86c8b98  SP=0000ffffcd2b0bd0
>>>   PSTATE=20001000 --C- EL0t
>>>   QEMU 9.2.50 monitor - type 'help' for more information
>>>   (qemu) quit
>>>
>>> Which looks very much like the PFN locking failure. However booting up
>>> with venus=on instead works. Could there be any differences in the way
>>> device memory is mapped in the two cases?
>>
>> Memory mapping works exactly the same for nctx and venus. Are you on
>> 6.13 host kernel?
> 
> Yes - with the Altra PCI workaround patches on both host and guest
> kernel.
> 
> Is there anyway to trace the sharing of device memory on the host so I
> can verify its an attempt at device access? The PC looks like its in
> user-space but once this fails the guest is suspended so I can't poke
> around in its environment.

I'm adding printk's to kernel in a such cases. Likely there is no other
better way to find why it fails.

Does your ARM VM and host both use 4k page size?

Well, if it's a page refcounting bug on ARM/KMV, then applying [1] to
the host driver will make it work and we will know where the problem is.
Please try.

[1]
https://patchwork.kernel.org/project/kvm/patch/20220815095423.11131-1-dmitry.osipenko@collabora.com/

-- 
Best regards,
Dmitry

Re: [PATCH v5 0/8] Support virtio-gpu DRM native context
Posted by Alex Bennée 2 months, 1 week ago
Dmitry Osipenko <dmitry.osipenko@collabora.com> writes:

> On 1/23/25 14:58, Alex Bennée wrote:
>> Dmitry Osipenko <dmitry.osipenko@collabora.com> writes:
>> 
>>> On 1/22/25 20:00, Alex Bennée wrote:
>>>> Dmitry Osipenko <dmitry.osipenko@collabora.com> writes:
>>>>
>>>>> This patchset adds DRM native context support to VirtIO-GPU on Qemu.
>>>>>
>>>>> Contarary to Virgl and Venus contexts that mediates high level GFX APIs,
>>>>> DRM native context [1] mediates lower level kernel driver UAPI, which
>>>>> reflects in a less CPU overhead and less/simpler code needed to support it.
>>>>> DRM context consists of a host and guest parts that have to be implemented
>>>>> for each GPU driver. On a guest side, DRM context presents a virtual GPU as
>>>>> a real/native host GPU device for GL/VK applications.
>>>>>
>>>>> [1] https://www.youtube.com/watch?v=9sFP_yddLLQ
>>>>>
>>>>> Today there are four known DRM native context drivers existing in a wild:
>>>>>
>>>>>   - Freedreno (Qualcomm SoC GPUs), completely upstreamed
>>>>>   - AMDGPU, mostly merged into upstreams
>>>>
>>>> I tried my AMD system today with:
>>>>
>>>> Host:
>>>>   Aarch64 AVA system
>>>>   Trixie
>>>>   virglrenderer @ v1.1.0/99557f5aa130930d11f04ffeb07f3a9aa5963182
>>>>   -display sdl,gl=on (gtk,gl=on also came up but handled window resizing
>>>>   poorly)
>>>>   
>>>> KVM Guest
>>>>
>>>>   Aarch64
>>>>   Trixie
>>>>   mesa @ main/d27748a76f7dd9236bfcf9ef172dc13b8c0e170f
>>>>   -Dvulkan-drivers=virtio,amd -Dgallium-drivers=virgl,radeonsi -Damdgpu-virtio=true
>>>>
>>>> However when I ran vulkan-info --summary KVM faulted with:
>>>>
>>>>   debian-trixie login: error: kvm run failed Bad address
>>>>    PC=0000ffffb9aa1eb0 X00=0000ffffba0450a4 X01=0000aaaaf7f32400
>>>>   X02=000000000000013c X03=0000ffffba045098 X04=0000aaaaf7f3253c
>>>>   X05=0000ffffba0451d4 X06=00000000c0016900 X07=000000000000000e
>>>>   X08=0000000000000014 X09=00000000000000ff X10=0000aaaaf7f32500
>>>>   X11=0000aaaaf7e4d028 X12=0000aaaaf7edbcb0 X13=0000000000000001
>>>>   X14=000000000000000c X15=0000000000007718 X16=0000ffffb93601f0
>>>>   X17=0000ffffb9aa1dc0 X18=00000000000076f0 X19=0000aaaaf7f31330
>>>>   X20=0000aaaaf7f323f0 X21=0000aaaaf7f235e0 X22=000000000000004c
>>>>   X23=0000aaaaf7f2b5e0 X24=0000aaaaf7ee0cb0 X25=00000000000000ff
>>>>   X26=0000000000000076 X27=0000ffffcd2b18a8 X28=0000aaaaf7ee0cb0
>>>>   X29=0000ffffcd2b0bd0 X30=0000ffffb86c8b98  SP=0000ffffcd2b0bd0
>>>>   PSTATE=20001000 --C- EL0t
>>>>   QEMU 9.2.50 monitor - type 'help' for more information
>>>>   (qemu) quit
>>>>
>>>> Which looks very much like the PFN locking failure. However booting up
>>>> with venus=on instead works. Could there be any differences in the way
>>>> device memory is mapped in the two cases?
>>>
>>> Memory mapping works exactly the same for nctx and venus. Are you on
>>> 6.13 host kernel?
>> 
>> Yes - with the Altra PCI workaround patches on both host and guest
>> kernel.
>> 
>> Is there anyway to trace the sharing of device memory on the host so I
>> can verify its an attempt at device access? The PC looks like its in
>> user-space but once this fails the guest is suspended so I can't poke
>> around in its environment.
>
> I'm adding printk's to kernel in a such cases. Likely there is no other
> better way to find why it fails.
>
> Does your ARM VM and host both use 4k page size?
>
> Well, if it's a page refcounting bug on ARM/KMV, then applying [1] to
> the host driver will make it work and we will know where the problem is.
> Please try.
>
> [1]
> https://patchwork.kernel.org/project/kvm/patch/20220815095423.11131-1-dmitry.osipenko@collabora.com/

That makes no difference.

AFAICT the fault is triggered in userspace:

  error: kvm run failed Bad address
   PC=0000ffffb1911eb0 X00=0000ffffb1eb60a4 X01=0000aaaaeb1f5400
  X02=000000000000013c X03=0000ffffb1eb6098 X04=0000aaaaeb1f553c
  X05=0000ffffb1eb61d4 X06=00000000c0016900 X07=000000000000000e
  X08=0000000000000014 X09=00000000000000ff X10=0000aaaaeb1f5500
  X11=0000aaaaeb110028 X12=0000aaaaeb19ecb0 X13=0000000000000001
  X14=000000000000000c X15=0000000000007718 X16=0000ffffb11d01f0
  X17=0000ffffb1911dc0 X18=00000000000076f0 X19=0000aaaaeb1f4330
  X20=0000aaaaeb1f53f0 X21=0000aaaaeb1e65e0 X22=000000000000004c
  X23=0000aaaaeb1ee5e0 X24=0000aaaaeb1a3cb0 X25=00000000000000ff
  X26=0000000000000076 X27=0000ffffc7db4e58 X28=0000aaaaeb1a3cb0
  X29=0000ffffc7db4180 X30=0000ffffb0538b98  SP=0000ffffc7db4180
  PSTATE=20001000 --C- EL0t
  QEMU 9.2.50 monitor - type 'help' for more information
  (qemu) quit

  Thread 4 received signal SIGABRT, Aborted.
  [Switching to Thread 1.4]
  cpu_do_idle () at /home/alex/lsrc/linux.git/arch/arm64/kernel/idle.c:32
  32              arm_cpuidle_restore_irq_context(&context);
  (gdb) alex
  Undefined command: "alex".  Try "help".
  (gdb) bt
  #0  cpu_do_idle () at /home/alex/lsrc/linux.git/arch/arm64/kernel/idle.c:32
  #1  0xffff800081962180 in arch_cpu_idle () at /home/alex/lsrc/linux.git/arch/arm64/kernel/idle.c:44
  #2  0xffff8000819622c4 in default_idle_call () at /home/alex/lsrc/linux.git/kernel/sched/idle.c:117
  #3  0xffff80008013af8c in cpuidle_idle_call () at /home/alex/lsrc/linux.git/kernel/sched/idle.c:185
  #4  do_idle () at /home/alex/lsrc/linux.git/kernel/sched/idle.c:325
  #5  0xffff80008013b208 in cpu_startup_entry (state=state@entry=CPUHP_AP_ONLINE_IDLE) at /home/alex/lsrc/linux.git/kernel/sched/idle.c:423
  #6  0xffff800080043668 in secondary_start_kernel () at /home/alex/lsrc/linux.git/arch/arm64/kernel/smp.c:279
  #7  0xffff800080051f78 in __secondary_switched () at /home/alex/lsrc/linux.git/arch/arm64/kernel/head.S:420
  Backtrace stopped: previous frame identical to this frame (corrupt stack?)
  (gdb) info threads
    Id   Target Id                    Frame 
    1    Thread 1.1 (CPU#0 [running]) cpu_do_idle () at /home/alex/lsrc/linux.git/arch/arm64/kernel/idle.c:32
    2    Thread 1.2 (CPU#1 [halted ]) 0x0000ffffb1911eb0 in ?? ()
    3    Thread 1.3 (CPU#2 [halted ]) cpu_do_idle () at /home/alex/lsrc/linux.git/arch/arm64/kernel/idle.c:32
  * 4    Thread 1.4 (CPU#3 [halted ]) cpu_do_idle () at /home/alex/lsrc/linux.git/arch/arm64/kernel/idle.c:32
  (gdb) thread 2
  [Switching to thread 2 (Thread 1.2)]
  #0  0x0000ffffb1911eb0 in ?? ()
  (gdb) bt
  #0  0x0000ffffb1911eb0 in ?? ()
  #1  0x0000aaaaeb1ea5e0 in ?? ()
  Backtrace stopped: previous frame inner to this frame (corrupt stack?)
  (gdb) frame 0
  #0  0x0000ffffb1911eb0 in ?? ()
  (gdb) x/5i $pc
  => 0xffffb1911eb0:      str     q3, [x0]
     0xffffb1911eb4:      ldp     q2, q3, [x1, #48]
     0xffffb1911eb8:      subs    x2, x2, #0x90
     0xffffb1911ebc:      b.ls    0xffffb1911ee0  // b.plast
     0xffffb1911ec0:      stp     q0, q1, [x3, #16]
  (gdb) p/x $x0
  $1 = 0xffffb1eb60a4

I suspect that is memcpy again but I'll try and track it down. The only
other note is:

[  411.509647] kvm [7713]: Unsupported FSC: EC=0x24 xFSC=0x21 ESR_EL2=0x92000061

Which is:

  EC 0x24 - Data Abort from lower EL
  DFSC 0x21 - Alignment fault
  WnR 1 - Caused by write
  
-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro