[v4] virtio-gpu: Add userptr support for compute workloads

[PATCH v4 0/5] virtio-gpu: Add userptr support for compute workloads

Posted by Honglei Huang 3 weeks, 3 days ago

From: Honglei Huang <honghuan@amd.com>

Hello,

This series adds virtio-gpu userptr support to enable ROCm native
context for compute workloads. The userptr feature allows the host to
directly access guest userspace memory without memcpy overhead, which is
essential for GPU compute performance.

The userptr implementation provides buffer-based zero-copy memory access. 
This approach pins guest userspace pages and exposes them to the host
via scatter-gather tables, enabling efficient compute operations.

Key features:
- Zero-copy memory access between guest userspace and host GPU
- Read-only and read-write userptr support
- Runtime feature detection via VIRTGPU_PARAM_RESOURCE_USERPTR
- ROCm capset support for ROCm stack integration
- Proper page lifecycle management with FOLL_LONGTERM pinning

Patches overview:
1. Add VIRTIO_GPU_CAPSET_ROCM capability for compute workloads
2. Add virtio-gpu API definitions for userptr blob resources
3. Extend DRM UAPI with comprehensive userptr support
4. Implement core userptr functionality with page management
5. Integrate userptr into blob resource creation and advertise to userspace

Performance: In popular compute benchmarks, this implementation achieves
approximately 70% efficiency compared to bare metal OpenCL performance on
AMD V2000 hardware, achieves 92% efficiency on AMD W7900 hardware.

Testing: Verified with ROCm stack and OpenCL applications in VIRTIO virtualized
environments.
- Full OPENCL CTS tests passed on ROCm 5.7.0 in V2000 platform.
- Near 70% percentage of OPENCL CTS tests passed on ROCm 7.0 W7900 platform.
- most HIP catch tests passed on ROCm 7.0 W7900 platform.
- Some AI applications enabled on ROCm 7.0 W7900 platform.

V4 changes:
    - Renamed VIRTIO_GPU_CAPSET_HSAKMT to VIRTIO_GPU_CAPSET_ROCM
    - Remove userptr feature probing cause it can reuse the guest 
      blob resource code path, reduce patch count from 6 to 5
    - Updated corresponding commit messages
    - Consolidated userptr feature detection in final patch
    - Update corresponding cover letter content

V3 changes:
    - Split into focused patches for easier review
    - Removed complex interval tree userptr management 
    - Simplified resource creation without deduplication
    - Added VIRTGPU_PARAM_RESOURCE_USERPTR for feature detection
    - Improved UAPI documentation and error handling
    - Enhanced code quality with proper cleanup paths
    - Removed MMU notifier dependencies for simplicity
    - Fixed resource lifecycle management issues

V2: - Split add HSAKMT context and blob userptr resource to two patches.
    - Remove MMU notifier related patches, cause use not moveable user space
      memory with MMU notifier is not a good idea.
    - Remove HSAKMT context check when create context, let all the context
      support the userptr feature.
    - Remove MMU notifier related content in cover letter.
    - Add more comments  for patch 6 in cover letter.

Honglei Huang (5):
  drm/virtio-gpu: Add VIRTIO_GPU_CAPSET_ROCM capability
  virtio-gpu api: add blob userptr resource
  drm/virtgpu api: add blob userptr resource
  drm/virtio: implement userptr support for zero-copy memory access
  drm/virtio: advertise base userptr feature to userspace

 drivers/gpu/drm/virtio/Makefile          |   3 +-
 drivers/gpu/drm/virtio/virtgpu_drv.h     |  33 ++++
 drivers/gpu/drm/virtio/virtgpu_ioctl.c   |   9 +-
 drivers/gpu/drm/virtio/virtgpu_object.c  |   6 +
 drivers/gpu/drm/virtio/virtgpu_userptr.c | 231 +++++++++++++++++++++++
 include/uapi/drm/virtgpu_drm.h           |   9 +
 include/uapi/linux/virtio_gpu.h          |   7 +
 7 files changed, 295 insertions(+), 3 deletions(-)
 create mode 100644 drivers/gpu/drm/virtio/virtgpu_userptr.c

-- 
2.34.1

Re: [PATCH v4 0/5] virtio-gpu: Add userptr support for compute workloads

Posted by Akihiko Odaki 3 weeks, 3 days ago

On 2026/01/15 16:58, Honglei Huang wrote:
> From: Honglei Huang <honghuan@amd.com>
> 
> Hello,
> 
> This series adds virtio-gpu userptr support to enable ROCm native
> context for compute workloads. The userptr feature allows the host to
> directly access guest userspace memory without memcpy overhead, which is
> essential for GPU compute performance.
> 
> The userptr implementation provides buffer-based zero-copy memory access.
> This approach pins guest userspace pages and exposes them to the host
> via scatter-gather tables, enabling efficient compute operations.

This description looks identical with what 
VIRTIO_GPU_BLOB_MEM_HOST3D_GUEST does so there should be some 
explanation how it makes difference.

I have already pointed out this when reviewing the QEMU patches[1], but 
I note that here too, since QEMU is just a middleman and this matter is 
better discussed by Linux and virglrenderer developers.

[1] 
https://lore.kernel.org/qemu-devel/35a8add7-da49-4833-9e69-d213f52c771a@amd.com/

> 
> Key features:
> - Zero-copy memory access between guest userspace and host GPU
> - Read-only and read-write userptr support
> - Runtime feature detection via VIRTGPU_PARAM_RESOURCE_USERPTR
> - ROCm capset support for ROCm stack integration
> - Proper page lifecycle management with FOLL_LONGTERM pinning
> 
> Patches overview:
> 1. Add VIRTIO_GPU_CAPSET_ROCM capability for compute workloads
> 2. Add virtio-gpu API definitions for userptr blob resources
> 3. Extend DRM UAPI with comprehensive userptr support
> 4. Implement core userptr functionality with page management
> 5. Integrate userptr into blob resource creation and advertise to userspace
> 
> Performance: In popular compute benchmarks, this implementation achieves
> approximately 70% efficiency compared to bare metal OpenCL performance on
> AMD V2000 hardware, achieves 92% efficiency on AMD W7900 hardware.
> 
> Testing: Verified with ROCm stack and OpenCL applications in VIRTIO virtualized
> environments.
> - Full OPENCL CTS tests passed on ROCm 5.7.0 in V2000 platform.
> - Near 70% percentage of OPENCL CTS tests passed on ROCm 7.0 W7900 platform.
> - most HIP catch tests passed on ROCm 7.0 W7900 platform.
> - Some AI applications enabled on ROCm 7.0 W7900 platform.
> 
> V4 changes:
>      - Renamed VIRTIO_GPU_CAPSET_HSAKMT to VIRTIO_GPU_CAPSET_ROCM
>      - Remove userptr feature probing cause it can reuse the guest
>        blob resource code path, reduce patch count from 6 to 5
>      - Updated corresponding commit messages
>      - Consolidated userptr feature detection in final patch
>      - Update corresponding cover letter content
> 
> V3 changes:
>      - Split into focused patches for easier review
>      - Removed complex interval tree userptr management
>      - Simplified resource creation without deduplication
>      - Added VIRTGPU_PARAM_RESOURCE_USERPTR for feature detection
>      - Improved UAPI documentation and error handling
>      - Enhanced code quality with proper cleanup paths
>      - Removed MMU notifier dependencies for simplicity
>      - Fixed resource lifecycle management issues
> 
> V2: - Split add HSAKMT context and blob userptr resource to two patches.
>      - Remove MMU notifier related patches, cause use not moveable user space
>        memory with MMU notifier is not a good idea.
>      - Remove HSAKMT context check when create context, let all the context
>        support the userptr feature.
>      - Remove MMU notifier related content in cover letter.
>      - Add more comments  for patch 6 in cover letter.
> 
> Honglei Huang (5):
>    drm/virtio-gpu: Add VIRTIO_GPU_CAPSET_ROCM capability
>    virtio-gpu api: add blob userptr resource
>    drm/virtgpu api: add blob userptr resource
>    drm/virtio: implement userptr support for zero-copy memory access
>    drm/virtio: advertise base userptr feature to userspace
> 
>   drivers/gpu/drm/virtio/Makefile          |   3 +-
>   drivers/gpu/drm/virtio/virtgpu_drv.h     |  33 ++++
>   drivers/gpu/drm/virtio/virtgpu_ioctl.c   |   9 +-
>   drivers/gpu/drm/virtio/virtgpu_object.c  |   6 +
>   drivers/gpu/drm/virtio/virtgpu_userptr.c | 231 +++++++++++++++++++++++
>   include/uapi/drm/virtgpu_drm.h           |   9 +
>   include/uapi/linux/virtio_gpu.h          |   7 +
>   7 files changed, 295 insertions(+), 3 deletions(-)
>   create mode 100644 drivers/gpu/drm/virtio/virtgpu_userptr.c
>

Re: [PATCH v4 0/5] virtio-gpu: Add userptr support for compute workloads

Posted by Honglei Huang 3 weeks, 2 days ago


On 2026/1/15 17:20, Akihiko Odaki wrote:
> On 2026/01/15 16:58, Honglei Huang wrote:
>> From: Honglei Huang <honghuan@amd.com>
>>
>> Hello,
>>
>> This series adds virtio-gpu userptr support to enable ROCm native
>> context for compute workloads. The userptr feature allows the host to
>> directly access guest userspace memory without memcpy overhead, which is
>> essential for GPU compute performance.
>>
>> The userptr implementation provides buffer-based zero-copy memory access.
>> This approach pins guest userspace pages and exposes them to the host
>> via scatter-gather tables, enabling efficient compute operations.
> 
> This description looks identical with what 
> VIRTIO_GPU_BLOB_MEM_HOST3D_GUEST does so there should be some 
> explanation how it makes difference.
> 
> I have already pointed out this when reviewing the QEMU patches[1], but 
> I note that here too, since QEMU is just a middleman and this matter is 
> better discussed by Linux and virglrenderer developers.
> 
> [1] https://lore.kernel.org/qemu-devel/35a8add7-da49-4833-9e69- 
> d213f52c771a@amd.com/
> 

Thanks for raising this important point about the distinction between
VIRTGPU_BLOB_FLAG_USE_USERPTR and VIRTIO_GPU_BLOB_MEM_HOST3D_GUEST.
I might not have explained it clearly previously.

The key difference is memory ownership and lifecycle:

BLOB_MEM_HOST3D_GUEST:
   - Kernel allocates memory (drm_gem_shmem_create)
   - Userspace accesses via mmap(GEM_BO)
   - Use case: Graphics resources (Vulkan/OpenGL)

BLOB_FLAG_USE_USERPTR:
   - Userspace pre-allocates memory (malloc/mmap)
   - Kernel only get existing pages
   - Use case: Compute workloads (ROCm/CUDA) with large datasets, like
GPU needs load a big model file 10G+, UMD mmap the fd file, then give 
the mmap ptr into userspace then driver do not need a another copy.
But if the shmem is used, the userspace needs copy the file data into a 
shmem mmap ptr there is a copy overhead.

Userptr:

file -> open/mmap -> userspace ptr -> driver

shmem:

user alloc shmem ──→ mmap shmem ──→ shmem userspace ptr -> driver
                                               ↑
                                               │ copy
                                               │
file ──→ open/mmap ──→ file userptr ──────────┘


For compute workloads, this matters significantly:
   Without userptr: malloc(8GB) → alloc GEM BO → memcpy 8GB → compute → 
memcpy 8GB back
   With userptr:    malloc(8GB) → create userptr BO → compute (zero-copy)

The explicit flag serves three purposes:

1. Although both send scatter-gather entries to host. The flag makes the 
intent unambiguous.

2. Ensures consistency between flag and userptr address field.

3. Future HMM support: There is a plan to upgrade userptr implementation 
to use Heterogeneous Memory Management for better GPU coherency and 
dynamic page migration. The flag provides a clean path to future upgrade.

I understand the concern about API complexity. I'll defer to the 
virtio-gpu maintainers for the final decision on whether this design is 
acceptable or if they prefer an alternative approach.

Regards,
Honglei Huang

>>
>> Key features:
>> - Zero-copy memory access between guest userspace and host GPU
>> - Read-only and read-write userptr support
>> - Runtime feature detection via VIRTGPU_PARAM_RESOURCE_USERPTR
>> - ROCm capset support for ROCm stack integration
>> - Proper page lifecycle management with FOLL_LONGTERM pinning
>>
>> Patches overview:
>> 1. Add VIRTIO_GPU_CAPSET_ROCM capability for compute workloads
>> 2. Add virtio-gpu API definitions for userptr blob resources
>> 3. Extend DRM UAPI with comprehensive userptr support
>> 4. Implement core userptr functionality with page management
>> 5. Integrate userptr into blob resource creation and advertise to 
>> userspace
>>
>> Performance: In popular compute benchmarks, this implementation achieves
>> approximately 70% efficiency compared to bare metal OpenCL performance on
>> AMD V2000 hardware, achieves 92% efficiency on AMD W7900 hardware.
>>
>> Testing: Verified with ROCm stack and OpenCL applications in VIRTIO 
>> virtualized
>> environments.
>> - Full OPENCL CTS tests passed on ROCm 5.7.0 in V2000 platform.
>> - Near 70% percentage of OPENCL CTS tests passed on ROCm 7.0 W7900 
>> platform.
>> - most HIP catch tests passed on ROCm 7.0 W7900 platform.
>> - Some AI applications enabled on ROCm 7.0 W7900 platform.
>>
>> V4 changes:
>>      - Renamed VIRTIO_GPU_CAPSET_HSAKMT to VIRTIO_GPU_CAPSET_ROCM
>>      - Remove userptr feature probing cause it can reuse the guest
>>        blob resource code path, reduce patch count from 6 to 5
>>      - Updated corresponding commit messages
>>      - Consolidated userptr feature detection in final patch
>>      - Update corresponding cover letter content
>>
>> V3 changes:
>>      - Split into focused patches for easier review
>>      - Removed complex interval tree userptr management
>>      - Simplified resource creation without deduplication
>>      - Added VIRTGPU_PARAM_RESOURCE_USERPTR for feature detection
>>      - Improved UAPI documentation and error handling
>>      - Enhanced code quality with proper cleanup paths
>>      - Removed MMU notifier dependencies for simplicity
>>      - Fixed resource lifecycle management issues
>>
>> V2: - Split add HSAKMT context and blob userptr resource to two patches.
>>      - Remove MMU notifier related patches, cause use not moveable 
>> user space
>>        memory with MMU notifier is not a good idea.
>>      - Remove HSAKMT context check when create context, let all the 
>> context
>>        support the userptr feature.
>>      - Remove MMU notifier related content in cover letter.
>>      - Add more comments  for patch 6 in cover letter.
>>
>> Honglei Huang (5):
>>    drm/virtio-gpu: Add VIRTIO_GPU_CAPSET_ROCM capability
>>    virtio-gpu api: add blob userptr resource
>>    drm/virtgpu api: add blob userptr resource
>>    drm/virtio: implement userptr support for zero-copy memory access
>>    drm/virtio: advertise base userptr feature to userspace
>>
>>   drivers/gpu/drm/virtio/Makefile          |   3 +-
>>   drivers/gpu/drm/virtio/virtgpu_drv.h     |  33 ++++
>>   drivers/gpu/drm/virtio/virtgpu_ioctl.c   |   9 +-
>>   drivers/gpu/drm/virtio/virtgpu_object.c  |   6 +
>>   drivers/gpu/drm/virtio/virtgpu_userptr.c | 231 +++++++++++++++++++++++
>>   include/uapi/drm/virtgpu_drm.h           |   9 +
>>   include/uapi/linux/virtio_gpu.h          |   7 +
>>   7 files changed, 295 insertions(+), 3 deletions(-)
>>   create mode 100644 drivers/gpu/drm/virtio/virtgpu_userptr.c
>>
>

Re: [PATCH v4 0/5] virtio-gpu: Add userptr support for compute workloads

Posted by Akihiko Odaki 3 weeks, 2 days ago

On 2026/01/16 16:20, Honglei Huang wrote:
> 
> 
> On 2026/1/15 17:20, Akihiko Odaki wrote:
>> On 2026/01/15 16:58, Honglei Huang wrote:
>>> From: Honglei Huang <honghuan@amd.com>
>>>
>>> Hello,
>>>
>>> This series adds virtio-gpu userptr support to enable ROCm native
>>> context for compute workloads. The userptr feature allows the host to
>>> directly access guest userspace memory without memcpy overhead, which is
>>> essential for GPU compute performance.
>>>
>>> The userptr implementation provides buffer-based zero-copy memory 
>>> access.
>>> This approach pins guest userspace pages and exposes them to the host
>>> via scatter-gather tables, enabling efficient compute operations.
>>
>> This description looks identical with what 
>> VIRTIO_GPU_BLOB_MEM_HOST3D_GUEST does so there should be some 
>> explanation how it makes difference.
>>
>> I have already pointed out this when reviewing the QEMU patches[1], 
>> but I note that here too, since QEMU is just a middleman and this 
>> matter is better discussed by Linux and virglrenderer developers.
>>
>> [1] https://lore.kernel.org/qemu-devel/35a8add7-da49-4833-9e69- 
>> d213f52c771a@amd.com/
>>
> 
> Thanks for raising this important point about the distinction between
> VIRTGPU_BLOB_FLAG_USE_USERPTR and VIRTIO_GPU_BLOB_MEM_HOST3D_GUEST.
> I might not have explained it clearly previously.
> 
> The key difference is memory ownership and lifecycle:
> 
> BLOB_MEM_HOST3D_GUEST:
>    - Kernel allocates memory (drm_gem_shmem_create)
>    - Userspace accesses via mmap(GEM_BO)
>    - Use case: Graphics resources (Vulkan/OpenGL)
> 
> BLOB_FLAG_USE_USERPTR:
>    - Userspace pre-allocates memory (malloc/mmap)

"Kernel allocates memory" and "userspace pre-allocates memory" is a bit 
ambiguous phrasing. Either way, the userspace requests the kernel to map 
memory with a system call, brk() or mmap().

>    - Kernel only get existing pages
>    - Use case: Compute workloads (ROCm/CUDA) with large datasets, like
> GPU needs load a big model file 10G+, UMD mmap the fd file, then give 
> the mmap ptr into userspace then driver do not need a another copy.
> But if the shmem is used, the userspace needs copy the file data into a 
> shmem mmap ptr there is a copy overhead.
> 
> Userptr:
> 
> file -> open/mmap -> userspace ptr -> driver
> 
> shmem:
> 
> user alloc shmem ──→ mmap shmem ──→ shmem userspace ptr -> driver
>                                                ↑
>                                                │ copy
>                                                │
> file ──→ open/mmap ──→ file userptr ──────────┘
> 
> 
> For compute workloads, this matters significantly:
>    Without userptr: malloc(8GB) → alloc GEM BO → memcpy 8GB → compute → 
> memcpy 8GB back
>    With userptr:    malloc(8GB) → create userptr BO → compute (zero-copy)

Why don't you alloc GEM BO first and read the file into there?

> 
> The explicit flag serves three purposes:
> 
> 1. Although both send scatter-gather entries to host. The flag makes the 
> intent unambiguous.

Why will the host care?

> 
> 2. Ensures consistency between flag and userptr address field.

Addresses are represented with the nr_entries and following struct 
virtio_gpu_mem_entry entries, whenever 
VIRTIO_GPU_CMD_RESOURCE_CREATE_BLOB or 
VIRTIO_GPU_CMD_RESOURCE_ATTACH_BACKING is used. Having a special flag 
introduces inconsistency.

> 
> 3. Future HMM support: There is a plan to upgrade userptr implementation 
> to use Heterogeneous Memory Management for better GPU coherency and 
> dynamic page migration. The flag provides a clean path to future upgrade.

How will the upgrade path with the flag and the one without the flag 
look like, and in what aspect the upgrade path with the flag is "cleaner"?

> 
> I understand the concern about API complexity. I'll defer to the virtio- 
> gpu maintainers for the final decision on whether this design is 
> acceptable or if they prefer an alternative approach.

It is fine to have API complexity. The problem here is the lack of clear 
motivation and documentation.

Another way to put this is: how will you explain the flag in the virtio 
specification? It should say "the driver MAY/SHOULD/MUST do something" 
and/or "the device MAY/SHOULD/MUST do something", and then Linux and 
virglrenderer can implement the flag accordingly.

Regards,
Akihiko Odaki

Re: [PATCH v4 0/5] virtio-gpu: Add userptr support for compute workloads

Posted by Honglei Huang 3 weeks, 2 days ago


On 2026/1/16 16:54, Akihiko Odaki wrote:
> On 2026/01/16 16:20, Honglei Huang wrote:
>>
>>
>> On 2026/1/15 17:20, Akihiko Odaki wrote:
>>> On 2026/01/15 16:58, Honglei Huang wrote:
>>>> From: Honglei Huang <honghuan@amd.com>
>>>>
>>>> Hello,
>>>>
>>>> This series adds virtio-gpu userptr support to enable ROCm native
>>>> context for compute workloads. The userptr feature allows the host to
>>>> directly access guest userspace memory without memcpy overhead, 
>>>> which is
>>>> essential for GPU compute performance.
>>>>
>>>> The userptr implementation provides buffer-based zero-copy memory 
>>>> access.
>>>> This approach pins guest userspace pages and exposes them to the host
>>>> via scatter-gather tables, enabling efficient compute operations.
>>>
>>> This description looks identical with what 
>>> VIRTIO_GPU_BLOB_MEM_HOST3D_GUEST does so there should be some 
>>> explanation how it makes difference.
>>>
>>> I have already pointed out this when reviewing the QEMU patches[1], 
>>> but I note that here too, since QEMU is just a middleman and this 
>>> matter is better discussed by Linux and virglrenderer developers.
>>>
>>> [1] https://lore.kernel.org/qemu-devel/35a8add7-da49-4833-9e69- 
>>> d213f52c771a@amd.com/
>>>
>>
>> Thanks for raising this important point about the distinction between
>> VIRTGPU_BLOB_FLAG_USE_USERPTR and VIRTIO_GPU_BLOB_MEM_HOST3D_GUEST.
>> I might not have explained it clearly previously.
>>
>> The key difference is memory ownership and lifecycle:
>>
>> BLOB_MEM_HOST3D_GUEST:
>>    - Kernel allocates memory (drm_gem_shmem_create)
>>    - Userspace accesses via mmap(GEM_BO)
>>    - Use case: Graphics resources (Vulkan/OpenGL)
>>
>> BLOB_FLAG_USE_USERPTR:
>>    - Userspace pre-allocates memory (malloc/mmap)
> 
> "Kernel allocates memory" and "userspace pre-allocates memory" is a bit 
> ambiguous phrasing. Either way, the userspace requests the kernel to map 
> memory with a system call, brk() or mmap().

They are different:
BLOB_MEM_HOST3D_GUEST (kernel-managed pages):
   - Allocated via drm_gem_shmem_create() as GFP_KERNEL pages
   - Kernel guarantees pages won't swap or migrate while GEM object exists
   - Physical addresses remain stable → safe for DMA

BLOB_FLAG_USE_USERPTR (userspace pages):
   - From regular malloc/mmap - subject to MM policies
   - Can be swapped, migrated, or compacted by kernel
   - Requires FOLL_LONGTERM pinning to make DMA-safe

The device must treat them differently. Kernel-managed pages have stable 
physical
addresses. Userspace pages need explicit pinning and the device must be 
prepared
for potential invalidation.

This is why all compute drivers (amdgpu, i915, nouveau) implement 
userptr - to
make arbitrary userspace allocations DMA-accessible while respecting 
their different
page mobility characteristics.
And the drm already has a better frame work for it: SVM, and this 
verions is a super simplified verion.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/drm_gpusvm.c#:~:text=*%20GPU%20Shared%20Virtual%20Memory%20(GPU%20SVM)%20layer%20for%20the%20Direct%20Rendering%20Manager%20(DRM)


> 
>>    - Kernel only get existing pages
>>    - Use case: Compute workloads (ROCm/CUDA) with large datasets, like
>> GPU needs load a big model file 10G+, UMD mmap the fd file, then give 
>> the mmap ptr into userspace then driver do not need a another copy.
>> But if the shmem is used, the userspace needs copy the file data into 
>> a shmem mmap ptr there is a copy overhead.
>>
>> Userptr:
>>
>> file -> open/mmap -> userspace ptr -> driver
>>
>> shmem:
>>
>> user alloc shmem ──→ mmap shmem ──→ shmem userspace ptr -> driver
>>                                                ↑
>>                                                │ copy
>>                                                │
>> file ──→ open/mmap ──→ file userptr ──────────┘
>>
>>
>> For compute workloads, this matters significantly:
>>    Without userptr: malloc(8GB) → alloc GEM BO → memcpy 8GB → compute 
>> → memcpy 8GB back
>>    With userptr:    malloc(8GB) → create userptr BO → compute (zero-copy)
> 
> Why don't you alloc GEM BO first and read the file into there?

Because that defeats the purpose of zero-copy.

With GEM-BO-first (what you suggest):

void *gembo = virtgpu_gem_create(10GB);     // Allocate GEM buffer
void *model = mmap(..., model_file_fd, 0);  // Map model file
memcpy(gembo, model, 10GB);                 // Copy 10GB - NOT zero-copy
munmap(model, 10GB);
gpu_compute(gembo);

Result: 10GB copy overhead + double memory usage during copy.

With userptr (zero-copy):

void *model = mmap(..., model_file_fd, 0);  // Map model file
hsa_memory_register(model, 10GB);           // Pin pages, create userptr BO
gpu_compute(model);                         // GPU reads directly from 
file pages


> 
>>
>> The explicit flag serves three purposes:
>>
>> 1. Although both send scatter-gather entries to host. The flag makes 
>> the intent unambiguous.
> 
> Why will the host care?

The flag tells host this is a userptr, host side need handle it specially.


> 
>>
>> 2. Ensures consistency between flag and userptr address field.
> 
> Addresses are represented with the nr_entries and following struct 
> virtio_gpu_mem_entry entries, whenever 
> VIRTIO_GPU_CMD_RESOURCE_CREATE_BLOB or 
> VIRTIO_GPU_CMD_RESOURCE_ATTACH_BACKING is used. Having a special flag 
> introduces inconsistency.

For this part I am talking about the virito gpu guest UMD side, in blob 
create io ctrl we need this flag to
check the userptr address and is it a read-only attribute:
	if (rc_blob->blob_flags & VIRTGPU_BLOB_FLAG_USE_USERPTR) {
		if (!rc_blob->userptr)
			return -EINVAL;
	} else {
		if (rc_blob->userptr)
			return -EINVAL;

		if (rc_blob->blob_flags & VIRTGPU_BLOB_FLAG_USERPTR_RDONLY)
			return -EINVAL;
	}

> 
>>
>> 3. Future HMM support: There is a plan to upgrade userptr 
>> implementation to use Heterogeneous Memory Management for better GPU 
>> coherency and dynamic page migration. The flag provides a clean path 
>> to future upgrade.
> 
> How will the upgrade path with the flag and the one without the flag 
> look like, and in what aspect the upgrade path with the flag is "cleaner"?

As I mentioned above the userptr handling is different with shmem/GEM BO.

> 
>>
>> I understand the concern about API complexity. I'll defer to the 
>> virtio- gpu maintainers for the final decision on whether this design 
>> is acceptable or if they prefer an alternative approach.
> 
> It is fine to have API complexity. The problem here is the lack of clear 
> motivation and documentation.
> 
> Another way to put this is: how will you explain the flag in the virtio 
> specification? It should say "the driver MAY/SHOULD/MUST do something" 
> and/or "the device MAY/SHOULD/MUST do something", and then Linux and 
> virglrenderer can implement the flag accordingly.

you're absolutely right that the specification should
be written in proper virtio spec language. The draft should be:

VIRTIO_GPU_BLOB_FLAG_USE_USERPTR:

Linux virtio driver requirements:
- MUST set userptr to valid guest userspace VA in 
drm_virtgpu_resource_create_blob
- SHOULD keep VA mapping valid until resource destruction
- MUST pin pages or use HMM at blob creation time

Virglrenderer requirements:
- must use correspoonding API for userptr resource


> 
> Regards,
> Akihiko Odaki

Re: [PATCH v4 0/5] virtio-gpu: Add userptr support for compute workloads

Posted by Akihiko Odaki 3 weeks, 2 days ago

On 2026/01/16 18:39, Honglei Huang wrote:
> 
> 
> On 2026/1/16 16:54, Akihiko Odaki wrote:
>> On 2026/01/16 16:20, Honglei Huang wrote:
>>>
>>>
>>> On 2026/1/15 17:20, Akihiko Odaki wrote:
>>>> On 2026/01/15 16:58, Honglei Huang wrote:
>>>>> From: Honglei Huang <honghuan@amd.com>
>>>>>
>>>>> Hello,
>>>>>
>>>>> This series adds virtio-gpu userptr support to enable ROCm native
>>>>> context for compute workloads. The userptr feature allows the host to
>>>>> directly access guest userspace memory without memcpy overhead, 
>>>>> which is
>>>>> essential for GPU compute performance.
>>>>>
>>>>> The userptr implementation provides buffer-based zero-copy memory 
>>>>> access.
>>>>> This approach pins guest userspace pages and exposes them to the host
>>>>> via scatter-gather tables, enabling efficient compute operations.
>>>>
>>>> This description looks identical with what 
>>>> VIRTIO_GPU_BLOB_MEM_HOST3D_GUEST does so there should be some 
>>>> explanation how it makes difference.
>>>>
>>>> I have already pointed out this when reviewing the QEMU patches[1], 
>>>> but I note that here too, since QEMU is just a middleman and this 
>>>> matter is better discussed by Linux and virglrenderer developers.
>>>>
>>>> [1] https://lore.kernel.org/qemu-devel/35a8add7-da49-4833-9e69- 
>>>> d213f52c771a@amd.com/
>>>>
>>>
>>> Thanks for raising this important point about the distinction between
>>> VIRTGPU_BLOB_FLAG_USE_USERPTR and VIRTIO_GPU_BLOB_MEM_HOST3D_GUEST.
>>> I might not have explained it clearly previously.
>>>
>>> The key difference is memory ownership and lifecycle:
>>>
>>> BLOB_MEM_HOST3D_GUEST:
>>>    - Kernel allocates memory (drm_gem_shmem_create)
>>>    - Userspace accesses via mmap(GEM_BO)
>>>    - Use case: Graphics resources (Vulkan/OpenGL)
>>>
>>> BLOB_FLAG_USE_USERPTR:
>>>    - Userspace pre-allocates memory (malloc/mmap)
>>
>> "Kernel allocates memory" and "userspace pre-allocates memory" is a 
>> bit ambiguous phrasing. Either way, the userspace requests the kernel 
>> to map memory with a system call, brk() or mmap().
> 
> They are different:
> BLOB_MEM_HOST3D_GUEST (kernel-managed pages):
>    - Allocated via drm_gem_shmem_create() as GFP_KERNEL pages
>    - Kernel guarantees pages won't swap or migrate while GEM object exists
>    - Physical addresses remain stable → safe for DMA
> 
> BLOB_FLAG_USE_USERPTR (userspace pages):
>    - From regular malloc/mmap - subject to MM policies
>    - Can be swapped, migrated, or compacted by kernel
>    - Requires FOLL_LONGTERM pinning to make DMA-safe
> 
> The device must treat them differently. Kernel-managed pages have stable 
> physical
> addresses. Userspace pages need explicit pinning and the device must be 
> prepared
> for potential invalidation.
> 
> This is why all compute drivers (amdgpu, i915, nouveau) implement 
> userptr - to
> make arbitrary userspace allocations DMA-accessible while respecting 
> their different
> page mobility characteristics.
> And the drm already has a better frame work for it: SVM, and this 
> verions is a super simplified verion.
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/ 
> drivers/gpu/drm/ 
> drm_gpusvm.c#:~:text=*%20GPU%20Shared%20Virtual%20Memory%20(GPU%20SVM)%20layer%20for%20the%20Direct%20Rendering%20Manager%20(DRM)

I referred to phrasing "kernel allocates" vs "userspace allocates". 
Using GFP_KERNEL, swapping, migrating, or pinning is all what the kernel 
does.

> 
> 
>>
>>>    - Kernel only get existing pages
>>>    - Use case: Compute workloads (ROCm/CUDA) with large datasets, like
>>> GPU needs load a big model file 10G+, UMD mmap the fd file, then give 
>>> the mmap ptr into userspace then driver do not need a another copy.
>>> But if the shmem is used, the userspace needs copy the file data into 
>>> a shmem mmap ptr there is a copy overhead.
>>>
>>> Userptr:
>>>
>>> file -> open/mmap -> userspace ptr -> driver
>>>
>>> shmem:
>>>
>>> user alloc shmem ──→ mmap shmem ──→ shmem userspace ptr -> driver
>>>                                                ↑
>>>                                                │ copy
>>>                                                │
>>> file ──→ open/mmap ──→ file userptr ──────────┘
>>>
>>>
>>> For compute workloads, this matters significantly:
>>>    Without userptr: malloc(8GB) → alloc GEM BO → memcpy 8GB → compute 
>>> → memcpy 8GB back
>>>    With userptr:    malloc(8GB) → create userptr BO → compute (zero- 
>>> copy)
>>
>> Why don't you alloc GEM BO first and read the file into there?
> 
> Because that defeats the purpose of zero-copy.
> 
> With GEM-BO-first (what you suggest):
> 
> void *gembo = virtgpu_gem_create(10GB);     // Allocate GEM buffer
> void *model = mmap(..., model_file_fd, 0);  // Map model file
> memcpy(gembo, model, 10GB);                 // Copy 10GB - NOT zero-copy
> munmap(model, 10GB);
> gpu_compute(gembo);
> 
> Result: 10GB copy overhead + double memory usage during copy.

How about:

void *gembo = virtgpu_gem_create(10GB);
read(model_file_fd, gembo, 10GB);

Result: zero-copy + simpler code.

> 
> With userptr (zero-copy):
> 
> void *model = mmap(..., model_file_fd, 0);  // Map model file
> hsa_memory_register(model, 10GB);           // Pin pages, create userptr BO
> gpu_compute(model);                         // GPU reads directly from 
> file pages
> 
> 
>>
>>>
>>> The explicit flag serves three purposes:
>>>
>>> 1. Although both send scatter-gather entries to host. The flag makes 
>>> the intent unambiguous.
>>
>> Why will the host care?
> 
> The flag tells host this is a userptr, host side need handle it specially.

Please provide the concrete requirement. What is the special handling 
the host side needs to perform?

> 
> 
>>
>>>
>>> 2. Ensures consistency between flag and userptr address field.
>>
>> Addresses are represented with the nr_entries and following struct 
>> virtio_gpu_mem_entry entries, whenever 
>> VIRTIO_GPU_CMD_RESOURCE_CREATE_BLOB or 
>> VIRTIO_GPU_CMD_RESOURCE_ATTACH_BACKING is used. Having a special flag 
>> introduces inconsistency.
> 
> For this part I am talking about the virito gpu guest UMD side, in blob 
> create io ctrl we need this flag to
> check the userptr address and is it a read-only attribute:
>      if (rc_blob->blob_flags & VIRTGPU_BLOB_FLAG_USE_USERPTR) {
>          if (!rc_blob->userptr)
>              return -EINVAL;
>      } else {
>          if (rc_blob->userptr)
>              return -EINVAL;
> 
>          if (rc_blob->blob_flags & VIRTGPU_BLOB_FLAG_USERPTR_RDONLY)
>              return -EINVAL;
>      }

I see. That shows VIRTGPU_BLOB_FLAG_USE_USERPTR is necessary for the ioctl.

> 
>>
>>>
>>> 3. Future HMM support: There is a plan to upgrade userptr 
>>> implementation to use Heterogeneous Memory Management for better GPU 
>>> coherency and dynamic page migration. The flag provides a clean path 
>>> to future upgrade.
>>
>> How will the upgrade path with the flag and the one without the flag 
>> look like, and in what aspect the upgrade path with the flag is 
>> "cleaner"?
> 
> As I mentioned above the userptr handling is different with shmem/GEM BO.

All the above describes the guest-internal behavior. What about the 
interaction between the guest and host? How will virtio as a guest-host 
interface having VIRTIO_GPU_BLOB_FLAG_USE_USERPTR ease future upgrade?

> 
>>
>>>
>>> I understand the concern about API complexity. I'll defer to the 
>>> virtio- gpu maintainers for the final decision on whether this design 
>>> is acceptable or if they prefer an alternative approach.
>>
>> It is fine to have API complexity. The problem here is the lack of 
>> clear motivation and documentation.
>>
>> Another way to put this is: how will you explain the flag in the 
>> virtio specification? It should say "the driver MAY/SHOULD/MUST do 
>> something" and/or "the device MAY/SHOULD/MUST do something", and then 
>> Linux and virglrenderer can implement the flag accordingly.
> 
> you're absolutely right that the specification should
> be written in proper virtio spec language. The draft should be:
> 
> VIRTIO_GPU_BLOB_FLAG_USE_USERPTR:
> 
> Linux virtio driver requirements:
> - MUST set userptr to valid guest userspace VA in 
> drm_virtgpu_resource_create_blob
> - SHOULD keep VA mapping valid until resource destruction
> - MUST pin pages or use HMM at blob creation time

These descriptions are not for the virtio specification. The virtio 
specification describes the interaction between the driver and device. 
These statements describe the interaction between the guest userspace 
and the guest kernel.

> 
> Virglrenderer requirements:
> - must use correspoonding API for userptr resource

What is the "corresponding API"?

Regards,
Akihiko Odaki

Re: [PATCH v4 0/5] virtio-gpu: Add userptr support for compute workloads

Posted by Honglei Huang 3 weeks, 2 days ago


On 2026/1/16 18:01, Akihiko Odaki wrote:
> On 2026/01/16 18:39, Honglei Huang wrote:
>>
>>
>> On 2026/1/16 16:54, Akihiko Odaki wrote:
>>> On 2026/01/16 16:20, Honglei Huang wrote:
>>>>
>>>>
>>>> On 2026/1/15 17:20, Akihiko Odaki wrote:
>>>>> On 2026/01/15 16:58, Honglei Huang wrote:
>>>>>> From: Honglei Huang <honghuan@amd.com>
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> This series adds virtio-gpu userptr support to enable ROCm native
>>>>>> context for compute workloads. The userptr feature allows the host to
>>>>>> directly access guest userspace memory without memcpy overhead, 
>>>>>> which is
>>>>>> essential for GPU compute performance.
>>>>>>
>>>>>> The userptr implementation provides buffer-based zero-copy memory 
>>>>>> access.
>>>>>> This approach pins guest userspace pages and exposes them to the host
>>>>>> via scatter-gather tables, enabling efficient compute operations.
>>>>>
>>>>> This description looks identical with what 
>>>>> VIRTIO_GPU_BLOB_MEM_HOST3D_GUEST does so there should be some 
>>>>> explanation how it makes difference.
>>>>>
>>>>> I have already pointed out this when reviewing the QEMU patches[1], 
>>>>> but I note that here too, since QEMU is just a middleman and this 
>>>>> matter is better discussed by Linux and virglrenderer developers.
>>>>>
>>>>> [1] https://lore.kernel.org/qemu-devel/35a8add7-da49-4833-9e69- 
>>>>> d213f52c771a@amd.com/
>>>>>
>>>>
>>>> Thanks for raising this important point about the distinction between
>>>> VIRTGPU_BLOB_FLAG_USE_USERPTR and VIRTIO_GPU_BLOB_MEM_HOST3D_GUEST.
>>>> I might not have explained it clearly previously.
>>>>
>>>> The key difference is memory ownership and lifecycle:
>>>>
>>>> BLOB_MEM_HOST3D_GUEST:
>>>>    - Kernel allocates memory (drm_gem_shmem_create)
>>>>    - Userspace accesses via mmap(GEM_BO)
>>>>    - Use case: Graphics resources (Vulkan/OpenGL)
>>>>
>>>> BLOB_FLAG_USE_USERPTR:
>>>>    - Userspace pre-allocates memory (malloc/mmap)
>>>
>>> "Kernel allocates memory" and "userspace pre-allocates memory" is a 
>>> bit ambiguous phrasing. Either way, the userspace requests the kernel 
>>> to map memory with a system call, brk() or mmap().
>>
>> They are different:
>> BLOB_MEM_HOST3D_GUEST (kernel-managed pages):
>>    - Allocated via drm_gem_shmem_create() as GFP_KERNEL pages
>>    - Kernel guarantees pages won't swap or migrate while GEM object 
>> exists
>>    - Physical addresses remain stable → safe for DMA
>>
>> BLOB_FLAG_USE_USERPTR (userspace pages):
>>    - From regular malloc/mmap - subject to MM policies
>>    - Can be swapped, migrated, or compacted by kernel
>>    - Requires FOLL_LONGTERM pinning to make DMA-safe
>>
>> The device must treat them differently. Kernel-managed pages have 
>> stable physical
>> addresses. Userspace pages need explicit pinning and the device must 
>> be prepared
>> for potential invalidation.
>>
>> This is why all compute drivers (amdgpu, i915, nouveau) implement 
>> userptr - to
>> make arbitrary userspace allocations DMA-accessible while respecting 
>> their different
>> page mobility characteristics.
>> And the drm already has a better frame work for it: SVM, and this 
>> verions is a super simplified verion.
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/ 
>> tree/ drivers/gpu/drm/ 
>> drm_gpusvm.c#:~:text=*%20GPU%20Shared%20Virtual%20Memory%20(GPU%20SVM)%20layer%20for%20the%20Direct%20Rendering%20Manager%20(DRM)
> 
> I referred to phrasing "kernel allocates" vs "userspace allocates". 
> Using GFP_KERNEL, swapping, migrating, or pinning is all what the kernel 
> does.

I am talking about the virtio gpu driver side, the virtio gpu driver 
need handle those two type memory differently.

> 
>>
>>
>>>
>>>>    - Kernel only get existing pages
>>>>    - Use case: Compute workloads (ROCm/CUDA) with large datasets, like
>>>> GPU needs load a big model file 10G+, UMD mmap the fd file, then 
>>>> give the mmap ptr into userspace then driver do not need a another 
>>>> copy.
>>>> But if the shmem is used, the userspace needs copy the file data 
>>>> into a shmem mmap ptr there is a copy overhead.
>>>>
>>>> Userptr:
>>>>
>>>> file -> open/mmap -> userspace ptr -> driver
>>>>
>>>> shmem:
>>>>
>>>> user alloc shmem ──→ mmap shmem ──→ shmem userspace ptr -> driver
>>>>                                                ↑
>>>>                                                │ copy
>>>>                                                │
>>>> file ──→ open/mmap ──→ file userptr ──────────┘
>>>>
>>>>
>>>> For compute workloads, this matters significantly:
>>>>    Without userptr: malloc(8GB) → alloc GEM BO → memcpy 8GB → 
>>>> compute → memcpy 8GB back
>>>>    With userptr:    malloc(8GB) → create userptr BO → compute (zero- 
>>>> copy)
>>>
>>> Why don't you alloc GEM BO first and read the file into there?
>>
>> Because that defeats the purpose of zero-copy.
>>
>> With GEM-BO-first (what you suggest):
>>
>> void *gembo = virtgpu_gem_create(10GB);     // Allocate GEM buffer
>> void *model = mmap(..., model_file_fd, 0);  // Map model file
>> memcpy(gembo, model, 10GB);                 // Copy 10GB - NOT zero-copy
>> munmap(model, 10GB);
>> gpu_compute(gembo);
>>
>> Result: 10GB copy overhead + double memory usage during copy.
> 
> How about:
> 
> void *gembo = virtgpu_gem_create(10GB);
> read(model_file_fd, gembo, 10GB);

I believe there is still memory copy in read operation
model_file_fd -> gembo, they have different physical pages,
but the userptr/SVM feature will access the model_file_fd physical pages 
directly.


> 
> Result: zero-copy + simpler code.
> 
>>
>> With userptr (zero-copy):
>>
>> void *model = mmap(..., model_file_fd, 0);  // Map model file
>> hsa_memory_register(model, 10GB);           // Pin pages, create 
>> userptr BO
>> gpu_compute(model);                         // GPU reads directly from 
>> file pages
>>
>>
>>>
>>>>
>>>> The explicit flag serves three purposes:
>>>>
>>>> 1. Although both send scatter-gather entries to host. The flag makes 
>>>> the intent unambiguous.
>>>
>>> Why will the host care?
>>
>> The flag tells host this is a userptr, host side need handle it 
>> specially.
> 
> Please provide the concrete requirement. What is the special handling 
> the host side needs to perform?

Every hardware has it own special API to handle userptr, for amdgpu ROCm
it is hsaKmtRegisterMemoryWithFlags.

> 
>>
>>
>>>
>>>>
>>>> 2. Ensures consistency between flag and userptr address field.
>>>
>>> Addresses are represented with the nr_entries and following struct 
>>> virtio_gpu_mem_entry entries, whenever 
>>> VIRTIO_GPU_CMD_RESOURCE_CREATE_BLOB or 
>>> VIRTIO_GPU_CMD_RESOURCE_ATTACH_BACKING is used. Having a special flag 
>>> introduces inconsistency.
>>
>> For this part I am talking about the virito gpu guest UMD side, in 
>> blob create io ctrl we need this flag to
>> check the userptr address and is it a read-only attribute:
>>      if (rc_blob->blob_flags & VIRTGPU_BLOB_FLAG_USE_USERPTR) {
>>          if (!rc_blob->userptr)
>>              return -EINVAL;
>>      } else {
>>          if (rc_blob->userptr)
>>              return -EINVAL;
>>
>>          if (rc_blob->blob_flags & VIRTGPU_BLOB_FLAG_USERPTR_RDONLY)
>>              return -EINVAL;
>>      }
> 
> I see. That shows VIRTGPU_BLOB_FLAG_USE_USERPTR is necessary for the ioctl.
> 
>>
>>>
>>>>
>>>> 3. Future HMM support: There is a plan to upgrade userptr 
>>>> implementation to use Heterogeneous Memory Management for better GPU 
>>>> coherency and dynamic page migration. The flag provides a clean path 
>>>> to future upgrade.
>>>
>>> How will the upgrade path with the flag and the one without the flag 
>>> look like, and in what aspect the upgrade path with the flag is 
>>> "cleaner"?
>>
>> As I mentioned above the userptr handling is different with shmem/GEM BO.
> 
> All the above describes the guest-internal behavior. What about the 
> interaction between the guest and host? How will virtio as a guest-host 
> interface having VIRTIO_GPU_BLOB_FLAG_USE_USERPTR ease future upgrade?

It depends on how we implement it, the current version is the simplest 
implementation, similar to the implementation in Intel's i915.
If virtio side needs HMM to implement a SVM type userptr feature
I think VIRTIO_GPU_BLOB_FLAG_USE_USERPTR is must needed, stack needs to 
know if it is a userptr resource, and to perform advanced operations 
such as updating page tables, splitting BOs, etc.

> 
>>
>>>
>>>>
>>>> I understand the concern about API complexity. I'll defer to the 
>>>> virtio- gpu maintainers for the final decision on whether this 
>>>> design is acceptable or if they prefer an alternative approach.
>>>
>>> It is fine to have API complexity. The problem here is the lack of 
>>> clear motivation and documentation.
>>>
>>> Another way to put this is: how will you explain the flag in the 
>>> virtio specification? It should say "the driver MAY/SHOULD/MUST do 
>>> something" and/or "the device MAY/SHOULD/MUST do something", and then 
>>> Linux and virglrenderer can implement the flag accordingly.
>>
>> you're absolutely right that the specification should
>> be written in proper virtio spec language. The draft should be:
>>
>> VIRTIO_GPU_BLOB_FLAG_USE_USERPTR:
>>
>> Linux virtio driver requirements:
>> - MUST set userptr to valid guest userspace VA in 
>> drm_virtgpu_resource_create_blob
>> - SHOULD keep VA mapping valid until resource destruction
>> - MUST pin pages or use HMM at blob creation time
> 
> These descriptions are not for the virtio specification. The virtio 
> specification describes the interaction between the driver and device. 
> These statements describe the interaction between the guest userspace 
> and the guest kernel.
> 
>>
>> Virglrenderer requirements:
>> - must use correspoonding API for userptr resource
> 
> What is the "corresponding API"?

It may can be:
**VIRTIO_GPU_BLOB_FLAG_USE_USERPTR specification:**

Driver requirements:
- MUST populate mem_entry[] with valid guest physical addresses of 
pinned userspace pages
- MUST set blob_mem to VIRTIO_GPU_BLOB_FLAG_USE_USERPTR when using this flag
- SHOULD keep pages pinned until VIRTIO_GPU_CMD_RESOURCE_UNREF

Device requirements:
- MUST establish IOMMU mappings using the provided iovec array with 
specific API.(hsaKmtRegisterMemoryWithFlags for ROCm)



Really thanks for your comments, and I believe we need some input of
virito gpu maintainers.

VIRTIO_GPU_BLOB_FLAG_USE_USERPTR flag is a flag for how to use, and it 
doen't conflict with VIRTGPU_BLOB_MEM_HOST3D_GUEST. Just like a resource 
is used for VIRTGPU_BLOB_FLAG_USE_SHAREABLE but it can be a guest 
resource or a host resource.

If we don't have VIRTIO_GPU_BLOB_FLAG_USE_USERPTR flag, we may have some
resource conflict in host side, guest kernel can use 'userptr' param to 
identify. But in host side the 'userptr' param is lost, we only know it 
is just a guest flag resource.


> 
> Regards,
> Akihiko Odaki

Re: [PATCH v4 0/5] virtio-gpu: Add userptr support for compute workloads

Posted by Akihiko Odaki 3 weeks, 2 days ago

On 2026/01/16 19:32, Honglei Huang wrote:
> 
> 
> On 2026/1/16 18:01, Akihiko Odaki wrote:
>> On 2026/01/16 18:39, Honglei Huang wrote:
>>>
>>>
>>> On 2026/1/16 16:54, Akihiko Odaki wrote:
>>>> On 2026/01/16 16:20, Honglei Huang wrote:
>>>>>
>>>>>
>>>>> On 2026/1/15 17:20, Akihiko Odaki wrote:
>>>>>> On 2026/01/15 16:58, Honglei Huang wrote:
>>>>>>> From: Honglei Huang <honghuan@amd.com>
>>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> This series adds virtio-gpu userptr support to enable ROCm native
>>>>>>> context for compute workloads. The userptr feature allows the 
>>>>>>> host to
>>>>>>> directly access guest userspace memory without memcpy overhead, 
>>>>>>> which is
>>>>>>> essential for GPU compute performance.
>>>>>>>
>>>>>>> The userptr implementation provides buffer-based zero-copy memory 
>>>>>>> access.
>>>>>>> This approach pins guest userspace pages and exposes them to the 
>>>>>>> host
>>>>>>> via scatter-gather tables, enabling efficient compute operations.
>>>>>>
>>>>>> This description looks identical with what 
>>>>>> VIRTIO_GPU_BLOB_MEM_HOST3D_GUEST does so there should be some 
>>>>>> explanation how it makes difference.
>>>>>>
>>>>>> I have already pointed out this when reviewing the QEMU 
>>>>>> patches[1], but I note that here too, since QEMU is just a 
>>>>>> middleman and this matter is better discussed by Linux and 
>>>>>> virglrenderer developers.
>>>>>>
>>>>>> [1] https://lore.kernel.org/qemu-devel/35a8add7-da49-4833-9e69- 
>>>>>> d213f52c771a@amd.com/
>>>>>>
>>>>>
>>>>> Thanks for raising this important point about the distinction between
>>>>> VIRTGPU_BLOB_FLAG_USE_USERPTR and VIRTIO_GPU_BLOB_MEM_HOST3D_GUEST.
>>>>> I might not have explained it clearly previously.
>>>>>
>>>>> The key difference is memory ownership and lifecycle:
>>>>>
>>>>> BLOB_MEM_HOST3D_GUEST:
>>>>>    - Kernel allocates memory (drm_gem_shmem_create)
>>>>>    - Userspace accesses via mmap(GEM_BO)
>>>>>    - Use case: Graphics resources (Vulkan/OpenGL)
>>>>>
>>>>> BLOB_FLAG_USE_USERPTR:
>>>>>    - Userspace pre-allocates memory (malloc/mmap)
>>>>
>>>> "Kernel allocates memory" and "userspace pre-allocates memory" is a 
>>>> bit ambiguous phrasing. Either way, the userspace requests the 
>>>> kernel to map memory with a system call, brk() or mmap().
>>>
>>> They are different:
>>> BLOB_MEM_HOST3D_GUEST (kernel-managed pages):
>>>    - Allocated via drm_gem_shmem_create() as GFP_KERNEL pages
>>>    - Kernel guarantees pages won't swap or migrate while GEM object 
>>> exists
>>>    - Physical addresses remain stable → safe for DMA
>>>
>>> BLOB_FLAG_USE_USERPTR (userspace pages):
>>>    - From regular malloc/mmap - subject to MM policies
>>>    - Can be swapped, migrated, or compacted by kernel
>>>    - Requires FOLL_LONGTERM pinning to make DMA-safe
>>>
>>> The device must treat them differently. Kernel-managed pages have 
>>> stable physical
>>> addresses. Userspace pages need explicit pinning and the device must 
>>> be prepared
>>> for potential invalidation.
>>>
>>> This is why all compute drivers (amdgpu, i915, nouveau) implement 
>>> userptr - to
>>> make arbitrary userspace allocations DMA-accessible while respecting 
>>> their different
>>> page mobility characteristics.
>>> And the drm already has a better frame work for it: SVM, and this 
>>> verions is a super simplified verion.
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/ 
>>> tree/ drivers/gpu/drm/ 
>>> drm_gpusvm.c#:~:text=*%20GPU%20Shared%20Virtual%20Memory%20(GPU%20SVM)%20layer%20for%20the%20Direct%20Rendering%20Manager%20(DRM)
>>
>> I referred to phrasing "kernel allocates" vs "userspace allocates". 
>> Using GFP_KERNEL, swapping, migrating, or pinning is all what the 
>> kernel does.
> 
> I am talking about the virtio gpu driver side, the virtio gpu driver 
> need handle those two type memory differently.
> 
>>
>>>
>>>
>>>>
>>>>>    - Kernel only get existing pages
>>>>>    - Use case: Compute workloads (ROCm/CUDA) with large datasets, like
>>>>> GPU needs load a big model file 10G+, UMD mmap the fd file, then 
>>>>> give the mmap ptr into userspace then driver do not need a another 
>>>>> copy.
>>>>> But if the shmem is used, the userspace needs copy the file data 
>>>>> into a shmem mmap ptr there is a copy overhead.
>>>>>
>>>>> Userptr:
>>>>>
>>>>> file -> open/mmap -> userspace ptr -> driver
>>>>>
>>>>> shmem:
>>>>>
>>>>> user alloc shmem ──→ mmap shmem ──→ shmem userspace ptr -> driver
>>>>>                                                ↑
>>>>>                                                │ copy
>>>>>                                                │
>>>>> file ──→ open/mmap ──→ file userptr ──────────┘
>>>>>
>>>>>
>>>>> For compute workloads, this matters significantly:
>>>>>    Without userptr: malloc(8GB) → alloc GEM BO → memcpy 8GB → 
>>>>> compute → memcpy 8GB back
>>>>>    With userptr:    malloc(8GB) → create userptr BO → compute 
>>>>> (zero- copy)
>>>>
>>>> Why don't you alloc GEM BO first and read the file into there?
>>>
>>> Because that defeats the purpose of zero-copy.
>>>
>>> With GEM-BO-first (what you suggest):
>>>
>>> void *gembo = virtgpu_gem_create(10GB);     // Allocate GEM buffer
>>> void *model = mmap(..., model_file_fd, 0);  // Map model file
>>> memcpy(gembo, model, 10GB);                 // Copy 10GB - NOT zero-copy
>>> munmap(model, 10GB);
>>> gpu_compute(gembo);
>>>
>>> Result: 10GB copy overhead + double memory usage during copy.
>>
>> How about:
>>
>> void *gembo = virtgpu_gem_create(10GB);
>> read(model_file_fd, gembo, 10GB);
> 
> I believe there is still memory copy in read operation
> model_file_fd -> gembo, they have different physical pages,
> but the userptr/SVM feature will access the model_file_fd physical pages 
> directly.

You can use O_DIRECT if you want.

> 
> 
>>
>> Result: zero-copy + simpler code.
>>
>>>
>>> With userptr (zero-copy):
>>>
>>> void *model = mmap(..., model_file_fd, 0);  // Map model file
>>> hsa_memory_register(model, 10GB);           // Pin pages, create 
>>> userptr BO
>>> gpu_compute(model);                         // GPU reads directly 
>>> from file pages
>>>
>>>
>>>>
>>>>>
>>>>> The explicit flag serves three purposes:
>>>>>
>>>>> 1. Although both send scatter-gather entries to host. The flag 
>>>>> makes the intent unambiguous.
>>>>
>>>> Why will the host care?
>>>
>>> The flag tells host this is a userptr, host side need handle it 
>>> specially.
>>
>> Please provide the concrete requirement. What is the special handling 
>> the host side needs to perform?
> 
> Every hardware has it own special API to handle userptr, for amdgpu ROCm
> it is hsaKmtRegisterMemoryWithFlags.

On the host side, BLOB_MEM_HOST3D_GUEST will always result in a 
userspace pointer. Below is how the address is translated:

1) (with the ioctl you are adding)
    Guest kernel translates guest userspace pointer to guest PA.
2) (with IOMMU)
    Guest kernel translates guest PA to device VA
3) The host VMM translates device VA to host userspace pointer
4) virglrenderer passes userspace pointer to the GPU API (ROCm)

BLOB_FLAG_USE_USERPTR tells 1) happened. But the succeeding process is 
not affected by that.

> 
>>
>>>
>>>
>>>>
>>>>>
>>>>> 2. Ensures consistency between flag and userptr address field.
>>>>
>>>> Addresses are represented with the nr_entries and following struct 
>>>> virtio_gpu_mem_entry entries, whenever 
>>>> VIRTIO_GPU_CMD_RESOURCE_CREATE_BLOB or 
>>>> VIRTIO_GPU_CMD_RESOURCE_ATTACH_BACKING is used. Having a special 
>>>> flag introduces inconsistency.
>>>
>>> For this part I am talking about the virito gpu guest UMD side, in 
>>> blob create io ctrl we need this flag to
>>> check the userptr address and is it a read-only attribute:
>>>      if (rc_blob->blob_flags & VIRTGPU_BLOB_FLAG_USE_USERPTR) {
>>>          if (!rc_blob->userptr)
>>>              return -EINVAL;
>>>      } else {
>>>          if (rc_blob->userptr)
>>>              return -EINVAL;
>>>
>>>          if (rc_blob->blob_flags & VIRTGPU_BLOB_FLAG_USERPTR_RDONLY)
>>>              return -EINVAL;
>>>      }
>>
>> I see. That shows VIRTGPU_BLOB_FLAG_USE_USERPTR is necessary for the 
>> ioctl.
>>
>>>
>>>>
>>>>>
>>>>> 3. Future HMM support: There is a plan to upgrade userptr 
>>>>> implementation to use Heterogeneous Memory Management for better 
>>>>> GPU coherency and dynamic page migration. The flag provides a clean 
>>>>> path to future upgrade.
>>>>
>>>> How will the upgrade path with the flag and the one without the flag 
>>>> look like, and in what aspect the upgrade path with the flag is 
>>>> "cleaner"?
>>>
>>> As I mentioned above the userptr handling is different with shmem/GEM 
>>> BO.
>>
>> All the above describes the guest-internal behavior. What about the 
>> interaction between the guest and host? How will virtio as a guest- 
>> host interface having VIRTIO_GPU_BLOB_FLAG_USE_USERPTR ease future 
>> upgrade?
> 
> It depends on how we implement it, the current version is the simplest 
> implementation, similar to the implementation in Intel's i915.
> If virtio side needs HMM to implement a SVM type userptr feature
> I think VIRTIO_GPU_BLOB_FLAG_USE_USERPTR is must needed, stack needs to 
> know if it is a userptr resource, and to perform advanced operations 
> such as updating page tables, splitting BOs, etc.

Why do the device need to know if it is a userptr resource to perform 
operations when the device always get device VAs?

> 
>>
>>>
>>>>
>>>>>
>>>>> I understand the concern about API complexity. I'll defer to the 
>>>>> virtio- gpu maintainers for the final decision on whether this 
>>>>> design is acceptable or if they prefer an alternative approach.
>>>>
>>>> It is fine to have API complexity. The problem here is the lack of 
>>>> clear motivation and documentation.
>>>>
>>>> Another way to put this is: how will you explain the flag in the 
>>>> virtio specification? It should say "the driver MAY/SHOULD/MUST do 
>>>> something" and/or "the device MAY/SHOULD/MUST do something", and 
>>>> then Linux and virglrenderer can implement the flag accordingly.
>>>
>>> you're absolutely right that the specification should
>>> be written in proper virtio spec language. The draft should be:
>>>
>>> VIRTIO_GPU_BLOB_FLAG_USE_USERPTR:
>>>
>>> Linux virtio driver requirements:
>>> - MUST set userptr to valid guest userspace VA in 
>>> drm_virtgpu_resource_create_blob
>>> - SHOULD keep VA mapping valid until resource destruction
>>> - MUST pin pages or use HMM at blob creation time
>>
>> These descriptions are not for the virtio specification. The virtio 
>> specification describes the interaction between the driver and device. 
>> These statements describe the interaction between the guest userspace 
>> and the guest kernel.
>>
>>>
>>> Virglrenderer requirements:
>>> - must use correspoonding API for userptr resource
>>
>> What is the "corresponding API"?
> 
> It may can be:
> **VIRTIO_GPU_BLOB_FLAG_USE_USERPTR specification:**
> 
> Driver requirements:
> - MUST populate mem_entry[] with valid guest physical addresses of 
> pinned userspace pages

"Userspace" is a the guest-internal concepts and irrelevant with the 
interaction between the driver and device.

> - MUST set blob_mem to VIRTIO_GPU_BLOB_FLAG_USE_USERPTR when using this 
> flag

When should the driver use the flag?

> - SHOULD keep pages pinned until VIRTIO_GPU_CMD_RESOURCE_UNREF

It is not a new requirement. The page must stay at the same position 
whether VIRTIO_GPU_BLOB_FLAG_USE_USERPTR is used or not.

> 
> Device requirements:
> - MUST establish IOMMU mappings using the provided iovec array with 
> specific API.(hsaKmtRegisterMemoryWithFlags for ROCm)

This should be also true even when VIRTIO_GPU_BLOB_FLAG_USE_USERPTR is 
not set.

> 
> 
> 
> Really thanks for your comments, and I believe we need some input of
> virito gpu maintainers.
> 
> VIRTIO_GPU_BLOB_FLAG_USE_USERPTR flag is a flag for how to use, and it 
> doen't conflict with VIRTGPU_BLOB_MEM_HOST3D_GUEST. Just like a resource 
> is used for VIRTGPU_BLOB_FLAG_USE_SHAREABLE but it can be a guest 
> resource or a host resource.
> 
> If we don't have VIRTIO_GPU_BLOB_FLAG_USE_USERPTR flag, we may have some
> resource conflict in host side, guest kernel can use 'userptr' param to 
> identify. But in host side the 'userptr' param is lost, we only know it 
> is just a guest flag resource.

I still don't see why knowing it is a guest resource is insufficient for 
the host.

Regards,
AKihiko Odaki

Re: [PATCH v4 0/5] virtio-gpu: Add userptr support for compute workloads

Posted by Honglei Huang 3 weeks, 2 days ago


On 2026/1/16 19:03, Akihiko Odaki wrote:
> On 2026/01/16 19:32, Honglei Huang wrote:
>>
>>
>> On 2026/1/16 18:01, Akihiko Odaki wrote:
>>> On 2026/01/16 18:39, Honglei Huang wrote:
>>>>
>>>>
>>>> On 2026/1/16 16:54, Akihiko Odaki wrote:
>>>>> On 2026/01/16 16:20, Honglei Huang wrote:
>>>>>>
>>>>>>
>>>>>> On 2026/1/15 17:20, Akihiko Odaki wrote:
>>>>>>> On 2026/01/15 16:58, Honglei Huang wrote:
>>>>>>>> From: Honglei Huang <honghuan@amd.com>
>>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> This series adds virtio-gpu userptr support to enable ROCm native
>>>>>>>> context for compute workloads. The userptr feature allows the 
>>>>>>>> host to
>>>>>>>> directly access guest userspace memory without memcpy overhead, 
>>>>>>>> which is
>>>>>>>> essential for GPU compute performance.
>>>>>>>>
>>>>>>>> The userptr implementation provides buffer-based zero-copy 
>>>>>>>> memory access.
>>>>>>>> This approach pins guest userspace pages and exposes them to the 
>>>>>>>> host
>>>>>>>> via scatter-gather tables, enabling efficient compute operations.
>>>>>>>
>>>>>>> This description looks identical with what 
>>>>>>> VIRTIO_GPU_BLOB_MEM_HOST3D_GUEST does so there should be some 
>>>>>>> explanation how it makes difference.
>>>>>>>
>>>>>>> I have already pointed out this when reviewing the QEMU 
>>>>>>> patches[1], but I note that here too, since QEMU is just a 
>>>>>>> middleman and this matter is better discussed by Linux and 
>>>>>>> virglrenderer developers.
>>>>>>>
>>>>>>> [1] https://lore.kernel.org/qemu-devel/35a8add7-da49-4833-9e69- 
>>>>>>> d213f52c771a@amd.com/
>>>>>>>
>>>>>>
>>>>>> Thanks for raising this important point about the distinction between
>>>>>> VIRTGPU_BLOB_FLAG_USE_USERPTR and VIRTIO_GPU_BLOB_MEM_HOST3D_GUEST.
>>>>>> I might not have explained it clearly previously.
>>>>>>
>>>>>> The key difference is memory ownership and lifecycle:
>>>>>>
>>>>>> BLOB_MEM_HOST3D_GUEST:
>>>>>>    - Kernel allocates memory (drm_gem_shmem_create)
>>>>>>    - Userspace accesses via mmap(GEM_BO)
>>>>>>    - Use case: Graphics resources (Vulkan/OpenGL)
>>>>>>
>>>>>> BLOB_FLAG_USE_USERPTR:
>>>>>>    - Userspace pre-allocates memory (malloc/mmap)
>>>>>
>>>>> "Kernel allocates memory" and "userspace pre-allocates memory" is a 
>>>>> bit ambiguous phrasing. Either way, the userspace requests the 
>>>>> kernel to map memory with a system call, brk() or mmap().
>>>>
>>>> They are different:
>>>> BLOB_MEM_HOST3D_GUEST (kernel-managed pages):
>>>>    - Allocated via drm_gem_shmem_create() as GFP_KERNEL pages
>>>>    - Kernel guarantees pages won't swap or migrate while GEM object 
>>>> exists
>>>>    - Physical addresses remain stable → safe for DMA
>>>>
>>>> BLOB_FLAG_USE_USERPTR (userspace pages):
>>>>    - From regular malloc/mmap - subject to MM policies
>>>>    - Can be swapped, migrated, or compacted by kernel
>>>>    - Requires FOLL_LONGTERM pinning to make DMA-safe
>>>>
>>>> The device must treat them differently. Kernel-managed pages have 
>>>> stable physical
>>>> addresses. Userspace pages need explicit pinning and the device must 
>>>> be prepared
>>>> for potential invalidation.
>>>>
>>>> This is why all compute drivers (amdgpu, i915, nouveau) implement 
>>>> userptr - to
>>>> make arbitrary userspace allocations DMA-accessible while respecting 
>>>> their different
>>>> page mobility characteristics.
>>>> And the drm already has a better frame work for it: SVM, and this 
>>>> verions is a super simplified verion.
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/ 
>>>> tree/ drivers/gpu/drm/ 
>>>> drm_gpusvm.c#:~:text=*%20GPU%20Shared%20Virtual%20Memory%20(GPU%20SVM)%20layer%20for%20the%20Direct%20Rendering%20Manager%20(DRM)
>>>
>>> I referred to phrasing "kernel allocates" vs "userspace allocates". 
>>> Using GFP_KERNEL, swapping, migrating, or pinning is all what the 
>>> kernel does.
>>
>> I am talking about the virtio gpu driver side, the virtio gpu driver 
>> need handle those two type memory differently.
>>
>>>
>>>>
>>>>
>>>>>
>>>>>>    - Kernel only get existing pages
>>>>>>    - Use case: Compute workloads (ROCm/CUDA) with large datasets, 
>>>>>> like
>>>>>> GPU needs load a big model file 10G+, UMD mmap the fd file, then 
>>>>>> give the mmap ptr into userspace then driver do not need a another 
>>>>>> copy.
>>>>>> But if the shmem is used, the userspace needs copy the file data 
>>>>>> into a shmem mmap ptr there is a copy overhead.
>>>>>>
>>>>>> Userptr:
>>>>>>
>>>>>> file -> open/mmap -> userspace ptr -> driver
>>>>>>
>>>>>> shmem:
>>>>>>
>>>>>> user alloc shmem ──→ mmap shmem ──→ shmem userspace ptr -> driver
>>>>>>                                                ↑
>>>>>>                                                │ copy
>>>>>>                                                │
>>>>>> file ──→ open/mmap ──→ file userptr ──────────┘
>>>>>>
>>>>>>
>>>>>> For compute workloads, this matters significantly:
>>>>>>    Without userptr: malloc(8GB) → alloc GEM BO → memcpy 8GB → 
>>>>>> compute → memcpy 8GB back
>>>>>>    With userptr:    malloc(8GB) → create userptr BO → compute 
>>>>>> (zero- copy)
>>>>>
>>>>> Why don't you alloc GEM BO first and read the file into there?
>>>>
>>>> Because that defeats the purpose of zero-copy.
>>>>
>>>> With GEM-BO-first (what you suggest):
>>>>
>>>> void *gembo = virtgpu_gem_create(10GB);     // Allocate GEM buffer
>>>> void *model = mmap(..., model_file_fd, 0);  // Map model file
>>>> memcpy(gembo, model, 10GB);                 // Copy 10GB - NOT zero- 
>>>> copy
>>>> munmap(model, 10GB);
>>>> gpu_compute(gembo);
>>>>
>>>> Result: 10GB copy overhead + double memory usage during copy.
>>>
>>> How about:
>>>
>>> void *gembo = virtgpu_gem_create(10GB);
>>> read(model_file_fd, gembo, 10GB);
>>
>> I believe there is still memory copy in read operation
>> model_file_fd -> gembo, they have different physical pages,
>> but the userptr/SVM feature will access the model_file_fd physical 
>> pages directly.
> 
> You can use O_DIRECT if you want.
> 
>>
>>
>>>
>>> Result: zero-copy + simpler code.
>>>
>>>>
>>>> With userptr (zero-copy):
>>>>
>>>> void *model = mmap(..., model_file_fd, 0);  // Map model file
>>>> hsa_memory_register(model, 10GB);           // Pin pages, create 
>>>> userptr BO
>>>> gpu_compute(model);                         // GPU reads directly 
>>>> from file pages
>>>>
>>>>
>>>>>
>>>>>>
>>>>>> The explicit flag serves three purposes:
>>>>>>
>>>>>> 1. Although both send scatter-gather entries to host. The flag 
>>>>>> makes the intent unambiguous.
>>>>>
>>>>> Why will the host care?
>>>>
>>>> The flag tells host this is a userptr, host side need handle it 
>>>> specially.
>>>
>>> Please provide the concrete requirement. What is the special handling 
>>> the host side needs to perform?
>>
>> Every hardware has it own special API to handle userptr, for amdgpu ROCm
>> it is hsaKmtRegisterMemoryWithFlags.
> 
> On the host side, BLOB_MEM_HOST3D_GUEST will always result in a 
> userspace pointer. Below is how the address is translated:
> 
> 1) (with the ioctl you are adding)
>     Guest kernel translates guest userspace pointer to guest PA.
> 2) (with IOMMU)
>     Guest kernel translates guest PA to device VA
> 3) The host VMM translates device VA to host userspace pointer
> 4) virglrenderer passes userspace pointer to the GPU API (ROCm)
> 
> BLOB_FLAG_USE_USERPTR tells 1) happened. But the succeeding process is 
> not affected by that.
> 
>>
>>>
>>>>
>>>>
>>>>>
>>>>>>
>>>>>> 2. Ensures consistency between flag and userptr address field.
>>>>>
>>>>> Addresses are represented with the nr_entries and following struct 
>>>>> virtio_gpu_mem_entry entries, whenever 
>>>>> VIRTIO_GPU_CMD_RESOURCE_CREATE_BLOB or 
>>>>> VIRTIO_GPU_CMD_RESOURCE_ATTACH_BACKING is used. Having a special 
>>>>> flag introduces inconsistency.
>>>>
>>>> For this part I am talking about the virito gpu guest UMD side, in 
>>>> blob create io ctrl we need this flag to
>>>> check the userptr address and is it a read-only attribute:
>>>>      if (rc_blob->blob_flags & VIRTGPU_BLOB_FLAG_USE_USERPTR) {
>>>>          if (!rc_blob->userptr)
>>>>              return -EINVAL;
>>>>      } else {
>>>>          if (rc_blob->userptr)
>>>>              return -EINVAL;
>>>>
>>>>          if (rc_blob->blob_flags & VIRTGPU_BLOB_FLAG_USERPTR_RDONLY)
>>>>              return -EINVAL;
>>>>      }
>>>
>>> I see. That shows VIRTGPU_BLOB_FLAG_USE_USERPTR is necessary for the 
>>> ioctl.
>>>
>>>>
>>>>>
>>>>>>
>>>>>> 3. Future HMM support: There is a plan to upgrade userptr 
>>>>>> implementation to use Heterogeneous Memory Management for better 
>>>>>> GPU coherency and dynamic page migration. The flag provides a 
>>>>>> clean path to future upgrade.
>>>>>
>>>>> How will the upgrade path with the flag and the one without the 
>>>>> flag look like, and in what aspect the upgrade path with the flag 
>>>>> is "cleaner"?
>>>>
>>>> As I mentioned above the userptr handling is different with shmem/ 
>>>> GEM BO.
>>>
>>> All the above describes the guest-internal behavior. What about the 
>>> interaction between the guest and host? How will virtio as a guest- 
>>> host interface having VIRTIO_GPU_BLOB_FLAG_USE_USERPTR ease future 
>>> upgrade?
>>
>> It depends on how we implement it, the current version is the simplest 
>> implementation, similar to the implementation in Intel's i915.
>> If virtio side needs HMM to implement a SVM type userptr feature
>> I think VIRTIO_GPU_BLOB_FLAG_USE_USERPTR is must needed, stack needs 
>> to know if it is a userptr resource, and to perform advanced 
>> operations such as updating page tables, splitting BOs, etc.
> 
> Why do the device need to know if it is a userptr resource to perform 
> operations when the device always get device VAs?
> 
>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>> I understand the concern about API complexity. I'll defer to the 
>>>>>> virtio- gpu maintainers for the final decision on whether this 
>>>>>> design is acceptable or if they prefer an alternative approach.
>>>>>
>>>>> It is fine to have API complexity. The problem here is the lack of 
>>>>> clear motivation and documentation.
>>>>>
>>>>> Another way to put this is: how will you explain the flag in the 
>>>>> virtio specification? It should say "the driver MAY/SHOULD/MUST do 
>>>>> something" and/or "the device MAY/SHOULD/MUST do something", and 
>>>>> then Linux and virglrenderer can implement the flag accordingly.
>>>>
>>>> you're absolutely right that the specification should
>>>> be written in proper virtio spec language. The draft should be:
>>>>
>>>> VIRTIO_GPU_BLOB_FLAG_USE_USERPTR:
>>>>
>>>> Linux virtio driver requirements:
>>>> - MUST set userptr to valid guest userspace VA in 
>>>> drm_virtgpu_resource_create_blob
>>>> - SHOULD keep VA mapping valid until resource destruction
>>>> - MUST pin pages or use HMM at blob creation time
>>>
>>> These descriptions are not for the virtio specification. The virtio 
>>> specification describes the interaction between the driver and 
>>> device. These statements describe the interaction between the guest 
>>> userspace and the guest kernel.
>>>
>>>>
>>>> Virglrenderer requirements:
>>>> - must use correspoonding API for userptr resource
>>>
>>> What is the "corresponding API"?
>>
>> It may can be:
>> **VIRTIO_GPU_BLOB_FLAG_USE_USERPTR specification:**
>>
>> Driver requirements:
>> - MUST populate mem_entry[] with valid guest physical addresses of 
>> pinned userspace pages
> 
> "Userspace" is a the guest-internal concepts and irrelevant with the 
> interaction between the driver and device.
> 
>> - MUST set blob_mem to VIRTIO_GPU_BLOB_FLAG_USE_USERPTR when using 
>> this flag
> 
> When should the driver use the flag?
> 
>> - SHOULD keep pages pinned until VIRTIO_GPU_CMD_RESOURCE_UNREF
> 
> It is not a new requirement. The page must stay at the same position 
> whether VIRTIO_GPU_BLOB_FLAG_USE_USERPTR is used or not.
> 
>>
>> Device requirements:
>> - MUST establish IOMMU mappings using the provided iovec array with 
>> specific API.(hsaKmtRegisterMemoryWithFlags for ROCm)
> 
> This should be also true even when VIRTIO_GPU_BLOB_FLAG_USE_USERPTR is 
> not set.
> 
>>
>>
>>
>> Really thanks for your comments, and I believe we need some input of
>> virito gpu maintainers.
>>
>> VIRTIO_GPU_BLOB_FLAG_USE_USERPTR flag is a flag for how to use, and it 
>> doen't conflict with VIRTGPU_BLOB_MEM_HOST3D_GUEST. Just like a 
>> resource is used for VIRTGPU_BLOB_FLAG_USE_SHAREABLE but it can be a 
>> guest resource or a host resource.
>>
>> If we don't have VIRTIO_GPU_BLOB_FLAG_USE_USERPTR flag, we may have some
>> resource conflict in host side, guest kernel can use 'userptr' param 
>> to identify. But in host side the 'userptr' param is lost, we only 
>> know it is just a guest flag resource.
> 
> I still don't see why knowing it is a guest resource is insufficient for 
> the host.

All right, I totally agreed with you.

And let virtio gpu maintainer/drm decide how to design the flag/params 
maybe is better.


I believe the core gap between you and me is the concept of userptr/SVM.
What does userptr/SVM used for, it let GPU and CPU share the userspace 
virtual address. Perhaps my description is not accurate enough.


> 
> Regards,
> AKihiko Odaki

Re: [PATCH v4 0/5] virtio-gpu: Add userptr support for compute workloads

Posted by Akihiko Odaki 3 weeks, 2 days ago

On 2026/01/16 21:34, Honglei Huang wrote:
> 
> 
> On 2026/1/16 19:03, Akihiko Odaki wrote:
>> On 2026/01/16 19:32, Honglei Huang wrote:
>>>
>>>
>>> On 2026/1/16 18:01, Akihiko Odaki wrote:
>>>> On 2026/01/16 18:39, Honglei Huang wrote:
>>>>>
>>>>>
>>>>> On 2026/1/16 16:54, Akihiko Odaki wrote:
>>>>>> On 2026/01/16 16:20, Honglei Huang wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2026/1/15 17:20, Akihiko Odaki wrote:
>>>>>>>> On 2026/01/15 16:58, Honglei Huang wrote:
>>>>>>>>> From: Honglei Huang <honghuan@amd.com>
>>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> This series adds virtio-gpu userptr support to enable ROCm native
>>>>>>>>> context for compute workloads. The userptr feature allows the 
>>>>>>>>> host to
>>>>>>>>> directly access guest userspace memory without memcpy overhead, 
>>>>>>>>> which is
>>>>>>>>> essential for GPU compute performance.
>>>>>>>>>
>>>>>>>>> The userptr implementation provides buffer-based zero-copy 
>>>>>>>>> memory access.
>>>>>>>>> This approach pins guest userspace pages and exposes them to 
>>>>>>>>> the host
>>>>>>>>> via scatter-gather tables, enabling efficient compute operations.
>>>>>>>>
>>>>>>>> This description looks identical with what 
>>>>>>>> VIRTIO_GPU_BLOB_MEM_HOST3D_GUEST does so there should be some 
>>>>>>>> explanation how it makes difference.
>>>>>>>>
>>>>>>>> I have already pointed out this when reviewing the QEMU 
>>>>>>>> patches[1], but I note that here too, since QEMU is just a 
>>>>>>>> middleman and this matter is better discussed by Linux and 
>>>>>>>> virglrenderer developers.
>>>>>>>>
>>>>>>>> [1] https://lore.kernel.org/qemu-devel/35a8add7-da49-4833-9e69- 
>>>>>>>> d213f52c771a@amd.com/
>>>>>>>>
>>>>>>>
>>>>>>> Thanks for raising this important point about the distinction 
>>>>>>> between
>>>>>>> VIRTGPU_BLOB_FLAG_USE_USERPTR and VIRTIO_GPU_BLOB_MEM_HOST3D_GUEST.
>>>>>>> I might not have explained it clearly previously.
>>>>>>>
>>>>>>> The key difference is memory ownership and lifecycle:
>>>>>>>
>>>>>>> BLOB_MEM_HOST3D_GUEST:
>>>>>>>    - Kernel allocates memory (drm_gem_shmem_create)
>>>>>>>    - Userspace accesses via mmap(GEM_BO)
>>>>>>>    - Use case: Graphics resources (Vulkan/OpenGL)
>>>>>>>
>>>>>>> BLOB_FLAG_USE_USERPTR:
>>>>>>>    - Userspace pre-allocates memory (malloc/mmap)
>>>>>>
>>>>>> "Kernel allocates memory" and "userspace pre-allocates memory" is 
>>>>>> a bit ambiguous phrasing. Either way, the userspace requests the 
>>>>>> kernel to map memory with a system call, brk() or mmap().
>>>>>
>>>>> They are different:
>>>>> BLOB_MEM_HOST3D_GUEST (kernel-managed pages):
>>>>>    - Allocated via drm_gem_shmem_create() as GFP_KERNEL pages
>>>>>    - Kernel guarantees pages won't swap or migrate while GEM object 
>>>>> exists
>>>>>    - Physical addresses remain stable → safe for DMA
>>>>>
>>>>> BLOB_FLAG_USE_USERPTR (userspace pages):
>>>>>    - From regular malloc/mmap - subject to MM policies
>>>>>    - Can be swapped, migrated, or compacted by kernel
>>>>>    - Requires FOLL_LONGTERM pinning to make DMA-safe
>>>>>
>>>>> The device must treat them differently. Kernel-managed pages have 
>>>>> stable physical
>>>>> addresses. Userspace pages need explicit pinning and the device 
>>>>> must be prepared
>>>>> for potential invalidation.
>>>>>
>>>>> This is why all compute drivers (amdgpu, i915, nouveau) implement 
>>>>> userptr - to
>>>>> make arbitrary userspace allocations DMA-accessible while 
>>>>> respecting their different
>>>>> page mobility characteristics.
>>>>> And the drm already has a better frame work for it: SVM, and this 
>>>>> verions is a super simplified verion.
>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/ 
>>>>> tree/ drivers/gpu/drm/ 
>>>>> drm_gpusvm.c#:~:text=*%20GPU%20Shared%20Virtual%20Memory%20(GPU%20SVM)%20layer%20for%20the%20Direct%20Rendering%20Manager%20(DRM)
>>>>
>>>> I referred to phrasing "kernel allocates" vs "userspace allocates". 
>>>> Using GFP_KERNEL, swapping, migrating, or pinning is all what the 
>>>> kernel does.
>>>
>>> I am talking about the virtio gpu driver side, the virtio gpu driver 
>>> need handle those two type memory differently.
>>>
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>>    - Kernel only get existing pages
>>>>>>>    - Use case: Compute workloads (ROCm/CUDA) with large datasets, 
>>>>>>> like
>>>>>>> GPU needs load a big model file 10G+, UMD mmap the fd file, then 
>>>>>>> give the mmap ptr into userspace then driver do not need a 
>>>>>>> another copy.
>>>>>>> But if the shmem is used, the userspace needs copy the file data 
>>>>>>> into a shmem mmap ptr there is a copy overhead.
>>>>>>>
>>>>>>> Userptr:
>>>>>>>
>>>>>>> file -> open/mmap -> userspace ptr -> driver
>>>>>>>
>>>>>>> shmem:
>>>>>>>
>>>>>>> user alloc shmem ──→ mmap shmem ──→ shmem userspace ptr -> driver
>>>>>>>                                                ↑
>>>>>>>                                                │ copy
>>>>>>>                                                │
>>>>>>> file ──→ open/mmap ──→ file userptr ──────────┘
>>>>>>>
>>>>>>>
>>>>>>> For compute workloads, this matters significantly:
>>>>>>>    Without userptr: malloc(8GB) → alloc GEM BO → memcpy 8GB → 
>>>>>>> compute → memcpy 8GB back
>>>>>>>    With userptr:    malloc(8GB) → create userptr BO → compute 
>>>>>>> (zero- copy)
>>>>>>
>>>>>> Why don't you alloc GEM BO first and read the file into there?
>>>>>
>>>>> Because that defeats the purpose of zero-copy.
>>>>>
>>>>> With GEM-BO-first (what you suggest):
>>>>>
>>>>> void *gembo = virtgpu_gem_create(10GB);     // Allocate GEM buffer
>>>>> void *model = mmap(..., model_file_fd, 0);  // Map model file
>>>>> memcpy(gembo, model, 10GB);                 // Copy 10GB - NOT 
>>>>> zero- copy
>>>>> munmap(model, 10GB);
>>>>> gpu_compute(gembo);
>>>>>
>>>>> Result: 10GB copy overhead + double memory usage during copy.
>>>>
>>>> How about:
>>>>
>>>> void *gembo = virtgpu_gem_create(10GB);
>>>> read(model_file_fd, gembo, 10GB);
>>>
>>> I believe there is still memory copy in read operation
>>> model_file_fd -> gembo, they have different physical pages,
>>> but the userptr/SVM feature will access the model_file_fd physical 
>>> pages directly.
>>
>> You can use O_DIRECT if you want.
>>
>>>
>>>
>>>>
>>>> Result: zero-copy + simpler code.
>>>>
>>>>>
>>>>> With userptr (zero-copy):
>>>>>
>>>>> void *model = mmap(..., model_file_fd, 0);  // Map model file
>>>>> hsa_memory_register(model, 10GB);           // Pin pages, create 
>>>>> userptr BO
>>>>> gpu_compute(model);                         // GPU reads directly 
>>>>> from file pages
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> The explicit flag serves three purposes:
>>>>>>>
>>>>>>> 1. Although both send scatter-gather entries to host. The flag 
>>>>>>> makes the intent unambiguous.
>>>>>>
>>>>>> Why will the host care?
>>>>>
>>>>> The flag tells host this is a userptr, host side need handle it 
>>>>> specially.
>>>>
>>>> Please provide the concrete requirement. What is the special 
>>>> handling the host side needs to perform?
>>>
>>> Every hardware has it own special API to handle userptr, for amdgpu ROCm
>>> it is hsaKmtRegisterMemoryWithFlags.
>>
>> On the host side, BLOB_MEM_HOST3D_GUEST will always result in a 
>> userspace pointer. Below is how the address is translated:
>>
>> 1) (with the ioctl you are adding)
>>     Guest kernel translates guest userspace pointer to guest PA.
>> 2) (with IOMMU)
>>     Guest kernel translates guest PA to device VA
>> 3) The host VMM translates device VA to host userspace pointer
>> 4) virglrenderer passes userspace pointer to the GPU API (ROCm)
>>
>> BLOB_FLAG_USE_USERPTR tells 1) happened. But the succeeding process is 
>> not affected by that.
>>
>>>
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> 2. Ensures consistency between flag and userptr address field.
>>>>>>
>>>>>> Addresses are represented with the nr_entries and following struct 
>>>>>> virtio_gpu_mem_entry entries, whenever 
>>>>>> VIRTIO_GPU_CMD_RESOURCE_CREATE_BLOB or 
>>>>>> VIRTIO_GPU_CMD_RESOURCE_ATTACH_BACKING is used. Having a special 
>>>>>> flag introduces inconsistency.
>>>>>
>>>>> For this part I am talking about the virito gpu guest UMD side, in 
>>>>> blob create io ctrl we need this flag to
>>>>> check the userptr address and is it a read-only attribute:
>>>>>      if (rc_blob->blob_flags & VIRTGPU_BLOB_FLAG_USE_USERPTR) {
>>>>>          if (!rc_blob->userptr)
>>>>>              return -EINVAL;
>>>>>      } else {
>>>>>          if (rc_blob->userptr)
>>>>>              return -EINVAL;
>>>>>
>>>>>          if (rc_blob->blob_flags & VIRTGPU_BLOB_FLAG_USERPTR_RDONLY)
>>>>>              return -EINVAL;
>>>>>      }
>>>>
>>>> I see. That shows VIRTGPU_BLOB_FLAG_USE_USERPTR is necessary for the 
>>>> ioctl.
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> 3. Future HMM support: There is a plan to upgrade userptr 
>>>>>>> implementation to use Heterogeneous Memory Management for better 
>>>>>>> GPU coherency and dynamic page migration. The flag provides a 
>>>>>>> clean path to future upgrade.
>>>>>>
>>>>>> How will the upgrade path with the flag and the one without the 
>>>>>> flag look like, and in what aspect the upgrade path with the flag 
>>>>>> is "cleaner"?
>>>>>
>>>>> As I mentioned above the userptr handling is different with shmem/ 
>>>>> GEM BO.
>>>>
>>>> All the above describes the guest-internal behavior. What about the 
>>>> interaction between the guest and host? How will virtio as a guest- 
>>>> host interface having VIRTIO_GPU_BLOB_FLAG_USE_USERPTR ease future 
>>>> upgrade?
>>>
>>> It depends on how we implement it, the current version is the 
>>> simplest implementation, similar to the implementation in Intel's i915.
>>> If virtio side needs HMM to implement a SVM type userptr feature
>>> I think VIRTIO_GPU_BLOB_FLAG_USE_USERPTR is must needed, stack needs 
>>> to know if it is a userptr resource, and to perform advanced 
>>> operations such as updating page tables, splitting BOs, etc.
>>
>> Why do the device need to know if it is a userptr resource to perform 
>> operations when the device always get device VAs?
>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> I understand the concern about API complexity. I'll defer to the 
>>>>>>> virtio- gpu maintainers for the final decision on whether this 
>>>>>>> design is acceptable or if they prefer an alternative approach.
>>>>>>
>>>>>> It is fine to have API complexity. The problem here is the lack of 
>>>>>> clear motivation and documentation.
>>>>>>
>>>>>> Another way to put this is: how will you explain the flag in the 
>>>>>> virtio specification? It should say "the driver MAY/SHOULD/MUST do 
>>>>>> something" and/or "the device MAY/SHOULD/MUST do something", and 
>>>>>> then Linux and virglrenderer can implement the flag accordingly.
>>>>>
>>>>> you're absolutely right that the specification should
>>>>> be written in proper virtio spec language. The draft should be:
>>>>>
>>>>> VIRTIO_GPU_BLOB_FLAG_USE_USERPTR:
>>>>>
>>>>> Linux virtio driver requirements:
>>>>> - MUST set userptr to valid guest userspace VA in 
>>>>> drm_virtgpu_resource_create_blob
>>>>> - SHOULD keep VA mapping valid until resource destruction
>>>>> - MUST pin pages or use HMM at blob creation time
>>>>
>>>> These descriptions are not for the virtio specification. The virtio 
>>>> specification describes the interaction between the driver and 
>>>> device. These statements describe the interaction between the guest 
>>>> userspace and the guest kernel.
>>>>
>>>>>
>>>>> Virglrenderer requirements:
>>>>> - must use correspoonding API for userptr resource
>>>>
>>>> What is the "corresponding API"?
>>>
>>> It may can be:
>>> **VIRTIO_GPU_BLOB_FLAG_USE_USERPTR specification:**
>>>
>>> Driver requirements:
>>> - MUST populate mem_entry[] with valid guest physical addresses of 
>>> pinned userspace pages
>>
>> "Userspace" is a the guest-internal concepts and irrelevant with the 
>> interaction between the driver and device.
>>
>>> - MUST set blob_mem to VIRTIO_GPU_BLOB_FLAG_USE_USERPTR when using 
>>> this flag
>>
>> When should the driver use the flag?
>>
>>> - SHOULD keep pages pinned until VIRTIO_GPU_CMD_RESOURCE_UNREF
>>
>> It is not a new requirement. The page must stay at the same position 
>> whether VIRTIO_GPU_BLOB_FLAG_USE_USERPTR is used or not.
>>
>>>
>>> Device requirements:
>>> - MUST establish IOMMU mappings using the provided iovec array with 
>>> specific API.(hsaKmtRegisterMemoryWithFlags for ROCm)
>>
>> This should be also true even when VIRTIO_GPU_BLOB_FLAG_USE_USERPTR is 
>> not set.
>>
>>>
>>>
>>>
>>> Really thanks for your comments, and I believe we need some input of
>>> virito gpu maintainers.
>>>
>>> VIRTIO_GPU_BLOB_FLAG_USE_USERPTR flag is a flag for how to use, and 
>>> it doen't conflict with VIRTGPU_BLOB_MEM_HOST3D_GUEST. Just like a 
>>> resource is used for VIRTGPU_BLOB_FLAG_USE_SHAREABLE but it can be a 
>>> guest resource or a host resource.
>>>
>>> If we don't have VIRTIO_GPU_BLOB_FLAG_USE_USERPTR flag, we may have some
>>> resource conflict in host side, guest kernel can use 'userptr' param 
>>> to identify. But in host side the 'userptr' param is lost, we only 
>>> know it is just a guest flag resource.
>>
>> I still don't see why knowing it is a guest resource is insufficient 
>> for the host.
> 
> All right, I totally agreed with you.
> 
> And let virtio gpu maintainer/drm decide how to design the flag/params 
> maybe is better.
> 
> 
> I believe the core gap between you and me is the concept of userptr/SVM.
> What does userptr/SVM used for, it let GPU and CPU share the userspace 
> virtual address. Perhaps my description is not accurate enough.

That is not what your QEMU patch series does; QEMU sees an address space 
bound to the virtio-gpu device which is not the guest userspace virtual 
address space.

Below is my points in the discussion:

- Zero copy is not a new thing, but virtio already has features for
   that: VIRTIO_GPU_BLOB_MEM_GUEST and VIRTIO_GPU_BLOB_MEM_HOST3D_GUEST.

- You *always* need hsaKmtRegisterMemoryWithFlags() or similar when
   implementing VIRTIO_GPU_BLOB_MEM_GUEST and/or
   VIRTIO_GPU_BLOB_MEM_HOST3D_GUEST, so having another flag does not make
   any difference.

- The guest userspace virtual address is never exposed to the host in
   your QEMU patch series in contrary to your description.

Regards,
Akihiko Odaki