[v3] drm/amdkfd: Add batch userptr allocation support

[PATCH v3 0/8] drm/amdkfd: Add batch userptr allocation support

Posted by Honglei Huang 2 days, 2 hours ago

From: Honglei Huang <honghuan@amd.com>

Hi all,

This is v3 of the patch series to support allocating multiple non-contiguous
CPU virtual address ranges that map to a single contiguous GPU virtual address.

v3:
1. No new ioctl: Reuses existing AMDKFD_IOC_ALLOC_MEMORY_OF_GPU
   - Adds only one flag: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH
   - When flag is set, mmap_offset field points to range array
   - Minimal API surface change

2. Improved MMU notifier handling:
   - Single mmu_interval_notifier covering the VA span [va_min, va_max]
   - Interval tree for efficient lookup of affected ranges during invalidation
   - Avoids per-range notifier overhead mentioned in v2 review

3. Better code organization: Split into 8 focused patches for easier review

v2:
   - Each CPU VA range gets its own mmu_interval_notifier for invalidation
   - All ranges validated together and mapped to contiguous GPU VA
   - Single kgd_mem object with array of user_range_info structures
   - Unified eviction/restore path for all ranges in a batch

Current Implementation Approach
===============================

This series implements a practical solution within existing kernel constraints:

1. Single MMU notifier for VA span: Register one notifier covering the
   entire range from lowest to highest address in the batch

2. Interval tree filtering: Use interval tree to efficiently identify
   which specific ranges are affected during invalidation callbacks,
   avoiding unnecessary processing for unrelated address changes

3. Unified eviction/restore: All ranges in a batch share eviction and
   restore paths, maintaining consistency with existing userptr handling

Patch Series Overview
=====================

Patch 1/8: Add userptr batch allocation UAPI structures
    - KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH flag
    - kfd_ioctl_userptr_range and kfd_ioctl_userptr_ranges_data structures

Patch 2/8: Add user_range_info infrastructure to kgd_mem
    - user_range_info structure for per-range tracking
    - Fields for batch allocation in kgd_mem

Patch 3/8: Implement interval tree for userptr ranges
    - Interval tree for efficient range lookup during invalidation
    - mark_invalid_ranges() function

Patch 4/8: Add batch MMU notifier support
    - Single notifier for entire VA span
    - Invalidation callback using interval tree filtering

Patch 5/8: Implement batch userptr page management
    - get_user_pages_batch() and set_user_pages_batch()
    - Per-range page array management

Patch 6/8: Add batch allocation function and export API
    - init_user_pages_batch() main initialization
    - amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu_batch() entry point

Patch 7/8: Unify userptr cleanup and update paths
    - Shared eviction/restore handling for batch allocations
    - Integration with existing userptr validation flows

Patch 8/8: Wire up batch allocation in ioctl handler
    - Input validation and range array parsing
    - Integration with existing alloc_memory_of_gpu path

Testing
=======

- Multiple scattered malloc() allocations (2-4000+ ranges)
- Various allocation sizes (4KB to 1G+ per range)
- Memory pressure scenarios and eviction/restore cycles
- OpenCL CTS and HIP catch tests in KVM guest environment
- AI workloads: Stable Diffusion, ComfyUI in virtualized environments
- Small LLM inference (3B-7B models)
- Benchmark score: 160,000 - 190,000 (80%-95% of bare metal)
- Performance improvement: 2x-2.4x faster than userspace approach

Thank you for your review and feedback.

Best regards,
Honglei Huang

Honglei Huang (8):
  drm/amdkfd: Add userptr batch allocation UAPI structures
  drm/amdkfd: Add user_range_info infrastructure to kgd_mem
  drm/amdkfd: Implement interval tree for userptr ranges
  drm/amdkfd: Add batch MMU notifier support
  drm/amdkfd: Implement batch userptr page management
  drm/amdkfd: Add batch allocation function and export API
  drm/amdkfd: Unify userptr cleanup and update paths
  drm/amdkfd: Wire up batch allocation in ioctl handler

 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |  23 +
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 539 +++++++++++++++++-
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      | 128 ++++-
 include/uapi/linux/kfd_ioctl.h                |  31 +-
 4 files changed, 697 insertions(+), 24 deletions(-)

-- 
2.34.1

Re: [PATCH v3 0/8] drm/amdkfd: Add batch userptr allocation support

Posted by Christian König 1 day, 18 hours ago

On 2/6/26 07:25, Honglei Huang wrote:
> From: Honglei Huang <honghuan@amd.com>
> 
> Hi all,
> 
> This is v3 of the patch series to support allocating multiple non-contiguous
> CPU virtual address ranges that map to a single contiguous GPU virtual address.
> 
> v3:
> 1. No new ioctl: Reuses existing AMDKFD_IOC_ALLOC_MEMORY_OF_GPU
>    - Adds only one flag: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH

That is most likely not the best approach, but Felix or Philip need to comment here since I don't know such IOCTLs well either.

>    - When flag is set, mmap_offset field points to range array
>    - Minimal API surface change

Why range of VA space for each entry?

> 2. Improved MMU notifier handling:
>    - Single mmu_interval_notifier covering the VA span [va_min, va_max]
>    - Interval tree for efficient lookup of affected ranges during invalidation
>    - Avoids per-range notifier overhead mentioned in v2 review

That won't work unless you also modify hmm_range_fault() to take multiple VA addresses (or ranges) at the same time.

The problem is that we must rely on hmm_range.notifier_seq to detect changes to the page tables in question, but that in turn works only if you have one hmm_range structure and not multiple.

What might work is doing an XOR or CRC over all hmm_range.notifier_seq you have, but that is a bit flaky.

Regards,
Christian.

> 
> 3. Better code organization: Split into 8 focused patches for easier review
> 
> v2:
>    - Each CPU VA range gets its own mmu_interval_notifier for invalidation
>    - All ranges validated together and mapped to contiguous GPU VA
>    - Single kgd_mem object with array of user_range_info structures
>    - Unified eviction/restore path for all ranges in a batch
> 
> Current Implementation Approach
> ===============================
> 
> This series implements a practical solution within existing kernel constraints:
> 
> 1. Single MMU notifier for VA span: Register one notifier covering the
>    entire range from lowest to highest address in the batch
> 
> 2. Interval tree filtering: Use interval tree to efficiently identify
>    which specific ranges are affected during invalidation callbacks,
>    avoiding unnecessary processing for unrelated address changes
> 
> 3. Unified eviction/restore: All ranges in a batch share eviction and
>    restore paths, maintaining consistency with existing userptr handling
> 
> Patch Series Overview
> =====================
> 
> Patch 1/8: Add userptr batch allocation UAPI structures
>     - KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH flag
>     - kfd_ioctl_userptr_range and kfd_ioctl_userptr_ranges_data structures
> 
> Patch 2/8: Add user_range_info infrastructure to kgd_mem
>     - user_range_info structure for per-range tracking
>     - Fields for batch allocation in kgd_mem
> 
> Patch 3/8: Implement interval tree for userptr ranges
>     - Interval tree for efficient range lookup during invalidation
>     - mark_invalid_ranges() function
> 
> Patch 4/8: Add batch MMU notifier support
>     - Single notifier for entire VA span
>     - Invalidation callback using interval tree filtering
> 
> Patch 5/8: Implement batch userptr page management
>     - get_user_pages_batch() and set_user_pages_batch()
>     - Per-range page array management
> 
> Patch 6/8: Add batch allocation function and export API
>     - init_user_pages_batch() main initialization
>     - amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu_batch() entry point
> 
> Patch 7/8: Unify userptr cleanup and update paths
>     - Shared eviction/restore handling for batch allocations
>     - Integration with existing userptr validation flows
> 
> Patch 8/8: Wire up batch allocation in ioctl handler
>     - Input validation and range array parsing
>     - Integration with existing alloc_memory_of_gpu path
> 
> Testing
> =======
> 
> - Multiple scattered malloc() allocations (2-4000+ ranges)
> - Various allocation sizes (4KB to 1G+ per range)
> - Memory pressure scenarios and eviction/restore cycles
> - OpenCL CTS and HIP catch tests in KVM guest environment
> - AI workloads: Stable Diffusion, ComfyUI in virtualized environments
> - Small LLM inference (3B-7B models)
> - Benchmark score: 160,000 - 190,000 (80%-95% of bare metal)
> - Performance improvement: 2x-2.4x faster than userspace approach
> 
> Thank you for your review and feedback.
> 
> Best regards,
> Honglei Huang
> 
> Honglei Huang (8):
>   drm/amdkfd: Add userptr batch allocation UAPI structures
>   drm/amdkfd: Add user_range_info infrastructure to kgd_mem
>   drm/amdkfd: Implement interval tree for userptr ranges
>   drm/amdkfd: Add batch MMU notifier support
>   drm/amdkfd: Implement batch userptr page management
>   drm/amdkfd: Add batch allocation function and export API
>   drm/amdkfd: Unify userptr cleanup and update paths
>   drm/amdkfd: Wire up batch allocation in ioctl handler
> 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |  23 +
>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 539 +++++++++++++++++-
>  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      | 128 ++++-
>  include/uapi/linux/kfd_ioctl.h                |  31 +-
>  4 files changed, 697 insertions(+), 24 deletions(-)
>