[PATCH v9 5/6] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags

ankita@nvidia.com posted 6 patches 3 months, 2 weeks ago
There is a newer version of this series
[PATCH v9 5/6] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
Posted by ankita@nvidia.com 3 months, 2 weeks ago
From: Ankit Agrawal <ankita@nvidia.com>

Today KVM forces the memory to either NORMAL or DEVICE_nGnRE
based on pfn_is_map_memory (which tracks whether the device memory
is in the kernel map) and ignores the per-VMA flags that indicates the
memory attributes. The KVM code is thus restrictive and allows only for
the memory that is added to the kernel to be marked as cacheable.

The device memory such as on the Grace Hopper/Blackwell systems
is interchangeable with DDR memory and retains properties such as
cacheability, unaligned accesses, atomics and handling of executable
faults. This requires the device memory to be mapped as NORMAL in
stage-2.

Given that the GPU device memory is not added to the kernel (but is rather
VMA mapped through remap_pfn_range() in nvgrace-gpu module which sets
VM_PFNMAP), pfn_is_map_memory() is false and thus KVM prevents such memory
to be mapped Normal cacheable. The patch aims to solve this use case.

Note when FWB is not enabled, the kernel expects to trivially do
cache management by flushing the memory by linearly converting a
kvm_pte to phys_addr to a KVA, see kvm_flush_dcache_to_poc(). The
cache management thus relies on memory being mapped. Moreover
ARM64_HAS_CACHE_DIC CPU cap allows KVM to avoid flushing the icache
and turns icache_inval_pou() into a NOP. These two capabilities
are thus a requirement of the cacheable PFNMAP feature. Make use of
kvm_arch_supports_cacheable_pfnmap() to check them.

A cachebility check is made by consulting the VMA pgprot value.
If the pgprot mapping type is cacheable, it is safe to be mapped S2
cacheable as the KVM S2 will have the same Normal memory type as the
VMA has in the S1 and KVM has no additional responsibility for safety.
Checking pgprot as NORMAL is thus a KVM sanity check.

It is possible to have COW VM_PFNMAP when doing a MAP_PRIVATE
/dev/mem mapping on systems that allow such mapping. Add check for
COW VM_PFNMAP and refuse such mapping.

No additional checks for MTE are needed as kvm_arch_prepare_memory_region()
already tests it at an early stage during memslot creation. There would
not even be a fault if the memslot is not created.

CC: Oliver Upton <oliver.upton@linux.dev>
CC: Sean Christopherson <seanjc@google.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 arch/arm64/kvm/mmu.c | 61 ++++++++++++++++++++++++++++++++++++--------
 1 file changed, 50 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index d8d2eb8a409e..ddcbbcc8a9ae 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1681,18 +1681,53 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (is_error_noslot_pfn(pfn))
 		return -EFAULT;
 
+	/*
+	 * Check if this is non-struct page memory PFN, and cannot support
+	 * CMOs. It could potentially be unsafe to access as cachable.
+	 */
 	if (vm_flags & (VM_PFNMAP | VM_MIXEDMAP) && !pfn_is_map_memory(pfn)) {
 		/*
-		 * If the page was identified as device early by looking at
-		 * the VMA flags, vma_pagesize is already representing the
-		 * largest quantity we can map.  If instead it was mapped
-		 * via __kvm_faultin_pfn(), vma_pagesize is set to PAGE_SIZE
-		 * and must not be upgraded.
-		 *
-		 * In both cases, we don't let transparent_hugepage_adjust()
-		 * change things at the last minute.
+		 * COW VM_PFNMAP is possible when doing a MAP_PRIVATE
+		 * /dev/mem mapping on systems that allow such mapping.
+		 * Reject such case.
 		 */
-		s2_force_noncacheable = true;
+		if (is_cow_mapping(vm_flags))
+			return -EINVAL;
+
+		if (is_vma_cacheable) {
+			/*
+			 * Whilst the VMA owner expects cacheable mapping to this
+			 * PFN, hardware also has to support the FWB and CACHE DIC
+			 * features.
+			 *
+			 * ARM64 KVM relies on kernel VA mapping to the PFN to
+			 * perform cache maintenance as the CMO instructions work on
+			 * virtual addresses. VM_PFNMAP region are not necessarily
+			 * mapped to a KVA and hence the presence of hardware features
+			 * S2FWB and CACHE DIC are mandatory for cache maintenance.
+			 *
+			 * Check if the hardware supports it before allowing the VMA
+			 * owner request for cacheable mapping.
+			 */
+			if (!kvm_arch_supports_cacheable_pfnmap())
+				return -EFAULT;
+
+			/* Cannot degrade cachable to non cachable */
+			if (s2_force_noncacheable)
+				return -EINVAL;
+		} else {
+			/*
+			 * If the page was identified as device early by looking at
+			 * the VMA flags, vma_pagesize is already representing the
+			 * largest quantity we can map.  If instead it was mapped
+			 * via __kvm_faultin_pfn(), vma_pagesize is set to PAGE_SIZE
+			 * and must not be upgraded.
+			 *
+			 * In both cases, we don't let transparent_hugepage_adjust()
+			 * change things at the last minute.
+			 */
+			s2_force_noncacheable = true;
+		}
 	} else if (logging_active && !write_fault) {
 		/*
 		 * Only actually map the page as writable if this was a write
@@ -2269,8 +2304,12 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 				break;
 			}
 
-			/* Cacheable PFNMAP is not allowed */
-			if (kvm_vma_is_cacheable(vma)) {
+			/*
+			 * Cacheable PFNMAP is allowed only if the hardware
+			 * supports it.
+			 */
+			if (kvm_vma_is_cacheable(vma) &&
+			    !kvm_arch_supports_cacheable_pfnmap()) {
 				ret = -EINVAL;
 				break;
 			}
-- 
2.34.1
Re: [PATCH v9 5/6] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
Posted by Jason Gunthorpe 3 months ago
On Sat, Jun 21, 2025 at 04:21:10AM +0000, ankita@nvidia.com wrote:
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1681,18 +1681,53 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	if (is_error_noslot_pfn(pfn))
>  		return -EFAULT;
>  
> +	/*
> +	 * Check if this is non-struct page memory PFN, and cannot support
> +	 * CMOs. It could potentially be unsafe to access as cachable.
> +	 */
>  	if (vm_flags & (VM_PFNMAP | VM_MIXEDMAP) && !pfn_is_map_memory(pfn)) {
>  		/*
> -		 * If the page was identified as device early by looking at
> -		 * the VMA flags, vma_pagesize is already representing the
> -		 * largest quantity we can map.  If instead it was mapped
> -		 * via __kvm_faultin_pfn(), vma_pagesize is set to PAGE_SIZE
> -		 * and must not be upgraded.
> -		 *
> -		 * In both cases, we don't let transparent_hugepage_adjust()
> -		 * change things at the last minute.
> +		 * COW VM_PFNMAP is possible when doing a MAP_PRIVATE
> +		 * /dev/mem mapping on systems that allow such mapping.
> +		 * Reject such case.
>  		 */
> -		s2_force_noncacheable = true;
> +		if (is_cow_mapping(vm_flags))
> +			return -EINVAL;

I still would like an explanation why we need to block this.

COW PFNMAP is like MIXEDMAP, you end up with a VMA where there is a
mixture of MMIO and normal pages. Arguably you are supposed to use
vm_normal_page() not pfn_is_map_memory(), but that seems difficult for
KVM.

Given we exclude the cachable case with the pfn_is_map_memory() we
know this is the non-struct page memory already, so why do we need to
block the COW?

I think the basic rule we are going for is that within the VMA the
non-normal/special PTE have to follow the vma->vm_pgprot while the
normal pages have to be cachable.

So if we find a normal page (ie pfn_is_map_memory()) then we know it
is cachable and s2_force_noncacheable = false. Otherwise we use the
vm_pgprot to decide if the special PTE is cachable.

David can you think of any reason to have this is_cow_mapping() test?

> +		if (is_vma_cacheable) {
> +			/*
> +			 * Whilst the VMA owner expects cacheable mapping to this
> +			 * PFN, hardware also has to support the FWB and CACHE DIC
> +			 * features.
> +			 *
> +			 * ARM64 KVM relies on kernel VA mapping to the PFN to
> +			 * perform cache maintenance as the CMO instructions work on
> +			 * virtual addresses. VM_PFNMAP region are not necessarily
> +			 * mapped to a KVA and hence the presence of hardware features
> +			 * S2FWB and CACHE DIC are mandatory for cache maintenance.
> +			 *
> +			 * Check if the hardware supports it before allowing the VMA
> +			 * owner request for cacheable mapping.
> +			 */
> +			if (!kvm_arch_supports_cacheable_pfnmap())
> +				return -EFAULT;
> +
> +			/* Cannot degrade cachable to non cachable */
> +			if (s2_force_noncacheable)
> +				return -EINVAL;

What am I missing? After the whole series is applied this is the first
reference to s2_force_noncacheable after it is initialized to
false. So this can't happen?

> +		} else {
> +			/*
> +			 * If the page was identified as device early by looking at
> +			 * the VMA flags, vma_pagesize is already representing the
> +			 * largest quantity we can map.  If instead it was mapped
> +			 * via __kvm_faultin_pfn(), vma_pagesize is set to PAGE_SIZE
> +			 * and must not be upgraded.
> +			 *
> +			 * In both cases, we don't let transparent_hugepage_adjust()
> +			 * change things at the last minute.
> +			 */
> +			s2_force_noncacheable = true;
> +		}


Then this logic that immediately follows:

        if (is_vma_cacheable && s2_force_noncacheable)
                return -EINVAL;

Doesn't make alot of sense either, the only cases that set
s2_force_noncacheable=true are the else block of 'if (is_vma_cacheable)'
so this is dead code too.

Seems like this still needs some cleanup to remove these impossible
conditions. The logic make sense to me otherwise though.

Jason
Re: [PATCH v9 5/6] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
Posted by David Hildenbrand 3 months ago
On 04.07.25 16:04, Jason Gunthorpe wrote:
> On Sat, Jun 21, 2025 at 04:21:10AM +0000, ankita@nvidia.com wrote:
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -1681,18 +1681,53 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>   	if (is_error_noslot_pfn(pfn))
>>   		return -EFAULT;
>>   
>> +	/*
>> +	 * Check if this is non-struct page memory PFN, and cannot support
>> +	 * CMOs. It could potentially be unsafe to access as cachable.
>> +	 */
>>   	if (vm_flags & (VM_PFNMAP | VM_MIXEDMAP) && !pfn_is_map_memory(pfn)) {
>>   		/*
>> -		 * If the page was identified as device early by looking at
>> -		 * the VMA flags, vma_pagesize is already representing the
>> -		 * largest quantity we can map.  If instead it was mapped
>> -		 * via __kvm_faultin_pfn(), vma_pagesize is set to PAGE_SIZE
>> -		 * and must not be upgraded.
>> -		 *
>> -		 * In both cases, we don't let transparent_hugepage_adjust()
>> -		 * change things at the last minute.
>> +		 * COW VM_PFNMAP is possible when doing a MAP_PRIVATE
>> +		 * /dev/mem mapping on systems that allow such mapping.
>> +		 * Reject such case.
>>   		 */
>> -		s2_force_noncacheable = true;
>> +		if (is_cow_mapping(vm_flags))
>> +			return -EINVAL;
> 
> I still would like an explanation why we need to block this.
> 
> COW PFNMAP is like MIXEDMAP, you end up with a VMA where there is a
> mixture of MMIO and normal pages. Arguably you are supposed to use
> vm_normal_page() not pfn_is_map_memory(), but that seems difficult for
> KVM.
> 
> Given we exclude the cachable case with the pfn_is_map_memory() we
> know this is the non-struct page memory already, so why do we need to
> block the COW?
> 
> I think the basic rule we are going for is that within the VMA the
> non-normal/special PTE have to follow the vma->vm_pgprot while the
> normal pages have to be cachable.
> 
> So if we find a normal page (ie pfn_is_map_memory()) then we know it
> is cachable and s2_force_noncacheable = false. Otherwise we use the
> vm_pgprot to decide if the special PTE is cachable.
> 
> David can you think of any reason to have this is_cow_mapping() test?

I think with that reasoning, it should be fine to drop it.

I think, the COW test made sense when we were talking about limiting it 
to VM_PFNMAP only and simplifying by dropping other checks. Then, it 
would have identified that something is certainly not "normal" memory.

-- 
Cheers,

David / dhildenb