[PATCH v7 4/5] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags

ankita@nvidia.com posted 5 patches 3 months, 3 weeks ago
There is a newer version of this series
[PATCH v7 4/5] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
Posted by ankita@nvidia.com 3 months, 3 weeks ago
From: Ankit Agrawal <ankita@nvidia.com>

Today KVM forces the memory to either NORMAL or DEVICE_nGnRE
based on pfn_is_map_memory (which tracks whether the device memory
is in the kernel map) and ignores the per-VMA flags that indicates the
memory attributes. The KVM code is thus restrictive and allows only for
the memory that is added to the kernel to be marked as cacheable.

The device memory such as on the Grace Hopper/Blackwell systems
is interchangeable with DDR memory and retains properties such as
cacheability, unaligned accesses, atomics and handling of executable
faults. This requires the device memory to be mapped as NORMAL in
stage-2.

Given that the GPU device memory is not added to the kernel (but is rather
VMA mapped through remap_pfn_range() in nvgrace-gpu module which sets
VM_PFNMAP), pfn_is_map_memory() is false and thus KVM prevents such memory
to be mapped Normal cacheable. The patch aims to solve this use case.

Note when FWB is not enabled, the kernel expects to trivially do
cache management by flushing the memory by linearly converting a
kvm_pte to phys_addr to a KVA, see kvm_flush_dcache_to_poc(). The
cache management thus relies on memory being mapped. Moreover
ARM64_HAS_CACHE_DIC CPU cap allows KVM to avoid flushing the icache
and turns icache_inval_pou() into a NOP. These two capabilities
are thus a requirement of the cacheable PFNMAP feature. Make use of
kvm_arch_supports_cacheable_pfnmap() to check them.

A cachebility check is made by consulting the VMA pgprot value.
If the pgprot mapping type is cacheable, it is safe to be mapped S2
cacheable as the KVM S2 will have the same Normal memory type as the
VMA has in the S1 and KVM has no additional responsibility for safety.
Checking pgprot as NORMAL is thus a KVM sanity check.

Add check for COW VM_PFNMAP and refuse such mapping.

No additional checks for MTE are needed as kvm_arch_prepare_memory_region()
already tests it at an early stage during memslot creation. There would
not even be a fault if the memslot is not created.

CC: Oliver Upton <oliver.upton@linux.dev>
CC: Sean Christopherson <seanjc@google.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 arch/arm64/kvm/mmu.c | 35 +++++++++++++++++++++++++++++------
 1 file changed, 29 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index a71b77df7c96..6a3955e07b5e 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1660,6 +1660,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 
 	is_vma_cacheable = kvm_vma_is_cacheable(vma);
 
+	/* Reject COW VM_PFNMAP */
+	if ((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags))
+		return -EINVAL;
+
 	/* Don't use the VMA after the unlock -- it may have vanished */
 	vma = NULL;
 
@@ -1684,9 +1688,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		return -EFAULT;
 
 	if (!kvm_can_use_cmo_pfn(pfn)) {
-		if (is_vma_cacheable)
-			return -EINVAL;
-
 		/*
 		 * If the page was identified as device early by looking at
 		 * the VMA flags, vma_pagesize is already representing the
@@ -1696,8 +1697,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		 *
 		 * In both cases, we don't let transparent_hugepage_adjust()
 		 * change things at the last minute.
+		 *
+		 * Do not set device as the device memory is cacheable. Note
+		 * that such mapping is safe as the KVM S2 will have the same
+		 * Normal memory type as the VMA has in the S1.
 		 */
-		disable_cmo = true;
+		if (!is_vma_cacheable)
+			disable_cmo = true;
 	} else if (logging_active && !write_fault) {
 		/*
 		 * Only actually map the page as writable if this was a write
@@ -1784,6 +1790,19 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		prot |= KVM_PGTABLE_PROT_X;
 	}
 
+	/*
+	 *  When FWB is unsupported KVM needs to do cache flushes
+	 *  (via dcache_clean_inval_poc()) of the underlying memory. This is
+	 *  only possible if the memory is already mapped into the kernel map.
+	 *
+	 *  Outright reject as the cacheable device memory is not present in
+	 *  the kernel map and not suitable for cache management.
+	 */
+	if (is_vma_cacheable && !kvm_arch_supports_cacheable_pfnmap()) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
 	/*
 	 * Under the premise of getting a FSC_PERM fault, we just need to relax
 	 * permissions only if vma_pagesize equals fault_granule. Otherwise,
@@ -2271,8 +2290,12 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 				break;
 			}
 
-			/* Cacheable PFNMAP is not allowed */
-			if (kvm_vma_is_cacheable(vma)) {
+			/*
+			 * Cacheable PFNMAP is allowed only if the hardware
+			 * supports it.
+			 */
+			if (kvm_vma_is_cacheable(vma) &&
+			    !kvm_arch_supports_cacheable_pfnmap()) {
 				ret = -EINVAL;
 				break;
 			}
-- 
2.34.1
Re: [PATCH v7 4/5] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
Posted by Catalin Marinas 3 months, 3 weeks ago
On Wed, Jun 18, 2025 at 06:55:40AM +0000, ankita@nvidia.com wrote:
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index a71b77df7c96..6a3955e07b5e 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1660,6 +1660,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  
>  	is_vma_cacheable = kvm_vma_is_cacheable(vma);
>  
> +	/* Reject COW VM_PFNMAP */
> +	if ((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags))
> +		return -EINVAL;

It may help to add a comment here why this needs to be rejected. I
forgot the details but tracked it down to an email from David a few
months ago:

https://lore.kernel.org/all/a2d95399-62ad-46d3-9e48-6fa90fd2c2f3@redhat.com/

> +
>  	/* Don't use the VMA after the unlock -- it may have vanished */
>  	vma = NULL;
>  
> @@ -1684,9 +1688,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  		return -EFAULT;
>  
>  	if (!kvm_can_use_cmo_pfn(pfn)) {
> -		if (is_vma_cacheable)
> -			return -EINVAL;
> -
>  		/*
>  		 * If the page was identified as device early by looking at
>  		 * the VMA flags, vma_pagesize is already representing the
> @@ -1696,8 +1697,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  		 *
>  		 * In both cases, we don't let transparent_hugepage_adjust()
>  		 * change things at the last minute.
> +		 *
> +		 * Do not set device as the device memory is cacheable. Note
> +		 * that such mapping is safe as the KVM S2 will have the same
> +		 * Normal memory type as the VMA has in the S1.
>  		 */
> -		disable_cmo = true;
> +		if (!is_vma_cacheable)
> +			disable_cmo = true;

I'm tempted to stick to the 'device' variable name. Or something like
s2_noncacheable. As I commented, it's not just about disabling CMOs.

>  	} else if (logging_active && !write_fault) {
>  		/*
>  		 * Only actually map the page as writable if this was a write
> @@ -1784,6 +1790,19 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  		prot |= KVM_PGTABLE_PROT_X;
>  	}
>  
> +	/*
> +	 *  When FWB is unsupported KVM needs to do cache flushes
> +	 *  (via dcache_clean_inval_poc()) of the underlying memory. This is
> +	 *  only possible if the memory is already mapped into the kernel map.
> +	 *
> +	 *  Outright reject as the cacheable device memory is not present in
> +	 *  the kernel map and not suitable for cache management.
> +	 */
> +	if (is_vma_cacheable && !kvm_arch_supports_cacheable_pfnmap()) {
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}

I'm missing the full context around this hunk but, judging by
indentation, does it also reject any cacheable vma even if it is not
PFNMAP on pre-FWB hardware?

-- 
Catalin
Re: [PATCH v7 4/5] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
Posted by Jason Gunthorpe 3 months, 3 weeks ago
On Wed, Jun 18, 2025 at 05:34:16PM +0100, Catalin Marinas wrote:
> > +		 *
> > +		 * Do not set device as the device memory is cacheable. Note
> > +		 * that such mapping is safe as the KVM S2 will have the same
> > +		 * Normal memory type as the VMA has in the S1.
> >  		 */
> > -		disable_cmo = true;
> > +		if (!is_vma_cacheable)
> > +			disable_cmo = true;
> 
> I'm tempted to stick to the 'device' variable name. Or something like
> s2_noncacheable. As I commented, it's not just about disabling CMOs.

I think it would be clearer to have two concepts/variable then because
the cases where it is really about preventing cachable access to
prevent aborts are not linked to the logic that checks pfn valid. We
have to detect those cases separately (through the VMA flags was it?).

Having these two things together is IMHO confusing..

Jason