[PATCH v6 1/5] KVM: arm64: Block cacheable PFNMAP mapping

ankita@nvidia.com posted 5 patches 6 months, 4 weeks ago
There is a newer version of this series
[PATCH v6 1/5] KVM: arm64: Block cacheable PFNMAP mapping
Posted by ankita@nvidia.com 6 months, 4 weeks ago
From: Ankit Agrawal <ankita@nvidia.com>

Fixes a security bug due to mismatched attributes between S1 and
S2 mapping.

Currently, it is possible for a region to be cacheable in S1, but mapped
non cached in S2. This creates a potential issue where the VMM may
sanitize cacheable memory across VMs using cacheable stores, ensuring
it is zeroed. However, if KVM subsequently assigns this memory to a VM
as uncached, the VM could end up accessing stale, non-zeroed data from
a previous VM, leading to unintended data exposure. This is a security
risk.

Block such mismatch attributes case by returning EINVAL when userspace
try to map PFNMAP cacheable. Only allow NORMAL_NC and DEVICE_*.

CC: Oliver Upton <oliver.upton@linux.dev>
CC: Sean Christopherson <seanjc@google.com>
CC: Catalin Marinas <catalin.marinas@arm.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 arch/arm64/kvm/mmu.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 2feb6c6b63af..305a0e054f81 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1466,6 +1466,18 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
 	return vma->vm_flags & VM_MTE_ALLOWED;
 }
 
+/*
+ * Determine the memory region cacheability from VMA's pgprot. This
+ * is used to set the stage 2 PTEs.
+ */
+static unsigned long mapping_type_noncacheable(pgprot_t page_prot)
+{
+	unsigned long mt = FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(page_prot));
+
+	return (mt == MT_NORMAL_NC || mt == MT_DEVICE_nGnRnE ||
+		mt == MT_DEVICE_nGnRE);
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  struct kvm_s2_trans *nested,
 			  struct kvm_memory_slot *memslot, unsigned long hva,
@@ -1612,6 +1624,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 
 	vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
 
+	if ((vma->vm_flags & VM_PFNMAP) &&
+	    !mapping_type_noncacheable(vma->vm_page_prot))
+		return -EINVAL;
+
 	/* Don't use the VMA after the unlock -- it may have vanished */
 	vma = NULL;
 
@@ -2207,6 +2223,12 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 				ret = -EINVAL;
 				break;
 			}
+
+			/* Cacheable PFNMAP is not allowed */
+			if (!mapping_type_noncacheable(vma->vm_page_prot)) {
+				ret = -EINVAL;
+				break;
+			}
 		}
 		hva = min(reg_end, vma->vm_end);
 	} while (hva < reg_end);
-- 
2.34.1
Re: [PATCH v6 1/5] KVM: arm64: Block cacheable PFNMAP mapping
Posted by Sean Christopherson 6 months, 2 weeks ago
On Sat, May 24, 2025, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> Fixes a security bug due to mismatched attributes between S1 and
> S2 mapping.
> 
> Currently, it is possible for a region to be cacheable in S1, but mapped
> non cached in S2. This creates a potential issue where the VMM may
> sanitize cacheable memory across VMs using cacheable stores, ensuring
> it is zeroed. However, if KVM subsequently assigns this memory to a VM
> as uncached, the VM could end up accessing stale, non-zeroed data from
> a previous VM, leading to unintended data exposure. This is a security
> risk.
> 
> Block such mismatch attributes case by returning EINVAL when userspace
> try to map PFNMAP cacheable. Only allow NORMAL_NC and DEVICE_*.
> 
> CC: Oliver Upton <oliver.upton@linux.dev>
> CC: Sean Christopherson <seanjc@google.com>
> CC: Catalin Marinas <catalin.marinas@arm.com>
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  arch/arm64/kvm/mmu.c | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 2feb6c6b63af..305a0e054f81 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1466,6 +1466,18 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
>  	return vma->vm_flags & VM_MTE_ALLOWED;
>  }
>  
> +/*
> + * Determine the memory region cacheability from VMA's pgprot. This
> + * is used to set the stage 2 PTEs.
> + */
> +static unsigned long mapping_type_noncacheable(pgprot_t page_prot)

Return a bool.  And given that all the usage queries cachaeable, maybe invert
this predicate?

> +{
> +	unsigned long mt = FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(page_prot));
> +
> +	return (mt == MT_NORMAL_NC || mt == MT_DEVICE_nGnRnE ||
> +		mt == MT_DEVICE_nGnRE);
> +}

No need for the parantheses.  And since the values are clumped together, maybe
use a switch statement to let the compiler optimize the checks (though I'm
guessing modern compilers will optimize either way).

E.g.

static bool kvm_vma_is_cacheable(struct vm_area_struct *vma)
{
	switch (FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(vma->vm_page_prot))) {
	case MT_NORMAL_NC:
	case MT_DEVICE_nGnRnE:
	case MT_DEVICE_nGnRE:
		return false;
	default:
		return true;
	}
}


>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  			  struct kvm_s2_trans *nested,
>  			  struct kvm_memory_slot *memslot, unsigned long hva,
> @@ -1612,6 +1624,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  
>  	vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
>  
> +	if ((vma->vm_flags & VM_PFNMAP) &&
> +	    !mapping_type_noncacheable(vma->vm_page_prot))

I don't think this is correct, and there's a very real chance this will break
existing setups.  PFNMAP memory isn't strictly device memory, and IIUC, KVM
force DEVICE/NORMAL_NC based on kvm_is_device_pfn(), not based on VM_PFNMAP.

	if (kvm_is_device_pfn(pfn)) {
		/*
		 * If the page was identified as device early by looking at
		 * the VMA flags, vma_pagesize is already representing the
		 * largest quantity we can map.  If instead it was mapped
		 * via __kvm_faultin_pfn(), vma_pagesize is set to PAGE_SIZE
		 * and must not be upgraded.
		 *
		 * In both cases, we don't let transparent_hugepage_adjust()
		 * change things at the last minute.
		 */
		device = true;
	}

	if (device) {
		if (vfio_allow_any_uc)
			prot |= KVM_PGTABLE_PROT_NORMAL_NC;
		else
			prot |= KVM_PGTABLE_PROT_DEVICE;
	} else if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC) &&
		   (!nested || kvm_s2_trans_executable(nested))) {
		prot |= KVM_PGTABLE_PROT_X;
	}

which gets morphed into the hardware memtype attributes as:

	switch (prot & (KVM_PGTABLE_PROT_DEVICE |
			KVM_PGTABLE_PROT_NORMAL_NC)) {
	case KVM_PGTABLE_PROT_DEVICE | KVM_PGTABLE_PROT_NORMAL_NC:
		return -EINVAL;
	case KVM_PGTABLE_PROT_DEVICE:
		if (prot & KVM_PGTABLE_PROT_X)
			return -EINVAL;
		attr = KVM_S2_MEMATTR(pgt, DEVICE_nGnRE);
		break;
	case KVM_PGTABLE_PROT_NORMAL_NC:
		if (prot & KVM_PGTABLE_PROT_X)
			return -EINVAL;
		attr = KVM_S2_MEMATTR(pgt, NORMAL_NC);
		break;
	default:
		attr = KVM_S2_MEMATTR(pgt, NORMAL);
	}

E.g. if the admin hides RAM from the kernel and manages it in userspace via
/dev/mem, this will break (I think).

So I believe what you want is something like this:

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index eeda92330ade..4129ab5ac871 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1466,6 +1466,18 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
        return vma->vm_flags & VM_MTE_ALLOWED;
 }
 
+static bool kvm_vma_is_cacheable(struct vm_area_struct *vma)
+{
+       switch (FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(vma->vm_page_prot))) {
+       case MT_NORMAL_NC:
+       case MT_DEVICE_nGnRnE:
+       case MT_DEVICE_nGnRE:
+               return false;
+       default:
+               return true;
+       }
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
                          struct kvm_s2_trans *nested,
                          struct kvm_memory_slot *memslot, unsigned long hva,
@@ -1473,7 +1485,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 {
        int ret = 0;
        bool write_fault, writable, force_pte = false;
-       bool exec_fault, mte_allowed;
+       bool exec_fault, mte_allowed, is_vma_cacheable;
        bool device = false, vfio_allow_any_uc = false;
        unsigned long mmu_seq;
        phys_addr_t ipa = fault_ipa;
@@ -1615,6 +1627,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 
        vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
 
+       is_vma_cacheable = kvm_vma_is_cacheable(vma);
+
        /* Don't use the VMA after the unlock -- it may have vanished */
        vma = NULL;
 
@@ -1639,6 +1653,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
                return -EFAULT;
 
        if (kvm_is_device_pfn(pfn)) {
+               if (is_vma_cacheable)
+                       return -EINVAL;
+
                /*
                 * If the page was identified as device early by looking at
                 * the VMA flags, vma_pagesize is already representing the
@@ -1722,6 +1739,11 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
                prot |= KVM_PGTABLE_PROT_X;
 
        if (device) {
+               if (is_vma_cacheable) {
+                       ret = -EINVAL;
+                       goto out;
+               }
+
                if (vfio_allow_any_uc)
                        prot |= KVM_PGTABLE_PROT_NORMAL_NC;
                else
Re: [PATCH v6 1/5] KVM: arm64: Block cacheable PFNMAP mapping
Posted by Jason Gunthorpe 6 months, 1 week ago
On Fri, Jun 06, 2025 at 11:11:56AM -0700, Sean Christopherson wrote:
> > @@ -1612,6 +1624,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >  
> >  	vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
> >  
> > +	if ((vma->vm_flags & VM_PFNMAP) &&
> > +	    !mapping_type_noncacheable(vma->vm_page_prot))
> 
> I don't think this is correct, and there's a very real chance this will break
> existing setups.  PFNMAP memory isn't strictly device memory, and IIUC, KVM
> force DEVICE/NORMAL_NC based on kvm_is_device_pfn(), not based on VM_PFNMAP.

kvm_is_device_pfn() effecitvely means KVM can't use CMOs on that
PFN. It doesn't really mean anything more..

PFNMAP says the same thing, or at least from a mm perspective we don't
want drivers taking PFNMAP memory and then trying to guess if there
are struct pages/KVAs for it. PFNMAP memory is supposed to be fully
opaque.

Though that confusion seems to be a separate issue from this patch.

> 	if (kvm_is_device_pfn(pfn)) {
> 		/*
> 		 * If the page was identified as device early by looking at
> 		 * the VMA flags, vma_pagesize is already representing the
> 		 * largest quantity we can map.  If instead it was mapped
> 		 * via __kvm_faultin_pfn(), vma_pagesize is set to PAGE_SIZE
> 		 * and must not be upgraded.
> 		 *
> 		 * In both cases, we don't let transparent_hugepage_adjust()
> 		 * change things at the last minute.
> 		 */
> 		device = true;

"device" here is sort of a mis-nomer, it is really just trying to
setup the S2 so that CMOs are not going go to be done.

Calling it 'disable_cmo' would sure make this code clearer..

> @@ -1639,6 +1653,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>                 return -EFAULT;
>  
>         if (kvm_is_device_pfn(pfn)) {
> +               if (is_vma_cacheable)
> +                       return -EINVAL;
> +

eg

if (!kvm_can_use_cmo_pfn(pfn)) {
               if (is_vma_cacheable)
                       return -EINVAL;

>                  * If the page was identified as device early by looking at
>                  * the VMA flags, vma_pagesize is already representing the
> @@ -1722,6 +1739,11 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>                 prot |= KVM_PGTABLE_PROT_X;
>  
>         if (device) {
> +               if (is_vma_cacheable) {
> +                       ret = -EINVAL;
> +                       goto out;
> +               }

if (disable_cmo) {
               if (is_vma_cacheable)
                       return -EINVAL;

Makes alot more sense, right? If KVM can't do CMOs then it should not
attempt to use memory mapped into the VMA as cachable.

>                 if (vfio_allow_any_uc)
>                         prot |= KVM_PGTABLE_PROT_NORMAL_NC;
>                 else
> 

Regardless, this seems good for this patch at least.

Jason
Re: [PATCH v6 1/5] KVM: arm64: Block cacheable PFNMAP mapping
Posted by Sean Christopherson 6 months, 1 week ago
On Mon, Jun 09, 2025, Jason Gunthorpe wrote:
> On Fri, Jun 06, 2025 at 11:11:56AM -0700, Sean Christopherson wrote:
> > > @@ -1612,6 +1624,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > >  
> > >  	vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
> > >  
> > > +	if ((vma->vm_flags & VM_PFNMAP) &&
> > > +	    !mapping_type_noncacheable(vma->vm_page_prot))
> > 
> > I don't think this is correct, and there's a very real chance this will break
> > existing setups.  PFNMAP memory isn't strictly device memory, and IIUC, KVM
> > force DEVICE/NORMAL_NC based on kvm_is_device_pfn(), not based on VM_PFNMAP.
> 
> kvm_is_device_pfn() effecitvely means KVM can't use CMOs on that
> PFN. It doesn't really mean anything more..

Ah, kvm_is_device_pfn() isn't actually detecting device memory, it's simply
detecting memory that isn't in the direct map.

> PFNMAP says the same thing, or at least from a mm perspective we don't
> want drivers taking PFNMAP memory and then trying to guess if there
> are struct pages/KVAs for it. PFNMAP memory is supposed to be fully
> opaque.
> 
> Though that confusion seems to be a separate issue from this patch.
> 
> > 	if (kvm_is_device_pfn(pfn)) {
> > 		/*
> > 		 * If the page was identified as device early by looking at
> > 		 * the VMA flags, vma_pagesize is already representing the
> > 		 * largest quantity we can map.  If instead it was mapped
> > 		 * via __kvm_faultin_pfn(), vma_pagesize is set to PAGE_SIZE
> > 		 * and must not be upgraded.
> > 		 *
> > 		 * In both cases, we don't let transparent_hugepage_adjust()
> > 		 * change things at the last minute.
> > 		 */
> > 		device = true;
> 
> "device" here is sort of a mis-nomer, it is really just trying to
> setup the S2 so that CMOs are not going go to be done.
> 
> Calling it 'disable_cmo' would sure make this code clearer..
> 
> > @@ -1639,6 +1653,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >                 return -EFAULT;
> >  
> >         if (kvm_is_device_pfn(pfn)) {
> > +               if (is_vma_cacheable)
> > +                       return -EINVAL;
> > +
> 
> eg
> 
> if (!kvm_can_use_cmo_pfn(pfn)) {
>                if (is_vma_cacheable)
>                        return -EINVAL;
> 
> >                  * If the page was identified as device early by looking at
> >                  * the VMA flags, vma_pagesize is already representing the
> > @@ -1722,6 +1739,11 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >                 prot |= KVM_PGTABLE_PROT_X;
> >  
> >         if (device) {
> > +               if (is_vma_cacheable) {
> > +                       ret = -EINVAL;
> > +                       goto out;
> > +               }
> 
> if (disable_cmo) {
>                if (is_vma_cacheable)
>                        return -EINVAL;
> 
> Makes alot more sense, right? If KVM can't do CMOs then it should not
> attempt to use memory mapped into the VMA as cachable.

Yes, for sure.

> >                 if (vfio_allow_any_uc)
> >                         prot |= KVM_PGTABLE_PROT_NORMAL_NC;
> >                 else
> > 
> 
> Regardless, this seems good for this patch at least.
> 
> Jason
Re: [PATCH v6 1/5] KVM: arm64: Block cacheable PFNMAP mapping
Posted by Jason Gunthorpe 6 months, 3 weeks ago
On Sat, May 24, 2025 at 01:39:39AM +0000, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> Fixes a security bug due to mismatched attributes between S1 and
> S2 mapping.
> 
> Currently, it is possible for a region to be cacheable in S1, but
> mapped

"cachable in the userspace VMA"

Jason