drivers/vfio/vfio_iommu_type1.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-)
From: Max Boone <mboone@akamai.com>
A race between page table walking (e.g. via procfs numa_maps) and VFIO DMA
pinning can lead to temporary failures in follow_pfnmap_start(). When a
PUD entry is split and concurrently refaulted, the PFNMAP mapping may be
temporarily zapped, causing follow_pfnmap_start() to return an error.
Although follow_pfnmap_start() returns an -EINVAL this is not due to
invalid parameters, but rather because of the pfnmap being non-present.
Treat it as such, and retry by returning -EAGAIN, similar to how GUP
handles such races.
This avoids propagating an unexpected -EINVAL to userspace, like follows:
[dma_map]
dma_map iova=0x000000000000 size=0x000004000000 vaddr=0x00007f7800000000
dma_map FAILED iova=0x020000000000: [Errno 22] Invalid argument
dma_map iova=0x040000000000 size=0x000002000000 vaddr=0x00007f5780000000
Which would've succeeded on a retry.
Cc: stable@vger.kernel.org
Fixes: a77f9489f1d7 ("vfio: use the new follow_pfnmap API")
Signed-off-by: Max Boone <mboone@akamai.com>
---
drivers/vfio/vfio_iommu_type1.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 5167bec14..3a0d0bbb9 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -559,9 +559,17 @@ static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
if (ret)
return ret;
+ /*
+ * follow_pfnmap_start() returns -EINVAL for
+ * invalid parameters and non-present entries.
+ * If that happens here after a successful
+ * fixup_user_fault(), it is likely that the
+ * pfnmap has been zapped. Retry instead of
+ * failing.
+ */
ret = follow_pfnmap_start(&args);
if (ret)
- return ret;
+ return -EAGAIN;
}
if (write_fault && !args.writable) {
---
base-commit: 96ca4caf9066f5ebd35b561a521af588a8eb0215
change-id: 20260317-retry-pin-on-reclaimed-pud-dfb9e26eb8cf
Best regards,
--
Max Boone <mboone@akamai.com>
On Tue, 17 Mar 2026 16:07:45 +0100
Max Boone via B4 Relay <devnull+mboone.akamai.com@kernel.org> wrote:
> From: Max Boone <mboone@akamai.com>
>
> A race between page table walking (e.g. via procfs numa_maps) and VFIO DMA
> pinning can lead to temporary failures in follow_pfnmap_start(). When a
> PUD entry is split and concurrently refaulted, the PFNMAP mapping may be
> temporarily zapped, causing follow_pfnmap_start() to return an error.
>
> Although follow_pfnmap_start() returns an -EINVAL this is not due to
> invalid parameters, but rather because of the pfnmap being non-present.
> Treat it as such, and retry by returning -EAGAIN, similar to how GUP
> handles such races.
>
> This avoids propagating an unexpected -EINVAL to userspace, like follows:
> [dma_map]
> dma_map iova=0x000000000000 size=0x000004000000 vaddr=0x00007f7800000000
> dma_map FAILED iova=0x020000000000: [Errno 22] Invalid argument
> dma_map iova=0x040000000000 size=0x000002000000 vaddr=0x00007f5780000000
>
> Which would've succeeded on a retry.
>
> Cc: stable@vger.kernel.org
> Fixes: a77f9489f1d7 ("vfio: use the new follow_pfnmap API")
> Signed-off-by: Max Boone <mboone@akamai.com>
> ---
> drivers/vfio/vfio_iommu_type1.c | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 5167bec14..3a0d0bbb9 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -559,9 +559,17 @@ static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
> if (ret)
> return ret;
>
> + /*
> + * follow_pfnmap_start() returns -EINVAL for
> + * invalid parameters and non-present entries.
> + * If that happens here after a successful
> + * fixup_user_fault(), it is likely that the
> + * pfnmap has been zapped. Retry instead of
> + * failing.
> + */
It's a little stronger than that, right? We're betting that the only
remaining non-zero return is due to a race and we can introduce what
appears to be potential for an infinite loop here because -EAGAIN will
get kicked out to redo the vma_lookup() and fixup_user_fault() should
return a genuine error if we're completely in the weeds. Should we
make this a little stronger and more specific? Thanks,
Alex
> ret = follow_pfnmap_start(&args);
> if (ret)
> - return ret;
> + return -EAGAIN;
> }
>
> if (write_fault && !args.writable) {
>
> ---
> base-commit: 96ca4caf9066f5ebd35b561a521af588a8eb0215
> change-id: 20260317-retry-pin-on-reclaimed-pud-dfb9e26eb8cf
>
> Best regards,
> On Mar 18, 2026, at 10:22 PM, Alex Williamson <alex@shazbot.org> wrote:
>
> […]
>
>> + /*
>> + * follow_pfnmap_start() returns -EINVAL for
>> + * invalid parameters and non-present entries.
>> + * If that happens here after a successful
>> + * fixup_user_fault(), it is likely that the
>> + * pfnmap has been zapped. Retry instead of
>> + * failing.
>> + */
>
> It's a little stronger than that, right? We're betting that the only
> remaining non-zero return is due to a race and we can introduce what
> appears to be potential for an infinite loop here because -EAGAIN will
> get kicked out to redo the vma_lookup() and fixup_user_fault() should
> return a genuine error if we're completely in the weeds. Should we
> make this a little stronger and more specific? Thanks,
I’d say that the best case would be to have follow_pfnmap_start() return
-EINVAL or -ENOENT w.r.t. which of the two return values it is. But then
again, we could theoretically run into an infinite loop I guess - as the zap
and faulting could run in lockstep (the race window is extremely small
though).
We could make the retry above bounded, and bubble up a -EBUSY such
that users of the ioctl can decide to retry instead of fail?
David, you mentioned that gup already has retry logic that we don’t have
with follow_fault_pfn() -> follow_pfnmap_start(). Would we potentially run
into an infinite loop with this change?
I see that the same pattern also appears in:
- virt/kvm/kvm_main.c:hva_to_pfn_remapped()
- s390/pci/pci_mmio.c:s390_pci_mmio_write()
And other users of the function are:
- drivers/virt/acrn/mm.c:acrn_vm_ram_map()
- mm/memory.c:generic_access_phys()
Maybe I can better draft in this patch with adding an -ENOENT return to
follow_pfnmap_start() and in the same patchset add retrying logic where
necessary (on first sight, kvm and vfio)?
>
> Alex
>
>> ret = follow_pfnmap_start(&args);
>> if (ret)
>> - return ret;
>> + return -EAGAIN;
>> }
>>
>> if (write_fault && !args.writable) {
>>
>> ---
>> base-commit: 96ca4caf9066f5ebd35b561a521af588a8eb0215
>> change-id: 20260317-retry-pin-on-reclaimed-pud-dfb9e26eb8cf
>>
>> Best regards,
>
On 3/19/26 09:36, Boone, Max wrote: > > >> On Mar 18, 2026, at 10:22 PM, Alex Williamson <alex@shazbot.org> wrote: >> >> […] >> >>> + /* >>> + * follow_pfnmap_start() returns -EINVAL for >>> + * invalid parameters and non-present entries. >>> + * If that happens here after a successful >>> + * fixup_user_fault(), it is likely that the >>> + * pfnmap has been zapped. Retry instead of >>> + * failing. >>> + */ >> >> It's a little stronger than that, right? We're betting that the only >> remaining non-zero return is due to a race and we can introduce what >> appears to be potential for an infinite loop here because -EAGAIN will >> get kicked out to redo the vma_lookup() and fixup_user_fault() should >> return a genuine error if we're completely in the weeds. Should we >> make this a little stronger and more specific? Thanks, > > I’d say that the best case would be to have follow_pfnmap_start() return > -EINVAL or -ENOENT w.r.t. which of the two return values it is. But then > again, we could theoretically run into an infinite loop I guess - as the zap > and faulting could run in lockstep (the race window is extremely small > though). Well, in theory :) To hit that race repeatedly, you'd really have to be quite lucky I guess. But the real question is: if user space triggered the pinning, and user space keeps hurting itself to make progress, is that a real problem? I guess the crucial part would be to a) Have some cond_resched(() in there? b) Checking for fatal signals somewhere? c) Possibly drop locks (mmap lock?) every now and then? For GUP, a) and b) are in place in __get_user_pages(). c) might be done, but I think it's less deterministic. > > We could make the retry above bounded, and bubble up a -EBUSY such > that users of the ioctl can decide to retry instead of fail? Would that be a possible ABI break? You'd really have to only do that in a case where user space does stupid things, I guess. > > David, you mentioned that gup already has retry logic that we don’t have > with follow_fault_pfn() -> follow_pfnmap_start(). Would we potentially run > into an infinite loop with this change? GUP triggers page faults through faultin_page(). If handle_mm_fault() returns * VM_FAULT_COMPLETED we return -EAGAIN * VM_FAULT_ERROR we return the error * VM_FAULT_RETRY we return -EBUSY * Otherwise 0 In the caller __get_user_pages(), we * Retry immediately with ret == 0 * Return to the GUP caller (letting it retry) with -EBUSY/-EAGAIN Having at least a) and b) sounds reasonable. Not sure about having c), might be tricky if we are not allowed to drop the lock. -- Cheers, David
On Thu, 19 Mar 2026 14:18:49 +0100 "David Hildenbrand (Arm)" <david@kernel.org> wrote: > On 3/19/26 09:36, Boone, Max wrote: > > > > > >> On Mar 18, 2026, at 10:22 PM, Alex Williamson <alex@shazbot.org> wrote: > >> > >> […] > >> > >>> + /* > >>> + * follow_pfnmap_start() returns -EINVAL for > >>> + * invalid parameters and non-present entries. > >>> + * If that happens here after a successful > >>> + * fixup_user_fault(), it is likely that the > >>> + * pfnmap has been zapped. Retry instead of > >>> + * failing. > >>> + */ > >> > >> It's a little stronger than that, right? We're betting that the only > >> remaining non-zero return is due to a race and we can introduce what > >> appears to be potential for an infinite loop here because -EAGAIN will > >> get kicked out to redo the vma_lookup() and fixup_user_fault() should > >> return a genuine error if we're completely in the weeds. Should we > >> make this a little stronger and more specific? Thanks, > > > > I’d say that the best case would be to have follow_pfnmap_start() return > > -EINVAL or -ENOENT w.r.t. which of the two return values it is. But then > > again, we could theoretically run into an infinite loop I guess - as the zap > > and faulting could run in lockstep (the race window is extremely small > > though). > > Well, in theory :) To hit that race repeatedly, you'd really have to be > quite lucky I guess. > > But the real question is: if user space triggered the pinning, and user > space keeps hurting itself to make progress, is that a real problem? I'd say no, that's not a problem. It's really just that masking all non-zero returns as -EAGAIN makes some assumptions about the other error conditions that could change over time, so minimally those dependencies should be clearly stated. Even better would be if we didn't need to make those assumptions. Thanks, Alex > I guess the crucial part would be to > > a) Have some cond_resched(() in there? > b) Checking for fatal signals somewhere? > c) Possibly drop locks (mmap lock?) every now and then? > > For GUP, a) and b) are in place in __get_user_pages(). > > c) might be done, but I think it's less deterministic. > > > > > We could make the retry above bounded, and bubble up a -EBUSY such > > that users of the ioctl can decide to retry instead of fail? > > Would that be a possible ABI break? You'd really have to only do that in > a case where user space does stupid things, I guess. > > > > > David, you mentioned that gup already has retry logic that we don’t have > > with follow_fault_pfn() -> follow_pfnmap_start(). Would we potentially run > > into an infinite loop with this change? > > GUP triggers page faults through faultin_page(). If handle_mm_fault() > returns > > * VM_FAULT_COMPLETED we return -EAGAIN > * VM_FAULT_ERROR we return the error > * VM_FAULT_RETRY we return -EBUSY > * Otherwise 0 > > In the caller __get_user_pages(), we > * Retry immediately with ret == 0 > * Return to the GUP caller (letting it retry) with -EBUSY/-EAGAIN > > Having at least a) and b) sounds reasonable. Not sure about having c), > might be tricky if we are not allowed to drop the lock. >
On 3/19/26 15:30, Alex Williamson wrote: > On Thu, 19 Mar 2026 14:18:49 +0100 > "David Hildenbrand (Arm)" <david@kernel.org> wrote: > >> On 3/19/26 09:36, Boone, Max wrote: >>> >>> >>> >>> I’d say that the best case would be to have follow_pfnmap_start() return >>> -EINVAL or -ENOENT w.r.t. which of the two return values it is. But then >>> again, we could theoretically run into an infinite loop I guess - as the zap >>> and faulting could run in lockstep (the race window is extremely small >>> though). >> >> Well, in theory :) To hit that race repeatedly, you'd really have to be >> quite lucky I guess. >> >> But the real question is: if user space triggered the pinning, and user >> space keeps hurting itself to make progress, is that a real problem? > > I'd say no, that's not a problem. It's really just that masking all > non-zero returns as -EAGAIN makes some assumptions about the other > error conditions that could change over time, so minimally those > dependencies should be clearly stated. Even better would be if we > didn't need to make those assumptions. I agree that follow_pfnmap_start() should be improved, to return -EINVAL on invalid parameters and -ENOENT on "no present entry"." And -ENOENT should be documented to be resolvable by triggering a fault. Looking at it, the retry label in the function also seems avoidable, and some other checks might be shaky ... Let me briefly see if I can rework that. -- Cheers, David
© 2016 - 2026 Red Hat, Inc.