From: Hou Tao <houtao1@huawei.com>
When vm_insert_page() fails in p2pmem_alloc_mmap(), p2pmem_alloc_mmap()
doesn't invoke percpu_ref_put() to free the per-cpu ref of pgmap
acquired after gen_pool_alloc_owner(), and memunmap_pages() will hang
forever when trying to remove the PCIe device.
Fix it by adding the missed percpu_ref_put().
Fixes: 7e9c7ef83d78 ("PCI/P2PDMA: Allow userspace VMA allocations through sysfs")
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
drivers/pci/p2pdma.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 4a2fc7ab42c3..218c1f5252b6 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -152,6 +152,7 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
ret = vm_insert_page(vma, vaddr, page);
if (ret) {
gen_pool_free(p2pdma->pool, (uintptr_t)kaddr, len);
+ percpu_ref_put(ref);
return ret;
}
percpu_ref_get(ref);
--
2.29.2
On 2025-12-20 at 15:04 +1100, Hou Tao <houtao@huaweicloud.com> wrote...
> From: Hou Tao <houtao1@huawei.com>
>
> When vm_insert_page() fails in p2pmem_alloc_mmap(), p2pmem_alloc_mmap()
> doesn't invoke percpu_ref_put() to free the per-cpu ref of pgmap
> acquired after gen_pool_alloc_owner(), and memunmap_pages() will hang
> forever when trying to remove the PCIe device.
>
> Fix it by adding the missed percpu_ref_put().
This pairs with the percpu_ref_tryget_live_rcu() above right? Might be worth
mentioning that as a comment, but overall looks good to me so feel free to add:
Reviewed-by: Alistair Popple <apopple@nvidia.com>
>
> Fixes: 7e9c7ef83d78 ("PCI/P2PDMA: Allow userspace VMA allocations through sysfs")
> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---
> drivers/pci/p2pdma.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index 4a2fc7ab42c3..218c1f5252b6 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -152,6 +152,7 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
> ret = vm_insert_page(vma, vaddr, page);
> if (ret) {
> gen_pool_free(p2pdma->pool, (uintptr_t)kaddr, len);
> + percpu_ref_put(ref);
> return ret;
> }
> percpu_ref_get(ref);
> --
> 2.29.2
>
On Thu, Jan 08, 2026 at 02:23:16PM +1100, Alistair Popple wrote:
> On 2025-12-20 at 15:04 +1100, Hou Tao <houtao@huaweicloud.com> wrote...
> > From: Hou Tao <houtao1@huawei.com>
> >
> > When vm_insert_page() fails in p2pmem_alloc_mmap(), p2pmem_alloc_mmap()
> > doesn't invoke percpu_ref_put() to free the per-cpu ref of pgmap
> > acquired after gen_pool_alloc_owner(), and memunmap_pages() will hang
> > forever when trying to remove the PCIe device.
> >
> > Fix it by adding the missed percpu_ref_put().
>
> This pairs with the percpu_ref_tryget_live_rcu() above right? Might
> be worth mentioning that as a comment, but overall looks good to me
> so feel free to add:
>
> Reviewed-by: Alistair Popple <apopple@nvidia.com>
Added your Reviewed-by, thanks!
Would the following commit log address your suggestion?
When the vm_insert_page() in p2pmem_alloc_mmap() failed, we did not
invoke percpu_ref_put() to free the per-CPU pgmap ref acquired by
percpu_ref_tryget_live_rcu(), which meant that PCI device removal would
hang forever in memunmap_pages().
Fix it by adding the missed percpu_ref_put().
Looking at this again, I'm confused about why in the normal, non-error
case, we do the percpu_ref_tryget_live_rcu(ref), followed by another
percpu_ref_get(ref) for each page, followed by just a single
percpu_ref_put() at the exit.
So we do ref_get() "1 + number of pages" times but we only do a single
ref_put(). Is there a loop of ref_put() for each page elsewhere?
> > Fixes: 7e9c7ef83d78 ("PCI/P2PDMA: Allow userspace VMA allocations through sysfs")
> > Signed-off-by: Hou Tao <houtao1@huawei.com>
> > ---
> > drivers/pci/p2pdma.c | 1 +
> > 1 file changed, 1 insertion(+)
> >
> > diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> > index 4a2fc7ab42c3..218c1f5252b6 100644
> > --- a/drivers/pci/p2pdma.c
> > +++ b/drivers/pci/p2pdma.c
> > @@ -152,6 +152,7 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
> > ret = vm_insert_page(vma, vaddr, page);
> > if (ret) {
> > gen_pool_free(p2pdma->pool, (uintptr_t)kaddr, len);
> > + percpu_ref_put(ref);
> > return ret;
> > }
> > percpu_ref_get(ref);
> > --
> > 2.29.2
> >
On 2026-01-09 at 02:55 +1100, Bjorn Helgaas <helgaas@kernel.org> wrote...
> On Thu, Jan 08, 2026 at 02:23:16PM +1100, Alistair Popple wrote:
> > On 2025-12-20 at 15:04 +1100, Hou Tao <houtao@huaweicloud.com> wrote...
> > > From: Hou Tao <houtao1@huawei.com>
> > >
> > > When vm_insert_page() fails in p2pmem_alloc_mmap(), p2pmem_alloc_mmap()
> > > doesn't invoke percpu_ref_put() to free the per-cpu ref of pgmap
> > > acquired after gen_pool_alloc_owner(), and memunmap_pages() will hang
> > > forever when trying to remove the PCIe device.
> > >
> > > Fix it by adding the missed percpu_ref_put().
> >
> > This pairs with the percpu_ref_tryget_live_rcu() above right? Might
> > be worth mentioning that as a comment, but overall looks good to me
> > so feel free to add:
> >
> > Reviewed-by: Alistair Popple <apopple@nvidia.com>
>
> Added your Reviewed-by, thanks!
>
> Would the following commit log address your suggestion?
>
> When the vm_insert_page() in p2pmem_alloc_mmap() failed, we did not
> invoke percpu_ref_put() to free the per-CPU pgmap ref acquired by
> percpu_ref_tryget_live_rcu(), which meant that PCI device removal would
> hang forever in memunmap_pages().
>
> Fix it by adding the missed percpu_ref_put().
Yes, that looks perfect. Thanks.
> Looking at this again, I'm confused about why in the normal, non-error
> case, we do the percpu_ref_tryget_live_rcu(ref), followed by another
> percpu_ref_get(ref) for each page, followed by just a single
> percpu_ref_put() at the exit.
>
> So we do ref_get() "1 + number of pages" times but we only do a single
> ref_put(). Is there a loop of ref_put() for each page elsewhere?
Right, the per-page ref_put() happens when the page is freed (ie. the struct
page refcount drops to zero) - in this case free_zone_device_folio() will call
p2pdma_folio_free() which has the corresponding percpu_ref_put().
It would be nice to harmonize the pgmap refcounting across all ZONE_DEVICE
users. For example for MEMORY_DEVICE_PRIVATE/COHERENT pages drop the reference
in the generic free_zone_device_folio() rather than in the specific free
callback. Although the whole thing is actually a bit redundant now and I have
debated removing it entirely - it really just serves as an optimised way to do
a sanity check that no pages are in use when memunmap_pages() is called. The
alternative would be just to check the refcount of every page.
> > > Fixes: 7e9c7ef83d78 ("PCI/P2PDMA: Allow userspace VMA allocations through sysfs")
> > > Signed-off-by: Hou Tao <houtao1@huawei.com>
> > > ---
> > > drivers/pci/p2pdma.c | 1 +
> > > 1 file changed, 1 insertion(+)
> > >
> > > diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> > > index 4a2fc7ab42c3..218c1f5252b6 100644
> > > --- a/drivers/pci/p2pdma.c
> > > +++ b/drivers/pci/p2pdma.c
> > > @@ -152,6 +152,7 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
> > > ret = vm_insert_page(vma, vaddr, page);
> > > if (ret) {
> > > gen_pool_free(p2pdma->pool, (uintptr_t)kaddr, len);
> > > + percpu_ref_put(ref);
> > > return ret;
> > > }
> > > percpu_ref_get(ref);
> > > --
> > > 2.29.2
> > >
>
On Fri, Jan 09, 2026 at 11:41:51AM +1100, Alistair Popple wrote: > On 2026-01-09 at 02:55 +1100, Bjorn Helgaas <helgaas@kernel.org> wrote... > > On Thu, Jan 08, 2026 at 02:23:16PM +1100, Alistair Popple wrote: > > > On 2025-12-20 at 15:04 +1100, Hou Tao <houtao@huaweicloud.com> wrote... > > > > From: Hou Tao <houtao1@huawei.com> > > > > > > > > When vm_insert_page() fails in p2pmem_alloc_mmap(), p2pmem_alloc_mmap() > > > > doesn't invoke percpu_ref_put() to free the per-cpu ref of pgmap > > > > acquired after gen_pool_alloc_owner(), and memunmap_pages() will hang > > > > forever when trying to remove the PCIe device. > > > > > > > > Fix it by adding the missed percpu_ref_put(). > ... > > Looking at this again, I'm confused about why in the normal, non-error > > case, we do the percpu_ref_tryget_live_rcu(ref), followed by another > > percpu_ref_get(ref) for each page, followed by just a single > > percpu_ref_put() at the exit. > > > > So we do ref_get() "1 + number of pages" times but we only do a single > > ref_put(). Is there a loop of ref_put() for each page elsewhere? > > Right, the per-page ref_put() happens when the page is freed (ie. the struct > page refcount drops to zero) - in this case free_zone_device_folio() will call > p2pdma_folio_free() which has the corresponding percpu_ref_put(). I don't see anything that looks like a loop to call ref_put() for each page in free_zone_device_folio() or in p2pdma_folio_free(), but this is all completely out of my range, so I'll take your word for it :) Bjorn
On 2026-01-10 at 02:03 +1100, Bjorn Helgaas <helgaas@kernel.org> wrote...
> On Fri, Jan 09, 2026 at 11:41:51AM +1100, Alistair Popple wrote:
> > On 2026-01-09 at 02:55 +1100, Bjorn Helgaas <helgaas@kernel.org> wrote...
> > > On Thu, Jan 08, 2026 at 02:23:16PM +1100, Alistair Popple wrote:
> > > > On 2025-12-20 at 15:04 +1100, Hou Tao <houtao@huaweicloud.com> wrote...
> > > > > From: Hou Tao <houtao1@huawei.com>
> > > > >
> > > > > When vm_insert_page() fails in p2pmem_alloc_mmap(), p2pmem_alloc_mmap()
> > > > > doesn't invoke percpu_ref_put() to free the per-cpu ref of pgmap
> > > > > acquired after gen_pool_alloc_owner(), and memunmap_pages() will hang
> > > > > forever when trying to remove the PCIe device.
> > > > >
> > > > > Fix it by adding the missed percpu_ref_put().
> > ...
>
> > > Looking at this again, I'm confused about why in the normal, non-error
> > > case, we do the percpu_ref_tryget_live_rcu(ref), followed by another
> > > percpu_ref_get(ref) for each page, followed by just a single
> > > percpu_ref_put() at the exit.
> > >
> > > So we do ref_get() "1 + number of pages" times but we only do a single
> > > ref_put(). Is there a loop of ref_put() for each page elsewhere?
> >
> > Right, the per-page ref_put() happens when the page is freed (ie. the struct
> > page refcount drops to zero) - in this case free_zone_device_folio() will call
> > p2pdma_folio_free() which has the corresponding percpu_ref_put().
>
> I don't see anything that looks like a loop to call ref_put() for each
> page in free_zone_device_folio() or in p2pdma_folio_free(), but this
> is all completely out of my range, so I'll take your word for it :)
That's brave :-)
What happens is the core mm takes over managing the page life time once
vm_insert_page() has been (successfully) called to map the page:
VM_WARN_ON_ONCE_PAGE(!page_ref_count(page), page);
set_page_count(page, 1);
ret = vm_insert_page(vma, vaddr, page);
if (ret) {
gen_pool_free(p2pdma->pool, (uintptr_t)kaddr, len);
return ret;
}
percpu_ref_get(ref);
put_page(page);
In the above sequence vm_insert_page() takes a page ref for each page it maps
into the user page tables with folio_get(). This reference is dropped when the
user page table entry is removed, typically by the loop in zap_pte_range().
Normally the user page table mapping is the only thing holding a reference so
it ends up calling folio_put()->free_zone_device_folio->...->ref_put() one page
at a time as the PTEs are removed from the page tables. At least that's what
happens conceptually - the TLB batching code makes it hard to actually see where
the folio_put() is called in this sequence.
Note the extra set_page_count(1) and put_page(page) in the above sequence is
just to make vm_insert_page() happy - it complains it you try and insert a page
with a zero page ref.
And looking at that sequence there is another minor bug - in the failure
path we are exiting the loop with the failed page ref count set to
1 from set_page_count(page, 1). That needs to be reset to zero with
set_page_count(page, 0) to avoid the VM_WARN_ON_ONCE_PAGE() if the page gets
reused. I will send a fix for that.
- Alistair
> Bjorn
On 2026-01-12 at 10:21 +1100, Alistair Popple <apopple@nvidia.com> wrote...
> On 2026-01-10 at 02:03 +1100, Bjorn Helgaas <helgaas@kernel.org> wrote...
> > On Fri, Jan 09, 2026 at 11:41:51AM +1100, Alistair Popple wrote:
> > > On 2026-01-09 at 02:55 +1100, Bjorn Helgaas <helgaas@kernel.org> wrote...
> > > > On Thu, Jan 08, 2026 at 02:23:16PM +1100, Alistair Popple wrote:
> > > > > On 2025-12-20 at 15:04 +1100, Hou Tao <houtao@huaweicloud.com> wrote...
> > > > > > From: Hou Tao <houtao1@huawei.com>
> > > > > >
> > > > > > When vm_insert_page() fails in p2pmem_alloc_mmap(), p2pmem_alloc_mmap()
> > > > > > doesn't invoke percpu_ref_put() to free the per-cpu ref of pgmap
> > > > > > acquired after gen_pool_alloc_owner(), and memunmap_pages() will hang
> > > > > > forever when trying to remove the PCIe device.
> > > > > >
> > > > > > Fix it by adding the missed percpu_ref_put().
> > > ...
> >
> > > > Looking at this again, I'm confused about why in the normal, non-error
> > > > case, we do the percpu_ref_tryget_live_rcu(ref), followed by another
> > > > percpu_ref_get(ref) for each page, followed by just a single
> > > > percpu_ref_put() at the exit.
> > > >
> > > > So we do ref_get() "1 + number of pages" times but we only do a single
> > > > ref_put(). Is there a loop of ref_put() for each page elsewhere?
> > >
> > > Right, the per-page ref_put() happens when the page is freed (ie. the struct
> > > page refcount drops to zero) - in this case free_zone_device_folio() will call
> > > p2pdma_folio_free() which has the corresponding percpu_ref_put().
> >
> > I don't see anything that looks like a loop to call ref_put() for each
> > page in free_zone_device_folio() or in p2pdma_folio_free(), but this
> > is all completely out of my range, so I'll take your word for it :)
>
> That's brave :-)
>
> What happens is the core mm takes over managing the page life time once
> vm_insert_page() has been (successfully) called to map the page:
>
> VM_WARN_ON_ONCE_PAGE(!page_ref_count(page), page);
> set_page_count(page, 1);
> ret = vm_insert_page(vma, vaddr, page);
> if (ret) {
> gen_pool_free(p2pdma->pool, (uintptr_t)kaddr, len);
> return ret;
> }
> percpu_ref_get(ref);
> put_page(page);
>
> In the above sequence vm_insert_page() takes a page ref for each page it maps
> into the user page tables with folio_get(). This reference is dropped when the
> user page table entry is removed, typically by the loop in zap_pte_range().
>
> Normally the user page table mapping is the only thing holding a reference so
> it ends up calling folio_put()->free_zone_device_folio->...->ref_put() one page
> at a time as the PTEs are removed from the page tables. At least that's what
> happens conceptually - the TLB batching code makes it hard to actually see where
> the folio_put() is called in this sequence.
>
> Note the extra set_page_count(1) and put_page(page) in the above sequence is
> just to make vm_insert_page() happy - it complains it you try and insert a page
> with a zero page ref.
>
> And looking at that sequence there is another minor bug - in the failure
> path we are exiting the loop with the failed page ref count set to
> 1 from set_page_count(page, 1). That needs to be reset to zero with
> set_page_count(page, 0) to avoid the VM_WARN_ON_ONCE_PAGE() if the page gets
> reused. I will send a fix for that.
Actually the whole failure path above seems wrong to me - we
free the entire allocation with gen_pool_free() even though
vm_insert_page() may have succeeded in mapping some pages. AFAICT the
generic VFS mmap code will call unmap_region() to undo any partial
mapping (see __mmap_new_file_vma) but that should end up calling
folio_put()->zone_free_device_range()->p2pdma_folio_free()->gen_pool_free_owner()
for the mapped pages even though we've already freed the entire pool.
> - Alistair
>
> > Bjorn
>
On 2026-01-12 at 11:12 +1100, Alistair Popple <apopple@nvidia.com> wrote...
> On 2026-01-12 at 10:21 +1100, Alistair Popple <apopple@nvidia.com> wrote...
> > On 2026-01-10 at 02:03 +1100, Bjorn Helgaas <helgaas@kernel.org> wrote...
> > > On Fri, Jan 09, 2026 at 11:41:51AM +1100, Alistair Popple wrote:
> > > > On 2026-01-09 at 02:55 +1100, Bjorn Helgaas <helgaas@kernel.org> wrote...
> > > > > On Thu, Jan 08, 2026 at 02:23:16PM +1100, Alistair Popple wrote:
> > > > > > On 2025-12-20 at 15:04 +1100, Hou Tao <houtao@huaweicloud.com> wrote...
> > > > > > > From: Hou Tao <houtao1@huawei.com>
> > > > > > >
> > > > > > > When vm_insert_page() fails in p2pmem_alloc_mmap(), p2pmem_alloc_mmap()
> > > > > > > doesn't invoke percpu_ref_put() to free the per-cpu ref of pgmap
> > > > > > > acquired after gen_pool_alloc_owner(), and memunmap_pages() will hang
> > > > > > > forever when trying to remove the PCIe device.
> > > > > > >
> > > > > > > Fix it by adding the missed percpu_ref_put().
> > > > ...
> > >
> > > > > Looking at this again, I'm confused about why in the normal, non-error
> > > > > case, we do the percpu_ref_tryget_live_rcu(ref), followed by another
> > > > > percpu_ref_get(ref) for each page, followed by just a single
> > > > > percpu_ref_put() at the exit.
> > > > >
> > > > > So we do ref_get() "1 + number of pages" times but we only do a single
> > > > > ref_put(). Is there a loop of ref_put() for each page elsewhere?
> > > >
> > > > Right, the per-page ref_put() happens when the page is freed (ie. the struct
> > > > page refcount drops to zero) - in this case free_zone_device_folio() will call
> > > > p2pdma_folio_free() which has the corresponding percpu_ref_put().
> > >
> > > I don't see anything that looks like a loop to call ref_put() for each
> > > page in free_zone_device_folio() or in p2pdma_folio_free(), but this
> > > is all completely out of my range, so I'll take your word for it :)
> >
> > That's brave :-)
> >
> > What happens is the core mm takes over managing the page life time once
> > vm_insert_page() has been (successfully) called to map the page:
> >
> > VM_WARN_ON_ONCE_PAGE(!page_ref_count(page), page);
> > set_page_count(page, 1);
> > ret = vm_insert_page(vma, vaddr, page);
> > if (ret) {
> > gen_pool_free(p2pdma->pool, (uintptr_t)kaddr, len);
> > return ret;
> > }
> > percpu_ref_get(ref);
> > put_page(page);
> >
> > In the above sequence vm_insert_page() takes a page ref for each page it maps
> > into the user page tables with folio_get(). This reference is dropped when the
> > user page table entry is removed, typically by the loop in zap_pte_range().
> >
> > Normally the user page table mapping is the only thing holding a reference so
> > it ends up calling folio_put()->free_zone_device_folio->...->ref_put() one page
> > at a time as the PTEs are removed from the page tables. At least that's what
> > happens conceptually - the TLB batching code makes it hard to actually see where
> > the folio_put() is called in this sequence.
> >
> > Note the extra set_page_count(1) and put_page(page) in the above sequence is
> > just to make vm_insert_page() happy - it complains it you try and insert a page
> > with a zero page ref.
> >
> > And looking at that sequence there is another minor bug - in the failure
> > path we are exiting the loop with the failed page ref count set to
> > 1 from set_page_count(page, 1). That needs to be reset to zero with
> > set_page_count(page, 0) to avoid the VM_WARN_ON_ONCE_PAGE() if the page gets
> > reused. I will send a fix for that.
>
> Actually the whole failure path above seems wrong to me - we
> free the entire allocation with gen_pool_free() even though
> vm_insert_page() may have succeeded in mapping some pages. AFAICT the
> generic VFS mmap code will call unmap_region() to undo any partial
> mapping (see __mmap_new_file_vma) but that should end up calling
> folio_put()->zone_free_device_range()->p2pdma_folio_free()->gen_pool_free_owner()
> for the mapped pages even though we've already freed the entire pool.
Oh nevermind, I hit send too soon. Ignore the above paragraph - I hadn't noticed
kaddr/len gets updated at the end of the loop to account for the successful
mappings.
> > - Alistair
> >
> > > Bjorn
> >
>
On 2025-12-19 21:04, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
>
> When vm_insert_page() fails in p2pmem_alloc_mmap(), p2pmem_alloc_mmap()
> doesn't invoke percpu_ref_put() to free the per-cpu ref of pgmap
> acquired after gen_pool_alloc_owner(), and memunmap_pages() will hang
> forever when trying to remove the PCIe device.
>
> Fix it by adding the missed percpu_ref_put().
>
> Fixes: 7e9c7ef83d78 ("PCI/P2PDMA: Allow userspace VMA allocations through sysfs")
> Signed-off-by: Hou Tao <houtao1@huawei.com>
Nice catch, thanks:
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Logan
© 2016 - 2026 Red Hat, Inc.