[v1] mm/hugetlb: Refactor hugetlb allocation resv accounting

[PATCH 1/7] mm/hugetlb: Fix avoid_reserve to allow taking folio from subpool

Posted by Peter Xu 1 year, 2 months ago

Since commit 04f2cbe35699 ("hugetlb: guarantee that COW faults for a
process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed"),
avoid_reserve was introduced for a special case of CoW on hugetlb private
mappings, and only if the owner VMA is trying to allocate yet another
hugetlb folio that is not reserved within the private vma reserved map.

Later on, in commit d85f69b0b533 ("mm/hugetlb: alloc_huge_page handle areas
hole punched by fallocate"), alloc_huge_page() enforced to not consume any
global reservation as long as avoid_reserve=true.  This operation doesn't
look correct, because even if it will enforce the allocation to not use
global reservation at all, it will still try to take one reservation from
the spool (if the subpool existed).  Then since the spool reserved pages
take from global reservation, it'll also take one reservation globally.

Logically it can cause global reservation to go wrong.

I wrote a reproducer below, trigger this special path, and every run of
such program will cause global reservation count to increment by one, until
it hits the number of free pages:

  #define _GNU_SOURCE             /* See feature_test_macros(7) */
  #include <stdio.h>
  #include <fcntl.h>
  #include <errno.h>
  #include <unistd.h>
  #include <stdlib.h>
  #include <sys/mman.h>

  #define  MSIZE  (2UL << 20)

  int main(int argc, char *argv[])
  {
      const char *path;
      int *buf;
      int fd, ret;
      pid_t child;

      if (argc < 2) {
          printf("usage: %s <hugetlb_file>\n", argv[0]);
          return -1;
      }

      path = argv[1];

      fd = open(path, O_RDWR | O_CREAT, 0666);
      if (fd < 0) {
          perror("open failed");
          return -1;
      }

      ret = fallocate(fd, 0, 0, MSIZE);
      if (ret != 0) {
          perror("fallocate");
          return -1;
      }

      buf = mmap(NULL, MSIZE, PROT_READ|PROT_WRITE,
                 MAP_PRIVATE, fd, 0);

      if (buf == MAP_FAILED) {
          perror("mmap() failed");
          return -1;
      }

      /* Allocate a page */
      *buf = 1;

      child = fork();
      if (child == 0) {
          /* child doesn't need to do anything */
          exit(0);
      }

      /* Trigger CoW from owner */
      *buf = 2;

      munmap(buf, MSIZE);
      close(fd);
      unlink(path);

      return 0;
  }

It can only reproduce with a sub-mount when there're reserved pages on the
spool, like:

  # sysctl vm.nr_hugepages=128
  # mkdir ./hugetlb-pool
  # mount -t hugetlbfs -o min_size=8M,pagesize=2M none ./hugetlb-pool

Then run the reproducer on the mountpoint:

  # ./reproducer ./hugetlb-pool/test

Fix it by taking the reservation from spool if available.  In general,
avoid_reserve is IMHO more about "avoid vma resv map", not spool's.

I copied stable, however I have no intention for backporting if it's not a
clean cherry-pick, because private hugetlb mapping, and then fork() on top
is too rare to hit.

Cc: linux-stable <stable@vger.kernel.org>
Fixes: d85f69b0b533 ("mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate")
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 22 +++-------------------
 1 file changed, 3 insertions(+), 19 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index cec4b121193f..9ce69fd22a01 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1394,8 +1394,7 @@ static unsigned long available_huge_pages(struct hstate *h)
 
 static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
 				struct vm_area_struct *vma,
-				unsigned long address, int avoid_reserve,
-				long chg)
+				unsigned long address, long chg)
 {
 	struct folio *folio = NULL;
 	struct mempolicy *mpol;
@@ -1411,10 +1410,6 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
 	if (!vma_has_reserves(vma, chg) && !available_huge_pages(h))
 		goto err;
 
-	/* If reserves cannot be used, ensure enough pages are in the pool */
-	if (avoid_reserve && !available_huge_pages(h))
-		goto err;
-
 	gfp_mask = htlb_alloc_mask(h);
 	nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
 
@@ -1430,7 +1425,7 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
 		folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask,
 							nid, nodemask);
 
-	if (folio && !avoid_reserve && vma_has_reserves(vma, chg)) {
+	if (folio && vma_has_reserves(vma, chg)) {
 		folio_set_hugetlb_restore_reserve(folio);
 		h->resv_huge_pages--;
 	}
@@ -3007,17 +3002,6 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 		gbl_chg = hugepage_subpool_get_pages(spool, 1);
 		if (gbl_chg < 0)
 			goto out_end_reservation;
-
-		/*
-		 * Even though there was no reservation in the region/reserve
-		 * map, there could be reservations associated with the
-		 * subpool that can be used.  This would be indicated if the
-		 * return value of hugepage_subpool_get_pages() is zero.
-		 * However, if avoid_reserve is specified we still avoid even
-		 * the subpool reservations.
-		 */
-		if (avoid_reserve)
-			gbl_chg = 1;
 	}
 
 	/* If this allocation is not consuming a reservation, charge it now.
@@ -3040,7 +3024,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	 * from the global free pool (global change).  gbl_chg == 0 indicates
 	 * a reservation exists for the allocation.
 	 */
-	folio = dequeue_hugetlb_folio_vma(h, vma, addr, avoid_reserve, gbl_chg);
+	folio = dequeue_hugetlb_folio_vma(h, vma, addr, gbl_chg);
 	if (!folio) {
 		spin_unlock_irq(&hugetlb_lock);
 		folio = alloc_buddy_hugetlb_folio_with_mpol(h, vma, addr);
-- 
2.47.0

Re: [PATCH 1/7] mm/hugetlb: Fix avoid_reserve to allow taking folio from subpool

Posted by Ackerley Tng 1 year, 1 month ago

Peter Xu <peterx@redhat.com> writes:

> Since commit 04f2cbe35699 ("hugetlb: guarantee that COW faults for a
> process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed"),
> avoid_reserve was introduced for a special case of CoW on hugetlb private
> mappings, and only if the owner VMA is trying to allocate yet another
> hugetlb folio that is not reserved within the private vma reserved map.
>
> Later on, in commit d85f69b0b533 ("mm/hugetlb: alloc_huge_page handle areas
> hole punched by fallocate"), alloc_huge_page() enforced to not consume any
> global reservation as long as avoid_reserve=true.  This operation doesn't
> look correct, because even if it will enforce the allocation to not use
> global reservation at all, it will still try to take one reservation from
> the spool (if the subpool existed).  Then since the spool reserved pages
> take from global reservation, it'll also take one reservation globally.
>
> Logically it can cause global reservation to go wrong.
>
> I wrote a reproducer below

Thank you so much for looking into this!

> <snip>

I was able to reproduce this using your
reproducer. /sys/kernel/mm/hugepages/hugepages-2048kB/resv_hugepages
is not decremented even after the reproducer exits.

  # sysctl vm.nr_hugepages=16 
  vm.nr_hugepages = 16
  # mkdir ./hugetlb-pool
  # mount -t hugetlbfs -o min_size=8M,pagesize=2M none ./hugetlb-pool
  # for i in $(seq 16); do ./a.out hugetlb-pool/test; cat /sys/kernel/mm/hugepages/hugepages-2048kB/resv_hugepages; done
  5
  6
  7
  8
  9
  10
  11
  12
  13
  14
  15
  16
  16
  16
  16
  16
  # 

I'll go over the rest of your patches and dig into the meaning of `avoid_reserve`.

Re: [PATCH 1/7] mm/hugetlb: Fix avoid_reserve to allow taking folio from subpool

Posted by Ackerley Tng 1 year, 1 month ago

Ackerley Tng <ackerleytng@google.com> writes:

> <snip>
>
> I'll go over the rest of your patches and dig into the meaning of `avoid_reserve`.

Yes, after looking into this more deeply, I agree that avoid_reserve
means avoiding the reservations in the resv_map rather than reservations
in the subpool or hstate.

Here's more detail of what's going on in the reproducer that I wrote as I
reviewed Peter's patch:

1. On fallocate(), allocate page A
2. On mmap(), set up a vma without VM_MAYSHARE since MAP_PRIVATE was requested
3. On faulting *buf = 1, allocate a new page B, copy A to B because the mmap request was MAP_PRIVATE
4. On fork, prep for COW by marking page as read only. Both parent and child share B.
5. On faulting *buf = 2 (write fault), allocate page C, copy B to C
    + B belongs to the child, C belongs to the parent
    + C is owned by the parent
6. Child exits, B is freed
7. On munmap(), C is freed
8. On unlink(), A is freed

When C was allocated in the parent (owns MAP_PRIVATE page, doing a copy on
write), spool->rsv_hpages was decreased but h->resv_huge_pages was not. This is
the root of the bug.

We should decrement h->resv_huge_pages if a reserved page from the subpool was
used, instead of whether avoid_reserve or vma_has_reserves() is set. If
avoid_reserve is set, the subpool shouldn't be checked for a reservation, so we
won't be decrementing h->resv_huge_pages anyway.

I agree with Peter's fix as a whole (the entire patch series).

Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Tested-by: Ackerley Tng <ackerleytng@google.com>

---

Some definitions which might be helpful:

+ h->resv_huge_pages indicates number of reserved pages globally.
    + This number increases when pages are reserved
    + This number decreases when reserved pages are allocated, or when pages are unreserved
+ spool->rsv_hpages indicates number of reserved pages in this subpool.
    + This number increases when pages are reserved
    + This number decreases when reserved pages are allocated, or when pages are unreserved
+ h->resv_huge_pages should be the sum of all subpools' spool->rsv_hpages.

More details on the flow in alloc_hugetlb_folio() which might be helpful:

hugepage_subpool_get_pages() returns "the number of pages by which the global
pools must be adjusted (upward)". This return value is never negative other than
errors. (hugepage_subpool_get_pages() always gets called with a positive delta).

Specifically in alloc_hugetlb_folio(), the return value is either 0 or 1 (other
than errors).

If the return value is 0, the subpool had enough reservations and so we should
decrement h->resv_huge_pages.

If the return value is 1, it means that this subpool did not have any more
reserved hugepages, and we need to get a page from the global
hstate. dequeue_hugetlb_folio_vma() will get us a page that was already
allocated.

In dequeue_hugetlb_folio_vma(), if the vma doesn't have enough reserves for 1
page, and there are no available_huge_pages() left, we quit dequeueing since we
will need to allocate a new page. If we want to avoid_reserve, that means we
don't want to use the vma's reserves in resv_map, we also check
available_huge_pages(). If there are available_huge_pages(), we go on to dequeue
a page.

Then, we determine whether to decrement h->resv_huge_pages. We should decrement
if a reserved page from the subpool was used, instead of whether avoid_reserve
or vma_has_reserves() is set.

In the case where a surplus page needs to be allocated, the surplus page isn't
and doesn't need to be associated with a subpool, so no subpool hugepage number
tracking updates are required. h->resv_huge_pages still has to be updated... is
this where h->resv_huge_pages can go negative?

Re: [PATCH 1/7] mm/hugetlb: Fix avoid_reserve to allow taking folio from subpool

Posted by Peter Xu 1 year, 1 month ago

On Fri, Dec 27, 2024 at 11:15:44PM +0000, Ackerley Tng wrote:
> Ackerley Tng <ackerleytng@google.com> writes:
> 
> > <snip>
> >
> > I'll go over the rest of your patches and dig into the meaning of `avoid_reserve`.
> 
> Yes, after looking into this more deeply, I agree that avoid_reserve
> means avoiding the reservations in the resv_map rather than reservations
> in the subpool or hstate.
> 
> Here's more detail of what's going on in the reproducer that I wrote as I
> reviewed Peter's patch:
> 
> 1. On fallocate(), allocate page A
> 2. On mmap(), set up a vma without VM_MAYSHARE since MAP_PRIVATE was requested
> 3. On faulting *buf = 1, allocate a new page B, copy A to B because the mmap request was MAP_PRIVATE
> 4. On fork, prep for COW by marking page as read only. Both parent and child share B.
> 5. On faulting *buf = 2 (write fault), allocate page C, copy B to C
>     + B belongs to the child, C belongs to the parent
>     + C is owned by the parent
> 6. Child exits, B is freed
> 7. On munmap(), C is freed
> 8. On unlink(), A is freed
> 
> When C was allocated in the parent (owns MAP_PRIVATE page, doing a copy on
> write), spool->rsv_hpages was decreased but h->resv_huge_pages was not. This is
> the root of the bug.
> 
> We should decrement h->resv_huge_pages if a reserved page from the subpool was
> used, instead of whether avoid_reserve or vma_has_reserves() is set. If
> avoid_reserve is set, the subpool shouldn't be checked for a reservation, so we
> won't be decrementing h->resv_huge_pages anyway.
> 
> I agree with Peter's fix as a whole (the entire patch series).
> 
> Reviewed-by: Ackerley Tng <ackerleytng@google.com>
> Tested-by: Ackerley Tng <ackerleytng@google.com>
> 
> ---
> 
> Some definitions which might be helpful:
> 
> + h->resv_huge_pages indicates number of reserved pages globally.
>     + This number increases when pages are reserved
>     + This number decreases when reserved pages are allocated, or when pages are unreserved
> + spool->rsv_hpages indicates number of reserved pages in this subpool.
>     + This number increases when pages are reserved
>     + This number decreases when reserved pages are allocated, or when pages are unreserved
> + h->resv_huge_pages should be the sum of all subpools' spool->rsv_hpages.

I think you're correct. One add-on comment: I think when taking vma
reservation into accout, then the global reservation should be a sum of
all spools' and all vmas' reservations.

> 
> More details on the flow in alloc_hugetlb_folio() which might be helpful:
> 
> hugepage_subpool_get_pages() returns "the number of pages by which the global
> pools must be adjusted (upward)". This return value is never negative other than
> errors. (hugepage_subpool_get_pages() always gets called with a positive delta).
> 
> Specifically in alloc_hugetlb_folio(), the return value is either 0 or 1 (other
> than errors).
> 
> If the return value is 0, the subpool had enough reservations and so we should
> decrement h->resv_huge_pages.
> 
> If the return value is 1, it means that this subpool did not have any more
> reserved hugepages, and we need to get a page from the global
> hstate. dequeue_hugetlb_folio_vma() will get us a page that was already
> allocated.
> 
> In dequeue_hugetlb_folio_vma(), if the vma doesn't have enough reserves for 1
> page, and there are no available_huge_pages() left, we quit dequeueing since we
> will need to allocate a new page. If we want to avoid_reserve, that means we
> don't want to use the vma's reserves in resv_map, we also check
> available_huge_pages(). If there are available_huge_pages(), we go on to dequeue
> a page.
> 
> Then, we determine whether to decrement h->resv_huge_pages. We should decrement
> if a reserved page from the subpool was used, instead of whether avoid_reserve
> or vma_has_reserves() is set.
> 
> In the case where a surplus page needs to be allocated, the surplus page isn't
> and doesn't need to be associated with a subpool, so no subpool hugepage number
> tracking updates are required. h->resv_huge_pages still has to be updated... is
> this where h->resv_huge_pages can go negative?

This question doesn't sound like relevant to this specific scenario that
this patch (or, the reproducer attached in the patch) was about.  In the
reproducer of this patch, we don't need to have surplus page involved.

Going back to the question you're asking - I don't think resv_huge_pages
will go negative for the surplus case?

IIUC updating resv_huge_pages is the correct behavior even for surplus
pages, as long as gbl_chg==0.

The initial change was done by Naoya in commit a88c76954804 ("mm: hugetlb:
fix hugepage memory leak caused by wrong reserve count").  There're some
more information in the commit log.  In general, when gbl_chg==0 it means
we consumed a global reservation either in vma or spool, so it must be
accounted globally after the folio is successfully allocated.  Here "being
accounted" should mean the global resv count will be properly decremented.

Thanks for taking a look, Ackerley!

-- 
Peter Xu

[PATCH 1/7] mm/hugetlb: Fix avoid_reserve to allow taking folio from subpool
[PATCH 2/7] mm/hugetlb: Stop using avoid_reserve flag in fork()
[PATCH 3/7] mm/hugetlb: Rename avoid_reserve to cow_from_owner
[PATCH 4/7] mm/hugetlb: Clean up map/global resv accounting when allocate
[PATCH 5/7] mm/hugetlb: Simplify vma_has_reserves()
[PATCH 6/7] mm/hugetlb: Drop vma_has_reserves()
[PATCH 7/7] mm/hugetlb: Unify restore reserve accounting for new allocations