The acceleration of THP was done with ctx.page_mask, however it'll be
ignored if **pages is non-NULL.
The old optimization was introduced in 2013 in 240aadeedc4a ("mm:
accelerate mm_populate() treatment of THP pages"). It didn't explain why
we can't optimize the **pages non-NULL case. It's possible that at that
time the major goal was for mm_populate() which should be enough back then.
Optimize thp for all cases, by properly looping over each subpage, doing
cache flushes, and boost refcounts / pincounts where needed in one go.
This can be verified using gup_test below:
# chrt -f 1 ./gup_test -m 512 -t -L -n 1024 -r 10
Before: 13992.50 ( +-8.75%)
After: 378.50 (+-69.62%)
Signed-off-by: Peter Xu <peterx@redhat.com>
---
mm/gup.c | 36 +++++++++++++++++++++++++++++-------
1 file changed, 29 insertions(+), 7 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c
index a2d1b3c4b104..cdabc8ea783b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1210,16 +1210,38 @@ static long __get_user_pages(struct mm_struct *mm,
goto out;
}
next_page:
- if (pages) {
- pages[i] = page;
- flush_anon_page(vma, page, start);
- flush_dcache_page(page);
- ctx.page_mask = 0;
- }
-
page_increm = 1 + (~(start >> PAGE_SHIFT) & ctx.page_mask);
if (page_increm > nr_pages)
page_increm = nr_pages;
+
+ if (pages) {
+ struct page *subpage;
+ unsigned int j;
+
+ /*
+ * This must be a large folio (and doesn't need to
+ * be the whole folio; it can be part of it), do
+ * the refcount work for all the subpages too.
+ * Since we already hold refcount on the head page,
+ * it should never fail.
+ *
+ * NOTE: here the page may not be the head page
+ * e.g. when start addr is not thp-size aligned.
+ */
+ if (page_increm > 1)
+ WARN_ON_ONCE(
+ try_grab_folio(compound_head(page),
+ page_increm - 1,
+ foll_flags) == NULL);
+
+ for (j = 0; j < page_increm; j++) {
+ subpage = nth_page(page, j);
+ pages[i+j] = subpage;
+ flush_anon_page(vma, subpage, start + j * PAGE_SIZE);
+ flush_dcache_page(subpage);
+ }
+ }
+
i += page_increm;
start += page_increm * PAGE_SIZE;
nr_pages -= page_increm;
--
2.40.1
On Tue, Jun 13, 2023 at 05:53:45PM -0400, Peter Xu wrote:
> The acceleration of THP was done with ctx.page_mask, however it'll be
> ignored if **pages is non-NULL.
>
> The old optimization was introduced in 2013 in 240aadeedc4a ("mm:
> accelerate mm_populate() treatment of THP pages"). It didn't explain why
> we can't optimize the **pages non-NULL case. It's possible that at that
> time the major goal was for mm_populate() which should be enough back then.
>
> Optimize thp for all cases, by properly looping over each subpage, doing
> cache flushes, and boost refcounts / pincounts where needed in one go.
>
> This can be verified using gup_test below:
>
> # chrt -f 1 ./gup_test -m 512 -t -L -n 1024 -r 10
>
> Before: 13992.50 ( +-8.75%)
> After: 378.50 (+-69.62%)
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> mm/gup.c | 36 +++++++++++++++++++++++++++++-------
> 1 file changed, 29 insertions(+), 7 deletions(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index a2d1b3c4b104..cdabc8ea783b 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1210,16 +1210,38 @@ static long __get_user_pages(struct mm_struct *mm,
> goto out;
> }
> next_page:
> - if (pages) {
> - pages[i] = page;
> - flush_anon_page(vma, page, start);
> - flush_dcache_page(page);
> - ctx.page_mask = 0;
> - }
> -
> page_increm = 1 + (~(start >> PAGE_SHIFT) & ctx.page_mask);
> if (page_increm > nr_pages)
> page_increm = nr_pages;
> +
> + if (pages) {
> + struct page *subpage;
> + unsigned int j;
> +
> + /*
> + * This must be a large folio (and doesn't need to
> + * be the whole folio; it can be part of it), do
> + * the refcount work for all the subpages too.
> + * Since we already hold refcount on the head page,
> + * it should never fail.
> + *
> + * NOTE: here the page may not be the head page
> + * e.g. when start addr is not thp-size aligned.
> + */
> + if (page_increm > 1)
> + WARN_ON_ONCE(
> + try_grab_folio(compound_head(page),
> + page_increm - 1,
> + foll_flags) == NULL);
I'm not sure this should be warning but otherwise ignoring this returning
NULL? This feels like a case that could come up in realtiy,
e.g. folio_ref_try_add_rcu() fails, or !folio_is_longterm_pinnable().
Side-note: I _hate_ the semantics of GUP such that try_grab_folio()
(invoked, other than for huge page cases, by the GUP-fast logic) will
explicitly fail if neither FOLL_GET or FOLL_PIN are specified,
differentiating it from try_grab_page() in this respect.
This is a side-note and not relevant here, as all callers to
__get_user_pages() either explicitly set FOLL_GET if not set by user (in
__get_user_pages_locked()) or don't set pages (e.g. in
faultin_vma_page_range())
> +
> + for (j = 0; j < page_increm; j++) {
> + subpage = nth_page(page, j);
> + pages[i+j] = subpage;
> + flush_anon_page(vma, subpage, start + j * PAGE_SIZE);
> + flush_dcache_page(subpage);
> + }
> + }
> +
> i += page_increm;
> start += page_increm * PAGE_SIZE;
> nr_pages -= page_increm;
> --
> 2.40.1
>
On Sat, Jun 17, 2023 at 09:27:22PM +0100, Lorenzo Stoakes wrote:
> On Tue, Jun 13, 2023 at 05:53:45PM -0400, Peter Xu wrote:
> > The acceleration of THP was done with ctx.page_mask, however it'll be
> > ignored if **pages is non-NULL.
> >
> > The old optimization was introduced in 2013 in 240aadeedc4a ("mm:
> > accelerate mm_populate() treatment of THP pages"). It didn't explain why
> > we can't optimize the **pages non-NULL case. It's possible that at that
> > time the major goal was for mm_populate() which should be enough back then.
> >
> > Optimize thp for all cases, by properly looping over each subpage, doing
> > cache flushes, and boost refcounts / pincounts where needed in one go.
> >
> > This can be verified using gup_test below:
> >
> > # chrt -f 1 ./gup_test -m 512 -t -L -n 1024 -r 10
> >
> > Before: 13992.50 ( +-8.75%)
> > After: 378.50 (+-69.62%)
> >
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> > mm/gup.c | 36 +++++++++++++++++++++++++++++-------
> > 1 file changed, 29 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index a2d1b3c4b104..cdabc8ea783b 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -1210,16 +1210,38 @@ static long __get_user_pages(struct mm_struct *mm,
> > goto out;
> > }
> > next_page:
> > - if (pages) {
> > - pages[i] = page;
> > - flush_anon_page(vma, page, start);
> > - flush_dcache_page(page);
> > - ctx.page_mask = 0;
> > - }
> > -
> > page_increm = 1 + (~(start >> PAGE_SHIFT) & ctx.page_mask);
> > if (page_increm > nr_pages)
> > page_increm = nr_pages;
> > +
> > + if (pages) {
> > + struct page *subpage;
> > + unsigned int j;
> > +
> > + /*
> > + * This must be a large folio (and doesn't need to
> > + * be the whole folio; it can be part of it), do
> > + * the refcount work for all the subpages too.
> > + * Since we already hold refcount on the head page,
> > + * it should never fail.
> > + *
> > + * NOTE: here the page may not be the head page
> > + * e.g. when start addr is not thp-size aligned.
> > + */
> > + if (page_increm > 1)
> > + WARN_ON_ONCE(
> > + try_grab_folio(compound_head(page),
> > + page_increm - 1,
> > + foll_flags) == NULL);
>
> I'm not sure this should be warning but otherwise ignoring this returning
> NULL? This feels like a case that could come up in realtiy,
> e.g. folio_ref_try_add_rcu() fails, or !folio_is_longterm_pinnable().
Note that we hold already at least 1 refcount on the folio (also mentioned
in the comment above this chunk of code), so both folio_ref_try_add_rcu()
and folio_is_longterm_pinnable() should already have been called on the
same folio and passed. If it will fail it should have already, afaict.
I still don't see how that would trigger if the refcount won't overflow.
Here what I can do is still guard this try_grab_folio() and fail the GUP if
for any reason it failed. Perhaps then it means I'll also keep that one
untouched in hugetlb_follow_page_mask() too. But I suppose keeping the
WARN_ON_ONCE() seems still proper.
Thanks,
--
Peter Xu
On Mon, Jun 19, 2023 at 03:37:30PM -0400, Peter Xu wrote:
> Here what I can do is still guard this try_grab_folio() and fail the GUP if
> for any reason it failed. Perhaps then it means I'll also keep that one
> untouched in hugetlb_follow_page_mask() too. But I suppose keeping the
> WARN_ON_ONCE() seems still proper.
Here's the outcome that I plan to post in the new version, taking care of
try_grab_folio() failures even if it happens, meanwhile remove the
compound_head() redundancy on the page.
__get_user_pages():
...
===8<===
/*
* This must be a large folio (and doesn't need to
* be the whole folio; it can be part of it), do
* the refcount work for all the subpages too.
*
* NOTE: here the page may not be the head page
* e.g. when start addr is not thp-size aligned.
* try_grab_folio() should have taken care of tail
* pages.
*/
if (page_increm > 1) {
struct folio *folio;
/*
* Since we already hold refcount on the
* large folio, this should never fail.
*/
folio = try_grab_folio(page, page_increm - 1,
foll_flags);
if (WARN_ON_ONCE(!folio)) {
/*
* Release the 1st page ref if the
* folio is problematic, fail hard.
*/
gup_put_folio(page_folio(page), 1,
foll_flags);
ret = -EFAULT;
goto out;
}
}
===8<===
Thanks,
--
Peter Xu
On Tue, Jun 13, 2023 at 05:53:45PM -0400, Peter Xu wrote:
> + if (page_increm > 1)
> + WARN_ON_ONCE(
> + try_grab_folio(compound_head(page),
You don't need to call compound_head() here; try_grab_folio() works
on tail pages just fine.
> + page_increm - 1,
> + foll_flags) == NULL);
> +
> + for (j = 0; j < page_increm; j++) {
> + subpage = nth_page(page, j);
> + pages[i+j] = subpage;
> + flush_anon_page(vma, subpage, start + j * PAGE_SIZE);
> + flush_dcache_page(subpage);
You're better off calling flush_dcache_folio() right at the end.
On Wed, Jun 14, 2023 at 03:58:34PM +0100, Matthew Wilcox wrote:
> On Tue, Jun 13, 2023 at 05:53:45PM -0400, Peter Xu wrote:
> > + if (page_increm > 1)
> > + WARN_ON_ONCE(
> > + try_grab_folio(compound_head(page),
>
> You don't need to call compound_head() here; try_grab_folio() works
> on tail pages just fine.
I did it with caution because two things I'm not sure: either
is_pci_p2pdma_page() or is_longterm_pinnable_page() inside, both calls
is_zone_device_page() on the page*.
But I just noticed try_grab_folio() is also used in gup_pte_range() where
the thp can be pte mapped, so I assume we at least need that to handle tail
page well.
Do we perhaps need the compound_head() in try_grab_folio() as a separate
patch? Or maybe I was wrong on is_zone_device_page()?
>
> > + page_increm - 1,
> > + foll_flags) == NULL);
> > +
> > + for (j = 0; j < page_increm; j++) {
> > + subpage = nth_page(page, j);
> > + pages[i+j] = subpage;
> > + flush_anon_page(vma, subpage, start + j * PAGE_SIZE);
> > + flush_dcache_page(subpage);
>
> You're better off calling flush_dcache_folio() right at the end.
Will do.
Thanks,
--
Peter Xu
On Wed, Jun 14, 2023 at 11:19:48AM -0400, Peter Xu wrote:
> > > + for (j = 0; j < page_increm; j++) {
> > > + subpage = nth_page(page, j);
> > > + pages[i+j] = subpage;
> > > + flush_anon_page(vma, subpage, start + j * PAGE_SIZE);
> > > + flush_dcache_page(subpage);
> >
> > You're better off calling flush_dcache_folio() right at the end.
>
> Will do.
Ah when I start to modify it I noticed it's a two-sided sword: we'll then
also do flush dcache over the whole folio even if we gup one page.
We'll start to get benefit only if some arch at least starts to impl
flush_dcache_folio() (which seems to be none, right now..), and we'll
already start to lose on amplifying the flush when gup on partial folio.
Perhaps I still keep it as-is which will still be accurate, always faster
than old code, and definitely not regress in any form?
--
Peter Xu
© 2016 - 2026 Red Hat, Inc.