[PATCH v2 1/2] mm/gup: stop leaking pinned pages in low memory conditions

John Hubbard posted 2 patches 1 month, 1 week ago
There is a newer version of this series
[PATCH v2 1/2] mm/gup: stop leaking pinned pages in low memory conditions
Posted by John Hubbard 1 month, 1 week ago
If a driver tries to call any of the pin_user_pages*(FOLL_LONGTERM)
family of functions, and requests "too many" pages, then the call will
erroneously leave pages pinned. This is visible in user space as an
actual memory leak.

Repro is trivial: just make enough pin_user_pages(FOLL_LONGTERM) calls
to exhaust memory.

The root cause of the problem is this sequence, within
__gup_longterm_locked():

    __get_user_pages_locked()
    rc = check_and_migrate_movable_pages()

...which gets retried in a loop. The loop error handling is incomplete,
clearly due to a somewhat unusual and complicated tri-state error API.
But anyway, if -ENOMEM, or in fact, any unexpected error is returned
from check_and_migrate_movable_pages(), then __gup_longterm_locked()
happily returns the error, while leaving the pages pinned.

In the failed case, which is an app that requests (via a device driver)
30720000000 bytes to be pinned, and then exits, I see this:

    $ grep foll /proc/vmstat
        nr_foll_pin_acquired 7502048
        nr_foll_pin_released 2048

And after applying this patch, it returns to balanced pins:

    $ grep foll /proc/vmstat
        nr_foll_pin_acquired 7502048
        nr_foll_pin_released 7502048

Fix this by unpinning the pages that __get_user_pages_locked() has
pinned, in such error cases.

Fixes: 24a95998e9ba ("mm/gup.c: simplify and fix check_and_migrate_movable_pages() return codes")
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Shigeru Yoshida <syoshida@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 mm/gup.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index a82890b46a36..233c284e8e66 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2492,6 +2492,17 @@ static long __gup_longterm_locked(struct mm_struct *mm,
 
 		/* FOLL_LONGTERM implies FOLL_PIN */
 		rc = check_and_migrate_movable_pages(nr_pinned_pages, pages);
+
+		/*
+		 * The __get_user_pages_locked() call happens before we know if
+		 * it's possible to successfully complete the whole operation.
+		 * To compensate for this, if we get an unexpected error (such
+		 * as -ENOMEM) then we must unpin everything, before erroring
+		 * out.
+		 */
+		if (rc != -EAGAIN && rc != 0)
+			unpin_user_pages(pages, nr_pinned_pages);
+
 	} while (rc == -EAGAIN);
 	memalloc_pin_restore(flags);
 	return rc ? rc : nr_pinned_pages;
-- 
2.47.0
Re: [PATCH v2 1/2] mm/gup: stop leaking pinned pages in low memory conditions
Posted by David Hildenbrand 1 month, 1 week ago
On 18.10.24 03:17, John Hubbard wrote:
> If a driver tries to call any of the pin_user_pages*(FOLL_LONGTERM)
> family of functions, and requests "too many" pages, then the call will
> erroneously leave pages pinned. This is visible in user space as an
> actual memory leak.
> 
> Repro is trivial: just make enough pin_user_pages(FOLL_LONGTERM) calls
> to exhaust memory.
> 
> The root cause of the problem is this sequence, within
> __gup_longterm_locked():
> 
>      __get_user_pages_locked()
>      rc = check_and_migrate_movable_pages()
> 
> ...which gets retried in a loop. The loop error handling is incomplete,
> clearly due to a somewhat unusual and complicated tri-state error API.
> But anyway, if -ENOMEM, or in fact, any unexpected error is returned
> from check_and_migrate_movable_pages(), then __gup_longterm_locked()
> happily returns the error, while leaving the pages pinned.

Sorry for another comment, I am taking my time to look into the code again in more detail ...

migrate_longterm_unpinnable_folios() will always unpin all pages: no matter which error it returns.

a) If it returns -EAGAIN, it unpinned all folios
b) If it returns any error it first calls unpin_folios().

So shouldn't the fix just be in check_and_migrate_movable_pages()?

diff --git a/mm/gup.c b/mm/gup.c
index a82890b46a36..81fc8314e687 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2403,8 +2403,9 @@ static int migrate_longterm_unpinnable_folios(
   * -EAGAIN. The caller should re-pin the entire range with FOLL_PIN and then
   * call this routine again.
   *
- * If an error other than -EAGAIN occurs, this indicates a migration failure.
- * The caller should give up, and propagate the error back up the call stack.
+ * If an error occurs, all folios are unpinned. If an error other than
+ * -EAGAIN occurs, this indicates a migration failure. The caller should give
+ * up, and propagate the error back up the call stack.
   *
   * If everything is OK and all folios in the range are allowed to be pinned,
   * then this routine leaves all folios pinned and returns zero for success.
@@ -2437,8 +2438,10 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages,
         long i, ret;
  
         folios = kmalloc_array(nr_pages, sizeof(*folios), GFP_KERNEL);
-       if (!folios)
+       if (!folios) {
+               unpin_user_pages(pages, nr_pages);
                 return -ENOMEM;
+       }
  
         for (i = 0; i < nr_pages; i++)
                 folios[i] = page_folio(pages[i]);



Then, check_and_migrate_movable_pages() will never return with an error and
having folios pinned.


If check_and_migrate_movable_pages() -> check_and_migrate_movable_folios()
returns "0", all folios remain pinned an no harm is done.


Consequently, I think patch #2 is not really required, because it doesn't
perform the temporary allocation that could fail with -ENOMEM.


Sorry for taking a closer look only now ...

-- 
Cheers,

David / dhildenb
Re: [PATCH v2 1/2] mm/gup: stop leaking pinned pages in low memory conditions
Posted by John Hubbard 1 month, 1 week ago
On 10/18/24 12:47 AM, David Hildenbrand wrote:
> On 18.10.24 03:17, John Hubbard wrote:
>> If a driver tries to call any of the pin_user_pages*(FOLL_LONGTERM)
>> family of functions, and requests "too many" pages, then the call will
>> erroneously leave pages pinned. This is visible in user space as an
>> actual memory leak.
>>
>> Repro is trivial: just make enough pin_user_pages(FOLL_LONGTERM) calls
>> to exhaust memory.
>>
>> The root cause of the problem is this sequence, within
>> __gup_longterm_locked():
>>
>>      __get_user_pages_locked()
>>      rc = check_and_migrate_movable_pages()
>>
>> ...which gets retried in a loop. The loop error handling is incomplete,
>> clearly due to a somewhat unusual and complicated tri-state error API.
>> But anyway, if -ENOMEM, or in fact, any unexpected error is returned
>> from check_and_migrate_movable_pages(), then __gup_longterm_locked()
>> happily returns the error, while leaving the pages pinned.
> 
> Sorry for another comment, I am taking my time to look into the code 
> again in more detail ...
> 
> migrate_longterm_unpinnable_folios() will always unpin all pages: no 
> matter which error it returns.
> 
> a) If it returns -EAGAIN, it unpinned all folios
> b) If it returns any error it first calls unpin_folios().
> 
> So shouldn't the fix just be in check_and_migrate_movable_pages()?

OK, sure. It's a little odd from a layering point of view, because the 
callee
"helpfully" unpins the pages for you (wheee!), but the updated comment
highlights that, at least.

And actually this whole thing of "pin the pages, just for a short time, even
though you're not allowed to" is partly why this area is so entertaining.

> 
> diff --git a/mm/gup.c b/mm/gup.c
> index a82890b46a36..81fc8314e687 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2403,8 +2403,9 @@ static int migrate_longterm_unpinnable_folios(
>    * -EAGAIN. The caller should re-pin the entire range with FOLL_PIN 
> and then
>    * call this routine again.
>    *
> - * If an error other than -EAGAIN occurs, this indicates a migration 
> failure.
> - * The caller should give up, and propagate the error back up the call 
> stack.
> + * If an error occurs, all folios are unpinned. If an error other than
> + * -EAGAIN occurs, this indicates a migration failure. The caller 
> should give
> + * up, and propagate the error back up the call stack.
>    *
>    * If everything is OK and all folios in the range are allowed to be 
> pinned,
>    * then this routine leaves all folios pinned and returns zero for 
> success.
> @@ -2437,8 +2438,10 @@ static long 
> check_and_migrate_movable_pages(unsigned long nr_pages,
>          long i, ret;
> 
>          folios = kmalloc_array(nr_pages, sizeof(*folios), GFP_KERNEL);
> -       if (!folios)
> +       if (!folios) {
> +               unpin_user_pages(pages, nr_pages);
>                  return -ENOMEM;
> +       }
> 
>          for (i = 0; i < nr_pages; i++)
>                  folios[i] = page_folio(pages[i]);
> 
> 
> 
> Then, check_and_migrate_movable_pages() will never return with an error and
> having folios pinned.
> 
> 
> If check_and_migrate_movable_pages() -> check_and_migrate_movable_folios()
> returns "0", all folios remain pinned an no harm is done.
> 
> 
> Consequently, I think patch #2 is not really required, because it doesn't
> perform the temporary allocation that could fail with -ENOMEM.
> 

Yes!

> 
> Sorry for taking a closer look only now ...
> 

It's all still in review, so the timing is perfectly fine. I really
appreciate the closer look, it's definitely making things better.


thanks,
-- 
John Hubbard

Re: [PATCH v2 1/2] mm/gup: stop leaking pinned pages in low memory conditions
Posted by Alistair Popple 1 month ago
John Hubbard <jhubbard@nvidia.com> writes:

> On 10/18/24 12:47 AM, David Hildenbrand wrote:
>> On 18.10.24 03:17, John Hubbard wrote:

[...]

> And actually this whole thing of "pin the pages, just for a short time, even
> though you're not allowed to" is partly why this area is so entertaining.

I'm looking at your v3 but as an aside I disagree with this
statement. AFAIK you're always allowed to pin the pages for a short time
(ie. !FOLL_LONGTERM), or did I misunderstand your comment?

>> diff --git a/mm/gup.c b/mm/gup.c
>> index a82890b46a36..81fc8314e687 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -2403,8 +2403,9 @@ static int migrate_longterm_unpinnable_folios(
>>    * -EAGAIN. The caller should re-pin the entire range with
>> FOLL_PIN and then
>>    * call this routine again.
>>    *
>> - * If an error other than -EAGAIN occurs, this indicates a
>> migration failure.
>> - * The caller should give up, and propagate the error back up the
>> call stack.
>> + * If an error occurs, all folios are unpinned. If an error other than
>> + * -EAGAIN occurs, this indicates a migration failure. The caller
>> should give
>> + * up, and propagate the error back up the call stack.
>>    *
>>    * If everything is OK and all folios in the range are allowed to
>> be pinned,
>>    * then this routine leaves all folios pinned and returns zero for
>> success.
>> @@ -2437,8 +2438,10 @@ static long
>> check_and_migrate_movable_pages(unsigned long nr_pages,
>>          long i, ret;
>>          folios = kmalloc_array(nr_pages, sizeof(*folios),
>> GFP_KERNEL);
>> -       if (!folios)
>> +       if (!folios) {
>> +               unpin_user_pages(pages, nr_pages);
>>                  return -ENOMEM;
>> +       }
>>          for (i = 0; i < nr_pages; i++)
>>                  folios[i] = page_folio(pages[i]);
>> Then, check_and_migrate_movable_pages() will never return with an
>> error and
>> having folios pinned.
>> If check_and_migrate_movable_pages() ->
>> check_and_migrate_movable_folios()
>> returns "0", all folios remain pinned an no harm is done.
>> Consequently, I think patch #2 is not really required, because it
>> doesn't
>> perform the temporary allocation that could fail with -ENOMEM.
>> 
>
> Yes!
>
>> Sorry for taking a closer look only now ...
>> 
>
> It's all still in review, so the timing is perfectly fine. I really
> appreciate the closer look, it's definitely making things better.
>
>
> thanks,
Re: [PATCH v2 1/2] mm/gup: stop leaking pinned pages in low memory conditions
Posted by John Hubbard 1 month ago
On 10/20/24 3:59 PM, Alistair Popple wrote:
> John Hubbard <jhubbard@nvidia.com> writes:
>> On 10/18/24 12:47 AM, David Hildenbrand wrote:
>>> On 18.10.24 03:17, John Hubbard wrote:
> [...]
>> And actually this whole thing of "pin the pages, just for a short time, even
>> though you're not allowed to" is partly why this area is so entertaining.
> 
> I'm looking at your v3 but as an aside I disagree with this
> statement. AFAIK you're always allowed to pin the pages for a short time
> (ie. !FOLL_LONGTERM), or did I misunderstand your comment?

Sort of: short term pins are allowed, but at this point in the code,
here:

pin_user_pages(FOLL_PIN | FOLL_LONGTERM)
     __gup_longterm_locked()
          __get_user_pages_locked(FOLL_PIN | FOLL_LONGTERM)

, just before calling check_and_migrate_movable_pages(), we have already
filtered out any cases other than (FOLL_PIN | FOLL_LONGTERM).

And that means that code has taken a *longterm* pin of presumably short
duration (this incongruity bothers me), on pages that are not actually
allowed to be long term pinned. That also feels imperfect, even though
it is supposedly short duration...except that page migration is only
sort of short...hmmm.

I'm starting to think that migrating any ZONE_MOVABLE pages away first
might be better.

Since I'm already preparing that "wait for folio refcount" idea for
migration, which is almost related, I'll take a closer look at this
idea while I'm at it.


thanks,
-- 
John Hubbard