[v1] smaps / mm/gup: fix gup_can_follow_protnone fallout

[PATCH v1 0/4] smaps / mm/gup: fix gup_can_follow_protnone fallout

Posted by David Hildenbrand 2 years, 6 months ago

This is my proposal on how to handle the fallout of 474098edac26
("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") where I
accidentially missed that follow_page() and smaps implicitly kept the
FOLL_NUMA flag clear by *not* setting it if FOLL_FORCE is absent, to
not trigger faults on PROT_NONE-mapped PTEs.

(maybe it's just me who considers that confusing)

Patch #1 is the original fix proposal, which patch #3 cleans up. Patch #2
is another fix for the issue on the follow_page() level pointed out by
Peter. Patch #4 documents the FOLL_FORCE situation.

Peter prefers a revert of that commit [1], I disagree and am still happy to
see FOLL_NUMA gone that implicitly relied on FOLL_FORCE.

An alternative might be to use an internal FOLL_PROTNONE or
FOLL_NO_PROTNONE flag in patch #3, not so sure about that.

Did a quick sanity test, will do more testing tomorrow.

[1] https://lkml.kernel.org/r/ZMK+jSDgOmJKySTr@x1n

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: liubo <liubo254@huawei.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>

David Hildenbrand (3):
  mm/gup: Make follow_page() succeed again on PROT_NONE PTEs/PMDs
  smaps: use vm_normal_page_pmd() instead of follow_trans_huge_pmd()
  mm/gup: document FOLL_FORCE behavior

liubo (1):
  smaps: Fix the abnormal memory statistics obtained through
    /proc/pid/smaps

 fs/proc/task_mmu.c       |  3 +--
 include/linux/mm_types.h | 25 ++++++++++++++++++++++++-
 mm/gup.c                 | 10 +++++++++-
 3 files changed, 34 insertions(+), 4 deletions(-)

-- 
2.41.0

Re: [PATCH v1 0/4] smaps / mm/gup: fix gup_can_follow_protnone fallout

Posted by Linus Torvalds 2 years, 6 months ago

On Thu, 27 Jul 2023 at 14:28, David Hildenbrand <david@redhat.com> wrote:
>
> This is my proposal on how to handle the fallout of 474098edac26
> ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") where I
> accidentially missed that follow_page() and smaps implicitly kept the
> FOLL_NUMA flag clear by *not* setting it if FOLL_FORCE is absent, to
> not trigger faults on PROT_NONE-mapped PTEs.

Ugh.

I hate how it uses FOLL_FORCE that is inherently scary.

Why do we have that "gup_can_follow_protnone()" logic AT ALL?

Couldn't we just get rid of that disgusting thing, and just say that
GUP (and follow_page()) always just ignores NUMA hinting, and always
just follows protnone?

We literally used to have this:

        if (!(gup_flags & FOLL_FORCE))
                gup_flags |= FOLL_NUMA;

ie we *always* set FOLL_NUMA for any sane situation. FOLL_FORCE should
be the rare crazy case.

The original reason for not setting FOLL_NUMA all the time is
documented in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting
page faults from gup/gup_fast") from way back in 2012:

         * If FOLL_FORCE and FOLL_NUMA are both set, handle_mm_fault
         * would be called on PROT_NONE ranges. We must never invoke
         * handle_mm_fault on PROT_NONE ranges or the NUMA hinting
         * page faults would unprotect the PROT_NONE ranges if
         * _PAGE_NUMA and _PAGE_PROTNONE are sharing the same pte/pmd
         * bitflag. So to avoid that, don't set FOLL_NUMA if
         * FOLL_FORCE is set.

but I don't think the original reason for this is *true* any more.

Because then two years later in 2014, in commit c46a7c817e66 ("x86:
define _PAGE_NUMA by reusing software bits on the PMD and PTE levels")
Mel made the code able to distinguish between PROT_NONE and NUMA
pages, and he changed the comment above too.

But I get the very very strong feeling that instead of changing the
comment, he should have actually removed the comment and the code.

So I get the strong feeling that all these FOLL_NUMA games should just
go away. You removed the FOLL_NUMA bit, but you replaced it with using
FOLL_FORCE.

So rather than make this all even more opaque and make it even harder
to figure out why we have that code in the first place, I think it
should all just be removed.

The original reason for FOLL_NUMA simply does not exist any more. We
know exactly when a page is marked for NUMA faulting, and we should
simply *ignore* it for GUP and follow_page().

I think we should treat a NUMA-faulting page as just being present
(and not NUMA-fault it).

Am I missing something?

                  Linus

Re: [PATCH v1 0/4] smaps / mm/gup: fix gup_can_follow_protnone fallout

Posted by Peter Xu 2 years, 6 months ago

Hi, Linus,

On Fri, Jul 28, 2023 at 09:18:45AM -0700, Linus Torvalds wrote:
> The original reason for FOLL_NUMA simply does not exist any more. We
> know exactly when a page is marked for NUMA faulting, and we should
> simply *ignore* it for GUP and follow_page().
> 
> I think we should treat a NUMA-faulting page as just being present
> (and not NUMA-fault it).

But then does it means that any gup-only user will have numa balancing
completely disabled?  Since as long as the page will only be accessed by
GUP, the numa balancing will never trigger anyway..  I think KVM is
manipulating guest pages just like that.  Not sure whether it means it'll
void the whole numa effort there.

If we allow that GUP from happening (taking protnone as present) I assume
it'll also stop any further numa balancing on this very page to trigger
too, because even if some page fault handler triggered on this protnone
page later that is not GUP anymore, when it wants to migrate the page to
the other numa node it'll see someone is holding a reference on it already,
and then we should give up the balancing.

So to me FOLL_NUMA (or any identifier like it.. just to describe the
caller's need; some may really just want to fetch the pfn/page) still makes
sense.  But maybe I totally misunderstood above..

Thanks,

-- 
Peter Xu

Re: [PATCH v1 0/4] smaps / mm/gup: fix gup_can_follow_protnone fallout

Posted by Linus Torvalds 2 years, 6 months ago

On Fri, 28 Jul 2023 at 12:39, Peter Xu <peterx@redhat.com> wrote:
>
> But then does it means that any gup-only user will have numa balancing
> completely disabled?

Why would we ever care about a GUP-only user?

Who knows where the actual access is coming from? It might be some
device that is on a different node entirely.

And even if the access is local from the CPU, it

 (a) might have happened after we moved somewhere else

 (b) who cares about the extra possible NUMA overhead when we just
wasted *thousands* of cycles on GUP?

So NUMA balancing really doesn't seem to make sense for GUP anyway as
far as I can see.

Now, the other side of the same thing is that (a) NUMA faulting should
be fairly rare and (b) once you do GUP, who cares anyway, so you can
also argue that "once you do GUP you might as well NUMA-fault, because
performance simply isn't an issue".

But I really think the real argument is "once you do GUP, numa
faulting is just crazy".

I think what happened is

 - the GUP code couldn't tell NUMA and actual PROTNONE apart

 - so the GUP code would punch through PROTNONE even when it shouldn't

 - so people added FOLL_NUMA to say "I don't want you to punch
through, I want the NUMA fault"

 - but then FOLL_FORCE ends up meaning that you actually *do* want to
punch through - regardless of NUMA or not - and now the two got tied
together, and we end up with nonsensical garbage like

        if (!(gup_flags & FOLL_FORCE))
                gup_flags |= FOLL_NUMA;

   to say "oh, actually, to avoid punching through when we shouldn't,
we should NUMA fault".

so we ended up with that case where even if YOU DIDN'T CARE AT ALL,
you got FOLL_NUMA just so that you wouldn't punch through.

And now we're in the situation that we've confused FOLL_FORCE and
FOLL_NUMA, even though they have absolutely *nothing* to do with each
other, except for a random implementation detail about punching
through incorrectly that isn't even relevant any more.

I really think FOLL_NUMA should just go away. And that FOLL_FORCE
replacement for it is just wrong.  If you *don't* do something without
FOLL_FORCE, you damn well shouldn't do it just because FOLL_FORCE is
set.

The *only* semantic meaning FOLL_FORCE should have is that it
overrides the vma protections for debuggers (in a very limited
manner). It should *not* affect any NUMA faulting logic in any way,
shape, or form.

               Linus

Re: [PATCH v1 0/4] smaps / mm/gup: fix gup_can_follow_protnone fallout

Posted by David Hildenbrand 2 years, 6 months ago

On 28.07.23 22:23, Linus Torvalds wrote:
> On Fri, 28 Jul 2023 at 12:39, Peter Xu <peterx@redhat.com> wrote:
>>
>> But then does it means that any gup-only user will have numa balancing
>> completely disabled?
> 
> Why would we ever care about a GUP-only user?
> 
> Who knows where the actual access is coming from? It might be some
> device that is on a different node entirely.
> 
> And even if the access is local from the CPU, it
> 
>   (a) might have happened after we moved somewhere else
> 
>   (b) who cares about the extra possible NUMA overhead when we just
> wasted *thousands* of cycles on GUP?
> 
> So NUMA balancing really doesn't seem to make sense for GUP anyway as
> far as I can see.

I do agree regarding many GUP users.

But at least for KVM (and probably some others, but most probably KVM is 
the most important GUP user) it does make sense and we have to find a 
way to keep that working.

At least, removing it creates significantly more harm than having it, 
guaranteed :)

So would you rather favor a FOLL_NUMA that has to be passed from the 
outside by selected callers or a FOLL_NUMA that is set on the GUP path 
unconditionally (but left clear for follow_page())?

-- 
Cheers,

David / dhildenb

Re: [PATCH v1 0/4] smaps / mm/gup: fix gup_can_follow_protnone fallout

Posted by Linus Torvalds 2 years, 6 months ago

On Fri, 28 Jul 2023 at 13:33, David Hildenbrand <david@redhat.com> wrote:
>
> So would you rather favor a FOLL_NUMA that has to be passed from the
> outside by selected callers or a FOLL_NUMA that is set on the GUP path
> unconditionally (but left clear for follow_page())?

I'd rather see the FOLL_NUMA that has to be set by odd cases, and that
is never set by any sane user.

And it should not be called FOLL_NUMA. It should be called something
else. Because *not* having it doesn't disable following pages across
NUMA boundaries, and the name is actively misleading.

It sounds like what KVM actually wants is a "Do NOT follow NUMA pages,
I'll force a page fault".

And the fact that KVM wants a fault for NUMA pages shouldn't mean that
others - who clearly cannot care - get that insane behavior by
default.

The name should reflect that, instead of being the misleading mess of
FOLL_FORCE and bad naming that it is now.

So maybe it can be called "FOLL_HONOR_NUMA_FAULT" or something, to
make it clear that it's the *opposite* of FOLL_FORCE, and that it
honors the NUMA faulting that nobody should care about.

Then the KVM code can have a big comment about *why* it sets that bit.

Hmm? Can we please aim for something that is understandable and
documented? No odd implicit rules. No "force NUMA fault even when it
makes no sense". No tie-in with FOLL_FORCE.

                 Linus

Re: [PATCH v1 0/4] smaps / mm/gup: fix gup_can_follow_protnone fallout

Posted by David Hildenbrand 2 years, 6 months ago

On 28.07.23 22:50, Linus Torvalds wrote:
> On Fri, 28 Jul 2023 at 13:33, David Hildenbrand <david@redhat.com> wrote:
>>
>> So would you rather favor a FOLL_NUMA that has to be passed from the
>> outside by selected callers or a FOLL_NUMA that is set on the GUP path
>> unconditionally (but left clear for follow_page())?
> 
> I'd rather see the FOLL_NUMA that has to be set by odd cases, and that
> is never set by any sane user.

Thanks!

> 
> And it should not be called FOLL_NUMA. It should be called something
> else. Because *not* having it doesn't disable following pages across
> NUMA boundaries, and the name is actively misleading.
> 
> It sounds like what KVM actually wants is a "Do NOT follow NUMA pages,
> I'll force a page fault".
> 
> And the fact that KVM wants a fault for NUMA pages shouldn't mean that
> others - who clearly cannot care - get that insane behavior by
> default.

For KVM it represents actual CPU access. To map these pages into the VM 
MMU we have to look them up from the process -- in the context of the 
faulting CPU. So it makes a lot of sense for KVM. (which is also where 
autonuma gets heavily used)

> 
> The name should reflect that, instead of being the misleading mess of
> FOLL_FORCE and bad naming that it is now.
> 
> So maybe it can be called "FOLL_HONOR_NUMA_FAULT" or something, to
> make it clear that it's the *opposite* of FOLL_FORCE, and that it
> honors the NUMA faulting that nobody should care about.

Naming sounds much better to me.

> 
> Then the KVM code can have a big comment about *why* it sets that bit.

Yes.

> 
> Hmm? Can we please aim for something that is understandable and
> documented? No odd implicit rules. No "force NUMA fault even when it
> makes no sense". No tie-in with FOLL_FORCE.

I mean, I messed all that FOLL_NUMA handling up because I was very 
confused. So I'm all for better documentation.

Can we get a simple revert in first (without that FOLL_FORCE special 
casing and ideally with a better name) to handle stable backports, and 
I'll follow-up with more documentation and letting GUP callers pass in 
that flag instead?

That would help a lot. Then we also have more time to let that "move it 
to GUP callers" mature a bit in -next, to see if we find any surprises?

-- 
Cheers,

David / dhildenb

Re: [PATCH v1 0/4] smaps / mm/gup: fix gup_can_follow_protnone fallout

Posted by Peter Xu 2 years, 6 months ago

On Fri, Jul 28, 2023 at 11:02:46PM +0200, David Hildenbrand wrote:
> Can we get a simple revert in first (without that FOLL_FORCE special casing
> and ideally with a better name) to handle stable backports, and I'll
> follow-up with more documentation and letting GUP callers pass in that flag
> instead?
> 
> That would help a lot. Then we also have more time to let that "move it to
> GUP callers" mature a bit in -next, to see if we find any surprises?

As I raised my concern over the other thread, I still worry numa users can
be affected by this change. After all, numa isn't so uncommon to me, at
least fedora / rhel as CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y. I highly
suspect that's also true to major distros.  Meanwhile all kernel modules
use gup..

I'd say we can go ahead and try if we want, but I really don't know why
that helps in any form to move it to the callers.. with the risk of
breaking someone.

Logically it should also be always better to migrate earlier than later,
not only because the page will be local earlier, but also per I discussed
also in the other thread (that the gup can hold a ref to the page, and it
could potentially stop numa balancing to succeed later).

Thanks,

-- 
Peter Xu

Re: [PATCH v1 0/4] smaps / mm/gup: fix gup_can_follow_protnone fallout

Posted by John Hubbard 2 years, 6 months ago

On 7/28/23 14:20, Peter Xu wrote:
> On Fri, Jul 28, 2023 at 11:02:46PM +0200, David Hildenbrand wrote:
>> Can we get a simple revert in first (without that FOLL_FORCE special casing
>> and ideally with a better name) to handle stable backports, and I'll
>> follow-up with more documentation and letting GUP callers pass in that flag
>> instead?
>>
>> That would help a lot. Then we also have more time to let that "move it to
>> GUP callers" mature a bit in -next, to see if we find any surprises?
> 
> As I raised my concern over the other thread, I still worry numa users can
> be affected by this change. After all, numa isn't so uncommon to me, at
> least fedora / rhel as CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y. I highly
> suspect that's also true to major distros.  Meanwhile all kernel modules
> use gup..
> 
> I'd say we can go ahead and try if we want, but I really don't know why
> that helps in any form to move it to the callers.. with the risk of
> breaking someone.

It's worth the trouble, in order to clear up this historical mess. It's
helping *future* callers of the API, and future maintenance efforts. Yes
there is some risk, but it seems very manageable.

The story of how FOLL_NUMA and FOLL_FORCE became entangled was enlightening,
by the way, and now that I've read it I don't want to go back. :)


thanks,
-- 
John Hubbard
NVIDIA

Re: [PATCH v1 0/4] smaps / mm/gup: fix gup_can_follow_protnone fallout

Posted by Peter Xu 2 years, 6 months ago

Hello, John,

On Fri, Jul 28, 2023 at 02:32:12PM -0700, John Hubbard wrote:
> On 7/28/23 14:20, Peter Xu wrote:
> > On Fri, Jul 28, 2023 at 11:02:46PM +0200, David Hildenbrand wrote:
> > > Can we get a simple revert in first (without that FOLL_FORCE special casing
> > > and ideally with a better name) to handle stable backports, and I'll
> > > follow-up with more documentation and letting GUP callers pass in that flag
> > > instead?
> > > 
> > > That would help a lot. Then we also have more time to let that "move it to
> > > GUP callers" mature a bit in -next, to see if we find any surprises?
> > 
> > As I raised my concern over the other thread, I still worry numa users can
> > be affected by this change. After all, numa isn't so uncommon to me, at
> > least fedora / rhel as CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y. I highly
> > suspect that's also true to major distros.  Meanwhile all kernel modules
> > use gup..
> > 
> > I'd say we can go ahead and try if we want, but I really don't know why
> > that helps in any form to move it to the callers.. with the risk of
> > breaking someone.
> 
> It's worth the trouble, in order to clear up this historical mess. It's
> helping *future* callers of the API, and future maintenance efforts. Yes
> there is some risk, but it seems very manageable.
> 
> The story of how FOLL_NUMA and FOLL_FORCE became entangled was enlightening,
> by the way, and now that I've read it I don't want to go back. :)

Yeah I fully agree we should hopefully remove the NUMA / FORCE
tangling.. even if we want to revert back to the FOLL_NUMA flag we may want
to not revive that specific part.  I had a feeling that we're all on the
same page there.

It's more about the further step to make FOLL_NUMA opt-in for GUP.

Thanks,

-- 
Peter Xu

Re: [PATCH v1 0/4] smaps / mm/gup: fix gup_can_follow_protnone fallout

Posted by John Hubbard 2 years, 6 months ago

On 7/28/23 14:49, Peter Xu wrote:
>> The story of how FOLL_NUMA and FOLL_FORCE became entangled was enlightening,
>> by the way, and now that I've read it I don't want to go back. :)
> 
> Yeah I fully agree we should hopefully remove the NUMA / FORCE
> tangling.. even if we want to revert back to the FOLL_NUMA flag we may want
> to not revive that specific part.  I had a feeling that we're all on the
> same page there.
> 

Yes, I think so. :)

> It's more about the further step to make FOLL_NUMA opt-in for GUP.

Let's say "FOLL_HONOR_NUMA_FAULT" for this next discussion, but yes. So
given that our API allows passing in FOLL_ flags, I don't understand the
objection to letting different callers pass in, or not pass in, that
flag.

It's the perfect way to clean up the whole thing. As Linus suggested
slightly earlier here, there can be a comment at the call site,
explaining why KVM needs FOLL_HONOR_NUMA_FAULT, and you're good, right?

thanks,
-- 
John Hubbard
NVIDIA

Re: [PATCH v1 0/4] smaps / mm/gup: fix gup_can_follow_protnone fallout

Posted by Peter Xu 2 years, 6 months ago

On Fri, Jul 28, 2023 at 03:00:04PM -0700, John Hubbard wrote:
> On 7/28/23 14:49, Peter Xu wrote:
> > > The story of how FOLL_NUMA and FOLL_FORCE became entangled was enlightening,
> > > by the way, and now that I've read it I don't want to go back. :)
> > 
> > Yeah I fully agree we should hopefully remove the NUMA / FORCE
> > tangling.. even if we want to revert back to the FOLL_NUMA flag we may want
> > to not revive that specific part.  I had a feeling that we're all on the
> > same page there.
> > 
> 
> Yes, I think so. :)
> 
> > It's more about the further step to make FOLL_NUMA opt-in for GUP.
> 
> Let's say "FOLL_HONOR_NUMA_FAULT" for this next discussion, but yes. So
> given that our API allows passing in FOLL_ flags, I don't understand the
> objection to letting different callers pass in, or not pass in, that
> flag.
> 
> It's the perfect way to clean up the whole thing. As Linus suggested
> slightly earlier here, there can be a comment at the call site,
> explaining why KVM needs FOLL_HONOR_NUMA_FAULT, and you're good, right?

I'm good even if we want to experiment anything, as long as (at least) kvm
is all covered then I'm not against it, not yet strongly.

But again, IMHO we're not only talking about "cleaning up" of any flag - if
that falls into "cleanup" first, which I'm not 100% sure yet - we're
takling about changing GUP abi.  What I wanted to point out to be careful
is we're literally changing the GUP abi for all kernel modules on numa
implications.

Thanks,

-- 
Peter Xu

Re: [PATCH v1 0/4] smaps / mm/gup: fix gup_can_follow_protnone fallout

Posted by David Hildenbrand 2 years, 6 months ago

On 28.07.23 23:20, Peter Xu wrote:
> On Fri, Jul 28, 2023 at 11:02:46PM +0200, David Hildenbrand wrote:
>> Can we get a simple revert in first (without that FOLL_FORCE special casing
>> and ideally with a better name) to handle stable backports, and I'll
>> follow-up with more documentation and letting GUP callers pass in that flag
>> instead?
>>
>> That would help a lot. Then we also have more time to let that "move it to
>> GUP callers" mature a bit in -next, to see if we find any surprises?
> 
> As I raised my concern over the other thread, I still worry numa users can
> be affected by this change. After all, numa isn't so uncommon to me, at
> least fedora / rhel as CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y. I highly
> suspect that's also true to major distros.  Meanwhile all kernel modules
> use gup..
> 
> I'd say we can go ahead and try if we want, but I really don't know why
> that helps in any form to move it to the callers.. with the risk of
> breaking someone.
> 

Indeed, that's why I suggest to be a bit careful, especially with stable.

> Logically it should also be always better to migrate earlier than later,
> not only because the page will be local earlier, but also per I discussed
> also in the other thread (that the gup can hold a ref to the page, and it
> could potentially stop numa balancing to succeed later).

I get your point, but I also see the following cases (QEMU/KVM as example):

* User space triggers O_DIRECT. It will be short-lived. But is it really
   an access from that CPU (NUMA node) to that page? At least for KVM,
   you much rather want to let KVM trigger the NUMA fault on actual
   memory access from a guest VCPU, not from a QEMU iothread when pinning
   the page?

* vfio triggers FOLL_PIN|FOLL_LONGTERM from a random QEMU thread.
   Where should we migrate that page to? Would it actually be counter-
   productive to migrate it to the NUMA node of the setup thread? The
   longterm pin will turn the page unmovable, yes, but where to migrate
   it to?

-- 
Cheers,

David / dhildenb

Re: [PATCH v1 0/4] smaps / mm/gup: fix gup_can_follow_protnone fallout

Posted by Jason Gunthorpe 2 years, 6 months ago

On Fri, Jul 28, 2023 at 11:31:49PM +0200, David Hildenbrand wrote:
> * vfio triggers FOLL_PIN|FOLL_LONGTERM from a random QEMU thread.
>   Where should we migrate that page to? Would it actually be counter-
>   productive to migrate it to the NUMA node of the setup thread? The
>   longterm pin will turn the page unmovable, yes, but where to migrate
>   it to?

For VFIO & KVM you actively don't get any kind of numa balancing or
awareness. In this case qemu should probably strive to put the memory
on the numa node of the majorty of CPUs early on because it doesn't
get another shot at it.

In other cases it depends quite alot. Eg DPDK might want its VFIO
buffers to NUMA'd to the node that is close to the device, not the
CPU. Or vice versa. There is alot of micro sensitivity here at high
data rates. I think people today manually tune this by deliberately
allocating the memory to specific numas and then GUP should just leave
it alone.

FWIW, I'm reading this thread and I have no idea what the special
semantic is KVM needs from GUP, so I'm all for better documentation on
the GUP flag :)

Jason

Re: [PATCH v1 0/4] smaps / mm/gup: fix gup_can_follow_protnone fallout

Posted by Peter Xu 2 years, 6 months ago

On Fri, Jul 28, 2023 at 07:14:26PM -0300, Jason Gunthorpe wrote:
> On Fri, Jul 28, 2023 at 11:31:49PM +0200, David Hildenbrand wrote:
> > * vfio triggers FOLL_PIN|FOLL_LONGTERM from a random QEMU thread.
> >   Where should we migrate that page to? Would it actually be counter-
> >   productive to migrate it to the NUMA node of the setup thread? The
> >   longterm pin will turn the page unmovable, yes, but where to migrate
> >   it to?
> 
> For VFIO & KVM you actively don't get any kind of numa balancing or
> awareness. In this case qemu should probably strive to put the memory
> on the numa node of the majorty of CPUs early on because it doesn't
> get another shot at it.
> 
> In other cases it depends quite alot. Eg DPDK might want its VFIO
> buffers to NUMA'd to the node that is close to the device, not the
> CPU. Or vice versa. There is alot of micro sensitivity here at high
> data rates. I think people today manually tune this by deliberately
> allocating the memory to specific numas and then GUP should just leave
> it alone.

Right.

For the other O_DIRECT example - it seems to be a more generic issue to
"whether we should rely on the follow up accessor to decide the target node
of numa balancing".  To me at least for KVM's use case I'd still expect the
major paths to trigger that is still when guest accessing a page from vcpu
threads, that's still the GUP paths.

Thanks,

-- 
Peter Xu

Re: [PATCH v1 0/4] smaps / mm/gup: fix gup_can_follow_protnone fallout

Posted by David Hildenbrand 2 years, 6 months ago

On 28.07.23 21:39, Peter Xu wrote:
> Hi, Linus,
> 
> On Fri, Jul 28, 2023 at 09:18:45AM -0700, Linus Torvalds wrote:
>> The original reason for FOLL_NUMA simply does not exist any more. We
>> know exactly when a page is marked for NUMA faulting, and we should
>> simply *ignore* it for GUP and follow_page().
>>
>> I think we should treat a NUMA-faulting page as just being present
>> (and not NUMA-fault it).
> 
> But then does it means that any gup-only user will have numa balancing
> completely disabled?  Since as long as the page will only be accessed by
> GUP, the numa balancing will never trigger anyway..  I think KVM is
> manipulating guest pages just like that.  Not sure whether it means it'll
> void the whole numa effort there.
> 
> If we allow that GUP from happening (taking protnone as present) I assume
> it'll also stop any further numa balancing on this very page to trigger
> too, because even if some page fault handler triggered on this protnone
> page later that is not GUP anymore, when it wants to migrate the page to
> the other numa node it'll see someone is holding a reference on it already,
> and then we should give up the balancing.
> 
> So to me FOLL_NUMA (or any identifier like it.. just to describe the
> caller's need; some may really just want to fetch the pfn/page) still makes
> sense.  But maybe I totally misunderstood above..

Yes, I agree, took me a bit longer to realize (being a KVM developer :) 
... I'm really ready for the weekend).

So if this series is not acceptable then better revert that commit -- or 
let callers like KVM specify FOLL_NUMA.

-- 
Cheers,

David / dhildenb

Re: [PATCH v1 0/4] smaps / mm/gup: fix gup_can_follow_protnone fallout

Posted by David Hildenbrand 2 years, 6 months ago

On 28.07.23 18:18, Linus Torvalds wrote:
> On Thu, 27 Jul 2023 at 14:28, David Hildenbrand <david@redhat.com> wrote:
>>
>> This is my proposal on how to handle the fallout of 474098edac26
>> ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") where I
>> accidentially missed that follow_page() and smaps implicitly kept the
>> FOLL_NUMA flag clear by *not* setting it if FOLL_FORCE is absent, to
>> not trigger faults on PROT_NONE-mapped PTEs.
> 
> Ugh.

I was hoping for that reaction, with the assumption that we would get
something cleaner :)

> 
> I hate how it uses FOLL_FORCE that is inherently scary.

I hate FOLL_FORCE, but I hate FOLL_NUMA even more, because to me it
is FOLL_FORCE in disguise (currently and before 474098edac26, if
FOLL_FORCE is set, FOLL_NUMA won't be set and the other way around).

> 
> Why do we have that "gup_can_follow_protnone()" logic AT ALL?

That's what I was hoping for.

> 
> Couldn't we just get rid of that disgusting thing, and just say that
> GUP (and follow_page()) always just ignores NUMA hinting, and always
> just follows protnone?
> 
> We literally used to have this:
> 
>          if (!(gup_flags & FOLL_FORCE))
>                  gup_flags |= FOLL_NUMA;
> 
> ie we *always* set FOLL_NUMA for any sane situation. FOLL_FORCE should
> be the rare crazy case.

Yes, but my point would be that we now spell that "rare crazy case"
out for follow_page().

If you're talking about patch #1, I agree, therefore patch #3 to
avoid all that nasty FOLL_FORCE handling in GUP callers.

But yeah, if we can avoid all that, great.

> 
> The original reason for not setting FOLL_NUMA all the time is
> documented in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting
> page faults from gup/gup_fast") from way back in 2012:
> 
>           * If FOLL_FORCE and FOLL_NUMA are both set, handle_mm_fault
>           * would be called on PROT_NONE ranges. We must never invoke
>           * handle_mm_fault on PROT_NONE ranges or the NUMA hinting
>           * page faults would unprotect the PROT_NONE ranges if
>           * _PAGE_NUMA and _PAGE_PROTNONE are sharing the same pte/pmd
>           * bitflag. So to avoid that, don't set FOLL_NUMA if
>           * FOLL_FORCE is set.

In handle_mm_fault(), we never call do_numa_page() if
!vma_is_accessible(). Same for do_huge_pmd_numa_page().

So, if we would ever end up triggering a page fault on
mprotect(PROT_NONE) ranges (i.e., via FOLL_FORCE), we
would simply do nothing.

At least that's the hope, I'll take a closer look just to make
sure we're good on all call paths.

> 
> but I don't think the original reason for this is *true* any more.
> 
> Because then two years later in 2014, in commit c46a7c817e66 ("x86:
> define _PAGE_NUMA by reusing software bits on the PMD and PTE levels")
> Mel made the code able to distinguish between PROT_NONE and NUMA
> pages, and he changed the comment above too.

CCing Mel.

I remember that pte_protnone() can only distinguished between
NUMA vs. actual mprotect(PROT_NONE) by looking at the VMA -- vma_is_accessible().

Indeed, include/linux/pgtable.h:

/*
  * Technically a PTE can be PROTNONE even when not doing NUMA balancing but
  * the only case the kernel cares is for NUMA balancing and is only ever set
  * when the VMA is accessible. For PROT_NONE VMAs, the PTEs are not marked
  * _PAGE_PROTNONE so by default, implement the helper as "always no". It
  * is the responsibility of the caller to distinguish between PROT_NONE
  * protections and NUMA hinting fault protections.
  */

> 
> But I get the very very strong feeling that instead of changing the
> comment, he should have actually removed the comment and the code.
> 
> So I get the strong feeling that all these FOLL_NUMA games should just
> go away. You removed the FOLL_NUMA bit, but you replaced it with using
> FOLL_FORCE.
> 
> So rather than make this all even more opaque and make it even harder
> to figure out why we have that code in the first place, I think it
> should all just be removed.

At least to me, spelling FOLL_FORCE in follow_page() now out is much
less opaque then getting that implied by lack of FOLL_NUMA.

> 
> The original reason for FOLL_NUMA simply does not exist any more. We
> know exactly when a page is marked for NUMA faulting, and we should
> simply *ignore* it for GUP and follow_page().
> 
> I think we should treat a NUMA-faulting page as just being present
> (and not NUMA-fault it).
> 
> Am I missing something?

There was the case for "FOLL_PIN represents application behavior and
should trigger NUMA faults", but I guess that can be ignored.

But it would be much better to just remove all that if we can.

Let me look into some details.

Thanks Linus!

-- 
Cheers,

David / dhildenb

Re: [PATCH v1 0/4] smaps / mm/gup: fix gup_can_follow_protnone fallout

Posted by Mel Gorman 2 years, 6 months ago

On Fri, Jul 28, 2023 at 07:30:33PM +0200, David Hildenbrand wrote:
> On 28.07.23 18:18, Linus Torvalds wrote:
> > On Thu, 27 Jul 2023 at 14:28, David Hildenbrand <david@redhat.com> wrote:
> > > 
> > > This is my proposal on how to handle the fallout of 474098edac26
> > > ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") where I
> > > accidentially missed that follow_page() and smaps implicitly kept the
> > > FOLL_NUMA flag clear by *not* setting it if FOLL_FORCE is absent, to
> > > not trigger faults on PROT_NONE-mapped PTEs.
> > 
> > Ugh.
> 
> I was hoping for that reaction, with the assumption that we would get
> something cleaner :)
> 
> > 
> > I hate how it uses FOLL_FORCE that is inherently scary.
> 
> I hate FOLL_FORCE, but I hate FOLL_NUMA even more, because to me it
> is FOLL_FORCE in disguise (currently and before 474098edac26, if
> FOLL_FORCE is set, FOLL_NUMA won't be set and the other way around).
> 

FOLL_NUMA being conflated with FOLL_FORCE is almost certainly a historical
accident.

> > 
> > Why do we have that "gup_can_follow_protnone()" logic AT ALL?
> 
> That's what I was hoping for.
> 
> > 
> > Couldn't we just get rid of that disgusting thing, and just say that
> > GUP (and follow_page()) always just ignores NUMA hinting, and always
> > just follows protnone?
> > 
> > We literally used to have this:
> > 
> >          if (!(gup_flags & FOLL_FORCE))
> >                  gup_flags |= FOLL_NUMA;
> > 
> > ie we *always* set FOLL_NUMA for any sane situation. FOLL_FORCE should
> > be the rare crazy case.
> 
> Yes, but my point would be that we now spell that "rare crazy case"
> out for follow_page().
> 
> If you're talking about patch #1, I agree, therefore patch #3 to
> avoid all that nasty FOLL_FORCE handling in GUP callers.
> 
> But yeah, if we can avoid all that, great.
> 
> > 
> > The original reason for not setting FOLL_NUMA all the time is
> > documented in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting
> > page faults from gup/gup_fast") from way back in 2012:
> > 
> >           * If FOLL_FORCE and FOLL_NUMA are both set, handle_mm_fault
> >           * would be called on PROT_NONE ranges. We must never invoke
> >           * handle_mm_fault on PROT_NONE ranges or the NUMA hinting
> >           * page faults would unprotect the PROT_NONE ranges if
> >           * _PAGE_NUMA and _PAGE_PROTNONE are sharing the same pte/pmd
> >           * bitflag. So to avoid that, don't set FOLL_NUMA if
> >           * FOLL_FORCE is set.
> 
> 
> In handle_mm_fault(), we never call do_numa_page() if
> !vma_is_accessible(). Same for do_huge_pmd_numa_page().
> 
> So, if we would ever end up triggering a page fault on
> mprotect(PROT_NONE) ranges (i.e., via FOLL_FORCE), we
> would simply do nothing.
> 
> At least that's the hope, I'll take a closer look just to make
> sure we're good on all call paths.
> 
> > 
> > but I don't think the original reason for this is *true* any more.
> > 
> > Because then two years later in 2014, in commit c46a7c817e66 ("x86:
> > define _PAGE_NUMA by reusing software bits on the PMD and PTE levels")
> > Mel made the code able to distinguish between PROT_NONE and NUMA
> > pages, and he changed the comment above too.
> 
> CCing Mel.
> 
> I remember that pte_protnone() can only distinguished between
> NUMA vs. actual mprotect(PROT_NONE) by looking at the VMA -- vma_is_accessible().
> 

Ok, as usual, I'm far behind and this thread massive but I'll respond
to this part before trying to digest the history of this and the current
implementation.

To the best of my recollection, FOLL_NUMA used to be a correctness issue
but that should no longer true. Initially, it was to prevent mixing up
"PROT_NONE" that was for NUMA hinting and "PROT_NONE" due to VMA
protections. Now the bits are different so this case should be
avoidable.

Later it was still a different correctness issue because PMD migration had
a hacky implementation without migration entries and a GUP could find a
page that was being collapsed and had to be serialised. That should also
now be avoidable.

At some point, FOLL_FORCE and FOLL_NUMA got conflated but they really should
not be related even if they are by accident. FOLL_FORCE (e.g. ptrace)
may have to process the fault and make the page resident and accessible
regardless of any other consequences. FOLL_NUMA ideally should be much
more specific. If the calling context only cares about the struct page
(e.g. smaps) then it's ok to get a reference to the page. If necessary,
it could clear the protection and lose the hinting fault although it's less
than ideal. Just needing the struct page for informational purposes though
should not be treated as a NUMA hinting fault because it has nothing to
do with the tasks memory reference behaviour.

A variant of FOLL_NUMA (FOLL_NUMA_HINT?) may still be required to indicate
the calling context is accessing the page for reasons that are equivalent
to a real memory access from a CPU related to the task mapping the page.
I didn't check but KVM may be an example of this when dealing with some MMU
faults as the page is being looked up on behalf of the task and presumably
from the same CPU the task was running run. Something like reading smaps
only needs the struct page but it should not be treated as a NUMA hinting
fault as the access has nothing to do with the task mapping the page.

> > The original reason for FOLL_NUMA simply does not exist any more. We
> > know exactly when a page is marked for NUMA faulting, and we should
> > simply *ignore* it for GUP and follow_page().
> > 

I'd be wary of completely ignoring it if there is any known calling context
that is equivalent to a memory access and the hinting fault should be
processed -- KVM may be an example.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH v1 0/4] smaps / mm/gup: fix gup_can_follow_protnone fallout

Posted by David Hildenbrand 2 years, 6 months ago

On 28.07.23 19:30, David Hildenbrand wrote:
> On 28.07.23 18:18, Linus Torvalds wrote:
>> On Thu, 27 Jul 2023 at 14:28, David Hildenbrand <david@redhat.com> wrote:
>>>
>>> This is my proposal on how to handle the fallout of 474098edac26
>>> ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") where I
>>> accidentially missed that follow_page() and smaps implicitly kept the
>>> FOLL_NUMA flag clear by *not* setting it if FOLL_FORCE is absent, to
>>> not trigger faults on PROT_NONE-mapped PTEs.
>>
>> Ugh.
> 
> I was hoping for that reaction, with the assumption that we would get
> something cleaner :)
> 
>>
>> I hate how it uses FOLL_FORCE that is inherently scary.
> 
> I hate FOLL_FORCE, but I hate FOLL_NUMA even more, because to me it
> is FOLL_FORCE in disguise (currently and before 474098edac26, if
> FOLL_FORCE is set, FOLL_NUMA won't be set and the other way around).
> 
>>
>> Why do we have that "gup_can_follow_protnone()" logic AT ALL?
> 
> That's what I was hoping for.
> 
>>
>> Couldn't we just get rid of that disgusting thing, and just say that
>> GUP (and follow_page()) always just ignores NUMA hinting, and always
>> just follows protnone?
>>
>> We literally used to have this:
>>
>>           if (!(gup_flags & FOLL_FORCE))
>>                   gup_flags |= FOLL_NUMA;
>>
>> ie we *always* set FOLL_NUMA for any sane situation. FOLL_FORCE should
>> be the rare crazy case.
> 
> Yes, but my point would be that we now spell that "rare crazy case"
> out for follow_page().
> 
> If you're talking about patch #1, I agree, therefore patch #3 to
> avoid all that nasty FOLL_FORCE handling in GUP callers.
> 
> But yeah, if we can avoid all that, great.
> 
>>
>> The original reason for not setting FOLL_NUMA all the time is
>> documented in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting
>> page faults from gup/gup_fast") from way back in 2012:
>>
>>            * If FOLL_FORCE and FOLL_NUMA are both set, handle_mm_fault
>>            * would be called on PROT_NONE ranges. We must never invoke
>>            * handle_mm_fault on PROT_NONE ranges or the NUMA hinting
>>            * page faults would unprotect the PROT_NONE ranges if
>>            * _PAGE_NUMA and _PAGE_PROTNONE are sharing the same pte/pmd
>>            * bitflag. So to avoid that, don't set FOLL_NUMA if
>>            * FOLL_FORCE is set.
> 
> 
> In handle_mm_fault(), we never call do_numa_page() if
> !vma_is_accessible(). Same for do_huge_pmd_numa_page().
> 
> So, if we would ever end up triggering a page fault on
> mprotect(PROT_NONE) ranges (i.e., via FOLL_FORCE), we
> would simply do nothing.
> 
> At least that's the hope, I'll take a closer look just to make
> sure we're good on all call paths.
> 
>>
>> but I don't think the original reason for this is *true* any more.
>>
>> Because then two years later in 2014, in commit c46a7c817e66 ("x86:
>> define _PAGE_NUMA by reusing software bits on the PMD and PTE levels")
>> Mel made the code able to distinguish between PROT_NONE and NUMA
>> pages, and he changed the comment above too.
> 
> CCing Mel.
> 
> I remember that pte_protnone() can only distinguished between
> NUMA vs. actual mprotect(PROT_NONE) by looking at the VMA -- vma_is_accessible().
> 
> Indeed, include/linux/pgtable.h:
> 
> /*
>    * Technically a PTE can be PROTNONE even when not doing NUMA balancing but
>    * the only case the kernel cares is for NUMA balancing and is only ever set
>    * when the VMA is accessible. For PROT_NONE VMAs, the PTEs are not marked
>    * _PAGE_PROTNONE so by default, implement the helper as "always no". It
>    * is the responsibility of the caller to distinguish between PROT_NONE
>    * protections and NUMA hinting fault protections.
>    */
> 
>>
>> But I get the very very strong feeling that instead of changing the
>> comment, he should have actually removed the comment and the code.
>>
>> So I get the strong feeling that all these FOLL_NUMA games should just
>> go away. You removed the FOLL_NUMA bit, but you replaced it with using
>> FOLL_FORCE.
>>
>> So rather than make this all even more opaque and make it even harder
>> to figure out why we have that code in the first place, I think it
>> should all just be removed.
> 
> At least to me, spelling FOLL_FORCE in follow_page() now out is much
> less opaque then getting that implied by lack of FOLL_NUMA.
> 
>>
>> The original reason for FOLL_NUMA simply does not exist any more. We
>> know exactly when a page is marked for NUMA faulting, and we should
>> simply *ignore* it for GUP and follow_page().
>>
>> I think we should treat a NUMA-faulting page as just being present
>> (and not NUMA-fault it).
>>
>> Am I missing something?
> 
> There was the case for "FOLL_PIN represents application behavior and
> should trigger NUMA faults", but I guess that can be ignored.


Re-reading commit 0b9d705297b2 ("mm: numa: Support NUMA hinting page 
faults from gup/gup_fast"), it actually does spell out an important case 
that we should handle:

"KVM secondary MMU page faults will trigger the NUMA hinting page
  faults through gup_fast -> get_user_pages -> follow_page ->
  handle_mm_fault."

That is still the case today (and important). Not triggering NUMA 
hinting faults would degrade KVM.

Hmm. So three alternatives I see:

1) Use FOLL_FORCE in follow_page() to unconditionally disable protnone
    checks. Alternatively, have an internal FOLL_NO_PROTNONE flag if we
    don't like that.

2) Revert the commit and reintroduce unconditional FOLL_NUMA without
    FOLL_FORCE.

3) Have a FOLL_NUMA that callers like KVM can pass.

Thoughts?

-- 
Cheers,

David / dhildenb

Re: [PATCH v1 0/4] smaps / mm/gup: fix gup_can_follow_protnone fallout

Posted by Peter Xu 2 years, 6 months ago

On Fri, Jul 28, 2023 at 09:40:54PM +0200, David Hildenbrand wrote:
> Hmm. So three alternatives I see:
> 
> 1) Use FOLL_FORCE in follow_page() to unconditionally disable protnone
>    checks. Alternatively, have an internal FOLL_NO_PROTNONE flag if we
>    don't like that.
> 
> 2) Revert the commit and reintroduce unconditional FOLL_NUMA without
>    FOLL_FORCE.
> 
> 3) Have a FOLL_NUMA that callers like KVM can pass.

I'm afraid 3) means changing numa balancing to opt-in, probably no-go for
any non-kvm gup users as that could start to break there, even if making
smaps/follow_page happy again.

I keep worrying 1) on FOLL_FORCE abuse.

So I keep my vote on 2).

Thanks,

-- 
Peter Xu

Re: [PATCH v1 0/4] smaps / mm/gup: fix gup_can_follow_protnone fallout

Posted by David Hildenbrand 2 years, 6 months ago

On 28.07.23 21:50, Peter Xu wrote:
> On Fri, Jul 28, 2023 at 09:40:54PM +0200, David Hildenbrand wrote:
>> Hmm. So three alternatives I see:
>>
>> 1) Use FOLL_FORCE in follow_page() to unconditionally disable protnone
>>     checks. Alternatively, have an internal FOLL_NO_PROTNONE flag if we
>>     don't like that.
>>
>> 2) Revert the commit and reintroduce unconditional FOLL_NUMA without
>>     FOLL_FORCE.
>>
>> 3) Have a FOLL_NUMA that callers like KVM can pass.
> 
> I'm afraid 3) means changing numa balancing to opt-in, probably no-go for
> any non-kvm gup users as that could start to break there, even if making
> smaps/follow_page happy again.
> 
> I keep worrying 1) on FOLL_FORCE abuse.
> 
> So I keep my vote on 2).

Linus had a point that we can nowadays always set FOLL_NUMA, even with 
FOLL_FORCE.

... so maybe we want to let GUP always set FOLL_NUMA (even with 
FOLL_FORCE) and follow_page() never set FOLL_NUMA.

That would at least decouple FOLL_NUMA from FOLL_FORCE.

-- 
Cheers,

David / dhildenb

Re: [PATCH v1 0/4] smaps / mm/gup: fix gup_can_follow_protnone fallout

Posted by David Hildenbrand 2 years, 6 months ago

[...]

> 
> There was the case for "FOLL_PIN represents application behavior and
> should trigger NUMA faults", but I guess that can be ignored.
> 
> But it would be much better to just remove all that if we can.
> 
> Let me look into some details.

I just stumbled over the comment from Mel in follow_trans_huge_pmd():

/* Full NUMA hinting faults to serialise migration in fault paths */

It dates back to

commit 2b4847e73004c10ae6666c2e27b5c5430aed8698
Author: Mel Gorman <mgorman@suse.de>
Date:   Wed Dec 18 17:08:32 2013 -0800

     mm: numa: serialise parallel get_user_page against THP migration
     
     Base pages are unmapped and flushed from cache and TLB during normal
     page migration and replaced with a migration entry that causes any
     parallel NUMA hinting fault or gup to block until migration completes.
     
     THP does not unmap pages due to a lack of support for migration entries
     at a PMD level.  This allows races with get_user_pages and
     get_user_pages_fast which commit 3f926ab945b6 ("mm: Close races between
     THP migration and PMD numa clearing") made worse by introducing a
     pmd_clear_flush().
     
     This patch forces get_user_page (fast and normal) on a pmd_numa page to
     go through the slow get_user_page path where it will serialise against
     THP migration and properly account for the NUMA hinting fault.  On the
     migration side the page table lock is taken for each PTE update.


We nowadays do have migration entries at PMD level -- and setting FOLL_FORCE
could similarly trigger such a race.

So I suspect we're good.

-- 
Cheers,

David / dhildenb