add process_madvise() flags to modify behaviour

[RFC PATCH 0/5] add process_madvise() flags to modify behaviour

Posted by Lorenzo Stoakes 8 months, 3 weeks ago

REVIEWERS NOTES:
================

This is a VERY EARLY version of the idea, it's relatively untested, and I'm
'putting it out there' for feedback. Any serious version of this will add a
bunch of self-tests to assert correct behaviour and I will more carefully
confirm everything's working.

This is based on discussion arising from Usama's series [0], SJ's input on
the thread around process_madvise() behaviour [1] (and a subsequent
response by me [2]) and prior discussion about a new madvise() interface
[3].

[0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/
[1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/
[2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/
[3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/

================

Currently, we are rather restricted in how madvise() operations
proceed. While effort has been put in to expanding what process_madvise()
can do (that is - unrestricted application of advice to the local process
alongside recent improvements on the efficiency of TLB operations over
these batvches), we are still constrained by existing madvise() limitations
and default behaviours.

This series makes use of the currently unused flags field in
process_madvise() to provide more flexiblity.

It introduces four flags:

1. PMADV_SKIP_ERRORS

Currently, when an error arises applying advice in any individual VMA
(keeping in mind that a range specified to madvise() or as part of the
iovec passed to process_madvise()), the operation stops where it is and
returns an error.

This might not be the desired behaviour of the user, who may wish instead
for the operation to be 'best effort'. By setting this flag, that behaviour
is obtained.

Since process_madvise() would trivially, if skipping errors, simply return
the input vector size, we instead return the number of entries in the
vector which completed successfully without error.

The PMADV_SKIP_ERRORS flag implicitly implies PMADV_NO_ERROR_ON_UNMAPPED.

2. PMADV_NO_ERROR_ON_UNMAPPED

Currently madvise() has the peculiar behaviour of, if the range specified
to it contains unmapped range(s), completing the full operation, but
ultimately returning -ENOMEM.

In the case of process_madvise(), this is fatal, as the operation will stop
immediately upon this occurring.

By setting PMADV_NO_ERROR_ON_UNMAPPED, the user can indicate that it wishes
unmapped areas to simply be entirely ignored.

3. PMADV_SET_FORK_EXEC_DEFAULT

It may be desirable for a user to specify that all VMAs mapped in a process
address space default to having an madvise() behaviour established by
default, in such a fashion as that this persists across fork/exec.

Since this is a very powerful option that would make no sense for many
advice modes, we explicitly only permit known-safe flags here (currently
MADV_HUGEPAGE and MADV_NOHUGEPAGE only).

4. PMADV_ENTIRE_ADDRESS_SPACE

It can be annoying, should a user wish to apply madvise() to all VMAs in an
address space, to have to add a singular large entry to the input iovec.

So provide sugar to permit this - PMADV_ENTIRE_ADDRESS_SPACE. If specified,
we expect the user to pass NULL and -1 to the vec and vlen parameters
respectively so they explicitly acknowledge that these will be ignored,
e.g.:

	process_madvise(PIDFD_SELF, NULL, -1, MADV_HUGEPAGE,
			PMADV_ENTIRE_ADDRESS_SPACE | PMADV_SKIP_ERRORS);

Usually a user ought to prefer setting PMADV_SKIP_ERRORS here as it may
well be the case that incompatible VMAs will be encountered that ought to
be skipped.

If this is not set, the PMADV_NO_ERROR_ON_UNMAPPED (which was otherwise
implicitly implied by PMADV_SKIP_ERRORS) ought to be set as of course, the
entire address space spans at least some gaps.

Lorenzo Stoakes (5):
  mm: madvise: refactor madvise_populate()
  mm/madvise: add PMADV_SKIP_ERRORS process_madvise() flag
  mm/madvise: add PMADV_NO_ERROR_ON_UNMAPPED process_madvise() flag
  mm/madvise: add PMADV_SET_FORK_EXEC_DEFAULT process_madvise() flag
  mm/madvise: add PMADV_ENTIRE_ADDRESS_SPACE process_madvise() flag

 include/uapi/asm-generic/mman-common.h |   6 +
 mm/madvise.c                           | 206 +++++++++++++++++++------
 2 files changed, 168 insertions(+), 44 deletions(-)

--
2.49.0

Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

Posted by Mike Rapoport 8 months, 3 weeks ago

(cc'ing linux-api)

On Mon, May 19, 2025 at 09:52:37PM +0100, Lorenzo Stoakes wrote:
> REVIEWERS NOTES:
> ================
> 
> This is a VERY EARLY version of the idea, it's relatively untested, and I'm
> 'putting it out there' for feedback. Any serious version of this will add a
> bunch of self-tests to assert correct behaviour and I will more carefully
> confirm everything's working.
> 
> This is based on discussion arising from Usama's series [0], SJ's input on
> the thread around process_madvise() behaviour [1] (and a subsequent
> response by me [2]) and prior discussion about a new madvise() interface
> [3].
> 
> [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/
> [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/
> [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/
> [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/
> 
> ================
> 
> Currently, we are rather restricted in how madvise() operations
> proceed. While effort has been put in to expanding what process_madvise()
> can do (that is - unrestricted application of advice to the local process
> alongside recent improvements on the efficiency of TLB operations over
> these batvches), we are still constrained by existing madvise() limitations
> and default behaviours.
> 
> This series makes use of the currently unused flags field in
> process_madvise() to provide more flexiblity.
> 
> It introduces four flags:
> 
> 1. PMADV_SKIP_ERRORS
> 
> Currently, when an error arises applying advice in any individual VMA
> (keeping in mind that a range specified to madvise() or as part of the
> iovec passed to process_madvise()), the operation stops where it is and
> returns an error.
> 
> This might not be the desired behaviour of the user, who may wish instead
> for the operation to be 'best effort'. By setting this flag, that behaviour
> is obtained.
> 
> Since process_madvise() would trivially, if skipping errors, simply return
> the input vector size, we instead return the number of entries in the
> vector which completed successfully without error.
> 
> The PMADV_SKIP_ERRORS flag implicitly implies PMADV_NO_ERROR_ON_UNMAPPED.
> 
> 2. PMADV_NO_ERROR_ON_UNMAPPED
> 
> Currently madvise() has the peculiar behaviour of, if the range specified
> to it contains unmapped range(s), completing the full operation, but
> ultimately returning -ENOMEM.
> 
> In the case of process_madvise(), this is fatal, as the operation will stop
> immediately upon this occurring.
> 
> By setting PMADV_NO_ERROR_ON_UNMAPPED, the user can indicate that it wishes
> unmapped areas to simply be entirely ignored.
> 
> 3. PMADV_SET_FORK_EXEC_DEFAULT
> 
> It may be desirable for a user to specify that all VMAs mapped in a process
> address space default to having an madvise() behaviour established by
> default, in such a fashion as that this persists across fork/exec.
> 
> Since this is a very powerful option that would make no sense for many
> advice modes, we explicitly only permit known-safe flags here (currently
> MADV_HUGEPAGE and MADV_NOHUGEPAGE only).
> 
> 4. PMADV_ENTIRE_ADDRESS_SPACE
> 
> It can be annoying, should a user wish to apply madvise() to all VMAs in an
> address space, to have to add a singular large entry to the input iovec.
> 
> So provide sugar to permit this - PMADV_ENTIRE_ADDRESS_SPACE. If specified,
> we expect the user to pass NULL and -1 to the vec and vlen parameters
> respectively so they explicitly acknowledge that these will be ignored,
> e.g.:
> 
> 	process_madvise(PIDFD_SELF, NULL, -1, MADV_HUGEPAGE,
> 			PMADV_ENTIRE_ADDRESS_SPACE | PMADV_SKIP_ERRORS);
> 
> Usually a user ought to prefer setting PMADV_SKIP_ERRORS here as it may
> well be the case that incompatible VMAs will be encountered that ought to
> be skipped.
> 
> If this is not set, the PMADV_NO_ERROR_ON_UNMAPPED (which was otherwise
> implicitly implied by PMADV_SKIP_ERRORS) ought to be set as of course, the
> entire address space spans at least some gaps.
> 
> Lorenzo Stoakes (5):
>   mm: madvise: refactor madvise_populate()
>   mm/madvise: add PMADV_SKIP_ERRORS process_madvise() flag
>   mm/madvise: add PMADV_NO_ERROR_ON_UNMAPPED process_madvise() flag
>   mm/madvise: add PMADV_SET_FORK_EXEC_DEFAULT process_madvise() flag
>   mm/madvise: add PMADV_ENTIRE_ADDRESS_SPACE process_madvise() flag
> 
>  include/uapi/asm-generic/mman-common.h |   6 +
>  mm/madvise.c                           | 206 +++++++++++++++++++------
>  2 files changed, 168 insertions(+), 44 deletions(-)
> 
> --
> 2.49.0
> 

-- 
Sincerely yours,
Mike.

Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

Posted by Shakeel Butt 8 months, 3 weeks ago

On Mon, May 19, 2025 at 09:52:37PM +0100, Lorenzo Stoakes wrote:
> REVIEWERS NOTES:
> ================
> 
> This is a VERY EARLY version of the idea, it's relatively untested, and I'm
> 'putting it out there' for feedback. Any serious version of this will add a
> bunch of self-tests to assert correct behaviour and I will more carefully
> confirm everything's working.
> 
> This is based on discussion arising from Usama's series [0], SJ's input on
> the thread around process_madvise() behaviour [1] (and a subsequent
> response by me [2]) and prior discussion about a new madvise() interface
> [3].
> 
> [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/
> [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/
> [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/
> [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/
> 
> ================
> 
> Currently, we are rather restricted in how madvise() operations
> proceed. While effort has been put in to expanding what process_madvise()
> can do (that is - unrestricted application of advice to the local process
> alongside recent improvements on the efficiency of TLB operations over
> these batvches), we are still constrained by existing madvise() limitations
> and default behaviours.
> 
> This series makes use of the currently unused flags field in
> process_madvise() to provide more flexiblity.
> 
> It introduces four flags:
> 
> 1. PMADV_SKIP_ERRORS
> 
> Currently, when an error arises applying advice in any individual VMA
> (keeping in mind that a range specified to madvise() or as part of the
> iovec passed to process_madvise()), the operation stops where it is and
> returns an error.
> 
> This might not be the desired behaviour of the user, who may wish instead
> for the operation to be 'best effort'. By setting this flag, that behaviour
> is obtained.
> 
> Since process_madvise() would trivially, if skipping errors, simply return
> the input vector size, we instead return the number of entries in the
> vector which completed successfully without error.
> 
> The PMADV_SKIP_ERRORS flag implicitly implies PMADV_NO_ERROR_ON_UNMAPPED.
> 
> 2. PMADV_NO_ERROR_ON_UNMAPPED
> 
> Currently madvise() has the peculiar behaviour of, if the range specified
> to it contains unmapped range(s), completing the full operation, but
> ultimately returning -ENOMEM.
> 
> In the case of process_madvise(), this is fatal, as the operation will stop
> immediately upon this occurring.
> 
> By setting PMADV_NO_ERROR_ON_UNMAPPED, the user can indicate that it wishes
> unmapped areas to simply be entirely ignored.

Why do we need PMADV_NO_ERROR_ON_UNMAPPED explicitly and why
PMADV_SKIP_ERRORS is not enough? I don't see a need for
PMADV_NO_ERROR_ON_UNMAPPED. Do you envision a use-case where
PMADV_NO_ERROR_ON_UNMAPPED makes more sense than PMADV_SKIP_ERRORS?

> 
> 3. PMADV_SET_FORK_EXEC_DEFAULT
> 
> It may be desirable for a user to specify that all VMAs mapped in a process
> address space default to having an madvise() behaviour established by
> default, in such a fashion as that this persists across fork/exec.
> 
> Since this is a very powerful option that would make no sense for many
> advice modes, we explicitly only permit known-safe flags here (currently
> MADV_HUGEPAGE and MADV_NOHUGEPAGE only).

Other flags seems general enough but this one is just weird. This is
exactly the scenario for prctl() like interface. You are trying to make
process_madvise() like prctl() and I can see process_madvise() would be
included in all the hate that prctl() receives. 

Let me ask in a different way. Eventually we want to be in a state where
hugepages works out of the box for all workloads. In that state what
would the need for this flag unless you have use-cases other than
hugepages. To me, there is a general consensus that prctl is a hacky
interface, so having some intermediate solution through prctl until
hugepages are good out of the box seems more reasonable.

> 
> 4. PMADV_ENTIRE_ADDRESS_SPACE
> 
> It can be annoying, should a user wish to apply madvise() to all VMAs in an
> address space, to have to add a singular large entry to the input iovec.
> 
> So provide sugar to permit this - PMADV_ENTIRE_ADDRESS_SPACE. If specified,
> we expect the user to pass NULL and -1 to the vec and vlen parameters
> respectively so they explicitly acknowledge that these will be ignored,
> e.g.:
> 
> 	process_madvise(PIDFD_SELF, NULL, -1, MADV_HUGEPAGE,
> 			PMADV_ENTIRE_ADDRESS_SPACE | PMADV_SKIP_ERRORS);
> 

I still don't see a need for this flag. Why not the following?

process_madvise(PIDFD_SELF, NULL, -1, advise, PMADV_SKIP_ERRORS)?

Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

Posted by Lorenzo Stoakes 8 months, 3 weeks ago

On Tue, May 20, 2025 at 11:25:05AM -0700, Shakeel Butt wrote:
> On Mon, May 19, 2025 at 09:52:37PM +0100, Lorenzo Stoakes wrote:
> > REVIEWERS NOTES:
> > ================
> >
> > This is a VERY EARLY version of the idea, it's relatively untested, and I'm
> > 'putting it out there' for feedback. Any serious version of this will add a
> > bunch of self-tests to assert correct behaviour and I will more carefully
> > confirm everything's working.
> >
> > This is based on discussion arising from Usama's series [0], SJ's input on
> > the thread around process_madvise() behaviour [1] (and a subsequent
> > response by me [2]) and prior discussion about a new madvise() interface
> > [3].
> >
> > [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/
> > [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/
> > [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/
> > [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/
> >
> > ================
> >
> > Currently, we are rather restricted in how madvise() operations
> > proceed. While effort has been put in to expanding what process_madvise()
> > can do (that is - unrestricted application of advice to the local process
> > alongside recent improvements on the efficiency of TLB operations over
> > these batvches), we are still constrained by existing madvise() limitations
> > and default behaviours.
> >
> > This series makes use of the currently unused flags field in
> > process_madvise() to provide more flexiblity.
> >
> > It introduces four flags:
> >
> > 1. PMADV_SKIP_ERRORS
> >
> > Currently, when an error arises applying advice in any individual VMA
> > (keeping in mind that a range specified to madvise() or as part of the
> > iovec passed to process_madvise()), the operation stops where it is and
> > returns an error.
> >
> > This might not be the desired behaviour of the user, who may wish instead
> > for the operation to be 'best effort'. By setting this flag, that behaviour
> > is obtained.
> >
> > Since process_madvise() would trivially, if skipping errors, simply return
> > the input vector size, we instead return the number of entries in the
> > vector which completed successfully without error.
> >
> > The PMADV_SKIP_ERRORS flag implicitly implies PMADV_NO_ERROR_ON_UNMAPPED.
> >
> > 2. PMADV_NO_ERROR_ON_UNMAPPED
> >
> > Currently madvise() has the peculiar behaviour of, if the range specified
> > to it contains unmapped range(s), completing the full operation, but
> > ultimately returning -ENOMEM.
> >
> > In the case of process_madvise(), this is fatal, as the operation will stop
> > immediately upon this occurring.
> >
> > By setting PMADV_NO_ERROR_ON_UNMAPPED, the user can indicate that it wishes
> > unmapped areas to simply be entirely ignored.
>
> Why do we need PMADV_NO_ERROR_ON_UNMAPPED explicitly and why
> PMADV_SKIP_ERRORS is not enough? I don't see a need for
> PMADV_NO_ERROR_ON_UNMAPPED. Do you envision a use-case where
> PMADV_NO_ERROR_ON_UNMAPPED makes more sense than PMADV_SKIP_ERRORS?

I thought I already explained this above:

	"In the case of process_madvise(), this is fatal, as the operation
	 will stop immediately upon this occurring."

This is somewhat bizarre behaviour. You specify multiple vector entries
spanning different ranges, but one spans some unmapped space and now the
whole process_madvise() operation grinds to a halt, except the vector entry
containing ranges including unmapped space is completed.

This is strange behaviour, and it makes sense to me to optionally disable
this.

If you were looping around doing an madvise(), this would _not_ occur, you
could filter out the -ENOMEM's. It's a really weird peculiarity in
process_madvise().

Moreover, you might not want an error reported, that possibly duplicates
_real_ -ENOMEM errors, when you simply encounter unmapped addresses.

Finally, if you perform an operation across the entire address space as
proposed you may wish to stop on actual error but not on the (inevitable at
least in 64-bit space) gaps you'll encounter.

>
> >
> > 3. PMADV_SET_FORK_EXEC_DEFAULT
> >
> > It may be desirable for a user to specify that all VMAs mapped in a process
> > address space default to having an madvise() behaviour established by
> > default, in such a fashion as that this persists across fork/exec.
> >
> > Since this is a very powerful option that would make no sense for many
> > advice modes, we explicitly only permit known-safe flags here (currently
> > MADV_HUGEPAGE and MADV_NOHUGEPAGE only).
>
> Other flags seems general enough but this one is just weird. This is
> exactly the scenario for prctl() like interface. You are trying to make
> process_madvise() like prctl() and I can see process_madvise() would be
> included in all the hate that prctl() receives.

I'm really not sure what you mean. prctl() has no rhyme nor reason, so not
sure what a 'prctl() like interface' means here, and you're not explaining
what you mean by that.

Presumably you mean you find this odd as you feel it sits outside the realm
of madvise() behaviour.

But I'd suggest it does not - the idea is to align _everything_ with
madvise(). Rather than adding an entirely arbitrary function in prctl(), we
are going the other way - keeping everything relating to madvise()-like
modification of memory _in mm_ and _in madvise()_, rather than bitrotting
away in kernel/sys.c.

So we get alignment in the fact that we're saying 'we establish a _default_
madvise flag for a process'.

We restrict initially to VM_HUGEPAGE and VM_NOHUGEPAGE to a. achieve what
you guys at meta want while also opening the door to doing so in future if
it makes sense to.

This couldn't be more different than putting some arbitrary code relating
to mm in the 'netherrealm' of prctl().

>
> Let me ask in a different way. Eventually we want to be in a state where
> hugepages works out of the box for all workloads. In that state what
> would the need for this flag unless you have use-cases other than
> hugepages. To me, there is a general consensus that prctl is a hacky
> interface, so having some intermediate solution through prctl until
> hugepages are good out of the box seems more reasonable.

No, fundamentally disagree. We already have MADV_[NO]HUGEPAGE. This will
have to be supported. In a future where things are automatic, these may be
no-ops in 'auto' mode.

But the requirement to have these flags will still exist, the requirement
to do madvise() operations will still exist, we're simply expanding this
functionality.

The problem arises the other way around when we shovel mm stuff in
kernel/sys.c.

>
> >
> > 4. PMADV_ENTIRE_ADDRESS_SPACE
> >
> > It can be annoying, should a user wish to apply madvise() to all VMAs in an
> > address space, to have to add a singular large entry to the input iovec.
> >
> > So provide sugar to permit this - PMADV_ENTIRE_ADDRESS_SPACE. If specified,
> > we expect the user to pass NULL and -1 to the vec and vlen parameters
> > respectively so they explicitly acknowledge that these will be ignored,
> > e.g.:
> >
> > 	process_madvise(PIDFD_SELF, NULL, -1, MADV_HUGEPAGE,
> > 			PMADV_ENTIRE_ADDRESS_SPACE | PMADV_SKIP_ERRORS);
> >
>
> I still don't see a need for this flag. Why not the following?
>
> process_madvise(PIDFD_SELF, NULL, -1, advise, PMADV_SKIP_ERRORS)?

I don't think iovec=NULL, vlen=-1 means 'everything' in any other interface
anywhere else in the kernel? Why would we reasonably expect anybody to
know/assume that here?

It's surely far better to very explicitly say as much? The idea with NULL,
-1 is that you're making sure the user knows any provided iovec, vlen will
be ignored, and better to just disallow the user providing those at all
(similar to mmap requiring -1 for fd for anon mappings).

Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

Posted by Shakeel Butt 8 months, 3 weeks ago

On Tue, May 20, 2025 at 07:45:43PM +0100, Lorenzo Stoakes wrote:
> On Tue, May 20, 2025 at 11:25:05AM -0700, Shakeel Butt wrote:
> > On Mon, May 19, 2025 at 09:52:37PM +0100, Lorenzo Stoakes wrote:
> > > REVIEWERS NOTES:
> > > ================
> > >
> > > This is a VERY EARLY version of the idea, it's relatively untested, and I'm
> > > 'putting it out there' for feedback. Any serious version of this will add a
> > > bunch of self-tests to assert correct behaviour and I will more carefully
> > > confirm everything's working.
> > >
> > > This is based on discussion arising from Usama's series [0], SJ's input on
> > > the thread around process_madvise() behaviour [1] (and a subsequent
> > > response by me [2]) and prior discussion about a new madvise() interface
> > > [3].
> > >
> > > [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/
> > > [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/
> > > [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/
> > > [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/
> > >
> > > ================
> > >
> > > Currently, we are rather restricted in how madvise() operations
> > > proceed. While effort has been put in to expanding what process_madvise()
> > > can do (that is - unrestricted application of advice to the local process
> > > alongside recent improvements on the efficiency of TLB operations over
> > > these batvches), we are still constrained by existing madvise() limitations
> > > and default behaviours.
> > >
> > > This series makes use of the currently unused flags field in
> > > process_madvise() to provide more flexiblity.
> > >
> > > It introduces four flags:
> > >
> > > 1. PMADV_SKIP_ERRORS
> > >
> > > Currently, when an error arises applying advice in any individual VMA
> > > (keeping in mind that a range specified to madvise() or as part of the
> > > iovec passed to process_madvise()), the operation stops where it is and
> > > returns an error.
> > >
> > > This might not be the desired behaviour of the user, who may wish instead
> > > for the operation to be 'best effort'. By setting this flag, that behaviour
> > > is obtained.
> > >
> > > Since process_madvise() would trivially, if skipping errors, simply return
> > > the input vector size, we instead return the number of entries in the
> > > vector which completed successfully without error.
> > >
> > > The PMADV_SKIP_ERRORS flag implicitly implies PMADV_NO_ERROR_ON_UNMAPPED.
> > >
> > > 2. PMADV_NO_ERROR_ON_UNMAPPED
> > >
> > > Currently madvise() has the peculiar behaviour of, if the range specified
> > > to it contains unmapped range(s), completing the full operation, but
> > > ultimately returning -ENOMEM.
> > >
> > > In the case of process_madvise(), this is fatal, as the operation will stop
> > > immediately upon this occurring.
> > >
> > > By setting PMADV_NO_ERROR_ON_UNMAPPED, the user can indicate that it wishes
> > > unmapped areas to simply be entirely ignored.
> >
> > Why do we need PMADV_NO_ERROR_ON_UNMAPPED explicitly and why
> > PMADV_SKIP_ERRORS is not enough? I don't see a need for
> > PMADV_NO_ERROR_ON_UNMAPPED. Do you envision a use-case where
> > PMADV_NO_ERROR_ON_UNMAPPED makes more sense than PMADV_SKIP_ERRORS?
> 
> I thought I already explained this above:
> 
> 	"In the case of process_madvise(), this is fatal, as the operation
> 	 will stop immediately upon this occurring."
> 
> This is somewhat bizarre behaviour. You specify multiple vector entries
> spanning different ranges, but one spans some unmapped space and now the
> whole process_madvise() operation grinds to a halt, except the vector entry
> containing ranges including unmapped space is completed.
> 
> This is strange behaviour, and it makes sense to me to optionally disable
> this.
> 
> If you were looping around doing an madvise(), this would _not_ occur, you
> could filter out the -ENOMEM's. It's a really weird peculiarity in
> process_madvise().
> 
> Moreover, you might not want an error reported, that possibly duplicates
> _real_ -ENOMEM errors, when you simply encounter unmapped addresses.
> 
> Finally, if you perform an operation across the entire address space as
> proposed you may wish to stop on actual error but not on the (inevitable at
> least in 64-bit space) gaps you'll encounter.

So, we *may* wish to stop on actual error, do you have a more concrete
example? We should not add an API on a case which may be needed. We can
always add stuff later when the actual concrete use-cases come up.

> 
> >
> > >
> > > 3. PMADV_SET_FORK_EXEC_DEFAULT
> > >
> > > It may be desirable for a user to specify that all VMAs mapped in a process
> > > address space default to having an madvise() behaviour established by
> > > default, in such a fashion as that this persists across fork/exec.
> > >
> > > Since this is a very powerful option that would make no sense for many
> > > advice modes, we explicitly only permit known-safe flags here (currently
> > > MADV_HUGEPAGE and MADV_NOHUGEPAGE only).
> >
> > Other flags seems general enough but this one is just weird. This is
> > exactly the scenario for prctl() like interface. You are trying to make
> > process_madvise() like prctl() and I can see process_madvise() would be
> > included in all the hate that prctl() receives.
> 
> I'm really not sure what you mean. prctl() has no rhyme nor reason, so not
> sure what a 'prctl() like interface' means here, and you're not explaining
> what you mean by that.

I meant it applies a property at the task or process level and has
examples where those properties are inherited to children.

> 
> Presumably you mean you find this odd as you feel it sits outside the realm
> of madvise() behaviour.

The persistence across exec seems weird.

> 
> But I'd suggest it does not - the idea is to align _everything_ with
> madvise(). Rather than adding an entirely arbitrary function in prctl(), we
> are going the other way - keeping everything relating to madvise()-like
> modification of memory _in mm_ and _in madvise()_, rather than bitrotting
> away in kernel/sys.c.

The above para seems like you are talking about code which can be moved
to mm.

> 
> So we get alignment in the fact that we're saying 'we establish a _default_
> madvise flag for a process'.

I think this is an important point. So, we want to introduce a way to
set a process level property which can be inherited through fork/exec.
With that in mind, is process_madvise() (or even madvise()) really a
good interface for it? There is no need for address ranges for such
use-case.

> 
> We restrict initially to VM_HUGEPAGE and VM_NOHUGEPAGE to a. achieve what
> you guys at meta want while also opening the door to doing so in future if
> it makes sense to.

Please drop the "you guys at meta". We should be aiming for what is good
for all (or most) linux users. Whatever is done here will be
incorporated in systemd which will be used very widely.

> 
> This couldn't be more different than putting some arbitrary code relating
> to mm in the 'netherrealm' of prctl().
> 
> 
> >
> > Let me ask in a different way. Eventually we want to be in a state where
> > hugepages works out of the box for all workloads. In that state what
> > would the need for this flag unless you have use-cases other than
> > hugepages. To me, there is a general consensus that prctl is a hacky
> > interface, so having some intermediate solution through prctl until
> > hugepages are good out of the box seems more reasonable.
> 
> No, fundamentally disagree. We already have MADV_[NO]HUGEPAGE. This will
> have to be supported. In a future where things are automatic, these may be
> no-ops in 'auto' mode.
> 
> But the requirement to have these flags will still exist, the requirement
> to do madvise() operations will still exist, we're simply expanding this
> functionality.
> 
> The problem arises the other way around when we shovel mm stuff in
> kernel/sys.c.

I think you mixing the location of the code and the interface which will
remain relevant long term. I don't see process_madvise (or madvise) good
interface for this specific functionality (i.e. process level property
that gets inherited through fork/exec). Now we can add a new syscall for
this but to me, particularly for hugepage, this functionality is needed
temporarily until hugepages are good out of the box. However if there is
a use-case other than hugepages then yes, why not a new syscall.

Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

Posted by Liam R. Howlett 8 months, 3 weeks ago

* Shakeel Butt <shakeel.butt@linux.dev> [250520 15:49]:
> On Tue, May 20, 2025 at 07:45:43PM +0100, Lorenzo Stoakes wrote:
> > On Tue, May 20, 2025 at 11:25:05AM -0700, Shakeel Butt wrote:
> > > On Mon, May 19, 2025 at 09:52:37PM +0100, Lorenzo Stoakes wrote:
> > > > REVIEWERS NOTES:
> > > > ================
> > > >
> > > > This is a VERY EARLY version of the idea, it's relatively untested, and I'm
> > > > 'putting it out there' for feedback. Any serious version of this will add a
> > > > bunch of self-tests to assert correct behaviour and I will more carefully
> > > > confirm everything's working.
> > > >
> > > > This is based on discussion arising from Usama's series [0], SJ's input on
> > > > the thread around process_madvise() behaviour [1] (and a subsequent
> > > > response by me [2]) and prior discussion about a new madvise() interface
> > > > [3].
> > > >
> > > > [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/
> > > > [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/
> > > > [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/
> > > > [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/
> > > >
> > > > ================

...

> > > >
> > > > 3. PMADV_SET_FORK_EXEC_DEFAULT
> > > >
> > > > It may be desirable for a user to specify that all VMAs mapped in a process
> > > > address space default to having an madvise() behaviour established by
> > > > default, in such a fashion as that this persists across fork/exec.
> > > >
> > > > Since this is a very powerful option that would make no sense for many
> > > > advice modes, we explicitly only permit known-safe flags here (currently
> > > > MADV_HUGEPAGE and MADV_NOHUGEPAGE only).
> > >
> > > Other flags seems general enough but this one is just weird. This is
> > > exactly the scenario for prctl() like interface. You are trying to make
> > > process_madvise() like prctl() and I can see process_madvise() would be
> > > included in all the hate that prctl() receives.
> > 
> > I'm really not sure what you mean. prctl() has no rhyme nor reason, so not
> > sure what a 'prctl() like interface' means here, and you're not explaining
> > what you mean by that.
> 
> I meant it applies a property at the task or process level and has
> examples where those properties are inherited to children.

Okay, I don't think we should have any illusions of how clear cut either
of these things are. prctl has PR_SET_VMA, which isn't to to do with
the process.

madvise() can be called with a range covering the entire address space
and may or may not fail on gaps (depending on the call).

process_madvise() was to madvise on a process range - including all of
it, so I don't see either as the wrong place.  Or, really, the best
place.

> 
> > 
> > Presumably you mean you find this odd as you feel it sits outside the realm
> > of madvise() behaviour.
> 
> The persistence across exec seems weird.
> 
> > 
> > But I'd suggest it does not - the idea is to align _everything_ with
> > madvise(). Rather than adding an entirely arbitrary function in prctl(), we
> > are going the other way - keeping everything relating to madvise()-like
> > modification of memory _in mm_ and _in madvise()_, rather than bitrotting
> > away in kernel/sys.c.
> 
> The above para seems like you are talking about code which can be moved
> to mm.
> 
> > 
> > So we get alignment in the fact that we're saying 'we establish a _default_
> > madvise flag for a process'.
> 
> I think this is an important point. So, we want to introduce a way to
> set a process level property which can be inherited through fork/exec.
> With that in mind, is process_madvise() (or even madvise()) really a
> good interface for it? There is no need for address ranges for such
> use-case.
> 
> > 
> > We restrict initially to VM_HUGEPAGE and VM_NOHUGEPAGE to a. achieve what
> > you guys at meta want while also opening the door to doing so in future if
> > it makes sense to.
> 
> Please drop the "you guys at meta". We should be aiming for what is good
> for all (or most) linux users. Whatever is done here will be
> incorporated in systemd which will be used very widely.

Agreed, the aim is to provide the most flexible interface for the
majority of the users.  In my view, there isn't a clear way forward yet:
I don't want to make process_madvise() a dumping ground and I don't want
to make prctl() a bigger mess.

But right now, it seems there is more than one user arguing for this
particular implementation, but in fact there two employees at the same
company and that isn't clear in this conversation to the casual reader.
And the opinions are being cross-referenced as evidence of how others
view it.  I have no idea if you two know each other, or even if you know
that you are both at meta - but I think if I chimed in from some other
email address saying how crazy prctl() is over this other way, then
you'd probably feel compelled to say something about where I work (I
hope you would tbh)?

You could add a (meta) to your email address or signature so that it is
implicit that there may be other influences on opinion.  I'm not saying
there is an influence on opinion or that this opinion is wrong, but it
is not fully transparent and complicates the conversation.

I don't think it is unreasonable for people to point out that everyone
arguing for one solution is at the same company when it isn't obvious,
do you?

> 
> > 
> > This couldn't be more different than putting some arbitrary code relating
> > to mm in the 'netherrealm' of prctl().
> > 
> > 
> > >
> > > Let me ask in a different way. Eventually we want to be in a state where
> > > hugepages works out of the box for all workloads. In that state what
> > > would the need for this flag unless you have use-cases other than
> > > hugepages. To me, there is a general consensus that prctl is a hacky
> > > interface, so having some intermediate solution through prctl until
> > > hugepages are good out of the box seems more reasonable.

(oh my, I'm going to sound like a jerk here.. sorry)

This is a terrible argument.  This intermediate solution may outlive us
all.. it may even shorten some of our lives due to stress dealing with
it.  Will the removal of the prctl() call coincide with the year of the
Linux desktop?  Or maybe the removal of the mmap lock... wait.

The hacky interface of prctl() is one of my main gripes with that
proposal, it's not a reason to go with it.

> > 
> > No, fundamentally disagree. We already have MADV_[NO]HUGEPAGE. This will
> > have to be supported. In a future where things are automatic, these may be
> > no-ops in 'auto' mode.
> > 
> > But the requirement to have these flags will still exist, the requirement
> > to do madvise() operations will still exist, we're simply expanding this
> > functionality.
> > 
> > The problem arises the other way around when we shovel mm stuff in
> > kernel/sys.c.
> 
> I think you mixing the location of the code and the interface which will
> remain relevant long term. I don't see process_madvise (or madvise) good
> interface for this specific functionality (i.e. process level property
> that gets inherited through fork/exec).

Process is literally in the name and we are applying an madvise flag
across the entire range.

The inherited through fork/exec is a much stronger reason that I see for
not doing it in process_madvise().

> Now we can add a new syscall for
> this but to me, particularly for hugepage, this functionality is needed
> temporarily until hugepages are good out of the box.

I'm not holding my breath.

> However if there is
> a use-case other than hugepages then yes, why not a new syscall.
> 

We also have Barry looking at prctl() for avoiding LRU updates on exit
[1], which I guess is process related.. or page related.. really it is
mm flags related in his implementation, sort of like this feature.  But
it will probably only be used by android, so...

[1]. https://lore.kernel.org/all/20250514070820.51793-1-21cnbao@gmail.com/

Thanks,
Liam

Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

Posted by Lorenzo Stoakes 8 months, 3 weeks ago

On Tue, May 20, 2025 at 12:49:24PM -0700, Shakeel Butt wrote:
> On Tue, May 20, 2025 at 07:45:43PM +0100, Lorenzo Stoakes wrote:
> > On Tue, May 20, 2025 at 11:25:05AM -0700, Shakeel Butt wrote:
> > > On Mon, May 19, 2025 at 09:52:37PM +0100, Lorenzo Stoakes wrote:
> > > > REVIEWERS NOTES:
> > > > ================
> > > >
> > > > This is a VERY EARLY version of the idea, it's relatively untested, and I'm
> > > > 'putting it out there' for feedback. Any serious version of this will add a
> > > > bunch of self-tests to assert correct behaviour and I will more carefully
> > > > confirm everything's working.
> > > >
> > > > This is based on discussion arising from Usama's series [0], SJ's input on
> > > > the thread around process_madvise() behaviour [1] (and a subsequent
> > > > response by me [2]) and prior discussion about a new madvise() interface
> > > > [3].
> > > >
> > > > [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/
> > > > [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/
> > > > [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/
> > > > [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/
> > > >
> > > > ================
> > > >
> > > > Currently, we are rather restricted in how madvise() operations
> > > > proceed. While effort has been put in to expanding what process_madvise()
> > > > can do (that is - unrestricted application of advice to the local process
> > > > alongside recent improvements on the efficiency of TLB operations over
> > > > these batvches), we are still constrained by existing madvise() limitations
> > > > and default behaviours.
> > > >
> > > > This series makes use of the currently unused flags field in
> > > > process_madvise() to provide more flexiblity.
> > > >
> > > > It introduces four flags:
> > > >
> > > > 1. PMADV_SKIP_ERRORS
> > > >
> > > > Currently, when an error arises applying advice in any individual VMA
> > > > (keeping in mind that a range specified to madvise() or as part of the
> > > > iovec passed to process_madvise()), the operation stops where it is and
> > > > returns an error.
> > > >
> > > > This might not be the desired behaviour of the user, who may wish instead
> > > > for the operation to be 'best effort'. By setting this flag, that behaviour
> > > > is obtained.
> > > >
> > > > Since process_madvise() would trivially, if skipping errors, simply return
> > > > the input vector size, we instead return the number of entries in the
> > > > vector which completed successfully without error.
> > > >
> > > > The PMADV_SKIP_ERRORS flag implicitly implies PMADV_NO_ERROR_ON_UNMAPPED.
> > > >
> > > > 2. PMADV_NO_ERROR_ON_UNMAPPED
> > > >
> > > > Currently madvise() has the peculiar behaviour of, if the range specified
> > > > to it contains unmapped range(s), completing the full operation, but
> > > > ultimately returning -ENOMEM.
> > > >
> > > > In the case of process_madvise(), this is fatal, as the operation will stop
> > > > immediately upon this occurring.
> > > >
> > > > By setting PMADV_NO_ERROR_ON_UNMAPPED, the user can indicate that it wishes
> > > > unmapped areas to simply be entirely ignored.
> > >
> > > Why do we need PMADV_NO_ERROR_ON_UNMAPPED explicitly and why
> > > PMADV_SKIP_ERRORS is not enough? I don't see a need for
> > > PMADV_NO_ERROR_ON_UNMAPPED. Do you envision a use-case where
> > > PMADV_NO_ERROR_ON_UNMAPPED makes more sense than PMADV_SKIP_ERRORS?
> >
> > I thought I already explained this above:
> >
> > 	"In the case of process_madvise(), this is fatal, as the operation
> > 	 will stop immediately upon this occurring."
> >
> > This is somewhat bizarre behaviour. You specify multiple vector entries
> > spanning different ranges, but one spans some unmapped space and now the
> > whole process_madvise() operation grinds to a halt, except the vector entry
> > containing ranges including unmapped space is completed.
> >
> > This is strange behaviour, and it makes sense to me to optionally disable
> > this.
> >
> > If you were looping around doing an madvise(), this would _not_ occur, you
> > could filter out the -ENOMEM's. It's a really weird peculiarity in
> > process_madvise().
> >
> > Moreover, you might not want an error reported, that possibly duplicates
> > _real_ -ENOMEM errors, when you simply encounter unmapped addresses.
> >
> > Finally, if you perform an operation across the entire address space as
> > proposed you may wish to stop on actual error but not on the (inevitable at
> > least in 64-bit space) gaps you'll encounter.
>
> So, we *may* wish to stop on actual error, do you have a more concrete
> example? We should not add an API on a case which may be needed. We can
> always add stuff later when the actual concrete use-cases come up.

I feel like I just gave a concrete example?

It's useful to not have to be absolutely sure that the range specified
includes no unmapped ranges.

It's required for the MADV_[NO]HUGEPAGE use case proposed, specifically
applying the operation to the entire address space.

But I think it might make this a lot more concrete when I write tests - as
these will 'concretise' the interface and provide examples.

We can also choose to 'hide' this from users and add it back in as you say.

Perhaps worth just waiting for a respin with tests to see this in action.

>
> >
> > >
> > > >
> > > > 3. PMADV_SET_FORK_EXEC_DEFAULT
> > > >
> > > > It may be desirable for a user to specify that all VMAs mapped in a process
> > > > address space default to having an madvise() behaviour established by
> > > > default, in such a fashion as that this persists across fork/exec.
> > > >
> > > > Since this is a very powerful option that would make no sense for many
> > > > advice modes, we explicitly only permit known-safe flags here (currently
> > > > MADV_HUGEPAGE and MADV_NOHUGEPAGE only).
> > >
> > > Other flags seems general enough but this one is just weird. This is
> > > exactly the scenario for prctl() like interface. You are trying to make
> > > process_madvise() like prctl() and I can see process_madvise() would be
> > > included in all the hate that prctl() receives.
> >
> > I'm really not sure what you mean. prctl() has no rhyme nor reason, so not
> > sure what a 'prctl() like interface' means here, and you're not explaining
> > what you mean by that.
>
> I meant it applies a property at the task or process level and has
> examples where those properties are inherited to children.

But de-facto several prctl() interfaces do not do this, and mm interfaces
like mlockall() for instance do do this?

_In practice_ prctl() is a random hodge-podge of stuff, subject to bitrot.

>
> >
> > Presumably you mean you find this odd as you feel it sits outside the realm
> > of madvise() behaviour.
>
> The persistence across exec seems weird.

OK I'm not quite sure how to quantify 'weird'?

As I argue below, the idea here is we're doing 'madvise by default'. So you
can either have prctl() invoke madvise() for some stuff, and then establish
some 'madvise by default' logic, or we can do it the other way, by doing
_as much as possible_ madvise() stuff in madvise, and add the
default-across-exec there as a highly controlled, very clear flag.

I continue to believe the latter is cleaner, more maintainable, and less
subject to bitrot.

And I would argue invoking madvise() from prctl() is similarly odd (though
pretty much everything that happens in prctl() is, by de-facto definition,
sort of odd :)

>
> >
> > But I'd suggest it does not - the idea is to align _everything_ with
> > madvise(). Rather than adding an entirely arbitrary function in prctl(), we
> > are going the other way - keeping everything relating to madvise()-like
> > modification of memory _in mm_ and _in madvise()_, rather than bitrotting
> > away in kernel/sys.c.
>
> The above para seems like you are talking about code which can be moved
> to mm.
>
> >
> > So we get alignment in the fact that we're saying 'we establish a _default_
> > madvise flag for a process'.
>
> I think this is an important point. So, we want to introduce a way to
> set a process level property which can be inherited through fork/exec.
> With that in mind, is process_madvise() (or even madvise()) really a
> good interface for it? There is no need for address ranges for such
> use-case.
>
> >
> > We restrict initially to VM_HUGEPAGE and VM_NOHUGEPAGE to a. achieve what
> > you guys at meta want while also opening the door to doing so in future if
> > it makes sense to.
>
> Please drop the "you guys at meta". We should be aiming for what is good
> for all (or most) linux users. Whatever is done here will be
> incorporated in systemd which will be used very widely.

...!

I've spent several hours doing review of Usama's series and proposing this
idea precisely to serve the community. I would ask you to please respect
that.

The point I'm making here re: you guys, is that we can both serve the
community and solve your problem - because aiming at both is _the only way
this change will get merged_.

I am doing my absolute best to try to reach this end.

Re: systemd, I'm not sure what you mean - has there been an indication that
they will using this? I'm not sure they make use of every prctl() interface
do they?

>
> >
> > This couldn't be more different than putting some arbitrary code relating
> > to mm in the 'netherrealm' of prctl().
> >
> >
> > >
> > > Let me ask in a different way. Eventually we want to be in a state where
> > > hugepages works out of the box for all workloads. In that state what
> > > would the need for this flag unless you have use-cases other than
> > > hugepages. To me, there is a general consensus that prctl is a hacky
> > > interface, so having some intermediate solution through prctl until
> > > hugepages are good out of the box seems more reasonable.
> >
> > No, fundamentally disagree. We already have MADV_[NO]HUGEPAGE. This will
> > have to be supported. In a future where things are automatic, these may be
> > no-ops in 'auto' mode.
> >
> > But the requirement to have these flags will still exist, the requirement
> > to do madvise() operations will still exist, we're simply expanding this
> > functionality.
> >
> > The problem arises the other way around when we shovel mm stuff in
> > kernel/sys.c.
>
> I think you mixing the location of the code and the interface which will
> remain relevant long term. I don't see process_madvise (or madvise) good
> interface for this specific functionality (i.e. process level property
> that gets inherited through fork/exec). Now we can add a new syscall for
> this but to me, particularly for hugepage, this functionality is needed
> temporarily until hugepages are good out of the box. However if there is
> a use-case other than hugepages then yes, why not a new syscall.
>

I disagree on several levels. Firstly of course 'process' is a vague term
here, you mean address space, rather mm_struct.

And really you're saying 'it's ok to associate mm_strut and madvise
specific stuff with prctl() but not ok to associate mm_struct stuff with
madvise code' which not quite compelling to me.

Overall you can view this approach as - 'we are making an madvise() flag a
default, optionally'.

And this justifies us therefore doing this via an madvise() API call.

It also further allows us to expand the capabilities of madvise() for free
- we can address long-standing issues around what is possible with these
system calls while also providing this interface. The idea is it's win/win.

Otherwise we're simply doing a bunch of madvise() stuff from prctl() in the
same way that PR_SET_VMA_ANON_NAME for instance (hey - that's a VMA-level
feature! What's that doing in prctl() :) - where we have a bunch of mm code
living over there, and we have to export mm-internal stuff to kernel/sys.c
and it's a mess.

As to the temporary nature of this stuff, you seem to have disregarded what
I've said here rather in relation to the persistence of the
MADV_[NO]HUGEPAGE flags which are required as uAPI. They therefore must
remain, and thus I don't find the argument compelling re: prctl() being
better suited. It seems more likely things will bitrot there?

Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

Posted by Shakeel Butt 8 months, 3 weeks ago

On Tue, May 20, 2025 at 09:39:37PM +0100, Lorenzo Stoakes wrote:
> On Tue, May 20, 2025 at 12:49:24PM -0700, Shakeel Butt wrote:
> > On Tue, May 20, 2025 at 07:45:43PM +0100, Lorenzo Stoakes wrote:
> > > On Tue, May 20, 2025 at 11:25:05AM -0700, Shakeel Butt wrote:
> > > > On Mon, May 19, 2025 at 09:52:37PM +0100, Lorenzo Stoakes wrote:
> > > > > REVIEWERS NOTES:
> > > > > ================
> > > > >
> > > > > This is a VERY EARLY version of the idea, it's relatively untested, and I'm
> > > > > 'putting it out there' for feedback. Any serious version of this will add a
> > > > > bunch of self-tests to assert correct behaviour and I will more carefully
> > > > > confirm everything's working.
> > > > >
> > > > > This is based on discussion arising from Usama's series [0], SJ's input on
> > > > > the thread around process_madvise() behaviour [1] (and a subsequent
> > > > > response by me [2]) and prior discussion about a new madvise() interface
> > > > > [3].
> > > > >
> > > > > [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/
> > > > > [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/
> > > > > [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/
> > > > > [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/
> > > > >
> > > > > ================
> > > > >
> > > > > Currently, we are rather restricted in how madvise() operations
> > > > > proceed. While effort has been put in to expanding what process_madvise()
> > > > > can do (that is - unrestricted application of advice to the local process
> > > > > alongside recent improvements on the efficiency of TLB operations over
> > > > > these batvches), we are still constrained by existing madvise() limitations
> > > > > and default behaviours.
> > > > >
> > > > > This series makes use of the currently unused flags field in
> > > > > process_madvise() to provide more flexiblity.
> > > > >
> > > > > It introduces four flags:
> > > > >
> > > > > 1. PMADV_SKIP_ERRORS
> > > > >
> > > > > Currently, when an error arises applying advice in any individual VMA
> > > > > (keeping in mind that a range specified to madvise() or as part of the
> > > > > iovec passed to process_madvise()), the operation stops where it is and
> > > > > returns an error.
> > > > >
> > > > > This might not be the desired behaviour of the user, who may wish instead
> > > > > for the operation to be 'best effort'. By setting this flag, that behaviour
> > > > > is obtained.
> > > > >
> > > > > Since process_madvise() would trivially, if skipping errors, simply return
> > > > > the input vector size, we instead return the number of entries in the
> > > > > vector which completed successfully without error.
> > > > >
> > > > > The PMADV_SKIP_ERRORS flag implicitly implies PMADV_NO_ERROR_ON_UNMAPPED.
> > > > >
> > > > > 2. PMADV_NO_ERROR_ON_UNMAPPED
> > > > >
> > > > > Currently madvise() has the peculiar behaviour of, if the range specified
> > > > > to it contains unmapped range(s), completing the full operation, but
> > > > > ultimately returning -ENOMEM.
> > > > >
> > > > > In the case of process_madvise(), this is fatal, as the operation will stop
> > > > > immediately upon this occurring.
> > > > >
> > > > > By setting PMADV_NO_ERROR_ON_UNMAPPED, the user can indicate that it wishes
> > > > > unmapped areas to simply be entirely ignored.
> > > >
> > > > Why do we need PMADV_NO_ERROR_ON_UNMAPPED explicitly and why
> > > > PMADV_SKIP_ERRORS is not enough? I don't see a need for
> > > > PMADV_NO_ERROR_ON_UNMAPPED. Do you envision a use-case where
> > > > PMADV_NO_ERROR_ON_UNMAPPED makes more sense than PMADV_SKIP_ERRORS?
> > >
> > > I thought I already explained this above:
> > >
> > > 	"In the case of process_madvise(), this is fatal, as the operation
> > > 	 will stop immediately upon this occurring."
> > >
> > > This is somewhat bizarre behaviour. You specify multiple vector entries
> > > spanning different ranges, but one spans some unmapped space and now the
> > > whole process_madvise() operation grinds to a halt, except the vector entry
> > > containing ranges including unmapped space is completed.
> > >
> > > This is strange behaviour, and it makes sense to me to optionally disable
> > > this.
> > >
> > > If you were looping around doing an madvise(), this would _not_ occur, you
> > > could filter out the -ENOMEM's. It's a really weird peculiarity in
> > > process_madvise().
> > >
> > > Moreover, you might not want an error reported, that possibly duplicates
> > > _real_ -ENOMEM errors, when you simply encounter unmapped addresses.
> > >
> > > Finally, if you perform an operation across the entire address space as
> > > proposed you may wish to stop on actual error but not on the (inevitable at
> > > least in 64-bit space) gaps you'll encounter.
> >
> > So, we *may* wish to stop on actual error, do you have a more concrete
> > example? We should not add an API on a case which may be needed. We can
> > always add stuff later when the actual concrete use-cases come up.
> 
> I feel like I just gave a concrete example?
> 
> It's useful to not have to be absolutely sure that the range specified
> includes no unmapped ranges.
> 
> It's required for the MADV_[NO]HUGEPAGE use case proposed, specifically
> applying the operation to the entire address space.
> 
> But I think it might make this a lot more concrete when I write tests - as
> these will 'concretise' the interface and provide examples.
> 
> We can also choose to 'hide' this from users and add it back in as you say.
> 
> Perhaps worth just waiting for a respin with tests to see this in action.

Yes, that makes sense.

> 
> >
> > >
> > > >
> > > > >
> > > > > 3. PMADV_SET_FORK_EXEC_DEFAULT
> > > > >
> > > > > It may be desirable for a user to specify that all VMAs mapped in a process
> > > > > address space default to having an madvise() behaviour established by
> > > > > default, in such a fashion as that this persists across fork/exec.
> > > > >
> > > > > Since this is a very powerful option that would make no sense for many
> > > > > advice modes, we explicitly only permit known-safe flags here (currently
> > > > > MADV_HUGEPAGE and MADV_NOHUGEPAGE only).
> > > >
> > > > Other flags seems general enough but this one is just weird. This is
> > > > exactly the scenario for prctl() like interface. You are trying to make
> > > > process_madvise() like prctl() and I can see process_madvise() would be
> > > > included in all the hate that prctl() receives.
> > >
> > > I'm really not sure what you mean. prctl() has no rhyme nor reason, so not
> > > sure what a 'prctl() like interface' means here, and you're not explaining
> > > what you mean by that.
> >
> > I meant it applies a property at the task or process level and has
> > examples where those properties are inherited to children.
> 
> But de-facto several prctl() interfaces do not do this, and mm interfaces
> like mlockall() for instance do do this?
> 
> _In practice_ prctl() is a random hodge-podge of stuff, subject to bitrot.
> 

I don't disagree.

> >
> > >
> > > Presumably you mean you find this odd as you feel it sits outside the realm
> > > of madvise() behaviour.
> >
> > The persistence across exec seems weird.
> 
> OK I'm not quite sure how to quantify 'weird'?
> 
> As I argue below, the idea here is we're doing 'madvise by default'. So you
> can either have prctl() invoke madvise() for some stuff, and then establish
> some 'madvise by default' logic, or we can do it the other way, by doing
> _as much as possible_ madvise() stuff in madvise, and add the
> default-across-exec there as a highly controlled, very clear flag.
> 
> I continue to believe the latter is cleaner, more maintainable, and less
> subject to bitrot.
> 
> And I would argue invoking madvise() from prctl() is similarly odd (though
> pretty much everything that happens in prctl() is, by de-facto definition,
> sort of odd :)
> 

Yes, prctl() is weird as well but see my point at the end.

> >
> > >
> > > But I'd suggest it does not - the idea is to align _everything_ with
> > > madvise(). Rather than adding an entirely arbitrary function in prctl(), we
> > > are going the other way - keeping everything relating to madvise()-like
> > > modification of memory _in mm_ and _in madvise()_, rather than bitrotting
> > > away in kernel/sys.c.
> >
> > The above para seems like you are talking about code which can be moved
> > to mm.
> >
> > >
> > > So we get alignment in the fact that we're saying 'we establish a _default_
> > > madvise flag for a process'.
> >
> > I think this is an important point. So, we want to introduce a way to
> > set a process level property which can be inherited through fork/exec.
> > With that in mind, is process_madvise() (or even madvise()) really a
> > good interface for it? There is no need for address ranges for such
> > use-case.
> >
> > >
> > > We restrict initially to VM_HUGEPAGE and VM_NOHUGEPAGE to a. achieve what
> > > you guys at meta want while also opening the door to doing so in future if
> > > it makes sense to.
> >
> > Please drop the "you guys at meta". We should be aiming for what is good
> > for all (or most) linux users. Whatever is done here will be
> > incorporated in systemd which will be used very widely.
> 
> ...!
> 
> I've spent several hours doing review of Usama's series and proposing this
> idea precisely to serve the community. I would ask you to please respect
> that.
> 
> The point I'm making here re: you guys, is that we can both serve the
> community and solve your problem - because aiming at both is _the only way
> this change will get merged_.
> 
> I am doing my absolute best to try to reach this end.

Thanks for doing that. My point was having a robust long term solution
is preferrable over any specific company's preferred solution. So, it is
ok to push back if you feel that we are pushing a solution which will
benefit only a single or small set of users.

> 
> Re: systemd, I'm not sure what you mean - has there been an indication that
> they will using this? I'm not sure they make use of every prctl() interface
> do they?

We plan to add this in systemd which already has similar support for
ksm.

> 
> >
> > >
> > > This couldn't be more different than putting some arbitrary code relating
> > > to mm in the 'netherrealm' of prctl().
> > >
> > >
> > > >
> > > > Let me ask in a different way. Eventually we want to be in a state where
> > > > hugepages works out of the box for all workloads. In that state what
> > > > would the need for this flag unless you have use-cases other than
> > > > hugepages. To me, there is a general consensus that prctl is a hacky
> > > > interface, so having some intermediate solution through prctl until
> > > > hugepages are good out of the box seems more reasonable.
> > >
> > > No, fundamentally disagree. We already have MADV_[NO]HUGEPAGE. This will
> > > have to be supported. In a future where things are automatic, these may be
> > > no-ops in 'auto' mode.
> > >
> > > But the requirement to have these flags will still exist, the requirement
> > > to do madvise() operations will still exist, we're simply expanding this
> > > functionality.
> > >
> > > The problem arises the other way around when we shovel mm stuff in
> > > kernel/sys.c.
> >
> > I think you mixing the location of the code and the interface which will
> > remain relevant long term. I don't see process_madvise (or madvise) good
> > interface for this specific functionality (i.e. process level property
> > that gets inherited through fork/exec). Now we can add a new syscall for
> > this but to me, particularly for hugepage, this functionality is needed
> > temporarily until hugepages are good out of the box. However if there is
> > a use-case other than hugepages then yes, why not a new syscall.
> >
> 
> I disagree on several levels. Firstly of course 'process' is a vague term
> here, you mean address space, rather mm_struct.
> 
> And really you're saying 'it's ok to associate mm_strut and madvise
> specific stuff with prctl() but not ok to associate mm_struct stuff with
> madvise code' which not quite compelling to me.
> 
> Overall you can view this approach as - 'we are making an madvise() flag a
> default, optionally'.
> 
> And this justifies us therefore doing this via an madvise() API call.
> 
> It also further allows us to expand the capabilities of madvise() for free
> - we can address long-standing issues around what is possible with these
> system calls while also providing this interface. The idea is it's win/win.
> 
> Otherwise we're simply doing a bunch of madvise() stuff from prctl() in the
> same way that PR_SET_VMA_ANON_NAME for instance (hey - that's a VMA-level
> feature! What's that doing in prctl() :) - where we have a bunch of mm code
> living over there, and we have to export mm-internal stuff to kernel/sys.c
> and it's a mess.
> 
> As to the temporary nature of this stuff, you seem to have disregarded what
> I've said here rather in relation to the persistence of the
> MADV_[NO]HUGEPAGE flags which are required as uAPI. They therefore must
> remain, and thus I don't find the argument compelling re: prctl() being
> better suited. It seems more likely things will bitrot there?

I think we are talking past each other (and I blame myself for that). Let
me try again. (Please keep aside prctl or process_madvise). We need a
way to change the policy of a process (job) and at the moment we are
aligned that the job loader (like systemd) will set that policy and load
the target (fork/exec), so the policy persist across fork/exec. (If
someone has a better way to set the policy without race, please let us
know).

My argument is that process_madvise() is not a good interface to set
that policy because of its address range like interface. So, if not
process_madvise() then what? Should we add a new syscall? (BTW we had
very similar discussion on process_madvise(DONTNEED) on a remote process
vs a new syscall i.e. process_mrelease()).

Adding a new syscall requires that it should be generally useful and
hopefully have more use-cases. Now going back to the current specific
use-case where we want to override the hugepage related policy of a job,
do we expect to use this override forever? I believe this is temporary
because the only reason we need this is because hugepages are not yet
ready for prime time (many applications do not handle them well). In
future when hugepages will be awesome, do we still need this "override
the hugepage policy" syscall?

Now if we can show that this specific functionality is useful more than
hugepages then I think new syscall seems like the best way forward.
However if we think this functionality is only needed temporarily then
shoving it in prctl() seems reasonable to me. If we really don't want
prctl() based solution, I would recommend to discuss the new syscall
approach and see if we can comeup with a more general solution.

Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

Posted by Lorenzo Stoakes 8 months, 3 weeks ago

On Tue, May 20, 2025 at 03:02:09PM -0700, Shakeel Butt wrote:

[snip for clarity]

> I think we are talking past each other (and I blame myself for that). Let
> me try again. (Please keep aside prctl or process_madvise). We need a
> way to change the policy of a process (job) and at the moment we are
> aligned that the job loader (like systemd) will set that policy and load
> the target (fork/exec), so the policy persist across fork/exec. (If
> someone has a better way to set the policy without race, please let us
> know).

Ack, totally agree the kernel currently lacks a cohesive story for this
'adjust POLICY of a process and descendents', we have cgroups, but they're
more general than a process, we have namespaces, but that's for restricting
resources...

I think we all want the same thing here, ultimately.

>
> My argument is that process_madvise() is not a good interface to set
> that policy because of its address range like interface. So, if not
> process_madvise() then what? Should we add a new syscall? (BTW we had
> very similar discussion on process_madvise(DONTNEED) on a remote process
> vs a new syscall i.e. process_mrelease()).

Sure, and generally both proposed interfaces are at least _awkward_, for me
prctl() is a no-go unless we have no other choice, I won't go over my
objections to it yet again (and Liam has also raised his of course).

>
> Adding a new syscall requires that it should be generally useful and
> hopefully have more use-cases. Now going back to the current specific
> use-case where we want to override the hugepage related policy of a job,
> do we expect to use this override forever? I believe this is temporary
> because the only reason we need this is because hugepages are not yet
> ready for prime time (many applications do not handle them well). In
> future when hugepages will be awesome, do we still need this "override
> the hugepage policy" syscall?

As argued previously, I am not so sure it'll be temporary, given the
proposed future 'auto' mode will be a _mode_ and we will need to support
VM_[NO]HUGEPAGE scenarios forever (deep, deep sigh).

Also if you add it into systemd it definitely won't be right? There's no
'throwaway' here, and scouring through prctl() (what is actually documented
:), I am not sure anything ever is, frankly.

So the idea is to try to make this as generic as possible and to have it
sit with code it makes sense to sit with.

>
> Now if we can show that this specific functionality is useful more than
> hugepages then I think new syscall seems like the best way forward.
> However if we think this functionality is only needed temporarily then
> shoving it in prctl() seems reasonable to me. If we really don't want
> prctl() based solution, I would recommend to discuss the new syscall
> approach and see if we can comeup with a more general solution.
>

So, something Liam mentioned off-list was the beautifully named
'mmadvise()'. Idea being that we have a system call _explicitly for_
mm-wide modifications.

With Barry's series doing a prctl() for something similar, and a whole host
of mm->flags existing for modifying behaviour, it would seem a natural fit.

I could do a respin that does something like this instead.

What's a pity to me re: going away from process_madvise() is losing the
opportunity to be able to modify the, frankly broken, gaps handling and
also being able to do 'best effort' madvise ranges.

But I suppose those can always be separate series... :)

I guess let me work that up so we can see how that looks?

Cheers, Lorenzo

Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

Posted by Johannes Weiner 8 months, 3 weeks ago

On Wed, May 21, 2025 at 05:21:19AM +0100, Lorenzo Stoakes wrote:
> So, something Liam mentioned off-list was the beautifully named
> 'mmadvise()'. Idea being that we have a system call _explicitly for_
> mm-wide modifications.
> 
> With Barry's series doing a prctl() for something similar, and a whole host
> of mm->flags existing for modifying behaviour, it would seem a natural fit.

That's an interesting idea.

So we'd have THP policies and Barry's FADE_ON_DEATH to start; and it
might also be a good fit for the coredump stuff and ksm if we wanted
to incorporate them into that (although it would duplicate the
existing proc/prctl knobs). The other MMF_s are internal AFAICS.

I think my main concern would be making something very generic and
versatile without having sufficiently broad/popular usecases for it.

But no strong feelings either way. Like I said, I don't have a strong
dislike for prctl(), but this idea would obviously be cleaner if we
think there is enough of a demand for a new syscall.

> I guess let me work that up so we can see how that looks?

I think it's worth exploring!

Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

Posted by Mike Rapoport 8 months, 3 weeks ago

On Wed, May 21, 2025 at 01:32:00PM -0400, Johannes Weiner wrote:
> On Wed, May 21, 2025 at 05:21:19AM +0100, Lorenzo Stoakes wrote:
> > So, something Liam mentioned off-list was the beautifully named
> > 'mmadvise()'. Idea being that we have a system call _explicitly for_
> > mm-wide modifications.
> > 
> > With Barry's series doing a prctl() for something similar, and a whole host
> > of mm->flags existing for modifying behaviour, it would seem a natural fit.
> 
> That's an interesting idea.
> 
> So we'd have THP policies and Barry's FADE_ON_DEATH to start; and it
> might also be a good fit for the coredump stuff and ksm if we wanted
> to incorporate them into that (although it would duplicate the
> existing proc/prctl knobs). The other MMF_s are internal AFAICS.
> 
> I think my main concern would be making something very generic and
> versatile without having sufficiently broad/popular usecases for it.
> 
> But no strong feelings either way. Like I said, I don't have a strong
> dislike for prctl(), but this idea would obviously be cleaner if we
> think there is enough of a demand for a new syscall.

To me it seems like having a "global mm control" system call makes much
more sense that adding more arms to prctl or overloading process_madvise().

With a dedicated syscall it's much clearer that the operation targets an mm
and it works for the entire mm.
And two usescase seem enough to me to justify a new syscall.

And it's easier to reason about a dedicated syscall designed for a certain
operation that for multiplexed ioctl() style controls.

> > I guess let me work that up so we can see how that looks?
> 
> I think it's worth exploring!
> 

-- 
Sincerely yours,
Mike.

Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

Posted by Lorenzo Stoakes 8 months, 3 weeks ago

On Thu, May 22, 2025 at 06:32:11PM +0300, Mike Rapoport wrote:
> On Wed, May 21, 2025 at 01:32:00PM -0400, Johannes Weiner wrote:
> > On Wed, May 21, 2025 at 05:21:19AM +0100, Lorenzo Stoakes wrote:
> > > So, something Liam mentioned off-list was the beautifully named
> > > 'mmadvise()'. Idea being that we have a system call _explicitly for_
> > > mm-wide modifications.
> > >
> > > With Barry's series doing a prctl() for something similar, and a whole host
> > > of mm->flags existing for modifying behaviour, it would seem a natural fit.
> >
> > That's an interesting idea.
> >
> > So we'd have THP policies and Barry's FADE_ON_DEATH to start; and it
> > might also be a good fit for the coredump stuff and ksm if we wanted
> > to incorporate them into that (although it would duplicate the
> > existing proc/prctl knobs). The other MMF_s are internal AFAICS.
> >
> > I think my main concern would be making something very generic and
> > versatile without having sufficiently broad/popular usecases for it.
> >
> > But no strong feelings either way. Like I said, I don't have a strong
> > dislike for prctl(), but this idea would obviously be cleaner if we
> > think there is enough of a demand for a new syscall.
>
> To me it seems like having a "global mm control" system call makes much
> more sense that adding more arms to prctl or overloading process_madvise().

Agreed yeah.

>
> With a dedicated syscall it's much clearer that the operation targets an mm
> and it works for the entire mm.
> And two usescase seem enough to me to justify a new syscall.

Yes I think so!

>
> And it's easier to reason about a dedicated syscall designed for a certain
> operation that for multiplexed ioctl() style controls.

Yes!

>
> > > I guess let me work that up so we can see how that looks?
> >
> > I think it's worth exploring!
> >
>
> --
> Sincerely yours,
> Mike.

Thanks, I will distribute a proposed mctl() prototype API to people here (+ cc
you + linux-api also) so we can use that as a basis to assess this approach.

Thanks!

Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

Posted by David Hildenbrand 8 months, 3 weeks ago

On 21.05.25 19:32, Johannes Weiner wrote:
> On Wed, May 21, 2025 at 05:21:19AM +0100, Lorenzo Stoakes wrote:
>> So, something Liam mentioned off-list was the beautifully named
>> 'mmadvise()'. Idea being that we have a system call _explicitly for_
>> mm-wide modifications.

As stated elsewhere (e.g., THP cabal yesterday): mctrl() or sth like 
that might be better.

... or anything else that doesn't (ab)use the "advise" terminology in an 
interface that will not only consume advises.

>>
>> With Barry's series doing a prctl() for something similar, and a whole host
>> of mm->flags existing for modifying behaviour, it would seem a natural fit.
> 
> That's an interesting idea.
> 
> So we'd have THP policies and Barry's FADE_ON_DEATH to start; and it
> might also be a good fit for the coredump stuff and ksm if we wanted
> to incorporate them into that (although it would duplicate the
> existing proc/prctl knobs). The other MMF_s are internal AFAICS.
> 
> I think my main concern would be making something very generic and
> versatile without having sufficiently broad/popular usecases for it.
> 
> But no strong feelings either way. Like I said, I don't have a strong
> dislike for prctl(), but this idea would obviously be cleaner if we
> think there is enough of a demand for a new syscall.

Same here. I am not 100% sure process_madvise() is really the right 
thing to extend, but I do enjoy the SET_DEFAULT_EXEC option. I am also 
not a big fan of prctl() ...

-- 
Cheers,

David / dhildenb

Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

Posted by Lorenzo Stoakes 8 months, 3 weeks ago

On Thu, May 22, 2025 at 02:45:30PM +0200, David Hildenbrand wrote:
> On 21.05.25 19:32, Johannes Weiner wrote:
> > On Wed, May 21, 2025 at 05:21:19AM +0100, Lorenzo Stoakes wrote:
> > > So, something Liam mentioned off-list was the beautifully named
> > > 'mmadvise()'. Idea being that we have a system call _explicitly for_
> > > mm-wide modifications.
>
> As stated elsewhere (e.g., THP cabal yesterday): mctrl() or sth like that
> might be better.
>
> ... or anything else that doesn't (ab)use the "advise" terminology in an
> interface that will not only consume advises.

Ack, as per my other reply, will work up some pseudocode/API exploration (not
yet code) and mail it round to participants here so we can explore this idea.

>
> > >
> > > With Barry's series doing a prctl() for something similar, and a whole host
> > > of mm->flags existing for modifying behaviour, it would seem a natural fit.
> >
> > That's an interesting idea.
> >
> > So we'd have THP policies and Barry's FADE_ON_DEATH to start; and it
> > might also be a good fit for the coredump stuff and ksm if we wanted
> > to incorporate them into that (although it would duplicate the
> > existing proc/prctl knobs). The other MMF_s are internal AFAICS.
> >
> > I think my main concern would be making something very generic and
> > versatile without having sufficiently broad/popular usecases for it.
> >
> > But no strong feelings either way. Like I said, I don't have a strong
> > dislike for prctl(), but this idea would obviously be cleaner if we
> > think there is enough of a demand for a new syscall.
>
> Same here. I am not 100% sure process_madvise() is really the right thing to
> extend, but I do enjoy the SET_DEFAULT_EXEC option. I am also not a big fan
> of prctl() ...

Yeah agreed on both actually! :)

>
> --
> Cheers,
>
> David / dhildenb
>

Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

Posted by Lorenzo Stoakes 8 months, 3 weeks ago

On Wed, May 21, 2025 at 01:32:00PM -0400, Johannes Weiner wrote:
> On Wed, May 21, 2025 at 05:21:19AM +0100, Lorenzo Stoakes wrote:
> > So, something Liam mentioned off-list was the beautifully named
> > 'mmadvise()'. Idea being that we have a system call _explicitly for_
> > mm-wide modifications.
> >
> > With Barry's series doing a prctl() for something similar, and a whole host
> > of mm->flags existing for modifying behaviour, it would seem a natural fit.
>
> That's an interesting idea.

Thanks!

>
> So we'd have THP policies and Barry's FADE_ON_DEATH to start; and it
> might also be a good fit for the coredump stuff and ksm if we wanted
> to incorporate them into that (although it would duplicate the
> existing proc/prctl knobs). The other MMF_s are internal AFAICS.
>
> I think my main concern would be making something very generic and
> versatile without having sufficiently broad/popular usecases for it.

Ack, the main argument here is to keep mm stuff together without bitrot.

The process_madvise() proposal advocated for the addition of flags that
would be useful in other circumstances such as eliminating the broken
behaviour around gaps etc which fulfilled this requirement naturally.

However it's a fair point that the fork/exec mm-scope stuff is awkwardly
placed (but equally so in prctl()).

It's an absolutely fair point as to specificity - but I would argue that
it's a _general_ thing to want to have mm-level state changes, and while
this might be specific _now_, in future the next specific thing can go
here, and the next etc.

Things that would each have been their own sort of special case in prctl()
can now live somewhere maintained by mm people, using core mm code and
avoiding bitrot.

I realise I (ugh mea culpa) missed a fairly eventful THP meeting, I think
David suggested 'mcontrol()' as a name for this? :)

In any event absolutely love to hear input from there from anybody on that
also!

>
> But no strong feelings either way. Like I said, I don't have a strong
> dislike for prctl(), but this idea would obviously be cleaner if we
> think there is enough of a demand for a new syscall.

I won't belabour the point by repeating the arguments in this area :)
generally I worry about seeing mm code proliferate in non-mm places.

>
> > I guess let me work that up so we can see how that looks?
>
> I think it's worth exploring!

Thanks!

If time permits and there isn't too much push back I"ll try spinning up an
RFC.

Cheers, Lorenzo

Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

Posted by Shakeel Butt 8 months, 3 weeks ago

On Wed, May 21, 2025 at 05:21:19AM +0100, Lorenzo Stoakes wrote:
> On Tue, May 20, 2025 at 03:02:09PM -0700, Shakeel Butt wrote:
> 
[...]
> 
> So, something Liam mentioned off-list was the beautifully named
> 'mmadvise()'. Idea being that we have a system call _explicitly for_
> mm-wide modifications.
> 
> With Barry's series doing a prctl() for something similar, and a whole host
> of mm->flags existing for modifying behaviour, it would seem a natural fit.
> 
> I could do a respin that does something like this instead.
> 

Please let's first get consensus on this before starting the work.

Usama, David, Johannes and others, WDYT?

Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

Posted by Usama Arif 8 months, 3 weeks ago

On 21/05/2025 17:28, Shakeel Butt wrote:
> On Wed, May 21, 2025 at 05:21:19AM +0100, Lorenzo Stoakes wrote:
>> On Tue, May 20, 2025 at 03:02:09PM -0700, Shakeel Butt wrote:
>>
> [...]
>>
>> So, something Liam mentioned off-list was the beautifully named
>> 'mmadvise()'. Idea being that we have a system call _explicitly for_
>> mm-wide modifications.
>>
>> With Barry's series doing a prctl() for something similar, and a whole host
>> of mm->flags existing for modifying behaviour, it would seem a natural fit.
>>
>> I could do a respin that does something like this instead.
>>
> 
> Please let's first get consensus on this before starting the work.
> 
> Usama, David, Johannes and others, WDYT?
> 

I would like that. Introducing another method might make the conversation a
lot more complex than it already is?

I have addressed the feedback from Lorenzo for the prctl series, but am
holding back sending it as RFC v4.
The v3 has a NACK on it so I would imagine it would discourage people
from reviewing it. If we are still progressing with sending patches, would it
make sense for me to wait a couple of days to see if there are any more comments
on it and send the RFC v4?

Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

Posted by Lorenzo Stoakes 8 months, 3 weeks ago

On Wed, May 21, 2025 at 05:57:48PM +0100, Usama Arif wrote:
>
>
> On 21/05/2025 17:28, Shakeel Butt wrote:
> > On Wed, May 21, 2025 at 05:21:19AM +0100, Lorenzo Stoakes wrote:
> >> On Tue, May 20, 2025 at 03:02:09PM -0700, Shakeel Butt wrote:
> >>
> > [...]
> >>
> >> So, something Liam mentioned off-list was the beautifully named
> >> 'mmadvise()'. Idea being that we have a system call _explicitly for_
> >> mm-wide modifications.
> >>
> >> With Barry's series doing a prctl() for something similar, and a whole host
> >> of mm->flags existing for modifying behaviour, it would seem a natural fit.
> >>
> >> I could do a respin that does something like this instead.
> >>
> >
> > Please let's first get consensus on this before starting the work.
> >
> > Usama, David, Johannes and others, WDYT?
> >
>
> I would like that. Introducing another method might make the conversation a
> lot more complex than it already is?

The argument is about prctl() vs. another method.

Another method was proposed, arguments were made that it didn't seem
suitable, so an alternative was proposed.

I'm really not sure what complexity this adds?

And what better means of comparing approaches than actual code? Isn't an
entirely theoretical discussion less helpful?

This feels a little like dismissing my input (and I note, Liam's points
remain unanswered), and I have to admit, that is a little upsetting.

But I suppose one has to have a thick skin in the kernel.

>
> I have addressed the feedback from Lorenzo for the prctl series, but am
> holding back sending it as RFC v4.
> The v3 has a NACK on it so I would imagine it would discourage people
> from reviewing it. If we are still progressing with sending patches, would it
> make sense for me to wait a couple of days to see if there are any more comments
> on it and send the RFC v4?

I've no problem with you respinning RFC"s, as I've said more than once. The
NACK has been explained to you, it's regrettable, but necessary I feel when
explicit agreed-upon review has not been actioned (obviously I realise it
was a mistake - but this doesn't make it less important to be clear like
this).

As to stopping and having a conversation about which way forward makes most
sense - I feel like that's what I've been trying to do the whole time? I
would like to think my input is of value, it is a pity that it seems that
it apparently is not.

I am obviously happy to hear other people's input. This is what I've been
working very hard to try to establish, partly by providing _actual code_ so
we can see how things compare.

It seems generally people don't have much of a strong opinion about the
interface, other than Liam (forgive me if I am misinterpreting you Liam and
please correct if so) and myself very strongly disliking prctl() for
numerous aforementioned reasons.

I would suggest that in that case, and since we maintain madvise()
interfaces, it would make sense for us therefore to pursue an alternative
approach.

But for the absolute best means of determining a way forward I'd suggest an
RFC is best.

Thanks.

Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

Posted by Usama Arif 8 months, 3 weeks ago

On 21/05/2025 18:39, Lorenzo Stoakes wrote:
> On Wed, May 21, 2025 at 05:57:48PM +0100, Usama Arif wrote:
>>
>>
>> On 21/05/2025 17:28, Shakeel Butt wrote:
>>> On Wed, May 21, 2025 at 05:21:19AM +0100, Lorenzo Stoakes wrote:
>>>> On Tue, May 20, 2025 at 03:02:09PM -0700, Shakeel Butt wrote:
>>>>
>>> [...]
>>>>
>>>> So, something Liam mentioned off-list was the beautifully named
>>>> 'mmadvise()'. Idea being that we have a system call _explicitly for_
>>>> mm-wide modifications.
>>>>
>>>> With Barry's series doing a prctl() for something similar, and a whole host
>>>> of mm->flags existing for modifying behaviour, it would seem a natural fit.
>>>>
>>>> I could do a respin that does something like this instead.
>>>>
>>>
>>> Please let's first get consensus on this before starting the work.
>>>
>>> Usama, David, Johannes and others, WDYT?
>>>
>>
>> I would like that. Introducing another method might make the conversation a
>> lot more complex than it already is?
> 
> The argument is about prctl() vs. another method.
> 

So there are a few things that have been on the mailing list prctl, process_madvise, 
bpf, cgroup based (although that was shot down). I meant adding another one to all
of them. But please see below.

> Another method was proposed, arguments were made that it didn't seem
> suitable, so an alternative was proposed.
> 
> I'm really not sure what complexity this adds?
> 
> And what better means of comparing approaches than actual code? Isn't an
> entirely theoretical discussion less helpful?

Sounds good!

> This feels a little like dismissing my input (and I note, Liam's points
> remain unanswered), and I have to admit, that is a little upsetting.
> 

The current prctl version is a lot lot better only because of your input.
The previous version had flags2, MMF flags, was breaking vm merge.
All of it was fixed only because of your valuable input.

So I have never dismissed your input, and try my best to reply quickly
to your and everyone elses reviews. We disagree on the best way forward
(whether prctl is appropriate) and I think disagreeing is definitely part
of good engineering. Please always assume I am acting in good faith.

As to not replying to Liams review, if its
https://lore.kernel.org/all/ass5hz6l26jc6xhbtybmgdjf55hmb3v4vvhrxhqampv6ohl67u@qi6iacwzwfk5/
I had already done it several hours ago.
If there was a email bashing prctl on how it is a bitrot and where everything
goes to die, I really dont know how to reply to that anymore.
Otherwise I might have missed it. I really hope one thing is
clear from our interactions over the past week, that I always try
and address feedback.

And for the upsetting part, from all the WTF reactions on my v2,
and all the rest of discussions we have had, I feel you have always
been upset with whatever I have sent (patches/replies), which I really
really don't understand.
We will have disagreements, its part of the dev process. The current version
wasnt going to get any acks, and was not going to get merged in any form,
so I really dont get it. Again, please always assume I am acting in good faith.

> But I suppose one has to have a thick skin in the kernel.
> 
>>
>> I have addressed the feedback from Lorenzo for the prctl series, but am
>> holding back sending it as RFC v4.
>> The v3 has a NACK on it so I would imagine it would discourage people
>> from reviewing it. If we are still progressing with sending patches, would it
>> make sense for me to wait a couple of days to see if there are any more comments
>> on it and send the RFC v4?
> 
> I've no problem with you respinning RFC"s, as I've said more than once. The
> NACK has been explained to you, it's regrettable, but necessary I feel when
> explicit agreed-upon review has not been actioned (obviously I realise it
> was a mistake - but this doesn't make it less important to be clear like
> this).
> 
> As to stopping and having a conversation about which way forward makes most
> sense - I feel like that's what I've been trying to do the whole time? I
> would like to think my input is of value, it is a pity that it seems that
> it apparently is not.
> 
> I am obviously happy to hear other people's input. This is what I've been
> working very hard to try to establish, partly by providing _actual code_ so
> we can see how things compare.
> 
> It seems generally people don't have much of a strong opinion about the
> interface, other than Liam (forgive me if I am misinterpreting you Liam and
> please correct if so) and myself very strongly disliking prctl() for
> numerous aforementioned reasons.
> 
> I would suggest that in that case, and since we maintain madvise()
> interfaces, it would make sense for us therefore to pursue an alternative
> approach.
> 
> But for the absolute best means of determining a way forward I'd suggest an
> RFC is best.

Sounds good to me.
> 
> Thanks.

Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

Posted by Lorenzo Stoakes 8 months, 3 weeks ago

I feel this will get circular. So let's not get lost in the weeds here.

Let's see what others think, and if not too much push-back I'll put out
another RFC for the mcontrol() concept and we can compare to your RFC and
use that to reach consensus if that works for you?

Thanks, Lorenzo

Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

Posted by Usama Arif 8 months, 3 weeks ago


On 21/05/2025 19:40, Lorenzo Stoakes wrote:
> I feel this will get circular. So let's not get lost in the weeds here.
> 
> Let's see what others think, and if not too much push-back I'll put out
> another RFC for the mcontrol() concept and we can compare to your RFC and
> use that to reach consensus if that works for you?
> 
> Thanks, Lorenzo

Sure sounds good, Thanks.

Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

Posted by Lorenzo Stoakes 8 months, 3 weeks ago

On Wed, May 21, 2025 at 09:28:43AM -0700, Shakeel Butt wrote:
> On Wed, May 21, 2025 at 05:21:19AM +0100, Lorenzo Stoakes wrote:
> > On Tue, May 20, 2025 at 03:02:09PM -0700, Shakeel Butt wrote:
> >
> [...]
> >
> > So, something Liam mentioned off-list was the beautifully named
> > 'mmadvise()'. Idea being that we have a system call _explicitly for_
> > mm-wide modifications.
> >
> > With Barry's series doing a prctl() for something similar, and a whole host
> > of mm->flags existing for modifying behaviour, it would seem a natural fit.
> >
> > I could do a respin that does something like this instead.
> >
>
> Please let's first get consensus on this before starting the work.

With respect Shakeel, I'll work on whatever I want, whenever I want.

These are RFC's, that is, Requests for Comment, intended to explore new
ideas and to show how they might look in practice so we can make the right
choice.

I feel like I've said this more than once now...!

>
> Usama, David, Johannes and others, WDYT?
>

But obviously it would be good to get some input from others, more so from
the actual maintainer (David) and reviewers though obviously everybody's
opinion matters.

Overall my view is prctl() is really something we should avoid here unless
we have absolutely no other choice. I've gone into detail as to why already
so won't belabour the point.

Thanks.

Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

Posted by Shakeel Butt 8 months, 3 weeks ago

On Wed, May 21, 2025 at 05:49:15PM +0100, Lorenzo Stoakes wrote:
[...]
> >
> > Please let's first get consensus on this before starting the work.
> 
> With respect Shakeel, I'll work on whatever I want, whenever I want.

I fail to understand why you would respond like that.

Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

Posted by David Hildenbrand 8 months, 3 weeks ago

On 21.05.25 19:39, Shakeel Butt wrote:
> On Wed, May 21, 2025 at 05:49:15PM +0100, Lorenzo Stoakes wrote:
> [...]
>>>
>>> Please let's first get consensus on this before starting the work.
>>
>> With respect Shakeel, I'll work on whatever I want, whenever I want.
> 
> I fail to understand why you would respond like that.

Relax guys ... :) Really nothing to be fighting about.

Lorenzo has a lot of energy to play with things, to see how it would 
look. I wish I would have that much energy, but I have no idea where it 
went ... (well, okay, I have a suspicion) :P

At the same time, I hope (and assume :) ) that Lorenzo will get Usama 
involved in the development once we know what we want.

To summarize my current view:

1) ebpf: most people are are not a fan of that, and I agree, at least
    for this purpose. If we were talking about making better *placement*
    decisions using epbf, it would be a different story.

2) prctl(): the unloved child, and I can understand why. Maybe now is
    the right time to stop adding new MM things that feel weird in there.
    Maybe we should already have done that with the KSM toggle (guess who
    was involved in that ;) ).

3) process_madvise(): I think it's an interesting extension, but
    probably we should just have something that applies to the whole
    address space naturally. At least my take for now.

4) new syscall: worth exploring how it would look. I'm especially
    interested in flag options (e.g., SET_DEFAULT_EXEC) and how we could
    make them only apply to selected controls.

An API prototype of 4), not necessarily with the code yet, might be 
valuable.

In general, the "always/madvise/never" policies are really horrible. We 
should instead be prioritizing who gets THPs -- and only disable them 
for selected workloads.

Because splitting THPs up because a process is not allowed to use them, 
thereby increasing memory fragmentation, is really absolutely suboptimal.

But we don't have anything better right now.

So I would hope that we can at least turn the "always/VM_HUGEPAGE" into 
a "prioritize for largest (m)THPs possible" in a distant future.

If only changing the semantics of VM_NOHUGEPAGE to mean "deprioritize 
for THPs" couldn't break userfaultfd ... :( But maybe that can be worked 
around in the future somehow (e.g., when we detect userfaultfd usage, 
not sure, ...).

-- 
Cheers,

David / dhildenb

Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

Posted by Shakeel Butt 8 months, 3 weeks ago

On Thu, May 22, 2025 at 03:05:30PM +0200, David Hildenbrand wrote:
> On 21.05.25 19:39, Shakeel Butt wrote:
> > On Wed, May 21, 2025 at 05:49:15PM +0100, Lorenzo Stoakes wrote:
> > [...]
> > > > 
> > > > Please let's first get consensus on this before starting the work.
> > > 
> > > With respect Shakeel, I'll work on whatever I want, whenever I want.
> > 
> > I fail to understand why you would respond like that.
> 
> Relax guys ... :) Really nothing to be fighting about.

Agreed.

[...]

> 
> 
> To summarize my current view:
> 
> 1) ebpf: most people are are not a fan of that, and I agree, at least
>    for this purpose. If we were talking about making better *placement*
>    decisions using epbf, it would be a different story.

From placement decisions, do you mean placement between memory
tiers/nodes or something else?

> 
> 2) prctl(): the unloved child, and I can understand why. Maybe now is
>    the right time to stop adding new MM things that feel weird in there.
>    Maybe we should already have done that with the KSM toggle (guess who
>    was involved in that ;) ).

At the moment systemd is the user I know of and I think it would very
easy to migrate it to whatever new thing we decide here.

> 
> 3) process_madvise(): I think it's an interesting extension, but
>    probably we should just have something that applies to the whole
>    address space naturally. At least my take for now.
> 
> 4) new syscall: worth exploring how it would look. I'm especially
>    interested in flag options (e.g., SET_DEFAULT_EXEC) and how we could
>    make them only apply to selected controls.

Were there any previous discussion on SET_DEFAULT_EXEC? First time I am
hearing about it.

Overall I agree with your assessment and thus I was requesting to at
least discuss the new syscall option as well.

Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

Posted by David Hildenbrand 8 months, 2 weeks ago

>>
>> To summarize my current view:
>>
>> 1) ebpf: most people are are not a fan of that, and I agree, at least
>>     for this purpose. If we were talking about making better *placement*
>>     decisions using epbf, it would be a different story.
> 
>  From placement decisions, do you mean placement between memory
> tiers/nodes or something else?

More like: which size to place, but it could be extended to other 
policies, maybe.

Assume we have a page fault and have to decide which size to place.

For a process that we really want to use THPs (VM_HUEPAGE?), we could 
use the largest free folio possible.

For a process that we don't want to spend valuable THPs on (VM_HUEPAGE 
not set?), we could use the smallest free folio possible.

Such a possibly might be encoded in an ebpf program I assume.

The hints (prioritize regions/processes, deprioritize 
regions/processes), such as VM_HUGEPAGE, inputs into such a program.
> 
>>
>> 2) prctl(): the unloved child, and I can understand why. Maybe now is
>>     the right time to stop adding new MM things that feel weird in there.
>>     Maybe we should already have done that with the KSM toggle (guess who
>>     was involved in that ;) ).
> 
> At the moment systemd is the user I know of and I think it would very
> easy to migrate it to whatever new thing we decide here.

Agreed.

> 
>>
>> 3) process_madvise(): I think it's an interesting extension, but
>>     probably we should just have something that applies to the whole
>>     address space naturally. At least my take for now.
>>
>> 4) new syscall: worth exploring how it would look. I'm especially
>>     interested in flag options (e.g., SET_DEFAULT_EXEC) and how we could
>>     make them only apply to selected controls.
> 
> Were there any previous discussion on SET_DEFAULT_EXEC? First time I am
> hearing about it.

I think it evolved in the discussion here from PMADV_SET_FORK_EXEC_DEFAULT.

> 
> Overall I agree with your assessment and thus I was requesting to at
> least discuss the new syscall option as well.

Yes.

I am still not sure if having a new "process" [1] mode would be a 
reasonable alternative to setting the VM_HUGEPAGE/VM_NOHUGEPAGE default. 
Assuming we would have a "process" mode, we could (a) set the policy 
per-process using the new syscall we discuss here, and options to (B) 
set the policy to use for the exec child and (c) maybe an option to seal 
the policy (depending on who is allowed to set the policy in the first 
place).

On the + side, we don't lose hints/instructions from the app 
(VM_HUGEPAGE/VM_NOHUGEPAGE) when changing the policy on an already 
running process.

The problem I see with the "process" policy is that people might want 
different "default" policies for processes, which means that we will 
have to add yet another toggle.


How I hate THP toggles. :)

[1] 
https://lore.kernel.org/all/CALOAHbB-KQ4+z-Lupv7RcxArfjX7qtWcrboMDdT4LdpoTXOMyw@mail.gmail.com/

-- 
Cheers,

David / dhildenb

Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

Posted by Lorenzo Stoakes 8 months, 3 weeks ago

TL;DR - action item on below is I'll put together a proposed API (without
code) and cc people here when I've done so, so we can take a look at how
mctl() or mmadvise() or whatever we call it might look :)

On Thu, May 22, 2025 at 03:05:30PM +0200, David Hildenbrand wrote:
> On 21.05.25 19:39, Shakeel Butt wrote:
> > On Wed, May 21, 2025 at 05:49:15PM +0100, Lorenzo Stoakes wrote:
> > [...]
> > > >
> > > > Please let's first get consensus on this before starting the work.
> > >
> > > With respect Shakeel, I'll work on whatever I want, whenever I want.
> >
> > I fail to understand why you would respond like that.
>
> Relax guys ... :) Really nothing to be fighting about.

Agreed...!

>
> Lorenzo has a lot of energy to play with things, to see how it would look. I
> wish I would have that much energy, but I have no idea where it went ...
> (well, okay, I have a suspicion) :P

We have cats rather than kids which might explain a bit ;)

>
> At the same time, I hope (and assume :) ) that Lorenzo will get Usama
> involved in the development once we know what we want.
>
>
> To summarize my current view:
>
> 1) ebpf: most people are are not a fan of that, and I agree, at least
>    for this purpose. If we were talking about making better *placement*
>    decisions using epbf, it would be a different story.

Yeah, I think overall we have a situation that is _bad_ in terms of
interface. We need something more fine-grained, but it's chicken and egg, and
there are genuine needs users have _now_.

So the whole discussion is about this.

>
> 2) prctl(): the unloved child, and I can understand why. Maybe now is
>    the right time to stop adding new MM things that feel weird in there.
>    Maybe we should already have done that with the KSM toggle (guess who
>    was involved in that ;) ).

I won't belabour this point, at this point I might get a reputation as
prctl()'s biggest hater otherwise :P

But one thing I will say is - systemd makes these things permanent (hey
that KSM thing that breaks VMA merging is literally an option in systemd,
wasn't aware :)

>
> 3) process_madvise(): I think it's an interesting extension, but
>    probably we should just have something that applies to the whole
>    address space naturally. At least my take for now.

Yeah that's the point of view I've come to, I mean the point was to try to
make this more generic in a way that _also_ got us improved control over
madvise() - sort of win/win.

But the 'default the process' thing is, as Shakeel and Liam rightly say,
just really out of band or doesn't quite fit this interface.

I may still put forward a patch to add flags for e.g. not breaking on gaps
but as a separate thing I think, I still think that'd be valuable (but I'll
provide solid at least self tests to make the point).

>
> 4) new syscall: worth exploring how it would look. I'm especially
>    interested in flag options (e.g., SET_DEFAULT_EXEC) and how we could
>    make them only apply to selected controls.

Yeah, this is exactly what I want to play with.

>
>
> An API prototype of 4), not necessarily with the code yet, might be
> valuable.

ACK, though I really find it valuable to code things up because so often
you figure out what works by trying to make it work in practice.

This is how guard regions happened for instance, we had a ton of
conversation like this, loads of back and forth, nobody quite knew, then I
wrote some prototype code and it became apparent that this thing was
doable.

I never intend the RFC to be _the work_ rather it's a 'proof of concept'
for discussion.

However, as we're still fairly vague on the API bit, I think in this case
it'll be valuable to do exactly what you suggest and simply prototype an
API around this without code.

So I'll do that and come up with something as a separate mail, cc'ing
people here.

>
> In general, the "always/madvise/never" policies are really horrible. We
> should instead be prioritizing who gets THPs -- and only disable them for
> selected workloads.

I couldn't agree more.

>
> Because splitting THPs up because a process is not allowed to use them,
> thereby increasing memory fragmentation, is really absolutely suboptimal.

Yes, there's a disconnect here between - a global resource (-ish :P) - and
process requirements.

>
> But we don't have anything better right now.

I feel like all this turmoil brings us closer to longer term solutions, if
perhaps via pain-inspired development (a new programming philosophy I
intend to trademark ;)

>
> So I would hope that we can at least turn the "always/VM_HUGEPAGE" into a
> "prioritize for largest (m)THPs possible" in a distant future.

I suspect we might still require some legacy settings so people don't
panic. Aren't uAPIs fun?

>
> If only changing the semantics of VM_NOHUGEPAGE to mean "deprioritize for
> THPs" couldn't break userfaultfd ... :( But maybe that can be worked around
> in the future somehow (e.g., when we detect userfaultfd usage, not sure,
> ...).

I hate how uffd is implemented (I like the concept of what it provides
though!) on multiple levels. It's crept into so much and the idea it's
putting restrictions on core stuff is just horrid.

I do feel though we may want to introduce something new for this though, as
'never' or 'no' suddenly not being no but 'deprioritise' could be pretty
concerning for people.

But on the other hand, this is a resource for the kernel to determine how
to manage as it sees fit so, perhaps we shouldn't care...

>
> --
> Cheers,
>
> David / dhildenb
>

Thanks!

Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

Posted by David Hildenbrand 8 months, 3 weeks ago

On 19.05.25 22:52, Lorenzo Stoakes wrote:
> REVIEWERS NOTES:
> ================
> 
> This is a VERY EARLY version of the idea, it's relatively untested, and I'm
> 'putting it out there' for feedback. Any serious version of this will add a
> bunch of self-tests to assert correct behaviour and I will more carefully
> confirm everything's working.
> 
> This is based on discussion arising from Usama's series [0], SJ's input on
> the thread around process_madvise() behaviour [1] (and a subsequent
> response by me [2]) and prior discussion about a new madvise() interface
> [3].
> 
> [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/
> [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/
> [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/
> [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/
> 
> ================
> 
> Currently, we are rather restricted in how madvise() operations
> proceed. While effort has been put in to expanding what process_madvise()
> can do (that is - unrestricted application of advice to the local process
> alongside recent improvements on the efficiency of TLB operations over
> these batvches), we are still constrained by existing madvise() limitations
> and default behaviours.
> 
> This series makes use of the currently unused flags field in
> process_madvise() to provide more flexiblity.
> 

In general, sounds like an interesting approach.

> It introduces four flags:
> 
> 1. PMADV_SKIP_ERRORS
> 
> Currently, when an error arises applying advice in any individual VMA
> (keeping in mind that a range specified to madvise() or as part of the
> iovec passed to process_madvise()), the operation stops where it is and
> returns an error.
> 
> This might not be the desired behaviour of the user, who may wish instead
> for the operation to be 'best effort'. By setting this flag, that behaviour
> is obtained.
> 
> Since process_madvise() would trivially, if skipping errors, simply return
> the input vector size, we instead return the number of entries in the
> vector which completed successfully without error.

I would focus only on adding flags that we absolutely need to make the 
use case we have in mind work. We can always add other flags as we see 
fit for them (IOW, when really required ;) ).

Regarding MADV_HUGEPAGE / MADV_NOHUGEPAGE, this would not be required, 
right?

> 
> The PMADV_SKIP_ERRORS flag implicitly implies PMADV_NO_ERROR_ON_UNMAPPED.
> 
> 2. PMADV_NO_ERROR_ON_UNMAPPED
> 
> Currently madvise() has the peculiar behaviour of, if the range specified
> to it contains unmapped range(s), completing the full operation, but
> ultimately returning -ENOMEM.
> 
> In the case of process_madvise(), this is fatal, as the operation will stop
> immediately upon this occurring.
> 
> By setting PMADV_NO_ERROR_ON_UNMAPPED, the user can indicate that it wishes
> unmapped areas to simply be entirely ignored.

Again, is this really required? Couldn't we glue that to 
PMADV_ENTIRE_ADDRESS_SPACE for our use case? After all, I don't expect 
anybody to have something mapped into *the entire address space*.

Well, okay, maybe on 32bit, but still ... :)

> 
> 3. PMADV_SET_FORK_EXEC_DEFAULT
> 
> It may be desirable for a user to specify that all VMAs mapped in a process
> address space default to having an madvise() behaviour established by
> default, in such a fashion as that this persists across fork/exec.

This is very specific for MADV_HUGEPAGE only, so I wonder how we could 
either avoid that flag or just make it clear that it shall stick around ...

Having that sad, PMADV_SET_FORK_EXEC_DEFAULT is rather a suboptimal name :(

-- 
Cheers,

David / dhildenb

Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

Posted by Lorenzo Stoakes 8 months, 3 weeks ago

On Tue, May 20, 2025 at 05:28:35PM +0200, David Hildenbrand wrote:
> On 19.05.25 22:52, Lorenzo Stoakes wrote:
> > REVIEWERS NOTES:
> > ================
> >
> > This is a VERY EARLY version of the idea, it's relatively untested, and I'm
> > 'putting it out there' for feedback. Any serious version of this will add a
> > bunch of self-tests to assert correct behaviour and I will more carefully
> > confirm everything's working.
> >
> > This is based on discussion arising from Usama's series [0], SJ's input on
> > the thread around process_madvise() behaviour [1] (and a subsequent
> > response by me [2]) and prior discussion about a new madvise() interface
> > [3].
> >
> > [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/
> > [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/
> > [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/
> > [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/
> >
> > ================
> >
> > Currently, we are rather restricted in how madvise() operations
> > proceed. While effort has been put in to expanding what process_madvise()
> > can do (that is - unrestricted application of advice to the local process
> > alongside recent improvements on the efficiency of TLB operations over
> > these batvches), we are still constrained by existing madvise() limitations
> > and default behaviours.
> >
> > This series makes use of the currently unused flags field in
> > process_madvise() to provide more flexiblity.
> >
>
> In general, sounds like an interesting approach.

Thanks!

If you agree this is workable, then I'll go ahead and put some more effort
into writing tests etc. on the next respin.

>
> > It introduces four flags:
> >
> > 1. PMADV_SKIP_ERRORS
> >
> > Currently, when an error arises applying advice in any individual VMA
> > (keeping in mind that a range specified to madvise() or as part of the
> > iovec passed to process_madvise()), the operation stops where it is and
> > returns an error.
> >
> > This might not be the desired behaviour of the user, who may wish instead
> > for the operation to be 'best effort'. By setting this flag, that behaviour
> > is obtained.
> >
> > Since process_madvise() would trivially, if skipping errors, simply return
> > the input vector size, we instead return the number of entries in the
> > vector which completed successfully without error.
>
> I would focus only on adding flags that we absolutely need to make the use
> case we have in mind work. We can always add other flags as we see fit for
> them (IOW, when really required ;) ).
>
> Regarding MADV_HUGEPAGE / MADV_NOHUGEPAGE, this would not be required,
> right?

Sure, we can restrict this to only supported flags to be conservative.

The idea was though that somebody might want to simply do a 'best effort'
change.

However at the same time it's possibly a wee bit dangerous...

>
> >
> > The PMADV_SKIP_ERRORS flag implicitly implies PMADV_NO_ERROR_ON_UNMAPPED.
> >
> > 2. PMADV_NO_ERROR_ON_UNMAPPED
> >
> > Currently madvise() has the peculiar behaviour of, if the range specified
> > to it contains unmapped range(s), completing the full operation, but
> > ultimately returning -ENOMEM.
> >
> > In the case of process_madvise(), this is fatal, as the operation will stop
> > immediately upon this occurring.
> >
> > By setting PMADV_NO_ERROR_ON_UNMAPPED, the user can indicate that it wishes
> > unmapped areas to simply be entirely ignored.
>
> Again, is this really required? Couldn't we glue that to
> PMADV_ENTIRE_ADDRESS_SPACE for our use case? After all, I don't expect
> anybody to have something mapped into *the entire address space*.

Well, I think it's an ongoing issue that unmapped entries cause the whole thing
to break, I do think it makes sense to make this _generally_ available, actually.

Obviously we should probably make PMADV_ENTIRE_MAPPING imply
PMADV_NO_ERROR_ON_UNMAPPED in the same way that PMADV_SKIP_ERRORS implies
PMADV_NO_ERROR_ON_UNMAPPED.

And yes I don't think any sane person would map the entirety of the 64-bit
address space :P

>
> Well, okay, maybe on 32bit, but still ... :)

32 what? :P I deny its existence... (ugh ok I guess I have to ack it, but
even in that case it's not very likely either :)

>
> >
> > 3. PMADV_SET_FORK_EXEC_DEFAULT
> >
> > It may be desirable for a user to specify that all VMAs mapped in a process
> > address space default to having an madvise() behaviour established by
> > default, in such a fashion as that this persists across fork/exec.
>
> This is very specific for MADV_HUGEPAGE only, so I wonder how we could
> either avoid that flag or just make it clear that it shall stick around ...
>
> Having that sad, PMADV_SET_FORK_EXEC_DEFAULT is rather a suboptimal name :(

Yeah it's horrid, see Pedro's suggestions, e.g. PMADV_SET_DEFAULT |
PMADV_INHERIT_EXEC.

I even wonder about PMADV_, but that's probably fine. PRMADV sucks, PADV
sort of loses the mm bit, PMADV is probably best we can do!

>
> --
> Cheers,
>
> David / dhildenb
>

Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

Posted by Lorenzo Stoakes 8 months, 3 weeks ago

On Tue, May 20, 2025 at 06:47:48PM +0100, Lorenzo Stoakes wrote:
> On Tue, May 20, 2025 at 05:28:35PM +0200, David Hildenbrand wrote:
[snip]
> > > 3. PMADV_SET_FORK_EXEC_DEFAULT
> > >
> > > It may be desirable for a user to specify that all VMAs mapped in a process
> > > address space default to having an madvise() behaviour established by
> > > default, in such a fashion as that this persists across fork/exec.
> >
> > This is very specific for MADV_HUGEPAGE only, so I wonder how we could
> > either avoid that flag or just make it clear that it shall stick around ...
> >

Sorry missed this bit.

I don't really like the idea of only for MADV_HUGEPAGE (and MADV_NOHUGEPAGE)
defaulting to doing this, I think that's far less clear than a user explicitly
asking for it, plus most users using process_madvise() wouldn't expect it.

So restricting this to these flags only and rejecting all others should cover
this off fine I think.

[snip]

Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

Posted by David Hildenbrand 8 months, 3 weeks ago

On 20.05.25 20:25, Lorenzo Stoakes wrote:
> On Tue, May 20, 2025 at 06:47:48PM +0100, Lorenzo Stoakes wrote:
>> On Tue, May 20, 2025 at 05:28:35PM +0200, David Hildenbrand wrote:
> [snip]
>>>> 3. PMADV_SET_FORK_EXEC_DEFAULT
>>>>
>>>> It may be desirable for a user to specify that all VMAs mapped in a process
>>>> address space default to having an madvise() behaviour established by
>>>> default, in such a fashion as that this persists across fork/exec.
>>>
>>> This is very specific for MADV_HUGEPAGE only, so I wonder how we could
>>> either avoid that flag or just make it clear that it shall stick around ...
>>>
> 
> Sorry missed this bit.
> 
> I don't really like the idea of only for MADV_HUGEPAGE (and MADV_NOHUGEPAGE)
> defaulting to doing this, I think that's far less clear than a user explicitly
> asking for it, plus most users using process_madvise() wouldn't expect it.

The thing is that PMADV_ENTIRE_ADDRESS_SPACE already adds something 
unexpected: suddenly also *new* mappings will be affected.

But maybe that'd be covered by the PMADV_SET_DEFAULT idea.

Maybe

PMADV_ENTIRE_ADDRESS_SPACE | PMADV_SET_DEFAULT | PMADV_SET_EXEC_DEFAULT

... hm, but "PMADV_ENTIRE_ADDRESS_SPACE" can then simply be "pass in the 
entire address space as a range" ... and ignore errors.

I'd probably have to see a revised proposal after the current discussions.

Anyhow, limiting the flags to a bare minimum for now usually makes 
sense, as long as it is expandable.

-- 
Cheers,

David / dhildenb

Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

Posted by Usama Arif 8 months, 3 weeks ago

On 20/05/2025 18:47, Lorenzo Stoakes wrote:
> On Tue, May 20, 2025 at 05:28:35PM +0200, David Hildenbrand wrote:
>> On 19.05.25 22:52, Lorenzo Stoakes wrote:
>>> REVIEWERS NOTES:
>>> ================
>>>
>>> This is a VERY EARLY version of the idea, it's relatively untested, and I'm
>>> 'putting it out there' for feedback. Any serious version of this will add a
>>> bunch of self-tests to assert correct behaviour and I will more carefully
>>> confirm everything's working.
>>>
>>> This is based on discussion arising from Usama's series [0], SJ's input on
>>> the thread around process_madvise() behaviour [1] (and a subsequent
>>> response by me [2]) and prior discussion about a new madvise() interface
>>> [3].
>>>
>>> [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/
>>> [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/
>>> [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/
>>> [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/
>>>
>>> ================
>>>
>>> Currently, we are rather restricted in how madvise() operations
>>> proceed. While effort has been put in to expanding what process_madvise()
>>> can do (that is - unrestricted application of advice to the local process
>>> alongside recent improvements on the efficiency of TLB operations over
>>> these batvches), we are still constrained by existing madvise() limitations
>>> and default behaviours.
>>>
>>> This series makes use of the currently unused flags field in
>>> process_madvise() to provide more flexiblity.
>>>
>>
>> In general, sounds like an interesting approach.
> 
> Thanks!
> 
> If you agree this is workable, then I'll go ahead and put some more effort
> into writing tests etc. on the next respin.
> 

So the prctl and process_madvise patches both are trying to accomplish a
similar end goal.

Would it make sense to discuss what would be the best way forward before we
continue developing the solutions? If we are not at that stage and a clear
picture has not formed yet, happy to continue refining the solutions.
But just thought I would check. 

I feel like changing process_madvise which was designed to work on an array
of iovec structures to have flags to skip errors and ignore the iovec
makes it function similar to a prctl call is not the right approach.
IMHO, prctl is a more direct solution to this.

I know that Lonenzo doesn't like prctl and wants to unify this in process_madvise.
But if in the end, we want to have a THP auto way which is truly transparent,
would it not be better to just have this as prctl and not change the madvise
structure? Maybe in a few years we wont need any of this, and it will be lost
in prctl and madvise wouldn't have changed for this?

Again, this is just to have a discussion (and not an aggressive argument :)),
and would love to get feedback from everyone in the community.
If its too early to have this discussion, its completely fine and we can
still keep developing the RFCs :)

Thanks!
Usama

Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

Posted by Lorenzo Stoakes 8 months, 3 weeks ago

On Tue, May 20, 2025 at 07:24:04PM +0100, Usama Arif wrote:
>
>
> On 20/05/2025 18:47, Lorenzo Stoakes wrote:
> > On Tue, May 20, 2025 at 05:28:35PM +0200, David Hildenbrand wrote:
> >> On 19.05.25 22:52, Lorenzo Stoakes wrote:
> >>> REVIEWERS NOTES:
> >>> ================
> >>>
> >>> This is a VERY EARLY version of the idea, it's relatively untested, and I'm
> >>> 'putting it out there' for feedback. Any serious version of this will add a
> >>> bunch of self-tests to assert correct behaviour and I will more carefully
> >>> confirm everything's working.
> >>>
> >>> This is based on discussion arising from Usama's series [0], SJ's input on
> >>> the thread around process_madvise() behaviour [1] (and a subsequent
> >>> response by me [2]) and prior discussion about a new madvise() interface
> >>> [3].
> >>>
> >>> [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/
> >>> [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/
> >>> [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/
> >>> [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/
> >>>
> >>> ================
> >>>
> >>> Currently, we are rather restricted in how madvise() operations
> >>> proceed. While effort has been put in to expanding what process_madvise()
> >>> can do (that is - unrestricted application of advice to the local process
> >>> alongside recent improvements on the efficiency of TLB operations over
> >>> these batvches), we are still constrained by existing madvise() limitations
> >>> and default behaviours.
> >>>
> >>> This series makes use of the currently unused flags field in
> >>> process_madvise() to provide more flexiblity.
> >>>
> >>
> >> In general, sounds like an interesting approach.
> >
> > Thanks!
> >
> > If you agree this is workable, then I'll go ahead and put some more effort
> > into writing tests etc. on the next respin.
> >
>
> So the prctl and process_madvise patches both are trying to accomplish a
> similar end goal.
>
> Would it make sense to discuss what would be the best way forward before we
> continue developing the solutions? If we are not at that stage and a clear
> picture has not formed yet, happy to continue refining the solutions.
> But just thought I would check.

As stated in the thread (I went out of my way to link everything above),
and stated with you cc'd in every discussion (I think!), this idea arose as
a concept that came out of my brainstorming whether we could better align
this concept with madvise().

This arose because of discussions had in-thread where we agreed it made
most sense to have this feature in effect perform a 'default madvise()'.

At this point, we are essentially _duplicating what madvise does_ in the
prctl() with your approach, only now all of the code that does that is
bitrotting in kernel/sys.c.

This approach was an attempt to avoid that.

It started as a 'crazy idea' I was throwing out there, as an RFC. The idea
was we could compare and contrast with the prctl() RFC.

Obviously this is gaining some traction now as a concept and your respin
was a little rushed out so needs rework.

So, apologies if it seems this is an attempt to supercede your series or
such, it truly wasn't intended that way, rather I have been working 12 hour
days trying to find the best way possible to _make what you want happen_
while also doing what's best for mm and madvise (among many other tasks of
course :)

So the idea is we can:

a. solve long-standing problems with madvise() that prevent it being used
   for certain things (e.g. the error on gaps which breaks
   process_madvise() and hides real errors)

b. Also provide this functionality in a way that the absolute most minimal
   delta from existing logic...

c. ...While keeping all this logic in mm and avoiding bitrot in
   kernel/sys.c.

So obviously my view is that this approach is superior to the prctl() one.

However the intent was that you would probably take a little longer in
spinning up an RFC, and then we could compare the two approaches, if people
didn't reject my 'crazy idea' :)

Obviously you respan your (kinda ;) RFC way quicker than expected, and then
there were a ton of bugs, which added workload and made it perhaps a little
more pressing as to deciding which should be pursued.

I would suggest holding off on your series until we see whether this one
works on this basis. But obviously this is simply my point of view.

To be clear however, this was not a planned series of events...

I mean equally, we are seeing several other series from Yafang, Nico and
Dev in relation to [m]THP and even a obliquely-related series from Barry,
so it seems we are in contention across many planes here :)

>
> I feel like changing process_madvise which was designed to work on an array
> of iovec structures to have flags to skip errors and ignore the iovec
> makes it function similar to a prctl call is not the right approach.
> IMHO, prctl is a more direct solution to this.

I don't know what you mean by 'function similar to a prctl call', or what
you mean by it being a more direct solution.

The problem with prctl() is there is no pattern, it's a dumping ground for
arbitrary stuff and a prime place for bitrot. It also means we give up
maintainership of mm-specific code.

It encourages misalignment with other interfaces and APIs.

What is being discussed here is an effort to _better align what you want
with an existing interface_ - if we treat this as a 'default madvise()'
plus having additional madvise functionality (apply to all, ignoring errors
e.g.) - and put all this code _alongside existing madvise code_ - this
makes this vastly more maintainable, futureproof and robust.

And keep in mind the proposed flags are _flags_ not default behaviours,
ones which we will carefully restrict to known flags, starting with the
flags you guys need.

>
> I know that Lonenzo doesn't like prctl and wants to unify this in process_madvise.
> But if in the end, we want to have a THP auto way which is truly transparent,
> would it not be better to just have this as prctl and not change the madvise
> structure? Maybe in a few years we wont need any of this, and it will be lost
> in prctl and madvise wouldn't have changed for this?

This really sounds a lot like your colleague Shakeel's objection, so I
don't want to duplicate my response too much, but as I said to him, at this
stage we would set THP mode to 'auto', but still have to support
MADV_[NO]HUGEPAGE.

We may then wish to set these as no-ops in auto mode right? But they'd
still have to exist for uAPI reasons.

So would it be better to do this in mm/madvise.c alongside literally all
other madvise() code and the existing handling for MADV_[NO]HUGEPAGE, or
would it be better to throw it in kernel/sys.c and hope people remember to
update it?

I think it's pretty clear what the answer to that is.

>
> Again, this is just to have a discussion (and not an aggressive argument :)),
> and would love to get feedback from everyone in the community.
> If its too early to have this discussion, its completely fine and we can
> still keep developing the RFCs :)

I try hard never to be aggressive, I am firm when I feel it is appropriate
to be so (it's important to push back when necessarily I feel), but I
always try to maintain civility as well as I can.

Of course I am imperfect, so apologies if it comes across any other way, I
really do try very hard to maintain a high standard of professionalism
on-list.

Again as I said, I really did not intend to supercede or interfere with
your series, this was just unfortunate timing and throwing out ideas to
reach the best solution.

My objection to prctl() is due to bit-rot, mm code in inappropriate places,
the fact it by nature lends itself to violating conventions and practices
that exist in other mm APIs, not some subjective sense of evil.

It is historically a place where 'things that people don't really care
about but don't quite want to NACK' go also, and to me suggests that the
problem hasn't necessarily been thought through to see how it might be
implemented by extending existing APIs or finding ways to achieve the same
thing that better align with existing intefaces.

To be clear - this concept is _all_ ultimately a product of your series and
your ideas leading the discussion within the community to this point where
a potential alternative has arisen - without which this would not have
occurred.

So the idea here is to simply explore what the best solution is that aligns
with what best serves the community.

>
> Thanks!
> Usama

Thanks, Lorenzo

Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

Posted by Usama Arif 8 months, 3 weeks ago


On 20/05/2025 20:21, Lorenzo Stoakes wrote:
> On Tue, May 20, 2025 at 07:24:04PM +0100, Usama Arif wrote:
>>
>>
>> On 20/05/2025 18:47, Lorenzo Stoakes wrote:
>>> On Tue, May 20, 2025 at 05:28:35PM +0200, David Hildenbrand wrote:
>>>> On 19.05.25 22:52, Lorenzo Stoakes wrote:
>>>>> REVIEWERS NOTES:
>>>>> ================
>>>>>
>>>>> This is a VERY EARLY version of the idea, it's relatively untested, and I'm
>>>>> 'putting it out there' for feedback. Any serious version of this will add a
>>>>> bunch of self-tests to assert correct behaviour and I will more carefully
>>>>> confirm everything's working.
>>>>>
>>>>> This is based on discussion arising from Usama's series [0], SJ's input on
>>>>> the thread around process_madvise() behaviour [1] (and a subsequent
>>>>> response by me [2]) and prior discussion about a new madvise() interface
>>>>> [3].
>>>>>
>>>>> [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/
>>>>> [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/
>>>>> [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/
>>>>> [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/
>>>>>
>>>>> ================
>>>>>
>>>>> Currently, we are rather restricted in how madvise() operations
>>>>> proceed. While effort has been put in to expanding what process_madvise()
>>>>> can do (that is - unrestricted application of advice to the local process
>>>>> alongside recent improvements on the efficiency of TLB operations over
>>>>> these batvches), we are still constrained by existing madvise() limitations
>>>>> and default behaviours.
>>>>>
>>>>> This series makes use of the currently unused flags field in
>>>>> process_madvise() to provide more flexiblity.
>>>>>
>>>>
>>>> In general, sounds like an interesting approach.
>>>
>>> Thanks!
>>>
>>> If you agree this is workable, then I'll go ahead and put some more effort
>>> into writing tests etc. on the next respin.
>>>
>>
>> So the prctl and process_madvise patches both are trying to accomplish a
>> similar end goal.
>>
>> Would it make sense to discuss what would be the best way forward before we
>> continue developing the solutions? If we are not at that stage and a clear
>> picture has not formed yet, happy to continue refining the solutions.
>> But just thought I would check.
> 
> As stated in the thread (I went out of my way to link everything above),
> and stated with you cc'd in every discussion (I think!), this idea arose as
> a concept that came out of my brainstorming whether we could better align
> this concept with madvise().
> 
> This arose because of discussions had in-thread where we agreed it made
> most sense to have this feature in effect perform a 'default madvise()'.
> 
> At this point, we are essentially _duplicating what madvise does_ in the
> prctl() with your approach, only now all of the code that does that is
> bitrotting in kernel/sys.c.
> 
> This approach was an attempt to avoid that.
> 
> It started as a 'crazy idea' I was throwing out there, as an RFC. The idea
> was we could compare and contrast with the prctl() RFC.
> 
> Obviously this is gaining some traction now as a concept and your respin
> was a little rushed out so needs rework.
> 
> So, apologies if it seems this is an attempt to supercede your series or
> such, it truly wasn't intended that way, rather I have been working 12 hour
> days trying to find the best way possible to _make what you want happen_
> while also doing what's best for mm and madvise (among many other tasks of
> course :)

Please don't burn out spending 12 hour days on this! Your feedback is very
amazing! (Thanks again!) It will take time to come up with a solution right
for the kernel!

> 
> So the idea is we can:
> 
> a. solve long-standing problems with madvise() that prevent it being used
>    for certain things (e.g. the error on gaps which breaks
>    process_madvise() and hides real errors)
> 
> b. Also provide this functionality in a way that the absolute most minimal
>    delta from existing logic...
> 
> c. ...While keeping all this logic in mm and avoiding bitrot in
>    kernel/sys.c.
> 
> So obviously my view is that this approach is superior to the prctl() one.
> 
> However the intent was that you would probably take a little longer in
> spinning up an RFC, and then we could compare the two approaches, if people
> didn't reject my 'crazy idea' :)
> 
> Obviously you respan your (kinda ;) RFC way quicker than expected, and then
> there were a ton of bugs, which added workload and made it perhaps a little
> more pressing as to deciding which should be pursued.
> 

I feel like a ton (i.e. a thousand) sounds like a lot, there are a couple of bugs
in a series I meant to send as an RFC (mistakes happen :)). I will wait a couple
of days to see how things develop and send a prctl RFC (will remember the tag this
time) and we can have a better comparison of what this would look like.

> I would suggest holding off on your series until we see whether this one
> works on this basis. But obviously this is simply my point of view.
> 
> To be clear however, this was not a planned series of events...
> 
> I mean equally, we are seeing several other series from Yafang, Nico and
> Dev in relation to [m]THP and even a obliquely-related series from Barry,
> so it seems we are in contention across many planes here :)
> 
>>
>> I feel like changing process_madvise which was designed to work on an array
>> of iovec structures to have flags to skip errors and ignore the iovec
>> makes it function similar to a prctl call is not the right approach.
>> IMHO, prctl is a more direct solution to this.
> 
> I don't know what you mean by 'function similar to a prctl call', or what
> you mean by it being a more direct solution.

So I thought I was going a bit crazy, but I am glad someone else raised this as well.

If we look at the man page for prctl it says it manipulates various aspects
of the behavior of the calling thread or process.

If we look at the man page for prcocess_madvise, it was for providing advice for the
address ranges described by iovec and n

What I mean (and I assume Shakeel meant as well) is this is changing what
process_madvise was designed for into changing  the behaviour of a process (i.e. prctl).
This is what I mean by 'function similar to a prctl call'.


> 
> The problem with prctl() is there is no pattern, it's a dumping ground for
> arbitrary stuff and a prime place for bitrot. It also means we give up
> maintainership of mm-specific code.
> 
> It encourages misalignment with other interfaces and APIs.
> 
> What is being discussed here is an effort to _better align what you want
> with an existing interface_ - if we treat this as a 'default madvise()'
> plus having additional madvise functionality (apply to all, ignoring errors
> e.g.) - and put all this code _alongside existing madvise code_ - this
> makes this vastly more maintainable, futureproof and robust.
> 
> And keep in mind the proposed flags are _flags_ not default behaviours,
> ones which we will carefully restrict to known flags, starting with the
> flags you guys need.
> 
>>
>> I know that Lonenzo doesn't like prctl and wants to unify this in process_madvise.
>> But if in the end, we want to have a THP auto way which is truly transparent,
>> would it not be better to just have this as prctl and not change the madvise
>> structure? Maybe in a few years we wont need any of this, and it will be lost
>> in prctl and madvise wouldn't have changed for this?
> 
> This really sounds a lot like your colleague Shakeel's objection, so I
> don't want to duplicate my response too much, but as I said to him, at this
> stage we would set THP mode to 'auto', but still have to support
> MADV_[NO]HUGEPAGE.
> 
> We may then wish to set these as no-ops in auto mode right? But they'd
> still have to exist for uAPI reasons.
> 
> So would it be better to do this in mm/madvise.c alongside literally all
> other madvise() code and the existing handling for MADV_[NO]HUGEPAGE, or
> would it be better to throw it in kernel/sys.c and hope people remember to
> update it?
> 
> I think it's pretty clear what the answer to that is.
> 
>>
>> Again, this is just to have a discussion (and not an aggressive argument :)),
>> and would love to get feedback from everyone in the community.
>> If its too early to have this discussion, its completely fine and we can
>> still keep developing the RFCs :)
> 
> I try hard never to be aggressive, I am firm when I feel it is appropriate
> to be so (it's important to push back when necessarily I feel), but I
> always try to maintain civility as well as I can.
> 
> Of course I am imperfect, so apologies if it comes across any other way, I
> really do try very hard to maintain a high standard of professionalism
> on-list.
> 
> Again as I said, I really did not intend to supercede or interfere with
> your series, this was just unfortunate timing and throwing out ideas to
> reach the best solution.
> 
> My objection to prctl() is due to bit-rot, mm code in inappropriate places,
> the fact it by nature lends itself to violating conventions and practices
> that exist in other mm APIs, not some subjective sense of evil.
> 
> It is historically a place where 'things that people don't really care
> about but don't quite want to NACK' go also, and to me suggests that the
> problem hasn't necessarily been thought through to see how it might be
> implemented by extending existing APIs or finding ways to achieve the same
> thing that better align with existing intefaces.
> 
> To be clear - this concept is _all_ ultimately a product of your series and
> your ideas leading the discussion within the community to this point where
> a potential alternative has arisen - without which this would not have
> occurred.
> 
> So the idea here is to simply explore what the best solution is that aligns
> with what best serves the community.
> 
>>
>> Thanks!
>> Usama
> 
> Thanks, Lorenzo

Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

Posted by Lorenzo Stoakes 8 months, 3 weeks ago

On Tue, May 20, 2025 at 08:42:50PM +0100, Usama Arif wrote:
>
>
> On 20/05/2025 20:21, Lorenzo Stoakes wrote:
> > On Tue, May 20, 2025 at 07:24:04PM +0100, Usama Arif wrote:
> >>
> >>
> >> On 20/05/2025 18:47, Lorenzo Stoakes wrote:
> >>> On Tue, May 20, 2025 at 05:28:35PM +0200, David Hildenbrand wrote:
> >>>> On 19.05.25 22:52, Lorenzo Stoakes wrote:
> >>>>> REVIEWERS NOTES:
> >>>>> ================
> >>>>>
> >>>>> This is a VERY EARLY version of the idea, it's relatively untested, and I'm
> >>>>> 'putting it out there' for feedback. Any serious version of this will add a
> >>>>> bunch of self-tests to assert correct behaviour and I will more carefully
> >>>>> confirm everything's working.
> >>>>>
> >>>>> This is based on discussion arising from Usama's series [0], SJ's input on
> >>>>> the thread around process_madvise() behaviour [1] (and a subsequent
> >>>>> response by me [2]) and prior discussion about a new madvise() interface
> >>>>> [3].
> >>>>>
> >>>>> [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/
> >>>>> [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/
> >>>>> [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/
> >>>>> [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/
> >>>>>
> >>>>> ================
> >>>>>
> >>>>> Currently, we are rather restricted in how madvise() operations
> >>>>> proceed. While effort has been put in to expanding what process_madvise()
> >>>>> can do (that is - unrestricted application of advice to the local process
> >>>>> alongside recent improvements on the efficiency of TLB operations over
> >>>>> these batvches), we are still constrained by existing madvise() limitations
> >>>>> and default behaviours.
> >>>>>
> >>>>> This series makes use of the currently unused flags field in
> >>>>> process_madvise() to provide more flexiblity.
> >>>>>
> >>>>
> >>>> In general, sounds like an interesting approach.
> >>>
> >>> Thanks!
> >>>
> >>> If you agree this is workable, then I'll go ahead and put some more effort
> >>> into writing tests etc. on the next respin.
> >>>
> >>
> >> So the prctl and process_madvise patches both are trying to accomplish a
> >> similar end goal.
> >>
> >> Would it make sense to discuss what would be the best way forward before we
> >> continue developing the solutions? If we are not at that stage and a clear
> >> picture has not formed yet, happy to continue refining the solutions.
> >> But just thought I would check.
> >
> > As stated in the thread (I went out of my way to link everything above),
> > and stated with you cc'd in every discussion (I think!), this idea arose as
> > a concept that came out of my brainstorming whether we could better align
> > this concept with madvise().
> >
> > This arose because of discussions had in-thread where we agreed it made
> > most sense to have this feature in effect perform a 'default madvise()'.
> >
> > At this point, we are essentially _duplicating what madvise does_ in the
> > prctl() with your approach, only now all of the code that does that is
> > bitrotting in kernel/sys.c.
> >
> > This approach was an attempt to avoid that.
> >
> > It started as a 'crazy idea' I was throwing out there, as an RFC. The idea
> > was we could compare and contrast with the prctl() RFC.
> >
> > Obviously this is gaining some traction now as a concept and your respin
> > was a little rushed out so needs rework.
> >
> > So, apologies if it seems this is an attempt to supercede your series or
> > such, it truly wasn't intended that way, rather I have been working 12 hour
> > days trying to find the best way possible to _make what you want happen_
> > while also doing what's best for mm and madvise (among many other tasks of
> > course :)
>
> Please don't burn out spending 12 hour days on this! Your feedback is very
> amazing! (Thanks again!) It will take time to come up with a solution right
> for the kernel!

Thanks I do try, the point is that everything is in good faith here... I'm just
trying to do what's right for the kernel.

>
> >
> > So the idea is we can:
> >
> > a. solve long-standing problems with madvise() that prevent it being used
> >    for certain things (e.g. the error on gaps which breaks
> >    process_madvise() and hides real errors)
> >
> > b. Also provide this functionality in a way that the absolute most minimal
> >    delta from existing logic...
> >
> > c. ...While keeping all this logic in mm and avoiding bitrot in
> >    kernel/sys.c.
> >
> > So obviously my view is that this approach is superior to the prctl() one.
> >
> > However the intent was that you would probably take a little longer in
> > spinning up an RFC, and then we could compare the two approaches, if people
> > didn't reject my 'crazy idea' :)
> >
> > Obviously you respan your (kinda ;) RFC way quicker than expected, and then
> > there were a ton of bugs, which added workload and made it perhaps a little
> > more pressing as to deciding which should be pursued.
> >
>
> I feel like a ton (i.e. a thousand) sounds like a lot, there are a couple of bugs
> in a series I meant to send as an RFC (mistakes happen :)). I will wait a couple
> of days to see how things develop and send a prctl RFC (will remember the tag this
> time) and we can have a better comparison of what this would look like.

Figure of speech ;) I didn't mean to disparage it, rather it felt rushed. Sorry
maybe should have more carefully phrased that. Text is a bad medium etc.

>
> > I would suggest holding off on your series until we see whether this one
> > works on this basis. But obviously this is simply my point of view.
> >
> > To be clear however, this was not a planned series of events...
> >
> > I mean equally, we are seeing several other series from Yafang, Nico and
> > Dev in relation to [m]THP and even a obliquely-related series from Barry,
> > so it seems we are in contention across many planes here :)
> >
> >>
> >> I feel like changing process_madvise which was designed to work on an array
> >> of iovec structures to have flags to skip errors and ignore the iovec
> >> makes it function similar to a prctl call is not the right approach.
> >> IMHO, prctl is a more direct solution to this.
> >
> > I don't know what you mean by 'function similar to a prctl call', or what
> > you mean by it being a more direct solution.
>
> So I thought I was going a bit crazy, but I am glad someone else raised this as well.
>
> If we look at the man page for prctl it says it manipulates various aspects
> of the behavior of the calling thread or process.
>
> If we look at the man page for prcocess_madvise, it was for providing advice for the
> address ranges described by iovec and n

And your proposal is to do precisely the same only also setting the default.

>
> What I mean (and I assume Shakeel meant as well) is this is changing what
> process_madvise was designed for into changing  the behaviour of a process (i.e. prctl).
> This is what I mean by 'function similar to a prctl call'.

Yeah I'm aware of the definition in the man page, it's really nebulous. I mean
even 'process' is unclear, do you mean task, do you mean thread leader? Do you
mean the shared virtual address space?

prctl() already does VMA-specific stuff. The overlap is enormous, and in
practice this is not a distinction that is in any real sense maintained.

prctl() is, de facto, a 'place where random stuff is put'.

And unavoidably you'll end up with mm-specific code (which now _must be
exported_) in kernel/sys.c.

At any rate, clearly there is still debate to be had here, so let's work on
both these series and see where things end up.

Obviously I'm open to feedback from ohers, but clearly I have a fairly
strong view on this :)

>
>
> >
> > The problem with prctl() is there is no pattern, it's a dumping ground for
> > arbitrary stuff and a prime place for bitrot. It also means we give up
> > maintainership of mm-specific code.
> >
> > It encourages misalignment with other interfaces and APIs.
> >
> > What is being discussed here is an effort to _better align what you want
> > with an existing interface_ - if we treat this as a 'default madvise()'
> > plus having additional madvise functionality (apply to all, ignoring errors
> > e.g.) - and put all this code _alongside existing madvise code_ - this
> > makes this vastly more maintainable, futureproof and robust.
> >
> > And keep in mind the proposed flags are _flags_ not default behaviours,
> > ones which we will carefully restrict to known flags, starting with the
> > flags you guys need.
> >
> >>
> >> I know that Lonenzo doesn't like prctl and wants to unify this in process_madvise.
> >> But if in the end, we want to have a THP auto way which is truly transparent,
> >> would it not be better to just have this as prctl and not change the madvise
> >> structure? Maybe in a few years we wont need any of this, and it will be lost
> >> in prctl and madvise wouldn't have changed for this?
> >
> > This really sounds a lot like your colleague Shakeel's objection, so I
> > don't want to duplicate my response too much, but as I said to him, at this
> > stage we would set THP mode to 'auto', but still have to support
> > MADV_[NO]HUGEPAGE.
> >
> > We may then wish to set these as no-ops in auto mode right? But they'd
> > still have to exist for uAPI reasons.
> >
> > So would it be better to do this in mm/madvise.c alongside literally all
> > other madvise() code and the existing handling for MADV_[NO]HUGEPAGE, or
> > would it be better to throw it in kernel/sys.c and hope people remember to
> > update it?
> >
> > I think it's pretty clear what the answer to that is.
> >
> >>
> >> Again, this is just to have a discussion (and not an aggressive argument :)),
> >> and would love to get feedback from everyone in the community.
> >> If its too early to have this discussion, its completely fine and we can
> >> still keep developing the RFCs :)
> >
> > I try hard never to be aggressive, I am firm when I feel it is appropriate
> > to be so (it's important to push back when necessarily I feel), but I
> > always try to maintain civility as well as I can.
> >
> > Of course I am imperfect, so apologies if it comes across any other way, I
> > really do try very hard to maintain a high standard of professionalism
> > on-list.
> >
> > Again as I said, I really did not intend to supercede or interfere with
> > your series, this was just unfortunate timing and throwing out ideas to
> > reach the best solution.
> >
> > My objection to prctl() is due to bit-rot, mm code in inappropriate places,
> > the fact it by nature lends itself to violating conventions and practices
> > that exist in other mm APIs, not some subjective sense of evil.
> >
> > It is historically a place where 'things that people don't really care
> > about but don't quite want to NACK' go also, and to me suggests that the
> > problem hasn't necessarily been thought through to see how it might be
> > implemented by extending existing APIs or finding ways to achieve the same
> > thing that better align with existing intefaces.
> >
> > To be clear - this concept is _all_ ultimately a product of your series and
> > your ideas leading the discussion within the community to this point where
> > a potential alternative has arisen - without which this would not have
> > occurred.
> >
> > So the idea here is to simply explore what the best solution is that aligns
> > with what best serves the community.
> >
> >>
> >> Thanks!
> >> Usama
> >
> > Thanks, Lorenzo
>

Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

Posted by Jann Horn 8 months, 3 weeks ago

On Mon, May 19, 2025 at 10:53 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> 3. PMADV_SET_FORK_EXEC_DEFAULT
>
> It may be desirable for a user to specify that all VMAs mapped in a process
> address space default to having an madvise() behaviour established by
> default, in such a fashion as that this persists across fork/exec.

Settings that persist across exec are generally a bit dodgy and you
have to ask questions like "what happens on setuid execution, could
this somehow influence the behavior of a setuid binary in a
security-relevant way", and I'm not sure whether that is the case with
hugepage flags but I guess it could maybe influence the alignment with
which mappings are created or something like that? And if you add more
flags to this list later, you'll again have to think about annoying
setuid considerations each time.

For comparison, personality flags are explicitly supposed to persist
across execve, but they can be dangerous (stuff like READ_IMPLIES_EXEC
and ADDR_NO_RANDOMIZE), so we have PER_CLEAR_ON_SETID which gets
cleared only if the execution is privileged. (Annoyingly, the
PER_CLEAR_ON_SETID handling is currently implemented separately for
each type of privileged execution we can have
[setuid/setgid/fscaps/selinux transition/apparmor transition/smack
transition], but I guess you could probably gate it on
bprm->secureexec instead...).

It would be nice if you could either make this a property of the
mm_struct that does not persist across exec, or if that would break
your intended usecase, alternatively wipe it on privileged execution.

> Since this is a very powerful option that would make no sense for many
> advice modes, we explicitly only permit known-safe flags here (currently
> MADV_HUGEPAGE and MADV_NOHUGEPAGE only).

Ah, sort of like a more generic version of prctl(PR_SET_THP_DISABLE,
flag, 0, 0, 0) that also allows opt-in on top of opt-out.

Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

Posted by Lorenzo Stoakes 8 months, 3 weeks ago

On Mon, May 19, 2025 at 11:53:43PM +0200, Jann Horn wrote:
> On Mon, May 19, 2025 at 10:53 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> > 3. PMADV_SET_FORK_EXEC_DEFAULT
> >
> > It may be desirable for a user to specify that all VMAs mapped in a process
> > address space default to having an madvise() behaviour established by
> > default, in such a fashion as that this persists across fork/exec.
>
> Settings that persist across exec are generally a bit dodgy and you
> have to ask questions like "what happens on setuid execution, could
> this somehow influence the behavior of a setuid binary in a
> security-relevant way", and I'm not sure whether that is the case with
> hugepage flags but I guess it could maybe influence the alignment with
> which mappings are created or something like that? And if you add more
> flags to this list later, you'll again have to think about annoying
> setuid considerations each time.

I do absolutely agree they're dodgy, which is why this is significantly
restricted.

The intent is that we'd _only ever_ add anything to this list after such
careful consideration had been made.

And obviously then you can assert whatever checks make sense for each
permitted madvise() flag.

Obviously this change is motivated by Usama's series, looking at an
alternative approach (as well as - at the same time - expanding what we can
do with [process_]madvise() - an ongoing bugbear for some time).

So this provides a less 'tacked on' (and thus bit rotting) version of this
change, to enforce that behaviour is consistent, aligned with madvise()
generally, and also importantly, maintained by mm :)

>
> For comparison, personality flags are explicitly supposed to persist
> across execve, but they can be dangerous (stuff like READ_IMPLIES_EXEC
> and ADDR_NO_RANDOMIZE), so we have PER_CLEAR_ON_SETID which gets
> cleared only if the execution is privileged. (Annoyingly, the
> PER_CLEAR_ON_SETID handling is currently implemented separately for
> each type of privileged execution we can have
> [setuid/setgid/fscaps/selinux transition/apparmor transition/smack
> transition], but I guess you could probably gate it on
> bprm->secureexec instead...).
>
> It would be nice if you could either make this a property of the
> mm_struct that does not persist across exec, or if that would break
> your intended usecase, alternatively wipe it on privileged execution.

The use case specifically requires persistence, unfortunately (we are still
determining whether this makes sense however - it is by no means a 'done
deal' that we're accepting this as a thing).

I suppose wiping on privileged execution could be achieved by storing a
mask of these permitted flags and clearing that mask in mm->def_flags at
this point?

Mightn't a better solution be to just require ns_capable(CAP_SYS_ADMIN)?

Usama - would wiping these flags on privileged execution, or requiring a
privileged process to be able to do this break your use case?

>
> > Since this is a very powerful option that would make no sense for many
> > advice modes, we explicitly only permit known-safe flags here (currently
> > MADV_HUGEPAGE and MADV_NOHUGEPAGE only).
>
> Ah, sort of like a more generic version of prctl(PR_SET_THP_DISABLE,
> flag, 0, 0, 0) that also allows opt-in on top of opt-out.

Right.

Thanks for taking a look at this! Your input on the security side is
absolutely invaluable :)

Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

Posted by Jann Horn 8 months, 3 weeks ago

On Tue, May 20, 2025 at 7:36 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> On Mon, May 19, 2025 at 11:53:43PM +0200, Jann Horn wrote:
> > For comparison, personality flags are explicitly supposed to persist
> > across execve, but they can be dangerous (stuff like READ_IMPLIES_EXEC
> > and ADDR_NO_RANDOMIZE), so we have PER_CLEAR_ON_SETID which gets
> > cleared only if the execution is privileged. (Annoyingly, the
> > PER_CLEAR_ON_SETID handling is currently implemented separately for
> > each type of privileged execution we can have
> > [setuid/setgid/fscaps/selinux transition/apparmor transition/smack
> > transition], but I guess you could probably gate it on
> > bprm->secureexec instead...).
> >
> > It would be nice if you could either make this a property of the
> > mm_struct that does not persist across exec, or if that would break
> > your intended usecase, alternatively wipe it on privileged execution.
>
> The use case specifically requires persistence, unfortunately (we are still
> determining whether this makes sense however - it is by no means a 'done
> deal' that we're accepting this as a thing).
>
> I suppose wiping on privileged execution could be achieved by storing a
> mask of these permitted flags and clearing that mask in mm->def_flags at
> this point?

Oh, I see, we're already inheriting VM_NOHUGEPAGE on execve through
mm->def_flags, with the bitmask VM_INIT_DEF_MASK controlling what is
inheritable? Hmmmm... I guess turning hugepages _off_ should be
fine...

Yeah I guess I'd do this by adding another bitmask
VM_INIT_DEF_MASK_SECUREEXEC or something like that, and then applying
that bitmask on setuid execution.

Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

Posted by Lorenzo Stoakes 8 months, 3 weeks ago

On Tue, May 20, 2025 at 06:04:47PM +0200, Jann Horn wrote:
> On Tue, May 20, 2025 at 7:36 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> > On Mon, May 19, 2025 at 11:53:43PM +0200, Jann Horn wrote:
> > > For comparison, personality flags are explicitly supposed to persist
> > > across execve, but they can be dangerous (stuff like READ_IMPLIES_EXEC
> > > and ADDR_NO_RANDOMIZE), so we have PER_CLEAR_ON_SETID which gets
> > > cleared only if the execution is privileged. (Annoyingly, the
> > > PER_CLEAR_ON_SETID handling is currently implemented separately for
> > > each type of privileged execution we can have
> > > [setuid/setgid/fscaps/selinux transition/apparmor transition/smack
> > > transition], but I guess you could probably gate it on
> > > bprm->secureexec instead...).
> > >
> > > It would be nice if you could either make this a property of the
> > > mm_struct that does not persist across exec, or if that would break
> > > your intended usecase, alternatively wipe it on privileged execution.
> >
> > The use case specifically requires persistence, unfortunately (we are still
> > determining whether this makes sense however - it is by no means a 'done
> > deal' that we're accepting this as a thing).
> >
> > I suppose wiping on privileged execution could be achieved by storing a
> > mask of these permitted flags and clearing that mask in mm->def_flags at
> > this point?
>
> Oh, I see, we're already inheriting VM_NOHUGEPAGE on execve through
> mm->def_flags, with the bitmask VM_INIT_DEF_MASK controlling what is
> inheritable? Hmmmm... I guess turning hugepages _off_ should be
> fine...
>
> Yeah I guess I'd do this by adding another bitmask
> VM_INIT_DEF_MASK_SECUREEXEC or something like that, and then applying
> that bitmask on setuid execution.

I guess we could do it this way, as it would only otherwise limit a non-sys
admin user, and we should try to keep things as flexible as possible.

Let me do this for v2 and see how it works.

As it seems there's some general traction here I can also write some
tests...