[RFC PATCH 4/5] mm/madvise: add PMADV_SET_FORK_EXEC_DEFAULT process_madvise() flag

Lorenzo Stoakes posted 5 patches 7 months ago
[RFC PATCH 4/5] mm/madvise: add PMADV_SET_FORK_EXEC_DEFAULT process_madvise() flag
Posted by Lorenzo Stoakes 7 months ago
It's useful in certain cases to be able to default-enable an madvise() flag
for all newly mapped VMAs, and for that to survive fork/exec.

The natural place to specify something like this is in an madvise()
invocation, and thus providing this functionality as a flag to
process_madvise() makes sense.

We intentionally limit this only to flags that we know should function
correctly without issue, and to be conservative about this, so we initially
limit ourselves only to MADV_HUGEPAGE, MADV_NOHUGEPAGE, that is - setting
the VM_HUGEPAGE, VM_NOHUGEPAGE VMA flags.

We implement this functionality by using the mm_struct->def_flags field.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 include/uapi/asm-generic/mman-common.h |  1 +
 mm/madvise.c                           | 43 +++++++++++++++++++++++++-
 2 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 58c8a3fadf99..6998ea0ecc6d 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -94,5 +94,6 @@
 /* process_madvise() flags */
 #define PMADV_SKIP_ERRORS (1U << 0) /* Skip VMAs on errors, but carry on. Implies no error on unmapped. */
 #define PMADV_NO_ERROR_ON_UNMAPPED (1U << 1) /* Never report an error on unmapped ranges. */
+#define PMADV_SET_FORK_EXEC_DEFAULT (1U << 2) /* Set the behavior as a default that survives fork/exec. */
 
 #endif /* __ASM_GENERIC_MMAN_COMMON_H */
diff --git a/mm/madvise.c b/mm/madvise.c
index fd94ef27f909..9ea36800de3c 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1869,6 +1869,40 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 	return do_madvise(current->mm, start, len_in, behavior);
 }
 
+/*
+ * Are we permitted to set an madvise() behavior by default across the virtual
+ * address space, surviving fork/exec?
+ */
+static bool can_set_default_behavior(int behavior)
+{
+	switch (behavior) {
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	case MADV_HUGEPAGE:
+	case MADV_NOHUGEPAGE:
+		return true;
+#endif
+	default:
+		return false;
+	}
+}
+
+static void set_default_behavior(struct mm_struct *mm, int behavior)
+{
+	switch (behavior) {
+	case MADV_HUGEPAGE:
+		mm->def_flags &= ~VM_NOHUGEPAGE;
+		mm->def_flags |= VM_HUGEPAGE;
+		break;
+	case MADV_NOHUGEPAGE:
+		mm->def_flags &= ~VM_HUGEPAGE;
+		mm->def_flags |= VM_NOHUGEPAGE;
+		break;
+	default:
+		WARN_ON(1);
+		break;
+	}
+}
+
 /* Perform an madvise operation over a vector of addresses and lengths. */
 static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 			      int behavior, int flags)
@@ -1882,8 +1916,12 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 		.flags = flags,
 	};
 	bool can_skip = flags & PMADV_SKIP_ERRORS;
+	bool set_default = flags & PMADV_SET_FORK_EXEC_DEFAULT;
 	size_t skipped = 0;
 
+	if (set_default && !can_set_default_behavior(behavior))
+		return -EINVAL;
+
 	total_len = iov_iter_count(iter);
 
 	ret = madvise_lock(mm, behavior);
@@ -1931,6 +1969,8 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 		if (can_skip && madv_behavior.saw_error) {
 			skipped++;
 			madv_behavior.saw_error = false;
+		} else if (set_default) {
+			set_default_behavior(mm, behavior);
 		}
 
 		iov_iter_advance(iter, iter_iov_len(iter));
@@ -1951,7 +1991,8 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 
 static bool check_process_madvise_flags(unsigned int flags)
 {
-	unsigned int mask = PMADV_SKIP_ERRORS | PMADV_NO_ERROR_ON_UNMAPPED;
+	unsigned int mask = PMADV_SKIP_ERRORS | PMADV_NO_ERROR_ON_UNMAPPED |
+		PMADV_SET_FORK_EXEC_DEFAULT;
 
 	if (flags & ~mask)
 		return false;
-- 
2.49.0
Re: [RFC PATCH 4/5] mm/madvise: add PMADV_SET_FORK_EXEC_DEFAULT process_madvise() flag
Posted by Johannes Weiner 7 months ago
On Mon, May 19, 2025 at 09:52:41PM +0100, Lorenzo Stoakes wrote:
> We intentionally limit this only to flags that we know should function
> correctly without issue, and to be conservative about this, so we initially
> limit ourselves only to MADV_HUGEPAGE, MADV_NOHUGEPAGE, that is - setting
> the VM_HUGEPAGE, VM_NOHUGEPAGE VMA flags.

Hm, but do we actually expect more to show up? Looking at the current
list of advisories, we have the conventional ones:

       MADV_NORMAL
              No special treatment.  This is the default.

       MADV_RANDOM
              Expect page references in random order.  (Hence, read ahead may be less useful than normally.)

       MADV_SEQUENTIAL
              Expect page references in sequential order.  (Hence, pages in the given range can be aggressively read ahead, and may be freed soon after they are accessed.)

       MADV_WILLNEED
              Expect access in the near future.  (Hence, it might be a good idea to read some pages ahead.)

       MADV_DONTNEED
              Do not expect access in the near future.  (For the time being, the application is finished with the given range, so the kernel can free resources associated with it.)

where only RANDOM and SEQUENTIAL are actual policies. But since those
apply to file mappings only, they don't seem to make much sense for a
process-wide setting.

For Linux-specific advisories we have

	MADV_REMOVE		- action
	MADV_DONTFORK		- policy, but not sure how this could work as a process-wide thing
	MADV_HWPOISON		- action
	MADV_MERGEABLE		- policy, but we have a prctl() for process-wide settings
	MADV_SOFTOFFLINE	- action
	MADV_HUGEPAGE		- policy, but we have a prctl() for process-wide disabling
	MADV_COLLAPSE		- action
	MADV_DONTDUMP		- policy, but there is /proc/<pid>/coredump_filter for process-wide settings
	MADV_FREE		- action
	MADV_WIPEONFORK		- policy, but similar to DONTFORK, not sure how this would work process-wide
	MADV_COLD		- action
	MADV_PAGEOUT		- action
	MADV_POPULATE_READ	- action
	MADV_POPULATE_WRITE	- action
	MADV_GUARD_INSTALL	- action

So the vast majority of advisories are either one-off actions, or they
are policies that naturally only make sense for very specific ranges.

KSM and THP really seem like the notable exception[1], rather than a
rule. And we already *have* prctl() ops to modify per-process policies
for these. So I'm reluctant to agree we should drill open and expand
madvise() to cover them - especially with the goal or the expectation
that THP per-process policies shouldn't matter that much down the line.

I will admit, I don't hate prctl() as much as you do. It *is* a fairly
broad interface - setting per-process policies. So it's bound to pick
odds and ends of various subsystems that don't quite fit elsewhere.

Is it the right choice everywhere? Of course not. And can its
broadness be abused to avoid real interface design? Absolutely.

I don't think that makes it inherently bad, though. As long as we make
an honest effort to find the best home for new knobs.

But IMO, in this case it is. The inheritance-along-the-process-tree
behavior that we want here is an established pattern in prctl().

And *because* that propagation is a security-sensitive pattern
(although I don't think the THP policy specifically *is* a security
issue), to me it makes more sense to keep this behavior confined to
prctl() at least, and not add it to madvise too.

[1] Maybe we should have added sys_andrea() to cover those, as well as
    the SECCOMP prctl()s ;)
Re: [RFC PATCH 4/5] mm/madvise: add PMADV_SET_FORK_EXEC_DEFAULT process_madvise() flag
Posted by Lorenzo Stoakes 6 months, 3 weeks ago
On Tue, May 20, 2025 at 06:26:09PM -0400, Johannes Weiner wrote:
> On Mon, May 19, 2025 at 09:52:41PM +0100, Lorenzo Stoakes wrote:
> > We intentionally limit this only to flags that we know should function
> > correctly without issue, and to be conservative about this, so we initially
> > limit ourselves only to MADV_HUGEPAGE, MADV_NOHUGEPAGE, that is - setting
> > the VM_HUGEPAGE, VM_NOHUGEPAGE VMA flags.
>
> Hm, but do we actually expect more to show up? Looking at the current
> list of advisories, we have the conventional ones:
>
>        MADV_NORMAL
>               No special treatment.  This is the default.
>
>        MADV_RANDOM
>               Expect page references in random order.  (Hence, read ahead may be less useful than normally.)
>
>        MADV_SEQUENTIAL
>               Expect page references in sequential order.  (Hence, pages in the given range can be aggressively read ahead, and may be freed soon after they are accessed.)
>
>        MADV_WILLNEED
>               Expect access in the near future.  (Hence, it might be a good idea to read some pages ahead.)
>
>        MADV_DONTNEED
>               Do not expect access in the near future.  (For the time being, the application is finished with the given range, so the kernel can free resources associated with it.)
>
> where only RANDOM and SEQUENTIAL are actual policies. But since those
> apply to file mappings only, they don't seem to make much sense for a
> process-wide setting.
>
> For Linux-specific advisories we have
>
> 	MADV_REMOVE		- action
> 	MADV_DONTFORK		- policy, but not sure how this could work as a process-wide thing
> 	MADV_HWPOISON		- action
> 	MADV_MERGEABLE		- policy, but we have a prctl() for process-wide settings
> 	MADV_SOFTOFFLINE	- action
> 	MADV_HUGEPAGE		- policy, but we have a prctl() for process-wide disabling
> 	MADV_COLLAPSE		- action
> 	MADV_DONTDUMP		- policy, but there is /proc/<pid>/coredump_filter for process-wide settings
> 	MADV_FREE		- action
> 	MADV_WIPEONFORK		- policy, but similar to DONTFORK, not sure how this would work process-wide
> 	MADV_COLD		- action
> 	MADV_PAGEOUT		- action
> 	MADV_POPULATE_READ	- action
> 	MADV_POPULATE_WRITE	- action
> 	MADV_GUARD_INSTALL	- action
>
> So the vast majority of advisories are either one-off actions, or they
> are policies that naturally only make sense for very specific ranges.
>
> KSM and THP really seem like the notable exception[1], rather than a
> rule. And we already *have* prctl() ops to modify per-process policies
> for these. So I'm reluctant to agree we should drill open and expand
> madvise() to cover them - especially with the goal or the expectation
> that THP per-process policies shouldn't matter that much down the line.
>
> I will admit, I don't hate prctl() as much as you do. It *is* a fairly
> broad interface - setting per-process policies. So it's bound to pick
> odds and ends of various subsystems that don't quite fit elsewhere.
>
> Is it the right choice everywhere? Of course not. And can its
> broadness be abused to avoid real interface design? Absolutely.
>
> I don't think that makes it inherently bad, though. As long as we make
> an honest effort to find the best home for new knobs.
>
> But IMO, in this case it is. The inheritance-along-the-process-tree
> behavior that we want here is an established pattern in prctl().
>
> And *because* that propagation is a security-sensitive pattern
> (although I don't think the THP policy specifically *is* a security
> issue), to me it makes more sense to keep this behavior confined to
> prctl() at least, and not add it to madvise too.
>
> [1] Maybe we should have added sys_andrea() to cover those, as well as
>     the SECCOMP prctl()s ;)

So sorry to have missed you really excellent and well thought-out reply
here Johannes (grr mail etc.).

You make a really good point, and I think this does overall, sadly, suggest
the concept of a 'general' madvise() call while, in a sense, 'neat',
doesn't quite fit.

I do think it aligns well with a dedicated mctl() API call, have sent out
an RFC discussion thread about this so we can determine if this makes
sense.

Thanks, Lorenzo
Re: [RFC PATCH 4/5] mm/madvise: add PMADV_SET_FORK_EXEC_DEFAULT process_madvise() flag
Posted by Pedro Falcato 7 months ago
On Mon, May 19, 2025 at 09:52:41PM +0100, Lorenzo Stoakes wrote:
> It's useful in certain cases to be able to default-enable an madvise() flag
> for all newly mapped VMAs, and for that to survive fork/exec.
> 
> The natural place to specify something like this is in an madvise()
> invocation, and thus providing this functionality as a flag to
> process_madvise() makes sense.
> 
> We intentionally limit this only to flags that we know should function
> correctly without issue, and to be conservative about this, so we initially
> limit ourselves only to MADV_HUGEPAGE, MADV_NOHUGEPAGE, that is - setting
> the VM_HUGEPAGE, VM_NOHUGEPAGE VMA flags.
> 
> We implement this functionality by using the mm_struct->def_flags field.

This seems super specific. How about this:

- PMADV_FUTURE (mirrors MCL_FUTURE). This only applies the flag to future VMAs in the current process.
- PMADV_INHERIT_FORK. This makes it so the flag is propagated to child processes (does not imply PMADV_FUTURE)
- PMADV_INHERIT_EXEC. This makes it so the flag is propagated through the execve boundary
  (and this is where we'd filter for 'safe' flags, at least through the secureexec boundary). Does not imply
  FUTURE nor INHERIT_FORK.

and, while we're at it, rename PMADV_ENTIRE_ADDRESS_SPACE to PMADV_CURRENT, to align it with MCL_CURRENT.

Thoughts?

-- 
Pedro
Re: [RFC PATCH 4/5] mm/madvise: add PMADV_SET_FORK_EXEC_DEFAULT process_madvise() flag
Posted by Jann Horn 7 months ago
On Tue, May 20, 2025 at 10:38 AM Pedro Falcato <pfalcato@suse.de> wrote:
> - PMADV_INHERIT_FORK. This makes it so the flag is propagated to child processes (does not imply PMADV_FUTURE)

Do we think there will be flags settable through this API that we
explicitly _don't_ want to inherit on fork()? My understanding is that
sort of the default for fork() is to inherit everything, and things
that don't get inherited are weird special cases (like mlock() state).
(While the default for execve() is to inherit nothing about the MM.)

(I guess you could make a case that in a fork+exec sequence, the child
might not want to create hugepage between fork and exec... but this is
probably not the right place to control that?)
Re: [RFC PATCH 4/5] mm/madvise: add PMADV_SET_FORK_EXEC_DEFAULT process_madvise() flag
Posted by Lorenzo Stoakes 7 months ago
On Tue, May 20, 2025 at 06:11:49PM +0200, Jann Horn wrote:
> On Tue, May 20, 2025 at 10:38 AM Pedro Falcato <pfalcato@suse.de> wrote:
> > - PMADV_INHERIT_FORK. This makes it so the flag is propagated to child processes (does not imply PMADV_FUTURE)
>
> Do we think there will be flags settable through this API that we
> explicitly _don't_ want to inherit on fork()? My understanding is that
> sort of the default for fork() is to inherit everything, and things
> that don't get inherited are weird special cases (like mlock() state).
> (While the default for execve() is to inherit nothing about the MM.)

Yeah this is true. It is the exception rather than the rule...

>
> (I guess you could make a case that in a fork+exec sequence, the child
> might not want to create hugepage between fork and exec... but this is
> probably not the right place to control that?)

From my point of view it simply reads better :)

But perhaps we can drop the fork bit and leave that implied.
Re: [RFC PATCH 4/5] mm/madvise: add PMADV_SET_FORK_EXEC_DEFAULT process_madvise() flag
Posted by David Hildenbrand 7 months ago
On 20.05.25 18:19, Lorenzo Stoakes wrote:
> On Tue, May 20, 2025 at 06:11:49PM +0200, Jann Horn wrote:
>> On Tue, May 20, 2025 at 10:38 AM Pedro Falcato <pfalcato@suse.de> wrote:
>>> - PMADV_INHERIT_FORK. This makes it so the flag is propagated to child processes (does not imply PMADV_FUTURE)
>>
>> Do we think there will be flags settable through this API that we
>> explicitly _don't_ want to inherit on fork()? My understanding is that
>> sort of the default for fork() is to inherit everything, and things
>> that don't get inherited are weird special cases (like mlock() state).
>> (While the default for execve() is to inherit nothing about the MM.)
> 
> Yeah this is true. It is the exception rather than the rule...
> 
>>
>> (I guess you could make a case that in a fork+exec sequence, the child
>> might not want to create hugepage between fork and exec... but this is
>> probably not the right place to control that?)
> 
>  From my point of view it simply reads better :)
> 
> But perhaps we can drop the fork bit and leave that implied.

Yes, we should. Exec is where the "fun" begins.

-- 
Cheers,

David / dhildenb

Re: [RFC PATCH 4/5] mm/madvise: add PMADV_SET_FORK_EXEC_DEFAULT process_madvise() flag
Posted by Lorenzo Stoakes 7 months ago
On Tue, May 20, 2025 at 09:38:50AM +0100, Pedro Falcato wrote:
> On Mon, May 19, 2025 at 09:52:41PM +0100, Lorenzo Stoakes wrote:
> > It's useful in certain cases to be able to default-enable an madvise() flag
> > for all newly mapped VMAs, and for that to survive fork/exec.
> >
> > The natural place to specify something like this is in an madvise()
> > invocation, and thus providing this functionality as a flag to
> > process_madvise() makes sense.
> >
> > We intentionally limit this only to flags that we know should function
> > correctly without issue, and to be conservative about this, so we initially
> > limit ourselves only to MADV_HUGEPAGE, MADV_NOHUGEPAGE, that is - setting
> > the VM_HUGEPAGE, VM_NOHUGEPAGE VMA flags.
> >
> > We implement this functionality by using the mm_struct->def_flags field.
>
> This seems super specific. How about this:
>
> - PMADV_FUTURE (mirrors MCL_FUTURE). This only applies the flag to future VMAs in the current process.
> - PMADV_INHERIT_FORK. This makes it so the flag is propagated to child processes (does not imply PMADV_FUTURE)
> - PMADV_INHERIT_EXEC. This makes it so the flag is propagated through the execve boundary
>   (and this is where we'd filter for 'safe' flags, at least through the secureexec boundary). Does not imply
>   FUTURE nor INHERIT_FORK.

I don't know how we could implement separate current process, fork, exec, fork/exec.
mm->def_flags is propagated this way automatically.

And again on the security stuff, I think the correct answer is to require sys
admin capability to be able to use this option _at all_. This simplifies
everything.

To have this kind of thing we'd have to add a whole new mechanism, literally
just for this, and I'd really rather not generate brand new mm_struct flags for
every possible mode (in fact that would probably makes the whole thing
intractible), or add a new field there for this.

The idea is that we get the advantages of an improved madvise interface, while
also providing the interface Usama wants without having to add some hideous
prctl() whose logic is disconnected from the rest of madvise(), while being, in
effect, a 'default madvise() for new mappings'.

So while specific to the case, nothing prevents us in future adding more
functionality if we want.

We could also potentially:

- add PMADV_SET_DEFAULT (I'm iffy about PMADV_FUTURE... but whichever we go with)
- add PMADV_INHERIT_FORK
- add PMADV_INHERIT_EXEC

And only support PMADV_SET_DEFAULT | PMADV_INHERIT_FORK | PMADV_INHERIT_EXEC for
now.

THen we could have the security semantics you specify (require cap sys admin on
PMADV_INHERIT_EXEC) but have that propagate to the only supported case.

What do you think?

>
> and, while we're at it, rename PMADV_ENTIRE_ADDRESS_SPACE to PMADV_CURRENT, to align it with MCL_CURRENT.

I'm not sure making the mlock()/madvise() stuff analagous is a good idea, as
they have different semantics. I'd rather keep these flags descriptive. Though
I'm open to alternative naming of course...

Also keep in mind these flags are for mlockall(), whose name already tells you
you're locking everything :)

>
> Thoughts?
>
> --
> Pedro
Re: [RFC PATCH 4/5] mm/madvise: add PMADV_SET_FORK_EXEC_DEFAULT process_madvise() flag
Posted by Pedro Falcato 7 months ago
On Tue, May 20, 2025 at 11:21:33AM +0100, Lorenzo Stoakes wrote:
> On Tue, May 20, 2025 at 09:38:50AM +0100, Pedro Falcato wrote:
> > On Mon, May 19, 2025 at 09:52:41PM +0100, Lorenzo Stoakes wrote:
> > > It's useful in certain cases to be able to default-enable an madvise() flag
> > > for all newly mapped VMAs, and for that to survive fork/exec.
> > >
> > > The natural place to specify something like this is in an madvise()
> > > invocation, and thus providing this functionality as a flag to
> > > process_madvise() makes sense.
> > >
> > > We intentionally limit this only to flags that we know should function
> > > correctly without issue, and to be conservative about this, so we initially
> > > limit ourselves only to MADV_HUGEPAGE, MADV_NOHUGEPAGE, that is - setting
> > > the VM_HUGEPAGE, VM_NOHUGEPAGE VMA flags.
> > >
> > > We implement this functionality by using the mm_struct->def_flags field.
> >
> > This seems super specific. How about this:
> >
> > - PMADV_FUTURE (mirrors MCL_FUTURE). This only applies the flag to future VMAs in the current process.
> > - PMADV_INHERIT_FORK. This makes it so the flag is propagated to child processes (does not imply PMADV_FUTURE)
> > - PMADV_INHERIT_EXEC. This makes it so the flag is propagated through the execve boundary
> >   (and this is where we'd filter for 'safe' flags, at least through the secureexec boundary). Does not imply
> >   FUTURE nor INHERIT_FORK.
> 
> I don't know how we could implement separate current process, fork, exec, fork/exec.
> mm->def_flags is propagated this way automatically.
> 
> And again on the security stuff, I think the correct answer is to require sys
> admin capability to be able to use this option _at all_. This simplifies
> everything.
> 
> To have this kind of thing we'd have to add a whole new mechanism, literally
> just for this, and I'd really rather not generate brand new mm_struct flags for
> every possible mode (in fact that would probably makes the whole thing
> intractible), or add a new field there for this.
> 
> The idea is that we get the advantages of an improved madvise interface, while
> also providing the interface Usama wants without having to add some hideous
> prctl() whose logic is disconnected from the rest of madvise(), while being, in
> effect, a 'default madvise() for new mappings'.
> 
> So while specific to the case, nothing prevents us in future adding more
> functionality if we want.
> 
> We could also potentially:
> 
> - add PMADV_SET_DEFAULT (I'm iffy about PMADV_FUTURE... but whichever we go with)
> - add PMADV_INHERIT_FORK
> - add PMADV_INHERIT_EXEC
> 
> And only support PMADV_SET_DEFAULT | PMADV_INHERIT_FORK | PMADV_INHERIT_EXEC for
> now.
>
> THen we could have the security semantics you specify (require cap sys admin on
> PMADV_INHERIT_EXEC) but have that propagate to the only supported case.
> 
> What do you think?
>

If you don't want to add new fields, this option seems fine.
And then if any other usecase pops up, we're ready.
 
> >
> > and, while we're at it, rename PMADV_ENTIRE_ADDRESS_SPACE to PMADV_CURRENT, to align it with MCL_CURRENT.
> 
> I'm not sure making the mlock()/madvise() stuff analagous is a good idea, as
> they have different semantics. I'd rather keep these flags descriptive. Though
> I'm open to alternative naming of course...

Semantics are similar I think? And I do think getting shorter names is a good
idea, however I won't insist too hard on this.

-- 
Pedro
Re: [RFC PATCH 4/5] mm/madvise: add PMADV_SET_FORK_EXEC_DEFAULT process_madvise() flag
Posted by Lorenzo Stoakes 7 months ago
On Tue, May 20, 2025 at 12:41:00PM +0100, Pedro Falcato wrote:
> On Tue, May 20, 2025 at 11:21:33AM +0100, Lorenzo Stoakes wrote:
> > On Tue, May 20, 2025 at 09:38:50AM +0100, Pedro Falcato wrote:
> > > On Mon, May 19, 2025 at 09:52:41PM +0100, Lorenzo Stoakes wrote:
> > > > It's useful in certain cases to be able to default-enable an madvise() flag
> > > > for all newly mapped VMAs, and for that to survive fork/exec.
> > > >
> > > > The natural place to specify something like this is in an madvise()
> > > > invocation, and thus providing this functionality as a flag to
> > > > process_madvise() makes sense.
> > > >
> > > > We intentionally limit this only to flags that we know should function
> > > > correctly without issue, and to be conservative about this, so we initially
> > > > limit ourselves only to MADV_HUGEPAGE, MADV_NOHUGEPAGE, that is - setting
> > > > the VM_HUGEPAGE, VM_NOHUGEPAGE VMA flags.
> > > >
> > > > We implement this functionality by using the mm_struct->def_flags field.
> > >
> > > This seems super specific. How about this:
> > >
> > > - PMADV_FUTURE (mirrors MCL_FUTURE). This only applies the flag to future VMAs in the current process.
> > > - PMADV_INHERIT_FORK. This makes it so the flag is propagated to child processes (does not imply PMADV_FUTURE)
> > > - PMADV_INHERIT_EXEC. This makes it so the flag is propagated through the execve boundary
> > >   (and this is where we'd filter for 'safe' flags, at least through the secureexec boundary). Does not imply
> > >   FUTURE nor INHERIT_FORK.
> >
> > I don't know how we could implement separate current process, fork, exec, fork/exec.
> > mm->def_flags is propagated this way automatically.
> >
> > And again on the security stuff, I think the correct answer is to require sys
> > admin capability to be able to use this option _at all_. This simplifies
> > everything.
> >
> > To have this kind of thing we'd have to add a whole new mechanism, literally
> > just for this, and I'd really rather not generate brand new mm_struct flags for
> > every possible mode (in fact that would probably makes the whole thing
> > intractible), or add a new field there for this.
> >
> > The idea is that we get the advantages of an improved madvise interface, while
> > also providing the interface Usama wants without having to add some hideous
> > prctl() whose logic is disconnected from the rest of madvise(), while being, in
> > effect, a 'default madvise() for new mappings'.
> >
> > So while specific to the case, nothing prevents us in future adding more
> > functionality if we want.
> >
> > We could also potentially:
> >
> > - add PMADV_SET_DEFAULT (I'm iffy about PMADV_FUTURE... but whichever we go with)
> > - add PMADV_INHERIT_FORK
> > - add PMADV_INHERIT_EXEC
> >
> > And only support PMADV_SET_DEFAULT | PMADV_INHERIT_FORK | PMADV_INHERIT_EXEC for
> > now.
> >
> > THen we could have the security semantics you specify (require cap sys admin on
> > PMADV_INHERIT_EXEC) but have that propagate to the only supported case.
> >
> > What do you think?
> >
>
> If you don't want to add new fields, this option seems fine.
> And then if any other usecase pops up, we're ready.

Yeah sounds fair, will do on respin!

>
> > >
> > > and, while we're at it, rename PMADV_ENTIRE_ADDRESS_SPACE to PMADV_CURRENT, to align it with MCL_CURRENT.
> >
> > I'm not sure making the mlock()/madvise() stuff analagous is a good idea, as
> > they have different semantics. I'd rather keep these flags descriptive. Though
> > I'm open to alternative naming of course...
>
> Semantics are similar I think? And I do think getting shorter names is a good
> idea, however I won't insist too hard on this.

Yeah perhaps with _ALL_ thrown in to make this clear... :) warming to it... ;)

>
> --
> Pedro