There are peculiarities within the kernel where what is very clearly mm
code is performed elsewhere arbitrarily.
This violates separation of concerns and makes it harder to refactor code
to make changes to how fundamental initialisation and operation of mm logic
is performed.
One such case is the creation of the VMA containing the initial stack upon
execve()'ing a new process. This is currently performed in __bprm_mm_init()
in fs/exec.c.
Abstract this operation to create_init_stack_vma(). This allows us to limit
use of vma allocation and free code to fork and mm only.
We previously did the same for the step at which we relocate the initial
stack VMA downwards via relocate_vma_down(), now we move the initial VMA
establishment too.
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
---
fs/exec.c | 51 +---------------------------------
include/linux/mm.h | 2 ++
mm/mmap.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 72 insertions(+), 50 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 8e4ea5f1e64c..ef34a68ef825 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -244,56 +244,7 @@ static void flush_arg_page(struct linux_binprm *bprm, unsigned long pos,
static int __bprm_mm_init(struct linux_binprm *bprm)
{
- int err;
- struct vm_area_struct *vma = NULL;
- struct mm_struct *mm = bprm->mm;
-
- bprm->vma = vma = vm_area_alloc(mm);
- if (!vma)
- return -ENOMEM;
- vma_set_anonymous(vma);
-
- if (mmap_write_lock_killable(mm)) {
- err = -EINTR;
- goto err_free;
- }
-
- /*
- * Need to be called with mmap write lock
- * held, to avoid race with ksmd.
- */
- err = ksm_execve(mm);
- if (err)
- goto err_ksm;
-
- /*
- * Place the stack at the largest stack address the architecture
- * supports. Later, we'll move this to an appropriate place. We don't
- * use STACK_TOP because that can depend on attributes which aren't
- * configured yet.
- */
- BUILD_BUG_ON(VM_STACK_FLAGS & VM_STACK_INCOMPLETE_SETUP);
- vma->vm_end = STACK_TOP_MAX;
- vma->vm_start = vma->vm_end - PAGE_SIZE;
- vm_flags_init(vma, VM_SOFTDIRTY | VM_STACK_FLAGS | VM_STACK_INCOMPLETE_SETUP);
- vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
-
- err = insert_vm_struct(mm, vma);
- if (err)
- goto err;
-
- mm->stack_vm = mm->total_vm = 1;
- mmap_write_unlock(mm);
- bprm->p = vma->vm_end - sizeof(void *);
- return 0;
-err:
- ksm_exit(mm);
-err_ksm:
- mmap_write_unlock(mm);
-err_free:
- bprm->vma = NULL;
- vm_area_free(vma);
- return err;
+ return create_init_stack_vma(bprm->mm, &bprm->vma, &bprm->p);
}
static bool valid_arg_len(struct linux_binprm *bprm, long len)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9b701cfbef22..fa84e59a99bb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3223,6 +3223,8 @@ void anon_vma_interval_tree_verify(struct anon_vma_chain *node);
extern int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin);
extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
extern void exit_mmap(struct mm_struct *);
+int create_init_stack_vma(struct mm_struct *mm, struct vm_area_struct **vmap,
+ unsigned long *top_mem_p);
int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift);
bool mmap_read_lock_maybe_expand(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, bool write);
diff --git a/mm/mmap.c b/mm/mmap.c
index bd210aaf7ebd..ec8572a93418 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1717,6 +1717,75 @@ static int __meminit init_reserve_notifier(void)
}
subsys_initcall(init_reserve_notifier);
+/*
+ * Establish the stack VMA in an execve'd process, located temporarily at the
+ * maximum stack address provided by the architecture.
+ *
+ * We later relocate this downwards in relocate_vma_down().
+ *
+ * This function is almost certainly NOT what you want for anything other than
+ * early executable initialisation.
+ *
+ * On success, returns 0 and sets *vmap to the stack VMA and *top_mem_p to the
+ * maximum addressable location in the stack (that is capable of storing a
+ * system word of data).
+ */
+int create_init_stack_vma(struct mm_struct *mm, struct vm_area_struct **vmap,
+ unsigned long *top_mem_p)
+{
+ int err;
+ struct vm_area_struct *vma = vm_area_alloc(mm);
+
+ if (!vma)
+ return -ENOMEM;
+
+ vma_set_anonymous(vma);
+
+ if (mmap_write_lock_killable(mm)) {
+ err = -EINTR;
+ goto err_free;
+ }
+
+ /*
+ * Need to be called with mmap write lock
+ * held, to avoid race with ksmd.
+ */
+ err = ksm_execve(mm);
+ if (err)
+ goto err_ksm;
+
+ /*
+ * Place the stack at the largest stack address the architecture
+ * supports. Later, we'll move this to an appropriate place. We don't
+ * use STACK_TOP because that can depend on attributes which aren't
+ * configured yet.
+ */
+ BUILD_BUG_ON(VM_STACK_FLAGS & VM_STACK_INCOMPLETE_SETUP);
+ vma->vm_end = STACK_TOP_MAX;
+ vma->vm_start = vma->vm_end - PAGE_SIZE;
+ vm_flags_init(vma, VM_SOFTDIRTY | VM_STACK_FLAGS | VM_STACK_INCOMPLETE_SETUP);
+ vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
+
+ err = insert_vm_struct(mm, vma);
+ if (err)
+ goto err;
+
+ mm->stack_vm = mm->total_vm = 1;
+ mmap_write_unlock(mm);
+ *vmap = vma;
+ *top_mem_p = vma->vm_end - sizeof(void *);
+ return 0;
+
+err:
+ ksm_exit(mm);
+err_ksm:
+ mmap_write_unlock(mm);
+err_free:
+ *vmap = NULL;
+ vm_area_free(vma);
+ return err;
+}
+
/*
* Relocate a VMA downwards by shift bytes. There cannot be any VMAs between
* this VMA and its relocated range, which will now reside at [vma->vm_start -
--
2.49.0
On Fri, Apr 25, 2025 at 03:54:34PM +0100, Lorenzo Stoakes wrote:
> There are peculiarities within the kernel where what is very clearly mm
> code is performed elsewhere arbitrarily.
>
> This violates separation of concerns and makes it harder to refactor code
> to make changes to how fundamental initialisation and operation of mm logic
> is performed.
>
> One such case is the creation of the VMA containing the initial stack upon
> execve()'ing a new process. This is currently performed in __bprm_mm_init()
> in fs/exec.c.
>
> Abstract this operation to create_init_stack_vma(). This allows us to limit
> use of vma allocation and free code to fork and mm only.
>
> We previously did the same for the step at which we relocate the initial
> stack VMA downwards via relocate_vma_down(), now we move the initial VMA
> establishment too.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Acked-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> ---
> fs/exec.c | 51 +---------------------------------
I'm kind of on the fence about this. On the one hand, yes, it's all vma
goo, and should live with the rest of vma code, as you suggest. On the
other had, exec is the only consumer of this behavior, and moving it
out of fs/exec.c means that changes to the code that specifically only
impacts exec are now in a separate file, and will no longer get exec
maintainer/reviewer CCs (based on MAINTAINERS file matching). Exec is
notoriously fragile, so I'm kind of generally paranoid about changes to
its behaviors going unnoticed.
In defense of moving it, yes, this routine has gotten updates over the
many years, but it's relatively stable. But at least one thing has gone in
without exec maintainer review recently (I would have Acked it, but the
point is review): 9e567ca45f ("mm/ksm: fix ksm exec support for prctl")
Everything else was before I took on the role officially (Nov 2022).
So I guess I'm asking, how do we make sure stuff pulled out of exec
still gets exec maintainer review?
> [...]
> static int __bprm_mm_init(struct linux_binprm *bprm)
> {
> - int err;
> [...]
> - return err;
> + return create_init_stack_vma(bprm->mm, &bprm->vma, &bprm->p);
> }
I'd prefer __bprm_mm_init() go away if it's just a 1:1 wrapper now.
However, it doesn't really look like it makes too much sense for the NOMMU
logic get moved as well, since it explicitly depends on exec-specific
values (MAX_ARG_PAGES), so perhaps something like this:
diff --git a/fs/exec.c b/fs/exec.c
index 8e4ea5f1e64c..313dc70e0012 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -382,9 +382,13 @@ static int bprm_mm_init(struct linux_binprm *bprm)
bprm->rlim_stack = current->signal->rlim[RLIMIT_STACK];
task_unlock(current->group_leader);
- err = __bprm_mm_init(bprm);
+#ifndef CONFIG_MMU
+ bprm->p = PAGE_SIZE * MAX_ARG_PAGES - sizeof(void *);
+#else
+ err = create_init_stack_vma(bprm->mm, &bprm->vma, &bprm->p);
if (err)
goto err;
+#endif
return 0;
On a related note, I'd like to point out that my claim that exec is
the only consumer here, is slightly a lie. Technically this is correct,
but only because this is specifically setting up the _stack_.
The rest of the VMA setup actually surrounds this code (another
reason I remain unhappy about moving it). Specifically the mm_alloc()
before __bprm_mm_init (which is reached through alloc_brpm()). And
then, following alloc_bprm() in do_execveat_common(), is the call to
setup_new_exec(), which does the rest of the VMA setup, specifically
arch_pick_mmap_layout() and related fiddling.
The "create userspace VMA" logic, mostly through mm_alloc(), is
used in a few places (e.g. text poking), but the "bring up a _usable_
userspace VMA" logic (i.e. one also with functional mmap) is repeated in
lib/kunit/alloc_user.c for allowing testing of code that touches userspace
(see kunit_attach_mm() and the kunit_vm_mmap() users). (But these tests
don't actually run userspace code, so no stack is set up.)
I guess what I'm trying to say is that I think we need a more clearly
defined "create usable userspace VMA" API, as we've got at least 3
scattered approaches right now: exec ("everything"), non-mmap-non-stack
users (text poking, et al), and mmap-but-not-stack users (kunit tests).
And the One True User of a full userspace VMA, exec, has the full setup
scattered into several phases, mostly due to needing to separate those
phases because it needs to progressively gather the information needed
to correctly configure each piece:
- set up userspace VMA at all (mm_alloc)
- set up a stack because exec args need to go somewhere (__bprm_mm_init)
- move stack to the right place (depends on executable binary and task bits)
- set up mmap (arch_pick_mmap_layout) to actually load executable binary
(depends on arch, binary, and task bits)
Hopefully this all explains why I'm uncomfortable to see __bprm_mm_init
get relocated. It'll _probably_ be fine, but I get antsy about changes
to code that only exec uses...
--
Kees Cook
On Fri, Apr 25, 2025 at 10:09:34AM -0700, Kees Cook wrote:
> On Fri, Apr 25, 2025 at 03:54:34PM +0100, Lorenzo Stoakes wrote:
> > There are peculiarities within the kernel where what is very clearly mm
> > code is performed elsewhere arbitrarily.
> >
> > This violates separation of concerns and makes it harder to refactor code
> > to make changes to how fundamental initialisation and operation of mm logic
> > is performed.
> >
> > One such case is the creation of the VMA containing the initial stack upon
> > execve()'ing a new process. This is currently performed in __bprm_mm_init()
> > in fs/exec.c.
> >
> > Abstract this operation to create_init_stack_vma(). This allows us to limit
> > use of vma allocation and free code to fork and mm only.
> >
> > We previously did the same for the step at which we relocate the initial
> > stack VMA downwards via relocate_vma_down(), now we move the initial VMA
> > establishment too.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Acked-by: David Hildenbrand <david@redhat.com>
> > Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> > fs/exec.c | 51 +---------------------------------
>
> I'm kind of on the fence about this. On the one hand, yes, it's all vma
> goo, and should live with the rest of vma code, as you suggest. On the
> other had, exec is the only consumer of this behavior, and moving it
> out of fs/exec.c means that changes to the code that specifically only
> impacts exec are now in a separate file, and will no longer get exec
> maintainer/reviewer CCs (based on MAINTAINERS file matching). Exec is
> notoriously fragile, so I'm kind of generally paranoid about changes to
> its behaviors going unnoticed.
>
> In defense of moving it, yes, this routine has gotten updates over the
> many years, but it's relatively stable. But at least one thing has gone in
> without exec maintainer review recently (I would have Acked it, but the
> point is review): 9e567ca45f ("mm/ksm: fix ksm exec support for prctl")
> Everything else was before I took on the role officially (Nov 2022).
>
> So I guess I'm asking, how do we make sure stuff pulled out of exec
> still gets exec maintainer review?
I think we have two options here:
1. Separate out this code into mm/vma_exec.c and treat it like
mm/vma_init.c, then add you as a reviewer, so you have visibility on
everything that happens there.
2. Add you as a reviewer to memory mapping in general.
I think 1 is preferable actually, as it'll reduce noise for you and then
you'll _always_ get notified about change in this code.
Note that we have done this previously for similar reasons with
relocate_vma_down() we could move this function into that file.
For the sake of saving time given time zone differences, let me explore
option 1in a v3, and obviously if that doesn't work for you let me know! :)
We'll have to then have fs/exec.c include ../mm/vma_exec.h, so it is _only_
exec that gets access to this.
>
> > [...]
> > static int __bprm_mm_init(struct linux_binprm *bprm)
> > {
> > - int err;
> > [...]
> > - return err;
> > + return create_init_stack_vma(bprm->mm, &bprm->vma, &bprm->p);
> > }
>
> I'd prefer __bprm_mm_init() go away if it's just a 1:1 wrapper now.
> However, it doesn't really look like it makes too much sense for the NOMMU
> logic get moved as well, since it explicitly depends on exec-specific
> values (MAX_ARG_PAGES), so perhaps something like this:
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 8e4ea5f1e64c..313dc70e0012 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -382,9 +382,13 @@ static int bprm_mm_init(struct linux_binprm *bprm)
> bprm->rlim_stack = current->signal->rlim[RLIMIT_STACK];
> task_unlock(current->group_leader);
>
> - err = __bprm_mm_init(bprm);
> +#ifndef CONFIG_MMU
> + bprm->p = PAGE_SIZE * MAX_ARG_PAGES - sizeof(void *);
> +#else
> + err = create_init_stack_vma(bprm->mm, &bprm->vma, &bprm->p);
> if (err)
> goto err;
> +#endif
>
> return 0;
Sure I can respin with this.
>
>
>
> On a related note, I'd like to point out that my claim that exec is
> the only consumer here, is slightly a lie. Technically this is correct,
> but only because this is specifically setting up the _stack_.
>
> The rest of the VMA setup actually surrounds this code (another
> reason I remain unhappy about moving it). Specifically the mm_alloc()
> before __bprm_mm_init (which is reached through alloc_brpm()). And
> then, following alloc_bprm() in do_execveat_common(), is the call to
> setup_new_exec(), which does the rest of the VMA setup, specifically
> arch_pick_mmap_layout() and related fiddling.
No other callers try to allocate/free vmas, which is the issue here.
>
> The "create userspace VMA" logic, mostly through mm_alloc(), is
> used in a few places (e.g. text poking), but the "bring up a _usable_
> userspace VMA" logic (i.e. one also with functional mmap) is repeated in
> lib/kunit/alloc_user.c for allowing testing of code that touches userspace
> (see kunit_attach_mm() and the kunit_vm_mmap() users). (But these tests
> don't actually run userspace code, so no stack is set up.)
Hm but this seems something different? This is using the mm_alloc()
interface?
>
> I guess what I'm trying to say is that I think we need a more clearly
> defined "create usable userspace VMA" API, as we've got at least 3
> scattered approaches right now: exec ("everything"), non-mmap-non-stack
> users (text poking, et al), and mmap-but-not-stack users (kunit tests).
But only exec is actually allocating and essentially 'forcing in' a stack
vma like this?
I mean I'm all for having some nicely abstracted interface rather than
scattering open-coded stuff around, but the one and only objective _here_
is to disallow the use of very clearly mm-specific internal APIs elsewhere.
exec is of course 'special' in this stack behaviour, but I feel we can
express this 'specialness' better by having this kind of well-defined,
well-described functions exposed by mm, rather than handing over absolutely
fundamental API calls to any part of the kernel that wants them.
>
> And the One True User of a full userspace VMA, exec, has the full setup
> scattered into several phases, mostly due to needing to separate those
> phases because it needs to progressively gather the information needed
> to correctly configure each piece:
> - set up userspace VMA at all (mm_alloc)
> - set up a stack because exec args need to go somewhere (__bprm_mm_init)
> - move stack to the right place (depends on executable binary and task bits)
> - set up mmap (arch_pick_mmap_layout) to actually load executable binary
> (depends on arch, binary, and task bits)
>
> Hopefully this all explains why I'm uncomfortable to see __bprm_mm_init
> get relocated. It'll _probably_ be fine, but I get antsy about changes
> to code that only exec uses...
Right, I understand and appreciate that, for sure. This is fiddly core
code, you worry about things breaking - I mean I feel you (don't ask me
about brk(), please please don't ask me about brk() :P)
So hopefully the proposed solution with vma_exec.c should cover this off
nicely?
>
> --
> Kees Cook
Cheers, Lorenzo
On Mon, Apr 28, 2025 at 09:53:05AM +0100, Lorenzo Stoakes wrote:
> On Fri, Apr 25, 2025 at 10:09:34AM -0700, Kees Cook wrote:
> > On Fri, Apr 25, 2025 at 03:54:34PM +0100, Lorenzo Stoakes wrote:
> > > There are peculiarities within the kernel where what is very clearly mm
> > > code is performed elsewhere arbitrarily.
> > >
> > > This violates separation of concerns and makes it harder to refactor code
> > > to make changes to how fundamental initialisation and operation of mm logic
> > > is performed.
> > >
> > > One such case is the creation of the VMA containing the initial stack upon
> > > execve()'ing a new process. This is currently performed in __bprm_mm_init()
> > > in fs/exec.c.
> > >
> > > Abstract this operation to create_init_stack_vma(). This allows us to limit
> > > use of vma allocation and free code to fork and mm only.
> > >
> > > We previously did the same for the step at which we relocate the initial
> > > stack VMA downwards via relocate_vma_down(), now we move the initial VMA
> > > establishment too.
> > >
> > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > Acked-by: David Hildenbrand <david@redhat.com>
> > > Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> > > ---
> > > fs/exec.c | 51 +---------------------------------
> >
> > I'm kind of on the fence about this. On the one hand, yes, it's all vma
> > goo, and should live with the rest of vma code, as you suggest. On the
> > other had, exec is the only consumer of this behavior, and moving it
> > out of fs/exec.c means that changes to the code that specifically only
> > impacts exec are now in a separate file, and will no longer get exec
> > maintainer/reviewer CCs (based on MAINTAINERS file matching). Exec is
> > notoriously fragile, so I'm kind of generally paranoid about changes to
> > its behaviors going unnoticed.
> >
> > In defense of moving it, yes, this routine has gotten updates over the
> > many years, but it's relatively stable. But at least one thing has gone in
> > without exec maintainer review recently (I would have Acked it, but the
> > point is review): 9e567ca45f ("mm/ksm: fix ksm exec support for prctl")
> > Everything else was before I took on the role officially (Nov 2022).
> >
> > So I guess I'm asking, how do we make sure stuff pulled out of exec
> > still gets exec maintainer review?
>
> I think we have two options here:
>
> 1. Separate out this code into mm/vma_exec.c and treat it like
> mm/vma_init.c, then add you as a reviewer, so you have visibility on
> everything that happens there.
>
Actually, (off-list) Vlastimil made the very good suggestion that we can
just add this new file to both exec and memory mapping sections, have
tested it and it works!
So I think this should cover off your concerns?
[snip]
On April 28, 2025 3:46:06 AM PDT, Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>On Mon, Apr 28, 2025 at 09:53:05AM +0100, Lorenzo Stoakes wrote:
>> On Fri, Apr 25, 2025 at 10:09:34AM -0700, Kees Cook wrote:
>> > On Fri, Apr 25, 2025 at 03:54:34PM +0100, Lorenzo Stoakes wrote:
>> > > There are peculiarities within the kernel where what is very clearly mm
>> > > code is performed elsewhere arbitrarily.
>> > >
>> > > This violates separation of concerns and makes it harder to refactor code
>> > > to make changes to how fundamental initialisation and operation of mm logic
>> > > is performed.
>> > >
>> > > One such case is the creation of the VMA containing the initial stack upon
>> > > execve()'ing a new process. This is currently performed in __bprm_mm_init()
>> > > in fs/exec.c.
>> > >
>> > > Abstract this operation to create_init_stack_vma(). This allows us to limit
>> > > use of vma allocation and free code to fork and mm only.
>> > >
>> > > We previously did the same for the step at which we relocate the initial
>> > > stack VMA downwards via relocate_vma_down(), now we move the initial VMA
>> > > establishment too.
>> > >
>> > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> > > Acked-by: David Hildenbrand <david@redhat.com>
>> > > Reviewed-by: Suren Baghdasaryan <surenb@google.com>
>> > > ---
>> > > fs/exec.c | 51 +---------------------------------
>> >
>> > I'm kind of on the fence about this. On the one hand, yes, it's all vma
>> > goo, and should live with the rest of vma code, as you suggest. On the
>> > other had, exec is the only consumer of this behavior, and moving it
>> > out of fs/exec.c means that changes to the code that specifically only
>> > impacts exec are now in a separate file, and will no longer get exec
>> > maintainer/reviewer CCs (based on MAINTAINERS file matching). Exec is
>> > notoriously fragile, so I'm kind of generally paranoid about changes to
>> > its behaviors going unnoticed.
>> >
>> > In defense of moving it, yes, this routine has gotten updates over the
>> > many years, but it's relatively stable. But at least one thing has gone in
>> > without exec maintainer review recently (I would have Acked it, but the
>> > point is review): 9e567ca45f ("mm/ksm: fix ksm exec support for prctl")
>> > Everything else was before I took on the role officially (Nov 2022).
>> >
>> > So I guess I'm asking, how do we make sure stuff pulled out of exec
>> > still gets exec maintainer review?
>>
>> I think we have two options here:
>>
>> 1. Separate out this code into mm/vma_exec.c and treat it like
>> mm/vma_init.c, then add you as a reviewer, so you have visibility on
>> everything that happens there.
>>
>
>Actually, (off-list) Vlastimil made the very good suggestion that we can
>just add this new file to both exec and memory mapping sections, have
>tested it and it works!
>
>So I think this should cover off your concerns?
>
>[snip]
Yes, this is brilliant! Totally works for me; thank you (and Vlastimil) for finding a good way to do it.
-Kees
--
Kees Cook
© 2016 - 2025 Red Hat, Inc.