[v3] Avoid MAP_FIXED gap exposure

[PATCH v3 16/16] mm/mmap: Move may_expand_vm() check in mmap_region()

Posted by Liam R. Howlett 1 year, 7 months ago

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

The MAP_FIXED page count is available after the vms_gather_munmap_vmas()
call, so use it instead of looping over the vmas twice.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 mm/mmap.c | 36 ++++--------------------------------
 1 file changed, 4 insertions(+), 32 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index b2de26683903..62edaabf3987 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -400,27 +400,6 @@ anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma)
 		anon_vma_interval_tree_insert(avc, &avc->anon_vma->rb_root);
 }
 
-static unsigned long count_vma_pages_range(struct mm_struct *mm,
-		unsigned long addr, unsigned long end,
-		unsigned long *nr_accounted)
-{
-	VMA_ITERATOR(vmi, mm, addr);
-	struct vm_area_struct *vma;
-	unsigned long nr_pages = 0;
-
-	*nr_accounted = 0;
-	for_each_vma_range(vmi, vma, end) {
-		unsigned long vm_start = max(addr, vma->vm_start);
-		unsigned long vm_end = min(end, vma->vm_end);
-
-		nr_pages += PHYS_PFN(vm_end - vm_start);
-		if (vma->vm_flags & VM_ACCOUNT)
-			*nr_accounted += PHYS_PFN(vm_end - vm_start);
-	}
-
-	return nr_pages;
-}
-
 static void __vma_link_file(struct vm_area_struct *vma,
 			    struct address_space *mapping)
 {
@@ -2946,17 +2925,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	pgoff_t vm_pgoff;
 	int error = -ENOMEM;
 	VMA_ITERATOR(vmi, mm, addr);
-	unsigned long nr_pages, nr_accounted;
-
-	nr_pages = count_vma_pages_range(mm, addr, end, &nr_accounted);
-
-	/* Check against address space limit. */
-	/*
-	 * MAP_FIXED may remove pages of mappings that intersects with requested
-	 * mapping. Account for the pages it would unmap.
-	 */
-	if (!may_expand_vm(mm, vm_flags, pglen - nr_pages))
-		return -ENOMEM;
 
 	if (unlikely(!can_modify_mm(mm, addr, end)))
 		return -EPERM;
@@ -2987,6 +2955,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 			vma_iter_next_range(&vmi);
 	}
 
+	/* Check against address space limit. */
+	if (!may_expand_vm(mm, vm_flags, pglen - vms.nr_pages))
+		goto abort_munmap;
+
 	/*
 	 * Private writable mapping: check memory availability
 	 */
-- 
2.43.0

Re: [PATCH v3 16/16] mm/mmap: Move may_expand_vm() check in mmap_region()

Posted by Lorenzo Stoakes 1 year, 7 months ago

On Thu, Jul 04, 2024 at 02:27:18PM GMT, Liam R. Howlett wrote:
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>
> The MAP_FIXED page count is available after the vms_gather_munmap_vmas()
> call, so use it instead of looping over the vmas twice.

Predictably indeed you removed the thing I commented on in the last patch
;) but at least this time I predicted it! ;)

>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> ---
>  mm/mmap.c | 36 ++++--------------------------------
>  1 file changed, 4 insertions(+), 32 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index b2de26683903..62edaabf3987 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -400,27 +400,6 @@ anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma)
>  		anon_vma_interval_tree_insert(avc, &avc->anon_vma->rb_root);
>  }
>
> -static unsigned long count_vma_pages_range(struct mm_struct *mm,
> -		unsigned long addr, unsigned long end,
> -		unsigned long *nr_accounted)
> -{
> -	VMA_ITERATOR(vmi, mm, addr);
> -	struct vm_area_struct *vma;
> -	unsigned long nr_pages = 0;
> -
> -	*nr_accounted = 0;
> -	for_each_vma_range(vmi, vma, end) {
> -		unsigned long vm_start = max(addr, vma->vm_start);
> -		unsigned long vm_end = min(end, vma->vm_end);
> -
> -		nr_pages += PHYS_PFN(vm_end - vm_start);
> -		if (vma->vm_flags & VM_ACCOUNT)
> -			*nr_accounted += PHYS_PFN(vm_end - vm_start);
> -	}
> -
> -	return nr_pages;
> -}
> -
>  static void __vma_link_file(struct vm_area_struct *vma,
>  			    struct address_space *mapping)
>  {
> @@ -2946,17 +2925,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  	pgoff_t vm_pgoff;
>  	int error = -ENOMEM;
>  	VMA_ITERATOR(vmi, mm, addr);
> -	unsigned long nr_pages, nr_accounted;
> -
> -	nr_pages = count_vma_pages_range(mm, addr, end, &nr_accounted);
> -
> -	/* Check against address space limit. */
> -	/*
> -	 * MAP_FIXED may remove pages of mappings that intersects with requested
> -	 * mapping. Account for the pages it would unmap.
> -	 */
> -	if (!may_expand_vm(mm, vm_flags, pglen - nr_pages))
> -		return -ENOMEM;
>
>  	if (unlikely(!can_modify_mm(mm, addr, end)))
>  		return -EPERM;
> @@ -2987,6 +2955,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  			vma_iter_next_range(&vmi);
>  	}
>
> +	/* Check against address space limit. */
> +	if (!may_expand_vm(mm, vm_flags, pglen - vms.nr_pages))
> +		goto abort_munmap;
> +

I know you can literally only do this after the vms_gather_munmap_vmas(),
but this does change where we check this, so for instance we do
arch_unmap() without having checked may_expand_vm().

However I assume this is fine?

>  	/*
>  	 * Private writable mapping: check memory availability
>  	 */
> --
> 2.43.0
>

Looks good to me generally,

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Re: [PATCH v3 16/16] mm/mmap: Move may_expand_vm() check in mmap_region()

Posted by Liam R. Howlett 1 year, 7 months ago

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240708 08:53]:
> On Thu, Jul 04, 2024 at 02:27:18PM GMT, Liam R. Howlett wrote:
> > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> >
> > The MAP_FIXED page count is available after the vms_gather_munmap_vmas()
> > call, so use it instead of looping over the vmas twice.
> 
> Predictably indeed you removed the thing I commented on in the last patch
> ;) but at least this time I predicted it! ;)
> 
> >
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > ---
> >  mm/mmap.c | 36 ++++--------------------------------
> >  1 file changed, 4 insertions(+), 32 deletions(-)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index b2de26683903..62edaabf3987 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -400,27 +400,6 @@ anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma)
> >  		anon_vma_interval_tree_insert(avc, &avc->anon_vma->rb_root);
> >  }
> >
> > -static unsigned long count_vma_pages_range(struct mm_struct *mm,
> > -		unsigned long addr, unsigned long end,
> > -		unsigned long *nr_accounted)
> > -{
> > -	VMA_ITERATOR(vmi, mm, addr);
> > -	struct vm_area_struct *vma;
> > -	unsigned long nr_pages = 0;
> > -
> > -	*nr_accounted = 0;
> > -	for_each_vma_range(vmi, vma, end) {
> > -		unsigned long vm_start = max(addr, vma->vm_start);
> > -		unsigned long vm_end = min(end, vma->vm_end);
> > -
> > -		nr_pages += PHYS_PFN(vm_end - vm_start);
> > -		if (vma->vm_flags & VM_ACCOUNT)
> > -			*nr_accounted += PHYS_PFN(vm_end - vm_start);
> > -	}
> > -
> > -	return nr_pages;
> > -}
> > -
> >  static void __vma_link_file(struct vm_area_struct *vma,
> >  			    struct address_space *mapping)
> >  {
> > @@ -2946,17 +2925,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >  	pgoff_t vm_pgoff;
> >  	int error = -ENOMEM;
> >  	VMA_ITERATOR(vmi, mm, addr);
> > -	unsigned long nr_pages, nr_accounted;
> > -
> > -	nr_pages = count_vma_pages_range(mm, addr, end, &nr_accounted);
> > -
> > -	/* Check against address space limit. */
> > -	/*
> > -	 * MAP_FIXED may remove pages of mappings that intersects with requested
> > -	 * mapping. Account for the pages it would unmap.
> > -	 */
> > -	if (!may_expand_vm(mm, vm_flags, pglen - nr_pages))
> > -		return -ENOMEM;
> >
> >  	if (unlikely(!can_modify_mm(mm, addr, end)))
> >  		return -EPERM;
> > @@ -2987,6 +2955,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >  			vma_iter_next_range(&vmi);
> >  	}
> >
> > +	/* Check against address space limit. */
> > +	if (!may_expand_vm(mm, vm_flags, pglen - vms.nr_pages))
> > +		goto abort_munmap;
> > +
> 
> I know you can literally only do this after the vms_gather_munmap_vmas(),
> but this does change where we check this, so for instance we do
> arch_unmap() without having checked may_expand_vm().
> 
> However I assume this is fine?

Thanks for pointing this out.

The functionality here has changed
--- from ---
may_expand_vm() check
can_modify_mm() check
arch_unmap()
vms_gather_munmap_vmas()
...

--- to ---
can_modify_mm() check
arch_unmap()
vms_gather_munmap_vmas()
may_expand_vm() check
...

vms_gather_munmap_vmas() does nothing but figures out what to do later,
but could use memory and can fail.

The user implications are:

1. The return type on the error may change to -EPERM from -ENOMEM, if
you are not allowed to expand and are trying to overwrite mseal()'ed
VMAs. That seems so very rare that I'm not sure it's worth mentioning.

2. arch_unmap() called prior to may_expand_vm().
powerpc uses this to set mm->context.vdso = NULL if mm->context.vdso is
within the unmap range.  User implication of this means that an
application my set the vdso to NULL prior to hitting the -ENOMEM case in
may_expand_vm() due to the address space limit.

Assuming the removal of the vdso does not cause the application to seg
fault, then the user visible change is that any vdso call after a failed
mmap(MAP_FIXED) call would result in a seg fault.  The only reason it
would fail is if the mapping process was attempting to map a large
enough area over the vdso (which is accounted and in the vma tree,
afaict) and ran out of memory. Note that this situation could arise
already since we could run out of memory (not accounting) after the
arch_unmap() call within the kernel.

The code today can suffer the same fate, but not by the accounting
failure.  It can happen due to failure to allocate a new vma,
do_vmi_munmap() failure after the arch_unmap() call, or any of the other
failure scenarios later in the mmap_region() function.

At the very least, this requires an expanded change log.

> 
> >  	/*
> >  	 * Private writable mapping: check memory availability
> >  	 */
> > --
> > 2.43.0
> >
> 
> Looks good to me generally,
> 
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Re: [PATCH v3 16/16] mm/mmap: Move may_expand_vm() check in mmap_region()

Posted by Lorenzo Stoakes 1 year, 7 months ago

On Mon, Jul 08, 2024 at 04:43:15PM GMT, Liam R. Howlett wrote:
> * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240708 08:53]:
> > On Thu, Jul 04, 2024 at 02:27:18PM GMT, Liam R. Howlett wrote:
> > > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> > >
> > > The MAP_FIXED page count is available after the vms_gather_munmap_vmas()
> > > call, so use it instead of looping over the vmas twice.
> >
> > Predictably indeed you removed the thing I commented on in the last patch
> > ;) but at least this time I predicted it! ;)
> >
> > >
> > > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > > ---
> > >  mm/mmap.c | 36 ++++--------------------------------
> > >  1 file changed, 4 insertions(+), 32 deletions(-)
> > >
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index b2de26683903..62edaabf3987 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -400,27 +400,6 @@ anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma)
> > >  		anon_vma_interval_tree_insert(avc, &avc->anon_vma->rb_root);
> > >  }
> > >
> > > -static unsigned long count_vma_pages_range(struct mm_struct *mm,
> > > -		unsigned long addr, unsigned long end,
> > > -		unsigned long *nr_accounted)
> > > -{
> > > -	VMA_ITERATOR(vmi, mm, addr);
> > > -	struct vm_area_struct *vma;
> > > -	unsigned long nr_pages = 0;
> > > -
> > > -	*nr_accounted = 0;
> > > -	for_each_vma_range(vmi, vma, end) {
> > > -		unsigned long vm_start = max(addr, vma->vm_start);
> > > -		unsigned long vm_end = min(end, vma->vm_end);
> > > -
> > > -		nr_pages += PHYS_PFN(vm_end - vm_start);
> > > -		if (vma->vm_flags & VM_ACCOUNT)
> > > -			*nr_accounted += PHYS_PFN(vm_end - vm_start);
> > > -	}
> > > -
> > > -	return nr_pages;
> > > -}
> > > -
> > >  static void __vma_link_file(struct vm_area_struct *vma,
> > >  			    struct address_space *mapping)
> > >  {
> > > @@ -2946,17 +2925,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > >  	pgoff_t vm_pgoff;
> > >  	int error = -ENOMEM;
> > >  	VMA_ITERATOR(vmi, mm, addr);
> > > -	unsigned long nr_pages, nr_accounted;
> > > -
> > > -	nr_pages = count_vma_pages_range(mm, addr, end, &nr_accounted);
> > > -
> > > -	/* Check against address space limit. */
> > > -	/*
> > > -	 * MAP_FIXED may remove pages of mappings that intersects with requested
> > > -	 * mapping. Account for the pages it would unmap.
> > > -	 */
> > > -	if (!may_expand_vm(mm, vm_flags, pglen - nr_pages))
> > > -		return -ENOMEM;
> > >
> > >  	if (unlikely(!can_modify_mm(mm, addr, end)))
> > >  		return -EPERM;
> > > @@ -2987,6 +2955,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > >  			vma_iter_next_range(&vmi);
> > >  	}
> > >
> > > +	/* Check against address space limit. */
> > > +	if (!may_expand_vm(mm, vm_flags, pglen - vms.nr_pages))
> > > +		goto abort_munmap;
> > > +
> >
> > I know you can literally only do this after the vms_gather_munmap_vmas(),
> > but this does change where we check this, so for instance we do
> > arch_unmap() without having checked may_expand_vm().
> >
> > However I assume this is fine?
>
> Thanks for pointing this out.
>
> The functionality here has changed
> --- from ---
> may_expand_vm() check
> can_modify_mm() check
> arch_unmap()
> vms_gather_munmap_vmas()
> ...
>
> --- to ---
> can_modify_mm() check
> arch_unmap()
> vms_gather_munmap_vmas()
> may_expand_vm() check
> ...
>
> vms_gather_munmap_vmas() does nothing but figures out what to do later,
> but could use memory and can fail.
>
> The user implications are:
>
> 1. The return type on the error may change to -EPERM from -ENOMEM, if
> you are not allowed to expand and are trying to overwrite mseal()'ed
> VMAs. That seems so very rare that I'm not sure it's worth mentioning.
>
>
> 2. arch_unmap() called prior to may_expand_vm().
> powerpc uses this to set mm->context.vdso = NULL if mm->context.vdso is
> within the unmap range.  User implication of this means that an
> application my set the vdso to NULL prior to hitting the -ENOMEM case in
> may_expand_vm() due to the address space limit.
>
> Assuming the removal of the vdso does not cause the application to seg
> fault, then the user visible change is that any vdso call after a failed
> mmap(MAP_FIXED) call would result in a seg fault.  The only reason it
> would fail is if the mapping process was attempting to map a large
> enough area over the vdso (which is accounted and in the vma tree,
> afaict) and ran out of memory. Note that this situation could arise
> already since we could run out of memory (not accounting) after the
> arch_unmap() call within the kernel.
>
> The code today can suffer the same fate, but not by the accounting
> failure.  It can happen due to failure to allocate a new vma,
> do_vmi_munmap() failure after the arch_unmap() call, or any of the other
> failure scenarios later in the mmap_region() function.
>
> At the very least, this requires an expanded change log.

Indeed, also (as mentioned on IRC) I feel like we need to look at whether
we _truly_ need this arch_unmap() call for a single, rather antiquated,
architecture.

I mean why are they unmapping the VDSO, why is that valid, why does it need
that field to be set to NULL, is it possible to signify that in some other
way etc.?

Regardless, I think the change you make here is fine and shouldn't be a
blocker for your changes at all.

But agreed, I do think expanding the change log a bit would be helpful.

>
> >
> > >  	/*
> > >  	 * Private writable mapping: check memory availability
> > >  	 */
> > > --
> > > 2.43.0
> > >
> >
> > Looks good to me generally,
> >
> > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Re: [PATCH v3 16/16] mm/mmap: Move may_expand_vm() check in mmap_region()

Posted by Michael Ellerman 1 year, 7 months ago

Lorenzo Stoakes <lorenzo.stoakes@oracle.com> writes:
> On Mon, Jul 08, 2024 at 04:43:15PM GMT, Liam R. Howlett wrote:
>>
...
>> The functionality here has changed
>> --- from ---
>> may_expand_vm() check
>> can_modify_mm() check
>> arch_unmap()
>> vms_gather_munmap_vmas()
>> ...
>>
>> --- to ---
>> can_modify_mm() check
>> arch_unmap()
>> vms_gather_munmap_vmas()
>> may_expand_vm() check
>> ...
>>
>> vms_gather_munmap_vmas() does nothing but figures out what to do later,
>> but could use memory and can fail.
>>
>> The user implications are:
>>
>> 1. The return type on the error may change to -EPERM from -ENOMEM, if
>> you are not allowed to expand and are trying to overwrite mseal()'ed
>> VMAs. That seems so very rare that I'm not sure it's worth mentioning.
>>
>>
>> 2. arch_unmap() called prior to may_expand_vm().
>> powerpc uses this to set mm->context.vdso = NULL if mm->context.vdso is
>> within the unmap range.  User implication of this means that an
>> application my set the vdso to NULL prior to hitting the -ENOMEM case in
>> may_expand_vm() due to the address space limit.
>>
>> Assuming the removal of the vdso does not cause the application to seg
>> fault, then the user visible change is that any vdso call after a failed
>> mmap(MAP_FIXED) call would result in a seg fault.  The only reason it
>> would fail is if the mapping process was attempting to map a large
>> enough area over the vdso (which is accounted and in the vma tree,
>> afaict) and ran out of memory. Note that this situation could arise
>> already since we could run out of memory (not accounting) after the
>> arch_unmap() call within the kernel.
>>
>> The code today can suffer the same fate, but not by the accounting
>> failure.  It can happen due to failure to allocate a new vma,
>> do_vmi_munmap() failure after the arch_unmap() call, or any of the other
>> failure scenarios later in the mmap_region() function.
>>
>> At the very least, this requires an expanded change log.
>
> Indeed, also (as mentioned on IRC) I feel like we need to look at whether
> we _truly_ need this arch_unmap() call for a single, rather antiquated,
> architecture.

You can call it "niche" or "irrelevant" or "fringe", but "antiquated" is
factually wrong :) Power10 came out of the fab just a few years ago at
7nm.

> I mean why are they unmapping the VDSO, why is that valid, why does it need
> that field to be set to NULL, is it possible to signify that in some other
> way etc.?

It was originally for CRIU. So a niche workload on a niche architecture.

But from the commit that added it, it sounds like CRIU was using mremap,
which should be handled these days by vdso_mremap(). So it could be that
arch_unmap() is not actually needed for CRIU anymore.

Then I guess we have to decide if removing our arch_unmap() would be an
ABI break, regardless of whether CRIU needs it or not.

cheers

Re: [PATCH v3 16/16] mm/mmap: Move may_expand_vm() check in mmap_region()

Posted by Lorenzo Stoakes 1 year, 7 months ago

On Wed, Jul 10, 2024 at 10:28:01PM GMT, Michael Ellerman wrote:
> Lorenzo Stoakes <lorenzo.stoakes@oracle.com> writes:
> > On Mon, Jul 08, 2024 at 04:43:15PM GMT, Liam R. Howlett wrote:
> >>
> ...
> >> The functionality here has changed
> >> --- from ---
> >> may_expand_vm() check
> >> can_modify_mm() check
> >> arch_unmap()
> >> vms_gather_munmap_vmas()
> >> ...
> >>
> >> --- to ---
> >> can_modify_mm() check
> >> arch_unmap()
> >> vms_gather_munmap_vmas()
> >> may_expand_vm() check
> >> ...
> >>
> >> vms_gather_munmap_vmas() does nothing but figures out what to do later,
> >> but could use memory and can fail.
> >>
> >> The user implications are:
> >>
> >> 1. The return type on the error may change to -EPERM from -ENOMEM, if
> >> you are not allowed to expand and are trying to overwrite mseal()'ed
> >> VMAs. That seems so very rare that I'm not sure it's worth mentioning.
> >>
> >>
> >> 2. arch_unmap() called prior to may_expand_vm().
> >> powerpc uses this to set mm->context.vdso = NULL if mm->context.vdso is
> >> within the unmap range.  User implication of this means that an
> >> application my set the vdso to NULL prior to hitting the -ENOMEM case in
> >> may_expand_vm() due to the address space limit.
> >>
> >> Assuming the removal of the vdso does not cause the application to seg
> >> fault, then the user visible change is that any vdso call after a failed
> >> mmap(MAP_FIXED) call would result in a seg fault.  The only reason it
> >> would fail is if the mapping process was attempting to map a large
> >> enough area over the vdso (which is accounted and in the vma tree,
> >> afaict) and ran out of memory. Note that this situation could arise
> >> already since we could run out of memory (not accounting) after the
> >> arch_unmap() call within the kernel.
> >>
> >> The code today can suffer the same fate, but not by the accounting
> >> failure.  It can happen due to failure to allocate a new vma,
> >> do_vmi_munmap() failure after the arch_unmap() call, or any of the other
> >> failure scenarios later in the mmap_region() function.
> >>
> >> At the very least, this requires an expanded change log.
> >
> > Indeed, also (as mentioned on IRC) I feel like we need to look at whether
> > we _truly_ need this arch_unmap() call for a single, rather antiquated,
> > architecture.
>
> You can call it "niche" or "irrelevant" or "fringe", but "antiquated" is
> factually wrong :) Power10 came out of the fab just a few years ago at
> 7nm.

Fair point ;) perhaps we could go with "rarified"? :>)

>
> > I mean why are they unmapping the VDSO, why is that valid, why does it need
> > that field to be set to NULL, is it possible to signify that in some other
> > way etc.?
>
> It was originally for CRIU. So a niche workload on a niche architecture.
>
> But from the commit that added it, it sounds like CRIU was using mremap,
> which should be handled these days by vdso_mremap(). So it could be that
> arch_unmap() is not actually needed for CRIU anymore.

Oh that's interesting!

>
> Then I guess we have to decide if removing our arch_unmap() would be an
> ABI break, regardless of whether CRIU needs it or not.

Seems to me like an internal implementation detail that should hopefully
not result in anything that should have visible ABI impact?

I guess this is something we ought to assess. It would be useful to
eliminate hooks where we can so we can better control VMA behaviour without
having to worry about an arch being able to do arbitrary things at
unexpected times, especially pertinent where we change the order of things.

>
> cheers

Thanks for taking a look!

Re: [PATCH v3 16/16] mm/mmap: Move may_expand_vm() check in mmap_region()

Posted by LEROY Christophe 1 year, 7 months ago


Le 10/07/2024 à 14:45, Lorenzo Stoakes a écrit :
> On Wed, Jul 10, 2024 at 10:28:01PM GMT, Michael Ellerman wrote:
>> Lorenzo Stoakes <lorenzo.stoakes@oracle.com> writes:
>>> On Mon, Jul 08, 2024 at 04:43:15PM GMT, Liam R. Howlett wrote:
>>>>
>> ...
>>>> The functionality here has changed
>>>> --- from ---
>>>> may_expand_vm() check
>>>> can_modify_mm() check
>>>> arch_unmap()
>>>> vms_gather_munmap_vmas()
>>>> ...
>>>>
>>>> --- to ---
>>>> can_modify_mm() check
>>>> arch_unmap()
>>>> vms_gather_munmap_vmas()
>>>> may_expand_vm() check
>>>> ...
>>>>
>>>> vms_gather_munmap_vmas() does nothing but figures out what to do later,
>>>> but could use memory and can fail.
>>>>
>>>> The user implications are:
>>>>
>>>> 1. The return type on the error may change to -EPERM from -ENOMEM, if
>>>> you are not allowed to expand and are trying to overwrite mseal()'ed
>>>> VMAs. That seems so very rare that I'm not sure it's worth mentioning.
>>>>
>>>>
>>>> 2. arch_unmap() called prior to may_expand_vm().
>>>> powerpc uses this to set mm->context.vdso = NULL if mm->context.vdso is
>>>> within the unmap range.  User implication of this means that an
>>>> application my set the vdso to NULL prior to hitting the -ENOMEM case in
>>>> may_expand_vm() due to the address space limit.
>>>>
>>>> Assuming the removal of the vdso does not cause the application to seg
>>>> fault, then the user visible change is that any vdso call after a failed
>>>> mmap(MAP_FIXED) call would result in a seg fault.  The only reason it
>>>> would fail is if the mapping process was attempting to map a large
>>>> enough area over the vdso (which is accounted and in the vma tree,
>>>> afaict) and ran out of memory. Note that this situation could arise
>>>> already since we could run out of memory (not accounting) after the
>>>> arch_unmap() call within the kernel.
>>>>
>>>> The code today can suffer the same fate, but not by the accounting
>>>> failure.  It can happen due to failure to allocate a new vma,
>>>> do_vmi_munmap() failure after the arch_unmap() call, or any of the other
>>>> failure scenarios later in the mmap_region() function.
>>>>
>>>> At the very least, this requires an expanded change log.
>>>
>>> Indeed, also (as mentioned on IRC) I feel like we need to look at whether
>>> we _truly_ need this arch_unmap() call for a single, rather antiquated,
>>> architecture.
>>
>> You can call it "niche" or "irrelevant" or "fringe", but "antiquated" is
>> factually wrong :) Power10 came out of the fab just a few years ago at
>> 7nm.
> 
> Fair point ;) perhaps we could go with "rarified"? :>)
> 
>>
>>> I mean why are they unmapping the VDSO, why is that valid, why does it need
>>> that field to be set to NULL, is it possible to signify that in some other
>>> way etc.?
>>
>> It was originally for CRIU. So a niche workload on a niche architecture.
>>
>> But from the commit that added it, it sounds like CRIU was using mremap,
>> which should be handled these days by vdso_mremap(). So it could be that
>> arch_unmap() is not actually needed for CRIU anymore.
> 
> Oh that's interesting!
> 
>>
>> Then I guess we have to decide if removing our arch_unmap() would be an
>> ABI break, regardless of whether CRIU needs it or not.
> 
> Seems to me like an internal implementation detail that should hopefully
> not result in anything that should have visible ABI impact?
> 
> I guess this is something we ought to assess. It would be useful to
> eliminate hooks where we can so we can better control VMA behaviour without
> having to worry about an arch being able to do arbitrary things at
> unexpected times, especially pertinent where we change the order of things.
> 

I see you are talking about arch_unmap(). I didn't follow the entire 
discussion but we have some related stuff here: 
https://github.com/linuxppc/issues/issues/241

If I remember correctly arch_unmap() should have gone away we Dmitry's 
series 
https://lore.kernel.org/lkml/20210611180242.711399-1-dima@arista.com/#r 
but it hasn't been applied yet.

Christophe

Re: [PATCH v3 16/16] mm/mmap: Move may_expand_vm() check in mmap_region()

Posted by Liam R. Howlett 1 year, 7 months ago

* LEROY Christophe <christophe.leroy2@cs-soprasteria.com> [240710 08:59]:
> 
...
> >>>>
> >>>> Assuming the removal of the vdso does not cause the application to seg
> >>>> fault, then the user visible change is that any vdso call after a failed
> >>>> mmap(MAP_FIXED) call would result in a seg fault.  The only reason it
> >>>> would fail is if the mapping process was attempting to map a large
> >>>> enough area over the vdso (which is accounted and in the vma tree,
> >>>> afaict) and ran out of memory. Note that this situation could arise
> >>>> already since we could run out of memory (not accounting) after the
> >>>> arch_unmap() call within the kernel.
> >>>>
> >>>> The code today can suffer the same fate, but not by the accounting
> >>>> failure.  It can happen due to failure to allocate a new vma,
> >>>> do_vmi_munmap() failure after the arch_unmap() call, or any of the other
> >>>> failure scenarios later in the mmap_region() function.
> >>>>
> >>>> At the very least, this requires an expanded change log.
> >>>
...

> >>> I mean why are they unmapping the VDSO, why is that valid, why does it need
> >>> that field to be set to NULL, is it possible to signify that in some other
> >>> way etc.?
> >>
> >> It was originally for CRIU. So a niche workload on a niche architecture.
> >>
> >> But from the commit that added it, it sounds like CRIU was using mremap,
> >> which should be handled these days by vdso_mremap(). So it could be that
> >> arch_unmap() is not actually needed for CRIU anymore.
> > 
> > Oh that's interesting!
> > 
> >>
> >> Then I guess we have to decide if removing our arch_unmap() would be an
> >> ABI break, regardless of whether CRIU needs it or not.
> > 
> > Seems to me like an internal implementation detail that should hopefully
> > not result in anything that should have visible ABI impact?
> > 
> > I guess this is something we ought to assess. It would be useful to
> > eliminate hooks where we can so we can better control VMA behaviour without
> > having to worry about an arch being able to do arbitrary things at
> > unexpected times, especially pertinent where we change the order of things.
> > 
> 
> I see you are talking about arch_unmap(). I didn't follow the entire 
> discussion but we have some related stuff here: 
> https://github.com/linuxppc/issues/issues/241
> 
> If I remember correctly arch_unmap() should have gone away we Dmitry's 
> series 
> https://lore.kernel.org/lkml/20210611180242.711399-1-dima@arista.com/#r 
> but it hasn't been applied yet.
> 

That is good news!

To review, ppc is the only arch using this now and it sounds like you
want to remove it too.

Considering the age of that thread and the possibility of conflict with
my series, can I drop the entire arch_unmap() function instead of
modifying the handling in core mm?  I'm going to assume that's okay and
start working on this for v4 (because there hasn't been a public reply
for v4 since 2023/10/11).

This would mean less arch-specific hooks and that's always a good idea.

Thanks,
Liam

Re: [PATCH v3 16/16] mm/mmap: Move may_expand_vm() check in mmap_region()

Posted by LEROY Christophe 1 year, 7 months ago


Le 10/07/2024 à 18:09, Liam R. Howlett a écrit :
> * LEROY Christophe <christophe.leroy2@cs-soprasteria.com> [240710 08:59]:
>>
> ...
>>>>>>
>>>>>> Assuming the removal of the vdso does not cause the application to seg
>>>>>> fault, then the user visible change is that any vdso call after a failed
>>>>>> mmap(MAP_FIXED) call would result in a seg fault.  The only reason it
>>>>>> would fail is if the mapping process was attempting to map a large
>>>>>> enough area over the vdso (which is accounted and in the vma tree,
>>>>>> afaict) and ran out of memory. Note that this situation could arise
>>>>>> already since we could run out of memory (not accounting) after the
>>>>>> arch_unmap() call within the kernel.
>>>>>>
>>>>>> The code today can suffer the same fate, but not by the accounting
>>>>>> failure.  It can happen due to failure to allocate a new vma,
>>>>>> do_vmi_munmap() failure after the arch_unmap() call, or any of the other
>>>>>> failure scenarios later in the mmap_region() function.
>>>>>>
>>>>>> At the very least, this requires an expanded change log.
>>>>>
> ...
>
>>>>> I mean why are they unmapping the VDSO, why is that valid, why does it need
>>>>> that field to be set to NULL, is it possible to signify that in some other
>>>>> way etc.?
>>>>
>>>> It was originally for CRIU. So a niche workload on a niche architecture.
>>>>
>>>> But from the commit that added it, it sounds like CRIU was using mremap,
>>>> which should be handled these days by vdso_mremap(). So it could be that
>>>> arch_unmap() is not actually needed for CRIU anymore.
>>>
>>> Oh that's interesting!
>>>
>>>>
>>>> Then I guess we have to decide if removing our arch_unmap() would be an
>>>> ABI break, regardless of whether CRIU needs it or not.
>>>
>>> Seems to me like an internal implementation detail that should hopefully
>>> not result in anything that should have visible ABI impact?
>>>
>>> I guess this is something we ought to assess. It would be useful to
>>> eliminate hooks where we can so we can better control VMA behaviour without
>>> having to worry about an arch being able to do arbitrary things at
>>> unexpected times, especially pertinent where we change the order of things.
>>>
>>
>> I see you are talking about arch_unmap(). I didn't follow the entire
>> discussion but we have some related stuff here:
>> https://github.com/linuxppc/issues/issues/241
>>
>> If I remember correctly arch_unmap() should have gone away we Dmitry's
>> series
>> https://lore.kernel.org/lkml/20210611180242.711399-1-dima@arista.com/#r
>> but it hasn't been applied yet.
>>
>
> That is good news!
>
> To review, ppc is the only arch using this now and it sounds like you
> want to remove it too.

Yes want to remove it but needs to be replaced by a more generic
core-based equivalent.

>
> Considering the age of that thread and the possibility of conflict with
> my series, can I drop the entire arch_unmap() function instead of
> modifying the handling in core mm?  I'm going to assume that's okay and
> start working on this for v4 (because there hasn't been a public reply
> for v4 since 2023/10/11).

drop it yes but not without implementing a replacement in core mm like
proposed by Dmitry.

>
> This would mean less arch-specific hooks and that's always a good idea.
>

Indeed.

Christophe

Re: [PATCH v3 16/16] mm/mmap: Move may_expand_vm() check in mmap_region()

Posted by Dmitry Safonov 1 year, 7 months ago

Hi Liam,

On Wed, Jul 10, 2024 at 5:09 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * LEROY Christophe <christophe.leroy2@cs-soprasteria.com> [240710 08:59]:
> >
> ...
> > >>>>
> > >>>> Assuming the removal of the vdso does not cause the application to seg
> > >>>> fault, then the user visible change is that any vdso call after a failed
> > >>>> mmap(MAP_FIXED) call would result in a seg fault.  The only reason it
> > >>>> would fail is if the mapping process was attempting to map a large
> > >>>> enough area over the vdso (which is accounted and in the vma tree,
> > >>>> afaict) and ran out of memory. Note that this situation could arise
> > >>>> already since we could run out of memory (not accounting) after the
> > >>>> arch_unmap() call within the kernel.
> > >>>>
> > >>>> The code today can suffer the same fate, but not by the accounting
> > >>>> failure.  It can happen due to failure to allocate a new vma,
> > >>>> do_vmi_munmap() failure after the arch_unmap() call, or any of the other
> > >>>> failure scenarios later in the mmap_region() function.
> > >>>>
> > >>>> At the very least, this requires an expanded change log.
> > >>>
> ...
>
> > >>> I mean why are they unmapping the VDSO, why is that valid, why does it need
> > >>> that field to be set to NULL, is it possible to signify that in some other
> > >>> way etc.?
> > >>
> > >> It was originally for CRIU. So a niche workload on a niche architecture.
> > >>
> > >> But from the commit that added it, it sounds like CRIU was using mremap,
> > >> which should be handled these days by vdso_mremap(). So it could be that
> > >> arch_unmap() is not actually needed for CRIU anymore.
> > >
> > > Oh that's interesting!
> > >
> > >>
> > >> Then I guess we have to decide if removing our arch_unmap() would be an
> > >> ABI break, regardless of whether CRIU needs it or not.
> > >
> > > Seems to me like an internal implementation detail that should hopefully
> > > not result in anything that should have visible ABI impact?
> > >
> > > I guess this is something we ought to assess. It would be useful to
> > > eliminate hooks where we can so we can better control VMA behaviour without
> > > having to worry about an arch being able to do arbitrary things at
> > > unexpected times, especially pertinent where we change the order of things.
> > >
> >
> > I see you are talking about arch_unmap(). I didn't follow the entire
> > discussion but we have some related stuff here:
> > https://github.com/linuxppc/issues/issues/241
> >
> > If I remember correctly arch_unmap() should have gone away we Dmitry's
> > series
> > https://lore.kernel.org/lkml/20210611180242.711399-1-dima@arista.com/#r
> > but it hasn't been applied yet.
> >
>
> That is good news!
>
> To review, ppc is the only arch using this now and it sounds like you
> want to remove it too.
>
> Considering the age of that thread and the possibility of conflict with
> my series, can I drop the entire arch_unmap() function instead of
> modifying the handling in core mm?  I'm going to assume that's okay and
> start working on this for v4 (because there hasn't been a public reply
> for v4 since 2023/10/11).

Yeah, this kind of felt through the cracks. I meant to find time to
push v4, but from my job side I got motivated to do core networking
changes that were required by customers, from the other side I got
demotivated a bit by slight pushback for v3 with "justify why is it
needed at all?". For changes that are mostly cleanups and refactoring.

So, usually I don't give up on patches sets that yet make sense to me,
but priorities+motivation changed and the set moved lower on my todo
list.

If you have time and urge to finish this patch set, you are more than
welcome to adopt it :-)
Otherwise, I'll try to find time for them, but not in near short-term
as at this moment I cook tcp & selftests changes that I'd love to see
upstream.

> This would mean less arch-specific hooks and that's always a good idea.

Thanks,
           Dmitry

Re: [PATCH v3 16/16] mm/mmap: Move may_expand_vm() check in mmap_region()

Posted by Liam R. Howlett 1 year, 7 months ago

Cc'ing Dave Hansen on this.

* Liam R. Howlett <Liam.Howlett@oracle.com> [240708 16:43]:
> * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240708 08:53]:
> > On Thu, Jul 04, 2024 at 02:27:18PM GMT, Liam R. Howlett wrote:
> > > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> > >
> > > The MAP_FIXED page count is available after the vms_gather_munmap_vmas()
> > > call, so use it instead of looping over the vmas twice.
> > 
> > Predictably indeed you removed the thing I commented on in the last patch
> > ;) but at least this time I predicted it! ;)
> > 
> > >
> > > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > > ---
> > >  mm/mmap.c | 36 ++++--------------------------------
> > >  1 file changed, 4 insertions(+), 32 deletions(-)
> > >
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index b2de26683903..62edaabf3987 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c

...

> > >  static void __vma_link_file(struct vm_area_struct *vma,
> > >  			    struct address_space *mapping)
> > >  {
> > > @@ -2946,17 +2925,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > >  	pgoff_t vm_pgoff;
> > >  	int error = -ENOMEM;
> > >  	VMA_ITERATOR(vmi, mm, addr);
> > > -	unsigned long nr_pages, nr_accounted;
> > > -
> > > -	nr_pages = count_vma_pages_range(mm, addr, end, &nr_accounted);
> > > -
> > > -	/* Check against address space limit. */
> > > -	/*
> > > -	 * MAP_FIXED may remove pages of mappings that intersects with requested
> > > -	 * mapping. Account for the pages it would unmap.
> > > -	 */
> > > -	if (!may_expand_vm(mm, vm_flags, pglen - nr_pages))
> > > -		return -ENOMEM;
> > >
> > >  	if (unlikely(!can_modify_mm(mm, addr, end)))
> > >  		return -EPERM;
> > > @@ -2987,6 +2955,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > >  			vma_iter_next_range(&vmi);
> > >  	}
> > >
> > > +	/* Check against address space limit. */
> > > +	if (!may_expand_vm(mm, vm_flags, pglen - vms.nr_pages))
> > > +		goto abort_munmap;
> > > +
> > 
> > I know you can literally only do this after the vms_gather_munmap_vmas(),
> > but this does change where we check this, so for instance we do
> > arch_unmap() without having checked may_expand_vm().
> > 
> > However I assume this is fine?
> 
> Thanks for pointing this out.
> 
> The functionality here has changed
> --- from ---
> may_expand_vm() check
> can_modify_mm() check
> arch_unmap()
> vms_gather_munmap_vmas()
> ...
> 
> --- to ---
> can_modify_mm() check
> arch_unmap()
> vms_gather_munmap_vmas()
> may_expand_vm() check
> ...
> 
> vms_gather_munmap_vmas() does nothing but figures out what to do later,
> but could use memory and can fail.
> 
> The user implications are:
> 
> 1. The return type on the error may change to -EPERM from -ENOMEM, if
> you are not allowed to expand and are trying to overwrite mseal()'ed
> VMAs. That seems so very rare that I'm not sure it's worth mentioning.
> 
> 
> 2. arch_unmap() called prior to may_expand_vm().
> powerpc uses this to set mm->context.vdso = NULL if mm->context.vdso is
> within the unmap range.  User implication of this means that an
> application my set the vdso to NULL prior to hitting the -ENOMEM case in
> may_expand_vm() due to the address space limit.
> 
> Assuming the removal of the vdso does not cause the application to seg
> fault, then the user visible change is that any vdso call after a failed
> mmap(MAP_FIXED) call would result in a seg fault.  The only reason it
> would fail is if the mapping process was attempting to map a large
> enough area over the vdso (which is accounted and in the vma tree,
> afaict) and ran out of memory. Note that this situation could arise
> already since we could run out of memory (not accounting) after the
> arch_unmap() call within the kernel.
> 
> The code today can suffer the same fate, but not by the accounting
> failure.  It can happen due to failure to allocate a new vma,
> do_vmi_munmap() failure after the arch_unmap() call, or any of the other
> failure scenarios later in the mmap_region() function.
>
> At the very least, this requires an expanded change log.

After doing a deep dive into the vdso issue, I think it would be best to
remove the arch_unmap() call completely in a later patch set by changing
the two areas highlighted by Dave in patch 5a28fc94c914 "x86/mpx,
mm/core: Fix recursive munmap() corruption" back in 2019 in regards to
the powerpc pointer use.  But that's for later work.

In the above mentioned patch, the arch_unmap() was moved to an earlier
time to avoid removing the same vma twice from the rbtree.  Since the
mpx code no longer removes the vma and powerpc never removed the vma, it
seems safe to reorder the calls as such:

can_modify_mm() check
vms_gather_munmap_vmas()
may_expand_vm() check
arch_unmap()

This seems very much fine because:
- powerpc is the only platform doing _anything_ in arch_unmap().
- powerpc used to work with the arch_unmap() call  after the vma was
  completely dropped.
- The vma isn't even dropped by this point and so all proposed changes
  will be completely undone in the rare case of may_expand_vm() failure.
- The arch_unmap() call doesn't need to be that early anymore anyways
  (mpx was dropped by Dave in 2020 git id ccaaaf6fe5a5).

I will make the order change in v4 of the patch series in its own patch.

Thanks,
Liam

Re: [PATCH v3 16/16] mm/mmap: Move may_expand_vm() check in mmap_region()

Posted by Liam R. Howlett 1 year, 7 months ago

* Liam R. Howlett <Liam.Howlett@oracle.com> [240709 10:42]:
> Cc'ing Dave Hansen on this.

Really adding Dave to the discussion.

> 
> * Liam R. Howlett <Liam.Howlett@oracle.com> [240708 16:43]:
> > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240708 08:53]:
> > > On Thu, Jul 04, 2024 at 02:27:18PM GMT, Liam R. Howlett wrote:
> > > > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> > > >
> > > > The MAP_FIXED page count is available after the vms_gather_munmap_vmas()
> > > > call, so use it instead of looping over the vmas twice.
> > > 
> > > Predictably indeed you removed the thing I commented on in the last patch
> > > ;) but at least this time I predicted it! ;)
> > > 
> > > >
> > > > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > > > ---
> > > >  mm/mmap.c | 36 ++++--------------------------------
> > > >  1 file changed, 4 insertions(+), 32 deletions(-)
> > > >
> > > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > > index b2de26683903..62edaabf3987 100644
> > > > --- a/mm/mmap.c
> > > > +++ b/mm/mmap.c
> 
> ...
> 
> > > >  static void __vma_link_file(struct vm_area_struct *vma,
> > > >  			    struct address_space *mapping)
> > > >  {
> > > > @@ -2946,17 +2925,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > > >  	pgoff_t vm_pgoff;
> > > >  	int error = -ENOMEM;
> > > >  	VMA_ITERATOR(vmi, mm, addr);
> > > > -	unsigned long nr_pages, nr_accounted;
> > > > -
> > > > -	nr_pages = count_vma_pages_range(mm, addr, end, &nr_accounted);
> > > > -
> > > > -	/* Check against address space limit. */
> > > > -	/*
> > > > -	 * MAP_FIXED may remove pages of mappings that intersects with requested
> > > > -	 * mapping. Account for the pages it would unmap.
> > > > -	 */
> > > > -	if (!may_expand_vm(mm, vm_flags, pglen - nr_pages))
> > > > -		return -ENOMEM;
> > > >
> > > >  	if (unlikely(!can_modify_mm(mm, addr, end)))
> > > >  		return -EPERM;
> > > > @@ -2987,6 +2955,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > > >  			vma_iter_next_range(&vmi);
> > > >  	}
> > > >
> > > > +	/* Check against address space limit. */
> > > > +	if (!may_expand_vm(mm, vm_flags, pglen - vms.nr_pages))
> > > > +		goto abort_munmap;
> > > > +
> > > 
> > > I know you can literally only do this after the vms_gather_munmap_vmas(),
> > > but this does change where we check this, so for instance we do
> > > arch_unmap() without having checked may_expand_vm().
> > > 
> > > However I assume this is fine?
> > 
> > Thanks for pointing this out.
> > 
> > The functionality here has changed
> > --- from ---
> > may_expand_vm() check
> > can_modify_mm() check
> > arch_unmap()
> > vms_gather_munmap_vmas()
> > ...
> > 
> > --- to ---
> > can_modify_mm() check
> > arch_unmap()
> > vms_gather_munmap_vmas()
> > may_expand_vm() check
> > ...
> > 
> > vms_gather_munmap_vmas() does nothing but figures out what to do later,
> > but could use memory and can fail.
> > 
> > The user implications are:
> > 
> > 1. The return type on the error may change to -EPERM from -ENOMEM, if
> > you are not allowed to expand and are trying to overwrite mseal()'ed
> > VMAs. That seems so very rare that I'm not sure it's worth mentioning.
> > 
> > 
> > 2. arch_unmap() called prior to may_expand_vm().
> > powerpc uses this to set mm->context.vdso = NULL if mm->context.vdso is
> > within the unmap range.  User implication of this means that an
> > application my set the vdso to NULL prior to hitting the -ENOMEM case in
> > may_expand_vm() due to the address space limit.
> > 
> > Assuming the removal of the vdso does not cause the application to seg
> > fault, then the user visible change is that any vdso call after a failed
> > mmap(MAP_FIXED) call would result in a seg fault.  The only reason it
> > would fail is if the mapping process was attempting to map a large
> > enough area over the vdso (which is accounted and in the vma tree,
> > afaict) and ran out of memory. Note that this situation could arise
> > already since we could run out of memory (not accounting) after the
> > arch_unmap() call within the kernel.
> > 
> > The code today can suffer the same fate, but not by the accounting
> > failure.  It can happen due to failure to allocate a new vma,
> > do_vmi_munmap() failure after the arch_unmap() call, or any of the other
> > failure scenarios later in the mmap_region() function.
> >
> > At the very least, this requires an expanded change log.
> 
> After doing a deep dive into the vdso issue, I think it would be best to
> remove the arch_unmap() call completely in a later patch set by changing
> the two areas highlighted by Dave in patch 5a28fc94c914 "x86/mpx,
> mm/core: Fix recursive munmap() corruption" back in 2019 in regards to
> the powerpc pointer use.  But that's for later work.
> 
> In the above mentioned patch, the arch_unmap() was moved to an earlier
> time to avoid removing the same vma twice from the rbtree.  Since the
> mpx code no longer removes the vma and powerpc never removed the vma, it
> seems safe to reorder the calls as such:
> 
> can_modify_mm() check
> vms_gather_munmap_vmas()
> may_expand_vm() check
> arch_unmap()
> 
> This seems very much fine because:
> - powerpc is the only platform doing _anything_ in arch_unmap().
> - powerpc used to work with the arch_unmap() call  after the vma was
>   completely dropped.
> - The vma isn't even dropped by this point and so all proposed changes
>   will be completely undone in the rare case of may_expand_vm() failure.
> - The arch_unmap() call doesn't need to be that early anymore anyways
>   (mpx was dropped by Dave in 2020 git id ccaaaf6fe5a5).
> 
> I will make the order change in v4 of the patch series in its own patch.
> 
> Thanks,
> Liam

Re: [PATCH v3 16/16] mm/mmap: Move may_expand_vm() check in mmap_region()

Posted by Dave Hansen 1 year, 7 months ago

On 7/9/24 07:52, Liam R. Howlett wrote:
...
> - The arch_unmap() call doesn't need to be that early anymore anyways
>   (mpx was dropped by Dave in 2020 git id ccaaaf6fe5a5).

Yep.  MPX is long gone and x86 doesn't do anything with arch_unmap() any
more.  Moving it around won't affect x86 at all.

Re: [PATCH v3 16/16] mm/mmap: Move may_expand_vm() check in mmap_region()

Posted by Lorenzo Stoakes 1 year, 7 months ago

On Tue, Jul 09, 2024 at 10:42:41AM GMT, Liam R. Howlett wrote:
> Cc'ing Dave Hansen on this.
>
> * Liam R. Howlett <Liam.Howlett@oracle.com> [240708 16:43]:
> > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240708 08:53]:
> > > On Thu, Jul 04, 2024 at 02:27:18PM GMT, Liam R. Howlett wrote:
> > > > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> > > >
> > > > The MAP_FIXED page count is available after the vms_gather_munmap_vmas()
> > > > call, so use it instead of looping over the vmas twice.
> > >
> > > Predictably indeed you removed the thing I commented on in the last patch
> > > ;) but at least this time I predicted it! ;)
> > >
> > > >
> > > > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > > > ---
> > > >  mm/mmap.c | 36 ++++--------------------------------
> > > >  1 file changed, 4 insertions(+), 32 deletions(-)
> > > >
> > > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > > index b2de26683903..62edaabf3987 100644
> > > > --- a/mm/mmap.c
> > > > +++ b/mm/mmap.c
>
> ...
>
> > > >  static void __vma_link_file(struct vm_area_struct *vma,
> > > >  			    struct address_space *mapping)
> > > >  {
> > > > @@ -2946,17 +2925,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > > >  	pgoff_t vm_pgoff;
> > > >  	int error = -ENOMEM;
> > > >  	VMA_ITERATOR(vmi, mm, addr);
> > > > -	unsigned long nr_pages, nr_accounted;
> > > > -
> > > > -	nr_pages = count_vma_pages_range(mm, addr, end, &nr_accounted);
> > > > -
> > > > -	/* Check against address space limit. */
> > > > -	/*
> > > > -	 * MAP_FIXED may remove pages of mappings that intersects with requested
> > > > -	 * mapping. Account for the pages it would unmap.
> > > > -	 */
> > > > -	if (!may_expand_vm(mm, vm_flags, pglen - nr_pages))
> > > > -		return -ENOMEM;
> > > >
> > > >  	if (unlikely(!can_modify_mm(mm, addr, end)))
> > > >  		return -EPERM;
> > > > @@ -2987,6 +2955,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > > >  			vma_iter_next_range(&vmi);
> > > >  	}
> > > >
> > > > +	/* Check against address space limit. */
> > > > +	if (!may_expand_vm(mm, vm_flags, pglen - vms.nr_pages))
> > > > +		goto abort_munmap;
> > > > +
> > >
> > > I know you can literally only do this after the vms_gather_munmap_vmas(),
> > > but this does change where we check this, so for instance we do
> > > arch_unmap() without having checked may_expand_vm().
> > >
> > > However I assume this is fine?
> >
> > Thanks for pointing this out.
> >
> > The functionality here has changed
> > --- from ---
> > may_expand_vm() check
> > can_modify_mm() check
> > arch_unmap()
> > vms_gather_munmap_vmas()
> > ...
> >
> > --- to ---
> > can_modify_mm() check
> > arch_unmap()
> > vms_gather_munmap_vmas()
> > may_expand_vm() check
> > ...
> >
> > vms_gather_munmap_vmas() does nothing but figures out what to do later,
> > but could use memory and can fail.
> >
> > The user implications are:
> >
> > 1. The return type on the error may change to -EPERM from -ENOMEM, if
> > you are not allowed to expand and are trying to overwrite mseal()'ed
> > VMAs. That seems so very rare that I'm not sure it's worth mentioning.
> >
> >
> > 2. arch_unmap() called prior to may_expand_vm().
> > powerpc uses this to set mm->context.vdso = NULL if mm->context.vdso is
> > within the unmap range.  User implication of this means that an
> > application my set the vdso to NULL prior to hitting the -ENOMEM case in
> > may_expand_vm() due to the address space limit.
> >
> > Assuming the removal of the vdso does not cause the application to seg
> > fault, then the user visible change is that any vdso call after a failed
> > mmap(MAP_FIXED) call would result in a seg fault.  The only reason it
> > would fail is if the mapping process was attempting to map a large
> > enough area over the vdso (which is accounted and in the vma tree,
> > afaict) and ran out of memory. Note that this situation could arise
> > already since we could run out of memory (not accounting) after the
> > arch_unmap() call within the kernel.
> >
> > The code today can suffer the same fate, but not by the accounting
> > failure.  It can happen due to failure to allocate a new vma,
> > do_vmi_munmap() failure after the arch_unmap() call, or any of the other
> > failure scenarios later in the mmap_region() function.
> >
> > At the very least, this requires an expanded change log.
>
> After doing a deep dive into the vdso issue, I think it would be best to
> remove the arch_unmap() call completely in a later patch set by changing
> the two areas highlighted by Dave in patch 5a28fc94c914 "x86/mpx,
> mm/core: Fix recursive munmap() corruption" back in 2019 in regards to
> the powerpc pointer use.  But that's for later work.

Our replies coincided but yes absolutely. But should be a separate patch,
agreed.

>
> In the above mentioned patch, the arch_unmap() was moved to an earlier
> time to avoid removing the same vma twice from the rbtree.  Since the
> mpx code no longer removes the vma and powerpc never removed the vma, it
> seems safe to reorder the calls as such:
>
> can_modify_mm() check
> vms_gather_munmap_vmas()
> may_expand_vm() check
> arch_unmap()
>
> This seems very much fine because:
> - powerpc is the only platform doing _anything_ in arch_unmap().
> - powerpc used to work with the arch_unmap() call  after the vma was
>   completely dropped.
> - The vma isn't even dropped by this point and so all proposed changes
>   will be completely undone in the rare case of may_expand_vm() failure.
> - The arch_unmap() call doesn't need to be that early anymore anyways
>   (mpx was dropped by Dave in 2020 git id ccaaaf6fe5a5).
>
> I will make the order change in v4 of the patch series in its own patch.
>

Great!

> Thanks,
> Liam