[RFC PATCH v2 0/3] eliminate mmap() retry merge, add .mmap_prepare hook

Lorenzo Stoakes posted 3 patches 9 months, 1 week ago
There is a newer version of this series
include/linux/fs.h               | 38 +++++++++++++++
include/linux/mm_types.h         | 24 ++++++++++
mm/memory.c                      |  3 +-
mm/mmap.c                        |  2 +-
mm/secretmem.c                   | 14 +++---
mm/vma.c                         | 82 ++++++++++++++++++++++++++------
tools/testing/vma/vma_internal.h | 79 ++++++++++++++++++++++++++++--
7 files changed, 214 insertions(+), 28 deletions(-)
[RFC PATCH v2 0/3] eliminate mmap() retry merge, add .mmap_prepare hook
Posted by Lorenzo Stoakes 9 months, 1 week ago
During the mmap() of a file-backed mapping, we invoke the underlying driver
file's mmap() callback in order to perform driver/file system
initialisation of the underlying VMA.

This has been a source of issues in the past, including a significant
security concern relating to unwinding of error state discovered by Jann
Horn, as fixed in commit 5de195060b2e ("mm: resolve faulty mmap_region()
error path behaviour") which performed the recent, significant, rework of
mmap() as a whole.

However, we have had a fly in the ointment remain - drivers have a great
deal of freedom in the .mmap() hook to manipulate VMA state (as well as
page table state).

This can be problematic, as we can no longer reason sensibly about VMA
state once the call is complete (the ability to do - anything - here does
rather interfere with that).

In addition, callers may choose to do odd or unusual things which might
interfere with subsequent steps in the mmap() process, and it may do so and
then raise an error, requiring very careful unwinding of state about which
we can make no assumptions.

Rather than providing such an open-ended interface, this series provides an
alternative, far more restrictive one - we expose a whitelist of fields
which can be adjusted by the driver, along with immutable state upon which
the driver can make such decisions:

struct vm_area_desc {
	/* Immutable state. */
	struct mm_struct *mm;
	unsigned long start;
	unsigned long end;

	/* Mutable fields. Populated with initial state. */
	pgoff_t pgoff;
	struct file *file;
	vm_flags_t vm_flags;
	pgprot_t page_prot;

	/* Write-only fields. */
	const struct vm_operations_struct *vm_ops;
	void *private_data;
};

The mmap logic then updates the state used to either merge with a VMA or
establish a new VMA based upon this logic.

This is achieved via new file hook .mmap_prepare(), which is, importantly,
invoked very early on in the mmap() process.

If an error arises, we can very simply abort the operation with very little
unwinding of state required.

The existing logic contains another, related, peccadillo - since the
.mmap() callback might do anything, it may also cause a previously
unmergeable VMA to become mergeable with adjacent VMAs.

Right now the logic will retry a merge like this only if the driver changes
VMA flags, and changes them in such a way that a merge might succeed (that
is, the flags are not 'special', that is do not contain any of the flags
specified in VM_SPECIAL).

This has also been the source of a great deal of pain - it's hard to
reason about an .mmap() callback that might do - anything - but it's also
hard to reason about setting up a VMA and writing to the maple tree, only
to do it again utilising a great deal of shared state.

Since .mmap_prepare() sets fields before the first merge is even attempted,
the use of this callback obviates the need for this retry merge logic.

A driver may only specify .mmap_prepare() or the deprecated .mmap()
callback. In future we may add futher callbacks beyond .mmap_prepare() to
faciliate all use cass as we convert drivers.

In researching this change, I examined every .mmap() callback, and
discovered only a very few that set VMA state in such a way that a. the VMA
flags changed and b. this would be mergeable.

In the majority of cases, it turns out that drivers are mapping kernel
memory and thus ultimately set VM_PFNMAP, VM_MIXEDMAP, or other unmergeable
VM_SPECIAL flags.

Of those that remain I identified a number of cases which are only
applicable in DAX, setting the VM_HUGEPAGE flag:

* dax_mmap()
* erofs_file_mmap()
* ext4_file_mmap()
* xfs_file_mmap()

For this remerge to not occur and to impact users, each of these cases
would require a user to mmap() files using DAX, in parts, immediately
adjacent to one another.

This is a very unlikely usecase and so it does not appear to be worthwhile
to adjust this functionality accordingly.

We can, however, very quickly do so if needed by simply adding an
.mmap_prepare() callback to these as required.

There are two further non-DAX cases I idenitfied:

* orangefs_file_mmap() - Clears VM_RAND_READ if set, replacing with
  VM_SEQ_READ.
* usb_stream_hwdep_mmap() - Sets VM_DONTDUMP.

Both of these cases again seem very unlikely to be mmap()'d immediately
adjacent to one another in a fashion that would result in a merge.

Finally, we are left with a viable case:

* secretmem_mmap() - Set VM_LOCKED, VM_DONTDUMP.

This is viable enough that the mm selftests trigger the logic as a matter
of course. Therefore, this series replace the .secretmem_mmap() hook with
.secret_mmap_prepare().

RFC v2:
* Renamed .mmap_proto() to .mmap_prepare() as per David.
* Made .mmap_prepare(), .mmap() mutually exclusive.
* Updated call_mmap() to bail out if .mmap_prepare() callback present as per Jann.
* Renamed vma_proto to vm_area_desc as per Mike.
* Added accidentally missing page_prot assignment/read from vm_area_desc.
* Renamed vm_area_desc->flags to vm_flags as per Liam.
* Added [__]call_mmap_prepare() for consistency with call_mmap().
* Added file_has_xxx_hook() helpers.
* Renamed file_has_valid_mmap() to file_has_valid_mmap_hooks() and check that
  the hook is mutually exclusive.

RFC v1:
https://lore.kernel.org/all/cover.1746040540.git.lorenzo.stoakes@oracle.com/

Lorenzo Stoakes (3):
  mm: introduce new .mmap_prepare() file callback
  mm: secretmem: convert to .mmap_prepare() hook
  mm/vma: remove mmap() retry merge

 include/linux/fs.h               | 38 +++++++++++++++
 include/linux/mm_types.h         | 24 ++++++++++
 mm/memory.c                      |  3 +-
 mm/mmap.c                        |  2 +-
 mm/secretmem.c                   | 14 +++---
 mm/vma.c                         | 82 ++++++++++++++++++++++++++------
 tools/testing/vma/vma_internal.h | 79 ++++++++++++++++++++++++++++--
 7 files changed, 214 insertions(+), 28 deletions(-)

-- 
2.49.0
Re: [RFC PATCH v2 0/3] eliminate mmap() retry merge, add .mmap_prepare hook
Posted by Jan Kara 9 months, 1 week ago
On Thu 01-05-25 18:25:26, Lorenzo Stoakes wrote:
> During the mmap() of a file-backed mapping, we invoke the underlying driver
> file's mmap() callback in order to perform driver/file system
> initialisation of the underlying VMA.
> 
> This has been a source of issues in the past, including a significant
> security concern relating to unwinding of error state discovered by Jann
> Horn, as fixed in commit 5de195060b2e ("mm: resolve faulty mmap_region()
> error path behaviour") which performed the recent, significant, rework of
> mmap() as a whole.
> 
> However, we have had a fly in the ointment remain - drivers have a great
> deal of freedom in the .mmap() hook to manipulate VMA state (as well as
> page table state).
> 
> This can be problematic, as we can no longer reason sensibly about VMA
> state once the call is complete (the ability to do - anything - here does
> rather interfere with that).
> 
> In addition, callers may choose to do odd or unusual things which might
> interfere with subsequent steps in the mmap() process, and it may do so and
> then raise an error, requiring very careful unwinding of state about which
> we can make no assumptions.
> 
> Rather than providing such an open-ended interface, this series provides an
> alternative, far more restrictive one - we expose a whitelist of fields
> which can be adjusted by the driver, along with immutable state upon which
> the driver can make such decisions:
> 
> struct vm_area_desc {
> 	/* Immutable state. */
> 	struct mm_struct *mm;
> 	unsigned long start;
> 	unsigned long end;
> 
> 	/* Mutable fields. Populated with initial state. */
> 	pgoff_t pgoff;
> 	struct file *file;
> 	vm_flags_t vm_flags;
> 	pgprot_t page_prot;
> 
> 	/* Write-only fields. */
> 	const struct vm_operations_struct *vm_ops;
> 	void *private_data;
> };
> 
> The mmap logic then updates the state used to either merge with a VMA or
> establish a new VMA based upon this logic.
> 
> This is achieved via new file hook .mmap_prepare(), which is, importantly,
> invoked very early on in the mmap() process.
> 
> If an error arises, we can very simply abort the operation with very little
> unwinding of state required.

Looks sensible. So is there a plan to transform existing .mmap hooks to
.mmap_prepare hooks? I agree that for most filesystems this should be just
easy 1:1 replacement and AFAIU this would be prefered?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
Re: [RFC PATCH v2 0/3] eliminate mmap() retry merge, add .mmap_prepare hook
Posted by Lorenzo Stoakes 9 months, 1 week ago
On Fri, May 02, 2025 at 02:20:38PM +0200, Jan Kara wrote:
> On Thu 01-05-25 18:25:26, Lorenzo Stoakes wrote:
> > During the mmap() of a file-backed mapping, we invoke the underlying driver
> > file's mmap() callback in order to perform driver/file system
> > initialisation of the underlying VMA.
> >
> > This has been a source of issues in the past, including a significant
> > security concern relating to unwinding of error state discovered by Jann
> > Horn, as fixed in commit 5de195060b2e ("mm: resolve faulty mmap_region()
> > error path behaviour") which performed the recent, significant, rework of
> > mmap() as a whole.
> >
> > However, we have had a fly in the ointment remain - drivers have a great
> > deal of freedom in the .mmap() hook to manipulate VMA state (as well as
> > page table state).
> >
> > This can be problematic, as we can no longer reason sensibly about VMA
> > state once the call is complete (the ability to do - anything - here does
> > rather interfere with that).
> >
> > In addition, callers may choose to do odd or unusual things which might
> > interfere with subsequent steps in the mmap() process, and it may do so and
> > then raise an error, requiring very careful unwinding of state about which
> > we can make no assumptions.
> >
> > Rather than providing such an open-ended interface, this series provides an
> > alternative, far more restrictive one - we expose a whitelist of fields
> > which can be adjusted by the driver, along with immutable state upon which
> > the driver can make such decisions:
> >
> > struct vm_area_desc {
> > 	/* Immutable state. */
> > 	struct mm_struct *mm;
> > 	unsigned long start;
> > 	unsigned long end;
> >
> > 	/* Mutable fields. Populated with initial state. */
> > 	pgoff_t pgoff;
> > 	struct file *file;
> > 	vm_flags_t vm_flags;
> > 	pgprot_t page_prot;
> >
> > 	/* Write-only fields. */
> > 	const struct vm_operations_struct *vm_ops;
> > 	void *private_data;
> > };
> >
> > The mmap logic then updates the state used to either merge with a VMA or
> > establish a new VMA based upon this logic.
> >
> > This is achieved via new file hook .mmap_prepare(), which is, importantly,
> > invoked very early on in the mmap() process.
> >
> > If an error arises, we can very simply abort the operation with very little
> > unwinding of state required.
>
> Looks sensible. So is there a plan to transform existing .mmap hooks to
> .mmap_prepare hooks? I agree that for most filesystems this should be just
> easy 1:1 replacement and AFAIU this would be prefered?

Thanks!

Yeah the intent is to convert _all_ callers away from .mmap() so we can
lock down what drivers are doing and be able to (relatively) safely make
assumptions about what's going on in mmap logic.

As David points out, we may need to add new callbacks to account for other
requirements by drivers but .mmap_prepare() forms the foundation.

Great to hear about filesystems, having done a big scan through .mmap()
hooks I am pretty convinced _most_ can be pretty easily replaced, there are
a few tricky things though, most notably things that 'pass through'
.mmap(), as raised by (other) Jan[n].

For now we fudge that case by just disallowing any driver that doesn't
implement .mmap(), though I think in future we can temporarily work around
this by having a wrapper function that takes a driver with an
.mmap_prepare() callback, populate a vm_area_desc object from the vma, then
'do the right thing' in setting vma fields once complete.

Once the conversion is complete we can drop this.

So overall I think this is very workable, and it's very helpful that we can
do this incrementally over a period time, given the number of .mmap()
callbacks that need changing :)

I am happy to dedicate my time to doing this (as this is near and dear to
my heart) though will of course encourage anybody else willing to help with
the effort :)

I think it'll lead to a more stable kernel overall.

>
> 								Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

Cheers, Lorenzo
Re: [RFC PATCH v2 0/3] eliminate mmap() retry merge, add .mmap_prepare hook
Posted by Christian Brauner 9 months, 1 week ago
On Fri, May 02, 2025 at 01:59:49PM +0100, Lorenzo Stoakes wrote:
> On Fri, May 02, 2025 at 02:20:38PM +0200, Jan Kara wrote:
> > On Thu 01-05-25 18:25:26, Lorenzo Stoakes wrote:
> > > During the mmap() of a file-backed mapping, we invoke the underlying driver
> > > file's mmap() callback in order to perform driver/file system
> > > initialisation of the underlying VMA.
> > >
> > > This has been a source of issues in the past, including a significant
> > > security concern relating to unwinding of error state discovered by Jann
> > > Horn, as fixed in commit 5de195060b2e ("mm: resolve faulty mmap_region()
> > > error path behaviour") which performed the recent, significant, rework of
> > > mmap() as a whole.
> > >
> > > However, we have had a fly in the ointment remain - drivers have a great
> > > deal of freedom in the .mmap() hook to manipulate VMA state (as well as
> > > page table state).
> > >
> > > This can be problematic, as we can no longer reason sensibly about VMA
> > > state once the call is complete (the ability to do - anything - here does
> > > rather interfere with that).
> > >
> > > In addition, callers may choose to do odd or unusual things which might
> > > interfere with subsequent steps in the mmap() process, and it may do so and
> > > then raise an error, requiring very careful unwinding of state about which
> > > we can make no assumptions.
> > >
> > > Rather than providing such an open-ended interface, this series provides an
> > > alternative, far more restrictive one - we expose a whitelist of fields
> > > which can be adjusted by the driver, along with immutable state upon which
> > > the driver can make such decisions:
> > >
> > > struct vm_area_desc {
> > > 	/* Immutable state. */
> > > 	struct mm_struct *mm;
> > > 	unsigned long start;
> > > 	unsigned long end;
> > >
> > > 	/* Mutable fields. Populated with initial state. */
> > > 	pgoff_t pgoff;
> > > 	struct file *file;
> > > 	vm_flags_t vm_flags;
> > > 	pgprot_t page_prot;
> > >
> > > 	/* Write-only fields. */
> > > 	const struct vm_operations_struct *vm_ops;
> > > 	void *private_data;
> > > };
> > >
> > > The mmap logic then updates the state used to either merge with a VMA or
> > > establish a new VMA based upon this logic.
> > >
> > > This is achieved via new file hook .mmap_prepare(), which is, importantly,
> > > invoked very early on in the mmap() process.
> > >
> > > If an error arises, we can very simply abort the operation with very little
> > > unwinding of state required.
> >
> > Looks sensible. So is there a plan to transform existing .mmap hooks to
> > .mmap_prepare hooks? I agree that for most filesystems this should be just
> > easy 1:1 replacement and AFAIU this would be prefered?
> 
> Thanks!
> 
> Yeah the intent is to convert _all_ callers away from .mmap() so we can
> lock down what drivers are doing and be able to (relatively) safely make
> assumptions about what's going on in mmap logic.
> 
> As David points out, we may need to add new callbacks to account for other

The plural is a little worrying, let's please aim minimize the number of
new methods we need for this.
Re: [RFC PATCH v2 0/3] eliminate mmap() retry merge, add .mmap_prepare hook
Posted by Lorenzo Stoakes 9 months, 1 week ago
On Mon, May 05, 2025 at 03:37:39PM +0200, Christian Brauner wrote:
> On Fri, May 02, 2025 at 01:59:49PM +0100, Lorenzo Stoakes wrote:
> > On Fri, May 02, 2025 at 02:20:38PM +0200, Jan Kara wrote:
> > > On Thu 01-05-25 18:25:26, Lorenzo Stoakes wrote:
> > > > During the mmap() of a file-backed mapping, we invoke the underlying driver
> > > > file's mmap() callback in order to perform driver/file system
> > > > initialisation of the underlying VMA.
> > > >
> > > > This has been a source of issues in the past, including a significant
> > > > security concern relating to unwinding of error state discovered by Jann
> > > > Horn, as fixed in commit 5de195060b2e ("mm: resolve faulty mmap_region()
> > > > error path behaviour") which performed the recent, significant, rework of
> > > > mmap() as a whole.
> > > >
> > > > However, we have had a fly in the ointment remain - drivers have a great
> > > > deal of freedom in the .mmap() hook to manipulate VMA state (as well as
> > > > page table state).
> > > >
> > > > This can be problematic, as we can no longer reason sensibly about VMA
> > > > state once the call is complete (the ability to do - anything - here does
> > > > rather interfere with that).
> > > >
> > > > In addition, callers may choose to do odd or unusual things which might
> > > > interfere with subsequent steps in the mmap() process, and it may do so and
> > > > then raise an error, requiring very careful unwinding of state about which
> > > > we can make no assumptions.
> > > >
> > > > Rather than providing such an open-ended interface, this series provides an
> > > > alternative, far more restrictive one - we expose a whitelist of fields
> > > > which can be adjusted by the driver, along with immutable state upon which
> > > > the driver can make such decisions:
> > > >
> > > > struct vm_area_desc {
> > > > 	/* Immutable state. */
> > > > 	struct mm_struct *mm;
> > > > 	unsigned long start;
> > > > 	unsigned long end;
> > > >
> > > > 	/* Mutable fields. Populated with initial state. */
> > > > 	pgoff_t pgoff;
> > > > 	struct file *file;
> > > > 	vm_flags_t vm_flags;
> > > > 	pgprot_t page_prot;
> > > >
> > > > 	/* Write-only fields. */
> > > > 	const struct vm_operations_struct *vm_ops;
> > > > 	void *private_data;
> > > > };
> > > >
> > > > The mmap logic then updates the state used to either merge with a VMA or
> > > > establish a new VMA based upon this logic.
> > > >
> > > > This is achieved via new file hook .mmap_prepare(), which is, importantly,
> > > > invoked very early on in the mmap() process.
> > > >
> > > > If an error arises, we can very simply abort the operation with very little
> > > > unwinding of state required.
> > >
> > > Looks sensible. So is there a plan to transform existing .mmap hooks to
> > > .mmap_prepare hooks? I agree that for most filesystems this should be just
> > > easy 1:1 replacement and AFAIU this would be prefered?
> >
> > Thanks!
> >
> > Yeah the intent is to convert _all_ callers away from .mmap() so we can
> > lock down what drivers are doing and be able to (relatively) safely make
> > assumptions about what's going on in mmap logic.
> >
> > As David points out, we may need to add new callbacks to account for other
>
> The plural is a little worrying, let's please aim minimize the number of
> new methods we need for this.

Ack. My intent is maximum one more, but to try to avoid it at all costs.