include/linux/fs.h | 7 +++ include/linux/mm_types.h | 24 ++++++++ mm/memory.c | 3 +- mm/mmap.c | 2 +- mm/secretmem.c | 14 ++--- mm/vma.c | 99 +++++++++++++++++++++++++++----- tools/testing/vma/vma_internal.h | 38 ++++++++++++ 7 files changed, 164 insertions(+), 23 deletions(-)
During the mmap() of a file-backed mapping, we invoke the underlying
driver's f_op->mmap() callback in order to perform driver/file system
initialisation of the underlying VMA.
This has been a source of issues in the past, including a significant
security concern relating to unwinding of error state discovered by Jann
Horn, as fixed in commit 5de195060b2e ("mm: resolve faulty mmap_region()
error path behaviour") which performed the recent, significant, rework of
mmap() as a whole.
However, we have had a fly in the ointment remain - drivers have a great
deal of freedom in the .mmap() hook to manipulate VMA state (as well as
page table state).
This can be problematic, as we can no longer reason sensibly about VMA
state once the call is complete (the ability to do - anything - here does
rather interfere with that).
In addition, callers may choose to do odd or unusual things which might
interfere with subsequent steps in the mmap() process, and it may do so and
then raise an error, requiring very careful unwinding of state about which
we can make no assumptions.
Rather than providing such an open-ended interface, this series provides an
alternative, far more restrictive one - we expose a whitelist of fields
which can be adjusted by the driver, along with immutable state upon which
the driver can make such decisions:
struct vma_proto {
/* Immutable state. */
struct mm_struct *mm;
unsigned long start;
unsigned long end;
/* Mutable fields. Populated with initial state. */
pgoff_t pgoff;
struct file *file;
vm_flags_t flags;
pgprot_t page_prot;
/* Write-only fields. */
const struct vm_operations_struct *vm_ops;
void *private_data;
};
The mmap logic then updates the state used to either merge with a VMA or
establish a new VMA based upon this logic.
This is achieved via new f_op hook .vma_proto(), which is, importantly,
invoked very early on in the mmap() process.
If an error arises, we can very simply abort the operation with very little
unwinding of state required.
The existing logic contains another, related, peccadillo - since the
.mmap() callback might do anything, it may also cause a previously
unmergeable VMA to become mergeable with adjacent VMAs.
Right now the logic will retry a merge like this only if the driver changes
VMA flags, and changes them in such a way that a merge might succeed (that
is, the flags are not 'special', that is do not contain any of the flags
specified in VM_SPECIAL).
This has been the source of a great deal of pain for a while - it is hard
to reason about an .mmap() callback that might do - anything - but it's
also hard to reason about setting up a VMA and writing to the maple tree,
only to do it again utilising a great deal of shared state.
Since .mmap_proto() sets fields before the first merge is even attempted,
the use of this callback obviates the need for this retry merge logic.
A driver can specify either .mmap_proto(), .mmap() or both. This provides
maximum flexibility.
In researching this change, I examined every .mmap() callback, and
discovered only a very few that set VMA state in such a way that a. the VMA
flags changed and b. this would be mergeable.
In the majority of cases, it turns out that drivers are mapping kernel
memory and thus ultimately set VM_PFNMAP, VM_MIXEDMAP, or other unmergeable
VM_SPECIAL flags.
Of those that remain I identified a number of cases which are only
applicable in DAX, setting the VM_HUGEPAGE flag:
* dax_mmap()
* erofs_file_mmap()
* ext4_file_mmap()
* xfs_file_mmap()
For this remerge to not occur and to impact users, each of these cases
would require a user to mmap() files using DAX, in parts, immediately
adjacent to one another.
This is a very unlikely usecase and so it does not appear to be worthwhile
to adjust this functionality accordingly.
We can, however, very quickly do so if needed by simply adding an
.mmap_proto() hook to these as required.
There are two further non-DAX cases I idenitfied:
* orangefs_file_mmap() - Clears VM_RAND_READ if set, replacing with
VM_SEQ_READ.
* usb_stream_hwdep_mmap() - Sets VM_DONTDUMP.
Both of these cases again seem very unlikely to be mmap()'d immediately
adjacent to one another in a fashion that would result in a merge.
Finally, we are left with a viable case:
* secretmem_mmap() - Set VM_LOCKED, VM_DONTDUMP.
This is viable enough that the mm selftests trigger the logic as a matter
of course. Therefore, this series replace the .secretmem_mmap() hook with
.secret_mmap_proto().
Lorenzo Stoakes (3):
mm: introduce new .mmap_proto() f_op callback
mm: secretmem: convert to .mmap_proto() hook
mm/vma: remove mmap() retry merge
include/linux/fs.h | 7 +++
include/linux/mm_types.h | 24 ++++++++
mm/memory.c | 3 +-
mm/mmap.c | 2 +-
mm/secretmem.c | 14 ++---
mm/vma.c | 99 +++++++++++++++++++++++++++-----
tools/testing/vma/vma_internal.h | 38 ++++++++++++
7 files changed, 164 insertions(+), 23 deletions(-)
--
2.49.0
On Wed, Apr 30, 2025 at 9:59 PM Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > A driver can specify either .mmap_proto(), .mmap() or both. This provides > maximum flexibility. Just to check I understand the intent correctly: The idea here is that .mmap_proto() is, at least for now, only for drivers who want to trigger merging, right? If a driver doesn't need merging, the normal .mmap() handler can still do all the things it was able to do before (like changing the ->vm_file pointer through vma_set_file(), or changing VMA flags in allowed ways)?
On Wed, Apr 30, 2025 at 11:29:46PM +0200, Jann Horn wrote: > On Wed, Apr 30, 2025 at 9:59 PM Lorenzo Stoakes > <lorenzo.stoakes@oracle.com> wrote: > > A driver can specify either .mmap_proto(), .mmap() or both. This provides > > maximum flexibility. > > Just to check I understand the intent correctly: The idea here is that > .mmap_proto() is, at least for now, only for drivers who want to > trigger merging, right? If a driver doesn't need merging, the normal > .mmap() handler can still do all the things it was able to do before > (like changing the ->vm_file pointer through vma_set_file(), or > changing VMA flags in allowed ways)? No, the intent is that this form the basis of an entirely new set of callbacks to use _instead_ of .mmap(). The vma_set_file() semantics would need to be changed actually, I will update logic to do this implicitly when the user sets vma_proto (or whatever this gets renamed to :P)->file - we can do this automagically. However the first use is indeed to be able to remove this merge retry. I have gone through .mmap() callbacks and identified the only one that seems meaningfully to care, so it's a great first use. Two birds with one stone, and forming the foundatio for future work to prevent drivers from having carte blache to do whatever on .mmap() wrt vma's.
On Thu, May 01, 2025 at 11:27:51AM +0100, Lorenzo Stoakes wrote: > On Wed, Apr 30, 2025 at 11:29:46PM +0200, Jann Horn wrote: > > On Wed, Apr 30, 2025 at 9:59 PM Lorenzo Stoakes > > <lorenzo.stoakes@oracle.com> wrote: > > > A driver can specify either .mmap_proto(), .mmap() or both. This provides > > > maximum flexibility. > > > > Just to check I understand the intent correctly: The idea here is that > > .mmap_proto() is, at least for now, only for drivers who want to > > trigger merging, right? If a driver doesn't need merging, the normal > > .mmap() handler can still do all the things it was able to do before > > (like changing the ->vm_file pointer through vma_set_file(), or > > changing VMA flags in allowed ways)? > > No, the intent is that this form the basis of an entirely new set of callbacks > to use _instead_ of .mmap(). > > The vma_set_file() semantics would need to be changed actually, I will update > logic to do this implicitly when the user sets vma_proto (or whatever this gets > renamed to :P)->file - we can do this automagically. Actually there is no need, as the file pointer we pass to the caller does not have an incremented reference count (it is pinned by the .mmap() call), we only increment it once it's in the VMA so this is unnecessary. > > However the first use is indeed to be able to remove this merge retry. I have > gone through .mmap() callbacks and identified the only one that seems > meaningfully to care, so it's a great first use. > > Two birds with one stone, and forming the foundatio for future work to prevent > drivers from having carte blache to do whatever on .mmap() wrt vma's.
Apologies, made a mistake that I wrongly assumed tooling would catch... let me
try sending this again...
On Wed, Apr 30, 2025 at 08:54:10PM +0100, Lorenzo Stoakes wrote:
> During the mmap() of a file-backed mapping, we invoke the underlying
> driver's f_op->mmap() callback in order to perform driver/file system
> initialisation of the underlying VMA.
>
> This has been a source of issues in the past, including a significant
> security concern relating to unwinding of error state discovered by Jann
> Horn, as fixed in commit 5de195060b2e ("mm: resolve faulty mmap_region()
> error path behaviour") which performed the recent, significant, rework of
> mmap() as a whole.
>
> However, we have had a fly in the ointment remain - drivers have a great
> deal of freedom in the .mmap() hook to manipulate VMA state (as well as
> page table state).
>
> This can be problematic, as we can no longer reason sensibly about VMA
> state once the call is complete (the ability to do - anything - here does
> rather interfere with that).
>
> In addition, callers may choose to do odd or unusual things which might
> interfere with subsequent steps in the mmap() process, and it may do so and
> then raise an error, requiring very careful unwinding of state about which
> we can make no assumptions.
>
> Rather than providing such an open-ended interface, this series provides an
> alternative, far more restrictive one - we expose a whitelist of fields
> which can be adjusted by the driver, along with immutable state upon which
> the driver can make such decisions:
>
> struct vma_proto {
> /* Immutable state. */
> struct mm_struct *mm;
> unsigned long start;
> unsigned long end;
>
> /* Mutable fields. Populated with initial state. */
> pgoff_t pgoff;
> struct file *file;
> vm_flags_t flags;
> pgprot_t page_prot;
>
> /* Write-only fields. */
> const struct vm_operations_struct *vm_ops;
> void *private_data;
> };
>
> The mmap logic then updates the state used to either merge with a VMA or
> establish a new VMA based upon this logic.
>
> This is achieved via new f_op hook .vma_proto(), which is, importantly,
> invoked very early on in the mmap() process.
>
> If an error arises, we can very simply abort the operation with very little
> unwinding of state required.
>
> The existing logic contains another, related, peccadillo - since the
> .mmap() callback might do anything, it may also cause a previously
> unmergeable VMA to become mergeable with adjacent VMAs.
>
> Right now the logic will retry a merge like this only if the driver changes
> VMA flags, and changes them in such a way that a merge might succeed (that
> is, the flags are not 'special', that is do not contain any of the flags
> specified in VM_SPECIAL).
>
> This has been the source of a great deal of pain for a while - it is hard
> to reason about an .mmap() callback that might do - anything - but it's
> also hard to reason about setting up a VMA and writing to the maple tree,
> only to do it again utilising a great deal of shared state.
>
> Since .mmap_proto() sets fields before the first merge is even attempted,
> the use of this callback obviates the need for this retry merge logic.
>
> A driver can specify either .mmap_proto(), .mmap() or both. This provides
> maximum flexibility.
>
> In researching this change, I examined every .mmap() callback, and
> discovered only a very few that set VMA state in such a way that a. the VMA
> flags changed and b. this would be mergeable.
>
> In the majority of cases, it turns out that drivers are mapping kernel
> memory and thus ultimately set VM_PFNMAP, VM_MIXEDMAP, or other unmergeable
> VM_SPECIAL flags.
>
> Of those that remain I identified a number of cases which are only
> applicable in DAX, setting the VM_HUGEPAGE flag:
>
> * dax_mmap()
> * erofs_file_mmap()
> * ext4_file_mmap()
> * xfs_file_mmap()
>
> For this remerge to not occur and to impact users, each of these cases
> would require a user to mmap() files using DAX, in parts, immediately
> adjacent to one another.
>
> This is a very unlikely usecase and so it does not appear to be worthwhile
> to adjust this functionality accordingly.
>
> We can, however, very quickly do so if needed by simply adding an
> .mmap_proto() hook to these as required.
>
> There are two further non-DAX cases I idenitfied:
>
> * orangefs_file_mmap() - Clears VM_RAND_READ if set, replacing with
> VM_SEQ_READ.
> * usb_stream_hwdep_mmap() - Sets VM_DONTDUMP.
>
> Both of these cases again seem very unlikely to be mmap()'d immediately
> adjacent to one another in a fashion that would result in a merge.
>
> Finally, we are left with a viable case:
>
> * secretmem_mmap() - Set VM_LOCKED, VM_DONTDUMP.
>
> This is viable enough that the mm selftests trigger the logic as a matter
> of course. Therefore, this series replace the .secretmem_mmap() hook with
> .secret_mmap_proto().
>
> Lorenzo Stoakes (3):
> mm: introduce new .mmap_proto() f_op callback
> mm: secretmem: convert to .mmap_proto() hook
> mm/vma: remove mmap() retry merge
>
> include/linux/fs.h | 7 +++
> include/linux/mm_types.h | 24 ++++++++
> mm/memory.c | 3 +-
> mm/mmap.c | 2 +-
> mm/secretmem.c | 14 ++---
> mm/vma.c | 99 +++++++++++++++++++++++++++-----
> tools/testing/vma/vma_internal.h | 38 ++++++++++++
> 7 files changed, 164 insertions(+), 23 deletions(-)
>
> --
> 2.49.0
>
© 2016 - 2026 Red Hat, Inc.