[PATCH 00/16] expand mmap_prepare functionality, port more users

Lorenzo Stoakes posted 16 patches 1 day, 9 hours ago
Documentation/filesystems/porting.rst |   9 ++
Documentation/filesystems/vfs.rst     |  35 +++++++
arch/csky/include/asm/pgtable.h       |   5 +
arch/mips/alchemy/common/setup.c      |  28 +++++-
arch/mips/include/asm/pgtable.h       |  10 ++
arch/s390/kernel/crash_dump.c         |   6 +-
arch/sparc/include/asm/pgtable_32.h   |  29 +++++-
arch/sparc/include/asm/pgtable_64.h   |  29 +++++-
drivers/char/mem.c                    |  80 ++++++++-------
drivers/dax/device.c                  |  32 +++---
fs/cramfs/inode.c                     | 134 ++++++++++++++++++--------
fs/hugetlbfs/inode.c                  |  86 +++++++++--------
fs/ntfs3/file.c                       |   2 +-
fs/proc/inode.c                       |  13 ++-
fs/proc/vmcore.c                      |  53 +++++++---
fs/resctrl/pseudo_lock.c              |  56 ++++++++---
include/linux/fs.h                    |   4 +
include/linux/mm.h                    |  53 +++++++++-
include/linux/mm_types.h              |   5 +
include/linux/proc_fs.h               |   5 +
include/linux/shmem_fs.h              |   3 +-
include/linux/vmalloc.h               |  10 +-
kernel/kcov.c                         |  40 +++++---
kernel/relay.c                        |  32 +++---
mm/memory.c                           | 128 +++++++++++++++---------
mm/secretmem.c                        |   2 +-
mm/shmem.c                            |  49 +++++++---
mm/util.c                             |  18 +++-
mm/vma.c                              |  96 +++++++++++++++---
mm/vmalloc.c                          |  16 ++-
tools/testing/vma/vma_internal.h      |  31 +++++-
31 files changed, 810 insertions(+), 289 deletions(-)
[PATCH 00/16] expand mmap_prepare functionality, port more users
Posted by Lorenzo Stoakes 1 day, 9 hours ago
Since commit c84bf6dd2b83 ("mm: introduce new .mmap_prepare() file
callback"), The f_op->mmap hook has been deprecated in favour of
f_op->mmap_prepare.

This was introduced in order to make it possible for us to eventually
eliminate the f_op->mmap hook which is highly problematic as it allows
drivers and filesystems raw access to a VMA which is not yet correctly
initialised.

This hook also introduces complexity for the memory mapping operation, as
we must correctly unwind what we do should an error arises.

Overall this interface being so open has caused significant problems for
us, including security issues, it is important for us to simply eliminate
this as a source of problems.

Therefore this series continues what was established by extending the
functionality further to permit more drivers and filesystems to use
mmap_prepare.

After updating some areas that can simply use mmap_prepare as-is, and
performing some housekeeping, we then introduce two new hooks:

f_op->mmap_complete - this is invoked at the point of the VMA having been
correctly inserted, though with the VMA write lock still held. mmap_prepare
must also be specified.

This expands the use of mmap_prepare to those callers which need to
prepopulate mappings, as well as any which does genuinely require access to
the VMA.

It's simple - we will let the caller access the VMA, but only once it's
established. At this point unwinding issues is simple - we just unmap the
VMA.

The VMA is also then correctly initialised at this stage so there can be no
issues arising from a not-fully initialised VMA at this point.

The other newly added hook is:

f_op->mmap_abort - this is only valid in conjunction with mmap_prepare and
mmap_complete. This is called should an error arise between mmap_prepare
and mmap_complete (not as a result of mmap_prepare but rather some other
part of the mapping logic).

This is required in case mmap_prepare wishes to establish state or locks
which need to be cleaned up on completion. If we did not provide this, then
this could not be permitted as this cleanup would otherwise not occur
should the mapping fail between the two calls.

We then add split remap_pfn_range*() functions which allow for PFN remap (a
typical mapping prepopulation operation) split between a prepare/complete
step, as well as io_mremap_pfn_range_prepare, complete for a similar
purpose.

From there we update various mm-adjacent logic to use this functionality as
a first set of changes, as well as resctl and cramfs filesystems to round
off the non-stacked filesystem instances.


REVIEWER NOTE:
~~~~~~~~~~~~~~

I considered putting the complete, abort callbacks in vm_ops, however this
won't work because then we would be unable to adjust helpers like
generic_file_mmap_prepare() (which provides vm_ops) to provide the correct
complete, abort callbacks.

Conceptually it also makes more sense to have these in f_op as they are
one-off operations performed at mmap time to establish the VMA, rather than
a property of the VMA itself.

Lorenzo Stoakes (16):
  mm/shmem: update shmem to use mmap_prepare
  device/dax: update devdax to use mmap_prepare
  mm: add vma_desc_size(), vma_desc_pages() helpers
  relay: update relay to use mmap_prepare
  mm/vma: rename mmap internal functions to avoid confusion
  mm: introduce the f_op->mmap_complete, mmap_abort hooks
  doc: update porting, vfs documentation for mmap_[complete, abort]
  mm: add remap_pfn_range_prepare(), remap_pfn_range_complete()
  mm: introduce io_remap_pfn_range_prepare, complete
  mm/hugetlb: update hugetlbfs to use mmap_prepare, mmap_complete
  mm: update mem char driver to use mmap_prepare, mmap_complete
  mm: update resctl to use mmap_prepare, mmap_complete, mmap_abort
  mm: update cramfs to use mmap_prepare, mmap_complete
  fs/proc: add proc_mmap_[prepare, complete] hooks for procfs
  fs/proc: update vmcore to use .proc_mmap_[prepare, complete]
  kcov: update kcov to use mmap_prepare, mmap_complete

 Documentation/filesystems/porting.rst |   9 ++
 Documentation/filesystems/vfs.rst     |  35 +++++++
 arch/csky/include/asm/pgtable.h       |   5 +
 arch/mips/alchemy/common/setup.c      |  28 +++++-
 arch/mips/include/asm/pgtable.h       |  10 ++
 arch/s390/kernel/crash_dump.c         |   6 +-
 arch/sparc/include/asm/pgtable_32.h   |  29 +++++-
 arch/sparc/include/asm/pgtable_64.h   |  29 +++++-
 drivers/char/mem.c                    |  80 ++++++++-------
 drivers/dax/device.c                  |  32 +++---
 fs/cramfs/inode.c                     | 134 ++++++++++++++++++--------
 fs/hugetlbfs/inode.c                  |  86 +++++++++--------
 fs/ntfs3/file.c                       |   2 +-
 fs/proc/inode.c                       |  13 ++-
 fs/proc/vmcore.c                      |  53 +++++++---
 fs/resctrl/pseudo_lock.c              |  56 ++++++++---
 include/linux/fs.h                    |   4 +
 include/linux/mm.h                    |  53 +++++++++-
 include/linux/mm_types.h              |   5 +
 include/linux/proc_fs.h               |   5 +
 include/linux/shmem_fs.h              |   3 +-
 include/linux/vmalloc.h               |  10 +-
 kernel/kcov.c                         |  40 +++++---
 kernel/relay.c                        |  32 +++---
 mm/memory.c                           | 128 +++++++++++++++---------
 mm/secretmem.c                        |   2 +-
 mm/shmem.c                            |  49 +++++++---
 mm/util.c                             |  18 +++-
 mm/vma.c                              |  96 +++++++++++++++---
 mm/vmalloc.c                          |  16 ++-
 tools/testing/vma/vma_internal.h      |  31 +++++-
 31 files changed, 810 insertions(+), 289 deletions(-)

--
2.51.0
Re: [PATCH 00/16] expand mmap_prepare functionality, port more users
Posted by Alexander Gordeev 12 hours ago
On Mon, Sep 08, 2025 at 12:10:31PM +0100, Lorenzo Stoakes wrote:

Hi Lorenzo,

I am getting this warning with this series applied:

[Tue Sep  9 10:25:34 2025] ------------[ cut here ]------------
[Tue Sep  9 10:25:34 2025] WARNING: CPU: 0 PID: 563 at mm/memory.c:2942 remap_pfn_range_internal+0x36e/0x420
[Tue Sep  9 10:25:34 2025] Modules linked in: diag288_wdt(E) watchdog(E) ghash_s390(E) des_generic(E) prng(E) aes_s390(E) des_s390(E) libdes(E) sha3_512_s390(E) sha3_256_s390(E) sha_common(E) vfio_ccw(E) mdev(E) vfio_iommu_type1(E) vfio(E) pkey(E) autofs4(E) overlay(E) squashfs(E) loop(E)
[Tue Sep  9 10:25:34 2025] Unloaded tainted modules: hmac_s390(E):1
[Tue Sep  9 10:25:34 2025] CPU: 0 UID: 0 PID: 563 Comm: makedumpfile Tainted: G            E       6.17.0-rc4-gcc-mmap-00410-g87e982e900f0 #288 PREEMPT 
[Tue Sep  9 10:25:34 2025] Tainted: [E]=UNSIGNED_MODULE
[Tue Sep  9 10:25:34 2025] Hardware name: IBM 8561 T01 703 (LPAR)
[Tue Sep  9 10:25:34 2025] Krnl PSW : 0704d00180000000 00007fffe07f5ef2 (remap_pfn_range_internal+0x372/0x420)
[Tue Sep  9 10:25:34 2025]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
[Tue Sep  9 10:25:34 2025] Krnl GPRS: 0000000004044400 001c0f000188b024 0000000000000000 001c0f000188b022
[Tue Sep  9 10:25:34 2025]            000078000c458120 000078000a0ca800 00000f000188b022 0000000000000711
[Tue Sep  9 10:25:34 2025]            000003ffa6e05000 00000f000188b024 000003ffa6a05000 0000000004044400
[Tue Sep  9 10:25:34 2025]            000003ffa7aadfa8 00007fffe2c35ea0 001c000000000000 00007f7fe0faf000
[Tue Sep  9 10:25:34 2025] Krnl Code: 00007fffe07f5ee6: 47000700                bc      0,1792
                                      00007fffe07f5eea: af000000                mc      0,0
                                     #00007fffe07f5eee: af000000                mc      0,0
                                     >00007fffe07f5ef2: a7f4ff11                brc     15,00007fffe07f5d14
                                      00007fffe07f5ef6: b904002b                lgr     %r2,%r11
                                      00007fffe07f5efa: c0e5000918bb    brasl   %r14,00007fffe0919070
                                      00007fffe07f5f00: a7f4ff39                brc     15,00007fffe07f5d72
                                      00007fffe07f5f04: e320f0c80004    lg      %r2,200(%r15)
[Tue Sep  9 10:25:34 2025] Call Trace:
[Tue Sep  9 10:25:34 2025]  [<00007fffe07f5ef2>] remap_pfn_range_internal+0x372/0x420 
[Tue Sep  9 10:25:34 2025]  [<00007fffe07f5fd4>] remap_pfn_range_complete+0x34/0x70 
[Tue Sep  9 10:25:34 2025]  [<00007fffe019879e>] remap_oldmem_pfn_range+0x13e/0x1a0 
[Tue Sep  9 10:25:34 2025]  [<00007fffe0bd3550>] mmap_complete_vmcore+0x520/0x7b0 
[Tue Sep  9 10:25:34 2025]  [<00007fffe077b05a>] __compat_vma_mmap_prepare+0x3ea/0x550 
[Tue Sep  9 10:25:34 2025]  [<00007fffe0ba27f0>] pde_mmap+0x160/0x1a0 
[Tue Sep  9 10:25:34 2025]  [<00007fffe0ba3750>] proc_reg_mmap+0xd0/0x180 
[Tue Sep  9 10:25:34 2025]  [<00007fffe0859904>] __mmap_new_vma+0x444/0x1290 
[Tue Sep  9 10:25:34 2025]  [<00007fffe085b0b4>] __mmap_region+0x964/0x1090 
[Tue Sep  9 10:25:34 2025]  [<00007fffe085dc7e>] mmap_region+0xde/0x250 
[Tue Sep  9 10:25:34 2025]  [<00007fffe08065fc>] do_mmap+0x80c/0xc30 
[Tue Sep  9 10:25:34 2025]  [<00007fffe077c708>] vm_mmap_pgoff+0x218/0x370 
[Tue Sep  9 10:25:34 2025]  [<00007fffe080467e>] ksys_mmap_pgoff+0x2ee/0x400 
[Tue Sep  9 10:25:34 2025]  [<00007fffe0804a3a>] __s390x_sys_old_mmap+0x15a/0x1d0 
[Tue Sep  9 10:25:34 2025]  [<00007fffe29f1cd6>] __do_syscall+0x146/0x410 
[Tue Sep  9 10:25:34 2025]  [<00007fffe2a17e1e>] system_call+0x6e/0x90 
[Tue Sep  9 10:25:34 2025] 2 locks held by makedumpfile/563:
[Tue Sep  9 10:25:34 2025]  #0: 000078000a0caab0 (&mm->mmap_lock){++++}-{3:3}, at: vm_mmap_pgoff+0x16e/0x370
[Tue Sep  9 10:25:34 2025]  #1: 00007fffe3864f50 (vmcore_cb_srcu){.+.+}-{0:0}, at: mmap_complete_vmcore+0x20c/0x7b0
[Tue Sep  9 10:25:34 2025] Last Breaking-Event-Address:
[Tue Sep  9 10:25:34 2025]  [<00007fffe07f5d0e>] remap_pfn_range_internal+0x18e/0x420
[Tue Sep  9 10:25:34 2025] irq event stamp: 19113
[Tue Sep  9 10:25:34 2025] hardirqs last  enabled at (19121): [<00007fffe0391910>] __up_console_sem+0xe0/0x120
[Tue Sep  9 10:25:34 2025] hardirqs last disabled at (19128): [<00007fffe03918f2>] __up_console_sem+0xc2/0x120
[Tue Sep  9 10:25:34 2025] softirqs last  enabled at (4934): [<00007fffe021cb8e>] handle_softirqs+0x70e/0xed0
[Tue Sep  9 10:25:34 2025] softirqs last disabled at (3919): [<00007fffe021b670>] __irq_exit_rcu+0x2e0/0x380
[Tue Sep  9 10:25:34 2025] ---[ end trace 0000000000000000 ]---

Thanks!
Re: [PATCH 00/16] expand mmap_prepare functionality, port more users
Posted by Lorenzo Stoakes 12 hours ago
On Tue, Sep 09, 2025 at 10:31:24AM +0200, Alexander Gordeev wrote:
> On Mon, Sep 08, 2025 at 12:10:31PM +0100, Lorenzo Stoakes wrote:
>
> Hi Lorenzo,
>
> I am getting this warning with this series applied:
>
> [Tue Sep  9 10:25:34 2025] ------------[ cut here ]------------
> [Tue Sep  9 10:25:34 2025] WARNING: CPU: 0 PID: 563 at mm/memory.c:2942 remap_pfn_range_internal+0x36e/0x420

OK yeah this is a very silly error :)

I'm asserting:

		VM_WARN_ON_ONCE((vma->vm_flags & VM_REMAP_FLAGS) == VM_REMAP_FLAGS);

So err.. this should be:

		VM_WARN_ON_ONCE((vma->vm_flags & VM_REMAP_FLAGS) != VM_REMAP_FLAGS);

This was a super late addition to the code and obviously I didn't test this as
well as I did the remap code in general, apologies.

Will fix on respin! :)

Cheers, Lorenzo

> [Tue Sep  9 10:25:34 2025] Modules linked in: diag288_wdt(E) watchdog(E) ghash_s390(E) des_generic(E) prng(E) aes_s390(E) des_s390(E) libdes(E) sha3_512_s390(E) sha3_256_s390(E) sha_common(E) vfio_ccw(E) mdev(E) vfio_iommu_type1(E) vfio(E) pkey(E) autofs4(E) overlay(E) squashfs(E) loop(E)
> [Tue Sep  9 10:25:34 2025] Unloaded tainted modules: hmac_s390(E):1
> [Tue Sep  9 10:25:34 2025] CPU: 0 UID: 0 PID: 563 Comm: makedumpfile Tainted: G            E       6.17.0-rc4-gcc-mmap-00410-g87e982e900f0 #288 PREEMPT
> [Tue Sep  9 10:25:34 2025] Tainted: [E]=UNSIGNED_MODULE
> [Tue Sep  9 10:25:34 2025] Hardware name: IBM 8561 T01 703 (LPAR)
> [Tue Sep  9 10:25:34 2025] Krnl PSW : 0704d00180000000 00007fffe07f5ef2 (remap_pfn_range_internal+0x372/0x420)
> [Tue Sep  9 10:25:34 2025]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
> [Tue Sep  9 10:25:34 2025] Krnl GPRS: 0000000004044400 001c0f000188b024 0000000000000000 001c0f000188b022
> [Tue Sep  9 10:25:34 2025]            000078000c458120 000078000a0ca800 00000f000188b022 0000000000000711
> [Tue Sep  9 10:25:34 2025]            000003ffa6e05000 00000f000188b024 000003ffa6a05000 0000000004044400
> [Tue Sep  9 10:25:34 2025]            000003ffa7aadfa8 00007fffe2c35ea0 001c000000000000 00007f7fe0faf000
> [Tue Sep  9 10:25:34 2025] Krnl Code: 00007fffe07f5ee6: 47000700                bc      0,1792
>                                       00007fffe07f5eea: af000000                mc      0,0
>                                      #00007fffe07f5eee: af000000                mc      0,0
>                                      >00007fffe07f5ef2: a7f4ff11                brc     15,00007fffe07f5d14
>                                       00007fffe07f5ef6: b904002b                lgr     %r2,%r11
>                                       00007fffe07f5efa: c0e5000918bb    brasl   %r14,00007fffe0919070
>                                       00007fffe07f5f00: a7f4ff39                brc     15,00007fffe07f5d72
>                                       00007fffe07f5f04: e320f0c80004    lg      %r2,200(%r15)
> [Tue Sep  9 10:25:34 2025] Call Trace:
> [Tue Sep  9 10:25:34 2025]  [<00007fffe07f5ef2>] remap_pfn_range_internal+0x372/0x420
> [Tue Sep  9 10:25:34 2025]  [<00007fffe07f5fd4>] remap_pfn_range_complete+0x34/0x70
> [Tue Sep  9 10:25:34 2025]  [<00007fffe019879e>] remap_oldmem_pfn_range+0x13e/0x1a0
> [Tue Sep  9 10:25:34 2025]  [<00007fffe0bd3550>] mmap_complete_vmcore+0x520/0x7b0
> [Tue Sep  9 10:25:34 2025]  [<00007fffe077b05a>] __compat_vma_mmap_prepare+0x3ea/0x550
> [Tue Sep  9 10:25:34 2025]  [<00007fffe0ba27f0>] pde_mmap+0x160/0x1a0
> [Tue Sep  9 10:25:34 2025]  [<00007fffe0ba3750>] proc_reg_mmap+0xd0/0x180
> [Tue Sep  9 10:25:34 2025]  [<00007fffe0859904>] __mmap_new_vma+0x444/0x1290
> [Tue Sep  9 10:25:34 2025]  [<00007fffe085b0b4>] __mmap_region+0x964/0x1090
> [Tue Sep  9 10:25:34 2025]  [<00007fffe085dc7e>] mmap_region+0xde/0x250
> [Tue Sep  9 10:25:34 2025]  [<00007fffe08065fc>] do_mmap+0x80c/0xc30
> [Tue Sep  9 10:25:34 2025]  [<00007fffe077c708>] vm_mmap_pgoff+0x218/0x370
> [Tue Sep  9 10:25:34 2025]  [<00007fffe080467e>] ksys_mmap_pgoff+0x2ee/0x400
> [Tue Sep  9 10:25:34 2025]  [<00007fffe0804a3a>] __s390x_sys_old_mmap+0x15a/0x1d0
> [Tue Sep  9 10:25:34 2025]  [<00007fffe29f1cd6>] __do_syscall+0x146/0x410
> [Tue Sep  9 10:25:34 2025]  [<00007fffe2a17e1e>] system_call+0x6e/0x90
> [Tue Sep  9 10:25:34 2025] 2 locks held by makedumpfile/563:
> [Tue Sep  9 10:25:34 2025]  #0: 000078000a0caab0 (&mm->mmap_lock){++++}-{3:3}, at: vm_mmap_pgoff+0x16e/0x370
> [Tue Sep  9 10:25:34 2025]  #1: 00007fffe3864f50 (vmcore_cb_srcu){.+.+}-{0:0}, at: mmap_complete_vmcore+0x20c/0x7b0
> [Tue Sep  9 10:25:34 2025] Last Breaking-Event-Address:
> [Tue Sep  9 10:25:34 2025]  [<00007fffe07f5d0e>] remap_pfn_range_internal+0x18e/0x420
> [Tue Sep  9 10:25:34 2025] irq event stamp: 19113
> [Tue Sep  9 10:25:34 2025] hardirqs last  enabled at (19121): [<00007fffe0391910>] __up_console_sem+0xe0/0x120
> [Tue Sep  9 10:25:34 2025] hardirqs last disabled at (19128): [<00007fffe03918f2>] __up_console_sem+0xc2/0x120
> [Tue Sep  9 10:25:34 2025] softirqs last  enabled at (4934): [<00007fffe021cb8e>] handle_softirqs+0x70e/0xed0
> [Tue Sep  9 10:25:34 2025] softirqs last disabled at (3919): [<00007fffe021b670>] __irq_exit_rcu+0x2e0/0x380
> [Tue Sep  9 10:25:34 2025] ---[ end trace 0000000000000000 ]---
>
> Thanks!
Re: [PATCH 00/16] expand mmap_prepare functionality, port more users
Posted by Jan Kara 1 day, 7 hours ago
Hi Lorenzo!

On Mon 08-09-25 12:10:31, Lorenzo Stoakes wrote:
> Since commit c84bf6dd2b83 ("mm: introduce new .mmap_prepare() file
> callback"), The f_op->mmap hook has been deprecated in favour of
> f_op->mmap_prepare.
> 
> This was introduced in order to make it possible for us to eventually
> eliminate the f_op->mmap hook which is highly problematic as it allows
> drivers and filesystems raw access to a VMA which is not yet correctly
> initialised.
> 
> This hook also introduces complexity for the memory mapping operation, as
> we must correctly unwind what we do should an error arises.
> 
> Overall this interface being so open has caused significant problems for
> us, including security issues, it is important for us to simply eliminate
> this as a source of problems.
> 
> Therefore this series continues what was established by extending the
> functionality further to permit more drivers and filesystems to use
> mmap_prepare.
> 
> After updating some areas that can simply use mmap_prepare as-is, and
> performing some housekeeping, we then introduce two new hooks:
> 
> f_op->mmap_complete - this is invoked at the point of the VMA having been
> correctly inserted, though with the VMA write lock still held. mmap_prepare
> must also be specified.
> 
> This expands the use of mmap_prepare to those callers which need to
> prepopulate mappings, as well as any which does genuinely require access to
> the VMA.
> 
> It's simple - we will let the caller access the VMA, but only once it's
> established. At this point unwinding issues is simple - we just unmap the
> VMA.
> 
> The VMA is also then correctly initialised at this stage so there can be no
> issues arising from a not-fully initialised VMA at this point.
> 
> The other newly added hook is:
> 
> f_op->mmap_abort - this is only valid in conjunction with mmap_prepare and
> mmap_complete. This is called should an error arise between mmap_prepare
> and mmap_complete (not as a result of mmap_prepare but rather some other
> part of the mapping logic).
> 
> This is required in case mmap_prepare wishes to establish state or locks
> which need to be cleaned up on completion. If we did not provide this, then
> this could not be permitted as this cleanup would otherwise not occur
> should the mapping fail between the two calls.

So seeing these new hooks makes me wonder: Shouldn't rather implement
mmap(2) in a way more similar to how other f_op hooks behave like ->read or
->write? I.e., a hook called at rather high level - something like from
vm_mmap_pgoff() or similar similar level - which would just call library
functions from MM for the stuff it needs to do. Filesystems would just do
their checks and call the generic mmap function with the vm_ops they want
to use, more complex users could then fill in the VMA before releasing
mmap_lock or do cleanup in case of failure... This would seem like a more
understandable API than several hooks with rules when what gets called.

								Honza

> 
> We then add split remap_pfn_range*() functions which allow for PFN remap (a
> typical mapping prepopulation operation) split between a prepare/complete
> step, as well as io_mremap_pfn_range_prepare, complete for a similar
> purpose.
> 
> From there we update various mm-adjacent logic to use this functionality as
> a first set of changes, as well as resctl and cramfs filesystems to round
> off the non-stacked filesystem instances.
> 
> 
> REVIEWER NOTE:
> ~~~~~~~~~~~~~~
> 
> I considered putting the complete, abort callbacks in vm_ops, however this
> won't work because then we would be unable to adjust helpers like
> generic_file_mmap_prepare() (which provides vm_ops) to provide the correct
> complete, abort callbacks.
> 
> Conceptually it also makes more sense to have these in f_op as they are
> one-off operations performed at mmap time to establish the VMA, rather than
> a property of the VMA itself.
> 
> Lorenzo Stoakes (16):
>   mm/shmem: update shmem to use mmap_prepare
>   device/dax: update devdax to use mmap_prepare
>   mm: add vma_desc_size(), vma_desc_pages() helpers
>   relay: update relay to use mmap_prepare
>   mm/vma: rename mmap internal functions to avoid confusion
>   mm: introduce the f_op->mmap_complete, mmap_abort hooks
>   doc: update porting, vfs documentation for mmap_[complete, abort]
>   mm: add remap_pfn_range_prepare(), remap_pfn_range_complete()
>   mm: introduce io_remap_pfn_range_prepare, complete
>   mm/hugetlb: update hugetlbfs to use mmap_prepare, mmap_complete
>   mm: update mem char driver to use mmap_prepare, mmap_complete
>   mm: update resctl to use mmap_prepare, mmap_complete, mmap_abort
>   mm: update cramfs to use mmap_prepare, mmap_complete
>   fs/proc: add proc_mmap_[prepare, complete] hooks for procfs
>   fs/proc: update vmcore to use .proc_mmap_[prepare, complete]
>   kcov: update kcov to use mmap_prepare, mmap_complete
> 
>  Documentation/filesystems/porting.rst |   9 ++
>  Documentation/filesystems/vfs.rst     |  35 +++++++
>  arch/csky/include/asm/pgtable.h       |   5 +
>  arch/mips/alchemy/common/setup.c      |  28 +++++-
>  arch/mips/include/asm/pgtable.h       |  10 ++
>  arch/s390/kernel/crash_dump.c         |   6 +-
>  arch/sparc/include/asm/pgtable_32.h   |  29 +++++-
>  arch/sparc/include/asm/pgtable_64.h   |  29 +++++-
>  drivers/char/mem.c                    |  80 ++++++++-------
>  drivers/dax/device.c                  |  32 +++---
>  fs/cramfs/inode.c                     | 134 ++++++++++++++++++--------
>  fs/hugetlbfs/inode.c                  |  86 +++++++++--------
>  fs/ntfs3/file.c                       |   2 +-
>  fs/proc/inode.c                       |  13 ++-
>  fs/proc/vmcore.c                      |  53 +++++++---
>  fs/resctrl/pseudo_lock.c              |  56 ++++++++---
>  include/linux/fs.h                    |   4 +
>  include/linux/mm.h                    |  53 +++++++++-
>  include/linux/mm_types.h              |   5 +
>  include/linux/proc_fs.h               |   5 +
>  include/linux/shmem_fs.h              |   3 +-
>  include/linux/vmalloc.h               |  10 +-
>  kernel/kcov.c                         |  40 +++++---
>  kernel/relay.c                        |  32 +++---
>  mm/memory.c                           | 128 +++++++++++++++---------
>  mm/secretmem.c                        |   2 +-
>  mm/shmem.c                            |  49 +++++++---
>  mm/util.c                             |  18 +++-
>  mm/vma.c                              |  96 +++++++++++++++---
>  mm/vmalloc.c                          |  16 ++-
>  tools/testing/vma/vma_internal.h      |  31 +++++-
>  31 files changed, 810 insertions(+), 289 deletions(-)
> 
> --
> 2.51.0
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
Re: [PATCH 00/16] expand mmap_prepare functionality, port more users
Posted by Lorenzo Stoakes 1 day, 6 hours ago
On Mon, Sep 08, 2025 at 03:27:52PM +0200, Jan Kara wrote:
> Hi Lorenzo!

Hey! :)

> > After updating some areas that can simply use mmap_prepare as-is, and
> > performing some housekeeping, we then introduce two new hooks:
> >
> > f_op->mmap_complete - this is invoked at the point of the VMA having been
> > correctly inserted, though with the VMA write lock still held. mmap_prepare
> > must also be specified.
> >
> > This expands the use of mmap_prepare to those callers which need to
> > prepopulate mappings, as well as any which does genuinely require access to
> > the VMA.
> >
> > It's simple - we will let the caller access the VMA, but only once it's
> > established. At this point unwinding issues is simple - we just unmap the
> > VMA.
> >
> > The VMA is also then correctly initialised at this stage so there can be no
> > issues arising from a not-fully initialised VMA at this point.
> >
> > The other newly added hook is:
> >
> > f_op->mmap_abort - this is only valid in conjunction with mmap_prepare and
> > mmap_complete. This is called should an error arise between mmap_prepare
> > and mmap_complete (not as a result of mmap_prepare but rather some other
> > part of the mapping logic).
> >
> > This is required in case mmap_prepare wishes to establish state or locks
> > which need to be cleaned up on completion. If we did not provide this, then
> > this could not be permitted as this cleanup would otherwise not occur
> > should the mapping fail between the two calls.
>
> So seeing these new hooks makes me wonder: Shouldn't rather implement
> mmap(2) in a way more similar to how other f_op hooks behave like ->read or
> ->write? I.e., a hook called at rather high level - something like from
> vm_mmap_pgoff() or similar similar level - which would just call library
> functions from MM for the stuff it needs to do. Filesystems would just do
> their checks and call the generic mmap function with the vm_ops they want
> to use, more complex users could then fill in the VMA before releasing
> mmap_lock or do cleanup in case of failure... This would seem like a more
> understandable API than several hooks with rules when what gets called.

We can't just do everything at this level, because we need:

a. Information to actually know how to map the VMA before putting it in the
   maple tree.
b. Once it's there, anything else we need to do (typically - prepopulate).

The crux of this change is to avoid horrors around the VMA being passed
around not yet being properly initialised, and yet being accessible for
drivers to do 'whatever' with.

Ideally we'd have only one case, and for _nearly all_ filesystems this is
how it is actually.

But sadly some _do need_ to do extra work afterwards, most notably,
prepopulation.

Cheers, Lorenzo
Re: [PATCH 00/16] expand mmap_prepare functionality, port more users
Posted by Jason Gunthorpe 1 day, 6 hours ago
On Mon, Sep 08, 2025 at 03:48:36PM +0100, Lorenzo Stoakes wrote:
> But sadly some _do need_ to do extra work afterwards, most notably,
> prepopulation.

I think Jan is suggesting something more like

mmap_op()
{
   struct vma_desc desc = {};

   desc.[..] = x
   desc.[..] = y
   desc.[..] = z
   vma = vma_alloc(desc);

   ret = remap_pfn(vma)
   if (ret) goto err_vma;

   return vma_commit(vma);

err_va:
  vma_dealloc(vma);
  return ERR_PTR(ret);
}

Jason
Re: [PATCH 00/16] expand mmap_prepare functionality, port more users
Posted by Lorenzo Stoakes 1 day, 5 hours ago
On Mon, Sep 08, 2025 at 12:04:04PM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 08, 2025 at 03:48:36PM +0100, Lorenzo Stoakes wrote:
> > But sadly some _do need_ to do extra work afterwards, most notably,
> > prepopulation.
>
> I think Jan is suggesting something more like
>
> mmap_op()
> {
>    struct vma_desc desc = {};
>
>    desc.[..] = x
>    desc.[..] = y
>    desc.[..] = z
>    vma = vma_alloc(desc);
>
>    ret = remap_pfn(vma)
>    if (ret) goto err_vma;
>
>    return vma_commit(vma);
>
> err_va:
>   vma_dealloc(vma);
>   return ERR_PTR(ret);
> }
>
> Jason

Right, unfortunately the locking and the subtle issues around memory mapping
really preclude something like this I think. We really do need to keep control
over that.

And since partly the motivation here is 'drivers do insane things when given too
much freedom', I feel this would not improve that :)

If you look at do_mmap() -> mmap_region() -> __mmap_region() etc. you can see a
lot of that.

We also had a security issue arise as a result of incorrect error path handling,
I don't think letting a driver writer handle that is wise.

It's a nice idea, but I just think this stuff is too sensitive for that. And in
any case, it wouldn't likely be tractable to convert legacy code to this.

Cheers, Lorenzo