Documentation/filesystems/porting.rst | 9 ++ Documentation/filesystems/vfs.rst | 35 +++++++ arch/csky/include/asm/pgtable.h | 5 + arch/mips/alchemy/common/setup.c | 28 +++++- arch/mips/include/asm/pgtable.h | 10 ++ arch/s390/kernel/crash_dump.c | 6 +- arch/sparc/include/asm/pgtable_32.h | 29 +++++- arch/sparc/include/asm/pgtable_64.h | 29 +++++- drivers/char/mem.c | 80 ++++++++------- drivers/dax/device.c | 32 +++--- fs/cramfs/inode.c | 134 ++++++++++++++++++-------- fs/hugetlbfs/inode.c | 86 +++++++++-------- fs/ntfs3/file.c | 2 +- fs/proc/inode.c | 13 ++- fs/proc/vmcore.c | 53 +++++++--- fs/resctrl/pseudo_lock.c | 56 ++++++++--- include/linux/fs.h | 4 + include/linux/mm.h | 53 +++++++++- include/linux/mm_types.h | 5 + include/linux/proc_fs.h | 5 + include/linux/shmem_fs.h | 3 +- include/linux/vmalloc.h | 10 +- kernel/kcov.c | 40 +++++--- kernel/relay.c | 32 +++--- mm/memory.c | 128 +++++++++++++++--------- mm/secretmem.c | 2 +- mm/shmem.c | 49 +++++++--- mm/util.c | 18 +++- mm/vma.c | 96 +++++++++++++++--- mm/vmalloc.c | 16 ++- tools/testing/vma/vma_internal.h | 31 +++++- 31 files changed, 810 insertions(+), 289 deletions(-)
Since commit c84bf6dd2b83 ("mm: introduce new .mmap_prepare() file callback"), The f_op->mmap hook has been deprecated in favour of f_op->mmap_prepare. This was introduced in order to make it possible for us to eventually eliminate the f_op->mmap hook which is highly problematic as it allows drivers and filesystems raw access to a VMA which is not yet correctly initialised. This hook also introduces complexity for the memory mapping operation, as we must correctly unwind what we do should an error arises. Overall this interface being so open has caused significant problems for us, including security issues, it is important for us to simply eliminate this as a source of problems. Therefore this series continues what was established by extending the functionality further to permit more drivers and filesystems to use mmap_prepare. After updating some areas that can simply use mmap_prepare as-is, and performing some housekeeping, we then introduce two new hooks: f_op->mmap_complete - this is invoked at the point of the VMA having been correctly inserted, though with the VMA write lock still held. mmap_prepare must also be specified. This expands the use of mmap_prepare to those callers which need to prepopulate mappings, as well as any which does genuinely require access to the VMA. It's simple - we will let the caller access the VMA, but only once it's established. At this point unwinding issues is simple - we just unmap the VMA. The VMA is also then correctly initialised at this stage so there can be no issues arising from a not-fully initialised VMA at this point. The other newly added hook is: f_op->mmap_abort - this is only valid in conjunction with mmap_prepare and mmap_complete. This is called should an error arise between mmap_prepare and mmap_complete (not as a result of mmap_prepare but rather some other part of the mapping logic). This is required in case mmap_prepare wishes to establish state or locks which need to be cleaned up on completion. If we did not provide this, then this could not be permitted as this cleanup would otherwise not occur should the mapping fail between the two calls. We then add split remap_pfn_range*() functions which allow for PFN remap (a typical mapping prepopulation operation) split between a prepare/complete step, as well as io_mremap_pfn_range_prepare, complete for a similar purpose. From there we update various mm-adjacent logic to use this functionality as a first set of changes, as well as resctl and cramfs filesystems to round off the non-stacked filesystem instances. REVIEWER NOTE: ~~~~~~~~~~~~~~ I considered putting the complete, abort callbacks in vm_ops, however this won't work because then we would be unable to adjust helpers like generic_file_mmap_prepare() (which provides vm_ops) to provide the correct complete, abort callbacks. Conceptually it also makes more sense to have these in f_op as they are one-off operations performed at mmap time to establish the VMA, rather than a property of the VMA itself. Lorenzo Stoakes (16): mm/shmem: update shmem to use mmap_prepare device/dax: update devdax to use mmap_prepare mm: add vma_desc_size(), vma_desc_pages() helpers relay: update relay to use mmap_prepare mm/vma: rename mmap internal functions to avoid confusion mm: introduce the f_op->mmap_complete, mmap_abort hooks doc: update porting, vfs documentation for mmap_[complete, abort] mm: add remap_pfn_range_prepare(), remap_pfn_range_complete() mm: introduce io_remap_pfn_range_prepare, complete mm/hugetlb: update hugetlbfs to use mmap_prepare, mmap_complete mm: update mem char driver to use mmap_prepare, mmap_complete mm: update resctl to use mmap_prepare, mmap_complete, mmap_abort mm: update cramfs to use mmap_prepare, mmap_complete fs/proc: add proc_mmap_[prepare, complete] hooks for procfs fs/proc: update vmcore to use .proc_mmap_[prepare, complete] kcov: update kcov to use mmap_prepare, mmap_complete Documentation/filesystems/porting.rst | 9 ++ Documentation/filesystems/vfs.rst | 35 +++++++ arch/csky/include/asm/pgtable.h | 5 + arch/mips/alchemy/common/setup.c | 28 +++++- arch/mips/include/asm/pgtable.h | 10 ++ arch/s390/kernel/crash_dump.c | 6 +- arch/sparc/include/asm/pgtable_32.h | 29 +++++- arch/sparc/include/asm/pgtable_64.h | 29 +++++- drivers/char/mem.c | 80 ++++++++------- drivers/dax/device.c | 32 +++--- fs/cramfs/inode.c | 134 ++++++++++++++++++-------- fs/hugetlbfs/inode.c | 86 +++++++++-------- fs/ntfs3/file.c | 2 +- fs/proc/inode.c | 13 ++- fs/proc/vmcore.c | 53 +++++++--- fs/resctrl/pseudo_lock.c | 56 ++++++++--- include/linux/fs.h | 4 + include/linux/mm.h | 53 +++++++++- include/linux/mm_types.h | 5 + include/linux/proc_fs.h | 5 + include/linux/shmem_fs.h | 3 +- include/linux/vmalloc.h | 10 +- kernel/kcov.c | 40 +++++--- kernel/relay.c | 32 +++--- mm/memory.c | 128 +++++++++++++++--------- mm/secretmem.c | 2 +- mm/shmem.c | 49 +++++++--- mm/util.c | 18 +++- mm/vma.c | 96 +++++++++++++++--- mm/vmalloc.c | 16 ++- tools/testing/vma/vma_internal.h | 31 +++++- 31 files changed, 810 insertions(+), 289 deletions(-) -- 2.51.0
On Mon, Sep 08, 2025 at 12:10:31PM +0100, Lorenzo Stoakes wrote: Hi Lorenzo, I am getting this warning with this series applied: [Tue Sep 9 10:25:34 2025] ------------[ cut here ]------------ [Tue Sep 9 10:25:34 2025] WARNING: CPU: 0 PID: 563 at mm/memory.c:2942 remap_pfn_range_internal+0x36e/0x420 [Tue Sep 9 10:25:34 2025] Modules linked in: diag288_wdt(E) watchdog(E) ghash_s390(E) des_generic(E) prng(E) aes_s390(E) des_s390(E) libdes(E) sha3_512_s390(E) sha3_256_s390(E) sha_common(E) vfio_ccw(E) mdev(E) vfio_iommu_type1(E) vfio(E) pkey(E) autofs4(E) overlay(E) squashfs(E) loop(E) [Tue Sep 9 10:25:34 2025] Unloaded tainted modules: hmac_s390(E):1 [Tue Sep 9 10:25:34 2025] CPU: 0 UID: 0 PID: 563 Comm: makedumpfile Tainted: G E 6.17.0-rc4-gcc-mmap-00410-g87e982e900f0 #288 PREEMPT [Tue Sep 9 10:25:34 2025] Tainted: [E]=UNSIGNED_MODULE [Tue Sep 9 10:25:34 2025] Hardware name: IBM 8561 T01 703 (LPAR) [Tue Sep 9 10:25:34 2025] Krnl PSW : 0704d00180000000 00007fffe07f5ef2 (remap_pfn_range_internal+0x372/0x420) [Tue Sep 9 10:25:34 2025] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3 [Tue Sep 9 10:25:34 2025] Krnl GPRS: 0000000004044400 001c0f000188b024 0000000000000000 001c0f000188b022 [Tue Sep 9 10:25:34 2025] 000078000c458120 000078000a0ca800 00000f000188b022 0000000000000711 [Tue Sep 9 10:25:34 2025] 000003ffa6e05000 00000f000188b024 000003ffa6a05000 0000000004044400 [Tue Sep 9 10:25:34 2025] 000003ffa7aadfa8 00007fffe2c35ea0 001c000000000000 00007f7fe0faf000 [Tue Sep 9 10:25:34 2025] Krnl Code: 00007fffe07f5ee6: 47000700 bc 0,1792 00007fffe07f5eea: af000000 mc 0,0 #00007fffe07f5eee: af000000 mc 0,0 >00007fffe07f5ef2: a7f4ff11 brc 15,00007fffe07f5d14 00007fffe07f5ef6: b904002b lgr %r2,%r11 00007fffe07f5efa: c0e5000918bb brasl %r14,00007fffe0919070 00007fffe07f5f00: a7f4ff39 brc 15,00007fffe07f5d72 00007fffe07f5f04: e320f0c80004 lg %r2,200(%r15) [Tue Sep 9 10:25:34 2025] Call Trace: [Tue Sep 9 10:25:34 2025] [<00007fffe07f5ef2>] remap_pfn_range_internal+0x372/0x420 [Tue Sep 9 10:25:34 2025] [<00007fffe07f5fd4>] remap_pfn_range_complete+0x34/0x70 [Tue Sep 9 10:25:34 2025] [<00007fffe019879e>] remap_oldmem_pfn_range+0x13e/0x1a0 [Tue Sep 9 10:25:34 2025] [<00007fffe0bd3550>] mmap_complete_vmcore+0x520/0x7b0 [Tue Sep 9 10:25:34 2025] [<00007fffe077b05a>] __compat_vma_mmap_prepare+0x3ea/0x550 [Tue Sep 9 10:25:34 2025] [<00007fffe0ba27f0>] pde_mmap+0x160/0x1a0 [Tue Sep 9 10:25:34 2025] [<00007fffe0ba3750>] proc_reg_mmap+0xd0/0x180 [Tue Sep 9 10:25:34 2025] [<00007fffe0859904>] __mmap_new_vma+0x444/0x1290 [Tue Sep 9 10:25:34 2025] [<00007fffe085b0b4>] __mmap_region+0x964/0x1090 [Tue Sep 9 10:25:34 2025] [<00007fffe085dc7e>] mmap_region+0xde/0x250 [Tue Sep 9 10:25:34 2025] [<00007fffe08065fc>] do_mmap+0x80c/0xc30 [Tue Sep 9 10:25:34 2025] [<00007fffe077c708>] vm_mmap_pgoff+0x218/0x370 [Tue Sep 9 10:25:34 2025] [<00007fffe080467e>] ksys_mmap_pgoff+0x2ee/0x400 [Tue Sep 9 10:25:34 2025] [<00007fffe0804a3a>] __s390x_sys_old_mmap+0x15a/0x1d0 [Tue Sep 9 10:25:34 2025] [<00007fffe29f1cd6>] __do_syscall+0x146/0x410 [Tue Sep 9 10:25:34 2025] [<00007fffe2a17e1e>] system_call+0x6e/0x90 [Tue Sep 9 10:25:34 2025] 2 locks held by makedumpfile/563: [Tue Sep 9 10:25:34 2025] #0: 000078000a0caab0 (&mm->mmap_lock){++++}-{3:3}, at: vm_mmap_pgoff+0x16e/0x370 [Tue Sep 9 10:25:34 2025] #1: 00007fffe3864f50 (vmcore_cb_srcu){.+.+}-{0:0}, at: mmap_complete_vmcore+0x20c/0x7b0 [Tue Sep 9 10:25:34 2025] Last Breaking-Event-Address: [Tue Sep 9 10:25:34 2025] [<00007fffe07f5d0e>] remap_pfn_range_internal+0x18e/0x420 [Tue Sep 9 10:25:34 2025] irq event stamp: 19113 [Tue Sep 9 10:25:34 2025] hardirqs last enabled at (19121): [<00007fffe0391910>] __up_console_sem+0xe0/0x120 [Tue Sep 9 10:25:34 2025] hardirqs last disabled at (19128): [<00007fffe03918f2>] __up_console_sem+0xc2/0x120 [Tue Sep 9 10:25:34 2025] softirqs last enabled at (4934): [<00007fffe021cb8e>] handle_softirqs+0x70e/0xed0 [Tue Sep 9 10:25:34 2025] softirqs last disabled at (3919): [<00007fffe021b670>] __irq_exit_rcu+0x2e0/0x380 [Tue Sep 9 10:25:34 2025] ---[ end trace 0000000000000000 ]--- Thanks!
On Tue, Sep 09, 2025 at 10:31:24AM +0200, Alexander Gordeev wrote: > On Mon, Sep 08, 2025 at 12:10:31PM +0100, Lorenzo Stoakes wrote: > > Hi Lorenzo, > > I am getting this warning with this series applied: > > [Tue Sep 9 10:25:34 2025] ------------[ cut here ]------------ > [Tue Sep 9 10:25:34 2025] WARNING: CPU: 0 PID: 563 at mm/memory.c:2942 remap_pfn_range_internal+0x36e/0x420 OK yeah this is a very silly error :) I'm asserting: VM_WARN_ON_ONCE((vma->vm_flags & VM_REMAP_FLAGS) == VM_REMAP_FLAGS); So err.. this should be: VM_WARN_ON_ONCE((vma->vm_flags & VM_REMAP_FLAGS) != VM_REMAP_FLAGS); This was a super late addition to the code and obviously I didn't test this as well as I did the remap code in general, apologies. Will fix on respin! :) Cheers, Lorenzo > [Tue Sep 9 10:25:34 2025] Modules linked in: diag288_wdt(E) watchdog(E) ghash_s390(E) des_generic(E) prng(E) aes_s390(E) des_s390(E) libdes(E) sha3_512_s390(E) sha3_256_s390(E) sha_common(E) vfio_ccw(E) mdev(E) vfio_iommu_type1(E) vfio(E) pkey(E) autofs4(E) overlay(E) squashfs(E) loop(E) > [Tue Sep 9 10:25:34 2025] Unloaded tainted modules: hmac_s390(E):1 > [Tue Sep 9 10:25:34 2025] CPU: 0 UID: 0 PID: 563 Comm: makedumpfile Tainted: G E 6.17.0-rc4-gcc-mmap-00410-g87e982e900f0 #288 PREEMPT > [Tue Sep 9 10:25:34 2025] Tainted: [E]=UNSIGNED_MODULE > [Tue Sep 9 10:25:34 2025] Hardware name: IBM 8561 T01 703 (LPAR) > [Tue Sep 9 10:25:34 2025] Krnl PSW : 0704d00180000000 00007fffe07f5ef2 (remap_pfn_range_internal+0x372/0x420) > [Tue Sep 9 10:25:34 2025] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3 > [Tue Sep 9 10:25:34 2025] Krnl GPRS: 0000000004044400 001c0f000188b024 0000000000000000 001c0f000188b022 > [Tue Sep 9 10:25:34 2025] 000078000c458120 000078000a0ca800 00000f000188b022 0000000000000711 > [Tue Sep 9 10:25:34 2025] 000003ffa6e05000 00000f000188b024 000003ffa6a05000 0000000004044400 > [Tue Sep 9 10:25:34 2025] 000003ffa7aadfa8 00007fffe2c35ea0 001c000000000000 00007f7fe0faf000 > [Tue Sep 9 10:25:34 2025] Krnl Code: 00007fffe07f5ee6: 47000700 bc 0,1792 > 00007fffe07f5eea: af000000 mc 0,0 > #00007fffe07f5eee: af000000 mc 0,0 > >00007fffe07f5ef2: a7f4ff11 brc 15,00007fffe07f5d14 > 00007fffe07f5ef6: b904002b lgr %r2,%r11 > 00007fffe07f5efa: c0e5000918bb brasl %r14,00007fffe0919070 > 00007fffe07f5f00: a7f4ff39 brc 15,00007fffe07f5d72 > 00007fffe07f5f04: e320f0c80004 lg %r2,200(%r15) > [Tue Sep 9 10:25:34 2025] Call Trace: > [Tue Sep 9 10:25:34 2025] [<00007fffe07f5ef2>] remap_pfn_range_internal+0x372/0x420 > [Tue Sep 9 10:25:34 2025] [<00007fffe07f5fd4>] remap_pfn_range_complete+0x34/0x70 > [Tue Sep 9 10:25:34 2025] [<00007fffe019879e>] remap_oldmem_pfn_range+0x13e/0x1a0 > [Tue Sep 9 10:25:34 2025] [<00007fffe0bd3550>] mmap_complete_vmcore+0x520/0x7b0 > [Tue Sep 9 10:25:34 2025] [<00007fffe077b05a>] __compat_vma_mmap_prepare+0x3ea/0x550 > [Tue Sep 9 10:25:34 2025] [<00007fffe0ba27f0>] pde_mmap+0x160/0x1a0 > [Tue Sep 9 10:25:34 2025] [<00007fffe0ba3750>] proc_reg_mmap+0xd0/0x180 > [Tue Sep 9 10:25:34 2025] [<00007fffe0859904>] __mmap_new_vma+0x444/0x1290 > [Tue Sep 9 10:25:34 2025] [<00007fffe085b0b4>] __mmap_region+0x964/0x1090 > [Tue Sep 9 10:25:34 2025] [<00007fffe085dc7e>] mmap_region+0xde/0x250 > [Tue Sep 9 10:25:34 2025] [<00007fffe08065fc>] do_mmap+0x80c/0xc30 > [Tue Sep 9 10:25:34 2025] [<00007fffe077c708>] vm_mmap_pgoff+0x218/0x370 > [Tue Sep 9 10:25:34 2025] [<00007fffe080467e>] ksys_mmap_pgoff+0x2ee/0x400 > [Tue Sep 9 10:25:34 2025] [<00007fffe0804a3a>] __s390x_sys_old_mmap+0x15a/0x1d0 > [Tue Sep 9 10:25:34 2025] [<00007fffe29f1cd6>] __do_syscall+0x146/0x410 > [Tue Sep 9 10:25:34 2025] [<00007fffe2a17e1e>] system_call+0x6e/0x90 > [Tue Sep 9 10:25:34 2025] 2 locks held by makedumpfile/563: > [Tue Sep 9 10:25:34 2025] #0: 000078000a0caab0 (&mm->mmap_lock){++++}-{3:3}, at: vm_mmap_pgoff+0x16e/0x370 > [Tue Sep 9 10:25:34 2025] #1: 00007fffe3864f50 (vmcore_cb_srcu){.+.+}-{0:0}, at: mmap_complete_vmcore+0x20c/0x7b0 > [Tue Sep 9 10:25:34 2025] Last Breaking-Event-Address: > [Tue Sep 9 10:25:34 2025] [<00007fffe07f5d0e>] remap_pfn_range_internal+0x18e/0x420 > [Tue Sep 9 10:25:34 2025] irq event stamp: 19113 > [Tue Sep 9 10:25:34 2025] hardirqs last enabled at (19121): [<00007fffe0391910>] __up_console_sem+0xe0/0x120 > [Tue Sep 9 10:25:34 2025] hardirqs last disabled at (19128): [<00007fffe03918f2>] __up_console_sem+0xc2/0x120 > [Tue Sep 9 10:25:34 2025] softirqs last enabled at (4934): [<00007fffe021cb8e>] handle_softirqs+0x70e/0xed0 > [Tue Sep 9 10:25:34 2025] softirqs last disabled at (3919): [<00007fffe021b670>] __irq_exit_rcu+0x2e0/0x380 > [Tue Sep 9 10:25:34 2025] ---[ end trace 0000000000000000 ]--- > > Thanks!
Hi Lorenzo! On Mon 08-09-25 12:10:31, Lorenzo Stoakes wrote: > Since commit c84bf6dd2b83 ("mm: introduce new .mmap_prepare() file > callback"), The f_op->mmap hook has been deprecated in favour of > f_op->mmap_prepare. > > This was introduced in order to make it possible for us to eventually > eliminate the f_op->mmap hook which is highly problematic as it allows > drivers and filesystems raw access to a VMA which is not yet correctly > initialised. > > This hook also introduces complexity for the memory mapping operation, as > we must correctly unwind what we do should an error arises. > > Overall this interface being so open has caused significant problems for > us, including security issues, it is important for us to simply eliminate > this as a source of problems. > > Therefore this series continues what was established by extending the > functionality further to permit more drivers and filesystems to use > mmap_prepare. > > After updating some areas that can simply use mmap_prepare as-is, and > performing some housekeeping, we then introduce two new hooks: > > f_op->mmap_complete - this is invoked at the point of the VMA having been > correctly inserted, though with the VMA write lock still held. mmap_prepare > must also be specified. > > This expands the use of mmap_prepare to those callers which need to > prepopulate mappings, as well as any which does genuinely require access to > the VMA. > > It's simple - we will let the caller access the VMA, but only once it's > established. At this point unwinding issues is simple - we just unmap the > VMA. > > The VMA is also then correctly initialised at this stage so there can be no > issues arising from a not-fully initialised VMA at this point. > > The other newly added hook is: > > f_op->mmap_abort - this is only valid in conjunction with mmap_prepare and > mmap_complete. This is called should an error arise between mmap_prepare > and mmap_complete (not as a result of mmap_prepare but rather some other > part of the mapping logic). > > This is required in case mmap_prepare wishes to establish state or locks > which need to be cleaned up on completion. If we did not provide this, then > this could not be permitted as this cleanup would otherwise not occur > should the mapping fail between the two calls. So seeing these new hooks makes me wonder: Shouldn't rather implement mmap(2) in a way more similar to how other f_op hooks behave like ->read or ->write? I.e., a hook called at rather high level - something like from vm_mmap_pgoff() or similar similar level - which would just call library functions from MM for the stuff it needs to do. Filesystems would just do their checks and call the generic mmap function with the vm_ops they want to use, more complex users could then fill in the VMA before releasing mmap_lock or do cleanup in case of failure... This would seem like a more understandable API than several hooks with rules when what gets called. Honza > > We then add split remap_pfn_range*() functions which allow for PFN remap (a > typical mapping prepopulation operation) split between a prepare/complete > step, as well as io_mremap_pfn_range_prepare, complete for a similar > purpose. > > From there we update various mm-adjacent logic to use this functionality as > a first set of changes, as well as resctl and cramfs filesystems to round > off the non-stacked filesystem instances. > > > REVIEWER NOTE: > ~~~~~~~~~~~~~~ > > I considered putting the complete, abort callbacks in vm_ops, however this > won't work because then we would be unable to adjust helpers like > generic_file_mmap_prepare() (which provides vm_ops) to provide the correct > complete, abort callbacks. > > Conceptually it also makes more sense to have these in f_op as they are > one-off operations performed at mmap time to establish the VMA, rather than > a property of the VMA itself. > > Lorenzo Stoakes (16): > mm/shmem: update shmem to use mmap_prepare > device/dax: update devdax to use mmap_prepare > mm: add vma_desc_size(), vma_desc_pages() helpers > relay: update relay to use mmap_prepare > mm/vma: rename mmap internal functions to avoid confusion > mm: introduce the f_op->mmap_complete, mmap_abort hooks > doc: update porting, vfs documentation for mmap_[complete, abort] > mm: add remap_pfn_range_prepare(), remap_pfn_range_complete() > mm: introduce io_remap_pfn_range_prepare, complete > mm/hugetlb: update hugetlbfs to use mmap_prepare, mmap_complete > mm: update mem char driver to use mmap_prepare, mmap_complete > mm: update resctl to use mmap_prepare, mmap_complete, mmap_abort > mm: update cramfs to use mmap_prepare, mmap_complete > fs/proc: add proc_mmap_[prepare, complete] hooks for procfs > fs/proc: update vmcore to use .proc_mmap_[prepare, complete] > kcov: update kcov to use mmap_prepare, mmap_complete > > Documentation/filesystems/porting.rst | 9 ++ > Documentation/filesystems/vfs.rst | 35 +++++++ > arch/csky/include/asm/pgtable.h | 5 + > arch/mips/alchemy/common/setup.c | 28 +++++- > arch/mips/include/asm/pgtable.h | 10 ++ > arch/s390/kernel/crash_dump.c | 6 +- > arch/sparc/include/asm/pgtable_32.h | 29 +++++- > arch/sparc/include/asm/pgtable_64.h | 29 +++++- > drivers/char/mem.c | 80 ++++++++------- > drivers/dax/device.c | 32 +++--- > fs/cramfs/inode.c | 134 ++++++++++++++++++-------- > fs/hugetlbfs/inode.c | 86 +++++++++-------- > fs/ntfs3/file.c | 2 +- > fs/proc/inode.c | 13 ++- > fs/proc/vmcore.c | 53 +++++++--- > fs/resctrl/pseudo_lock.c | 56 ++++++++--- > include/linux/fs.h | 4 + > include/linux/mm.h | 53 +++++++++- > include/linux/mm_types.h | 5 + > include/linux/proc_fs.h | 5 + > include/linux/shmem_fs.h | 3 +- > include/linux/vmalloc.h | 10 +- > kernel/kcov.c | 40 +++++--- > kernel/relay.c | 32 +++--- > mm/memory.c | 128 +++++++++++++++--------- > mm/secretmem.c | 2 +- > mm/shmem.c | 49 +++++++--- > mm/util.c | 18 +++- > mm/vma.c | 96 +++++++++++++++--- > mm/vmalloc.c | 16 ++- > tools/testing/vma/vma_internal.h | 31 +++++- > 31 files changed, 810 insertions(+), 289 deletions(-) > > -- > 2.51.0 -- Jan Kara <jack@suse.com> SUSE Labs, CR
On Mon, Sep 08, 2025 at 03:27:52PM +0200, Jan Kara wrote: > Hi Lorenzo! Hey! :) > > After updating some areas that can simply use mmap_prepare as-is, and > > performing some housekeeping, we then introduce two new hooks: > > > > f_op->mmap_complete - this is invoked at the point of the VMA having been > > correctly inserted, though with the VMA write lock still held. mmap_prepare > > must also be specified. > > > > This expands the use of mmap_prepare to those callers which need to > > prepopulate mappings, as well as any which does genuinely require access to > > the VMA. > > > > It's simple - we will let the caller access the VMA, but only once it's > > established. At this point unwinding issues is simple - we just unmap the > > VMA. > > > > The VMA is also then correctly initialised at this stage so there can be no > > issues arising from a not-fully initialised VMA at this point. > > > > The other newly added hook is: > > > > f_op->mmap_abort - this is only valid in conjunction with mmap_prepare and > > mmap_complete. This is called should an error arise between mmap_prepare > > and mmap_complete (not as a result of mmap_prepare but rather some other > > part of the mapping logic). > > > > This is required in case mmap_prepare wishes to establish state or locks > > which need to be cleaned up on completion. If we did not provide this, then > > this could not be permitted as this cleanup would otherwise not occur > > should the mapping fail between the two calls. > > So seeing these new hooks makes me wonder: Shouldn't rather implement > mmap(2) in a way more similar to how other f_op hooks behave like ->read or > ->write? I.e., a hook called at rather high level - something like from > vm_mmap_pgoff() or similar similar level - which would just call library > functions from MM for the stuff it needs to do. Filesystems would just do > their checks and call the generic mmap function with the vm_ops they want > to use, more complex users could then fill in the VMA before releasing > mmap_lock or do cleanup in case of failure... This would seem like a more > understandable API than several hooks with rules when what gets called. We can't just do everything at this level, because we need: a. Information to actually know how to map the VMA before putting it in the maple tree. b. Once it's there, anything else we need to do (typically - prepopulate). The crux of this change is to avoid horrors around the VMA being passed around not yet being properly initialised, and yet being accessible for drivers to do 'whatever' with. Ideally we'd have only one case, and for _nearly all_ filesystems this is how it is actually. But sadly some _do need_ to do extra work afterwards, most notably, prepopulation. Cheers, Lorenzo
On Mon, Sep 08, 2025 at 03:48:36PM +0100, Lorenzo Stoakes wrote: > But sadly some _do need_ to do extra work afterwards, most notably, > prepopulation. I think Jan is suggesting something more like mmap_op() { struct vma_desc desc = {}; desc.[..] = x desc.[..] = y desc.[..] = z vma = vma_alloc(desc); ret = remap_pfn(vma) if (ret) goto err_vma; return vma_commit(vma); err_va: vma_dealloc(vma); return ERR_PTR(ret); } Jason
On Mon, Sep 08, 2025 at 12:04:04PM -0300, Jason Gunthorpe wrote: > On Mon, Sep 08, 2025 at 03:48:36PM +0100, Lorenzo Stoakes wrote: > > But sadly some _do need_ to do extra work afterwards, most notably, > > prepopulation. > > I think Jan is suggesting something more like > > mmap_op() > { > struct vma_desc desc = {}; > > desc.[..] = x > desc.[..] = y > desc.[..] = z > vma = vma_alloc(desc); > > ret = remap_pfn(vma) > if (ret) goto err_vma; > > return vma_commit(vma); > > err_va: > vma_dealloc(vma); > return ERR_PTR(ret); > } > > Jason Right, unfortunately the locking and the subtle issues around memory mapping really preclude something like this I think. We really do need to keep control over that. And since partly the motivation here is 'drivers do insane things when given too much freedom', I feel this would not improve that :) If you look at do_mmap() -> mmap_region() -> __mmap_region() etc. you can see a lot of that. We also had a security issue arise as a result of incorrect error path handling, I don't think letting a driver writer handle that is wise. It's a nice idea, but I just think this stuff is too sensitive for that. And in any case, it wouldn't likely be tractable to convert legacy code to this. Cheers, Lorenzo
© 2016 - 2025 Red Hat, Inc.