Documentation/filesystems/index.rst | 1 + Documentation/filesystems/msharefs.rst | 96 ++ .../userspace-api/ioctl/ioctl-number.rst | 1 + arch/Kconfig | 3 + arch/x86/Kconfig | 1 + arch/x86/mm/fault.c | 40 +- include/linux/mm.h | 52 + include/linux/mm_types.h | 38 +- include/linux/mmap_lock.h | 7 + include/linux/mshare.h | 25 + include/linux/sched.h | 5 + include/trace/events/mmflags.h | 7 + include/uapi/linux/magic.h | 1 + include/uapi/linux/msharefs.h | 38 + ipc/shm.c | 17 + kernel/exit.c | 1 + kernel/fork.c | 1 + kernel/sched/fair.c | 3 +- mm/Kconfig | 11 + mm/Makefile | 4 + mm/hugetlb.c | 25 + mm/memory.c | 76 +- mm/mmap.c | 10 +- mm/mshare.c | 942 ++++++++++++++++++ mm/vma.c | 22 +- mm/vma.h | 3 +- 26 files changed, 1385 insertions(+), 45 deletions(-) create mode 100644 Documentation/filesystems/msharefs.rst create mode 100644 include/linux/mshare.h create mode 100644 include/uapi/linux/msharefs.h create mode 100644 mm/mshare.c
Memory pages shared between processes require page table entries
(PTEs) for each process. Each of these PTEs consume some of
the memory and as long as the number of mappings being maintained
is small enough, this space consumed by page tables is not
objectionable. When very few memory pages are shared between
processes, the number of PTEs to maintain is mostly constrained by
the number of pages of memory on the system. As the number of shared
pages and the number of times pages are shared goes up, amount of
memory consumed by page tables starts to become significant. This
issue does not apply to threads. Any number of threads can share the
same pages inside a process while sharing the same PTEs. Extending
this same model to sharing pages across processes can eliminate this
issue for sharing across processes as well.
Some of the field deployments commonly see memory pages shared
across 1000s of processes. On x86_64, each page requires a PTE that
is 8 bytes long which is very small compared to the 4K page
size. When 2000 processes map the same page in their address space,
each one of them requires 8 bytes for its PTE and together that adds
up to 8K of memory just to hold the PTEs for one 4K page. On a
database server with 300GB SGA, a system crash was seen with
out-of-memory condition when 1500+ clients tried to share this SGA
even though the system had 512GB of memory. On this server, in the
worst case scenario of all 1500 processes mapping every page from
SGA would have required 878GB+ for just the PTEs. If these PTEs
could be shared, the a substantial amount of memory saved.
This patch series implements a mechanism that allows userspace
processes to opt into sharing PTEs. It adds a new in-memory
filesystem - msharefs. A file created on msharefs represents a
shared region where all processes mapping that region will map
objects within it with shared PTEs. When the file is created,
a new host mm struct is created to hold the shared page tables
and vmas for objects later mapped into the shared region. This
host mm struct is associated with the file and not with a task.
When a process mmap's the shared region, a vm flag VM_MSHARE
is added to the vma. On page fault the vma is checked for the
presence of the VM_MSHARE flag. If found, the host mm is
searched for a vma that covers the fault address. Fault handling
then continues using that host vma which establishes PTEs in the
host mm. Fault handling in a shared region also links the shared
page table to the process page table if the shared page table
already exists.
Ioctls are used to map and unmap objects in the shared region and
to (eventually) perform other operations on the shared objects such
as changing protections.
API
===
The steps to use this feature are:
1. Mount msharefs on /sys/fs/mshare -
mount -t msharefs msharefs /sys/fs/mshare
2. mshare regions have alignment and size requirements. The start
address for the region must be aligned to an address boundary and
be a multiple of fixed size. This alignment and size requirement
can be obtained by reading the file /sys/fs/mshare/mshare_info
which returns a number in text format. mshare regions must be
aligned to this boundary and be a multiple of this size.
3. For the process creating an mshare region:
a. Create a file on /sys/fs/mshare, for example -
fd = open("/sys/fs/mshare/shareme",
O_RDWR|O_CREAT|O_EXCL, 0600);
b. Establish the size of the region
ftruncate(fd, BUFFER_SIZE);
c. Map some memory in the region
struct mshare_create mcreate;
mcreate.region_offset = 0;
mcreate.size = BUFFER_SIZE;
mcreate.offset = 0;
mcreate.prot = PROT_READ | PROT_WRITE;
mcreate.flags = MAP_ANONYMOUS | MAP_SHARED | MAP_FIXED;
mcreate.fd = -1;
ioctl(fd, MSHAREFS_CREATE_MAPPING, &mcreate)
d. Map the mshare region into the process
mmap((void *)TB(2), BUFFER_SIZE, PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_SHARED, fd, 0);
e. Write and read to mshared region normally.
4. For processes attaching an mshare region:
a. Open the file on msharefs, for example -
fd = open("/sys/fs/mshare/shareme", O_RDWR);
b. Get information about mshare'd region from the file:
struct stat sb;
fstat(fd, &sb);
mshare_size = sb.st_size;
c. Map the mshare'd region into the process
mmap((void *)TB(2), mshare_size, PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_SHARED, fd, 0);
5. To delete the mshare region -
unlink("/sys/fs/mshare/shareme");
Example Code
============
Snippet of the code that a donor process would run looks like below:
-----------------
struct mshare_create mcreate;
fd = open("/sys/fs/mshare/mshare_info", O_RDONLY);
read(fd, req, 128);
alignsize = atoi(req);
close(fd);
fd = open("/sys/fs/mshare/shareme", O_RDWR|O_CREAT|O_EXCL, 0600);
start = alignsize * 4;
size = alignsize * 2;
ftruncate(fd, size);
mcreate.region_offset = 0;
mcreate.size = size;
mcreate.offset = 0;
mcreate.prot = PROT_READ | PROT_WRITE;
mcreate.flags = MAP_ANONYMOUS | MAP_SHARED | MAP_FIXED;
mcreate.fd = -1;
ret = ioctl(fd, MSHAREFS_CREATE_MAPPING, &mcreate);
if (ret < 0)
perror("ERROR: MSHAREFS_CREATE_MAPPING");
addr = mmap((void *)start, size, PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_SHARED, fd, 0);
if (addr == MAP_FAILED)
perror("ERROR: mmap failed");
strncpy(addr, "Some random shared text",
sizeof("Some random shared text"));
-----------------
Snippet of code that a consumer process would execute looks like:
-----------------
fd = open("/sys/fs/mshare/shareme", O_RDONLY);
fstat(fd, &sb);
size = sb.st_size;
if (!size)
perror("ERROR: mshare region not init'd");
addr = mmap(0, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
printf("Guest mmap at %px:\n", addr);
printf("%s\n", addr);
printf("\nDone\n");
-----------------
v3:
- Based on mm-new as of 2025-08-15
- (Fix) When unmapping an msharefs VMA, unlink the shared page tables
from the process page table using the new unmap_page_range vm_ops hook
rather than having free_pgtables() skip mshare VMAs (Jann Horn).
- (Fix) Keep a reference count on shared PUD pages to prevent UAF when
the unmap of objects in the mshare region also frees shared page
tables.
- (New) Support mapping files and anonymous hugetlb memory in an mshare
region.
- (New) Implement ownership of mshare regions. The process that
creates an mshare region is assigned as the owner. See the patch for
details.
- (Changed) Undid previous attempt at cgroup support. Cgroup accounting
is now directed to the owner process.
- (TBD) Support for mmu notifiers is not yet implemented. There are some
hurdles to be overcome. Mentioned here because it came up in comments
on the v2 series (Jann Horn).
v2:
(https://lore.kernel.org/all/20250404021902.48863-1-anthony.yznaga@oracle.com/)
- Based on mm-unstable as of 2025-04-03 (8ff02705ba8f)
- Set mshare size via fallocate or ftruncate instead of MSHAREFS_SET_SIZE.
Removed MSHAREFS_SET_SIZE/MSHAREFS_GET_SIZE ioctls. Use stat to get size.
(David H)
- Remove spinlock from mshare_data. Initializing the size is protected by
the inode lock.
- Support mapping a single mshare region at different virtual addresses.
- Support system selection of the start address when mmap'ing an mshare
region.
- Changed MSHAREFS_CREATE_MAPPING and MSHAREFS_UNMAP to use a byte offset
to specify the start of a mapping.
- Updated documentation.
v1:
(https://lore.kernel.org/linux-mm/20250124235454.84587-1-anthony.yznaga@oracle.com/)
- Based on mm-unstable mm-hotfixes-stable-2025-01-16-21-11
- Use mshare size instead of start address to check if mshare region
has been initialized.
- Share page tables at PUD level instead of PGD.
- Rename vma_is_shared() to vma_is_mshare() (James H / David H)
- Introduce and use mmap_read_lock_nested() (Kirill)
- Use an mmu notifier to flush all TLBs when updating shared pagetable
mappings. (Dave Hansen)
- Move logic for finding the shared vma to use to handle a fault from
handle_mm_fault() to do_user_addr_fault() because the arch-specific
fault handling checks vma flags for access permissions.
- Add CONFIG_MSHARE / ARCH_SUPPORTS_MSHARE
- Add msharefs_get_unmapped_area()
- Implemented vm_ops->unmap_page_range (Kirill)
- Update free_pgtables/free_pgd_range to free process pagetable levels
but not shared pagetable levels.
- A first take at cgroup support
RFC v2 -> v3:
- Now based on 6.11-rc5
- Addressed many comments from v2.
- Simplified filesystem code. Removed refcounting of the
shared mm_struct allocated for an mshare file. The mm_struct
and the pagetables and mappings it contains are freed when
the inode is evicted.
- Switched to an ioctl-based interface. Ioctls implemented
are used to set and get the start address and size of an
mshare region and to map objects into an mshare region
(only anon shared memory is supported in this series).
- Updated example code
[1] v2: https://lore.kernel.org/linux-mm/cover.1656531090.git.khalid.aziz@oracle.com/
RFC v1 -> v2:
- Eliminated mshare and mshare_unlink system calls and
replaced API with standard mmap and unlink (Based upon
v1 patch discussions and LSF/MM discussions)
- All fd based API (based upon feedback and suggestions from
Andy Lutomirski, Eric Biederman, Kirill and others)
- Added a file /sys/fs/mshare/mshare_info to provide
alignment and size requirement info (based upon feedback
from Dave Hansen, Mark Hemment and discussions at LSF/MM)
- Addressed TODOs in v1
- Added support for directories in msharefs
- Added locks around any time vma is touched (Dave Hansen)
- Eliminated the need to point vm_mm in original vmas to the
newly synthesized mshare mm
- Ensured mmap_read_unlock is called for correct mm in
handle_mm_fault (Dave Hansen)
Anthony Yznaga (15):
mm/mshare: allocate an mm_struct for msharefs files
mm/mshare: add ways to set the size of an mshare region
mm/mshare: flush all TLBs when updating PTEs in an mshare range
sched/numa: do not scan msharefs vmas
mm: add mmap_read_lock_killable_nested()
mm: add and use unmap_page_range vm_ops hook
mm: introduce PUD page table shared count
x86/mm: enable page table sharing
mm: create __do_mmap() to take an mm_struct * arg
mm: pass the mm in vma_munmap_struct
sched/mshare: mshare ownership
mm/mshare: Add an ioctl for unmapping objects in an mshare region
mm/mshare: support mapping files and anon hugetlb in an mshare region
mm/mshare: provide a way to identify an mm as an mshare host mm
mm/mshare: charge fault handling allocations to the mshare owner
Khalid Aziz (7):
mm: Add msharefs filesystem
mm/mshare: pre-populate msharefs with information file
mm/mshare: make msharefs writable and support directories
mm/mshare: Add a vma flag to indicate an mshare region
mm/mshare: Add mmap support
mm/mshare: prepare for page table sharing support
mm/mshare: Add an ioctl for mapping objects in an mshare region
Documentation/filesystems/index.rst | 1 +
Documentation/filesystems/msharefs.rst | 96 ++
.../userspace-api/ioctl/ioctl-number.rst | 1 +
arch/Kconfig | 3 +
arch/x86/Kconfig | 1 +
arch/x86/mm/fault.c | 40 +-
include/linux/mm.h | 52 +
include/linux/mm_types.h | 38 +-
include/linux/mmap_lock.h | 7 +
include/linux/mshare.h | 25 +
include/linux/sched.h | 5 +
include/trace/events/mmflags.h | 7 +
include/uapi/linux/magic.h | 1 +
include/uapi/linux/msharefs.h | 38 +
ipc/shm.c | 17 +
kernel/exit.c | 1 +
kernel/fork.c | 1 +
kernel/sched/fair.c | 3 +-
mm/Kconfig | 11 +
mm/Makefile | 4 +
mm/hugetlb.c | 25 +
mm/memory.c | 76 +-
mm/mmap.c | 10 +-
mm/mshare.c | 942 ++++++++++++++++++
mm/vma.c | 22 +-
mm/vma.h | 3 +-
26 files changed, 1385 insertions(+), 45 deletions(-)
create mode 100644 Documentation/filesystems/msharefs.rst
create mode 100644 include/linux/mshare.h
create mode 100644 include/uapi/linux/msharefs.h
create mode 100644 mm/mshare.c
--
2.47.1
On 20.08.25 03:03, Anthony Yznaga wrote: > Memory pages shared between processes require page table entries > (PTEs) for each process. Each of these PTEs consume some of > the memory and as long as the number of mappings being maintained > is small enough, this space consumed by page tables is not > objectionable. When very few memory pages are shared between > processes, the number of PTEs to maintain is mostly constrained by > the number of pages of memory on the system. As the number of shared > pages and the number of times pages are shared goes up, amount of > memory consumed by page tables starts to become significant. This > issue does not apply to threads. Any number of threads can share the > same pages inside a process while sharing the same PTEs. Extending > this same model to sharing pages across processes can eliminate this > issue for sharing across processes as well. > > Some of the field deployments commonly see memory pages shared > across 1000s of processes. On x86_64, each page requires a PTE that > is 8 bytes long which is very small compared to the 4K page > size. When 2000 processes map the same page in their address space, > each one of them requires 8 bytes for its PTE and together that adds > up to 8K of memory just to hold the PTEs for one 4K page. On a > database server with 300GB SGA, a system crash was seen with > out-of-memory condition when 1500+ clients tried to share this SGA > even though the system had 512GB of memory. On this server, in the > worst case scenario of all 1500 processes mapping every page from > SGA would have required 878GB+ for just the PTEs. If these PTEs > could be shared, the a substantial amount of memory saved. > > This patch series implements a mechanism that allows userspace > processes to opt into sharing PTEs. It adds a new in-memory > filesystem - msharefs. A file created on msharefs represents a > shared region where all processes mapping that region will map > objects within it with shared PTEs. When the file is created, > a new host mm struct is created to hold the shared page tables > and vmas for objects later mapped into the shared region. This > host mm struct is associated with the file and not with a task. > When a process mmap's the shared region, a vm flag VM_MSHARE > is added to the vma. On page fault the vma is checked for the > presence of the VM_MSHARE flag. If found, the host mm is > searched for a vma that covers the fault address. Fault handling > then continues using that host vma which establishes PTEs in the > host mm. Fault handling in a shared region also links the shared > page table to the process page table if the shared page table > already exists. Regarding the overall design, two important questions: In the context of this series, how do we handle VMA-modifying functions like mprotect/some madvise/mlock/mempolicy/...? Are they currently blocked when applied to a mshare VMA? And how are we handling other page table walkers that don't modify VMAs like MADV_DONTNEED, smaps, migrate_pages, ... etc? -- Cheers David / dhildenb
On Mon, Sep 08, 2025 at 10:32:22PM +0200, David Hildenbrand wrote: > In the context of this series, how do we handle VMA-modifying functions like > mprotect/some madvise/mlock/mempolicy/...? Are they currently blocked when > applied to a mshare VMA? I haven't been following this series recently, so I'm not sure what Anthony will say. My expectation is that the shared VMA is somewhat transparent to these operations; that is they are faulty if they span the boundary of the mshare VMA, but otherwise they pass through and affect the shared VMAs. That does raise the interesting question of how mlockall() affects an mshare VMA. I'm tempted to say that it should affect the shared VMA, but reasonable people might well disagree with me and have excellent arguments. > And how are we handling other page table walkers that don't modify VMAs like > MADV_DONTNEED, smaps, migrate_pages, ... etc? I'd expect those to walk into the shared region too.
On 9/8/25 1:59 PM, Matthew Wilcox wrote: > On Mon, Sep 08, 2025 at 10:32:22PM +0200, David Hildenbrand wrote: >> In the context of this series, how do we handle VMA-modifying functions like >> mprotect/some madvise/mlock/mempolicy/...? Are they currently blocked when >> applied to a mshare VMA? > > I haven't been following this series recently, so I'm not sure what > Anthony will say. My expectation is that the shared VMA is somewhat > transparent to these operations; that is they are faulty if they span > the boundary of the mshare VMA, but otherwise they pass through and > affect the shared VMAs. > > That does raise the interesting question of how mlockall() affects > an mshare VMA. I'm tempted to say that it should affect the shared > VMA, but reasonable people might well disagree with me and have > excellent arguments. > >> And how are we handling other page table walkers that don't modify VMAs like >> MADV_DONTNEED, smaps, migrate_pages, ... etc? > > I'd expect those to walk into the shared region too. I've received conflicting feedback in previous discussions that things like protection changes should be done via ioctl. I do thing somethings are appropriate for ioctl like map and unmap, but I also like the idea of the existing APIs being transparent to mshare so long as they are operating entirely with an mshare range and not crossing boundaries.
On 08.09.25 23:14, Anthony Yznaga wrote: > > > On 9/8/25 1:59 PM, Matthew Wilcox wrote: >> On Mon, Sep 08, 2025 at 10:32:22PM +0200, David Hildenbrand wrote: >>> In the context of this series, how do we handle VMA-modifying functions like >>> mprotect/some madvise/mlock/mempolicy/...? Are they currently blocked when >>> applied to a mshare VMA? >> >> I haven't been following this series recently, so I'm not sure what >> Anthony will say. My expectation is that the shared VMA is somewhat >> transparent to these operations; that is they are faulty if they span >> the boundary of the mshare VMA, but otherwise they pass through and >> affect the shared VMAs. >> >> That does raise the interesting question of how mlockall() affects >> an mshare VMA. I'm tempted to say that it should affect the shared >> VMA, but reasonable people might well disagree with me and have >> excellent arguments. Right, I think there are (at least) two possible models. (A) It's just a special file mapping. How that special file is orchestrated is not controlled through VMA change operations (mprotect etc) from one process but through dedicated ioctl. (B) It's something different. VMA change operations will affect how that file is orchestrated but not modify how the VMA in each process looks like. I still believe that (A) is clean and (B) is asking for trouble. But in any case, this is one of the most vital parts of mshare integration and should be documented clearly. >> >>> And how are we handling other page table walkers that don't modify VMAs like >>> MADV_DONTNEED, smaps, migrate_pages, ... etc? >> >> I'd expect those to walk into the shared region too. > > I've received conflicting feedback in previous discussions that things > like protection changes should be done via ioctl. I do thing somethings > are appropriate for ioctl like map and unmap, but I also like the idea > of the existing APIs being transparent to mshare so long as they are > operating entirely with an mshare range and not crossing boundaries. We have to be very careful here to not create a mess (this is all going to be unchangeable API later), and getting the opinion from other VMA handling folks (i.e., Lorenzo, Liam, Vlastimil, Pedro) will be crucial. So if can you answer the questions I raised in more detail? In particular how it works with the current series or what the current long-term plans are? Thanks! -- Cheers David / dhildenb
On Tue, Sep 09, 2025 at 09:53:35AM +0200, David Hildenbrand wrote: > We have to be very careful here to not create a mess (this is all going to > be unchangeable API later), and getting the opinion from other VMA handling > folks (i.e., Lorenzo, Liam, Vlastimil, Pedro) will be crucial. BTW I am 100% planning to take a look here and reply properly, just working through a backlog atm :) [as usual :P] Cheers, Lorenzo
On 9/9/25 12:53 AM, David Hildenbrand wrote: > On 08.09.25 23:14, Anthony Yznaga wrote: >> >> >> On 9/8/25 1:59 PM, Matthew Wilcox wrote: >>> On Mon, Sep 08, 2025 at 10:32:22PM +0200, David Hildenbrand wrote: >>>> In the context of this series, how do we handle VMA-modifying >>>> functions like >>>> mprotect/some madvise/mlock/mempolicy/...? Are they currently >>>> blocked when >>>> applied to a mshare VMA? >>> >>> I haven't been following this series recently, so I'm not sure what >>> Anthony will say. My expectation is that the shared VMA is somewhat >>> transparent to these operations; that is they are faulty if they span >>> the boundary of the mshare VMA, but otherwise they pass through and >>> affect the shared VMAs. >>> >>> That does raise the interesting question of how mlockall() affects >>> an mshare VMA. I'm tempted to say that it should affect the shared >>> VMA, but reasonable people might well disagree with me and have >>> excellent arguments. > > Right, I think there are (at least) two possible models. > > (A) It's just a special file mapping. > > How that special file is orchestrated is not controlled through VMA > change operations (mprotect etc) from one process but through dedicated > ioctl. > > (B) It's something different. > > VMA change operations will affect how that file is orchestrated but not > modify how the VMA in each process looks like. > > > I still believe that (A) is clean and (B) is asking for trouble. But in > any case, this is one of the most vital parts of mshare integration and > should be documented clearly. > >>> >>>> And how are we handling other page table walkers that don't modify >>>> VMAs like >>>> MADV_DONTNEED, smaps, migrate_pages, ... etc? >>> >>> I'd expect those to walk into the shared region too. >> >> I've received conflicting feedback in previous discussions that things >> like protection changes should be done via ioctl. I do thing somethings >> are appropriate for ioctl like map and unmap, but I also like the idea >> of the existing APIs being transparent to mshare so long as they are >> operating entirely with an mshare range and not crossing boundaries. > > We have to be very careful here to not create a mess (this is all going > to be unchangeable API later), and getting the opinion from other VMA > handling folks (i.e., Lorenzo, Liam, Vlastimil, Pedro) will be crucial. > > So if can you answer the questions I raised in more detail? In > particular how it works with the current series or what the current > long-term plans are? With respect to the current series there are some deficiencies. For madvise(), there are some advices like MADV_DONTNEED that will operate on the shared page table without taking the needed locks. Many will fail for various reasons. I'll add a check to reject trying to apply advise to msharefs VMAs. The plan is to add an ioctl for applying advice to the memory in an mshare region. If it makes sense to make it more transparent then I think that's something could come later. Things like migrate_pages() that use the rmap to get to VMAs are in better shape because they will naturally find the real VMA with its vm_mm pointing to an mshare mm. Another area I'm currently working on is ensuring mmu notifiers work. There is some locking trickery to work out there. Currently the plan is to add ioctls for protections changes, advice, and whatever else makes sense. I'm definitely open to feedback on any aspect of this. > > Thanks! >
© 2016 - 2026 Red Hat, Inc.