From: Khalid Aziz <khalid@kernel.org>
Add a ram-based filesystem that contains page table sharing
information and files that enables processes to share page tables.
This patch adds the basic filesystem that can be mounted and
a CONFIG_MSHARE option for compiling support in a kernel.
Signed-off-by: Khalid Aziz <khalid@kernel.org>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
Documentation/filesystems/msharefs.rst | 107 +++++++++++++++++++++++++
include/uapi/linux/magic.h | 1 +
mm/Kconfig | 9 +++
mm/Makefile | 4 +
mm/mshare.c | 96 ++++++++++++++++++++++
5 files changed, 217 insertions(+)
create mode 100644 Documentation/filesystems/msharefs.rst
create mode 100644 mm/mshare.c
diff --git a/Documentation/filesystems/msharefs.rst b/Documentation/filesystems/msharefs.rst
new file mode 100644
index 000000000000..c3c7168aa18f
--- /dev/null
+++ b/Documentation/filesystems/msharefs.rst
@@ -0,0 +1,107 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================================
+msharefs - a filesystem to support shared page tables
+=====================================================
+
+msharefs is a ram-based filesystem that allows multiple processes to
+share page table entries for shared pages. To enable support for
+msharefs the kernel must be compiled with CONFIG_MSHARE set.
+
+msharefs is typically mounted like this::
+
+ mount -t msharefs none /sys/fs/mshare
+
+A file created on msharefs creates a new shared region where all
+processes mapping that region will map it using shared page table
+entries. ioctls are used to initialize or retrieve the start address
+and size of a shared region and to map objects in the shared
+region. It is important to note that an msharefs file is a control
+file for the shared region and does not contain the contents
+of the region itself.
+
+Here are the basic steps for using mshare::
+
+1. Mount msharefs on /sys/fs/mshare::
+
+ mount -t msharefs msharefs /sys/fs/mshare
+
+2. mshare regions have alignment and size requirements. Start
+ address for the region must be aligned to an address boundary and
+ be a multiple of fixed size. This alignment and size requirement
+ can be obtained by reading the file ``/sys/fs/mshare/mshare_info``
+ which returns a number in text format. mshare regions must be
+ aligned to this boundary and be a multiple of this size.
+
+3. For the process creating an mshare region::
+
+a. Create a file on /sys/fs/mshare, for example:
+
+.. code-block:: c
+
+ fd = open("/sys/fs/mshare/shareme",
+ O_RDWR|O_CREAT|O_EXCL, 0600);
+
+b. Establish the starting address and size of the region:
+
+.. code-block:: c
+
+ struct mshare_info minfo;
+
+ minfo.start = TB(2);
+ minfo.size = BUFFER_SIZE;
+ ioctl(fd, MSHAREFS_SET_SIZE, &minfo);
+
+c. Map some memory in the region:
+
+.. code-block:: c
+
+ struct mshare_create mcreate;
+
+ mcreate.addr = TB(2);
+ mcreate.size = BUFFER_SIZE;
+ mcreate.offset = 0;
+ mcreate.prot = PROT_READ | PROT_WRITE;
+ mcreate.flags = MAP_ANONYMOUS | MAP_SHARED | MAP_FIXED;
+ mcreate.fd = -1;
+
+ ioctl(fd, MSHAREFS_CREATE_MAPPING, &mcreate);
+
+d. Map the mshare region into the process:
+
+.. code-block:: c
+
+ mmap((void *)TB(2), BUF_SIZE,
+ PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+
+e. Write and read to mshared region normally.
+
+
+4. For processes attaching an mshare region::
+
+a. Open the file on msharefs, for example:
+
+.. code-block:: c
+
+ fd = open("/sys/fs/mshare/shareme", O_RDWR);
+
+b. Get information about mshare'd region from the file:
+
+.. code-block:: c
+
+ struct mshare_info minfo;
+
+ ioctl(fd, MSHAREFS_GET_SIZE, &minfo);
+
+c. Map the mshare'd region into the process:
+
+.. code-block:: c
+
+ mmap(minfo.start, minfo.size,
+ PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+
+5. To delete the mshare region:
+
+.. code-block:: c
+
+ unlink("/sys/fs/mshare/shareme");
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index bb575f3ab45e..e53dd6063cba 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -103,5 +103,6 @@
#define DEVMEM_MAGIC 0x454d444d /* "DMEM" */
#define SECRETMEM_MAGIC 0x5345434d /* "SECM" */
#define PID_FS_MAGIC 0x50494446 /* "PIDF" */
+#define MSHARE_MAGIC 0x4d534852 /* "MSHR" */
#endif /* __LINUX_MAGIC_H__ */
diff --git a/mm/Kconfig b/mm/Kconfig
index 1b501db06417..ba3dbe31f86a 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1358,6 +1358,15 @@ config PT_RECLAIM
Note: now only empty user PTE page table pages will be reclaimed.
+config MSHARE
+ bool "Mshare"
+ depends on MMU
+ help
+ Enable msharefs: A ram-based filesystem that allows multiple
+ processes to share page table entries for shared pages. A file
+ created on msharefs represents a shared region where all processes
+ mapping that region will map objects within it with shared PTEs.
+ Ioctls are used to configure and map objects into the shared region
source "mm/damon/Kconfig"
diff --git a/mm/Makefile b/mm/Makefile
index 850386a67b3e..68bc967863f9 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -48,6 +48,10 @@ ifdef CONFIG_64BIT
mmu-$(CONFIG_MMU) += mseal.o
endif
+ifdef CONFIG_MSHARE
+mmu-$(CONFIG_MMU) += mshare.o
+endif
+
obj-y := filemap.o mempool.o oom_kill.o fadvise.o \
maccess.o page-writeback.o folio-compat.o \
readahead.o swap.o truncate.o vmscan.o shrinker.o \
diff --git a/mm/mshare.c b/mm/mshare.c
new file mode 100644
index 000000000000..49d32e0c20d2
--- /dev/null
+++ b/mm/mshare.c
@@ -0,0 +1,96 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Enable cooperating processes to share page table between
+ * them to reduce the extra memory consumed by multiple copies
+ * of page tables.
+ *
+ * This code adds an in-memory filesystem - msharefs.
+ * msharefs is used to manage page table sharing
+ *
+ *
+ * Copyright (C) 2024 Oracle Corp. All rights reserved.
+ * Author: Khalid Aziz <khalid@kernel.org>
+ *
+ */
+
+#include <linux/fs.h>
+#include <linux/fs_context.h>
+#include <uapi/linux/magic.h>
+
+static const struct file_operations msharefs_file_operations = {
+ .open = simple_open,
+};
+
+static const struct super_operations mshare_s_ops = {
+ .statfs = simple_statfs,
+};
+
+static int
+msharefs_fill_super(struct super_block *sb, struct fs_context *fc)
+{
+ struct inode *inode;
+
+ sb->s_blocksize = PAGE_SIZE;
+ sb->s_blocksize_bits = PAGE_SHIFT;
+ sb->s_magic = MSHARE_MAGIC;
+ sb->s_op = &mshare_s_ops;
+ sb->s_time_gran = 1;
+
+ inode = new_inode(sb);
+ if (!inode)
+ return -ENOMEM;
+
+ inode->i_ino = 1;
+ inode->i_mode = S_IFDIR | 0777;
+ simple_inode_init_ts(inode);
+ inode->i_op = &simple_dir_inode_operations;
+ inode->i_fop = &simple_dir_operations;
+ set_nlink(inode, 2);
+
+ sb->s_root = d_make_root(inode);
+ if (!sb->s_root)
+ return -ENOMEM;
+
+ return 0;
+}
+
+static int
+msharefs_get_tree(struct fs_context *fc)
+{
+ return get_tree_nodev(fc, msharefs_fill_super);
+}
+
+static const struct fs_context_operations msharefs_context_ops = {
+ .get_tree = msharefs_get_tree,
+};
+
+static int
+mshare_init_fs_context(struct fs_context *fc)
+{
+ fc->ops = &msharefs_context_ops;
+ return 0;
+}
+
+static struct file_system_type mshare_fs = {
+ .name = "msharefs",
+ .init_fs_context = mshare_init_fs_context,
+ .kill_sb = kill_litter_super,
+};
+
+static int __init
+mshare_init(void)
+{
+ int ret;
+
+ ret = sysfs_create_mount_point(fs_kobj, "mshare");
+ if (ret)
+ return ret;
+
+ ret = register_filesystem(&mshare_fs);
+ if (ret)
+ sysfs_remove_mount_point(fs_kobj, "mshare");
+
+ return ret;
+}
+
+core_initcall(mshare_init);
--
2.43.5
On Fri, Jan 24, 2025 at 03:54:35PM -0800, Anthony Yznaga wrote:
> diff --git a/Documentation/filesystems/msharefs.rst b/Documentation/filesystems/msharefs.rst
> new file mode 100644
> index 000000000000..c3c7168aa18f
> --- /dev/null
> +++ b/Documentation/filesystems/msharefs.rst
> @@ -0,0 +1,107 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=====================================================
> +msharefs - a filesystem to support shared page tables
> +=====================================================
> +
> +msharefs is a ram-based filesystem that allows multiple processes to
> +share page table entries for shared pages. To enable support for
> +msharefs the kernel must be compiled with CONFIG_MSHARE set.
> +
> +msharefs is typically mounted like this::
> +
> + mount -t msharefs none /sys/fs/mshare
> +
> +A file created on msharefs creates a new shared region where all
> +processes mapping that region will map it using shared page table
> +entries. ioctls are used to initialize or retrieve the start address
> +and size of a shared region and to map objects in the shared
> +region. It is important to note that an msharefs file is a control
> +file for the shared region and does not contain the contents
> +of the region itself.
> +
> +Here are the basic steps for using mshare::
> +
> +1. Mount msharefs on /sys/fs/mshare::
> +
> + mount -t msharefs msharefs /sys/fs/mshare
> +
> +2. mshare regions have alignment and size requirements. Start
> + address for the region must be aligned to an address boundary and
> + be a multiple of fixed size. This alignment and size requirement
> + can be obtained by reading the file ``/sys/fs/mshare/mshare_info``
> + which returns a number in text format. mshare regions must be
> + aligned to this boundary and be a multiple of this size.
> +
> +3. For the process creating an mshare region::
> +
> +a. Create a file on /sys/fs/mshare, for example:
Should the creating mshare region sublist be nested list?
> +
> +.. code-block:: c
> +
> + fd = open("/sys/fs/mshare/shareme",
> + O_RDWR|O_CREAT|O_EXCL, 0600);
> +
> +b. Establish the starting address and size of the region:
> +
> +.. code-block:: c
> +
> + struct mshare_info minfo;
> +
> + minfo.start = TB(2);
> + minfo.size = BUFFER_SIZE;
> + ioctl(fd, MSHAREFS_SET_SIZE, &minfo);
> +
> +c. Map some memory in the region:
> +
> +.. code-block:: c
> +
> + struct mshare_create mcreate;
> +
> + mcreate.addr = TB(2);
> + mcreate.size = BUFFER_SIZE;
> + mcreate.offset = 0;
> + mcreate.prot = PROT_READ | PROT_WRITE;
> + mcreate.flags = MAP_ANONYMOUS | MAP_SHARED | MAP_FIXED;
> + mcreate.fd = -1;
> +
> + ioctl(fd, MSHAREFS_CREATE_MAPPING, &mcreate);
> +
> +d. Map the mshare region into the process:
> +
> +.. code-block:: c
> +
> + mmap((void *)TB(2), BUF_SIZE,
> + PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
> +
> +e. Write and read to mshared region normally.
> +
> +
> +4. For processes attaching an mshare region::
> +
> +a. Open the file on msharefs, for example:
> +
> +.. code-block:: c
> +
> + fd = open("/sys/fs/mshare/shareme", O_RDWR);
> +
> +b. Get information about mshare'd region from the file:
> +
> +.. code-block:: c
> +
> + struct mshare_info minfo;
> +
> + ioctl(fd, MSHAREFS_GET_SIZE, &minfo);
> +
> +c. Map the mshare'd region into the process:
> +
> +.. code-block:: c
> +
> + mmap(minfo.start, minfo.size,
> + PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
> +
> +5. To delete the mshare region:
> +
> +.. code-block:: c
> +
> + unlink("/sys/fs/mshare/shareme");
Sphinx reports htmldocs warnings:
Documentation/filesystems/msharefs.rst:25: WARNING: Literal block expected; none found. [docutils]
Documentation/filesystems/msharefs.rst:38: WARNING: Literal block expected; none found. [docutils]
Documentation/filesystems/msharefs.rst:82: WARNING: Literal block expected; none found. [docutils]
Thanks.
--
An old man doll... just what I always wanted! - Clara
On 2/3/25 5:52 PM, Bagas Sanjaya wrote:
> On Fri, Jan 24, 2025 at 03:54:35PM -0800, Anthony Yznaga wrote:
>> diff --git a/Documentation/filesystems/msharefs.rst b/Documentation/filesystems/msharefs.rst
>> new file mode 100644
>> index 000000000000..c3c7168aa18f
>> --- /dev/null
>> +++ b/Documentation/filesystems/msharefs.rst
>> @@ -0,0 +1,107 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +
>> +=====================================================
>> +msharefs - a filesystem to support shared page tables
>> +=====================================================
>> +
>> +msharefs is a ram-based filesystem that allows multiple processes to
>> +share page table entries for shared pages. To enable support for
>> +msharefs the kernel must be compiled with CONFIG_MSHARE set.
>> +
>> +msharefs is typically mounted like this::
>> +
>> + mount -t msharefs none /sys/fs/mshare
>> +
>> +A file created on msharefs creates a new shared region where all
>> +processes mapping that region will map it using shared page table
>> +entries. ioctls are used to initialize or retrieve the start address
>> +and size of a shared region and to map objects in the shared
>> +region. It is important to note that an msharefs file is a control
>> +file for the shared region and does not contain the contents
>> +of the region itself.
>> +
>> +Here are the basic steps for using mshare::
>> +
>> +1. Mount msharefs on /sys/fs/mshare::
>> +
>> + mount -t msharefs msharefs /sys/fs/mshare
>> +
>> +2. mshare regions have alignment and size requirements. Start
>> + address for the region must be aligned to an address boundary and
>> + be a multiple of fixed size. This alignment and size requirement
>> + can be obtained by reading the file ``/sys/fs/mshare/mshare_info``
>> + which returns a number in text format. mshare regions must be
>> + aligned to this boundary and be a multiple of this size.
>> +
>> +3. For the process creating an mshare region::
>> +
>> +a. Create a file on /sys/fs/mshare, for example:
> Should the creating mshare region sublist be nested list?
Can you expand on that? Do you mean create an mshare region as a
directory and populate it with files representing the mappings that are
created in the region?
>
>> +
>> +.. code-block:: c
>> +
>> + fd = open("/sys/fs/mshare/shareme",
>> + O_RDWR|O_CREAT|O_EXCL, 0600);
>> +
>> +b. Establish the starting address and size of the region:
>> +
>> +.. code-block:: c
>> +
>> + struct mshare_info minfo;
>> +
>> + minfo.start = TB(2);
>> + minfo.size = BUFFER_SIZE;
>> + ioctl(fd, MSHAREFS_SET_SIZE, &minfo);
>> +
>> +c. Map some memory in the region:
>> +
>> +.. code-block:: c
>> +
>> + struct mshare_create mcreate;
>> +
>> + mcreate.addr = TB(2);
>> + mcreate.size = BUFFER_SIZE;
>> + mcreate.offset = 0;
>> + mcreate.prot = PROT_READ | PROT_WRITE;
>> + mcreate.flags = MAP_ANONYMOUS | MAP_SHARED | MAP_FIXED;
>> + mcreate.fd = -1;
>> +
>> + ioctl(fd, MSHAREFS_CREATE_MAPPING, &mcreate);
>> +
>> +d. Map the mshare region into the process:
>> +
>> +.. code-block:: c
>> +
>> + mmap((void *)TB(2), BUF_SIZE,
>> + PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
>> +
>> +e. Write and read to mshared region normally.
>> +
>> +
>> +4. For processes attaching an mshare region::
>> +
>> +a. Open the file on msharefs, for example:
>> +
>> +.. code-block:: c
>> +
>> + fd = open("/sys/fs/mshare/shareme", O_RDWR);
>> +
>> +b. Get information about mshare'd region from the file:
>> +
>> +.. code-block:: c
>> +
>> + struct mshare_info minfo;
>> +
>> + ioctl(fd, MSHAREFS_GET_SIZE, &minfo);
>> +
>> +c. Map the mshare'd region into the process:
>> +
>> +.. code-block:: c
>> +
>> + mmap(minfo.start, minfo.size,
>> + PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
>> +
>> +5. To delete the mshare region:
>> +
>> +.. code-block:: c
>> +
>> + unlink("/sys/fs/mshare/shareme");
> Sphinx reports htmldocs warnings:
>
> Documentation/filesystems/msharefs.rst:25: WARNING: Literal block expected; none found. [docutils]
> Documentation/filesystems/msharefs.rst:38: WARNING: Literal block expected; none found. [docutils]
> Documentation/filesystems/msharefs.rst:82: WARNING: Literal block expected; none found. [docutils]
Thanks. Will fix this.
Anthony
>
> Thanks.
>
Just nits: On 1/24/25 3:54 PM, Anthony Yznaga wrote: > diff --git a/mm/Kconfig b/mm/Kconfig > index 1b501db06417..ba3dbe31f86a 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -1358,6 +1358,15 @@ config PT_RECLAIM > > Note: now only empty user PTE page table pages will be reclaimed. > > +config MSHARE > + bool "Mshare" > + depends on MMU > + help > + Enable msharefs: A ram-based filesystem that allows multiple RAM-based > + processes to share page table entries for shared pages. A file > + created on msharefs represents a shared region where all processes > + mapping that region will map objects within it with shared PTEs. > + Ioctls are used to configure and map objects into the shared region End the sentence above with a period. > > source "mm/damon/Kconfig" -- ~Randy
On 1/24/25 7:13 PM, Randy Dunlap wrote: > Just nits: > > > On 1/24/25 3:54 PM, Anthony Yznaga wrote: >> diff --git a/mm/Kconfig b/mm/Kconfig >> index 1b501db06417..ba3dbe31f86a 100644 >> --- a/mm/Kconfig >> +++ b/mm/Kconfig >> @@ -1358,6 +1358,15 @@ config PT_RECLAIM >> >> Note: now only empty user PTE page table pages will be reclaimed. >> >> +config MSHARE >> + bool "Mshare" >> + depends on MMU >> + help >> + Enable msharefs: A ram-based filesystem that allows multiple > RAM-based > >> + processes to share page table entries for shared pages. A file >> + created on msharefs represents a shared region where all processes >> + mapping that region will map objects within it with shared PTEs. >> + Ioctls are used to configure and map objects into the shared region > End the sentence above with a period. Thanks, Randy. Appreciate the comments. Anthony > >> >> source "mm/damon/Kconfig"
On Sat, Jan 25, 2025 at 12:05:47PM -0800, Anthony Yznaga wrote: > > On 1/24/25 7:13 PM, Randy Dunlap wrote: > > Just nits: > > > > > > On 1/24/25 3:54 PM, Anthony Yznaga wrote: > > > diff --git a/mm/Kconfig b/mm/Kconfig > > > index 1b501db06417..ba3dbe31f86a 100644 > > > --- a/mm/Kconfig > > > +++ b/mm/Kconfig > > > @@ -1358,6 +1358,15 @@ config PT_RECLAIM > > > Note: now only empty user PTE page table pages will be reclaimed. > > > +config MSHARE > > > + bool "Mshare" > > > + depends on MMU > > > + help > > > + Enable msharefs: A ram-based filesystem that allows multiple > > RAM-based But it's not a ram-based filesystem. It's a pseudo-filesystem like procfs. It doesn't have any memory of its own.
On 1/25/25 1:10 PM, Matthew Wilcox wrote: > On Sat, Jan 25, 2025 at 12:05:47PM -0800, Anthony Yznaga wrote: >> On 1/24/25 7:13 PM, Randy Dunlap wrote: >>> Just nits: >>> >>> >>> On 1/24/25 3:54 PM, Anthony Yznaga wrote: >>>> diff --git a/mm/Kconfig b/mm/Kconfig >>>> index 1b501db06417..ba3dbe31f86a 100644 >>>> --- a/mm/Kconfig >>>> +++ b/mm/Kconfig >>>> @@ -1358,6 +1358,15 @@ config PT_RECLAIM >>>> Note: now only empty user PTE page table pages will be reclaimed. >>>> +config MSHARE >>>> + bool "Mshare" >>>> + depends on MMU >>>> + help >>>> + Enable msharefs: A ram-based filesystem that allows multiple >>> RAM-based > But it's not a ram-based filesystem. It's a pseudo-filesystem like > procfs. It doesn't have any memory of its own. Right. I'll clear that up. Anthony
© 2016 - 2026 Red Hat, Inc.