[PATCH 00/20] Add support for shared PTEs across processes

Anthony Yznaga posted 20 patches 1 year ago
There is a newer version of this series
Documentation/filesystems/msharefs.rst        | 107 +++
.../userspace-api/ioctl/ioctl-number.rst      |   1 +
arch/Kconfig                                  |   3 +
arch/x86/Kconfig                              |   1 +
arch/x86/mm/fault.c                           |  48 +-
include/linux/memcontrol.h                    |   3 +
include/linux/mm.h                            |  56 ++
include/linux/mm_types.h                      |   2 +
include/linux/mmap_lock.h                     |   7 +
include/trace/events/mmflags.h                |   7 +
include/uapi/linux/magic.h                    |   1 +
include/uapi/linux/msharefs.h                 |  45 ++
ipc/shm.c                                     |  17 +
kernel/sched/fair.c                           |   3 +-
mm/Kconfig                                    |   9 +
mm/Makefile                                   |   4 +
mm/hugetlb.c                                  |  25 +
mm/memcontrol.c                               |   3 +-
mm/memory.c                                   |  74 +-
mm/mmap.c                                     |   7 +-
mm/mshare.c                                   | 708 ++++++++++++++++++
mm/vma.c                                      |  25 +-
mm/vma.h                                      |   3 +-
23 files changed, 1108 insertions(+), 51 deletions(-)
create mode 100644 Documentation/filesystems/msharefs.rst
create mode 100644 include/uapi/linux/msharefs.h
create mode 100644 mm/mshare.c
[PATCH 00/20] Add support for shared PTEs across processes
Posted by Anthony Yznaga 1 year ago
Memory pages shared between processes require page table entries
(PTEs) for each process. Each of these PTEs consume some of
the memory and as long as the number of mappings being maintained
is small enough, this space consumed by page tables is not
objectionable. When very few memory pages are shared between
processes, the number of PTEs to maintain is mostly constrained by
the number of pages of memory on the system. As the number of shared
pages and the number of times pages are shared goes up, amount of
memory consumed by page tables starts to become significant. This
issue does not apply to threads. Any number of threads can share the
same pages inside a process while sharing the same PTEs. Extending
this same model to sharing pages across processes can eliminate this
issue for sharing across processes as well.

Some of the field deployments commonly see memory pages shared
across 1000s of processes. On x86_64, each page requires a PTE that
is 8 bytes long which is very small compared to the 4K page
size. When 2000 processes map the same page in their address space,
each one of them requires 8 bytes for its PTE and together that adds
up to 8K of memory just to hold the PTEs for one 4K page. On a
database server with 300GB SGA, a system crash was seen with
out-of-memory condition when 1500+ clients tried to share this SGA
even though the system had 512GB of memory. On this server, in the
worst case scenario of all 1500 processes mapping every page from
SGA would have required 878GB+ for just the PTEs. If these PTEs
could be shared, the a substantial amount of memory saved.

This patch series implements a mechanism that allows userspace
processes to opt into sharing PTEs. It adds a new in-memory
filesystem - msharefs. A file created on msharefs represents a
shared region where all processes mapping that region will map
objects within it with shared PTEs. When the file is created,
a new host mm struct is created to hold the shared page tables
and vmas for objects later mapped into the shared region. This
host mm struct is associated with the file and not with a task.
When a process mmap's the shared region, a vm flag VM_MSHARE
is added to the vma. On page fault the vma is checked for the
presence of the VM_MSHARE flag. If found, the host mm is
searched for a vma that covers the fault address. Fault handling
then continues using that host vma which establishes PTEs in the
host mm. Fault handling in a shared region also links the shared
page table to the process page table if the shared page table
already exists.

Ioctls are used to set/get the start address and size of the host
mm, to map objects into the shared region, and to (eventually)
perform other operations on the shared objects such as changing
protections.

API
===

mshare does not introduce a new API. It instead uses existing APIs
to implement page table sharing. The steps to use this feature are:

1. Mount msharefs on /sys/fs/mshare -
        mount -t msharefs msharefs /sys/fs/mshare

2. mshare regions have alignment and size requirements. Start
   address for the region must be aligned to an address boundary and
   be a multiple of fixed size. This alignment and size requirement
   can be obtained by reading the file /sys/fs/mshare/mshare_info
   which returns a number in text format. mshare regions must be
   aligned to this boundary and be a multiple of this size.

3. For the process creating an mshare region:
        a. Create a file on /sys/fs/mshare, for example -
                fd = open("/sys/fs/mshare/shareme",
                                O_RDWR|O_CREAT|O_EXCL, 0600);

        b. Establish the starting address and size of the region
                struct mshare_info minfo;

                minfo.start = TB(2);
                minfo.size = BUFFER_SIZE;
                ioctl(fd, MSHAREFS_SET_SIZE, &minfo)

        c. Map some memory in the region
                struct mshare_create mcreate;

                mcreate.addr = TB(2);
                mcreate.size = BUFFER_SIZE;
                mcreate.offset = 0;
                mcreate.prot = PROT_READ | PROT_WRITE;
                mcreate.flags = MAP_ANONYMOUS | MAP_SHARED | MAP_FIXED;
                mcreate.fd = -1;

                ioctl(fd, MSHAREFS_CREATE_MAPPING, &mcreate)

        d. Map the mshare region into the process
                mmap((void *)TB(2), BUF_SIZE, PROT_READ | PROT_WRITE,
                        MAP_SHARED, fd, 0);

        e. Write and read to mshared region normally.

4. For processes attaching an mshare region:
        a. Open the file on msharefs, for example -
                fd = open("/sys/fs/mshare/shareme", O_RDWR);

        b. Get information about mshare'd region from the file:
                struct mshare_info minfo;

                ioctl(fd, MSHAREFS_GET_SIZE, &minfo);

        c. Map the mshare'd region into the process
                mmap(minfo.start, minfo.size,
                        PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

5. To delete the mshare region -
                unlink("/sys/fs/mshare/shareme");



Example Code
============

Snippet of the code that a donor process would run looks like below:

-----------------
        struct mshare_info minfo;
        struct mshare_create mcreate;

        fd = open("/sys/fs/mshare/mshare_info", O_RDONLY);
        read(fd, req, 128);
        alignsize = atoi(req);
        close(fd);
        fd = open("/sys/fs/mshare/shareme", O_RDWR|O_CREAT|O_EXCL, 0600);
        start = alignsize * 4;
        size = alignsize * 2;

        minfo.start = start;
        minfo.size = size;
        ret = ioctl(fd, MSHAREFS_SET_SIZE, &minfo);
        if (ret < 0)
                perror("ERROR: MSHAREFS_SET_SIZE");

        mcreate.addr = start;
        mcreate.size = size;
        mcreate.offset = 0;
        mcreate.prot = PROT_READ | PROT_WRITE;
        mcreate.flags = MAP_ANONYMOUS | MAP_SHARED | MAP_FIXED;
        mcreate.fd = -1;
        ret = ioctl(fd, MSHAREFS_CREATE_MAPPING, &mcreate);
        if (ret < 0)
                perror("ERROR: MSHAREFS_CREATE_MAPPING");

        addr = mmap((void *)start, size, PROT_READ | PROT_WRITE,
                        MAP_SHARED, fd, 0);
        if (addr == MAP_FAILED)
                perror("ERROR: mmap failed");

        strncpy(addr, "Some random shared text",
                        sizeof("Some random shared text"));
-----------------

Snippet of code that a consumer process would execute looks like:

-----------------
        struct mshare_info minfo;

        fd = open("/sys/fs/mshare/shareme", O_RDONLY);

        ret = ioctl(fd, MSHAREFS_GET_SIZE, &minfo);
        if (ret < 0)
                perror("ERROR: MSHAREFS_GET_SIZE");
        if (!minfo.start)
                perror("ERROR: mshare region not init'd");

        addr = mmap(minfo.start, minfo.size, PROT_READ | PROT_WRITE,
                        MAP_SHARED, fd, 0);

        printf("Guest mmap at %px:\n", addr);
        printf("%s\n", addr);
        printf("\nDone\n");

-----------------

v1:
  - Based on mm-unstable mm-hotfixes-stable-2025-01-16-21-11
  - Use mshare size instead of start address to check if mshare region
    has been initialized.
  - Share page tables at PUD level instead of PGD.
  - Rename vma_is_shared() to vma_is_mshare() (James H / David H)
  - Introduce and use mmap_read_lock_nested() (Kirill)
  - Use an mmu notifier to flush all TLBs when updating shared pagetable
    mappings. (Dave Hansen)
  - Move logic for finding the shared vma to use to handle a fault from
    handle_mm_fault() to do_user_addr_fault() because the arch-specific
    fault handling checks vma flags for access permissions.
  - Add CONFIG_MSHARE / ARCH_SUPPORTS_MSHARE 
  - Add msharefs_get_unmapped_area()
  - Implemented vm_ops->unmap_page_range (Kirill)
  - Update free_pgtables/free_pgd_range to free process pagetable levels
    but not shared pagetable levels.
  - A first take at cgroup support

RFC v2 -> v3:
  - Now based on 6.11-rc5
  - Addressed many comments from v2.
  - Simplified filesystem code. Removed refcounting of the
    shared mm_struct allocated for an mshare file. The mm_struct
    and the pagetables and mappings it contains are freed when
    the inode is evicted.
  - Switched to an ioctl-based interface. Ioctls implemented
    are used to set and get the start address and size of an
    mshare region and to map objects into an mshare region
    (only anon shared memory is supported in this series).
  - Updated example code

[1] v2: https://lore.kernel.org/linux-mm/cover.1656531090.git.khalid.aziz@oracle.com/


RFC v1 -> v2:
  - Eliminated mshare and mshare_unlink system calls and
    replaced API with standard mmap and unlink (Based upon
    v1 patch discussions and LSF/MM discussions)
  - All fd based API (based upon feedback and suggestions from
    Andy Lutomirski, Eric Biederman, Kirill and others)
  - Added a file /sys/fs/mshare/mshare_info to provide
    alignment and size requirement info (based upon feedback
    from Dave Hansen, Mark Hemment and discussions at LSF/MM)
  - Addressed TODOs in v1
  - Added support for directories in msharefs
  - Added locks around any time vma is touched (Dave Hansen)
  - Eliminated the need to point vm_mm in original vmas to the
    newly synthesized mshare mm
  - Ensured mmap_read_unlock is called for correct mm in
    handle_mm_fault (Dave Hansen)


Anthony Yznaga (13):
  mm/mshare: allocate an mm_struct for msharefs files
  mm/mshare: flush all TLBs when updating PTEs in an mshare range
  sched/numa: do not scan msharefs vmas
  mm: add mmap_read_lock_killable_nested()
  mm: add and use unmap_page_range vm_ops hook
  x86/mm: enable page table sharing
  mm: create __do_mmap() to take an mm_struct * arg
  mm: pass the mm in vma_munmap_struct
  mshare: add MSHAREFS_CREATE_MAPPING
  mshare: add MSHAREFS_UNMAP
  mm/mshare: provide a way to identify an mm as an mshare host mm
  mm/mshare: get memcg from current->mm instead of mshare mm
  mm/mshare: associate a mem cgroup with an mshare file

Khalid Aziz (7):
  mm: Add msharefs filesystem
  mm/mshare: pre-populate msharefs with information file
  mm/mshare: make msharefs writable and support directories
  mm/mshare: Add ioctl support
  mm/mshare: Add a vma flag to indicate an mshare region
  mm/mshare: Add mmap support
  mm/mshare: prepare for page table sharing support

 Documentation/filesystems/msharefs.rst        | 107 +++
 .../userspace-api/ioctl/ioctl-number.rst      |   1 +
 arch/Kconfig                                  |   3 +
 arch/x86/Kconfig                              |   1 +
 arch/x86/mm/fault.c                           |  48 +-
 include/linux/memcontrol.h                    |   3 +
 include/linux/mm.h                            |  56 ++
 include/linux/mm_types.h                      |   2 +
 include/linux/mmap_lock.h                     |   7 +
 include/trace/events/mmflags.h                |   7 +
 include/uapi/linux/magic.h                    |   1 +
 include/uapi/linux/msharefs.h                 |  45 ++
 ipc/shm.c                                     |  17 +
 kernel/sched/fair.c                           |   3 +-
 mm/Kconfig                                    |   9 +
 mm/Makefile                                   |   4 +
 mm/hugetlb.c                                  |  25 +
 mm/memcontrol.c                               |   3 +-
 mm/memory.c                                   |  74 +-
 mm/mmap.c                                     |   7 +-
 mm/mshare.c                                   | 708 ++++++++++++++++++
 mm/vma.c                                      |  25 +-
 mm/vma.h                                      |   3 +-
 23 files changed, 1108 insertions(+), 51 deletions(-)
 create mode 100644 Documentation/filesystems/msharefs.rst
 create mode 100644 include/uapi/linux/msharefs.h
 create mode 100644 mm/mshare.c

-- 
2.43.5
Re: [PATCH 00/20] Add support for shared PTEs across processes
Posted by Andrew Morton 1 year ago
On Fri, 24 Jan 2025 15:54:34 -0800 Anthony Yznaga <anthony.yznaga@oracle.com> wrote:

> Some of the field deployments commonly see memory pages shared
> across 1000s of processes. On x86_64, each page requires a PTE that
> is 8 bytes long which is very small compared to the 4K page
> size.

Dumb question: why aren't these applications using huge pages?
Re: [PATCH 00/20] Add support for shared PTEs across processes
Posted by Anthony Yznaga 1 year ago
On 1/28/25 4:11 PM, Andrew Morton wrote:
> On Fri, 24 Jan 2025 15:54:34 -0800 Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
>
>> Some of the field deployments commonly see memory pages shared
>> across 1000s of processes. On x86_64, each page requires a PTE that
>> is 8 bytes long which is very small compared to the 4K page
>> size.
> Dumb question: why aren't these applications using huge pages?
>
They often are using hugetlbfs but would also benefit from having page 
tables shared for other kinds of memory such as shmem, tmpfs or dax.
Re: [PATCH 00/20] Add support for shared PTEs across processes
Posted by Matthew Wilcox 1 year ago
On Tue, Jan 28, 2025 at 04:25:22PM -0800, Anthony Yznaga wrote:
> 
> On 1/28/25 4:11 PM, Andrew Morton wrote:
> > On Fri, 24 Jan 2025 15:54:34 -0800 Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
> > 
> > > Some of the field deployments commonly see memory pages shared
> > > across 1000s of processes. On x86_64, each page requires a PTE that
> > > is 8 bytes long which is very small compared to the 4K page
> > > size.
> > Dumb question: why aren't these applications using huge pages?
> > 
> They often are using hugetlbfs but would also benefit from having page
> tables shared for other kinds of memory such as shmem, tmpfs or dax.

... and the implementation of PMD sharing in hugetlbfs is horrible.  In
addition to inverting the locking order (see gigantic comment in rmap.c),
the semantics aren't what the Oracle DB wants, and it's inefficient.

So when we were looking at implementing page table sharing for DAX, we
examined _and rejected_ porting the hugetlbfs approach.  We've discussed
this extensively at the last three LSFMM sessions where mshare has been
a topic, and in previous submissions of mshare.  So seeing the question
being asked yet again is disheartening.
Re: [PATCH 00/20] Add support for shared PTEs across processes
Posted by David Hildenbrand 1 year ago
> API
> ===
> 
> mshare does not introduce a new API. It instead uses existing APIs
> to implement page table sharing. The steps to use this feature are:
> 
> 1. Mount msharefs on /sys/fs/mshare -
>          mount -t msharefs msharefs /sys/fs/mshare
> 
> 2. mshare regions have alignment and size requirements. Start
>     address for the region must be aligned to an address boundary and
>     be a multiple of fixed size. This alignment and size requirement
>     can be obtained by reading the file /sys/fs/mshare/mshare_info
>     which returns a number in text format. mshare regions must be
>     aligned to this boundary and be a multiple of this size.
> 
> 3. For the process creating an mshare region:
>          a. Create a file on /sys/fs/mshare, for example -
>                  fd = open("/sys/fs/mshare/shareme",
>                                  O_RDWR|O_CREAT|O_EXCL, 0600);
> 
>          b. Establish the starting address and size of the region
>                  struct mshare_info minfo;
> 
>                  minfo.start = TB(2);
>                  minfo.size = BUFFER_SIZE;
>                  ioctl(fd, MSHAREFS_SET_SIZE, &minfo)

We could set the size using ftruncate, just like for any other file. It 
would have to be the first thing after creating the file, and before we 
allow any other modifications.

Idealy, we'd be able to get rid of the "start", use something resaonable 
(e.g., TB(2)) internally, and allow processes to mmap() it at different 
(suitably-aligned) addresses.

I recall we discussed that in the past. Did you stumble over real 
blockers such that we really must mmap() the file at the same address in 
all processes? I recall some things around TLB flushing, but not sure. 
So we might have to stick to an mmap address for now.

When using fallocate/stat to set/query the file size, we could end up with:

/*
  * Set the address where this file can be mapped into processes. Other
  * addresses are not supported for now, and mmap will fail. Changing the
  * mmap address after mappings were already created is not supported.
  */
MSHAREFS_SET_MMAP_ADDRESS
MSHAREFS_GET_MMAP_ADDRESS


> 
>          c. Map some memory in the region
>                  struct mshare_create mcreate;
> 
>                  mcreate.addr = TB(2);

Can we use the offset into the virtual file instead? We should be able 
to perform that translation internally fairly easily I assume.

>                  mcreate.size = BUFFER_SIZE;
>                  mcreate.offset = 0;
>                  mcreate.prot = PROT_READ | PROT_WRITE;
>                  mcreate.flags = MAP_ANONYMOUS | MAP_SHARED | MAP_FIXED;
>                  mcreate.fd = -1;
> 
>                  ioctl(fd, MSHAREFS_CREATE_MAPPING, &mcreate)

Would examples with multiple mappings work already in this version?

Did you experiment with other mappings (e.g., ordinary shared file 
mappings), and what are the blockers to make that fly?

> 
>          d. Map the mshare region into the process
>                  mmap((void *)TB(2), BUF_SIZE, PROT_READ | PROT_WRITE,
>                          MAP_SHARED, fd, 0);
> 
>          e. Write and read to mshared region normally.
> 
> 4. For processes attaching an mshare region:
>          a. Open the file on msharefs, for example -
>                  fd = open("/sys/fs/mshare/shareme", O_RDWR);
> 
>          b. Get information about mshare'd region from the file:
>                  struct mshare_info minfo;
> 
>                  ioctl(fd, MSHAREFS_GET_SIZE, &minfo);
> 
>          c. Map the mshare'd region into the process
>                  mmap(minfo.start, minfo.size,
>                          PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
> 
> 5. To delete the mshare region -
>                  unlink("/sys/fs/mshare/shareme");
> 

I recall discussions around cgroup accounting, OOM handling etc. I 
thought the conclusion was that we need an "mshare process" where the 
memory is accounted to, and once that process is killed (e.g., OOM), it 
must tear down all mappings/pages etc.

How does your design currently look like in that regard? E.g., how can 
OOM handling make progress, how is cgroup accounting handled?

-- 
Cheers,

David / dhildenb
Re: [PATCH 00/20] Add support for shared PTEs across processes
Posted by Anthony Yznaga 1 year ago
On 1/28/25 1:36 AM, David Hildenbrand wrote:
>> API
>> ===
>>
>> mshare does not introduce a new API. It instead uses existing APIs
>> to implement page table sharing. The steps to use this feature are:
>>
>> 1. Mount msharefs on /sys/fs/mshare -
>>          mount -t msharefs msharefs /sys/fs/mshare
>>
>> 2. mshare regions have alignment and size requirements. Start
>>     address for the region must be aligned to an address boundary and
>>     be a multiple of fixed size. This alignment and size requirement
>>     can be obtained by reading the file /sys/fs/mshare/mshare_info
>>     which returns a number in text format. mshare regions must be
>>     aligned to this boundary and be a multiple of this size.
>>
>> 3. For the process creating an mshare region:
>>          a. Create a file on /sys/fs/mshare, for example -
>>                  fd = open("/sys/fs/mshare/shareme",
>>                                  O_RDWR|O_CREAT|O_EXCL, 0600);
>>
>>          b. Establish the starting address and size of the region
>>                  struct mshare_info minfo;
>>
>>                  minfo.start = TB(2);
>>                  minfo.size = BUFFER_SIZE;
>>                  ioctl(fd, MSHAREFS_SET_SIZE, &minfo)
>
> We could set the size using ftruncate, just like for any other file. 
> It would have to be the first thing after creating the file, and 
> before we allow any other modifications.

I'll look into this.


>
> Idealy, we'd be able to get rid of the "start", use something 
> resaonable (e.g., TB(2)) internally, and allow processes to mmap() it 
> at different (suitably-aligned) addresses.
>
> I recall we discussed that in the past. Did you stumble over real 
> blockers such that we really must mmap() the file at the same address 
> in all processes? I recall some things around TLB flushing, but not 
> sure. So we might have to stick to an mmap address for now.

It's not hard to implement this. It does have the affect that rmap walks 
will find the internal VA rather than the actual VA for a given process. 
For TLB flushing this isn't a problem for the current implementation 
because all TLBs are flushed entirely. I don't know if there might be 
other complications. It does mean that an offset rather than address 
should be used when creating a mapping as you point out below.


>
> When using fallocate/stat to set/query the file size, we could end up 
> with:
>
> /*
>  * Set the address where this file can be mapped into processes. Other
>  * addresses are not supported for now, and mmap will fail. Changing the
>  * mmap address after mappings were already created is not supported.
>  */
> MSHAREFS_SET_MMAP_ADDRESS
> MSHAREFS_GET_MMAP_ADDRESS

I'll look into this, too.


>
>
>>
>>          c. Map some memory in the region
>>                  struct mshare_create mcreate;
>>
>>                  mcreate.addr = TB(2);
>
> Can we use the offset into the virtual file instead? We should be able 
> to perform that translation internally fairly easily I assume.

Yes, an offset would be preferable. Especially if mapping the same file 
at different VAs is implemented.


>
>>                  mcreate.size = BUFFER_SIZE;
>>                  mcreate.offset = 0;
>>                  mcreate.prot = PROT_READ | PROT_WRITE;
>>                  mcreate.flags = MAP_ANONYMOUS | MAP_SHARED | MAP_FIXED;
>>                  mcreate.fd = -1;
>>
>>                  ioctl(fd, MSHAREFS_CREATE_MAPPING, &mcreate)
>
> Would examples with multiple mappings work already in this version?
>
> Did you experiment with other mappings (e.g., ordinary shared file 
> mappings), and what are the blockers to make that fly?

Yes, multiple mappings works. And it's straightforward to make shared 
file mappings work. I have a patch where I basically just copied code 
from ksys_mmap_pgoff() into msharefs_create_mapping(). Needs some 
refactoring and finessing to make it a real patch.


>
>>
>>          d. Map the mshare region into the process
>>                  mmap((void *)TB(2), BUF_SIZE, PROT_READ | PROT_WRITE,
>>                          MAP_SHARED, fd, 0);
>>
>>          e. Write and read to mshared region normally.
>>
>> 4. For processes attaching an mshare region:
>>          a. Open the file on msharefs, for example -
>>                  fd = open("/sys/fs/mshare/shareme", O_RDWR);
>>
>>          b. Get information about mshare'd region from the file:
>>                  struct mshare_info minfo;
>>
>>                  ioctl(fd, MSHAREFS_GET_SIZE, &minfo);
>>
>>          c. Map the mshare'd region into the process
>>                  mmap(minfo.start, minfo.size,
>>                          PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
>>
>> 5. To delete the mshare region -
>>                  unlink("/sys/fs/mshare/shareme");
>>
>
> I recall discussions around cgroup accounting, OOM handling etc. I 
> thought the conclusion was that we need an "mshare process" where the 
> memory is accounted to, and once that process is killed (e.g., OOM), 
> it must tear down all mappings/pages etc.
>
> How does your design currently look like in that regard? E.g., how can 
> OOM handling make progress, how is cgroup accounting handled?


There was some discussion on this at last year's LSF/MM, but it seemed 
more like ideas rather than a conclusion on an approach. In any case, 
tearing down everything if an owning process is killed does not work for 
our internal use cases, and I think that was mentioned somewhere in 
discussions. Plus it seems to me that yanking the mappings away from the 
unsuspecting non-owner processes could be quite catastrophic. Shouldn't 
an mshare virtual file be treated like any other in-memory file? Or do 
such files get zapped somehow by OOM? Not saying we shouldn't do 
anything for OOM, but I'm not sure what the answer is.


Cgroups are tricky. At the mm alignment meeting last year a use case was 
brought up where it would be desirable to have all pagetable pages 
charged to one memcg rather than have them charged on a first touch 
basis. It was proposed that perhaps an mshare file could associated with 
a cgroup at the time it is created. I have figured out a way to do this 
but I'm not versed enough in cgroups to know if the approach is viable. 
The last three patches provided this functionality as well as 
functionality that ensures a newly faulted in page is charged to the 
current process. If everything, pagetable and faulted pages, should be 
charged to the same cgroup then more work is definitely required. 
Hopefully this provides enough context to move towards a complete solution.


Anthony

Re: [PATCH 00/20] Add support for shared PTEs across processes
Posted by Bagas Sanjaya 1 year ago
On Fri, Jan 24, 2025 at 03:54:34PM -0800, Anthony Yznaga wrote:
> v1:
>   - Based on mm-unstable mm-hotfixes-stable-2025-01-16-21-11

Seems like I can't cleanly apply this series on the aforementioned tag.
Can you give me the exact base commit?

Confused...

-- 
An old man doll... just what I always wanted! - Clara
Re: [PATCH 00/20] Add support for shared PTEs across processes
Posted by Anthony Yznaga 1 year ago
On 1/27/25 11:11 PM, Bagas Sanjaya wrote:
> On Fri, Jan 24, 2025 at 03:54:34PM -0800, Anthony Yznaga wrote:
>> v1:
>>    - Based on mm-unstable mm-hotfixes-stable-2025-01-16-21-11
> Seems like I can't cleanly apply this series on the aforementioned tag.
> Can you give me the exact base commit?
>
> Confused...
>
Hmm, maybe I goofed something. Last commit was:

103978aab801 mm/compaction: fix UBSAN shift-out-of-bounds warning


Anthony
Re: [PATCH 00/20] Add support for shared PTEs across processes
Posted by Andrew Morton 1 year ago
On Fri, 24 Jan 2025 15:54:34 -0800 Anthony Yznaga <anthony.yznaga@oracle.com> wrote:

> Memory pages shared between processes require page table entries
> (PTEs) for each process. Each of these PTEs consume some of
> the memory and as long as the number of mappings being maintained
> is small enough, this space consumed by page tables is not
> objectionable. When very few memory pages are shared between
> processes, the number of PTEs to maintain is mostly constrained by
> the number of pages of memory on the system. As the number of shared
> pages and the number of times pages are shared goes up, amount of
> memory consumed by page tables starts to become significant. This
> issue does not apply to threads. Any number of threads can share the
> same pages inside a process while sharing the same PTEs. Extending
> this same model to sharing pages across processes can eliminate this
> issue for sharing across processes as well.
> 
> ...
>
> API
> ===
> 
> mshare does not introduce a new API. It instead uses existing APIs
> to implement page table sharing. The steps to use this feature are:
> 
> 1. Mount msharefs on /sys/fs/mshare -
>         mount -t msharefs msharefs /sys/fs/mshare
> 
> 2. mshare regions have alignment and size requirements. Start
>    address for the region must be aligned to an address boundary and
>    be a multiple of fixed size. This alignment and size requirement
>    can be obtained by reading the file /sys/fs/mshare/mshare_info
>    which returns a number in text format. mshare regions must be
>    aligned to this boundary and be a multiple of this size.
> 
> 3. For the process creating an mshare region:
>         a. Create a file on /sys/fs/mshare, for example -
>                 fd = open("/sys/fs/mshare/shareme",
>                                 O_RDWR|O_CREAT|O_EXCL, 0600);
> 
>         b. Establish the starting address and size of the region
>                 struct mshare_info minfo;
> 
>                 minfo.start = TB(2);
>                 minfo.size = BUFFER_SIZE;
>                 ioctl(fd, MSHAREFS_SET_SIZE, &minfo)
> 
>         c. Map some memory in the region
>                 struct mshare_create mcreate;
> 
>                 mcreate.addr = TB(2);
>                 mcreate.size = BUFFER_SIZE;
>                 mcreate.offset = 0;
>                 mcreate.prot = PROT_READ | PROT_WRITE;
>                 mcreate.flags = MAP_ANONYMOUS | MAP_SHARED | MAP_FIXED;
>                 mcreate.fd = -1;
> 
>                 ioctl(fd, MSHAREFS_CREATE_MAPPING, &mcreate)

I'm not really understanding why step a exists.  It's basically an
mmap() so why can't this be done within step d?

>         d. Map the mshare region into the process
>                 mmap((void *)TB(2), BUF_SIZE, PROT_READ | PROT_WRITE,
>                         MAP_SHARED, fd, 0);
> 
>         e. Write and read to mshared region normally.
> 
> 4. For processes attaching an mshare region:
>         a. Open the file on msharefs, for example -
>                 fd = open("/sys/fs/mshare/shareme", O_RDWR);
> 
>         b. Get information about mshare'd region from the file:
>                 struct mshare_info minfo;
> 
>                 ioctl(fd, MSHAREFS_GET_SIZE, &minfo);
> 
>         c. Map the mshare'd region into the process
>                 mmap(minfo.start, minfo.size,
>                         PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
> 
> 5. To delete the mshare region -
>                 unlink("/sys/fs/mshare/shareme");
> 

The userspace intergace is the thing we should initially consider.  I'm
having ancient memories of hugetlbfs.  Over time it was seen that
hugetlbfs was too standalone and huge pages became more (and more (and
more (and more))) integrated into regular MM code.  Can we expect a
similar evolution with pte-shared memory and if so, is this the correct
interface to be starting out with?
Re: [PATCH 00/20] Add support for shared PTEs across processes
Posted by David Hildenbrand 1 year ago
On 27.01.25 23:33, Andrew Morton wrote:
> On Fri, 24 Jan 2025 15:54:34 -0800 Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
> 
>> Memory pages shared between processes require page table entries
>> (PTEs) for each process. Each of these PTEs consume some of
>> the memory and as long as the number of mappings being maintained
>> is small enough, this space consumed by page tables is not
>> objectionable. When very few memory pages are shared between
>> processes, the number of PTEs to maintain is mostly constrained by
>> the number of pages of memory on the system. As the number of shared
>> pages and the number of times pages are shared goes up, amount of
>> memory consumed by page tables starts to become significant. This
>> issue does not apply to threads. Any number of threads can share the
>> same pages inside a process while sharing the same PTEs. Extending
>> this same model to sharing pages across processes can eliminate this
>> issue for sharing across processes as well.
>>
>> ...
>>
>> API
>> ===
>>
>> mshare does not introduce a new API. It instead uses existing APIs
>> to implement page table sharing. The steps to use this feature are:
>>
>> 1. Mount msharefs on /sys/fs/mshare -
>>          mount -t msharefs msharefs /sys/fs/mshare
>>
>> 2. mshare regions have alignment and size requirements. Start
>>     address for the region must be aligned to an address boundary and
>>     be a multiple of fixed size. This alignment and size requirement
>>     can be obtained by reading the file /sys/fs/mshare/mshare_info
>>     which returns a number in text format. mshare regions must be
>>     aligned to this boundary and be a multiple of this size.
>>
>> 3. For the process creating an mshare region:
>>          a. Create a file on /sys/fs/mshare, for example -
>>                  fd = open("/sys/fs/mshare/shareme",
>>                                  O_RDWR|O_CREAT|O_EXCL, 0600);
>>
>>          b. Establish the starting address and size of the region
>>                  struct mshare_info minfo;
>>
>>                  minfo.start = TB(2);
>>                  minfo.size = BUFFER_SIZE;
>>                  ioctl(fd, MSHAREFS_SET_SIZE, &minfo)
 >>>>          c. Map some memory in the region
>>                  struct mshare_create mcreate;
>>
>>                  mcreate.addr = TB(2);
 >>                  mcreate.size = BUFFER_SIZE;>> 
mcreate.offset = 0;
>>                  mcreate.prot = PROT_READ | PROT_WRITE;
>>                  mcreate.flags = MAP_ANONYMOUS | MAP_SHARED | MAP_FIXED;
>>                  mcreate.fd = -1;
>>
>>                  ioctl(fd, MSHAREFS_CREATE_MAPPING, &mcreate)
> 
> I'm not really understanding why step a exists.  It's basically an
> mmap() so why can't this be done within step d?

Conceptually, it's defining the content of the virtual file: by creating 
mappings/unmapping mappings/changing mappings. Some applications will 
require multiple different mappings in such a virtual file.

Processes mmap the resulting virtual file.

-- 
Cheers,

David / dhildenb
Re: [PATCH 00/20] Add support for shared PTEs across processes
Posted by Anthony Yznaga 1 year ago
On 1/27/25 2:33 PM, Andrew Morton wrote:
> On Fri, 24 Jan 2025 15:54:34 -0800 Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
>
>> Memory pages shared between processes require page table entries
>> (PTEs) for each process. Each of these PTEs consume some of
>> the memory and as long as the number of mappings being maintained
>> is small enough, this space consumed by page tables is not
>> objectionable. When very few memory pages are shared between
>> processes, the number of PTEs to maintain is mostly constrained by
>> the number of pages of memory on the system. As the number of shared
>> pages and the number of times pages are shared goes up, amount of
>> memory consumed by page tables starts to become significant. This
>> issue does not apply to threads. Any number of threads can share the
>> same pages inside a process while sharing the same PTEs. Extending
>> this same model to sharing pages across processes can eliminate this
>> issue for sharing across processes as well.
>>
>> ...
>>
>> API
>> ===
>>
>> mshare does not introduce a new API. It instead uses existing APIs
>> to implement page table sharing. The steps to use this feature are:
>>
>> 1. Mount msharefs on /sys/fs/mshare -
>>          mount -t msharefs msharefs /sys/fs/mshare
>>
>> 2. mshare regions have alignment and size requirements. Start
>>     address for the region must be aligned to an address boundary and
>>     be a multiple of fixed size. This alignment and size requirement
>>     can be obtained by reading the file /sys/fs/mshare/mshare_info
>>     which returns a number in text format. mshare regions must be
>>     aligned to this boundary and be a multiple of this size.
>>
>> 3. For the process creating an mshare region:
>>          a. Create a file on /sys/fs/mshare, for example -
>>                  fd = open("/sys/fs/mshare/shareme",
>>                                  O_RDWR|O_CREAT|O_EXCL, 0600);
>>
>>          b. Establish the starting address and size of the region
>>                  struct mshare_info minfo;
>>
>>                  minfo.start = TB(2);
>>                  minfo.size = BUFFER_SIZE;
>>                  ioctl(fd, MSHAREFS_SET_SIZE, &minfo)
>>
>>          c. Map some memory in the region
>>                  struct mshare_create mcreate;
>>
>>                  mcreate.addr = TB(2);
>>                  mcreate.size = BUFFER_SIZE;
>>                  mcreate.offset = 0;
>>                  mcreate.prot = PROT_READ | PROT_WRITE;
>>                  mcreate.flags = MAP_ANONYMOUS | MAP_SHARED | MAP_FIXED;
>>                  mcreate.fd = -1;
>>
>>                  ioctl(fd, MSHAREFS_CREATE_MAPPING, &mcreate)
> I'm not really understanding why step a exists.  It's basically an
> mmap() so why can't this be done within step d?

One way to think of it is that step d establishes a window to the mshare 
region and the objects mapped within it.

Discussions on earlier iterations of mshare pushed back strongly on 
introducing special casing in the mmap path to redirect mmaps that fell 
within an mshare region to map into an mshare mm. Even then it gets 
messier for munmap, i.e. does an unmap of the whole range mean unmap the 
window or unmap the objects within it.

>
>>          d. Map the mshare region into the process
>>                  mmap((void *)TB(2), BUF_SIZE, PROT_READ | PROT_WRITE,
>>                          MAP_SHARED, fd, 0);
>>
>>          e. Write and read to mshared region normally.
>>
>> 4. For processes attaching an mshare region:
>>          a. Open the file on msharefs, for example -
>>                  fd = open("/sys/fs/mshare/shareme", O_RDWR);
>>
>>          b. Get information about mshare'd region from the file:
>>                  struct mshare_info minfo;
>>
>>                  ioctl(fd, MSHAREFS_GET_SIZE, &minfo);
>>
>>          c. Map the mshare'd region into the process
>>                  mmap(minfo.start, minfo.size,
>>                          PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
>>
>> 5. To delete the mshare region -
>>                  unlink("/sys/fs/mshare/shareme");
>>
> The userspace intergace is the thing we should initially consider.  I'm
> having ancient memories of hugetlbfs.  Over time it was seen that
> hugetlbfs was too standalone and huge pages became more (and more (and
> more (and more))) integrated into regular MM code.  Can we expect a
> similar evolution with pte-shared memory and if so, is this the correct
> interface to be starting out with?

I don't know. This is an approach that has been refined through a number 
of discussions, but I'm certainly open to alternatives.


Anthony