[RFC PATCH 0/6] Introduce Copy-On-Write to Page Table

Chih-En Lin posted 6 patches 3 years, 11 months ago
There is a newer version of this series
include/linux/mm.h             |   2 +
include/linux/mm_types.h       |   2 +
include/linux/pgtable.h        |  44 +++++
include/linux/sched/coredump.h |   5 +-
include/uapi/linux/sched.h     |   1 +
kernel/fork.c                  |   6 +-
mm/memory.c                    | 329 ++++++++++++++++++++++++++++++---
mm/mmap.c                      |   4 +
mm/mremap.c                    |   5 +
9 files changed, 373 insertions(+), 25 deletions(-)
[RFC PATCH 0/6] Introduce Copy-On-Write to Page Table
Posted by Chih-En Lin 3 years, 11 months ago
When creating the user process, it usually uses the Copy-On-Write (COW)
mechanism to save the memory usage and the cost of time for copying.
COW defers the work of copying private memory and shares it across the
processes as read-only. If either process wants to write in these
memories, it will page fault and copy the shared memory, so the process
will now get its private memory right here, which is called break COW.

Presently this kind of technology is only used as the mapping memory.
It still needs to copy the entire page table from the parent.
It might cost a lot of time and memory to copy each page table when the
parent already has a lot of page tables allocated. For example, here is
the state table for mapping the 1 GB memory of forking.

	    mmap before fork         mmap after fork
MemTotal:       32746776 kB             32746776 kB
MemFree:        31468152 kB             31463244 kB
AnonPages:       1073836 kB              1073628 kB
Mapped:            39520 kB                39992 kB
PageTables:         3356 kB                 5432 kB

This patch introduces Copy-On-Write to the page table. This patch only
implements the COW on the PTE level. It's based on the paper
On-Demand Fork [1]. Summary of the implementation for the paper:

- Only implements the COW to the anonymous mapping
- Only do COW to the PTE table which the range is all covered by a
  single VMA.
- Use the reference count to control the COW PTE table lifetime.
  Decrease the counter when breaking COW or dereference the COW PTE
  table. When the counter reduces to zero, free the PTE table.

The paper is based on v5.6, and this patch is for v.518-rc6. And, this
patch has some differences between the version of paper. To reduce the
work of duplicating page tables, I adapted the restriction of the COW
page table. Excluding the brk and shared memory, it will do the COW to
all the PTE tables. With a reference count of one, we reuse the table
when breaking COW. To handle the page table state of the process, it
adds the ownership of the COW PTE table. It uses the address of the PMD
index for the ownership of the PTE table to maintain the COW PTE table
state to the RSS and pgtable_bytes.

If we do the COW to the PTE table once as the time we touch the PMD
entry, it cannot preserves the reference count of the COW PTE table.
Since the address range of VMA may overlap the PTE table, the copying
function will use VMA to travel the page table for copying it.
So it may increase the reference count of the COW PTE table multiple
times in one COW page table forking. Generically it will only increase
once time as the child reference it. To solve this problem, it needs to
check the destination of PMD entry does exist. And the reference count
of the source PTE table is more than one before doing the COW.

Here is the patch of a state table for mapping the 1 GB memory of
forking.

            mmap before fork         mmap after fork
MemTotal:       32746776 kB             32746776 kB
MemFree:        31471324 kB             31468888 kB
AnonPages:       1073628 kB              1073660 kB
Mapped:            39264 kB                39504 kB
PageTables:         3304 kB                 3396 kB

TODO list:
- Handle the swap
- Rewrite the TLB flush for zapping the COW PTE table.
- Experiment COW to the entire page table. (Now just for PTE level)
- Bug in some case from copy_pte_range()::vm_normal_page()::print_bad_pte().
- Bug of Bad RSS counter in multiple times COW PTE table forking.

[1] https://dl.acm.org/doi/10.1145/3447786.3456258

This patch is based on v5.18-rc6.

---

Chih-En Lin (6):
  mm: Add a new mm flag for Copy-On-Write PTE table
  mm: clone3: Add CLONE_COW_PGTABLE flag
  mm, pgtable: Add ownership for the PTE table
  mm: Add COW PTE fallback function
  mm, pgtable: Add the reference counter for COW PTE
  mm: Expand Copy-On-Write to PTE table

 include/linux/mm.h             |   2 +
 include/linux/mm_types.h       |   2 +
 include/linux/pgtable.h        |  44 +++++
 include/linux/sched/coredump.h |   5 +-
 include/uapi/linux/sched.h     |   1 +
 kernel/fork.c                  |   6 +-
 mm/memory.c                    | 329 ++++++++++++++++++++++++++++++---
 mm/mmap.c                      |   4 +
 mm/mremap.c                    |   5 +
 9 files changed, 373 insertions(+), 25 deletions(-)

-- 
2.36.1
Re: [External] [RFC PATCH 0/6] Introduce Copy-On-Write to Page Table
Posted by Qi Zheng 3 years, 11 months ago

On 2022/5/20 2:31 AM, Chih-En Lin wrote:
> When creating the user process, it usually uses the Copy-On-Write (COW)
> mechanism to save the memory usage and the cost of time for copying.
> COW defers the work of copying private memory and shares it across the
> processes as read-only. If either process wants to write in these
> memories, it will page fault and copy the shared memory, so the process
> will now get its private memory right here, which is called break COW.
> 
> Presently this kind of technology is only used as the mapping memory.
> It still needs to copy the entire page table from the parent.
> It might cost a lot of time and memory to copy each page table when the
> parent already has a lot of page tables allocated. For example, here is
> the state table for mapping the 1 GB memory of forking.
> 
> 	    mmap before fork         mmap after fork
> MemTotal:       32746776 kB             32746776 kB
> MemFree:        31468152 kB             31463244 kB
> AnonPages:       1073836 kB              1073628 kB
> Mapped:            39520 kB                39992 kB
> PageTables:         3356 kB                 5432 kB
> 
> This patch introduces Copy-On-Write to the page table. This patch only
> implements the COW on the PTE level. It's based on the paper
> On-Demand Fork [1]. Summary of the implementation for the paper:
> 
> - Only implements the COW to the anonymous mapping
> - Only do COW to the PTE table which the range is all covered by a
>    single VMA.
> - Use the reference count to control the COW PTE table lifetime.
>    Decrease the counter when breaking COW or dereference the COW PTE
>    table. When the counter reduces to zero, free the PTE table.
> 

Hi,

To reduce the empty user PTE tables, I also introduced a reference
count (pte_ref) for user PTE tables in my patch[1][2], It is used
to track the usage of each user PTE tables.

The following people will hold a pte_ref:
  - The !pte_none() entry, such as regular page table entry that map
    physical pages, or swap entry, or migrate entry, etc.
  - Visitor to the PTE page table entries, such as page table walker.

With COW PTE, a new holder (the process using the COW PTE) is added.

It's funny, it leads me to see more meaning of pte_ref.

Thanks,
Qi

[1] [RFC PATCH 00/18] Try to free user PTE page table pages
     link: 
https://lore.kernel.org/lkml/20220429133552.33768-1-zhengqi.arch@bytedance.com/
     (percpu_ref version)

[2] [PATCH v3 00/15] Free user PTE page table pages
     link: 
https://lore.kernel.org/lkml/20211110105428.32458-1-zhengqi.arch@bytedance.com/
     (atomic count version)

-- 
Thanks,
Qi
Re: [External] [RFC PATCH 0/6] Introduce Copy-On-Write to Page Table
Posted by Chih-En Lin 3 years, 11 months ago
On Sat, May 21, 2022 at 04:59:19PM +0800, Qi Zheng wrote:
> Hi,
> 
> To reduce the empty user PTE tables, I also introduced a reference
> count (pte_ref) for user PTE tables in my patch[1][2], It is used
> to track the usage of each user PTE tables.
> 
> The following people will hold a pte_ref:
>  - The !pte_none() entry, such as regular page table entry that map
>    physical pages, or swap entry, or migrate entry, etc.
>  - Visitor to the PTE page table entries, such as page table walker.
> 
> With COW PTE, a new holder (the process using the COW PTE) is added.
> 
> It's funny, it leads me to see more meaning of pte_ref.
> 
> Thanks,
> Qi
> 
> [1] [RFC PATCH 00/18] Try to free user PTE page table pages
>     link: https://lore.kernel.org/lkml/20220429133552.33768-1-zhengqi.arch@bytedance.com/
>     (percpu_ref version)
> 
> [2] [PATCH v3 00/15] Free user PTE page table pages
>     link: https://lore.kernel.org/lkml/20211110105428.32458-1-zhengqi.arch@bytedance.com/
>     (atomic count version)
> 
> -- 
> Thanks,
> Qi

Hi,

I saw your patch a few months ago.
Actually, my school's independent study is tracing the page table. And
one of the topics is your patch. It is really helpful from your pte_ref.
It's great to see you have more ideas for your pte_ref.

Thanks.
Re: [RFC PATCH 0/6] Introduce Copy-On-Write to Page Table
Posted by David Hildenbrand 3 years, 11 months ago
On 19.05.22 20:31, Chih-En Lin wrote:
> When creating the user process, it usually uses the Copy-On-Write (COW)
> mechanism to save the memory usage and the cost of time for copying.
> COW defers the work of copying private memory and shares it across the
> processes as read-only. If either process wants to write in these
> memories, it will page fault and copy the shared memory, so the process
> will now get its private memory right here, which is called break COW.

Yes. Lately we've been dealing with advanced COW+GUP pinnings (which
resulted in PageAnonExclusive, which should hit upstream soon), and
hearing about COW of page tables (and wondering how it will interact
with the mapcount, refcount, PageAnonExclusive of anonymous pages) makes
me feel a bit uneasy :)

> 
> Presently this kind of technology is only used as the mapping memory.
> It still needs to copy the entire page table from the parent.
> It might cost a lot of time and memory to copy each page table when the
> parent already has a lot of page tables allocated. For example, here is
> the state table for mapping the 1 GB memory of forking.
> 
> 	    mmap before fork         mmap after fork
> MemTotal:       32746776 kB             32746776 kB
> MemFree:        31468152 kB             31463244 kB
> AnonPages:       1073836 kB              1073628 kB
> Mapped:            39520 kB                39992 kB
> PageTables:         3356 kB                 5432 kB


I'm missing the most important point: why do we care and why should we
care to make our COW/fork implementation even more complicated?

Yes, we might save some page tables and we might reduce the fork() time,
however, which specific workload really benefits from this and why do we
really care about that workload? Without even hearing about an example
user in this cover letter (unless I missed it), I naturally wonder about
relevance in practice.

I assume it really only matters if we fork() realtively large processes,
like databases for snapshotting. However, fork() is already a pretty
sever performance hit due to COW, and there are alternatives getting
developed as a replacement for such use cases (e.g., uffd-wp).

I'm also missing a performance evaluation: I'd expect some simple
workloads that use fork() might be even slower after fork() with this
change.

(I don't have time to read the paper, I'd expect an independent summary
in the cover letter)


I have tons of questions regarding rmap, accounting, GUP, page table
walkers, OOM situations in page walkers, but at this point I am not
(yet) convinced that the added complexity is really worth it. So I'd
appreciate some additional information.



[...]

> TODO list:
> - Handle the swap

Scary if that's not easy to handle :/

> - Rewrite the TLB flush for zapping the COW PTE table.
> - Experiment COW to the entire page table. (Now just for PTE level)
> - Bug in some case from copy_pte_range()::vm_normal_page()::print_bad_pte().
> - Bug of Bad RSS counter in multiple times COW PTE table forking.



-- 
Thanks,

David / dhildenb
Re: [RFC PATCH 0/6] Introduce Copy-On-Write to Page Table
Posted by Matthew Wilcox 3 years, 11 months ago
On Sat, May 21, 2022 at 06:07:27PM +0200, David Hildenbrand wrote:
> I'm missing the most important point: why do we care and why should we
> care to make our COW/fork implementation even more complicated?
> 
> Yes, we might save some page tables and we might reduce the fork() time,
> however, which specific workload really benefits from this and why do we
> really care about that workload? Without even hearing about an example
> user in this cover letter (unless I missed it), I naturally wonder about
> relevance in practice.

As I get older (and crankier), I get less convinced that fork() is
really the right solution for implementing system().  I feel that a
better model is to create a process with zero threads, but have an fd
to it.  Then manipulate the child process through its fd (eg mmap
ld.so, open new fds in that process's fdtable, etc).  Closing the fd
launches a new thread in the process (ensuring nobody has an fd to a
running process, particularly one which is setuid).
Re: [RFC PATCH 0/6] Introduce Copy-On-Write to Page Table
Posted by Andy Lutomirski 3 years, 11 months ago
On 5/21/22 13:12, Matthew Wilcox wrote:
> On Sat, May 21, 2022 at 06:07:27PM +0200, David Hildenbrand wrote:
>> I'm missing the most important point: why do we care and why should we
>> care to make our COW/fork implementation even more complicated?
>>
>> Yes, we might save some page tables and we might reduce the fork() time,
>> however, which specific workload really benefits from this and why do we
>> really care about that workload? Without even hearing about an example
>> user in this cover letter (unless I missed it), I naturally wonder about
>> relevance in practice.
> 
> As I get older (and crankier), I get less convinced that fork() is
> really the right solution for implementing system().  I feel that a
> better model is to create a process with zero threads, but have an fd
> to it.  Then manipulate the child process through its fd (eg mmap
> ld.so, open new fds in that process's fdtable, etc).  Closing the fd
> launches a new thread in the process (ensuring nobody has an fd to a
> running process, particularly one which is setuid).

Heh, I learned serious programming on Windows, and I thought fork() was 
entertaining, cool, and a bad idea when I first learned about it.  (I 
admit I did think the fact that POSIX fork and exec had many fewer 
arguments than CreateProcess was a good thing.)  Don't even get me 
started on setuid -- if I had my way, distros would set NO_NEW_PRIVS on 
boot for the entire system.

I can see a rather different use for this type of shared-pagetable 
technology, though: monstrous MAP_SHARED mappings.  For database and 
some VM users, multiple processes will map the same file.  If there was 
a way to ensure appropriate alignment (or at least encourage it) and a 
way to handle mappings that don't cover the whole file, then having 
multiple mappings share the same page tables could be a decent 
efficiently gain.  This doesn't even need COW -- it's "just" pagetable 
sharing.

It's probably a pipe dream, but I like to imagine that the bookkeeping 
that would enable this would also enable a much less ad-hoc concept of 
who owns which pagetable page.  Then things like x86's KPTI LDT mappings 
would be less disgusting under the hood.

Android would probably like a similar feature for MAP_ANONYMOUS or that 
could otherwise enable Zygote to share paging structures (ideally 
without fork(), although that's my dream, not necessarily Android's). 
This is more complex, since COW is involved.  Also possibly less 
valuable -- possibly the entire benefit and then some would be achieved 
by using huge pages for Zygote and arranging for CoWing one normal-size 
page out of a hugepage COW mapping to only COW the one page.

--Andy
Re: [RFC PATCH 0/6] Introduce Copy-On-Write to Page Table
Posted by Matthew Wilcox 3 years, 11 months ago
On Sat, May 21, 2022 at 03:19:24PM -0700, Andy Lutomirski wrote:
> I can see a rather different use for this type of shared-pagetable
> technology, though: monstrous MAP_SHARED mappings.  For database and some VM
> users, multiple processes will map the same file.  If there was a way to
> ensure appropriate alignment (or at least encourage it) and a way to handle
> mappings that don't cover the whole file, then having multiple mappings
> share the same page tables could be a decent efficiently gain.  This doesn't
> even need COW -- it's "just" pagetable sharing.

The mshare proposal did not get a warm reception at LSFMM ;-(

The conceptual model doesn't seem to work for the MM developers who were
in the room.  "Fear" was the most-used word.  Not sure how we're going
to get to a model of sharing page tables that doesn't scare people.
Re: [RFC PATCH 0/6] Introduce Copy-On-Write to Page Table
Posted by Andy Lutomirski 3 years, 11 months ago

On Sat, May 21, 2022, at 5:31 PM, Matthew Wilcox wrote:
> On Sat, May 21, 2022 at 03:19:24PM -0700, Andy Lutomirski wrote:
>> I can see a rather different use for this type of shared-pagetable
>> technology, though: monstrous MAP_SHARED mappings.  For database and some VM
>> users, multiple processes will map the same file.  If there was a way to
>> ensure appropriate alignment (or at least encourage it) and a way to handle
>> mappings that don't cover the whole file, then having multiple mappings
>> share the same page tables could be a decent efficiently gain.  This doesn't
>> even need COW -- it's "just" pagetable sharing.
>
> The mshare proposal did not get a warm reception at LSFMM ;-(
>
> The conceptual model doesn't seem to work for the MM developers who were
> in the room.  "Fear" was the most-used word.  Not sure how we're going
> to get to a model of sharing page tables that doesn't scare people.

FWIW, I didn’t like mshare.  mshare was weird: it seemed to have one mm own some page tables and other mms share them.  I’m talking about having a *file* own page tables and mms map them.  This seems less fear-inducing to me.  Circular dependencies are impossible, mmap calls don’t need to propagate, etc.

It would still be quite a change, though.
Re: [RFC PATCH 0/6] Introduce Copy-On-Write to Page Table
Posted by Matthew Wilcox 3 years, 11 months ago
On Sun, May 22, 2022 at 08:20:05AM -0700, Andy Lutomirski wrote:
> On Sat, May 21, 2022, at 5:31 PM, Matthew Wilcox wrote:
> > On Sat, May 21, 2022 at 03:19:24PM -0700, Andy Lutomirski wrote:
> >> I can see a rather different use for this type of shared-pagetable
> >> technology, though: monstrous MAP_SHARED mappings.  For database and some VM
> >> users, multiple processes will map the same file.  If there was a way to
> >> ensure appropriate alignment (or at least encourage it) and a way to handle
> >> mappings that don't cover the whole file, then having multiple mappings
> >> share the same page tables could be a decent efficiently gain.  This doesn't
> >> even need COW -- it's "just" pagetable sharing.
> >
> > The mshare proposal did not get a warm reception at LSFMM ;-(
> >
> > The conceptual model doesn't seem to work for the MM developers who were
> > in the room.  "Fear" was the most-used word.  Not sure how we're going
> > to get to a model of sharing page tables that doesn't scare people.
> 
> FWIW, I didn’t like mshare.  mshare was weird: it seemed to have
> one mm own some page tables and other mms share them.  I’m talking
> about having a *file* own page tables and mms map them.  This seems less
> fear-inducing to me.  Circular dependencies are impossible, mmap calls
> don’t need to propagate, etc.

OK, so that doesn't work for our use case.  We need an object to own page
tables that can be shared between different (co-operating) processes.
Because we need the property that calling mprotect() changes the
protection in all processes at the same time.

Obviously we want that object to be referenced by a file descriptor, and
it can also have a name.  That object doesn't have to be an mm_struct.
Maybe that would be enough of a change to remove the fear.
Re: [RFC PATCH 0/6] Introduce Copy-On-Write to Page Table
Posted by David Hildenbrand 3 years, 11 months ago
On 21.05.22 22:12, Matthew Wilcox wrote:
> On Sat, May 21, 2022 at 06:07:27PM +0200, David Hildenbrand wrote:
>> I'm missing the most important point: why do we care and why should we
>> care to make our COW/fork implementation even more complicated?
>>
>> Yes, we might save some page tables and we might reduce the fork() time,
>> however, which specific workload really benefits from this and why do we
>> really care about that workload? Without even hearing about an example
>> user in this cover letter (unless I missed it), I naturally wonder about
>> relevance in practice.
> 
> As I get older (and crankier), I get less convinced that fork() is
> really the right solution for implementing system().

Heh, I couldn't agree more. IMHO, fork() is mostly a blast from the
past. There *are* still a lot of user and there are a couple of sane use
cases.

Consequently, I am not convinced that it is something to optimize for,
especially if it adds additional complexity. For the use case of
snapshotting, we have better mechanisms nowadays (uffd-wp) that avoid
messing with copying address spaces.

Calling fork()/system() from a big, performance-sensitive process is
usually a bad idea.

Note: there is an (for me) interesting paper about this topic from 2019
("A fork() in the road"), although it might be a bit biased coming from
Microsoft research :). It comes to a similar conclusion regarding fork
and how it should or shouldn't dictate our OS design.

[1] https://www.microsoft.com/en-us/research/publication/a-fork-in-the-road/

-- 
Thanks,

David / dhildenb
Re: [RFC PATCH 0/6] Introduce Copy-On-Write to Page Table
Posted by Chih-En Lin 3 years, 11 months ago
On Sat, May 21, 2022 at 06:07:27PM +0200, David Hildenbrand wrote:
> On 19.05.22 20:31, Chih-En Lin wrote:
> > When creating the user process, it usually uses the Copy-On-Write (COW)
> > mechanism to save the memory usage and the cost of time for copying.
> > COW defers the work of copying private memory and shares it across the
> > processes as read-only. If either process wants to write in these
> > memories, it will page fault and copy the shared memory, so the process
> > will now get its private memory right here, which is called break COW.
> 
> Yes. Lately we've been dealing with advanced COW+GUP pinnings (which
> resulted in PageAnonExclusive, which should hit upstream soon), and
> hearing about COW of page tables (and wondering how it will interact
> with the mapcount, refcount, PageAnonExclusive of anonymous pages) makes
> me feel a bit uneasy :)

I saw the series patch of this and knew how complicated handling COW of
the physical page was [1][2][3][4]. So the COW page table will tend to
restrict the sharing only to the page table. This means any modification
to the physical page will trigger the break COW of page table.

Presently implementation will only update the physical page information
to the RSS of the owner process of COW PTE. Generally owner is the
parent process. And the state of the page, like refcount and mapcount,
will not change under the COW page table.

But if any situations will lead to the COW page table needs to consider
the state of physical page, it might be fretful. ;-)

> > 
> > Presently this kind of technology is only used as the mapping memory.
> > It still needs to copy the entire page table from the parent.
> > It might cost a lot of time and memory to copy each page table when the
> > parent already has a lot of page tables allocated. For example, here is
> > the state table for mapping the 1 GB memory of forking.
> > 
> > 	    mmap before fork         mmap after fork
> > MemTotal:       32746776 kB             32746776 kB
> > MemFree:        31468152 kB             31463244 kB
> > AnonPages:       1073836 kB              1073628 kB
> > Mapped:            39520 kB                39992 kB
> > PageTables:         3356 kB                 5432 kB
> 
> 
> I'm missing the most important point: why do we care and why should we
> care to make our COW/fork implementation even more complicated?
> 
> Yes, we might save some page tables and we might reduce the fork() time,
> however, which specific workload really benefits from this and why do we
> really care about that workload? Without even hearing about an example
> user in this cover letter (unless I missed it), I naturally wonder about
> relevance in practice.
> 
> I assume it really only matters if we fork() realtively large processes,
> like databases for snapshotting. However, fork() is already a pretty
> sever performance hit due to COW, and there are alternatives getting
> developed as a replacement for such use cases (e.g., uffd-wp).
> 
> I'm also missing a performance evaluation: I'd expect some simple
> workloads that use fork() might be even slower after fork() with this
> change.
> 

The paper mentioned a list of benchmarks of the time cost for On-Demand
fork. For example, on Redis, the meantime of fork when taking the
snapshot. Default fork() got 7.40 ms; On-demand Fork (COW PTE table) got
0.12 ms. But there are some other cases, like the Response latency
distribution of Apache HTTP Server, are not have significant benefits
from their On-demand fork.

For the COW page table from this patch, I also take the perf to analyze
the cost time. But it looks like not different from the default fork.

Here is the report, the mmap-sfork is COW page table version:

 Performance counter stats for './mmap-fork' (100 runs):

            373.92 msec task-clock                #    0.992 CPUs utilized            ( +-  0.09% )
                 1      context-switches          #    2.656 /sec                     ( +-  6.03% )
                 0      cpu-migrations            #    0.000 /sec
               881      page-faults               #    2.340 K/sec                    ( +-  0.02% )
     1,860,460,792      cycles                    #    4.941 GHz                      ( +-  0.08% )
     1,451,024,912      instructions              #    0.78  insn per cycle           ( +-  0.00% )
       310,129,843      branches                  #  823.559 M/sec                    ( +-  0.01% )
         1,552,469      branch-misses             #    0.50% of all branches          ( +-  0.38% )

          0.377007 +- 0.000480 seconds time elapsed  ( +-  0.13% )

 Performance counter stats for './mmap-sfork' (100 runs):

            373.04 msec task-clock                #    0.992 CPUs utilized            ( +-  0.10% )
                 1      context-switches          #    2.660 /sec                     ( +-  6.58% )
                 0      cpu-migrations            #    0.000 /sec
               877      page-faults               #    2.333 K/sec                    ( +-  0.08% )
     1,851,843,683      cycles                    #    4.926 GHz                      ( +-  0.08% )
     1,451,763,414      instructions              #    0.78  insn per cycle           ( +-  0.00% )
       310,270,268      branches                  #  825.352 M/sec                    ( +-  0.01% )
         1,649,486      branch-misses             #    0.53% of all branches          ( +-  0.49% )

          0.376095 +- 0.000478 seconds time elapsed  ( +-  0.13% )

So, the COW of the page table may reduce the time of forking. But it
builds on the transfer of the copy work to other modified operations
to the physical page.

> (I don't have time to read the paper, I'd expect an independent summary
> in the cover letter)

Sure, I will add more performance evaluations and descriptions in the
next version.

> I have tons of questions regarding rmap, accounting, GUP, page table
> walkers, OOM situations in page walkers, but at this point I am not
> (yet) convinced that the added complexity is really worth it. So I'd
> appreciate some additional information.

It seems like I have a lot of work to do. ;-)

> 
> [...]
> 
> > TODO list:
> > - Handle the swap
> 
> Scary if that's not easy to handle :/

;-)

> -- 
> Thanks,
> 
> David / dhildenb
>

Thanks!

[1] https://lore.kernel.org/all/20220131162940.210846-1-david@redhat.com/T/
[2] https://lore.kernel.org/linux-mm/20220315104741.63071-2-david@redhat.com/T/
[3] https://lore.kernel.org/linux-mm/51afa7a7-15c5-8769-78db-ed2d134792f4@redhat.com/T/
[4] https://lore.kernel.org/all/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com/
Re: [RFC PATCH 0/6] Introduce Copy-On-Write to Page Table
Posted by David Hildenbrand 3 years, 11 months ago
On 21.05.22 20:50, Chih-En Lin wrote:
> On Sat, May 21, 2022 at 06:07:27PM +0200, David Hildenbrand wrote:
>> On 19.05.22 20:31, Chih-En Lin wrote:
>>> When creating the user process, it usually uses the Copy-On-Write (COW)
>>> mechanism to save the memory usage and the cost of time for copying.
>>> COW defers the work of copying private memory and shares it across the
>>> processes as read-only. If either process wants to write in these
>>> memories, it will page fault and copy the shared memory, so the process
>>> will now get its private memory right here, which is called break COW.
>>
>> Yes. Lately we've been dealing with advanced COW+GUP pinnings (which
>> resulted in PageAnonExclusive, which should hit upstream soon), and
>> hearing about COW of page tables (and wondering how it will interact
>> with the mapcount, refcount, PageAnonExclusive of anonymous pages) makes
>> me feel a bit uneasy :)
> 
> I saw the series patch of this and knew how complicated handling COW of
> the physical page was [1][2][3][4]. So the COW page table will tend to
> restrict the sharing only to the page table. This means any modification
> to the physical page will trigger the break COW of page table.
> 
> Presently implementation will only update the physical page information
> to the RSS of the owner process of COW PTE. Generally owner is the
> parent process. And the state of the page, like refcount and mapcount,
> will not change under the COW page table.
> 
> But if any situations will lead to the COW page table needs to consider
> the state of physical page, it might be fretful. ;-)

I haven't looked into the details of how GUP deals with these COW page
tables. But I suspect there might be problems with page pinning:
skipping copy_present_page() even for R/O pages is usually problematic
with R/O pinnings of pages. I might be just wrong.

> 
>>>
>>> Presently this kind of technology is only used as the mapping memory.
>>> It still needs to copy the entire page table from the parent.
>>> It might cost a lot of time and memory to copy each page table when the
>>> parent already has a lot of page tables allocated. For example, here is
>>> the state table for mapping the 1 GB memory of forking.
>>>
>>> 	    mmap before fork         mmap after fork
>>> MemTotal:       32746776 kB             32746776 kB
>>> MemFree:        31468152 kB             31463244 kB
>>> AnonPages:       1073836 kB              1073628 kB
>>> Mapped:            39520 kB                39992 kB
>>> PageTables:         3356 kB                 5432 kB
>>
>>
>> I'm missing the most important point: why do we care and why should we
>> care to make our COW/fork implementation even more complicated?
>>
>> Yes, we might save some page tables and we might reduce the fork() time,
>> however, which specific workload really benefits from this and why do we
>> really care about that workload? Without even hearing about an example
>> user in this cover letter (unless I missed it), I naturally wonder about
>> relevance in practice.
>>
>> I assume it really only matters if we fork() realtively large processes,
>> like databases for snapshotting. However, fork() is already a pretty
>> sever performance hit due to COW, and there are alternatives getting
>> developed as a replacement for such use cases (e.g., uffd-wp).
>>
>> I'm also missing a performance evaluation: I'd expect some simple
>> workloads that use fork() might be even slower after fork() with this
>> change.
>>
> 
> The paper mentioned a list of benchmarks of the time cost for On-Demand
> fork. For example, on Redis, the meantime of fork when taking the
> snapshot. Default fork() got 7.40 ms; On-demand Fork (COW PTE table) got
> 0.12 ms. But there are some other cases, like the Response latency
> distribution of Apache HTTP Server, are not have significant benefits
> from their On-demand fork.

Thanks. I expected that snapshotting would pop up and be one of the most
prominent users that could benefit. However, for that specific use case
I am convinced that uffd-wp is the better choice and fork() is just the
old way of doing it. having nothing better at hand. QEMU already
implements snapshotting of VMs that way and I remember that redis also
intended to implement support for uffd-wp. Not sure what happened with
that and if there is anything missing to make it work.

> 
> For the COW page table from this patch, I also take the perf to analyze
> the cost time. But it looks like not different from the default fork.

Interesting, thanks for sharing.

> 
> Here is the report, the mmap-sfork is COW page table version:
> 
>  Performance counter stats for './mmap-fork' (100 runs):
> 
>             373.92 msec task-clock                #    0.992 CPUs utilized            ( +-  0.09% )
>                  1      context-switches          #    2.656 /sec                     ( +-  6.03% )
>                  0      cpu-migrations            #    0.000 /sec
>                881      page-faults               #    2.340 K/sec                    ( +-  0.02% )
>      1,860,460,792      cycles                    #    4.941 GHz                      ( +-  0.08% )
>      1,451,024,912      instructions              #    0.78  insn per cycle           ( +-  0.00% )
>        310,129,843      branches                  #  823.559 M/sec                    ( +-  0.01% )
>          1,552,469      branch-misses             #    0.50% of all branches          ( +-  0.38% )
> 
>           0.377007 +- 0.000480 seconds time elapsed  ( +-  0.13% )
> 
>  Performance counter stats for './mmap-sfork' (100 runs):
> 
>             373.04 msec task-clock                #    0.992 CPUs utilized            ( +-  0.10% )
>                  1      context-switches          #    2.660 /sec                     ( +-  6.58% )
>                  0      cpu-migrations            #    0.000 /sec
>                877      page-faults               #    2.333 K/sec                    ( +-  0.08% )
>      1,851,843,683      cycles                    #    4.926 GHz                      ( +-  0.08% )
>      1,451,763,414      instructions              #    0.78  insn per cycle           ( +-  0.00% )
>        310,270,268      branches                  #  825.352 M/sec                    ( +-  0.01% )
>          1,649,486      branch-misses             #    0.53% of all branches          ( +-  0.49% )
> 
>           0.376095 +- 0.000478 seconds time elapsed  ( +-  0.13% )
> 
> So, the COW of the page table may reduce the time of forking. But it
> builds on the transfer of the copy work to other modified operations
> to the physical page.

Right.

> 
>> I have tons of questions regarding rmap, accounting, GUP, page table
>> walkers, OOM situations in page walkers, but at this point I am not
>> (yet) convinced that the added complexity is really worth it. So I'd
>> appreciate some additional information.
> 
> It seems like I have a lot of work to do. ;-)

Messing with page tables and COW is usually like opening a can of worms :)

-- 
Thanks,

David / dhildenb