fs/proc/task_mmu.c | 5 + include/linux/mm.h | 37 ++ include/linux/pgtable.h | 6 + include/linux/rmap.h | 2 + include/linux/sched/coredump.h | 12 +- include/trace/events/huge_memory.h | 1 + include/uapi/linux/prctl.h | 6 + kernel/events/uprobes.c | 2 +- kernel/fork.c | 7 + kernel/sys.c | 11 + mm/Kconfig | 9 + mm/gup.c | 8 +- mm/khugepaged.c | 35 +- mm/ksm.c | 4 +- mm/madvise.c | 13 + mm/memory.c | 642 ++++++++++++++++++++++++++++- mm/migrate.c | 3 +- mm/migrate_device.c | 2 + mm/mmap.c | 4 + mm/mprotect.c | 9 + mm/mremap.c | 2 + mm/page_vma_mapped.c | 4 + mm/rmap.c | 9 +- mm/swapfile.c | 2 + mm/userfaultfd.c | 6 + mm/vmscan.c | 3 +- 26 files changed, 826 insertions(+), 18 deletions(-)
v3 -> v4 - Add Kconfig, CONFIG_COW_PTE, since some of the architectures, e.g., s390 and powerpc32, don't support the PMD entry and PTE table operations. - Fix unmatch type of break_cow_pte_range() in migrate_vma_collect_pmd(). - Don’t break COW PTE in folio_referenced_one(). - Fix the wrong VMA range checking in break_cow_pte_range(). - Only break COW when we modify the soft-dirty bit in clear_refs_pte_range(). - Handle do_swap_page() with COW PTE in mm/memory.c and mm/khugepaged.c. - Change the tlb flush from flush_tlb_mm_range() (x86 specific) to tlb_flush_pmd_range(). - Handle VM_DONTCOPY with COW PTE fork. - Fix the wrong address and invalid vma in recover_pte_range(). - Fix the infinite page fault loop in GUP routine. In mm/gup.c:follow_pfn_pte(), instead of calling the break COW PTE handler, we return -EMLINK to let the GUP handles the page fault (call faultin_page() in __get_user_pages()). - return not_found(pvmw) if the break COW PTE failed in page_vma_mapped_walk(). - Since COW PTE has the same result as the normal COW selftest, it probably passed the COW selftest. # [RUN] vmsplice() + unmap in child ... with hugetlb (2048 kB) not ok 33 No leak from parent into child # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with hugetlb (2048 kB) not ok 44 No leak from parent into child # [RUN] vmsplice() before fork(), unmap in parent after fork() ... with hugetlb (2048 kB) not ok 55 No leak from child into parent # [RUN] vmsplice() + unmap in parent after fork() ... with hugetlb (2048 kB) not ok 66 No leak from child into parent Bail out! 4 out of 147 tests failed # Totals: pass:143 fail:4 xfail:0 xpass:0 skip:0 error:0 See the more information about anon cow hugetlb tests: https://patchwork.kernel.org/project/linux-mm/patch/20220927110120.106906-5-david@redhat.com/ v3: https://lore.kernel.org/linux-mm/20221220072743.3039060-1-shiyn.lin@gmail.com/T/ RFC v2 -> v3 - Change the sysctl with PID to prctl(PR_SET_COW_PTE). - Account all the COW PTE mapped pages in fork() instead of defer it to page fault (break COW PTE). - If there is an unshareable mapped page (maybe pinned or private device), recover all the entries that are already handled by COW PTE fork, then copy to the new one. - Remove COW_PTE_OWNER_EXCLUSIVE flag and handle the only case of GUP, follow_pfn_pte(). - Remove the PTE ownership since we don't need it. - Use pte lock to protect the break COW PTE and free COW-ed PTE. - Do TLB flushing in break COW PTE handler. - Handle THP, KSM, madvise, mprotect, uffd and migrate device. - Handle the replacement page of uprobe. - Handle the clear_refs_write() of fs/proc. - All of the benchmarks dropped since the accounting and pte lock. The benchmarks of v3 is worse than RFC v2, most of the cases are similar to the normal fork, but there still have an use case (TriforceAFL) is better than the normal fork version. RFC v2: https://lore.kernel.org/linux-mm/20220927162957.270460-1-shiyn.lin@gmail.com/T/ RFC v1 -> RFC v2 - Change the clone flag method to sysctl with PID. - Change the MMF_COW_PGTABLE flag to two flags, MMF_COW_PTE and MMF_COW_PTE_READY, for the sysctl. - Change the owner pointer to use the folio padding. - Handle all the VMAs that cover the PTE table when doing the break COW PTE. - Remove the self-defined refcount to use the _refcount for the page table page. - Add the exclusive flag to let the page table only own by one task in some situations. - Invalidate address range MMU notifier and start the write_seqcount when doing the break COW PTE. - Handle the swap cache and swapoff. RFC v1: https://lore.kernel.org/all/20220519183127.3909598-1-shiyn.lin@gmail.com/ --- Currently, copy-on-write is only used for the mapped memory; the child process still needs to copy the entire page table from the parent process during forking. The parent process might take a lot of time and memory to copy the page table when the parent has a big page table allocated. For example, the memory usage of a process after forking with 1 GB mapped memory is as follows: DEFAULT FORK parent child VmRSS: 1049688 kB 1048688 kB VmPTE: 2096 kB 2096 kB This patch introduces copy-on-write (COW) for the PTE level page tables. COW PTE improves performance in the situation where the user needs copies of the program to run on isolated environments. Feedback-based fuzzers (e.g., AFL) and serverless/microservice frameworks are two major examples. For instance, COW PTE achieves a 1.03x throughput increase when running TriforceAFL. After applying COW to PTE, the memory usage after forking is as follows: COW PTE parent child VmRSS: 1049968 kB 2576 kB VmPTE: 2096 kB 44 kB The results show that this patch significantly decreases memory usage. The other number of latencies are discussed later. Real-world application benchmarks ================================= We run benchmarks of fuzzing and VM cloning. The experiments were done with the normal fork or the fork with COW PTE. With AFL (LLVM mode) and SQLite, COW PTE (52.15 execs/sec) is a little bit worse than the normal fork version (53.50 execs/sec). fork execs_per_sec unix_time time count 28.000000 2.800000e+01 28.000000 mean 53.496786 1.671270e+09 96.107143 std 3.625060 7.194717e+01 71.947172 min 35.350000 1.671270e+09 0.000000 25% 53.967500 1.671270e+09 33.750000 50% 54.235000 1.671270e+09 92.000000 75% 54.525000 1.671270e+09 149.250000 max 55.100000 1.671270e+09 275.000000 COW PTE execs_per_sec unix_time time count 34.000000 3.400000e+01 34.000000 mean 52.150000 1.671268e+09 103.323529 std 3.218271 7.507682e+01 75.076817 min 34.250000 1.671268e+09 0.000000 25% 52.500000 1.671268e+09 42.250000 50% 52.750000 1.671268e+09 94.500000 75% 52.952500 1.671268e+09 150.750000 max 53.680000 1.671268e+09 285.000000 With TriforceAFL which is for kernel fuzzing with QEMU, COW PTE (105.54 execs/sec) achieves a 1.05x throughput increase over the normal fork version (102.30 execs/sec). fork execs_per_sec unix_time time count 38.000000 3.800000e+01 38.000000 mean 102.299737 1.671269e+09 156.289474 std 20.139268 8.717113e+01 87.171130 min 6.600000 1.671269e+09 0.000000 25% 95.657500 1.671269e+09 82.250000 50% 109.950000 1.671269e+09 176.500000 75% 113.972500 1.671269e+09 223.750000 max 118.790000 1.671269e+09 281.000000 COW PTE execs_per_sec unix_time time count 42.000000 4.200000e+01 42.000000 mean 105.540714 1.671269e+09 163.476190 std 19.443517 8.858845e+01 88.588453 min 6.200000 1.671269e+09 0.000000 25% 96.585000 1.671269e+09 123.500000 50% 113.925000 1.671269e+09 180.500000 75% 116.940000 1.671269e+09 233.500000 max 121.090000 1.671269e+09 286.000000 Microbenchmark - syscall latency ================================ We run microbenchmarks to measure the latency of a fork syscall with sizes of mapped memory ranging from 0 to 512 MB. The results show that the latency of a normal fork reaches 10 ms. The latency of a fork with COW PTE is also around 10 ms. Microbenchmark - page fault latency ==================================== We conducted some microbenchmarks to measure page fault latency with different patterns of accesses to a 512 MB memory buffer after forking. In the first experiment, the program accesses the entire 512 MB memory by writing to all the pages consecutively. The experiment is done with normal fork, fork with COW PTE and calculates the single access average latency. COW PTE page fault latency (0.000795 ms) and the normal fork fault latency (0.000770 ms). Here are the raw numbers: Page fault - Access to the entire 512 MB memory fork mean: 0.000770 ms fork median: 0.000769 ms fork std: 0.000010 ms COW PTE mean: 0.000795 ms COW PTE median: 0.000795 ms COW PTE std: 0.000009 ms The second experiment simulates real-world applications with sparse accesses. The program randomly accesses the memory by writing to one random page 1 million times and calculates the average access time, after that, we run both 100 times to get the averages. The result shows that COW PTE (0.000029 ms) is similar to the normal fork (0.000026 ms). Page fault - Random access fork mean: 0.000026 ms fork median: 0.000025 ms fork std: 0.000002 ms COW PTE mean: 0.000029 ms COW PTE median: 0.000026 ms COW PTE std: 0.000004 ms All the tests were run with QEMU and the kernel was built with the x86_64 default config (v3 patch set). Summary ======= In summary, COW PTE reduces the memory footprint of processes and improves the performance for some use cases. This patch is based on the paper "On-demand-fork: a microsecond fork for memory-intensive and latency-sensitive applications" [1] from Purdue University. Any comments and suggestions are welcome. Thanks, Chih-En Lin --- [1] https://dl.acm.org/doi/10.1145/3447786.3456258 This patch is based on v6.2-rc7. --- Chih-En Lin (14): mm: Allow user to control COW PTE via prctl mm: Add Copy-On-Write PTE to fork() mm: Add break COW PTE fault and helper functions mm/rmap: Break COW PTE in rmap walking mm/khugepaged: Break COW PTE before scanning pte mm/ksm: Break COW PTE before modify shared PTE mm/madvise: Handle COW-ed PTE with madvise() mm/gup: Trigger break COW PTE before calling follow_pfn_pte() mm/mprotect: Break COW PTE before changing protection mm/userfaultfd: Support COW PTE mm/migrate_device: Support COW PTE fs/proc: Support COW PTE with clear_refs_write events/uprobes: Break COW PTE before replacing page mm: fork: Enable COW PTE to fork system call fs/proc/task_mmu.c | 5 + include/linux/mm.h | 37 ++ include/linux/pgtable.h | 6 + include/linux/rmap.h | 2 + include/linux/sched/coredump.h | 12 +- include/trace/events/huge_memory.h | 1 + include/uapi/linux/prctl.h | 6 + kernel/events/uprobes.c | 2 +- kernel/fork.c | 7 + kernel/sys.c | 11 + mm/Kconfig | 9 + mm/gup.c | 8 +- mm/khugepaged.c | 35 +- mm/ksm.c | 4 +- mm/madvise.c | 13 + mm/memory.c | 642 ++++++++++++++++++++++++++++- mm/migrate.c | 3 +- mm/migrate_device.c | 2 + mm/mmap.c | 4 + mm/mprotect.c | 9 + mm/mremap.c | 2 + mm/page_vma_mapped.c | 4 + mm/rmap.c | 9 +- mm/swapfile.c | 2 + mm/userfaultfd.c | 6 + mm/vmscan.c | 3 +- 26 files changed, 826 insertions(+), 18 deletions(-) -- 2.34.1
On Mon, Feb 6, 2023 at 10:52 PM Chih-En Lin <shiyn.lin@gmail.com> wrote: > > v3 -> v4 > - Add Kconfig, CONFIG_COW_PTE, since some of the architectures, e.g., > s390 and powerpc32, don't support the PMD entry and PTE table > operations. > - Fix unmatch type of break_cow_pte_range() in > migrate_vma_collect_pmd(). > - Don’t break COW PTE in folio_referenced_one(). > - Fix the wrong VMA range checking in break_cow_pte_range(). > - Only break COW when we modify the soft-dirty bit in > clear_refs_pte_range(). > - Handle do_swap_page() with COW PTE in mm/memory.c and mm/khugepaged.c. > - Change the tlb flush from flush_tlb_mm_range() (x86 specific) to > tlb_flush_pmd_range(). > - Handle VM_DONTCOPY with COW PTE fork. > - Fix the wrong address and invalid vma in recover_pte_range(). > - Fix the infinite page fault loop in GUP routine. > In mm/gup.c:follow_pfn_pte(), instead of calling the break COW PTE > handler, we return -EMLINK to let the GUP handles the page fault > (call faultin_page() in __get_user_pages()). > - return not_found(pvmw) if the break COW PTE failed in > page_vma_mapped_walk(). > - Since COW PTE has the same result as the normal COW selftest, it > probably passed the COW selftest. > > # [RUN] vmsplice() + unmap in child ... with hugetlb (2048 kB) > not ok 33 No leak from parent into child > # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with hugetlb (2048 kB) > not ok 44 No leak from parent into child > # [RUN] vmsplice() before fork(), unmap in parent after fork() ... with hugetlb (2048 kB) > not ok 55 No leak from child into parent > # [RUN] vmsplice() + unmap in parent after fork() ... with hugetlb (2048 kB) > not ok 66 No leak from child into parent > > Bail out! 4 out of 147 tests failed > # Totals: pass:143 fail:4 xfail:0 xpass:0 skip:0 error:0 > See the more information about anon cow hugetlb tests: > https://patchwork.kernel.org/project/linux-mm/patch/20220927110120.106906-5-david@redhat.com/ > > > v3: https://lore.kernel.org/linux-mm/20221220072743.3039060-1-shiyn.lin@gmail.com/T/ > > RFC v2 -> v3 > - Change the sysctl with PID to prctl(PR_SET_COW_PTE). > - Account all the COW PTE mapped pages in fork() instead of defer it to > page fault (break COW PTE). > - If there is an unshareable mapped page (maybe pinned or private > device), recover all the entries that are already handled by COW PTE > fork, then copy to the new one. > - Remove COW_PTE_OWNER_EXCLUSIVE flag and handle the only case of GUP, > follow_pfn_pte(). > - Remove the PTE ownership since we don't need it. > - Use pte lock to protect the break COW PTE and free COW-ed PTE. > - Do TLB flushing in break COW PTE handler. > - Handle THP, KSM, madvise, mprotect, uffd and migrate device. > - Handle the replacement page of uprobe. > - Handle the clear_refs_write() of fs/proc. > - All of the benchmarks dropped since the accounting and pte lock. > The benchmarks of v3 is worse than RFC v2, most of the cases are > similar to the normal fork, but there still have an use case > (TriforceAFL) is better than the normal fork version. > > RFC v2: https://lore.kernel.org/linux-mm/20220927162957.270460-1-shiyn.lin@gmail.com/T/ > > RFC v1 -> RFC v2 > - Change the clone flag method to sysctl with PID. > - Change the MMF_COW_PGTABLE flag to two flags, MMF_COW_PTE and > MMF_COW_PTE_READY, for the sysctl. > - Change the owner pointer to use the folio padding. > - Handle all the VMAs that cover the PTE table when doing the break COW PTE. > - Remove the self-defined refcount to use the _refcount for the page > table page. > - Add the exclusive flag to let the page table only own by one task in > some situations. > - Invalidate address range MMU notifier and start the write_seqcount > when doing the break COW PTE. > - Handle the swap cache and swapoff. > > RFC v1: https://lore.kernel.org/all/20220519183127.3909598-1-shiyn.lin@gmail.com/ > > --- > > Currently, copy-on-write is only used for the mapped memory; the child > process still needs to copy the entire page table from the parent > process during forking. The parent process might take a lot of time and > memory to copy the page table when the parent has a big page table > allocated. For example, the memory usage of a process after forking with > 1 GB mapped memory is as follows: For some reason, I was not able to reproduce performance improvements with a simple fork() performance measurement program. The results that I saw are the following: Base: Fork latency per gigabyte: 0.004416 seconds Fork latency per gigabyte: 0.004382 seconds Fork latency per gigabyte: 0.004442 seconds COW kernel: Fork latency per gigabyte: 0.004524 seconds Fork latency per gigabyte: 0.004764 seconds Fork latency per gigabyte: 0.004547 seconds AMD EPYC 7B12 64-Core Processor Base: Fork latency per gigabyte: 0.003923 seconds Fork latency per gigabyte: 0.003909 seconds Fork latency per gigabyte: 0.003955 seconds COW kernel: Fork latency per gigabyte: 0.004221 seconds Fork latency per gigabyte: 0.003882 seconds Fork latency per gigabyte: 0.003854 seconds Given, that page table for child is not copied, I was expecting the performance to be better with COW kernel, and also not to depend on the size of the parent. Test program: #include <time.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <sys/time.h> #include <sys/mman.h> #include <sys/types.h> #define USEC 1000000 #define GIG (1ul << 30) #define NGIG 32 #define SIZE (NGIG * GIG) #define NPROC 16 void main() { int page_size = getpagesize(); struct timeval start, end; long duration, i; char *p; p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (p == MAP_FAILED) { perror("mmap"); exit(1); } madvise(p, SIZE, MADV_NOHUGEPAGE); /* Touch every page */ for (i = 0; i < SIZE; i += page_size) p[i] = 0; gettimeofday(&start, NULL); for (i = 0; i < NPROC; i++) { int pid = fork(); if (pid == 0) { sleep(30); exit(0); } } gettimeofday(&end, NULL); /* Normolize per proc and per gig */ duration = ((end.tv_sec - start.tv_sec) * USEC + (end.tv_usec - start.tv_usec)) / NPROC / NGIG; printf("Fork latency per gigabyte: %ld.%06ld seconds\n", duration / USEC, duration % USEC); }
On Fri, Feb 10, 2023 at 2:16 AM Pasha Tatashin <pasha.tatashin@soleen.com> wrote: > > On Mon, Feb 6, 2023 at 10:52 PM Chih-En Lin <shiyn.lin@gmail.com> wrote: > > > > v3 -> v4 > > - Add Kconfig, CONFIG_COW_PTE, since some of the architectures, e.g., > > s390 and powerpc32, don't support the PMD entry and PTE table > > operations. > > - Fix unmatch type of break_cow_pte_range() in > > migrate_vma_collect_pmd(). > > - Don’t break COW PTE in folio_referenced_one(). > > - Fix the wrong VMA range checking in break_cow_pte_range(). > > - Only break COW when we modify the soft-dirty bit in > > clear_refs_pte_range(). > > - Handle do_swap_page() with COW PTE in mm/memory.c and mm/khugepaged.c. > > - Change the tlb flush from flush_tlb_mm_range() (x86 specific) to > > tlb_flush_pmd_range(). > > - Handle VM_DONTCOPY with COW PTE fork. > > - Fix the wrong address and invalid vma in recover_pte_range(). > > - Fix the infinite page fault loop in GUP routine. > > In mm/gup.c:follow_pfn_pte(), instead of calling the break COW PTE > > handler, we return -EMLINK to let the GUP handles the page fault > > (call faultin_page() in __get_user_pages()). > > - return not_found(pvmw) if the break COW PTE failed in > > page_vma_mapped_walk(). > > - Since COW PTE has the same result as the normal COW selftest, it > > probably passed the COW selftest. > > > > # [RUN] vmsplice() + unmap in child ... with hugetlb (2048 kB) > > not ok 33 No leak from parent into child > > # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with hugetlb (2048 kB) > > not ok 44 No leak from parent into child > > # [RUN] vmsplice() before fork(), unmap in parent after fork() ... with hugetlb (2048 kB) > > not ok 55 No leak from child into parent > > # [RUN] vmsplice() + unmap in parent after fork() ... with hugetlb (2048 kB) > > not ok 66 No leak from child into parent > > > > Bail out! 4 out of 147 tests failed > > # Totals: pass:143 fail:4 xfail:0 xpass:0 skip:0 error:0 > > See the more information about anon cow hugetlb tests: > > https://patchwork.kernel.org/project/linux-mm/patch/20220927110120.106906-5-david@redhat.com/ > > > > > > v3: https://lore.kernel.org/linux-mm/20221220072743.3039060-1-shiyn.lin@gmail.com/T/ > > > > RFC v2 -> v3 > > - Change the sysctl with PID to prctl(PR_SET_COW_PTE). > > - Account all the COW PTE mapped pages in fork() instead of defer it to > > page fault (break COW PTE). > > - If there is an unshareable mapped page (maybe pinned or private > > device), recover all the entries that are already handled by COW PTE > > fork, then copy to the new one. > > - Remove COW_PTE_OWNER_EXCLUSIVE flag and handle the only case of GUP, > > follow_pfn_pte(). > > - Remove the PTE ownership since we don't need it. > > - Use pte lock to protect the break COW PTE and free COW-ed PTE. > > - Do TLB flushing in break COW PTE handler. > > - Handle THP, KSM, madvise, mprotect, uffd and migrate device. > > - Handle the replacement page of uprobe. > > - Handle the clear_refs_write() of fs/proc. > > - All of the benchmarks dropped since the accounting and pte lock. > > The benchmarks of v3 is worse than RFC v2, most of the cases are > > similar to the normal fork, but there still have an use case > > (TriforceAFL) is better than the normal fork version. > > > > RFC v2: https://lore.kernel.org/linux-mm/20220927162957.270460-1-shiyn.lin@gmail.com/T/ > > > > RFC v1 -> RFC v2 > > - Change the clone flag method to sysctl with PID. > > - Change the MMF_COW_PGTABLE flag to two flags, MMF_COW_PTE and > > MMF_COW_PTE_READY, for the sysctl. > > - Change the owner pointer to use the folio padding. > > - Handle all the VMAs that cover the PTE table when doing the break COW PTE. > > - Remove the self-defined refcount to use the _refcount for the page > > table page. > > - Add the exclusive flag to let the page table only own by one task in > > some situations. > > - Invalidate address range MMU notifier and start the write_seqcount > > when doing the break COW PTE. > > - Handle the swap cache and swapoff. > > > > RFC v1: https://lore.kernel.org/all/20220519183127.3909598-1-shiyn.lin@gmail.com/ > > > > --- > > > > Currently, copy-on-write is only used for the mapped memory; the child > > process still needs to copy the entire page table from the parent > > process during forking. The parent process might take a lot of time and > > memory to copy the page table when the parent has a big page table > > allocated. For example, the memory usage of a process after forking with > > 1 GB mapped memory is as follows: > > For some reason, I was not able to reproduce performance improvements > with a simple fork() performance measurement program. The results that > I saw are the following: > > Base: > Fork latency per gigabyte: 0.004416 seconds > Fork latency per gigabyte: 0.004382 seconds > Fork latency per gigabyte: 0.004442 seconds > COW kernel: > Fork latency per gigabyte: 0.004524 seconds > Fork latency per gigabyte: 0.004764 seconds > Fork latency per gigabyte: 0.004547 seconds > > AMD EPYC 7B12 64-Core Processor > Base: > Fork latency per gigabyte: 0.003923 seconds > Fork latency per gigabyte: 0.003909 seconds > Fork latency per gigabyte: 0.003955 seconds > COW kernel: > Fork latency per gigabyte: 0.004221 seconds > Fork latency per gigabyte: 0.003882 seconds > Fork latency per gigabyte: 0.003854 seconds > > Given, that page table for child is not copied, I was expecting the > performance to be better with COW kernel, and also not to depend on > the size of the parent. Yes, the child won't duplicate the page table, but fork will still traverse all the page table entries to do the accounting. And, since this patch expends the COW to the PTE table level, it's not the mapped page (page table entry) grained anymore, so we have to guarantee that all the mapped page is available to do COW mapping in the such page table. This kind of checking also costs some time. As a result, since the accounting and the checking, the COW PTE fork still depends on the size of the parent so the improvement might not be significant. Actually, at the RFC v1 and v2, we proposed the version of skipping those works, and we got a significant improvement. You can see the number from RFC v2 cover letter [1]: "In short, with 512 MB mapped memory, COW PTE decreases latency by 93% for normal fork" However, it might break the existing logic of the refcount/mapcount of the page and destabilize the system. [1] https://lore.kernel.org/linux-mm/20220927162957.270460-1-shiyn.lin@gmail.com/T/#me2340d963c2758a2561c39cb3baf42c478dfe548 [2] https://lore.kernel.org/linux-mm/20220927162957.270460-1-shiyn.lin@gmail.com/T/#mbc33221f00c7cf3d71839b45fc23862a5dac3014 > Test program: > > #include <time.h> > #include <stdio.h> > #include <stdlib.h> > #include <string.h> > #include <unistd.h> > #include <sys/time.h> > #include <sys/mman.h> > #include <sys/types.h> > > #define USEC 1000000 > #define GIG (1ul << 30) > #define NGIG 32 > #define SIZE (NGIG * GIG) > #define NPROC 16 > > void main() { > int page_size = getpagesize(); > struct timeval start, end; > long duration, i; > char *p; > > p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, > MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > if (p == MAP_FAILED) { > perror("mmap"); > exit(1); > } > madvise(p, SIZE, MADV_NOHUGEPAGE); > > /* Touch every page */ > for (i = 0; i < SIZE; i += page_size) > p[i] = 0; > > gettimeofday(&start, NULL); > for (i = 0; i < NPROC; i++) { > int pid = fork(); > > if (pid == 0) { > sleep(30); > exit(0); > } > } > gettimeofday(&end, NULL); > /* Normolize per proc and per gig */ > duration = ((end.tv_sec - start.tv_sec) * USEC > + (end.tv_usec - start.tv_usec)) / NPROC / NGIG; > printf("Fork latency per gigabyte: %ld.%06ld seconds\n", > duration / USEC, duration % USEC); > } I'm not sure only taking the few testing is enough. So, I rewrite your test program to run multiple times but focus on a single fork, and get the average time: fork.log: 0.000498 odfork.log: 0.000469 Test program: #include <time.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <sys/time.h> #include <sys/mman.h> #include <sys/types.h> #include <sys/wait.h> #include <sys/prctl.h> #define USEC 1000000 #define GIG (1ul << 30) #define NGIG 4 #define SIZE (NGIG * GIG) #define NPROC 16 int main(void) { unsigned int i = 0; unsigned long j = 0; int pid, page_size = getpagesize(); struct timeval start, end; long duration; char *p; prctl(65, 0, 0, 0, 0); for (i = 0; i < NPROC; i++) { p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (p == MAP_FAILED) { perror("mmap"); exit(1); } madvise(p, SIZE, MADV_NOHUGEPAGE); /* Touch every page */ for (j = 0; j < SIZE; j += page_size) p[j] = 0; gettimeofday(&start, NULL); pid = fork(); switch (pid) { case -1: perror("fork"); exit(1); case 0: /* child */ return 0; default: /* parent */ gettimeofday(&end, NULL); duration = ((end.tv_sec - start.tv_sec) * USEC + (end.tv_usec - start.tv_usec)) / NPROC / NGIG; // seconds printf("%ld.%06ld\n", duration / USEC, duration % USEC); waitpid(pid, NULL, 0); munmap(p, SIZE); p = NULL; } } } Script: import numpy def calc_mean(file): np_tmp = numpy.loadtxt(file, usecols=range(0,1)) print("{}: {:6f}".format(file, np_tmp.mean())) calc_mean("fork.log") calc_mean("odfork.log") I didn't make the memory size and process number bigger because it ran on my laptop, and I can't access my server for some reason. Thanks, Chih-En Lin On Fri, Feb 10, 2023 at 2:16 AM Pasha Tatashin <pasha.tatashin@soleen.com> wrote: > > On Mon, Feb 6, 2023 at 10:52 PM Chih-En Lin <shiyn.lin@gmail.com> wrote: > > > > v3 -> v4 > > - Add Kconfig, CONFIG_COW_PTE, since some of the architectures, e.g., > > s390 and powerpc32, don't support the PMD entry and PTE table > > operations. > > - Fix unmatch type of break_cow_pte_range() in > > migrate_vma_collect_pmd(). > > - Don’t break COW PTE in folio_referenced_one(). > > - Fix the wrong VMA range checking in break_cow_pte_range(). > > - Only break COW when we modify the soft-dirty bit in > > clear_refs_pte_range(). > > - Handle do_swap_page() with COW PTE in mm/memory.c and mm/khugepaged.c. > > - Change the tlb flush from flush_tlb_mm_range() (x86 specific) to > > tlb_flush_pmd_range(). > > - Handle VM_DONTCOPY with COW PTE fork. > > - Fix the wrong address and invalid vma in recover_pte_range(). > > - Fix the infinite page fault loop in GUP routine. > > In mm/gup.c:follow_pfn_pte(), instead of calling the break COW PTE > > handler, we return -EMLINK to let the GUP handles the page fault > > (call faultin_page() in __get_user_pages()). > > - return not_found(pvmw) if the break COW PTE failed in > > page_vma_mapped_walk(). > > - Since COW PTE has the same result as the normal COW selftest, it > > probably passed the COW selftest. > > > > # [RUN] vmsplice() + unmap in child ... with hugetlb (2048 kB) > > not ok 33 No leak from parent into child > > # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with hugetlb (2048 kB) > > not ok 44 No leak from parent into child > > # [RUN] vmsplice() before fork(), unmap in parent after fork() ... with hugetlb (2048 kB) > > not ok 55 No leak from child into parent > > # [RUN] vmsplice() + unmap in parent after fork() ... with hugetlb (2048 kB) > > not ok 66 No leak from child into parent > > > > Bail out! 4 out of 147 tests failed > > # Totals: pass:143 fail:4 xfail:0 xpass:0 skip:0 error:0 > > See the more information about anon cow hugetlb tests: > > https://patchwork.kernel.org/project/linux-mm/patch/20220927110120.106906-5-david@redhat.com/ > > > > > > v3: https://lore.kernel.org/linux-mm/20221220072743.3039060-1-shiyn.lin@gmail.com/T/ > > > > RFC v2 -> v3 > > - Change the sysctl with PID to prctl(PR_SET_COW_PTE). > > - Account all the COW PTE mapped pages in fork() instead of defer it to > > page fault (break COW PTE). > > - If there is an unshareable mapped page (maybe pinned or private > > device), recover all the entries that are already handled by COW PTE > > fork, then copy to the new one. > > - Remove COW_PTE_OWNER_EXCLUSIVE flag and handle the only case of GUP, > > follow_pfn_pte(). > > - Remove the PTE ownership since we don't need it. > > - Use pte lock to protect the break COW PTE and free COW-ed PTE. > > - Do TLB flushing in break COW PTE handler. > > - Handle THP, KSM, madvise, mprotect, uffd and migrate device. > > - Handle the replacement page of uprobe. > > - Handle the clear_refs_write() of fs/proc. > > - All of the benchmarks dropped since the accounting and pte lock. > > The benchmarks of v3 is worse than RFC v2, most of the cases are > > similar to the normal fork, but there still have an use case > > (TriforceAFL) is better than the normal fork version. > > > > RFC v2: https://lore.kernel.org/linux-mm/20220927162957.270460-1-shiyn.lin@gmail.com/T/ > > > > RFC v1 -> RFC v2 > > - Change the clone flag method to sysctl with PID. > > - Change the MMF_COW_PGTABLE flag to two flags, MMF_COW_PTE and > > MMF_COW_PTE_READY, for the sysctl. > > - Change the owner pointer to use the folio padding. > > - Handle all the VMAs that cover the PTE table when doing the break COW PTE. > > - Remove the self-defined refcount to use the _refcount for the page > > table page. > > - Add the exclusive flag to let the page table only own by one task in > > some situations. > > - Invalidate address range MMU notifier and start the write_seqcount > > when doing the break COW PTE. > > - Handle the swap cache and swapoff. > > > > RFC v1: https://lore.kernel.org/all/20220519183127.3909598-1-shiyn.lin@gmail.com/ > > > > --- > > > > Currently, copy-on-write is only used for the mapped memory; the child > > process still needs to copy the entire page table from the parent > > process during forking. The parent process might take a lot of time and > > memory to copy the page table when the parent has a big page table > > allocated. For example, the memory usage of a process after forking with > > 1 GB mapped memory is as follows: > > For some reason, I was not able to reproduce performance improvements > with a simple fork() performance measurement program. The results that > I saw are the following: > > Base: > Fork latency per gigabyte: 0.004416 seconds > Fork latency per gigabyte: 0.004382 seconds > Fork latency per gigabyte: 0.004442 seconds > COW kernel: > Fork latency per gigabyte: 0.004524 seconds > Fork latency per gigabyte: 0.004764 seconds > Fork latency per gigabyte: 0.004547 seconds > > AMD EPYC 7B12 64-Core Processor > Base: > Fork latency per gigabyte: 0.003923 seconds > Fork latency per gigabyte: 0.003909 seconds > Fork latency per gigabyte: 0.003955 seconds > COW kernel: > Fork latency per gigabyte: 0.004221 seconds > Fork latency per gigabyte: 0.003882 seconds > Fork latency per gigabyte: 0.003854 seconds > > Given, that page table for child is not copied, I was expecting the > performance to be better with COW kernel, and also not to depend on > the size of the parent. > > Test program: > > #include <time.h> > #include <stdio.h> > #include <stdlib.h> > #include <string.h> > #include <unistd.h> > #include <sys/time.h> > #include <sys/mman.h> > #include <sys/types.h> > > #define USEC 1000000 > #define GIG (1ul << 30) > #define NGIG 32 > #define SIZE (NGIG * GIG) > #define NPROC 16 > > void main() { > int page_size = getpagesize(); > struct timeval start, end; > long duration, i; > char *p; > > p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, > MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > if (p == MAP_FAILED) { > perror("mmap"); > exit(1); > } > madvise(p, SIZE, MADV_NOHUGEPAGE); > > /* Touch every page */ > for (i = 0; i < SIZE; i += page_size) > p[i] = 0; > > gettimeofday(&start, NULL); > for (i = 0; i < NPROC; i++) { > int pid = fork(); > > if (pid == 0) { > sleep(30); > exit(0); > } > } > gettimeofday(&end, NULL); > /* Normolize per proc and per gig */ > duration = ((end.tv_sec - start.tv_sec) * USEC > + (end.tv_usec - start.tv_usec)) / NPROC / NGIG; > printf("Fork latency per gigabyte: %ld.%06ld seconds\n", > duration / USEC, duration % USEC); > }
> > > Currently, copy-on-write is only used for the mapped memory; the child > > > process still needs to copy the entire page table from the parent > > > process during forking. The parent process might take a lot of time and > > > memory to copy the page table when the parent has a big page table > > > allocated. For example, the memory usage of a process after forking with > > > 1 GB mapped memory is as follows: > > > > For some reason, I was not able to reproduce performance improvements > > with a simple fork() performance measurement program. The results that > > I saw are the following: > > > > Base: > > Fork latency per gigabyte: 0.004416 seconds > > Fork latency per gigabyte: 0.004382 seconds > > Fork latency per gigabyte: 0.004442 seconds > > COW kernel: > > Fork latency per gigabyte: 0.004524 seconds > > Fork latency per gigabyte: 0.004764 seconds > > Fork latency per gigabyte: 0.004547 seconds > > > > AMD EPYC 7B12 64-Core Processor > > Base: > > Fork latency per gigabyte: 0.003923 seconds > > Fork latency per gigabyte: 0.003909 seconds > > Fork latency per gigabyte: 0.003955 seconds > > COW kernel: > > Fork latency per gigabyte: 0.004221 seconds > > Fork latency per gigabyte: 0.003882 seconds > > Fork latency per gigabyte: 0.003854 seconds > > > > Given, that page table for child is not copied, I was expecting the > > performance to be better with COW kernel, and also not to depend on > > the size of the parent. > > Yes, the child won't duplicate the page table, but fork will still > traverse all the page table entries to do the accounting. > And, since this patch expends the COW to the PTE table level, it's not > the mapped page (page table entry) grained anymore, so we have to > guarantee that all the mapped page is available to do COW mapping in > the such page table. > This kind of checking also costs some time. > As a result, since the accounting and the checking, the COW PTE fork > still depends on the size of the parent so the improvement might not > be significant. The current version of the series does not provide any performance improvements for fork(). I would recommend removing claims from the cover letter about better fork() performance, as this may be misleading for those looking for a way to speed up forking. In my case, I was looking to speed up Redis OSS, which relies on fork() to create consistent snapshots for driving replicates/backups. The O(N) per-page operation causes fork() to be slow, so I was hoping that this series, which does not duplicate the VA during fork(), would make the operation much quicker. > Actually, at the RFC v1 and v2, we proposed the version of skipping > those works, and we got a significant improvement. You can see the > number from RFC v2 cover letter [1]: > "In short, with 512 MB mapped memory, COW PTE decreases latency by 93% > for normal fork" I suspect the 93% improvement (when the mapcount was not updated) was only for VAs with 4K pages. With 2M mappings this series did not provide any benefit is this correct? > > However, it might break the existing logic of the refcount/mapcount of > the page and destabilize the system. This makes sense. > [1] https://lore.kernel.org/linux-mm/20220927162957.270460-1-shiyn.lin@gmail.com/T/#me2340d963c2758a2561c39cb3baf42c478dfe548 > [2] https://lore.kernel.org/linux-mm/20220927162957.270460-1-shiyn.lin@gmail.com/T/#mbc33221f00c7cf3d71839b45fc23862a5dac3014
On Fri, Feb 10, 2023 at 11:21:16AM -0500, Pasha Tatashin wrote: > > > > Currently, copy-on-write is only used for the mapped memory; the child > > > > process still needs to copy the entire page table from the parent > > > > process during forking. The parent process might take a lot of time and > > > > memory to copy the page table when the parent has a big page table > > > > allocated. For example, the memory usage of a process after forking with > > > > 1 GB mapped memory is as follows: > > > > > > For some reason, I was not able to reproduce performance improvements > > > with a simple fork() performance measurement program. The results that > > > I saw are the following: > > > > > > Base: > > > Fork latency per gigabyte: 0.004416 seconds > > > Fork latency per gigabyte: 0.004382 seconds > > > Fork latency per gigabyte: 0.004442 seconds > > > COW kernel: > > > Fork latency per gigabyte: 0.004524 seconds > > > Fork latency per gigabyte: 0.004764 seconds > > > Fork latency per gigabyte: 0.004547 seconds > > > > > > AMD EPYC 7B12 64-Core Processor > > > Base: > > > Fork latency per gigabyte: 0.003923 seconds > > > Fork latency per gigabyte: 0.003909 seconds > > > Fork latency per gigabyte: 0.003955 seconds > > > COW kernel: > > > Fork latency per gigabyte: 0.004221 seconds > > > Fork latency per gigabyte: 0.003882 seconds > > > Fork latency per gigabyte: 0.003854 seconds > > > > > > Given, that page table for child is not copied, I was expecting the > > > performance to be better with COW kernel, and also not to depend on > > > the size of the parent. > > > > Yes, the child won't duplicate the page table, but fork will still > > traverse all the page table entries to do the accounting. > > And, since this patch expends the COW to the PTE table level, it's not > > the mapped page (page table entry) grained anymore, so we have to > > guarantee that all the mapped page is available to do COW mapping in > > the such page table. > > This kind of checking also costs some time. > > As a result, since the accounting and the checking, the COW PTE fork > > still depends on the size of the parent so the improvement might not > > be significant. > > The current version of the series does not provide any performance > improvements for fork(). I would recommend removing claims from the > cover letter about better fork() performance, as this may be > misleading for those looking for a way to speed up forking. In my From v3 to v4, I changed the implementation of the COW fork() part to do the accounting and checking. At the time, I also removed most of the descriptions about the better fork() performance. Maybe it's not enough and still has some misleading. I will fix this in the next version. Thanks. > case, I was looking to speed up Redis OSS, which relies on fork() to > create consistent snapshots for driving replicates/backups. The O(N) > per-page operation causes fork() to be slow, so I was hoping that this > series, which does not duplicate the VA during fork(), would make the > operation much quicker. Indeed, at first, I tried to avoid the O(N) per-page operation by deferring the accounting and the swap stuff to the page fault. But, as I mentioned, it's not suitable for the mainline. Honestly, for improving the fork(), I have an idea to skip the per-page operation without breaking the logic. However, this will introduce the complicated mechanism and may has the overhead for other features. It might not be worth it. It's hard to strike a balance between the over-complicated mechanism with (probably) better performance and data consistency with the page status. So, I would focus on the safety and stable approach at first. > > Actually, at the RFC v1 and v2, we proposed the version of skipping > > those works, and we got a significant improvement. You can see the > > number from RFC v2 cover letter [1]: > > "In short, with 512 MB mapped memory, COW PTE decreases latency by 93% > > for normal fork" > > I suspect the 93% improvement (when the mapcount was not updated) was > only for VAs with 4K pages. With 2M mappings this series did not > provide any benefit is this correct? Yes. In this case, the COW PTE performance is similar to the normal fork(). > > > > However, it might break the existing logic of the refcount/mapcount of > > the page and destabilize the system. > > This makes sense. ;) Thanks, Chih-En Lin
On 10.02.23 18:20, Chih-En Lin wrote: > On Fri, Feb 10, 2023 at 11:21:16AM -0500, Pasha Tatashin wrote: >>>>> Currently, copy-on-write is only used for the mapped memory; the child >>>>> process still needs to copy the entire page table from the parent >>>>> process during forking. The parent process might take a lot of time and >>>>> memory to copy the page table when the parent has a big page table >>>>> allocated. For example, the memory usage of a process after forking with >>>>> 1 GB mapped memory is as follows: >>>> >>>> For some reason, I was not able to reproduce performance improvements >>>> with a simple fork() performance measurement program. The results that >>>> I saw are the following: >>>> >>>> Base: >>>> Fork latency per gigabyte: 0.004416 seconds >>>> Fork latency per gigabyte: 0.004382 seconds >>>> Fork latency per gigabyte: 0.004442 seconds >>>> COW kernel: >>>> Fork latency per gigabyte: 0.004524 seconds >>>> Fork latency per gigabyte: 0.004764 seconds >>>> Fork latency per gigabyte: 0.004547 seconds >>>> >>>> AMD EPYC 7B12 64-Core Processor >>>> Base: >>>> Fork latency per gigabyte: 0.003923 seconds >>>> Fork latency per gigabyte: 0.003909 seconds >>>> Fork latency per gigabyte: 0.003955 seconds >>>> COW kernel: >>>> Fork latency per gigabyte: 0.004221 seconds >>>> Fork latency per gigabyte: 0.003882 seconds >>>> Fork latency per gigabyte: 0.003854 seconds >>>> >>>> Given, that page table for child is not copied, I was expecting the >>>> performance to be better with COW kernel, and also not to depend on >>>> the size of the parent. >>> >>> Yes, the child won't duplicate the page table, but fork will still >>> traverse all the page table entries to do the accounting. >>> And, since this patch expends the COW to the PTE table level, it's not >>> the mapped page (page table entry) grained anymore, so we have to >>> guarantee that all the mapped page is available to do COW mapping in >>> the such page table. >>> This kind of checking also costs some time. >>> As a result, since the accounting and the checking, the COW PTE fork >>> still depends on the size of the parent so the improvement might not >>> be significant. >> >> The current version of the series does not provide any performance >> improvements for fork(). I would recommend removing claims from the >> cover letter about better fork() performance, as this may be >> misleading for those looking for a way to speed up forking. In my > > From v3 to v4, I changed the implementation of the COW fork() part to do > the accounting and checking. At the time, I also removed most of the > descriptions about the better fork() performance. Maybe it's not enough > and still has some misleading. I will fix this in the next version. > Thanks. > >> case, I was looking to speed up Redis OSS, which relies on fork() to >> create consistent snapshots for driving replicates/backups. The O(N) >> per-page operation causes fork() to be slow, so I was hoping that this >> series, which does not duplicate the VA during fork(), would make the >> operation much quicker. > > Indeed, at first, I tried to avoid the O(N) per-page operation by > deferring the accounting and the swap stuff to the page fault. But, > as I mentioned, it's not suitable for the mainline. > > Honestly, for improving the fork(), I have an idea to skip the per-page > operation without breaking the logic. However, this will introduce the > complicated mechanism and may has the overhead for other features. It > might not be worth it. It's hard to strike a balance between the > over-complicated mechanism with (probably) better performance and data > consistency with the page status. So, I would focus on the safety and > stable approach at first. Yes, it is most probably possible, but complexity, robustness and maintainability have to be considered as well. Thanks for implementing this approach (only deduplication without other optimizations) and evaluating it accordingly. It's certainly "cleaner", such that we only have to mess with unsharing and not with other accounting/pinning/mapcount thingies. But it also highlights how intrusive even this basic deduplication approach already is -- and that most benefits of the original approach requires even more complexity on top. I am not quite sure if the benefit is worth the price (I am not to decide and I would like to hear other options). My quick thoughts after skimming over the core parts of this series (1) forgetting to break COW on a PTE in some pgtable walker feels quite likely (meaning that it might be fairly error-prone) and forgetting to break COW on a PTE table, accidentally modifying the shared table. (2) break_cow_pte() can fail, which means that we can fail some operations (possibly silently halfway through) now. For example, looking at your change_pte_range() change, I suspect it's wrong. (3) handle_cow_pte_fault() looks quite complicated and needs quite some double-checking: we temporarily clear the PMD, to reset it afterwards. I am not sure if that is correct. For example, what stops another page fault stumbling over that pmd_none() and allocating an empty page table? Maybe there are some locking details missing or they are very subtle such that we better document them. I recall that THP played quite some tricks to make such cases work ... > >>> Actually, at the RFC v1 and v2, we proposed the version of skipping >>> those works, and we got a significant improvement. You can see the >>> number from RFC v2 cover letter [1]: >>> "In short, with 512 MB mapped memory, COW PTE decreases latency by 93% >>> for normal fork" >> >> I suspect the 93% improvement (when the mapcount was not updated) was >> only for VAs with 4K pages. With 2M mappings this series did not >> provide any benefit is this correct? > > Yes. In this case, the COW PTE performance is similar to the normal > fork(). The thing with THP is, that during fork(), we always allocate a backup PTE table, to be able to PTE-map the THP whenever we have to. Otherwise we'd have to eventually fail some operations we don't want to fail -- similar to the case where break_cow_pte() could fail now due to -ENOMEM although we really don't want to fail (e.g., change_pte_range() ). I always considered that wasteful, because in many scenarios, we'll never ever split a THP and possibly waste memory. Optimizing that for THP (e.g., don't always allocate backup THP, have some global allocation backup pool for splits + refill when close-to-empty) might provide similar fork() improvements, both in speed and memory consumption when it comes to anonymous memory. -- Thanks, David / dhildenb
On Tue, Feb 14, 2023 at 1:58 AM David Hildenbrand <david@redhat.com> wrote: > > On 10.02.23 18:20, Chih-En Lin wrote: > > On Fri, Feb 10, 2023 at 11:21:16AM -0500, Pasha Tatashin wrote: > >>>>> Currently, copy-on-write is only used for the mapped memory; the child > >>>>> process still needs to copy the entire page table from the parent > >>>>> process during forking. The parent process might take a lot of time and > >>>>> memory to copy the page table when the parent has a big page table > >>>>> allocated. For example, the memory usage of a process after forking with > >>>>> 1 GB mapped memory is as follows: > >>>> > >>>> For some reason, I was not able to reproduce performance improvements > >>>> with a simple fork() performance measurement program. The results that > >>>> I saw are the following: > >>>> > >>>> Base: > >>>> Fork latency per gigabyte: 0.004416 seconds > >>>> Fork latency per gigabyte: 0.004382 seconds > >>>> Fork latency per gigabyte: 0.004442 seconds > >>>> COW kernel: > >>>> Fork latency per gigabyte: 0.004524 seconds > >>>> Fork latency per gigabyte: 0.004764 seconds > >>>> Fork latency per gigabyte: 0.004547 seconds > >>>> > >>>> AMD EPYC 7B12 64-Core Processor > >>>> Base: > >>>> Fork latency per gigabyte: 0.003923 seconds > >>>> Fork latency per gigabyte: 0.003909 seconds > >>>> Fork latency per gigabyte: 0.003955 seconds > >>>> COW kernel: > >>>> Fork latency per gigabyte: 0.004221 seconds > >>>> Fork latency per gigabyte: 0.003882 seconds > >>>> Fork latency per gigabyte: 0.003854 seconds > >>>> > >>>> Given, that page table for child is not copied, I was expecting the > >>>> performance to be better with COW kernel, and also not to depend on > >>>> the size of the parent. > >>> > >>> Yes, the child won't duplicate the page table, but fork will still > >>> traverse all the page table entries to do the accounting. > >>> And, since this patch expends the COW to the PTE table level, it's not > >>> the mapped page (page table entry) grained anymore, so we have to > >>> guarantee that all the mapped page is available to do COW mapping in > >>> the such page table. > >>> This kind of checking also costs some time. > >>> As a result, since the accounting and the checking, the COW PTE fork > >>> still depends on the size of the parent so the improvement might not > >>> be significant. > >> > >> The current version of the series does not provide any performance > >> improvements for fork(). I would recommend removing claims from the > >> cover letter about better fork() performance, as this may be > >> misleading for those looking for a way to speed up forking. In my > > > > From v3 to v4, I changed the implementation of the COW fork() part to do > > the accounting and checking. At the time, I also removed most of the > > descriptions about the better fork() performance. Maybe it's not enough > > and still has some misleading. I will fix this in the next version. > > Thanks. > > > >> case, I was looking to speed up Redis OSS, which relies on fork() to > >> create consistent snapshots for driving replicates/backups. The O(N) > >> per-page operation causes fork() to be slow, so I was hoping that this > >> series, which does not duplicate the VA during fork(), would make the > >> operation much quicker. > > > > Indeed, at first, I tried to avoid the O(N) per-page operation by > > deferring the accounting and the swap stuff to the page fault. But, > > as I mentioned, it's not suitable for the mainline. > > > > Honestly, for improving the fork(), I have an idea to skip the per-page > > operation without breaking the logic. However, this will introduce the > > complicated mechanism and may has the overhead for other features. It > > might not be worth it. It's hard to strike a balance between the > > over-complicated mechanism with (probably) better performance and data > > consistency with the page status. So, I would focus on the safety and > > stable approach at first. > > Yes, it is most probably possible, but complexity, robustness and > maintainability have to be considered as well. > > Thanks for implementing this approach (only deduplication without other > optimizations) and evaluating it accordingly. It's certainly "cleaner", > such that we only have to mess with unsharing and not with other > accounting/pinning/mapcount thingies. But it also highlights how > intrusive even this basic deduplication approach already is -- and that > most benefits of the original approach requires even more complexity on top. > > I am not quite sure if the benefit is worth the price (I am not to > decide and I would like to hear other options). > > My quick thoughts after skimming over the core parts of this series > > (1) forgetting to break COW on a PTE in some pgtable walker feels quite > likely (meaning that it might be fairly error-prone) and forgetting > to break COW on a PTE table, accidentally modifying the shared > table. > (2) break_cow_pte() can fail, which means that we can fail some > operations (possibly silently halfway through) now. For example, > looking at your change_pte_range() change, I suspect it's wrong. > (3) handle_cow_pte_fault() looks quite complicated and needs quite some > double-checking: we temporarily clear the PMD, to reset it > afterwards. I am not sure if that is correct. For example, what > stops another page fault stumbling over that pmd_none() and > allocating an empty page table? Maybe there are some locking details > missing or they are very subtle such that we better document them. I > recall that THP played quite some tricks to make such cases work ... > > > > >>> Actually, at the RFC v1 and v2, we proposed the version of skipping > >>> those works, and we got a significant improvement. You can see the > >>> number from RFC v2 cover letter [1]: > >>> "In short, with 512 MB mapped memory, COW PTE decreases latency by 93% > >>> for normal fork" > >> > >> I suspect the 93% improvement (when the mapcount was not updated) was > >> only for VAs with 4K pages. With 2M mappings this series did not > >> provide any benefit is this correct? > > > > Yes. In this case, the COW PTE performance is similar to the normal > > fork(). > > > The thing with THP is, that during fork(), we always allocate a backup > PTE table, to be able to PTE-map the THP whenever we have to. Otherwise > we'd have to eventually fail some operations we don't want to fail -- > similar to the case where break_cow_pte() could fail now due to -ENOMEM > although we really don't want to fail (e.g., change_pte_range() ). > > I always considered that wasteful, because in many scenarios, we'll > never ever split a THP and possibly waste memory. When you say "split THP", do you mean split the compound page to base pages? IIUC the backup PTE table page is used to guarantee the PMD split (just convert pmd mapped THP to PTE-mapped but not split the compound page) succeed. You may already notice there is no return value for PMD split. The PMD split may be called quite often, for example, MADV_DONTNEED, mbind, mlock, and even in memory reclamation context (THP swap). > > Optimizing that for THP (e.g., don't always allocate backup THP, have > some global allocation backup pool for splits + refill when > close-to-empty) might provide similar fork() improvements, both in speed > and memory consumption when it comes to anonymous memory. It might work. But may be much more complicated than what you thought when handling multiple parallel PMD splits. > > -- > Thanks, > > David / dhildenb >
On 14.02.23 18:23, Yang Shi wrote: > On Tue, Feb 14, 2023 at 1:58 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 10.02.23 18:20, Chih-En Lin wrote: >>> On Fri, Feb 10, 2023 at 11:21:16AM -0500, Pasha Tatashin wrote: >>>>>>> Currently, copy-on-write is only used for the mapped memory; the child >>>>>>> process still needs to copy the entire page table from the parent >>>>>>> process during forking. The parent process might take a lot of time and >>>>>>> memory to copy the page table when the parent has a big page table >>>>>>> allocated. For example, the memory usage of a process after forking with >>>>>>> 1 GB mapped memory is as follows: >>>>>> >>>>>> For some reason, I was not able to reproduce performance improvements >>>>>> with a simple fork() performance measurement program. The results that >>>>>> I saw are the following: >>>>>> >>>>>> Base: >>>>>> Fork latency per gigabyte: 0.004416 seconds >>>>>> Fork latency per gigabyte: 0.004382 seconds >>>>>> Fork latency per gigabyte: 0.004442 seconds >>>>>> COW kernel: >>>>>> Fork latency per gigabyte: 0.004524 seconds >>>>>> Fork latency per gigabyte: 0.004764 seconds >>>>>> Fork latency per gigabyte: 0.004547 seconds >>>>>> >>>>>> AMD EPYC 7B12 64-Core Processor >>>>>> Base: >>>>>> Fork latency per gigabyte: 0.003923 seconds >>>>>> Fork latency per gigabyte: 0.003909 seconds >>>>>> Fork latency per gigabyte: 0.003955 seconds >>>>>> COW kernel: >>>>>> Fork latency per gigabyte: 0.004221 seconds >>>>>> Fork latency per gigabyte: 0.003882 seconds >>>>>> Fork latency per gigabyte: 0.003854 seconds >>>>>> >>>>>> Given, that page table for child is not copied, I was expecting the >>>>>> performance to be better with COW kernel, and also not to depend on >>>>>> the size of the parent. >>>>> >>>>> Yes, the child won't duplicate the page table, but fork will still >>>>> traverse all the page table entries to do the accounting. >>>>> And, since this patch expends the COW to the PTE table level, it's not >>>>> the mapped page (page table entry) grained anymore, so we have to >>>>> guarantee that all the mapped page is available to do COW mapping in >>>>> the such page table. >>>>> This kind of checking also costs some time. >>>>> As a result, since the accounting and the checking, the COW PTE fork >>>>> still depends on the size of the parent so the improvement might not >>>>> be significant. >>>> >>>> The current version of the series does not provide any performance >>>> improvements for fork(). I would recommend removing claims from the >>>> cover letter about better fork() performance, as this may be >>>> misleading for those looking for a way to speed up forking. In my >>> >>> From v3 to v4, I changed the implementation of the COW fork() part to do >>> the accounting and checking. At the time, I also removed most of the >>> descriptions about the better fork() performance. Maybe it's not enough >>> and still has some misleading. I will fix this in the next version. >>> Thanks. >>> >>>> case, I was looking to speed up Redis OSS, which relies on fork() to >>>> create consistent snapshots for driving replicates/backups. The O(N) >>>> per-page operation causes fork() to be slow, so I was hoping that this >>>> series, which does not duplicate the VA during fork(), would make the >>>> operation much quicker. >>> >>> Indeed, at first, I tried to avoid the O(N) per-page operation by >>> deferring the accounting and the swap stuff to the page fault. But, >>> as I mentioned, it's not suitable for the mainline. >>> >>> Honestly, for improving the fork(), I have an idea to skip the per-page >>> operation without breaking the logic. However, this will introduce the >>> complicated mechanism and may has the overhead for other features. It >>> might not be worth it. It's hard to strike a balance between the >>> over-complicated mechanism with (probably) better performance and data >>> consistency with the page status. So, I would focus on the safety and >>> stable approach at first. >> >> Yes, it is most probably possible, but complexity, robustness and >> maintainability have to be considered as well. >> >> Thanks for implementing this approach (only deduplication without other >> optimizations) and evaluating it accordingly. It's certainly "cleaner", >> such that we only have to mess with unsharing and not with other >> accounting/pinning/mapcount thingies. But it also highlights how >> intrusive even this basic deduplication approach already is -- and that >> most benefits of the original approach requires even more complexity on top. >> >> I am not quite sure if the benefit is worth the price (I am not to >> decide and I would like to hear other options). >> >> My quick thoughts after skimming over the core parts of this series >> >> (1) forgetting to break COW on a PTE in some pgtable walker feels quite >> likely (meaning that it might be fairly error-prone) and forgetting >> to break COW on a PTE table, accidentally modifying the shared >> table. >> (2) break_cow_pte() can fail, which means that we can fail some >> operations (possibly silently halfway through) now. For example, >> looking at your change_pte_range() change, I suspect it's wrong. >> (3) handle_cow_pte_fault() looks quite complicated and needs quite some >> double-checking: we temporarily clear the PMD, to reset it >> afterwards. I am not sure if that is correct. For example, what >> stops another page fault stumbling over that pmd_none() and >> allocating an empty page table? Maybe there are some locking details >> missing or they are very subtle such that we better document them. I >> recall that THP played quite some tricks to make such cases work ... >> >>> >>>>> Actually, at the RFC v1 and v2, we proposed the version of skipping >>>>> those works, and we got a significant improvement. You can see the >>>>> number from RFC v2 cover letter [1]: >>>>> "In short, with 512 MB mapped memory, COW PTE decreases latency by 93% >>>>> for normal fork" >>>> >>>> I suspect the 93% improvement (when the mapcount was not updated) was >>>> only for VAs with 4K pages. With 2M mappings this series did not >>>> provide any benefit is this correct? >>> >>> Yes. In this case, the COW PTE performance is similar to the normal >>> fork(). >> >> >> The thing with THP is, that during fork(), we always allocate a backup >> PTE table, to be able to PTE-map the THP whenever we have to. Otherwise >> we'd have to eventually fail some operations we don't want to fail -- >> similar to the case where break_cow_pte() could fail now due to -ENOMEM >> although we really don't want to fail (e.g., change_pte_range() ). >> >> I always considered that wasteful, because in many scenarios, we'll >> never ever split a THP and possibly waste memory. > > When you say "split THP", do you mean split the compound page to base > pages? IIUC the backup PTE table page is used to guarantee the PMD > split (just convert pmd mapped THP to PTE-mapped but not split the > compound page) succeed. You may already notice there is no return > value for PMD split. Yes, as I raised in my other reply. > > The PMD split may be called quite often, for example, MADV_DONTNEED, > mbind, mlock, and even in memory reclamation context (THP swap). Yes, but with a single MADV_DONTNEED call you cannot PTE-map more than 2 THP (all other overlapped THP will get zapped). Same with most other operations. There are corner cases, though. I recall that s390x/kvm wants to break all THP in a given VMA range. But that operation could safely fail if we can't do that. Certainly needs some investigation, that's most probably why it hasn't been done yet. > >> >> Optimizing that for THP (e.g., don't always allocate backup THP, have >> some global allocation backup pool for splits + refill when >> close-to-empty) might provide similar fork() improvements, both in speed >> and memory consumption when it comes to anonymous memory. > > It might work. But may be much more complicated than what you thought > when handling multiple parallel PMD splits. I consider the whole PTE-table linking to THPs complicated enough to eventually replace it by something differently complicated that wastes less memory ;) -- Thanks, David / dhildenb
On Tue, Feb 14, 2023 at 9:39 AM David Hildenbrand <david@redhat.com> wrote: > > On 14.02.23 18:23, Yang Shi wrote: > > On Tue, Feb 14, 2023 at 1:58 AM David Hildenbrand <david@redhat.com> wrote: > >> > >> On 10.02.23 18:20, Chih-En Lin wrote: > >>> On Fri, Feb 10, 2023 at 11:21:16AM -0500, Pasha Tatashin wrote: > >>>>>>> Currently, copy-on-write is only used for the mapped memory; the child > >>>>>>> process still needs to copy the entire page table from the parent > >>>>>>> process during forking. The parent process might take a lot of time and > >>>>>>> memory to copy the page table when the parent has a big page table > >>>>>>> allocated. For example, the memory usage of a process after forking with > >>>>>>> 1 GB mapped memory is as follows: > >>>>>> > >>>>>> For some reason, I was not able to reproduce performance improvements > >>>>>> with a simple fork() performance measurement program. The results that > >>>>>> I saw are the following: > >>>>>> > >>>>>> Base: > >>>>>> Fork latency per gigabyte: 0.004416 seconds > >>>>>> Fork latency per gigabyte: 0.004382 seconds > >>>>>> Fork latency per gigabyte: 0.004442 seconds > >>>>>> COW kernel: > >>>>>> Fork latency per gigabyte: 0.004524 seconds > >>>>>> Fork latency per gigabyte: 0.004764 seconds > >>>>>> Fork latency per gigabyte: 0.004547 seconds > >>>>>> > >>>>>> AMD EPYC 7B12 64-Core Processor > >>>>>> Base: > >>>>>> Fork latency per gigabyte: 0.003923 seconds > >>>>>> Fork latency per gigabyte: 0.003909 seconds > >>>>>> Fork latency per gigabyte: 0.003955 seconds > >>>>>> COW kernel: > >>>>>> Fork latency per gigabyte: 0.004221 seconds > >>>>>> Fork latency per gigabyte: 0.003882 seconds > >>>>>> Fork latency per gigabyte: 0.003854 seconds > >>>>>> > >>>>>> Given, that page table for child is not copied, I was expecting the > >>>>>> performance to be better with COW kernel, and also not to depend on > >>>>>> the size of the parent. > >>>>> > >>>>> Yes, the child won't duplicate the page table, but fork will still > >>>>> traverse all the page table entries to do the accounting. > >>>>> And, since this patch expends the COW to the PTE table level, it's not > >>>>> the mapped page (page table entry) grained anymore, so we have to > >>>>> guarantee that all the mapped page is available to do COW mapping in > >>>>> the such page table. > >>>>> This kind of checking also costs some time. > >>>>> As a result, since the accounting and the checking, the COW PTE fork > >>>>> still depends on the size of the parent so the improvement might not > >>>>> be significant. > >>>> > >>>> The current version of the series does not provide any performance > >>>> improvements for fork(). I would recommend removing claims from the > >>>> cover letter about better fork() performance, as this may be > >>>> misleading for those looking for a way to speed up forking. In my > >>> > >>> From v3 to v4, I changed the implementation of the COW fork() part to do > >>> the accounting and checking. At the time, I also removed most of the > >>> descriptions about the better fork() performance. Maybe it's not enough > >>> and still has some misleading. I will fix this in the next version. > >>> Thanks. > >>> > >>>> case, I was looking to speed up Redis OSS, which relies on fork() to > >>>> create consistent snapshots for driving replicates/backups. The O(N) > >>>> per-page operation causes fork() to be slow, so I was hoping that this > >>>> series, which does not duplicate the VA during fork(), would make the > >>>> operation much quicker. > >>> > >>> Indeed, at first, I tried to avoid the O(N) per-page operation by > >>> deferring the accounting and the swap stuff to the page fault. But, > >>> as I mentioned, it's not suitable for the mainline. > >>> > >>> Honestly, for improving the fork(), I have an idea to skip the per-page > >>> operation without breaking the logic. However, this will introduce the > >>> complicated mechanism and may has the overhead for other features. It > >>> might not be worth it. It's hard to strike a balance between the > >>> over-complicated mechanism with (probably) better performance and data > >>> consistency with the page status. So, I would focus on the safety and > >>> stable approach at first. > >> > >> Yes, it is most probably possible, but complexity, robustness and > >> maintainability have to be considered as well. > >> > >> Thanks for implementing this approach (only deduplication without other > >> optimizations) and evaluating it accordingly. It's certainly "cleaner", > >> such that we only have to mess with unsharing and not with other > >> accounting/pinning/mapcount thingies. But it also highlights how > >> intrusive even this basic deduplication approach already is -- and that > >> most benefits of the original approach requires even more complexity on top. > >> > >> I am not quite sure if the benefit is worth the price (I am not to > >> decide and I would like to hear other options). > >> > >> My quick thoughts after skimming over the core parts of this series > >> > >> (1) forgetting to break COW on a PTE in some pgtable walker feels quite > >> likely (meaning that it might be fairly error-prone) and forgetting > >> to break COW on a PTE table, accidentally modifying the shared > >> table. > >> (2) break_cow_pte() can fail, which means that we can fail some > >> operations (possibly silently halfway through) now. For example, > >> looking at your change_pte_range() change, I suspect it's wrong. > >> (3) handle_cow_pte_fault() looks quite complicated and needs quite some > >> double-checking: we temporarily clear the PMD, to reset it > >> afterwards. I am not sure if that is correct. For example, what > >> stops another page fault stumbling over that pmd_none() and > >> allocating an empty page table? Maybe there are some locking details > >> missing or they are very subtle such that we better document them. I > >> recall that THP played quite some tricks to make such cases work ... > >> > >>> > >>>>> Actually, at the RFC v1 and v2, we proposed the version of skipping > >>>>> those works, and we got a significant improvement. You can see the > >>>>> number from RFC v2 cover letter [1]: > >>>>> "In short, with 512 MB mapped memory, COW PTE decreases latency by 93% > >>>>> for normal fork" > >>>> > >>>> I suspect the 93% improvement (when the mapcount was not updated) was > >>>> only for VAs with 4K pages. With 2M mappings this series did not > >>>> provide any benefit is this correct? > >>> > >>> Yes. In this case, the COW PTE performance is similar to the normal > >>> fork(). > >> > >> > >> The thing with THP is, that during fork(), we always allocate a backup > >> PTE table, to be able to PTE-map the THP whenever we have to. Otherwise > >> we'd have to eventually fail some operations we don't want to fail -- > >> similar to the case where break_cow_pte() could fail now due to -ENOMEM > >> although we really don't want to fail (e.g., change_pte_range() ). > >> > >> I always considered that wasteful, because in many scenarios, we'll > >> never ever split a THP and possibly waste memory. > > > > When you say "split THP", do you mean split the compound page to base > > pages? IIUC the backup PTE table page is used to guarantee the PMD > > split (just convert pmd mapped THP to PTE-mapped but not split the > > compound page) succeed. You may already notice there is no return > > value for PMD split. > > Yes, as I raised in my other reply. > > > > > The PMD split may be called quite often, for example, MADV_DONTNEED, > > mbind, mlock, and even in memory reclamation context (THP swap). > > Yes, but with a single MADV_DONTNEED call you cannot PTE-map more than 2 > THP (all other overlapped THP will get zapped). Same with most other > operations. My point is there may be multiple processes calling PMD split on different THPs at the same time. > > There are corner cases, though. I recall that s390x/kvm wants to break > all THP in a given VMA range. But that operation could safely fail if we > can't do that. I'm supposed that is THP split (split the compound page), it may fail. > > Certainly needs some investigation, that's most probably why it hasn't > been done yet. > > > > >> > >> Optimizing that for THP (e.g., don't always allocate backup THP, have > >> some global allocation backup pool for splits + refill when > >> close-to-empty) might provide similar fork() improvements, both in speed > >> and memory consumption when it comes to anonymous memory. > > > > It might work. But may be much more complicated than what you thought > > when handling multiple parallel PMD splits. > > > I consider the whole PTE-table linking to THPs complicated enough to > eventually replace it by something differently complicated that wastes > less memory ;) Maybe... > > -- > Thanks, > > David / dhildenb >
On Tue, Feb 14, 2023 at 10:58:30AM +0100, David Hildenbrand wrote: > On 10.02.23 18:20, Chih-En Lin wrote: > > On Fri, Feb 10, 2023 at 11:21:16AM -0500, Pasha Tatashin wrote: > > > > > > Currently, copy-on-write is only used for the mapped memory; the child > > > > > > process still needs to copy the entire page table from the parent > > > > > > process during forking. The parent process might take a lot of time and > > > > > > memory to copy the page table when the parent has a big page table > > > > > > allocated. For example, the memory usage of a process after forking with > > > > > > 1 GB mapped memory is as follows: > > > > > > > > > > For some reason, I was not able to reproduce performance improvements > > > > > with a simple fork() performance measurement program. The results that > > > > > I saw are the following: > > > > > > > > > > Base: > > > > > Fork latency per gigabyte: 0.004416 seconds > > > > > Fork latency per gigabyte: 0.004382 seconds > > > > > Fork latency per gigabyte: 0.004442 seconds > > > > > COW kernel: > > > > > Fork latency per gigabyte: 0.004524 seconds > > > > > Fork latency per gigabyte: 0.004764 seconds > > > > > Fork latency per gigabyte: 0.004547 seconds > > > > > > > > > > AMD EPYC 7B12 64-Core Processor > > > > > Base: > > > > > Fork latency per gigabyte: 0.003923 seconds > > > > > Fork latency per gigabyte: 0.003909 seconds > > > > > Fork latency per gigabyte: 0.003955 seconds > > > > > COW kernel: > > > > > Fork latency per gigabyte: 0.004221 seconds > > > > > Fork latency per gigabyte: 0.003882 seconds > > > > > Fork latency per gigabyte: 0.003854 seconds > > > > > > > > > > Given, that page table for child is not copied, I was expecting the > > > > > performance to be better with COW kernel, and also not to depend on > > > > > the size of the parent. > > > > > > > > Yes, the child won't duplicate the page table, but fork will still > > > > traverse all the page table entries to do the accounting. > > > > And, since this patch expends the COW to the PTE table level, it's not > > > > the mapped page (page table entry) grained anymore, so we have to > > > > guarantee that all the mapped page is available to do COW mapping in > > > > the such page table. > > > > This kind of checking also costs some time. > > > > As a result, since the accounting and the checking, the COW PTE fork > > > > still depends on the size of the parent so the improvement might not > > > > be significant. > > > > > > The current version of the series does not provide any performance > > > improvements for fork(). I would recommend removing claims from the > > > cover letter about better fork() performance, as this may be > > > misleading for those looking for a way to speed up forking. In my > > > > From v3 to v4, I changed the implementation of the COW fork() part to do > > the accounting and checking. At the time, I also removed most of the > > descriptions about the better fork() performance. Maybe it's not enough > > and still has some misleading. I will fix this in the next version. > > Thanks. > > > > > case, I was looking to speed up Redis OSS, which relies on fork() to > > > create consistent snapshots for driving replicates/backups. The O(N) > > > per-page operation causes fork() to be slow, so I was hoping that this > > > series, which does not duplicate the VA during fork(), would make the > > > operation much quicker. > > > > Indeed, at first, I tried to avoid the O(N) per-page operation by > > deferring the accounting and the swap stuff to the page fault. But, > > as I mentioned, it's not suitable for the mainline. > > > > Honestly, for improving the fork(), I have an idea to skip the per-page > > operation without breaking the logic. However, this will introduce the > > complicated mechanism and may has the overhead for other features. It > > might not be worth it. It's hard to strike a balance between the > > over-complicated mechanism with (probably) better performance and data > > consistency with the page status. So, I would focus on the safety and > > stable approach at first. > > Yes, it is most probably possible, but complexity, robustness and > maintainability have to be considered as well. > > Thanks for implementing this approach (only deduplication without other > optimizations) and evaluating it accordingly. It's certainly "cleaner", such > that we only have to mess with unsharing and not with other > accounting/pinning/mapcount thingies. But it also highlights how intrusive > even this basic deduplication approach already is -- and that most benefits > of the original approach requires even more complexity on top. > > I am not quite sure if the benefit is worth the price (I am not to decide > and I would like to hear other options). I'm looking at the discussion of page table sharing in 2002 [1]. It looks like in 2002 ~ 2006, there also have some patches try to improve fork(). After that, I also saw one thread which is about another shared page table patch's benchmark. I can't find the original patch though [2]. But, I found the probably same patch in 2005 [3], it also mentioned the previous benchmark discussion: " For those familiar with the shared page table patch I did a couple of years ago, this patch does not implement copy-on-write page tables for private mappings. Analysis showed the cost and complexity far outweighed any potential benefit. " However, it might be different right now. For example, the implemetation . We have split page table lock now, so we don't have to consider the page_table_share_lock thing. Also, presently, we have different use cases (shells [2] v.s. VM cloning and fuzzing) to consider. Nonetheless, I still think the discussion can provide some of the mind to us. BTW, It seems like the 2002 patch [1] is different from the 2002 [2] and 2005 [3]. [1] https://lkml.iu.edu/hypermail/linux/kernel/0202.2/0102.html [2] https://lore.kernel.org/linux-mm/3E02FACD.5B300794@digeo.com/ [3] https://lore.kernel.org/linux-mm/7C49DFF721CB4E671DB260F9@%5B10.1.1.4%5D/T/#u > My quick thoughts after skimming over the core parts of this series > > (1) forgetting to break COW on a PTE in some pgtable walker feels quite > likely (meaning that it might be fairly error-prone) and forgetting > to break COW on a PTE table, accidentally modifying the shared > table. Maybe I should also handle arch/ and others parts. I will keep looking at where I missed. > (2) break_cow_pte() can fail, which means that we can fail some > operations (possibly silently halfway through) now. For example, > looking at your change_pte_range() change, I suspect it's wrong. Maybe I should add WARN_ON() and skip the failed COW PTE. > (3) handle_cow_pte_fault() looks quite complicated and needs quite some > double-checking: we temporarily clear the PMD, to reset it > afterwards. I am not sure if that is correct. For example, what > stops another page fault stumbling over that pmd_none() and > allocating an empty page table? Maybe there are some locking details > missing or they are very subtle such that we better document them. I > recall that THP played quite some tricks to make such cases work ... I think that holding mmap_write_lock may be enough (I added mmap_assert_write_locked() in the fault function btw). But, I might be wrong. I will look at the THP stuff to see how they work. Thanks. Thanks for the review. > > > > > > Actually, at the RFC v1 and v2, we proposed the version of skipping > > > > those works, and we got a significant improvement. You can see the > > > > number from RFC v2 cover letter [1]: > > > > "In short, with 512 MB mapped memory, COW PTE decreases latency by 93% > > > > for normal fork" > > > > > > I suspect the 93% improvement (when the mapcount was not updated) was > > > only for VAs with 4K pages. With 2M mappings this series did not > > > provide any benefit is this correct? > > > > Yes. In this case, the COW PTE performance is similar to the normal > > fork(). > > > The thing with THP is, that during fork(), we always allocate a backup PTE > table, to be able to PTE-map the THP whenever we have to. Otherwise we'd > have to eventually fail some operations we don't want to fail -- similar to > the case where break_cow_pte() could fail now due to -ENOMEM although we > really don't want to fail (e.g., change_pte_range() ). > > I always considered that wasteful, because in many scenarios, we'll never > ever split a THP and possibly waste memory. > > Optimizing that for THP (e.g., don't always allocate backup THP, have some > global allocation backup pool for splits + refill when close-to-empty) might > provide similar fork() improvements, both in speed and memory consumption > when it comes to anonymous memory. When collapsing huge pages, do/can they reuse those PTEs for backup? So, we don't have to allocate the PTE or maintain the pool. Thanks, Chih-En Lin
>>> >>> Honestly, for improving the fork(), I have an idea to skip the per-page >>> operation without breaking the logic. However, this will introduce the >>> complicated mechanism and may has the overhead for other features. It >>> might not be worth it. It's hard to strike a balance between the >>> over-complicated mechanism with (probably) better performance and data >>> consistency with the page status. So, I would focus on the safety and >>> stable approach at first. >> >> Yes, it is most probably possible, but complexity, robustness and >> maintainability have to be considered as well. >> >> Thanks for implementing this approach (only deduplication without other >> optimizations) and evaluating it accordingly. It's certainly "cleaner", such >> that we only have to mess with unsharing and not with other >> accounting/pinning/mapcount thingies. But it also highlights how intrusive >> even this basic deduplication approach already is -- and that most benefits >> of the original approach requires even more complexity on top. >> >> I am not quite sure if the benefit is worth the price (I am not to decide >> and I would like to hear other options). > > I'm looking at the discussion of page table sharing in 2002 [1]. > It looks like in 2002 ~ 2006, there also have some patches try to > improve fork(). > > After that, I also saw one thread which is about another shared page > table patch's benchmark. I can't find the original patch though [2]. > But, I found the probably same patch in 2005 [3], it also mentioned > the previous benchmark discussion: > > " > For those familiar with the shared page table patch I did a couple of years > ago, this patch does not implement copy-on-write page tables for private > mappings. Analysis showed the cost and complexity far outweighed any > potential benefit. > " Thanks for the pointer, interesting read. And my personal opinion is that part of that statement still hold true :) > > However, it might be different right now. For example, the implemetation > . We have split page table lock now, so we don't have to consider the > page_table_share_lock thing. Also, presently, we have different use > cases (shells [2] v.s. VM cloning and fuzzing) to consider. > > Nonetheless, I still think the discussion can provide some of the mind > to us. > > BTW, It seems like the 2002 patch [1] is different from the 2002 [2] > and 2005 [3]. > > [1] https://lkml.iu.edu/hypermail/linux/kernel/0202.2/0102.html > [2] https://lore.kernel.org/linux-mm/3E02FACD.5B300794@digeo.com/ > [3] https://lore.kernel.org/linux-mm/7C49DFF721CB4E671DB260F9@%5B10.1.1.4%5D/T/#u > >> My quick thoughts after skimming over the core parts of this series >> >> (1) forgetting to break COW on a PTE in some pgtable walker feels quite >> likely (meaning that it might be fairly error-prone) and forgetting >> to break COW on a PTE table, accidentally modifying the shared >> table. > > Maybe I should also handle arch/ and others parts. > I will keep looking at where I missed. One could add sanity checks when modifying a PTE while the PTE table is still marked shared ... but I guess there are some valid reasons where we might want to modify shared PTE tables (rmap). > >> (2) break_cow_pte() can fail, which means that we can fail some >> operations (possibly silently halfway through) now. For example, >> looking at your change_pte_range() change, I suspect it's wrong. > > Maybe I should add WARN_ON() and skip the failed COW PTE. One way or the other we'll have to handle it. WARN_ON() sounds wrong for handling OOM situations (e.g., if only that cgroup is OOM). > >> (3) handle_cow_pte_fault() looks quite complicated and needs quite some >> double-checking: we temporarily clear the PMD, to reset it >> afterwards. I am not sure if that is correct. For example, what >> stops another page fault stumbling over that pmd_none() and >> allocating an empty page table? Maybe there are some locking details >> missing or they are very subtle such that we better document them. I >> recall that THP played quite some tricks to make such cases work ... > > I think that holding mmap_write_lock may be enough (I added > mmap_assert_write_locked() in the fault function btw). But, I might > be wrong. I will look at the THP stuff to see how they work. Thanks. > Ehm, but page faults don't hold the mmap lock writable? And so are other callers, like MADV_DONTNEED or MADV_FREE. handle_pte_fault()->handle_pte_fault()->mmap_assert_write_locked() should bail out. Either I am missing something or you didn't test with lockdep enabled :) Note that there are upstream efforts to use only a VMA lock (and some people even want to perform some page faults only protected by RCU). -- Thanks, David / dhildenb
On Tue, Feb 14, 2023 at 05:58:45PM +0100, David Hildenbrand wrote: > > > > > > > > > Honestly, for improving the fork(), I have an idea to skip the per-page > > > > operation without breaking the logic. However, this will introduce the > > > > complicated mechanism and may has the overhead for other features. It > > > > might not be worth it. It's hard to strike a balance between the > > > > over-complicated mechanism with (probably) better performance and data > > > > consistency with the page status. So, I would focus on the safety and > > > > stable approach at first. > > > > > > Yes, it is most probably possible, but complexity, robustness and > > > maintainability have to be considered as well. > > > > > > Thanks for implementing this approach (only deduplication without other > > > optimizations) and evaluating it accordingly. It's certainly "cleaner", such > > > that we only have to mess with unsharing and not with other > > > accounting/pinning/mapcount thingies. But it also highlights how intrusive > > > even this basic deduplication approach already is -- and that most benefits > > > of the original approach requires even more complexity on top. > > > > > > I am not quite sure if the benefit is worth the price (I am not to decide > > > and I would like to hear other options). > > > > I'm looking at the discussion of page table sharing in 2002 [1]. > > It looks like in 2002 ~ 2006, there also have some patches try to > > improve fork(). > > > > After that, I also saw one thread which is about another shared page > > table patch's benchmark. I can't find the original patch though [2]. > > But, I found the probably same patch in 2005 [3], it also mentioned > > the previous benchmark discussion: > > > > " > > For those familiar with the shared page table patch I did a couple of years > > ago, this patch does not implement copy-on-write page tables for private > > mappings. Analysis showed the cost and complexity far outweighed any > > potential benefit. > > " > > Thanks for the pointer, interesting read. And my personal opinion is that > part of that statement still hold true :) ;) > > > > However, it might be different right now. For example, the implemetation > > . We have split page table lock now, so we don't have to consider the > > page_table_share_lock thing. Also, presently, we have different use > > cases (shells [2] v.s. VM cloning and fuzzing) to consider. > > > > Nonetheless, I still think the discussion can provide some of the mind > > to us. > > > > BTW, It seems like the 2002 patch [1] is different from the 2002 [2] > > and 2005 [3]. > > > > [1] https://lkml.iu.edu/hypermail/linux/kernel/0202.2/0102.html > > [2] https://lore.kernel.org/linux-mm/3E02FACD.5B300794@digeo.com/ > > [3] https://lore.kernel.org/linux-mm/7C49DFF721CB4E671DB260F9@%5B10.1.1.4%5D/T/#u > > > > > My quick thoughts after skimming over the core parts of this series > > > > > > (1) forgetting to break COW on a PTE in some pgtable walker feels quite > > > likely (meaning that it might be fairly error-prone) and forgetting > > > to break COW on a PTE table, accidentally modifying the shared > > > table. > > > > Maybe I should also handle arch/ and others parts. > > I will keep looking at where I missed. > > One could add sanity checks when modifying a PTE while the PTE table is > still marked shared ... but I guess there are some valid reasons where we > might want to modify shared PTE tables (rmap). Sounds good for adding sanity checks. I will look at this. One of the valid reasons that come to my head might be the referenced bit (rmap). > > > > > (2) break_cow_pte() can fail, which means that we can fail some > > > operations (possibly silently halfway through) now. For example, > > > looking at your change_pte_range() change, I suspect it's wrong. > > > > Maybe I should add WARN_ON() and skip the failed COW PTE. > > One way or the other we'll have to handle it. WARN_ON() sounds wrong for > handling OOM situations (e.g., if only that cgroup is OOM). Or we should do the same thing like you mentioned: " For example, __split_huge_pmd() is currently not able to report a failure. I assume that we could sleep in there. And if we're not able to allocate any memory in there (with sleeping), maybe the process should be zapped either way by the OOM killer. " But instead of zapping the process, we just skip the failed COW PTE. I don't think the user will expect their process to be killed by changing the protection. > > > > > (3) handle_cow_pte_fault() looks quite complicated and needs quite some > > > double-checking: we temporarily clear the PMD, to reset it > > > afterwards. I am not sure if that is correct. For example, what > > > stops another page fault stumbling over that pmd_none() and > > > allocating an empty page table? Maybe there are some locking details > > > missing or they are very subtle such that we better document them. I > > > recall that THP played quite some tricks to make such cases work ... > > > > I think that holding mmap_write_lock may be enough (I added > > mmap_assert_write_locked() in the fault function btw). But, I might > > be wrong. I will look at the THP stuff to see how they work. Thanks. > > > > Ehm, but page faults don't hold the mmap lock writable? And so are other > callers, like MADV_DONTNEED or MADV_FREE. > > handle_pte_fault()->handle_pte_fault()->mmap_assert_write_locked() should > bail out. > > Either I am missing something or you didn't test with lockdep enabled :) You're right. I thought I enabled the lockdep. And, why do I have the page fault will handle the mmap lock writable in my mind. The page fault holds the mmap lock readable instead of writable. ;-) I should check/test all the locks again. Thanks. > > Note that there are upstream efforts to use only a VMA lock (and some people > even want to perform some page faults only protected by RCU). I saw the discussion (https://lwn.net/Articles/906852/) before. If the page fault handler only uses a VMA lock, handle_cow_pte_fault() might not be affected since it only takes one VMA at a time. handle_cow_pte_fault() just allocate the PTE and copy the COW mapping entries to the new one. It's alredy handle the checking and accounting in copy_cow_pte_range(). But, if we decide to skip the per-page operation during fork(). We should handle the VMA lock (or RCU) for the accounting and other stuff. It might be more complicated than before... Thanks, Chih-En Lin
On 14.02.23 18:54, Chih-En Lin wrote: >>> >>>> (2) break_cow_pte() can fail, which means that we can fail some >>>> operations (possibly silently halfway through) now. For example, >>>> looking at your change_pte_range() change, I suspect it's wrong. >>> >>> Maybe I should add WARN_ON() and skip the failed COW PTE. >> >> One way or the other we'll have to handle it. WARN_ON() sounds wrong for >> handling OOM situations (e.g., if only that cgroup is OOM). > > Or we should do the same thing like you mentioned: > " > For example, __split_huge_pmd() is currently not able to report a > failure. I assume that we could sleep in there. And if we're not able to > allocate any memory in there (with sleeping), maybe the process should > be zapped either way by the OOM killer. > " > > But instead of zapping the process, we just skip the failed COW PTE. > I don't think the user will expect their process to be killed by > changing the protection. The process is consuming more memory than it is capable of consuming. The process most probably would have died earlier without the PTE optimization. But yeah, it all gets tricky ... > >>> >>>> (3) handle_cow_pte_fault() looks quite complicated and needs quite some >>>> double-checking: we temporarily clear the PMD, to reset it >>>> afterwards. I am not sure if that is correct. For example, what >>>> stops another page fault stumbling over that pmd_none() and >>>> allocating an empty page table? Maybe there are some locking details >>>> missing or they are very subtle such that we better document them. I >>>> recall that THP played quite some tricks to make such cases work ... >>> >>> I think that holding mmap_write_lock may be enough (I added >>> mmap_assert_write_locked() in the fault function btw). But, I might >>> be wrong. I will look at the THP stuff to see how they work. Thanks. >>> >> >> Ehm, but page faults don't hold the mmap lock writable? And so are other >> callers, like MADV_DONTNEED or MADV_FREE. >> >> handle_pte_fault()->handle_pte_fault()->mmap_assert_write_locked() should >> bail out. >> >> Either I am missing something or you didn't test with lockdep enabled :) > > You're right. I thought I enabled the lockdep. > And, why do I have the page fault will handle the mmap lock writable in my mind. > The page fault holds the mmap lock readable instead of writable. > ;-) > > I should check/test all the locks again. > Thanks. Note that we have other ways of traversing page tables, especially, using the rmap which does not hold the mmap lock. Not sure if there are similar issues when suddenly finding no page table where there logically should be one. Or when a page table gets replaced and modified, while rmap code still walks the shared copy. Hm. -- Thanks, David / dhildenb
On Tue, Feb 14, 2023 at 06:59:50PM +0100, David Hildenbrand wrote: > On 14.02.23 18:54, Chih-En Lin wrote: > > > > > > > > > (2) break_cow_pte() can fail, which means that we can fail some > > > > > operations (possibly silently halfway through) now. For example, > > > > > looking at your change_pte_range() change, I suspect it's wrong. > > > > > > > > Maybe I should add WARN_ON() and skip the failed COW PTE. > > > > > > One way or the other we'll have to handle it. WARN_ON() sounds wrong for > > > handling OOM situations (e.g., if only that cgroup is OOM). > > > > Or we should do the same thing like you mentioned: > > " > > For example, __split_huge_pmd() is currently not able to report a > > failure. I assume that we could sleep in there. And if we're not able to > > allocate any memory in there (with sleeping), maybe the process should > > be zapped either way by the OOM killer. > > " > > > > But instead of zapping the process, we just skip the failed COW PTE. > > I don't think the user will expect their process to be killed by > > changing the protection. > > The process is consuming more memory than it is capable of consuming. The > process most probably would have died earlier without the PTE optimization. > > But yeah, it all gets tricky ... > > > > > > > > > > > > (3) handle_cow_pte_fault() looks quite complicated and needs quite some > > > > > double-checking: we temporarily clear the PMD, to reset it > > > > > afterwards. I am not sure if that is correct. For example, what > > > > > stops another page fault stumbling over that pmd_none() and > > > > > allocating an empty page table? Maybe there are some locking details > > > > > missing or they are very subtle such that we better document them. I > > > > > recall that THP played quite some tricks to make such cases work ... > > > > > > > > I think that holding mmap_write_lock may be enough (I added > > > > mmap_assert_write_locked() in the fault function btw). But, I might > > > > be wrong. I will look at the THP stuff to see how they work. Thanks. > > > > > > > > > > Ehm, but page faults don't hold the mmap lock writable? And so are other > > > callers, like MADV_DONTNEED or MADV_FREE. > > > > > > handle_pte_fault()->handle_pte_fault()->mmap_assert_write_locked() should > > > bail out. > > > > > > Either I am missing something or you didn't test with lockdep enabled :) > > > > You're right. I thought I enabled the lockdep. > > And, why do I have the page fault will handle the mmap lock writable in my mind. > > The page fault holds the mmap lock readable instead of writable. > > ;-) > > > > I should check/test all the locks again. > > Thanks. > > Note that we have other ways of traversing page tables, especially, using > the rmap which does not hold the mmap lock. Not sure if there are similar > issues when suddenly finding no page table where there logically should be > one. Or when a page table gets replaced and modified, while rmap code still > walks the shared copy. Hm. It seems like I should take carefully for the page table entry in page fault with rmap. ;) While the rmap code walks the page table, it will hold the pt lock. So, maybe I should hold the old (shared) PTE table's lock in handle_cow_pte_fault() all the time. Thanks, Chih-En Lin
On 14.02.23 17:58, David Hildenbrand wrote: > >>>> >>>> Honestly, for improving the fork(), I have an idea to skip the per-page >>>> operation without breaking the logic. However, this will introduce the >>>> complicated mechanism and may has the overhead for other features. It >>>> might not be worth it. It's hard to strike a balance between the >>>> over-complicated mechanism with (probably) better performance and data >>>> consistency with the page status. So, I would focus on the safety and >>>> stable approach at first. >>> >>> Yes, it is most probably possible, but complexity, robustness and >>> maintainability have to be considered as well. >>> >>> Thanks for implementing this approach (only deduplication without other >>> optimizations) and evaluating it accordingly. It's certainly "cleaner", such >>> that we only have to mess with unsharing and not with other >>> accounting/pinning/mapcount thingies. But it also highlights how intrusive >>> even this basic deduplication approach already is -- and that most benefits >>> of the original approach requires even more complexity on top. >>> >>> I am not quite sure if the benefit is worth the price (I am not to decide >>> and I would like to hear other options). >> >> I'm looking at the discussion of page table sharing in 2002 [1]. >> It looks like in 2002 ~ 2006, there also have some patches try to >> improve fork(). >> >> After that, I also saw one thread which is about another shared page >> table patch's benchmark. I can't find the original patch though [2]. >> But, I found the probably same patch in 2005 [3], it also mentioned >> the previous benchmark discussion: >> >> " >> For those familiar with the shared page table patch I did a couple of years >> ago, this patch does not implement copy-on-write page tables for private >> mappings. Analysis showed the cost and complexity far outweighed any >> potential benefit. >> " > > Thanks for the pointer, interesting read. And my personal opinion is > that part of that statement still hold true :) > >> >> However, it might be different right now. For example, the implemetation >> . We have split page table lock now, so we don't have to consider the >> page_table_share_lock thing. Also, presently, we have different use >> cases (shells [2] v.s. VM cloning and fuzzing) to consider. Oh, and because I stumbled over it, just as an interesting pointer on QEMU devel: "[PATCH 00/10] Retire Fork-Based Fuzzing" [1] [1] https://lore.kernel.org/all/20230205042951.3570008-1-alxndr@bu.edu/T/#u -- Thanks, David / dhildenb
On Tue, Feb 14, 2023 at 06:03:58PM +0100, David Hildenbrand wrote: > On 14.02.23 17:58, David Hildenbrand wrote: > > > > > > > > > > > > Honestly, for improving the fork(), I have an idea to skip the per-page > > > > > operation without breaking the logic. However, this will introduce the > > > > > complicated mechanism and may has the overhead for other features. It > > > > > might not be worth it. It's hard to strike a balance between the > > > > > over-complicated mechanism with (probably) better performance and data > > > > > consistency with the page status. So, I would focus on the safety and > > > > > stable approach at first. > > > > > > > > Yes, it is most probably possible, but complexity, robustness and > > > > maintainability have to be considered as well. > > > > > > > > Thanks for implementing this approach (only deduplication without other > > > > optimizations) and evaluating it accordingly. It's certainly "cleaner", such > > > > that we only have to mess with unsharing and not with other > > > > accounting/pinning/mapcount thingies. But it also highlights how intrusive > > > > even this basic deduplication approach already is -- and that most benefits > > > > of the original approach requires even more complexity on top. > > > > > > > > I am not quite sure if the benefit is worth the price (I am not to decide > > > > and I would like to hear other options). > > > > > > I'm looking at the discussion of page table sharing in 2002 [1]. > > > It looks like in 2002 ~ 2006, there also have some patches try to > > > improve fork(). > > > > > > After that, I also saw one thread which is about another shared page > > > table patch's benchmark. I can't find the original patch though [2]. > > > But, I found the probably same patch in 2005 [3], it also mentioned > > > the previous benchmark discussion: > > > > > > " > > > For those familiar with the shared page table patch I did a couple of years > > > ago, this patch does not implement copy-on-write page tables for private > > > mappings. Analysis showed the cost and complexity far outweighed any > > > potential benefit. > > > " > > > > Thanks for the pointer, interesting read. And my personal opinion is > > that part of that statement still hold true :) > > > > > > > > However, it might be different right now. For example, the implemetation > > > . We have split page table lock now, so we don't have to consider the > > > page_table_share_lock thing. Also, presently, we have different use > > > cases (shells [2] v.s. VM cloning and fuzzing) to consider. > > > Oh, and because I stumbled over it, just as an interesting pointer on QEMU > devel: > > "[PATCH 00/10] Retire Fork-Based Fuzzing" [1] > > [1] https://lore.kernel.org/all/20230205042951.3570008-1-alxndr@bu.edu/T/#u Thanks for the information. It's interesting. Thanks, Chih-En Lin
> > The thing with THP is, that during fork(), we always allocate a backup PTE > > table, to be able to PTE-map the THP whenever we have to. Otherwise we'd > > have to eventually fail some operations we don't want to fail -- similar to > > the case where break_cow_pte() could fail now due to -ENOMEM although we > > really don't want to fail (e.g., change_pte_range() ). > > > > I always considered that wasteful, because in many scenarios, we'll never > > ever split a THP and possibly waste memory. > > > > Optimizing that for THP (e.g., don't always allocate backup THP, have some > > global allocation backup pool for splits + refill when close-to-empty) might > > provide similar fork() improvements, both in speed and memory consumption > > when it comes to anonymous memory. > > When collapsing huge pages, do/can they reuse those PTEs for backup? > So, we don't have to allocate the PTE or maintain the pool. It might not work for all pages, as collapsing pages might have had holes in the user page table, and there were no PTE tables. Pasha
On Tue, Feb 14, 2023 at 11:30:26AM -0500, Pasha Tatashin wrote: > > > The thing with THP is, that during fork(), we always allocate a backup PTE > > > table, to be able to PTE-map the THP whenever we have to. Otherwise we'd > > > have to eventually fail some operations we don't want to fail -- similar to > > > the case where break_cow_pte() could fail now due to -ENOMEM although we > > > really don't want to fail (e.g., change_pte_range() ). > > > > > > I always considered that wasteful, because in many scenarios, we'll never > > > ever split a THP and possibly waste memory. > > > > > > Optimizing that for THP (e.g., don't always allocate backup THP, have some > > > global allocation backup pool for splits + refill when close-to-empty) might > > > provide similar fork() improvements, both in speed and memory consumption > > > when it comes to anonymous memory. > > > > When collapsing huge pages, do/can they reuse those PTEs for backup? > > So, we don't have to allocate the PTE or maintain the pool. > > It might not work for all pages, as collapsing pages might have had > holes in the user page table, and there were no PTE tables. So if there have holes in the user page table, after we doing the collapsing and then splitting. Do those holes be filled? Assume it is, then, I think it's the reason why it's not work for all the pages. But, after those operations, Will the user get the additional and unexpected memory (which is from the huge page filling)? I'm a little bit confused now. Thanks, Chih-En Lin
On Tue, Feb 14, 2023 at 1:42 PM Chih-En Lin <shiyn.lin@gmail.com> wrote: > > On Tue, Feb 14, 2023 at 11:30:26AM -0500, Pasha Tatashin wrote: > > > > The thing with THP is, that during fork(), we always allocate a backup PTE > > > > table, to be able to PTE-map the THP whenever we have to. Otherwise we'd > > > > have to eventually fail some operations we don't want to fail -- similar to > > > > the case where break_cow_pte() could fail now due to -ENOMEM although we > > > > really don't want to fail (e.g., change_pte_range() ). > > > > > > > > I always considered that wasteful, because in many scenarios, we'll never > > > > ever split a THP and possibly waste memory. > > > > > > > > Optimizing that for THP (e.g., don't always allocate backup THP, have some > > > > global allocation backup pool for splits + refill when close-to-empty) might > > > > provide similar fork() improvements, both in speed and memory consumption > > > > when it comes to anonymous memory. > > > > > > When collapsing huge pages, do/can they reuse those PTEs for backup? > > > So, we don't have to allocate the PTE or maintain the pool. > > > > It might not work for all pages, as collapsing pages might have had > > holes in the user page table, and there were no PTE tables. > > So if there have holes in the user page table, after we doing the > collapsing and then splitting. Do those holes be filled? Assume it is, > then, I think it's the reason why it's not work for all the pages. > > But, after those operations, Will the user get the additional and > unexpected memory (which is from the huge page filling)? Yes, more memory is going to be allocated for a process in such THP collapse case. This is similar to madvise huge pages, and touching the first byte may allocate 2M. Pasha
On Tue, Feb 14, 2023 at 01:52:16PM -0500, Pasha Tatashin wrote: > On Tue, Feb 14, 2023 at 1:42 PM Chih-En Lin <shiyn.lin@gmail.com> wrote: > > > > On Tue, Feb 14, 2023 at 11:30:26AM -0500, Pasha Tatashin wrote: > > > > > The thing with THP is, that during fork(), we always allocate a backup PTE > > > > > table, to be able to PTE-map the THP whenever we have to. Otherwise we'd > > > > > have to eventually fail some operations we don't want to fail -- similar to > > > > > the case where break_cow_pte() could fail now due to -ENOMEM although we > > > > > really don't want to fail (e.g., change_pte_range() ). > > > > > > > > > > I always considered that wasteful, because in many scenarios, we'll never > > > > > ever split a THP and possibly waste memory. > > > > > > > > > > Optimizing that for THP (e.g., don't always allocate backup THP, have some > > > > > global allocation backup pool for splits + refill when close-to-empty) might > > > > > provide similar fork() improvements, both in speed and memory consumption > > > > > when it comes to anonymous memory. > > > > > > > > When collapsing huge pages, do/can they reuse those PTEs for backup? > > > > So, we don't have to allocate the PTE or maintain the pool. > > > > > > It might not work for all pages, as collapsing pages might have had > > > holes in the user page table, and there were no PTE tables. > > > > So if there have holes in the user page table, after we doing the > > collapsing and then splitting. Do those holes be filled? Assume it is, > > then, I think it's the reason why it's not work for all the pages. > > > > But, after those operations, Will the user get the additional and > > unexpected memory (which is from the huge page filling)? > > Yes, more memory is going to be allocated for a process in such THP > collapse case. This is similar to madvise huge pages, and touching the > first byte may allocate 2M. Thanks for the explanation. Yeah, It seems like the reuse case can't work for all the pages. Thanks, Chih-En Lin
On Tue, Feb 14, 2023 at 4:58 AM David Hildenbrand <david@redhat.com> wrote: > > On 10.02.23 18:20, Chih-En Lin wrote: > > On Fri, Feb 10, 2023 at 11:21:16AM -0500, Pasha Tatashin wrote: > >>>>> Currently, copy-on-write is only used for the mapped memory; the child > >>>>> process still needs to copy the entire page table from the parent > >>>>> process during forking. The parent process might take a lot of time and > >>>>> memory to copy the page table when the parent has a big page table > >>>>> allocated. For example, the memory usage of a process after forking with > >>>>> 1 GB mapped memory is as follows: > >>>> > >>>> For some reason, I was not able to reproduce performance improvements > >>>> with a simple fork() performance measurement program. The results that > >>>> I saw are the following: > >>>> > >>>> Base: > >>>> Fork latency per gigabyte: 0.004416 seconds > >>>> Fork latency per gigabyte: 0.004382 seconds > >>>> Fork latency per gigabyte: 0.004442 seconds > >>>> COW kernel: > >>>> Fork latency per gigabyte: 0.004524 seconds > >>>> Fork latency per gigabyte: 0.004764 seconds > >>>> Fork latency per gigabyte: 0.004547 seconds > >>>> > >>>> AMD EPYC 7B12 64-Core Processor > >>>> Base: > >>>> Fork latency per gigabyte: 0.003923 seconds > >>>> Fork latency per gigabyte: 0.003909 seconds > >>>> Fork latency per gigabyte: 0.003955 seconds > >>>> COW kernel: > >>>> Fork latency per gigabyte: 0.004221 seconds > >>>> Fork latency per gigabyte: 0.003882 seconds > >>>> Fork latency per gigabyte: 0.003854 seconds > >>>> > >>>> Given, that page table for child is not copied, I was expecting the > >>>> performance to be better with COW kernel, and also not to depend on > >>>> the size of the parent. > >>> > >>> Yes, the child won't duplicate the page table, but fork will still > >>> traverse all the page table entries to do the accounting. > >>> And, since this patch expends the COW to the PTE table level, it's not > >>> the mapped page (page table entry) grained anymore, so we have to > >>> guarantee that all the mapped page is available to do COW mapping in > >>> the such page table. > >>> This kind of checking also costs some time. > >>> As a result, since the accounting and the checking, the COW PTE fork > >>> still depends on the size of the parent so the improvement might not > >>> be significant. > >> > >> The current version of the series does not provide any performance > >> improvements for fork(). I would recommend removing claims from the > >> cover letter about better fork() performance, as this may be > >> misleading for those looking for a way to speed up forking. In my > > > > From v3 to v4, I changed the implementation of the COW fork() part to do > > the accounting and checking. At the time, I also removed most of the > > descriptions about the better fork() performance. Maybe it's not enough > > and still has some misleading. I will fix this in the next version. > > Thanks. > > > >> case, I was looking to speed up Redis OSS, which relies on fork() to > >> create consistent snapshots for driving replicates/backups. The O(N) > >> per-page operation causes fork() to be slow, so I was hoping that this > >> series, which does not duplicate the VA during fork(), would make the > >> operation much quicker. > > > > Indeed, at first, I tried to avoid the O(N) per-page operation by > > deferring the accounting and the swap stuff to the page fault. But, > > as I mentioned, it's not suitable for the mainline. > > > > Honestly, for improving the fork(), I have an idea to skip the per-page > > operation without breaking the logic. However, this will introduce the > > complicated mechanism and may has the overhead for other features. It > > might not be worth it. It's hard to strike a balance between the > > over-complicated mechanism with (probably) better performance and data > > consistency with the page status. So, I would focus on the safety and > > stable approach at first. > > Yes, it is most probably possible, but complexity, robustness and > maintainability have to be considered as well. > > Thanks for implementing this approach (only deduplication without other > optimizations) and evaluating it accordingly. It's certainly "cleaner", > such that we only have to mess with unsharing and not with other > accounting/pinning/mapcount thingies. But it also highlights how > intrusive even this basic deduplication approach already is -- and that > most benefits of the original approach requires even more complexity on top. > > I am not quite sure if the benefit is worth the price (I am not to > decide and I would like to hear other options). > > My quick thoughts after skimming over the core parts of this series > > (1) forgetting to break COW on a PTE in some pgtable walker feels quite > likely (meaning that it might be fairly error-prone) and forgetting > to break COW on a PTE table, accidentally modifying the shared > table. > (2) break_cow_pte() can fail, which means that we can fail some > operations (possibly silently halfway through) now. For example, > looking at your change_pte_range() change, I suspect it's wrong. > (3) handle_cow_pte_fault() looks quite complicated and needs quite some > double-checking: we temporarily clear the PMD, to reset it > afterwards. I am not sure if that is correct. For example, what > stops another page fault stumbling over that pmd_none() and > allocating an empty page table? Maybe there are some locking details > missing or they are very subtle such that we better document them. I > recall that THP played quite some tricks to make such cases work ... > > > > >>> Actually, at the RFC v1 and v2, we proposed the version of skipping > >>> those works, and we got a significant improvement. You can see the > >>> number from RFC v2 cover letter [1]: > >>> "In short, with 512 MB mapped memory, COW PTE decreases latency by 93% > >>> for normal fork" > >> > >> I suspect the 93% improvement (when the mapcount was not updated) was > >> only for VAs with 4K pages. With 2M mappings this series did not > >> provide any benefit is this correct? > > > > Yes. In this case, the COW PTE performance is similar to the normal > > fork(). > > > The thing with THP is, that during fork(), we always allocate a backup > PTE table, to be able to PTE-map the THP whenever we have to. Otherwise > we'd have to eventually fail some operations we don't want to fail -- > similar to the case where break_cow_pte() could fail now due to -ENOMEM > although we really don't want to fail (e.g., change_pte_range() ). > > I always considered that wasteful, because in many scenarios, we'll > never ever split a THP and possibly waste memory. Yes, it does sound wasteful for a pretty rare corner case that combines splitting THP in a process, and not having enough memory to allocate PTE page tables. > Optimizing that for THP (e.g., don't always allocate backup THP, have > some global allocation backup pool for splits + refill when > close-to-empty) might provide similar fork() improvements, both in speed > and memory consumption when it comes to anonymous memory. This sounds like a reasonable way to optimize the fork performance for processes with large RSS, which in most cases would have 2M THP mappings. When you say global pool, do you mean per machine, per cgroup, or per process? Pasha > > -- > Thanks, > > David / dhildenb > >
On 14.02.23 14:07, Pasha Tatashin wrote: > On Tue, Feb 14, 2023 at 4:58 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 10.02.23 18:20, Chih-En Lin wrote: >>> On Fri, Feb 10, 2023 at 11:21:16AM -0500, Pasha Tatashin wrote: >>>>>>> Currently, copy-on-write is only used for the mapped memory; the child >>>>>>> process still needs to copy the entire page table from the parent >>>>>>> process during forking. The parent process might take a lot of time and >>>>>>> memory to copy the page table when the parent has a big page table >>>>>>> allocated. For example, the memory usage of a process after forking with >>>>>>> 1 GB mapped memory is as follows: >>>>>> >>>>>> For some reason, I was not able to reproduce performance improvements >>>>>> with a simple fork() performance measurement program. The results that >>>>>> I saw are the following: >>>>>> >>>>>> Base: >>>>>> Fork latency per gigabyte: 0.004416 seconds >>>>>> Fork latency per gigabyte: 0.004382 seconds >>>>>> Fork latency per gigabyte: 0.004442 seconds >>>>>> COW kernel: >>>>>> Fork latency per gigabyte: 0.004524 seconds >>>>>> Fork latency per gigabyte: 0.004764 seconds >>>>>> Fork latency per gigabyte: 0.004547 seconds >>>>>> >>>>>> AMD EPYC 7B12 64-Core Processor >>>>>> Base: >>>>>> Fork latency per gigabyte: 0.003923 seconds >>>>>> Fork latency per gigabyte: 0.003909 seconds >>>>>> Fork latency per gigabyte: 0.003955 seconds >>>>>> COW kernel: >>>>>> Fork latency per gigabyte: 0.004221 seconds >>>>>> Fork latency per gigabyte: 0.003882 seconds >>>>>> Fork latency per gigabyte: 0.003854 seconds >>>>>> >>>>>> Given, that page table for child is not copied, I was expecting the >>>>>> performance to be better with COW kernel, and also not to depend on >>>>>> the size of the parent. >>>>> >>>>> Yes, the child won't duplicate the page table, but fork will still >>>>> traverse all the page table entries to do the accounting. >>>>> And, since this patch expends the COW to the PTE table level, it's not >>>>> the mapped page (page table entry) grained anymore, so we have to >>>>> guarantee that all the mapped page is available to do COW mapping in >>>>> the such page table. >>>>> This kind of checking also costs some time. >>>>> As a result, since the accounting and the checking, the COW PTE fork >>>>> still depends on the size of the parent so the improvement might not >>>>> be significant. >>>> >>>> The current version of the series does not provide any performance >>>> improvements for fork(). I would recommend removing claims from the >>>> cover letter about better fork() performance, as this may be >>>> misleading for those looking for a way to speed up forking. In my >>> >>> From v3 to v4, I changed the implementation of the COW fork() part to do >>> the accounting and checking. At the time, I also removed most of the >>> descriptions about the better fork() performance. Maybe it's not enough >>> and still has some misleading. I will fix this in the next version. >>> Thanks. >>> >>>> case, I was looking to speed up Redis OSS, which relies on fork() to >>>> create consistent snapshots for driving replicates/backups. The O(N) >>>> per-page operation causes fork() to be slow, so I was hoping that this >>>> series, which does not duplicate the VA during fork(), would make the >>>> operation much quicker. >>> >>> Indeed, at first, I tried to avoid the O(N) per-page operation by >>> deferring the accounting and the swap stuff to the page fault. But, >>> as I mentioned, it's not suitable for the mainline. >>> >>> Honestly, for improving the fork(), I have an idea to skip the per-page >>> operation without breaking the logic. However, this will introduce the >>> complicated mechanism and may has the overhead for other features. It >>> might not be worth it. It's hard to strike a balance between the >>> over-complicated mechanism with (probably) better performance and data >>> consistency with the page status. So, I would focus on the safety and >>> stable approach at first. >> >> Yes, it is most probably possible, but complexity, robustness and >> maintainability have to be considered as well. >> >> Thanks for implementing this approach (only deduplication without other >> optimizations) and evaluating it accordingly. It's certainly "cleaner", >> such that we only have to mess with unsharing and not with other >> accounting/pinning/mapcount thingies. But it also highlights how >> intrusive even this basic deduplication approach already is -- and that >> most benefits of the original approach requires even more complexity on top. >> >> I am not quite sure if the benefit is worth the price (I am not to >> decide and I would like to hear other options). >> >> My quick thoughts after skimming over the core parts of this series >> >> (1) forgetting to break COW on a PTE in some pgtable walker feels quite >> likely (meaning that it might be fairly error-prone) and forgetting >> to break COW on a PTE table, accidentally modifying the shared >> table. >> (2) break_cow_pte() can fail, which means that we can fail some >> operations (possibly silently halfway through) now. For example, >> looking at your change_pte_range() change, I suspect it's wrong. >> (3) handle_cow_pte_fault() looks quite complicated and needs quite some >> double-checking: we temporarily clear the PMD, to reset it >> afterwards. I am not sure if that is correct. For example, what >> stops another page fault stumbling over that pmd_none() and >> allocating an empty page table? Maybe there are some locking details >> missing or they are very subtle such that we better document them. I >> recall that THP played quite some tricks to make such cases work ... >> >>> >>>>> Actually, at the RFC v1 and v2, we proposed the version of skipping >>>>> those works, and we got a significant improvement. You can see the >>>>> number from RFC v2 cover letter [1]: >>>>> "In short, with 512 MB mapped memory, COW PTE decreases latency by 93% >>>>> for normal fork" >>>> >>>> I suspect the 93% improvement (when the mapcount was not updated) was >>>> only for VAs with 4K pages. With 2M mappings this series did not >>>> provide any benefit is this correct? >>> >>> Yes. In this case, the COW PTE performance is similar to the normal >>> fork(). >> >> >> The thing with THP is, that during fork(), we always allocate a backup >> PTE table, to be able to PTE-map the THP whenever we have to. Otherwise >> we'd have to eventually fail some operations we don't want to fail -- >> similar to the case where break_cow_pte() could fail now due to -ENOMEM >> although we really don't want to fail (e.g., change_pte_range() ). >> >> I always considered that wasteful, because in many scenarios, we'll >> never ever split a THP and possibly waste memory. > > Yes, it does sound wasteful for a pretty rare corner case that > combines splitting THP in a process, and not having enough memory to > allocate PTE page tables. > >> Optimizing that for THP (e.g., don't always allocate backup THP, have >> some global allocation backup pool for splits + refill when >> close-to-empty) might provide similar fork() improvements, both in speed >> and memory consumption when it comes to anonymous memory. > > This sounds like a reasonable way to optimize the fork performance for > processes with large RSS, which in most cases would have 2M THP > mappings. When you say global pool, do you mean per machine, per > cgroup, or per process? Good question. I recall that the problem is that we sometimes need a new pgtable when splitting a THP, but (a) we might be under spinlock and cannot sleep. We need an atomic allocation that might fail. For this, a pool might be helpful. (b) we might actually be out of memory. My gut feeling is that a global pool would be sufficient, only yo be used when we run into (a) or (b) to be able to make progress in these rare cases. Something that would be interesting to evaluate is which THP split operations might might require some way to recover when really OOM. For example, __split_huge_pmd() is currently not able to report a failure. I assume that we could sleep in there. And if we're not able to allocate any memory in there (with sleeping), maybe the process should be zapped either way by the OOM killer. -- Thanks, David / dhildenb
On Sat, Feb 11, 2023 at 01:20:10AM +0800, Chih-En Lin wrote: > On Fri, Feb 10, 2023 at 11:21:16AM -0500, Pasha Tatashin wrote: > > > > > Currently, copy-on-write is only used for the mapped memory; the child > > > > > process still needs to copy the entire page table from the parent > > > > > process during forking. The parent process might take a lot of time and > > > > > memory to copy the page table when the parent has a big page table > > > > > allocated. For example, the memory usage of a process after forking with > > > > > 1 GB mapped memory is as follows: > > > > > > > > For some reason, I was not able to reproduce performance improvements > > > > with a simple fork() performance measurement program. The results that > > > > I saw are the following: > > > > > > > > Base: > > > > Fork latency per gigabyte: 0.004416 seconds > > > > Fork latency per gigabyte: 0.004382 seconds > > > > Fork latency per gigabyte: 0.004442 seconds > > > > COW kernel: > > > > Fork latency per gigabyte: 0.004524 seconds > > > > Fork latency per gigabyte: 0.004764 seconds > > > > Fork latency per gigabyte: 0.004547 seconds > > > > > > > > AMD EPYC 7B12 64-Core Processor > > > > Base: > > > > Fork latency per gigabyte: 0.003923 seconds > > > > Fork latency per gigabyte: 0.003909 seconds > > > > Fork latency per gigabyte: 0.003955 seconds > > > > COW kernel: > > > > Fork latency per gigabyte: 0.004221 seconds > > > > Fork latency per gigabyte: 0.003882 seconds > > > > Fork latency per gigabyte: 0.003854 seconds > > > > > > > > Given, that page table for child is not copied, I was expecting the > > > > performance to be better with COW kernel, and also not to depend on > > > > the size of the parent. > > > > > > Yes, the child won't duplicate the page table, but fork will still > > > traverse all the page table entries to do the accounting. > > > And, since this patch expends the COW to the PTE table level, it's not > > > the mapped page (page table entry) grained anymore, so we have to > > > guarantee that all the mapped page is available to do COW mapping in > > > the such page table. > > > This kind of checking also costs some time. > > > As a result, since the accounting and the checking, the COW PTE fork > > > still depends on the size of the parent so the improvement might not > > > be significant. > > > > The current version of the series does not provide any performance > > improvements for fork(). I would recommend removing claims from the > > cover letter about better fork() performance, as this may be > > misleading for those looking for a way to speed up forking. In my > > From v3 to v4, I changed the implementation of the COW fork() part to do Sorry, it's "RFC v2 to v3". > the accounting and checking. At the time, I also removed most of the > descriptions about the better fork() performance. Maybe it's not enough > and still has some misleading. I will fix this in the next version. > Thanks. Thanks, Chih-En Lin
© 2016 - 2025 Red Hat, Inc.