[PATCH v2 6/7] x86: mm: free page table pages by RCU instead of semi RCU

Qi Zheng posted 7 patches 3 weeks, 3 days ago
There is a newer version of this series
[PATCH v2 6/7] x86: mm: free page table pages by RCU instead of semi RCU
Posted by Qi Zheng 3 weeks, 3 days ago
Now, if CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, the page table pages
will be freed by semi RCU, that is:

 - batch table freeing: asynchronous free by RCU
 - single table freeing: IPI + synchronous free

In this way, the page table can be lockless traversed by disabling IRQ in
paths such as fast GUP. But this is not enough to free the empty PTE page
table pages in paths other that munmap and exit_mmap path, because IPI
cannot be synchronized with rcu_read_lock() in pte_offset_map{_lock}().

In preparation for supporting empty PTE page table pages reclaimation,
let single table also be freed by RCU like batch table freeing. Then we
can also use pte_offset_map() etc to prevent PTE page from being freed.

Like pte_free_defer(), we can also safely use ptdesc->pt_rcu_head to free
the page table pages:

 - The pt_rcu_head is unioned with pt_list and pmd_huge_pte.

 - For pt_list, it is used to manage the PGD page in x86. Fortunately
   tlb_remove_table() will not be used for free PGD pages, so it is safe
   to use pt_rcu_head.

 - For pmd_huge_pte, we will do zap_deposited_table() before freeing the
   PMD page, so it is also safe.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 arch/x86/include/asm/tlb.h | 19 +++++++++++++++++++
 arch/x86/kernel/paravirt.c |  7 +++++++
 arch/x86/mm/pgtable.c      | 10 +++++++++-
 mm/mmu_gather.c            |  9 ++++++++-
 4 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 580636cdc257b..e223b53a8b190 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -34,4 +34,23 @@ static inline void __tlb_remove_table(void *table)
 	free_page_and_swap_cache(table);
 }
 
+#ifdef CONFIG_PT_RECLAIM
+static inline void __tlb_remove_table_one_rcu(struct rcu_head *head)
+{
+	struct page *page;
+
+	page = container_of(head, struct page, rcu_head);
+	free_page_and_swap_cache(page);
+}
+
+static inline void __tlb_remove_table_one(void *table)
+{
+	struct page *page;
+
+	page = table;
+	call_rcu(&page->rcu_head, __tlb_remove_table_one_rcu);
+}
+#define __tlb_remove_table_one __tlb_remove_table_one
+#endif /* CONFIG_PT_RECLAIM */
+
 #endif /* _ASM_X86_TLB_H */
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index fec3815335558..89688921ea62e 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -59,10 +59,17 @@ void __init native_pv_lock_init(void)
 		static_branch_enable(&virt_spin_lock_key);
 }
 
+#ifndef CONFIG_PT_RECLAIM
 static void native_tlb_remove_table(struct mmu_gather *tlb, void *table)
 {
 	tlb_remove_page(tlb, table);
 }
+#else
+static void native_tlb_remove_table(struct mmu_gather *tlb, void *table)
+{
+	tlb_remove_table(tlb, table);
+}
+#endif
 
 struct static_key paravirt_steal_enabled;
 struct static_key paravirt_steal_rq_enabled;
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 5745a354a241c..69a357b15974a 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -19,12 +19,20 @@ EXPORT_SYMBOL(physical_mask);
 #endif
 
 #ifndef CONFIG_PARAVIRT
+#ifndef CONFIG_PT_RECLAIM
 static inline
 void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table)
 {
 	tlb_remove_page(tlb, table);
 }
-#endif
+#else
+static inline
+void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table)
+{
+	tlb_remove_table(tlb, table);
+}
+#endif /* !CONFIG_PT_RECLAIM */
+#endif /* !CONFIG_PARAVIRT */
 
 gfp_t __userpte_alloc_gfp = GFP_PGTABLE_USER | PGTABLE_HIGHMEM;
 
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 99b3e9408aa0f..d948479ca09e6 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -311,10 +311,17 @@ static inline void tlb_table_invalidate(struct mmu_gather *tlb)
 	}
 }
 
+#ifndef __tlb_remove_table_one
+static inline void __tlb_remove_table_one(void *table)
+{
+	__tlb_remove_table(table);
+}
+#endif
+
 static void tlb_remove_table_one(void *table)
 {
 	tlb_remove_table_sync_one();
-	__tlb_remove_table(table);
+	__tlb_remove_table_one(table);
 }
 
 static void tlb_table_flush(struct mmu_gather *tlb)
-- 
2.20.1
Re: [PATCH v2 6/7] x86: mm: free page table pages by RCU instead of semi RCU
Posted by Jann Horn 2 weeks, 3 days ago
+x86 MM maintainers - x86@kernel.org was already cc'ed, but I don't
know if that is enough for them to see it, and I haven't seen them
comment on this series yet; I think you need an ack from them for this
change.

On Thu, Oct 31, 2024 at 9:14 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> Now, if CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, the page table pages
> will be freed by semi RCU, that is:
>
>  - batch table freeing: asynchronous free by RCU
>  - single table freeing: IPI + synchronous free
>
> In this way, the page table can be lockless traversed by disabling IRQ in
> paths such as fast GUP. But this is not enough to free the empty PTE page
> table pages in paths other that munmap and exit_mmap path, because IPI
> cannot be synchronized with rcu_read_lock() in pte_offset_map{_lock}().
>
> In preparation for supporting empty PTE page table pages reclaimation,
> let single table also be freed by RCU like batch table freeing. Then we
> can also use pte_offset_map() etc to prevent PTE page from being freed.

I applied your series locally and followed the page table freeing path
that the reclaim feature would use on x86-64. Looks like it goes like
this with the series applied:

free_pte
  pte_free_tlb
    __pte_free_tlb
      ___pte_free_tlb
        paravirt_tlb_remove_table
          tlb_remove_table [!CONFIG_PARAVIRT, Xen PV, Hyper-V, KVM]
            [no-free-memory slowpath:]
              tlb_table_invalidate
              tlb_remove_table_one
                tlb_remove_table_sync_one [does IPI for GUP-fast]
                __tlb_remove_table_one [frees via RCU]
            [fastpath:]
              tlb_table_flush
                tlb_remove_table_free [frees via RCU]
          native_tlb_remove_table [CONFIG_PARAVIRT on native]
            tlb_remove_table [see above]

Basically, the only remaining case in which
paravirt_tlb_remove_table() does not use tlb_remove_table() with RCU
delay is !CONFIG_PARAVIRT && !CONFIG_PT_RECLAIM. Given that
CONFIG_PT_RECLAIM is defined as "default y" when supported, I guess
that means X86's direct page table freeing path will almost never be
used? If it stays that way and the X86 folks don't see a performance
impact from using RCU to free tables on munmap() / process exit, I
guess we might want to get rid of the direct page table freeing path
on x86 at some point to simplify things...

(That simplification might also help prepare for Intel Remote Action
Request, if that is a thing people want...)

> Like pte_free_defer(), we can also safely use ptdesc->pt_rcu_head to free
> the page table pages:
>
>  - The pt_rcu_head is unioned with pt_list and pmd_huge_pte.
>
>  - For pt_list, it is used to manage the PGD page in x86. Fortunately
>    tlb_remove_table() will not be used for free PGD pages, so it is safe
>    to use pt_rcu_head.
>
>  - For pmd_huge_pte, we will do zap_deposited_table() before freeing the
>    PMD page, so it is also safe.

Please also update the comments on "struct ptdesc" accordingly.

> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>  arch/x86/include/asm/tlb.h | 19 +++++++++++++++++++
>  arch/x86/kernel/paravirt.c |  7 +++++++
>  arch/x86/mm/pgtable.c      | 10 +++++++++-
>  mm/mmu_gather.c            |  9 ++++++++-
>  4 files changed, 43 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
> index 580636cdc257b..e223b53a8b190 100644
> --- a/arch/x86/include/asm/tlb.h
> +++ b/arch/x86/include/asm/tlb.h
> @@ -34,4 +34,23 @@ static inline void __tlb_remove_table(void *table)
>         free_page_and_swap_cache(table);
>  }
>
> +#ifdef CONFIG_PT_RECLAIM
> +static inline void __tlb_remove_table_one_rcu(struct rcu_head *head)
> +{
> +       struct page *page;
> +
> +       page = container_of(head, struct page, rcu_head);
> +       free_page_and_swap_cache(page);
> +}

Why free_page_and_swap_cache()? Page tables shouldn't have swap cache,
so I think something like put_page() would do the job.

> +static inline void __tlb_remove_table_one(void *table)
> +{
> +       struct page *page;
> +
> +       page = table;
> +       call_rcu(&page->rcu_head, __tlb_remove_table_one_rcu);
> +}
> +#define __tlb_remove_table_one __tlb_remove_table_one
> +#endif /* CONFIG_PT_RECLAIM */
> +
>  #endif /* _ASM_X86_TLB_H */
> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
> index fec3815335558..89688921ea62e 100644
> --- a/arch/x86/kernel/paravirt.c
> +++ b/arch/x86/kernel/paravirt.c
> @@ -59,10 +59,17 @@ void __init native_pv_lock_init(void)
>                 static_branch_enable(&virt_spin_lock_key);
>  }
>
> +#ifndef CONFIG_PT_RECLAIM
>  static void native_tlb_remove_table(struct mmu_gather *tlb, void *table)
>  {
>         tlb_remove_page(tlb, table);
>  }
> +#else
> +static void native_tlb_remove_table(struct mmu_gather *tlb, void *table)
> +{
> +       tlb_remove_table(tlb, table);
> +}
> +#endif
>
>  struct static_key paravirt_steal_enabled;
>  struct static_key paravirt_steal_rq_enabled;
> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> index 5745a354a241c..69a357b15974a 100644
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -19,12 +19,20 @@ EXPORT_SYMBOL(physical_mask);
>  #endif
>
>  #ifndef CONFIG_PARAVIRT
> +#ifndef CONFIG_PT_RECLAIM
>  static inline
>  void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table)
>  {
>         tlb_remove_page(tlb, table);
>  }
> -#endif
> +#else
> +static inline
> +void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table)
> +{
> +       tlb_remove_table(tlb, table);
> +}
> +#endif /* !CONFIG_PT_RECLAIM */
> +#endif /* !CONFIG_PARAVIRT */
>
>  gfp_t __userpte_alloc_gfp = GFP_PGTABLE_USER | PGTABLE_HIGHMEM;
>
> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> index 99b3e9408aa0f..d948479ca09e6 100644
> --- a/mm/mmu_gather.c
> +++ b/mm/mmu_gather.c
> @@ -311,10 +311,17 @@ static inline void tlb_table_invalidate(struct mmu_gather *tlb)
>         }
>  }
>
> +#ifndef __tlb_remove_table_one
> +static inline void __tlb_remove_table_one(void *table)
> +{
> +       __tlb_remove_table(table);
> +}
> +#endif
> +
>  static void tlb_remove_table_one(void *table)
>  {
>         tlb_remove_table_sync_one();
> -       __tlb_remove_table(table);
> +       __tlb_remove_table_one(table);
>  }
>
>  static void tlb_table_flush(struct mmu_gather *tlb)
> --
> 2.20.1
>
Re: [PATCH v2 6/7] x86: mm: free page table pages by RCU instead of semi RCU
Posted by Qi Zheng 2 weeks, 2 days ago
Hi Jann,

On 2024/11/8 06:39, Jann Horn wrote:
> +x86 MM maintainers - x86@kernel.org was already cc'ed, but I don't
> know if that is enough for them to see it, and I haven't seen them
> comment on this series yet; I think you need an ack from them for this
> change.

Yes, thanks to Jann for cc-ing x86 MM maintainers, and look forward to
their feedback!

> 
> On Thu, Oct 31, 2024 at 9:14 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>> Now, if CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, the page table pages
>> will be freed by semi RCU, that is:
>>
>>   - batch table freeing: asynchronous free by RCU
>>   - single table freeing: IPI + synchronous free
>>
>> In this way, the page table can be lockless traversed by disabling IRQ in
>> paths such as fast GUP. But this is not enough to free the empty PTE page
>> table pages in paths other that munmap and exit_mmap path, because IPI
>> cannot be synchronized with rcu_read_lock() in pte_offset_map{_lock}().
>>
>> In preparation for supporting empty PTE page table pages reclaimation,
>> let single table also be freed by RCU like batch table freeing. Then we
>> can also use pte_offset_map() etc to prevent PTE page from being freed.
> 
> I applied your series locally and followed the page table freeing path
> that the reclaim feature would use on x86-64. Looks like it goes like
> this with the series applied:

Yes.

> 
> free_pte
>    pte_free_tlb
>      __pte_free_tlb
>        ___pte_free_tlb
>          paravirt_tlb_remove_table
>            tlb_remove_table [!CONFIG_PARAVIRT, Xen PV, Hyper-V, KVM]
>              [no-free-memory slowpath:]
>                tlb_table_invalidate
>                tlb_remove_table_one
>                  tlb_remove_table_sync_one [does IPI for GUP-fast]

		   ^
		   It seems that this step can be ommitted when
		   CONFIG_PT_RECLAIM is enabled, because GUP-fast will
		   disable IRQ, which can also serve as the RCU critical
		   section.

>                  __tlb_remove_table_one [frees via RCU]
>              [fastpath:]
>                tlb_table_flush
>                  tlb_remove_table_free [frees via RCU]
>            native_tlb_remove_table [CONFIG_PARAVIRT on native]
>              tlb_remove_table [see above]
> 
> Basically, the only remaining case in which
> paravirt_tlb_remove_table() does not use tlb_remove_table() with RCU
> delay is !CONFIG_PARAVIRT && !CONFIG_PT_RECLAIM. Given that
> CONFIG_PT_RECLAIM is defined as "default y" when supported, I guess
> that means X86's direct page table freeing path will almost never be
> used? If it stays that way and the X86 folks don't see a performance
> impact from using RCU to free tables on munmap() / process exit, I
> guess we might want to get rid of the direct page table freeing path
> on x86 at some point to simplify things...

In theory, using RCU to asynchronously free PTE pages should make
munmap() / process exit path faster. I can try to grab some data.

> 
> (That simplification might also help prepare for Intel Remote Action
> Request, if that is a thing people want...)

If so, even better!

> 
>> Like pte_free_defer(), we can also safely use ptdesc->pt_rcu_head to free
>> the page table pages:
>>
>>   - The pt_rcu_head is unioned with pt_list and pmd_huge_pte.
>>
>>   - For pt_list, it is used to manage the PGD page in x86. Fortunately
>>     tlb_remove_table() will not be used for free PGD pages, so it is safe
>>     to use pt_rcu_head.
>>
>>   - For pmd_huge_pte, we will do zap_deposited_table() before freeing the
>>     PMD page, so it is also safe.
> 
> Please also update the comments on "struct ptdesc" accordingly.

OK, will do.

> 
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>>   arch/x86/include/asm/tlb.h | 19 +++++++++++++++++++
>>   arch/x86/kernel/paravirt.c |  7 +++++++
>>   arch/x86/mm/pgtable.c      | 10 +++++++++-
>>   mm/mmu_gather.c            |  9 ++++++++-
>>   4 files changed, 43 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
>> index 580636cdc257b..e223b53a8b190 100644
>> --- a/arch/x86/include/asm/tlb.h
>> +++ b/arch/x86/include/asm/tlb.h
>> @@ -34,4 +34,23 @@ static inline void __tlb_remove_table(void *table)
>>          free_page_and_swap_cache(table);
>>   }
>>
>> +#ifdef CONFIG_PT_RECLAIM
>> +static inline void __tlb_remove_table_one_rcu(struct rcu_head *head)
>> +{
>> +       struct page *page;
>> +
>> +       page = container_of(head, struct page, rcu_head);
>> +       free_page_and_swap_cache(page);
>> +}
> 
> Why free_page_and_swap_cache()? Page tables shouldn't have swap cache,
> so I think something like put_page() would do the job.

Ah, I just did the same thing as __tlb_remove_table(). But I also
have the same doubt as you, why does __tlb_remove_table() need to
call free_page_and_swap_cache() instead of put_page().

Thanks,
Qi

> 
>> +static inline void __tlb_remove_table_one(void *table)
>> +{
>> +       struct page *page;
>> +
>> +       page = table;
>> +       call_rcu(&page->rcu_head, __tlb_remove_table_one_rcu);
>> +}
>> +#define __tlb_remove_table_one __tlb_remove_table_one
>> +#endif /* CONFIG_PT_RECLAIM */
>> +
>>   #endif /* _ASM_X86_TLB_H */
>> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
>> index fec3815335558..89688921ea62e 100644
>> --- a/arch/x86/kernel/paravirt.c
>> +++ b/arch/x86/kernel/paravirt.c
>> @@ -59,10 +59,17 @@ void __init native_pv_lock_init(void)
>>                  static_branch_enable(&virt_spin_lock_key);
>>   }
>>
>> +#ifndef CONFIG_PT_RECLAIM
>>   static void native_tlb_remove_table(struct mmu_gather *tlb, void *table)
>>   {
>>          tlb_remove_page(tlb, table);
>>   }
>> +#else
>> +static void native_tlb_remove_table(struct mmu_gather *tlb, void *table)
>> +{
>> +       tlb_remove_table(tlb, table);
>> +}
>> +#endif
>>
>>   struct static_key paravirt_steal_enabled;
>>   struct static_key paravirt_steal_rq_enabled;
>> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
>> index 5745a354a241c..69a357b15974a 100644
>> --- a/arch/x86/mm/pgtable.c
>> +++ b/arch/x86/mm/pgtable.c
>> @@ -19,12 +19,20 @@ EXPORT_SYMBOL(physical_mask);
>>   #endif
>>
>>   #ifndef CONFIG_PARAVIRT
>> +#ifndef CONFIG_PT_RECLAIM
>>   static inline
>>   void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table)
>>   {
>>          tlb_remove_page(tlb, table);
>>   }
>> -#endif
>> +#else
>> +static inline
>> +void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table)
>> +{
>> +       tlb_remove_table(tlb, table);
>> +}
>> +#endif /* !CONFIG_PT_RECLAIM */
>> +#endif /* !CONFIG_PARAVIRT */
>>
>>   gfp_t __userpte_alloc_gfp = GFP_PGTABLE_USER | PGTABLE_HIGHMEM;
>>
>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
>> index 99b3e9408aa0f..d948479ca09e6 100644
>> --- a/mm/mmu_gather.c
>> +++ b/mm/mmu_gather.c
>> @@ -311,10 +311,17 @@ static inline void tlb_table_invalidate(struct mmu_gather *tlb)
>>          }
>>   }
>>
>> +#ifndef __tlb_remove_table_one
>> +static inline void __tlb_remove_table_one(void *table)
>> +{
>> +       __tlb_remove_table(table);
>> +}
>> +#endif
>> +
>>   static void tlb_remove_table_one(void *table)
>>   {
>>          tlb_remove_table_sync_one();
>> -       __tlb_remove_table(table);
>> +       __tlb_remove_table_one(table);
>>   }
>>
>>   static void tlb_table_flush(struct mmu_gather *tlb)
>> --
>> 2.20.1
>>
Re: [PATCH v2 6/7] x86: mm: free page table pages by RCU instead of semi RCU
Posted by Qi Zheng 1 week, 4 days ago

On 2024/11/8 15:38, Qi Zheng wrote:
> Hi Jann,
> 
> On 2024/11/8 06:39, Jann Horn wrote:
>> +x86 MM maintainers - x86@kernel.org was already cc'ed, but I don't
>> know if that is enough for them to see it, and I haven't seen them
>> comment on this series yet; I think you need an ack from them for this
>> change.
> 
> Yes, thanks to Jann for cc-ing x86 MM maintainers, and look forward to
> their feedback!
> 
>>
>> On Thu, Oct 31, 2024 at 9:14 AM Qi Zheng <zhengqi.arch@bytedance.com> 
>> wrote:
>>> Now, if CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, the page table 
>>> pages
>>> will be freed by semi RCU, that is:
>>>
>>>   - batch table freeing: asynchronous free by RCU
>>>   - single table freeing: IPI + synchronous free
>>>
>>> In this way, the page table can be lockless traversed by disabling 
>>> IRQ in
>>> paths such as fast GUP. But this is not enough to free the empty PTE 
>>> page
>>> table pages in paths other that munmap and exit_mmap path, because IPI
>>> cannot be synchronized with rcu_read_lock() in pte_offset_map{_lock}().
>>>
>>> In preparation for supporting empty PTE page table pages reclaimation,
>>> let single table also be freed by RCU like batch table freeing. Then we
>>> can also use pte_offset_map() etc to prevent PTE page from being freed.
>>
>> I applied your series locally and followed the page table freeing path
>> that the reclaim feature would use on x86-64. Looks like it goes like
>> this with the series applied:
> 
> Yes.
> 
>>
>> free_pte
>>    pte_free_tlb
>>      __pte_free_tlb
>>        ___pte_free_tlb
>>          paravirt_tlb_remove_table
>>            tlb_remove_table [!CONFIG_PARAVIRT, Xen PV, Hyper-V, KVM]
>>              [no-free-memory slowpath:]
>>                tlb_table_invalidate
>>                tlb_remove_table_one
>>                  tlb_remove_table_sync_one [does IPI for GUP-fast]
> 
>             ^
>             It seems that this step can be ommitted when
>             CONFIG_PT_RECLAIM is enabled, because GUP-fast will
>             disable IRQ, which can also serve as the RCU critical
>             section.
> 
>>                  __tlb_remove_table_one [frees via RCU]
>>              [fastpath:]
>>                tlb_table_flush
>>                  tlb_remove_table_free [frees via RCU]
>>            native_tlb_remove_table [CONFIG_PARAVIRT on native]
>>              tlb_remove_table [see above]
>>
>> Basically, the only remaining case in which
>> paravirt_tlb_remove_table() does not use tlb_remove_table() with RCU
>> delay is !CONFIG_PARAVIRT && !CONFIG_PT_RECLAIM. Given that
>> CONFIG_PT_RECLAIM is defined as "default y" when supported, I guess
>> that means X86's direct page table freeing path will almost never be
>> used? If it stays that way and the X86 folks don't see a performance
>> impact from using RCU to free tables on munmap() / process exit, I
>> guess we might want to get rid of the direct page table freeing path
>> on x86 at some point to simplify things...
> 
> In theory, using RCU to asynchronously free PTE pages should make
> munmap() / process exit path faster. I can try to grab some data.
> 

I ran 'stress-ng --mmap 1 --mmap-bytes 1G', and grabbed the data with
bpftrace like this:

bpftrace -e 'tracepoint:syscalls:sys_enter_munmap /comm == 
"stress-ng"/{@start[tid] = nsecs;} tracepoint:syscalls:sys_exit_munmap 
/@start[tid]/ { @ns[comm] = hist(nsecs - @start[tid]); 
delete(@start[tid]); } interval:s:1 {exit();}'

The results are as follows:

without patch:

@ns[stress-ng]:
[1K, 2K)           99566 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2K, 4K)           77756 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 
      |
[4K, 8K)           32545 |@@@@@@@@@@@@@@@@ 
      |
[8K, 16K)            442 | 
      |
[16K, 32K)            69 | 
      |
[32K, 64K)             1 | 
      |
[64K, 128K)            1 | 
      |
[128K, 256K)          14 | 
      |
[256K, 512K)          14 | 
      |
[512K, 1M)            68 | 
      |

with patch:

@ns[stress-ng]:
[512, 1K)             69 | 
      |
[1K, 2K)           53921 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2K, 4K)           47088 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 
      |
[4K, 8K)           20583 |@@@@@@@@@@@@@@@@@@@ 
      |
[8K, 16K)            659 | 
      |
[16K, 32K)            93 | 
      |
[32K, 64K)            24 | 
      |
[64K, 128K)           14 | 
      |
[128K, 256K)           6 | 
      |
[256K, 512K)          10 | 
      |
[512K, 1M)            29 | 
      |

It doesn't seem to have much effect on munmap.

Thanks,
Qi

Re: [PATCH v2 6/7] x86: mm: free page table pages by RCU instead of semi RCU
Posted by Jann Horn 2 weeks, 2 days ago
On Fri, Nov 8, 2024 at 8:38 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> On 2024/11/8 06:39, Jann Horn wrote:
> > On Thu, Oct 31, 2024 at 9:14 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> > free_pte
> >    pte_free_tlb
> >      __pte_free_tlb
> >        ___pte_free_tlb
> >          paravirt_tlb_remove_table
> >            tlb_remove_table [!CONFIG_PARAVIRT, Xen PV, Hyper-V, KVM]
> >              [no-free-memory slowpath:]
> >                tlb_table_invalidate
> >                tlb_remove_table_one
> >                  tlb_remove_table_sync_one [does IPI for GUP-fast]
>
>                    ^
>                    It seems that this step can be ommitted when
>                    CONFIG_PT_RECLAIM is enabled, because GUP-fast will
>                    disable IRQ, which can also serve as the RCU critical
>                    section.

Yeah, I think so too.

> >> +#ifdef CONFIG_PT_RECLAIM
> >> +static inline void __tlb_remove_table_one_rcu(struct rcu_head *head)
> >> +{
> >> +       struct page *page;
> >> +
> >> +       page = container_of(head, struct page, rcu_head);
> >> +       free_page_and_swap_cache(page);
> >> +}
> >
> > Why free_page_and_swap_cache()? Page tables shouldn't have swap cache,
> > so I think something like put_page() would do the job.
>
> Ah, I just did the same thing as __tlb_remove_table(). But I also
> have the same doubt as you, why does __tlb_remove_table() need to
> call free_page_and_swap_cache() instead of put_page().

I think commit 9e52fc2b50de3a1c08b44f94c610fbe998c0031a probably just
copy-pasted it from a more generic page freeing path...
Re: [PATCH v2 6/7] x86: mm: free page table pages by RCU instead of semi RCU
Posted by Qi Zheng 2 weeks, 1 day ago

On 2024/11/9 04:09, Jann Horn wrote:
> On Fri, Nov 8, 2024 at 8:38 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>> On 2024/11/8 06:39, Jann Horn wrote:
>>> On Thu, Oct 31, 2024 at 9:14 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>>> free_pte
>>>     pte_free_tlb
>>>       __pte_free_tlb
>>>         ___pte_free_tlb
>>>           paravirt_tlb_remove_table
>>>             tlb_remove_table [!CONFIG_PARAVIRT, Xen PV, Hyper-V, KVM]
>>>               [no-free-memory slowpath:]
>>>                 tlb_table_invalidate
>>>                 tlb_remove_table_one
>>>                   tlb_remove_table_sync_one [does IPI for GUP-fast]
>>
>>                     ^
>>                     It seems that this step can be ommitted when
>>                     CONFIG_PT_RECLAIM is enabled, because GUP-fast will
>>                     disable IRQ, which can also serve as the RCU critical
>>                     section.
> 
> Yeah, I think so too.

Will remove this step in the next version.

> 
>>>> +#ifdef CONFIG_PT_RECLAIM
>>>> +static inline void __tlb_remove_table_one_rcu(struct rcu_head *head)
>>>> +{
>>>> +       struct page *page;
>>>> +
>>>> +       page = container_of(head, struct page, rcu_head);
>>>> +       free_page_and_swap_cache(page);
>>>> +}
>>>
>>> Why free_page_and_swap_cache()? Page tables shouldn't have swap cache,
>>> so I think something like put_page() would do the job.
>>
>> Ah, I just did the same thing as __tlb_remove_table(). But I also
>> have the same doubt as you, why does __tlb_remove_table() need to
>> call free_page_and_swap_cache() instead of put_page().
> 
> I think commit 9e52fc2b50de3a1c08b44f94c610fbe998c0031a probably just
> copy-pasted it from a more generic page freeing path...

I guess so. Will use put_page() instead of free_page_and_swap_cache()
in the next version. But for __tlb_remove_table(), I prefer to send
a separate patch to modify.

Thanks!