include/linux/swap.h | 16 ++++++++-------- mm/shmem.c | 2 +- mm/swapfile.c | 28 +++++++++------------------- 3 files changed, 18 insertions(+), 28 deletions(-)
The SWAP_MAP_SHMEM state was originally introduced in the commit aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a swap entry belongs to shmem during swapoff. However, swapoff has since been rewritten drastically in the commit b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now having swap count == SWAP_MAP_SHMEM value is basically the same as having swap count == 1, and swap_shmem_alloc() behaves analogously to swap_duplicate() This RFC proposes the removal of this state and the associated helper to simplify the state machine (both mentally and code-wise). We will also have an extra state/special value that can be repurposed (for swap entries that never gets re-duplicated). Another motivation (albeit a bit premature at the moment) is the new swap abstraction I am currently working on, that would allow for swap/zswap decoupling, swapoff optimization, etc. The fewer states and swap API functions there are, the simpler the conversion will be. I am sending this series first as an RFC, just in case I missed something or misunderstood this state, or if someone has a swap optimization in mind for shmem that would require this special state. Swap experts, let me know if I'm mistaken :) Otherwise if there is no objection I will resend this patch series again for merging. Nhat Pham (2): swapfile: add a batched variant for swap_duplicate() swap: shmem: remove SWAP_MAP_SHMEM include/linux/swap.h | 16 ++++++++-------- mm/shmem.c | 2 +- mm/swapfile.c | 28 +++++++++------------------- 3 files changed, 18 insertions(+), 28 deletions(-) base-commit: acfabf7e197f7a5bedf4749dac1f39551417b049 -- 2.43.5
Hi Nhat, On Mon, Sep 23, 2024 at 4:11 PM Nhat Pham <nphamcs@gmail.com> wrote: > > The SWAP_MAP_SHMEM state was originally introduced in the commit > aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a > swap entry belongs to shmem during swapoff. > > However, swapoff has since been rewritten drastically in the commit > b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now > having swap count == SWAP_MAP_SHMEM value is basically the same as having > swap count == 1, and swap_shmem_alloc() behaves analogously to > swap_duplicate() > > This RFC proposes the removal of this state and the associated helper to > simplify the state machine (both mentally and code-wise). We will also > have an extra state/special value that can be repurposed (for swap entries > that never gets re-duplicated). Please help me understand. After having this patch, the shmem swap entry will also use the swap continue as the only way to count shmem swap entries, am I missing something? That seems to have some performance hit for the shmem. Because unlike anonymous memory, The shmem can easily have a very high swap count. I would expect there to be some performance hit for the high share swap count usage case. Do you have any test number on high shared swap count usage that says it is negligible to remove it? Also if you remove the in the SWAP_MAP_SHMEM, wouldn't you need remove the counter in the shmem as well. Because the shmem counter is no longer connected to the swap count any more. Huge, do you have any feedback on this change? Chris > Another motivation (albeit a bit premature at the moment) is the new swap > abstraction I am currently working on, that would allow for swap/zswap > decoupling, swapoff optimization, etc. The fewer states and swap API > functions there are, the simpler the conversion will be. > > I am sending this series first as an RFC, just in case I missed something > or misunderstood this state, or if someone has a swap optimization in mind > for shmem that would require this special state. > > Swap experts, let me know if I'm mistaken :) Otherwise if there is no > objection I will resend this patch series again for merging. > > Nhat Pham (2): > swapfile: add a batched variant for swap_duplicate() > swap: shmem: remove SWAP_MAP_SHMEM > > include/linux/swap.h | 16 ++++++++-------- > mm/shmem.c | 2 +- > mm/swapfile.c | 28 +++++++++------------------- > 3 files changed, 18 insertions(+), 28 deletions(-) > > > base-commit: acfabf7e197f7a5bedf4749dac1f39551417b049 > -- > 2.43.5
On Tue, Sep 24, 2024 at 1:16 PM Chris Li <chrisl@kernel.org> wrote: > > Hi Nhat, > > On Mon, Sep 23, 2024 at 4:11 PM Nhat Pham <nphamcs@gmail.com> wrote: > > > > The SWAP_MAP_SHMEM state was originally introduced in the commit > > aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a > > swap entry belongs to shmem during swapoff. > > > > However, swapoff has since been rewritten drastically in the commit > > b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now > > having swap count == SWAP_MAP_SHMEM value is basically the same as having > > swap count == 1, and swap_shmem_alloc() behaves analogously to > > swap_duplicate() > > > > This RFC proposes the removal of this state and the associated helper to > > simplify the state machine (both mentally and code-wise). We will also > > have an extra state/special value that can be repurposed (for swap entries > > that never gets re-duplicated). > > Please help me understand. After having this patch, the shmem swap > entry will also use the swap continue as the only way to count shmem > swap entries, am I missing something? > > That seems to have some performance hit for the shmem. Because unlike > anonymous memory, The shmem can easily have a very high swap count. > I would expect there to be some performance hit for the high share > swap count usage case. > Do you have any test number on high shared swap count usage that says > it is negligible to remove it? Shmem only calls swap_duplicate() once in shmem_writepage() when we add the swap entry to the page cache. We do not increment the swap count once for each user like anon memory. IOW, the page cache is the only user of the shmem swap entry, so when we remove SWAP_MAP_SHMEM the swap count of shmem pages will either be 0 or 1 (disregarding SWAP_HAS_CACHE). So the swap continuation code is not used here. > > Also if you remove the in the SWAP_MAP_SHMEM, wouldn't you need > remove the counter in the shmem as well. Because the shmem counter is > no longer connected to the swap count any more. Not sure what you mean here. > > Huge, do you have any feedback on this change? Hugh* > > Chris
On 2024/9/24 07:11, Nhat Pham wrote: > The SWAP_MAP_SHMEM state was originally introduced in the commit > aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a > swap entry belongs to shmem during swapoff. > > However, swapoff has since been rewritten drastically in the commit > b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now > having swap count == SWAP_MAP_SHMEM value is basically the same as having > swap count == 1, and swap_shmem_alloc() behaves analogously to > swap_duplicate() > > This RFC proposes the removal of this state and the associated helper to > simplify the state machine (both mentally and code-wise). We will also > have an extra state/special value that can be repurposed (for swap entries > that never gets re-duplicated). > > Another motivation (albeit a bit premature at the moment) is the new swap > abstraction I am currently working on, that would allow for swap/zswap > decoupling, swapoff optimization, etc. The fewer states and swap API > functions there are, the simpler the conversion will be. > > I am sending this series first as an RFC, just in case I missed something > or misunderstood this state, or if someone has a swap optimization in mind > for shmem that would require this special state. The idea makes sense to me. I did a quick test with shmem mTHP, and encountered the following warning which is triggered by 'VM_WARN_ON(usage == 1 && nr > 1)' in __swap_duplicate(). [ 81.064967] ------------[ cut here ]------------ [ 81.064968] WARNING: CPU: 4 PID: 6852 at mm/swapfile.c:3623 __swap_duplicate+0x1d0/0x2e0 [ 81.064994] pstate: 23400005 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [ 81.064995] pc : __swap_duplicate+0x1d0/0x2e0 [ 81.064997] lr : swap_duplicate_nr+0x30/0x70 [......] [ 81.065019] Call trace: [ 81.065019] __swap_duplicate+0x1d0/0x2e0 [ 81.065021] swap_duplicate_nr+0x30/0x70 [ 81.065022] shmem_writepage+0x24c/0x438 [ 81.065024] pageout+0x104/0x2e0 [ 81.065026] shrink_folio_list+0x7f0/0xe60 [ 81.065027] reclaim_folio_list+0x90/0x178 [ 81.065029] reclaim_pages+0x128/0x1a8 [ 81.065030] madvise_cold_or_pageout_pte_range+0x80c/0xd10 [ 81.065031] walk_pmd_range.isra.0+0x1b8/0x3a0 [ 81.065033] walk_pud_range+0x120/0x1b0 [ 81.065035] walk_pgd_range+0x150/0x1a8 [ 81.065036] __walk_page_range+0xa4/0xb8 [ 81.065038] walk_page_range+0x1c8/0x250 [ 81.065039] madvise_pageout+0xf4/0x280 [ 81.065041] madvise_vma_behavior+0x268/0x3f0 [ 81.065042] madvise_walk_vmas.constprop.0+0xb8/0x128 [ 81.065043] do_madvise.part.0+0xe8/0x2a0 [ 81.065044] __arm64_sys_madvise+0x64/0x78 [ 81.065046] invoke_syscall.constprop.0+0x54/0xe8 [ 81.065048] do_el0_svc+0xa4/0xc0 [ 81.065050] el0_svc+0x2c/0xb0 [ 81.065052] el0t_64_sync_handler+0xb8/0xc0 [ 81.065054] el0t_64_sync+0x14c/0x150 > Swap experts, let me know if I'm mistaken :) Otherwise if there is no > objection I will resend this patch series again for merging. > > Nhat Pham (2): > swapfile: add a batched variant for swap_duplicate() > swap: shmem: remove SWAP_MAP_SHMEM > > include/linux/swap.h | 16 ++++++++-------- > mm/shmem.c | 2 +- > mm/swapfile.c | 28 +++++++++------------------- > 3 files changed, 18 insertions(+), 28 deletions(-) > > > base-commit: acfabf7e197f7a5bedf4749dac1f39551417b049
On Mon, Sep 23, 2024 at 6:55 PM Baolin Wang <baolin.wang@linux.alibaba.com> wrote: > > > > On 2024/9/24 07:11, Nhat Pham wrote: > > The SWAP_MAP_SHMEM state was originally introduced in the commit > > aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a > > swap entry belongs to shmem during swapoff. > > > > However, swapoff has since been rewritten drastically in the commit > > b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now > > having swap count == SWAP_MAP_SHMEM value is basically the same as having > > swap count == 1, and swap_shmem_alloc() behaves analogously to > > swap_duplicate() > > > > This RFC proposes the removal of this state and the associated helper to > > simplify the state machine (both mentally and code-wise). We will also > > have an extra state/special value that can be repurposed (for swap entries > > that never gets re-duplicated). > > > > Another motivation (albeit a bit premature at the moment) is the new swap > > abstraction I am currently working on, that would allow for swap/zswap > > decoupling, swapoff optimization, etc. The fewer states and swap API > > functions there are, the simpler the conversion will be. > > > > I am sending this series first as an RFC, just in case I missed something > > or misunderstood this state, or if someone has a swap optimization in mind > > for shmem that would require this special state. > > The idea makes sense to me. I did a quick test with shmem mTHP, and > encountered the following warning which is triggered by > 'VM_WARN_ON(usage == 1 && nr > 1)' in __swap_duplicate(). Apparently __swap_duplicate() does not currently handle increasing the swap count for multiple swap entries by 1 (i.e. usage == 1) because it does not handle rolling back count increases when swap_count_continued() fails. I guess this voids my Reviewed-by until we sort this out. Technically swap_count_continued() won't ever be called for shmem because we only ever increment the count by 1, but there is no way to know this in __swap_duplicate() without SWAP_HAS_SHMEM. > > [ 81.064967] ------------[ cut here ]------------ > [ 81.064968] WARNING: CPU: 4 PID: 6852 at mm/swapfile.c:3623 > __swap_duplicate+0x1d0/0x2e0 > [ 81.064994] pstate: 23400005 (nzCv daif +PAN -UAO +TCO +DIT -SSBS > BTYPE=--) > [ 81.064995] pc : __swap_duplicate+0x1d0/0x2e0 > [ 81.064997] lr : swap_duplicate_nr+0x30/0x70 > [......] > [ 81.065019] Call trace: > [ 81.065019] __swap_duplicate+0x1d0/0x2e0 > [ 81.065021] swap_duplicate_nr+0x30/0x70 > [ 81.065022] shmem_writepage+0x24c/0x438 > [ 81.065024] pageout+0x104/0x2e0 > [ 81.065026] shrink_folio_list+0x7f0/0xe60 > [ 81.065027] reclaim_folio_list+0x90/0x178 > [ 81.065029] reclaim_pages+0x128/0x1a8 > [ 81.065030] madvise_cold_or_pageout_pte_range+0x80c/0xd10 > [ 81.065031] walk_pmd_range.isra.0+0x1b8/0x3a0 > [ 81.065033] walk_pud_range+0x120/0x1b0 > [ 81.065035] walk_pgd_range+0x150/0x1a8 > [ 81.065036] __walk_page_range+0xa4/0xb8 > [ 81.065038] walk_page_range+0x1c8/0x250 > [ 81.065039] madvise_pageout+0xf4/0x280 > [ 81.065041] madvise_vma_behavior+0x268/0x3f0 > [ 81.065042] madvise_walk_vmas.constprop.0+0xb8/0x128 > [ 81.065043] do_madvise.part.0+0xe8/0x2a0 > [ 81.065044] __arm64_sys_madvise+0x64/0x78 > [ 81.065046] invoke_syscall.constprop.0+0x54/0xe8 > [ 81.065048] do_el0_svc+0xa4/0xc0 > [ 81.065050] el0_svc+0x2c/0xb0 > [ 81.065052] el0t_64_sync_handler+0xb8/0xc0 > [ 81.065054] el0t_64_sync+0x14c/0x150 > > > Swap experts, let me know if I'm mistaken :) Otherwise if there is no > > objection I will resend this patch series again for merging. > > > > Nhat Pham (2): > > swapfile: add a batched variant for swap_duplicate() > > swap: shmem: remove SWAP_MAP_SHMEM > > > > include/linux/swap.h | 16 ++++++++-------- > > mm/shmem.c | 2 +- > > mm/swapfile.c | 28 +++++++++------------------- > > 3 files changed, 18 insertions(+), 28 deletions(-) > > > > > > base-commit: acfabf7e197f7a5bedf4749dac1f39551417b049
On 2024/9/24 10:15, Yosry Ahmed wrote: > On Mon, Sep 23, 2024 at 6:55 PM Baolin Wang > <baolin.wang@linux.alibaba.com> wrote: >> >> >> >> On 2024/9/24 07:11, Nhat Pham wrote: >>> The SWAP_MAP_SHMEM state was originally introduced in the commit >>> aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a >>> swap entry belongs to shmem during swapoff. >>> >>> However, swapoff has since been rewritten drastically in the commit >>> b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now >>> having swap count == SWAP_MAP_SHMEM value is basically the same as having >>> swap count == 1, and swap_shmem_alloc() behaves analogously to >>> swap_duplicate() >>> >>> This RFC proposes the removal of this state and the associated helper to >>> simplify the state machine (both mentally and code-wise). We will also >>> have an extra state/special value that can be repurposed (for swap entries >>> that never gets re-duplicated). >>> >>> Another motivation (albeit a bit premature at the moment) is the new swap >>> abstraction I am currently working on, that would allow for swap/zswap >>> decoupling, swapoff optimization, etc. The fewer states and swap API >>> functions there are, the simpler the conversion will be. >>> >>> I am sending this series first as an RFC, just in case I missed something >>> or misunderstood this state, or if someone has a swap optimization in mind >>> for shmem that would require this special state. >> >> The idea makes sense to me. I did a quick test with shmem mTHP, and >> encountered the following warning which is triggered by >> 'VM_WARN_ON(usage == 1 && nr > 1)' in __swap_duplicate(). > > Apparently __swap_duplicate() does not currently handle increasing the > swap count for multiple swap entries by 1 (i.e. usage == 1) because it > does not handle rolling back count increases when > swap_count_continued() fails. > > I guess this voids my Reviewed-by until we sort this out. Technically > swap_count_continued() won't ever be called for shmem because we only > ever increment the count by 1, but there is no way to know this in > __swap_duplicate() without SWAP_HAS_SHMEM. Agreed. An easy solution might be to add a new boolean parameter to indicate whether the SHMEM swap entry count is increasing? diff --git a/mm/swapfile.c b/mm/swapfile.c index cebc244ee60f..21f1eec2c30a 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -3607,7 +3607,7 @@ void si_swapinfo(struct sysinfo *val) * - swap-cache reference is requested but the entry is not used. -> ENOENT * - swap-mapped reference requested but needs continued swap count. -> ENOMEM */ -static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr) +static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr, bool shmem) { struct swap_info_struct *si; struct swap_cluster_info *ci; @@ -3620,7 +3620,7 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr) offset = swp_offset(entry); VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER); - VM_WARN_ON(usage == 1 && nr > 1); + VM_WARN_ON(usage == 1 && nr > 1 && !shmem); ci = lock_cluster_or_swap_info(si, offset); err = 0; @@ -3661,7 +3661,7 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr) has_cache = SWAP_HAS_CACHE; else if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX) count += usage; - else if (swap_count_continued(si, offset + i, count)) + else if (!shmem && swap_count_continued(si, offset + i, count)) count = COUNT_CONTINUED; else { /*
On Mon, Sep 23, 2024 at 8:25 PM Baolin Wang <baolin.wang@linux.alibaba.com> wrote: > > > On 2024/9/24 10:15, Yosry Ahmed wrote: > > On Mon, Sep 23, 2024 at 6:55 PM Baolin Wang > > <baolin.wang@linux.alibaba.com> wrote: > >> > >> > >> > >> On 2024/9/24 07:11, Nhat Pham wrote: > >>> The SWAP_MAP_SHMEM state was originally introduced in the commit > >>> aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a > >>> swap entry belongs to shmem during swapoff. > >>> > >>> However, swapoff has since been rewritten drastically in the commit > >>> b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now > >>> having swap count == SWAP_MAP_SHMEM value is basically the same as having > >>> swap count == 1, and swap_shmem_alloc() behaves analogously to > >>> swap_duplicate() > >>> > >>> This RFC proposes the removal of this state and the associated helper to > >>> simplify the state machine (both mentally and code-wise). We will also > >>> have an extra state/special value that can be repurposed (for swap entries > >>> that never gets re-duplicated). > >>> > >>> Another motivation (albeit a bit premature at the moment) is the new swap > >>> abstraction I am currently working on, that would allow for swap/zswap > >>> decoupling, swapoff optimization, etc. The fewer states and swap API > >>> functions there are, the simpler the conversion will be. > >>> > >>> I am sending this series first as an RFC, just in case I missed something > >>> or misunderstood this state, or if someone has a swap optimization in mind > >>> for shmem that would require this special state. > >> > >> The idea makes sense to me. I did a quick test with shmem mTHP, and > >> encountered the following warning which is triggered by > >> 'VM_WARN_ON(usage == 1 && nr > 1)' in __swap_duplicate(). > > > > Apparently __swap_duplicate() does not currently handle increasing the > > swap count for multiple swap entries by 1 (i.e. usage == 1) because it > > does not handle rolling back count increases when > > swap_count_continued() fails. > > > > I guess this voids my Reviewed-by until we sort this out. Technically > > swap_count_continued() won't ever be called for shmem because we only > > ever increment the count by 1, but there is no way to know this in > > __swap_duplicate() without SWAP_HAS_SHMEM. Ah this is my bad. I compiled with CONFIG_THP_SWAP, but forgot to remove the swapfile check (that's another can of worms, but I need data before submitting the patch to remove it...) One thing we can do is instead of warning here, we can handle it in the for loop check, where we have access to count - that's the point of having that for-loop check anyway? :) There's a couple of ways to go about it: 1. VM_WARN_ON(usage == 1 && nr > 1 && count != 0 ); (or more accurately, (count & ~COUNT_CONTINUED) >= SWAP_MAP_MAX)) 2. Alternatively, instead of warning here, we can simply return -ENOMEM. Then, at shmem callsite, have a VM_WARN_ON/VM_BUG_ON(), since this MUST succeed. Either solutions should follow with careful documentation to make it clear the expectation/guarantee of the new API. Yosry, Baolin, how do you two feel about this? Would something like this work? I need to test it first, but let me know if I'm missing something. If this does not work, we can do what Baolin is suggesting, and perhaps maintain the swap_shmem_alloc() helper. It's less than ideal, but at least we still lose a state...
On Tue, Sep 24, 2024 at 7:32 AM Nhat Pham <nphamcs@gmail.com> wrote: > > On Mon, Sep 23, 2024 at 8:25 PM Baolin Wang > <baolin.wang@linux.alibaba.com> wrote: > > > > > > On 2024/9/24 10:15, Yosry Ahmed wrote: > > > On Mon, Sep 23, 2024 at 6:55 PM Baolin Wang > > > <baolin.wang@linux.alibaba.com> wrote: > > >> > > >> > > >> > > >> On 2024/9/24 07:11, Nhat Pham wrote: > > >>> The SWAP_MAP_SHMEM state was originally introduced in the commit > > >>> aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a > > >>> swap entry belongs to shmem during swapoff. > > >>> > > >>> However, swapoff has since been rewritten drastically in the commit > > >>> b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now > > >>> having swap count == SWAP_MAP_SHMEM value is basically the same as having > > >>> swap count == 1, and swap_shmem_alloc() behaves analogously to > > >>> swap_duplicate() > > >>> > > >>> This RFC proposes the removal of this state and the associated helper to > > >>> simplify the state machine (both mentally and code-wise). We will also > > >>> have an extra state/special value that can be repurposed (for swap entries > > >>> that never gets re-duplicated). > > >>> > > >>> Another motivation (albeit a bit premature at the moment) is the new swap > > >>> abstraction I am currently working on, that would allow for swap/zswap > > >>> decoupling, swapoff optimization, etc. The fewer states and swap API > > >>> functions there are, the simpler the conversion will be. > > >>> > > >>> I am sending this series first as an RFC, just in case I missed something > > >>> or misunderstood this state, or if someone has a swap optimization in mind > > >>> for shmem that would require this special state. > > >> > > >> The idea makes sense to me. I did a quick test with shmem mTHP, and > > >> encountered the following warning which is triggered by > > >> 'VM_WARN_ON(usage == 1 && nr > 1)' in __swap_duplicate(). > > > > > > Apparently __swap_duplicate() does not currently handle increasing the > > > swap count for multiple swap entries by 1 (i.e. usage == 1) because it > > > does not handle rolling back count increases when > > > swap_count_continued() fails. > > > > > > I guess this voids my Reviewed-by until we sort this out. Technically > > > swap_count_continued() won't ever be called for shmem because we only > > > ever increment the count by 1, but there is no way to know this in > > > __swap_duplicate() without SWAP_HAS_SHMEM. > > Ah this is my bad. I compiled with CONFIG_THP_SWAP, but forgot to > remove the swapfile check (that's another can of worms, but I need > data before submitting the patch to remove it...) > > One thing we can do is instead of warning here, we can handle it in > the for loop check, where we have access to count - that's the point > of having that for-loop check anyway? :) > > There's a couple of ways to go about it: > > 1. VM_WARN_ON(usage == 1 && nr > 1 && count != 0 ); Hmm that should work, although it's a bit complicated tbh. > (or more accurately, (count & ~COUNT_CONTINUED) >= SWAP_MAP_MAX)) I think this will make the warning very hard to hit if there's a misuse of __swap_duplicate(). It will only be hit when an entry needs count continuation, which I am not sure is very common. If there's a bug, the warning will potentially catch it too late, if ever. The side effect here is failing to decrement the swap count of some swap entries which will lead to them never being freed, essentially leaking swap capacity slowly over time. I am not sure if there are more detrimental effects. > > 2. Alternatively, instead of warning here, we can simply return > -ENOMEM. Then, at shmem callsite, have a VM_WARN_ON/VM_BUG_ON(), since > this MUST succeed. We still fail to rollback incremented counts though when we return -ENOMEM, right? Maybe I didn't get what you mean. > > Either solutions should follow with careful documentation to make it > clear the expectation/guarantee of the new API. > > Yosry, Baolin, how do you two feel about this? Would something like > this work? I need to test it first, but let me know if I'm missing > something. > > If this does not work, we can do what Baolin is suggesting, and > perhaps maintain the swap_shmem_alloc() helper. It's less than ideal, > but at least we still lose a state... Depending on the complexity tbh, right now removing SWAP_MAP_SHMEM is just a cleanup with small wins, so if it's too complicated to remove it it may not be worth it. I am assuming with your ongoing work, it becomes much more valuable, so maybe if it's too complicated we can defer it until the benefits are realizable?
On Tue, Sep 24, 2024 at 8:08 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Tue, Sep 24, 2024 at 7:32 AM Nhat Pham <nphamcs@gmail.com> wrote: > > > > On Mon, Sep 23, 2024 at 8:25 PM Baolin Wang > > <baolin.wang@linux.alibaba.com> wrote: > > > > > > > > > On 2024/9/24 10:15, Yosry Ahmed wrote: > > > > On Mon, Sep 23, 2024 at 6:55 PM Baolin Wang > > > > <baolin.wang@linux.alibaba.com> wrote: > > > >> > > > >> > > > >> > > > >> On 2024/9/24 07:11, Nhat Pham wrote: > > > >>> The SWAP_MAP_SHMEM state was originally introduced in the commit > > > >>> aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a > > > >>> swap entry belongs to shmem during swapoff. > > > >>> > > > >>> However, swapoff has since been rewritten drastically in the commit > > > >>> b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now > > > >>> having swap count == SWAP_MAP_SHMEM value is basically the same as having > > > >>> swap count == 1, and swap_shmem_alloc() behaves analogously to > > > >>> swap_duplicate() > > > >>> > > > >>> This RFC proposes the removal of this state and the associated helper to > > > >>> simplify the state machine (both mentally and code-wise). We will also > > > >>> have an extra state/special value that can be repurposed (for swap entries > > > >>> that never gets re-duplicated). > > > >>> > > > >>> Another motivation (albeit a bit premature at the moment) is the new swap > > > >>> abstraction I am currently working on, that would allow for swap/zswap > > > >>> decoupling, swapoff optimization, etc. The fewer states and swap API > > > >>> functions there are, the simpler the conversion will be. > > > >>> > > > >>> I am sending this series first as an RFC, just in case I missed something > > > >>> or misunderstood this state, or if someone has a swap optimization in mind > > > >>> for shmem that would require this special state. > > > >> > > > >> The idea makes sense to me. I did a quick test with shmem mTHP, and > > > >> encountered the following warning which is triggered by > > > >> 'VM_WARN_ON(usage == 1 && nr > 1)' in __swap_duplicate(). > > > > > > > > Apparently __swap_duplicate() does not currently handle increasing the > > > > swap count for multiple swap entries by 1 (i.e. usage == 1) because it > > > > does not handle rolling back count increases when > > > > swap_count_continued() fails. > > > > > > > > I guess this voids my Reviewed-by until we sort this out. Technically > > > > swap_count_continued() won't ever be called for shmem because we only > > > > ever increment the count by 1, but there is no way to know this in > > > > __swap_duplicate() without SWAP_HAS_SHMEM. > > > > Ah this is my bad. I compiled with CONFIG_THP_SWAP, but forgot to > > remove the swapfile check (that's another can of worms, but I need > > data before submitting the patch to remove it...) > > > > One thing we can do is instead of warning here, we can handle it in > > the for loop check, where we have access to count - that's the point > > of having that for-loop check anyway? :) > > > > There's a couple of ways to go about it: > > > > 1. VM_WARN_ON(usage == 1 && nr > 1 && count != 0 ); > > Hmm that should work, although it's a bit complicated tbh. > > > (or more accurately, (count & ~COUNT_CONTINUED) >= SWAP_MAP_MAX)) > > I think this will make the warning very hard to hit if there's a > misuse of __swap_duplicate(). It will only be hit when an entry needs > count continuation, which I am not sure is very common. If there's a > bug, the warning will potentially catch it too late, if ever. > > The side effect here is failing to decrement the swap count of some > swap entries which will lead to them never being freed, essentially > leaking swap capacity slowly over time. I am not sure if there are > more detrimental effects. > > > > > 2. Alternatively, instead of warning here, we can simply return > > -ENOMEM. Then, at shmem callsite, have a VM_WARN_ON/VM_BUG_ON(), since > > this MUST succeed. > > We still fail to rollback incremented counts though when we return > -ENOMEM, right? Maybe I didn't get what you mean. My understanding now is that there are two for loops. One for loop that checks the entry's states, and one for loop that does the actual incrementing work (or state modification). We can check in the first for loop, if it is safe to proceed: if (!count && !has_cache) { err = -ENOENT; } else if (usage == SWAP_HAS_CACHE) { if (has_cache) err = -EEXIST; } else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) { err = -EINVAL; } else if (usage == 1 && nr > 1 && (count & ~COUNT_CONTINUED) >= SWAP_MAP_MAX)) { /* the batched variants currently do not support rollback */ err = -ENOMEM; } At this point, IIUC, we have not done any incrementing, so no rollback needed? :) > > > > > Either solutions should follow with careful documentation to make it > > clear the expectation/guarantee of the new API. > > > > Yosry, Baolin, how do you two feel about this? Would something like > > this work? I need to test it first, but let me know if I'm missing > > something. > > > > If this does not work, we can do what Baolin is suggesting, and > > perhaps maintain the swap_shmem_alloc() helper. It's less than ideal, > > but at least we still lose a state... > > Depending on the complexity tbh, right now removing SWAP_MAP_SHMEM is > just a cleanup with small wins, so if it's too complicated to remove > it it may not be worth it. I am assuming with your ongoing work, it > becomes much more valuable, so maybe if it's too complicated we can > defer it until the benefits are realizable? I agree :)
Nhat Pham <nphamcs@gmail.com> writes: [snip] > > My understanding now is that there are two for loops. One for loop > that checks the entry's states, and one for loop that does the actual > incrementing work (or state modification). > > We can check in the first for loop, if it is safe to proceed: > > if (!count && !has_cache) { > err = -ENOENT; > } else if (usage == SWAP_HAS_CACHE) { > if (has_cache) > err = -EEXIST; > } else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) { > err = -EINVAL; > } else if (usage == 1 && nr > 1 && (count & ~COUNT_CONTINUED) >= > SWAP_MAP_MAX)) { > /* the batched variants currently do not support rollback */ > err = -ENOMEM; > } > > At this point, IIUC, we have not done any incrementing, so no rollback > needed? :) I think that it's better to add a VM_WARN_ONCE() here. If someone enabled 'nr > 1' for __swap_duplicate(), the issue will be more explicit. [snip] -- Best Regards, Huang, Ying
On 2024/9/24 23:48, Nhat Pham wrote: > On Tue, Sep 24, 2024 at 8:08 AM Yosry Ahmed <yosryahmed@google.com> wrote: >> >> On Tue, Sep 24, 2024 at 7:32 AM Nhat Pham <nphamcs@gmail.com> wrote: >>> >>> On Mon, Sep 23, 2024 at 8:25 PM Baolin Wang >>> <baolin.wang@linux.alibaba.com> wrote: >>>> >>>> >>>> On 2024/9/24 10:15, Yosry Ahmed wrote: >>>>> On Mon, Sep 23, 2024 at 6:55 PM Baolin Wang >>>>> <baolin.wang@linux.alibaba.com> wrote: >>>>>> >>>>>> >>>>>> >>>>>> On 2024/9/24 07:11, Nhat Pham wrote: >>>>>>> The SWAP_MAP_SHMEM state was originally introduced in the commit >>>>>>> aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a >>>>>>> swap entry belongs to shmem during swapoff. >>>>>>> >>>>>>> However, swapoff has since been rewritten drastically in the commit >>>>>>> b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now >>>>>>> having swap count == SWAP_MAP_SHMEM value is basically the same as having >>>>>>> swap count == 1, and swap_shmem_alloc() behaves analogously to >>>>>>> swap_duplicate() >>>>>>> >>>>>>> This RFC proposes the removal of this state and the associated helper to >>>>>>> simplify the state machine (both mentally and code-wise). We will also >>>>>>> have an extra state/special value that can be repurposed (for swap entries >>>>>>> that never gets re-duplicated). >>>>>>> >>>>>>> Another motivation (albeit a bit premature at the moment) is the new swap >>>>>>> abstraction I am currently working on, that would allow for swap/zswap >>>>>>> decoupling, swapoff optimization, etc. The fewer states and swap API >>>>>>> functions there are, the simpler the conversion will be. >>>>>>> >>>>>>> I am sending this series first as an RFC, just in case I missed something >>>>>>> or misunderstood this state, or if someone has a swap optimization in mind >>>>>>> for shmem that would require this special state. >>>>>> >>>>>> The idea makes sense to me. I did a quick test with shmem mTHP, and >>>>>> encountered the following warning which is triggered by >>>>>> 'VM_WARN_ON(usage == 1 && nr > 1)' in __swap_duplicate(). >>>>> >>>>> Apparently __swap_duplicate() does not currently handle increasing the >>>>> swap count for multiple swap entries by 1 (i.e. usage == 1) because it >>>>> does not handle rolling back count increases when >>>>> swap_count_continued() fails. >>>>> >>>>> I guess this voids my Reviewed-by until we sort this out. Technically >>>>> swap_count_continued() won't ever be called for shmem because we only >>>>> ever increment the count by 1, but there is no way to know this in >>>>> __swap_duplicate() without SWAP_HAS_SHMEM. >>> >>> Ah this is my bad. I compiled with CONFIG_THP_SWAP, but forgot to >>> remove the swapfile check (that's another can of worms, but I need >>> data before submitting the patch to remove it...) >>> >>> One thing we can do is instead of warning here, we can handle it in >>> the for loop check, where we have access to count - that's the point >>> of having that for-loop check anyway? :) >>> >>> There's a couple of ways to go about it: >>> >>> 1. VM_WARN_ON(usage == 1 && nr > 1 && count != 0 ); >> >> Hmm that should work, although it's a bit complicated tbh. >> >>> (or more accurately, (count & ~COUNT_CONTINUED) >= SWAP_MAP_MAX)) >> >> I think this will make the warning very hard to hit if there's a >> misuse of __swap_duplicate(). It will only be hit when an entry needs >> count continuation, which I am not sure is very common. If there's a >> bug, the warning will potentially catch it too late, if ever. >> >> The side effect here is failing to decrement the swap count of some >> swap entries which will lead to them never being freed, essentially >> leaking swap capacity slowly over time. I am not sure if there are >> more detrimental effects. >> >>> >>> 2. Alternatively, instead of warning here, we can simply return >>> -ENOMEM. Then, at shmem callsite, have a VM_WARN_ON/VM_BUG_ON(), since >>> this MUST succeed. >> >> We still fail to rollback incremented counts though when we return >> -ENOMEM, right? Maybe I didn't get what you mean. > > My understanding now is that there are two for loops. One for loop > that checks the entry's states, and one for loop that does the actual > incrementing work (or state modification). > > We can check in the first for loop, if it is safe to proceed: > > if (!count && !has_cache) { > err = -ENOENT; > } else if (usage == SWAP_HAS_CACHE) { > if (has_cache) > err = -EEXIST; > } else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) { > err = -EINVAL; > } else if (usage == 1 && nr > 1 && (count & ~COUNT_CONTINUED) >= > SWAP_MAP_MAX)) { > /* the batched variants currently do not support rollback */ > err = -ENOMEM; > } > > At this point, IIUC, we have not done any incrementing, so no rollback > needed? :) Right, looks good (although need some cleanup pointed by Yosry). >>> >>> Either solutions should follow with careful documentation to make it >>> clear the expectation/guarantee of the new API. >>> >>> Yosry, Baolin, how do you two feel about this? Would something like >>> this work? I need to test it first, but let me know if I'm missing >>> something. >>> >>> If this does not work, we can do what Baolin is suggesting, and >>> perhaps maintain the swap_shmem_alloc() helper. It's less than ideal, >>> but at least we still lose a state... >> >> Depending on the complexity tbh, right now removing SWAP_MAP_SHMEM is >> just a cleanup with small wins, so if it's too complicated to remove >> it it may not be worth it. I am assuming with your ongoing work, it >> becomes much more valuable, so maybe if it's too complicated we can >> defer it until the benefits are realizable? > > I agree :) One benefit I can mention is that removing 'SWAP_MAP_SHMEM' can help to batch free shmem swap entries in __swap_entries_free(), similar to the commit bea67dcc5eea ("mm: attempt to batch free swap entries for zap_pte_range()") did, which can improve the performance of shmem mTHP munmap() function based on my testing. Without this patch set, I need do following changes to batch free shmem swap entries: diff --git a/mm/swapfile.c b/mm/swapfile.c index 0cded32414a1..94e28cd60c52 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -163,7 +163,7 @@ static bool swap_is_last_map(struct swap_info_struct *si, unsigned char *map_end = map + nr_pages; unsigned char count = *map; - if (swap_count(count) != 1) + if (swap_count(count) != 1 && swap_count(count) != SWAP_MAP_SHMEM) return false; while (++map < map_end) { @@ -1503,10 +1503,10 @@ static bool __swap_entries_free(struct swap_info_struct *si, unsigned int type = swp_type(entry); struct swap_cluster_info *ci; bool has_cache = false; - unsigned char count; + unsigned char count = swap_count(data_race(si->swap_map[offset])); int i; - if (nr <= 1 || swap_count(data_race(si->swap_map[offset])) != 1) + if (nr <= 1 || (count != 1 && count != SWAP_MAP_SHMEM)) goto fallback; /* cross into another cluster */ if (nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER)
On Tue, Sep 24, 2024 at 6:53 PM Baolin Wang <baolin.wang@linux.alibaba.com> wrote: > > > One benefit I can mention is that removing 'SWAP_MAP_SHMEM' can help to > batch free shmem swap entries in __swap_entries_free(), similar to the > commit bea67dcc5eea ("mm: attempt to batch free swap entries for > zap_pte_range()") did, which can improve the performance of shmem mTHP > munmap() function based on my testing. Yeah, the problem with having an extraneous state is you have to constantly check for it in code, and/or keep it in mind when you develop things. I've been constantly having to check for this state when I develop code around this area, and it gets old fast. If we can use it to optimize something, I can understand keeping it. But it just seems like dead code to me :) My preference is to do this as simply as possible - add another case (usage == 1, nr > 1, and we need to add swap continuations) in the check in __swap_duplicate()'s first loop, and just WARN right there. That case CANNOT happen UNLESS we introduce a bug, or have a new use case. When we actually have a use case, we can always introduce handling/fallback logic for that case. Barry, Yosry, Baolin, Ying, how do you feel about this?
On Thu, Sep 26, 2024 at 2:37 AM Nhat Pham <nphamcs@gmail.com> wrote: > > On Tue, Sep 24, 2024 at 6:53 PM Baolin Wang > <baolin.wang@linux.alibaba.com> wrote: > > > > > > One benefit I can mention is that removing 'SWAP_MAP_SHMEM' can help to > > batch free shmem swap entries in __swap_entries_free(), similar to the > > commit bea67dcc5eea ("mm: attempt to batch free swap entries for > > zap_pte_range()") did, which can improve the performance of shmem mTHP > > munmap() function based on my testing. > > Yeah, the problem with having an extraneous state is you have to > constantly check for it in code, and/or keep it in mind when you > develop things. I've been constantly having to check for this state > when I develop code around this area, and it gets old fast. > > If we can use it to optimize something, I can understand keeping it. > But it just seems like dead code to me :) > > My preference is to do this as simply as possible - add another case > (usage == 1, nr > 1, and we need to add swap continuations) in the > check in __swap_duplicate()'s first loop, and just WARN right there. > > That case CANNOT happen UNLESS we introduce a bug, or have a new use > case. When we actually have a use case, we can always introduce > handling/fallback logic for that case. > > Barry, Yosry, Baolin, Ying, how do you feel about this? Hi Nhat, I’m not entirely clear on your point. If your proposal is to support the case where usage == 1 and nr > 1 only when we don’t require CONTINUED, and to issue a warning once we determine that CONTINUED is needed, then I’m completely on board with that approach. It seems that your intention is to simply relocate the existing warning to the scenario where CONTINUED is actually required, rather than maintaining a warning for the case where usage == 1 and nr > 1 at all times? I wasn't actually suggesting a rollback as you posted: err = __swap_duplicate(entry, 1, nr); if (err == -ENOMEM) { /* fallback to non-batched version */ for (i = 0; i < nr; i++) { cur_entry = (swp_entry_t){entry.val + i}; if (swap_duplicate(cur_entry)) { /* rollback */ while (--i >= 0) { cur_entry = (swp_entry_t){entry.val + i}; swap_free(cur_entry); } I suggested checking for all errors except -ENOMEM in the first loop. If all conditions indicate that only CONTINUED is necessary (with no other issues like ENOENT, EEXIST, etc., for any entry), we can proceed with a for loop for swap_duplicate(), eliminating the need for a rollback. However, I agree that we can proceed with that only when there is actually a user involved. (There's no need to address it right now.) Thanks Barry
Nhat Pham <nphamcs@gmail.com> writes: > On Tue, Sep 24, 2024 at 6:53 PM Baolin Wang > <baolin.wang@linux.alibaba.com> wrote: >> >> >> One benefit I can mention is that removing 'SWAP_MAP_SHMEM' can help to >> batch free shmem swap entries in __swap_entries_free(), similar to the >> commit bea67dcc5eea ("mm: attempt to batch free swap entries for >> zap_pte_range()") did, which can improve the performance of shmem mTHP >> munmap() function based on my testing. > > Yeah, the problem with having an extraneous state is you have to > constantly check for it in code, and/or keep it in mind when you > develop things. I've been constantly having to check for this state > when I develop code around this area, and it gets old fast. > > If we can use it to optimize something, I can understand keeping it. > But it just seems like dead code to me :) > > My preference is to do this as simply as possible - add another case > (usage == 1, nr > 1, and we need to add swap continuations) in the > check in __swap_duplicate()'s first loop, and just WARN right there. > > That case CANNOT happen UNLESS we introduce a bug, or have a new use > case. When we actually have a use case, we can always introduce > handling/fallback logic for that case. > > Barry, Yosry, Baolin, Ying, how do you feel about this? Sounds good to me to just WARN now. We can always improve when it's necessary. -- Best Regards, Huang, Ying
On 2024/9/26 09:59, Huang, Ying wrote: > Nhat Pham <nphamcs@gmail.com> writes: > >> On Tue, Sep 24, 2024 at 6:53 PM Baolin Wang >> <baolin.wang@linux.alibaba.com> wrote: >>> >>> >>> One benefit I can mention is that removing 'SWAP_MAP_SHMEM' can help to >>> batch free shmem swap entries in __swap_entries_free(), similar to the >>> commit bea67dcc5eea ("mm: attempt to batch free swap entries for >>> zap_pte_range()") did, which can improve the performance of shmem mTHP >>> munmap() function based on my testing. >> >> Yeah, the problem with having an extraneous state is you have to >> constantly check for it in code, and/or keep it in mind when you >> develop things. I've been constantly having to check for this state >> when I develop code around this area, and it gets old fast. >> >> If we can use it to optimize something, I can understand keeping it. >> But it just seems like dead code to me :) >> >> My preference is to do this as simply as possible - add another case >> (usage == 1, nr > 1, and we need to add swap continuations) in the >> check in __swap_duplicate()'s first loop, and just WARN right there. >> >> That case CANNOT happen UNLESS we introduce a bug, or have a new use >> case. When we actually have a use case, we can always introduce >> handling/fallback logic for that case. >> >> Barry, Yosry, Baolin, Ying, how do you feel about this? > > Sounds good to me to just WARN now. We can always improve when it's > necessary. +1. Agreed.
[..] > > > > > Apparently __swap_duplicate() does not currently handle increasing the > > > > > swap count for multiple swap entries by 1 (i.e. usage == 1) because it > > > > > does not handle rolling back count increases when > > > > > swap_count_continued() fails. > > > > > > > > > > I guess this voids my Reviewed-by until we sort this out. Technically > > > > > swap_count_continued() won't ever be called for shmem because we only > > > > > ever increment the count by 1, but there is no way to know this in > > > > > __swap_duplicate() without SWAP_HAS_SHMEM. > > > > > > Ah this is my bad. I compiled with CONFIG_THP_SWAP, but forgot to > > > remove the swapfile check (that's another can of worms, but I need > > > data before submitting the patch to remove it...) > > > > > > One thing we can do is instead of warning here, we can handle it in > > > the for loop check, where we have access to count - that's the point > > > of having that for-loop check anyway? :) > > > > > > There's a couple of ways to go about it: > > > > > > 1. VM_WARN_ON(usage == 1 && nr > 1 && count != 0 ); > > > > Hmm that should work, although it's a bit complicated tbh. > > > > > (or more accurately, (count & ~COUNT_CONTINUED) >= SWAP_MAP_MAX)) > > > > I think this will make the warning very hard to hit if there's a > > misuse of __swap_duplicate(). It will only be hit when an entry needs > > count continuation, which I am not sure is very common. If there's a > > bug, the warning will potentially catch it too late, if ever. > > > > The side effect here is failing to decrement the swap count of some > > swap entries which will lead to them never being freed, essentially > > leaking swap capacity slowly over time. I am not sure if there are > > more detrimental effects. > > > > > > > > 2. Alternatively, instead of warning here, we can simply return > > > -ENOMEM. Then, at shmem callsite, have a VM_WARN_ON/VM_BUG_ON(), since > > > this MUST succeed. > > > > We still fail to rollback incremented counts though when we return > > -ENOMEM, right? Maybe I didn't get what you mean. > > My understanding now is that there are two for loops. One for loop > that checks the entry's states, and one for loop that does the actual > incrementing work (or state modification). > > We can check in the first for loop, if it is safe to proceed: > > if (!count && !has_cache) { > err = -ENOENT; > } else if (usage == SWAP_HAS_CACHE) { > if (has_cache) > err = -EEXIST; > } else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) { > err = -EINVAL; > } else if (usage == 1 && nr > 1 && (count & ~COUNT_CONTINUED) >= > SWAP_MAP_MAX)) { > /* the batched variants currently do not support rollback */ > err = -ENOMEM; > } Hmm yeah I think something like this should work and is arguably better than just warning, although this needs cleaning up: - We already know usage != SWAP_HAS_CACHE, so no need to check if usage == 1. - We already know (count & ~COUNT_CONTINUED) is larger than SWAP_MAP_MAX, so we should check if it's equal to SWAP_MAP_MAX. - We should probably just calculate count & ~COUNT_CONTINUED above the if conditions at this point. I would also like to hear what Barry thinks since he added this (and I just realized he is not CC'd).
On Wed, Sep 25, 2024 at 2:12 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > [..] > > > > > > Apparently __swap_duplicate() does not currently handle increasing the > > > > > > swap count for multiple swap entries by 1 (i.e. usage == 1) because it > > > > > > does not handle rolling back count increases when > > > > > > swap_count_continued() fails. > > > > > > > > > > > > I guess this voids my Reviewed-by until we sort this out. Technically > > > > > > swap_count_continued() won't ever be called for shmem because we only > > > > > > ever increment the count by 1, but there is no way to know this in > > > > > > __swap_duplicate() without SWAP_HAS_SHMEM. > > > > > > > > Ah this is my bad. I compiled with CONFIG_THP_SWAP, but forgot to > > > > remove the swapfile check (that's another can of worms, but I need > > > > data before submitting the patch to remove it...) > > > > > > > > One thing we can do is instead of warning here, we can handle it in > > > > the for loop check, where we have access to count - that's the point > > > > of having that for-loop check anyway? :) > > > > > > > > There's a couple of ways to go about it: > > > > > > > > 1. VM_WARN_ON(usage == 1 && nr > 1 && count != 0 ); > > > > > > Hmm that should work, although it's a bit complicated tbh. > > > > > > > (or more accurately, (count & ~COUNT_CONTINUED) >= SWAP_MAP_MAX)) > > > > > > I think this will make the warning very hard to hit if there's a > > > misuse of __swap_duplicate(). It will only be hit when an entry needs > > > count continuation, which I am not sure is very common. If there's a > > > bug, the warning will potentially catch it too late, if ever. > > > > > > The side effect here is failing to decrement the swap count of some > > > swap entries which will lead to them never being freed, essentially > > > leaking swap capacity slowly over time. I am not sure if there are > > > more detrimental effects. > > > > > > > > > > > 2. Alternatively, instead of warning here, we can simply return > > > > -ENOMEM. Then, at shmem callsite, have a VM_WARN_ON/VM_BUG_ON(), since > > > > this MUST succeed. > > > > > > We still fail to rollback incremented counts though when we return > > > -ENOMEM, right? Maybe I didn't get what you mean. > > > > My understanding now is that there are two for loops. One for loop > > that checks the entry's states, and one for loop that does the actual > > incrementing work (or state modification). > > > > We can check in the first for loop, if it is safe to proceed: > > > > if (!count && !has_cache) { > > err = -ENOENT; > > } else if (usage == SWAP_HAS_CACHE) { > > if (has_cache) > > err = -EEXIST; > > } else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) { > > err = -EINVAL; > > } else if (usage == 1 && nr > 1 && (count & ~COUNT_CONTINUED) >= > > SWAP_MAP_MAX)) { > > /* the batched variants currently do not support rollback */ > > err = -ENOMEM; > > } > > Hmm yeah I think something like this should work and is arguably > better than just warning, although this needs cleaning up: > - We already know usage != SWAP_HAS_CACHE, so no need to check if usage == 1. > - We already know (count & ~COUNT_CONTINUED) is larger than > SWAP_MAP_MAX, so we should check if it's equal to SWAP_MAP_MAX. > - We should probably just calculate count & ~COUNT_CONTINUED above the > if conditions at this point. > > I would also like to hear what Barry thinks since he added this (and I > just realized he is not CC'd). I am perfectly fine with the approach, in the first loop, if we find all entries don't need CONTINUED, we can run the 2nd loop even for usage==1 and nr > 1. this is almost always true for a real product where anon folios are unlikely to be fork-shared by so many processes. but we need fall back to iterating nr times if this really happens: int swap_duplicate_nr(swp_entry_t entry, int nr) { .... if (nr > 1 and ENOMEM) { for(nr entries) { __swap_duplicate(entry, 1, 1); entry = next_entry; } } seems a bit ugly? maybe we can keep the swap_duplicate(swp_entry_t entry) there? then avoid __swap_duplicate(entry, 1, 1);? Thanks Barry
Barry Song <21cnbao@gmail.com> writes: [snip] > I am perfectly fine with the approach, in the first loop, if we find all entries > don't need CONTINUED, we can run the 2nd loop even for usage==1 > and nr > 1. this is almost always true for a real product where anon folios > are unlikely to be fork-shared by so many processes. One possible use case is ksm. Where the map count could be large. -- Best Regards, Huang, Ying
On Wed, Sep 25, 2024 at 3:28 PM Huang, Ying <ying.huang@intel.com> wrote: > > Barry Song <21cnbao@gmail.com> writes: > > [snip] > > > I am perfectly fine with the approach, in the first loop, if we find all entries > > don't need CONTINUED, we can run the 2nd loop even for usage==1 > > and nr > 1. this is almost always true for a real product where anon folios > > are unlikely to be fork-shared by so many processes. > > One possible use case is ksm. Where the map count could be large. Sorry, I overlooked the KSM case, but it seems this doesn't significantly change the overall direction. :-) Since we can fall back to swap_duplicate() and handle each entry one by one after the first loopback returns -ENOMEM, KSM isn't a concern either. > > -- > Best Regards, > Huang, Ying Thanks Barry
On Mon, Sep 23, 2024 at 4:11 PM Nhat Pham <nphamcs@gmail.com> wrote: > > The SWAP_MAP_SHMEM state was originally introduced in the commit > aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a > swap entry belongs to shmem during swapoff. > > However, swapoff has since been rewritten drastically in the commit > b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now > having swap count == SWAP_MAP_SHMEM value is basically the same as having > swap count == 1, and swap_shmem_alloc() behaves analogously to > swap_duplicate() > > This RFC proposes the removal of this state and the associated helper to > simplify the state machine (both mentally and code-wise). We will also > have an extra state/special value that can be repurposed (for swap entries > that never gets re-duplicated). > > Another motivation (albeit a bit premature at the moment) is the new swap > abstraction I am currently working on, that would allow for swap/zswap > decoupling, swapoff optimization, etc. The fewer states and swap API > functions there are, the simpler the conversion will be. > > I am sending this series first as an RFC, just in case I missed something > or misunderstood this state, or if someone has a swap optimization in mind > for shmem that would require this special state. I have the same patch sitting in a tree somewhere from when I tried working on swap abstraction, except then swap_shmem_alloc() did not take a 'nr' argument so I did not need swap_duplicate_nr(). I was going to send it out with other swap code cleanups I had, but I ended up deciding to do nothing. So for what it's worth I think this is correct: Reviewed-by: Yosry Ahmed <yosryahmed@google.com> > > Swap experts, let me know if I'm mistaken :) Otherwise if there is no > objection I will resend this patch series again for merging. > > Nhat Pham (2): > swapfile: add a batched variant for swap_duplicate() > swap: shmem: remove SWAP_MAP_SHMEM > > include/linux/swap.h | 16 ++++++++-------- > mm/shmem.c | 2 +- > mm/swapfile.c | 28 +++++++++------------------- > 3 files changed, 18 insertions(+), 28 deletions(-) > > > base-commit: acfabf7e197f7a5bedf4749dac1f39551417b049 > -- > 2.43.5
© 2016 - 2024 Red Hat, Inc.