[v1] mm: zswap: global shrinker fix and proactive shrink

[PATCH v1 0/3] mm: zswap: global shrinker fix and proactive shrink

Posted by Takero Funaki 1 year, 8 months ago

This series addresses two issues and introduces a minor improvement in
zswap global shrinker:

1. Fix the memcg iteration logic that breaks iteration on offline memcgs.
2. Fix the error path that aborts on expected error codes.
3. Add proactive shrinking at 91% full, for 90% accept threshold.

These patches need to be applied in this order to avoid potential loops
caused by the first issue. Patch 3 can be applied independently, but the
two issues must be resolved to ensure the shrinker can evict pages.

Previously, the zswap pool could be filled with old pages that the
shrinker failed to evict, leading to zswap rejecting new pages. With
this series applied, the shrinker will continue to evict pages until the
pool reaches the accept_thr_percent threshold proactively, as
documented, and maintain the pool to keep recent pages.

As a side effect of changes in the hysteresis logic, zswap will no
longer reject pages under the max pool limit.

With this series, reclaims smaller than the proative shrinking amount
finish instantly and trigger background shrinking. Admins can check if
new pages are buffered by zswap by monitoring the pool_limit_hit
counter. 

Changes since v0:
mm: zswap: fix global shrinker memcg iteration
- Drop and reacquire spinlock before skipping a memcg.
- Add some comment to clarify the locking mechanism.
mm: zswap: proactive shrinking before pool size limit is hit
- Remove unneeded check before scheduling work.
- Change shrink start threshold to accept_thr_percent + 1%.

Now it starts shrinking at accept_thr_percent + 1%. Previously, the
threshold was at the midpoint of 100% to accept_threshold.

If a workload needs 10% space to buffer the average reclaim amount, with
the previous patch, it required setting the accept_thr_percent to 80%.
For 50%, it became 0%, which is not acceptable and unclear for admins.
We can use the accept percent as the shrink threshold directly but that
sounds shrinker is called too frequently around the accept threshold.  I
added 1% as a minimum gap to the shrink threshold.

----

Takero Funaki (3):
  mm: zswap: fix global shrinker memcg iteration
  mm: zswap: fix global shrinker error handling logic
  mm: zswap: proactive shrinking before pool size limit is hit

 Documentation/admin-guide/mm/zswap.rst |  17 ++-
 mm/zswap.c                             | 172 ++++++++++++++++++-------
 2 files changed, 136 insertions(+), 53 deletions(-)

-- 
2.43.0

Re: [PATCH v1 0/3] mm: zswap: global shrinker fix and proactive shrink

Posted by Nhat Pham 1 year, 8 months ago

On Sat, Jun 8, 2024 at 8:53 AM Takero Funaki <flintglass@gmail.com> wrote:
>
> This series addresses two issues and introduces a minor improvement in
> zswap global shrinker:
>
> 1. Fix the memcg iteration logic that breaks iteration on offline memcgs.
> 2. Fix the error path that aborts on expected error codes.
> 3. Add proactive shrinking at 91% full, for 90% accept threshold.
>

Taking a step back from the correctness conversation, could you
include in the changelog of the patches and cover letter a realistic
scenario, along with user space-visible metrics that show (ideally all
4, but at least some of the following):

1. A user problem (that affects performance, or usability, etc.) is happening.

2. The root cause is what we are trying to fix (for e.g in patch 1, we
are skipping over memcgs unnecessarily in the global shrinker loop).

3. The fix alleviates the root cause in b)

4. The userspace-visible problem goes away or is less serious.

I have already hinted in a previous response, but global shrinker is
rarely triggered in production. There are lots of factors that would
prevent this from triggering:

1. Max zswap pool size 20% of memory by default, which is a lot.

2. Swapfile size also limits the size of the amount of data storable
in the zswap pool.

3. Other cgroup constraints (memory.max, memory.zswap.max,
memory.swap.max) also limit a cgroup's zswap usage.

I do agree that patch 1 at least is fixing a problem, and probably
patch 2 too but please justify why we are investing in the extra
complexity to fix this problem in the first place.

Re: [PATCH v1 0/3] mm: zswap: global shrinker fix and proactive shrink

Posted by Takero Funaki 1 year, 8 months ago

2024年6月14日(金) 0:22 Nhat Pham <nphamcs@gmail.com>:
>
> Taking a step back from the correctness conversation, could you
> include in the changelog of the patches and cover letter a realistic
> scenario, along with user space-visible metrics that show (ideally all
> 4, but at least some of the following):
>
> 1. A user problem (that affects performance, or usability, etc.) is happening.
>
> 2. The root cause is what we are trying to fix (for e.g in patch 1, we
> are skipping over memcgs unnecessarily in the global shrinker loop).
>
> 3. The fix alleviates the root cause in b)
>
> 4. The userspace-visible problem goes away or is less serious.
>

Thank you for your suggestions.
For quick response before submitting v2,

1.
The visible issue is that pageout/in operations from active processes
are slow when zswap is near its max pool size. This is particularly
significant on small memory systems, where total swap usage exceeds
what zswap can store. This means that old pages occupy most of the
zswap pool space, and recent pages use swap disk directly.

2.
This issue is caused by zswap keeping the pool size near 100%. Since
the shrinker fails to shrink the pool to accept_thr_percent and zswap
rejects incoming pages, rejection occurs more frequently than it
should. The rejected pages are directly written to disk while zswap
protects old pages from eviction, leading to slow pageout/in
performance for recent pages on the swap disk.

3.
If the pool size were shrunk proactively, rejection by pool limit hits
would be less likely. New incoming pages could be accepted as the pool
gains some space in advance, while older pages are written back in the
background. zswap would then be filled with recent pages, as expected
in the LRU logic.

Patch 1 and 2 make the shrinker reduce the pool to accept_thr_percent.
Patch 3 makes zswap_store trigger the shrinker before reaching the max
pool size. With this series, zswap will prepare some space to reduce
the probability of problematic pool_limit_hit situation, thus reducing
slow reclaim and the page priority inversion against LRU.

4.
Once proactive shrinking reduces the pool size, pageouts complete
instantly as long as the space prepared by shrinking can store the
direct reclaim. If an admin sees a large pool_limit_hit, lowering
accept_threshold_percent will improve active process performance.


> I have already hinted in a previous response, but global shrinker is
> rarely triggered in production. There are lots of factors that would
> prevent this from triggering:
>
> 1. Max zswap pool size 20% of memory by default, which is a lot.
>
> 2. Swapfile size also limits the size of the amount of data storable
> in the zswap pool.
>
> 3. Other cgroup constraints (memory.max, memory.zswap.max,
> memory.swap.max) also limit a cgroup's zswap usage.
>
> I do agree that patch 1 at least is fixing a problem, and probably
> patch 2 too but please justify why we are investing in the extra
> complexity to fix this problem in the first place.

Regarding the production workload, I believe we are facing different situations.

My intended workload is low-activity services distributed on small
system like t2.nano, with 0.5G to 1G of RAM. There are a significant
number of pool_limit_hits and the zswap pool usage sometimes stays
near 100% filled by background service processes.

When I evaluated zswap and zramswap, zswap performed well. I suppose
this was because of its LRU. Once old pages occupied zramswap, there
was no way to move pages from zramswap to the swap disk. zswap could
adapt to this situation by writing back old pages in LRU order. We
could benefit from compressed swap storing recent pages and also
utilize overcommit backed by a large swap on disk.

I think if a system has never seen a pool limit hit, there is no need
to manage the costly LRU. In such cases, I will try zramswap.

I am developing these patches by testing with artificially allocating
large memory to fill the zswap pool (emulating background services),
and then invoking another allocation to trigger direct reclaim
(emulating short-live active process).
On real devices, seeing much less pool_limit_hit by lower
accept_thr_percent=50 and higher max_pool_percent=25.
Note that before patch 3, zswap_store() did not count some of the
rejection as pool_limit_hit properly (underestimated). we cannot
compare the counter directly.

I will try to create a more reproducible realistic benchmark, but it
might not be realistic for your workload.

Re: [PATCH v1 0/3] mm: zswap: global shrinker fix and proactive shrink

Posted by Yosry Ahmed 1 year, 8 months ago

On Thu, Jun 13, 2024 at 9:09 PM Takero Funaki <flintglass@gmail.com> wrote:
>
> 2024年6月14日(金) 0:22 Nhat Pham <nphamcs@gmail.com>:
> >
> > Taking a step back from the correctness conversation, could you
> > include in the changelog of the patches and cover letter a realistic
> > scenario, along with user space-visible metrics that show (ideally all
> > 4, but at least some of the following):
> >
> > 1. A user problem (that affects performance, or usability, etc.) is happening.
> >
> > 2. The root cause is what we are trying to fix (for e.g in patch 1, we
> > are skipping over memcgs unnecessarily in the global shrinker loop).
> >
> > 3. The fix alleviates the root cause in b)
> >
> > 4. The userspace-visible problem goes away or is less serious.
> >
>
> Thank you for your suggestions.
> For quick response before submitting v2,

Thanks for all the info, this should be in the cover letter or commit
messages in some shape or form.

>
> 1.
> The visible issue is that pageout/in operations from active processes
> are slow when zswap is near its max pool size. This is particularly
> significant on small memory systems, where total swap usage exceeds
> what zswap can store. This means that old pages occupy most of the
> zswap pool space, and recent pages use swap disk directly.

This should be a transient state though, right? Once the shrinker
kicks in it should writeback the old pages and make space for the hot
ones. Which takes us to our next point.

>
> 2.
> This issue is caused by zswap keeping the pool size near 100%. Since
> the shrinker fails to shrink the pool to accept_thr_percent and zswap
> rejects incoming pages, rejection occurs more frequently than it
> should. The rejected pages are directly written to disk while zswap
> protects old pages from eviction, leading to slow pageout/in
> performance for recent pages on the swap disk.

Why is the shrinker failing? IIUC the first two patches fixes two
cases where the shrinker stumbles upon offline memcgs, or memcgs with
no zswapped pages. Are these cases common enough in your use case that
every single time the shrinker runs it hits MAX_RECLAIM_RETRIES before
putting the zswap usage below accept_thr_percent?

This would be surprising given that we should be restarting the
shrinker with every swapout attempt until we can accept pages again.

I guess one could construct a malicious case where there are some
sticky offline memcgs, and all the memcgs that actually have zswap
pages come after it in the iteration order.

Could you shed more light about this? What does the setup look like?
How many memcgs there are, how many of them use zswap, and how many
offline memcgs are you observing?

I am not saying we shouldn't fix these problems anyway, I am just
trying to understand how we got into this situation to begin with.

>
> 3.
> If the pool size were shrunk proactively, rejection by pool limit hits
> would be less likely. New incoming pages could be accepted as the pool
> gains some space in advance, while older pages are written back in the
> background. zswap would then be filled with recent pages, as expected
> in the LRU logic.

I suspect if patches 1 and 2 fix your problem, the shrinker invoked
from reclaim should be doing this sort of "proactive shrinking".

I agree that the current hysteresis around accept_thr_percent is not
good enough, but I am surprised you are hitting the pool limit if the
shrinker is being run during reclaim.

>
> Patch 1 and 2 make the shrinker reduce the pool to accept_thr_percent.
> Patch 3 makes zswap_store trigger the shrinker before reaching the max
> pool size. With this series, zswap will prepare some space to reduce
> the probability of problematic pool_limit_hit situation, thus reducing
> slow reclaim and the page priority inversion against LRU.
>
> 4.
> Once proactive shrinking reduces the pool size, pageouts complete
> instantly as long as the space prepared by shrinking can store the
> direct reclaim. If an admin sees a large pool_limit_hit, lowering
> accept_threshold_percent will improve active process performance.

I agree that proactive shrinking is preferable to waiting until we hit
pool limit, then stop taking in pages until the acceptance threshold.
I am just trying to understand whether such a proactive shrinking
mechanism will be needed if the reclaim shrinker for zswap is being
used, how the two would work together.

Re: [PATCH v1 0/3] mm: zswap: global shrinker fix and proactive shrink

Posted by Takero Funaki 1 year, 7 months ago

Hello,

Sorry for the late reply. I am currently investigating a
responsiveness issue I found while benchmarking with this series,
possibly related to concurrent zswap writeback and pageouts.

This series cannot be applied until the root cause is identified,
unfortunately. Thank you all for taking the time to review.

The responsiveness issue was confirmed with 6.10-rc2 with all 3
patches applied. Without patch 3, it still happens but is less likely.

When allocating much larger memory than zswap can buffer, and
writeback and rejection by pool_limit_hit happen simultaneously, the
system stops responding. I do not see this freeze when zswap is
disabled or when there is no pool_limit_hit. The proactive shrinking
itself seems to work as expected as long as the writeback and pageout
do not occur simultaneously.

I suspect this issue exists in current code but was not visible
without this series since the global shrinker did not writeback
considerable amount of pages.


2024年6月15日(土) 7:48 Nhat Pham <nphamcs@gmail.com>:
>
> BTW, I'm curious. Have you experimented with increasing the pool size?
> That 20% number is plenty for our use cases, but maybe yours need a
> different cap?
>

Probably we can allocate a bit more zswap pool size. But that will
keep more old pages once the pool limit is hit. If we can ensure no
pool limit hits and zero writeback by allocating more memory, I will
try the same amount of zramswap.

> Also, have you experimented with the dynamic zswap shrinker? :) I'm
> actually curious how it works out in the small machine regime, with
> whatever workload you are running.
>

It seems the dynamic shrinker is trying to evict all pages. That does
not fit to my use case that prefer balanced swapin and swapout
performance


2024年6月15日(土) 9:20 Yosry Ahmed <yosryahmed@google.com>:
> >
> > 1.
> > The visible issue is that pageout/in operations from active processes
> > are slow when zswap is near its max pool size. This is particularly
> > significant on small memory systems, where total swap usage exceeds
> > what zswap can store. This means that old pages occupy most of the
> > zswap pool space, and recent pages use swap disk directly.
>
> This should be a transient state though, right? Once the shrinker
> kicks in it should writeback the old pages and make space for the hot
> ones. Which takes us to our next point.
>
> >
> > 2.
> > This issue is caused by zswap keeping the pool size near 100%. Since
> > the shrinker fails to shrink the pool to accept_thr_percent and zswap
> > rejects incoming pages, rejection occurs more frequently than it
> > should. The rejected pages are directly written to disk while zswap
> > protects old pages from eviction, leading to slow pageout/in
> > performance for recent pages on the swap disk.
>
> Why is the shrinker failing? IIUC the first two patches fixes two
> cases where the shrinker stumbles upon offline memcgs, or memcgs with
> no zswapped pages. Are these cases common enough in your use case that
> every single time the shrinker runs it hits MAX_RECLAIM_RETRIES before
> putting the zswap usage below accept_thr_percent?
>
> This would be surprising given that we should be restarting the
> shrinker with every swapout attempt until we can accept pages again.
>
> I guess one could construct a malicious case where there are some
> sticky offline memcgs, and all the memcgs that actually have zswap
> pages come after it in the iteration order.
>
> Could you shed more light about this? What does the setup look like?
> How many memcgs there are, how many of them use zswap, and how many
> offline memcgs are you observing?
>

Example from ubuntu 22.04 using zswap:
root@ctl:~# find /sys/fs/cgroup/ -wholename
\*service/memory.zswap.current | xargs grep . | wc
     31      31    2557
root@ctl:~# find /sys/fs/cgroup/ -wholename
\*service/memory.zswap.current | xargs grep ^0 | wc
     11      11     911

This indicates 11 out of 31 services have no pages in zswap. Without
patch 2, shrink_worker() aborts shrinking in the second tree walk,
before evicting about 40 pages from the services. The number varies,
but I think it is common to see a few memcg that has no zswap pages

> I am not saying we shouldn't fix these problems anyway, I am just
> trying to understand how we got into this situation to begin with.
>
> >
> > 3.
> > If the pool size were shrunk proactively, rejection by pool limit hits
> > would be less likely. New incoming pages could be accepted as the pool
> > gains some space in advance, while older pages are written back in the
> > background. zswap would then be filled with recent pages, as expected
> > in the LRU logic.
>
> I suspect if patches 1 and 2 fix your problem, the shrinker invoked
> from reclaim should be doing this sort of "proactive shrinking".
>
> I agree that the current hysteresis around accept_thr_percent is not
> good enough, but I am surprised you are hitting the pool limit if the
> shrinker is being run during reclaim.
>
> >
> > Patch 1 and 2 make the shrinker reduce the pool to accept_thr_percent.
> > Patch 3 makes zswap_store trigger the shrinker before reaching the max
> > pool size. With this series, zswap will prepare some space to reduce
> > the probability of problematic pool_limit_hit situation, thus reducing
> > slow reclaim and the page priority inversion against LRU.
> >
> > 4.
> > Once proactive shrinking reduces the pool size, pageouts complete
> > instantly as long as the space prepared by shrinking can store the
> > direct reclaim. If an admin sees a large pool_limit_hit, lowering
> > accept_threshold_percent will improve active process performance.
>
> I agree that proactive shrinking is preferable to waiting until we hit
> pool limit, then stop taking in pages until the acceptance threshold.
> I am just trying to understand whether such a proactive shrinking
> mechanism will be needed if the reclaim shrinker for zswap is being
> used, how the two would work together.

For my workload, the dynamic shrinker (reclaim shrinker) is disabled.
The proposed global shrinker and the existing dynamic shrinker are
both proactive, but their goals are different.

The global shrinker starts shrinking when the zswap pool exceeds
accept_thr_percent + 1%, then stops when it reaches
accept_thr_percent. Pages below accept_thr_percent are protected from
shrinking.

The dynamic shrinker starts shrinking based on memory pressure
regardless of the zswap pool size, and stops when the LRU size is
reduced to 1/4. Its goal is to wipe out all pages from zswap. It
prefers swapout performance only.

I think the current LRU logic decreases nr_zswap_protected too quickly
for my workload. In zswap_lru_add(), nr_zswap_protected is reduced to
between 1/4 and 1/8 of the LRU size. Although zswap_folio_swapin()
increments nr_zswap_protected when page-ins of evicted pages occur
later, this technically has no effect while reclaim is in progress.

While zswap_store() and zswap_lru_add() are called, the dynamic
shrinker is likely running due to the pressure. The dynamic shrinker
reduces the LRU size to 1/4, and then a few subsequent zswap_store()
calls reduce the protected count to 1/4 of the LRU size. The stored
pages will be reduced to zero through a few shrinker_scan calls.

Re: [PATCH v1 0/3] mm: zswap: global shrinker fix and proactive shrink

Posted by Nhat Pham 1 year, 7 months ago

On Wed, Jun 19, 2024 at 6:03 PM Takero Funaki <flintglass@gmail.com> wrote:
>
> Hello,
>
> Sorry for the late reply. I am currently investigating a
> responsiveness issue I found while benchmarking with this series,
> possibly related to concurrent zswap writeback and pageouts.
>
> This series cannot be applied until the root cause is identified,
> unfortunately. Thank you all for taking the time to review.
>
> The responsiveness issue was confirmed with 6.10-rc2 with all 3
> patches applied. Without patch 3, it still happens but is less likely.
>
> When allocating much larger memory than zswap can buffer, and
> writeback and rejection by pool_limit_hit happen simultaneously, the
> system stops responding. I do not see this freeze when zswap is
> disabled or when there is no pool_limit_hit. The proactive shrinking
> itself seems to work as expected as long as the writeback and pageout
> do not occur simultaneously.
>
> I suspect this issue exists in current code but was not visible
> without this series since the global shrinker did not writeback
> considerable amount of pages.
>
>
> 2024年6月15日(土) 7:48 Nhat Pham <nphamcs@gmail.com>:
> >
> > BTW, I'm curious. Have you experimented with increasing the pool size?
> > That 20% number is plenty for our use cases, but maybe yours need a
> > different cap?
> >
>
> Probably we can allocate a bit more zswap pool size. But that will
> keep more old pages once the pool limit is hit. If we can ensure no
> pool limit hits and zero writeback by allocating more memory, I will
> try the same amount of zramswap.
>
> > Also, have you experimented with the dynamic zswap shrinker? :) I'm
> > actually curious how it works out in the small machine regime, with
> > whatever workload you are running.
> >
>
> It seems the dynamic shrinker is trying to evict all pages. That does
> not fit to my use case that prefer balanced swapin and swapout
> performance

Hmm not quite. As you have noted earlier, it (tries to) shrink the
unprotected pages only,

>
>
> 2024年6月15日(土) 9:20 Yosry Ahmed <yosryahmed@google.com>:
> > >
> > > 1.
> > > The visible issue is that pageout/in operations from active processes
> > > are slow when zswap is near its max pool size. This is particularly
> > > significant on small memory systems, where total swap usage exceeds
> > > what zswap can store. This means that old pages occupy most of the
> > > zswap pool space, and recent pages use swap disk directly.
> >
> > This should be a transient state though, right? Once the shrinker
> > kicks in it should writeback the old pages and make space for the hot
> > ones. Which takes us to our next point.
> >
> > >
> > > 2.
> > > This issue is caused by zswap keeping the pool size near 100%. Since
> > > the shrinker fails to shrink the pool to accept_thr_percent and zswap
> > > rejects incoming pages, rejection occurs more frequently than it
> > > should. The rejected pages are directly written to disk while zswap
> > > protects old pages from eviction, leading to slow pageout/in
> > > performance for recent pages on the swap disk.
> >
> > Why is the shrinker failing? IIUC the first two patches fixes two
> > cases where the shrinker stumbles upon offline memcgs, or memcgs with
> > no zswapped pages. Are these cases common enough in your use case that
> > every single time the shrinker runs it hits MAX_RECLAIM_RETRIES before
> > putting the zswap usage below accept_thr_percent?
> >
> > This would be surprising given that we should be restarting the
> > shrinker with every swapout attempt until we can accept pages again.
> >
> > I guess one could construct a malicious case where there are some
> > sticky offline memcgs, and all the memcgs that actually have zswap
> > pages come after it in the iteration order.
> >
> > Could you shed more light about this? What does the setup look like?
> > How many memcgs there are, how many of them use zswap, and how many
> > offline memcgs are you observing?
> >
>
> Example from ubuntu 22.04 using zswap:
> root@ctl:~# find /sys/fs/cgroup/ -wholename
> \*service/memory.zswap.current | xargs grep . | wc
>      31      31    2557
> root@ctl:~# find /sys/fs/cgroup/ -wholename
> \*service/memory.zswap.current | xargs grep ^0 | wc
>      11      11     911
>
> This indicates 11 out of 31 services have no pages in zswap. Without
> patch 2, shrink_worker() aborts shrinking in the second tree walk,
> before evicting about 40 pages from the services. The number varies,
> but I think it is common to see a few memcg that has no zswap pages
>
> > I am not saying we shouldn't fix these problems anyway, I am just
> > trying to understand how we got into this situation to begin with.
> >
> > >
> > > 3.
> > > If the pool size were shrunk proactively, rejection by pool limit hits
> > > would be less likely. New incoming pages could be accepted as the pool
> > > gains some space in advance, while older pages are written back in the
> > > background. zswap would then be filled with recent pages, as expected
> > > in the LRU logic.
> >
> > I suspect if patches 1 and 2 fix your problem, the shrinker invoked
> > from reclaim should be doing this sort of "proactive shrinking".
> >
> > I agree that the current hysteresis around accept_thr_percent is not
> > good enough, but I am surprised you are hitting the pool limit if the
> > shrinker is being run during reclaim.
> >
> > >
> > > Patch 1 and 2 make the shrinker reduce the pool to accept_thr_percent.
> > > Patch 3 makes zswap_store trigger the shrinker before reaching the max
> > > pool size. With this series, zswap will prepare some space to reduce
> > > the probability of problematic pool_limit_hit situation, thus reducing
> > > slow reclaim and the page priority inversion against LRU.
> > >
> > > 4.
> > > Once proactive shrinking reduces the pool size, pageouts complete
> > > instantly as long as the space prepared by shrinking can store the
> > > direct reclaim. If an admin sees a large pool_limit_hit, lowering
> > > accept_threshold_percent will improve active process performance.
> >
> > I agree that proactive shrinking is preferable to waiting until we hit
> > pool limit, then stop taking in pages until the acceptance threshold.
> > I am just trying to understand whether such a proactive shrinking
> > mechanism will be needed if the reclaim shrinker for zswap is being
> > used, how the two would work together.
>
> For my workload, the dynamic shrinker (reclaim shrinker) is disabled.
> The proposed global shrinker and the existing dynamic shrinker are
> both proactive, but their goals are different.
>
> The global shrinker starts shrinking when the zswap pool exceeds
> accept_thr_percent + 1%, then stops when it reaches
> accept_thr_percent. Pages below accept_thr_percent are protected from
> shrinking.
>
> The dynamic shrinker starts shrinking based on memory pressure
> regardless of the zswap pool size, and stops when the LRU size is
> reduced to 1/4. Its goal is to wipe out all pages from zswap. It
> prefers swapout performance only.
>
> I think the current LRU logic decreases nr_zswap_protected too quickly
> for my workload. In zswap_lru_add(), nr_zswap_protected is reduced to
> between 1/4 and 1/8 of the LRU size. Although zswap_folio_swapin()
> increments nr_zswap_protected when page-ins of evicted pages occur
> later, this technically has no effect while reclaim is in progress.
>
> While zswap_store() and zswap_lru_add() are called, the dynamic
> shrinker is likely running due to the pressure. The dynamic shrinker
> reduces the LRU size to 1/4, and then a few subsequent zswap_store()
> calls reduce the protected count to 1/4 of the LRU size. The stored
> pages will be reduced to zero through a few shrinker_scan calls.

Ah this is a fair point. We've been observing this in
production/experiments as well - there's seems to be a positive
correlation between zswpout rate and zswap_written_back rate. Whenever
there's a spike in zswpout, you also see a spike in writtenback pages
too - looks like the flood of zswpout weaken zswap's lru protection,
which is not quite the intended effect.

We're working to improve this situation. We have a couple of ideas
floating around, none of which are too complicated to implement, but
need experiments to validate before sending upstream :)

Re: [PATCH v1 0/3] mm: zswap: global shrinker fix and proactive shrink

Posted by Nhat Pham 1 year, 8 months ago

> My intended workload is low-activity services distributed on small
> system like t2.nano, with 0.5G to 1G of RAM. There are a significant
> number of pool_limit_hits and the zswap pool usage sometimes stays
> near 100% filled by background service processes.
>

BTW, I'm curious. Have you experimented with increasing the pool size?
That 20% number is plenty for our use cases, but maybe yours need a
different cap?

Also, have you experimented with the dynamic zswap shrinker? :) I'm
actually curious how it works out in the small machine regime, with
whatever workload you are running.

But yeah, I'd imagine either way zswap would be better than zram for
this particular scenario, as long as you have a swapfile, some
proportion of your anonymous memory cold, and/or can tolerate a
certain amount of latency.

Re: [PATCH v1 0/3] mm: zswap: global shrinker fix and proactive shrink

Posted by Nhat Pham 1 year, 8 months ago

On Thu, Jun 13, 2024 at 9:09 PM Takero Funaki <flintglass@gmail.com> wrote:
>
> 2024年6月14日(金) 0:22 Nhat Pham <nphamcs@gmail.com>:
> >
> > Taking a step back from the correctness conversation, could you
> > include in the changelog of the patches and cover letter a realistic
> > scenario, along with user space-visible metrics that show (ideally all
> > 4, but at least some of the following):
> >
> > 1. A user problem (that affects performance, or usability, etc.) is happening.
> >
> > 2. The root cause is what we are trying to fix (for e.g in patch 1, we
> > are skipping over memcgs unnecessarily in the global shrinker loop).
> >
> > 3. The fix alleviates the root cause in b)
> >
> > 4. The userspace-visible problem goes away or is less serious.
> >
>
> Thank you for your suggestions.
> For quick response before submitting v2,
>
> 1.
> The visible issue is that pageout/in operations from active processes
> are slow when zswap is near its max pool size. This is particularly
> significant on small memory systems, where total swap usage exceeds
> what zswap can store. This means that old pages occupy most of the
> zswap pool space, and recent pages use swap disk directly.

Makes sense. You could probably check pswpin stats etc. to verify
this, or maybe some sort of tracing to measure performance impact?

>
> 2.
> This issue is caused by zswap keeping the pool size near 100%. Since
> the shrinker fails to shrink the pool to accept_thr_percent and zswap
> rejects incoming pages, rejection occurs more frequently than it
> should. The rejected pages are directly written to disk while zswap
> protects old pages from eviction, leading to slow pageout/in
> performance for recent pages on the swap disk.

Yeah this makes sense. The hysteresis heuristic is broken.

>
> 3.
> If the pool size were shrunk proactively, rejection by pool limit hits
> would be less likely. New incoming pages could be accepted as the pool
> gains some space in advance, while older pages are written back in the
> background. zswap would then be filled with recent pages, as expected
> in the LRU logic.
>
> Patch 1 and 2 make the shrinker reduce the pool to accept_thr_percent.
> Patch 3 makes zswap_store trigger the shrinker before reaching the max
> pool size. With this series, zswap will prepare some space to reduce
> the probability of problematic pool_limit_hit situation, thus reducing
> slow reclaim and the page priority inversion against LRU.

Makes sense, but do include numbers to back up your claims if you have them!

>
> 4.
> Once proactive shrinking reduces the pool size, pageouts complete
> instantly as long as the space prepared by shrinking can store the
> direct reclaim. If an admin sees a large pool_limit_hit, lowering
> accept_threshold_percent will improve active process performance.
>

This makes sense from a glance. Thank you for writing all of this, and
please include actual numbers from a benchmark to show these in your
v2, if you have them :)

> Regarding the production workload, I believe we are facing different situations.
>
> My intended workload is low-activity services distributed on small
> system like t2.nano, with 0.5G to 1G of RAM. There are a significant
> number of pool_limit_hits and the zswap pool usage sometimes stays
> near 100% filled by background service processes.
>
> When I evaluated zswap and zramswap, zswap performed well. I suppose
> this was because of its LRU. Once old pages occupied zramswap, there
> was no way to move pages from zramswap to the swap disk. zswap could
> adapt to this situation by writing back old pages in LRU order. We
> could benefit from compressed swap storing recent pages and also
> utilize overcommit backed by a large swap on disk.
>
> I think if a system has never seen a pool limit hit, there is no need
> to manage the costly LRU. In such cases, I will try zramswap.
>
> I am developing these patches by testing with artificially allocating
> large memory to fill the zswap pool (emulating background services),
> and then invoking another allocation to trigger direct reclaim
> (emulating short-live active process).
> On real devices, seeing much less pool_limit_hit by lower
> accept_thr_percent=50 and higher max_pool_percent=25.
> Note that before patch 3, zswap_store() did not count some of the
> rejection as pool_limit_hit properly (underestimated). we cannot
> compare the counter directly.

Sounds good! Did you see performance improvement from lowering
pool_limit_hit etc.?

>
> I will try to create a more reproducible realistic benchmark, but it
> might not be realistic for your workload.

Your approach is not interfering with our workload - we don't
encounter the global shrinker in production ever. That does not mean I
will not consider your code though - this is everyone's kernel, not
just Meta's kernel :)

What I was concerned with was if there were any real workloads at all
that benefit from it - as long as there are (which there seem to be
based on your description), and the approach does not hurt other
cases, I'm happy to review it for merging :)

And as I have stated before, I personally believe the global shrinker
needs to change. I just do not have the means to verify that any
ideas/improvement actually work in a realistic scenario (since our
production system does not encounter global shrinker). Thank you for
picking up the work!

Re: [PATCH v1 0/3] mm: zswap: global shrinker fix and proactive shrink

Posted by Nhat Pham 1 year, 8 months ago

On Sat, Jun 8, 2024 at 8:53 AM Takero Funaki <flintglass@gmail.com> wrote:
>
> This series addresses two issues and introduces a minor improvement in
> zswap global shrinker:

By the way, what is your current setup?

This global shrinker loop should only be run when the global pool
limit is hit. That *never* happens to us in production, even with the
zswap shrinker disabled.

The default pool limit is 20% of memory, which is quite a lot,
especially if anonymous memory is well-compressed and/or has a lot of
zero pages (which do not count towards the limit).

>
> 1. Fix the memcg iteration logic that breaks iteration on offline memcgs.
> 2. Fix the error path that aborts on expected error codes.
> 3. Add proactive shrinking at 91% full, for 90% accept threshold.
>
> These patches need to be applied in this order to avoid potential loops
> caused by the first issue. Patch 3 can be applied independently, but the
> two issues must be resolved to ensure the shrinker can evict pages.
>
> Previously, the zswap pool could be filled with old pages that the
> shrinker failed to evict, leading to zswap rejecting new pages. With
> this series applied, the shrinker will continue to evict pages until the
> pool reaches the accept_thr_percent threshold proactively, as
> documented, and maintain the pool to keep recent pages.
>
> As a side effect of changes in the hysteresis logic, zswap will no
> longer reject pages under the max pool limit.
>
> With this series, reclaims smaller than the proative shrinking amount
> finish instantly and trigger background shrinking. Admins can check if
> new pages are buffered by zswap by monitoring the pool_limit_hit
> counter.
>
> Changes since v0:
> mm: zswap: fix global shrinker memcg iteration
> - Drop and reacquire spinlock before skipping a memcg.
> - Add some comment to clarify the locking mechanism.
> mm: zswap: proactive shrinking before pool size limit is hit
> - Remove unneeded check before scheduling work.
> - Change shrink start threshold to accept_thr_percent + 1%.
>
> Now it starts shrinking at accept_thr_percent + 1%. Previously, the
> threshold was at the midpoint of 100% to accept_threshold.
>
> If a workload needs 10% space to buffer the average reclaim amount, with
> the previous patch, it required setting the accept_thr_percent to 80%.
> For 50%, it became 0%, which is not acceptable and unclear for admins.
> We can use the accept percent as the shrink threshold directly but that
> sounds shrinker is called too frequently around the accept threshold.  I
> added 1% as a minimum gap to the shrink threshold.
>
> ----
>
> Takero Funaki (3):
>   mm: zswap: fix global shrinker memcg iteration
>   mm: zswap: fix global shrinker error handling logic
>   mm: zswap: proactive shrinking before pool size limit is hit
>
>  Documentation/admin-guide/mm/zswap.rst |  17 ++-
>  mm/zswap.c                             | 172 ++++++++++++++++++-------
>  2 files changed, 136 insertions(+), 53 deletions(-)
>
> --
> 2.43.0
>