Documentation/admin-guide/cgroup-v2.rst | 17 +++++++++++------ Documentation/admin-guide/mm/zswap.rst | 9 ++++++++- include/linux/memcontrol.h | 3 ++- include/linux/zswap.h | 6 ++++++ mm/memcontrol.c | 24 +++++++++++++++++------- mm/zswap.c | 9 +++++++++ 6 files changed, 53 insertions(+), 15 deletions(-)
Improve the inheritance behavior of the `memory.zswap.writeback` cgroup
attribute introduced during the 6.11 cycle. Specifically, in 6.11 we
walk the parent cgroups until we find a _disabled_ writeback, which does
not allow the user to selectively enable zswap writeback while having it
disabled by default.
Instead, introduce a third value for the `memory.zswap.writeback` cgroup
attribute meaning "follow the parent", and use it as the default value
for all cgroups. Additionally, introduce a `zswap.writeback_disable`
module parameter which is used as the "parent" of the root cgroup,
thus making it the global default.
Reads from `memory.zswap.writeback` cgroup attribute will be coerced to
0 or 1 to maintain backwards compatilibity. However, it is possible to
write -1 to that attribute to make the cgroup follow the parent again.
Fixes: e39925734909 ("mm/memcontrol: respect zswap.writeback setting from parent cg too")
Fixes: 501a06fe8e4c ("zswap: memcontrol: implement zswap writeback disabling")
Signed-off-by: Ivan Shapovalov <intelfx@intelfx.name>
---
Documentation/admin-guide/cgroup-v2.rst | 17 +++++++++++------
Documentation/admin-guide/mm/zswap.rst | 9 ++++++++-
include/linux/memcontrol.h | 3 ++-
include/linux/zswap.h | 6 ++++++
mm/memcontrol.c | 24 +++++++++++++++++-------
mm/zswap.c | 9 +++++++++
6 files changed, 53 insertions(+), 15 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 95c18bc17083..eea580490679 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1717,10 +1717,12 @@ The following nested keys are defined.
entries fault back in or are written out to disk.
memory.zswap.writeback
- A read-write single value file. The default value is "1".
- Note that this setting is hierarchical, i.e. the writeback would be
- implicitly disabled for child cgroups if the upper hierarchy
- does so.
+ A read-write single value file. The default is to follow the parent
+ cgroup configuration, and the root cgroup follows the global
+ ``zswap.writeback_enabled`` module parameter (which is 1 by default).
+ Thus, this setting is hierarchical, i.e. the writeback setting for
+ a child cgroup can be implicitly controlled at runtime by changing
+ any parent value or the global module parameter.
When this is set to 0, all swapping attempts to swapping devices
are disabled. This included both zswap writebacks, and swapping due
@@ -1729,8 +1731,11 @@ The following nested keys are defined.
reclaim inefficiency after disabling writeback (because the same
pages might be rejected again and again).
- Note that this is subtly different from setting memory.swap.max to
- 0, as it still allows for pages to be written to the zswap pool.
+ Note that this is different from setting memory.swap.max to 0,
+ as it still allows for pages to be written to the zswap pool.
+
+ This can also be set to -1, which would make the cgroup (and its
+ future children) follow the parent/global value again.
memory.pressure
A read-only nested-keyed file.
diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst
index 3598dcd7dbe7..20267b8893db 100644
--- a/Documentation/admin-guide/mm/zswap.rst
+++ b/Documentation/admin-guide/mm/zswap.rst
@@ -126,10 +126,17 @@ Setting this parameter to 100 will disable the hysteresis.
Some users cannot tolerate the swapping that comes with zswap store failures
and zswap writebacks. Swapping can be disabled entirely (without disabling
-zswap itself) on a cgroup-basis as follows::
+zswap itself) either globally, using the ``writeback_enabled`` sysfs attribute,
+or on a per-cgroup basis, e.g.::
+
+ echo 0 > /sys/module/zswap/parameters/writeback_enabled
echo 0 > /sys/fs/cgroup/<cgroup-name>/memory.zswap.writeback
+All cgroups follow (i.e. dynamically inherit) the parent configuration
+by default, and the root cgroup follows the module parameter (which can
+thus be considered the global default).
+
Note that if the store failures are recurring (for e.g if the pages are
incompressible), users can observe reclaim inefficiency after disabling
writeback (because the same pages might be rejected again and again).
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0e5bf25d324f..ca0510057040 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -202,8 +202,9 @@ struct mem_cgroup {
/*
* Prevent pages from this memcg from being written back from zswap to
* swap, and from being swapped out on zswap store failures.
+ * (< 0: follow the parent/global default)
*/
- bool zswap_writeback;
+ int zswap_writeback;
#endif
/* vmpressure notifications */
diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index 6cecb4a4f68b..7d121fdb3521 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -37,6 +37,7 @@ void zswap_lruvec_state_init(struct lruvec *lruvec);
void zswap_folio_swapin(struct folio *folio);
bool zswap_is_enabled(void);
bool zswap_never_enabled(void);
+bool zswap_writeback_is_enabled(void);
#else
struct zswap_lruvec_state {};
@@ -71,6 +72,11 @@ static inline bool zswap_never_enabled(void)
return true;
}
+static inline bool zswap_writeback_is_enabled(void)
+{
+ return true;
+}
+
#endif
#endif /* _LINUX_ZSWAP_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d563fb515766..1e0aca42e5a7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3613,7 +3613,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
memcg1_soft_limit_reset(memcg);
#ifdef CONFIG_ZSWAP
memcg->zswap_max = PAGE_COUNTER_MAX;
- WRITE_ONCE(memcg->zswap_writeback, true);
+ WRITE_ONCE(memcg->zswap_writeback, -1);
#endif
page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX);
if (parent) {
@@ -5318,15 +5318,25 @@ void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size)
bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg)
{
+ int memcg_zswap_writeback;
+
/* if zswap is disabled, do not block pages going to the swapping device */
if (!zswap_is_enabled())
return true;
- for (; memcg; memcg = parent_mem_cgroup(memcg))
- if (!READ_ONCE(memcg->zswap_writeback))
- return false;
+ /*
+ * -1 means "follow the parent" (root cgroup follows the global default).
+ * Walk cgroups until we find something, otherwise return the global default.
+ */
+ for (; memcg; memcg = parent_mem_cgroup(memcg)) {
+ memcg_zswap_writeback = READ_ONCE(memcg->zswap_writeback);
+ if (memcg_zswap_writeback >= 0)
+ goto found;
+ }
+ return zswap_writeback_is_enabled();
- return true;
+found:
+ return !!memcg_zswap_writeback;
}
static u64 zswap_current_read(struct cgroup_subsys_state *css,
@@ -5365,7 +5375,7 @@ static int zswap_writeback_show(struct seq_file *m, void *v)
{
struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
- seq_printf(m, "%d\n", READ_ONCE(memcg->zswap_writeback));
+ seq_printf(m, "%d\n", mem_cgroup_zswap_writeback_enabled(memcg));
return 0;
}
@@ -5379,7 +5389,7 @@ static ssize_t zswap_writeback_write(struct kernfs_open_file *of,
if (parse_ret)
return parse_ret;
- if (zswap_writeback != 0 && zswap_writeback != 1)
+ if (zswap_writeback < -1 || zswap_writeback > 1)
return -EINVAL;
WRITE_ONCE(memcg->zswap_writeback, zswap_writeback);
diff --git a/mm/zswap.c b/mm/zswap.c
index adeaf9c97fde..724d8a02d61c 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -129,6 +129,10 @@ static bool zswap_shrinker_enabled = IS_ENABLED(
CONFIG_ZSWAP_SHRINKER_DEFAULT_ON);
module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644);
+/* Enable/disable zswap writeback globally */
+static bool zswap_writeback_enabled = true;
+module_param_named(writeback_enabled, zswap_writeback_enabled, bool, 0644);
+
bool zswap_is_enabled(void)
{
return zswap_enabled;
@@ -139,6 +143,11 @@ bool zswap_never_enabled(void)
return !static_branch_maybe(CONFIG_ZSWAP_DEFAULT_ON, &zswap_ever_enabled);
}
+bool zswap_writeback_is_enabled(void)
+{
+ return zswap_writeback_enabled;
+}
+
/*********************************
* data structures
**********************************/
--
2.46.1.5.g6e13a7cc9d
On Thu, Sep 26, 2024 at 3:55 PM Ivan Shapovalov <intelfx@intelfx.name> wrote: > > Improve the inheritance behavior of the `memory.zswap.writeback` cgroup > attribute introduced during the 6.11 cycle. Specifically, in 6.11 we > walk the parent cgroups until we find a _disabled_ writeback, which does > not allow the user to selectively enable zswap writeback while having it > disabled by default. > > Instead, introduce a third value for the `memory.zswap.writeback` cgroup > attribute meaning "follow the parent", and use it as the default value > for all cgroups. Additionally, introduce a `zswap.writeback_disable` > module parameter which is used as the "parent" of the root cgroup, > thus making it the global default. > > Reads from `memory.zswap.writeback` cgroup attribute will be coerced to > 0 or 1 to maintain backwards compatilibity. However, it is possible to > write -1 to that attribute to make the cgroup follow the parent again. > > Fixes: e39925734909 ("mm/memcontrol: respect zswap.writeback setting from parent cg too") > Fixes: 501a06fe8e4c ("zswap: memcontrol: implement zswap writeback disabling") > Signed-off-by: Ivan Shapovalov <intelfx@intelfx.name> > --- > Documentation/admin-guide/cgroup-v2.rst | 17 +++++++++++------ > Documentation/admin-guide/mm/zswap.rst | 9 ++++++++- > include/linux/memcontrol.h | 3 ++- > include/linux/zswap.h | 6 ++++++ > mm/memcontrol.c | 24 +++++++++++++++++------- > mm/zswap.c | 9 +++++++++ > 6 files changed, 53 insertions(+), 15 deletions(-) > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > index 95c18bc17083..eea580490679 100644 > --- a/Documentation/admin-guide/cgroup-v2.rst > +++ b/Documentation/admin-guide/cgroup-v2.rst > @@ -1717,10 +1717,12 @@ The following nested keys are defined. > entries fault back in or are written out to disk. > > memory.zswap.writeback > - A read-write single value file. The default value is "1". > - Note that this setting is hierarchical, i.e. the writeback would be > - implicitly disabled for child cgroups if the upper hierarchy > - does so. > + A read-write single value file. The default is to follow the parent > + cgroup configuration, and the root cgroup follows the global > + ``zswap.writeback_enabled`` module parameter (which is 1 by default). > + Thus, this setting is hierarchical, i.e. the writeback setting for > + a child cgroup can be implicitly controlled at runtime by changing > + any parent value or the global module parameter. > > When this is set to 0, all swapping attempts to swapping devices > are disabled. This included both zswap writebacks, and swapping due > @@ -1729,8 +1731,11 @@ The following nested keys are defined. > reclaim inefficiency after disabling writeback (because the same > pages might be rejected again and again). > > - Note that this is subtly different from setting memory.swap.max to > - 0, as it still allows for pages to be written to the zswap pool. > + Note that this is different from setting memory.swap.max to 0, > + as it still allows for pages to be written to the zswap pool. > + > + This can also be set to -1, which would make the cgroup (and its > + future children) follow the parent/global value again. > API-design-wise, this seems a bit confusing... Using the value -1 to indicate the cgroup should follow ancestor is not quite semantically meaningful. What does "follow the parent" here mean? Reading your code, it seems to be "find the first non negative from this memcg to root and just use it", but it's neither intuitive, nor deducable from the documentation. There are also ill-defined/poorly-defined behavior. What if we set -1 to all cgroups? Should writeback be enabled or disabled? The implementation also contradicts the "writeback setting for a child cgroup can be implicitly controlled at runtime by chaging any parent value or the global module parameter" part. What happens if we have the following sequence of zswap.writeback values, from root to current memcg: root.memcg.zswap.writeback = 0, 1, -1, -1, -1 = cur_memcg.zswap.writeback Should we disable or enable cur_memcg's zswap.writeback? It's disabled at root, but the first non-negative ancestral value is 1, so this cgroup will follow it (even though that ancestor itself has zswap writeback disabled!!!) I think we should separate the values itself from the decision to check ancestors. The former should be indicated per-memcg, and the latter decision should only come into play when the memcg itself does prevent zswap.writeback. It might be less confusing to make the latter decision globally - perhaps through a mount option, analogous to memory_recursiveprot?
Hello. On Thu, Sep 26, 2024 at 07:28:08PM GMT, Nhat Pham <nphamcs@gmail.com> wrote: > API-design-wise, this seems a bit confusing... Using the value -1 to > indicate the cgroup should follow ancestor is not quite semantically > meaningful. What about assigning this semantic to an empty string ("")? That would be the default behavior and also the value shown when reading the file (to distinguish this for explicitly configured values). (The weirdness of 0, 1, -1, -1, -1 would remain. Maybe switching this via the mount option could satisfy any user. Admittedly, I tend to confuse this knob with swap.max.) Michal
On Fri, Sep 27, 2024 at 8:01 AM Michal Koutný <mkoutny@suse.com> wrote: > > > What about assigning this semantic to an empty string ("")? > That would be the default behavior and also the value shown when reading > the file (to distinguish this for explicitly configured values). Yeah that's better than -1, I agree. Still a bit confusing, but at least the semantic is "we are not making a choice at the memcg". > > (The weirdness of 0, 1, -1, -1, -1 would remain. Maybe switching this > via the mount option could satisfy any user. Admittedly, I tend to > confuse this knob with swap.max.) Yeah a mount option, or in general some sort of global knob (with proper documentation) would be preferable. And yeah, I hear you with the swap.max confusion. That's why I tried to make it explicit in the documentation, because the difference is subtle and can trip up users. Hopefully, when zswap and swap are decoupled, users can conceptualize them as two separate tiers, and the confusion will lessen...
On Thu, Sep 26, 2024 at 3:55 PM Ivan Shapovalov <intelfx@intelfx.name> wrote: > > Improve the inheritance behavior of the `memory.zswap.writeback` cgroup > attribute introduced during the 6.11 cycle. Specifically, in 6.11 we > walk the parent cgroups until we find a _disabled_ writeback, which does > not allow the user to selectively enable zswap writeback while having it > disabled by default. Is there an actual need for this? This is a theoretical use case I thought of (and raised), but I don't think anybody actually wants this...? Besides, most people who want this can just: 1. Enable zswap writeback on root cgroup (and all non-leaf cgroups). 2. Disable zswap writeback on leaf cgroups on creation by default. 3. Selectively enable zswap writeback for the leaf cgroups. All of this is quite doable in userspace. It's not even _that_ racy - just do this before adding tasks etc. to the cgroup?
On 2024-09-26 at 16:12 -0700, Nhat Pham wrote: > On Thu, Sep 26, 2024 at 3:55 PM Ivan Shapovalov <intelfx@intelfx.name> wrote: > > > > Improve the inheritance behavior of the `memory.zswap.writeback` cgroup > > attribute introduced during the 6.11 cycle. Specifically, in 6.11 we > > walk the parent cgroups until we find a _disabled_ writeback, which does > > not allow the user to selectively enable zswap writeback while having it > > disabled by default. > > Is there an actual need for this? This is a theoretical use case I > thought of (and raised), but I don't think anybody actually wants > this...? This is of course anecdata, but yes, it does solve a real use-case that I'm having right now, as well as a bunch of my colleagues who recently complained to me (in private) about pretty much the same use-case. The use-case is following: it turns out that it could be beneficial for desktop systems to run with a pretty high swappiness and zswap writeback globally disabled, to nudge the system to compress cold pages but not actually write them back to the disk (which would happen pretty aggressively if it was not disabled) to reduce I/O and latencies. However, under this setup it is sometimes needed to re-enable zswap writeback for specific memory-heavy applications that allocate a lot of cold pages, to "allow" the kernel to actually swap those programs out. > > Besides, most people who want this can just: > > 1. Enable zswap writeback on root cgroup (and all non-leaf cgroups). > > 2. Disable zswap writeback on leaf cgroups on creation by default. > > 3. Selectively enable zswap writeback for the leaf cgroups. > > All of this is quite doable in userspace. It's not even _that_ racy - > just do this before adding tasks etc. to the cgroup? Well, yes, it is technically doable from userspace, just like it was technically doable prior to commit e39925734909 to have userspace explicitly control the entire hierarchy of writeback settings. However it _is_ pretty painful, and the flow you described would essentially negate any benefits of that patch (it would require userspace to, once again, manage the entire hierarchy explicitly without any help from the kernel). IOWs, per the above commit I concluded that reducing pain levels for userspace implementations was an acceptable design goal :-) Thanks, -- Ivan Shapovalov / intelfx /
On Thu, Sep 26, 2024 at 4:40 PM Ivan Shapovalov <intelfx@intelfx.name> wrote: > > On 2024-09-26 at 16:12 -0700, Nhat Pham wrote: > > On Thu, Sep 26, 2024 at 3:55 PM Ivan Shapovalov <intelfx@intelfx.name> wrote: > > > > > > Improve the inheritance behavior of the `memory.zswap.writeback` cgroup > > > attribute introduced during the 6.11 cycle. Specifically, in 6.11 we > > > walk the parent cgroups until we find a _disabled_ writeback, which does > > > not allow the user to selectively enable zswap writeback while having it > > > disabled by default. > > > > Is there an actual need for this? This is a theoretical use case I > > thought of (and raised), but I don't think anybody actually wants > > this...? > > This is of course anecdata, but yes, it does solve a real use-case that > I'm having right now, as well as a bunch of my colleagues who recently > complained to me (in private) about pretty much the same use-case. > > The use-case is following: it turns out that it could be beneficial for > desktop systems to run with a pretty high swappiness and zswap > writeback globally disabled, to nudge the system to compress cold pages > but not actually write them back to the disk (which would happen pretty > aggressively if it was not disabled) to reduce I/O and latencies. > However, under this setup it is sometimes needed to re-enable zswap > writeback for specific memory-heavy applications that allocate a lot of > cold pages, to "allow" the kernel to actually swap those programs out. Out of pure curiosity (and to make sure I fully grasp the problem at hand), is this the capacity-based zswap writeback (i.e the one triggered by limits), or the memory pressure based dynamic shrinker? If you disable the latter and only rely on the former, it should not "write pages aggressively". Limits are rarely reached (IIUC, zswap.max are frequently used as binary knobs, and global limits are hard to reach), so usually pages that are going to disk swap are just pages zswap reject (i.e mostly just pages that fail to compress). This should be a very small category. You will still see "swap" usage due to a quirk in zswap architecture (which I'm working to fix), but there should rarely be any IOs. So the setup itself is not super necessary. If it's the latter then yeah I can kinda see the need for the setup. > > > > > Besides, most people who want this can just: > > > > 1. Enable zswap writeback on root cgroup (and all non-leaf cgroups). > > > > 2. Disable zswap writeback on leaf cgroups on creation by default. > > > > 3. Selectively enable zswap writeback for the leaf cgroups. > > > > All of this is quite doable in userspace. It's not even _that_ racy - > > just do this before adding tasks etc. to the cgroup? > > Well, yes, it is technically doable from userspace, just like it was > technically doable prior to commit e39925734909 to have userspace > explicitly control the entire hierarchy of writeback settings. > However it _is_ pretty painful, and the flow you described would > essentially negate any benefits of that patch (it would require > userspace to, once again, manage the entire hierarchy explicitly > without any help from the kernel). I think it's a tad different. In the case of the commit e39925734909, the hierarchical behavior of zswap.writeback knob is quite semantically confusing, almost counter-intuitive (and does not conform to the convention of other cgroup knobs, which use the most restrictive limit - check out zswap.max limit checking for example). That commit arguably fixes it for the "common" case (i.e you want the hierarchical enforcement to hold for the most part). That's why there were even conversations about cc-ing the stable mailing list for backporting it to older kernels. This is more of a "new use case" patch. It complicates the API, for something readily doable in userspace - the kernel does not do anything that userspace cannot achieve. So it should undergo stricter scrutiny. :) Anyway, Yosry, Johannes, how do you feel about this?
© 2016 - 2024 Red Hat, Inc.