Documentation/admin-guide/cgroup-v2.rst | 10 ++++++++++ include/linux/vm_event_item.h | 2 ++ mm/memcontrol.c | 4 ++++ mm/page_io.c | 16 ++++++++++++++++ mm/vmstat.c | 2 ++ 5 files changed, 34 insertions(+)
From: Barry Song <v-songbaohua@oppo.com>
When the proportion of folios from the zero map is small, missing their
accounting may not significantly impact profiling. However, it’s easy
to construct a scenario where this becomes an issue—for example,
allocating 1 GB of memory, writing zeros from userspace, followed by
MADV_PAGEOUT, and then swapping it back in. In this case, the swap-out
and swap-in counts seem to vanish into a black hole, potentially
causing semantic ambiguity.
We have two ways to address this:
1. Add a separate counter specifically for the zero map.
2. Continue using the current accounting, treating the zero map like
a normal backend. (This aligns with the current behavior of zRAM
when supporting same-page fills at the device level.)
This patch adopts option 1 as pswpin/pswpout counters are that they
only apply to IO done directly to the backend device (as noted by
Nhat Pham).
We can find these counters from /proc/vmstat (counters for the whole
system) and memcg's memory.stat (counters for the interested memcg).
For example:
$ grep -E 'swpin_zero|swpout_zero' /proc/vmstat
swpin_zero 1648
swpout_zero 33536
$ grep -E 'swpin_zero|swpout_zero' /sys/fs/cgroup/system.slice/memory.stat
swpin_zero 3905
swpout_zero 3985
Fixes: 0ca0c24e3211 ("mm: store zero pages to be swapped out in a bitmap")
Cc: Usama Arif <usamaarif642@gmail.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
-v2:
* add separate counters rather than using pswpin/out; thanks
for the comments from Usama, David, Yosry and Nhat;
* Usama also suggested a new counter like swapped_zero, I
prefer that one be separated as an enhancement patch not
a hotfix. will probably handle it later on.
Documentation/admin-guide/cgroup-v2.rst | 10 ++++++++++
include/linux/vm_event_item.h | 2 ++
mm/memcontrol.c | 4 ++++
mm/page_io.c | 16 ++++++++++++++++
mm/vmstat.c | 2 ++
5 files changed, 34 insertions(+)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index db3799f1483e..984eb3c9d05b 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1599,6 +1599,16 @@ The following nested keys are defined.
pglazyfreed (npn)
Amount of reclaimed lazyfree pages
+ swpin_zero
+ Number of pages moved into memory with zero content, meaning no
+ copy exists in the backend swapfile, allowing swap-in to avoid
+ I/O read overhead.
+
+ swpout_zero
+ Number of pages moved out of memory with zero content, meaning no
+ copy is needed in the backend swapfile, allowing swap-out to avoid
+ I/O write overhead.
+
zswpin
Number of pages moved in to memory from zswap.
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index aed952d04132..f70d0958095c 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -134,6 +134,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
#ifdef CONFIG_SWAP
SWAP_RA,
SWAP_RA_HIT,
+ SWPIN_ZERO,
+ SWPOUT_ZERO,
#ifdef CONFIG_KSM
KSM_SWPIN_COPY,
#endif
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5e44d6e7591e..7b3503d12aaf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -441,6 +441,10 @@ static const unsigned int memcg_vm_event_stat[] = {
PGDEACTIVATE,
PGLAZYFREE,
PGLAZYFREED,
+#ifdef CONFIG_SWAP
+ SWPIN_ZERO,
+ SWPOUT_ZERO,
+#endif
#ifdef CONFIG_ZSWAP
ZSWPIN,
ZSWPOUT,
diff --git a/mm/page_io.c b/mm/page_io.c
index 5d9b6e6cf96c..4b4ea8e49cf6 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -204,7 +204,9 @@ static bool is_folio_zero_filled(struct folio *folio)
static void swap_zeromap_folio_set(struct folio *folio)
{
+ struct obj_cgroup *objcg = get_obj_cgroup_from_folio(folio);
struct swap_info_struct *sis = swp_swap_info(folio->swap);
+ int nr_pages = folio_nr_pages(folio);
swp_entry_t entry;
unsigned int i;
@@ -212,6 +214,12 @@ static void swap_zeromap_folio_set(struct folio *folio)
entry = page_swap_entry(folio_page(folio, i));
set_bit(swp_offset(entry), sis->zeromap);
}
+
+ count_vm_events(SWPOUT_ZERO, nr_pages);
+ if (objcg) {
+ count_objcg_events(objcg, SWPOUT_ZERO, nr_pages);
+ obj_cgroup_put(objcg);
+ }
}
static void swap_zeromap_folio_clear(struct folio *folio)
@@ -507,6 +515,7 @@ static void sio_read_complete(struct kiocb *iocb, long ret)
static bool swap_read_folio_zeromap(struct folio *folio)
{
int nr_pages = folio_nr_pages(folio);
+ struct obj_cgroup *objcg;
bool is_zeromap;
/*
@@ -521,6 +530,13 @@ static bool swap_read_folio_zeromap(struct folio *folio)
if (!is_zeromap)
return false;
+ objcg = get_obj_cgroup_from_folio(folio);
+ count_vm_events(SWPIN_ZERO, nr_pages);
+ if (objcg) {
+ count_objcg_events(objcg, SWPIN_ZERO, nr_pages);
+ obj_cgroup_put(objcg);
+ }
+
folio_zero_range(folio, 0, folio_size(folio));
folio_mark_uptodate(folio);
return true;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 22a294556b58..c8ef7352f9ed 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1418,6 +1418,8 @@ const char * const vmstat_text[] = {
#ifdef CONFIG_SWAP
"swap_ra",
"swap_ra_hit",
+ "swpin_zero",
+ "swpout_zero",
#ifdef CONFIG_KSM
"ksm_swpin_copy",
#endif
--
2.39.3 (Apple Git-146)
On Sat, Nov 2, 2024 at 3:12 AM Barry Song <21cnbao@gmail.com> wrote: > > From: Barry Song <v-songbaohua@oppo.com> > > When the proportion of folios from the zero map is small, missing their > accounting may not significantly impact profiling. However, it’s easy > to construct a scenario where this becomes an issue—for example, > allocating 1 GB of memory, writing zeros from userspace, followed by > MADV_PAGEOUT, and then swapping it back in. In this case, the swap-out > and swap-in counts seem to vanish into a black hole, potentially > causing semantic ambiguity. > > We have two ways to address this: > > 1. Add a separate counter specifically for the zero map. > 2. Continue using the current accounting, treating the zero map like > a normal backend. (This aligns with the current behavior of zRAM > when supporting same-page fills at the device level.) > > This patch adopts option 1 as pswpin/pswpout counters are that they > only apply to IO done directly to the backend device (as noted by > Nhat Pham). > > We can find these counters from /proc/vmstat (counters for the whole > system) and memcg's memory.stat (counters for the interested memcg). > > For example: > > $ grep -E 'swpin_zero|swpout_zero' /proc/vmstat > swpin_zero 1648 > swpout_zero 33536 > > $ grep -E 'swpin_zero|swpout_zero' /sys/fs/cgroup/system.slice/memory.stat > swpin_zero 3905 > swpout_zero 3985 > LGTM FWIW, so I'll leave my review tag here: Reviewed-by: Nhat Pham <nphamcs@gmail.com> Too many emails in this thread, but my opinions is: 1. A fix tag is appropriate. It's not a kernel bug per se, but it's incredibly confusing, and can potentially throw off user space agents who rely on the rate of change of these counters as signals. 2. I do think we should use a separate set of counters for this optimization. No strong opinions regarding combining this with the zswap counters, but it can get confusing for users when they enable/disable zswap. If we are to combine, I'd be much more comfortable if we have a generic name, like the one David suggested in v1 ("swpin_skip" / "swpout_skip"). This would still require some API change tho, so not sure if this is the best approach? :) It would also be appropriate if we bring back the same-filled optimization (which should be doable in the swap ID world, but I digress).
On 02.11.24 11:12, Barry Song wrote: > From: Barry Song <v-songbaohua@oppo.com> > > When the proportion of folios from the zero map is small, missing their > accounting may not significantly impact profiling. However, it’s easy > to construct a scenario where this becomes an issue—for example, > allocating 1 GB of memory, writing zeros from userspace, followed by > MADV_PAGEOUT, and then swapping it back in. In this case, the swap-out > and swap-in counts seem to vanish into a black hole, potentially > causing semantic ambiguity. > > We have two ways to address this: > > 1. Add a separate counter specifically for the zero map. > 2. Continue using the current accounting, treating the zero map like > a normal backend. (This aligns with the current behavior of zRAM > when supporting same-page fills at the device level.) > > This patch adopts option 1 as pswpin/pswpout counters are that they > only apply to IO done directly to the backend device (as noted by > Nhat Pham). > > We can find these counters from /proc/vmstat (counters for the whole > system) and memcg's memory.stat (counters for the interested memcg). > > For example: > > $ grep -E 'swpin_zero|swpout_zero' /proc/vmstat > swpin_zero 1648 > swpout_zero 33536 > > $ grep -E 'swpin_zero|swpout_zero' /sys/fs/cgroup/system.slice/memory.stat > swpin_zero 3905 > swpout_zero 3985 > > Fixes: 0ca0c24e3211 ("mm: store zero pages to be swapped out in a bitmap") > Cc: Usama Arif <usamaarif642@gmail.com> > Cc: Chengming Zhou <chengming.zhou@linux.dev> > Cc: Yosry Ahmed <yosryahmed@google.com> > Cc: Nhat Pham <nphamcs@gmail.com> > Cc: Johannes Weiner <hannes@cmpxchg.org> > Cc: David Hildenbrand <david@redhat.com> > Cc: Hugh Dickins <hughd@google.com> > Cc: Matthew Wilcox (Oracle) <willy@infradead.org> > Cc: Shakeel Butt <shakeel.butt@linux.dev> > Cc: Andi Kleen <ak@linux.intel.com> > Cc: Baolin Wang <baolin.wang@linux.alibaba.com> > Cc: Chris Li <chrisl@kernel.org> > Cc: "Huang, Ying" <ying.huang@intel.com> > Cc: Kairui Song <kasong@tencent.com> > Cc: Ryan Roberts <ryan.roberts@arm.com> > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > --- > -v2: > * add separate counters rather than using pswpin/out; thanks > for the comments from Usama, David, Yosry and Nhat; > * Usama also suggested a new counter like swapped_zero, I > prefer that one be separated as an enhancement patch not > a hotfix. will probably handle it later on. > > Documentation/admin-guide/cgroup-v2.rst | 10 ++++++++++ > include/linux/vm_event_item.h | 2 ++ > mm/memcontrol.c | 4 ++++ > mm/page_io.c | 16 ++++++++++++++++ > mm/vmstat.c | 2 ++ > 5 files changed, 34 insertions(+) > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > index db3799f1483e..984eb3c9d05b 100644 > --- a/Documentation/admin-guide/cgroup-v2.rst > +++ b/Documentation/admin-guide/cgroup-v2.rst > @@ -1599,6 +1599,16 @@ The following nested keys are defined. > pglazyfreed (npn) > Amount of reclaimed lazyfree pages > > + swpin_zero > + Number of pages moved into memory with zero content, meaning no > + copy exists in the backend swapfile, allowing swap-in to avoid > + I/O read overhead. > + > + swpout_zero > + Number of pages moved out of memory with zero content, meaning no > + copy is needed in the backend swapfile, allowing swap-out to avoid > + I/O write overhead. Hm, can make it a bit clearer that this is a pure optimization and refer to the other counters? swpin_zero Portion of "pswpin" pages for which I/O was optimized out because the page content was detected to be zero during swapout. swpout_zero Portion of "pswout" pages for which I/O was optimized out because the page content was detected to be zero. -- Cheers, David / dhildenb
On Mon, Nov 04, 2024 at 01:42:08PM +0100, David Hildenbrand wrote: > On 02.11.24 11:12, Barry Song wrote: > > @@ -1599,6 +1599,16 @@ The following nested keys are defined. > > pglazyfreed (npn) > > Amount of reclaimed lazyfree pages > > > > + swpin_zero > > + Number of pages moved into memory with zero content, meaning no > > + copy exists in the backend swapfile, allowing swap-in to avoid > > + I/O read overhead. > > + > > + swpout_zero > > + Number of pages moved out of memory with zero content, meaning no > > + copy is needed in the backend swapfile, allowing swap-out to avoid > > + I/O write overhead. > > Hm, can make it a bit clearer that this is a pure optimization and refer > to the other counters? > > swpin_zero > Portion of "pswpin" pages for which I/O was optimized out > because the page content was detected to be zero during swapout. AFAICS the zeropages currently don't show up in pswpin/pswpout, so these are independent counters, not subsets. I'm leaning towards Barry's side on the fixes tag. When zswap handled the same-filled pages, we would count them in zswpin/out. From a user POV, especially one using zswap, the behavior didn't change, but the counts giving insight into this (potentially significant) VM activity disappeared. This is arguably a regression. > swpout_zero > Portion of "pswout" pages for which I/O was optimized out > because the page content was detected to be zero. Are we sure we want to commit to the "zero" in the name here? Until very recently, zswap optimized all same-filled pages. It's possible somebody might want to bring that back down the line. In reference to the above, I'd actually prefer putting them back into zswpin/zswpout. Sure, they're not handled by zswap.c proper, but this is arguably just an implementation detail; from a user POV this is still just (a form of) compression in lieu of IO to the swap backend. IMO there is no need for coming up with a separate category. Just add them to zswpin/zswpout and remove the CONFIG_ZSWAP guards from them?
On 04.11.24 17:34, Johannes Weiner wrote: > On Mon, Nov 04, 2024 at 01:42:08PM +0100, David Hildenbrand wrote: >> On 02.11.24 11:12, Barry Song wrote: >>> @@ -1599,6 +1599,16 @@ The following nested keys are defined. >>> pglazyfreed (npn) >>> Amount of reclaimed lazyfree pages >>> >>> + swpin_zero >>> + Number of pages moved into memory with zero content, meaning no >>> + copy exists in the backend swapfile, allowing swap-in to avoid >>> + I/O read overhead. >>> + >>> + swpout_zero >>> + Number of pages moved out of memory with zero content, meaning no >>> + copy is needed in the backend swapfile, allowing swap-out to avoid >>> + I/O write overhead. >> >> Hm, can make it a bit clearer that this is a pure optimization and refer >> to the other counters? >> >> swpin_zero >> Portion of "pswpin" pages for which I/O was optimized out >> because the page content was detected to be zero during swapout. > > AFAICS the zeropages currently don't show up in pswpin/pswpout, so > these are independent counters, not subsets. Ah. now I understand the problem. The whole "move out of memory" "move into memory" here is quite confusing TBH. We're not moving anything, we're optimizing out the move completely ... yes, you could call it compression (below). > > I'm leaning towards Barry's side on the fixes tag. I think the documentation when to use the Fixes: tag is pretty clear. Introducing new counters can hardly be considered a bugfix. Missing to adjust some counters that *existing tools* would know/use might be IMO (below). > When zswap handled > the same-filled pages, we would count them in zswpin/out. From a user > POV, especially one using zswap, the behavior didn't change, but the > counts giving insight into this (potentially significant) VM activity > disappeared. This is arguably a regression. > >> swpout_zero >> Portion of "pswout" pages for which I/O was optimized out >> because the page content was detected to be zero. > > Are we sure we want to commit to the "zero" in the name here? Until > very recently, zswap optimized all same-filled pages. It's possible > somebody might want to bring that back down the line. Agreed. > > In reference to the above, I'd actually prefer putting them back into > zswpin/zswpout. Sure, they're not handled by zswap.c proper, but this > is arguably just an implementation detail; from a user POV this is > still just (a form of) compression in lieu of IO to the swap backend. > > IMO there is no need for coming up with a separate category. Just add > them to zswpin/zswpout and remove the CONFIG_ZSWAP guards from them? Would work for me, although it might be a bit confusing if no zswap is configured. Maybe just needs to be documented. -- Cheers, David / dhildenb
On 04/11/2024 17:10, David Hildenbrand wrote: > On 04.11.24 17:34, Johannes Weiner wrote: >> On Mon, Nov 04, 2024 at 01:42:08PM +0100, David Hildenbrand wrote: >>> On 02.11.24 11:12, Barry Song wrote: >>>> @@ -1599,6 +1599,16 @@ The following nested keys are defined. >>>> pglazyfreed (npn) >>>> Amount of reclaimed lazyfree pages >>>> + swpin_zero >>>> + Number of pages moved into memory with zero content, meaning no >>>> + copy exists in the backend swapfile, allowing swap-in to avoid >>>> + I/O read overhead. >>>> + >>>> + swpout_zero >>>> + Number of pages moved out of memory with zero content, meaning no >>>> + copy is needed in the backend swapfile, allowing swap-out to avoid >>>> + I/O write overhead. >>> >>> Hm, can make it a bit clearer that this is a pure optimization and refer >>> to the other counters? >>> >>> swpin_zero >>> Portion of "pswpin" pages for which I/O was optimized out >>> because the page content was detected to be zero during swapout. >> >> AFAICS the zeropages currently don't show up in pswpin/pswpout, so >> these are independent counters, not subsets. > > Ah. now I understand the problem. The whole "move out of memory" "move into memory" here is quite confusing TBH. We're not moving anything, we're optimizing out the move completely ... yes, you could call it compression (below). > >> >> I'm leaning towards Barry's side on the fixes tag. > > I think the documentation when to use the Fixes: tag is pretty clear. > > Introducing new counters can hardly be considered a bugfix. Missing to adjust some counters that *existing tools* would know/use might be IMO (below). > >> When zswap handled >> the same-filled pages, we would count them in zswpin/out. From a user >> POV, especially one using zswap, the behavior didn't change, but the >> counts giving insight into this (potentially significant) VM activity >> disappeared. This is arguably a regression. >> >> swpout_zero >>> Portion of "pswout" pages for which I/O was optimized out >>> because the page content was detected to be zero. >> >> Are we sure we want to commit to the "zero" in the name here? Until >> very recently, zswap optimized all same-filled pages. It's possible >> somebody might want to bring that back down the line. > > Agreed. > >> >> In reference to the above, I'd actually prefer putting them back into >> zswpin/zswpout. Sure, they're not handled by zswap.c proper, but this >> is arguably just an implementation detail; from a user POV this is >> still just (a form of) compression in lieu of IO to the swap backend. >> >> IMO there is no need for coming up with a separate category. Just add >> them to zswpin/zswpout and remove the CONFIG_ZSWAP guards from them? > hmm, I actually don't like the idea of using zswpin/zswpout. Its a bit confusing if zswap is disabled and zswap counters are incrementing? Also, it means that when zswap is enabled, you won't be able to distinguish between zswap and zeropage optimization. > Would work for me, although it might be a bit confusing if no zswap is configured. Maybe just needs to be documented. >
On 04.11.24 19:48, Usama Arif wrote: > > > On 04/11/2024 17:10, David Hildenbrand wrote: >> On 04.11.24 17:34, Johannes Weiner wrote: >>> On Mon, Nov 04, 2024 at 01:42:08PM +0100, David Hildenbrand wrote: >>>> On 02.11.24 11:12, Barry Song wrote: >>>>> @@ -1599,6 +1599,16 @@ The following nested keys are defined. >>>>> pglazyfreed (npn) >>>>> Amount of reclaimed lazyfree pages >>>>> + swpin_zero >>>>> + Number of pages moved into memory with zero content, meaning no >>>>> + copy exists in the backend swapfile, allowing swap-in to avoid >>>>> + I/O read overhead. >>>>> + >>>>> + swpout_zero >>>>> + Number of pages moved out of memory with zero content, meaning no >>>>> + copy is needed in the backend swapfile, allowing swap-out to avoid >>>>> + I/O write overhead. >>>> >>>> Hm, can make it a bit clearer that this is a pure optimization and refer >>>> to the other counters? >>>> >>>> swpin_zero >>>> Portion of "pswpin" pages for which I/O was optimized out >>>> because the page content was detected to be zero during swapout. >>> >>> AFAICS the zeropages currently don't show up in pswpin/pswpout, so >>> these are independent counters, not subsets. >> >> Ah. now I understand the problem. The whole "move out of memory" "move into memory" here is quite confusing TBH. We're not moving anything, we're optimizing out the move completely ... yes, you could call it compression (below). >> >>> >>> I'm leaning towards Barry's side on the fixes tag. >> >> I think the documentation when to use the Fixes: tag is pretty clear. >> >> Introducing new counters can hardly be considered a bugfix. Missing to adjust some counters that *existing tools* would know/use might be IMO (below). >> >>> When zswap handled >>> the same-filled pages, we would count them in zswpin/out. From a user >>> POV, especially one using zswap, the behavior didn't change, but the >>> counts giving insight into this (potentially significant) VM activity >>> disappeared. This is arguably a regression. >>>>> swpout_zero >>>> Portion of "pswout" pages for which I/O was optimized out >>>> because the page content was detected to be zero. >>> >>> Are we sure we want to commit to the "zero" in the name here? Until >>> very recently, zswap optimized all same-filled pages. It's possible >>> somebody might want to bring that back down the line. >> >> Agreed. >> >>> >>> In reference to the above, I'd actually prefer putting them back into >>> zswpin/zswpout. Sure, they're not handled by zswap.c proper, but this >>> is arguably just an implementation detail; from a user POV this is >>> still just (a form of) compression in lieu of IO to the swap backend. >>> >>> IMO there is no need for coming up with a separate category. Just add >>> them to zswpin/zswpout and remove the CONFIG_ZSWAP guards from them? >> > > hmm, I actually don't like the idea of using zswpin/zswpout. Its a > bit confusing if zswap is disabled and zswap counters are incrementing? > > Also, it means that when zswap is enabled, you won't be able to distinguish > between zswap and zeropage optimization. Does it matter? Because in the past the same would have happened, no (back when this was done in zswap code)? -- Cheers, David / dhildenb
On 04/11/2024 20:56, David Hildenbrand wrote: > On 04.11.24 19:48, Usama Arif wrote: >> >> >> On 04/11/2024 17:10, David Hildenbrand wrote: >>> On 04.11.24 17:34, Johannes Weiner wrote: >>>> On Mon, Nov 04, 2024 at 01:42:08PM +0100, David Hildenbrand wrote: >>>>> On 02.11.24 11:12, Barry Song wrote: >>>>>> @@ -1599,6 +1599,16 @@ The following nested keys are defined. >>>>>> pglazyfreed (npn) >>>>>> Amount of reclaimed lazyfree pages >>>>>> + swpin_zero >>>>>> + Number of pages moved into memory with zero content, meaning no >>>>>> + copy exists in the backend swapfile, allowing swap-in to avoid >>>>>> + I/O read overhead. >>>>>> + >>>>>> + swpout_zero >>>>>> + Number of pages moved out of memory with zero content, meaning no >>>>>> + copy is needed in the backend swapfile, allowing swap-out to avoid >>>>>> + I/O write overhead. >>>>> >>>>> Hm, can make it a bit clearer that this is a pure optimization and refer >>>>> to the other counters? >>>>> >>>>> swpin_zero >>>>> Portion of "pswpin" pages for which I/O was optimized out >>>>> because the page content was detected to be zero during swapout. >>>> >>>> AFAICS the zeropages currently don't show up in pswpin/pswpout, so >>>> these are independent counters, not subsets. >>> >>> Ah. now I understand the problem. The whole "move out of memory" "move into memory" here is quite confusing TBH. We're not moving anything, we're optimizing out the move completely ... yes, you could call it compression (below). >>> >>>> >>>> I'm leaning towards Barry's side on the fixes tag. >>> >>> I think the documentation when to use the Fixes: tag is pretty clear. >>> >>> Introducing new counters can hardly be considered a bugfix. Missing to adjust some counters that *existing tools* would know/use might be IMO (below). >>> >>>> When zswap handled >>>> the same-filled pages, we would count them in zswpin/out. From a user >>>> POV, especially one using zswap, the behavior didn't change, but the >>>> counts giving insight into this (potentially significant) VM activity >>>> disappeared. This is arguably a regression. >>>>>> swpout_zero >>>>> Portion of "pswout" pages for which I/O was optimized out >>>>> because the page content was detected to be zero. >>>> >>>> Are we sure we want to commit to the "zero" in the name here? Until >>>> very recently, zswap optimized all same-filled pages. It's possible >>>> somebody might want to bring that back down the line. >>> >>> Agreed. >>> >>>> >>>> In reference to the above, I'd actually prefer putting them back into >>>> zswpin/zswpout. Sure, they're not handled by zswap.c proper, but this >>>> is arguably just an implementation detail; from a user POV this is >>>> still just (a form of) compression in lieu of IO to the swap backend. >>>> >>>> IMO there is no need for coming up with a separate category. Just add >>>> them to zswpin/zswpout and remove the CONFIG_ZSWAP guards from them? >>> >> >> hmm, I actually don't like the idea of using zswpin/zswpout. Its a >> bit confusing if zswap is disabled and zswap counters are incrementing? >> >> Also, it means that when zswap is enabled, you won't be able to distinguish >> between zswap and zeropage optimization. > > Does it matter? Because in the past the same would have happened, no (back when this was done in zswap code)? > When it was in zswap code, there was zswap_same_filled_pages stat as well to see how many zero-filled pages were part of zswap. (Not the same as counter, but you could still get a good idea about same filled page usage). The other thing is it affects zram as well.. Maybe We could have a hybrid approach? i.e. have the zswpin/zswpout counter incremented at zero filled pages as suggested, but then also have a zero_swapped stat that tells how much of the zeromap is currently set (similar to zswapped).
On Tue, Nov 5, 2024 at 10:24 AM Usama Arif <usamaarif642@gmail.com> wrote: > > > > On 04/11/2024 20:56, David Hildenbrand wrote: > > On 04.11.24 19:48, Usama Arif wrote: > >> > >> > >> On 04/11/2024 17:10, David Hildenbrand wrote: > >>> On 04.11.24 17:34, Johannes Weiner wrote: > >>>> On Mon, Nov 04, 2024 at 01:42:08PM +0100, David Hildenbrand wrote: > >>>>> On 02.11.24 11:12, Barry Song wrote: > >>>>>> @@ -1599,6 +1599,16 @@ The following nested keys are defined. > >>>>>> pglazyfreed (npn) > >>>>>> Amount of reclaimed lazyfree pages > >>>>>> + swpin_zero > >>>>>> + Number of pages moved into memory with zero content, meaning no > >>>>>> + copy exists in the backend swapfile, allowing swap-in to avoid > >>>>>> + I/O read overhead. > >>>>>> + > >>>>>> + swpout_zero > >>>>>> + Number of pages moved out of memory with zero content, meaning no > >>>>>> + copy is needed in the backend swapfile, allowing swap-out to avoid > >>>>>> + I/O write overhead. > >>>>> > >>>>> Hm, can make it a bit clearer that this is a pure optimization and refer > >>>>> to the other counters? > >>>>> > >>>>> swpin_zero > >>>>> Portion of "pswpin" pages for which I/O was optimized out > >>>>> because the page content was detected to be zero during swapout. > >>>> > >>>> AFAICS the zeropages currently don't show up in pswpin/pswpout, so > >>>> these are independent counters, not subsets. > >>> > >>> Ah. now I understand the problem. The whole "move out of memory" "move into memory" here is quite confusing TBH. We're not moving anything, we're optimizing out the move completely ... yes, you could call it compression (below). > >>> > >>>> > >>>> I'm leaning towards Barry's side on the fixes tag. > >>> > >>> I think the documentation when to use the Fixes: tag is pretty clear. > >>> > >>> Introducing new counters can hardly be considered a bugfix. Missing to adjust some counters that *existing tools* would know/use might be IMO (below). > >>> > >>>> When zswap handled > >>>> the same-filled pages, we would count them in zswpin/out. From a user > >>>> POV, especially one using zswap, the behavior didn't change, but the > >>>> counts giving insight into this (potentially significant) VM activity > >>>> disappeared. This is arguably a regression. > >>>>>> swpout_zero > >>>>> Portion of "pswout" pages for which I/O was optimized out > >>>>> because the page content was detected to be zero. > >>>> > >>>> Are we sure we want to commit to the "zero" in the name here? Until > >>>> very recently, zswap optimized all same-filled pages. It's possible > >>>> somebody might want to bring that back down the line. > >>> > >>> Agreed. > >>> > >>>> > >>>> In reference to the above, I'd actually prefer putting them back into > >>>> zswpin/zswpout. Sure, they're not handled by zswap.c proper, but this > >>>> is arguably just an implementation detail; from a user POV this is > >>>> still just (a form of) compression in lieu of IO to the swap backend. > >>>> > >>>> IMO there is no need for coming up with a separate category. Just add > >>>> them to zswpin/zswpout and remove the CONFIG_ZSWAP guards from them? > >>> > >> > >> hmm, I actually don't like the idea of using zswpin/zswpout. Its a > >> bit confusing if zswap is disabled and zswap counters are incrementing? > >> > >> Also, it means that when zswap is enabled, you won't be able to distinguish > >> between zswap and zeropage optimization. > > > > Does it matter? Because in the past the same would have happened, no (back when this was done in zswap code)? > > > > When it was in zswap code, there was zswap_same_filled_pages stat as well to see > how many zero-filled pages were part of zswap. (Not the same as counter, but you > could still get a good idea about same filled page usage). > > The other thing is it affects zram as well.. > > Maybe We could have a hybrid approach? > i.e. have the zswpin/zswpout counter incremented at zero filled pages as suggested, > but then also have a zero_swapped stat that tells how much of the zeromap is > currently set (similar to zswapped). I still think we should keep zswap and zeromap separate. On a system without zswap, zero-page swap-in and swap-out are included in pswpin and pswpout counts. Although zram has same_page_filled, it's still treated as a block device after the swap layer. Thanks Barry
On 02/11/2024 10:12, Barry Song wrote: > From: Barry Song <v-songbaohua@oppo.com> > > When the proportion of folios from the zero map is small, missing their > accounting may not significantly impact profiling. However, it’s easy > to construct a scenario where this becomes an issue—for example, > allocating 1 GB of memory, writing zeros from userspace, followed by > MADV_PAGEOUT, and then swapping it back in. In this case, the swap-out > and swap-in counts seem to vanish into a black hole, potentially > causing semantic ambiguity. > > We have two ways to address this: > > 1. Add a separate counter specifically for the zero map. > 2. Continue using the current accounting, treating the zero map like > a normal backend. (This aligns with the current behavior of zRAM > when supporting same-page fills at the device level.) > > This patch adopts option 1 as pswpin/pswpout counters are that they > only apply to IO done directly to the backend device (as noted by > Nhat Pham). > > We can find these counters from /proc/vmstat (counters for the whole > system) and memcg's memory.stat (counters for the interested memcg). > > For example: > > $ grep -E 'swpin_zero|swpout_zero' /proc/vmstat > swpin_zero 1648 > swpout_zero 33536 > > $ grep -E 'swpin_zero|swpout_zero' /sys/fs/cgroup/system.slice/memory.stat > swpin_zero 3905 > swpout_zero 3985 > > Fixes: 0ca0c24e3211 ("mm: store zero pages to be swapped out in a bitmap") I don't think its a hotfix (or even a fix). It was discussed in the initial series to add these as a follow up and Joshua was going to do this soon. Its not fixing any bug in the initial series. > Cc: Usama Arif <usamaarif642@gmail.com> > Cc: Chengming Zhou <chengming.zhou@linux.dev> > Cc: Yosry Ahmed <yosryahmed@google.com> > Cc: Nhat Pham <nphamcs@gmail.com> > Cc: Johannes Weiner <hannes@cmpxchg.org> > Cc: David Hildenbrand <david@redhat.com> > Cc: Hugh Dickins <hughd@google.com> > Cc: Matthew Wilcox (Oracle) <willy@infradead.org> > Cc: Shakeel Butt <shakeel.butt@linux.dev> > Cc: Andi Kleen <ak@linux.intel.com> > Cc: Baolin Wang <baolin.wang@linux.alibaba.com> > Cc: Chris Li <chrisl@kernel.org> > Cc: "Huang, Ying" <ying.huang@intel.com> > Cc: Kairui Song <kasong@tencent.com> > Cc: Ryan Roberts <ryan.roberts@arm.com> > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > --- > -v2: > * add separate counters rather than using pswpin/out; thanks > for the comments from Usama, David, Yosry and Nhat; > * Usama also suggested a new counter like swapped_zero, I > prefer that one be separated as an enhancement patch not > a hotfix. will probably handle it later on. > I dont think either of them would be a hotfix. > Documentation/admin-guide/cgroup-v2.rst | 10 ++++++++++ > include/linux/vm_event_item.h | 2 ++ > mm/memcontrol.c | 4 ++++ > mm/page_io.c | 16 ++++++++++++++++ > mm/vmstat.c | 2 ++ > 5 files changed, 34 insertions(+) > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > index db3799f1483e..984eb3c9d05b 100644 > --- a/Documentation/admin-guide/cgroup-v2.rst > +++ b/Documentation/admin-guide/cgroup-v2.rst > @@ -1599,6 +1599,16 @@ The following nested keys are defined. > pglazyfreed (npn) > Amount of reclaimed lazyfree pages > > + swpin_zero > + Number of pages moved into memory with zero content, meaning no > + copy exists in the backend swapfile, allowing swap-in to avoid > + I/O read overhead. > + > + swpout_zero > + Number of pages moved out of memory with zero content, meaning no > + copy is needed in the backend swapfile, allowing swap-out to avoid > + I/O write overhead. > + Maybe zero-filled pages might be a better term in both. > zswpin > Number of pages moved in to memory from zswap. > > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h > index aed952d04132..f70d0958095c 100644 > --- a/include/linux/vm_event_item.h > +++ b/include/linux/vm_event_item.h > @@ -134,6 +134,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, > #ifdef CONFIG_SWAP > SWAP_RA, > SWAP_RA_HIT, > + SWPIN_ZERO, > + SWPOUT_ZERO, > #ifdef CONFIG_KSM > KSM_SWPIN_COPY, > #endif > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 5e44d6e7591e..7b3503d12aaf 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -441,6 +441,10 @@ static const unsigned int memcg_vm_event_stat[] = { > PGDEACTIVATE, > PGLAZYFREE, > PGLAZYFREED, > +#ifdef CONFIG_SWAP > + SWPIN_ZERO, > + SWPOUT_ZERO, > +#endif > #ifdef CONFIG_ZSWAP > ZSWPIN, > ZSWPOUT, > diff --git a/mm/page_io.c b/mm/page_io.c > index 5d9b6e6cf96c..4b4ea8e49cf6 100644 > --- a/mm/page_io.c > +++ b/mm/page_io.c > @@ -204,7 +204,9 @@ static bool is_folio_zero_filled(struct folio *folio) > > static void swap_zeromap_folio_set(struct folio *folio) > { > + struct obj_cgroup *objcg = get_obj_cgroup_from_folio(folio); > struct swap_info_struct *sis = swp_swap_info(folio->swap); > + int nr_pages = folio_nr_pages(folio); > swp_entry_t entry; > unsigned int i; > > @@ -212,6 +214,12 @@ static void swap_zeromap_folio_set(struct folio *folio) > entry = page_swap_entry(folio_page(folio, i)); > set_bit(swp_offset(entry), sis->zeromap); > } > + > + count_vm_events(SWPOUT_ZERO, nr_pages); > + if (objcg) { > + count_objcg_events(objcg, SWPOUT_ZERO, nr_pages); > + obj_cgroup_put(objcg); > + } > } > > static void swap_zeromap_folio_clear(struct folio *folio) > @@ -507,6 +515,7 @@ static void sio_read_complete(struct kiocb *iocb, long ret) > static bool swap_read_folio_zeromap(struct folio *folio) > { > int nr_pages = folio_nr_pages(folio); > + struct obj_cgroup *objcg; > bool is_zeromap; > > /* > @@ -521,6 +530,13 @@ static bool swap_read_folio_zeromap(struct folio *folio) > if (!is_zeromap) > return false; > > + objcg = get_obj_cgroup_from_folio(folio); > + count_vm_events(SWPIN_ZERO, nr_pages); > + if (objcg) { > + count_objcg_events(objcg, SWPIN_ZERO, nr_pages); > + obj_cgroup_put(objcg); > + } > + > folio_zero_range(folio, 0, folio_size(folio)); > folio_mark_uptodate(folio); > return true; > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 22a294556b58..c8ef7352f9ed 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -1418,6 +1418,8 @@ const char * const vmstat_text[] = { > #ifdef CONFIG_SWAP > "swap_ra", > "swap_ra_hit", > + "swpin_zero", > + "swpout_zero", > #ifdef CONFIG_KSM > "ksm_swpin_copy", > #endif
On Sat, Nov 2, 2024 at 8:32 PM Usama Arif <usamaarif642@gmail.com> wrote: > > > > On 02/11/2024 10:12, Barry Song wrote: > > From: Barry Song <v-songbaohua@oppo.com> > > > > When the proportion of folios from the zero map is small, missing their > > accounting may not significantly impact profiling. However, it’s easy > > to construct a scenario where this becomes an issue—for example, > > allocating 1 GB of memory, writing zeros from userspace, followed by > > MADV_PAGEOUT, and then swapping it back in. In this case, the swap-out > > and swap-in counts seem to vanish into a black hole, potentially > > causing semantic ambiguity. > > > > We have two ways to address this: > > > > 1. Add a separate counter specifically for the zero map. > > 2. Continue using the current accounting, treating the zero map like > > a normal backend. (This aligns with the current behavior of zRAM > > when supporting same-page fills at the device level.) > > > > This patch adopts option 1 as pswpin/pswpout counters are that they > > only apply to IO done directly to the backend device (as noted by > > Nhat Pham). > > > > We can find these counters from /proc/vmstat (counters for the whole > > system) and memcg's memory.stat (counters for the interested memcg). > > > > For example: > > > > $ grep -E 'swpin_zero|swpout_zero' /proc/vmstat > > swpin_zero 1648 > > swpout_zero 33536 > > > > $ grep -E 'swpin_zero|swpout_zero' /sys/fs/cgroup/system.slice/memory.stat > > swpin_zero 3905 > > swpout_zero 3985 > > > > Fixes: 0ca0c24e3211 ("mm: store zero pages to be swapped out in a bitmap") > I don't think its a hotfix (or even a fix). It was discussed in the initial > series to add these as a follow up and Joshua was going to do this soon. > Its not fixing any bug in the initial series. I would prefer that all kernel versions with zeromap include this counter; otherwise, it could be confusing to determine where swap-in and swap-out have occurred, as shown by the small program below: p =malloc(1g); write p to zero madvise_pageout read p; Previously, there was 1GB of swap-in and swap-out activity reported, but now nothing is shown. I don't mean to suggest that there's a bug in the zeromap code; rather, having this counter would help clear up any confusion. I didn't realize Joshua was handling it. Is he still planning to? If so, I can leave it with Joshua if that was the plan :-) > > > Cc: Usama Arif <usamaarif642@gmail.com> > > Cc: Chengming Zhou <chengming.zhou@linux.dev> > > Cc: Yosry Ahmed <yosryahmed@google.com> > > Cc: Nhat Pham <nphamcs@gmail.com> > > Cc: Johannes Weiner <hannes@cmpxchg.org> > > Cc: David Hildenbrand <david@redhat.com> > > Cc: Hugh Dickins <hughd@google.com> > > Cc: Matthew Wilcox (Oracle) <willy@infradead.org> > > Cc: Shakeel Butt <shakeel.butt@linux.dev> > > Cc: Andi Kleen <ak@linux.intel.com> > > Cc: Baolin Wang <baolin.wang@linux.alibaba.com> > > Cc: Chris Li <chrisl@kernel.org> > > Cc: "Huang, Ying" <ying.huang@intel.com> > > Cc: Kairui Song <kasong@tencent.com> > > Cc: Ryan Roberts <ryan.roberts@arm.com> > > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > > --- > > -v2: > > * add separate counters rather than using pswpin/out; thanks > > for the comments from Usama, David, Yosry and Nhat; > > * Usama also suggested a new counter like swapped_zero, I > > prefer that one be separated as an enhancement patch not > > a hotfix. will probably handle it later on. > > > I dont think either of them would be a hotfix. As mentioned above, this isn't about fixing a bug; it's simply to ensure that swap-related metrics don't disappear. > > > Documentation/admin-guide/cgroup-v2.rst | 10 ++++++++++ > > include/linux/vm_event_item.h | 2 ++ > > mm/memcontrol.c | 4 ++++ > > mm/page_io.c | 16 ++++++++++++++++ > > mm/vmstat.c | 2 ++ > > 5 files changed, 34 insertions(+) > > > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > > index db3799f1483e..984eb3c9d05b 100644 > > --- a/Documentation/admin-guide/cgroup-v2.rst > > +++ b/Documentation/admin-guide/cgroup-v2.rst > > @@ -1599,6 +1599,16 @@ The following nested keys are defined. > > pglazyfreed (npn) > > Amount of reclaimed lazyfree pages > > > > + swpin_zero > > + Number of pages moved into memory with zero content, meaning no > > + copy exists in the backend swapfile, allowing swap-in to avoid > > + I/O read overhead. > > + > > + swpout_zero > > + Number of pages moved out of memory with zero content, meaning no > > + copy is needed in the backend swapfile, allowing swap-out to avoid > > + I/O write overhead. > > + > > Maybe zero-filled pages might be a better term in both. Do you mean dropping "with zero content" and replacing it by Number of zero-filled pages moved out of memory ? I'm fine with the change. > > > zswpin > > Number of pages moved in to memory from zswap. > > > > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h > > index aed952d04132..f70d0958095c 100644 > > --- a/include/linux/vm_event_item.h > > +++ b/include/linux/vm_event_item.h > > @@ -134,6 +134,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, > > #ifdef CONFIG_SWAP > > SWAP_RA, > > SWAP_RA_HIT, > > + SWPIN_ZERO, > > + SWPOUT_ZERO, > > #ifdef CONFIG_KSM > > KSM_SWPIN_COPY, > > #endif > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 5e44d6e7591e..7b3503d12aaf 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -441,6 +441,10 @@ static const unsigned int memcg_vm_event_stat[] = { > > PGDEACTIVATE, > > PGLAZYFREE, > > PGLAZYFREED, > > +#ifdef CONFIG_SWAP > > + SWPIN_ZERO, > > + SWPOUT_ZERO, > > +#endif > > #ifdef CONFIG_ZSWAP > > ZSWPIN, > > ZSWPOUT, > > diff --git a/mm/page_io.c b/mm/page_io.c > > index 5d9b6e6cf96c..4b4ea8e49cf6 100644 > > --- a/mm/page_io.c > > +++ b/mm/page_io.c > > @@ -204,7 +204,9 @@ static bool is_folio_zero_filled(struct folio *folio) > > > > static void swap_zeromap_folio_set(struct folio *folio) > > { > > + struct obj_cgroup *objcg = get_obj_cgroup_from_folio(folio); > > struct swap_info_struct *sis = swp_swap_info(folio->swap); > > + int nr_pages = folio_nr_pages(folio); > > swp_entry_t entry; > > unsigned int i; > > > > @@ -212,6 +214,12 @@ static void swap_zeromap_folio_set(struct folio *folio) > > entry = page_swap_entry(folio_page(folio, i)); > > set_bit(swp_offset(entry), sis->zeromap); > > } > > + > > + count_vm_events(SWPOUT_ZERO, nr_pages); > > + if (objcg) { > > + count_objcg_events(objcg, SWPOUT_ZERO, nr_pages); > > + obj_cgroup_put(objcg); > > + } > > } > > > > static void swap_zeromap_folio_clear(struct folio *folio) > > @@ -507,6 +515,7 @@ static void sio_read_complete(struct kiocb *iocb, long ret) > > static bool swap_read_folio_zeromap(struct folio *folio) > > { > > int nr_pages = folio_nr_pages(folio); > > + struct obj_cgroup *objcg; > > bool is_zeromap; > > > > /* > > @@ -521,6 +530,13 @@ static bool swap_read_folio_zeromap(struct folio *folio) > > if (!is_zeromap) > > return false; > > > > + objcg = get_obj_cgroup_from_folio(folio); > > + count_vm_events(SWPIN_ZERO, nr_pages); > > + if (objcg) { > > + count_objcg_events(objcg, SWPIN_ZERO, nr_pages); > > + obj_cgroup_put(objcg); > > + } > > + > > folio_zero_range(folio, 0, folio_size(folio)); > > folio_mark_uptodate(folio); > > return true; > > diff --git a/mm/vmstat.c b/mm/vmstat.c > > index 22a294556b58..c8ef7352f9ed 100644 > > --- a/mm/vmstat.c > > +++ b/mm/vmstat.c > > @@ -1418,6 +1418,8 @@ const char * const vmstat_text[] = { > > #ifdef CONFIG_SWAP > > "swap_ra", > > "swap_ra_hit", > > + "swpin_zero", > > + "swpout_zero", > > #ifdef CONFIG_KSM > > "ksm_swpin_copy", > > #endif > Thanks Barry
On 02.11.24 13:59, Barry Song wrote: > On Sat, Nov 2, 2024 at 8:32 PM Usama Arif <usamaarif642@gmail.com> wrote: >> >> >> >> On 02/11/2024 10:12, Barry Song wrote: >>> From: Barry Song <v-songbaohua@oppo.com> >>> >>> When the proportion of folios from the zero map is small, missing their >>> accounting may not significantly impact profiling. However, it’s easy >>> to construct a scenario where this becomes an issue—for example, >>> allocating 1 GB of memory, writing zeros from userspace, followed by >>> MADV_PAGEOUT, and then swapping it back in. In this case, the swap-out >>> and swap-in counts seem to vanish into a black hole, potentially >>> causing semantic ambiguity. >>> >>> We have two ways to address this: >>> >>> 1. Add a separate counter specifically for the zero map. >>> 2. Continue using the current accounting, treating the zero map like >>> a normal backend. (This aligns with the current behavior of zRAM >>> when supporting same-page fills at the device level.) >>> >>> This patch adopts option 1 as pswpin/pswpout counters are that they >>> only apply to IO done directly to the backend device (as noted by >>> Nhat Pham). >>> >>> We can find these counters from /proc/vmstat (counters for the whole >>> system) and memcg's memory.stat (counters for the interested memcg). >>> >>> For example: >>> >>> $ grep -E 'swpin_zero|swpout_zero' /proc/vmstat >>> swpin_zero 1648 >>> swpout_zero 33536 >>> >>> $ grep -E 'swpin_zero|swpout_zero' /sys/fs/cgroup/system.slice/memory.stat >>> swpin_zero 3905 >>> swpout_zero 3985 >>> >>> Fixes: 0ca0c24e3211 ("mm: store zero pages to be swapped out in a bitmap") >> I don't think its a hotfix (or even a fix). It was discussed in the initial >> series to add these as a follow up and Joshua was going to do this soon. >> Its not fixing any bug in the initial series. > > I would prefer that all kernel versions with zeromap include this > counter; otherwise, > it could be confusing to determine where swap-in and swap-out have occurred, > as shown by the small program below: > > p =malloc(1g); > write p to zero > madvise_pageout > read p; > > Previously, there was 1GB of swap-in and swap-out activity reported, but > now nothing is shown. > > I don't mean to suggest that there's a bug in the zeromap code; rather, > having this counter would help clear up any confusion. > > I didn't realize Joshua was handling it. Is he still planning to? If > so, I can leave it > with Joshua if that was the plan :-) > >> >>> Cc: Usama Arif <usamaarif642@gmail.com> >>> Cc: Chengming Zhou <chengming.zhou@linux.dev> >>> Cc: Yosry Ahmed <yosryahmed@google.com> >>> Cc: Nhat Pham <nphamcs@gmail.com> >>> Cc: Johannes Weiner <hannes@cmpxchg.org> >>> Cc: David Hildenbrand <david@redhat.com> >>> Cc: Hugh Dickins <hughd@google.com> >>> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> >>> Cc: Shakeel Butt <shakeel.butt@linux.dev> >>> Cc: Andi Kleen <ak@linux.intel.com> >>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> >>> Cc: Chris Li <chrisl@kernel.org> >>> Cc: "Huang, Ying" <ying.huang@intel.com> >>> Cc: Kairui Song <kasong@tencent.com> >>> Cc: Ryan Roberts <ryan.roberts@arm.com> >>> Signed-off-by: Barry Song <v-songbaohua@oppo.com> >>> --- >>> -v2: >>> * add separate counters rather than using pswpin/out; thanks >>> for the comments from Usama, David, Yosry and Nhat; >>> * Usama also suggested a new counter like swapped_zero, I >>> prefer that one be separated as an enhancement patch not >>> a hotfix. will probably handle it later on. >>> >> I dont think either of them would be a hotfix. > > As mentioned above, this isn't about fixing a bug; it's simply to ensure > that swap-related metrics don't disappear. Documentation/process/submitting-patches.rst: "A Fixes: tag indicates that the patch fixes an issue in a previous commit. It is used to make it easy to determine where a bug originated, which can help review a bug fix." If there is no BUG, I'm afraid you are abusing that tag. Also, I don't really understand the problem? We added an optimization, great. Who will be complaining about that? Above you write "it could be confusing to determine where swap-in and swap-out have occurred" -- when is that confusion supposed to happen in practice? How will the confused individuals know that they must take a look at that new metric, even if it is in place? I think we should just add the new stats and call it a day. -- Cheers, David / dhildenb
On Mon, 4 Nov 2024 13:32:55 +0100 David Hildenbrand <david@redhat.com> wrote: > > As mentioned above, this isn't about fixing a bug; it's simply to ensure > > that swap-related metrics don't disappear. > > Documentation/process/submitting-patches.rst: > > "A Fixes: tag indicates that the patch fixes an issue in a previous > commit. It is used to make it easy to determine where a bug originated, > which can help review a bug fix." > > If there is no BUG, I'm afraid you are abusing that tag. I think the abuse is reasonable. We have no Should-be-included-with:. 0ca0c24e3211 is only in 6.12-rcX so this is the time to make userspace-visible tweaks, so the 6.12 interfaces are the same as the 6.13+ interfaces (which is what I think is happening here?) And including the Fixes in this patch might be useful to someone who is backporting 0ca0c24e3211 into some earlier kernel for their own purposes.
On 05.11.24 04:40, Andrew Morton wrote: > On Mon, 4 Nov 2024 13:32:55 +0100 David Hildenbrand <david@redhat.com> wrote: > >>> As mentioned above, this isn't about fixing a bug; it's simply to ensure >>> that swap-related metrics don't disappear. >> >> Documentation/process/submitting-patches.rst: >> >> "A Fixes: tag indicates that the patch fixes an issue in a previous >> commit. It is used to make it easy to determine where a bug originated, >> which can help review a bug fix." >> >> If there is no BUG, I'm afraid you are abusing that tag. > > I think the abuse is reasonable. We have no Should-be-included-with:. A "Belongs-to:" might make sense, for this kind of stuff that is still only in an RFC. Or we update the doc to explicitly spell out this special case of using "Fixes" to sort out something into the RC. Because if this would be already in a released kernel, it would get a bit trickier: stable rules explicitly spell out "fix a real bug". > > 0ca0c24e3211 is only in 6.12-rcX so this is the time to make > userspace-visible tweaks, so the 6.12 interfaces are the same as the > 6.13+ interfaces (which is what I think is happening here?) > > And including the Fixes in this patch might be useful to someone who is > backporting 0ca0c24e3211 into some earlier kernel for their own > purposes. Just to be clear: adding new counters would hardly be fixing existing tools that perform calculations based on existing counters. So we are already changing the "userspace-visible" portion in some way, and I have no idea what in vmstat we consider "stable". But I still don't think it's all that big of a deal except in some handcrafted scenarios hardly anybody cares about; the cover letter is also pretty clear on that. So I'll shut up now and let people figure out the naming first, and if a new counter is required at all :) -- Cheers, David / dhildenb
On Tue, Nov 5, 2024 at 9:23 PM David Hildenbrand <david@redhat.com> wrote: > > On 05.11.24 04:40, Andrew Morton wrote: > > On Mon, 4 Nov 2024 13:32:55 +0100 David Hildenbrand <david@redhat.com> wrote: > > > >>> As mentioned above, this isn't about fixing a bug; it's simply to ensure > >>> that swap-related metrics don't disappear. > >> > >> Documentation/process/submitting-patches.rst: > >> > >> "A Fixes: tag indicates that the patch fixes an issue in a previous > >> commit. It is used to make it easy to determine where a bug originated, > >> which can help review a bug fix." > >> > >> If there is no BUG, I'm afraid you are abusing that tag. > > > > I think the abuse is reasonable. We have no Should-be-included-with:. > > A "Belongs-to:" might make sense, for this kind of stuff that is still > only in an RFC. Or we update the doc to explicitly spell out this > special case of using "Fixes" to sort out something into the RC. > > Because if this would be already in a released kernel, it would get a > bit trickier: stable rules explicitly spell out "fix a real bug". > > > > > 0ca0c24e3211 is only in 6.12-rcX so this is the time to make > > userspace-visible tweaks, so the 6.12 interfaces are the same as the > > 6.13+ interfaces (which is what I think is happening here?) > > > And including the Fixes in this patch might be useful to someone who is > > backporting 0ca0c24e3211 into some earlier kernel for their own > > purposes. > > Just to be clear: adding new counters would hardly be fixing existing > tools that perform calculations based on existing counters. So we are > already changing the "userspace-visible" portion in some way, and I have > no idea what in vmstat we consider "stable". > > But I still don't think it's all that big of a deal except in some > handcrafted scenarios hardly anybody cares about; the cover letter is > also pretty clear on that. I may have been mistaken in the cover letter. According to the zswap data Usama provided for servers, zero-filled pages accounted for about 1%. So really doesn't matter too much, but I just checked with Hailong from our team—he has data on same-page-filled usage in Android apps, which on average show 3-4% same-page-filled, with over 85% being zero-filled. Some apps even reach 6-7% zero-filled pages. We previously used these counters to profile optimizations, but with zeromap now serving as the frontend for swap files, those counters will disappear entirely from both zRAM and pswpin/pswpout metrics, as folios are filtered earlier. Hailong essentially has a table that looks like the below which could be collected from the existing counters: com.xxx.app 5% same-page-filled. 88% zero com.yyy.app 6% same-page-filled. 85% zero com.zzz.map 6.7 same-page-filled. 88% zero .... Anyone on 6.12 will be unable to track zero-filled pages unless they backport this patch from a newer kernel version if it doesn’t make it into 6.12. Whether it's marked as 'Belongs-to:' or 'Fixes:', I'd prefer we aim to land it in 6.12 :-) > > So I'll shut up now and let people figure out the naming first, and if a > new counter is required at all :) > > -- > Cheers, > > David / dhildenb > Thanks Barry
On 05.11.24 10:15, Barry Song wrote: > On Tue, Nov 5, 2024 at 9:23 PM David Hildenbrand <david@redhat.com> wrote: >> >> On 05.11.24 04:40, Andrew Morton wrote: >>> On Mon, 4 Nov 2024 13:32:55 +0100 David Hildenbrand <david@redhat.com> wrote: >>> >>>>> As mentioned above, this isn't about fixing a bug; it's simply to ensure >>>>> that swap-related metrics don't disappear. >>>> >>>> Documentation/process/submitting-patches.rst: >>>> >>>> "A Fixes: tag indicates that the patch fixes an issue in a previous >>>> commit. It is used to make it easy to determine where a bug originated, >>>> which can help review a bug fix." >>>> >>>> If there is no BUG, I'm afraid you are abusing that tag. >>> >>> I think the abuse is reasonable. We have no Should-be-included-with:. >> >> A "Belongs-to:" might make sense, for this kind of stuff that is still >> only in an RFC. Or we update the doc to explicitly spell out this >> special case of using "Fixes" to sort out something into the RC. >> >> Because if this would be already in a released kernel, it would get a >> bit trickier: stable rules explicitly spell out "fix a real bug". >> >>> >>> 0ca0c24e3211 is only in 6.12-rcX so this is the time to make >>> userspace-visible tweaks, so the 6.12 interfaces are the same as the >>> 6.13+ interfaces (which is what I think is happening here?) >> > > And including the Fixes in this patch might be useful to someone who is >>> backporting 0ca0c24e3211 into some earlier kernel for their own >>> purposes. >> >> Just to be clear: adding new counters would hardly be fixing existing >> tools that perform calculations based on existing counters. So we are >> already changing the "userspace-visible" portion in some way, and I have >> no idea what in vmstat we consider "stable". >> >> But I still don't think it's all that big of a deal except in some >> handcrafted scenarios hardly anybody cares about; the cover letter is >> also pretty clear on that. > > I may have been mistaken in the cover letter. According to the zswap data Usama > provided for servers, zero-filled pages accounted for about 1%. So > really doesn't > matter too much, but I just checked with Hailong from our team—he has data > on same-page-filled usage in Android apps, which on average show 3-4% > same-page-filled, with over 85% being zero-filled. Some apps even reach > 6-7% zero-filled pages. We previously used these counters to profile > optimizations, but with zeromap now serving as the frontend for swap files, > those counters will disappear entirely from both zRAM and pswpin/pswpout > metrics, as folios are filtered earlier. > > Hailong essentially has a table that looks like the below which could be > collected from the existing counters: > > com.xxx.app 5% same-page-filled. 88% zero > com.yyy.app 6% same-page-filled. 85% zero > com.zzz.map 6.7 same-page-filled. 88% zero > .... > > Anyone on 6.12 will be unable to track zero-filled pages unless they > backport this patch from a newer kernel version if it doesn’t make it > into 6.12. > > Whether it's marked as 'Belongs-to:' or 'Fixes:', I'd prefer we aim to > land it in > 6.12 :-) Thanks for sharing these numbers. Just for the records, I think the optimization is quite powerful. What I am not convinced is that the missing stat is a big deal. If I upgrade my car and it suddenly consumes 5% less fuel/energy I (a) either won't notice at all or (b) am happy and don't care why; I won't start digging through the updated manual, looking for some hidden gages that might explain why :) But I might be missing something important, and I do understand why it can be helpful to have it black-on-white how much this optimization gives you. So feel free to move forward with this as you like, having it in 6.12 sounds very reasonable to me. -- Cheers, David / dhildenb
On 05/11/2024 09:15, Barry Song wrote: > On Tue, Nov 5, 2024 at 9:23 PM David Hildenbrand <david@redhat.com> wrote: >> >> On 05.11.24 04:40, Andrew Morton wrote: >>> On Mon, 4 Nov 2024 13:32:55 +0100 David Hildenbrand <david@redhat.com> wrote: >>> >>>>> As mentioned above, this isn't about fixing a bug; it's simply to ensure >>>>> that swap-related metrics don't disappear. >>>> >>>> Documentation/process/submitting-patches.rst: >>>> >>>> "A Fixes: tag indicates that the patch fixes an issue in a previous >>>> commit. It is used to make it easy to determine where a bug originated, >>>> which can help review a bug fix." >>>> >>>> If there is no BUG, I'm afraid you are abusing that tag. >>> >>> I think the abuse is reasonable. We have no Should-be-included-with:. >> >> A "Belongs-to:" might make sense, for this kind of stuff that is still >> only in an RFC. Or we update the doc to explicitly spell out this >> special case of using "Fixes" to sort out something into the RC. >> >> Because if this would be already in a released kernel, it would get a >> bit trickier: stable rules explicitly spell out "fix a real bug". >> >>> >>> 0ca0c24e3211 is only in 6.12-rcX so this is the time to make >>> userspace-visible tweaks, so the 6.12 interfaces are the same as the >>> 6.13+ interfaces (which is what I think is happening here?) >> > > And including the Fixes in this patch might be useful to someone who is >>> backporting 0ca0c24e3211 into some earlier kernel for their own >>> purposes. >> >> Just to be clear: adding new counters would hardly be fixing existing >> tools that perform calculations based on existing counters. So we are >> already changing the "userspace-visible" portion in some way, and I have >> no idea what in vmstat we consider "stable". >> >> But I still don't think it's all that big of a deal except in some >> handcrafted scenarios hardly anybody cares about; the cover letter is >> also pretty clear on that. > > I may have been mistaken in the cover letter. According to the zswap data Usama > provided for servers, zero-filled pages accounted for about 1%. 10% not 1% (https://lore.kernel.org/all/20240612124750.2220726-1-usamaarif642@gmail.com/). > So > really doesn't > matter too much, but I just checked with Hailong from our team—he has data > on same-page-filled usage in Android apps, which on average show 3-4% > same-page-filled, with over 85% being zero-filled. Some apps even reach > 6-7% zero-filled pages. We previously used these counters to profile > optimizations, but with zeromap now serving as the frontend for swap files, > those counters will disappear entirely from both zRAM and pswpin/pswpout > metrics, as folios are filtered earlier. > This is what I meant in https://lore.kernel.org/all/79deed1a-9b0e-42e0-be2f-f0c3ef5fee11@gmail.com/ when I said it affects zram as well! I am happy with the current version of the patch, just need the change in Documentation/admin-guide/cgroup-v2.rst. Thanks, Usama > Hailong essentially has a table that looks like the below which could be > collected from the existing counters: > > com.xxx.app 5% same-page-filled. 88% zero > com.yyy.app 6% same-page-filled. 85% zero > com.zzz.map 6.7 same-page-filled. 88% zero > .... > > Anyone on 6.12 will be unable to track zero-filled pages unless they > backport this patch from a newer kernel version if it doesn’t make it > into 6.12. > > Whether it's marked as 'Belongs-to:' or 'Fixes:', I'd prefer we aim to > land it in > 6.12 :-) > >> >> So I'll shut up now and let people figure out the naming first, and if a >> new counter is required at all :) >> >> -- >> Cheers, >> >> David / dhildenb >> > > Thanks > Barry
On Tue, Nov 5, 2024 at 11:44 PM Usama Arif <usamaarif642@gmail.com> wrote: > > > > On 05/11/2024 09:15, Barry Song wrote: > > On Tue, Nov 5, 2024 at 9:23 PM David Hildenbrand <david@redhat.com> wrote: > >> > >> On 05.11.24 04:40, Andrew Morton wrote: > >>> On Mon, 4 Nov 2024 13:32:55 +0100 David Hildenbrand <david@redhat.com> wrote: > >>> > >>>>> As mentioned above, this isn't about fixing a bug; it's simply to ensure > >>>>> that swap-related metrics don't disappear. > >>>> > >>>> Documentation/process/submitting-patches.rst: > >>>> > >>>> "A Fixes: tag indicates that the patch fixes an issue in a previous > >>>> commit. It is used to make it easy to determine where a bug originated, > >>>> which can help review a bug fix." > >>>> > >>>> If there is no BUG, I'm afraid you are abusing that tag. > >>> > >>> I think the abuse is reasonable. We have no Should-be-included-with:. > >> > >> A "Belongs-to:" might make sense, for this kind of stuff that is still > >> only in an RFC. Or we update the doc to explicitly spell out this > >> special case of using "Fixes" to sort out something into the RC. > >> > >> Because if this would be already in a released kernel, it would get a > >> bit trickier: stable rules explicitly spell out "fix a real bug". > >> > >>> > >>> 0ca0c24e3211 is only in 6.12-rcX so this is the time to make > >>> userspace-visible tweaks, so the 6.12 interfaces are the same as the > >>> 6.13+ interfaces (which is what I think is happening here?) > >> > > And including the Fixes in this patch might be useful to someone who is > >>> backporting 0ca0c24e3211 into some earlier kernel for their own > >>> purposes. > >> > >> Just to be clear: adding new counters would hardly be fixing existing > >> tools that perform calculations based on existing counters. So we are > >> already changing the "userspace-visible" portion in some way, and I have > >> no idea what in vmstat we consider "stable". > >> > >> But I still don't think it's all that big of a deal except in some > >> handcrafted scenarios hardly anybody cares about; the cover letter is > >> also pretty clear on that. > > > > I may have been mistaken in the cover letter. According to the zswap data Usama > > provided for servers, zero-filled pages accounted for about 1%. > > 10% not 1% (https://lore.kernel.org/all/20240612124750.2220726-1-usamaarif642@gmail.com/). > Sorry. My memory must have faded; my mind is confused by the 1% non-zero same-page data and the 10% same-page data. Getting old :-) > > So > > really doesn't > > matter too much, but I just checked with Hailong from our team—he has data > > on same-page-filled usage in Android apps, which on average show 3-4% > > same-page-filled, with over 85% being zero-filled. Some apps even reach > > 6-7% zero-filled pages. We previously used these counters to profile > > optimizations, but with zeromap now serving as the frontend for swap files, > > those counters will disappear entirely from both zRAM and pswpin/pswpout > > metrics, as folios are filtered earlier. > > > This is what I meant in https://lore.kernel.org/all/79deed1a-9b0e-42e0-be2f-f0c3ef5fee11@gmail.com/ > when I said it affects zram as well! > > I am happy with the current version of the patch, just need the change > in Documentation/admin-guide/cgroup-v2.rst. I will update the document and send version 3 tomorrow, incorporating both your comments on "zero-filled" and David's suggestion regarding "move out of memory". > > Thanks, > Usama > > > Hailong essentially has a table that looks like the below which could be > > collected from the existing counters: > > > > com.xxx.app 5% same-page-filled. 88% zero > > com.yyy.app 6% same-page-filled. 85% zero > > com.zzz.map 6.7 same-page-filled. 88% zero > > .... > > > > Anyone on 6.12 will be unable to track zero-filled pages unless they > > backport this patch from a newer kernel version if it doesn’t make it > > into 6.12. > > > > Whether it's marked as 'Belongs-to:' or 'Fixes:', I'd prefer we aim to > > land it in > > 6.12 :-) > > > >> > >> So I'll shut up now and let people figure out the naming first, and if a > >> new counter is required at all :) > >> > >> -- > >> Cheers, > >> > >> David / dhildenb > >> > > Thanks Barry
On 05.11.24 09:23, David Hildenbrand wrote: > On 05.11.24 04:40, Andrew Morton wrote: >> On Mon, 4 Nov 2024 13:32:55 +0100 David Hildenbrand <david@redhat.com> wrote: >> >>>> As mentioned above, this isn't about fixing a bug; it's simply to ensure >>>> that swap-related metrics don't disappear. >>> >>> Documentation/process/submitting-patches.rst: >>> >>> "A Fixes: tag indicates that the patch fixes an issue in a previous >>> commit. It is used to make it easy to determine where a bug originated, >>> which can help review a bug fix." >>> >>> If there is no BUG, I'm afraid you are abusing that tag. >> >> I think the abuse is reasonable. We have no Should-be-included-with:. > > A "Belongs-to:" might make sense, for this kind of stuff that is still > only in an RFC. s/RFC/RC/ -- Cheers, David / dhildenb
On 02/11/2024 12:59, Barry Song wrote: > On Sat, Nov 2, 2024 at 8:32 PM Usama Arif <usamaarif642@gmail.com> wrote: >> >> >> >> On 02/11/2024 10:12, Barry Song wrote: >>> From: Barry Song <v-songbaohua@oppo.com> >>> >>> When the proportion of folios from the zero map is small, missing their >>> accounting may not significantly impact profiling. However, it’s easy >>> to construct a scenario where this becomes an issue—for example, >>> allocating 1 GB of memory, writing zeros from userspace, followed by >>> MADV_PAGEOUT, and then swapping it back in. In this case, the swap-out >>> and swap-in counts seem to vanish into a black hole, potentially >>> causing semantic ambiguity. >>> >>> We have two ways to address this: >>> >>> 1. Add a separate counter specifically for the zero map. >>> 2. Continue using the current accounting, treating the zero map like >>> a normal backend. (This aligns with the current behavior of zRAM >>> when supporting same-page fills at the device level.) >>> >>> This patch adopts option 1 as pswpin/pswpout counters are that they >>> only apply to IO done directly to the backend device (as noted by >>> Nhat Pham). >>> >>> We can find these counters from /proc/vmstat (counters for the whole >>> system) and memcg's memory.stat (counters for the interested memcg). >>> >>> For example: >>> >>> $ grep -E 'swpin_zero|swpout_zero' /proc/vmstat >>> swpin_zero 1648 >>> swpout_zero 33536 >>> >>> $ grep -E 'swpin_zero|swpout_zero' /sys/fs/cgroup/system.slice/memory.stat >>> swpin_zero 3905 >>> swpout_zero 3985 >>> >>> Fixes: 0ca0c24e3211 ("mm: store zero pages to be swapped out in a bitmap") >> I don't think its a hotfix (or even a fix). It was discussed in the initial >> series to add these as a follow up and Joshua was going to do this soon. >> Its not fixing any bug in the initial series. > > I would prefer that all kernel versions with zeromap include this > counter; otherwise, > it could be confusing to determine where swap-in and swap-out have occurred, > as shown by the small program below: > > p =malloc(1g); > write p to zero > madvise_pageout > read p; > > Previously, there was 1GB of swap-in and swap-out activity reported, but > now nothing is shown. > > I don't mean to suggest that there's a bug in the zeromap code; rather, > having this counter would help clear up any confusion. > > I didn't realize Joshua was handling it. Is he still planning to? If > so, I can leave it > with Joshua if that was the plan :-) > Please do continue with this patch, I think he was going to look at the swapped_zero version that we discussed earlier anyways. Will let Joshua comment on it. >> >>> Cc: Usama Arif <usamaarif642@gmail.com> >>> Cc: Chengming Zhou <chengming.zhou@linux.dev> >>> Cc: Yosry Ahmed <yosryahmed@google.com> >>> Cc: Nhat Pham <nphamcs@gmail.com> >>> Cc: Johannes Weiner <hannes@cmpxchg.org> >>> Cc: David Hildenbrand <david@redhat.com> >>> Cc: Hugh Dickins <hughd@google.com> >>> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> >>> Cc: Shakeel Butt <shakeel.butt@linux.dev> >>> Cc: Andi Kleen <ak@linux.intel.com> >>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> >>> Cc: Chris Li <chrisl@kernel.org> >>> Cc: "Huang, Ying" <ying.huang@intel.com> >>> Cc: Kairui Song <kasong@tencent.com> >>> Cc: Ryan Roberts <ryan.roberts@arm.com> >>> Signed-off-by: Barry Song <v-songbaohua@oppo.com> >>> --- >>> -v2: >>> * add separate counters rather than using pswpin/out; thanks >>> for the comments from Usama, David, Yosry and Nhat; >>> * Usama also suggested a new counter like swapped_zero, I >>> prefer that one be separated as an enhancement patch not >>> a hotfix. will probably handle it later on. >>> >> I dont think either of them would be a hotfix. > > As mentioned above, this isn't about fixing a bug; it's simply to ensure > that swap-related metrics don't disappear. > >> >>> Documentation/admin-guide/cgroup-v2.rst | 10 ++++++++++ >>> include/linux/vm_event_item.h | 2 ++ >>> mm/memcontrol.c | 4 ++++ >>> mm/page_io.c | 16 ++++++++++++++++ >>> mm/vmstat.c | 2 ++ >>> 5 files changed, 34 insertions(+) >>> >>> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst >>> index db3799f1483e..984eb3c9d05b 100644 >>> --- a/Documentation/admin-guide/cgroup-v2.rst >>> +++ b/Documentation/admin-guide/cgroup-v2.rst >>> @@ -1599,6 +1599,16 @@ The following nested keys are defined. >>> pglazyfreed (npn) >>> Amount of reclaimed lazyfree pages >>> >>> + swpin_zero >>> + Number of pages moved into memory with zero content, meaning no >>> + copy exists in the backend swapfile, allowing swap-in to avoid >>> + I/O read overhead. >>> + >>> + swpout_zero >>> + Number of pages moved out of memory with zero content, meaning no >>> + copy is needed in the backend swapfile, allowing swap-out to avoid >>> + I/O write overhead. >>> + >> >> Maybe zero-filled pages might be a better term in both. > > Do you mean dropping "with zero content" and replacing it by > Number of zero-filled pages moved out of memory ? I'm fine > with the change. Yes, mainly because if you do swapout of memory that was memset 0 its still content, just zero-filled. Thanks, Usama > >> >>> zswpin >>> Number of pages moved in to memory from zswap. >>> >>> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h >>> index aed952d04132..f70d0958095c 100644 >>> --- a/include/linux/vm_event_item.h >>> +++ b/include/linux/vm_event_item.h >>> @@ -134,6 +134,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, >>> #ifdef CONFIG_SWAP >>> SWAP_RA, >>> SWAP_RA_HIT, >>> + SWPIN_ZERO, >>> + SWPOUT_ZERO, >>> #ifdef CONFIG_KSM >>> KSM_SWPIN_COPY, >>> #endif >>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >>> index 5e44d6e7591e..7b3503d12aaf 100644 >>> --- a/mm/memcontrol.c >>> +++ b/mm/memcontrol.c >>> @@ -441,6 +441,10 @@ static const unsigned int memcg_vm_event_stat[] = { >>> PGDEACTIVATE, >>> PGLAZYFREE, >>> PGLAZYFREED, >>> +#ifdef CONFIG_SWAP >>> + SWPIN_ZERO, >>> + SWPOUT_ZERO, >>> +#endif >>> #ifdef CONFIG_ZSWAP >>> ZSWPIN, >>> ZSWPOUT, >>> diff --git a/mm/page_io.c b/mm/page_io.c >>> index 5d9b6e6cf96c..4b4ea8e49cf6 100644 >>> --- a/mm/page_io.c >>> +++ b/mm/page_io.c >>> @@ -204,7 +204,9 @@ static bool is_folio_zero_filled(struct folio *folio) >>> >>> static void swap_zeromap_folio_set(struct folio *folio) >>> { >>> + struct obj_cgroup *objcg = get_obj_cgroup_from_folio(folio); >>> struct swap_info_struct *sis = swp_swap_info(folio->swap); >>> + int nr_pages = folio_nr_pages(folio); >>> swp_entry_t entry; >>> unsigned int i; >>> >>> @@ -212,6 +214,12 @@ static void swap_zeromap_folio_set(struct folio *folio) >>> entry = page_swap_entry(folio_page(folio, i)); >>> set_bit(swp_offset(entry), sis->zeromap); >>> } >>> + >>> + count_vm_events(SWPOUT_ZERO, nr_pages); >>> + if (objcg) { >>> + count_objcg_events(objcg, SWPOUT_ZERO, nr_pages); >>> + obj_cgroup_put(objcg); >>> + } >>> } >>> >>> static void swap_zeromap_folio_clear(struct folio *folio) >>> @@ -507,6 +515,7 @@ static void sio_read_complete(struct kiocb *iocb, long ret) >>> static bool swap_read_folio_zeromap(struct folio *folio) >>> { >>> int nr_pages = folio_nr_pages(folio); >>> + struct obj_cgroup *objcg; >>> bool is_zeromap; >>> >>> /* >>> @@ -521,6 +530,13 @@ static bool swap_read_folio_zeromap(struct folio *folio) >>> if (!is_zeromap) >>> return false; >>> >>> + objcg = get_obj_cgroup_from_folio(folio); >>> + count_vm_events(SWPIN_ZERO, nr_pages); >>> + if (objcg) { >>> + count_objcg_events(objcg, SWPIN_ZERO, nr_pages); >>> + obj_cgroup_put(objcg); >>> + } >>> + >>> folio_zero_range(folio, 0, folio_size(folio)); >>> folio_mark_uptodate(folio); >>> return true; >>> diff --git a/mm/vmstat.c b/mm/vmstat.c >>> index 22a294556b58..c8ef7352f9ed 100644 >>> --- a/mm/vmstat.c >>> +++ b/mm/vmstat.c >>> @@ -1418,6 +1418,8 @@ const char * const vmstat_text[] = { >>> #ifdef CONFIG_SWAP >>> "swap_ra", >>> "swap_ra_hit", >>> + "swpin_zero", >>> + "swpout_zero", >>> #ifdef CONFIG_KSM >>> "ksm_swpin_copy", >>> #endif >> > > Thanks > Barry
On 02/11/2024 14:43:07, Usama Arif <usamaarif642@gmail.com> wrote: > On 02/11/2024 12:59, Barry Song wrote: >> On Sat, Nov 2, 2024 at 8:32 PM Usama Arif <usamaarif642@gmail.com> wrote: >>> >>> On 02/11/2024 10:12, Barry Song wrote: >>>> From: Barry Song <v-songbaohua@oppo.com> >>>> >>>> When the proportion of folios from the zero map is small, missing their >>>> accounting may not significantly impact profiling. However, it’s easy >>>> to construct a scenario where this becomes an issue—for example, >>>> allocating 1 GB of memory, writing zeros from userspace, followed by >>>> MADV_PAGEOUT, and then swapping it back in. In this case, the swap-out >>>> and swap-in counts seem to vanish into a black hole, potentially >>>> causing semantic ambiguity. >>>> >>>> This patch adopts option 1 as pswpin/pswpout counters are that they >>>> only apply to IO done directly to the backend device (as noted by >>>> Nhat Pham). >>>> >>>> Fixes: 0ca0c24e3211 ("mm: store zero pages to be swapped out in a bitmap") >>> I don't think its a hotfix (or even a fix). It was discussed in the initial >>> series to add these as a follow up and Joshua was going to do this soon. >>> Its not fixing any bug in the initial series. >> >> I didn't realize Joshua was handling it. Is he still planning to? If >> so, I can leave it >> with Joshua if that was the plan :-) >> > > Please do continue with this patch, I think he was going to look at the > swapped_zero version that we discussed earlier anyways. Will let Joshua > comment on it. Hi Usama and Barry, First of all, I am sorry for not participating in the previous conversation about this, it is my fault for the lack of communication on my end on the status of zero_swapped (name pending). Sorry for the confusion! I am hoping to pick this up in a few days (I have been working on a few patches in different subsystems, but I will wrap up the work on these very soon). As far as I can tell, zero_swapped and swp{in,out}zero seem to be orthogonal, and as Nhat pointed out [1] I think there is need for both. Thank you for this patch Barry, and thank you Usama for keeping me in the loop! Have a great day! Joshua [1] https://lore.kernel.org/linux-mm/882008b6-13e0-41d8-91fa-f26c585120d8@gmail.com/T/#m7c5017bc97d56843242d3e006cd7e1f0fd0f1a38
© 2016 - 2024 Red Hat, Inc.