[v7] mm: ZSWAP swap-out of mTHP folios

[PATCH v7 5/8] mm: zswap: Compress and store a specific page in a folio.

Posted by Kanchana P Sridhar 1 year, 4 months ago

For zswap_store() to handle mTHP folios, we need to iterate through each
page in the mTHP, compress it and store it in the zswap pool. This patch
introduces an auxiliary function zswap_store_page() that provides this
functionality.

The function signature reflects the design intent, namely, for it
to be invoked by zswap_store() per-page in an mTHP. Hence, the folio's
objcg and the zswap_pool to use are input parameters for sake of
efficiency and consistency.

The functionality in zswap_store_page() is reused and adapted from
Ryan Roberts' RFC patch [1]:

  "[RFC,v1] mm: zswap: Store large folios without splitting"

  [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u

Co-developed-by: Ryan Roberts
Signed-off-by:
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 88 insertions(+)

diff --git a/mm/zswap.c b/mm/zswap.c
index 9bea948d653e..8f2e0ab34c84 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1463,6 +1463,94 @@ static void zswap_delete_stored_offsets(struct xarray *tree,
 	}
 }
 
+/*
+ * Stores the page at specified "index" in a folio.
+ *
+ * @folio: The folio to store in zswap.
+ * @index: Index into the page in the folio that this function will store.
+ * @objcg: The folio's objcg.
+ * @pool:  The zswap_pool to store the compressed data for the page.
+ */
+static bool __maybe_unused zswap_store_page(struct folio *folio, long index,
+					    struct obj_cgroup *objcg,
+					    struct zswap_pool *pool)
+{
+	swp_entry_t swp = folio->swap;
+	int type = swp_type(swp);
+	pgoff_t offset = swp_offset(swp) + index;
+	struct page *page = folio_page(folio, index);
+	struct xarray *tree = swap_zswap_tree(swp);
+	struct zswap_entry *entry;
+
+	if (objcg)
+		obj_cgroup_get(objcg);
+
+	if (zswap_check_limits())
+		goto reject;
+
+	/* allocate entry */
+	entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
+	if (!entry) {
+		zswap_reject_kmemcache_fail++;
+		goto reject;
+	}
+
+	/* if entry is successfully added, it keeps the reference */
+	if (!zswap_pool_get(pool))
+		goto freepage;
+
+	entry->pool = pool;
+
+	if (!zswap_compress(page, entry))
+		goto put_pool;
+
+	entry->swpentry = swp_entry(type, offset);
+	entry->objcg = objcg;
+	entry->referenced = true;
+
+	if (!zswap_store_entry(tree, entry))
+		goto store_failed;
+
+	if (objcg) {
+		obj_cgroup_charge_zswap(objcg, entry->length);
+		count_objcg_event(objcg, ZSWPOUT);
+	}
+
+	/*
+	 * We finish initializing the entry while it's already in xarray.
+	 * This is safe because:
+	 *
+	 * 1. Concurrent stores and invalidations are excluded by folio lock.
+	 *
+	 * 2. Writeback is excluded by the entry not being on the LRU yet.
+	 *    The publishing order matters to prevent writeback from seeing
+	 *    an incoherent entry.
+	 */
+	if (entry->length) {
+		INIT_LIST_HEAD(&entry->lru);
+		zswap_lru_add(&zswap_list_lru, entry);
+	}
+
+	/* update stats */
+	atomic_inc(&zswap_stored_pages);
+	count_vm_event(ZSWPOUT);
+
+	return true;
+
+store_failed:
+	zpool_free(entry->pool->zpool, entry->handle);
+put_pool:
+	zswap_pool_put(pool);
+freepage:
+	zswap_entry_cache_free(entry);
+reject:
+	obj_cgroup_put(objcg);
+	if (zswap_pool_reached_full)
+		queue_work(shrink_wq, &zswap_shrink_work);
+
+	return false;
+}
+
 bool zswap_store(struct folio *folio)
 {
 	long nr_pages = folio_nr_pages(folio);
-- 
2.27.0

Re: [PATCH v7 5/8] mm: zswap: Compress and store a specific page in a folio.

Posted by Yosry Ahmed 1 year, 4 months ago

On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> For zswap_store() to handle mTHP folios, we need to iterate through each
> page in the mTHP, compress it and store it in the zswap pool. This patch
> introduces an auxiliary function zswap_store_page() that provides this
> functionality.
>
> The function signature reflects the design intent, namely, for it
> to be invoked by zswap_store() per-page in an mTHP. Hence, the folio's
> objcg and the zswap_pool to use are input parameters for sake of
> efficiency and consistency.
>
> The functionality in zswap_store_page() is reused and adapted from
> Ryan Roberts' RFC patch [1]:
>
>   "[RFC,v1] mm: zswap: Store large folios without splitting"
>
>   [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
>
> Co-developed-by: Ryan Roberts
> Signed-off-by:
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  mm/zswap.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 88 insertions(+)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 9bea948d653e..8f2e0ab34c84 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1463,6 +1463,94 @@ static void zswap_delete_stored_offsets(struct xarray *tree,
>         }
>  }
>
> +/*
> + * Stores the page at specified "index" in a folio.
> + *
> + * @folio: The folio to store in zswap.
> + * @index: Index into the page in the folio that this function will store.
> + * @objcg: The folio's objcg.
> + * @pool:  The zswap_pool to store the compressed data for the page.
> + */
> +static bool __maybe_unused zswap_store_page(struct folio *folio, long index,
> +                                           struct obj_cgroup *objcg,
> +                                           struct zswap_pool *pool)

Why are we adding an unused function that duplicates code in
zswap_store(), then using it in the following patch? This makes it
difficult to see that the function does the same thing. This patch
should be refactoring the per-page code out of zswap_store() into
zswap_store_page(), and directly calling zswap_store_page() from
zswap_store().

> +{
> +       swp_entry_t swp = folio->swap;
> +       int type = swp_type(swp);
> +       pgoff_t offset = swp_offset(swp) + index;
> +       struct page *page = folio_page(folio, index);
> +       struct xarray *tree = swap_zswap_tree(swp);
> +       struct zswap_entry *entry;
> +
> +       if (objcg)
> +               obj_cgroup_get(objcg);
> +
> +       if (zswap_check_limits())
> +               goto reject;
> +
> +       /* allocate entry */
> +       entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
> +       if (!entry) {
> +               zswap_reject_kmemcache_fail++;
> +               goto reject;
> +       }
> +
> +       /* if entry is successfully added, it keeps the reference */
> +       if (!zswap_pool_get(pool))
> +               goto freepage;

I think we can batch this for all pages in zswap_store(), maybe first
add zswap_pool_get_many().

I am also wondering if it would be better to batch the limit checking
and allocating the entries, to front load any failures before we start
compression. Not sure if that's overall better though.

To batch allocate entries we will have to also allocate an array to
hold them. To batch the limit checking we will have to either allow
going further over limit for mTHPs, or check if there is enough
clearance to allow for compressing all the pages. Using the
uncompressed size will lead to false negatives though, so maybe we can
start tracking the average compression ratio for better limit
checking.

Nhat, Johannes, any thoughts here? I need someone to tell me if I am
overthinking this :)

> +
> +       entry->pool = pool;
> +
> +       if (!zswap_compress(page, entry))
> +               goto put_pool;
> +
> +       entry->swpentry = swp_entry(type, offset);
> +       entry->objcg = objcg;
> +       entry->referenced = true;
> +
> +       if (!zswap_store_entry(tree, entry))
> +               goto store_failed;
> +
> +       if (objcg) {
> +               obj_cgroup_charge_zswap(objcg, entry->length);
> +               count_objcg_event(objcg, ZSWPOUT);
> +       }
> +
> +       /*
> +        * We finish initializing the entry while it's already in xarray.
> +        * This is safe because:
> +        *
> +        * 1. Concurrent stores and invalidations are excluded by folio lock.
> +        *
> +        * 2. Writeback is excluded by the entry not being on the LRU yet.
> +        *    The publishing order matters to prevent writeback from seeing
> +        *    an incoherent entry.
> +        */
> +       if (entry->length) {
> +               INIT_LIST_HEAD(&entry->lru);
> +               zswap_lru_add(&zswap_list_lru, entry);
> +       }
> +
> +       /* update stats */
> +       atomic_inc(&zswap_stored_pages);
> +       count_vm_event(ZSWPOUT);

We should probably also batch updating the stats. It actually seems
like now we don't handle rolling them back upon failure.


> +
> +       return true;
> +
> +store_failed:
> +       zpool_free(entry->pool->zpool, entry->handle);
> +put_pool:
> +       zswap_pool_put(pool);
> +freepage:
> +       zswap_entry_cache_free(entry);
> +reject:
> +       obj_cgroup_put(objcg);
> +       if (zswap_pool_reached_full)
> +               queue_work(shrink_wq, &zswap_shrink_work);
> +
> +       return false;
> +}
> +
>  bool zswap_store(struct folio *folio)
>  {
>         long nr_pages = folio_nr_pages(folio);
> --
> 2.27.0
>

RE: [PATCH v7 5/8] mm: zswap: Compress and store a specific page in a folio.

Posted by Sridhar, Kanchana P 1 year, 4 months ago

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Tuesday, September 24, 2024 12:29 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 5/8] mm: zswap: Compress and store a specific page
> in a folio.
> 
> On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > For zswap_store() to handle mTHP folios, we need to iterate through each
> > page in the mTHP, compress it and store it in the zswap pool. This patch
> > introduces an auxiliary function zswap_store_page() that provides this
> > functionality.
> >
> > The function signature reflects the design intent, namely, for it
> > to be invoked by zswap_store() per-page in an mTHP. Hence, the folio's
> > objcg and the zswap_pool to use are input parameters for sake of
> > efficiency and consistency.
> >
> > The functionality in zswap_store_page() is reused and adapted from
> > Ryan Roberts' RFC patch [1]:
> >
> >   "[RFC,v1] mm: zswap: Store large folios without splitting"
> >
> >   [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> ryan.roberts@arm.com/T/#u
> >
> > Co-developed-by: Ryan Roberts
> > Signed-off-by:
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  mm/zswap.c | 88
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 88 insertions(+)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 9bea948d653e..8f2e0ab34c84 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -1463,6 +1463,94 @@ static void zswap_delete_stored_offsets(struct
> xarray *tree,
> >         }
> >  }
> >
> > +/*
> > + * Stores the page at specified "index" in a folio.
> > + *
> > + * @folio: The folio to store in zswap.
> > + * @index: Index into the page in the folio that this function will store.
> > + * @objcg: The folio's objcg.
> > + * @pool:  The zswap_pool to store the compressed data for the page.
> > + */
> > +static bool __maybe_unused zswap_store_page(struct folio *folio, long
> index,
> > +                                           struct obj_cgroup *objcg,
> > +                                           struct zswap_pool *pool)
> 
> Why are we adding an unused function that duplicates code in
> zswap_store(), then using it in the following patch? This makes it
> difficult to see that the function does the same thing. This patch
> should be refactoring the per-page code out of zswap_store() into
> zswap_store_page(), and directly calling zswap_store_page() from
> zswap_store().

Sure, thanks Yosry for this suggestion. Will fix in v8.

> 
> > +{
> > +       swp_entry_t swp = folio->swap;
> > +       int type = swp_type(swp);
> > +       pgoff_t offset = swp_offset(swp) + index;
> > +       struct page *page = folio_page(folio, index);
> > +       struct xarray *tree = swap_zswap_tree(swp);
> > +       struct zswap_entry *entry;
> > +
> > +       if (objcg)
> > +               obj_cgroup_get(objcg);
> > +
> > +       if (zswap_check_limits())
> > +               goto reject;
> > +
> > +       /* allocate entry */
> > +       entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
> > +       if (!entry) {
> > +               zswap_reject_kmemcache_fail++;
> > +               goto reject;
> > +       }
> > +
> > +       /* if entry is successfully added, it keeps the reference */
> > +       if (!zswap_pool_get(pool))
> > +               goto freepage;
> 
> I think we can batch this for all pages in zswap_store(), maybe first
> add zswap_pool_get_many().
> 
> I am also wondering if it would be better to batch the limit checking
> and allocating the entries, to front load any failures before we start
> compression. Not sure if that's overall better though.
> 
> To batch allocate entries we will have to also allocate an array to
> hold them. To batch the limit checking we will have to either allow
> going further over limit for mTHPs, or check if there is enough
> clearance to allow for compressing all the pages. Using the
> uncompressed size will lead to false negatives though, so maybe we can
> start tracking the average compression ratio for better limit
> checking.
> 
> Nhat, Johannes, any thoughts here? I need someone to tell me if I am
> overthinking this :)

These are all good points. I suppose I was thinking along the same lines
of what Nhat mentioned in an earlier comment. I was trying the
incremental zswap_pool_get() and limit checks and shrinker invocations
in case of (all) error conditions to allow different concurrent stores to make
progress, without favoring only one process's mTHP store. I was thinking
this would have minimal impact on the process(es) that see the zswap
limit being exceeded, and that this would be better than preemptively
checking for the entire mTHP and failing (this could also complicate things
where no one makes progress because multiple processes run the batch
checks and fail, when realistically one/many could have triggered
the shrinker before erroring out, and at least one could have made
progress).

Would appreciate your perspectives on how this should be handled,
and will implement a solution in v8 accordingly.

Thanks,
Kanchana

> 
> > +
> > +       entry->pool = pool;
> > +
> > +       if (!zswap_compress(page, entry))
> > +               goto put_pool;
> > +
> > +       entry->swpentry = swp_entry(type, offset);
> > +       entry->objcg = objcg;
> > +       entry->referenced = true;
> > +
> > +       if (!zswap_store_entry(tree, entry))
> > +               goto store_failed;
> > +
> > +       if (objcg) {
> > +               obj_cgroup_charge_zswap(objcg, entry->length);
> > +               count_objcg_event(objcg, ZSWPOUT);
> > +       }
> > +
> > +       /*
> > +        * We finish initializing the entry while it's already in xarray.
> > +        * This is safe because:
> > +        *
> > +        * 1. Concurrent stores and invalidations are excluded by folio lock.
> > +        *
> > +        * 2. Writeback is excluded by the entry not being on the LRU yet.
> > +        *    The publishing order matters to prevent writeback from seeing
> > +        *    an incoherent entry.
> > +        */
> > +       if (entry->length) {
> > +               INIT_LIST_HEAD(&entry->lru);
> > +               zswap_lru_add(&zswap_list_lru, entry);
> > +       }
> > +
> > +       /* update stats */
> > +       atomic_inc(&zswap_stored_pages);
> > +       count_vm_event(ZSWPOUT);
> 
> We should probably also batch updating the stats. It actually seems
> like now we don't handle rolling them back upon failure.

Good point! I assume you are referring only to the "ZSWPOUT" vm event stats
updates and not the "zswap_stored_pages" (since latter is used in limit checking)?

I will fix this in v8.

Thanks,
Kanchana

> 
> 
> > +
> > +       return true;
> > +
> > +store_failed:
> > +       zpool_free(entry->pool->zpool, entry->handle);
> > +put_pool:
> > +       zswap_pool_put(pool);
> > +freepage:
> > +       zswap_entry_cache_free(entry);
> > +reject:
> > +       obj_cgroup_put(objcg);
> > +       if (zswap_pool_reached_full)
> > +               queue_work(shrink_wq, &zswap_shrink_work);
> > +
> > +       return false;
> > +}
> > +
> >  bool zswap_store(struct folio *folio)
> >  {
> >         long nr_pages = folio_nr_pages(folio);
> > --
> > 2.27.0
> >

Re: [PATCH v7 5/8] mm: zswap: Compress and store a specific page in a folio.

Posted by Yosry Ahmed 1 year, 4 months ago

[..]
> >
> > > +{
> > > +       swp_entry_t swp = folio->swap;
> > > +       int type = swp_type(swp);
> > > +       pgoff_t offset = swp_offset(swp) + index;
> > > +       struct page *page = folio_page(folio, index);
> > > +       struct xarray *tree = swap_zswap_tree(swp);
> > > +       struct zswap_entry *entry;
> > > +
> > > +       if (objcg)
> > > +               obj_cgroup_get(objcg);
> > > +
> > > +       if (zswap_check_limits())
> > > +               goto reject;
> > > +
> > > +       /* allocate entry */
> > > +       entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
> > > +       if (!entry) {
> > > +               zswap_reject_kmemcache_fail++;
> > > +               goto reject;
> > > +       }
> > > +
> > > +       /* if entry is successfully added, it keeps the reference */
> > > +       if (!zswap_pool_get(pool))
> > > +               goto freepage;
> >
> > I think we can batch this for all pages in zswap_store(), maybe first
> > add zswap_pool_get_many().
> >
> > I am also wondering if it would be better to batch the limit checking
> > and allocating the entries, to front load any failures before we start
> > compression. Not sure if that's overall better though.
> >
> > To batch allocate entries we will have to also allocate an array to
> > hold them. To batch the limit checking we will have to either allow
> > going further over limit for mTHPs, or check if there is enough
> > clearance to allow for compressing all the pages. Using the
> > uncompressed size will lead to false negatives though, so maybe we can
> > start tracking the average compression ratio for better limit
> > checking.
> >
> > Nhat, Johannes, any thoughts here? I need someone to tell me if I am
> > overthinking this :)
>
> These are all good points. I suppose I was thinking along the same lines
> of what Nhat mentioned in an earlier comment. I was trying the
> incremental zswap_pool_get() and limit checks and shrinker invocations
> in case of (all) error conditions to allow different concurrent stores to make
> progress, without favoring only one process's mTHP store. I was thinking
> this would have minimal impact on the process(es) that see the zswap
> limit being exceeded, and that this would be better than preemptively
> checking for the entire mTHP and failing (this could also complicate things
> where no one makes progress because multiple processes run the batch
> checks and fail, when realistically one/many could have triggered
> the shrinker before erroring out, and at least one could have made
> progress).

On the other hand, if we allow concurrent mTHP swapouts to do limit
checks incrementally, they may all fail at the last page. While if
they all do limit checks beforehand, one of them can proceed.

I think we need to agree on a higher-level strategy for limit
checking, both global and per-memcg. The per-memcg limit should be
stricter though, so we may end up having different policies.

>
> Would appreciate your perspectives on how this should be handled,
> and will implement a solution in v8 accordingly.
>
> Thanks,
> Kanchana
>
> >
> > > +
> > > +       entry->pool = pool;
> > > +
> > > +       if (!zswap_compress(page, entry))
> > > +               goto put_pool;
> > > +
> > > +       entry->swpentry = swp_entry(type, offset);
> > > +       entry->objcg = objcg;
> > > +       entry->referenced = true;
> > > +
> > > +       if (!zswap_store_entry(tree, entry))
> > > +               goto store_failed;
> > > +
> > > +       if (objcg) {
> > > +               obj_cgroup_charge_zswap(objcg, entry->length);
> > > +               count_objcg_event(objcg, ZSWPOUT);
> > > +       }
> > > +
> > > +       /*
> > > +        * We finish initializing the entry while it's already in xarray.
> > > +        * This is safe because:
> > > +        *
> > > +        * 1. Concurrent stores and invalidations are excluded by folio lock.
> > > +        *
> > > +        * 2. Writeback is excluded by the entry not being on the LRU yet.
> > > +        *    The publishing order matters to prevent writeback from seeing
> > > +        *    an incoherent entry.
> > > +        */
> > > +       if (entry->length) {
> > > +               INIT_LIST_HEAD(&entry->lru);
> > > +               zswap_lru_add(&zswap_list_lru, entry);
> > > +       }
> > > +
> > > +       /* update stats */
> > > +       atomic_inc(&zswap_stored_pages);
> > > +       count_vm_event(ZSWPOUT);
> >
> > We should probably also batch updating the stats. It actually seems
> > like now we don't handle rolling them back upon failure.
>
> Good point! I assume you are referring only to the "ZSWPOUT" vm event stats
> updates and not the "zswap_stored_pages" (since latter is used in limit checking)?

I actually meant both. Do we rollback changes to zswap_stored_pages
when some stores succeed and some of them fail?

I think it's more correct and efficient to update the atomic once
after all the pages are successfully compressed and stored.

RE: [PATCH v7 5/8] mm: zswap: Compress and store a specific page in a folio.

Posted by Sridhar, Kanchana P 1 year, 4 months ago

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Tuesday, September 24, 2024 5:47 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 5/8] mm: zswap: Compress and store a specific page
> in a folio.
> 
> [..]
> > >
> > > > +{
> > > > +       swp_entry_t swp = folio->swap;
> > > > +       int type = swp_type(swp);
> > > > +       pgoff_t offset = swp_offset(swp) + index;
> > > > +       struct page *page = folio_page(folio, index);
> > > > +       struct xarray *tree = swap_zswap_tree(swp);
> > > > +       struct zswap_entry *entry;
> > > > +
> > > > +       if (objcg)
> > > > +               obj_cgroup_get(objcg);
> > > > +
> > > > +       if (zswap_check_limits())
> > > > +               goto reject;
> > > > +
> > > > +       /* allocate entry */
> > > > +       entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
> > > > +       if (!entry) {
> > > > +               zswap_reject_kmemcache_fail++;
> > > > +               goto reject;
> > > > +       }
> > > > +
> > > > +       /* if entry is successfully added, it keeps the reference */
> > > > +       if (!zswap_pool_get(pool))
> > > > +               goto freepage;
> > >
> > > I think we can batch this for all pages in zswap_store(), maybe first
> > > add zswap_pool_get_many().
> > >
> > > I am also wondering if it would be better to batch the limit checking
> > > and allocating the entries, to front load any failures before we start
> > > compression. Not sure if that's overall better though.
> > >
> > > To batch allocate entries we will have to also allocate an array to
> > > hold them. To batch the limit checking we will have to either allow
> > > going further over limit for mTHPs, or check if there is enough
> > > clearance to allow for compressing all the pages. Using the
> > > uncompressed size will lead to false negatives though, so maybe we can
> > > start tracking the average compression ratio for better limit
> > > checking.
> > >
> > > Nhat, Johannes, any thoughts here? I need someone to tell me if I am
> > > overthinking this :)
> >
> > These are all good points. I suppose I was thinking along the same lines
> > of what Nhat mentioned in an earlier comment. I was trying the
> > incremental zswap_pool_get() and limit checks and shrinker invocations
> > in case of (all) error conditions to allow different concurrent stores to make
> > progress, without favoring only one process's mTHP store. I was thinking
> > this would have minimal impact on the process(es) that see the zswap
> > limit being exceeded, and that this would be better than preemptively
> > checking for the entire mTHP and failing (this could also complicate things
> > where no one makes progress because multiple processes run the batch
> > checks and fail, when realistically one/many could have triggered
> > the shrinker before erroring out, and at least one could have made
> > progress).
> 
> On the other hand, if we allow concurrent mTHP swapouts to do limit
> checks incrementally, they may all fail at the last page. While if
> they all do limit checks beforehand, one of them can proceed.

Yes, this is possible too. Although, given the dynamic nature of the usage,
even with a check-before-store strategy for mTHP we could end up in a
similar situation as the optimistic approach in which we allowed progress
until there really was a reason to fail. 

> 
> I think we need to agree on a higher-level strategy for limit
> checking, both global and per-memcg. The per-memcg limit should be
> stricter though, so we may end up having different policies.

Sure, this makes sense. One possibility is we could allow zswap to
follow the "optimistic approach" used currently, while we manage
the limits checking at the memcg level? Something along the lines
of mem_cgroup_handle_over_high() that gets called every time after
a page-fault is handled; instead checks the cgroup's zswap usage and
triggers writeback? This seems like one way of not adding overhead to
the reclaim path (zswap will store mTHP until the limit checking causes
error and unwinding state), while triggering zswap-LRU based writeback
at a higher level to manage the limit.

> 
> >
> > Would appreciate your perspectives on how this should be handled,
> > and will implement a solution in v8 accordingly.
> >
> > Thanks,
> > Kanchana
> >
> > >
> > > > +
> > > > +       entry->pool = pool;
> > > > +
> > > > +       if (!zswap_compress(page, entry))
> > > > +               goto put_pool;
> > > > +
> > > > +       entry->swpentry = swp_entry(type, offset);
> > > > +       entry->objcg = objcg;
> > > > +       entry->referenced = true;
> > > > +
> > > > +       if (!zswap_store_entry(tree, entry))
> > > > +               goto store_failed;
> > > > +
> > > > +       if (objcg) {
> > > > +               obj_cgroup_charge_zswap(objcg, entry->length);
> > > > +               count_objcg_event(objcg, ZSWPOUT);
> > > > +       }
> > > > +
> > > > +       /*
> > > > +        * We finish initializing the entry while it's already in xarray.
> > > > +        * This is safe because:
> > > > +        *
> > > > +        * 1. Concurrent stores and invalidations are excluded by folio lock.
> > > > +        *
> > > > +        * 2. Writeback is excluded by the entry not being on the LRU yet.
> > > > +        *    The publishing order matters to prevent writeback from seeing
> > > > +        *    an incoherent entry.
> > > > +        */
> > > > +       if (entry->length) {
> > > > +               INIT_LIST_HEAD(&entry->lru);
> > > > +               zswap_lru_add(&zswap_list_lru, entry);
> > > > +       }
> > > > +
> > > > +       /* update stats */
> > > > +       atomic_inc(&zswap_stored_pages);
> > > > +       count_vm_event(ZSWPOUT);
> > >
> > > We should probably also batch updating the stats. It actually seems
> > > like now we don't handle rolling them back upon failure.
> >
> > Good point! I assume you are referring only to the "ZSWPOUT" vm event
> stats
> > updates and not the "zswap_stored_pages" (since latter is used in limit
> checking)?
> 
> I actually meant both. Do we rollback changes to zswap_stored_pages
> when some stores succeed and some of them fail?

Yes we do. zswap_tree_delete() calls zswap_entry_free() which will
decrement zswap_stored_pages. The only stat that is left in an incorrect
state in this case is the vmstat 'zswpout'.

> 
> I think it's more correct and efficient to update the atomic once
> after all the pages are successfully compressed and stored.

Actually this would need to co-relate with the limits checking strategy,
because the atomic is used there and needs to be as accurate as possible.

As far as the vmstat 'zswpout', the reason I left it as-is in my patchset
was to be more indicative of the actual zswpout compute events that
occurred (for things like getting the compressions count), regardless
of whether or not the overall mTHP store was successful. If this vmstat
needs to reflect only successful zswpout events (i.e., represent the zswap
usage), I can fix it by updating it once only if the mTHP is stored successfully.

Thanks,
Kanchana

Re: [PATCH v7 5/8] mm: zswap: Compress and store a specific page in a folio.

Posted by Johannes Weiner 1 year, 4 months ago

On Wed, Sep 25, 2024 at 01:49:03AM +0000, Sridhar, Kanchana P wrote:
> > From: Yosry Ahmed <yosryahmed@google.com>
> > I think it's more correct and efficient to update the atomic once
> > after all the pages are successfully compressed and stored.
> 
> Actually this would need to co-relate with the limits checking strategy,
> because the atomic is used there and needs to be as accurate as possible.

For the limit checks, we use the zpool counters, not zswap_stored_pages.

zswap_stored_pages is used in the zswap shrinker to guesstimate
pressure, so it's likely a good thing to only count entries that are
expected to stay, and not account the ones that might fail just yet.

> As far as the vmstat 'zswpout', the reason I left it as-is in my patchset
> was to be more indicative of the actual zswpout compute events that
> occurred (for things like getting the compressions count), regardless
> of whether or not the overall mTHP store was successful. If this vmstat
> needs to reflect only successful zswpout events (i.e., represent the zswap
> usage), I can fix it by updating it once only if the mTHP is stored successfully.

Yeah, that's fine as well.

I would suggest batching them both at the end of zswap_store().

RE: [PATCH v7 5/8] mm: zswap: Compress and store a specific page in a folio.

Posted by Sridhar, Kanchana P 1 year, 4 months ago

> -----Original Message-----
> From: Johannes Weiner <hannes@cmpxchg.org>
> Sent: Wednesday, September 25, 2024 6:54 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Yosry Ahmed <yosryahmed@google.com>; linux-kernel@vger.kernel.org;
> linux-mm@kvack.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 5/8] mm: zswap: Compress and store a specific page
> in a folio.
> 
> On Wed, Sep 25, 2024 at 01:49:03AM +0000, Sridhar, Kanchana P wrote:
> > > From: Yosry Ahmed <yosryahmed@google.com>
> > > I think it's more correct and efficient to update the atomic once
> > > after all the pages are successfully compressed and stored.
> >
> > Actually this would need to co-relate with the limits checking strategy,
> > because the atomic is used there and needs to be as accurate as possible.
> 
> For the limit checks, we use the zpool counters, not zswap_stored_pages.

Thanks Johannes for your insights and comments. Yes, you are absolutely
right. My apologies.

> 
> zswap_stored_pages is used in the zswap shrinker to guesstimate
> pressure, so it's likely a good thing to only count entries that are
> expected to stay, and not account the ones that might fail just yet.

Sure, makes sense.

> 
> > As far as the vmstat 'zswpout', the reason I left it as-is in my patchset
> > was to be more indicative of the actual zswpout compute events that
> > occurred (for things like getting the compressions count), regardless
> > of whether or not the overall mTHP store was successful. If this vmstat
> > needs to reflect only successful zswpout events (i.e., represent the zswap
> > usage), I can fix it by updating it once only if the mTHP is stored successfully.
> 
> Yeah, that's fine as well.
> 
> I would suggest batching them both at the end of zswap_store().

Ok, will do so in v8.

Thanks,
Kanchana