mm, swap: misc cleanup and bugfix

[PATCH 1/4] mm, swap: do not perform synchronous discard during allocation

Posted by Kairui Song 4 months ago

From: Kairui Song <kasong@tencent.com>

Since commit 1b7e90020eb77 ("mm, swap: use percpu cluster as allocation
fast path"), swap allocation is protected by a local lock, which means
we can't do any sleeping calls during allocation.

However, the discard routine is not taken well care of. When the swap
allocator failed to find any usable cluster, it would look at the
pending discard cluster and try to issue some blocking discards. It may
not necessarily sleep, but the cond_resched at the bio layer indicates
this is wrong when combined with a local lock. And the bio GFP flag used
for discard bio is also wrong (not atomic).

It's arguable whether this synchronous discard is helpful at all. In
most cases, the async discard is good enough. And the swap allocator is
doing very differently at organizing the clusters since the recent
change, so it is very rare to see discard clusters piling up.

So far, no issues have been observed or reported with typical SSD setups
under months of high pressure. This issue was found during my code
review. But by hacking the kernel a bit: adding a mdelay(100) in the
async discard path, this issue will be observable with WARNING triggered
by the wrong GFP and cond_resched in the bio layer.

So let's fix this issue in a safe way: remove the synchronous discard in
the swap allocation path. And when order 0 is failing with all cluster
list drained on all swap devices, try to do a discard following the swap
device priority list. If any discards released some cluster, try the
allocation again. This way, we can still avoid OOM due to swap failure
if the hardware is very slow and memory pressure is extremely high.

Cc: <stable@vger.kernel.org>
Fixes: 1b7e90020eb77 ("mm, swap: use percpu cluster as allocation fast path")
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 40 +++++++++++++++++++++++++++++++++-------
 1 file changed, 33 insertions(+), 7 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index cb2392ed8e0e..0d1924f6f495 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1101,13 +1101,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 			goto done;
 	}
 
-	/*
-	 * We don't have free cluster but have some clusters in discarding,
-	 * do discard now and reclaim them.
-	 */
-	if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))
-		goto new_cluster;
-
 	if (order)
 		goto done;
 
@@ -1394,6 +1387,33 @@ static bool swap_alloc_slow(swp_entry_t *entry,
 	return false;
 }
 
+/*
+ * Discard pending clusters in a synchronized way when under high pressure.
+ * Return: true if any cluster is discarded.
+ */
+static bool swap_sync_discard(void)
+{
+	bool ret = false;
+	int nid = numa_node_id();
+	struct swap_info_struct *si, *next;
+
+	spin_lock(&swap_avail_lock);
+	plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
+		spin_unlock(&swap_avail_lock);
+		if (get_swap_device_info(si)) {
+			if (si->flags & SWP_PAGE_DISCARD)
+				ret = swap_do_scheduled_discard(si);
+			put_swap_device(si);
+		}
+		if (ret)
+			break;
+		spin_lock(&swap_avail_lock);
+	}
+	spin_unlock(&swap_avail_lock);
+
+	return ret;
+}
+
 /**
  * folio_alloc_swap - allocate swap space for a folio
  * @folio: folio we want to move to swap
@@ -1432,11 +1452,17 @@ int folio_alloc_swap(struct folio *folio, gfp_t gfp)
 		}
 	}
 
+again:
 	local_lock(&percpu_swap_cluster.lock);
 	if (!swap_alloc_fast(&entry, order))
 		swap_alloc_slow(&entry, order);
 	local_unlock(&percpu_swap_cluster.lock);
 
+	if (unlikely(!order && !entry.val)) {
+		if (swap_sync_discard())
+			goto again;
+	}
+
 	/* Need to call this even if allocation failed, for MEMCG_SWAP_FAIL. */
 	if (mem_cgroup_try_charge_swap(folio, entry))
 		goto out_free;

-- 
2.51.0

Re: [PATCH 1/4] mm, swap: do not perform synchronous discard during allocation

Posted by Chris Li 4 months ago

Hi Kairui,

First of all, your title is a bit misleading:
"do not perform synchronous discard during allocation"

You still do the synchronous discard, just limited to order 0 failing.

Also your commit did not describe the behavior change of this patch.
The behavior change is that, it now prefers to allocate from the
fragment list before waiting for the discard. Which I feel is not
justified.

After reading your patch, I feel that you still do the synchronous
discard, just now you do it with less lock held.
I suggest you just fix the lock held issue without changing the
discard ordering behavior.

On Mon, Oct 6, 2025 at 1:03 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Since commit 1b7e90020eb77 ("mm, swap: use percpu cluster as allocation
> fast path"), swap allocation is protected by a local lock, which means
> we can't do any sleeping calls during allocation.
>
> However, the discard routine is not taken well care of. When the swap
> allocator failed to find any usable cluster, it would look at the
> pending discard cluster and try to issue some blocking discards. It may
> not necessarily sleep, but the cond_resched at the bio layer indicates
> this is wrong when combined with a local lock. And the bio GFP flag used
> for discard bio is also wrong (not atomic).

If lock is the issue, let's fix the lock issue.

> It's arguable whether this synchronous discard is helpful at all. In
> most cases, the async discard is good enough. And the swap allocator is
> doing very differently at organizing the clusters since the recent
> change, so it is very rare to see discard clusters piling up.

Very rare does not mean this never happens. If you have a cluster on
the discarding queue, I think it is better to wait for the discard to
complete before using the fragmented list, to reduce the
fragmentation. So it seems the real issue is holding a lock while
doing the block discard?

> So far, no issues have been observed or reported with typical SSD setups
> under months of high pressure. This issue was found during my code
> review. But by hacking the kernel a bit: adding a mdelay(100) in the
> async discard path, this issue will be observable with WARNING triggered
> by the wrong GFP and cond_resched in the bio layer.

I think that makes an assumption on how slow the SSD discard is. Some
SSD can be really slow. We want our kernel to work for those slow
discard SSD cases as well.

> So let's fix this issue in a safe way: remove the synchronous discard in
> the swap allocation path. And when order 0 is failing with all cluster
> list drained on all swap devices, try to do a discard following the swap

I don't feel that changing the discard behavior is justified here, the
real fix is discarding with less lock held. Am I missing something?
If I understand correctly, we should be able to keep the current
discard ordering behavior, discard before the fragment list. But with
less lock held as your current patch does.

I suggest the allocation here detects there is a discard pending and
running out of free blocks. Return there and indicate the need to
discard. The caller performs the discard without holding the lock,
similar to what you do with the order == 0 case.

> device priority list. If any discards released some cluster, try the
> allocation again. This way, we can still avoid OOM due to swap failure
> if the hardware is very slow and memory pressure is extremely high.
>
> Cc: <stable@vger.kernel.org>
> Fixes: 1b7e90020eb77 ("mm, swap: use percpu cluster as allocation fast path")
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swapfile.c | 40 +++++++++++++++++++++++++++++++++-------
>  1 file changed, 33 insertions(+), 7 deletions(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index cb2392ed8e0e..0d1924f6f495 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1101,13 +1101,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>                         goto done;
>         }
>
> -       /*
> -        * We don't have free cluster but have some clusters in discarding,
> -        * do discard now and reclaim them.
> -        */
> -       if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))
> -               goto new_cluster;

Assume you follow my suggestion.
Change this to some function to detect if there is a pending discard
on this device. Return to the caller indicating that you need a
discard for this device that has a pending discard.
Add an output argument to indicate the discard device "discard" if needed.

> -
>         if (order)
>                 goto done;
>
> @@ -1394,6 +1387,33 @@ static bool swap_alloc_slow(swp_entry_t *entry,
>         return false;
>  }
>
> +/*
> + * Discard pending clusters in a synchronized way when under high pressure.
> + * Return: true if any cluster is discarded.
> + */
> +static bool swap_sync_discard(void)
> +{

This function discards all swap devices. I am wondering if we should
just discard the current working device instead.
Another device supposedly discarded is already on going with the work
queue. We don't have to wait for that.

To unblock the current swap allocation.  We only need to wait for the
discard on the current swap device to indicate it needs to wait for
discard. Assume you take my above suggestion.

> +       bool ret = false;
> +       int nid = numa_node_id();
> +       struct swap_info_struct *si, *next;
> +
> +       spin_lock(&swap_avail_lock);
> +       plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
> +               spin_unlock(&swap_avail_lock);
> +               if (get_swap_device_info(si)) {
> +                       if (si->flags & SWP_PAGE_DISCARD)
> +                               ret = swap_do_scheduled_discard(si);
> +                       put_swap_device(si);
> +               }
> +               if (ret)
> +                       break;
> +               spin_lock(&swap_avail_lock);
> +       }
> +       spin_unlock(&swap_avail_lock);
> +
> +       return ret;
> +}
> +
>  /**
>   * folio_alloc_swap - allocate swap space for a folio
>   * @folio: folio we want to move to swap
> @@ -1432,11 +1452,17 @@ int folio_alloc_swap(struct folio *folio, gfp_t gfp)
>                 }
>         }
>
> +again:
>         local_lock(&percpu_swap_cluster.lock);
>         if (!swap_alloc_fast(&entry, order))
>                 swap_alloc_slow(&entry, order);

Here we can have a "discard" output function argument to indicate
which swap device needs to be discarded.

>         local_unlock(&percpu_swap_cluster.lock);
>
> +       if (unlikely(!order && !entry.val)) {

If you take the above suggestion, here will be just check if the
"discard" device is not NULL, perform discard on that device and done.

> +               if (swap_sync_discard())
> +                       goto again;
> +       }
> +
>         /* Need to call this even if allocation failed, for MEMCG_SWAP_FAIL. */
>         if (mem_cgroup_try_charge_swap(folio, entry))
>                 goto out_free;

Chris

Re: [PATCH 1/4] mm, swap: do not perform synchronous discard during allocation

Posted by Kairui Song 4 months ago

On Thu, Oct 9, 2025 at 5:10 AM Chris Li <chrisl@kernel.org> wrote:
>
> Hi Kairui,
>
> First of all, your title is a bit misleading:
> "do not perform synchronous discard during allocation"
>
> You still do the synchronous discard, just limited to order 0 failing.
>
> Also your commit did not describe the behavior change of this patch.
> The behavior change is that, it now prefers to allocate from the
> fragment list before waiting for the discard. Which I feel is not
> justified.
>
> After reading your patch, I feel that you still do the synchronous
> discard, just now you do it with less lock held.
> I suggest you just fix the lock held issue without changing the
> discard ordering behavior.
>
> On Mon, Oct 6, 2025 at 1:03 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > Since commit 1b7e90020eb77 ("mm, swap: use percpu cluster as allocation
> > fast path"), swap allocation is protected by a local lock, which means
> > we can't do any sleeping calls during allocation.
> >
> > However, the discard routine is not taken well care of. When the swap
> > allocator failed to find any usable cluster, it would look at the
> > pending discard cluster and try to issue some blocking discards. It may
> > not necessarily sleep, but the cond_resched at the bio layer indicates
> > this is wrong when combined with a local lock. And the bio GFP flag used
> > for discard bio is also wrong (not atomic).
>
> If lock is the issue, let's fix the lock issue.
>
> > It's arguable whether this synchronous discard is helpful at all. In
> > most cases, the async discard is good enough. And the swap allocator is
> > doing very differently at organizing the clusters since the recent
> > change, so it is very rare to see discard clusters piling up.
>
> Very rare does not mean this never happens. If you have a cluster on
> the discarding queue, I think it is better to wait for the discard to
> complete before using the fragmented list, to reduce the
> fragmentation. So it seems the real issue is holding a lock while
> doing the block discard?
>
> > So far, no issues have been observed or reported with typical SSD setups
> > under months of high pressure. This issue was found during my code
> > review. But by hacking the kernel a bit: adding a mdelay(100) in the
> > async discard path, this issue will be observable with WARNING triggered
> > by the wrong GFP and cond_resched in the bio layer.
>
> I think that makes an assumption on how slow the SSD discard is. Some
> SSD can be really slow. We want our kernel to work for those slow
> discard SSD cases as well.
>
> > So let's fix this issue in a safe way: remove the synchronous discard in
> > the swap allocation path. And when order 0 is failing with all cluster
> > list drained on all swap devices, try to do a discard following the swap
>
> I don't feel that changing the discard behavior is justified here, the
> real fix is discarding with less lock held. Am I missing something?
> If I understand correctly, we should be able to keep the current
> discard ordering behavior, discard before the fragment list. But with
> less lock held as your current patch does.
>
> I suggest the allocation here detects there is a discard pending and
> running out of free blocks. Return there and indicate the need to
> discard. The caller performs the discard without holding the lock,
> similar to what you do with the order == 0 case.
>
> > device priority list. If any discards released some cluster, try the
> > allocation again. This way, we can still avoid OOM due to swap failure
> > if the hardware is very slow and memory pressure is extremely high.
> >
> > Cc: <stable@vger.kernel.org>
> > Fixes: 1b7e90020eb77 ("mm, swap: use percpu cluster as allocation fast path")
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/swapfile.c | 40 +++++++++++++++++++++++++++++++++-------
> >  1 file changed, 33 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index cb2392ed8e0e..0d1924f6f495 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -1101,13 +1101,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> >                         goto done;
> >         }
> >
> > -       /*
> > -        * We don't have free cluster but have some clusters in discarding,
> > -        * do discard now and reclaim them.
> > -        */
> > -       if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))
> > -               goto new_cluster;
>
> Assume you follow my suggestion.
> Change this to some function to detect if there is a pending discard
> on this device. Return to the caller indicating that you need a
> discard for this device that has a pending discard.
> Add an output argument to indicate the discard device "discard" if needed.

The problem I just realized is that, if we just bail out here, we are
forbidding order 0 to steal if there is any discarding cluster. We
just return here to let the caller handle the discard outside
the lock.

It may just discard the cluster just fine, then retry from free clusters.
Then everything is fine, that's the easy part.

But it might also fail, and interestingly, in the failure case we need
to try again as well. It might fail with a race with another discard,
in that case order 0 steal is still feasible. Or it fail with
get_swap_device_info (we have to release the device to return here),
in that case we should go back to the plist and try other devices.

This is doable but seems kind of fragile, we'll have something like
this in the folio_alloc_swap function:

local_lock(&percpu_swap_cluster.lock);
if (!swap_alloc_fast(&entry, order))
    swap_alloc_slow(&entry, order, &discard_si);
local_unlock(&percpu_swap_cluster.lock);

+if (discard_si) {
+    if (get_swap_device_info(discard_si)) {
+        swap_do_scheduled_discard(discard_si);
+        put_swap_device(discard_si);
+        /*
+         * Ignoring the return value, since we need to try
+         * again even if the discard failed. If failed due to
+         * race with another discard, we should still try
+         * order 0 steal.
+         */
+    } else {
+        discard_si = NULL;
+        /*
+         * If raced with swapoff, we should try again too but
+         * not using the discard device anymore.
+         */
+    }
+    goto again;
+}

And the `again` retry we'll have to always start from free_clusters again,
unless we have another parameter just to indicate that we want to skip
everything and jump to stealing, or pass and reuse the discard_si
pointer as return argument to cluster_alloc_swap_entry as well,
as the indicator to jump to stealing directly.

It looks kind of strange. So far swap_do_scheduled_discard can only
fail due to a race with another successful discard, so retrying is
safe and won't run into an endless loop. But it seems easy to break,
e.g. if we may handle bio alloc failure of discard request in the
future. And trying again if get_swap_device_info failed makes no sense
if there is only one device, but has to be done here to cover
multi-device usage, or we have to add more special checks.

swap_alloc_slow will be a bit longer too if we want to prevent
touching plist again:
+/*
+ * Resuming after trying to discard cluster on a swap device,
+ * try the discarded device first.
+ */
+si = *discard_si;
+if (unlikely(si)) {
+    *discard_si = NULL;
+    if (get_swap_device_info(si)) {
+        offset = cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE,
&need_discard);
+        put_swap_device(si);
+        if (offset) {
+            *entry = swp_entry(si->type, offset);
+            return true;
+        }
+        if (need_discard) {
+            *discard_si = si;
+            return false;
+        }
+    }
+}

The logic of the workflow jumping between several functions might also
be kind of hard to follow. Some cleanup can be done later though.

Considering the discard issue is really rare, I'm not sure if this is
the right way to go? How do you think?

BTW: The logic of V1 can be optimized a little bit to let discards also
happen with order > 0 cases too. That seems closer to what the current
upstream kernel was doing except: Allocator prefers to try another
device instead of waiting for discard, which seems OK?
And order 0 steal can happen without waiting for discard.
Fragmentation under extreme pressure might not be that
serious an issue if we are having really slow SSDs, and
might even be no longer an issue if we have a generic
solution for frags?

Re: [PATCH 1/4] mm, swap: do not perform synchronous discard during allocation

Posted by Kairui Song 4 months ago

On Thu, Oct 9, 2025 at 5:10 AM Chris Li <chrisl@kernel.org> wrote:
>
> Hi Kairui,
>
> First of all, your title is a bit misleading:
> "do not perform synchronous discard during allocation"
>
> You still do the synchronous discard, just limited to order 0 failing.
>
> Also your commit did not describe the behavior change of this patch.
> The behavior change is that, it now prefers to allocate from the
> fragment list before waiting for the discard. Which I feel is not
> justified.
>
> After reading your patch, I feel that you still do the synchronous
> discard, just now you do it with less lock held.
> I suggest you just fix the lock held issue without changing the
> discard ordering behavior.
>
> On Mon, Oct 6, 2025 at 1:03 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > Since commit 1b7e90020eb77 ("mm, swap: use percpu cluster as allocation
> > fast path"), swap allocation is protected by a local lock, which means
> > we can't do any sleeping calls during allocation.
> >
> > However, the discard routine is not taken well care of. When the swap
> > allocator failed to find any usable cluster, it would look at the
> > pending discard cluster and try to issue some blocking discards. It may
> > not necessarily sleep, but the cond_resched at the bio layer indicates
> > this is wrong when combined with a local lock. And the bio GFP flag used
> > for discard bio is also wrong (not atomic).
>
> If lock is the issue, let's fix the lock issue.
>
> > It's arguable whether this synchronous discard is helpful at all. In
> > most cases, the async discard is good enough. And the swap allocator is
> > doing very differently at organizing the clusters since the recent
> > change, so it is very rare to see discard clusters piling up.
>
> Very rare does not mean this never happens. If you have a cluster on
> the discarding queue, I think it is better to wait for the discard to
> complete before using the fragmented list, to reduce the
> fragmentation. So it seems the real issue is holding a lock while
> doing the block discard?
>
> > So far, no issues have been observed or reported with typical SSD setups
> > under months of high pressure. This issue was found during my code
> > review. But by hacking the kernel a bit: adding a mdelay(100) in the
> > async discard path, this issue will be observable with WARNING triggered
> > by the wrong GFP and cond_resched in the bio layer.
>
> I think that makes an assumption on how slow the SSD discard is. Some
> SSD can be really slow. We want our kernel to work for those slow
> discard SSD cases as well.
>
> > So let's fix this issue in a safe way: remove the synchronous discard in
> > the swap allocation path. And when order 0 is failing with all cluster
> > list drained on all swap devices, try to do a discard following the swap
>
> I don't feel that changing the discard behavior is justified here, the
> real fix is discarding with less lock held. Am I missing something?
> If I understand correctly, we should be able to keep the current
> discard ordering behavior, discard before the fragment list. But with
> less lock held as your current patch does.
>
> I suggest the allocation here detects there is a discard pending and
> running out of free blocks. Return there and indicate the need to
> discard. The caller performs the discard without holding the lock,
> similar to what you do with the order == 0 case.

Thanks for the suggestion. Right, that sounds even better. My initial
though was that maybe we can just remove this discard completely since
it rarely helps, and if the SSD is really that slow, OOM under heavy
pressure might even be an acceptable behaviour. But to make it safer,
I made it do discard only when order 0 is failing so the code is
simpler.

Let me sent a V2 to handle the discard carefully to reduce potential impact.

> > device priority list. If any discards released some cluster, try the
> > allocation again. This way, we can still avoid OOM due to swap failure
> > if the hardware is very slow and memory pressure is extremely high.
> >
> > Cc: <stable@vger.kernel.org>
> > Fixes: 1b7e90020eb77 ("mm, swap: use percpu cluster as allocation fast path")
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/swapfile.c | 40 +++++++++++++++++++++++++++++++++-------
> >  1 file changed, 33 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index cb2392ed8e0e..0d1924f6f495 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -1101,13 +1101,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> >                         goto done;
> >         }
> >
> > -       /*
> > -        * We don't have free cluster but have some clusters in discarding,
> > -        * do discard now and reclaim them.
> > -        */
> > -       if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))
> > -               goto new_cluster;
>
> Assume you follow my suggestion.
> Change this to some function to detect if there is a pending discard
> on this device. Return to the caller indicating that you need a
> discard for this device that has a pending discard.

Checking `!list_empty(si->discard_clusters)` should be good enough.

Re: [PATCH 1/4] mm, swap: do not perform synchronous discard during allocation

Posted by Chris Li 4 months ago

On Thu, Oct 9, 2025 at 8:33 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Thu, Oct 9, 2025 at 5:10 AM Chris Li <chrisl@kernel.org> wrote:
> > I suggest the allocation here detects there is a discard pending and
> > running out of free blocks. Return there and indicate the need to
> > discard. The caller performs the discard without holding the lock,
> > similar to what you do with the order == 0 case.
>
> Thanks for the suggestion. Right, that sounds even better. My initial
> though was that maybe we can just remove this discard completely since
> it rarely helps, and if the SSD is really that slow, OOM under heavy

Your argument is that cases happen very rarely. I agree with you on
that. The follow up question is that, if that rare case does happen,
are we doing the best we can in that situation? The V1 patch is not
doing the best as we can, it is pretty much I don't care about the
discard much, just ignore it unless order 0 failing forces our hand.
As far as I can tell, properly handling that having discard pending
condition is not much more complicated than your V1 patch, it might be
even simpler because you don't have that order 0 failing logic any
more.

> pressure might even be an acceptable behaviour. But to make it safer,
> I made it do discard only when order 0 is failing so the code is
> simpler.
>
> Let me sent a V2 to handle the discard carefully to reduce potential impact.

Great. Looking forward to it.

BTW, In the caller retry loop, the caller can retry the very swap
device that has discard just perform on it, it does not need to retry
from the very first swap device. In that regard, it is also a better
behavior than V1 or even existing discard behavior, which waits for
all devices to discard.

Chris

Re: [PATCH 1/4] mm, swap: do not perform synchronous discard during allocation

Posted by Nhat Pham 4 months ago

On Mon, Oct 6, 2025 at 1:03 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Since commit 1b7e90020eb77 ("mm, swap: use percpu cluster as allocation
> fast path"), swap allocation is protected by a local lock, which means
> we can't do any sleeping calls during allocation.
>
> However, the discard routine is not taken well care of. When the swap
> allocator failed to find any usable cluster, it would look at the
> pending discard cluster and try to issue some blocking discards. It may
> not necessarily sleep, but the cond_resched at the bio layer indicates
> this is wrong when combined with a local lock. And the bio GFP flag used
> for discard bio is also wrong (not atomic).
>
> It's arguable whether this synchronous discard is helpful at all. In
> most cases, the async discard is good enough. And the swap allocator is
> doing very differently at organizing the clusters since the recent
> change, so it is very rare to see discard clusters piling up.
>
> So far, no issues have been observed or reported with typical SSD setups
> under months of high pressure. This issue was found during my code
> review. But by hacking the kernel a bit: adding a mdelay(100) in the
> async discard path, this issue will be observable with WARNING triggered
> by the wrong GFP and cond_resched in the bio layer.
>
> So let's fix this issue in a safe way: remove the synchronous discard in
> the swap allocation path. And when order 0 is failing with all cluster
> list drained on all swap devices, try to do a discard following the swap
> device priority list. If any discards released some cluster, try the
> allocation again. This way, we can still avoid OOM due to swap failure
> if the hardware is very slow and memory pressure is extremely high.
>
> Cc: <stable@vger.kernel.org>
> Fixes: 1b7e90020eb77 ("mm, swap: use percpu cluster as allocation fast path")
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---

Seems reasonable to me.

Acked-by: Nhat Pham <nphamcs@gmail.com>

[PATCH 1/4] mm, swap: do not perform synchronous discard during allocation
[PATCH 2/4] mm, swap: rename helper for setup bad slots
[PATCH 3/4] mm, swap: cleanup swap entry allocation parameter
[PATCH 4/4] mm/migrate, swap: drop usage of folio_index