mm/swap_state.c | 2 -- 1 file changed, 2 deletions(-)
swap_cluster_readahead() and swap_vma_readahead() end the readahead
loop with an explicit lru_add_drain() call. That drain is a leftover
from 2.6.12 era code and serves no functional purpose for the callers:
- do_swap_page() ignores LRU residency for the readahead folios;
it only needs the target folio it called swapin_readahead() for,
and if the write-fault path needs the target folio on the LRU to count
references accurately, it runs its own lru_add_drain() at the
wp_can_reuse_anon_folio() and do_swap_page() sites.
- shmem_swapin_cluster() immediately locks the returned folio, waits
for writeback, then operates on it - LRU residency of either the target
or the readahead folios is irrelevant.
- try_to_unuse() likewise locks the folio and calls unuse_pte() without
depending on LRU presence.
Folios newly added to the swap cache by the readahead loop sit in
the per-CPU LRU folio_batch and will be drained naturally as the
batch fills (FOLIO_BATCH_SIZE),by the next reclaim/compaction
lru_add_drain_all() and so on. The unconditional drain only
synchronously flushes a partial batch and forces contention on
lruvec_lock.
On a 176-CPU production host running a memory-pressured workload, this
path was observed to call folio_batch_move_lru() from
swap_cluster_readahead() ~28K/min, a very large source of LRU lock
traffic.
This is a direct continuation of the cleanup started in commit
1aa43598c03b ("mm: remove unnecessary calls to lru_add_drain") which
removed the equivalent drain from free_pages_and_swap_cache() with
the same rationale. A detailed reasoning for this is present in [1].
Remove both drains.
[1] https://lore.kernel.org/all/dca2824e8e88e826c6b260a831d79089b5b9c79d.camel@surriel.com/T/#u
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
mm/swap_state.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 9c3a5cf99778..6fd6e3415b71 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -836,7 +836,6 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
}
blk_finish_plug(&plug);
swap_read_unplug(splug);
- lru_add_drain(); /* Push any new pages onto the LRU now */
skip:
/* The page was likely read above, so no need for plugging here */
return swap_cache_read_folio(entry, gfp_mask, mpol, ilx, NULL, false);
@@ -951,7 +950,6 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
pte_unmap(pte);
blk_finish_plug(&plug);
swap_read_unplug(splug);
- lru_add_drain();
skip:
/* The folio was likely read above, so no need for plugging here */
folio = swap_cache_read_folio(targ_entry, gfp_mask, mpol, targ_ilx,
--
2.52.0
On Mon, Jun 08, 2026 at 07:32:42AM -0700, Usama Arif wrote:
> swap_cluster_readahead() and swap_vma_readahead() end the readahead
> loop with an explicit lru_add_drain() call. That drain is a leftover
> from 2.6.12 era code and serves no functional purpose for the callers:
>
> - do_swap_page() ignores LRU residency for the readahead folios;
> it only needs the target folio it called swapin_readahead() for,
> and if the write-fault path needs the target folio on the LRU to count
> references accurately, it runs its own lru_add_drain() at the
> wp_can_reuse_anon_folio() and do_swap_page() sites.
>
> - shmem_swapin_cluster() immediately locks the returned folio, waits
> for writeback, then operates on it - LRU residency of either the target
> or the readahead folios is irrelevant.
>
> - try_to_unuse() likewise locks the folio and calls unuse_pte() without
> depending on LRU presence.
>
> Folios newly added to the swap cache by the readahead loop sit in
> the per-CPU LRU folio_batch and will be drained naturally as the
> batch fills (FOLIO_BATCH_SIZE),by the next reclaim/compaction
> lru_add_drain_all() and so on. The unconditional drain only
> synchronously flushes a partial batch and forces contention on
> lruvec_lock.
>
> On a 176-CPU production host running a memory-pressured workload, this
> path was observed to call folio_batch_move_lru() from
> swap_cluster_readahead() ~28K/min, a very large source of LRU lock
> traffic.
>
> This is a direct continuation of the cleanup started in commit
> 1aa43598c03b ("mm: remove unnecessary calls to lru_add_drain") which
> removed the equivalent drain from free_pages_and_swap_cache() with
> the same rationale. A detailed reasoning for this is present in [1].
>
> Remove both drains.
>
> [1] https://lore.kernel.org/all/dca2824e8e88e826c6b260a831d79089b5b9c79d.camel@surriel.com/T/#u
>
> Signed-off-by: Usama Arif <usama.arif@linux.dev>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
On Mon, Jun 8, 2026 at 10:45 PM Usama Arif <usama.arif@linux.dev> wrote:
>
> swap_cluster_readahead() and swap_vma_readahead() end the readahead
> loop with an explicit lru_add_drain() call. That drain is a leftover
> from 2.6.12 era code and serves no functional purpose for the callers:
>
> - do_swap_page() ignores LRU residency for the readahead folios;
> it only needs the target folio it called swapin_readahead() for,
> and if the write-fault path needs the target folio on the LRU to count
> references accurately, it runs its own lru_add_drain() at the
> wp_can_reuse_anon_folio() and do_swap_page() sites.
>
> - shmem_swapin_cluster() immediately locks the returned folio, waits
> for writeback, then operates on it - LRU residency of either the target
> or the readahead folios is irrelevant.
>`
> - try_to_unuse() likewise locks the folio and calls unuse_pte() without
> depending on LRU presence.
>
> Folios newly added to the swap cache by the readahead loop sit in
> the per-CPU LRU folio_batch and will be drained naturally as the
> batch fills (FOLIO_BATCH_SIZE),by the next reclaim/compaction
> lru_add_drain_all() and so on. The unconditional drain only
> synchronously flushes a partial batch and forces contention on
> lruvec_lock.
>
> On a 176-CPU production host running a memory-pressured workload, this
> path was observed to call folio_batch_move_lru() from
> swap_cluster_readahead() ~28K/min, a very large source of LRU lock
> traffic.
>
> This is a direct continuation of the cleanup started in commit
> 1aa43598c03b ("mm: remove unnecessary calls to lru_add_drain") which
> removed the equivalent drain from free_pages_and_swap_cache() with
> the same rationale. A detailed reasoning for this is present in [1].
>
> Remove both drains.
>
> [1] https://lore.kernel.org/all/dca2824e8e88e826c6b260a831d79089b5b9c79d.camel@surriel.com/T/#u
>
> Signed-off-by: Usama Arif <usama.arif@linux.dev>
> ---
> mm/swap_state.c | 2 --
> 1 file changed, 2 deletions(-)
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 9c3a5cf99778..6fd6e3415b71 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -836,7 +836,6 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
> }
> blk_finish_plug(&plug);
> swap_read_unplug(splug);
> - lru_add_drain(); /* Push any new pages onto the LRU now */
> skip:
> /* The page was likely read above, so no need for plugging here */
> return swap_cache_read_folio(entry, gfp_mask, mpol, ilx, NULL, false);
> @@ -951,7 +950,6 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
> pte_unmap(pte);
> blk_finish_plug(&plug);
> swap_read_unplug(splug);
> - lru_add_drain();
> skip:
> /* The folio was likely read above, so no need for plugging here */
> folio = swap_cache_read_folio(targ_entry, gfp_mask, mpol, targ_ilx,
> --
> 2.52.0
Thanks!
Reviewed-by: Kairui Song <kasong@tencent.com>
On Mon, Jun 8, 2026 at 10:33 PM Usama Arif <usama.arif@linux.dev> wrote:
>
> swap_cluster_readahead() and swap_vma_readahead() end the readahead
> loop with an explicit lru_add_drain() call. That drain is a leftover
> from 2.6.12 era code and serves no functional purpose for the callers:
>
> - do_swap_page() ignores LRU residency for the readahead folios;
> it only needs the target folio it called swapin_readahead() for,
> and if the write-fault path needs the target folio on the LRU to count
> references accurately, it runs its own lru_add_drain() at the
> wp_can_reuse_anon_folio() and do_swap_page() sites.
right. as i can see the below in do_swap_page():
/*
* If we want to map a page that's in the swapcache writable, we
* have to detect via the refcount if we're really the exclusive
* owner. Try removing the extra reference from the local LRU
* caches if required.
*/
if ((vmf->flags & FAULT_FLAG_WRITE) &&
!folio_test_ksm(folio) && !folio_test_lru(folio))
lru_add_drain();
and the below in wp_can_reuse_anon_folio():
if (!folio_test_lru(folio))
/*
* We cannot easily detect+handle references from
* remote LRU caches or references to LRU folios.
*/
lru_add_drain();
>
> - shmem_swapin_cluster() immediately locks the returned folio, waits
> for writeback, then operates on it - LRU residency of either the target
> or the readahead folios is irrelevant.
>
> - try_to_unuse() likewise locks the folio and calls unuse_pte() without
> depending on LRU presence.
>
> Folios newly added to the swap cache by the readahead loop sit in
> the per-CPU LRU folio_batch and will be drained naturally as the
> batch fills (FOLIO_BATCH_SIZE),by the next reclaim/compaction
> lru_add_drain_all() and so on. The unconditional drain only
> synchronously flushes a partial batch and forces contention on
> lruvec_lock.
>
> On a 176-CPU production host running a memory-pressured workload, this
> path was observed to call folio_batch_move_lru() from
> swap_cluster_readahead() ~28K/min, a very large source of LRU lock
> traffic.
>
Do we see a workload improvement? If yes, can we put the data?
Thanks
Barry
On 09/06/2026 09:01, Barry Song wrote: > On Mon, Jun 8, 2026 at 10:33 PM Usama Arif <usama.arif@linux.dev> wrote: >> >> swap_cluster_readahead() and swap_vma_readahead() end the readahead >> loop with an explicit lru_add_drain() call. That drain is a leftover >> from 2.6.12 era code and serves no functional purpose for the callers: >> >> - do_swap_page() ignores LRU residency for the readahead folios; >> it only needs the target folio it called swapin_readahead() for, >> and if the write-fault path needs the target folio on the LRU to count >> references accurately, it runs its own lru_add_drain() at the >> wp_can_reuse_anon_folio() and do_swap_page() sites. > > right. as i can see the below in do_swap_page(): > > /* > * If we want to map a page that's in the swapcache writable, we > * have to detect via the refcount if we're really the exclusive > * owner. Try removing the extra reference from the local LRU > * caches if required. > */ > if ((vmf->flags & FAULT_FLAG_WRITE) && > !folio_test_ksm(folio) && !folio_test_lru(folio)) > lru_add_drain(); > > and the below in wp_can_reuse_anon_folio(): > > if (!folio_test_lru(folio)) > /* > * We cannot easily detect+handle references from > * remote LRU caches or references to LRU folios. > */ > lru_add_drain(); > >> >> - shmem_swapin_cluster() immediately locks the returned folio, waits >> for writeback, then operates on it - LRU residency of either the target >> or the readahead folios is irrelevant. >> >> - try_to_unuse() likewise locks the folio and calls unuse_pte() without >> depending on LRU presence. >> >> Folios newly added to the swap cache by the readahead loop sit in >> the per-CPU LRU folio_batch and will be drained naturally as the >> batch fills (FOLIO_BATCH_SIZE),by the next reclaim/compaction >> lru_add_drain_all() and so on. The unconditional drain only >> synchronously flushes a partial batch and forces contention on >> lruvec_lock. >> >> On a 176-CPU production host running a memory-pressured workload, this >> path was observed to call folio_batch_move_lru() from >> swap_cluster_readahead() ~28K/min, a very large source of LRU lock >> traffic. >> > > Do we see a workload improvement? If yes, can we put the data? > Hello Barry! So lru lock contention is a source of issue in the meta fleet. This problem was specifically seen when I ran `perf lock contention -a -b` in production on a workload that has a really big anon heap and heavy swap activity. When I tried to trace with bpftrace who was the biggest consumer, it was readahead. It is easy to run perf and bpftrace on prod on this specific workload, but more difficult to flash a new kernel and see results. The easiest would be when kernel upgrade happens and this patch lands to see the difference and I can report back. Thanks, Usama > Thanks > Barry
On Tue, Jun 9, 2026 at 5:30 PM Usama Arif <usama.arif@linux.dev> wrote: > > > > On 09/06/2026 09:01, Barry Song wrote: > > On Mon, Jun 8, 2026 at 10:33 PM Usama Arif <usama.arif@linux.dev> wrote: > >> > >> swap_cluster_readahead() and swap_vma_readahead() end the readahead > >> loop with an explicit lru_add_drain() call. That drain is a leftover > >> from 2.6.12 era code and serves no functional purpose for the callers: > >> > >> - do_swap_page() ignores LRU residency for the readahead folios; > >> it only needs the target folio it called swapin_readahead() for, > >> and if the write-fault path needs the target folio on the LRU to count > >> references accurately, it runs its own lru_add_drain() at the > >> wp_can_reuse_anon_folio() and do_swap_page() sites. > > > > right. as i can see the below in do_swap_page(): > > > > /* > > * If we want to map a page that's in the swapcache writable, we > > * have to detect via the refcount if we're really the exclusive > > * owner. Try removing the extra reference from the local LRU > > * caches if required. > > */ > > if ((vmf->flags & FAULT_FLAG_WRITE) && > > !folio_test_ksm(folio) && !folio_test_lru(folio)) > > lru_add_drain(); > > > > and the below in wp_can_reuse_anon_folio(): > > > > if (!folio_test_lru(folio)) > > /* > > * We cannot easily detect+handle references from > > * remote LRU caches or references to LRU folios. > > */ > > lru_add_drain(); > > > >> > >> - shmem_swapin_cluster() immediately locks the returned folio, waits > >> for writeback, then operates on it - LRU residency of either the target > >> or the readahead folios is irrelevant. > >> > >> - try_to_unuse() likewise locks the folio and calls unuse_pte() without > >> depending on LRU presence. > >> > >> Folios newly added to the swap cache by the readahead loop sit in > >> the per-CPU LRU folio_batch and will be drained naturally as the > >> batch fills (FOLIO_BATCH_SIZE),by the next reclaim/compaction > >> lru_add_drain_all() and so on. The unconditional drain only > >> synchronously flushes a partial batch and forces contention on > >> lruvec_lock. > >> > >> On a 176-CPU production host running a memory-pressured workload, this > >> path was observed to call folio_batch_move_lru() from > >> swap_cluster_readahead() ~28K/min, a very large source of LRU lock > >> traffic. > >> > > > > Do we see a workload improvement? If yes, can we put the data? > > > > Hello Barry! > > So lru lock contention is a source of issue in the meta fleet. > > This problem was specifically seen when I ran `perf lock contention -a -b` > in production on a workload that has a really big anon heap and heavy swap > activity. > > When I tried to trace with bpftrace who was the biggest consumer, it was > readahead. > > It is easy to run perf and bpftrace on prod on this specific workload, but > more difficult to flash a new kernel and see results. The easiest would be > when kernel upgrade happens and this patch lands to see the difference and > I can report back. Yes, it seems fairly straightforward to reduce LRU lock contention. From the review, the patch looks good. I’m just not certain whether we might be missing any corner cases. I would be happy to see it queued for testing once the mm tree is ready to accept new patches. Is it too late at the moment? Feel free to add: Reviewed-by: Barry Song <baohua@kernel.org> Thanks, Barry
On Mon, Jun 08, 2026 at 07:32:42AM -0700, Usama Arif wrote:
> swap_cluster_readahead() and swap_vma_readahead() end the readahead
> loop with an explicit lru_add_drain() call. That drain is a leftover
> from 2.6.12 era code and serves no functional purpose for the callers:
>
> - do_swap_page() ignores LRU residency for the readahead folios;
> it only needs the target folio it called swapin_readahead() for,
> and if the write-fault path needs the target folio on the LRU to count
> references accurately, it runs its own lru_add_drain() at the
> wp_can_reuse_anon_folio() and do_swap_page() sites.
>
> - shmem_swapin_cluster() immediately locks the returned folio, waits
> for writeback, then operates on it - LRU residency of either the target
> or the readahead folios is irrelevant.
>
> - try_to_unuse() likewise locks the folio and calls unuse_pte() without
> depending on LRU presence.
>
> Folios newly added to the swap cache by the readahead loop sit in
> the per-CPU LRU folio_batch and will be drained naturally as the
> batch fills (FOLIO_BATCH_SIZE),by the next reclaim/compaction
> lru_add_drain_all() and so on. The unconditional drain only
> synchronously flushes a partial batch and forces contention on
> lruvec_lock.
>
> On a 176-CPU production host running a memory-pressured workload, this
> path was observed to call folio_batch_move_lru() from
> swap_cluster_readahead() ~28K/min, a very large source of LRU lock
> traffic.
>
> This is a direct continuation of the cleanup started in commit
> 1aa43598c03b ("mm: remove unnecessary calls to lru_add_drain") which
> removed the equivalent drain from free_pages_and_swap_cache() with
> the same rationale. A detailed reasoning for this is present in [1].
>
> Remove both drains.
>
> [1] https://lore.kernel.org/all/dca2824e8e88e826c6b260a831d79089b5b9c79d.camel@surriel.com/T/#u
>
> Signed-off-by: Usama Arif <usama.arif@linux.dev>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Thanks for pushing this. JP was also looking into LRU lock contention sources.
Particularly we lack visibiluty into the lru_add_drain_all() callers. The idea
was to add tracepoints to tracks such callers. (Just nudging you towards it :P)
On 6/8/26 9:06 AM, Shakeel Butt wrote:
> On Mon, Jun 08, 2026 at 07:32:42AM -0700, Usama Arif wrote:
>> swap_cluster_readahead() and swap_vma_readahead() end the readahead
>> loop with an explicit lru_add_drain() call. That drain is a leftover
>> from 2.6.12 era code and serves no functional purpose for the callers:
>>
>> - do_swap_page() ignores LRU residency for the readahead folios;
>> it only needs the target folio it called swapin_readahead() for,
>> and if the write-fault path needs the target folio on the LRU to count
>> references accurately, it runs its own lru_add_drain() at the
>> wp_can_reuse_anon_folio() and do_swap_page() sites.
>>
>> - shmem_swapin_cluster() immediately locks the returned folio, waits
>> for writeback, then operates on it - LRU residency of either the target
>> or the readahead folios is irrelevant.
>>
>> - try_to_unuse() likewise locks the folio and calls unuse_pte() without
>> depending on LRU presence.
>>
>> Folios newly added to the swap cache by the readahead loop sit in
>> the per-CPU LRU folio_batch and will be drained naturally as the
>> batch fills (FOLIO_BATCH_SIZE),by the next reclaim/compaction
>> lru_add_drain_all() and so on. The unconditional drain only
>> synchronously flushes a partial batch and forces contention on
>> lruvec_lock.
>>
>> On a 176-CPU production host running a memory-pressured workload, this
>> path was observed to call folio_batch_move_lru() from
>> swap_cluster_readahead() ~28K/min, a very large source of LRU lock
>> traffic.
>>
>> This is a direct continuation of the cleanup started in commit
>> 1aa43598c03b ("mm: remove unnecessary calls to lru_add_drain") which
>> removed the equivalent drain from free_pages_and_swap_cache() with
>> the same rationale. A detailed reasoning for this is present in [1].
>>
>> Remove both drains.
>>
>> [1] https://lore.kernel.org/all/dca2824e8e88e826c6b260a831d79089b5b9c79d.camel@surriel.com/T/#u
>>
>> Signed-off-by: Usama Arif <usama.arif@linux.dev>
>
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
>
> Thanks for pushing this. JP was also looking into LRU lock contention sources.
> Particularly we lack visibiluty into the lru_add_drain_all() callers. The idea
> was to add tracepoints to tracks such callers. (Just nudging you towards it :P)
Yup, I'll send a patch for that one.
© 2016 - 2026 Red Hat, Inc.