mm/page_alloc.c | 8 ++++++++ 1 file changed, 8 insertions(+)
__alloc_pages_slowpath has no change detection for ac->nodemask
in the part of retry path, while cpuset can modify it in parallel.
For some processes that set mempolicy as MPOL_BIND, this results
ac->nodemask changes, and then the should_reclaim_retry will
judge based on the latest nodemask and jump to retry, while the
get_page_from_freelist only traverses the zonelist from
ac->preferred_zoneref, which selected by a expired nodemask
and may cause infinite retries in some cases
cpu 64:
__alloc_pages_slowpath {
/* ..... */
retry:
/* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */
if (alloc_flags & ALLOC_KSWAPD)
wake_all_kswapds(order, gfp_mask, ac);
/* cpu 1:
cpuset_write_resmask
update_nodemask
update_nodemasks_hier
update_tasks_nodemask
mpol_rebind_task
mpol_rebind_policy
mpol_rebind_nodemask
// mempolicy->nodes has been modified,
// which ac->nodemask point to
*/
/* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */
if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
did_some_progress > 0, &no_progress_loops))
goto retry;
}
Simultaneously starting multiple cpuset01 from LTP can quickly
reproduce this issue on a multi node server when the maximum
memory pressure is reached and the swap is enabled
Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
---
mm/page_alloc.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fd6b865cb1ab..1e82f5214a42 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4530,6 +4530,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
}
retry:
+ /*
+ * Deal with possible cpuset update races or zonelist updates to avoid
+ * infinite retries.
+ */
+ if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
+ check_retry_zonelist(zonelist_iter_cookie))
+ goto restart;
+
/* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
if (alloc_flags & ALLOC_KSWAPD)
wake_all_kswapds(order, gfp_mask, ac);
--
2.20.1
On Wed, Apr 16, 2025 at 04:24:05PM +0800, Tianyang Zhang wrote:
> __alloc_pages_slowpath has no change detection for ac->nodemask
> in the part of retry path, while cpuset can modify it in parallel.
> For some processes that set mempolicy as MPOL_BIND, this results
> ac->nodemask changes, and then the should_reclaim_retry will
> judge based on the latest nodemask and jump to retry, while the
> get_page_from_freelist only traverses the zonelist from
> ac->preferred_zoneref, which selected by a expired nodemask
> and may cause infinite retries in some cases
>
> cpu 64:
> __alloc_pages_slowpath {
> /* ..... */
> retry:
> /* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */
> if (alloc_flags & ALLOC_KSWAPD)
> wake_all_kswapds(order, gfp_mask, ac);
> /* cpu 1:
> cpuset_write_resmask
> update_nodemask
> update_nodemasks_hier
> update_tasks_nodemask
> mpol_rebind_task
> mpol_rebind_policy
> mpol_rebind_nodemask
> // mempolicy->nodes has been modified,
> // which ac->nodemask point to
>
> */
> /* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */
> if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
> did_some_progress > 0, &no_progress_loops))
> goto retry;
> }
>
> Simultaneously starting multiple cpuset01 from LTP can quickly
> reproduce this issue on a multi node server when the maximum
> memory pressure is reached and the swap is enabled
>
> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
> ---
What commit does it fix and should it be backported to -stable?
There's a new 'MEMORY MANAGEMENT - PAGE ALLOCATOR' entry (only in
Andrew's mm.git repository now).
Let's Cc the page allocator folks here!
--
Cheers,
Harry / Hyeonggon
> mm/page_alloc.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index fd6b865cb1ab..1e82f5214a42 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4530,6 +4530,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> }
>
> retry:
> + /*
> + * Deal with possible cpuset update races or zonelist updates to avoid
> + * infinite retries.
> + */
> + if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
> + check_retry_zonelist(zonelist_iter_cookie))
> + goto restart;
> +
> /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
> if (alloc_flags & ALLOC_KSWAPD)
> wake_all_kswapds(order, gfp_mask, ac);
> --
> 2.20.1
>
>
Hi.
在 2025/4/21 下午6:00, Harry Yoo 写道:
> On Wed, Apr 16, 2025 at 04:24:05PM +0800, Tianyang Zhang wrote:
>> __alloc_pages_slowpath has no change detection for ac->nodemask
>> in the part of retry path, while cpuset can modify it in parallel.
>> For some processes that set mempolicy as MPOL_BIND, this results
>> ac->nodemask changes, and then the should_reclaim_retry will
>> judge based on the latest nodemask and jump to retry, while the
>> get_page_from_freelist only traverses the zonelist from
>> ac->preferred_zoneref, which selected by a expired nodemask
>> and may cause infinite retries in some cases
>>
>> cpu 64:
>> __alloc_pages_slowpath {
>> /* ..... */
>> retry:
>> /* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */
>> if (alloc_flags & ALLOC_KSWAPD)
>> wake_all_kswapds(order, gfp_mask, ac);
>> /* cpu 1:
>> cpuset_write_resmask
>> update_nodemask
>> update_nodemasks_hier
>> update_tasks_nodemask
>> mpol_rebind_task
>> mpol_rebind_policy
>> mpol_rebind_nodemask
>> // mempolicy->nodes has been modified,
>> // which ac->nodemask point to
>>
>> */
>> /* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */
>> if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
>> did_some_progress > 0, &no_progress_loops))
>> goto retry;
>> }
>>
>> Simultaneously starting multiple cpuset01 from LTP can quickly
>> reproduce this issue on a multi node server when the maximum
>> memory pressure is reached and the swap is enabled
>>
>> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
>> ---
> What commit does it fix and should it be backported to -stable?
>
> There's a new 'MEMORY MANAGEMENT - PAGE ALLOCATOR' entry (only in
> Andrew's mm.git repository now).
>
> Let's Cc the page allocator folks here!
We first identified this issue in 6.6.52-stable , and through root cause
analysis,
it appears the problem may have existed for a significant period.
However It is recommended that the fix should be backported to at least
Linux kernel versions after 6.6-stable
On Tue, 22 Apr 2025 20:10:06 +0800 Tianyang Zhang <zhangtianyang@loongson.cn> wrote: > > ... > > >> > >> Simultaneously starting multiple cpuset01 from LTP can quickly > >> reproduce this issue on a multi node server when the maximum > >> memory pressure is reached and the swap is enabled > >> > >> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn> > >> --- > > What commit does it fix and should it be backported to -stable? > > > > There's a new 'MEMORY MANAGEMENT - PAGE ALLOCATOR' entry (only in > > Andrew's mm.git repository now). > > > > Let's Cc the page allocator folks here! > > We first identified this issue in 6.6.52-stable , and through root cause > analysis, > > it appears the problem may have existed for a significant period. > > However It is recommended that the fix should be backported to at least > Linux kernel versions after 6.6-stable OK, thanks, This has been in mm-hotfixes-unstable for six days. Hopefully we'll see some review activity soon (please).
On Tue, Apr 22, 2025 at 5:11 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Tue, 22 Apr 2025 20:10:06 +0800 Tianyang Zhang <zhangtianyang@loongson.cn> wrote: > > > > > ... > > > > >> > > >> Simultaneously starting multiple cpuset01 from LTP can quickly > > >> reproduce this issue on a multi node server when the maximum > > >> memory pressure is reached and the swap is enabled > > >> > > >> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn> > > >> --- > > > What commit does it fix and should it be backported to -stable? > > > > > > There's a new 'MEMORY MANAGEMENT - PAGE ALLOCATOR' entry (only in > > > Andrew's mm.git repository now). > > > > > > Let's Cc the page allocator folks here! > > > > We first identified this issue in 6.6.52-stable , and through root cause > > analysis, > > > > it appears the problem may have existed for a significant period. > > > > However It is recommended that the fix should be backported to at least > > Linux kernel versions after 6.6-stable > > OK, thanks, > > This has been in mm-hotfixes-unstable for six days. Hopefully we'll > see some review activity soon (please). I reviewed and provided my feedback but saw neither a reply nor a respin with proposed changes. >
On Tue, 22 Apr 2025 17:22:04 -0700 Suren Baghdasaryan <surenb@google.com> wrote: > On Tue, Apr 22, 2025 at 5:11 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > > > On Tue, 22 Apr 2025 20:10:06 +0800 Tianyang Zhang <zhangtianyang@loongson.cn> wrote: > > > > > > > > ... > > > > > > >> > > > >> Simultaneously starting multiple cpuset01 from LTP can quickly > > > >> reproduce this issue on a multi node server when the maximum > > > >> memory pressure is reached and the swap is enabled > > > >> > > > >> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn> > > > >> --- > > > > What commit does it fix and should it be backported to -stable? > > > > > > > > There's a new 'MEMORY MANAGEMENT - PAGE ALLOCATOR' entry (only in > > > > Andrew's mm.git repository now). > > > > > > > > Let's Cc the page allocator folks here! > > > > > > We first identified this issue in 6.6.52-stable , and through root cause > > > analysis, > > > > > > it appears the problem may have existed for a significant period. > > > > > > However It is recommended that the fix should be backported to at least > > > Linux kernel versions after 6.6-stable > > > > OK, thanks, > > > > This has been in mm-hotfixes-unstable for six days. Hopefully we'll > > see some review activity soon (please). > > I reviewed and provided my feedback but saw neither a reply nor a > respin with proposed changes. OK, thanks. Do you have time to put together a modified version of this?
On Sat, May 10, 2025 at 8:07 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Tue, 22 Apr 2025 17:22:04 -0700 Suren Baghdasaryan <surenb@google.com> wrote: > > > On Tue, Apr 22, 2025 at 5:11 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > > > > > On Tue, 22 Apr 2025 20:10:06 +0800 Tianyang Zhang <zhangtianyang@loongson.cn> wrote: > > > > > > > > > > > ... > > > > > > > > >> > > > > >> Simultaneously starting multiple cpuset01 from LTP can quickly > > > > >> reproduce this issue on a multi node server when the maximum > > > > >> memory pressure is reached and the swap is enabled > > > > >> > > > > >> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn> > > > > >> --- > > > > > What commit does it fix and should it be backported to -stable? > > > > > > > > > > There's a new 'MEMORY MANAGEMENT - PAGE ALLOCATOR' entry (only in > > > > > Andrew's mm.git repository now). > > > > > > > > > > Let's Cc the page allocator folks here! > > > > > > > > We first identified this issue in 6.6.52-stable , and through root cause > > > > analysis, > > > > > > > > it appears the problem may have existed for a significant period. > > > > > > > > However It is recommended that the fix should be backported to at least > > > > Linux kernel versions after 6.6-stable > > > > > > OK, thanks, > > > > > > This has been in mm-hotfixes-unstable for six days. Hopefully we'll > > > see some review activity soon (please). > > > > I reviewed and provided my feedback but saw neither a reply nor a > > respin with proposed changes. > > OK, thanks. Do you have time to put together a modified version of this? I think the code is fine as is. Would be good to add Fixes: tag but it will require some investigation to find the appropriate patch to reference here.
On Tue, 13 May 2025 09:26:53 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
> > > > This has been in mm-hotfixes-unstable for six days. Hopefully we'll
> > > > see some review activity soon (please).
> > >
> > > I reviewed and provided my feedback but saw neither a reply nor a
> > > respin with proposed changes.
> >
> > OK, thanks. Do you have time to put together a modified version of this?
>
> I think the code is fine as is. Would be good to add Fixes: tag but it
> will require some investigation to find the appropriate patch to
> reference here.
Below is what is in mm-hotfixes. It doesn't actually have any
acked-by's or reviewed-by's.
So... final call for review, please.
From: Tianyang Zhang <zhangtianyang@loongson.cn>
Subject: mm/page_alloc.c: avoid infinite retries caused by cpuset race
Date: Wed, 16 Apr 2025 16:24:05 +0800
__alloc_pages_slowpath has no change detection for ac->nodemask in the
part of retry path, while cpuset can modify it in parallel. For some
processes that set mempolicy as MPOL_BIND, this results ac->nodemask
changes, and then the should_reclaim_retry will judge based on the latest
nodemask and jump to retry, while the get_page_from_freelist only
traverses the zonelist from ac->preferred_zoneref, which selected by a
expired nodemask and may cause infinite retries in some cases
cpu 64:
__alloc_pages_slowpath {
/* ..... */
retry:
/* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */
if (alloc_flags & ALLOC_KSWAPD)
wake_all_kswapds(order, gfp_mask, ac);
/* cpu 1:
cpuset_write_resmask
update_nodemask
update_nodemasks_hier
update_tasks_nodemask
mpol_rebind_task
mpol_rebind_policy
mpol_rebind_nodemask
// mempolicy->nodes has been modified,
// which ac->nodemask point to
*/
/* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */
if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
did_some_progress > 0, &no_progress_loops))
goto retry;
}
Simultaneously starting multiple cpuset01 from LTP can quickly reproduce
this issue on a multi node server when the maximum memory pressure is
reached and the swap is enabled
Link: https://lkml.kernel.org/r/20250416082405.20988-1-zhangtianyang@loongson.cn
Fixes: 902b62810a57 ("mm, page_alloc: fix more premature OOM due to race with cpuset update").
Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
mm/page_alloc.c | 8 ++++++++
1 file changed, 8 insertions(+)
--- a/mm/page_alloc.c~mm-page_allocc-avoid-infinite-retries-caused-by-cpuset-race
+++ a/mm/page_alloc.c
@@ -4562,6 +4562,14 @@ restart:
}
retry:
+ /*
+ * Deal with possible cpuset update races or zonelist updates to avoid
+ * infinite retries.
+ */
+ if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
+ check_retry_zonelist(zonelist_iter_cookie))
+ goto restart;
+
/* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
if (alloc_flags & ALLOC_KSWAPD)
wake_all_kswapds(order, gfp_mask, ac);
_
On 5/13/25 21:16, Andrew Morton wrote:
> On Tue, 13 May 2025 09:26:53 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
>
>> > > > This has been in mm-hotfixes-unstable for six days. Hopefully we'll
>> > > > see some review activity soon (please).
>> > >
>> > > I reviewed and provided my feedback but saw neither a reply nor a
>> > > respin with proposed changes.
>> >
>> > OK, thanks. Do you have time to put together a modified version of this?
>>
>> I think the code is fine as is. Would be good to add Fixes: tag but it
>> will require some investigation to find the appropriate patch to
>> reference here.
>
> Below is what is in mm-hotfixes. It doesn't actually have any
> acked-by's or reviewed-by's.
>
> So... final call for review, please.
>
>
> From: Tianyang Zhang <zhangtianyang@loongson.cn>
> Subject: mm/page_alloc.c: avoid infinite retries caused by cpuset race
> Date: Wed, 16 Apr 2025 16:24:05 +0800
>
> __alloc_pages_slowpath has no change detection for ac->nodemask in the
> part of retry path, while cpuset can modify it in parallel. For some
> processes that set mempolicy as MPOL_BIND, this results ac->nodemask
> changes, and then the should_reclaim_retry will judge based on the latest
> nodemask and jump to retry, while the get_page_from_freelist only
> traverses the zonelist from ac->preferred_zoneref, which selected by a
> expired nodemask and may cause infinite retries in some cases
>
> cpu 64:
> __alloc_pages_slowpath {
> /* ..... */
> retry:
> /* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */
> if (alloc_flags & ALLOC_KSWAPD)
> wake_all_kswapds(order, gfp_mask, ac);
> /* cpu 1:
> cpuset_write_resmask
> update_nodemask
> update_nodemasks_hier
> update_tasks_nodemask
> mpol_rebind_task
> mpol_rebind_policy
> mpol_rebind_nodemask
> // mempolicy->nodes has been modified,
> // which ac->nodemask point to
>
> */
> /* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */
> if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
> did_some_progress > 0, &no_progress_loops))
> goto retry;
> }
>
> Simultaneously starting multiple cpuset01 from LTP can quickly reproduce
> this issue on a multi node server when the maximum memory pressure is
> reached and the swap is enabled
>
> Link: https://lkml.kernel.org/r/20250416082405.20988-1-zhangtianyang@loongson.cn
> Fixes: 902b62810a57 ("mm, page_alloc: fix more premature OOM due to race with cpuset update").
After the discussion in this thread, Suren retracted this Fixes: suggestion.
I think it actually goes back to this one which introduced the
preferred_zoneref caching.
Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a
zonelist twice")
> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Brendan Jackman <jackmanb@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
I would have placed the check bit further down, just above the
should_reclaim_retry() call, but it's not that important to hold up a fix
and can be done later.
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>
> mm/page_alloc.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> --- a/mm/page_alloc.c~mm-page_allocc-avoid-infinite-retries-caused-by-cpuset-race
> +++ a/mm/page_alloc.c
> @@ -4562,6 +4562,14 @@ restart:
> }
>
> retry:
> + /*
> + * Deal with possible cpuset update races or zonelist updates to avoid
> + * infinite retries.
> + */
> + if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
> + check_retry_zonelist(zonelist_iter_cookie))
> + goto restart;
> +
> /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
> if (alloc_flags & ALLOC_KSWAPD)
> wake_all_kswapds(order, gfp_mask, ac);
> _
>
Hi,
在 2025/5/14 下午3:34, Vlastimil Babka 写道:
> On 5/13/25 21:16, Andrew Morton wrote:
>> On Tue, 13 May 2025 09:26:53 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
>>
>>>>>> This has been in mm-hotfixes-unstable for six days. Hopefully we'll
>>>>>> see some review activity soon (please).
>>>>> I reviewed and provided my feedback but saw neither a reply nor a
>>>>> respin with proposed changes.
>>>> OK, thanks. Do you have time to put together a modified version of this?
>>> I think the code is fine as is. Would be good to add Fixes: tag but it
>>> will require some investigation to find the appropriate patch to
>>> reference here.
>> Below is what is in mm-hotfixes. It doesn't actually have any
>> acked-by's or reviewed-by's.
>>
>> So... final call for review, please.
>>
>>
>> From: Tianyang Zhang <zhangtianyang@loongson.cn>
>> Subject: mm/page_alloc.c: avoid infinite retries caused by cpuset race
>> Date: Wed, 16 Apr 2025 16:24:05 +0800
>>
>> __alloc_pages_slowpath has no change detection for ac->nodemask in the
>> part of retry path, while cpuset can modify it in parallel. For some
>> processes that set mempolicy as MPOL_BIND, this results ac->nodemask
>> changes, and then the should_reclaim_retry will judge based on the latest
>> nodemask and jump to retry, while the get_page_from_freelist only
>> traverses the zonelist from ac->preferred_zoneref, which selected by a
>> expired nodemask and may cause infinite retries in some cases
>>
>> cpu 64:
>> __alloc_pages_slowpath {
>> /* ..... */
>> retry:
>> /* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */
>> if (alloc_flags & ALLOC_KSWAPD)
>> wake_all_kswapds(order, gfp_mask, ac);
>> /* cpu 1:
>> cpuset_write_resmask
>> update_nodemask
>> update_nodemasks_hier
>> update_tasks_nodemask
>> mpol_rebind_task
>> mpol_rebind_policy
>> mpol_rebind_nodemask
>> // mempolicy->nodes has been modified,
>> // which ac->nodemask point to
>>
>> */
>> /* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */
>> if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
>> did_some_progress > 0, &no_progress_loops))
>> goto retry;
>> }
>>
>> Simultaneously starting multiple cpuset01 from LTP can quickly reproduce
>> this issue on a multi node server when the maximum memory pressure is
>> reached and the swap is enabled
>>
>> Link: https://lkml.kernel.org/r/20250416082405.20988-1-zhangtianyang@loongson.cn
>> Fixes: 902b62810a57 ("mm, page_alloc: fix more premature OOM due to race with cpuset update").
> After the discussion in this thread, Suren retracted this Fixes: suggestion.
> I think it actually goes back to this one which introduced the
> preferred_zoneref caching.
>
> Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a
> zonelist twice")
Yes, the problem should be introduced by this patch, thank you
>
>> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Suren Baghdasaryan <surenb@google.com>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Brendan Jackman <jackmanb@google.com>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: <stable@vger.kernel.org>
>> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> I would have placed the check bit further down, just above the
> should_reclaim_retry() call, but it's not that important to hold up a fix
> and can be done later.
>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
>
>> ---
>>
>> mm/page_alloc.c | 8 ++++++++
>> 1 file changed, 8 insertions(+)
>>
>> --- a/mm/page_alloc.c~mm-page_allocc-avoid-infinite-retries-caused-by-cpuset-race
>> +++ a/mm/page_alloc.c
>> @@ -4562,6 +4562,14 @@ restart:
>> }
>>
>> retry:
>> + /*
>> + * Deal with possible cpuset update races or zonelist updates to avoid
>> + * infinite retries.
>> + */
>> + if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
>> + check_retry_zonelist(zonelist_iter_cookie))
>> + goto restart;
>> +
>> /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
>> if (alloc_flags & ALLOC_KSWAPD)
>> wake_all_kswapds(order, gfp_mask, ac);
>> _
>>
On Wed, 14 May 2025 09:34:53 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:
> On 5/13/25 21:16, Andrew Morton wrote:
> > On Tue, 13 May 2025 09:26:53 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
> >
> >> > > > This has been in mm-hotfixes-unstable for six days. Hopefully we'll
> >> > > > see some review activity soon (please).
> >> > >
> >> > > I reviewed and provided my feedback but saw neither a reply nor a
> >> > > respin with proposed changes.
> >> >
> >> > OK, thanks. Do you have time to put together a modified version of this?
> >>
> >> I think the code is fine as is. Would be good to add Fixes: tag but it
> >> will require some investigation to find the appropriate patch to
> >> reference here.
> >
> > Below is what is in mm-hotfixes. It doesn't actually have any
> > acked-by's or reviewed-by's.
> >
> > So... final call for review, please.
> >
>
> ...
>
> After the discussion in this thread, Suren retracted this Fixes: suggestion.
> I think it actually goes back to this one which introduced the
> preferred_zoneref caching.
>
> Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a
> zonelist twice")
Updated.
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Thanks.
On Tue, May 13, 2025 at 12:16 PM Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> On Tue, 13 May 2025 09:26:53 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
>
> > > > > This has been in mm-hotfixes-unstable for six days. Hopefully we'll
> > > > > see some review activity soon (please).
> > > >
> > > > I reviewed and provided my feedback but saw neither a reply nor a
> > > > respin with proposed changes.
> > >
> > > OK, thanks. Do you have time to put together a modified version of this?
> >
> > I think the code is fine as is. Would be good to add Fixes: tag but it
> > will require some investigation to find the appropriate patch to
> > reference here.
>
> Below is what is in mm-hotfixes. It doesn't actually have any
> acked-by's or reviewed-by's.
>
> So... final call for review, please.
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
>
>
> From: Tianyang Zhang <zhangtianyang@loongson.cn>
> Subject: mm/page_alloc.c: avoid infinite retries caused by cpuset race
> Date: Wed, 16 Apr 2025 16:24:05 +0800
>
> __alloc_pages_slowpath has no change detection for ac->nodemask in the
> part of retry path, while cpuset can modify it in parallel. For some
> processes that set mempolicy as MPOL_BIND, this results ac->nodemask
> changes, and then the should_reclaim_retry will judge based on the latest
> nodemask and jump to retry, while the get_page_from_freelist only
> traverses the zonelist from ac->preferred_zoneref, which selected by a
> expired nodemask and may cause infinite retries in some cases
>
> cpu 64:
> __alloc_pages_slowpath {
> /* ..... */
> retry:
> /* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */
> if (alloc_flags & ALLOC_KSWAPD)
> wake_all_kswapds(order, gfp_mask, ac);
> /* cpu 1:
> cpuset_write_resmask
> update_nodemask
> update_nodemasks_hier
> update_tasks_nodemask
> mpol_rebind_task
> mpol_rebind_policy
> mpol_rebind_nodemask
> // mempolicy->nodes has been modified,
> // which ac->nodemask point to
>
> */
> /* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */
> if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
> did_some_progress > 0, &no_progress_loops))
> goto retry;
> }
>
> Simultaneously starting multiple cpuset01 from LTP can quickly reproduce
> this issue on a multi node server when the maximum memory pressure is
> reached and the swap is enabled
>
> Link: https://lkml.kernel.org/r/20250416082405.20988-1-zhangtianyang@loongson.cn
> Fixes: 902b62810a57 ("mm, page_alloc: fix more premature OOM due to race with cpuset update").
> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Brendan Jackman <jackmanb@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>
> mm/page_alloc.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> --- a/mm/page_alloc.c~mm-page_allocc-avoid-infinite-retries-caused-by-cpuset-race
> +++ a/mm/page_alloc.c
> @@ -4562,6 +4562,14 @@ restart:
> }
>
> retry:
> + /*
> + * Deal with possible cpuset update races or zonelist updates to avoid
> + * infinite retries.
> + */
> + if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
> + check_retry_zonelist(zonelist_iter_cookie))
> + goto restart;
> +
> /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
> if (alloc_flags & ALLOC_KSWAPD)
> wake_all_kswapds(order, gfp_mask, ac);
> _
>
On Mon, Apr 21, 2025 at 3:00 AM Harry Yoo <harry.yoo@oracle.com> wrote:
>
> On Wed, Apr 16, 2025 at 04:24:05PM +0800, Tianyang Zhang wrote:
> > __alloc_pages_slowpath has no change detection for ac->nodemask
> > in the part of retry path, while cpuset can modify it in parallel.
> > For some processes that set mempolicy as MPOL_BIND, this results
> > ac->nodemask changes, and then the should_reclaim_retry will
> > judge based on the latest nodemask and jump to retry, while the
> > get_page_from_freelist only traverses the zonelist from
> > ac->preferred_zoneref, which selected by a expired nodemask
> > and may cause infinite retries in some cases
> >
> > cpu 64:
> > __alloc_pages_slowpath {
> > /* ..... */
> > retry:
> > /* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */
> > if (alloc_flags & ALLOC_KSWAPD)
> > wake_all_kswapds(order, gfp_mask, ac);
> > /* cpu 1:
> > cpuset_write_resmask
> > update_nodemask
> > update_nodemasks_hier
> > update_tasks_nodemask
> > mpol_rebind_task
> > mpol_rebind_policy
> > mpol_rebind_nodemask
> > // mempolicy->nodes has been modified,
> > // which ac->nodemask point to
> >
> > */
> > /* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */
> > if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
> > did_some_progress > 0, &no_progress_loops))
> > goto retry;
> > }
> >
> > Simultaneously starting multiple cpuset01 from LTP can quickly
> > reproduce this issue on a multi node server when the maximum
> > memory pressure is reached and the swap is enabled
> >
> > Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
> > ---
>
> What commit does it fix and should it be backported to -stable?
I think it fixes 902b62810a57 ("mm, page_alloc: fix more premature OOM
due to race with cpuset update").
>
> There's a new 'MEMORY MANAGEMENT - PAGE ALLOCATOR' entry (only in
> Andrew's mm.git repository now).
>
> Let's Cc the page allocator folks here!
>
> --
> Cheers,
> Harry / Hyeonggon
>
> > mm/page_alloc.c | 8 ++++++++
> > 1 file changed, 8 insertions(+)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index fd6b865cb1ab..1e82f5214a42 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4530,6 +4530,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > }
> >
> > retry:
> > + /*
> > + * Deal with possible cpuset update races or zonelist updates to avoid
> > + * infinite retries.
> > + */
> > + if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
> > + check_retry_zonelist(zonelist_iter_cookie))
> > + goto restart;
> > +
We have this check later in this block:
https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652,
so IIUC you effectively are moving it to be called before
should_reclaim_retry(). If so, I think you should remove the old one
(the one I linked earlier) as it seems to be unnecessary duplication
at this point.
> > /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
> > if (alloc_flags & ALLOC_KSWAPD)
> > wake_all_kswapds(order, gfp_mask, ac);
> > --
> > 2.20.1
> >
> >
Hi, Suren
在 2025/4/22 上午4:28, Suren Baghdasaryan 写道:
> On Mon, Apr 21, 2025 at 3:00 AM Harry Yoo <harry.yoo@oracle.com> wrote:
>> On Wed, Apr 16, 2025 at 04:24:05PM +0800, Tianyang Zhang wrote:
>>> __alloc_pages_slowpath has no change detection for ac->nodemask
>>> in the part of retry path, while cpuset can modify it in parallel.
>>> For some processes that set mempolicy as MPOL_BIND, this results
>>> ac->nodemask changes, and then the should_reclaim_retry will
>>> judge based on the latest nodemask and jump to retry, while the
>>> get_page_from_freelist only traverses the zonelist from
>>> ac->preferred_zoneref, which selected by a expired nodemask
>>> and may cause infinite retries in some cases
>>>
>>> cpu 64:
>>> __alloc_pages_slowpath {
>>> /* ..... */
>>> retry:
>>> /* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */
>>> if (alloc_flags & ALLOC_KSWAPD)
>>> wake_all_kswapds(order, gfp_mask, ac);
>>> /* cpu 1:
>>> cpuset_write_resmask
>>> update_nodemask
>>> update_nodemasks_hier
>>> update_tasks_nodemask
>>> mpol_rebind_task
>>> mpol_rebind_policy
>>> mpol_rebind_nodemask
>>> // mempolicy->nodes has been modified,
>>> // which ac->nodemask point to
>>>
>>> */
>>> /* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */
>>> if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
>>> did_some_progress > 0, &no_progress_loops))
>>> goto retry;
>>> }
>>>
>>> Simultaneously starting multiple cpuset01 from LTP can quickly
>>> reproduce this issue on a multi node server when the maximum
>>> memory pressure is reached and the swap is enabled
>>>
>>> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
>>> ---
>> What commit does it fix and should it be backported to -stable?
> I think it fixes 902b62810a57 ("mm, page_alloc: fix more premature OOM
> due to race with cpuset update").
I think this issue is unlikely to have been introduced by Patch
902b62810a57 ,
as the infinite-reties section from
https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4568
to
https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4628
where the cpuset race condition occurs remains unmodified in the logic
of Patch 902b62810a57.
>> There's a new 'MEMORY MANAGEMENT - PAGE ALLOCATOR' entry (only in
>> Andrew's mm.git repository now).
>>
>> Let's Cc the page allocator folks here!
>>
>> --
>> Cheers,
>> Harry / Hyeonggon
>>
>>> mm/page_alloc.c | 8 ++++++++
>>> 1 file changed, 8 insertions(+)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index fd6b865cb1ab..1e82f5214a42 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -4530,6 +4530,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>>> }
>>>
>>> retry:
>>> + /*
>>> + * Deal with possible cpuset update races or zonelist updates to avoid
>>> + * infinite retries.
>>> + */
>>> + if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
>>> + check_retry_zonelist(zonelist_iter_cookie))
>>> + goto restart;
>>> +
> We have this check later in this block:
> https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652,
> so IIUC you effectively are moving it to be called before
> should_reclaim_retry(). If so, I think you should remove the old one
> (the one I linked earlier) as it seems to be unnecessary duplication
> at this point.
In my understanding, the code in
https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652
was introduced to prevent unnecessary OOM (Out-of-Memory) conditions
in__alloc_pages_may_oom.
If old code is removed, the newly added code (on retry loop entry)
cannot guarantee that the cpuset
remains valid when the flow reaches in__alloc_pages_may_oom, especially
if scheduling occurs during this section.
Therefore, I think retaining the original code logic is necessary to
ensure correctness under concurrency.
>
>
>>> /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
>>> if (alloc_flags & ALLOC_KSWAPD)
>>> wake_all_kswapds(order, gfp_mask, ac);
>>> --
>>> 2.20.1
>>>
>>>
Thanks
On Tue, Apr 22, 2025 at 7:39 PM Tianyang Zhang
<zhangtianyang@loongson.cn> wrote:
>
> Hi, Suren
>
> 在 2025/4/22 上午4:28, Suren Baghdasaryan 写道:
> > On Mon, Apr 21, 2025 at 3:00 AM Harry Yoo <harry.yoo@oracle.com> wrote:
> >> On Wed, Apr 16, 2025 at 04:24:05PM +0800, Tianyang Zhang wrote:
> >>> __alloc_pages_slowpath has no change detection for ac->nodemask
> >>> in the part of retry path, while cpuset can modify it in parallel.
> >>> For some processes that set mempolicy as MPOL_BIND, this results
> >>> ac->nodemask changes, and then the should_reclaim_retry will
> >>> judge based on the latest nodemask and jump to retry, while the
> >>> get_page_from_freelist only traverses the zonelist from
> >>> ac->preferred_zoneref, which selected by a expired nodemask
> >>> and may cause infinite retries in some cases
> >>>
> >>> cpu 64:
> >>> __alloc_pages_slowpath {
> >>> /* ..... */
> >>> retry:
> >>> /* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */
> >>> if (alloc_flags & ALLOC_KSWAPD)
> >>> wake_all_kswapds(order, gfp_mask, ac);
> >>> /* cpu 1:
> >>> cpuset_write_resmask
> >>> update_nodemask
> >>> update_nodemasks_hier
> >>> update_tasks_nodemask
> >>> mpol_rebind_task
> >>> mpol_rebind_policy
> >>> mpol_rebind_nodemask
> >>> // mempolicy->nodes has been modified,
> >>> // which ac->nodemask point to
> >>>
> >>> */
> >>> /* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */
> >>> if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
> >>> did_some_progress > 0, &no_progress_loops))
> >>> goto retry;
> >>> }
> >>>
> >>> Simultaneously starting multiple cpuset01 from LTP can quickly
> >>> reproduce this issue on a multi node server when the maximum
> >>> memory pressure is reached and the swap is enabled
> >>>
> >>> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
> >>> ---
> >> What commit does it fix and should it be backported to -stable?
> > I think it fixes 902b62810a57 ("mm, page_alloc: fix more premature OOM
> > due to race with cpuset update").
>
> I think this issue is unlikely to have been introduced by Patch
> 902b62810a57 ,
>
> as the infinite-reties section from
>
> https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4568
> to
> https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4628
>
> where the cpuset race condition occurs remains unmodified in the logic
> of Patch 902b62810a57.
Yeah, you are right. After looking into it some more, 902b62810a57 is
a wrong patch to blame for this infinite loop.
>
> >> There's a new 'MEMORY MANAGEMENT - PAGE ALLOCATOR' entry (only in
> >> Andrew's mm.git repository now).
> >>
> >> Let's Cc the page allocator folks here!
> >>
> >> --
> >> Cheers,
> >> Harry / Hyeonggon
> >>
> >>> mm/page_alloc.c | 8 ++++++++
> >>> 1 file changed, 8 insertions(+)
> >>>
> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>> index fd6b865cb1ab..1e82f5214a42 100644
> >>> --- a/mm/page_alloc.c
> >>> +++ b/mm/page_alloc.c
> >>> @@ -4530,6 +4530,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >>> }
> >>>
> >>> retry:
> >>> + /*
> >>> + * Deal with possible cpuset update races or zonelist updates to avoid
> >>> + * infinite retries.
> >>> + */
> >>> + if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
> >>> + check_retry_zonelist(zonelist_iter_cookie))
> >>> + goto restart;
> >>> +
> > We have this check later in this block:
> > https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652,
> > so IIUC you effectively are moving it to be called before
> > should_reclaim_retry(). If so, I think you should remove the old one
> > (the one I linked earlier) as it seems to be unnecessary duplication
> > at this point.
> In my understanding, the code in
>
> https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652
>
> was introduced to prevent unnecessary OOM (Out-of-Memory) conditions
> in__alloc_pages_may_oom.
>
> If old code is removed, the newly added code (on retry loop entry)
> cannot guarantee that the cpuset
>
> remains valid when the flow reaches in__alloc_pages_may_oom, especially
> if scheduling occurs during this section.
Well, rescheduling can happen even between
https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652
and https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4657
but I see your point. Also should_reclaim_retry() does not include
zonelist change detection, so keeping the checks at
https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652
sounds like a good idea.
>
> Therefore, I think retaining the original code logic is necessary to
> ensure correctness under concurrency.
>
> >
> >
> >>> /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
> >>> if (alloc_flags & ALLOC_KSWAPD)
> >>> wake_all_kswapds(order, gfp_mask, ac);
> >>> --
> >>> 2.20.1
> >>>
> >>>
> Thanks
>
On 4/23/25 17:35, Suren Baghdasaryan wrote: >> >> There's a new 'MEMORY MANAGEMENT - PAGE ALLOCATOR' entry (only in >> >> Andrew's mm.git repository now). >> >> >> >> Let's Cc the page allocator folks here! >> >> >> >> -- >> >> Cheers, >> >> Harry / Hyeonggon >> >> >> >>> mm/page_alloc.c | 8 ++++++++ >> >>> 1 file changed, 8 insertions(+) >> >>> >> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> >>> index fd6b865cb1ab..1e82f5214a42 100644 >> >>> --- a/mm/page_alloc.c >> >>> +++ b/mm/page_alloc.c >> >>> @@ -4530,6 +4530,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, >> >>> } >> >>> >> >>> retry: >> >>> + /* >> >>> + * Deal with possible cpuset update races or zonelist updates to avoid >> >>> + * infinite retries. >> >>> + */ >> >>> + if (check_retry_cpuset(cpuset_mems_cookie, ac) || >> >>> + check_retry_zonelist(zonelist_iter_cookie)) >> >>> + goto restart; >> >>> + >> > We have this check later in this block: >> > https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652, >> > so IIUC you effectively are moving it to be called before >> > should_reclaim_retry(). If so, I think you should remove the old one >> > (the one I linked earlier) as it seems to be unnecessary duplication >> > at this point. >> In my understanding, the code in >> >> https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652 >> >> was introduced to prevent unnecessary OOM (Out-of-Memory) conditions >> in__alloc_pages_may_oom. >> >> If old code is removed, the newly added code (on retry loop entry) >> cannot guarantee that the cpuset >> >> remains valid when the flow reaches in__alloc_pages_may_oom, especially >> if scheduling occurs during this section. > > Well, rescheduling can happen even between > https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652 > and https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4657 > but I see your point. Also should_reclaim_retry() does not include I think the rescheduling isn't a problem because what we're testing is "we are about to oom, could it have been because we raced?" and the race would have affected the code before #L4652. If we didn't race and yet determined it's time for oom, a race between #L4652 and #L4657 shouldn't matter. The get_page_from_freelist() in __alloc_pages_may_oom() isn't that important for preventing premature oom AFAICS, given it uses high wmark. That said, I think the newly added check could be more logically placed above the call to should_reclaim_retry() instead of right after the retry: label, but it's not critical. > zonelist change detection, so keeping the checks at > https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652 > sounds like a good idea. > >> >> Therefore, I think retaining the original code logic is necessary to >> ensure correctness under concurrency. >> >> > >> > >> >>> /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */ >> >>> if (alloc_flags & ALLOC_KSWAPD) >> >>> wake_all_kswapds(order, gfp_mask, ac); >> >>> -- >> >>> 2.20.1 >> >>> >> >>> >> Thanks >>
© 2016 - 2025 Red Hat, Inc.