[PATCH mm-unstable v1] mm/page_alloc: try not to overestimate free highatomic

Yu Zhao posted 1 patch 1 month ago
mm/page_alloc.c | 25 ++++++++++++++++++++++---
1 file changed, 22 insertions(+), 3 deletions(-)
[PATCH mm-unstable v1] mm/page_alloc: try not to overestimate free highatomic
Posted by Yu Zhao 1 month ago
OOM kills due to vastly overestimated free highatomic reserves were
observed:

  ... invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0 ...
  Node 0 Normal free:1482936kB boost:0kB min:410416kB low:739404kB high:1068392kB reserved_highatomic:1073152KB ...
  Node 0 Normal: 1292*4kB (ME) 1920*8kB (E) 383*16kB (UE) 220*32kB (ME) 340*64kB (E) 2155*128kB (UE) 3243*256kB (UE) 615*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 1477408kB

The second line above shows that the OOM kill was due to the following
condition:

  free (1482936kB) - reserved_highatomic (1073152kB) = 409784KB < min (410416kB)

And the third line shows there were no free pages in any
MIGRATE_HIGHATOMIC pageblocks, which otherwise would show up as type
'H'. Therefore __zone_watermark_unusable_free() overestimated free
highatomic reserves. IOW, it underestimated the usable free memory by
over 1GB, which resulted in the unnecessary OOM kill.

The estimation can be made less crude, by quickly checking whether
there are free highatomic reserves at all. If not, then do not deduct
the entire highatomic reserves when calculating usable free memory.

Reported-by: Link Lin <linkl@google.com>
Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 mm/page_alloc.c | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bc55d39eb372..ee1ce19925ad 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3110,6 +3110,25 @@ struct page *rmqueue(struct zone *preferred_zone,
 	return page;
 }
 
+static unsigned long get_max_free_highatomic(struct zone *zone)
+{
+	int order;
+	unsigned long free = 0;
+	unsigned long reserved = zone->nr_reserved_highatomic;
+
+	if (reserved <= pageblock_nr_pages)
+		return reserved;
+
+	for (order = 0; order <= MAX_PAGE_ORDER; order++) {
+		struct free_area *area = &zone->free_area[order];
+
+		if (!list_empty(&area->free_list[MIGRATE_HIGHATOMIC]))
+			free += READ_ONCE(area->nr_free) << order;
+	}
+
+	return min(reserved, free);
+}
+
 static inline long __zone_watermark_unusable_free(struct zone *z,
 				unsigned int order, unsigned int alloc_flags)
 {
@@ -3117,11 +3136,11 @@ static inline long __zone_watermark_unusable_free(struct zone *z,
 
 	/*
 	 * If the caller does not have rights to reserves below the min
-	 * watermark then subtract the high-atomic reserves. This will
-	 * over-estimate the size of the atomic reserve but it avoids a search.
+	 * watermark then subtract the high-atomic reserves. This can
+	 * overestimate the size of free high-atomic reserves.
 	 */
 	if (likely(!(alloc_flags & ALLOC_RESERVES)))
-		unusable_free += z->nr_reserved_highatomic;
+		unusable_free += get_max_free_highatomic(z);
 
 #ifdef CONFIG_CMA
 	/* If allocation can't use CMA areas don't use free CMA pages */
-- 
2.47.0.rc1.288.g06298d1525-goog
Re: [PATCH mm-unstable v1] mm/page_alloc: try not to overestimate free highatomic
Posted by Michal Hocko 1 month ago
On Sat 19-10-24 23:13:15, Yu Zhao wrote:
> OOM kills due to vastly overestimated free highatomic reserves were
> observed:
> 
>   ... invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0 ...
>   Node 0 Normal free:1482936kB boost:0kB min:410416kB low:739404kB high:1068392kB reserved_highatomic:1073152KB ...
>   Node 0 Normal: 1292*4kB (ME) 1920*8kB (E) 383*16kB (UE) 220*32kB (ME) 340*64kB (E) 2155*128kB (UE) 3243*256kB (UE) 615*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 1477408kB
> 
> The second line above shows that the OOM kill was due to the following
> condition:
> 
>   free (1482936kB) - reserved_highatomic (1073152kB) = 409784KB < min (410416kB)
> 
> And the third line shows there were no free pages in any
> MIGRATE_HIGHATOMIC pageblocks, which otherwise would show up as type
> 'H'. Therefore __zone_watermark_unusable_free() overestimated free
> highatomic reserves. IOW, it underestimated the usable free memory by
> over 1GB, which resulted in the unnecessary OOM kill.

Why doesn't unreserve_highatomic_pageblock deal with this situation?

-- 
Michal Hocko
SUSE Labs
Re: [PATCH mm-unstable v1] mm/page_alloc: try not to overestimate free highatomic
Posted by Yu Zhao 1 month ago
On Mon, Oct 21, 2024 at 2:13 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Sat 19-10-24 23:13:15, Yu Zhao wrote:
> > OOM kills due to vastly overestimated free highatomic reserves were
> > observed:
> >
> >   ... invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0 ...
> >   Node 0 Normal free:1482936kB boost:0kB min:410416kB low:739404kB high:1068392kB reserved_highatomic:1073152KB ...
> >   Node 0 Normal: 1292*4kB (ME) 1920*8kB (E) 383*16kB (UE) 220*32kB (ME) 340*64kB (E) 2155*128kB (UE) 3243*256kB (UE) 615*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 1477408kB
> >
> > The second line above shows that the OOM kill was due to the following
> > condition:
> >
> >   free (1482936kB) - reserved_highatomic (1073152kB) = 409784KB < min (410416kB)
> >
> > And the third line shows there were no free pages in any
> > MIGRATE_HIGHATOMIC pageblocks, which otherwise would show up as type
> > 'H'. Therefore __zone_watermark_unusable_free() overestimated free
> > highatomic reserves. IOW, it underestimated the usable free memory by
> > over 1GB, which resulted in the unnecessary OOM kill.
>
> Why doesn't unreserve_highatomic_pageblock deal with this situation?

The current behavior of unreserve_highatomic_pageblock() seems WAI to
me: it unreserves highatomic pageblocks that contain *free* pages so
that those pages can become usable to others. There is nothing to
unreserve when they have no free pages.
Re: [PATCH mm-unstable v1] mm/page_alloc: try not to overestimate free highatomic
Posted by Michal Hocko 1 month ago
On Mon 21-10-24 11:10:50, Yu Zhao wrote:
> On Mon, Oct 21, 2024 at 2:13 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Sat 19-10-24 23:13:15, Yu Zhao wrote:
> > > OOM kills due to vastly overestimated free highatomic reserves were
> > > observed:
> > >
> > >   ... invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0 ...
> > >   Node 0 Normal free:1482936kB boost:0kB min:410416kB low:739404kB high:1068392kB reserved_highatomic:1073152KB ...
> > >   Node 0 Normal: 1292*4kB (ME) 1920*8kB (E) 383*16kB (UE) 220*32kB (ME) 340*64kB (E) 2155*128kB (UE) 3243*256kB (UE) 615*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 1477408kB
> > >
> > > The second line above shows that the OOM kill was due to the following
> > > condition:
> > >
> > >   free (1482936kB) - reserved_highatomic (1073152kB) = 409784KB < min (410416kB)
> > >
> > > And the third line shows there were no free pages in any
> > > MIGRATE_HIGHATOMIC pageblocks, which otherwise would show up as type
> > > 'H'. Therefore __zone_watermark_unusable_free() overestimated free
> > > highatomic reserves. IOW, it underestimated the usable free memory by
> > > over 1GB, which resulted in the unnecessary OOM kill.
> >
> > Why doesn't unreserve_highatomic_pageblock deal with this situation?
> 
> The current behavior of unreserve_highatomic_pageblock() seems WAI to
> me: it unreserves highatomic pageblocks that contain *free* pages so
> that those pages can become usable to others. There is nothing to
> unreserve when they have no free pages.

I do not follow. How can you have reserved highatomic pages of that size
without having page blocks with free memory. In other words is this an
accounting problem or reserves problem? This is not really clear from
your description.

-- 
Michal Hocko
SUSE Labs
Re: [PATCH mm-unstable v1] mm/page_alloc: try not to overestimate free highatomic
Posted by Vlastimil Babka 1 month ago
+Cc Mel and Matt

On 10/21/24 19:25, Michal Hocko wrote:
> On Mon 21-10-24 11:10:50, Yu Zhao wrote:
>> On Mon, Oct 21, 2024 at 2:13 AM Michal Hocko <mhocko@suse.com> wrote:
>> >
>> > On Sat 19-10-24 23:13:15, Yu Zhao wrote:
>> > > OOM kills due to vastly overestimated free highatomic reserves were
>> > > observed:
>> > >
>> > >   ... invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0 ...
>> > >   Node 0 Normal free:1482936kB boost:0kB min:410416kB low:739404kB high:1068392kB reserved_highatomic:1073152KB ...
>> > >   Node 0 Normal: 1292*4kB (ME) 1920*8kB (E) 383*16kB (UE) 220*32kB (ME) 340*64kB (E) 2155*128kB (UE) 3243*256kB (UE) 615*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 1477408kB
>> > >
>> > > The second line above shows that the OOM kill was due to the following
>> > > condition:
>> > >
>> > >   free (1482936kB) - reserved_highatomic (1073152kB) = 409784KB < min (410416kB)
>> > >
>> > > And the third line shows there were no free pages in any
>> > > MIGRATE_HIGHATOMIC pageblocks, which otherwise would show up as type
>> > > 'H'. Therefore __zone_watermark_unusable_free() overestimated free
>> > > highatomic reserves. IOW, it underestimated the usable free memory by
>> > > over 1GB, which resulted in the unnecessary OOM kill.
>> >
>> > Why doesn't unreserve_highatomic_pageblock deal with this situation?
>> 
>> The current behavior of unreserve_highatomic_pageblock() seems WAI to
>> me: it unreserves highatomic pageblocks that contain *free* pages so

Hm I don't think it's completely WAI. The intention is that we should be
able to unreserve the highatomic pageblocks before going OOM, and there
seems to be an unintended corner case that if the pageblocks are fully
exhausted, they are not reachable for unreserving. The nr_highatomic is then
also fully misleading as it prevents allocations due to a limit that does
not reflect reality. Your patch addresses the second issue, but there's a
cost to it when calculating the watermarks, and it would be better to
address the root issue instead.

>> that those pages can become usable to others. There is nothing to
>> unreserve when they have no free pages.

Yeah there are no actual free pages to unreserve, but unreserving would fix
the nr_highatomic overestimate and thus allow allocations to proceed.

> I do not follow. How can you have reserved highatomic pages of that size
> without having page blocks with free memory. In other words is this an
> accounting problem or reserves problem? This is not really clear from
> your description.

I think it's the problem of finding the highatomic pageblocks for
unreserving them once they become full. The proper fix is not exactly
trivial though. Either we'll have to scan for highatomic pageblocks in the
pageblock bitmap, or track them using an additional data structure.
Re: [PATCH mm-unstable v1] mm/page_alloc: try not to overestimate free highatomic
Posted by Yu Zhao 1 month ago
On Tue, Oct 22, 2024 at 4:53 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> +Cc Mel and Matt
>
> On 10/21/24 19:25, Michal Hocko wrote:
> > On Mon 21-10-24 11:10:50, Yu Zhao wrote:
> >> On Mon, Oct 21, 2024 at 2:13 AM Michal Hocko <mhocko@suse.com> wrote:
> >> >
> >> > On Sat 19-10-24 23:13:15, Yu Zhao wrote:
> >> > > OOM kills due to vastly overestimated free highatomic reserves were
> >> > > observed:
> >> > >
> >> > >   ... invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0 ...
> >> > >   Node 0 Normal free:1482936kB boost:0kB min:410416kB low:739404kB high:1068392kB reserved_highatomic:1073152KB ...
> >> > >   Node 0 Normal: 1292*4kB (ME) 1920*8kB (E) 383*16kB (UE) 220*32kB (ME) 340*64kB (E) 2155*128kB (UE) 3243*256kB (UE) 615*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 1477408kB
> >> > >
> >> > > The second line above shows that the OOM kill was due to the following
> >> > > condition:
> >> > >
> >> > >   free (1482936kB) - reserved_highatomic (1073152kB) = 409784KB < min (410416kB)
> >> > >
> >> > > And the third line shows there were no free pages in any
> >> > > MIGRATE_HIGHATOMIC pageblocks, which otherwise would show up as type
> >> > > 'H'. Therefore __zone_watermark_unusable_free() overestimated free
> >> > > highatomic reserves. IOW, it underestimated the usable free memory by
> >> > > over 1GB, which resulted in the unnecessary OOM kill.
> >> >
> >> > Why doesn't unreserve_highatomic_pageblock deal with this situation?
> >>
> >> The current behavior of unreserve_highatomic_pageblock() seems WAI to
> >> me: it unreserves highatomic pageblocks that contain *free* pages so
>
> Hm I don't think it's completely WAI. The intention is that we should be
> able to unreserve the highatomic pageblocks before going OOM, and there
> seems to be an unintended corner case that if the pageblocks are fully
> exhausted, they are not reachable for unreserving.

I still think unreserving should only apply to highatomic PBs that
contain free pages. Otherwise, it seems to me that it'd be
self-defecting because:
1. Unreserving fully used hightatomic PBs can't fulfill the alloc
demand immediately.
2. More importantly, it only takes one alloc failure in
__alloc_pages_direct_reclaim() to reset nr_reserved_highatomic to 2MB,
from as high as 1% of a zone (in this case 1GB). IOW, it makes more
sense to me that highatomic only unreserves what it doesn't fully use
each time unreserve_highatomic_pageblock() is called, not everything
it got (except the last PB).

Also not reachable from free_area[] isn't really a big problem. There
are ways to solve this without scanning the PB bitmap.

> The nr_highatomic is then
> also fully misleading as it prevents allocations due to a limit that does
> not reflect reality.

Right, and the comments warn about this.

> Your patch addresses the second issue, but there's a
> cost to it when calculating the watermarks, and it would be better to
> address the root issue instead.

Theoretically, yes. And I don't think it's actually measurable
considering the paths (alloc/reclaim) we are in -- all the data
structures this patch accesses should already have been cache-hot, due
to unreserve_highatomic_pageblock(), etc.

Also, we have not agreed on the root cause yet.

> >> that those pages can become usable to others. There is nothing to
> >> unreserve when they have no free pages.
>
> Yeah there are no actual free pages to unreserve, but unreserving would fix
> the nr_highatomic overestimate and thus allow allocations to proceed.

Yes, but honestly, I think this is going to cause regression in
highatomic allocs.

> > I do not follow. How can you have reserved highatomic pages of that size
> > without having page blocks with free memory. In other words is this an
> > accounting problem or reserves problem? This is not really clear from
> > your description.
>
> I think it's the problem of finding the highatomic pageblocks for
> unreserving them once they become full. The proper fix is not exactly
> trivial though. Either we'll have to scan for highatomic pageblocks in the
> pageblock bitmap, or track them using an additional data structure.

Assuming we want to unreserve fully used hightatomic PBs, we wouldn't
need to scan for them or track them. We'd only need to track the delta
between how many we want to unreserve (full or not) and how many we
are able to do so. The first page freed in a PB that's highatomic
would need to try to reduce the delta by changing the MT.

To summarize, I think this is an estimation problem, which I would
categorize as a lesser problem than accounting problems. But it sounds
to me that you think it's a policy problem, i.e., the highatomic
unreserving policy is wrong or not properly implemented?
Re: [PATCH mm-unstable v1] mm/page_alloc: try not to overestimate free highatomic
Posted by Vlastimil Babka 1 month ago
On 10/23/24 08:36, Yu Zhao wrote:
> On Tue, Oct 22, 2024 at 4:53 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> +Cc Mel and Matt
>>
>> On 10/21/24 19:25, Michal Hocko wrote:
>> > On Mon 21-10-24 11:10:50, Yu Zhao wrote:
>> >> On Mon, Oct 21, 2024 at 2:13 AM Michal Hocko <mhocko@suse.com> wrote:
>> >> >
>> >> > On Sat 19-10-24 23:13:15, Yu Zhao wrote:
>> >> > > OOM kills due to vastly overestimated free highatomic reserves were
>> >> > > observed:
>> >> > >
>> >> > >   ... invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0 ...
>> >> > >   Node 0 Normal free:1482936kB boost:0kB min:410416kB low:739404kB high:1068392kB reserved_highatomic:1073152KB ...
>> >> > >   Node 0 Normal: 1292*4kB (ME) 1920*8kB (E) 383*16kB (UE) 220*32kB (ME) 340*64kB (E) 2155*128kB (UE) 3243*256kB (UE) 615*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 1477408kB
>> >> > >
>> >> > > The second line above shows that the OOM kill was due to the following
>> >> > > condition:
>> >> > >
>> >> > >   free (1482936kB) - reserved_highatomic (1073152kB) = 409784KB < min (410416kB)
>> >> > >
>> >> > > And the third line shows there were no free pages in any
>> >> > > MIGRATE_HIGHATOMIC pageblocks, which otherwise would show up as type
>> >> > > 'H'. Therefore __zone_watermark_unusable_free() overestimated free
>> >> > > highatomic reserves. IOW, it underestimated the usable free memory by
>> >> > > over 1GB, which resulted in the unnecessary OOM kill.
>> >> >
>> >> > Why doesn't unreserve_highatomic_pageblock deal with this situation?
>> >>
>> >> The current behavior of unreserve_highatomic_pageblock() seems WAI to
>> >> me: it unreserves highatomic pageblocks that contain *free* pages so
>>
>> Hm I don't think it's completely WAI. The intention is that we should be
>> able to unreserve the highatomic pageblocks before going OOM, and there
>> seems to be an unintended corner case that if the pageblocks are fully
>> exhausted, they are not reachable for unreserving.
> 
> I still think unreserving should only apply to highatomic PBs that
> contain free pages. Otherwise, it seems to me that it'd be
> self-defecting because:
> 1. Unreserving fully used hightatomic PBs can't fulfill the alloc
> demand immediately.

I thought the alloc demand is only blocked on the pessimistic watermark
calculation. Usable free pages exist, but the allocation is not allowed to
use them.

> 2. More importantly, it only takes one alloc failure in
> __alloc_pages_direct_reclaim() to reset nr_reserved_highatomic to 2MB,
> from as high as 1% of a zone (in this case 1GB). IOW, it makes more
> sense to me that highatomic only unreserves what it doesn't fully use
> each time unreserve_highatomic_pageblock() is called, not everything
> it got (except the last PB).

But if the highatomic pageblocks are already full, we are not really
removing any actual highatomic reserves just by changing the migratetype and
decreasing nr_reserved_highatomic? In fact that would allow the reserves
grow with some actual free pages in the future.

> Also not reachable from free_area[] isn't really a big problem. There
> are ways to solve this without scanning the PB bitmap.

Sure, if we agree it's the way to go.

>> The nr_highatomic is then
>> also fully misleading as it prevents allocations due to a limit that does
>> not reflect reality.
> 
> Right, and the comments warn about this.

Yes and explains it's to avoid the cost of searching free lists. Your fix
introduces that cost and that's not really great for a watermark check fast
path. I'd rather move the cost to highatomic unreserve which is not a fast path.

>> Your patch addresses the second issue, but there's a
>> cost to it when calculating the watermarks, and it would be better to
>> address the root issue instead.
> 
> Theoretically, yes. And I don't think it's actually measurable
> considering the paths (alloc/reclaim) we are in -- all the data
> structures this patch accesses should already have been cache-hot, due
> to unreserve_highatomic_pageblock(), etc.

__zone_watermark_unusable_free() will be executed from every allocation's
fast path, and not only after we recently did
unreserve_highatomic_pageblock(). AFAICS as soon as nr_reserved_highatomic
is over pageblock_nr_pages we'll unconditionally start counting precisely
and the design wanted to avoid this.

> Also, we have not agreed on the root cause yet.
> 
>> >> that those pages can become usable to others. There is nothing to
>> >> unreserve when they have no free pages.
>>
>> Yeah there are no actual free pages to unreserve, but unreserving would fix
>> the nr_highatomic overestimate and thus allow allocations to proceed.
> 
> Yes, but honestly, I think this is going to cause regression in
> highatomic allocs.

I think not as having more realistic counter of what's actually reserved
(and not already used up) can also allow reserving new pageblocks.

>> > I do not follow. How can you have reserved highatomic pages of that size
>> > without having page blocks with free memory. In other words is this an
>> > accounting problem or reserves problem? This is not really clear from
>> > your description.
>>
>> I think it's the problem of finding the highatomic pageblocks for
>> unreserving them once they become full. The proper fix is not exactly
>> trivial though. Either we'll have to scan for highatomic pageblocks in the
>> pageblock bitmap, or track them using an additional data structure.
> 
> Assuming we want to unreserve fully used hightatomic PBs, we wouldn't
> need to scan for them or track them. We'd only need to track the delta
> between how many we want to unreserve (full or not) and how many we
> are able to do so. The first page freed in a PB that's highatomic
> would need to try to reduce the delta by changing the MT.

Hm that assumes we're adding some checks in free fastpath, and for that to
work also that there will be a freed page in highatomic PC in near enough
future from the decision we need to unreserve something. Which is not so
much different from the current assumption we'll find such a free page
already in the free list immediately.

> To summarize, I think this is an estimation problem, which I would
> categorize as a lesser problem than accounting problems. But it sounds
> to me that you think it's a policy problem, i.e., the highatomic
> unreserving policy is wrong or not properly implemented?

Yeah I'd say not properly implemented, but that sounds like a mechanism, not
policy problem to me :)
Re: [PATCH mm-unstable v1] mm/page_alloc: try not to overestimate free highatomic
Posted by Yu Zhao 1 month ago
On Wed, Oct 23, 2024 at 1:35 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 10/23/24 08:36, Yu Zhao wrote:
> > On Tue, Oct 22, 2024 at 4:53 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >>
> >> +Cc Mel and Matt
> >>
> >> On 10/21/24 19:25, Michal Hocko wrote:
> >> > On Mon 21-10-24 11:10:50, Yu Zhao wrote:
> >> >> On Mon, Oct 21, 2024 at 2:13 AM Michal Hocko <mhocko@suse.com> wrote:
> >> >> >
> >> >> > On Sat 19-10-24 23:13:15, Yu Zhao wrote:
> >> >> > > OOM kills due to vastly overestimated free highatomic reserves were
> >> >> > > observed:
> >> >> > >
> >> >> > >   ... invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0 ...
> >> >> > >   Node 0 Normal free:1482936kB boost:0kB min:410416kB low:739404kB high:1068392kB reserved_highatomic:1073152KB ...
> >> >> > >   Node 0 Normal: 1292*4kB (ME) 1920*8kB (E) 383*16kB (UE) 220*32kB (ME) 340*64kB (E) 2155*128kB (UE) 3243*256kB (UE) 615*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 1477408kB
> >> >> > >
> >> >> > > The second line above shows that the OOM kill was due to the following
> >> >> > > condition:
> >> >> > >
> >> >> > >   free (1482936kB) - reserved_highatomic (1073152kB) = 409784KB < min (410416kB)
> >> >> > >
> >> >> > > And the third line shows there were no free pages in any
> >> >> > > MIGRATE_HIGHATOMIC pageblocks, which otherwise would show up as type
> >> >> > > 'H'. Therefore __zone_watermark_unusable_free() overestimated free
> >> >> > > highatomic reserves. IOW, it underestimated the usable free memory by
> >> >> > > over 1GB, which resulted in the unnecessary OOM kill.
> >> >> >
> >> >> > Why doesn't unreserve_highatomic_pageblock deal with this situation?
> >> >>
> >> >> The current behavior of unreserve_highatomic_pageblock() seems WAI to
> >> >> me: it unreserves highatomic pageblocks that contain *free* pages so
> >>
> >> Hm I don't think it's completely WAI. The intention is that we should be
> >> able to unreserve the highatomic pageblocks before going OOM, and there
> >> seems to be an unintended corner case that if the pageblocks are fully
> >> exhausted, they are not reachable for unreserving.
> >
> > I still think unreserving should only apply to highatomic PBs that
> > contain free pages. Otherwise, it seems to me that it'd be
> > self-defecting because:
> > 1. Unreserving fully used hightatomic PBs can't fulfill the alloc
> > demand immediately.
>
> I thought the alloc demand is only blocked on the pessimistic watermark
> calculation. Usable free pages exist, but the allocation is not allowed to
> use them.

I think we are talking about two different problems here:
1. The estimation problem.
2. The unreserving policy problem.

What you said here is correct w.r.t. the first problem, and I was
talking about the second problem.

> > 2. More importantly, it only takes one alloc failure in
> > __alloc_pages_direct_reclaim() to reset nr_reserved_highatomic to 2MB,
> > from as high as 1% of a zone (in this case 1GB). IOW, it makes more
> > sense to me that highatomic only unreserves what it doesn't fully use
> > each time unreserve_highatomic_pageblock() is called, not everything
> > it got (except the last PB).
>
> But if the highatomic pageblocks are already full, we are not really
> removing any actual highatomic reserves just by changing the migratetype and
> decreasing nr_reserved_highatomic?

If we change the MT, they can be fragmented a lot faster, i.e., from
the next near OOM condition to upon becoming free. Trying to persist
over time is what actually makes those PBs more fragmentation
resistant.

> In fact that would allow the reserves
> grow with some actual free pages in the future.

Good point. I think I can explain it better along this line.

If highatomic is under the limit, both your proposal and the current
implementation would try to grow, making not much difference. However,
the current implementation can also reuse previously full PBs when
they become available. So there is a clear winner here: the current
implementation.

If highatomic has reached the limit, with your proposal, the growth
can only happen after unreserve, and unreserve only happens under
memory pressure. This means it's likely that it tries to grow under
memory pressure, which is more difficult than the condition where
there is plenty of memory. For the current implementation, it doesn't
try to grow, rather, it keeps what it already has, betting those full
PBs becoming available for reuse. So I don't see a clear winner
between trying to grow under memory pressure and betting on becoming
available for reuse.

> > Also not reachable from free_area[] isn't really a big problem. There
> > are ways to solve this without scanning the PB bitmap.
>
> Sure, if we agree it's the way to go.
>
> >> The nr_highatomic is then
> >> also fully misleading as it prevents allocations due to a limit that does
> >> not reflect reality.
> >
> > Right, and the comments warn about this.
>
> Yes and explains it's to avoid the cost of searching free lists. Your fix
> introduces that cost and that's not really great for a watermark check fast
> path. I'd rather move the cost to highatomic unreserve which is not a fast path.
>
> >> Your patch addresses the second issue, but there's a
> >> cost to it when calculating the watermarks, and it would be better to
> >> address the root issue instead.
> >
> > Theoretically, yes. And I don't think it's actually measurable
> > considering the paths (alloc/reclaim) we are in -- all the data
> > structures this patch accesses should already have been cache-hot, due
> > to unreserve_highatomic_pageblock(), etc.
>
> __zone_watermark_unusable_free() will be executed from every allocation's
> fast path, and not only after we recently did
> unreserve_highatomic_pageblock(). AFAICS as soon as nr_reserved_highatomic
> is over pageblock_nr_pages we'll unconditionally start counting precisely
> and the design wanted to avoid this.
>
> > Also, we have not agreed on the root cause yet.
> >
> >> >> that those pages can become usable to others. There is nothing to
> >> >> unreserve when they have no free pages.
> >>
> >> Yeah there are no actual free pages to unreserve, but unreserving would fix
> >> the nr_highatomic overestimate and thus allow allocations to proceed.
> >
> > Yes, but honestly, I think this is going to cause regression in
> > highatomic allocs.
>
> I think not as having more realistic counter of what's actually reserved
> (and not already used up) can also allow reserving new pageblocks.
>
> >> > I do not follow. How can you have reserved highatomic pages of that size
> >> > without having page blocks with free memory. In other words is this an
> >> > accounting problem or reserves problem? This is not really clear from
> >> > your description.
> >>
> >> I think it's the problem of finding the highatomic pageblocks for
> >> unreserving them once they become full. The proper fix is not exactly
> >> trivial though. Either we'll have to scan for highatomic pageblocks in the
> >> pageblock bitmap, or track them using an additional data structure.
> >
> > Assuming we want to unreserve fully used hightatomic PBs, we wouldn't
> > need to scan for them or track them. We'd only need to track the delta
> > between how many we want to unreserve (full or not) and how many we
> > are able to do so. The first page freed in a PB that's highatomic
> > would need to try to reduce the delta by changing the MT.
>
> Hm that assumes we're adding some checks in free fastpath, and for that to
> work also that there will be a freed page in highatomic PC in near enough
> future from the decision we need to unreserve something. Which is not so
> much different from the current assumption we'll find such a free page
> already in the free list immediately.
>
> > To summarize, I think this is an estimation problem, which I would
> > categorize as a lesser problem than accounting problems. But it sounds
> > to me that you think it's a policy problem, i.e., the highatomic
> > unreserving policy is wrong or not properly implemented?
>
> Yeah I'd say not properly implemented, but that sounds like a mechanism, not
> policy problem to me :)

What about adding a new counter to keep track of the size of free
pages reserved for highatomic?

Mel?
Re: [PATCH mm-unstable v1] mm/page_alloc: try not to overestimate free highatomic
Posted by Vlastimil Babka 1 month ago
On 10/24/24 06:35, Yu Zhao wrote:
> On Wed, Oct 23, 2024 at 1:35 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> On 10/23/24 08:36, Yu Zhao wrote:
>> > On Tue, Oct 22, 2024 at 4:53 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>> >>
>> >> +Cc Mel and Matt
>> >>
>> >> On 10/21/24 19:25, Michal Hocko wrote:
>> >>
>> >> Hm I don't think it's completely WAI. The intention is that we should be
>> >> able to unreserve the highatomic pageblocks before going OOM, and there
>> >> seems to be an unintended corner case that if the pageblocks are fully
>> >> exhausted, they are not reachable for unreserving.
>> >
>> > I still think unreserving should only apply to highatomic PBs that
>> > contain free pages. Otherwise, it seems to me that it'd be
>> > self-defecting because:
>> > 1. Unreserving fully used hightatomic PBs can't fulfill the alloc
>> > demand immediately.
>>
>> I thought the alloc demand is only blocked on the pessimistic watermark
>> calculation. Usable free pages exist, but the allocation is not allowed to
>> use them.
> 
> I think we are talking about two different problems here:
> 1. The estimation problem.
> 2. The unreserving policy problem.
> 
> What you said here is correct w.r.t. the first problem, and I was
> talking about the second problem.

OK but the problem with unreserving currently makes the problem of
estimation worse and unfixable.

>> > 2. More importantly, it only takes one alloc failure in
>> > __alloc_pages_direct_reclaim() to reset nr_reserved_highatomic to 2MB,
>> > from as high as 1% of a zone (in this case 1GB). IOW, it makes more
>> > sense to me that highatomic only unreserves what it doesn't fully use
>> > each time unreserve_highatomic_pageblock() is called, not everything
>> > it got (except the last PB).
>>
>> But if the highatomic pageblocks are already full, we are not really
>> removing any actual highatomic reserves just by changing the migratetype and
>> decreasing nr_reserved_highatomic?
> 
> If we change the MT, they can be fragmented a lot faster, i.e., from
> the next near OOM condition to upon becoming free. Trying to persist
> over time is what actually makes those PBs more fragmentation
> resistant.

If we assume the allocations there have similar sizes and lifetimes, then I
guess yeah.

>> In fact that would allow the reserves
>> grow with some actual free pages in the future.
> 
> Good point. I think I can explain it better along this line.
> 
> If highatomic is under the limit, both your proposal and the current
> implementation would try to grow, making not much difference. However,
> the current implementation can also reuse previously full PBs when
> they become available. So there is a clear winner here: the current
> implementation.

I'd say it depends on the user of the highatomic blocks (the workload),
which way ends up better.

> If highatomic has reached the limit, with your proposal, the growth
> can only happen after unreserve, and unreserve only happens under
> memory pressure. This means it's likely that it tries to grow under
> memory pressure, which is more difficult than the condition where
> there is plenty of memory. For the current implementation, it doesn't
> try to grow, rather, it keeps what it already has, betting those full
> PBs becoming available for reuse. So I don't see a clear winner
> between trying to grow under memory pressure and betting on becoming
> available for reuse.

Understood. But also note there are many conditions where the current
implementation and my proposal behave the same. If highatomic pageblocks
become full and then only one or few pages from each is freed, it suddenly
becomes possible to unreserve them due to memory pressure, and there is no
reuse for those highatomic allocations anymore. This very different outcome
only depends on whether a single page is free for the unreserve to work, but
from the efficiency of pageblock reusal you describe above a single page is
only a minor difference. My proposal would at least remove the sudden change
of behavior when going from a single free page to no free page.

>> Hm that assumes we're adding some checks in free fastpath, and for that to
>> work also that there will be a freed page in highatomic PC in near enough
>> future from the decision we need to unreserve something. Which is not so
>> much different from the current assumption we'll find such a free page
>> already in the free list immediately.
>>
>> > To summarize, I think this is an estimation problem, which I would
>> > categorize as a lesser problem than accounting problems. But it sounds
>> > to me that you think it's a policy problem, i.e., the highatomic
>> > unreserving policy is wrong or not properly implemented?
>>
>> Yeah I'd say not properly implemented, but that sounds like a mechanism, not
>> policy problem to me :)
> 
> What about adding a new counter to keep track of the size of free
> pages reserved for highatomic?

That's doable but not so trivial and means starting to handle the highatomic
pageblocks much more carefully, like we do with CMA pageblocks and
NR_FREE_CMA_PAGES counter, otherwise we risk drifting the counter unrecoverably.

> Mel?

Re: [PATCH mm-unstable v1] mm/page_alloc: try not to overestimate free highatomic
Posted by Yu Zhao 1 month ago
On Thu, Oct 24, 2024 at 2:16 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 10/24/24 06:35, Yu Zhao wrote:
> > On Wed, Oct 23, 2024 at 1:35 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >>
> >> On 10/23/24 08:36, Yu Zhao wrote:
> >> > On Tue, Oct 22, 2024 at 4:53 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >> >>
> >> >> +Cc Mel and Matt
> >> >>
> >> >> On 10/21/24 19:25, Michal Hocko wrote:
> >> >>
> >> >> Hm I don't think it's completely WAI. The intention is that we should be
> >> >> able to unreserve the highatomic pageblocks before going OOM, and there
> >> >> seems to be an unintended corner case that if the pageblocks are fully
> >> >> exhausted, they are not reachable for unreserving.
> >> >
> >> > I still think unreserving should only apply to highatomic PBs that
> >> > contain free pages. Otherwise, it seems to me that it'd be
> >> > self-defecting because:
> >> > 1. Unreserving fully used hightatomic PBs can't fulfill the alloc
> >> > demand immediately.
> >>
> >> I thought the alloc demand is only blocked on the pessimistic watermark
> >> calculation. Usable free pages exist, but the allocation is not allowed to
> >> use them.
> >
> > I think we are talking about two different problems here:
> > 1. The estimation problem.
> > 2. The unreserving policy problem.
> >
> > What you said here is correct w.r.t. the first problem, and I was
> > talking about the second problem.
>
> OK but the problem with unreserving currently makes the problem of
> estimation worse and unfixable.
>
> >> > 2. More importantly, it only takes one alloc failure in
> >> > __alloc_pages_direct_reclaim() to reset nr_reserved_highatomic to 2MB,
> >> > from as high as 1% of a zone (in this case 1GB). IOW, it makes more
> >> > sense to me that highatomic only unreserves what it doesn't fully use
> >> > each time unreserve_highatomic_pageblock() is called, not everything
> >> > it got (except the last PB).
> >>
> >> But if the highatomic pageblocks are already full, we are not really
> >> removing any actual highatomic reserves just by changing the migratetype and
> >> decreasing nr_reserved_highatomic?
> >
> > If we change the MT, they can be fragmented a lot faster, i.e., from
> > the next near OOM condition to upon becoming free. Trying to persist
> > over time is what actually makes those PBs more fragmentation
> > resistant.
>
> If we assume the allocations there have similar sizes and lifetimes, then I
> guess yeah.
>
> >> In fact that would allow the reserves
> >> grow with some actual free pages in the future.
> >
> > Good point. I think I can explain it better along this line.
> >
> > If highatomic is under the limit, both your proposal and the current
> > implementation would try to grow, making not much difference. However,
> > the current implementation can also reuse previously full PBs when
> > they become available. So there is a clear winner here: the current
> > implementation.
>
> I'd say it depends on the user of the highatomic blocks (the workload),
> which way ends up better.
>
> > If highatomic has reached the limit, with your proposal, the growth
> > can only happen after unreserve, and unreserve only happens under
> > memory pressure. This means it's likely that it tries to grow under
> > memory pressure, which is more difficult than the condition where
> > there is plenty of memory. For the current implementation, it doesn't
> > try to grow, rather, it keeps what it already has, betting those full
> > PBs becoming available for reuse. So I don't see a clear winner
> > between trying to grow under memory pressure and betting on becoming
> > available for reuse.
>
> Understood. But also note there are many conditions where the current
> implementation and my proposal behave the same. If highatomic pageblocks
> become full and then only one or few pages from each is freed, it suddenly
> becomes possible to unreserve them due to memory pressure, and there is no
> reuse for those highatomic allocations anymore. This very different outcome
> only depends on whether a single page is free for the unreserve to work, but
> from the efficiency of pageblock reusal you describe above a single page is
> only a minor difference. My proposal would at least remove the sudden change
> of behavior when going from a single free page to no free page.
>
> >> Hm that assumes we're adding some checks in free fastpath, and for that to
> >> work also that there will be a freed page in highatomic PC in near enough
> >> future from the decision we need to unreserve something. Which is not so
> >> much different from the current assumption we'll find such a free page
> >> already in the free list immediately.
> >>
> >> > To summarize, I think this is an estimation problem, which I would
> >> > categorize as a lesser problem than accounting problems. But it sounds
> >> > to me that you think it's a policy problem, i.e., the highatomic
> >> > unreserving policy is wrong or not properly implemented?
> >>
> >> Yeah I'd say not properly implemented, but that sounds like a mechanism, not
> >> policy problem to me :)
> >
> > What about adding a new counter to keep track of the size of free
> > pages reserved for highatomic?
>
> That's doable but not so trivial and means starting to handle the highatomic
> pageblocks much more carefully, like we do with CMA pageblocks and
> NR_FREE_CMA_PAGES counter, otherwise we risk drifting the counter unrecoverably.

The counter would be protected by the zone lock:

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 17506e4a2835..86c63d48c08e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -824,6 +824,7 @@ struct zone {
  unsigned long watermark_boost;

  unsigned long nr_reserved_highatomic;
+ unsigned long nr_free_highatomic;

  /*
  * We don't know if the memory that we're going to allocate will be
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8afab64814dc..4d8031817c59 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -644,6 +644,17 @@ static inline void account_freepages(struct zone
*zone, int nr_pages,
  __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, nr_pages);
 }

+static void account_highatomic_freepages(struct zone *zone, unsigned
int order, int old_mt, int new_mt)
+{
+ int nr_pages = 1 < order;
+
+ if (is_migrate_highatomic(old_mt))
+ zone->nr_free_highatomic -= nr_pages;
+
+ if (is_migrate_highatomic(new_mt))
+ zone->nr_free_highatomic += nr_pages;
+}
+
 /* Used for pages not on another list */
 static inline void __add_to_free_list(struct page *page, struct zone *zone,
        unsigned int order, int migratetype,
@@ -660,6 +671,8 @@ static inline void __add_to_free_list(struct page
*page, struct zone *zone,
  else
  list_add(&page->buddy_list, &area->free_list[migratetype]);
  area->nr_free++;
+
+ account_highatomic_freepages(zone, order, -1, migratetype);
 }

 /*
@@ -681,6 +694,8 @@ static inline void move_to_free_list(struct page
*page, struct zone *zone,

  account_freepages(zone, -(1 << order), old_mt);
  account_freepages(zone, 1 << order, new_mt);
+
+ account_highatomic_freepages(zone, order, old_mt, new_mt);
 }

 static inline void __del_page_from_free_list(struct page *page,
struct zone *zone,
@@ -698,6 +713,8 @@ static inline void
__del_page_from_free_list(struct page *page, struct zone *zon
  __ClearPageBuddy(page);
  set_page_private(page, 0);
  zone->free_area[order].nr_free--;
+
+ account_highatomic_freepages(zone, order, migratetype, -1);
 }

 static inline void del_page_from_free_list(struct page *page, struct
zone *zone,
@@ -3085,7 +3102,7 @@ static inline long
__zone_watermark_unusable_free(struct zone *z,
  * over-estimate the size of the atomic reserve but it avoids a search.
  */
  if (likely(!(alloc_flags & ALLOC_RESERVES)))
- unusable_free += z->nr_reserved_highatomic;
+ unusable_free += z->nr_free_highatomic;


 #ifdef CONFIG_CMA
  /* If allocation can't use CMA areas don't use free CMA pages */
Re: [PATCH mm-unstable v1] mm/page_alloc: try not to overestimate free highatomic
Posted by Matt Fleming 1 month ago
On Wed, Oct 23, 2024 at 8:35 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> I thought the alloc demand is only blocked on the pessimistic watermark
> calculation. Usable free pages exist, but the allocation is not allowed to
> use them.

I'm confused -- I thought the problem was the inverse of your
statement: the allocation is attempted because
__zone_watermark_unusable_free() claims the highatomic pages are free
but they're not?
Re: [PATCH mm-unstable v1] mm/page_alloc: try not to overestimate free highatomic
Posted by Vlastimil Babka 1 month ago
On 10/23/24 11:25, Matt Fleming wrote:
> On Wed, Oct 23, 2024 at 8:35 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> I thought the alloc demand is only blocked on the pessimistic watermark
>> calculation. Usable free pages exist, but the allocation is not allowed to
>> use them.
> 
> I'm confused -- I thought the problem was the inverse of your
> statement: the allocation is attempted because
> __zone_watermark_unusable_free() claims the highatomic pages are free
> but they're not?

AFAICS the fix is about GFP_HIGHUSER_MOVABLE allocation, so not eligible for
highatomic reserves. Thus the watermark check in
__zone_watermark_unusable_free() will add z->nr_reserved_highatomic as
unusable_free, which is then subtracted from actual NR_FREE_PAGES. But since
there are little or no actual free highatomic pages within the
NR_FREE_PAGES, we're subtracting more than we should and this makes the
watermark check very pessimistic and likely to fail. So the allocation is
denied even if it would find many non-highatomic pages to allocate, and
above the watermark.

The problem you describe would apply to a highatomic allocation. Which would
then try to reserve more, but maybe conclude we already have too many
reserved, and not reserve anything. But highatomic pageblocks that are
already full don't really contribute to that reserve anymore, so it would be
better to stop marking and counting them as highatomic, and instead allow
new ones to be reserved. So I think both kinds of allocations (highatomic or
not) are losing here due to full highatomic pageblocks.
Re: [PATCH mm-unstable v1] mm/page_alloc: try not to overestimate free highatomic
Posted by Yu Zhao 1 month ago
On Mon, Oct 21, 2024 at 11:26 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 21-10-24 11:10:50, Yu Zhao wrote:
> > On Mon, Oct 21, 2024 at 2:13 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Sat 19-10-24 23:13:15, Yu Zhao wrote:
> > > > OOM kills due to vastly overestimated free highatomic reserves were
> > > > observed:
> > > >
> > > >   ... invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0 ...
> > > >   Node 0 Normal free:1482936kB boost:0kB min:410416kB low:739404kB high:1068392kB reserved_highatomic:1073152KB ...
> > > >   Node 0 Normal: 1292*4kB (ME) 1920*8kB (E) 383*16kB (UE) 220*32kB (ME) 340*64kB (E) 2155*128kB (UE) 3243*256kB (UE) 615*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 1477408kB
> > > >
> > > > The second line above shows that the OOM kill was due to the following
> > > > condition:
> > > >
> > > >   free (1482936kB) - reserved_highatomic (1073152kB) = 409784KB < min (410416kB)
> > > >
> > > > And the third line shows there were no free pages in any
> > > > MIGRATE_HIGHATOMIC pageblocks, which otherwise would show up as type
> > > > 'H'. Therefore __zone_watermark_unusable_free() overestimated free
> > > > highatomic reserves. IOW, it underestimated the usable free memory by
> > > > over 1GB, which resulted in the unnecessary OOM kill.
> > >
> > > Why doesn't unreserve_highatomic_pageblock deal with this situation?
> >
> > The current behavior of unreserve_highatomic_pageblock() seems WAI to
> > me: it unreserves highatomic pageblocks that contain *free* pages so
> > that those pages can become usable to others. There is nothing to
> > unreserve when they have no free pages.
>
> I do not follow. How can you have reserved highatomic pages of that size
> without having page blocks with free memory.

Sorry I might still not get your question: are you saying it's not
possible for 524 pageblocks (reserved_highatomic=1073152kB) not to
have free pages? It might be uncommon but I don't think it's
impossible.

> In other words is this an
> accounting problem or reserves problem?

I don't follow here: why does it need to be one of the two?
reserved_highatomic can go up to 1% of the zone, and all reserves can
be used for highatomic allocs, leaving no free pages in
reserved_highatomic.