From: Baokun Li <libaokun1@huawei.com>
In __alloc_pages_slowpath(), allocating page units greater than order-1
with the __GFP_NOFAIL flag may trigger an unexpected WARN_ON. To avoid
this, handle the case separately in grow_dev_folio(). This ensures that
buffer_head-based filesystems will not encounter the warning when using
__GFP_NOFAIL to read metadata after BS > PS support is enabled.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/buffer.c | 33 +++++++++++++++++++++++++++++++--
1 file changed, 31 insertions(+), 2 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index 6a8752f7bbed..2f5a7dd199b2 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1031,6 +1031,35 @@ static sector_t folio_init_buffers(struct folio *folio,
return end_block;
}
+static struct folio *blkdev_get_folio(struct address_space *mapping,
+ pgoff_t index, fgf_t fgp_flags, gfp_t gfp)
+{
+ struct folio *folio;
+ unsigned int min_order = mapping_min_folio_order(mapping);
+
+ /*
+ * Allocating page units greater than order-1 with __GFP_NOFAIL in
+ * __alloc_pages_slowpath() can trigger an unexpected WARN_ON.
+ * Handle this case separately to suppress the warning.
+ */
+ if (min_order <= 1)
+ return __filemap_get_folio(mapping, index, fgp_flags, gfp);
+
+ while (1) {
+ folio = __filemap_get_folio(mapping, index, fgp_flags,
+ gfp & ~__GFP_NOFAIL);
+ if (!IS_ERR(folio) || !(gfp & __GFP_NOFAIL))
+ return folio;
+
+ if (PTR_ERR(folio) != -ENOMEM && PTR_ERR(folio) != -EAGAIN)
+ return folio;
+
+ memalloc_retry_wait(gfp);
+ }
+
+ return folio;
+}
+
/*
* Create the page-cache folio that contains the requested block.
*
@@ -1047,8 +1076,8 @@ static bool grow_dev_folio(struct block_device *bdev, sector_t block,
struct buffer_head *bh;
sector_t end_block = 0;
- folio = __filemap_get_folio(mapping, index,
- FGP_LOCK | FGP_ACCESSED | FGP_CREAT, gfp);
+ folio = blkdev_get_folio(mapping, index,
+ FGP_LOCK | FGP_ACCESSED | FGP_CREAT, gfp);
if (IS_ERR(folio))
return false;
--
2.46.1
On Sat, Oct 25, 2025 at 11:22:18AM +0800, libaokun@huaweicloud.com wrote:
> + while (1) {
> + folio = __filemap_get_folio(mapping, index, fgp_flags,
> + gfp & ~__GFP_NOFAIL);
> + if (!IS_ERR(folio) || !(gfp & __GFP_NOFAIL))
> + return folio;
> +
> + if (PTR_ERR(folio) != -ENOMEM && PTR_ERR(folio) != -EAGAIN)
> + return folio;
> +
> + memalloc_retry_wait(gfp);
> + }
No, absolutely not. We're not having open-coded GFP_NOFAIL semantics.
The right way forward is for ext4 to use iomap, not for buffer heads
to support large block sizes.
On 2025-10-25 12:45, Matthew Wilcox wrote:
> On Sat, Oct 25, 2025 at 11:22:18AM +0800, libaokun@huaweicloud.com wrote:
>> + while (1) {
>> + folio = __filemap_get_folio(mapping, index, fgp_flags,
>> + gfp & ~__GFP_NOFAIL);
>> + if (!IS_ERR(folio) || !(gfp & __GFP_NOFAIL))
>> + return folio;
>> +
>> + if (PTR_ERR(folio) != -ENOMEM && PTR_ERR(folio) != -EAGAIN)
>> + return folio;
>> +
>> + memalloc_retry_wait(gfp);
>> + }
> No, absolutely not. We're not having open-coded GFP_NOFAIL semantics.
> The right way forward is for ext4 to use iomap, not for buffer heads
> to support large block sizes.
ext4 only calls getblk_unmovable or __getblk when reading critical
metadata. Both of these functions set __GFP_NOFAIL to ensure that
metadata reads do not fail due to memory pressure.
Both functions eventually call grow_dev_folio(), which is why we
handle the __GFP_NOFAIL logic there. xfs_buf_alloc_backing_mem()
has similar logic, but XFS manages its own metadata, allowing it
to use vmalloc for memory allocation.
ext4 Direct I/O has already switched to iomap, and patches to
support iomap for Buffered I/O are currently under iteration.
But as far as I know, iomap does not support metadata, and XFS does not
use iomap to read metadata either.
Am I missing something here?
--
With Best Regards,
Baokun Li
On Sat, Oct 25, 2025 at 02:32:45PM +0800, Baokun Li wrote: > On 2025-10-25 12:45, Matthew Wilcox wrote: > > No, absolutely not. We're not having open-coded GFP_NOFAIL semantics. > > The right way forward is for ext4 to use iomap, not for buffer heads > > to support large block sizes. > > ext4 only calls getblk_unmovable or __getblk when reading critical > metadata. Both of these functions set __GFP_NOFAIL to ensure that > metadata reads do not fail due to memory pressure. > > Both functions eventually call grow_dev_folio(), which is why we > handle the __GFP_NOFAIL logic there. xfs_buf_alloc_backing_mem() > has similar logic, but XFS manages its own metadata, allowing it > to use vmalloc for memory allocation. In today's ext4 call, we discussed various options: 1. Change folios to be potentially fragmented. This change would be ridiculously large and nobody thinks this is a good idea. Included here for completeness. 2. Separate the buffer cache from the page cache again. They were unified about 25 years ago, and this also feels like a very big job. 3. Duplicate the buffer cache into ext4/jbd2, remove the functionality not needed and make _this_ version of the buffer cache allocate its own memory instead of aliasing into the page cache. More feasible than 1 or 2; still quite a big job. 4. Pick up Catherine's work and make ext4/jbd2 use it. Seems to be about an equivalent amount of work to option 3. 5. Make __GFP_NOFAIL work for allocations up to 64KiB (we decided this was probably the practical limit of sector sizes that people actually want). In terms of programming, it's a one-line change. But we need to sell this change to the MM people. I think it's doable because if we have a filesystem with 64KiB sectors, there will be many clean folios in the pagecache which are 64KiB or larger. So, we liked option 5 best.
On 2025-10-31 05:25, Matthew Wilcox wrote: > On Sat, Oct 25, 2025 at 02:32:45PM +0800, Baokun Li wrote: >> On 2025-10-25 12:45, Matthew Wilcox wrote: >>> No, absolutely not. We're not having open-coded GFP_NOFAIL semantics. >>> The right way forward is for ext4 to use iomap, not for buffer heads >>> to support large block sizes. >> ext4 only calls getblk_unmovable or __getblk when reading critical >> metadata. Both of these functions set __GFP_NOFAIL to ensure that >> metadata reads do not fail due to memory pressure. >> >> Both functions eventually call grow_dev_folio(), which is why we >> handle the __GFP_NOFAIL logic there. xfs_buf_alloc_backing_mem() >> has similar logic, but XFS manages its own metadata, allowing it >> to use vmalloc for memory allocation. > In today's ext4 call, we discussed various options: > > 1. Change folios to be potentially fragmented. This change would be > ridiculously large and nobody thinks this is a good idea. Included here > for completeness. > > 2. Separate the buffer cache from the page cache again. They were > unified about 25 years ago, and this also feels like a very big job. > > 3. Duplicate the buffer cache into ext4/jbd2, remove the functionality > not needed and make _this_ version of the buffer cache allocate > its own memory instead of aliasing into the page cache. More feasible > than 1 or 2; still quite a big job. > > 4. Pick up Catherine's work and make ext4/jbd2 use it. Seems to be > about an equivalent amount of work to option 3. > > 5. Make __GFP_NOFAIL work for allocations up to 64KiB (we decided this was > probably the practical limit of sector sizes that people actually want). > In terms of programming, it's a one-line change. But we need to sell > this change to the MM people. I think it's doable because if we have > a filesystem with 64KiB sectors, there will be many clean folios in the > pagecache which are 64KiB or larger. > > So, we liked option 5 best. > Thank you for your suggestions! Yes, options 1 and 2 don’t seem very feasible, and options 3 and 4 would involve a significant amount of work. Option 5 is indeed the simplest and most general solution at this point, and it makes a lot of sense. I will send a separate RFC patch to the MM list to gather feedback from the MM people. If this approach is accepted, we can drop patches 22 and 23 from the current series. Cheers, Baokun
Hi! On 10/31/2025 5:25 AM, Matthew Wilcox wrote: > On Sat, Oct 25, 2025 at 02:32:45PM +0800, Baokun Li wrote: >> On 2025-10-25 12:45, Matthew Wilcox wrote: >>> No, absolutely not. We're not having open-coded GFP_NOFAIL semantics. >>> The right way forward is for ext4 to use iomap, not for buffer heads >>> to support large block sizes. >> >> ext4 only calls getblk_unmovable or __getblk when reading critical >> metadata. Both of these functions set __GFP_NOFAIL to ensure that >> metadata reads do not fail due to memory pressure. >> >> Both functions eventually call grow_dev_folio(), which is why we >> handle the __GFP_NOFAIL logic there. xfs_buf_alloc_backing_mem() >> has similar logic, but XFS manages its own metadata, allowing it >> to use vmalloc for memory allocation. > > In today's ext4 call, we discussed various options: > > 1. Change folios to be potentially fragmented. This change would be > ridiculously large and nobody thinks this is a good idea. Included here > for completeness. > > 2. Separate the buffer cache from the page cache again. They were > unified about 25 years ago, and this also feels like a very big job. > > 3. Duplicate the buffer cache into ext4/jbd2, remove the functionality > not needed and make _this_ version of the buffer cache allocate > its own memory instead of aliasing into the page cache. More feasible > than 1 or 2; still quite a big job. > > 4. Pick up Catherine's work and make ext4/jbd2 use it. Seems to be > about an equivalent amount of work to option 3. > Regarding these two proposals, would you consider them for the long term? Besides the currently discussed case, they offer additional benefits, such as making ext4's metadata management more flexible and secure, as well as enabling more robust error handling. Thanks, Yi. > 5. Make __GFP_NOFAIL work for allocations up to 64KiB (we decided this was > probably the practical limit of sector sizes that people actually want). > In terms of programming, it's a one-line change. But we need to sell > this change to the MM people. I think it's doable because if we have > a filesystem with 64KiB sectors, there will be many clean folios in the > pagecache which are 64KiB or larger. > > So, we liked option 5 best. >
On Sat, Oct 25, 2025 at 02:32:45PM +0800, Baokun Li wrote:
> On 2025-10-25 12:45, Matthew Wilcox wrote:
> > On Sat, Oct 25, 2025 at 11:22:18AM +0800, libaokun@huaweicloud.com wrote:
> >> + while (1) {
> >> + folio = __filemap_get_folio(mapping, index, fgp_flags,
> >> + gfp & ~__GFP_NOFAIL);
> >> + if (!IS_ERR(folio) || !(gfp & __GFP_NOFAIL))
> >> + return folio;
> >> +
> >> + if (PTR_ERR(folio) != -ENOMEM && PTR_ERR(folio) != -EAGAIN)
> >> + return folio;
> >> +
> >> + memalloc_retry_wait(gfp);
> >> + }
> > No, absolutely not. We're not having open-coded GFP_NOFAIL semantics.
> > The right way forward is for ext4 to use iomap, not for buffer heads
> > to support large block sizes.
>
> ext4 only calls getblk_unmovable or __getblk when reading critical
> metadata. Both of these functions set __GFP_NOFAIL to ensure that
> metadata reads do not fail due to memory pressure.
If filesystems actually require __GFP_NOFAIL for high-order allocations,
then this is a new requirement that needs to be communicated to the MM
developers, not hacked around in filesystems (or the VFS). And that
communication needs to be a separate thread with a clear subject line
to attract the right attention, not buried in patch 26/28.
For what it's worth, I think you have a good case. This really is
a new requirement (bs>PS) and in this scenario, we should be able to
reclaim page cache memory of the appropriate order to satisfy the NOFAIL
requirement. There will be concerns that other users will now be able to
use it without warning, but I think eventually this use case will prevail.
> Both functions eventually call grow_dev_folio(), which is why we
> handle the __GFP_NOFAIL logic there. xfs_buf_alloc_backing_mem()
> has similar logic, but XFS manages its own metadata, allowing it
> to use vmalloc for memory allocation.
The other possibility is that we switch ext4 away from the buffer cache
entirely. This is a big job! I know Catherine has been working on
a generic replacement for the buffer cache, but I'm not sure if it's
ready yet.
On Sat, Oct 25, 2025 at 06:56:57PM +0100, Matthew Wilcox wrote: > If filesystems actually require __GFP_NOFAIL for high-order allocations, > then this is a new requirement that needs to be communicated to the MM > developers, not hacked around in filesystems (or the VFS). And that > communication needs to be a separate thread with a clear subject line > to attract the right attention, not buried in patch 26/28. It's not really new. XFS had this basically since day 1, but with Linus having a religious aversion against __GFP_NOFAIL most folks have given up on trying to improve it as it just ends up in shouting matches in political grounds. XFS just ends up with it's own fallback in xfs_buf_alloc_backing_mem which survives the various rounds of refactoring since XFS was merged. Given that weird behavior in some of the memory allocators where GFP_NOFAIL is simply ignored for too large allocations that seems like by far the sanest option in the current Linux environment unfortunately.
On 2025-10-26 01:56, Matthew Wilcox wrote:
> On Sat, Oct 25, 2025 at 02:32:45PM +0800, Baokun Li wrote:
>> On 2025-10-25 12:45, Matthew Wilcox wrote:
>>> On Sat, Oct 25, 2025 at 11:22:18AM +0800, libaokun@huaweicloud.com wrote:
>>>> + while (1) {
>>>> + folio = __filemap_get_folio(mapping, index, fgp_flags,
>>>> + gfp & ~__GFP_NOFAIL);
>>>> + if (!IS_ERR(folio) || !(gfp & __GFP_NOFAIL))
>>>> + return folio;
>>>> +
>>>> + if (PTR_ERR(folio) != -ENOMEM && PTR_ERR(folio) != -EAGAIN)
>>>> + return folio;
>>>> +
>>>> + memalloc_retry_wait(gfp);
>>>> + }
>>> No, absolutely not. We're not having open-coded GFP_NOFAIL semantics.
>>> The right way forward is for ext4 to use iomap, not for buffer heads
>>> to support large block sizes.
>> ext4 only calls getblk_unmovable or __getblk when reading critical
>> metadata. Both of these functions set __GFP_NOFAIL to ensure that
>> metadata reads do not fail due to memory pressure.
> If filesystems actually require __GFP_NOFAIL for high-order allocations,
> then this is a new requirement that needs to be communicated to the MM
> developers, not hacked around in filesystems (or the VFS). And that
> communication needs to be a separate thread with a clear subject line
> to attract the right attention, not buried in patch 26/28.
EXT4 is not the first filesystem to support LBS. I believe other
filesystems that already support LBS, even if they manage their own
metadata, have similar requirements. A filesystem cannot afford to become
read-only, shut down, or enter an inconsistent state due to memory
allocation failures in critical paths. Large folios have been around for
some time, and the fact that this warning still exists shows that the
problem is not trivial to solve.
Therefore, following the approach of filesystems that already support LBS,
such as XFS and the soon-to-be-removed bcachefs, I avoid adding
__GFP_NOFAIL for large allocations and instead retry internally to prevent
failures.
I do not intend to hide this issue in Patch 22/25. I cc’d linux-mm@kvack.org
precisely to invite memory management experts to share their thoughts on
the current situation.
Here is my limited understanding of the history of __GFP_NOFAIL:
Originally, in commit 4923abf9f1a4 ("Don't warn about order-1 allocations
with __GFP_NOFAIL"), Linus Torvalds raised the warning order from 0 to 1,
and commented,
"Maybe we should remove this warning entirely."
We had considered removing this warning, but then saw the discussion below.
Previously we used WARN_ON_ONCE_GFP, which meant the warning could be
suppressed with __GFP_NOWARN. But with the introduction of large folios,
memory allocation and reclaim have become much more challenging.
__GFP_NOFAIL can still fail, and many callers do not check the return
value, leading to potential NULL pointer dereferences.
Linus also noted that __GFP_NOFAIL is heavily abused, and even said in [1]:
“Honestly, I'm perfectly fine with just removing that stupid useless flag
entirely.”
"Because the blame should go *there*, and it should not even remotely look
like "oh, the MM code failed". No. The caller was garbage."
[1]:
https://lore.kernel.org/linux-mm/CAHk-=wgv2-=Bm16Gtn5XHWj9J6xiqriV56yamU+iG07YrN28SQ@mail.gmail.com/
From this, my understanding is that handling or retrying large allocation
failures in the caller is the direction going forward.
As for why retries are done in the VFS, there are two reasons: first, both
ext4 and jbd2 read metadata through blkdev, so a unified change is simpler.
Second, retrying here allows other buffer-head-based filesystems to support
LBS more easily.
For now, until large memory allocation and reclaim are properly handled,
this approach serves as a practical workaround.
> For what it's worth, I think you have a good case. This really is
> a new requirement (bs>PS) and in this scenario, we should be able to
> reclaim page cache memory of the appropriate order to satisfy the NOFAIL
> requirement. There will be concerns that other users will now be able to
> use it without warning, but I think eventually this use case will prevail.
Yeah, it would be best if the memory subsystem could add a flag like
__GFP_LBS to suppress these warnings and guide allocation and reclaim to
perform optimizations suited for this scenario.
>> Both functions eventually call grow_dev_folio(), which is why we
>> handle the __GFP_NOFAIL logic there. xfs_buf_alloc_backing_mem()
>> has similar logic, but XFS manages its own metadata, allowing it
>> to use vmalloc for memory allocation.
> The other possibility is that we switch ext4 away from the buffer cache
> entirely. This is a big job! I know Catherine has been working on
> a generic replacement for the buffer cache, but I'm not sure if it's
> ready yet.
>
The key issue is not whether ext4 uses buffer heads; even using vmalloc
with __GFP_NOFAIL for large allocations faces the same problem.
As Linus also mentioned in the link[1] above:
"It has then expanded and is now a problem. The cases using GFP_NOFAIL
for things like vmalloc() - which is by definition not a small
allocation - should be just removed as outright bugs."
Thanks,
Baokun
On 10/25/2025 2:32 PM, Baokun Li wrote:
> On 2025-10-25 12:45, Matthew Wilcox wrote:
>> On Sat, Oct 25, 2025 at 11:22:18AM +0800, libaokun@huaweicloud.com wrote:
>>> + while (1) {
>>> + folio = __filemap_get_folio(mapping, index, fgp_flags,
>>> + gfp & ~__GFP_NOFAIL);
>>> + if (!IS_ERR(folio) || !(gfp & __GFP_NOFAIL))
>>> + return folio;
>>> +
>>> + if (PTR_ERR(folio) != -ENOMEM && PTR_ERR(folio) != -EAGAIN)
>>> + return folio;
>>> +
>>> + memalloc_retry_wait(gfp);
>>> + }
>> No, absolutely not. We're not having open-coded GFP_NOFAIL semantics.
>> The right way forward is for ext4 to use iomap, not for buffer heads
>> to support large block sizes.
>
> ext4 only calls getblk_unmovable or __getblk when reading critical
> metadata. Both of these functions set __GFP_NOFAIL to ensure that
> metadata reads do not fail due to memory pressure.
>
> Both functions eventually call grow_dev_folio(), which is why we
> handle the __GFP_NOFAIL logic there. xfs_buf_alloc_backing_mem()
> has similar logic, but XFS manages its own metadata, allowing it
> to use vmalloc for memory allocation.
>
> ext4 Direct I/O has already switched to iomap, and patches to
> support iomap for Buffered I/O are currently under iteration.
>
> But as far as I know, iomap does not support metadata, and XFS does not
> use iomap to read metadata either.
>
> Am I missing something here?
>
AFAIK, Unless ext4 also manages metadata on its own, like XFS does,
instead of using the bdev buffer head interface. However, this is
currently difficult to achieve.
Best Regards,
Yi.
On 2025-10-25 12:45, Matthew Wilcox wrote:
> On Sat, Oct 25, 2025 at 11:22:18AM +0800, libaokun@huaweicloud.com wrote:
>> + while (1) {
>> + folio = __filemap_get_folio(mapping, index, fgp_flags,
>> + gfp & ~__GFP_NOFAIL);
>> + if (!IS_ERR(folio) || !(gfp & __GFP_NOFAIL))
>> + return folio;
>> +
>> + if (PTR_ERR(folio) != -ENOMEM && PTR_ERR(folio) != -EAGAIN)
>> + return folio;
>> +
>> + memalloc_retry_wait(gfp);
>> + }
> No, absolutely not. We're not having open-coded GFP_NOFAIL semantics.
> The right way forward is for ext4 to use iomap, not for buffer heads
> to support large block sizes.
ext4 only calls getblk_unmovable or __getblk when reading critical
metadata. Both of these functions set __GFP_NOFAIL to ensure that
metadata reads do not fail due to memory pressure.
Both functions eventually call grow_dev_folio(), which is why we
handle the __GFP_NOFAIL logic there. xfs_buf_alloc_backing_mem()
has similar logic, but XFS manages its own metadata, allowing it
to use vmalloc for memory allocation.
ext4 Direct I/O has already switched to iomap, and patches to
support iomap for Buffered I/O are currently under iteration.
But as far as I know, iomap does not support metadata, and XFS does not
use iomap to read metadata either.
Am I missing something here?
--
With Best Regards,
Baokun Li
On Sat, Oct 25, 2025 at 05:45:29AM +0100, Matthew Wilcox wrote:
> On Sat, Oct 25, 2025 at 11:22:18AM +0800, libaokun@huaweicloud.com wrote:
> > + while (1) {
> > + folio = __filemap_get_folio(mapping, index, fgp_flags,
> > + gfp & ~__GFP_NOFAIL);
> > + if (!IS_ERR(folio) || !(gfp & __GFP_NOFAIL))
> > + return folio;
> > +
> > + if (PTR_ERR(folio) != -ENOMEM && PTR_ERR(folio) != -EAGAIN)
> > + return folio;
> > +
> > + memalloc_retry_wait(gfp);
> > + }
>
> No, absolutely not. We're not having open-coded GFP_NOFAIL semantics.
> The right way forward is for ext4 to use iomap, not for buffer heads
> to support large block sizes.
Seconded.
--D
© 2016 - 2026 Red Hat, Inc.