Large folios vs. SIGBUS semantics

[RFC, PATCH 0/2] Large folios vs. SIGBUS semantics

Posted by Kiryl Shutsemau 3 months, 2 weeks ago

From: Kiryl Shutsemau <kas@kernel.org>

I do NOT want the patches in this patchset to be applied. Instead, I
would like to discuss the semantics of large folios versus SIGBUS.

## Background

Accessing memory within a VMA, but beyond i_size rounded up to the next
page size, is supposed to generate SIGBUS.

This definition is simple if all pages are PAGE_SIZE in size, but with
large folios in the picture, it is no longer the case.

## Problem

Darrick reported[1] an xfstests regression in v6.18-rc1. generic/749
failed due to missing SIGBUS. This was caused by my recent changes that
try to fault in the whole folio where possible:

	19773df031bc ("mm/fault: try to map the entire file folio in finish_fault()")
	357b92761d94 ("mm/filemap: map entire large folio faultaround")

These changes did not consider i_size when setting up PTEs, leading to
xfstest breakage.

However, the problem has been present in the kernel for a long time -
since huge tmpfs was introduced in 2016. The kernel happily maps
PMD-sized folios as PMD without checking i_size. And huge=always tmpfs
allocates PMD-size folios on any writes.

I considered this corner case when I implemented a large tmpfs, and my
conclusion was that no one in their right mind should rely on receiving
a SIGBUS signal when accessing beyond i_size. I cannot imagine how it
could be useful for the workload.

Generic/749 was introduced last year with reference to POSIX, but no
real workloads were mentioned. It also acknowledged the tmpfs deviation
from the test case.

POSIX indeed says[3]:

	References within the address range starting at pa and
	continuing for len bytes to whole pages following the end of an
	object shall result in delivery of a SIGBUS signal.

Do we care about adhering strictly to this in absence of real workloads
that relies on this semantics?

I think it valuable to allow kernel to map memory with a larger chunks
-- whole folio -- to get TLB benefits (from both huge pages and TLB
coalescing). I value TLB hit rate over POSIX wording.

Any opinions?

See also discussion in the thread[1] with the report.

[1] https://lore.kernel.org/all/20251014175214.GW6188@frogsfrogsfrogs
[2] https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/commit/tests/generic/749?h=for-next&id=e4a6b119e5229599eac96235fb7e683b8a8bdc53
[3] https://pubs.opengroup.org/onlinepubs/9799919799/

Kiryl Shutsemau (2):
  mm/memory: Do not populate page table entries beyond i_size.
  mm/truncate: Unmap large folio on split failure

 mm/filemap.c  | 18 ++++++++++--------
 mm/memory.c   | 12 ++++++++++--
 mm/truncate.c | 29 ++++++++++++++++++++++++++---
 3 files changed, 46 insertions(+), 13 deletions(-)

-- 
2.50.1

Re: [RFC, PATCH 0/2] Large folios vs. SIGBUS semantics

Posted by Dave Chinner 3 months, 2 weeks ago

On Mon, Oct 20, 2025 at 05:30:52PM +0100, Kiryl Shutsemau wrote:
> From: Kiryl Shutsemau <kas@kernel.org>
> 
> I do NOT want the patches in this patchset to be applied. Instead, I
> would like to discuss the semantics of large folios versus SIGBUS.
> 
> ## Background
> 
> Accessing memory within a VMA, but beyond i_size rounded up to the next
> page size, is supposed to generate SIGBUS.
> 
> This definition is simple if all pages are PAGE_SIZE in size, but with
> large folios in the picture, it is no longer the case.
> 
> ## Problem
> 
> Darrick reported[1] an xfstests regression in v6.18-rc1. generic/749
> failed due to missing SIGBUS. This was caused by my recent changes that
> try to fault in the whole folio where possible:
> 
> 	19773df031bc ("mm/fault: try to map the entire file folio in finish_fault()")
> 	357b92761d94 ("mm/filemap: map entire large folio faultaround")
> 
> These changes did not consider i_size when setting up PTEs, leading to
> xfstest breakage.
> 
> However, the problem has been present in the kernel for a long time -
> since huge tmpfs was introduced in 2016. The kernel happily maps
> PMD-sized folios as PMD without checking i_size. And huge=always tmpfs
> allocates PMD-size folios on any writes.

The tmpfs huge=always specific behaviour is not how regular
filesystems have behaved. It is niche, special case functionality
that has weird behaviours and, as such, it most definitely does not
set the standards for how all other filesystems should behave.

> I considered this corner case when I implemented a large tmpfs, and my
> conclusion was that no one in their right mind should rely on receiving
> a SIGBUS signal when accessing beyond i_size. I cannot imagine how it
> could be useful for the workload.

Lacking the imagination or knowledge to understand why a behaviour
exists does not mean that behaviour is unnecessary or that it should
be removed. It just means you didn't ask the people who knew wy the
functionality exists...

> Generic/749 was introduced last year with reference to POSIX, but no
> real workloads were mentioned. It also acknowledged the tmpfs deviation
> from the test case.
> 
> POSIX indeed says[3]:
> 
> 	References within the address range starting at pa and
> 	continuing for len bytes to whole pages following the end of an
> 	object shall result in delivery of a SIGBUS signal.
> 
> Do we care about adhering strictly to this in absence of real workloads
> that relies on this semantics?

We've already told you that we do, because mapping beyond EOF has
implications for the impact on how much stale data exposure occur
when the next set of truncate/mmap() bugs are introduced.

> I think it valuable to allow kernel to map memory with a larger chunks
> -- whole folio -- to get TLB benefits (from both huge pages and TLB
> coalescing). I value TLB hit rate over POSIX wording.

Feel free to do that for tmpfs, but for persistent filesystems the
existing POSIX SIGBUS behaviour needs to remain.

> Any opinions?

There are solid historic reasons for the existing behaviour and for
keeping it unchanged.  You aren't allowed to handwave them away
because you don't understand or care about them.

In critical paths like truncate, correctness and safety come first.
Performance is only a secondary consideration.  The overlap of
mmap() and truncate() is an area where we have had many, many bugs
and, at minimum, the current POSIX behaviour largely shields us from
serious stale data exposure events when those bugs (inevitably)
occur.

Fundamentally, we really don't care about the mapping/tlb
performance of the PTE fragments at EOF. Anyone using files large
enough to notice the TLB overhead improvements from mapping large
folios is not going to notice that the EOF mapping has a slightly
higher TLB miss overhead than everywhere else in the file.

Please jsut fix the regression.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

Re: [RFC, PATCH 0/2] Large folios vs. SIGBUS semantics

Posted by Kiryl Shutsemau 3 months, 2 weeks ago

On Tue, Oct 21, 2025 at 10:28:02AM +1100, Dave Chinner wrote:
> In critical paths like truncate, correctness and safety come first.
> Performance is only a secondary consideration.  The overlap of
> mmap() and truncate() is an area where we have had many, many bugs
> and, at minimum, the current POSIX behaviour largely shields us from
> serious stale data exposure events when those bugs (inevitably)
> occur.

How do you prevent writes via GUP racing with truncate()?

Something like this:

	CPU0				CPU1
fd = open("file")
p = mmap(fd)
whatever_syscall(p)
  get_user_pages(p, &page)
  				truncate("file");
  <write to page>
  put_page(page);

The GUP can pin a page in the middle of a large folio well beyond the
truncation point. The folio will not be split on truncation due to the
elevated pin.

I don't think this issue can be fundamentally fixed as long as we allow
GUP for file-backed memory.

If the filesystem side cannot handle a non-zeroed tail of a large folio,
this SIGBUS semantics only hides the issue instead of addressing it.

And the race above does not seem to be far-fetched to me.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [RFC, PATCH 0/2] Large folios vs. SIGBUS semantics

Posted by Dave Chinner 3 months, 2 weeks ago

On Tue, Oct 21, 2025 at 07:16:26AM +0100, Kiryl Shutsemau wrote:
> On Tue, Oct 21, 2025 at 10:28:02AM +1100, Dave Chinner wrote:
> > In critical paths like truncate, correctness and safety come first.
> > Performance is only a secondary consideration.  The overlap of
> > mmap() and truncate() is an area where we have had many, many bugs
> > and, at minimum, the current POSIX behaviour largely shields us from
> > serious stale data exposure events when those bugs (inevitably)
> > occur.
> 
> How do you prevent writes via GUP racing with truncate()?
> 
> Something like this:
> 
> 	CPU0				CPU1
> fd = open("file")
> p = mmap(fd)
> whatever_syscall(p)
>   get_user_pages(p, &page)
>   				truncate("file");
>   <write to page>
>   put_page(page);

Forget about truncate, go look at the comment above
writable_file_mapping_allowed() about using GUP this way.

i.e. file-backed mmap/GUP is a known broken anti-pattern. We've
spent the past 15+ years telling people that it is unfixably broken
and they will crash their kernel or corrupt there data if they do
this.

This is not supported functionality because real world production
use ends up exposing problems with sync and background writeback
races, truncate races, fallocate() races, writes into holes, writes
into preallocated regions, writes over shared extents that require
copy-on-write, etc, etc, ad nausiem.

If anyone is using filebacked mappings like this, then when it
breaks they get to keep all the broken pieces to themselves.

> The GUP can pin a page in the middle of a large folio well beyond the
> truncation point. The folio will not be split on truncation due to the
> elevated pin.
> 
> I don't think this issue can be fundamentally fixed as long as we allow
> GUP for file-backed memory.

Yup, but that's the least of the problems with GUP on file-backed
pages...

> If the filesystem side cannot handle a non-zeroed tail of a large folio,
> this SIGBUS semantics only hides the issue instead of addressing it.

The objections raised have not related to whether a filesystem
"cannot handle" this case or not. The concerns are about a change of
behaviour in a well known, widely documented API, as well as the
significant increase in surface area of potential data exposure it
would enable should there be Yet Another Truncate Bug Again Once
More.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

Re: [RFC, PATCH 0/2] Large folios vs. SIGBUS semantics

Posted by Andreas Dilger 3 months, 2 weeks ago

> On Oct 23, 2025, at 5:38 AM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Tue, Oct 21, 2025 at 07:16:26AM +0100, Kiryl Shutsemau wrote:
>> On Tue, Oct 21, 2025 at 10:28:02AM +1100, Dave Chinner wrote:
>>> In critical paths like truncate, correctness and safety come first.
>>> Performance is only a secondary consideration.  The overlap of
>>> mmap() and truncate() is an area where we have had many, many bugs
>>> and, at minimum, the current POSIX behaviour largely shields us from
>>> serious stale data exposure events when those bugs (inevitably)
>>> occur.
>> 
>> How do you prevent writes via GUP racing with truncate()?
>> 
>> Something like this:
>> 
>> 	CPU0				CPU1
>> fd = open("file")
>> p = mmap(fd)
>> whatever_syscall(p)
>>  get_user_pages(p, &page)
>>  				truncate("file");
>>  <write to page>
>>  put_page(page);
> 
> Forget about truncate, go look at the comment above
> writable_file_mapping_allowed() about using GUP this way.
> 
> i.e. file-backed mmap/GUP is a known broken anti-pattern. We've
> spent the past 15+ years telling people that it is unfixably broken
> and they will crash their kernel or corrupt there data if they do
> this.
> 
> This is not supported functionality because real world production
> use ends up exposing problems with sync and background writeback
> races, truncate races, fallocate() races, writes into holes, writes
> into preallocated regions, writes over shared extents that require
> copy-on-write, etc, etc, ad nausiem.
> 
> If anyone is using filebacked mappings like this, then when it
> breaks they get to keep all the broken pieces to themselves.

Should ftruncate("file") return ETXTBUSY in this case, so that users
and applications know this doesn't work/isn't safe?  Unfortunately,
today's application developers barely even know how IO is done, so
there is little chance that they would understand subtleties like this.

Cheers, Andreas

>> The GUP can pin a page in the middle of a large folio well beyond the
>> truncation point. The folio will not be split on truncation due to the
>> elevated pin.
>> 
>> I don't think this issue can be fundamentally fixed as long as we allow
>> GUP for file-backed memory.
> 
> Yup, but that's the least of the problems with GUP on file-backed
> pages...
> 
>> If the filesystem side cannot handle a non-zeroed tail of a large folio,
>> this SIGBUS semantics only hides the issue instead of addressing it.
> 
> The objections raised have not related to whether a filesystem
> "cannot handle" this case or not. The concerns are about a change of
> behaviour in a well known, widely documented API, as well as the
> significant increase in surface area of potential data exposure it
> would enable should there be Yet Another Truncate Bug Again Once
> More.
> 
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com
> 


Cheers, Andreas

Re: [RFC, PATCH 0/2] Large folios vs. SIGBUS semantics

Posted by Kiryl Shutsemau 3 months, 2 weeks ago

On Tue, Oct 21, 2025 at 07:16:33AM +0100, Kiryl Shutsemau wrote:
> On Tue, Oct 21, 2025 at 10:28:02AM +1100, Dave Chinner wrote:
> > In critical paths like truncate, correctness and safety come first.
> > Performance is only a secondary consideration.  The overlap of
> > mmap() and truncate() is an area where we have had many, many bugs
> > and, at minimum, the current POSIX behaviour largely shields us from
> > serious stale data exposure events when those bugs (inevitably)
> > occur.
> 
> How do you prevent writes via GUP racing with truncate()?
> 
> Something like this:
> 
> 	CPU0				CPU1
> fd = open("file")
> p = mmap(fd)
> whatever_syscall(p)
>   get_user_pages(p, &page)
>   				truncate("file");
>   <write to page>
>   put_page(page);
> 
> The GUP can pin a page in the middle of a large folio well beyond the
> truncation point. The folio will not be split on truncation due to the
> elevated pin.
> 
> I don't think this issue can be fundamentally fixed as long as we allow
> GUP for file-backed memory.
> 
> If the filesystem side cannot handle a non-zeroed tail of a large folio,
> this SIGBUS semantics only hides the issue instead of addressing it.
> 
> And the race above does not seem to be far-fetched to me.

Any comments?

Jan, I remember you worked a lot on making GUP semantics sanish for file
pages. Any clues if I imagine a problem here?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [RFC, PATCH 0/2] Large folios vs. SIGBUS semantics

Posted by Christoph Hellwig 3 months, 2 weeks ago

On Tue, Oct 21, 2025 at 10:28:02AM +1100, Dave Chinner wrote:
> Fundamentally, we really don't care about the mapping/tlb
> performance of the PTE fragments at EOF. Anyone using files large
> enough to notice the TLB overhead improvements from mapping large
> folios is not going to notice that the EOF mapping has a slightly
> higher TLB miss overhead than everywhere else in the file.
> 
> Please jsut fix the regression.

Yeah.  I'm not even sure why we're having this discussion.  The
behavior is mandated, we have test cases for it and there is
literally no practical upside in changing the behavior from what
we've done forever and what is mandated in Posix.

Re: [RFC, PATCH 0/2] Large folios vs. SIGBUS semantics

Posted by Kiryl Shutsemau 3 months, 2 weeks ago

On Mon, Oct 20, 2025 at 11:12:40PM -0700, Christoph Hellwig wrote:
> On Tue, Oct 21, 2025 at 10:28:02AM +1100, Dave Chinner wrote:
> > Fundamentally, we really don't care about the mapping/tlb
> > performance of the PTE fragments at EOF. Anyone using files large
> > enough to notice the TLB overhead improvements from mapping large
> > folios is not going to notice that the EOF mapping has a slightly
> > higher TLB miss overhead than everywhere else in the file.
> > 
> > Please jsut fix the regression.
> 
> Yeah.  I'm not even sure why we're having this discussion.  The
> behavior is mandated, we have test cases for it and there is
> literally no practical upside in changing the behavior from what
> we've done forever and what is mandated in Posix.

Okay, will fix.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov