[v3] Fix SIGBUS semantics with large folios

[PATCHv3 1/2] mm/memory: Do not populate page table entries beyond i_size

Posted by Kiryl Shutsemau 3 months, 2 weeks ago

From: Kiryl Shutsemau <kas@kernel.org>

Accesses within VMA, but beyond i_size rounded up to PAGE_SIZE are
supposed to generate SIGBUS.

Recent changes attempted to fault in full folio where possible. They did
not respect i_size, which led to populating PTEs beyond i_size and
breaking SIGBUS semantics.

Darrick reported generic/749 breakage because of this.

However, the problem existed before the recent changes. With huge=always
tmpfs, any write to a file leads to PMD-size allocation. Following the
fault-in of the folio will install PMD mapping regardless of i_size.

Fix filemap_map_pages() and finish_fault() to not install:
  - PTEs beyond i_size;
  - PMD mappings across i_size;

Make an exception for shmem/tmpfs that for long time intentionally
mapped with PMDs across i_size.

Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Fixes: 19773df031bc ("mm/fault: try to map the entire file folio in finish_fault()")
Fixes: 357b92761d94 ("mm/filemap: map entire large folio faultaround")
Fixes: 01c70267053d ("fs: add a filesystem flag for THPs")
Reported-by: "Darrick J. Wong" <djwong@kernel.org>
---
 mm/filemap.c | 28 ++++++++++++++++++++--------
 mm/memory.c  | 20 +++++++++++++++++++-
 2 files changed, 39 insertions(+), 9 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index b7b297c1ad4f..ff75bd89b68c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3690,7 +3690,8 @@ static struct folio *next_uptodate_folio(struct xa_state *xas,
 static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
 			struct folio *folio, unsigned long start,
 			unsigned long addr, unsigned int nr_pages,
-			unsigned long *rss, unsigned short *mmap_miss)
+			unsigned long *rss, unsigned short *mmap_miss,
+			bool can_map_large)
 {
 	unsigned int ref_from_caller = 1;
 	vm_fault_t ret = 0;
@@ -3705,7 +3706,7 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
 	 * The folio must not cross VMA or page table boundary.
 	 */
 	addr0 = addr - start * PAGE_SIZE;
-	if (folio_within_vma(folio, vmf->vma) &&
+	if (can_map_large && folio_within_vma(folio, vmf->vma) &&
 	    (addr0 & PMD_MASK) == ((addr0 + folio_size(folio) - 1) & PMD_MASK)) {
 		vmf->pte -= start;
 		page -= start;
@@ -3820,13 +3821,27 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 	unsigned long rss = 0;
 	unsigned int nr_pages = 0, folio_type;
 	unsigned short mmap_miss = 0, mmap_miss_saved;
+	bool can_map_large;
 
 	rcu_read_lock();
 	folio = next_uptodate_folio(&xas, mapping, end_pgoff);
 	if (!folio)
 		goto out;
 
-	if (filemap_map_pmd(vmf, folio, start_pgoff)) {
+	file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE) - 1;
+	end_pgoff = min(end_pgoff, file_end);
+
+	/*
+	 * Do not allow to map with PTEs beyond i_size and with PMD
+	 * across i_size to preserve SIGBUS semantics.
+	 *
+	 * Make an exception for shmem/tmpfs that for long time
+	 * intentionally mapped with PMDs across i_size.
+	 */
+	can_map_large = shmem_mapping(mapping) ||
+		file_end >= folio_next_index(folio);
+
+	if (can_map_large && filemap_map_pmd(vmf, folio, start_pgoff)) {
 		ret = VM_FAULT_NOPAGE;
 		goto out;
 	}
@@ -3839,10 +3854,6 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 		goto out;
 	}
 
-	file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE) - 1;
-	if (end_pgoff > file_end)
-		end_pgoff = file_end;
-
 	folio_type = mm_counter_file(folio);
 	do {
 		unsigned long end;
@@ -3859,7 +3870,8 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 		else
 			ret |= filemap_map_folio_range(vmf, folio,
 					xas.xa_index - folio->index, addr,
-					nr_pages, &rss, &mmap_miss);
+					nr_pages, &rss, &mmap_miss,
+					can_map_large);
 
 		folio_unlock(folio);
 	} while ((folio = next_uptodate_folio(&xas, mapping, end_pgoff)) != NULL);
diff --git a/mm/memory.c b/mm/memory.c
index 39e21688e74b..1a3eb070f8df 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -77,6 +77,7 @@
 #include <linux/sched/sysctl.h>
 #include <linux/pgalloc.h>
 #include <linux/uaccess.h>
+#include <linux/shmem_fs.h>
 
 #include <trace/events/kmem.h>
 
@@ -5545,8 +5546,25 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 			return ret;
 	}
 
+	if (!needs_fallback && vma->vm_file) {
+		struct address_space *mapping = vma->vm_file->f_mapping;
+		pgoff_t file_end;
+
+		file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE);
+
+		/*
+		 * Do not allow to map with PTEs beyond i_size and with PMD
+		 * across i_size to preserve SIGBUS semantics.
+		 *
+		 * Make an exception for shmem/tmpfs that for long time
+		 * intentionally mapped with PMDs across i_size.
+		 */
+		needs_fallback = !shmem_mapping(mapping) &&
+			file_end < folio_next_index(folio);
+	}
+
 	if (pmd_none(*vmf->pmd)) {
-		if (folio_test_pmd_mappable(folio)) {
+		if (!needs_fallback && folio_test_pmd_mappable(folio)) {
 			ret = do_set_pmd(vmf, folio, page);
 			if (ret != VM_FAULT_FALLBACK)
 				return ret;
-- 
2.50.1

Re: [PATCHv3 1/2] mm/memory: Do not populate page table entries beyond i_size

Posted by Andrew Morton 3 months, 2 weeks ago

On Mon, 27 Oct 2025 11:56:35 +0000 Kiryl Shutsemau <kirill@shutemov.name> wrote:

> From: Kiryl Shutsemau <kas@kernel.org>
> 
> Accesses within VMA, but beyond i_size rounded up to PAGE_SIZE are
> supposed to generate SIGBUS.
> 
> Recent changes attempted to fault in full folio where possible. They did
> not respect i_size, which led to populating PTEs beyond i_size and
> breaking SIGBUS semantics.
> 
> Darrick reported generic/749 breakage because of this.
> 
> However, the problem existed before the recent changes. With huge=always
> tmpfs, any write to a file leads to PMD-size allocation. Following the
> fault-in of the folio will install PMD mapping regardless of i_size.
> 
> Fix filemap_map_pages() and finish_fault() to not install:
>   - PTEs beyond i_size;
>   - PMD mappings across i_size;
> 
> Make an exception for shmem/tmpfs that for long time intentionally
> mapped with PMDs across i_size.
> 
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> Fixes: 19773df031bc ("mm/fault: try to map the entire file folio in finish_fault()")
> Fixes: 357b92761d94 ("mm/filemap: map entire large folio faultaround")
> Fixes: 01c70267053d ("fs: add a filesystem flag for THPs")

Multiple Fixes: are confusing.

We have two 6.18-rcX targets and one from 2020.  Are we asking people
to backport this all the way back to 2020?  If so I'd suggest the
removal of the more recent Fixes: targets.

Also, is [2/2] to be backported?  The changelog makes it sound that way,
but no Fixes: was identified?

Re: [PATCHv3 1/2] mm/memory: Do not populate page table entries beyond i_size

Posted by Kiryl Shutsemau 3 months, 2 weeks ago

On Mon, Oct 27, 2025 at 03:33:23PM -0700, Andrew Morton wrote:
> On Mon, 27 Oct 2025 11:56:35 +0000 Kiryl Shutsemau <kirill@shutemov.name> wrote:
> 
> > From: Kiryl Shutsemau <kas@kernel.org>
> > 
> > Accesses within VMA, but beyond i_size rounded up to PAGE_SIZE are
> > supposed to generate SIGBUS.
> > 
> > Recent changes attempted to fault in full folio where possible. They did
> > not respect i_size, which led to populating PTEs beyond i_size and
> > breaking SIGBUS semantics.
> > 
> > Darrick reported generic/749 breakage because of this.
> > 
> > However, the problem existed before the recent changes. With huge=always
> > tmpfs, any write to a file leads to PMD-size allocation. Following the
> > fault-in of the folio will install PMD mapping regardless of i_size.
> > 
> > Fix filemap_map_pages() and finish_fault() to not install:
> >   - PTEs beyond i_size;
> >   - PMD mappings across i_size;
> > 
> > Make an exception for shmem/tmpfs that for long time intentionally
> > mapped with PMDs across i_size.
> > 
> > Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> > Fixes: 19773df031bc ("mm/fault: try to map the entire file folio in finish_fault()")
> > Fixes: 357b92761d94 ("mm/filemap: map entire large folio faultaround")
> > Fixes: 01c70267053d ("fs: add a filesystem flag for THPs")
> 
> Multiple Fixes: are confusing.
> 
> We have two 6.18-rcX targets and one from 2020.  Are we asking people
> to backport this all the way back to 2020?  If so I'd suggest the
> removal of the more recent Fixes: targets.

Okay, fair enough.

> Also, is [2/2] to be backported?  The changelog makes it sound that way,
> but no Fixes: was identified?

Looking at split-on-truncate history, looks like this is the right
commit to point to:

Fixes: b9a8a4195c7d ("truncate,shmem: Handle truncates that split large folios")

It moves split logic from shmem-specific to generic truncate.

As with the first patch, it will not be a trivial backport, but I am
around to help with this.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCHv3 1/2] mm/memory: Do not populate page table entries beyond i_size

Posted by Hugh Dickins 3 months, 1 week ago

On Tue, 28 Oct 2025, Kiryl Shutsemau wrote:
> On Mon, Oct 27, 2025 at 03:33:23PM -0700, Andrew Morton wrote:
> > On Mon, 27 Oct 2025 11:56:35 +0000 Kiryl Shutsemau <kirill@shutemov.name> wrote:
> > 
> > > From: Kiryl Shutsemau <kas@kernel.org>
> > > 
> > > Accesses within VMA, but beyond i_size rounded up to PAGE_SIZE are
> > > supposed to generate SIGBUS.
> > > 
> > > Recent changes attempted to fault in full folio where possible. They did
> > > not respect i_size, which led to populating PTEs beyond i_size and
> > > breaking SIGBUS semantics.
> > > 
> > > Darrick reported generic/749 breakage because of this.
> > > 
> > > However, the problem existed before the recent changes. With huge=always
> > > tmpfs, any write to a file leads to PMD-size allocation. Following the
> > > fault-in of the folio will install PMD mapping regardless of i_size.
> > > 
> > > Fix filemap_map_pages() and finish_fault() to not install:
> > >   - PTEs beyond i_size;
> > >   - PMD mappings across i_size;
> > > 
> > > Make an exception for shmem/tmpfs that for long time intentionally
> > > mapped with PMDs across i_size.

Thanks for the v3 patches, which do now suit huge tmpfs.
Not beautiful, but no longer regressing.

> > > 
> > > Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> > > Fixes: 19773df031bc ("mm/fault: try to map the entire file folio in finish_fault()")
> > > Fixes: 357b92761d94 ("mm/filemap: map entire large folio faultaround")
> > > Fixes: 01c70267053d ("fs: add a filesystem flag for THPs")
> > 
> > Multiple Fixes: are confusing.
> > 
> > We have two 6.18-rcX targets and one from 2020.  Are we asking people
> > to backport this all the way back to 2020?  If so I'd suggest the
> > removal of the more recent Fixes: targets.
> 
> Okay, fair enough.
> 
> > Also, is [2/2] to be backported?  The changelog makes it sound that way,
> > but no Fixes: was identified?
> 
> Looking at split-on-truncate history, looks like this is the right
> commit to point to:
> 
> Fixes: b9a8a4195c7d ("truncate,shmem: Handle truncates that split large folios")

I agree that's the right Fixee for 2/2: the one which introduced
splitting a large folio to non-shmem filesystems in 5.17.

But you're giving yourself too hard a time of backporting with your
5.10 Fixee 01c70267053d for 1/2: the only filesystem which set the
flag then was tmpfs, which you're now excepting.  The flag got
renamed later (in 5.16) and then in 5.17 at last there was another
filesystem to set it.  So, this 1/2 would be

Fixes: 6795801366da ("xfs: Support large folios")

> 
> It moves split logic from shmem-specific to generic truncate.
> 
> As with the first patch, it will not be a trivial backport, but I am
> around to help with this.
> 
> -- 
>   Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCHv3 1/2] mm/memory: Do not populate page table entries beyond i_size

Posted by Matthew Wilcox 3 months, 1 week ago

On Wed, Oct 29, 2025 at 02:45:52AM -0700, Hugh Dickins wrote:
> But you're giving yourself too hard a time of backporting with your
> 5.10 Fixee 01c70267053d for 1/2: the only filesystem which set the
> flag then was tmpfs, which you're now excepting.  The flag got
> renamed later (in 5.16) and then in 5.17 at last there was another
> filesystem to set it.  So, this 1/2 would be
> 
> Fixes: 6795801366da ("xfs: Support large folios")

I haven't been able to keep up with this patchset -- sorry.

But this problem didn't exist until bs>PS support was added because we
would never add a folio to the page cache which extended beyond i_size
before.  We'd shrink the folio order allocated in do_page_cache_ra()
(actually, we still do, but page_cache_ra_unbounded() rounds it up
again).  So it doesn't fix that commit at all, but something far more
recent.

Re: [PATCHv3 1/2] mm/memory: Do not populate page table entries beyond i_size

Posted by Kiryl Shutsemau 3 months, 1 week ago

On Sat, Nov 01, 2025 at 05:00:47AM +0000, Matthew Wilcox wrote:
> On Wed, Oct 29, 2025 at 02:45:52AM -0700, Hugh Dickins wrote:
> > But you're giving yourself too hard a time of backporting with your
> > 5.10 Fixee 01c70267053d for 1/2: the only filesystem which set the
> > flag then was tmpfs, which you're now excepting.  The flag got
> > renamed later (in 5.16) and then in 5.17 at last there was another
> > filesystem to set it.  So, this 1/2 would be
> > 
> > Fixes: 6795801366da ("xfs: Support large folios")
> 
> I haven't been able to keep up with this patchset -- sorry.
> 
> But this problem didn't exist until bs>PS support was added because we
> would never add a folio to the page cache which extended beyond i_size
> before.  We'd shrink the folio order allocated in do_page_cache_ra()
> (actually, we still do, but page_cache_ra_unbounded() rounds it up
> again).  So it doesn't fix that commit at all, but something far more
> recent.

What about truncate path? We could allocate within i_size at first, then
truncate, if truncation failed to split the folio the mapping stays
beyond i_size.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCHv3 1/2] mm/memory: Do not populate page table entries beyond i_size

Posted by Matthew Wilcox 3 months, 1 week ago

On Mon, Nov 03, 2025 at 10:59:00AM +0000, Kiryl Shutsemau wrote:
> On Sat, Nov 01, 2025 at 05:00:47AM +0000, Matthew Wilcox wrote:
> > On Wed, Oct 29, 2025 at 02:45:52AM -0700, Hugh Dickins wrote:
> > > But you're giving yourself too hard a time of backporting with your
> > > 5.10 Fixee 01c70267053d for 1/2: the only filesystem which set the
> > > flag then was tmpfs, which you're now excepting.  The flag got
> > > renamed later (in 5.16) and then in 5.17 at last there was another
> > > filesystem to set it.  So, this 1/2 would be
> > > 
> > > Fixes: 6795801366da ("xfs: Support large folios")
> > 
> > I haven't been able to keep up with this patchset -- sorry.
> > 
> > But this problem didn't exist until bs>PS support was added because we
> > would never add a folio to the page cache which extended beyond i_size
> > before.  We'd shrink the folio order allocated in do_page_cache_ra()
> > (actually, we still do, but page_cache_ra_unbounded() rounds it up
> > again).  So it doesn't fix that commit at all, but something far more
> > recent.
> 
> What about truncate path? We could allocate within i_size at first, then
> truncate, if truncation failed to split the folio the mapping stays
> beyond i_size.

Is it worth backporting all this way to solve this niche case?

Re: [PATCHv3 1/2] mm/memory: Do not populate page table entries beyond i_size

Posted by Kiryl Shutsemau 3 months, 1 week ago

On Mon, Nov 03, 2025 at 02:35:57PM +0000, Matthew Wilcox wrote:
> On Mon, Nov 03, 2025 at 10:59:00AM +0000, Kiryl Shutsemau wrote:
> > On Sat, Nov 01, 2025 at 05:00:47AM +0000, Matthew Wilcox wrote:
> > > On Wed, Oct 29, 2025 at 02:45:52AM -0700, Hugh Dickins wrote:
> > > > But you're giving yourself too hard a time of backporting with your
> > > > 5.10 Fixee 01c70267053d for 1/2: the only filesystem which set the
> > > > flag then was tmpfs, which you're now excepting.  The flag got
> > > > renamed later (in 5.16) and then in 5.17 at last there was another
> > > > filesystem to set it.  So, this 1/2 would be
> > > > 
> > > > Fixes: 6795801366da ("xfs: Support large folios")
> > > 
> > > I haven't been able to keep up with this patchset -- sorry.
> > > 
> > > But this problem didn't exist until bs>PS support was added because we
> > > would never add a folio to the page cache which extended beyond i_size
> > > before.  We'd shrink the folio order allocated in do_page_cache_ra()
> > > (actually, we still do, but page_cache_ra_unbounded() rounds it up
> > > again).  So it doesn't fix that commit at all, but something far more
> > > recent.
> > 
> > What about truncate path? We could allocate within i_size at first, then
> > truncate, if truncation failed to split the folio the mapping stays
> > beyond i_size.
> 
> Is it worth backporting all this way to solve this niche case?

Dave says it is correctness issue, so.. yes?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCHv3 1/2] mm/memory: Do not populate page table entries beyond i_size

Posted by Andrew Morton 3 months, 1 week ago

On Wed, 29 Oct 2025 02:45:52 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:

> > Fixes: b9a8a4195c7d ("truncate,shmem: Handle truncates that split large folios")
> 
> I agree that's the right Fixee for 2/2: the one which introduced
> splitting a large folio to non-shmem filesystems in 5.17.
> 
> But you're giving yourself too hard a time of backporting with your
> 5.10 Fixee 01c70267053d for 1/2: the only filesystem which set the
> flag then was tmpfs, which you're now excepting.  The flag got
> renamed later (in 5.16) and then in 5.17 at last there was another
> filesystem to set it.  So, this 1/2 would be
> 
> Fixes: 6795801366da ("xfs: Support large folios")

I updated the changelog in mm.git's copy of this patch, thanks.

Re: [PATCHv3 1/2] mm/memory: Do not populate page table entries beyond i_size

Posted by Kiryl Shutsemau 3 months, 1 week ago

On Wed, Oct 29, 2025 at 02:45:52AM -0700, Hugh Dickins wrote:
> On Tue, 28 Oct 2025, Kiryl Shutsemau wrote:
> > 
> > > Also, is [2/2] to be backported?  The changelog makes it sound that way,
> > > but no Fixes: was identified?
> > 
> > Looking at split-on-truncate history, looks like this is the right
> > commit to point to:
> > 
> > Fixes: b9a8a4195c7d ("truncate,shmem: Handle truncates that split large folios")
> 
> I agree that's the right Fixee for 2/2: the one which introduced
> splitting a large folio to non-shmem filesystems in 5.17.
> 
> But you're giving yourself too hard a time of backporting with your
> 5.10 Fixee 01c70267053d for 1/2: the only filesystem which set the
> flag then was tmpfs, which you're now excepting.  The flag got
> renamed later (in 5.16) and then in 5.17 at last there was another
> filesystem to set it.  So, this 1/2 would be
> 
> Fixes: 6795801366da ("xfs: Support large folios")

Good point.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov