[v2] ext4: enable block size larger than page size

[PATCH v2 00/24] ext4: enable block size larger than page size

Posted by libaokun@huaweicloud.com 3 months ago

From: Baokun Li <libaokun1@huawei.com>

Changes since v1:
 * Collect RVB from Jan Kara and Zhang Yi. (Thanks for your review!)
 * Patch 4: Just use blocksize in the rounding.(Suggested by Jan Kara)
 * Patch 7: use kvmalloc() instead of allocating contiguous physical
    pages.(Suggested by Jan Kara)
 * Patch 12: Fix some typos.(Suggested by Jan Kara)
 * Use clearer naming: EXT4_LBLK_TO_PG() and EXT4_PG_TO_LBLK().
    (Suggested by Jan Kara)
 * Patch 21: removed. After rebasing on Ted’s latest dev branch, this
    patch is no longer needed.
 * Patch 22-23: removed. The issue was resolved by removing the WARN_ON
    in the MM code, so we now rely on patch [1].(Suggested by Matthew)
 * Add new Patch 21 to support data=journal under LBS. (Suggested by
    Jan Kara)
 * Add new Patch 22 to support fs verity under LBS.
 * New Patch 23: add the s_max_folio_order field instead of introducing
    the EXT4_MF_LARGE_FOLIO flag.
 * New Patch 24: rebase adaptation.

[v1]: https://lore.kernel.org/r/20251025032221.2905818-1-libaokun@huaweicloud.com

======

This series enables block size > page size (Large Block Size) in EXT4.

Since large folios are already supported for regular files, the required
changes are not substantial, but they are scattered across the code.
The changes primarily focus on cleaning up potential division-by-zero
errors, resolving negative left/right shifts, and correctly handling
mutually exclusive mount options.

One somewhat troublesome issue is that allocating page units greater than
order-1 with __GFP_NOFAIL in __alloc_pages_slowpath() can trigger an
unexpected WARN_ON. With LBS support, EXT4 and jbd2 may use __GFP_NOFAIL
to allocate large folios when reading metadata. The issue was resolved by
removing the WARN_ON in the MM code, so we now rely on patch [1].

[1]: https://lore.kernel.org/r/20251105085652.4081123-1-libaokun@huaweicloud.com

Patch series based on Ted’s latest dev branch.

`kvm-xfstests -c ext4/all -g auto` has been executed with no new failures.
`kvm-xfstests -c ext4/64k -g auto` has been executed and no Oops was
observed, but allocation failures for large folios may trigger warn_alloc()
warnings.

Here are some performance test data for your reference:

Testing EXT4 filesystems with different block sizes, measuring
single-threaded dd bandwidth for BIO/DIO with varying bs values.

Before(PAGE_SIZE=4096):

      BIO     | bs=4k    | bs=8k    | bs=16k   | bs=32k   | bs=64k
--------------|----------|----------|----------|----------|------------
 4k           | 1.5 GB/s | 2.1 GB/s | 2.8 GB/s | 3.4 GB/s | 3.8 GB/s
 8k (bigalloc)| 1.4 GB/s | 2.0 GB/s | 2.6 GB/s | 3.1 GB/s | 3.4 GB/s
 16k(bigalloc)| 1.5 GB/s | 2.0 GB/s | 2.6 GB/s | 3.2 GB/s | 3.6 GB/s
 32k(bigalloc)| 1.5 GB/s | 2.1 GB/s | 2.7 GB/s | 3.3 GB/s | 3.7 GB/s
 64k(bigalloc)| 1.5 GB/s | 2.1 GB/s | 2.8 GB/s | 3.4 GB/s | 3.8 GB/s
              
      DIO     | bs=4k    | bs=8k    | bs=16k   | bs=32k   | bs=64k
--------------|----------|----------|----------|----------|------------
 4k           | 194 MB/s | 366 MB/s | 626 MB/s | 1.0 GB/s | 1.4 GB/s
 8k (bigalloc)| 188 MB/s | 359 MB/s | 612 MB/s | 996 MB/s | 1.4 GB/s
 16k(bigalloc)| 208 MB/s | 378 MB/s | 642 MB/s | 1.0 GB/s | 1.4 GB/s
 32k(bigalloc)| 184 MB/s | 368 MB/s | 637 MB/s | 995 MB/s | 1.4 GB/s
 64k(bigalloc)| 208 MB/s | 389 MB/s | 634 MB/s | 1.0 GB/s | 1.4 GB/s

Patched(PAGE_SIZE=4096):

   BIO   | bs=4k    | bs=8k    | bs=16k   | bs=32k   | bs=64k
---------|----------|----------|----------|----------|------------
 4k      | 1.5 GB/s | 2.1 GB/s | 2.8 GB/s | 3.4 GB/s | 3.8 GB/s
 8k (LBS)| 1.7 GB/s | 2.3 GB/s | 3.2 GB/s | 4.2 GB/s | 4.7 GB/s
 16k(LBS)| 2.0 GB/s | 2.7 GB/s | 3.6 GB/s | 4.7 GB/s | 5.4 GB/s
 32k(LBS)| 2.2 GB/s | 3.1 GB/s | 3.9 GB/s | 4.9 GB/s | 5.7 GB/s
 64k(LBS)| 2.4 GB/s | 3.3 GB/s | 4.2 GB/s | 5.1 GB/s | 6.0 GB/s

   DIO   | bs=4k    | bs=8k    | bs=16k   | bs=32k   | bs=64k
---------|----------|----------|----------|----------|------------
 4k      | 204 MB/s | 355 MB/s | 627 MB/s | 1.0 GB/s | 1.4 GB/s
 8k (LBS)| 210 MB/s | 356 MB/s | 602 MB/s | 997 MB/s | 1.4 GB/s
 16k(LBS)| 191 MB/s | 361 MB/s | 589 MB/s | 981 MB/s | 1.4 GB/s
 32k(LBS)| 181 MB/s | 330 MB/s | 581 MB/s | 951 MB/s | 1.3 GB/s
 64k(LBS)| 148 MB/s | 272 MB/s | 499 MB/s | 840 MB/s | 1.3 GB/s


The results show:

 * The code changes have almost no impact on the original 4k write
   performance of ext4.
 * Compared with bigalloc, LBS improves BIO write performance by about 50%
   on average.
 * Compared with bigalloc, LBS shows degradation in DIO write performance,
   which increases as the filesystem block size grows and the test bs
   decreases, with a maximum degradation of about 30%.

The DIO regression is primarily due to the increased time spent in
crc32c_arch() within ext4_block_bitmap_csum_set() during block allocation,
as the block size grows larger. This indicates that larger filesystem block
sizes are not always better; please choose an appropriate block size based
on your I/O workload characteristics.

We are also planning further optimizations for block allocation under LBS
in the future.

Comments and questions are, as always, welcome.

Thanks,
Baokun

Baokun Li (21):
  ext4: remove page offset calculation in ext4_block_truncate_page()
  ext4: remove PAGE_SIZE checks for rec_len conversion
  ext4: make ext4_punch_hole() support large block size
  ext4: enable DIOREAD_NOLOCK by default for BS > PS as well
  ext4: introduce s_min_folio_order for future BS > PS support
  ext4: support large block size in ext4_calculate_overhead()
  ext4: support large block size in ext4_readdir()
  ext4: add EXT4_LBLK_TO_B macro for logical block to bytes conversion
  ext4: add EXT4_LBLK_TO_PG and EXT4_PG_TO_LBLK for block/page
    conversion
  ext4: support large block size in ext4_mb_load_buddy_gfp()
  ext4: support large block size in ext4_mb_get_buddy_page_lock()
  ext4: support large block size in ext4_mb_init_cache()
  ext4: prepare buddy cache inode for BS > PS with large folios
  ext4: support large block size in ext4_mpage_readpages()
  ext4: support large block size in ext4_block_write_begin()
  ext4: support large block size in mpage_map_and_submit_buffers()
  ext4: support large block size in mpage_prepare_extent_to_map()
  ext4: make data=journal support large block size
  ext4: support verifying data from large folios with fs-verity
  ext4: add checks for large folio incompatibilities when BS > PS
  ext4: enable block size larger than page size

Zhihao Cheng (3):
  ext4: remove page offset calculation in ext4_block_zero_page_range()
  ext4: rename 'page' references to 'folio' in multi-block allocator
  ext4: support large block size in __ext4_block_zero_page_range()

 fs/ext4/dir.c       |   8 +--
 fs/ext4/ext4.h      |  26 ++++-----
 fs/ext4/ext4_jbd2.c |   3 +-
 fs/ext4/extents.c   |   2 +-
 fs/ext4/inode.c     |  93 ++++++++++++------------------
 fs/ext4/mballoc.c   | 137 +++++++++++++++++++++++---------------------
 fs/ext4/namei.c     |   8 +--
 fs/ext4/readpage.c  |   7 +--
 fs/ext4/super.c     |  66 +++++++++++++++++----
 fs/ext4/verity.c    |   2 +-
 10 files changed, 187 insertions(+), 165 deletions(-)

-- 
2.46.1

Re: [PATCH v2 00/24] ext4: enable block size larger than page size

Posted by Theodore Ts'o 2 months, 4 weeks ago

On Fri, Nov 07, 2025 at 10:42:25PM +0800, libaokun@huaweicloud.com wrote:
> `kvm-xfstests -c ext4/all -g auto` has been executed with no new failures.
> `kvm-xfstests -c ext4/64k -g auto` has been executed and no Oops was
> observed, but allocation failures for large folios may trigger warn_alloc()
> warnings.

I'm seeing some new failures.  ext4/4k -g auto is running without any
failures, but when I tried to run ext4/64, I got:

ext4/64k: 607 tests, 16 failures, 101 skipped, 7277 seconds
  Failures: ext4/033 generic/472 generic/493 generic/494 generic/495
    generic/496 generic/497 generic/554 generic/569 generic/620
    generic/636 generic/641 generic/643 generic/759 generic/760
  Flaky: generic/251: 80% (4/5)
Totals: 671 tests, 101 skipped, 79 failures, 0 errors, 6782s

Some of the test failures may be because I was only using a 5G test
and scratch device, and with a 64k block sze, that might be too small.
But I tried using a 20G test device, and ext3/033 is still failing but
with a different error signature:

    --- tests/ext4/033.out      2025-11-06 22:04:13.000000000 -0500
    +++ /results/ext4/results-64k/ext4/033.out.bad      2025-11-11 17:57:31.149710364 -0500
    @@ -1,6 +1,8 @@
     QA output created by 033
     Figure out block size
     Format huge device
    +mount: /vdf: fsconfig() failed: Structure needs cleaning.
    +       dmesg(1) may have more information after failed mount system call.


I took a look at the generc/472 and that appears to be a swap on file failure:

root@kvm-xfstests:~# /vtmp/mke2fs.static -t ext4 -b 65536 -Fq /dev/vdc
Warning: blocksize 65536 not usable on most systems.
/dev/vdc contains a ext4 file system
        created on Tue Nov 11 18:02:13 2025
root@kvm-xfstests:~# mount /dev/vdc /vdc
root@kvm-xfstests:~# fallocate -l 1G /vdc/swap
root@kvm-xfstests:~# mkswap /vdc/swap
mkswap: /vdc/swap: insecure permissions 0644, fix with: chmod 0600 /vdc/swap
Setting up swapspace version 1, size = 1024 MiB (1073737728 bytes)
no label, UUID=a6298248-abf1-42a1-b124-2f6b3be7f597
root@kvm-xfstests:~# swapon /vdc/swap
swapon: /vdc/swap: insecure permissions 0644, 0600 suggested.
swapon: /vdc/swap: swapon failed: Invalid argument
root@kvm-xfstests:~# 

A number of the other tests (generic/493, generic/494, generic/495,
generic/496, generic/497, generic/554) are all swapfile tests.

I'm not sure why you're not seeing these issues; what version of
xfstests are you using?  I recently uploaded a new test appliance[1]
can you try rerunning your tests with the latest test appliance for
kvm-xfstests?

[1] https://www.kernel.org/pub/linux/kernel/people/tytso/kvm-xfstests;

					- Ted

Re: [PATCH v2 00/24] ext4: enable block size larger than page size

Posted by Baokun Li 2 months, 4 weeks ago

On 2025-11-12 07:54, Theodore Ts'o wrote:
> On Fri, Nov 07, 2025 at 10:42:25PM +0800, libaokun@huaweicloud.com wrote:
>> `kvm-xfstests -c ext4/all -g auto` has been executed with no new failures.
>> `kvm-xfstests -c ext4/64k -g auto` has been executed and no Oops was
>> observed, but allocation failures for large folios may trigger warn_alloc()
>> warnings.
> I'm seeing some new failures.  ext4/4k -g auto is running without any
> failures, but when I tried to run ext4/64, I got:
>
> ext4/64k: 607 tests, 16 failures, 101 skipped, 7277 seconds
>   Failures: ext4/033 generic/472 generic/493 generic/494 generic/495
>     generic/496 generic/497 generic/554 generic/569 generic/620
>     generic/636 generic/641 generic/643 generic/759 generic/760
>   Flaky: generic/251: 80% (4/5)
> Totals: 671 tests, 101 skipped, 79 failures, 0 errors, 6782s
>
> Some of the test failures may be because I was only using a 5G test
> and scratch device, and with a 64k block sze, that might be too small.
> But I tried using a 20G test device, and ext3/033 is still failing but
> with a different error signature:
>
>     --- tests/ext4/033.out      2025-11-06 22:04:13.000000000 -0500
>     +++ /results/ext4/results-64k/ext4/033.out.bad      2025-11-11 17:57:31.149710364 -0500
>     @@ -1,6 +1,8 @@
>      QA output created by 033
>      Figure out block size
>      Format huge device
>     +mount: /vdf: fsconfig() failed: Structure needs cleaning.
>     +       dmesg(1) may have more information after failed mount system call.
>
>
> I took a look at the generc/472 and that appears to be a swap on file failure:
>
> root@kvm-xfstests:~# /vtmp/mke2fs.static -t ext4 -b 65536 -Fq /dev/vdc
> Warning: blocksize 65536 not usable on most systems.
> /dev/vdc contains a ext4 file system
>         created on Tue Nov 11 18:02:13 2025
> root@kvm-xfstests:~# mount /dev/vdc /vdc
> root@kvm-xfstests:~# fallocate -l 1G /vdc/swap
> root@kvm-xfstests:~# mkswap /vdc/swap
> mkswap: /vdc/swap: insecure permissions 0644, fix with: chmod 0600 /vdc/swap
> Setting up swapspace version 1, size = 1024 MiB (1073737728 bytes)
> no label, UUID=a6298248-abf1-42a1-b124-2f6b3be7f597
> root@kvm-xfstests:~# swapon /vdc/swap
> swapon: /vdc/swap: insecure permissions 0644, 0600 suggested.
> swapon: /vdc/swap: swapon failed: Invalid argument
> root@kvm-xfstests:~# 

I checked the code of the swapon syscall in mm/swapfile.c, and currently
the swapfile does not support LBS. Therefore, some failing test cases can
be filtered out based on this.

         /*
          * The swap subsystem needs a major overhaul to support this.
          * It doesn't work yet so just disable it for now.
          */
         if (mapping_min_folio_order(mapping) > 0) {
                 error = -EINVAL;
                 goto bad_swap_unlock_inode;
         }

Regards,
Baokun

> A number of the other tests (generic/493, generic/494, generic/495,
> generic/496, generic/497, generic/554) are all swapfile tests.
>
> I'm not sure why you're not seeing these issues; what version of
> xfstests are you using?  I recently uploaded a new test appliance[1]
> can you try rerunning your tests with the latest test appliance for
> kvm-xfstests?
>
> [1] https://www.kernel.org/pub/linux/kernel/people/tytso/kvm-xfstests;
>
> 					- Ted
>

Re: [PATCH v2 00/24] ext4: enable block size larger than page size

Posted by Theodore Ts'o 2 months, 4 weeks ago

On Wed, Nov 12, 2025 at 10:19:06AM +0800, Baokun Li wrote:
> I am using a slightly older version of xfstests, and when running the
> 64k tests I also encountered similar failures. The cover letter stated
> "no Oops" for the 64k tests rather than "no new failures," meaning that
> some cases did fail, but no severe issues such as BUG_ON or softlock
> were observed.

Sorry, I misread your cover letter.  It's good you are seeing similar
failures.


On Wed, Nov 12, 2025 at 10:49:19AM +0800, Baokun Li wrote:
> I checked the code of the swapon syscall in mm/swapfile.c, and currently
> the swapfile does not support LBS. Therefore, some failing test cases can
> be filtered out based on this.

Ah, OK. What's happening is with XFS the swap tests are being skipped
automatically if the swapon fails.  From _require_scratch_swapfils:

	*)
		if ! swapon "$SCRATCH_MNT/swap" >/dev/null 2>&1; then
			_scratch_unmount
			_notrun "swapfiles are not supported"
		fi
		;;


But ext4 has different logic:

	# ext* has supported all variants of swap files since their
	# introduction, so swapon should not fail.

<< famous last words >>

	case "$FSTYP" in
	ext2|ext3|ext4)
		if ! swapon "$SCRATCH_MNT/swap" >/dev/null 2>&1; then
			if _check_s_dax "$SCRATCH_MNT/swap" 1 >/dev/null; then
				_scratch_unmount
				_notrun "swapfiles are not supported"
			else
				_scratch_unmount
				_fail "swapon failed for $FSTYP"
			fi
		fi
		;;


I guess we could add logic to _require_scratch_swapfile in common/rc
to also _notrun if swapon fails and block size is greater that page
size.  Or I might just add an exclusion in my test appliance runner
for now for all tests in group swap.

						- Ted

Re: [PATCH v2 00/24] ext4: enable block size larger than page size

Posted by Baokun Li 2 months, 3 weeks ago

On 2025-11-12 12:02, Theodore Ts'o wrote:
> On Wed, Nov 12, 2025 at 10:19:06AM +0800, Baokun Li wrote:
>> I am using a slightly older version of xfstests, and when running the
>> 64k tests I also encountered similar failures. The cover letter stated
>> "no Oops" for the 64k tests rather than "no new failures," meaning that
>> some cases did fail, but no severe issues such as BUG_ON or softlock
>> were observed.
> Sorry, I misread your cover letter.  It's good you are seeing similar
> failures.

Sorry, my description wasn’t clear enough.

>
>
> On Wed, Nov 12, 2025 at 10:49:19AM +0800, Baokun Li wrote:
>> I checked the code of the swapon syscall in mm/swapfile.c, and currently
>> the swapfile does not support LBS. Therefore, some failing test cases can
>> be filtered out based on this.
> Ah, OK. What's happening is with XFS the swap tests are being skipped
> automatically if the swapon fails.  From _require_scratch_swapfils:
>
> 	*)
> 		if ! swapon "$SCRATCH_MNT/swap" >/dev/null 2>&1; then
> 			_scratch_unmount
> 			_notrun "swapfiles are not supported"
> 		fi
> 		;;
>
>
> But ext4 has different logic:
>
> 	# ext* has supported all variants of swap files since their
> 	# introduction, so swapon should not fail.
>
> << famous last words >>
😄
>
> 	case "$FSTYP" in
> 	ext2|ext3|ext4)
> 		if ! swapon "$SCRATCH_MNT/swap" >/dev/null 2>&1; then
> 			if _check_s_dax "$SCRATCH_MNT/swap" 1 >/dev/null; then
> 				_scratch_unmount
> 				_notrun "swapfiles are not supported"
> 			else
> 				_scratch_unmount
> 				_fail "swapon failed for $FSTYP"
> 			fi
> 		fi
> 		;;
>
>
> I guess we could add logic to _require_scratch_swapfile in common/rc
> to also _notrun if swapon fails and block size is greater that page
> size.  Or I might just add an exclusion in my test appliance runner
> for now for all tests in group swap.

Darrick’s reply in another thread has already made a similar change,
so we can apply that patch first for testing.


Cheers,
Baokun

Re: [PATCH v2 00/24] ext4: enable block size larger than page size

Posted by Theodore Ts'o 2 months, 3 weeks ago

On Wed, Nov 12, 2025 at 02:27:19PM +0800, Baokun Li wrote:
> Darrick’s reply in another thread has already made a similar change,
> so we can apply that patch first for testing.

I'll give that a try when I have a chance.  For now, here's a test run
using a version of my test appliance which excludes the way group for
the config ext4/lbs, and which has a modified e2fsprogs (built from
the latest e2fsprogs git repo) which suppresses both warnings when
using large block sizes if the kernel has the blocksize_gt_pagesize
feature detected.

ext4/lbs: 595 tests, 6 failures, 101 skipped, 6656 seconds
  Failures: ext4/033 generic/620 generic/759 generic/760
  Flaky: generic/251: 60% (3/5)   generic/645: 40% (2/5)
Totals: 619 tests, 101 skipped, 25 failures, 0 errors, 6291s

Fixing all of these filures is not a blocker for getting this patchset
upstream, but it would be nice for us to figure out the root cause for
them, so we can decide whether it's better to exclude the tests for
now, or whether there's an easy fix.

Thanks,

					- Ted

Re: [PATCH v2 00/24] ext4: enable block size larger than page size

Posted by Baokun Li 2 months, 3 weeks ago

On 2025-11-12 23:29, Theodore Ts'o wrote:
> On Wed, Nov 12, 2025 at 02:27:19PM +0800, Baokun Li wrote:
>> Darrick’s reply in another thread has already made a similar change,
>> so we can apply that patch first for testing.
> I'll give that a try when I have a chance.  For now, here's a test run
> using a version of my test appliance which excludes the way group for
> the config ext4/lbs, and which has a modified e2fsprogs (built from
> the latest e2fsprogs git repo) which suppresses both warnings when
> using large block sizes if the kernel has the blocksize_gt_pagesize
> feature detected.
>
> ext4/lbs: 595 tests, 6 failures, 101 skipped, 6656 seconds
>   Failures: ext4/033 generic/620 generic/759 generic/760
>   Flaky: generic/251: 60% (3/5)   generic/645: 40% (2/5)
> Totals: 619 tests, 101 skipped, 25 failures, 0 errors, 6291s
>
> Fixing all of these filures is not a blocker for getting this patchset
> upstream, but it would be nice for us to figure out the root cause for
> them, so we can decide whether it's better to exclude the tests for
> now, or whether there's an easy fix.

Thank you for your testing! I have analyzed the above failing cases, and
they are basically unrelated to this patch set. My analysis is as follows:

# generic/759 generic/760
Require CONFIG_HUGETLB_PAGE and CONFIG_HUGETLBFS enabled.

# generic/620
vdc needs at least 33G. Passed after replacing with a 2T disk. Suggest
putting this test case into exclude.

# ext4/033
1. With 64k block size, inodes_per_group=$((blksz*8)) does not hold;
2. Creating a 400+T snapshot and formatting it as a 64k ext4 filesystem
    requires more than 1T of disk space just for metadata;
3. With 64k block size ext4, when orphan file is enabled by default,
    it fails because orphan file size exceeds 8 << 20. Fixed in [1].
    [1]:
https://lore.kernel.org/r/20251113090122.2385797-1-libaokun@huaweicloud.com
After resolving the above issues, the test passes with a 2T disk. However,
since the inode number overflow is unrelated to block size, suggest putting
this test case into exclude.

# generic/645
This test checks that idmapped mounts behave correctly with complex user
namespaces. On my side the reproduction rate is very low, about 1/100.
Even before the code was merged, occasional failures also appeared in the
4k tests. Based on the test content, I think it is unrelated to LBS.

Cheers,
Baokun

Re: [PATCH v2 00/24] ext4: enable block size larger than page size

Posted by Darrick J. Wong 2 months, 4 weeks ago

On Tue, Nov 11, 2025 at 11:02:20PM -0500, Theodore Ts'o wrote:
> On Wed, Nov 12, 2025 at 10:19:06AM +0800, Baokun Li wrote:
> > I am using a slightly older version of xfstests, and when running the
> > 64k tests I also encountered similar failures. The cover letter stated
> > "no Oops" for the 64k tests rather than "no new failures," meaning that
> > some cases did fail, but no severe issues such as BUG_ON or softlock
> > were observed.
> 
> Sorry, I misread your cover letter.  It's good you are seeing similar
> failures.
> 
> 
> On Wed, Nov 12, 2025 at 10:49:19AM +0800, Baokun Li wrote:
> > I checked the code of the swapon syscall in mm/swapfile.c, and currently
> > the swapfile does not support LBS. Therefore, some failing test cases can
> > be filtered out based on this.
> 
> Ah, OK. What's happening is with XFS the swap tests are being skipped
> automatically if the swapon fails.  From _require_scratch_swapfils:
> 
> 	*)
> 		if ! swapon "$SCRATCH_MNT/swap" >/dev/null 2>&1; then
> 			_scratch_unmount
> 			_notrun "swapfiles are not supported"
> 		fi
> 		;;
> 
> 
> But ext4 has different logic:
> 
> 	# ext* has supported all variants of swap files since their
> 	# introduction, so swapon should not fail.
> 
> << famous last words >>
> 
> 	case "$FSTYP" in
> 	ext2|ext3|ext4)
> 		if ! swapon "$SCRATCH_MNT/swap" >/dev/null 2>&1; then
> 			if _check_s_dax "$SCRATCH_MNT/swap" 1 >/dev/null; then
> 				_scratch_unmount
> 				_notrun "swapfiles are not supported"
> 			else
> 				_scratch_unmount
> 				_fail "swapon failed for $FSTYP"
> 			fi
> 		fi
> 		;;
> 
> 
> I guess we could add logic to _require_scratch_swapfile in common/rc
> to also _notrun if swapon fails and block size is greater that page
> size.  Or I might just add an exclusion in my test appliance runner
> for now for all tests in group swap.

https://lore.kernel.org/fstests/176169820051.1433624.4158113392739761085.stgit@frogsfrogsfrogs/T/#u

Hm?

--D

> 
> 						- Ted
>

Re: [PATCH v2 00/24] ext4: enable block size larger than page size

Posted by Baokun Li 2 months, 4 weeks ago

On 2025-11-12 07:54, Theodore Ts'o wrote:
> On Fri, Nov 07, 2025 at 10:42:25PM +0800, libaokun@huaweicloud.com wrote:
>> `kvm-xfstests -c ext4/all -g auto` has been executed with no new failures.
>> `kvm-xfstests -c ext4/64k -g auto` has been executed and no Oops was
>> observed, but allocation failures for large folios may trigger warn_alloc()
>> warnings.
> I'm seeing some new failures.  ext4/4k -g auto is running without any
> failures, but when I tried to run ext4/64, I got:
>
> ext4/64k: 607 tests, 16 failures, 101 skipped, 7277 seconds
>   Failures: ext4/033 generic/472 generic/493 generic/494 generic/495
>     generic/496 generic/497 generic/554 generic/569 generic/620
>     generic/636 generic/641 generic/643 generic/759 generic/760
>   Flaky: generic/251: 80% (4/5)
> Totals: 671 tests, 101 skipped, 79 failures, 0 errors, 6782s
>
> Some of the test failures may be because I was only using a 5G test
> and scratch device, and with a 64k block sze, that might be too small.
> But I tried using a 20G test device, and ext3/033 is still failing but
> with a different error signature:
>
>     --- tests/ext4/033.out      2025-11-06 22:04:13.000000000 -0500
>     +++ /results/ext4/results-64k/ext4/033.out.bad      2025-11-11 17:57:31.149710364 -0500
>     @@ -1,6 +1,8 @@
>      QA output created by 033
>      Figure out block size
>      Format huge device
>     +mount: /vdf: fsconfig() failed: Structure needs cleaning.
>     +       dmesg(1) may have more information after failed mount system call.
>
>
> I took a look at the generc/472 and that appears to be a swap on file failure:
>
> root@kvm-xfstests:~# /vtmp/mke2fs.static -t ext4 -b 65536 -Fq /dev/vdc
> Warning: blocksize 65536 not usable on most systems.
> /dev/vdc contains a ext4 file system
>         created on Tue Nov 11 18:02:13 2025
> root@kvm-xfstests:~# mount /dev/vdc /vdc
> root@kvm-xfstests:~# fallocate -l 1G /vdc/swap
> root@kvm-xfstests:~# mkswap /vdc/swap
> mkswap: /vdc/swap: insecure permissions 0644, fix with: chmod 0600 /vdc/swap
> Setting up swapspace version 1, size = 1024 MiB (1073737728 bytes)
> no label, UUID=a6298248-abf1-42a1-b124-2f6b3be7f597
> root@kvm-xfstests:~# swapon /vdc/swap
> swapon: /vdc/swap: insecure permissions 0644, 0600 suggested.
> swapon: /vdc/swap: swapon failed: Invalid argument
> root@kvm-xfstests:~# 
>
> A number of the other tests (generic/493, generic/494, generic/495,
> generic/496, generic/497, generic/554) are all swapfile tests.
>
> I'm not sure why you're not seeing these issues; what version of
> xfstests are you using?  I recently uploaded a new test appliance[1]
> can you try rerunning your tests with the latest test appliance for
> kvm-xfstests?
>
> [1] https://www.kernel.org/pub/linux/kernel/people/tytso/kvm-xfstests;
>
> 					- Ted
>
I am using a slightly older version of xfstests, and when running the
64k tests I also encountered similar failures. The cover letter stated
"no Oops" for the 64k tests rather than "no new failures," meaning that
some cases did fail, but no severe issues such as BUG_ON or softlock
were observed.

I had been traveling frequently and didn’t have time to analyze.
In October, Pankaj asked about ext4 LBS progress and offered to help with
testing/review once the patches were out, so I rebased the existing code
and sent it out.

The analysis of the failing cases has been ongoing, but it keeps getting
interrupted by various high‑priority internal tasks. In the next few days
I will make time to analyze the failing cases and optimize the checksum
performance issues introduced by large blocks.

Below are my previous 64k test results:

-------------------- Summary report
KERNEL:    kernel 6.18.0-rc4-xfstests-00041-g13ad1f4f1378 #1007 SMP
PREEMPT_DYNAMIC Tue Nov 11 16:55:01 CST 2025 x86_64
CPUS:      2
MEM:       7944.36

ext4/64k: 563 tests, 20 failures, 81 skipped, 4992 seconds
  Failures: ext4/033 ext4/048 generic/219 generic/251 generic/436
    generic/472 generic/493 generic/494 generic/495 generic/496
    generic/497 generic/554 generic/563 generic/569 generic/620
    generic/636 generic/641 generic/643
  Flaky: generic/320: 80% (4/5)   generic/347: 60% (3/5)
Totals: 643 tests, 81 skipped, 97 failures, 0 errors, 4652s

FSTESTVER: blktests 698f1a0 (Mon, 27 May 2024 11:30:36 +0900)
FSTESTVER: fio  fio-3.28 (Wed, 8 Sep 2021 08:59:48 -0600)
FSTESTVER: fsverity v1.6 (Wed, 20 Mar 2024 21:21:46 -0700)
FSTESTVER: libaio   libaio-0.3.108-81-g1b18bfa (Mon, 28 Mar 2022 11:30:33
-0400)
FSTESTVER: quota  v4.05-43-gd2256ac (Fri, 17 Sep 2021 14:04:16 +0200)
FSTESTVER: xfsprogs v5.13.0 (Fri, 20 Aug 2021 12:03:57 -0400)
FSTESTVER: xfstests-bld 1bdd10a-dirty (Fri, 3 May 2024 16:14:41 -0400)
FSTESTVER: xfstests v2024.05.12 (Sun, 12 May 2024 20:28:48 +0800)
FSTESTCFG: ext4/64k
FSTESTSET: -g auto
FSTESTOPT: aex
Truncating test artifacts in /results to 31k


Cheers,
Baokun

Re: [PATCH v2 00/24] ext4: enable block size larger than page size

Posted by Theodore Ts'o 3 months ago

I've started looking at this patch series and playing with it, and one
thing which is worth noting is that CONFIG_TRANSPARENT_HUGEPAGE needs
to be enabled, or else sb_set_blocksize() will fail for block size >
page size.  This isn't specific to ext4, and maybe I'm missing
something, but apparently this isn't documented.  I had to go digging
through the source code to figure out what was needed.

I wonder if we should have some kind of warning in sb_set_blocksize()
where if there is an attempt to set a blocksize > page size and
transparent hugepages is not configured, we issue a printk_once()
giving a hint to the user that the reason that the mount failed was
because transparent hugepages wasn't enabled at compile time.

It **really** isn't obvious that large block size support and
transparent hugepages are linked.

					- Ted

Re: [PATCH v2 00/24] ext4: enable block size larger than page size

Posted by Pankaj Raghav 2 months, 4 weeks ago

On 11/10/25 05:32, Theodore Ts'o wrote:
> I've started looking at this patch series and playing with it, and one
> thing which is worth noting is that CONFIG_TRANSPARENT_HUGEPAGE needs
> to be enabled, or else sb_set_blocksize() will fail for block size >
> page size.  This isn't specific to ext4, and maybe I'm missing
> something, but apparently this isn't documented.  I had to go digging
> through the source code to figure out what was needed.
> 
> I wonder if we should have some kind of warning in sb_set_blocksize()
> where if there is an attempt to set a blocksize > page size and
> transparent hugepages is not configured, we issue a printk_once()
> giving a hint to the user that the reason that the mount failed was
> because transparent hugepages wasn't enabled at compile time.
> 

I added something similar for block devices[1]. Probably we might need something
here as well as a stop gap.

> It **really** isn't obvious that large block size support and
> transparent hugepages are linked.

Funny that you mention this because I have talk about this topic:
Decoupling Large Folios from Transparent Huge Pages in LPC under MM MC [2].
You are more than welcome to come to the talk :)

But just a small summary: When large folios were introduced, it used
THP infrastructure for splitting the folios (for example when we do a truncate).

I hope we will soon be able to sort it out so that we don't have
to sprinkle CONFIG_THP everywhere.

--
Pankaj

[1] https://lore.kernel.org/all/20250704092134.289491-1-p.raghav@samsung.com/
[2] https://lpc.events/event/19/contributions/2139/>
> 					- Ted
>

Re: [PATCH v2 00/24] ext4: enable block size larger than page size

Posted by Theodore Ts'o 2 months, 4 weeks ago

On Mon, Nov 10, 2025 at 04:34:47PM +0100, Pankaj Raghav wrote:
> 
> I added something similar for block devices[1]. Probably we might need something
> here as well as a stop gap.
>
> [1] https://lore.kernel.org/all/20250704092134.289491-1-p.raghav@samsung.com/

Yeah, this is the precisely code that I ran into; it's good that we're
not triggering a panic if we try mounting a file system with a large
block size, but when trying to mount file system with a large
blocksize w/o CONFIG_TRANSPARENT_HUGEPAGE, we get:

[   33.211382] XFS (vdc): block size (65536 bytes) not supported; Only block size (4096) or less is supported
mount: /vdc: fsconfig() failed: Function not implemented.
       dmesg(1) may have more information after failed mount system call.

or

[   78.537420] EXT4-fs (vdc): bad block size 65536

Pity the poor user who is trying to use large block sizes, and who
didn't bother to enabl transparent hugepages because they didn't need
it.  Fortunately most distributions tend to enable THP.

> Funny that you mention this because I have talk about this topic:
> Decoupling Large Folios from Transparent Huge Pages in LPC under MM MC [2].
> You are more than welcome to come to the talk :)

Cool!  So if we're going to change it, perhaps we should have an
explicit CONFIG option, say, CONFIG_FS_LARGE_BLOCKSIZE which enables
bs > ps.  This might allow us to remove smount of code for those
embedded applications who don't need large block sizes, but more
importantly, we can have it automatically enable whatever depedencies
that are needed --- and if it changes later, we can have the kernel
config DTRT automatically.

						- Ted

Re: [PATCH v2 00/24] ext4: enable block size larger than page size

Posted by Baokun Li 3 months ago

On 2025-11-10 12:32, Theodore Ts'o wrote:
> I've started looking at this patch series and playing with it, and one
> thing which is worth noting is that CONFIG_TRANSPARENT_HUGEPAGE needs
> to be enabled, or else sb_set_blocksize() will fail for block size >
> page size.  This isn't specific to ext4, and maybe I'm missing
> something, but apparently this isn't documented.  I had to go digging
> through the source code to figure out what was needed.
>
> I wonder if we should have some kind of warning in sb_set_blocksize()
> where if there is an attempt to set a blocksize > page size and
> transparent hugepages is not configured, we issue a printk_once()
> giving a hint to the user that the reason that the mount failed was
> because transparent hugepages wasn't enabled at compile time.
>
> It **really** isn't obvious that large block size support and
> transparent hugepages are linked.
>
Thank you for the review!

Yes, supporting block sizes larger than the page size requires large
folios, so it is indeed necessary to enable CONFIG_TRANSPARENT_HUGEPAGE
to support large folios. Because the code is wrapped in multiple layers,
the connection between the two is somewhat hidden, and users may not
notice it or know how to enable LBS.

I will add some hints in sb_set_blocksize to make users aware of this
relationship. Thanks for the suggestion!

Cheers,
Baokun