[v2] mm/shmem: optimize read with reduced xarray lookups and folio batching

[PATCH v2 0/5] mm/shmem: optimize read with reduced xarray lookups and folio batching

Posted by Chi Zhiling 1 week ago

From: Chi Zhiling <chizhiling@kylinos.cn>

This series improves shmem read performance by implementing folio
batching in the read path and reducing unnecessary xarray lookups.

Performance Results:

fio --ioengine=sync --rw=read --bs=$1 --size=1G --runtime=180 --time_based --group_reporting --name=seq_read_test --filename=testfile

| THP disabled in tmpfs  | v7.1-rc5     | v7.1-rc5 + fbatch | Improvement |
| ---------------------- | ------------ | ----------------- | ----------- |
| 1M + normal file       | bw=11.5GiB/s | bw=12.7GiB/s      | +10.4%      |
| 64k + normal file      | bw=11.0GiB/s | bw=12.3GiB/s      | +11.8%      |
| 4k + normal file       | bw=3826MiB/s | bw=3849MiB/s      | +0.6%       |
| 1M + fallocated file   | bw=23.8GiB/s | bw=28.6GiB/s      | +20.2%      |
| 64k + fallocated file  | bw=22.5GiB/s | bw=27.3GiB/s      | +21.3%      |
| 4k + fallocated file   | bw=4655MiB/s | bw=4680MiB/s      | +0.5%       |
| 1M + hole              | bw=24.2GiB/s | bw=28.6GiB/s      | +18.2%      |
| 64k + hole             | bw=22.6GiB/s | bw=27.6GiB/s      | +22.1%      |
| 4k + hole              | bw=4652MiB/s | bw=4489MiB/s      | -3.5%       |


| THP enabled in tmpfs  | v7.1-rc5     | v7.1-rc5 + fbatch | Improvement |
| --------------------- | ------------ | ----------------- | ----------- |
| 1M + normal file      | bw=13.7GiB/s | bw=13.9GiB/s      | +1.4%       |
| 64k + normal file     | bw=13.5GiB/s | bw=13.5GiB/s      | +0.0%       |
| 4k + normal file      | bw=3833MiB/s | bw=3859MiB/s      | +0.7%       |
| 1M + fallocated file  | bw=24.9GiB/s | bw=34.2GiB/s      | +37.3%      |
| 64k + fallocated file | bw=23.0GiB/s | bw=31.4GiB/s      | +36.5%      |
| 4k + fallocated file  | bw=4710MiB/s | bw=4655MiB/s      | -1.2%       |
| 1M + hole             | bw=24.3GiB/s | bw=34.5GiB/s      | +42.0%      |
| 64k + hole            | bw=23.5GiB/s | bw=31.1GiB/s      | +32.3%      |
| 4k + hole             | bw=4690MiB/s | bw=4647MiB/s      | -0.9%       |


v1:
https://lore.kernel.org/linux-mm/20260520101538.58745-1-chizhiling@163.com/#t
rfc:
https://lore.kernel.org/linux-fsdevel/20260515094702.1092355-1-chizhiling@163.com/


Chi Zhiling (5):
  mm/filemap: reduce unnecessary xarray lookups when read cached pages
  mm/filemap: reduce xarray lookups in filemap_get_folios_contig()
  mm/shmem: introduce copy_zero_to_iter() for large zeroing
  mm/shmem: remove page-copy fallback in shmem read path
  mm/shmem: optimize file read with folio batching

 mm/filemap.c |  46 +++++++++++--------
 mm/shmem.c   | 126 +++++++++++++++++++++++++++++++++++----------------
 2 files changed, 113 insertions(+), 59 deletions(-)

-- 
2.43.0

Re: [PATCH v2 0/5] mm/shmem: optimize read with reduced xarray lookups and folio batching

Posted by Chi Zhiling 6 days, 23 hours ago

On 6/1/26 13:56, Chi Zhiling wrote:
> Performance Results:
> 
> fio --ioengine=sync --rw=read --bs=$1 --size=1G --runtime=180 --time_based --group_reporting --name=seq_read_test --filename=testfile
> 
> | THP disabled in tmpfs  | v7.1-rc5     | v7.1-rc5 + fbatch | Improvement |
> | ---------------------- | ------------ | ----------------- | ----------- |
> | 1M + normal file       | bw=11.5GiB/s | bw=12.7GiB/s      | +10.4%      |
> | 64k + normal file      | bw=11.0GiB/s | bw=12.3GiB/s      | +11.8%      |
> | 4k + normal file       | bw=3826MiB/s | bw=3849MiB/s      | +0.6%       |
> | 1M + fallocated file   | bw=23.8GiB/s | bw=28.6GiB/s      | +20.2%      |
> | 64k + fallocated file  | bw=22.5GiB/s | bw=27.3GiB/s      | +21.3%      |
> | 4k + fallocated file   | bw=4655MiB/s | bw=4680MiB/s      | +0.5%       |
> | 1M + hole              | bw=24.2GiB/s | bw=28.6GiB/s      | +18.2%      |
> | 64k + hole             | bw=22.6GiB/s | bw=27.6GiB/s      | +22.1%      |
> | 4k + hole              | bw=4652MiB/s | bw=4489MiB/s      | -3.5%       |
> 
> 
> | THP enabled in tmpfs  | v7.1-rc5     | v7.1-rc5 + fbatch | Improvement |
> | --------------------- | ------------ | ----------------- | ----------- |
> | 1M + normal file      | bw=13.7GiB/s | bw=13.9GiB/s      | +1.4%       |
> | 64k + normal file     | bw=13.5GiB/s | bw=13.5GiB/s      | +0.0%       |
> | 4k + normal file      | bw=3833MiB/s | bw=3859MiB/s      | +0.7%       |
> | 1M + fallocated file  | bw=24.9GiB/s | bw=34.2GiB/s      | +37.3%      |
> | 64k + fallocated file | bw=23.0GiB/s | bw=31.4GiB/s      | +36.5%      |
> | 4k + fallocated file  | bw=4710MiB/s | bw=4655MiB/s      | -1.2%       |
> | 1M + hole             | bw=24.3GiB/s | bw=34.5GiB/s      | +42.0%      |
> | 64k + hole            | bw=23.5GiB/s | bw=31.1GiB/s      | +32.3%      |
> | 4k + hole             | bw=4690MiB/s | bw=4647MiB/s      | -0.9%       |
> 
Apologies, due to my oversight, the tests involving hole were incorrect, 
the holes were not successfully created in the files during testing.

Below are the corrected results from a retest:

| THP disabled | v7.1-rc5     | v7.1-rc5 + fbatch | Improvement |
| ------------ | ------------ | ----------------- | ----------- |
| 1M + hole    | bw=27.3GiB/s | bw=23.4GiB/s      | -14.3%      |
| 64k + hole   | bw=27.3GiB/s | bw=23.3GiB/s      | -14.7%      |
| 4k + hole    | bw=4825MiB/s | bw=4624MiB/s      | -4.2%       |


| THP enabled | v7.1-rc5     | v7.1-rc5 + fbatch | Improvement |
| ----------- | ------------ | ----------------- | ----------- |
| 1M + hole   | bw=27.0GiB/s | bw=23.1GiB/s      | -14.4%      |
| 64k + hole  | bw=27.5GiB/s | bw=23.3GiB/s      | -15.3%      |
| 4k + hole   | bw=4777MiB/s | bw=4640MiB/s      | -2.9%       |


There is a noticeable performance drop when accessing holes, as every 
read triggers a fallback. I will address this in the next version.

Re: [PATCH v2 0/5] mm/shmem: optimize read with reduced xarray lookups and folio batching

Posted by Andrew Morton 6 days, 5 hours ago

On Mon,  1 Jun 2026 13:56:59 +0800 Chi Zhiling <chizhiling@163.com> wrote:

> From: Chi Zhiling <chizhiling@kylinos.cn>
> 
> This series improves shmem read performance by implementing folio
> batching in the read path and reducing unnecessary xarray lookups.
> 
> Performance Results:
> 
> fio --ioengine=sync --rw=read --bs=$1 --size=1G --runtime=180 --time_based --group_reporting --name=seq_read_test --filename=testfile
> 
> | THP disabled in tmpfs  | v7.1-rc5     | v7.1-rc5 + fbatch | Improvement |
> | ---------------------- | ------------ | ----------------- | ----------- |
> | 1M + normal file       | bw=11.5GiB/s | bw=12.7GiB/s      | +10.4%      |
> | 64k + normal file      | bw=11.0GiB/s | bw=12.3GiB/s      | +11.8%      |
> | 4k + normal file       | bw=3826MiB/s | bw=3849MiB/s      | +0.6%       |
> | 1M + fallocated file   | bw=23.8GiB/s | bw=28.6GiB/s      | +20.2%      |
> | 64k + fallocated file  | bw=22.5GiB/s | bw=27.3GiB/s      | +21.3%      |
> | 4k + fallocated file   | bw=4655MiB/s | bw=4680MiB/s      | +0.5%       |
> | 1M + hole              | bw=24.2GiB/s | bw=28.6GiB/s      | +18.2%      |
> | 64k + hole             | bw=22.6GiB/s | bw=27.6GiB/s      | +22.1%      |
> | 4k + hole              | bw=4652MiB/s | bw=4489MiB/s      | -3.5%       |
> 
> 
> | THP enabled in tmpfs  | v7.1-rc5     | v7.1-rc5 + fbatch | Improvement |
> | --------------------- | ------------ | ----------------- | ----------- |
> | 1M + normal file      | bw=13.7GiB/s | bw=13.9GiB/s      | +1.4%       |
> | 64k + normal file     | bw=13.5GiB/s | bw=13.5GiB/s      | +0.0%       |
> | 4k + normal file      | bw=3833MiB/s | bw=3859MiB/s      | +0.7%       |
> | 1M + fallocated file  | bw=24.9GiB/s | bw=34.2GiB/s      | +37.3%      |
> | 64k + fallocated file | bw=23.0GiB/s | bw=31.4GiB/s      | +36.5%      |
> | 4k + fallocated file  | bw=4710MiB/s | bw=4655MiB/s      | -1.2%       |
> | 1M + hole             | bw=24.3GiB/s | bw=34.5GiB/s      | +42.0%      |
> | 64k + hole            | bw=23.5GiB/s | bw=31.1GiB/s      | +32.3%      |
> | 4k + hole             | bw=4690MiB/s | bw=4647MiB/s      | -0.9%       |
> 

That looks nice.

Microbenchmarks are useful, but are you able to help us understand how
much benefit our users might see in real-world workloads?

I'll take no action at this time - it's late in the cycle and reviewers
have yet to participate.

AI review flagged a few possible issues, so please take a look:
	https://sashiko.dev/#/patchset/20260601055704.167436-1-chizhiling@163.com

Re: [PATCH v2 0/5] mm/shmem: optimize read with reduced xarray lookups and folio batching

Posted by Chi Zhiling 6 days, 3 hours ago

On 6/2/26 08:43, Andrew Morton wrote:
> On Mon,  1 Jun 2026 13:56:59 +0800 Chi Zhiling <chizhiling@163.com> wrote:
> 
>> From: Chi Zhiling <chizhiling@kylinos.cn>
>>
>> This series improves shmem read performance by implementing folio
>> batching in the read path and reducing unnecessary xarray lookups.
>>
>> Performance Results:
>>
>> fio --ioengine=sync --rw=read --bs=$1 --size=1G --runtime=180 --time_based --group_reporting --name=seq_read_test --filename=testfile
>>
>> | THP disabled in tmpfs  | v7.1-rc5     | v7.1-rc5 + fbatch | Improvement |
>> | ---------------------- | ------------ | ----------------- | ----------- |
>> | 1M + normal file       | bw=11.5GiB/s | bw=12.7GiB/s      | +10.4%      |
>> | 64k + normal file      | bw=11.0GiB/s | bw=12.3GiB/s      | +11.8%      |
>> | 4k + normal file       | bw=3826MiB/s | bw=3849MiB/s      | +0.6%       |
>> | 1M + fallocated file   | bw=23.8GiB/s | bw=28.6GiB/s      | +20.2%      |
>> | 64k + fallocated file  | bw=22.5GiB/s | bw=27.3GiB/s      | +21.3%      |
>> | 4k + fallocated file   | bw=4655MiB/s | bw=4680MiB/s      | +0.5%       |
>> | 1M + hole              | bw=24.2GiB/s | bw=28.6GiB/s      | +18.2%      |
>> | 64k + hole             | bw=22.6GiB/s | bw=27.6GiB/s      | +22.1%      |
>> | 4k + hole              | bw=4652MiB/s | bw=4489MiB/s      | -3.5%       |
>>
>>
>> | THP enabled in tmpfs  | v7.1-rc5     | v7.1-rc5 + fbatch | Improvement |
>> | --------------------- | ------------ | ----------------- | ----------- |
>> | 1M + normal file      | bw=13.7GiB/s | bw=13.9GiB/s      | +1.4%       |
>> | 64k + normal file     | bw=13.5GiB/s | bw=13.5GiB/s      | +0.0%       |
>> | 4k + normal file      | bw=3833MiB/s | bw=3859MiB/s      | +0.7%       |
>> | 1M + fallocated file  | bw=24.9GiB/s | bw=34.2GiB/s      | +37.3%      |
>> | 64k + fallocated file | bw=23.0GiB/s | bw=31.4GiB/s      | +36.5%      |
>> | 4k + fallocated file  | bw=4710MiB/s | bw=4655MiB/s      | -1.2%       |
>> | 1M + hole             | bw=24.3GiB/s | bw=34.5GiB/s      | +42.0%      |
>> | 64k + hole            | bw=23.5GiB/s | bw=31.1GiB/s      | +32.3%      |
>> | 4k + hole             | bw=4690MiB/s | bw=4647MiB/s      | -0.9%       |
>>
> 
> That looks nice.
> 
> Microbenchmarks are useful, but are you able to help us understand how
> much benefit our users might see in real-world workloads?

Hi, Andrew

I don't have real-world performance data yet. I'm working on this simply 
because the patch shows decent gains in microbenchmarks. Even with THP 
enabled, it can still reduce some unnecessary overhead.

> 
> I'll take no action at this time - it's late in the cycle and reviewers
> have yet to participate.

Yes, it's unlikely to land in 7.2, and I still need to resolve some 
performance regressions.

> 
> AI review flagged a few possible issues, so please take a look:
> 	https://sashiko.dev/#/patchset/20260601055704.167436-1-chizhiling@163.com

Okay, I will take a close look.


Thanks!