mm/filemap.c | 46 +++++++++++-------- mm/shmem.c | 126 +++++++++++++++++++++++++++++++++++---------------- 2 files changed, 113 insertions(+), 59 deletions(-)
From: Chi Zhiling <chizhiling@kylinos.cn> This series improves shmem read performance by implementing folio batching in the read path and reducing unnecessary xarray lookups. Performance Results: fio --ioengine=sync --rw=read --bs=$1 --size=1G --runtime=180 --time_based --group_reporting --name=seq_read_test --filename=testfile | THP disabled in tmpfs | v7.1-rc5 | v7.1-rc5 + fbatch | Improvement | | ---------------------- | ------------ | ----------------- | ----------- | | 1M + normal file | bw=11.5GiB/s | bw=12.7GiB/s | +10.4% | | 64k + normal file | bw=11.0GiB/s | bw=12.3GiB/s | +11.8% | | 4k + normal file | bw=3826MiB/s | bw=3849MiB/s | +0.6% | | 1M + fallocated file | bw=23.8GiB/s | bw=28.6GiB/s | +20.2% | | 64k + fallocated file | bw=22.5GiB/s | bw=27.3GiB/s | +21.3% | | 4k + fallocated file | bw=4655MiB/s | bw=4680MiB/s | +0.5% | | 1M + hole | bw=24.2GiB/s | bw=28.6GiB/s | +18.2% | | 64k + hole | bw=22.6GiB/s | bw=27.6GiB/s | +22.1% | | 4k + hole | bw=4652MiB/s | bw=4489MiB/s | -3.5% | | THP enabled in tmpfs | v7.1-rc5 | v7.1-rc5 + fbatch | Improvement | | --------------------- | ------------ | ----------------- | ----------- | | 1M + normal file | bw=13.7GiB/s | bw=13.9GiB/s | +1.4% | | 64k + normal file | bw=13.5GiB/s | bw=13.5GiB/s | +0.0% | | 4k + normal file | bw=3833MiB/s | bw=3859MiB/s | +0.7% | | 1M + fallocated file | bw=24.9GiB/s | bw=34.2GiB/s | +37.3% | | 64k + fallocated file | bw=23.0GiB/s | bw=31.4GiB/s | +36.5% | | 4k + fallocated file | bw=4710MiB/s | bw=4655MiB/s | -1.2% | | 1M + hole | bw=24.3GiB/s | bw=34.5GiB/s | +42.0% | | 64k + hole | bw=23.5GiB/s | bw=31.1GiB/s | +32.3% | | 4k + hole | bw=4690MiB/s | bw=4647MiB/s | -0.9% | v1: https://lore.kernel.org/linux-mm/20260520101538.58745-1-chizhiling@163.com/#t rfc: https://lore.kernel.org/linux-fsdevel/20260515094702.1092355-1-chizhiling@163.com/ Chi Zhiling (5): mm/filemap: reduce unnecessary xarray lookups when read cached pages mm/filemap: reduce xarray lookups in filemap_get_folios_contig() mm/shmem: introduce copy_zero_to_iter() for large zeroing mm/shmem: remove page-copy fallback in shmem read path mm/shmem: optimize file read with folio batching mm/filemap.c | 46 +++++++++++-------- mm/shmem.c | 126 +++++++++++++++++++++++++++++++++++---------------- 2 files changed, 113 insertions(+), 59 deletions(-) -- 2.43.0
On 6/1/26 13:56, Chi Zhiling wrote: > Performance Results: > > fio --ioengine=sync --rw=read --bs=$1 --size=1G --runtime=180 --time_based --group_reporting --name=seq_read_test --filename=testfile > > | THP disabled in tmpfs | v7.1-rc5 | v7.1-rc5 + fbatch | Improvement | > | ---------------------- | ------------ | ----------------- | ----------- | > | 1M + normal file | bw=11.5GiB/s | bw=12.7GiB/s | +10.4% | > | 64k + normal file | bw=11.0GiB/s | bw=12.3GiB/s | +11.8% | > | 4k + normal file | bw=3826MiB/s | bw=3849MiB/s | +0.6% | > | 1M + fallocated file | bw=23.8GiB/s | bw=28.6GiB/s | +20.2% | > | 64k + fallocated file | bw=22.5GiB/s | bw=27.3GiB/s | +21.3% | > | 4k + fallocated file | bw=4655MiB/s | bw=4680MiB/s | +0.5% | > | 1M + hole | bw=24.2GiB/s | bw=28.6GiB/s | +18.2% | > | 64k + hole | bw=22.6GiB/s | bw=27.6GiB/s | +22.1% | > | 4k + hole | bw=4652MiB/s | bw=4489MiB/s | -3.5% | > > > | THP enabled in tmpfs | v7.1-rc5 | v7.1-rc5 + fbatch | Improvement | > | --------------------- | ------------ | ----------------- | ----------- | > | 1M + normal file | bw=13.7GiB/s | bw=13.9GiB/s | +1.4% | > | 64k + normal file | bw=13.5GiB/s | bw=13.5GiB/s | +0.0% | > | 4k + normal file | bw=3833MiB/s | bw=3859MiB/s | +0.7% | > | 1M + fallocated file | bw=24.9GiB/s | bw=34.2GiB/s | +37.3% | > | 64k + fallocated file | bw=23.0GiB/s | bw=31.4GiB/s | +36.5% | > | 4k + fallocated file | bw=4710MiB/s | bw=4655MiB/s | -1.2% | > | 1M + hole | bw=24.3GiB/s | bw=34.5GiB/s | +42.0% | > | 64k + hole | bw=23.5GiB/s | bw=31.1GiB/s | +32.3% | > | 4k + hole | bw=4690MiB/s | bw=4647MiB/s | -0.9% | > Apologies, due to my oversight, the tests involving hole were incorrect, the holes were not successfully created in the files during testing. Below are the corrected results from a retest: | THP disabled | v7.1-rc5 | v7.1-rc5 + fbatch | Improvement | | ------------ | ------------ | ----------------- | ----------- | | 1M + hole | bw=27.3GiB/s | bw=23.4GiB/s | -14.3% | | 64k + hole | bw=27.3GiB/s | bw=23.3GiB/s | -14.7% | | 4k + hole | bw=4825MiB/s | bw=4624MiB/s | -4.2% | | THP enabled | v7.1-rc5 | v7.1-rc5 + fbatch | Improvement | | ----------- | ------------ | ----------------- | ----------- | | 1M + hole | bw=27.0GiB/s | bw=23.1GiB/s | -14.4% | | 64k + hole | bw=27.5GiB/s | bw=23.3GiB/s | -15.3% | | 4k + hole | bw=4777MiB/s | bw=4640MiB/s | -2.9% | There is a noticeable performance drop when accessing holes, as every read triggers a fallback. I will address this in the next version.
On Mon, 1 Jun 2026 13:56:59 +0800 Chi Zhiling <chizhiling@163.com> wrote: > From: Chi Zhiling <chizhiling@kylinos.cn> > > This series improves shmem read performance by implementing folio > batching in the read path and reducing unnecessary xarray lookups. > > Performance Results: > > fio --ioengine=sync --rw=read --bs=$1 --size=1G --runtime=180 --time_based --group_reporting --name=seq_read_test --filename=testfile > > | THP disabled in tmpfs | v7.1-rc5 | v7.1-rc5 + fbatch | Improvement | > | ---------------------- | ------------ | ----------------- | ----------- | > | 1M + normal file | bw=11.5GiB/s | bw=12.7GiB/s | +10.4% | > | 64k + normal file | bw=11.0GiB/s | bw=12.3GiB/s | +11.8% | > | 4k + normal file | bw=3826MiB/s | bw=3849MiB/s | +0.6% | > | 1M + fallocated file | bw=23.8GiB/s | bw=28.6GiB/s | +20.2% | > | 64k + fallocated file | bw=22.5GiB/s | bw=27.3GiB/s | +21.3% | > | 4k + fallocated file | bw=4655MiB/s | bw=4680MiB/s | +0.5% | > | 1M + hole | bw=24.2GiB/s | bw=28.6GiB/s | +18.2% | > | 64k + hole | bw=22.6GiB/s | bw=27.6GiB/s | +22.1% | > | 4k + hole | bw=4652MiB/s | bw=4489MiB/s | -3.5% | > > > | THP enabled in tmpfs | v7.1-rc5 | v7.1-rc5 + fbatch | Improvement | > | --------------------- | ------------ | ----------------- | ----------- | > | 1M + normal file | bw=13.7GiB/s | bw=13.9GiB/s | +1.4% | > | 64k + normal file | bw=13.5GiB/s | bw=13.5GiB/s | +0.0% | > | 4k + normal file | bw=3833MiB/s | bw=3859MiB/s | +0.7% | > | 1M + fallocated file | bw=24.9GiB/s | bw=34.2GiB/s | +37.3% | > | 64k + fallocated file | bw=23.0GiB/s | bw=31.4GiB/s | +36.5% | > | 4k + fallocated file | bw=4710MiB/s | bw=4655MiB/s | -1.2% | > | 1M + hole | bw=24.3GiB/s | bw=34.5GiB/s | +42.0% | > | 64k + hole | bw=23.5GiB/s | bw=31.1GiB/s | +32.3% | > | 4k + hole | bw=4690MiB/s | bw=4647MiB/s | -0.9% | > That looks nice. Microbenchmarks are useful, but are you able to help us understand how much benefit our users might see in real-world workloads? I'll take no action at this time - it's late in the cycle and reviewers have yet to participate. AI review flagged a few possible issues, so please take a look: https://sashiko.dev/#/patchset/20260601055704.167436-1-chizhiling@163.com
On 6/2/26 08:43, Andrew Morton wrote: > On Mon, 1 Jun 2026 13:56:59 +0800 Chi Zhiling <chizhiling@163.com> wrote: > >> From: Chi Zhiling <chizhiling@kylinos.cn> >> >> This series improves shmem read performance by implementing folio >> batching in the read path and reducing unnecessary xarray lookups. >> >> Performance Results: >> >> fio --ioengine=sync --rw=read --bs=$1 --size=1G --runtime=180 --time_based --group_reporting --name=seq_read_test --filename=testfile >> >> | THP disabled in tmpfs | v7.1-rc5 | v7.1-rc5 + fbatch | Improvement | >> | ---------------------- | ------------ | ----------------- | ----------- | >> | 1M + normal file | bw=11.5GiB/s | bw=12.7GiB/s | +10.4% | >> | 64k + normal file | bw=11.0GiB/s | bw=12.3GiB/s | +11.8% | >> | 4k + normal file | bw=3826MiB/s | bw=3849MiB/s | +0.6% | >> | 1M + fallocated file | bw=23.8GiB/s | bw=28.6GiB/s | +20.2% | >> | 64k + fallocated file | bw=22.5GiB/s | bw=27.3GiB/s | +21.3% | >> | 4k + fallocated file | bw=4655MiB/s | bw=4680MiB/s | +0.5% | >> | 1M + hole | bw=24.2GiB/s | bw=28.6GiB/s | +18.2% | >> | 64k + hole | bw=22.6GiB/s | bw=27.6GiB/s | +22.1% | >> | 4k + hole | bw=4652MiB/s | bw=4489MiB/s | -3.5% | >> >> >> | THP enabled in tmpfs | v7.1-rc5 | v7.1-rc5 + fbatch | Improvement | >> | --------------------- | ------------ | ----------------- | ----------- | >> | 1M + normal file | bw=13.7GiB/s | bw=13.9GiB/s | +1.4% | >> | 64k + normal file | bw=13.5GiB/s | bw=13.5GiB/s | +0.0% | >> | 4k + normal file | bw=3833MiB/s | bw=3859MiB/s | +0.7% | >> | 1M + fallocated file | bw=24.9GiB/s | bw=34.2GiB/s | +37.3% | >> | 64k + fallocated file | bw=23.0GiB/s | bw=31.4GiB/s | +36.5% | >> | 4k + fallocated file | bw=4710MiB/s | bw=4655MiB/s | -1.2% | >> | 1M + hole | bw=24.3GiB/s | bw=34.5GiB/s | +42.0% | >> | 64k + hole | bw=23.5GiB/s | bw=31.1GiB/s | +32.3% | >> | 4k + hole | bw=4690MiB/s | bw=4647MiB/s | -0.9% | >> > > That looks nice. > > Microbenchmarks are useful, but are you able to help us understand how > much benefit our users might see in real-world workloads? Hi, Andrew I don't have real-world performance data yet. I'm working on this simply because the patch shows decent gains in microbenchmarks. Even with THP enabled, it can still reduce some unnecessary overhead. > > I'll take no action at this time - it's late in the cycle and reviewers > have yet to participate. Yes, it's unlikely to land in 7.2, and I still need to resolve some performance regressions. > > AI review flagged a few possible issues, so please take a look: > https://sashiko.dev/#/patchset/20260601055704.167436-1-chizhiling@163.com Okay, I will take a close look. Thanks!
© 2016 - 2026 Red Hat, Inc.