mm/shmem: optimize read with reduced xarray lookups and folio batching

[PATCH v1 0/5] mm/shmem: optimize read with reduced xarray lookups and folio batching

Posted by Chi Zhiling 4 days, 13 hours ago

From: Chi Zhiling <chizhiling@kylinos.cn>

This series improves shmem read performance by implementing folio
batching in the read path and reducing unnecessary xarray lookups.


Changes since RFC
=================

The RFC version used xas_for_each() in shmem_get_read_batch(), which
introduced about a 1% regression for 4K read workloads.

This v1 addresses the regression in patch 2 by switching to
filemap_get_folios_contig() and optimizing it to avoid the extra
xarray traversal overhead.


Performance Results
===================

Testing was performed with fio sequential read workloads:

  fio --ioengine=sync --rw=read --size=1G --runtime=180


### THP Disabled - Normal Files ###

| Block Size | Baseline  | v1        | Improvement |
| ---------- | --------- | --------- | ----------- |
| 1M         | 11.4GiB/s | 12.7GiB/s | +11.4%      |
| 64k        | 11.2GiB/s | 12.2GiB/s | +8.9%       |
| 4k         | 3809MiB/s | 3838MiB/s | +0.8%       |

### THP Disabled - Fallocated Files ###

| Block Size | Baseline  | v1        | Improvement |
| ---------- | --------- | --------- | ----------- |
| 1M         | 23.7GiB/s | 28.7GiB/s | +21.1%      |
| 64k        | 22.6GiB/s | 27.0GiB/s | +19.5%      |
| 4k         | 4668MiB/s | 4678MiB/s | +0.2%       |

### THP Enabled - Normal Files ###

| Block Size | Baseline  | v1        | Improvement |
| ---------- | --------- | --------- | ----------- |
| 1M         | 13.9GiB/s | 13.9GiB/s | 0%          |
| 64k        | 13.4GiB/s | 13.4GiB/s | 0%          |
| 4k         | 3818MiB/s | 3836MiB/s | +0.5%       |

### THP Enabled - Fallocated Files ###

| Block Size | Baseline  | v1        | Improvement |
| ---------- | --------- | --------- | ----------- |
| 1M         | 24.1GiB/s | 34.9GiB/s | +44.8%      |
| 64k        | 22.9GiB/s | 31.3GiB/s | +36.7%      |
| 4k         | 4721MiB/s | 4708MiB/s | -0.3%       |


rfc:
https://lore.kernel.org/linux-fsdevel/20260515094702.1092355-1-chizhiling@163.com/


Chi Zhiling (5):
  mm/filemap: reduce unnecessary xarray lookups when read cached pages
  mm/filemap: reduce xarray lookups in filemap_get_folios_contig()
  mm/shmem: make SGP_NOALLOC succeed on hole like SGP_READ
  mm/shmem: introduce copy_zero_to_iter() for large zeroing
  mm/shmem: optimize file read with folio batching

 include/linux/shmem_fs.h |  2 +-
 mm/filemap.c             | 34 ++++++++++++------
 mm/khugepaged.c          |  2 +-
 mm/shmem.c               | 75 +++++++++++++++++++++++++++++-----------
 4 files changed, 79 insertions(+), 34 deletions(-)

-- 
2.43.0

Re: [PATCH v1 0/5] mm/shmem: optimize read with reduced xarray lookups and folio batching

Posted by Andrew Morton 2 days, 23 hours ago

On Wed, 20 May 2026 18:15:33 +0800 Chi Zhiling <chizhiling@163.com> wrote:

> From: Chi Zhiling <chizhiling@kylinos.cn>
> 
> This series improves shmem read performance by implementing folio
> batching in the read path and reducing unnecessary xarray lookups.
> 

Thanks.

> Performance Results
> ===================
> 
> Testing was performed with fio sequential read workloads:
> 
>   fio --ioengine=sync --rw=read --size=1G --runtime=180
> 
> 
> ### THP Disabled - Normal Files ###
> 
> | Block Size | Baseline  | v1        | Improvement |
> | ---------- | --------- | --------- | ----------- |
> | 1M         | 11.4GiB/s | 12.7GiB/s | +11.4%      |
> | 64k        | 11.2GiB/s | 12.2GiB/s | +8.9%       |
> | 4k         | 3809MiB/s | 3838MiB/s | +0.8%       |
> 
> ### THP Disabled - Fallocated Files ###
> 
> | Block Size | Baseline  | v1        | Improvement |
> | ---------- | --------- | --------- | ----------- |
> | 1M         | 23.7GiB/s | 28.7GiB/s | +21.1%      |
> | 64k        | 22.6GiB/s | 27.0GiB/s | +19.5%      |
> | 4k         | 4668MiB/s | 4678MiB/s | +0.2%       |
> 
> ### THP Enabled - Normal Files ###
> 
> | Block Size | Baseline  | v1        | Improvement |
> | ---------- | --------- | --------- | ----------- |
> | 1M         | 13.9GiB/s | 13.9GiB/s | 0%          |
> | 64k        | 13.4GiB/s | 13.4GiB/s | 0%          |
> | 4k         | 3818MiB/s | 3836MiB/s | +0.5%       |
> 
> ### THP Enabled - Fallocated Files ###
> 
> | Block Size | Baseline  | v1        | Improvement |
> | ---------- | --------- | --------- | ----------- |
> | 1M         | 24.1GiB/s | 34.9GiB/s | +44.8%      |
> | 64k        | 22.9GiB/s | 31.3GiB/s | +36.7%      |
> | 4k         | 4721MiB/s | 4708MiB/s | -0.3%       |

That looks nice.

AI review might have found a few things:
	https://sashiko.dev/#/patchset/20260520101538.58745-1-chizhiling@163.com

I'll skip the patchset for now (unreviewed v1!).

Re: [PATCH v1 0/5] mm/shmem: optimize read with reduced xarray lookups and folio batching

Posted by Chi Zhiling 2 days, 21 hours ago

On 5/22/26 08:14, Andrew Morton wrote:
> On Wed, 20 May 2026 18:15:33 +0800 Chi Zhiling <chizhiling@163.com> wrote:
> 
>> From: Chi Zhiling <chizhiling@kylinos.cn>
>>
>> This series improves shmem read performance by implementing folio
>> batching in the read path and reducing unnecessary xarray lookups.
>>
> 
> Thanks.
> 
>> Performance Results
>> ===================
>>
>> Testing was performed with fio sequential read workloads:
>>
>>    fio --ioengine=sync --rw=read --size=1G --runtime=180
>>
>>
>> ### THP Disabled - Normal Files ###
>>
>> | Block Size | Baseline  | v1        | Improvement |
>> | ---------- | --------- | --------- | ----------- |
>> | 1M         | 11.4GiB/s | 12.7GiB/s | +11.4%      |
>> | 64k        | 11.2GiB/s | 12.2GiB/s | +8.9%       |
>> | 4k         | 3809MiB/s | 3838MiB/s | +0.8%       |
>>
>> ### THP Disabled - Fallocated Files ###
>>
>> | Block Size | Baseline  | v1        | Improvement |
>> | ---------- | --------- | --------- | ----------- |
>> | 1M         | 23.7GiB/s | 28.7GiB/s | +21.1%      |
>> | 64k        | 22.6GiB/s | 27.0GiB/s | +19.5%      |
>> | 4k         | 4668MiB/s | 4678MiB/s | +0.2%       |
>>
>> ### THP Enabled - Normal Files ###
>>
>> | Block Size | Baseline  | v1        | Improvement |
>> | ---------- | --------- | --------- | ----------- |
>> | 1M         | 13.9GiB/s | 13.9GiB/s | 0%          |
>> | 64k        | 13.4GiB/s | 13.4GiB/s | 0%          |
>> | 4k         | 3818MiB/s | 3836MiB/s | +0.5%       |
>>
>> ### THP Enabled - Fallocated Files ###
>>
>> | Block Size | Baseline  | v1        | Improvement |
>> | ---------- | --------- | --------- | ----------- |
>> | 1M         | 24.1GiB/s | 34.9GiB/s | +44.8%      |
>> | 64k        | 22.9GiB/s | 31.3GiB/s | +36.7%      |
>> | 4k         | 4721MiB/s | 4708MiB/s | -0.3%       |
> 
> That looks nice.
> 
> AI review might have found a few things:
> 	https://sashiko.dev/#/patchset/20260520101538.58745-1-chizhiling@163.com
> 
> I'll skip the patchset for now (unreviewed v1!).

Okay, I will fix those issues in v2.

Thanks!