[v1] mm/filemap: tighten mmap_miss hit accounting

[RFC PATCH 0/1] mm/filemap: tighten mmap_miss hit accounting
Posted by fujunjie 1 month, 2 weeks ago
Hi,

This RFC explores a narrow mmap readahead accounting issue in
filemap_map_pages().

Today, mmap_miss is increased when synchronous mmap readahead is needed,
and decreased when filemap_map_pages() maps folios that are already in
the page cache.  The decrease side can over-credit hits in two cases:

  - fault-around installs nearby PTEs even though the fault only proves
    that the faulting address was accessed;
  - after synchronous mmap readahead returns VM_FAULT_RETRY, the retry
    can find the folio brought in by the same miss and immediately
    cancel that miss.

This first RFC keeps the scope intentionally conservative:

  - only credit a hit when filemap_map_pages() maps the actual faulting
    address;
  - do not credit FAULT_FLAG_TRIED retries as mmap hits;
  - keep the existing workingset-folio behavior unchanged;
  - do not change async mmap readahead hit accounting.

Current evidence supports that the change helps sparse random mmap
access and sparse strides that do not geometrically overlap with the
read-around window.  The main data set is a local KVM/data-disk
microbenchmark using mmap_miss_probe, with an 8 GiB guest, 2 vCPUs,
8192 KiB read_ahead_kb, cold page cache before each run, and medians
from 3 runs.

mmap_miss_probe is a small userspace benchmark used only for these
measurements.  It mmap()s a prepared file with MADV_NORMAL and then
touches one byte at selected base-page offsets; the access order is
random, sequential, or a fixed page stride.  The harness drops caches
before each run and samples /proc/vmstat around that access loop.

Here "pressure" means file-cache capacity pressure from a 20 GiB file in
an 8 GiB guest.  It is not an extra memhog workload.  The fit-in-memory
case uses a 4 GiB file in the same 8 GiB guest.

Each case used a fresh temporary qcow2 data disk, seen by the guest as
/dev/vda, formatted as ext4 and mounted at /mnt/mmap-matrix.

Each result is "pgpgin GiB / elapsed seconds".  "pgpgin GiB" is the
delta of the guest /proc/vmstat pgpgin counter, converted from KiB to
GiB; I use it as an approximate block input counter, not as resident
memory or exact application IO.  "Elapsed seconds" is the wall-clock
runtime of the whole mmap_miss_probe access pass, not per-access
latency.

For the 20 GiB pressure case with 1% of pages accessed:

        workload       before                after
        random         223.377 GiB/101.293s  1.010 GiB/4.790s
        stride1021     204.214 GiB/97.557s   204.208 GiB/108.086s
        stride2053     409.584 GiB/193.700s  0.970 GiB/3.685s
        stride4099     406.452 GiB/134.241s  0.975 GiB/3.499s
        sequential       0.212 GiB/0.050s    0.212 GiB/0.057s

For the 4 GiB fit-in-memory case in the same 8 GiB guest:

        workload       before              after
        random         3.987 GiB/1.960s    0.980 GiB/1.221s
        stride1021     4.002 GiB/1.838s    4.002 GiB/1.851s
        stride2053     3.991 GiB/1.835s    0.811 GiB/0.985s
        stride4099     4.001 GiB/1.836s    0.819 GiB/1.037s
        sequential     0.056 GiB/0.013s    0.056 GiB/0.018s

The same 8 GiB pressure setup also has an ablation.  P1 is only the
faulting-address hit accounting change.  P2-only is only the
FAULT_FLAG_TRIED retry filter.  P1+P2 is this RFC.  A representative
subset of that ablation is:

        workload    variant   result
        random      baseline  223.377 GiB/101.293s
        random      P1        223.268 GiB/98.481s
        random      P2-only   223.257 GiB/100.091s
        random      P1+P2     1.010 GiB/4.790s
        stride2053  baseline  409.584 GiB/193.700s
        stride2053  P1        409.584 GiB/197.645s
        stride2053  P2-only   15.722 GiB/5.485s
        stride2053  P1+P2     0.970 GiB/3.685s
        sequential  baseline  0.212 GiB/0.050s
        sequential  P1        0.212 GiB/0.046s
        sequential  P2-only   0.212 GiB/0.050s
        sequential  P1+P2     0.212 GiB/0.057s

This supports keeping the RFC scoped to the two accounting changes:
P1 alone was effectively baseline, while P2-only helped large sparse
strides under pressure but left random access at baseline-level IO.
I also tried variants that changed async mmap readahead and workingset
handling; in this data set they tracked P1+P2 closely, so I left them
out of this RFC.

Current evidence does not establish that this solves every sparse
pattern.  The stride1021 rows above are intentionally included: the
20 GiB run still reads about 204 GiB.

In the table, strideN means that the benchmark advances by N base pages
between mmap loads.  Thus stride1021 is 1021 * 4 KiB = 4084 KiB.  With
8192 KiB read_ahead_kb, file->f_ra.ra_pages is 2048 base pages, and
synchronous mmap read-around uses a 2048-page window centered around the
fault, i.e. roughly [index - 1024, index + 1023].  A stride1021 access
therefore lands inside the previous read-around window.  About every
other access can be a real faulting-address page-cache hit, and the
other half can each read about 8 MiB.  For about 52k accesses in the
20 GiB/1% run, half of them times 8 MiB is about 205 GiB, which matches
the observed 204 GiB.

Feedback on the accounting boundary and on suitable test coverage would
be useful.

I will be travelling next week, so I may be slow to reply.

Best regards.

fujunjie

fujunjie (1):
  mm/filemap: tighten mmap_miss hit accounting

 mm/filemap.c | 33 ++++++++++++++++-----------------
 1 file changed, 16 insertions(+), 17 deletions(-)


base-commit: 1b55f8358e35a67bf3969339ea7b86988af92f66
-- 
2.34.1