Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/2] mm/filemap: tighten mmap_miss hit accounting
@ 2026-04-28  1:57 fujunjie
  2026-04-28  1:59 ` [PATCH v3 1/2] mm/filemap: count only the faulting address as a mmap hit fujunjie
  2026-04-28  1:59 ` [PATCH v3 2/2] mm/filemap: do not count FAULT_FLAG_TRIED retries as mmap hits fujunjie
  0 siblings, 2 replies; 7+ messages in thread
From: fujunjie @ 2026-04-28  1:57 UTC (permalink / raw)
  To: Matthew Wilcox, Jan Kara, Andrew Morton
  Cc: linux-fsdevel, linux-mm, linux-kernel, Roman Gushchin, Haoran Zhu


This is v3 of the mmap_miss hit-accounting change.  v1 was sent as an
RFC.  The accounting logic is unchanged from v2, but patch 1 now keeps
the workingset mmap_miss comment near the new accounting block as
Matthew requested.

  - patch 1 limits fault-around hit accounting to the faulting address;
  - patch 2 stops FAULT_FLAG_TRIED retries from decrementing mmap_miss.

Patch 1 also follows Jan's implementation suggestion: the helper
functions no longer propagate a mmap_miss variable, and
filemap_map_pages() updates file->f_ra.mmap_miss based on whether the
helper mapped the actual faulting address.

mmap_miss is increased when synchronous mmap readahead is needed, and
decreased when filemap_map_pages() maps folios that are already in the
page cache.  The decrease side can over-credit hits in two cases:

  - fault-around installs nearby PTEs even though the fault only proves
    that the faulting address was accessed;
  - after synchronous mmap readahead returns VM_FAULT_RETRY, the retry
    can find the folio brought in by the same miss and immediately
    cancel that miss.

Current evidence comes from a local KVM/data-disk microbenchmark using
mmap_miss_probe, with an 8 GiB guest, 2 vCPUs, 8192 KiB read_ahead_kb,
cold page cache before each run, 1% of the file accessed, and medians of
3 runs.

mmap_miss_probe mmap()s a prepared file with MADV_NORMAL and then
touches one byte at selected base-page offsets.  The access order is
random, sequential, or a fixed page stride.  The harness drops caches
before each run and samples /proc/vmstat around that access loop.

The 20 GiB case below is a larger-than-memory file case in an 8 GiB
guest.  No separate memory hog was used.  The 4 GiB case uses the same
8 GiB guest but keeps the file fit-in-memory.

Each case used a fresh temporary qcow2 data disk, seen by the guest as
/dev/vda, formatted as ext4 and mounted at /mnt/mmap-matrix.

Each result is "pgpgin GiB / elapsed seconds".  "pgpgin GiB" is the
delta of the guest /proc/vmstat pgpgin counter, converted from KiB to
GiB; it is used here as an approximate block input counter, not as
resident memory or exact application IO.  "Elapsed seconds" is the
wall-clock runtime of the whole mmap_miss_probe access pass, not
per-access latency.

For the 20 GiB larger-than-memory case:

        workload       before                after
        random         223.377 GiB/101.293s  1.010 GiB/4.790s
        stride1021     204.214 GiB/97.557s   204.208 GiB/108.086s
        stride2053     409.584 GiB/193.700s  0.970 GiB/3.685s
        stride4099     406.452 GiB/134.241s  0.975 GiB/3.499s
        sequential       0.212 GiB/0.050s    0.212 GiB/0.057s

For the 4 GiB fit-in-memory case:

        workload       before              after
        random         3.987 GiB/1.960s    0.980 GiB/1.221s
        stride1021     4.002 GiB/1.838s    4.002 GiB/1.851s
        stride2053     3.991 GiB/1.835s    0.811 GiB/0.985s
        stride4099     4.001 GiB/1.836s    0.819 GiB/1.037s
        sequential     0.056 GiB/0.013s    0.056 GiB/0.018s

The 20 GiB setup also has an ablation.  P1 is only the faulting-address
hit accounting change.  P2-only is only the FAULT_FLAG_TRIED retry
filter.  P1+P2 is the combined accounting change:

        workload    variant   result
        random      baseline  223.377 GiB/101.293s
        random      P1        223.268 GiB/98.481s
        random      P2-only   223.257 GiB/100.091s
        random      P1+P2     1.010 GiB/4.790s
        stride2053  baseline  409.584 GiB/193.700s
        stride2053  P1        409.584 GiB/197.645s
        stride2053  P2-only   15.722 GiB/5.485s
        stride2053  P1+P2     0.970 GiB/3.685s
        sequential  baseline  0.212 GiB/0.050s
        sequential  P1        0.212 GiB/0.046s
        sequential  P2-only   0.212 GiB/0.050s
        sequential  P1+P2     0.212 GiB/0.057s

After the v2 implementation refactor, only the final P1+P2 shape was
rerun in the same setup.  The numbers stayed in line with the v1 P1+P2
rows above:

        workload       larger-than-memory case    fit-in-memory case
                       20 GiB file, 1% access    4 GiB file, 1% access
        random           1.010 GiB/4.383s          0.980 GiB/1.088s
        stride1021     204.216 GiB/105.601s        4.001 GiB/1.783s
        stride2053       0.970 GiB/3.760s          0.810 GiB/0.908s
        stride4099       0.975 GiB/3.410s          0.818 GiB/0.870s
        sequential       0.212 GiB/0.060s          0.056 GiB/0.016s

This does not claim to solve every sparse pattern.  The stride1021 rows
are intentionally shown as a boundary: with 8192 KiB read_ahead_kb,
file->f_ra.ra_pages is 2048 base pages, and synchronous mmap
read-around uses a 2048-page window centered around the fault, roughly
[index - 1024, index + 1023].  stride1021 is 1021 * 4 KiB = 4084 KiB,
so the next access lands inside the previous read-around window.  About
every other access can be a real faulting-address page-cache hit, and
the other half can each read about 8 MiB.  For about 52k accesses in the
20 GiB/1% run, half of them times 8 MiB is about 205 GiB, matching the
observed 204 GiB.

---
v3:
- move the workingset mmap_miss comment to the new accounting block in
  filemap_map_pages().
- no new performance run; v3 only moves a comment and does not change
  executable code from v2.

v2: https://lore.kernel.org/r/tencent_4EDE373816615C46CFD48A6EF3B61E232308@qq.com
v1: https://lore.kernel.org/r/tencent_3F158B17AE85E73945C5F97D8F8A918F9B07@qq.com

v2 changes:
- split the original patch into two patches;
- move mmap_miss updating back into filemap_map_pages();
- drop the mmap_miss argument from filemap_map_order0_folio() and
  filemap_map_folio_range();

fujunjie (2):
  mm/filemap: count only the faulting address as a mmap hit
  mm/filemap: do not count FAULT_FLAG_TRIED retries as mmap hits

 mm/filemap.c | 63 ++++++++++++++++++++++++++++++------------------------------
 1 file changed, 32 insertions(+), 31 deletions(-)


base-commit: 1b55f8358e35a67bf3969339ea7b86988af92f66
-- 
2.34.1



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-05-10 12:46 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-28  1:57 [PATCH v3 0/2] mm/filemap: tighten mmap_miss hit accounting fujunjie
2026-04-28  1:59 ` [PATCH v3 1/2] mm/filemap: count only the faulting address as a mmap hit fujunjie
2026-04-28  9:47   ` Jan Kara
2026-05-10 12:45   ` Vishal Moola
2026-04-28  1:59 ` [PATCH v3 2/2] mm/filemap: do not count FAULT_FLAG_TRIED retries as mmap hits fujunjie
2026-04-28  9:48   ` Jan Kara
2026-05-10 12:46   ` Vishal Moola

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox