* Re: [RFC PATCH 0/2] mm: continue using per-VMA lock when retrying page faults after I/O [not found] ` <CAGsJ_4zyZeLtxVe56OSYQx0OcjETw2ru1FjZjBOnTszMe_MW2g@mail.gmail.com> @ 2025-11-27 19:43 ` Matthew Wilcox 2025-11-27 20:29 ` Barry Song 2025-11-30 5:38 ` Shakeel Butt 0 siblings, 2 replies; 7+ messages in thread From: Matthew Wilcox @ 2025-11-27 19:43 UTC (permalink / raw) To: Barry Song Cc: akpm, linux-mm, linux-arm-kernel, linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390, linux-fsdevel [dropping individuals, leaving only mailing lists. please don't send this kind of thing to so many people in future] On Thu, Nov 27, 2025 at 12:22:16PM +0800, Barry Song wrote: > On Thu, Nov 27, 2025 at 12:09 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Thu, Nov 27, 2025 at 09:14:36AM +0800, Barry Song wrote: > > > There is no need to always fall back to mmap_lock if the per-VMA > > > lock was released only to wait for pagecache or swapcache to > > > become ready. > > > > Something I've been wondering about is removing all the "drop the MM > > locks while we wait for I/O" gunk. It's a nice amount of code removed: > > I think the point is that page fault handlers should avoid holding the VMA > lock or mmap_lock for too long while waiting for I/O. Otherwise, those > writers and readers will be stuck for a while. There's a usecase some of us have been discussing off-list for a few weeks that our current strategy pessimises. It's a process with thousands (maybe tens of thousands) of threads. It has much more mapped files than it has memory that cgroups will allow it to use. So on a page fault, we drop the vma lock, allocate a page of ram, kick off the read, sleep waiting for the folio to come uptodate, once it is return, expecting the page to still be there when we reenter filemap_fault. But it's under so much memory pressure that it's already been reclaimed by the time we get back to it. So all the threads just batter the storage re-reading data. If we don't drop the vma lock, we can insert the pages in the page table and return, maybe getting some work done before this thread is descheduled. This use case also manages to get utterly hung-up trying to do reclaim today with the mmap_lock held. SO it manifests somewhat similarly to your problem (everybody ends up blocked on mmap_lock) but it has a rather different root cause. > I agree there’s room for improvement, but merely removing the "drop the MM > locks while waiting for I/O" code is unlikely to improve performance. I'm not sure it'd hurt performance. The "drop mmap locks for I/O" code was written before the VMA locking code was written. I don't know that it's actually helping these days. > The change would be much more complex, so I’d prefer to land the current > patchset first. At least this way, we avoid falling back to mmap_lock and > causing contention or priority inversion, with minimal changes. Uh, this is an RFC patchset. I'm giving you my comment, which is that I don't think this is the right direction to go in. Any talk of "landing" these patches is extremely premature. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH 0/2] mm: continue using per-VMA lock when retrying page faults after I/O 2025-11-27 19:43 ` [RFC PATCH 0/2] mm: continue using per-VMA lock when retrying page faults after I/O Matthew Wilcox @ 2025-11-27 20:29 ` Barry Song 2025-11-27 21:52 ` Barry Song 2025-11-30 0:28 ` Suren Baghdasaryan 2025-11-30 5:38 ` Shakeel Butt 1 sibling, 2 replies; 7+ messages in thread From: Barry Song @ 2025-11-27 20:29 UTC (permalink / raw) To: Matthew Wilcox Cc: akpm, linux-mm, linux-arm-kernel, linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390, linux-fsdevel On Fri, Nov 28, 2025 at 3:43 AM Matthew Wilcox <willy@infradead.org> wrote: > > [dropping individuals, leaving only mailing lists. please don't send > this kind of thing to so many people in future] > > On Thu, Nov 27, 2025 at 12:22:16PM +0800, Barry Song wrote: > > On Thu, Nov 27, 2025 at 12:09 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Thu, Nov 27, 2025 at 09:14:36AM +0800, Barry Song wrote: > > > > There is no need to always fall back to mmap_lock if the per-VMA > > > > lock was released only to wait for pagecache or swapcache to > > > > become ready. > > > > > > Something I've been wondering about is removing all the "drop the MM > > > locks while we wait for I/O" gunk. It's a nice amount of code removed: > > > > I think the point is that page fault handlers should avoid holding the VMA > > lock or mmap_lock for too long while waiting for I/O. Otherwise, those > > writers and readers will be stuck for a while. > > There's a usecase some of us have been discussing off-list for a few > weeks that our current strategy pessimises. It's a process with > thousands (maybe tens of thousands) of threads. It has much more mapped > files than it has memory that cgroups will allow it to use. So on a > page fault, we drop the vma lock, allocate a page of ram, kick off the > read, sleep waiting for the folio to come uptodate, once it is return, > expecting the page to still be there when we reenter filemap_fault. > But it's under so much memory pressure that it's already been reclaimed > by the time we get back to it. So all the threads just batter the > storage re-reading data. Is this entirely the fault of re-entering the page fault? Under extreme memory pressure, even if we map the pages, they can still be reclaimed quickly? > > If we don't drop the vma lock, we can insert the pages in the page table > and return, maybe getting some work done before this thread is > descheduled. If we need to protect the page from being reclaimed too early, the fix should reside within LRU management, not in page fault handling. Also, I gave an example where we may not drop the VMA lock if the folio is already up to date. That likely corresponds to waiting for the PTE mapping to complete. > > This use case also manages to get utterly hung-up trying to do reclaim > today with the mmap_lock held. SO it manifests somewhat similarly to > your problem (everybody ends up blocked on mmap_lock) but it has a > rather different root cause. > > > I agree there’s room for improvement, but merely removing the "drop the MM > > locks while waiting for I/O" code is unlikely to improve performance. > > I'm not sure it'd hurt performance. The "drop mmap locks for I/O" code > was written before the VMA locking code was written. I don't know that > it's actually helping these days. I am concerned that other write paths may still need to modify the VMA, for example during splitting. Tail latency has long been a significant issue for Android users, and we have observed it even with folio_lock, which has much finer granularity than the VMA lock. > > > The change would be much more complex, so I’d prefer to land the current > > patchset first. At least this way, we avoid falling back to mmap_lock and > > causing contention or priority inversion, with minimal changes. > > Uh, this is an RFC patchset. I'm giving you my comment, which is that I > don't think this is the right direction to go in. Any talk of "landing" > these patches is extremely premature. While I agree that there are other approaches worth exploring, I remain entirely unconvinced that this patchset is the wrong direction. With the current retry logic, it substantially reduces mmap_lock acquisitions and represents a clear low-hanging fruit. Also, I am not referring to landing the RFC itself, but to a subsequent formal patchset that retries using the per-VMA lock. Thanks Barry ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH 0/2] mm: continue using per-VMA lock when retrying page faults after I/O 2025-11-27 20:29 ` Barry Song @ 2025-11-27 21:52 ` Barry Song 2025-11-30 0:28 ` Suren Baghdasaryan 1 sibling, 0 replies; 7+ messages in thread From: Barry Song @ 2025-11-27 21:52 UTC (permalink / raw) To: Matthew Wilcox Cc: akpm, linux-mm, linux-arm-kernel, linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390, linux-fsdevel On Fri, Nov 28, 2025 at 4:29 AM Barry Song <21cnbao@gmail.com> wrote: > > On Fri, Nov 28, 2025 at 3:43 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > [dropping individuals, leaving only mailing lists. please don't send > > this kind of thing to so many people in future] Apologies, I missed this one. The output comes from ./scripts/get_maintainer.pl. If you think the group is too large, I guess we should at least include Suren, Lorenzo, David, and a few others in the discussion? [...] > > > > > This use case also manages to get utterly hung-up trying to do reclaim > > today with the mmap_lock held. SO it manifests somewhat similarly to > > your problem (everybody ends up blocked on mmap_lock) but it has a > > rather different root cause. If I understand the use case correctly, I believe retrying with the per-VMA lock would also be very helpful. Previously, we always retried using mmap_lock, which can be difficult to acquire under heavy contention, leading to long latency while the pages might be reclaimed. With the per-VMA lock, it is much easier to hold and proceed with the work. Thanks Barry ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH 0/2] mm: continue using per-VMA lock when retrying page faults after I/O 2025-11-27 20:29 ` Barry Song 2025-11-27 21:52 ` Barry Song @ 2025-11-30 0:28 ` Suren Baghdasaryan 2025-11-30 2:56 ` Barry Song 1 sibling, 1 reply; 7+ messages in thread From: Suren Baghdasaryan @ 2025-11-30 0:28 UTC (permalink / raw) To: Barry Song Cc: Matthew Wilcox, akpm, linux-mm, linux-arm-kernel, linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390, linux-fsdevel On Thu, Nov 27, 2025 at 2:29 PM Barry Song <21cnbao@gmail.com> wrote: > > On Fri, Nov 28, 2025 at 3:43 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > [dropping individuals, leaving only mailing lists. please don't send > > this kind of thing to so many people in future] > > > > On Thu, Nov 27, 2025 at 12:22:16PM +0800, Barry Song wrote: > > > On Thu, Nov 27, 2025 at 12:09 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > On Thu, Nov 27, 2025 at 09:14:36AM +0800, Barry Song wrote: > > > > > There is no need to always fall back to mmap_lock if the per-VMA > > > > > lock was released only to wait for pagecache or swapcache to > > > > > become ready. > > > > > > > > Something I've been wondering about is removing all the "drop the MM > > > > locks while we wait for I/O" gunk. It's a nice amount of code removed: > > > > > > I think the point is that page fault handlers should avoid holding the VMA > > > lock or mmap_lock for too long while waiting for I/O. Otherwise, those > > > writers and readers will be stuck for a while. > > > > There's a usecase some of us have been discussing off-list for a few > > weeks that our current strategy pessimises. It's a process with > > thousands (maybe tens of thousands) of threads. It has much more mapped > > files than it has memory that cgroups will allow it to use. So on a > > page fault, we drop the vma lock, allocate a page of ram, kick off the > > read, sleep waiting for the folio to come uptodate, once it is return, > > expecting the page to still be there when we reenter filemap_fault. > > But it's under so much memory pressure that it's already been reclaimed > > by the time we get back to it. So all the threads just batter the > > storage re-reading data. > > Is this entirely the fault of re-entering the page fault? Under extreme > memory pressure, even if we map the pages, they can still be reclaimed > quickly? > > > > > If we don't drop the vma lock, we can insert the pages in the page table > > and return, maybe getting some work done before this thread is > > descheduled. > > If we need to protect the page from being reclaimed too early, the fix > should reside within LRU management, not in page fault handling. > > Also, I gave an example where we may not drop the VMA lock if the folio is > already up to date. That likely corresponds to waiting for the PTE mapping to > complete. > > > > > This use case also manages to get utterly hung-up trying to do reclaim > > today with the mmap_lock held. SO it manifests somewhat similarly to > > your problem (everybody ends up blocked on mmap_lock) but it has a > > rather different root cause. > > > > > I agree there’s room for improvement, but merely removing the "drop the MM > > > locks while waiting for I/O" code is unlikely to improve performance. > > > > I'm not sure it'd hurt performance. The "drop mmap locks for I/O" code > > was written before the VMA locking code was written. I don't know that > > it's actually helping these days. > > I am concerned that other write paths may still need to modify the VMA, for > example during splitting. Tail latency has long been a significant issue for > Android users, and we have observed it even with folio_lock, which has much > finer granularity than the VMA lock. Another corner case we need to consider is when there is a large VMA covering most of the address space, so holding a VMA lock during IO would resemble holding an mmap_lock, leading to the same issue that we faced before "drop mmap locks for I/O". We discussed this with Matthew in the context of the problem he mentioned (the page is reclaimed before page fault retry happens) with no conclusion yet. > > > > > > The change would be much more complex, so I’d prefer to land the current > > > patchset first. At least this way, we avoid falling back to mmap_lock and > > > causing contention or priority inversion, with minimal changes. > > > > Uh, this is an RFC patchset. I'm giving you my comment, which is that I > > don't think this is the right direction to go in. Any talk of "landing" > > these patches is extremely premature. > > While I agree that there are other approaches worth exploring, I > remain entirely unconvinced that this patchset is the wrong > direction. With the current retry logic, it substantially reduces > mmap_lock acquisitions and represents a clear low-hanging fruit. > > Also, I am not referring to landing the RFC itself, but to a subsequent formal > patchset that retries using the per-VMA lock. I don't know if this direction is the right one but I agree with Matthew that we should consider alternatives before adopting a new direction. Hopefully we can find one fix for both problems rather than fixing each one in isolation. > > Thanks > Barry > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH 0/2] mm: continue using per-VMA lock when retrying page faults after I/O 2025-11-30 0:28 ` Suren Baghdasaryan @ 2025-11-30 2:56 ` Barry Song 2026-04-04 9:19 ` wang lian 0 siblings, 1 reply; 7+ messages in thread From: Barry Song @ 2025-11-30 2:56 UTC (permalink / raw) To: Suren Baghdasaryan Cc: Matthew Wilcox, akpm, linux-mm, linux-arm-kernel, linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390, linux-fsdevel On Sun, Nov 30, 2025 at 8:28 AM Suren Baghdasaryan <surenb@google.com> wrote: > > On Thu, Nov 27, 2025 at 2:29 PM Barry Song <21cnbao@gmail.com> wrote: > > > > On Fri, Nov 28, 2025 at 3:43 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > [dropping individuals, leaving only mailing lists. please don't send > > > this kind of thing to so many people in future] > > > > > > On Thu, Nov 27, 2025 at 12:22:16PM +0800, Barry Song wrote: > > > > On Thu, Nov 27, 2025 at 12:09 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > On Thu, Nov 27, 2025 at 09:14:36AM +0800, Barry Song wrote: > > > > > > There is no need to always fall back to mmap_lock if the per-VMA > > > > > > lock was released only to wait for pagecache or swapcache to > > > > > > become ready. > > > > > > > > > > Something I've been wondering about is removing all the "drop the MM > > > > > locks while we wait for I/O" gunk. It's a nice amount of code removed: > > > > > > > > I think the point is that page fault handlers should avoid holding the VMA > > > > lock or mmap_lock for too long while waiting for I/O. Otherwise, those > > > > writers and readers will be stuck for a while. > > > > > > There's a usecase some of us have been discussing off-list for a few > > > weeks that our current strategy pessimises. It's a process with > > > thousands (maybe tens of thousands) of threads. It has much more mapped > > > files than it has memory that cgroups will allow it to use. So on a > > > page fault, we drop the vma lock, allocate a page of ram, kick off the > > > read, sleep waiting for the folio to come uptodate, once it is return, > > > expecting the page to still be there when we reenter filemap_fault. > > > But it's under so much memory pressure that it's already been reclaimed > > > by the time we get back to it. So all the threads just batter the > > > storage re-reading data. > > > > Is this entirely the fault of re-entering the page fault? Under extreme > > memory pressure, even if we map the pages, they can still be reclaimed > > quickly? > > > > > > > > If we don't drop the vma lock, we can insert the pages in the page table > > > and return, maybe getting some work done before this thread is > > > descheduled. > > > > If we need to protect the page from being reclaimed too early, the fix > > should reside within LRU management, not in page fault handling. > > > > Also, I gave an example where we may not drop the VMA lock if the folio is > > already up to date. That likely corresponds to waiting for the PTE mapping to > > complete. > > > > > > > > This use case also manages to get utterly hung-up trying to do reclaim > > > today with the mmap_lock held. SO it manifests somewhat similarly to > > > your problem (everybody ends up blocked on mmap_lock) but it has a > > > rather different root cause. > > > > > > > I agree there’s room for improvement, but merely removing the "drop the MM > > > > locks while waiting for I/O" code is unlikely to improve performance. > > > > > > I'm not sure it'd hurt performance. The "drop mmap locks for I/O" code > > > was written before the VMA locking code was written. I don't know that > > > it's actually helping these days. > > > > I am concerned that other write paths may still need to modify the VMA, for > > example during splitting. Tail latency has long been a significant issue for > > Android users, and we have observed it even with folio_lock, which has much > > finer granularity than the VMA lock. > > Another corner case we need to consider is when there is a large VMA > covering most of the address space, so holding a VMA lock during IO > would resemble holding an mmap_lock, leading to the same issue that we > faced before "drop mmap locks for I/O". We discussed this with Matthew > in the context of the problem he mentioned (the page is reclaimed > before page fault retry happens) with no conclusion yet. Suren, thank you very much for your input. Right. I think we may discover more corner cases on Android in places where we previously saw VMA merging, such as between two native heap mmap areas. This can happen fairly often, and we don’t want long BIO queues to block those writers. > > > > > > > > > > The change would be much more complex, so I’d prefer to land the current > > > > patchset first. At least this way, we avoid falling back to mmap_lock and > > > > causing contention or priority inversion, with minimal changes. > > > > > > Uh, this is an RFC patchset. I'm giving you my comment, which is that I > > > don't think this is the right direction to go in. Any talk of "landing" > > > these patches is extremely premature. > > > > While I agree that there are other approaches worth exploring, I > > remain entirely unconvinced that this patchset is the wrong > > direction. With the current retry logic, it substantially reduces > > mmap_lock acquisitions and represents a clear low-hanging fruit. > > > > Also, I am not referring to landing the RFC itself, but to a subsequent formal > > patchset that retries using the per-VMA lock. > > I don't know if this direction is the right one but I agree with > Matthew that we should consider alternatives before adopting a new > direction. Hopefully we can find one fix for both problems rather than > fixing each one in isolation. As I mentioned in a follow-up reply to Matthew[1], I think the current approach also helps in cases where pages are reclaimed during retries. Previously, we required mmap_lock to retry, so any contention made it hard to acquire and introduced high latency. During that time, pages could be reclaimed before mmap_lock was obtained. Now that we only require the per-VMA lock, retries can proceed much more easily than before. As long as we replace a big lock with a smaller one, there is less chance of getting stuck in D state. If either you or Matthew have a reproducer for this issue, I’d be happy to try it out. BTW, we also observed mmap_lock contention during MGLRU aging. TBH, the non-RMAP clearing of the PTE young bit does not seem helpful on arm64, which does not support non-leaf young bits at all. After disabling the feature below, we found that reclamation used less CPU and ran better. echo 1 >/sys/kernel/mm/lru_gen/enabled 0x0002 Clearing the accessed bit in leaf page table entries in large batches, when MMU sets it (e.g., on x86). This behavior can theoretically worsen lock contention (mmap_lock). If it is disabled, the multi-gen LRU will suffer a minor performance degradation for workloads that contiguously map hot pages, whose accessed bits can be otherwise cleared by fewer larger batches. [1] https://lore.kernel.org/linux-mm/CAGsJ_4wvaieWtTrK+koM3SFu9rDExkVHX5eUwYiEotVqP-ndEQ@mail.gmail.com/ Thanks Barry ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH 0/2] mm: continue using per-VMA lock when retrying page faults after I/O 2025-11-30 2:56 ` Barry Song @ 2026-04-04 9:19 ` wang lian 0 siblings, 0 replies; 7+ messages in thread From: wang lian @ 2026-04-04 9:19 UTC (permalink / raw) To: 21cnbao Cc: akpm, linux-arm-kernel, linux-fsdevel, linux-kernel, linux-mm, linux-riscv, linux-s390, linuxppc-dev, loongarch, surenb, willy, wang lian, Wang Lian, Kunwu Chan, Kunwu Chan [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset=y, Size: 6581 bytes --] Hi Barry, > If either you or Matthew have a reproducer for this issue, I’d be > happy to try it out. Kunwu and I evaluated this series ("mm: continue using per-VMA lock when retrying page faults after I/O") under a stress scenario specifically designed to expose the retry behavior in filemap_fault(). This models the exact situation described by Matthew Wilcox [1], where retries after I/O fail to make forward progress under memory pressure. The scenario targets the critical window between I/O completion and mmap_lock reacquisition. This workload deliberately includes frequent mmap/munmap operations to simulate a highly contended mmap_lock environment alongside severe memory pressure (1GB memcg limit). Under this pressure, folios instantiated by the I/O can be aggressively reclaimed before the delayed task can re-acquire the lock and install the PTE, forcing retries to repeat the entire work. To make this behavior reproducible, we constructed a stress setup that intentionally extends this interval: * 256-core x86 system * 1GB memory cgroup * 500 threads continuously faulting on a 16MB file The core reproducer and the execution command are provided below: #define _GNU_SOURCE #include <errno.h> #include <fcntl.h> #include <pthread.h> #include <stdatomic.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/mman.h> #include <unistd.h> #include <time.h> #define THREADS 500 #define FILE_SIZE (16 * 1024 * 1024) /* 16MB */ static _Atomic int g_stop = 0; #define RUN_SECONDS 600 struct worker_arg { long id; uint64_t *counts; }; void *worker(void *arg) { struct worker_arg *wa = (struct worker_arg *)arg; long id = wa->id; char path[64]; uint64_t local_rounds = 0; snprintf(path, sizeof(path), "./test_file_%d_%ld.dat", getpid(), id); int fd = open(path, O_RDWR | O_CREAT | O_TRUNC, 0666); if (fd < 0) return NULL; if (ftruncate(fd, FILE_SIZE) < 0) { close(fd); return NULL; } while (!atomic_load_explicit(&g_stop, memory_order_relaxed)) { char *f_map = mmap(NULL, FILE_SIZE, PROT_READ, MAP_SHARED, fd, 0); if (f_map != MAP_FAILED) { /* Pure page cache thrashing */ for (int i = 0; i < FILE_SIZE; i += 4096) { volatile unsigned char c = (unsigned char)f_map[i]; (void)c; } munmap(f_map, FILE_SIZE); local_rounds++; } } wa->counts[id] = local_rounds; close(fd); unlink(path); return NULL; } int main(void) { printf("Pure File Thrashing Started. PID: %d\n", getpid()); pthread_t t[THREADS]; uint64_t local_counts[THREADS]; memset(local_counts, 0, sizeof(local_counts)); struct worker_arg args[THREADS]; for (long i = 0; i < THREADS; i++) { args[i].id = i; args[i].counts = local_counts; pthread_create(&t[i], NULL, worker, &args[i]); } sleep(RUN_SECONDS); atomic_store_explicit(&g_stop, 1, memory_order_relaxed); for (int i = 0; i < THREADS; i++) pthread_join(t[i], NULL); uint64_t total = 0; for (int i = 0; i < THREADS; i++) total += local_counts[i]; printf("Total rounds : %llu\n", (unsigned long long)total); printf("Throughput : %.2f rounds/sec\n", (double)total / RUN_SECONDS); return 0; } Command line used for the test: systemd-run --scope -p MemoryHigh=1G -p MemoryMax=1.2G -p MemorySwapMax=0 \ --unit=mmap-thrash-$$ ./mmap_lock & \ TEST_PID=$! We also added temporary counters in page fault retries [2]: - RETRY_IO_MISS : folio not present after I/O completion - RETRY_MMAP_DROP : retry fallback due to waiting for I/O We report representative runs from our 600-second test iterations (kernel v7.0-rc3): | Case | Total Rounds | Throughput | Miss/Drop(%) | RETRY_MMAP_DROP | RETRY_IO_MISS | | ------------------- | ------------ | ---------- | ------------ | --------------- | ------------- | | Baseline (Run 1) | 22,711 | 37.85 /s | 45.04 | 970,078 | 436,956 | | Baseline (Run 2) | 23,530 | 39.22 /s | 44.96 | 972,043 | 437,077 | | With Series (Run A) | 54,428 | 90.71 /s | 1.69 | 1,204,124 | 20,398 | | With Series (Run B) | 35,949 | 59.91 /s | 0.03 | 327,023 | 99 | Notes: 1. Throughput Improvement: During the 600-second testing window, overall workload throughput can more than double (e.g., Run A jumped from ~38 to 90.71 rounds/sec). 2. Elimination of Race Condition: Without the patch, ~45% of retries were invalid because newly fetched folios were evicted during the mmap_lock reacquisition delay. With the per-VMA retry path, the invalidation ratio plummeted to near zero (0.03% - 1.69%). 3. Counter Scaling and Variance: In Run A, because the I/O wait bottleneck is eliminated, the threads advance much faster. Thus, the absolute number of mmap_lock drops naturally scales up with the increased throughput. In Run B, the primary bottleneck shifts to the mmap write-lock contention (lock convoying), causing throughput and total drops to fluctuate. Crucially, the Miss/Drop ratio remains near zero regardless of this variance. Without this series, almost half of the retries fail to observe completed I/O results, causing severe CPU and I/O waste. With the finer-grained VMA lock, the faulting threads bypass the heavily contended mmap_lock entirely during retries, completing the fault almost instantly. This scenario perfectly aligns with the exact concern raised, and these results show that the patch not only successfully eliminates the retry inefficiency but also tangibly boosts macro-level system throughput. [1] https://lore.kernel.org/linux-mm/aSip2mWX13sqPW_l@casper.infradead.org/ [2] https://github.com/lianux-mm/ioretry_test/ Tested-by: Wang Lian <wanglian@kylinos.cn> Tested-by: Kunwu Chan <chentao@kylinos.cn> Reviewed-by: Wang Lian <lianux.mm@gmail.com> Reviewed-by: Kunwu Chan <kunwu.chan@gmail.com> -- Best Regards, wang lian ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH 0/2] mm: continue using per-VMA lock when retrying page faults after I/O 2025-11-27 19:43 ` [RFC PATCH 0/2] mm: continue using per-VMA lock when retrying page faults after I/O Matthew Wilcox 2025-11-27 20:29 ` Barry Song @ 2025-11-30 5:38 ` Shakeel Butt 1 sibling, 0 replies; 7+ messages in thread From: Shakeel Butt @ 2025-11-30 5:38 UTC (permalink / raw) To: Matthew Wilcox Cc: Barry Song, akpm, linux-mm, linux-arm-kernel, linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390, linux-fsdevel On Thu, Nov 27, 2025 at 07:43:22PM +0000, Matthew Wilcox wrote: > [dropping individuals, leaving only mailing lists. please don't send > this kind of thing to so many people in future] > > On Thu, Nov 27, 2025 at 12:22:16PM +0800, Barry Song wrote: > > On Thu, Nov 27, 2025 at 12:09 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Thu, Nov 27, 2025 at 09:14:36AM +0800, Barry Song wrote: > > > > There is no need to always fall back to mmap_lock if the per-VMA > > > > lock was released only to wait for pagecache or swapcache to > > > > become ready. > > > > > > Something I've been wondering about is removing all the "drop the MM > > > locks while we wait for I/O" gunk. It's a nice amount of code removed: > > > > I think the point is that page fault handlers should avoid holding the VMA > > lock or mmap_lock for too long while waiting for I/O. Otherwise, those > > writers and readers will be stuck for a while. > > There's a usecase some of us have been discussing off-list for a few > weeks that our current strategy pessimises. It's a process with > thousands (maybe tens of thousands) of threads. It has much more mapped > files than it has memory that cgroups will allow it to use. So on a > page fault, we drop the vma lock, allocate a page of ram, kick off the > read, sleep waiting for the folio to come uptodate, once it is return, > expecting the page to still be there when we reenter filemap_fault. > But it's under so much memory pressure that it's already been reclaimed > by the time we get back to it. So all the threads just batter the > storage re-reading data. I would caution against changing kernel for such usecase. Actually I would call it a misconfigured system instead of a usecase. If a workload is under that much memory pressure that its refaulted pages are getting reclaimed then it means its workingset is larger than the available memory and is thrashing. The only option here is to either increase the memory limits or kill the workload and reschedule on the system with enough memory available. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2026-04-04 9:20 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20251127011438.6918-1-21cnbao@gmail.com>
[not found] ` <aSfO7fA-04SBtTug@casper.infradead.org>
[not found] ` <CAGsJ_4zyZeLtxVe56OSYQx0OcjETw2ru1FjZjBOnTszMe_MW2g@mail.gmail.com>
2025-11-27 19:43 ` [RFC PATCH 0/2] mm: continue using per-VMA lock when retrying page faults after I/O Matthew Wilcox
2025-11-27 20:29 ` Barry Song
2025-11-27 21:52 ` Barry Song
2025-11-30 0:28 ` Suren Baghdasaryan
2025-11-30 2:56 ` Barry Song
2026-04-04 9:19 ` wang lian
2025-11-30 5:38 ` Shakeel Butt
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox