[PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance

Linux-ARM-Kernel Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
@ 2026-04-30  4:04 Barry Song (Xiaomi)
  2026-04-30  4:04 ` [PATCH v2 1/5] mm/filemap: Retry fault by VMA lock if the lock was released for I/O Barry Song (Xiaomi)
                   ` (5 more replies)
  0 siblings, 6 replies; 80+ messages in thread
From: Barry Song (Xiaomi) @ 2026-04-30  4:04 UTC (permalink / raw)
  To: akpm, linux-mm, willy
  Cc: david, ljs, liam, vbabka, rppt, surenb, mhocko, jack, pfalcato,
	wanglian, chentao, lianux.mm, kunwu.chan, liyangouwen1, chrisl,
	kasong, shikemeng, nphamcs, bhe, youngjun.park, linux-arm-kernel,
	linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390,
	Barry Song (Xiaomi)

Oven observed most mmap_lock contention and priority inversion
come from page fault retries after waiting for I/O completion.
Oven subsequently raised the following idea:

There is no need to always fall back to mmap_lock when the per-VMA lock
is released only to wait for the page cache to become ready. On a page
fault retry, the per-VMA lock can still be reused.

We believe the same should also apply to anonymous folios. However, there
is a case where I/O has completed but we fail to acquire the folio lock
because a concurrent thread may be installing PTEs for the folio. This
is expected to be short-lived, so retrying the page fault is unnecessary.

This patchset handles two cases:

(1) If we need to wait for I/O completion, we still drop the per-VMA lock, as
current page fault handling already does. Holding it for too long may introduce
various priority inversion issues on mobile devices. After I/O completes, we
retry the page fault with the per-VMA lock, rather than falling back to
mmap_lock.

(2) If I/O has already completed and the folio is up to date, the wait is
likely due to a concurrent PTE installation. In this case, we keep the
per-VMA lock and avoid retrying the page fault.

With (1), the dramatically reduced mmap_lock contention leads to a
significant improvement in Douyin performance. Oven’s data is shown
below.

Douyin (the Chinese version of TikTok) warm start on a smartphone with
8GB RAM.

== mmap_lock Acquisitions And Wait Time ==

Metric                    Before (Avg)    After (Avg)    Change
------------------------------------------------------------------------
Read Lock Count           20,010          5,719          -71.42%
Read Total Wait (us)      10,695,877     408,436        -96.18%
Read Avg Wait (us)        534.00         71.00           -86.70%
Write Lock Count          838             909            +8.47%
Write Total Wait (us)     501,293        97,633          -80.52%
Write Avg Wait (us)       598.00         107.00          -82.11%


== Read Lock Waiting Time Distribution of mmap_lock ==

Range (us)                 Before (Avg)    After (Avg)    Change
------------------------------------------------------------------------
[0, 1)                     9,927           4,286          -56.82%
[1, 10)                    9,179           1,327          -85.54%
[10, 100)                  191             88             -53.93%
[100, 1000)                57              6              -89.47%
[1000, 10000)              328             9              -97.26%
[10000, 100000)            328             6              -98.17%
[100000, 1000000)          0               0              N/A
[1000000, +)               0               0              N/A

== Write Lock Waiting Time Distribution of mmap_lock ==

Range (us)                 Before (Avg)    After (Avg)    Change
------------------------------------------------------------------------
[0, 1)                     250             300            +20.00%
[1, 10)                    483             556            +15.11%
[10, 100)                  52              41             -21.15%
[100, 1000)                12              5              -58.33%
[1000, 10000)              22              4              -81.82%
[10000, 100000)            16              1              -93.75%
[100000, 1000000)          0               0              N/A
[1000000, +)               0               0              N/A

After the optimization, the number of read lock acquisitions is 
significantly reduced, and both lock waiting time and tail latency are 
dramatically improved.

Kunwu and Lian also developed a model to capture the situation described
by Matthew [1], where a memcg with limited memory may fail to make
progress. This happens because after I/O is initiated on the first page
fault, the folios may be reclaimed by the time of the retry, leaving the
workload with little or no forward progress.

A stress setup made by Kunwu and Lian as follows:
* 256-core x86 system
* 500 threads continuously faulting on 16MB files

The model was running within a memcg with limited memory,
as shown below:

systemd-run --scope -p MemoryHigh=1G -p MemoryMax=1.2G -p MemorySwapMax=0 \
--unit=mmap-thrash-$$ ./mmap_lock & \
TEST_PID=$!

The reproducer code is shown below:

 #define THREADS 500 
 #define FILE_SIZE (16 * 1024 * 1024) /* 16MB */ 
 static _Atomic int g_stop = 0; 
 #define RUN_SECONDS 600 
 
 struct worker_arg { 
         long id; 
         uint64_t *counts; 
 }; 
 
 void *worker(void *arg) 
 { 
         struct worker_arg *wa = (struct worker_arg *)arg; 
         long id = wa->id; 
         char path[64]; 
         uint64_t local_rounds = 0; 
 
         snprintf(path, sizeof(path), "./test_file_%d_%ld.dat", 
                  getpid(), id); 
         int fd = open(path, O_RDWR | O_CREAT | O_TRUNC, 0666); 
         if (fd < 0) return NULL; 
         if (ftruncate(fd, FILE_SIZE) < 0) { 
                 close(fd); return NULL; 
         } 
 
         while (!atomic_load_explicit(&g_stop, memory_order_relaxed)) { 
                 char *f_map = mmap(NULL, FILE_SIZE, PROT_READ, 
                                    MAP_SHARED, fd, 0); 
                 if (f_map != MAP_FAILED) { 
                         /* Pure page cache thrashing */ 
                         for (int i = 0; i < FILE_SIZE; i += 4096) { 
                                 volatile unsigned char c = 
                                         (unsigned char)f_map[i]; 
                                 (void)c; 
                         } 
                         munmap(f_map, FILE_SIZE); 
                         local_rounds++; 
                 } 
         } 
         wa->counts[id] = local_rounds; 
         close(fd); 
         unlink(path); 
         return NULL; 
 } 
 
 int main(void) 
 { 
         printf("Pure File Thrashing Started. PID: %d\n", getpid()); 
         pthread_t t[THREADS]; 
         uint64_t local_counts[THREADS]; 
         memset(local_counts, 0, sizeof(local_counts)); 
         struct worker_arg args[THREADS]; 
 
         for (long i = 0; i < THREADS; i++) { 
                 args[i].id = i; 
                 args[i].counts = local_counts; 
                 pthread_create(&t[i], NULL, worker, &args[i]); 
         } 
 
         sleep(RUN_SECONDS); 
         atomic_store_explicit(&g_stop, 1, memory_order_relaxed); 
 
         for (int i = 0; i < THREADS; i++) pthread_join(t[i], NULL); 
 
         uint64_t total = 0; 
         for (int i = 0; i < THREADS; i++) total += local_counts[i]; 
 
         printf("Total rounds     : %llu\n", (unsigned long long)total); 
         printf("Throughput       : %.2f rounds/sec\n", 
                (double)total / RUN_SECONDS); 
         return 0; 
 }

They also added temporary counters in page fault retries [2]:
- RETRY_IO_MISS   : folio not present after I/O completion
- RETRY_MMAP_DROP : retry fallback due to waiting for I/O

Their results are as follows:

| Case                | Total Rounds | Throughput | Miss/Drop(%) | RETRY_MMAP_DROP | RETRY_IO_MISS |
| ------------------- | ------------ | ---------- | ------------ | --------------- | ------------- |
| Baseline (Run 1)    | 22,711       | 37.85 /s   | 45.04        | 970,078         | 436,956       |
| Baseline (Run 2)    | 23,530       | 39.22 /s   | 44.96        | 972,043         | 437,077       |
| With Series (Run A) | 54,428       | 90.71 /s   | 1.69         | 1,204,124       | 20,398        |
| With Series (Run B) | 35,949       | 59.91 /s   | 0.03         | 327,023         | 99            |

Without this series, nearly half of the retries fail to observe completed
I/O results, leading to significant CPU and I/O waste. With the finer-
grained VMA lock, faulting threads avoid the heavily contended mmap_lock
during retries and are therefore able to complete the page fault.

With (2), there is a clear improvement in swap-in bandwidth in a model
with five threads issuing MADV_PAGEOUT-based swap-outs and five threads
performing swap-ins on a 100MB anonymous mmap VMA.

 #define SIZE (100 * 1024 * 1024)
 #define PAGE_SIZE 4096
 #define WRITER_THREADS 5
 #define READER_THREADS 5
 #define RUN_SECONDS 30
 
 static uint8_t *buf;
 static atomic_ulong pageout_rounds = 0;
 static atomic_ulong swapin_rounds = 0;
 static atomic_int stop_flag = 0;
 
 static void *pageout_thread(void *arg)
 {
     (void)arg;
     while (!atomic_load(&stop_flag)) {
         if (madvise(buf, SIZE, MADV_PAGEOUT) == 0) {
             atomic_fetch_add(&pageout_rounds, 1);
         }
     }
     return NULL;
 }
 
 static void *reader_thread(void *arg)
 {
     (void)arg;
     volatile uint64_t sum = 0;
 
     while (!atomic_load(&stop_flag)) {
         for (size_t i = 0; i < SIZE; i += PAGE_SIZE) {
             sum += buf[i];
         }
         /* One full pass over 100MB, counted as one swap-in round (approximate) */
         atomic_fetch_add(&swapin_rounds, 1);
     }
     return NULL;
 }
 
 int main(void)
 {
     pthread_t writers[WRITER_THREADS];
     pthread_t readers[READER_THREADS];
 
     buf = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
     if (buf == MAP_FAILED) {
         exit(EXIT_FAILURE);
     }
     memset(buf, 0, SIZE);
 
     for (int i = 0; i < WRITER_THREADS; i++) {
         if (pthread_create(&writers[i], NULL, pageout_thread, NULL) != 0) {
             perror("pthread_create");
             exit(EXIT_FAILURE);
         }
     }
     for (int i = 0; i < READER_THREADS; i++) {
         if (pthread_create(&readers[i], NULL, reader_thread, NULL) != 0) {
             perror("pthread_create");
             exit(EXIT_FAILURE);
         }
     }
 
     sleep(RUN_SECONDS);
     atomic_store(&stop_flag, 1);
     for (int i = 0; i < WRITER_THREADS; i++)
         pthread_join(writers[i], NULL);
     for (int i = 0; i < READER_THREADS; i++)
         pthread_join(readers[i], NULL);
 
     printf("=== Result (30s) ===\n");
     printf("Pageout rounds: %lu\n", pageout_rounds);
     printf("Swap-in rounds (approx): %lu\n", swapin_rounds);
     munmap(buf, SIZE);
     return 0;
 }

W/o patches:
=== Result (30s) ===
Pageout rounds: 1324847
Swap-in rounds (approx): 874

W/patches:
=== Result (30s) ===
Pageout rounds: 1330550
Swap-in rounds (approx): 1017

[1] https://lore.kernel.org/linux-mm/aSip2mWX13sqPW_l@casper.infradead.org/
[2] https://github.com/lianux-mm/ioretry_test/

-v2:
  * collect tags from Pedro, Kunwu and Lian, thanks!
  * handle case (2), for uptodate folios, don't retry PF
-RFC:
  https://lore.kernel.org/linux-mm/20251127011438.6918-1-21cnbao@gmail.com/

Barry Song (Xiaomi) (4):
  mm/swapin: Retry swapin by VMA lock if the lock was released for I/O
  mm: Move folio_lock_or_retry() and drop __folio_lock_or_retry()
  mm: Don't retry page fault if folio is uptodate during swap-in
  mm/filemap: Avoid retrying page faults on uptodate folios in filemap
    faults

Oven Liyang (1):
  mm/filemap: Retry fault by VMA lock if the lock was released for I/O

 arch/arm/mm/fault.c       |  5 +++
 arch/arm64/mm/fault.c     |  5 +++
 arch/loongarch/mm/fault.c |  4 +++
 arch/powerpc/mm/fault.c   |  5 ++-
 arch/riscv/mm/fault.c     |  4 +++
 arch/s390/mm/fault.c      |  4 +++
 arch/x86/mm/fault.c       |  4 +++
 include/linux/mm_types.h  |  9 ++---
 include/linux/pagemap.h   | 17 ----------
 mm/filemap.c              | 57 ++++++-------------------------
 mm/memory.c               | 70 +++++++++++++++++++++++++++++++++++++--
 11 files changed, 114 insertions(+), 70 deletions(-)

-- 
* The work began during my collaboration with OPPO and has continued through
my current collaboration with Xiaomi. Although the OPPO collaboration has
ended, OPPO still deserves more than half of the credit for this series,
if any credit is to be assigned.

2.39.3 (Apple Git-146)


^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v2 1/5] mm/filemap: Retry fault by VMA lock if the lock was released for I/O
  2026-04-30  4:04 [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance Barry Song (Xiaomi)
@ 2026-04-30  4:04 ` Barry Song (Xiaomi)
  2026-04-30  4:04 ` [PATCH v2 2/5] mm/swapin: Retry swapin " Barry Song (Xiaomi)
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 80+ messages in thread
From: Barry Song (Xiaomi) @ 2026-04-30  4:04 UTC (permalink / raw)
  To: akpm, linux-mm, willy
  Cc: david, ljs, liam, vbabka, rppt, surenb, mhocko, jack, pfalcato,
	wanglian, chentao, lianux.mm, kunwu.chan, liyangouwen1, chrisl,
	kasong, shikemeng, nphamcs, bhe, youngjun.park, linux-arm-kernel,
	linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390,
	Barry Song

From: Oven Liyang <liyangouwen1@oppo.com>

If the current page fault is using the per-VMA lock, and we only released
the lock to wait for I/O completion (e.g., using folio_lock()), then when
the fault is retried after the I/O completes, it should still qualify for
the per-VMA-lock path.

Acked-by: Pedro Falcato <pfalcato@suse.de>
Tested-by: Wang Lian <wanglian@kylinos.cn>
Tested-by: Kunwu Chan <chentao@kylinos.cn>
Reviewed-by: Wang Lian <lianux.mm@gmail.com>
Reviewed-by: Kunwu Chan <kunwu.chan@gmail.com>
Signed-off-by: Oven Liyang <liyangouwen1@oppo.com>
Co-developed-by: Barry Song <baohua@kernel.org>
Signed-off-by: Barry Song <baohua@kernel.org>
---
 arch/arm/mm/fault.c       | 5 +++++
 arch/arm64/mm/fault.c     | 5 +++++
 arch/loongarch/mm/fault.c | 4 ++++
 arch/powerpc/mm/fault.c   | 5 ++++-
 arch/riscv/mm/fault.c     | 4 ++++
 arch/s390/mm/fault.c      | 4 ++++
 arch/x86/mm/fault.c       | 4 ++++
 include/linux/mm_types.h  | 9 +++++----
 mm/filemap.c              | 5 ++++-
 9 files changed, 39 insertions(+), 6 deletions(-)

diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index e62cc4be5adf..5971e02845f7 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -391,6 +391,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 	if (!(flags & FAULT_FLAG_USER))
 		goto lock_mmap;
 
+retry_vma:
 	vma = lock_vma_under_rcu(mm, addr);
 	if (!vma)
 		goto lock_mmap;
@@ -420,6 +421,10 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 			goto no_context;
 		return 0;
 	}
+
+	/* If the first try is only about waiting for the I/O to complete */
+	if (fault & VM_FAULT_RETRY_VMA)
+		goto retry_vma;
 lock_mmap:
 
 retry:
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 739800835920..d0362a3e11b7 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -673,6 +673,7 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
 	if (!(mm_flags & FAULT_FLAG_USER))
 		goto lock_mmap;
 
+retry_vma:
 	vma = lock_vma_under_rcu(mm, addr);
 	if (!vma)
 		goto lock_mmap;
@@ -719,6 +720,10 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
 			goto no_context;
 		return 0;
 	}
+
+	/* If the first try is only about waiting for the I/O to complete */
+	if (fault & VM_FAULT_RETRY_VMA)
+		goto retry_vma;
 lock_mmap:
 
 retry:
diff --git a/arch/loongarch/mm/fault.c b/arch/loongarch/mm/fault.c
index 2c93d33356e5..738f495560c0 100644
--- a/arch/loongarch/mm/fault.c
+++ b/arch/loongarch/mm/fault.c
@@ -219,6 +219,7 @@ static void __kprobes __do_page_fault(struct pt_regs *regs,
 	if (!(flags & FAULT_FLAG_USER))
 		goto lock_mmap;
 
+retry_vma:
 	vma = lock_vma_under_rcu(mm, address);
 	if (!vma)
 		goto lock_mmap;
@@ -265,6 +266,9 @@ static void __kprobes __do_page_fault(struct pt_regs *regs,
 			no_context(regs, write, address);
 		return;
 	}
+	/* If the first try is only about waiting for the I/O to complete */
+	if (fault & VM_FAULT_RETRY_VMA)
+		goto retry_vma;
 lock_mmap:
 
 retry:
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 806c74e0d5ab..cb7ffc20c760 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -487,6 +487,7 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address,
 	if (!(flags & FAULT_FLAG_USER))
 		goto lock_mmap;
 
+retry_vma:
 	vma = lock_vma_under_rcu(mm, address);
 	if (!vma)
 		goto lock_mmap;
@@ -516,7 +517,9 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address,
 
 	if (fault_signal_pending(fault, regs))
 		return user_mode(regs) ? 0 : SIGBUS;
-
+	/* If the first try is only about waiting for the I/O to complete */
+	if (fault & VM_FAULT_RETRY_VMA)
+		goto retry_vma;
 lock_mmap:
 
 	/* When running in the kernel we expect faults to occur only to
diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
index 04ed6f8acae4..b94cf57c2b9a 100644
--- a/arch/riscv/mm/fault.c
+++ b/arch/riscv/mm/fault.c
@@ -347,6 +347,7 @@ void handle_page_fault(struct pt_regs *regs)
 	if (!(flags & FAULT_FLAG_USER))
 		goto lock_mmap;
 
+retry_vma:
 	vma = lock_vma_under_rcu(mm, addr);
 	if (!vma)
 		goto lock_mmap;
@@ -376,6 +377,9 @@ void handle_page_fault(struct pt_regs *regs)
 			no_context(regs, addr);
 		return;
 	}
+	/* If the first try is only about waiting for the I/O to complete */
+	if (fault & VM_FAULT_RETRY_VMA)
+		goto retry_vma;
 lock_mmap:
 
 retry:
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index 191cc53caead..e0576e629f65 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -294,6 +294,7 @@ static void do_exception(struct pt_regs *regs, int access)
 		flags |= FAULT_FLAG_WRITE;
 	if (!(flags & FAULT_FLAG_USER))
 		goto lock_mmap;
+retry_vma:
 	vma = lock_vma_under_rcu(mm, address);
 	if (!vma)
 		goto lock_mmap;
@@ -318,6 +319,9 @@ static void do_exception(struct pt_regs *regs, int access)
 			handle_fault_error_nolock(regs, 0);
 		return;
 	}
+	/* If the first try is only about waiting for the I/O to complete */
+	if (fault & VM_FAULT_RETRY_VMA)
+		goto retry_vma;
 lock_mmap:
 retry:
 	vma = lock_mm_and_find_vma(mm, address, regs);
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index f0e77e084482..0589fc693eea 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1322,6 +1322,7 @@ void do_user_addr_fault(struct pt_regs *regs,
 	if (!(flags & FAULT_FLAG_USER))
 		goto lock_mmap;
 
+retry_vma:
 	vma = lock_vma_under_rcu(mm, address);
 	if (!vma)
 		goto lock_mmap;
@@ -1351,6 +1352,9 @@ void do_user_addr_fault(struct pt_regs *regs,
 						 ARCH_DEFAULT_PKEY);
 		return;
 	}
+	/* If the first try is only about waiting for the I/O to complete */
+	if (fault & VM_FAULT_RETRY_VMA)
+		goto retry_vma;
 lock_mmap:
 
 retry:
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index a308e2c23b82..5907200ea587 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1678,10 +1678,11 @@ enum vm_fault_reason {
 	VM_FAULT_NOPAGE         = (__force vm_fault_t)0x000100,
 	VM_FAULT_LOCKED         = (__force vm_fault_t)0x000200,
 	VM_FAULT_RETRY          = (__force vm_fault_t)0x000400,
-	VM_FAULT_FALLBACK       = (__force vm_fault_t)0x000800,
-	VM_FAULT_DONE_COW       = (__force vm_fault_t)0x001000,
-	VM_FAULT_NEEDDSYNC      = (__force vm_fault_t)0x002000,
-	VM_FAULT_COMPLETED      = (__force vm_fault_t)0x004000,
+	VM_FAULT_RETRY_VMA      = (__force vm_fault_t)0x000800,
+	VM_FAULT_FALLBACK       = (__force vm_fault_t)0x001000,
+	VM_FAULT_DONE_COW       = (__force vm_fault_t)0x002000,
+	VM_FAULT_NEEDDSYNC      = (__force vm_fault_t)0x004000,
+	VM_FAULT_COMPLETED      = (__force vm_fault_t)0x008000,
 	VM_FAULT_HINDEX_MASK    = (__force vm_fault_t)0x0f0000,
 };
 
diff --git a/mm/filemap.c b/mm/filemap.c
index ab34cab2416a..a045b771e8de 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3525,6 +3525,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 	struct folio *folio;
 	vm_fault_t ret = 0;
 	bool mapping_locked = false;
+	bool retry_by_vma_lock = false;
 
 	max_idx = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
 	if (unlikely(index >= max_idx))
@@ -3621,6 +3622,8 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 	 */
 	if (fpin) {
 		folio_unlock(folio);
+		if (vmf->flags & FAULT_FLAG_VMA_LOCK)
+			retry_by_vma_lock = true;
 		goto out_retry;
 	}
 	if (mapping_locked)
@@ -3671,7 +3674,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 		filemap_invalidate_unlock_shared(mapping);
 	if (fpin)
 		fput(fpin);
-	return ret | VM_FAULT_RETRY;
+	return ret | VM_FAULT_RETRY | (retry_by_vma_lock ? VM_FAULT_RETRY_VMA : 0);
 }
 EXPORT_SYMBOL(filemap_fault);
 
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v2 2/5] mm/swapin: Retry swapin by VMA lock if the lock was released for I/O
  2026-04-30  4:04 [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance Barry Song (Xiaomi)
  2026-04-30  4:04 ` [PATCH v2 1/5] mm/filemap: Retry fault by VMA lock if the lock was released for I/O Barry Song (Xiaomi)
@ 2026-04-30  4:04 ` Barry Song (Xiaomi)
  2026-04-30  4:04 ` [PATCH v2 3/5] mm: Move folio_lock_or_retry() and drop __folio_lock_or_retry() Barry Song (Xiaomi)
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 80+ messages in thread
From: Barry Song (Xiaomi) @ 2026-04-30  4:04 UTC (permalink / raw)
  To: akpm, linux-mm, willy
  Cc: david, ljs, liam, vbabka, rppt, surenb, mhocko, jack, pfalcato,
	wanglian, chentao, lianux.mm, kunwu.chan, liyangouwen1, chrisl,
	kasong, shikemeng, nphamcs, bhe, youngjun.park, linux-arm-kernel,
	linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390,
	Barry Song (Xiaomi)

If the current do_swap_page() took the per-VMA lock and we dropped it only
to wait for I/O completion (e.g., use folio_wait_locked()), then when
do_swap_page() is retried after the I/O completes, it should still qualify
for the per-VMA-lock path.

Tested-by: Wang Lian <wanglian@kylinos.cn>
Tested-by: Kunwu Chan <chentao@kylinos.cn>
Reviewed-by: Wang Lian <lianux.mm@gmail.com>
Reviewed-by: Kunwu Chan <kunwu.chan@gmail.com>
Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
---
 mm/memory.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 199214f8de08..00ee1599d637 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4791,6 +4791,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	unsigned long page_idx;
 	unsigned long address;
 	pte_t *ptep;
+	bool retry_by_vma_lock = false;
 
 	if (!pte_unmap_same(vmf))
 		goto out;
@@ -4896,8 +4897,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 
 	swapcache = folio;
 	ret |= folio_lock_or_retry(folio, vmf);
-	if (ret & VM_FAULT_RETRY)
+	if (ret & VM_FAULT_RETRY) {
+		if (fault_flag_allow_retry_first(vmf->flags) &&
+		    !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT) &&
+		    (vmf->flags & FAULT_FLAG_VMA_LOCK))
+			retry_by_vma_lock = true;
 		goto out_release;
+	}
 
 	page = folio_file_page(folio, swp_offset(entry));
 	/*
@@ -5182,7 +5188,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	}
 	if (si)
 		put_swap_device(si);
-	return ret;
+	return ret | (retry_by_vma_lock ? VM_FAULT_RETRY_VMA : 0);
 }
 
 static bool pte_range_none(pte_t *pte, int nr_pages)
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v2 3/5] mm: Move folio_lock_or_retry() and drop __folio_lock_or_retry()
  2026-04-30  4:04 [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance Barry Song (Xiaomi)
  2026-04-30  4:04 ` [PATCH v2 1/5] mm/filemap: Retry fault by VMA lock if the lock was released for I/O Barry Song (Xiaomi)
  2026-04-30  4:04 ` [PATCH v2 2/5] mm/swapin: Retry swapin " Barry Song (Xiaomi)
@ 2026-04-30  4:04 ` Barry Song (Xiaomi)
  2026-04-30  4:04 ` [PATCH v2 4/5] mm: Don't retry page fault if folio is uptodate during swap-in Barry Song (Xiaomi)
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 80+ messages in thread
From: Barry Song (Xiaomi) @ 2026-04-30  4:04 UTC (permalink / raw)
  To: akpm, linux-mm, willy
  Cc: david, ljs, liam, vbabka, rppt, surenb, mhocko, jack, pfalcato,
	wanglian, chentao, lianux.mm, kunwu.chan, liyangouwen1, chrisl,
	kasong, shikemeng, nphamcs, bhe, youngjun.park, linux-arm-kernel,
	linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390,
	Barry Song (Xiaomi)

folio_lock_or_retry() is effectively only used in mm/memory.c,
not in the filemap code. Move it there and make it static.

The helper __folio_lock_or_retry() can be folded into
folio_lock_or_retry(), allowing it to be removed.

Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
---
 include/linux/pagemap.h | 17 -------------
 mm/filemap.c            | 45 ----------------------------------
 mm/memory.c             | 53 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 53 insertions(+), 62 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 1f50991b43e3..500ab783bf70 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -1101,7 +1101,6 @@ static inline bool wake_page_match(struct wait_page_queue *wait_page,
 
 void __folio_lock(struct folio *folio);
 int __folio_lock_killable(struct folio *folio);
-vm_fault_t __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf);
 void unlock_page(struct page *page);
 void folio_unlock(struct folio *folio);
 
@@ -1198,22 +1197,6 @@ static inline int folio_lock_killable(struct folio *folio)
 	return 0;
 }
 
-/*
- * folio_lock_or_retry - Lock the folio, unless this would block and the
- * caller indicated that it can handle a retry.
- *
- * Return value and mmap_lock implications depend on flags; see
- * __folio_lock_or_retry().
- */
-static inline vm_fault_t folio_lock_or_retry(struct folio *folio,
-					     struct vm_fault *vmf)
-{
-	might_sleep();
-	if (!folio_trylock(folio))
-		return __folio_lock_or_retry(folio, vmf);
-	return 0;
-}
-
 /*
  * This is exported only for folio_wait_locked/folio_wait_writeback, etc.,
  * and should not be used directly.
diff --git a/mm/filemap.c b/mm/filemap.c
index a045b771e8de..b532d6cbafc8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1740,51 +1740,6 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
 	return ret;
 }
 
-/*
- * Return values:
- * 0 - folio is locked.
- * non-zero - folio is not locked.
- *     mmap_lock or per-VMA lock has been released (mmap_read_unlock() or
- *     vma_end_read()), unless flags had both FAULT_FLAG_ALLOW_RETRY and
- *     FAULT_FLAG_RETRY_NOWAIT set, in which case the lock is still held.
- *
- * If neither ALLOW_RETRY nor KILLABLE are set, will always return 0
- * with the folio locked and the mmap_lock/per-VMA lock is left unperturbed.
- */
-vm_fault_t __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf)
-{
-	unsigned int flags = vmf->flags;
-
-	if (fault_flag_allow_retry_first(flags)) {
-		/*
-		 * CAUTION! In this case, mmap_lock/per-VMA lock is not
-		 * released even though returning VM_FAULT_RETRY.
-		 */
-		if (flags & FAULT_FLAG_RETRY_NOWAIT)
-			return VM_FAULT_RETRY;
-
-		release_fault_lock(vmf);
-		if (flags & FAULT_FLAG_KILLABLE)
-			folio_wait_locked_killable(folio);
-		else
-			folio_wait_locked(folio);
-		return VM_FAULT_RETRY;
-	}
-	if (flags & FAULT_FLAG_KILLABLE) {
-		bool ret;
-
-		ret = __folio_lock_killable(folio);
-		if (ret) {
-			release_fault_lock(vmf);
-			return VM_FAULT_RETRY;
-		}
-	} else {
-		__folio_lock(folio);
-	}
-
-	return 0;
-}
-
 /**
  * page_cache_next_miss() - Find the next gap in the page cache.
  * @mapping: Mapping.
diff --git a/mm/memory.c b/mm/memory.c
index 00ee1599d637..0c740ca363cc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4442,6 +4442,59 @@ void unmap_mapping_range(struct address_space *mapping,
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
+/*
+ * folio_lock_or_retry - Lock the folio, unless this would block and the
+ * caller indicated that it can handle a retry.
+ *
+ * Return values:
+ * 0 - folio is locked.
+ * non-zero - folio is not locked.
+ *     mmap_lock or per-VMA lock has been released (mmap_read_unlock() or
+ *     vma_end_read()), unless flags had both FAULT_FLAG_ALLOW_RETRY and
+ *     FAULT_FLAG_RETRY_NOWAIT set, in which case the lock is still held.
+ *
+ * If neither ALLOW_RETRY nor KILLABLE are set, will always return 0
+ * with the folio locked and the mmap_lock/per-VMA lock is left unperturbed.
+ */
+static inline vm_fault_t folio_lock_or_retry(struct folio *folio,
+					     struct vm_fault *vmf)
+{
+	unsigned int flags = vmf->flags;
+
+	might_sleep();
+	if (folio_trylock(folio))
+		return 0;
+
+	if (fault_flag_allow_retry_first(flags)) {
+		/*
+		 * CAUTION! In this case, mmap_lock/per-VMA lock is not
+		 * released even though returning VM_FAULT_RETRY.
+		 */
+		if (flags & FAULT_FLAG_RETRY_NOWAIT)
+			return VM_FAULT_RETRY;
+
+		release_fault_lock(vmf);
+		if (flags & FAULT_FLAG_KILLABLE)
+			folio_wait_locked_killable(folio);
+		else
+			folio_wait_locked(folio);
+		return VM_FAULT_RETRY;
+	}
+	if (flags & FAULT_FLAG_KILLABLE) {
+		bool ret;
+
+		ret = __folio_lock_killable(folio);
+		if (ret) {
+			release_fault_lock(vmf);
+			return VM_FAULT_RETRY;
+		}
+	} else {
+		__folio_lock(folio);
+	}
+
+	return 0;
+}
+
 /*
  * Restore a potential device exclusive pte to a working pte entry
  */
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v2 4/5] mm: Don't retry page fault if folio is uptodate during swap-in
  2026-04-30  4:04 [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance Barry Song (Xiaomi)
                   ` (2 preceding siblings ...)
  2026-04-30  4:04 ` [PATCH v2 3/5] mm: Move folio_lock_or_retry() and drop __folio_lock_or_retry() Barry Song (Xiaomi)
@ 2026-04-30  4:04 ` Barry Song (Xiaomi)
  2026-04-30 12:35   ` Matthew Wilcox
  2026-04-30  4:04 ` [PATCH v2 5/5] mm/filemap: Avoid retrying page faults on uptodate folios in filemap faults Barry Song (Xiaomi)
  2026-04-30 12:37 ` [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance Matthew Wilcox
  5 siblings, 1 reply; 80+ messages in thread
From: Barry Song (Xiaomi) @ 2026-04-30  4:04 UTC (permalink / raw)
  To: akpm, linux-mm, willy
  Cc: david, ljs, liam, vbabka, rppt, surenb, mhocko, jack, pfalcato,
	wanglian, chentao, lianux.mm, kunwu.chan, liyangouwen1, chrisl,
	kasong, shikemeng, nphamcs, bhe, youngjun.park, linux-arm-kernel,
	linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390,
	Barry Song (Xiaomi)

If we are waiting for long I/O to complete, it makes sense to
avoid holding locks for too long. However, if the folio is
uptodate, we are likely only waiting for a concurrent PTE
update to finish. Retrying the entire page fault seems
excessive.

Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
---
 mm/memory.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 0c740ca363cc..a2e4f2d87ec8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4949,6 +4949,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	}
 
 	swapcache = folio;
+	/*
+	 * If the folio is uptodate, we are likely only waiting for
+	 * another concurrent PTE mapping to complete, which should
+	 * be brief. No need to drop the lock and retry the fault.
+	 */
+	if (folio_test_uptodate(folio))
+		vmf->flags &= ~FAULT_FLAG_ALLOW_RETRY;
 	ret |= folio_lock_or_retry(folio, vmf);
 	if (ret & VM_FAULT_RETRY) {
 		if (fault_flag_allow_retry_first(vmf->flags) &&
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 4/5] mm: Don't retry page fault if folio is uptodate during swap-in
  2026-04-30  4:04 ` [PATCH v2 4/5] mm: Don't retry page fault if folio is uptodate during swap-in Barry Song (Xiaomi)
@ 2026-04-30 12:35   ` Matthew Wilcox
  2026-05-01 16:11     ` Matthew Wilcox
  0 siblings, 1 reply; 80+ messages in thread
From: Matthew Wilcox @ 2026-04-30 12:35 UTC (permalink / raw)
  To: Barry Song (Xiaomi)
  Cc: akpm, linux-mm, david, ljs, liam, vbabka, rppt, surenb, mhocko,
	jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
	liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390

On Thu, Apr 30, 2026 at 12:04:26PM +0800, Barry Song (Xiaomi) wrote:
> If we are waiting for long I/O to complete, it makes sense to
> avoid holding locks for too long. However, if the folio is
> uptodate, we are likely only waiting for a concurrent PTE
> update to finish. Retrying the entire page fault seems
> excessive.

I think the idea is good, but the implementation is misplaced.
The check for folio_uptodate() should be inside folio_lock_or_retry()
rather than tampering with FAULT_FLAG_ALLOW_RETRY in its caller.

Similarly for your next patch.

> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> ---
>  mm/memory.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 0c740ca363cc..a2e4f2d87ec8 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4949,6 +4949,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	}
>  
>  	swapcache = folio;
> +	/*
> +	 * If the folio is uptodate, we are likely only waiting for
> +	 * another concurrent PTE mapping to complete, which should
> +	 * be brief. No need to drop the lock and retry the fault.
> +	 */
> +	if (folio_test_uptodate(folio))
> +		vmf->flags &= ~FAULT_FLAG_ALLOW_RETRY;
>  	ret |= folio_lock_or_retry(folio, vmf);
>  	if (ret & VM_FAULT_RETRY) {
>  		if (fault_flag_allow_retry_first(vmf->flags) &&
> -- 
> 2.39.3 (Apple Git-146)
> 
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 4/5] mm: Don't retry page fault if folio is uptodate during swap-in
  2026-04-30 12:35   ` Matthew Wilcox
@ 2026-05-01 16:11     ` Matthew Wilcox
  0 siblings, 0 replies; 80+ messages in thread
From: Matthew Wilcox @ 2026-05-01 16:11 UTC (permalink / raw)
  To: Barry Song (Xiaomi)
  Cc: akpm, linux-mm, david, ljs, liam, vbabka, rppt, surenb, mhocko,
	jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
	liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390

On Thu, Apr 30, 2026 at 01:35:30PM +0100, Matthew Wilcox wrote:
> On Thu, Apr 30, 2026 at 12:04:26PM +0800, Barry Song (Xiaomi) wrote:
> > If we are waiting for long I/O to complete, it makes sense to
> > avoid holding locks for too long. However, if the folio is
> > uptodate, we are likely only waiting for a concurrent PTE
> > update to finish. Retrying the entire page fault seems
> > excessive.
> 
> I think the idea is good, but the implementation is misplaced.
> The check for folio_uptodate() should be inside folio_lock_or_retry()
> rather than tampering with FAULT_FLAG_ALLOW_RETRY in its caller.

Actually it needs to be a little more complex than this.  We
sometimes wait for writeback while holding the folio lock, and
that's a similar latency to reads (or with cheap NAND, maybe longer!)

So I think the test needs to be:

	if (folio_test_uptodate(folio) && !folio_test_writeback(folio))



^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v2 5/5] mm/filemap: Avoid retrying page faults on uptodate folios in filemap faults
  2026-04-30  4:04 [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance Barry Song (Xiaomi)
                   ` (3 preceding siblings ...)
  2026-04-30  4:04 ` [PATCH v2 4/5] mm: Don't retry page fault if folio is uptodate during swap-in Barry Song (Xiaomi)
@ 2026-04-30  4:04 ` Barry Song (Xiaomi)
  2026-04-30 12:37 ` [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance Matthew Wilcox
  5 siblings, 0 replies; 80+ messages in thread
From: Barry Song (Xiaomi) @ 2026-04-30  4:04 UTC (permalink / raw)
  To: akpm, linux-mm, willy
  Cc: david, ljs, liam, vbabka, rppt, surenb, mhocko, jack, pfalcato,
	wanglian, chentao, lianux.mm, kunwu.chan, liyangouwen1, chrisl,
	kasong, shikemeng, nphamcs, bhe, youngjun.park, linux-arm-kernel,
	linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390,
	Barry Song (Xiaomi)

For uptodate folios, we are not waiting on I/O. We should
be able to acquire the folio lock shortly, so there is no
need to drop per-vma locks and perform a full PF retry.

Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
---
 mm/filemap.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index b532d6cbafc8..0d2f6af5d0fe 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3533,6 +3533,13 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 		}
 	}
 
+	/*
+	 * If the folio is uptodate, we are likely only waiting for
+	 * another concurrent PTE mapping to complete, which should
+	 * be brief. No need to drop the lock and retry the fault.
+	 */
+	if (folio_test_uptodate(folio))
+		vmf->flags &= ~FAULT_FLAG_ALLOW_RETRY;
 	if (!lock_folio_maybe_drop_mmap(vmf, folio, &fpin))
 		goto out_retry;
 
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-04-30  4:04 [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance Barry Song (Xiaomi)
                   ` (4 preceding siblings ...)
  2026-04-30  4:04 ` [PATCH v2 5/5] mm/filemap: Avoid retrying page faults on uptodate folios in filemap faults Barry Song (Xiaomi)
@ 2026-04-30 12:37 ` Matthew Wilcox
  2026-04-30 22:49   ` Barry Song
  2026-05-01 15:52   ` Lorenzo Stoakes
  5 siblings, 2 replies; 80+ messages in thread
From: Matthew Wilcox @ 2026-04-30 12:37 UTC (permalink / raw)
  To: Barry Song (Xiaomi)
  Cc: akpm, linux-mm, david, ljs, liam, vbabka, rppt, surenb, mhocko,
	jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
	liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390

On Thu, Apr 30, 2026 at 12:04:22PM +0800, Barry Song (Xiaomi) wrote:
> (1) If we need to wait for I/O completion, we still drop the per-VMA lock, as
> current page fault handling already does. Holding it for too long may introduce
> various priority inversion issues on mobile devices. After I/O completes, we
> retry the page fault with the per-VMA lock, rather than falling back to
> mmap_lock.

You're going to have to do better than that.  You know I hate the
additional complexity you're adding.  You need to explain why my idea of
ripping out all the complexity now that we have per-VMA locks doesn't
work.


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-04-30 12:37 ` [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance Matthew Wilcox
@ 2026-04-30 22:49   ` Barry Song
  2026-05-01 14:56     ` Matthew Wilcox
  2026-05-01 15:52   ` Lorenzo Stoakes
  1 sibling, 1 reply; 80+ messages in thread
From: Barry Song @ 2026-04-30 22:49 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, linux-mm, david, ljs, liam, vbabka, rppt, surenb, mhocko,
	jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
	liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390

On Thu, Apr 30, 2026 at 8:37 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Apr 30, 2026 at 12:04:22PM +0800, Barry Song (Xiaomi) wrote:
> > (1) If we need to wait for I/O completion, we still drop the per-VMA lock, as
> > current page fault handling already does. Holding it for too long may introduce
> > various priority inversion issues on mobile devices. After I/O completes, we
> > retry the page fault with the per-VMA lock, rather than falling back to
> > mmap_lock.
>
> You're going to have to do better than that.  You know I hate the
> additional complexity you're adding.  You need to explain why my idea of
> ripping out all the complexity now that we have per-VMA locks doesn't
> work.

Yep, I know you don’t like the added complexity, but I would rather prioritize
user experience over simplicity. Let me try to explain in more detail.

1. There is no deterministic latency for I/O completion. It depends on
both the hardware and the software stack (bio/request queues and the
block scheduler). Sometimes the latency is short; at other times it can
be quite long. In such cases, a high-priority thread performing operations
such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
for an unpredictable amount of time. For example, if low-priority tasks
trigger page faults and issue low-priority I/O, a high-priority task
requiring the write lock may end up waiting for an unknown amount of time,
depending on the block layer and filesystem behavior.

As a result, high-priority tasks are exposed to unpredictable I/O latency
introduced by many low-priority tasks that may generate a large number of
page faults.

On Android, latency in certain tasks can significantly affect user experience,
such as interactive threads. Priority inversion is particularly problematic and
should be avoided, especially since we have no clear bound on how long we may
have to wait for I/O from other tasks.

Meanwhile, priority inversion can propagate through a long chain: a writer
waiting on I/O from multiple concurrent page faults may end up blocking other
writers and readers as well. A long-waiting writer can also amplify
mmap_lock contention, which we still rely on in many cases.

2. VMA sizes can be highly uneven: some VMAs may be very large while others are
small. We used to have many reasons to release mmap_lock when we did not have a
per-VMA lock. Since VMA sizes are not uniform, those same considerations may
still apply to the per-VMA lock when a small number of VMAs account for most
of a process’s address space. I recall that Suren also mentioned this[1].

So I would prefer that we hold only the per-VMA lock and avoid retrying the
page fault when we are reasonably sure that I/O has already completed and we
are only waiting for short-lived conditions. Uncertainties in the block layer,
filesystem, and GC behavior, as well as latency-induced priority inversion
chains and potentially amplified mmap_lock contention, can significantly hurt
Android user experience.

[1] https://lore.kernel.org/linux-mm/CAJuCfpFVQJtvbj5fV2fmm4APhNZDL1qPg-YExw7gO1pmngC3Rw@mail.gmail.com/

Thanks
Barry

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-04-30 22:49   ` Barry Song
@ 2026-05-01 14:56     ` Matthew Wilcox
  2026-05-01 17:44       ` Barry Song
  0 siblings, 1 reply; 80+ messages in thread
From: Matthew Wilcox @ 2026-05-01 14:56 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, david, ljs, liam, vbabka, rppt, surenb, mhocko,
	jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
	liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390

On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> 1. There is no deterministic latency for I/O completion. It depends on
> both the hardware and the software stack (bio/request queues and the
> block scheduler). Sometimes the latency is short; at other times it can
> be quite long. In such cases, a high-priority thread performing operations
> such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> for an unpredictable amount of time.

But does that actually happen?  I find it hard to believe that thread A
unmaps a VMA while thread B is in the middle of taking a page fault in
that same VMA.  mprotect() and madvise() are more likely to happen, but
it still seems really unlikely to me.



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-01 14:56     ` Matthew Wilcox
@ 2026-05-01 17:44       ` Barry Song
  2026-05-01 17:57         ` Matthew Wilcox
  0 siblings, 1 reply; 80+ messages in thread
From: Barry Song @ 2026-05-01 17:44 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, linux-mm, david, ljs, liam, vbabka, rppt, surenb, mhocko,
	jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
	liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390

On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > 1. There is no deterministic latency for I/O completion. It depends on
> > both the hardware and the software stack (bio/request queues and the
> > block scheduler). Sometimes the latency is short; at other times it can
> > be quite long. In such cases, a high-priority thread performing operations
> > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > for an unpredictable amount of time.
>
> But does that actually happen?  I find it hard to believe that thread A
> unmaps a VMA while thread B is in the middle of taking a page fault in
> that same VMA.  mprotect() and madvise() are more likely to happen, but
> it still seems really unlikely to me.

It doesn’t have to involve unmapping or applying mprotect to
the entire VMA—just a portion of it is sufficient.

BTW, the chain can propagate: a page fault occurs, B wants to write this
VMA, and C (a higher-priority task) wants to write another VMA. D may need
to iterate VMAs under mmap_lock, so B can end up blocking both C and D.


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-01 17:44       ` Barry Song
@ 2026-05-01 17:57         ` Matthew Wilcox
  2026-05-01 18:25           ` Barry Song
                             ` (2 more replies)
  0 siblings, 3 replies; 80+ messages in thread
From: Matthew Wilcox @ 2026-05-01 17:57 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, david, ljs, liam, vbabka, rppt, surenb, mhocko,
	jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
	liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390

On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > 1. There is no deterministic latency for I/O completion. It depends on
> > > both the hardware and the software stack (bio/request queues and the
> > > block scheduler). Sometimes the latency is short; at other times it can
> > > be quite long. In such cases, a high-priority thread performing operations
> > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > for an unpredictable amount of time.
> >
> > But does that actually happen?  I find it hard to believe that thread A
> > unmaps a VMA while thread B is in the middle of taking a page fault in
> > that same VMA.  mprotect() and madvise() are more likely to happen, but
> > it still seems really unlikely to me.
> 
> It doesn’t have to involve unmapping or applying mprotect to
> the entire VMA—just a portion of it is sufficient.

Yes, but that still fails to answer "does this actually happen".  How much
performance is all this complexity in the page fault handler buying us?
If you don't answer this question, I'm just going to go in and rip it
all out.

> BTW, the chain can propagate: a page fault occurs, B wants to write this
> VMA, and C (a higher-priority task) wants to write another VMA. D may need
> to iterate VMAs under mmap_lock, so B can end up blocking both C and D.

I know.


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-01 17:57         ` Matthew Wilcox
@ 2026-05-01 18:25           ` Barry Song
  2026-05-01 19:39             ` Matthew Wilcox
  2026-05-03 13:13           ` Jan Kara
  2026-05-17  8:45           ` Barry Song
  2 siblings, 1 reply; 80+ messages in thread
From: Barry Song @ 2026-05-01 18:25 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, linux-mm, david, ljs, liam, vbabka, rppt, surenb, mhocko,
	jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
	liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390

On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > both the hardware and the software stack (bio/request queues and the
> > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > be quite long. In such cases, a high-priority thread performing operations
> > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > for an unpredictable amount of time.
> > >
> > > But does that actually happen?  I find it hard to believe that thread A
> > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > that same VMA.  mprotect() and madvise() are more likely to happen, but
> > > it still seems really unlikely to me.
> >
> > It doesn’t have to involve unmapping or applying mprotect to
> > the entire VMA—just a portion of it is sufficient.
>
> Yes, but that still fails to answer "does this actually happen".  How much
> performance is all this complexity in the page fault handler buying us?
> If you don't answer this question, I'm just going to go in and rip it
> all out.

I’m getting quite confused. In patch 4/5, you suggest a more
restrictive condition using
if (folio_test_uptodate(folio) && !folio_test_writeback(folio))
rather than if (folio_test_uptodate(folio)), before we decide to skip
retrying the page fault [1].
That seems to suggest we should be more cautious about when we can skip
retrying the page fault.

However, in the cover letter, you suggest removing all retry code entirely.
Does this suggestion apply only to file-backed page faults?

[1] https://lore.kernel.org/linux-mm/afTQl12XcXVnku9J@casper.infradead.org/

Best Regards
Barry


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-01 18:25           ` Barry Song
@ 2026-05-01 19:39             ` Matthew Wilcox
  2026-05-03 20:39               ` Barry Song
  0 siblings, 1 reply; 80+ messages in thread
From: Matthew Wilcox @ 2026-05-01 19:39 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, david, ljs, liam, vbabka, rppt, surenb, mhocko,
	jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
	liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390

On Sat, May 02, 2026 at 02:25:37AM +0800, Barry Song wrote:
> On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> > Yes, but that still fails to answer "does this actually happen".  How much
> > performance is all this complexity in the page fault handler buying us?
> > If you don't answer this question, I'm just going to go in and rip it
> > all out.
> 
> I’m getting quite confused. In patch 4/5, you suggest a more
> restrictive condition using
> if (folio_test_uptodate(folio) && !folio_test_writeback(folio))
> rather than if (folio_test_uptodate(folio)), before we decide to skip
> retrying the page fault [1].
> That seems to suggest we should be more cautious about when we can skip
> retrying the page fault.
> 
> However, in the cover letter, you suggest removing all retry code entirely.
> Does this suggestion apply only to file-backed page faults?

I'm making sure that if Andrew decides to override me he at least sees
that there are other problems with this patchset beyond "I don't like
the additional complexity".

And maybe we decide to do the fallback for anon-mm but not file memory.
Or maybe it's just something somebody happens upon when reading the
mailing list (or more likely it's just grist for an AI).


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-01 19:39             ` Matthew Wilcox
@ 2026-05-03 20:39               ` Barry Song
  0 siblings, 0 replies; 80+ messages in thread
From: Barry Song @ 2026-05-03 20:39 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, linux-mm, david, ljs, liam, vbabka, rppt, surenb, mhocko,
	jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
	liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390

On Sat, May 2, 2026 at 3:39 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sat, May 02, 2026 at 02:25:37AM +0800, Barry Song wrote:
> > On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > Yes, but that still fails to answer "does this actually happen".  How much
> > > performance is all this complexity in the page fault handler buying us?
> > > If you don't answer this question, I'm just going to go in and rip it
> > > all out.

I guess the only way to answer this question is to
remove all retry code for file VMA and run a real test.
For defensive programming, I am generally very cautious
about this approach, but if this is the only way to clarify
whether we still need PF retry for file, I can give it a try
and run a complete test on Android phones after lsf/mm/bpf.

> >
> > I’m getting quite confused. In patch 4/5, you suggest a more
> > restrictive condition using
> > if (folio_test_uptodate(folio) && !folio_test_writeback(folio))
> > rather than if (folio_test_uptodate(folio)), before we decide to skip
> > retrying the page fault [1].
> > That seems to suggest we should be more cautious about when we can skip
> > retrying the page fault.
> >
> > However, in the cover letter, you suggest removing all retry code entirely.
> > Does this suggestion apply only to file-backed page faults?
>
> I'm making sure that if Andrew decides to override me he at least sees

No, I don’t want Andrew to override you unless there is a real PI
issue for file, and only if you still still insist on “ripping it out”
after a thorough test with it removed.

> that there are other problems with this patchset beyond "I don't like
> the additional complexity".

The other issue you are pointing out is that, for anon, we
should be more cautious before deciding to skip PF retry,
which seems to be the opposite direction of what you expect
for file.

>
> And maybe we decide to do the fallback for anon-mm but not file memory.

I was targeting a unified approach for both file-backed
and anonymous memory. For example, if anon requires retry
under the per-VMA lock, we may already have the necessary
branch in place that file-backed cases can also leverage.
For anon cases, high-level language GCs can operate on a
small portion of a large heap requiring VMA writes, which
is fairly common, as I explained to Jan.

> Or maybe it's just something somebody happens upon when reading the
> mailing list (or more likely it's just grist for an AI).

Maybe one or two years from now. For now, at least, there are still
humans working on the kernel :-)

Best Regards
Barry

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-01 17:57         ` Matthew Wilcox
  2026-05-01 18:25           ` Barry Song
@ 2026-05-03 13:13           ` Jan Kara
  2026-05-03 19:55             ` Barry Song
  2026-05-17  8:45           ` Barry Song
  2 siblings, 1 reply; 80+ messages in thread
From: Jan Kara @ 2026-05-03 13:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Barry Song, akpm, linux-mm, david, ljs, liam, vbabka, rppt,
	surenb, mhocko, jack, pfalcato, wanglian, chentao, lianux.mm,
	kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390

On Fri 01-05-26 18:57:52, Matthew Wilcox wrote:
> On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > both the hardware and the software stack (bio/request queues and the
> > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > be quite long. In such cases, a high-priority thread performing operations
> > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > for an unpredictable amount of time.
> > >
> > > But does that actually happen?  I find it hard to believe that thread A
> > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > that same VMA.  mprotect() and madvise() are more likely to happen, but
> > > it still seems really unlikely to me.
> > 
> > It doesn’t have to involve unmapping or applying mprotect to
> > the entire VMA—just a portion of it is sufficient.
> 
> Yes, but that still fails to answer "does this actually happen".  How much
> performance is all this complexity in the page fault handler buying us?
> If you don't answer this question, I'm just going to go in and rip it
> all out.

I fully agree with you we should verify whether the retry code still brings
in real-world advantage today with VMA locks. After all the retry logic has
been introduced in 2010. That being said if there are realistic loads where
one thread needs VMA write lock while another thread is faulting the VMA,
then the latencies can be indeed extreme. For example things like cgroup IO
throttling happen on the IO path and thus can throttle IO of a low-priority
thread for a long time.

BTW I'm not sure I quite understand Barry's priority inversion problem
since I'd expect all threads of a task to generally be treated with the
same priority...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-03 13:13           ` Jan Kara
@ 2026-05-03 19:55             ` Barry Song
  2026-05-04 13:03               ` Jan Kara
  0 siblings, 1 reply; 80+ messages in thread
From: Barry Song @ 2026-05-03 19:55 UTC (permalink / raw)
  To: Jan Kara
  Cc: Matthew Wilcox, akpm, linux-mm, david, ljs, liam, vbabka, rppt,
	surenb, mhocko, pfalcato, wanglian, chentao, lianux.mm,
	kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390

On Mon, May 4, 2026 at 2:17 AM Jan Kara <jack@suse.cz> wrote:
>
> On Fri 01-05-26 18:57:52, Matthew Wilcox wrote:
> > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > both the hardware and the software stack (bio/request queues and the
> > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > for an unpredictable amount of time.
> > > >
> > > > But does that actually happen?  I find it hard to believe that thread A
> > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > that same VMA.  mprotect() and madvise() are more likely to happen, but
> > > > it still seems really unlikely to me.
> > >
> > > It doesn’t have to involve unmapping or applying mprotect to
> > > the entire VMA—just a portion of it is sufficient.
> >
> > Yes, but that still fails to answer "does this actually happen".  How much
> > performance is all this complexity in the page fault handler buying us?
> > If you don't answer this question, I'm just going to go in and rip it
> > all out.
>
> I fully agree with you we should verify whether the retry code still brings
> in real-world advantage today with VMA locks. After all the retry logic has
> been introduced in 2010. That being said if there are realistic loads where
> one thread needs VMA write lock while another thread is faulting the VMA,
> then the latencies can be indeed extreme. For example things like cgroup IO
> throttling happen on the IO path and thus can throttle IO of a low-priority
> thread for a long time.

I’m quite sure that swap-in and VMA writes can occur
concurrently, and this is fairly common. For example,
Java GC may use mprotect or userfaultfd on a small
portion of a large Java heap while other portions are
still under do_swap_page().

If we start exploring different approaches for anon and
file, I agree I can revisit this on an Android phone if
there is a real, serious case where a file VMA can be
written and a page fault occurs at the same time.

Please note that, as an Android developer, I am particularly
cautious about priority inversion. A recent issue causing
severe priority inversion is zram attempting to support
preemption[1]. When a task performing compression or
decompression is migrated to another CPU and then preempted
by other tasks, high-priority tasks waiting on the mutex may
be significantly delayed, impacting user experience.

>
> BTW I'm not sure I quite understand Barry's priority inversion problem
> since I'd expect all threads of a task to generally be treated with the
> same priority...

Exactly not. Maybe these slides[2] and this project[3] can give
you a hint—they aim to standardize things on Linux by
learning from Apple OS. Basically, tasks are classified
into five types:

USER_INTERACTIVE: Requires immediate response.
USER_INITIATED: Tolerates a short delay, but must respond quickly still.
UTILITY: Tolerates long delays, but not prolonged ones.
BACKGROUND: Doesn’t mind prolonged delays.
DEFAULT: System default behavior.

[1] https://lore.kernel.org/linux-mm/20250303022425.285971-3-senozhatsky@chromium.org/
[2] https://lpc.events/event/19/contributions/2089/attachments/1797/3877/Userspace%20Assisted%20Scheduling%20via%20Sched%20QoS.pdf
[3] https://lore.kernel.org/lkml/20260415000910.2h5misvwc45bdumu@airbuntu/

Thanks
Barry


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-03 19:55             ` Barry Song
@ 2026-05-04 13:03               ` Jan Kara
  2026-05-04 13:35                 ` Barry Song
  2026-05-04 14:15                 ` Barry Song
  0 siblings, 2 replies; 80+ messages in thread
From: Jan Kara @ 2026-05-04 13:03 UTC (permalink / raw)
  To: Barry Song
  Cc: Jan Kara, Matthew Wilcox, akpm, linux-mm, david, ljs, liam,
	vbabka, rppt, surenb, mhocko, pfalcato, wanglian, chentao,
	lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng,
	nphamcs, bhe, youngjun.park, linux-arm-kernel, linux-kernel,
	loongarch, linuxppc-dev, linux-riscv, linux-s390

On Mon 04-05-26 03:55:43, Barry Song wrote:
> On Mon, May 4, 2026 at 2:17 AM Jan Kara <jack@suse.cz> wrote:
> > On Fri 01-05-26 18:57:52, Matthew Wilcox wrote:
> > > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > > both the hardware and the software stack (bio/request queues and the
> > > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > > for an unpredictable amount of time.
> > > > >
> > > > > But does that actually happen?  I find it hard to believe that thread A
> > > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > > that same VMA.  mprotect() and madvise() are more likely to happen, but
> > > > > it still seems really unlikely to me.
> > > >
> > > > It doesn’t have to involve unmapping or applying mprotect to
> > > > the entire VMA—just a portion of it is sufficient.
> > >
> > > Yes, but that still fails to answer "does this actually happen".  How much
> > > performance is all this complexity in the page fault handler buying us?
> > > If you don't answer this question, I'm just going to go in and rip it
> > > all out.
> >
> > I fully agree with you we should verify whether the retry code still brings
> > in real-world advantage today with VMA locks. After all the retry logic has
> > been introduced in 2010. That being said if there are realistic loads where
> > one thread needs VMA write lock while another thread is faulting the VMA,
> > then the latencies can be indeed extreme. For example things like cgroup IO
> > throttling happen on the IO path and thus can throttle IO of a low-priority
> > thread for a long time.
> 
> I’m quite sure that swap-in and VMA writes can occur
> concurrently, and this is fairly common. For example,
> Java GC may use mprotect or userfaultfd on a small
> portion of a large Java heap while other portions are
> still under do_swap_page().

OK, makes sense.

> If we start exploring different approaches for anon and
> file, I agree I can revisit this on an Android phone if
> there is a real, serious case where a file VMA can be
> written and a page fault occurs at the same time.
> 
> Please note that, as an Android developer, I am particularly
> cautious about priority inversion. A recent issue causing
> severe priority inversion is zram attempting to support
> preemption[1]. When a task performing compression or
> decompression is migrated to another CPU and then preempted
> by other tasks, high-priority tasks waiting on the mutex may
> be significantly delayed, impacting user experience.

Well, container people are concerned about priority inversion as well. But
usually this is with coarse lock (such as global filesystem locks) but VMA
lock is specific to a task (and a VMA) so there the opportunity for
priority inversion looks more limited.  But the example with Java where GC
thread can presumably have higher priority than ordinary Java threads is an
interesting one.

> > BTW I'm not sure I quite understand Barry's priority inversion problem
> > since I'd expect all threads of a task to generally be treated with the
> > same priority...
> 
> Exactly not. Maybe these slides[2] and this project[3] can give
> you a hint—they aim to standardize things on Linux by
> learning from Apple OS. Basically, tasks are classified
> into five types:
> 
> USER_INTERACTIVE: Requires immediate response.
> USER_INITIATED: Tolerates a short delay, but must respond quickly still.
> UTILITY: Tolerates long delays, but not prolonged ones.
> BACKGROUND: Doesn’t mind prolonged delays.
> DEFAULT: System default behavior.

Again, this is a clasification of tasks but not really of threads in a task
so at least for VMA lock there's no inversion so have?

								Honza

> [1] https://lore.kernel.org/linux-mm/20250303022425.285971-3-senozhatsky@chromium.org/
> [2] https://lpc.events/event/19/contributions/2089/attachments/1797/3877/Userspace%20Assisted%20Scheduling%20via%20Sched%20QoS.pdf
> [3] https://lore.kernel.org/lkml/20260415000910.2h5misvwc45bdumu@airbuntu/
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-04 13:03               ` Jan Kara
@ 2026-05-04 13:35                 ` Barry Song
  2026-05-04 14:15                 ` Barry Song
  1 sibling, 0 replies; 80+ messages in thread
From: Barry Song @ 2026-05-04 13:35 UTC (permalink / raw)
  To: Jan Kara
  Cc: Matthew Wilcox, akpm, linux-mm, david, ljs, liam, vbabka, rppt,
	surenb, mhocko, pfalcato, wanglian, chentao, lianux.mm,
	kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390

On Mon, May 4, 2026 at 9:04 PM Jan Kara <jack@suse.cz> wrote:
[...]
>
> > > BTW I'm not sure I quite understand Barry's priority inversion problem
> > > since I'd expect all threads of a task to generally be treated with the
> > > same priority...
> >
> > Exactly not. Maybe these slides[2] and this project[3] can give
> > you a hint—they aim to standardize things on Linux by
> > learning from Apple OS. Basically, tasks are classified
> > into five types:
> >
> > USER_INTERACTIVE: Requires immediate response.
> > USER_INITIATED: Tolerates a short delay, but must respond quickly still.
> > UTILITY: Tolerates long delays, but not prolonged ones.
> > BACKGROUND: Doesn’t mind prolonged delays.
> > DEFAULT: System default behavior.
>
> Again, this is a clasification of tasks but not really of threads in a task
> so at least for VMA lock there's no inversion so have?

I’m specifically referring to a task (i.e., a thread) when
discussing scheduler context. It may be clearer to use the
terms process and thread explicitly.

In a typical process sharing an mm_struct, each thread can
have a different priority.

In an Android app, some threads handle the UI and require
higher priority, such as the main thread and RenderThread;
otherwise, frame drops may occur.

The Linux scheduler can control scheduling policy and
priority for each thread.

Thanks
Barry


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-04 13:03               ` Jan Kara
  2026-05-04 13:35                 ` Barry Song
@ 2026-05-04 14:15                 ` Barry Song
  1 sibling, 0 replies; 80+ messages in thread
From: Barry Song @ 2026-05-04 14:15 UTC (permalink / raw)
  To: Jan Kara
  Cc: Matthew Wilcox, akpm, linux-mm, david, ljs, liam, vbabka, rppt,
	surenb, mhocko, pfalcato, wanglian, chentao, lianux.mm,
	kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390

On Mon, May 4, 2026 at 9:04 PM Jan Kara <jack@suse.cz> wrote:
>
> On Mon 04-05-26 03:55:43, Barry Song wrote:
> > On Mon, May 4, 2026 at 2:17 AM Jan Kara <jack@suse.cz> wrote:
> > > On Fri 01-05-26 18:57:52, Matthew Wilcox wrote:
> > > > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > > > both the hardware and the software stack (bio/request queues and the
> > > > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > > > for an unpredictable amount of time.
> > > > > >
> > > > > > But does that actually happen?  I find it hard to believe that thread A
> > > > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > > > that same VMA.  mprotect() and madvise() are more likely to happen, but
> > > > > > it still seems really unlikely to me.
> > > > >
> > > > > It doesn’t have to involve unmapping or applying mprotect to
> > > > > the entire VMA—just a portion of it is sufficient.
> > > >
> > > > Yes, but that still fails to answer "does this actually happen".  How much
> > > > performance is all this complexity in the page fault handler buying us?
> > > > If you don't answer this question, I'm just going to go in and rip it
> > > > all out.
> > >
> > > I fully agree with you we should verify whether the retry code still brings
> > > in real-world advantage today with VMA locks. After all the retry logic has
> > > been introduced in 2010. That being said if there are realistic loads where
> > > one thread needs VMA write lock while another thread is faulting the VMA,
> > > then the latencies can be indeed extreme. For example things like cgroup IO
> > > throttling happen on the IO path and thus can throttle IO of a low-priority
> > > thread for a long time.
> >
> > I’m quite sure that swap-in and VMA writes can occur
> > concurrently, and this is fairly common. For example,
> > Java GC may use mprotect or userfaultfd on a small
> > portion of a large Java heap while other portions are
> > still under do_swap_page().
>
> OK, makes sense.
>
> > If we start exploring different approaches for anon and
> > file, I agree I can revisit this on an Android phone if
> > there is a real, serious case where a file VMA can be
> > written and a page fault occurs at the same time.
> >
> > Please note that, as an Android developer, I am particularly
> > cautious about priority inversion. A recent issue causing
> > severe priority inversion is zram attempting to support
> > preemption[1]. When a task performing compression or
> > decompression is migrated to another CPU and then preempted
> > by other tasks, high-priority tasks waiting on the mutex may
> > be significantly delayed, impacting user experience.
>
> Well, container people are concerned about priority inversion as well. But
> usually this is with coarse lock (such as global filesystem locks) but VMA
> lock is specific to a task (and a VMA) so there the opportunity for
> priority inversion looks more limited.  But the example with Java where GC
> thread can presumably have higher priority than ordinary Java threads is an
> interesting one.

A major difference in Android apps is that each thread can
affect user experience differently. And it is not simply a matter
of whether a VMA writer has higher or lower priority than a
page-fault (PF) thread performing I/O.

For example, thread A handles a PF; thread B attempts to
modify the VMA where the PF occurs; thread C tries to modify
another VMA (requiring mmap_lock in write mode) or iterate
VMAs (requiring mmap_lock in read mode). Regardless of
thread B’s priority, it holds mmap_lock in write mode while
waiting for the VMA lock. The usual pattern for a VMA writer
is:

mmap_write_lock()
vma_start_write()

As a result, thread C can be blocked even if it has higher
priority but operates on a different VMA.

In essence, when a PF and a VMA write occur concurrently,
high-priority threads may be blocked even if they operate on
different VMAs, not necessarily the same one.

Thanks
Barry


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-01 17:57         ` Matthew Wilcox
  2026-05-01 18:25           ` Barry Song
  2026-05-03 13:13           ` Jan Kara
@ 2026-05-17  8:45           ` Barry Song
  2026-05-18  9:46             ` Lorenzo Stoakes
                               ` (2 more replies)
  2 siblings, 3 replies; 80+ messages in thread
From: Barry Song @ 2026-05-17  8:45 UTC (permalink / raw)
  To: Matthew Wilcox, surenb
  Cc: akpm, linux-mm, david, ljs, liam, vbabka, rppt, mhocko, jack,
	pfalcato, wanglian, chentao, lianux.mm, kunwu.chan, liyangouwen1,
	chrisl, kasong, shikemeng, nphamcs, bhe, youngjun.park,
	linux-arm-kernel, linux-kernel, loongarch, linuxppc-dev,
	linux-riscv, linux-s390, Nanzhe Zhao

On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > both the hardware and the software stack (bio/request queues and the
> > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > be quite long. In such cases, a high-priority thread performing operations
> > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > for an unpredictable amount of time.
> > >
> > > But does that actually happen?  I find it hard to believe that thread A
> > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > that same VMA.  mprotect() and madvise() are more likely to happen, but
> > > it still seems really unlikely to me.
> >
> > It doesn’t have to involve unmapping or applying mprotect to
> > the entire VMA—just a portion of it is sufficient.
>
> Yes, but that still fails to answer "does this actually happen".  How much
> performance is all this complexity in the page fault handler buying us?
> If you don't answer this question, I'm just going to go in and rip it
> all out.
>

Hi Matthew (and Lorenzo, Jan, and anyone else who may be
waiting for answers),

As promised during LSF/MM/BPF, we conducted thorough
testing on Android phones to determine whether performing
I/O in `filemap_fault()` can block `vma_start_write()`.
I wanted to give a quick update on this question.

Nanzhe at Xiaomi created tracing scripts and ran various
applications on Android devices with I/O performed under
the VMA lock in `filemap_fault()`. We found that:

1. There are very few cases where unmap() is blocked by
   page faults. I assume this is due to buggy user code
   or poor synchronization between reads and unmap().
So I assume it is not a problem.

2. We observed many cases where `vma_start_write()`
   is blocked by page-fault I/O in some applications.
   The blocking occurs in the `dup_mmap()` path during
   fork().

With Suren's commit fb49c455323ff ("fork: lock VMAs of
the parent process when forking"), we now always hold
`vma_write_lock()` for each VMA. Note that the
`mmap_lock` write lock is also held, which could lead to
chained waiting if page-fault I/O is performed without
releasing the VMA lock.

My gut feeling is that Suren's commit may be overshooting,
so my rough idea is that we might want to do something like
the following (we haven't tested it yet and it might be
wrong):

diff --git a/mm/mmap.c b/mm/mmap.c
index 2311ae7c2ff4..5ddaf297f31a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1762,7 +1762,13 @@ __latent_entropy int dup_mmap(struct mm_struct
*mm, struct mm_struct *oldmm)
        for_each_vma(vmi, mpnt) {
                struct file *file;

-               retval = vma_start_write_killable(mpnt);
+               /*
+                * For anonymous or writable private VMAs, prevent
+                * concurrent CoW faults.
+                */
+               if (!mpnt->vm_file || (!(mpnt->vm_flags & VM_SHARED) &&
+                                       (mpnt->vm_flags & VM_WRITE)))
+                       retval = vma_start_write_killable(mpnt);
                if (retval < 0)
                        goto loop_out;
                if (mpnt->vm_flags & VM_DONTCOPY) {

Based on the above, we may want to re-check whether fork()
can be blocked by page faults. At the same time, if Suren,
you, or anyone else has any comments, please feel free to
share them.

Best Regards
Barry


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-17  8:45           ` Barry Song
@ 2026-05-18  9:46             ` Lorenzo Stoakes
  2026-05-18 11:25               ` Barry Song
  2026-05-18  9:53             ` David Hildenbrand (Arm)
  2026-05-18 21:21             ` Yang Shi
  2 siblings, 1 reply; 80+ messages in thread
From: Lorenzo Stoakes @ 2026-05-18  9:46 UTC (permalink / raw)
  To: Barry Song
  Cc: Matthew Wilcox, surenb, akpm, linux-mm, david, liam, vbabka, rppt,
	mhocko, jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
	liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao

On Sun, May 17, 2026 at 04:45:15PM +0800, Barry Song wrote:
> On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > both the hardware and the software stack (bio/request queues and the
> > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > for an unpredictable amount of time.
> > > >
> > > > But does that actually happen?  I find it hard to believe that thread A
> > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > that same VMA.  mprotect() and madvise() are more likely to happen, but
> > > > it still seems really unlikely to me.
> > >
> > > It doesn’t have to involve unmapping or applying mprotect to
> > > the entire VMA—just a portion of it is sufficient.
> >
> > Yes, but that still fails to answer "does this actually happen".  How much
> > performance is all this complexity in the page fault handler buying us?
> > If you don't answer this question, I'm just going to go in and rip it
> > all out.
> >
>
> Hi Matthew (and Lorenzo, Jan, and anyone else who may be
> waiting for answers),
>
> As promised during LSF/MM/BPF, we conducted thorough
> testing on Android phones to determine whether performing
> I/O in `filemap_fault()` can block `vma_start_write()`.
> I wanted to give a quick update on this question.
>
> Nanzhe at Xiaomi created tracing scripts and ran various
> applications on Android devices with I/O performed under
> the VMA lock in `filemap_fault()`. We found that:
>
> 1. There are very few cases where unmap() is blocked by
>    page faults. I assume this is due to buggy user code
>    or poor synchronization between reads and unmap().
> So I assume it is not a problem.
>
> 2. We observed many cases where `vma_start_write()`
>    is blocked by page-fault I/O in some applications.
>    The blocking occurs in the `dup_mmap()` path during
>    fork().
>
> With Suren's commit fb49c455323ff ("fork: lock VMAs of
> the parent process when forking"), we now always hold
> `vma_write_lock()` for each VMA. Note that the
> `mmap_lock` write lock is also held, which could lead to
> chained waiting if page-fault I/O is performed without
> releasing the VMA lock.

Hm but did you observe this 'chained waiting'? And what were the latencies?

>
> My gut feeling is that Suren's commit may be overshooting,
> so my rough idea is that we might want to do something like
> the following (we haven't tested it yet and it might be
> wrong):

Yeah I'm really not sure about that.

Prior to the VMA locks, the mmap write lock would have guaranteed no concurrent
page faults, which is really what fb49c455323ff is about.

So Suren's patch was essentially restoring the _existing_ forking behaviour, and
now you're saying 'let's change the forking behaviour that's been like that for
forever'.

I think you would _really_ have to be sure that's safe. And forking is a very
dangerous time in terms of complexity and sensitivity and 'weird stuff'
happening so I'd tread _very_ carefully here.

>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 2311ae7c2ff4..5ddaf297f31a 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1762,7 +1762,13 @@ __latent_entropy int dup_mmap(struct mm_struct
> *mm, struct mm_struct *oldmm)
>         for_each_vma(vmi, mpnt) {
>                 struct file *file;
>
> -               retval = vma_start_write_killable(mpnt);
> +               /*
> +                * For anonymous or writable private VMAs, prevent
> +                * concurrent CoW faults.
> +                */

To nit pick I think the comment's confusing but also tells you you don't need to
specific anon check - writable private is sufficient. And it's not really just
CoW that's the issue, it's anon_vma population _at all_ as well as CoW.

> +               if (!mpnt->vm_file || (!(mpnt->vm_flags & VM_SHARED) &&
> +                                       (mpnt->vm_flags & VM_WRITE)))
> +                       retval = vma_start_write_killable(mpnt);

I think this has to be VM_MAYWRITE, because somebody could otherwise mprotect()
it R/W.

I also don't understand why !mpnt->vm_file for a read-only anon mapping (more
likely PROT_NONE) is here, just do the second check?

(Also please use the new interface, so !vma_test(mpnt, VMA_SHARED_BIT) &&
vma_test(mpnt, VMA_MAYWRITE_BIT))

>                 if (retval < 0)
>                         goto loop_out;
>                 if (mpnt->vm_flags & VM_DONTCOPY) {
>
> Based on the above, we may want to re-check whether fork()
> can be blocked by page faults. At the same time, if Suren,
> you, or anyone else has any comments, please feel free to
> share them.
>
> Best Regards
> Barry

Technical commentary above is sort of 'just cos' :) because I really question
doing this honestly.

I'd also like to get Suren's input, however.

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-18  9:46             ` Lorenzo Stoakes
@ 2026-05-18 11:25               ` Barry Song
  2026-05-18 16:17                 ` Matthew Wilcox
                                   ` (2 more replies)
  0 siblings, 3 replies; 80+ messages in thread
From: Barry Song @ 2026-05-18 11:25 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Matthew Wilcox, surenb, akpm, linux-mm, david, liam, vbabka, rppt,
	mhocko, jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
	liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao

On Mon, May 18, 2026 at 5:47 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
>
> On Sun, May 17, 2026 at 04:45:15PM +0800, Barry Song wrote:
> > On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > > both the hardware and the software stack (bio/request queues and the
> > > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > > for an unpredictable amount of time.
> > > > >
> > > > > But does that actually happen?  I find it hard to believe that thread A
> > > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > > that same VMA.  mprotect() and madvise() are more likely to happen, but
> > > > > it still seems really unlikely to me.
> > > >
> > > > It doesn’t have to involve unmapping or applying mprotect to
> > > > the entire VMA—just a portion of it is sufficient.
> > >
> > > Yes, but that still fails to answer "does this actually happen".  How much
> > > performance is all this complexity in the page fault handler buying us?
> > > If you don't answer this question, I'm just going to go in and rip it
> > > all out.
> > >
> >
> > Hi Matthew (and Lorenzo, Jan, and anyone else who may be
> > waiting for answers),
> >
> > As promised during LSF/MM/BPF, we conducted thorough
> > testing on Android phones to determine whether performing
> > I/O in `filemap_fault()` can block `vma_start_write()`.
> > I wanted to give a quick update on this question.
> >
> > Nanzhe at Xiaomi created tracing scripts and ran various
> > applications on Android devices with I/O performed under
> > the VMA lock in `filemap_fault()`. We found that:
> >
> > 1. There are very few cases where unmap() is blocked by
> >    page faults. I assume this is due to buggy user code
> >    or poor synchronization between reads and unmap().
> > So I assume it is not a problem.
> >
> > 2. We observed many cases where `vma_start_write()`
> >    is blocked by page-fault I/O in some applications.
> >    The blocking occurs in the `dup_mmap()` path during
> >    fork().
> >
> > With Suren's commit fb49c455323ff ("fork: lock VMAs of
> > the parent process when forking"), we now always hold
> > `vma_write_lock()` for each VMA. Note that the
> > `mmap_lock` write lock is also held, which could lead to
> > chained waiting if page-fault I/O is performed without
> > releasing the VMA lock.
>
> Hm but did you observe this 'chained waiting'? And what were the latencies?

We have clearly observed that the `fork()` operations of many
popular Android apps, such as iQiyi, Baidu Tieba, and 10086,
end up waiting on page-fault (PF) I/O when the VMA lock is
held during I/O operations. This has already become a
practical issue. I also believe this can lead to chained
waiting, since the global `mmap_lock` blocks all threads that
need to acquire it.


>
> >
> > My gut feeling is that Suren's commit may be overshooting,
> > so my rough idea is that we might want to do something like
> > the following (we haven't tested it yet and it might be
> > wrong):
>
> Yeah I'm really not sure about that.
>
> Prior to the VMA locks, the mmap write lock would have guaranteed no concurrent
> page faults, which is really what fb49c455323ff is about.
>
> So Suren's patch was essentially restoring the _existing_ forking behaviour, and
> now you're saying 'let's change the forking behaviour that's been like that for
> forever'.


I am afraid not. Before we introduced the per-VMA lock, we
were not performing I/O while holding `mmap_lock`. A page fault
that needed I/O would drop the `mmap_lock` read lock and allow
`fork()` to proceed.

Now, you are suggesting performing I/O while holding the VMA
lock, which changes the requirements and introduces this
problem.

>
> I think you would _really_ have to be sure that's safe. And forking is a very
> dangerous time in terms of complexity and sensitivity and 'weird stuff'
> happening so I'd tread _very_ carefully here.

Yep. I think my original proposal did not require any changes
to `fork()`, since it simply preserved the current behavior of
dropping the VMA lock before performing I/O. In that model,
`fork()` would not end up waiting on I/O at all.

What you are suggesting now appears to be performing I/O while
holding the VMA lock, which in turn introduces the need to
change `fork()`.

>
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 2311ae7c2ff4..5ddaf297f31a 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1762,7 +1762,13 @@ __latent_entropy int dup_mmap(struct mm_struct
> > *mm, struct mm_struct *oldmm)
> >         for_each_vma(vmi, mpnt) {
> >                 struct file *file;
> >
> > -               retval = vma_start_write_killable(mpnt);
> > +               /*
> > +                * For anonymous or writable private VMAs, prevent
> > +                * concurrent CoW faults.
> > +                */
>
> To nit pick I think the comment's confusing but also tells you you don't need to
> specific anon check - writable private is sufficient. And it's not really just
> CoW that's the issue, it's anon_vma population _at all_ as well as CoW.
>
> > +               if (!mpnt->vm_file || (!(mpnt->vm_flags & VM_SHARED) &&
> > +                                       (mpnt->vm_flags & VM_WRITE)))
> > +                       retval = vma_start_write_killable(mpnt);
>
> I think this has to be VM_MAYWRITE, because somebody could otherwise mprotect()
> it R/W.
>
> I also don't understand why !mpnt->vm_file for a read-only anon mapping (more
> likely PROT_NONE) is here, just do the second check?
>
> (Also please use the new interface, so !vma_test(mpnt, VMA_SHARED_BIT) &&
> vma_test(mpnt, VMA_MAYWRITE_BIT))

Yep, I can definitely refine the check further. But before
doing that, I'd first like to confirm that we are aligned on
the direction.

If you still intend to hold the VMA lock while performing I/O,
then I think we should fix `fork()` to avoid taking
`vma_start_write()`.

>
> >                 if (retval < 0)
> >                         goto loop_out;
> >                 if (mpnt->vm_flags & VM_DONTCOPY) {
> >
> > Based on the above, we may want to re-check whether fork()
> > can be blocked by page faults. At the same time, if Suren,
> > you, or anyone else has any comments, please feel free to
> > share them.
> >
> > Best Regards
> > Barry
>
> Technical commentary above is sort of 'just cos' :) because I really question
> doing this honestly.

I think we either need to fix `fork()`, or keep the current
behavior of dropping the VMA lock before performing I/O.

>
> I'd also like to get Suren's input, however.

Yes. of course.

>
> Thanks, Lorenzo

Best Regards
Barry


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-18 11:25               ` Barry Song
@ 2026-05-18 16:17                 ` Matthew Wilcox
  2026-05-18 20:50                   ` Barry Song
  2026-05-18 19:56                 ` Suren Baghdasaryan
  2026-05-19 12:43                 ` Lorenzo Stoakes
  2 siblings, 1 reply; 80+ messages in thread
From: Matthew Wilcox @ 2026-05-18 16:17 UTC (permalink / raw)
  To: Barry Song
  Cc: Lorenzo Stoakes, surenb, akpm, linux-mm, david, liam, vbabka,
	rppt, mhocko, jack, pfalcato, wanglian, chentao, lianux.mm,
	kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao

On Mon, May 18, 2026 at 07:25:54PM +0800, Barry Song wrote:
> We have clearly observed that the `fork()` operations of many
> popular Android apps, such as iQiyi, Baidu Tieba, and 10086,
> end up waiting on page-fault (PF) I/O when the VMA lock is
> held during I/O operations. This has already become a
> practical issue. I also believe this can lead to chained
> waiting, since the global `mmap_lock` blocks all threads that
> need to acquire it.

It's always been a terrible idea to call fork() from a multithreaded
application.  For example, this question:

https://stackoverflow.com/questions/53601200/calling-fork-on-a-multithreaded-process

or this lwn thread: https://lwn.net/Articles/674660/

Do we have any insight into why these applications are doing this
horrible thing?


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-18 16:17                 ` Matthew Wilcox
@ 2026-05-18 20:50                   ` Barry Song
  0 siblings, 0 replies; 80+ messages in thread
From: Barry Song @ 2026-05-18 20:50 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Lorenzo Stoakes, surenb, akpm, linux-mm, david, liam, vbabka,
	rppt, mhocko, jack, pfalcato, wanglian, chentao, lianux.mm,
	kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao

On Tue, May 19, 2026 at 12:17 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, May 18, 2026 at 07:25:54PM +0800, Barry Song wrote:
> > We have clearly observed that the `fork()` operations of many
> > popular Android apps, such as iQiyi, Baidu Tieba, and 10086,
> > end up waiting on page-fault (PF) I/O when the VMA lock is
> > held during I/O operations. This has already become a
> > practical issue. I also believe this can lead to chained
> > waiting, since the global `mmap_lock` blocks all threads that
> > need to acquire it.
>
> It's always been a terrible idea to call fork() from a multithreaded
> application.  For example, this question:
>
> https://stackoverflow.com/questions/53601200/calling-fork-on-a-multithreaded-process
>
> or this lwn thread: https://lwn.net/Articles/674660/
>
> Do we have any insight into why these applications are doing this
> horrible thing?

I swear I read the two links you shared. But the reality
is that as long as people use the Android framework,
even the simplest "Hello World" app already runs with
10+ threads :-)


main
RenderThread
ReferenceQueueDaemon
FinalizerDaemon
FinalizerWatchdogDaemon
HeapTaskDaemon
Binder:1234_1
Binder:1234_2
Signal Catcher
JDWP
...

Best Regards
Barry


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-18 11:25               ` Barry Song
  2026-05-18 16:17                 ` Matthew Wilcox
@ 2026-05-18 19:56                 ` Suren Baghdasaryan
  2026-05-18 21:14                   ` Barry Song
  2026-05-19 12:53                   ` Lorenzo Stoakes
  2026-05-19 12:43                 ` Lorenzo Stoakes
  2 siblings, 2 replies; 80+ messages in thread
From: Suren Baghdasaryan @ 2026-05-18 19:56 UTC (permalink / raw)
  To: Barry Song
  Cc: Lorenzo Stoakes, Matthew Wilcox, akpm, linux-mm, david, liam,
	vbabka, rppt, mhocko, jack, pfalcato, wanglian, chentao,
	lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng,
	nphamcs, bhe, youngjun.park, linux-arm-kernel, linux-kernel,
	loongarch, linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao

On Mon, May 18, 2026 at 4:26 AM Barry Song <baohua@kernel.org> wrote:
>
> On Mon, May 18, 2026 at 5:47 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
> >
> > On Sun, May 17, 2026 at 04:45:15PM +0800, Barry Song wrote:
> > > On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > >
> > > > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > > > both the hardware and the software stack (bio/request queues and the
> > > > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > > > for an unpredictable amount of time.
> > > > > >
> > > > > > But does that actually happen?  I find it hard to believe that thread A
> > > > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > > > that same VMA.  mprotect() and madvise() are more likely to happen, but
> > > > > > it still seems really unlikely to me.
> > > > >
> > > > > It doesn’t have to involve unmapping or applying mprotect to
> > > > > the entire VMA—just a portion of it is sufficient.
> > > >
> > > > Yes, but that still fails to answer "does this actually happen".  How much
> > > > performance is all this complexity in the page fault handler buying us?
> > > > If you don't answer this question, I'm just going to go in and rip it
> > > > all out.
> > > >
> > >
> > > Hi Matthew (and Lorenzo, Jan, and anyone else who may be
> > > waiting for answers),
> > >
> > > As promised during LSF/MM/BPF, we conducted thorough
> > > testing on Android phones to determine whether performing
> > > I/O in `filemap_fault()` can block `vma_start_write()`.
> > > I wanted to give a quick update on this question.
> > >
> > > Nanzhe at Xiaomi created tracing scripts and ran various
> > > applications on Android devices with I/O performed under
> > > the VMA lock in `filemap_fault()`. We found that:
> > >
> > > 1. There are very few cases where unmap() is blocked by
> > >    page faults. I assume this is due to buggy user code
> > >    or poor synchronization between reads and unmap().
> > > So I assume it is not a problem.
> > >
> > > 2. We observed many cases where `vma_start_write()`
> > >    is blocked by page-fault I/O in some applications.
> > >    The blocking occurs in the `dup_mmap()` path during
> > >    fork().
> > >
> > > With Suren's commit fb49c455323ff ("fork: lock VMAs of
> > > the parent process when forking"), we now always hold
> > > `vma_write_lock()` for each VMA. Note that the
> > > `mmap_lock` write lock is also held, which could lead to
> > > chained waiting if page-fault I/O is performed without
> > > releasing the VMA lock.
> >
> > Hm but did you observe this 'chained waiting'? And what were the latencies?
>
> We have clearly observed that the `fork()` operations of many
> popular Android apps, such as iQiyi, Baidu Tieba, and 10086,
> end up waiting on page-fault (PF) I/O when the VMA lock is
> held during I/O operations. This has already become a
> practical issue. I also believe this can lead to chained
> waiting, since the global `mmap_lock` blocks all threads that
> need to acquire it.
>
>
> >
> > >
> > > My gut feeling is that Suren's commit may be overshooting,
> > > so my rough idea is that we might want to do something like
> > > the following (we haven't tested it yet and it might be
> > > wrong):
> >
> > Yeah I'm really not sure about that.
> >
> > Prior to the VMA locks, the mmap write lock would have guaranteed no concurrent
> > page faults, which is really what fb49c455323ff is about.
> >
> > So Suren's patch was essentially restoring the _existing_ forking behaviour, and
> > now you're saying 'let's change the forking behaviour that's been like that for
> > forever'.
>
>
> I am afraid not. Before we introduced the per-VMA lock, we
> were not performing I/O while holding `mmap_lock`. A page fault
> that needed I/O would drop the `mmap_lock` read lock and allow
> `fork()` to proceed.
>
> Now, you are suggesting performing I/O while holding the VMA
> lock, which changes the requirements and introduces this
> problem.
>
> >
> > I think you would _really_ have to be sure that's safe. And forking is a very
> > dangerous time in terms of complexity and sensitivity and 'weird stuff'
> > happening so I'd tread _very_ carefully here.
>
> Yep. I think my original proposal did not require any changes
> to `fork()`, since it simply preserved the current behavior of
> dropping the VMA lock before performing I/O. In that model,
> `fork()` would not end up waiting on I/O at all.
>
> What you are suggesting now appears to be performing I/O while
> holding the VMA lock, which in turn introduces the need to
> change `fork()`.
>
> >
> > >
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index 2311ae7c2ff4..5ddaf297f31a 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -1762,7 +1762,13 @@ __latent_entropy int dup_mmap(struct mm_struct
> > > *mm, struct mm_struct *oldmm)
> > >         for_each_vma(vmi, mpnt) {
> > >                 struct file *file;
> > >
> > > -               retval = vma_start_write_killable(mpnt);
> > > +               /*
> > > +                * For anonymous or writable private VMAs, prevent
> > > +                * concurrent CoW faults.
> > > +                */
> >
> > To nit pick I think the comment's confusing but also tells you you don't need to
> > specific anon check - writable private is sufficient. And it's not really just
> > CoW that's the issue, it's anon_vma population _at all_ as well as CoW.
> >
> > > +               if (!mpnt->vm_file || (!(mpnt->vm_flags & VM_SHARED) &&
> > > +                                       (mpnt->vm_flags & VM_WRITE)))
> > > +                       retval = vma_start_write_killable(mpnt);
> >
> > I think this has to be VM_MAYWRITE, because somebody could otherwise mprotect()
> > it R/W.
> >
> > I also don't understand why !mpnt->vm_file for a read-only anon mapping (more
> > likely PROT_NONE) is here, just do the second check?
> >
> > (Also please use the new interface, so !vma_test(mpnt, VMA_SHARED_BIT) &&
> > vma_test(mpnt, VMA_MAYWRITE_BIT))
>
> Yep, I can definitely refine the check further. But before
> doing that, I'd first like to confirm that we are aligned on
> the direction.
>
> If you still intend to hold the VMA lock while performing I/O,
> then I think we should fix `fork()` to avoid taking
> `vma_start_write()`.
>
> >
> > >                 if (retval < 0)
> > >                         goto loop_out;
> > >                 if (mpnt->vm_flags & VM_DONTCOPY) {
> > >
> > > Based on the above, we may want to re-check whether fork()
> > > can be blocked by page faults. At the same time, if Suren,
> > > you, or anyone else has any comments, please feel free to
> > > share them.
> > >
> > > Best Regards
> > > Barry
> >
> > Technical commentary above is sort of 'just cos' :) because I really question
> > doing this honestly.
>
> I think we either need to fix `fork()`, or keep the current
> behavior of dropping the VMA lock before performing I/O.

I see. So, this problem arises from the fact that we are changing the
pagefaults requiring I/O operation to hold VMA lock...
And you want to lock VMA on fork only if vma_is_anonymous(vma) ||
is_cow_mapping(vma->vm_flags). So, we will be blocking page faults for
anonymous and COW VMAs only while holding mmap_write_lock, preventing
any VMA modification. On the surface, that looks ok to me but I might
be missing some corner cases. If nobody sees any obvious issues, I
think it's worth a try.




>
> >
> > I'd also like to get Suren's input, however.
>
> Yes. of course.
>
> >
> > Thanks, Lorenzo
>
> Best Regards
> Barry


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-18 19:56                 ` Suren Baghdasaryan
@ 2026-05-18 21:14                   ` Barry Song
  2026-05-19 12:45                     ` Lorenzo Stoakes
  2026-05-19 14:17                     ` Liam R. Howlett
  2026-05-19 12:53                   ` Lorenzo Stoakes
  1 sibling, 2 replies; 80+ messages in thread
From: Barry Song @ 2026-05-18 21:14 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Lorenzo Stoakes, Matthew Wilcox, akpm, linux-mm, david, liam,
	vbabka, rppt, mhocko, jack, pfalcato, wanglian, chentao,
	lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng,
	nphamcs, bhe, youngjun.park, linux-arm-kernel, linux-kernel,
	loongarch, linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao

On Tue, May 19, 2026 at 3:57 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Mon, May 18, 2026 at 4:26 AM Barry Song <baohua@kernel.org> wrote:
> >
> > On Mon, May 18, 2026 at 5:47 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
> > >
> > > On Sun, May 17, 2026 at 04:45:15PM +0800, Barry Song wrote:
[...]
> >
> > I think we either need to fix `fork()`, or keep the current
> > behavior of dropping the VMA lock before performing I/O.
>
> I see. So, this problem arises from the fact that we are changing the
> pagefaults requiring I/O operation to hold VMA lock...
> And you want to lock VMA on fork only if vma_is_anonymous(vma) ||
> is_cow_mapping(vma->vm_flags). So, we will be blocking page faults for
> anonymous and COW VMAs only while holding mmap_write_lock, preventing
> any VMA modification. On the surface, that looks ok to me but I might
> be missing some corner cases. If nobody sees any obvious issues, I
> think it's worth a try.
>

Thanks. Besides the creation of processes via fork(), I
am also beginning to worry about the death of processes.

One thing that came to my mind this morning
is that when lowmemorykiller decides to kill an app, we
want the memory to be released as quickly as possible so
the new app or user scenario can get memory sooner.

In that case, if the app being killed is performing I/O
while holding the VMA lock, the unmapping procedure
could end up being blocked as well.

If we release the VMA lock as we currently do, we allow
process exit to proceed.

I haven't thought it through very clearly yet, and I
may be wrong. I'd like to do more investigation. I hope
the apps being killed stay very still, but who knows—we
have so many applications in the market.

Meanwhile, if you have any comments regarding the death
of processes, they would be very welcome.

Best Regards
Barry

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-18 21:14                   ` Barry Song
@ 2026-05-19 12:45                     ` Lorenzo Stoakes
  2026-05-19 14:17                     ` Liam R. Howlett
  1 sibling, 0 replies; 80+ messages in thread
From: Lorenzo Stoakes @ 2026-05-19 12:45 UTC (permalink / raw)
  To: Barry Song
  Cc: Suren Baghdasaryan, Matthew Wilcox, akpm, linux-mm, david, liam,
	vbabka, rppt, mhocko, jack, pfalcato, wanglian, chentao,
	lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng,
	nphamcs, bhe, youngjun.park, linux-arm-kernel, linux-kernel,
	loongarch, linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao

On Tue, May 19, 2026 at 05:14:45AM +0800, Barry Song wrote:
> On Tue, May 19, 2026 at 3:57 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Mon, May 18, 2026 at 4:26 AM Barry Song <baohua@kernel.org> wrote:
> > >
> > > On Mon, May 18, 2026 at 5:47 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
> > > >
> > > > On Sun, May 17, 2026 at 04:45:15PM +0800, Barry Song wrote:
> [...]
> > >
> > > I think we either need to fix `fork()`, or keep the current
> > > behavior of dropping the VMA lock before performing I/O.
> >
> > I see. So, this problem arises from the fact that we are changing the
> > pagefaults requiring I/O operation to hold VMA lock...
> > And you want to lock VMA on fork only if vma_is_anonymous(vma) ||
> > is_cow_mapping(vma->vm_flags). So, we will be blocking page faults for
> > anonymous and COW VMAs only while holding mmap_write_lock, preventing
> > any VMA modification. On the surface, that looks ok to me but I might
> > be missing some corner cases. If nobody sees any obvious issues, I
> > think it's worth a try.
> >
>
> Thanks. Besides the creation of processes via fork(), I
> am also beginning to worry about the death of processes.
>
> One thing that came to my mind this morning
> is that when lowmemorykiller decides to kill an app, we

What's the lowmemorykiller? :P you mean the OOM killer?

> want the memory to be released as quickly as possible so
> the new app or user scenario can get memory sooner.
>
> In that case, if the app being killed is performing I/O
> while holding the VMA lock, the unmapping procedure
> could end up being blocked as well.
>
> If we release the VMA lock as we currently do, we allow
> process exit to proceed.
>
> I haven't thought it through very clearly yet, and I
> may be wrong. I'd like to do more investigation. I hope
> the apps being killed stay very still, but who knows—we
> have so many applications in the market.

Yeah let's tread very carefully please, you're picking two of the most fraught
areas of mm, I'm not going to want to see changes there unless they're
substantially more convincingly argued.

>
> Meanwhile, if you have any comments regarding the death
> of processes, they would be very welcome.

As above, leave it alone please :)

>
> Best Regards
> Barry

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-18 21:14                   ` Barry Song
  2026-05-19 12:45                     ` Lorenzo Stoakes
@ 2026-05-19 14:17                     ` Liam R. Howlett
  2026-05-19 22:01                       ` Barry Song
  1 sibling, 1 reply; 80+ messages in thread
From: Liam R. Howlett @ 2026-05-19 14:17 UTC (permalink / raw)
  To: Barry Song
  Cc: Suren Baghdasaryan, Lorenzo Stoakes, Matthew Wilcox, akpm,
	linux-mm, david, vbabka, rppt, mhocko, jack, pfalcato, wanglian,
	chentao, lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong,
	shikemeng, nphamcs, bhe, youngjun.park, linux-arm-kernel,
	linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390,
	Nanzhe Zhao

On 26/05/19 05:14AM, Barry Song wrote:
> On Tue, May 19, 2026 at 3:57 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Mon, May 18, 2026 at 4:26 AM Barry Song <baohua@kernel.org> wrote:
> > >
> > > On Mon, May 18, 2026 at 5:47 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
> > > >
> > > > On Sun, May 17, 2026 at 04:45:15PM +0800, Barry Song wrote:
> [...]
> > >
> > > I think we either need to fix `fork()`, or keep the current
> > > behavior of dropping the VMA lock before performing I/O.
> >
> > I see. So, this problem arises from the fact that we are changing the
> > pagefaults requiring I/O operation to hold VMA lock...
> > And you want to lock VMA on fork only if vma_is_anonymous(vma) ||
> > is_cow_mapping(vma->vm_flags). So, we will be blocking page faults for
> > anonymous and COW VMAs only while holding mmap_write_lock, preventing
> > any VMA modification. On the surface, that looks ok to me but I might
> > be missing some corner cases. If nobody sees any obvious issues, I
> > think it's worth a try.

From Barry's description, I think what he is saying is that the vma
locking has caused the mmap_lock to become unfair?  I think what is
implied is that the per-vma locking may stall mmap_lock writes for
longer than if the mmap_lock was taken in read mode?  Barry, is that
correct?

Since Android is doing something (according to Barry) that should not be
done (according to Willy), both of these together are causing slow down?

> 
> Thanks. Besides the creation of processes via fork(), I
> am also beginning to worry about the death of processes.
> 
> One thing that came to my mind this morning
> is that when lowmemorykiller decides to kill an app, we
> want the memory to be released as quickly as possible so
> the new app or user scenario can get memory sooner.
> 
> In that case, if the app being killed is performing I/O
> while holding the VMA lock, the unmapping procedure
> could end up being blocked as well.
> 
> If we release the VMA lock as we currently do, we allow
> process exit to proceed.
> 
> I haven't thought it through very clearly yet, and I
> may be wrong. I'd like to do more investigation. I hope
> the apps being killed stay very still, but who knows—we
> have so many applications in the market.
> 
> Meanwhile, if you have any comments regarding the death
> of processes, they would be very welcome.

The oom killer only cleans out anon/not shared vmas [1].  So, what this
would hold up would be the actual process exit path.  Although that
would have resources associated with it, the amount of resources should
be relatively low compared to the amount freed by the oom reaper, right?

The other entry point that's mostly to do with android,
process_mrelease() [2] will end up in the same  __oom_reap_task_mm()
function.

So, for the most part, the memory will be freed while the file backed
vma completes IO and that sounds like the right thing to do anyways.

Thanks,
Liam

[1]. https://elixir.bootlin.com/linux/v7.1-rc4/source/mm/oom_kill.c#L547
[2]. https://elixir.bootlin.com/linux/v6.18.6/source/mm/oom_kill.c#L1210



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-19 14:17                     ` Liam R. Howlett
@ 2026-05-19 22:01                       ` Barry Song
  2026-05-20 21:04                         ` Matthew Wilcox
  0 siblings, 1 reply; 80+ messages in thread
From: Barry Song @ 2026-05-19 22:01 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: Suren Baghdasaryan, Lorenzo Stoakes, Matthew Wilcox, akpm,
	linux-mm, david, vbabka, rppt, mhocko, jack, pfalcato, wanglian,
	chentao, lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong,
	shikemeng, nphamcs, bhe, youngjun.park, linux-arm-kernel,
	linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390,
	Nanzhe Zhao

On Tue, May 19, 2026 at 10:17 PM Liam R. Howlett <liam@infradead.org> wrote:
>
> On 26/05/19 05:14AM, Barry Song wrote:
> > On Tue, May 19, 2026 at 3:57 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Mon, May 18, 2026 at 4:26 AM Barry Song <baohua@kernel.org> wrote:
> > > >
> > > > On Mon, May 18, 2026 at 5:47 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
> > > > >
> > > > > On Sun, May 17, 2026 at 04:45:15PM +0800, Barry Song wrote:
> > [...]
> > > >
> > > > I think we either need to fix `fork()`, or keep the current
> > > > behavior of dropping the VMA lock before performing I/O.
> > >
> > > I see. So, this problem arises from the fact that we are changing the
> > > pagefaults requiring I/O operation to hold VMA lock...
> > > And you want to lock VMA on fork only if vma_is_anonymous(vma) ||
> > > is_cow_mapping(vma->vm_flags). So, we will be blocking page faults for
> > > anonymous and COW VMAs only while holding mmap_write_lock, preventing
> > > any VMA modification. On the surface, that looks ok to me but I might
> > > be missing some corner cases. If nobody sees any obvious issues, I
> > > think it's worth a try.
>
> From Barry's description, I think what he is saying is that the vma
> locking has caused the mmap_lock to become unfair?  I think what is

For now, we do not have this problem. Before per-VMA
locks, we dropped mmap_lock before doing I/O in the
page-fault path and then retried the page fault. After
per-VMA locks, we dropped the VMA lock before doing I/O in
the page-fault path and then retried the page fault.

The problem only starts to exist if we decide to perform
I/O without releasing the VMA lock — which is what Matthew
is suggesting, because it would allow us to rip out a large
amount of page-fault retry code.

> implied is that the per-vma locking may stall mmap_lock writes for
> longer than if the mmap_lock was taken in read mode?  Barry, is that
> correct?

Not the case — the actual situation is (if we modify the
current kernel to perform I/O without releasing VMA read locks):

thread 1 PF: lock vma1 read ----  IO ----- ;
thread 2 PF: lock vma2 read ----- IO ----- ;
thread 3 PF:  lock vma3 read ---- IO ----- ;
thread 4 fork:  mmap_lock_write ---- lock vma1, vma2, vma3 write ;
thread 5 :  take mmap_lock for any read/write reason

Now you can see that thread 4 has to wait for the I/O of
VMA1, VMA2, and VMA3 to complete, and thread 5 then has to
wait for thread 4 to release mmap_lock. Both thread 4 and
thread 5 can become extremely slow, because I/O may be stuck
anywhere in the bio/request queue or filesystem GC.

So now we have two choices:

1. Change fork() to avoid taking the vma write lock for vma1/2/3 where possible;
2. Keep the current kernel behavior and drop the VMA lock before I/O:

thread 1 PF: lock vma1 read; drop vma1 read_lock ----  IO ----- retry PF
thread 2 PF: lock vma2 read; drop vma2 read_lock ----- IO ----- retry PF
thread 3 PF:  lock vma3 read; drop vma3 read_lock ---- IO ----- retry PF

Option 2 is what mainline is currently doing, and what this
patchset also follows. The only difference in this patchset is
that page faults are retried under the VMA read lock, rather
than under mmap_lock as in the current kernel, which is causing
mmap_lock contention.

>
> Since Android is doing something (according to Barry) that should not be
> done (according to Willy), both of these together are causing slow down?

The only thing that would cause slowdown is holding the VMA
lock while performing I/O in the page-fault path, which is not
happening today. It would only happen if we insist on doing I/O
under the VMA lock without changing fork().

>
> >
> > Thanks. Besides the creation of processes via fork(), I
> > am also beginning to worry about the death of processes.
> >
> > One thing that came to my mind this morning
> > is that when lowmemorykiller decides to kill an app, we
> > want the memory to be released as quickly as possible so
> > the new app or user scenario can get memory sooner.
> >
> > In that case, if the app being killed is performing I/O
> > while holding the VMA lock, the unmapping procedure
> > could end up being blocked as well.
> >
> > If we release the VMA lock as we currently do, we allow
> > process exit to proceed.
> >
> > I haven't thought it through very clearly yet, and I
> > may be wrong. I'd like to do more investigation. I hope
> > the apps being killed stay very still, but who knows—we
> > have so many applications in the market.
> >
> > Meanwhile, if you have any comments regarding the death
> > of processes, they would be very welcome.
>
> The oom killer only cleans out anon/not shared vmas [1].  So, what this
> would hold up would be the actual process exit path.  Although that
> would have resources associated with it, the amount of resources should
> be relatively low compared to the amount freed by the oom reaper, right?
>
> The other entry point that's mostly to do with android,
> process_mrelease() [2] will end up in the same  __oom_reap_task_mm()
> function.
>
> So, for the most part, the memory will be freed while the file backed
> vma completes IO and that sounds like the right thing to do anyways.

Thanks very much for your valuable input!
I’m going to run more experiments to dig deeper into this.

>
> Thanks,
> Liam
>
> [1]. https://elixir.bootlin.com/linux/v7.1-rc4/source/mm/oom_kill.c#L547
> [2]. https://elixir.bootlin.com/linux/v6.18.6/source/mm/oom_kill.c#L1210
>

Best Regards
Barry


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-19 22:01                       ` Barry Song
@ 2026-05-20 21:04                         ` Matthew Wilcox
  2026-05-20 21:14                           ` Barry Song
  0 siblings, 1 reply; 80+ messages in thread
From: Matthew Wilcox @ 2026-05-20 21:04 UTC (permalink / raw)
  To: Barry Song
  Cc: Liam R. Howlett, Suren Baghdasaryan, Lorenzo Stoakes, akpm,
	linux-mm, david, vbabka, rppt, mhocko, jack, pfalcato, wanglian,
	chentao, lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong,
	shikemeng, nphamcs, bhe, youngjun.park, linux-arm-kernel,
	linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390,
	Nanzhe Zhao

On Wed, May 20, 2026 at 06:01:56AM +0800, Barry Song wrote:
> > implied is that the per-vma locking may stall mmap_lock writes for
> > longer than if the mmap_lock was taken in read mode?  Barry, is that
> > correct?
> 
> Not the case — the actual situation is (if we modify the
> current kernel to perform I/O without releasing VMA read locks):
> 
> thread 1 PF: lock vma1 read ----  IO ----- ;
> thread 2 PF: lock vma2 read ----- IO ----- ;
> thread 3 PF:  lock vma3 read ---- IO ----- ;
> thread 4 fork:  mmap_lock_write ---- lock vma1, vma2, vma3 write ;
> thread 5 :  take mmap_lock for any read/write reason
> 
> Now you can see that thread 4 has to wait for the I/O of
> VMA1, VMA2, and VMA3 to complete, and thread 5 then has to
> wait for thread 4 to release mmap_lock. Both thread 4 and
> thread 5 can become extremely slow, because I/O may be stuck
> anywhere in the bio/request queue or filesystem GC.
> 
> So now we have two choices:
> 
> 1. Change fork() to avoid taking the vma write lock for vma1/2/3 where possible;
> 2. Keep the current kernel behavior and drop the VMA lock before I/O:

Option 3: Say that this is a very silly thing to optimise for.  I have a
hard time believing that any application will care about the latency of
fork(), or the latency of page faults while it's in the middle of fork().
Multithreaded applications just don't fork that often!


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-20 21:04                         ` Matthew Wilcox
@ 2026-05-20 21:14                           ` Barry Song
  2026-05-20 21:15                             ` Matthew Wilcox
  0 siblings, 1 reply; 80+ messages in thread
From: Barry Song @ 2026-05-20 21:14 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Liam R. Howlett, Suren Baghdasaryan, Lorenzo Stoakes, akpm,
	linux-mm, david, vbabka, rppt, mhocko, jack, pfalcato, wanglian,
	chentao, lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong,
	shikemeng, nphamcs, bhe, youngjun.park, linux-arm-kernel,
	linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390,
	Nanzhe Zhao

On Thu, May 21, 2026 at 5:05 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, May 20, 2026 at 06:01:56AM +0800, Barry Song wrote:
> > > implied is that the per-vma locking may stall mmap_lock writes for
> > > longer than if the mmap_lock was taken in read mode?  Barry, is that
> > > correct?
> >
> > Not the case — the actual situation is (if we modify the
> > current kernel to perform I/O without releasing VMA read locks):
> >
> > thread 1 PF: lock vma1 read ----  IO ----- ;
> > thread 2 PF: lock vma2 read ----- IO ----- ;
> > thread 3 PF:  lock vma3 read ---- IO ----- ;
> > thread 4 fork:  mmap_lock_write ---- lock vma1, vma2, vma3 write ;
> > thread 5 :  take mmap_lock for any read/write reason
> >
> > Now you can see that thread 4 has to wait for the I/O of
> > VMA1, VMA2, and VMA3 to complete, and thread 5 then has to
> > wait for thread 4 to release mmap_lock. Both thread 4 and
> > thread 5 can become extremely slow, because I/O may be stuck
> > anywhere in the bio/request queue or filesystem GC.
> >
> > So now we have two choices:
> >
> > 1. Change fork() to avoid taking the vma write lock for vma1/2/3 where possible;
> > 2. Keep the current kernel behavior and drop the VMA lock before I/O:
>
> Option 3: Say that this is a very silly thing to optimise for.  I have a
> hard time believing that any application will care about the latency of
> fork(), or the latency of page faults while it's in the middle of fork().
> Multithreaded applications just don't fork that often!

My understanding is that we should not blame applications here. This is 2026:
there are basically only two kinds of applications — single-threaded and
multi-threaded — and single-threaded applications are nearly extinct.


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-20 21:14                           ` Barry Song
@ 2026-05-20 21:15                             ` Matthew Wilcox
  2026-05-20 21:35                               ` David Hildenbrand (Arm)
  2026-05-22  2:33                               ` Barry Song (Xiaomi)
  0 siblings, 2 replies; 80+ messages in thread
From: Matthew Wilcox @ 2026-05-20 21:15 UTC (permalink / raw)
  To: Barry Song
  Cc: Liam R. Howlett, Suren Baghdasaryan, Lorenzo Stoakes, akpm,
	linux-mm, david, vbabka, rppt, mhocko, jack, pfalcato, wanglian,
	chentao, lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong,
	shikemeng, nphamcs, bhe, youngjun.park, linux-arm-kernel,
	linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390,
	Nanzhe Zhao

On Thu, May 21, 2026 at 05:14:20AM +0800, Barry Song wrote:
> My understanding is that we should not blame applications here. This is 2026:
> there are basically only two kinds of applications — single-threaded and
> multi-threaded — and single-threaded applications are nearly extinct.

all of the applications i run are either single threaded or don't fork.
what multithreaded applications call fork?


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-20 21:15                             ` Matthew Wilcox
@ 2026-05-20 21:35                               ` David Hildenbrand (Arm)
  2026-05-20 23:37                                 ` Barry Song
  2026-06-23  7:58                                 ` Hongru Zhang
  2026-05-22  2:33                               ` Barry Song (Xiaomi)
  1 sibling, 2 replies; 80+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-20 21:35 UTC (permalink / raw)
  To: Matthew Wilcox, Barry Song
  Cc: Liam R. Howlett, Suren Baghdasaryan, Lorenzo Stoakes, akpm,
	linux-mm, vbabka, rppt, mhocko, jack, pfalcato, wanglian, chentao,
	lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng,
	nphamcs, bhe, youngjun.park, linux-arm-kernel, linux-kernel,
	loongarch, linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao

On 5/20/26 23:15, Matthew Wilcox wrote:
> On Thu, May 21, 2026 at 05:14:20AM +0800, Barry Song wrote:
>> My understanding is that we should not blame applications here. This is 2026:
>> there are basically only two kinds of applications — single-threaded and
>> multi-threaded — and single-threaded applications are nearly extinct.
> 
> all of the applications i run are either single threaded or don't fork.
> what multithreaded applications call fork?

Traditionally the problem was random libraries using fork+execve to launch other
programs ... instead of using alternatives like posix_spwan (some use cases
require more work done before execve and cannot yet switch to that). I'd hope
that that is less of a problem on Android.

I assume Android zygote might be multi threaded? Maybe sshd as well? Systemd?
But I'd be surprised if there are really performance implications.

Not sure about webbroswers .... I think most of them switched to fork servers,
where I would assume fork servers would be single-threaded.

So, yeah, getting a clear understanding how this ends up being a problem on
Android would be great.

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-20 21:35                               ` David Hildenbrand (Arm)
@ 2026-05-20 23:37                                 ` Barry Song
  2026-05-22 15:53                                   ` Lorenzo Stoakes
  2026-06-23  7:58                                 ` Hongru Zhang
  1 sibling, 1 reply; 80+ messages in thread
From: Barry Song @ 2026-05-20 23:37 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Matthew Wilcox, Liam R. Howlett, Suren Baghdasaryan,
	Lorenzo Stoakes, akpm, linux-mm, vbabka, rppt, mhocko, jack,
	pfalcato, wanglian, chentao, lianux.mm, kunwu.chan, liyangouwen1,
	chrisl, kasong, shikemeng, nphamcs, bhe, youngjun.park,
	linux-arm-kernel, linux-kernel, loongarch, linuxppc-dev,
	linux-riscv, linux-s390, Nanzhe Zhao

On Thu, May 21, 2026 at 5:35 AM David Hildenbrand (Arm)
<david@kernel.org> wrote:
>
> On 5/20/26 23:15, Matthew Wilcox wrote:
> > On Thu, May 21, 2026 at 05:14:20AM +0800, Barry Song wrote:
> >> My understanding is that we should not blame applications here. This is 2026:
> >> there are basically only two kinds of applications — single-threaded and
> >> multi-threaded — and single-threaded applications are nearly extinct.
> >
> > all of the applications i run are either single threaded or don't fork.
> > what multithreaded applications call fork?
>
> Traditionally the problem was random libraries using fork+execve to launch other
> programs ... instead of using alternatives like posix_spwan (some use cases
> require more work done before execve and cannot yet switch to that). I'd hope
> that that is less of a problem on Android.
>
> I assume Android zygote might be multi threaded? Maybe sshd as well? Systemd?
> But I'd be surprised if there are really performance implications.

I am trying to answer the question above:

1. zygote, multi-threaded on my phone using Android13.
/ # ls /proc/`pidof zygote64`/task/
1359  22728  22729  22730  22731  22732

/proc/1359/task # cat 22728/comm
Jit thread pool
/proc/1359/task # cat 22730/comm
ReferenceQueueD
/proc/1359/task # cat 22731/comm
FinalizerDaemon
/proc/1359/task # cat 22732/comm
FinalizerWatchd
/proc/1359/task # cat 1359/comm
main

But on another phone of mine running Android 16, zygote64 is
single-threaded.
Not sure if it is due to the Android team making some changes
related to threads from Android 13 to Android 16.

2. sshd, multi-processes instead of multi-threads:
$ ps aux | grep sshd
root        1192  0.0  0.0  15444  9032 ?        Ss   09:42   0:00
sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
root        2465  0.0  0.0  17164 10760 ?        Ss   09:42   0:00
sshd: barry [priv]
barry       2632  0.0  0.0  17164  7852 ?        S    09:42   0:00
sshd: barry@pts/0
root        3305  2.5  0.0  17164 10772 ?        Ss   09:44   0:00
sshd: barry [priv]
barry       3406  0.0  0.0  17164  7940 ?        S    09:44   0:00
sshd: barry@pts/1

3. systemd, also multi-processes

$ ps ax | grep systemd
    350 ?        S<s    0:00 /lib/systemd/systemd-journald
    387 ?        Ss     0:00 /lib/systemd/systemd-udevd
    666 ?        Ss     0:00 /lib/systemd/systemd-oomd
    667 ?        Ss     0:00 /lib/systemd/systemd-resolved
    728 ?        Ss     0:00 @dbus-daemon --system --address=systemd:
--nofork --nopidfile --systemd-activation --syslog-only
    751 ?        Ss     0:00 /lib/systemd/systemd-logind
    753 ?        Ssl    0:00 /usr/sbin/thermald --systemd
--dbus-enable --adaptive
   1350 ?        Ss     0:00 /lib/systemd/systemd --user
   1428 ?        Ss     0:00 /usr/bin/dbus-daemon --session
--address=systemd: --nofork --nopidfile --systemd-activation
--syslog-only
   1900 ?        Ssl    0:00 /usr/libexec/gnome-session-binary
--systemd-service --session=ubuntu
   2141 ?        Ssl    0:00 /lib/systemd/systemd-timesyncd

>
> Not sure about webbroswers .... I think most of them switched to fork servers,
> where I would assume fork servers would be single-threaded.

On my phone, Chrome is multi-process, but its parent process
chrome_zygote (10774) is single-threaded:

 ps -A | grep chrome
u0_i15        9883 10774 321066464 119452 do_epoll_wait     0 S
com.android.chrome:sandboxed_process0:org.chromium.content.app.SandboxedProcessService0:15
u0_a142      10164  1359 35110548 277640 do_epoll_wait      0 S
com.android.chrome
u0_a278      10724  1359 9779864 104988 do_epoll_wait       0 S
com.google.android.apps.chromecast.app
u0_a142      10774  1359 32803908 64076 do_sys_poll         0 S
com.android.chrome_zygote
u0_a142      11173  1359 34208592 142192 do_epoll_wait      0 S
com.android.chrome:privileged_process0

/proc/10774/task # ls
10774

>
> So, yeah, getting a clear understanding how this ends up being a problem on
> Android would be great.

I guess the real issue is that in the Android market, there
are so many applications that are out of our control？

Here are some trace examples from Nanzhe:

iQIYI plugin
vma reader thread:
PbMisc-0, pid=27183, tgid=26444

vma writer thread:
i.video:plugin1, pid=27298, tgid=26444
writer blocked: 440394938 ns (440 ms)

reader stack:
vma_start_read
lock_vma_under_rcu
do_page_fault
do_translation_fault
do_mem_abort
el0_da
el0t_64_sync_handler
el0t_64_sync

writer stack:
__vma_start_write
dup_mmap
copy_mm
copy_process
kernel_clone
__arm64_sys_clone
invoke_syscall
el0_svc_common
do_el0_svc
el0_svc


Baidu Tieba
vma reader thread:
elastic_pms_pro, pid=7731, tgid=7575

vma writer thread:
com.baidu.tieba, pid=8005, tgid=7575
writer blocked: 514975545 ns(515 ms)

reader stack:
vma_start_read
lock_vma_under_rcu
do_page_fault
do_translation_fault
do_mem_abort
el0_da
el0t_64_sync_handler
el0t_64_sync

writer stack:
__vma_start_write
dup_mmap
copy_mm
copy_process
kernel_clone
__arm64_sys_clone
invoke_syscall
el0_svc_common
do_el0_svc
el0_svc

Thanks
Barry


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-20 23:37                                 ` Barry Song
@ 2026-05-22 15:53                                   ` Lorenzo Stoakes
  2026-05-22 21:31                                     ` Barry Song
  0 siblings, 1 reply; 80+ messages in thread
From: Lorenzo Stoakes @ 2026-05-22 15:53 UTC (permalink / raw)
  To: Barry Song
  Cc: David Hildenbrand (Arm), Matthew Wilcox, Liam R. Howlett,
	Suren Baghdasaryan, akpm, linux-mm, vbabka, rppt, mhocko, jack,
	pfalcato, wanglian, chentao, lianux.mm, kunwu.chan, liyangouwen1,
	chrisl, kasong, shikemeng, nphamcs, bhe, youngjun.park,
	linux-arm-kernel, linux-kernel, loongarch, linuxppc-dev,
	linux-riscv, linux-s390, Nanzhe Zhao

On Thu, May 21, 2026 at 07:37:58AM +0800, Barry Song wrote:
> On Thu, May 21, 2026 at 5:35 AM David Hildenbrand (Arm)
> <david@kernel.org> wrote:
> >
> > On 5/20/26 23:15, Matthew Wilcox wrote:
> > > On Thu, May 21, 2026 at 05:14:20AM +0800, Barry Song wrote:
> > >> My understanding is that we should not blame applications here. This is 2026:
> > >> there are basically only two kinds of applications — single-threaded and
> > >> multi-threaded — and single-threaded applications are nearly extinct.
> > >
> > > all of the applications i run are either single threaded or don't fork.
> > > what multithreaded applications call fork?
> >
> > Traditionally the problem was random libraries using fork+execve to launch other
> > programs ... instead of using alternatives like posix_spwan (some use cases
> > require more work done before execve and cannot yet switch to that). I'd hope
> > that that is less of a problem on Android.
> >
> > I assume Android zygote might be multi threaded? Maybe sshd as well? Systemd?
> > But I'd be surprised if there are really performance implications.
>
> I am trying to answer the question above:
>
> 1. zygote, multi-threaded on my phone using Android13.
> / # ls /proc/`pidof zygote64`/task/
> 1359  22728  22729  22730  22731  22732
>
> /proc/1359/task # cat 22728/comm
> Jit thread pool
> /proc/1359/task # cat 22730/comm
> ReferenceQueueD
> /proc/1359/task # cat 22731/comm
> FinalizerDaemon
> /proc/1359/task # cat 22732/comm
> FinalizerWatchd
> /proc/1359/task # cat 1359/comm
> main
>
> But on another phone of mine running Android 16, zygote64 is
> single-threaded.
> Not sure if it is due to the Android team making some changes
> related to threads from Android 13 to Android 16.
>
> 2. sshd, multi-processes instead of multi-threads:
> $ ps aux | grep sshd
> root        1192  0.0  0.0  15444  9032 ?        Ss   09:42   0:00
> sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
> root        2465  0.0  0.0  17164 10760 ?        Ss   09:42   0:00
> sshd: barry [priv]
> barry       2632  0.0  0.0  17164  7852 ?        S    09:42   0:00
> sshd: barry@pts/0
> root        3305  2.5  0.0  17164 10772 ?        Ss   09:44   0:00
> sshd: barry [priv]
> barry       3406  0.0  0.0  17164  7940 ?        S    09:44   0:00
> sshd: barry@pts/1
>
> 3. systemd, also multi-processes
>
> $ ps ax | grep systemd
>     350 ?        S<s    0:00 /lib/systemd/systemd-journald
>     387 ?        Ss     0:00 /lib/systemd/systemd-udevd
>     666 ?        Ss     0:00 /lib/systemd/systemd-oomd
>     667 ?        Ss     0:00 /lib/systemd/systemd-resolved
>     728 ?        Ss     0:00 @dbus-daemon --system --address=systemd:
> --nofork --nopidfile --systemd-activation --syslog-only
>     751 ?        Ss     0:00 /lib/systemd/systemd-logind
>     753 ?        Ssl    0:00 /usr/sbin/thermald --systemd
> --dbus-enable --adaptive
>    1350 ?        Ss     0:00 /lib/systemd/systemd --user
>    1428 ?        Ss     0:00 /usr/bin/dbus-daemon --session
> --address=systemd: --nofork --nopidfile --systemd-activation
> --syslog-only
>    1900 ?        Ssl    0:00 /usr/libexec/gnome-session-binary
> --systemd-service --session=ubuntu
>    2141 ?        Ssl    0:00 /lib/systemd/systemd-timesyncd
>
> >
> > Not sure about webbroswers .... I think most of them switched to fork servers,
> > where I would assume fork servers would be single-threaded.
>
> On my phone, Chrome is multi-process, but its parent process
> chrome_zygote (10774) is single-threaded:
>
>  ps -A | grep chrome
> u0_i15        9883 10774 321066464 119452 do_epoll_wait     0 S
> com.android.chrome:sandboxed_process0:org.chromium.content.app.SandboxedProcessService0:15
> u0_a142      10164  1359 35110548 277640 do_epoll_wait      0 S
> com.android.chrome
> u0_a278      10724  1359 9779864 104988 do_epoll_wait       0 S
> com.google.android.apps.chromecast.app
> u0_a142      10774  1359 32803908 64076 do_sys_poll         0 S
> com.android.chrome_zygote
> u0_a142      11173  1359 34208592 142192 do_epoll_wait      0 S
> com.android.chrome:privileged_process0
>
> /proc/10774/task # ls
> 10774
>
> >
> > So, yeah, getting a clear understanding how this ends up being a problem on
> > Android would be great.
>
> I guess the real issue is that in the Android market, there
> are so many applications that are out of our control？
>
> Here are some trace examples from Nanzhe:
>
> iQIYI plugin
> vma reader thread:
> PbMisc-0, pid=27183, tgid=26444
>
> vma writer thread:
> i.video:plugin1, pid=27298, tgid=26444
> writer blocked: 440394938 ns (440 ms)
>
> reader stack:
> vma_start_read
> lock_vma_under_rcu
> do_page_fault
> do_translation_fault
> do_mem_abort
> el0_da
> el0t_64_sync_handler
> el0t_64_sync
>
> writer stack:
> __vma_start_write
> dup_mmap
> copy_mm
> copy_process
> kernel_clone
> __arm64_sys_clone
> invoke_syscall
> el0_svc_common
> do_el0_svc
> el0_svc
>
>
> Baidu Tieba
> vma reader thread:
> elastic_pms_pro, pid=7731, tgid=7575
>
> vma writer thread:
> com.baidu.tieba, pid=8005, tgid=7575
> writer blocked: 514975545 ns(515 ms)
>
> reader stack:
> vma_start_read
> lock_vma_under_rcu
> do_page_fault
> do_translation_fault
> do_mem_abort
> el0_da
> el0t_64_sync_handler
> el0t_64_sync
>
> writer stack:
> __vma_start_write
> dup_mmap
> copy_mm
> copy_process
> kernel_clone
> __arm64_sys_clone
> invoke_syscall
> el0_svc_common
> do_el0_svc
> el0_svc
>
> Thanks
> Barry

Again this is making me want to sit outside and sip on some lemonade and
ice :)

Yes - android processes are aggressively multi-threaded, sure of course.

The missing bit here is the forking - what, where, why, when?

And then you say zygote is sometimes multi-threaded but sometimes
single-threaded, which is adding a whole bunch of confusion on top of all
that.

I don't find these stack trace dumps all that useful (though thanks of
course for taking the time to gather them), I think we'd be better off with
specific data on forking, in some _concise_ _summarised_ form, ideally with
numbers.

There's such a thing as too much information :))

Anyway, again, please let's see a new _RFC_ with the approach proposed by
Suren, with some _succinct_ data demonstrating _exactly_ what the problem
is, so we can make some headway here.

And now I'm off for a cornetto! :)

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-22 15:53                                   ` Lorenzo Stoakes
@ 2026-05-22 21:31                                     ` Barry Song
  2026-06-20 23:48                                       ` Suren Baghdasaryan
  0 siblings, 1 reply; 80+ messages in thread
From: Barry Song @ 2026-05-22 21:31 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand (Arm), Matthew Wilcox, Liam R. Howlett,
	Suren Baghdasaryan, akpm, linux-mm, vbabka, rppt, mhocko, jack,
	pfalcato, wanglian, chentao, lianux.mm, kunwu.chan, liyangouwen1,
	chrisl, kasong, shikemeng, nphamcs, bhe, youngjun.park,
	linux-arm-kernel, linux-kernel, loongarch, linuxppc-dev,
	linux-riscv, linux-s390, Nanzhe Zhao

>
> Again this is making me want to sit outside and sip on some lemonade and
> ice :)
>
> Yes - android processes are aggressively multi-threaded, sure of course.
>
> The missing bit here is the forking - what, where, why, when?
>

I really want to know the what, where, why, and when
as well. But since most applications are not
open-source, it is basically a black hole for anyone
other than the owners of those apps.

Let me try to do more investigation to understand what
is going on, although it is really hard.
To be honest, I would rather the Android framework
completely prohibit apps from calling fork(), if
possible.

> And then you say zygote is sometimes multi-threaded but sometimes
> single-threaded, which is adding a whole bunch of confusion on top of all
> that.
>
> I don't find these stack trace dumps all that useful (though thanks of
> course for taking the time to gather them), I think we'd be better off with
> specific data on forking, in some _concise_ _summarised_ form, ideally with
> numbers.
>
> There's such a thing as too much information :))

This trace shows PF I/O in one thread overlapping
with a fork() call in another thread.
But as I explained, I really do not know what kind of
user behavior is behind it.

>
> Anyway, again, please let's see a new _RFC_ with the approach proposed by
> Suren, with some _succinct_ data demonstrating _exactly_ what the problem
> is, so we can make some headway here.

Okay, sure. Thanks for your patience.

>
> And now I'm off for a cornetto! :)

Sounds good :) Enjoy your cornetto!

Best Regards
Barry


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-22 21:31                                     ` Barry Song
@ 2026-06-20 23:48                                       ` Suren Baghdasaryan
  2026-06-21 20:49                                         ` Matthew Wilcox
  0 siblings, 1 reply; 80+ messages in thread
From: Suren Baghdasaryan @ 2026-06-20 23:48 UTC (permalink / raw)
  To: Barry Song
  Cc: Lorenzo Stoakes, David Hildenbrand (Arm), Matthew Wilcox,
	Liam R. Howlett, akpm, linux-mm, vbabka, rppt, mhocko, jack,
	pfalcato, wanglian, chentao, lianux.mm, kunwu.chan, liyangouwen1,
	chrisl, kasong, shikemeng, nphamcs, bhe, youngjun.park,
	linux-arm-kernel, linux-kernel, loongarch, linuxppc-dev,
	linux-riscv, linux-s390, Nanzhe Zhao

On Fri, May 22, 2026 at 2:31 PM Barry Song <baohua@kernel.org> wrote:
>
> >
> > Again this is making me want to sit outside and sip on some lemonade and
> > ice :)
> >
> > Yes - android processes are aggressively multi-threaded, sure of course.
> >
> > The missing bit here is the forking - what, where, why, when?
> >
>
> I really want to know the what, where, why, and when
> as well. But since most applications are not
> open-source, it is basically a black hole for anyone
> other than the owners of those apps.
>
> Let me try to do more investigation to understand what
> is going on, although it is really hard.
> To be honest, I would rather the Android framework
> completely prohibit apps from calling fork(), if
> possible.
>
> > And then you say zygote is sometimes multi-threaded but sometimes
> > single-threaded, which is adding a whole bunch of confusion on top of all
> > that.
> >
> > I don't find these stack trace dumps all that useful (though thanks of
> > course for taking the time to gather them), I think we'd be better off with
> > specific data on forking, in some _concise_ _summarised_ form, ideally with
> > numbers.
> >
> > There's such a thing as too much information :))
>
> This trace shows PF I/O in one thread overlapping
> with a fork() call in another thread.
> But as I explained, I really do not know what kind of
> user behavior is behind it.
>
> >
> > Anyway, again, please let's see a new _RFC_ with the approach proposed by
> > Suren, with some _succinct_ data demonstrating _exactly_ what the problem
> > is, so we can make some headway here.
>
> Okay, sure. Thanks for your patience.

Just checking in on the followup plans. IIUC the RFC mentioned will
try to implement the solution we discussed at LSFMM: splitting
VM_FAULT_RETRY into two flags - one for retrying under per-VMA locks
and another one to fallback to mmap_lock.

Barry, if you need any help or clarification, please do not hesitate
to contact me.
Thanks,
Suren.

>
> >
> > And now I'm off for a cornetto! :)
>
> Sounds good :) Enjoy your cornetto!
>
> Best Regards
> Barry


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-06-20 23:48                                       ` Suren Baghdasaryan
@ 2026-06-21 20:49                                         ` Matthew Wilcox
  2026-06-22  0:15                                           ` Barry Song
  0 siblings, 1 reply; 80+ messages in thread
From: Matthew Wilcox @ 2026-06-21 20:49 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Barry Song, Lorenzo Stoakes, David Hildenbrand (Arm),
	Liam R. Howlett, akpm, linux-mm, vbabka, rppt, mhocko, jack,
	pfalcato, wanglian, chentao, lianux.mm, kunwu.chan, liyangouwen1,
	chrisl, kasong, shikemeng, nphamcs, bhe, youngjun.park,
	linux-arm-kernel, linux-kernel, loongarch, linuxppc-dev,
	linux-riscv, linux-s390, Nanzhe Zhao

On Sat, Jun 20, 2026 at 04:48:57PM -0700, Suren Baghdasaryan wrote:
> Just checking in on the followup plans. IIUC the RFC mentioned will
> try to implement the solution we discussed at LSFMM: splitting
> VM_FAULT_RETRY into two flags - one for retrying under per-VMA locks
> and another one to fallback to mmap_lock.

I continue to hate this idea.  I don't believe that those who were
pushing for it have ever tried to understand the whole fault path.
It's utterly byzantine.

I defy anyone to make sense of this:

        /*
         * NOTE! This will make us return with VM_FAULT_RETRY, but with
         * the fault lock still held. That's how FAULT_FLAG_RETRY_NOWAIT
         * is supposed to work. We have way too many special cases..
         */
        if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
                return 0;

        *fpin = maybe_unlock_mmap_for_io(vmf, *fpin);
        if (vmf->flags & FAULT_FLAG_KILLABLE) {
                if (__folio_lock_killable(folio)) {
                        /*
                         * We didn't have the right flags to drop the
                         * fault lock, but all fault_handlers only check
                         * for fatal signals if we return VM_FAULT_RETRY,
                         * so we need to drop the fault lock here and
                         * return 0 if we don't have a fpin.
                         */
                        if (*fpin == NULL)
                                release_fault_lock(vmf);
                        return 0;
                }

Wed need to simplify the fault path, not add additional complexity.
Josef has said he wouldn't've done the lock dropping had we had per-VMA
locks.  We should rip it out.


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-06-21 20:49                                         ` Matthew Wilcox
@ 2026-06-22  0:15                                           ` Barry Song
  2026-06-22 14:50                                             ` Liam R. Howlett
  0 siblings, 1 reply; 80+ messages in thread
From: Barry Song @ 2026-06-22  0:15 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Suren Baghdasaryan, Lorenzo Stoakes, David Hildenbrand (Arm),
	Liam R. Howlett, akpm, linux-mm, vbabka, rppt, mhocko, jack,
	pfalcato, wanglian, chentao, lianux.mm, kunwu.chan, liyangouwen1,
	chrisl, kasong, shikemeng, nphamcs, bhe, youngjun.park,
	linux-arm-kernel, linux-kernel, loongarch, linuxppc-dev,
	linux-riscv, linux-s390, Nanzhe Zhao

On Mon, Jun 22, 2026 at 4:49 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sat, Jun 20, 2026 at 04:48:57PM -0700, Suren Baghdasaryan wrote:
> > Just checking in on the followup plans. IIUC the RFC mentioned will
> > try to implement the solution we discussed at LSFMM: splitting
> > VM_FAULT_RETRY into two flags - one for retrying under per-VMA locks
> > and another one to fallback to mmap_lock.
>
> I continue to hate this idea.  I don't believe that those who were
> pushing for it have ever tried to understand the whole fault path.
> It's utterly byzantine.
>
> I defy anyone to make sense of this:
>
>         /*
>          * NOTE! This will make us return with VM_FAULT_RETRY, but with
>          * the fault lock still held. That's how FAULT_FLAG_RETRY_NOWAIT
>          * is supposed to work. We have way too many special cases..
>          */
>         if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
>                 return 0;
>
>         *fpin = maybe_unlock_mmap_for_io(vmf, *fpin);
>         if (vmf->flags & FAULT_FLAG_KILLABLE) {
>                 if (__folio_lock_killable(folio)) {
>                         /*
>                          * We didn't have the right flags to drop the
>                          * fault lock, but all fault_handlers only check
>                          * for fatal signals if we return VM_FAULT_RETRY,
>                          * so we need to drop the fault lock here and
>                          * return 0 if we don't have a fpin.
>                          */
>                         if (*fpin == NULL)
>                                 release_fault_lock(vmf);
>                         return 0;
>                 }
>
> Wed need to simplify the fault path, not add additional complexity.
> Josef has said he wouldn't've done the lock dropping had we had per-VMA
> locks.  We should rip it out.

I think you have agreed that, at least for anon vma, we can
keep the current policy, since anon vma is much more volatile
than file vma.
Concurrent page faults and VMA modifications can happen more
often than with file VMAs.

For file vmas, how much code can we actually remove, given that
the first page fault might already be holding mmap_lock?
It could be the case that lock_vma_under_rcu() fails, and then
on the first page fault we end up holding mmap_lock before
retrying. So are we also going to rip out the lock release,
even if it risks holding mmap_lock for a long time?

        vma = lock_vma_under_rcu(mm, addr);
        if (!vma)
                goto lock_mmap;
       ...
lock_mmap:

        vma = lock_mm_and_find_vma(mm, addr, regs);
        if (unlikely(!vma)) {
                fault = 0;
                si_code = SEGV_MAPERR;
                goto bad_area;
        }

If we still need to keep the page fault retry code there, it
doesn't seem like "ripping out" really reduces complexity in
the page fault code?

Best Regards
Barry


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-06-22  0:15                                           ` Barry Song
@ 2026-06-22 14:50                                             ` Liam R. Howlett
  2026-06-22 21:35                                               ` Barry Song
  0 siblings, 1 reply; 80+ messages in thread
From: Liam R. Howlett @ 2026-06-22 14:50 UTC (permalink / raw)
  To: Barry Song
  Cc: Matthew Wilcox, Suren Baghdasaryan, Lorenzo Stoakes,
	David Hildenbrand (Arm), akpm, linux-mm, vbabka, rppt, mhocko,
	jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
	liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao

On 26/06/22 08:15AM, Barry Song wrote:
> On Mon, Jun 22, 2026 at 4:49 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Sat, Jun 20, 2026 at 04:48:57PM -0700, Suren Baghdasaryan wrote:
> > > Just checking in on the followup plans. IIUC the RFC mentioned will
> > > try to implement the solution we discussed at LSFMM: splitting
> > > VM_FAULT_RETRY into two flags - one for retrying under per-VMA locks
> > > and another one to fallback to mmap_lock.
> >
> > I continue to hate this idea.  I don't believe that those who were
> > pushing for it have ever tried to understand the whole fault path.
> > It's utterly byzantine.
> >
> > I defy anyone to make sense of this:
> >
> >         /*
> >          * NOTE! This will make us return with VM_FAULT_RETRY, but with
> >          * the fault lock still held. That's how FAULT_FLAG_RETRY_NOWAIT
> >          * is supposed to work. We have way too many special cases..
> >          */
> >         if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
> >                 return 0;
> >
> >         *fpin = maybe_unlock_mmap_for_io(vmf, *fpin);
> >         if (vmf->flags & FAULT_FLAG_KILLABLE) {
> >                 if (__folio_lock_killable(folio)) {
> >                         /*
> >                          * We didn't have the right flags to drop the
> >                          * fault lock, but all fault_handlers only check
> >                          * for fatal signals if we return VM_FAULT_RETRY,
> >                          * so we need to drop the fault lock here and
> >                          * return 0 if we don't have a fpin.
> >                          */
> >                         if (*fpin == NULL)
> >                                 release_fault_lock(vmf);
> >                         return 0;
> >                 }
> >
> > Wed need to simplify the fault path, not add additional complexity.
> > Josef has said he wouldn't've done the lock dropping had we had per-VMA
> > locks.  We should rip it out.
> 
> I think you have agreed that, at least for anon vma, we can
> keep the current policy, since anon vma is much more volatile
> than file vma.

I don't think any of the above has to do with anon vmas.  Does any anon
vma handling have anything to do with your problem?

This would be needed if anon vmas were being faulted while being
unmapped or merged?  Do we really need a fast path for that?  Note that
anon vmas cannot be merged if the vma chain... you know what, I wonder
how many people know what I'm talking about here... Let's just say that
they can't be merged if they were around for a fork.

So, then, we're looking at anon vmas taking the mmap lock on:
1. single task anon vmas being expanded and faulted at the same time
2. single task anon vmas being unmapped and faulted at the same time

I think that's it?

But maybe I missed something critical about your use case here?

I don't understand why you are involving anon vmas in this discussion,
so I must have missed something with your IO completion issue.  Is there
an anon vma causing your priority inversion?

> Concurrent page faults and VMA modifications can happen more
> often than with file VMAs.

But it's only a problem for anon vmas with per-vma locking if it's the
same vma (or the vma lock sequence counter overflows, but let's say
that's a statistically insignificant non-zero value).

> 
> For file vmas, how much code can we actually remove, given that
> the first page fault might already be holding mmap_lock?

How much complexity can we remove and maintain the performance, might be
a better question.

> It could be the case that lock_vma_under_rcu() fails, and then
> on the first page fault we end up holding mmap_lock before
> retrying. So are we also going to rip out the lock release,
> even if it risks holding mmap_lock for a long time?
> 
>         vma = lock_vma_under_rcu(mm, addr);
>         if (!vma)
>                 goto lock_mmap;
>        ...
> lock_mmap:
> 
>         vma = lock_mm_and_find_vma(mm, addr, regs);
>         if (unlikely(!vma)) {
>                 fault = 0;
>                 si_code = SEGV_MAPERR;
>                 goto bad_area;
>         }
> 
> If we still need to keep the page fault retry code there, it
> doesn't seem like "ripping out" really reduces complexity in
> the page fault code?

This seems unrelated to be above complexity that might be the target of
removal?

Thanks,
Liam



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-06-22 14:50                                             ` Liam R. Howlett
@ 2026-06-22 21:35                                               ` Barry Song
  0 siblings, 0 replies; 80+ messages in thread
From: Barry Song @ 2026-06-22 21:35 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: Matthew Wilcox, Suren Baghdasaryan, Lorenzo Stoakes,
	David Hildenbrand (Arm), akpm, linux-mm, vbabka, rppt, mhocko,
	jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
	liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao, Hongru Zhang

On Mon, Jun 22, 2026 at 10:50 PM Liam R. Howlett <liam@infradead.org> wrote:
>
> On 26/06/22 08:15AM, Barry Song wrote:
> > On Mon, Jun 22, 2026 at 4:49 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Sat, Jun 20, 2026 at 04:48:57PM -0700, Suren Baghdasaryan wrote:
> > > > Just checking in on the followup plans. IIUC the RFC mentioned will
> > > > try to implement the solution we discussed at LSFMM: splitting
> > > > VM_FAULT_RETRY into two flags - one for retrying under per-VMA locks
> > > > and another one to fallback to mmap_lock.
> > >
> > > I continue to hate this idea.  I don't believe that those who were
> > > pushing for it have ever tried to understand the whole fault path.
> > > It's utterly byzantine.
> > >
> > > I defy anyone to make sense of this:
> > >
> > >         /*
> > >          * NOTE! This will make us return with VM_FAULT_RETRY, but with
> > >          * the fault lock still held. That's how FAULT_FLAG_RETRY_NOWAIT
> > >          * is supposed to work. We have way too many special cases..
> > >          */
> > >         if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
> > >                 return 0;
> > >
> > >         *fpin = maybe_unlock_mmap_for_io(vmf, *fpin);
> > >         if (vmf->flags & FAULT_FLAG_KILLABLE) {
> > >                 if (__folio_lock_killable(folio)) {
> > >                         /*
> > >                          * We didn't have the right flags to drop the
> > >                          * fault lock, but all fault_handlers only check
> > >                          * for fatal signals if we return VM_FAULT_RETRY,
> > >                          * so we need to drop the fault lock here and
> > >                          * return 0 if we don't have a fpin.
> > >                          */
> > >                         if (*fpin == NULL)
> > >                                 release_fault_lock(vmf);
> > >                         return 0;
> > >                 }
> > >
> > > Wed need to simplify the fault path, not add additional complexity.
> > > Josef has said he wouldn't've done the lock dropping had we had per-VMA
> > > locks.  We should rip it out.
> >
> > I think you have agreed that, at least for anon vma, we can
> > keep the current policy, since anon vma is much more volatile
> > than file vma.
>
> I don't think any of the above has to do with anon vmas.  Does any anon
> vma handling have anything to do with your problem?

Hi Liam,

I think there may be a misunderstanding about the motivation behind
this series.

Currently, for both file-backed and anonymous VMAs, when a page fault
cannot lock the required folios—for example, because a folio is under
I/O during a major fault—the fault handler drops any locks it is
holding (either per-VMA locks or the mmap lock) and retries the fault
under the mmap_lock. This page-fault retry pattern requiring the
mmap_lock can lead to significant mmap_lock contention.

The entire purpose of this series is to avoid reacquiring the mmap_lock
where possible, while ensuring that the implementation does not
introduce new priority inversion issues or unnecessary complexity.

We have two possible approaches:

1. Keep the page-fault retry path, but retry under the per-VMA lock
whenever possible. In this case, we would need a flag to indicate
whether the retry should be performed under the per-VMA lock or the
mmap_lock.

2. Remove the page-fault retry path entirely. Instead, wait for the
folio to become lockable while retaining the locks currently held,
and continue the fault handling without retrying the page fault.

Approach 1 is the direction taken by both the current patch and the
RFC that was suggested.

Approach 2 is a potential alternative, but I have never posted an RFC
proposing it.

For Approach 1, the primary concern seems to be the added complexity.

For Approach 2, my concern is the increased risk of priority
inversion. With this approach, we may end up holding a lock while
waiting for I/O completion, potentially for a considerable amount of
time. As a result, a concurrent VMA writer, along with any subsequent
mmap_lock acquirers blocked behind it, could be stalled for an
extended period.

If there is an approach 3, it could be:
for file VMAs, we take approach 2; for anonymous VMAs, we take
approach 1.

>
> This would be needed if anon vmas were being faulted while being
> unmapped or merged?  Do we really need a fast path for that?  Note that
> anon vmas cannot be merged if the vma chain... you know what, I wonder
> how many people know what I'm talking about here... Let's just say that
> they can't be merged if they were around for a fork.

In terms of fork(), this is the concern I raised when considering
approach 2—holding the VMA lock while performing I/O, since a
concurrent fork would need to acquire the VMA write lock.

I had Hongru add some tracing code and run it against the top 200
Android applications in the China market. All of them are heavily
multi-threaded. Unfortunately, we found that 82 of these 200 Android
applications call fork(), and some even call fork() from multiple
threads.

So, although it may be technically a bad idea to call fork() in a
multi-threaded application, it appears that in practice it is still
widely used in real-world applications.

I guess Hongru (Cc-ed) will share his observations later today or
tomorrow.

>
> So, then, we're looking at anon vmas taking the mmap lock on:
> 1. single task anon vmas being expanded and faulted at the same time
> 2. single task anon vmas being unmapped and faulted at the same time
>
> I think that's it?

Yes and no. It could also include mprotect, UFFDIO_REGISTER,
UFFDIO_UNREGISTER, and setting VMA names, etc.

Note that Java GC may also invoke UFFDIO_REGISTER and
UFFDIO_UNREGISTER on Java heaps.

Note that priority inversion can still occur between threads that are
not operating on the same VMA if we take approach 2.

For example:

Thread A: page fault in vma1, holding the VMA lock and waiting for I/O.

Thread B: concurrent write on vma1 (takes mmap_lock and then waits for
the VMA write lock);

Thread C: concurrent write on vma2 or do VMA iteration (acquires
mmap_lock).

In this scenario, Thread C may end up indirectly waiting for Thread A.

>
> But maybe I missed something critical about your use case here?
>
> I don't understand why you are involving anon vmas in this discussion,
> so I must have missed something with your IO completion issue.  Is there
> an anon vma causing your priority inversion?

As explained, the primary goal is to reduce mmap_lock contention by
avoiding taking the mmap_lock whenever possible, while ensuring that
the implementation does not introduce new priority inversion issues.

>
> > Concurrent page faults and VMA modifications can happen more
> > often than with file VMAs.
>
> But it's only a problem for anon vmas with per-vma locking if it's the
> same vma (or the vma lock sequence counter overflows, but let's say
> that's a statistically insignificant non-zero value).
>
> >
> > For file vmas, how much code can we actually remove, given that
> > the first page fault might already be holding mmap_lock?
>
> How much complexity can we remove and maintain the performance, might be
> a better question.

Right, thanks for improving the question.

>
> > It could be the case that lock_vma_under_rcu() fails, and then
> > on the first page fault we end up holding mmap_lock before
> > retrying. So are we also going to rip out the lock release,
> > even if it risks holding mmap_lock for a long time?
> >
> >         vma = lock_vma_under_rcu(mm, addr);
> >         if (!vma)
> >                 goto lock_mmap;
> >        ...
> > lock_mmap:
> >
> >         vma = lock_mm_and_find_vma(mm, addr, regs);
> >         if (unlikely(!vma)) {
> >                 fault = 0;
> >                 si_code = SEGV_MAPERR;
> >                 goto bad_area;
> >         }
> >
> > If we still need to keep the page fault retry code there, it
> > doesn't seem like "ripping out" really reduces complexity in
> > the page fault code?
>
> This seems unrelated to be above complexity that might be the target of
> removal?

I think it is highly related. If we take approach 2—holding locks to
perform I/O and removing the page-fault retry path—we need to
consider whether the same behavior should also apply when we are
already holding the mmap_lock. We should understand the full picture
before focusing on a specific part in isolation.

Thanks
Barry

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-20 21:35                               ` David Hildenbrand (Arm)
  2026-05-20 23:37                                 ` Barry Song
@ 2026-06-23  7:58                                 ` Hongru Zhang
  2026-06-23  8:02                                   ` David Hildenbrand (Arm)
  1 sibling, 1 reply; 80+ messages in thread
From: Hongru Zhang @ 2026-06-23  7:58 UTC (permalink / raw)
  To: david
  Cc: akpm, baohua, bhe, chentao, chrisl, jack, kasong, kunwu.chan,
	liam, lianux.mm, linux-arm-kernel, linux-kernel, linux-mm,
	linux-riscv, linux-s390, linuxppc-dev, liyangouwen1, ljs,
	loongarch, mhocko, nphamcs, nzzhao, pfalcato, rppt, shikemeng,
	surenb, vbabka, wanglian, willy, youngjun.park, zhanghongru06

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 2922 bytes --]

> On 5/20/26 23:15, Matthew Wilcox wrote:
> > On Thu, May 21, 2026 at 05:14:20AM +0800, Barry Song wrote:
> >> My understanding is that we should not blame applications here. This is 2026:
> >> there are basically only two kinds of applications — single-threaded and
> >> multi-threaded — and single-threaded applications are nearly extinct.
> > 
> > all of the applications i run are either single threaded or don't fork.
> > what multithreaded applications call fork?
>
> Traditionally the problem was random libraries using fork+execve to launch other
> programs ... instead of using alternatives like posix_spwan (some use cases
> require more work done before execve and cannot yet switch to that). I'd hope
> that that is less of a problem on Android.
>
> I assume Android zygote might be multi threaded? Maybe sshd as well? Systemd?
> But I'd be surprised if there are really performance implications.
>
> Not sure about webbroswers .... I think most of them switched to fork servers,
> where I would assume fork servers would be single-threaded.
>
> So, yeah, getting a clear understanding how this ends up being a problem on
> Android would be great.

Barry asked me to share observations on fork() usage across Android
applications.

I wrote a BPF-based tracing tool (kprobe on copy_process, checking
CLONE_VM to distinguish process creation from thread creation) and ran
it against the top 200 Android applications in the China market during
normal usage scenarios.

Results:
- 82 out of 200 apps (41%) call fork() during normal operation
- Among these, some call fork() from multiple threads

These are not zygote forks — they are fork() calls initiated by app
threads at runtime. Examples by category:

  Browsers:     com.quark.browser, com.UCMobile, com.xunlei.browser
  Shopping:     com.taobao.taobao, com.tmall.wireless, com.achievo.vipshop
  Video:        com.youku.phone, com.qiyi.video, com.hunantv.imgo.activity
  Social/IM:    com.alibaba.android.rimet, com.ss.android.lark
  News:         com.ss.android.article.news, com.ss.android.article.lite
  Navigation:   com.autonavi.minimap, com.sdu.didi.psnger
  Finance:      com.eg.android.AlipayGphone, com.chinamworld.main

This confirms that fork() is widely used in real-world multi-threaded
Android applications. Since dup_mmap() needs to acquire
vma_start_write() for every VMA, holding the VMA lock across I/O
would risk blocking fork() for unpredictable durations in these 82
applications.

Tracing tool (two equivalent implementations):
  bpftrace:         https://gist.github.com/zhr250/bf4384202d598bb4cda71cb9902f15ab
  libbpf-bootstrap: https://gist.github.com/zhr250/76189bdf51bdc8818500e4c8917c6493

Analysis results (top 200 apps):
  https://gist.github.com/zhr250/06f51092c84a49c602a55ac3d186e9ce

Hongru



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-06-23  7:58                                 ` Hongru Zhang
@ 2026-06-23  8:02                                   ` David Hildenbrand (Arm)
  2026-06-23 10:10                                     ` Hongru Zhang
  0 siblings, 1 reply; 80+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-23  8:02 UTC (permalink / raw)
  To: Hongru Zhang
  Cc: akpm, baohua, bhe, chentao, chrisl, jack, kasong, kunwu.chan,
	liam, lianux.mm, linux-arm-kernel, linux-kernel, linux-mm,
	linux-riscv, linux-s390, linuxppc-dev, liyangouwen1, ljs,
	loongarch, mhocko, nphamcs, nzzhao, pfalcato, rppt, shikemeng,
	surenb, vbabka, wanglian, willy, youngjun.park

On 6/23/26 09:58, Hongru Zhang wrote:
>> On 5/20/26 23:15, Matthew Wilcox wrote:
>>>
>>> all of the applications i run are either single threaded or don't fork.
>>> what multithreaded applications call fork?
>>
>> Traditionally the problem was random libraries using fork+execve to launch other
>> programs ... instead of using alternatives like posix_spwan (some use cases
>> require more work done before execve and cannot yet switch to that). I'd hope
>> that that is less of a problem on Android.
>>
>> I assume Android zygote might be multi threaded? Maybe sshd as well? Systemd?
>> But I'd be surprised if there are really performance implications.
>>
>> Not sure about webbroswers .... I think most of them switched to fork servers,
>> where I would assume fork servers would be single-threaded.
>>
>> So, yeah, getting a clear understanding how this ends up being a problem on
>> Android would be great.
> 
> Barry asked me to share observations on fork() usage across Android
> applications.
> 
> I wrote a BPF-based tracing tool (kprobe on copy_process, checking
> CLONE_VM to distinguish process creation from thread creation) and ran
> it against the top 200 Android applications in the China market during
> normal usage scenarios.
> 
> Results:
> - 82 out of 200 apps (41%) call fork() during normal operation

Crazy. Thanks for these numbers.

> - Among these, some call fork() from multiple threads
> 
> These are not zygote forks — they are fork() calls initiated by app
> threads at runtime. Examples by category:
> 
>   Browsers:     com.quark.browser, com.UCMobile, com.xunlei.browser
>   Shopping:     com.taobao.taobao, com.tmall.wireless, com.achievo.vipshop
>   Video:        com.youku.phone, com.qiyi.video, com.hunantv.imgo.activity
>   Social/IM:    com.alibaba.android.rimet, com.ss.android.lark
>   News:         com.ss.android.article.news, com.ss.android.article.lite
>   Navigation:   com.autonavi.minimap, com.sdu.didi.psnger
>   Finance:      com.eg.android.AlipayGphone, com.chinamworld.main

I know that especially browser usually use fork servers: a tiny
(single-threaded) process just to create new child processes. Any information
regarding the apps above that use fork() on small vs. large processes?

> 
> This confirms that fork() is widely used in real-world multi-threaded
> Android applications.

Above you write "some call fork() from multiple threads". Any further
information on that?

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-06-23  8:02                                   ` David Hildenbrand (Arm)
@ 2026-06-23 10:10                                     ` Hongru Zhang
  0 siblings, 0 replies; 80+ messages in thread
From: Hongru Zhang @ 2026-06-23 10:10 UTC (permalink / raw)
  To: david
  Cc: akpm, baohua, bhe, chentao, chrisl, jack, kasong, kunwu.chan,
	liam, lianux.mm, linux-arm-kernel, linux-kernel, linux-mm,
	linux-riscv, linux-s390, linuxppc-dev, liyangouwen1, ljs,
	loongarch, mhocko, nphamcs, nzzhao, pfalcato, rppt, shikemeng,
	surenb, vbabka, wanglian, willy, youngjun.park, zhanghongru06

On 6/23/26 10:02, David Hildenbrand wrote:
> I know that especially browser usually use fork servers: a tiny
> (single-threaded) process just to create new child processes. Any information
> regarding the apps above that use fork() on small vs. large processes?

I wrote a second BPF tool (fork_info) that captures nr_threads and
map_count (VMA count) from the calling process at the exact moment
fork() is triggered. Results from 3 representative apps:

  App (category)          Fork caller        Threads   VMAs
  -----------------------------------------------------------
  Taobao (shopping)       DaemonThread-6        526    8,987
  Amap (navigation)       DaemonThread-6        289    7,120
  UC Browser (browser)    OneNativeThread       350    8,144

These are all heavyweight multi-threaded processes (hundreds of threads,
7,000-9,000 VMAs), not fork servers.

> Above you write "some call fork() from multiple threads". Any further
> information on that?

Xiaohongshu (com.xingin.xhs, social media) is a clear example. In just
tens of seconds of normal usage, fork() was called 22 times from 4
different threads:

  PID     COMM            THREADS    VMAS
  4206    com.xingin.xhs       85    4,140
  4216    Thread-2208          85    4,157
  4208    Thread-2208          90    4,211
  5200    Thread-3200         337    6,519
  5200    Thread-3200         343    6,563
  5200    Thread-3200         361    6,769
  5200    Thread-3200         453    7,793
  5200    Thread-3200         450    7,779
  5202    Thread-2219         459    7,846
  5202    Thread-2219         462    7,875
  5202    Thread-2219         465    7,899
  4219    Thread-2219         465    7,903
  4219    Thread-2219         468    7,922
  5202    Thread-2219         467    7,917
  4219    Thread-2219         467    7,921
  4219    Thread-2219         468    7,929
  5202    Thread-2219         464    7,909
  5202    Thread-2219         460    7,889
  5202    Thread-2219         459    7,884
  4219    Thread-2219         433    7,771
  4219    Thread-2219         433    7,771
  4219    Thread-2219         434    7,778

The process grew from 85 threads / 4,140 VMAs at first fork to
434 threads / 7,778 VMAs at last fork, showing these are long-lived
heavyweight processes that fork repeatedly throughout their lifecycle.

Tracing tool:
  https://gist.github.com/zhr250/ba7725d0ea55594bcafd3cd4806eed98

Hongru


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-20 21:15                             ` Matthew Wilcox
  2026-05-20 21:35                               ` David Hildenbrand (Arm)
@ 2026-05-22  2:33                               ` Barry Song (Xiaomi)
  2026-05-22 13:09                                 ` Matthew Wilcox
  1 sibling, 1 reply; 80+ messages in thread
From: Barry Song (Xiaomi) @ 2026-05-22  2:33 UTC (permalink / raw)
  To: willy
  Cc: akpm, baohua, bhe, chentao, chrisl, david, jack, kasong,
	kunwu.chan, liam, lianux.mm, linux-arm-kernel, linux-kernel,
	linux-mm, linux-riscv, linux-s390, linuxppc-dev, liyangouwen1,
	ljs, loongarch, mhocko, nphamcs, nzzhao, pfalcato, rppt,
	shikemeng, surenb, vbabka, wanglian, youngjun.park

On Thu, May 21, 2026 at 5:16 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, May 21, 2026 at 05:14:20AM +0800, Barry Song wrote:
> > My understanding is that we should not blame applications here. This is 2026:
> > there are basically only two kinds of applications — single-threaded and
> > multi-threaded — and single-threaded applications are nearly extinct.
>
> all of the applications i run are either single threaded or don't fork.
> what multithreaded applications call fork?

As I replied to David [1], we cannot control what those apps do.
Technically, I agree with you that calling fork() within a
multithreaded app may not be a good idea. But in such a complex
ecosystem, we cannot simply say no to those apps.

Especially when our phones are improving the kernel with this fix,
our customers may instead complain that our phones regress their
apps first. That feels unfair.

I can offer a two-step plan. For the first step, we keep the
current approach of dropping the VMA lock and retrying page faults,
while trying to make the smallest possible change.
As discussed with Suren, the draft code is being changed from a
whitelist approach to a blacklist approach. This way, we do not
need to touch `filemap.c` at all (probably because you are already
maintaining `filemap.c` perfectly):

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 63de8e8684f2..4101d5fa7a82 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1322,6 +1322,7 @@ void do_user_addr_fault(struct pt_regs *regs,
 	if (!(flags & FAULT_FLAG_USER))
 		goto lock_mmap;
 
+retry_vma:
 	vma = lock_vma_under_rcu(mm, address);
 	if (!vma)
 		goto lock_mmap;
@@ -1351,6 +1352,8 @@ void do_user_addr_fault(struct pt_regs *regs,
 						 ARCH_DEFAULT_PKEY);
 		return;
 	}
+	if (!(fault & VM_FAULT_RETRY_HARD))
+		goto retry_vma;
 lock_mmap:
 
 retry:
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index a308e2c23b82..eeb7d6091bef 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1659,6 +1659,7 @@ typedef __bitwise unsigned int vm_fault_t;
  * @VM_FAULT_NOPAGE:		->fault installed the pte, not return page
  * @VM_FAULT_LOCKED:		->fault locked the returned page
  * @VM_FAULT_RETRY:		->fault blocked, must retry
+ * @VM_FAULT_RETRY_HARD:	->fault blocked, must retry via mmap_lock
  * @VM_FAULT_FALLBACK:		huge page fault failed, fall back to small
  * @VM_FAULT_DONE_COW:		->fault has fully handled COW
  * @VM_FAULT_NEEDDSYNC:		->fault did not modify page tables and needs
@@ -1678,10 +1679,11 @@ enum vm_fault_reason {
 	VM_FAULT_NOPAGE         = (__force vm_fault_t)0x000100,
 	VM_FAULT_LOCKED         = (__force vm_fault_t)0x000200,
 	VM_FAULT_RETRY          = (__force vm_fault_t)0x000400,
-	VM_FAULT_FALLBACK       = (__force vm_fault_t)0x000800,
-	VM_FAULT_DONE_COW       = (__force vm_fault_t)0x001000,
-	VM_FAULT_NEEDDSYNC      = (__force vm_fault_t)0x002000,
-	VM_FAULT_COMPLETED      = (__force vm_fault_t)0x004000,
+	VM_FAULT_RETRY_HARD     = (__force vm_fault_t)0x000800,
+	VM_FAULT_FALLBACK       = (__force vm_fault_t)0x001000,
+	VM_FAULT_DONE_COW       = (__force vm_fault_t)0x002000,
+	VM_FAULT_NEEDDSYNC      = (__force vm_fault_t)0x004000,
+	VM_FAULT_COMPLETED      = (__force vm_fault_t)0x008000,
 	VM_FAULT_HINDEX_MASK    = (__force vm_fault_t)0x0f0000,
 };
 
diff --git a/mm/memory.c b/mm/memory.c
index 7c020995eafc..b3e7ffdd83f9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3797,7 +3797,7 @@ static inline vm_fault_t vmf_can_call_fault(const struct vm_fault *vmf)
 	if (vma->vm_ops->map_pages || !(vmf->flags & FAULT_FLAG_VMA_LOCK))
 		return 0;
 	vma_end_read(vma);
-	return VM_FAULT_RETRY;
+	return VM_FAULT_RETRY | VM_FAULT_RETRY_HARD;
 }
 
 /**
@@ -3824,7 +3824,7 @@ vm_fault_t __vmf_anon_prepare(struct vm_fault *vmf)
 		return 0;
 	if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
 		if (!mmap_read_trylock(vma->vm_mm))
-			return VM_FAULT_RETRY;
+			return VM_FAULT_RETRY | VM_FAULT_RETRY_HARD;
 	}
 	if (__anon_vma_prepare(vma))
 		ret = VM_FAULT_OOM;
@@ -4778,7 +4778,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 				 * under VMA lock.
 				 */
 				vma_end_read(vma);
-				ret = VM_FAULT_RETRY;
+				ret = VM_FAULT_RETRY | VM_FAULT_RETRY_HARD;
 				goto out;
 			}
 

For the second step, we can move forward with your approach of
ripping out the PF retry code, after getting in touch with the
owners of those popular apps one by one to understand why they are
doing this and whether they can find a different approach. In
short, this would allow for a one- or two-year transition period.

What do you think about that?

[1] https://lore.kernel.org/linux-mm/CAGsJ_4xC5LdhuoWV1=tK-RZ5rkjc8aOKOkmb1L_8BG_3gtJhDg@mail.gmail.com/


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-22  2:33                               ` Barry Song (Xiaomi)
@ 2026-05-22 13:09                                 ` Matthew Wilcox
  2026-05-22 13:36                                   ` Barry Song
  0 siblings, 1 reply; 80+ messages in thread
From: Matthew Wilcox @ 2026-05-22 13:09 UTC (permalink / raw)
  To: Barry Song (Xiaomi)
  Cc: akpm, bhe, chentao, chrisl, david, jack, kasong, kunwu.chan, liam,
	lianux.mm, linux-arm-kernel, linux-kernel, linux-mm, linux-riscv,
	linux-s390, linuxppc-dev, liyangouwen1, ljs, loongarch, mhocko,
	nphamcs, nzzhao, pfalcato, rppt, shikemeng, surenb, vbabka,
	wanglian, youngjun.park

On Fri, May 22, 2026 at 10:33:05AM +0800, Barry Song (Xiaomi) wrote:
> need to touch `filemap.c` at all (probably because you are already
> maintaining `filemap.c` perfectly):

I'm going to give you one chance to apologise for that.


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-22 13:09                                 ` Matthew Wilcox
@ 2026-05-22 13:36                                   ` Barry Song
  2026-05-22 13:48                                     ` Barry Song
  0 siblings, 1 reply; 80+ messages in thread
From: Barry Song @ 2026-05-22 13:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, bhe, chentao, chrisl, david, jack, kasong, kunwu.chan, liam,
	lianux.mm, linux-arm-kernel, linux-kernel, linux-mm, linux-riscv,
	linux-s390, linuxppc-dev, liyangouwen1, ljs, loongarch, mhocko,
	nphamcs, nzzhao, pfalcato, rppt, shikemeng, surenb, vbabka,
	wanglian, youngjun.park

On Fri, May 22, 2026 at 9:09 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, May 22, 2026 at 10:33:05AM +0800, Barry Song (Xiaomi) wrote:
> > need to touch `filemap.c` at all (probably because you are already
> > maintaining `filemap.c` perfectly):
>
> I'm going to give you one chance to apologise for that.

Apologies if my wording caused any misunderstanding.
That was not my intention at all.

What I meant is that filemap.c already has a very
solid design.

For memory.c, I had to touch several places for the
blacklist; otherwise, the kernel would hang.

But for filemap.c, I basically didn't need to touch
anything, and preliminary testing shows no issues after
moving it from the whitelist to the blacklist. This is
probably because the current filemap.c design is
already handling some aspects really well.

That is all I meant.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-22 13:36                                   ` Barry Song
@ 2026-05-22 13:48                                     ` Barry Song
  2026-05-22 15:42                                       ` Lorenzo Stoakes
  0 siblings, 1 reply; 80+ messages in thread
From: Barry Song @ 2026-05-22 13:48 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, bhe, chentao, chrisl, david, jack, kasong, kunwu.chan, liam,
	lianux.mm, linux-arm-kernel, linux-kernel, linux-mm, linux-riscv,
	linux-s390, linuxppc-dev, liyangouwen1, ljs, loongarch, mhocko,
	nphamcs, nzzhao, pfalcato, rppt, shikemeng, surenb, vbabka,
	wanglian, youngjun.park

On Fri, May 22, 2026 at 9:36 PM Barry Song <baohua@kernel.org> wrote:
>
> On Fri, May 22, 2026 at 9:09 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Fri, May 22, 2026 at 10:33:05AM +0800, Barry Song (Xiaomi) wrote:
> > > need to touch `filemap.c` at all (probably because you are already
> > > maintaining `filemap.c` perfectly):
> >
> > I'm going to give you one chance to apologise for that.
>
> Apologies if my wording caused any misunderstanding.
> That was not my intention at all.
>
> What I meant is that filemap.c already has a very
> solid design.
>
> For memory.c, I had to touch several places for the
> blacklist; otherwise, the kernel would hang.
>
> But for filemap.c, I basically didn't need to touch
> anything, and preliminary testing shows no issues after
> moving it from the whitelist to the blacklist. This is

Sorry, I feel I may be causing some misunderstanding
again.

By "whitelist", I mean I used to allow certain cases
to use per-vma retry.

By "blacklist", I mean I am now moving to disallow
certain cases from using per-vma retry.

Right now, I have to add several cases in memory.c
to the blacklist; otherwise, the kernel would hang.

But it seems that everything in filemap.c is fine so
far based on testing.

I'm not sure if I've explained things clearly. Please
let me know if anything is still unclear or insufficient.


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-22 13:48                                     ` Barry Song
@ 2026-05-22 15:42                                       ` Lorenzo Stoakes
  0 siblings, 0 replies; 80+ messages in thread
From: Lorenzo Stoakes @ 2026-05-22 15:42 UTC (permalink / raw)
  To: Barry Song
  Cc: Matthew Wilcox, akpm, bhe, chentao, chrisl, david, jack, kasong,
	kunwu.chan, liam, lianux.mm, linux-arm-kernel, linux-kernel,
	linux-mm, linux-riscv, linux-s390, linuxppc-dev, liyangouwen1,
	loongarch, mhocko, nphamcs, nzzhao, pfalcato, rppt, shikemeng,
	surenb, vbabka, wanglian, youngjun.park

On Fri, May 22, 2026 at 09:48:35PM +0800, Barry Song wrote:
> On Fri, May 22, 2026 at 9:36 PM Barry Song <baohua@kernel.org> wrote:
> >
> > On Fri, May 22, 2026 at 9:09 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Fri, May 22, 2026 at 10:33:05AM +0800, Barry Song (Xiaomi) wrote:
> > > > need to touch `filemap.c` at all (probably because you are already
> > > > maintaining `filemap.c` perfectly):
> > >
> > > I'm going to give you one chance to apologise for that.
> >
> > Apologies if my wording caused any misunderstanding.
> > That was not my intention at all.
> >
> > What I meant is that filemap.c already has a very
> > solid design.
> >
> > For memory.c, I had to touch several places for the
> > blacklist; otherwise, the kernel would hang.
> >
> > But for filemap.c, I basically didn't need to touch
> > anything, and preliminary testing shows no issues after
> > moving it from the whitelist to the blacklist. This is
>
> Sorry, I feel I may be causing some misunderstanding
> again.
>
> By "whitelist", I mean I used to allow certain cases
> to use per-vma retry.
>
> By "blacklist", I mean I am now moving to disallow
> certain cases from using per-vma retry.
>
> Right now, I have to add several cases in memory.c
> to the blacklist; otherwise, the kernel would hang.
>
> But it seems that everything in filemap.c is fine so
> far based on testing.
>
> I'm not sure if I've explained things clearly. Please
> let me know if anything is still unclear or insufficient.

Barry - this thread is completely out of hand and getting _rapidly_
unproductive.

It's certainly about as clear as mud where we stand right now, so here's my
suggestion - let's just stop adding to the noise here :) and instead, you
take the approach suggested by Suren at LSF and send that as an _RFC_
series.

That way we can look at that and hopefully actually circle in on a solution
rather than have endless sub threads and sub discussions :) It's far too
sunny out in the UK right now for that ;)

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-18 19:56                 ` Suren Baghdasaryan
  2026-05-18 21:14                   ` Barry Song
@ 2026-05-19 12:53                   ` Lorenzo Stoakes
  2026-05-19 21:18                     ` Barry Song
                                       ` (2 more replies)
  1 sibling, 3 replies; 80+ messages in thread
From: Lorenzo Stoakes @ 2026-05-19 12:53 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Barry Song, Matthew Wilcox, akpm, linux-mm, david, liam, vbabka,
	rppt, mhocko, jack, pfalcato, wanglian, chentao, lianux.mm,
	kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao

On Mon, May 18, 2026 at 12:56:59PM -0700, Suren Baghdasaryan wrote:

> >
> > I think we either need to fix `fork()`, or keep the current
> > behavior of dropping the VMA lock before performing I/O.
>
> I see. So, this problem arises from the fact that we are changing the
> pagefaults requiring I/O operation to hold VMA lock...
> And you want to lock VMA on fork only if vma_is_anonymous(vma) ||
> is_cow_mapping(vma->vm_flags). So, we will be blocking page faults for
> anonymous and COW VMAs only while holding mmap_write_lock, preventing
> any VMA modification. On the surface, that looks ok to me but I might
> be missing some corner cases. If nobody sees any obvious issues, I
> think it's worth a try.

Not sure if you noticed but I did raise concerns ;)

I wonder if you've confused the fault path and fork here, as I think Barry has
been a little unclear on that.

What's being suggested in this thread is to fundamentally change fork behaviour
so it's different from the entire history of the kernel (or - presumably - at
least recent history :) and permit concurrent page faults to occur on a forking
process.

I absolutely object to this for being pretty crazy. I mean I'm not sure we
really want to be simultaneously modifying page tables while invoking
copy_page_range()? No?

OK you cover anon and MAP_PRIVATE file-backed but hang on there's
VM_COPY_ON_FORK too.. so PFN mapped, mixed map and (the accursed) UFFD W/P as
well as possibly-guard region containing VMAs now can have page tables raced.

That's not to mention anything else that relies on serialisation here (this
would be changing how forking has been done in general) that we may or may not
know about.

The risk level is high, for what amounts to a hack to work around the fault
issue.

I suggest that if we have a problem with the fault path, let's look at the fault
path :)

So yeah I'm very opposed to this unless I'm somehow horribly mistaken here or a
very convincing argument is made.

>
>
>
>
> >
> > >
> > > I'd also like to get Suren's input, however.
> >
> > Yes. of course.
> >
> > >
> > > Thanks, Lorenzo
> >
> > Best Regards
> > Barry

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-19 12:53                   ` Lorenzo Stoakes
@ 2026-05-19 21:18                     ` Barry Song
  2026-05-20  7:50                       ` Lorenzo Stoakes
  2026-05-20  5:51                     ` Suren Baghdasaryan
  2026-05-20 10:33                     ` David Hildenbrand (Arm)
  2 siblings, 1 reply; 80+ messages in thread
From: Barry Song @ 2026-05-19 21:18 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Suren Baghdasaryan, Matthew Wilcox, akpm, linux-mm, david, liam,
	vbabka, rppt, mhocko, jack, pfalcato, wanglian, chentao,
	lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng,
	nphamcs, bhe, youngjun.park, linux-arm-kernel, linux-kernel,
	loongarch, linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao

On Tue, May 19, 2026 at 8:53 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
>
> On Mon, May 18, 2026 at 12:56:59PM -0700, Suren Baghdasaryan wrote:
>
> > >
> > > I think we either need to fix `fork()`, or keep the current
> > > behavior of dropping the VMA lock before performing I/O.
> >
> > I see. So, this problem arises from the fact that we are changing the
> > pagefaults requiring I/O operation to hold VMA lock...
> > And you want to lock VMA on fork only if vma_is_anonymous(vma) ||
> > is_cow_mapping(vma->vm_flags). So, we will be blocking page faults for
> > anonymous and COW VMAs only while holding mmap_write_lock, preventing
> > any VMA modification. On the surface, that looks ok to me but I might
> > be missing some corner cases. If nobody sees any obvious issues, I
> > think it's worth a try.
>
> Not sure if you noticed but I did raise concerns ;)
>
> I wonder if you've confused the fault path and fork here, as I think Barry has
> been a little unclear on that.

I think I’ve been absolutely clear :-)
We should either stick to the current behavior - drop
the VMA lock before doing I/O, or change fork() so that it
does not wait on vma_start_write().

Before per-VMA locks, page faults dropped mmap_lock before
doing I/O. After per-VMA locks, page faults dropped the
VMA lock before doing I/O. In both cases, fork() would not
wait for I/O in the page-fault path.

Now you guys are suggesting performing I/O while holding
the VMA lock, which means fork() must wait for that I/O to
complete. Since an application can have more than 1000
VMAs, and I/O can be stalled for an unpredictable amount
of time in the bio/request queue or filesystem GC, fork()
could end up blocked on multiple VMAs while taking
vma_start_write() for each of them.

As a result, fork() could hold mmap_lock for a very, very,
very long time. fork() itself would become extremely slow,
and any other task needing mmap_lock would also be blocked
behind it.

>
> What's being suggested in this thread is to fundamentally change fork behaviour
> so it's different from the entire history of the kernel (or - presumably - at
> least recent history :) and permit concurrent page faults to occur on a forking
> process.
>
> I absolutely object to this for being pretty crazy. I mean I'm not sure we
> really want to be simultaneously modifying page tables while invoking
> copy_page_range()? No?

If you object to touching fork(), can you at least accept
keeping the existing behavior of dropping the VMA lock
before doing I/O? If you object to both approaches, then I
really do not know how we can continue :-)

Thanks
Barry

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-19 21:18                     ` Barry Song
@ 2026-05-20  7:50                       ` Lorenzo Stoakes
  2026-05-20  9:07                         ` Barry Song
  0 siblings, 1 reply; 80+ messages in thread
From: Lorenzo Stoakes @ 2026-05-20  7:50 UTC (permalink / raw)
  To: Barry Song
  Cc: Suren Baghdasaryan, Matthew Wilcox, akpm, linux-mm, david, liam,
	vbabka, rppt, mhocko, jack, pfalcato, wanglian, chentao,
	lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng,
	nphamcs, bhe, youngjun.park, linux-arm-kernel, linux-kernel,
	loongarch, linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao

On Wed, May 20, 2026 at 05:18:52AM +0800, Barry Song wrote:
> On Tue, May 19, 2026 at 8:53 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
> >
> > On Mon, May 18, 2026 at 12:56:59PM -0700, Suren Baghdasaryan wrote:
> >
> > > >
> > > > I think we either need to fix `fork()`, or keep the current
> > > > behavior of dropping the VMA lock before performing I/O.
> > >
> > > I see. So, this problem arises from the fact that we are changing the
> > > pagefaults requiring I/O operation to hold VMA lock...
> > > And you want to lock VMA on fork only if vma_is_anonymous(vma) ||
> > > is_cow_mapping(vma->vm_flags). So, we will be blocking page faults for
> > > anonymous and COW VMAs only while holding mmap_write_lock, preventing
> > > any VMA modification. On the surface, that looks ok to me but I might
> > > be missing some corner cases. If nobody sees any obvious issues, I
> > > think it's worth a try.
> >
> > Not sure if you noticed but I did raise concerns ;)
> >
> > I wonder if you've confused the fault path and fork here, as I think Barry has
> > been a little unclear on that.
>
> I think I’ve been absolutely clear :-)

On this point sure, I would argue less so around the fork stuff but I responded
on that specifically elsewhere so let's keep things moving :>)

> We should either stick to the current behavior - drop
> the VMA lock before doing I/O, or change fork() so that it
> does not wait on vma_start_write().

Again, as I said elsewhere, I think there might be a 3rd way possibly. It's a
big mistake to assume that there are only specific solutions to problems in the
kernel then to present a false dichotomy.

We absolutely hear you on this being a problem and it WILL be addressed one way
or another.

Of the two approaches, as I said elsewhere, I prefer what you've done in this
series to anything touching fork.

But give me time to look through the series please (I'd also suggest RFC'ing
when it's something kinda fundamental that might generate converastion, makes
life a bit easier on the review side :)

>
> Before per-VMA locks, page faults dropped mmap_lock before
> doing I/O. After per-VMA locks, page faults dropped the
> VMA lock before doing I/O. In both cases, fork() would not
> wait for I/O in the page-fault path.
>
> Now you guys are suggesting performing I/O while holding
> the VMA lock, which means fork() must wait for that I/O to
> complete. Since an application can have more than 1000
> VMAs, and I/O can be stalled for an unpredictable amount
> of time in the bio/request queue or filesystem GC, fork()
> could end up blocked on multiple VMAs while taking
> vma_start_write() for each of them.
>
> As a result, fork() could hold mmap_lock for a very, very,
> very long time. fork() itself would become extremely slow,
> and any other task needing mmap_lock would also be blocked
> behind it.

Yep aware, we spoke in Zagreb about this, and on this thread, we know :)

>
> >
> > What's being suggested in this thread is to fundamentally change fork behaviour
> > so it's different from the entire history of the kernel (or - presumably - at
> > least recent history :) and permit concurrent page faults to occur on a forking
> > process.
> >
> > I absolutely object to this for being pretty crazy. I mean I'm not sure we
> > really want to be simultaneously modifying page tables while invoking
> > copy_page_range()? No?
>
> If you object to touching fork(), can you at least accept
> keeping the existing behavior of dropping the VMA lock
> before doing I/O? If you object to both approaches, then I
> really do not know how we can continue :-)

Again as per above, let's not impose a false dichtomy, let's take our time, and
specifically - please give me time to read through the series and think about
this.

>
> Thanks
> Barry

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-20  7:50                       ` Lorenzo Stoakes
@ 2026-05-20  9:07                         ` Barry Song
  2026-05-20 10:07                           ` Lorenzo Stoakes
  2026-05-20 16:20                           ` Suren Baghdasaryan
  0 siblings, 2 replies; 80+ messages in thread
From: Barry Song @ 2026-05-20  9:07 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Suren Baghdasaryan, Matthew Wilcox, akpm, linux-mm, david, liam,
	vbabka, rppt, mhocko, jack, pfalcato, wanglian, chentao,
	lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng,
	nphamcs, bhe, youngjun.park, linux-arm-kernel, linux-kernel,
	loongarch, linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao

On Wed, May 20, 2026 at 3:50 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
>
> On Wed, May 20, 2026 at 05:18:52AM +0800, Barry Song wrote:
> > On Tue, May 19, 2026 at 8:53 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
> > >
> > > On Mon, May 18, 2026 at 12:56:59PM -0700, Suren Baghdasaryan wrote:
> > >
> > > > >
> > > > > I think we either need to fix `fork()`, or keep the current
> > > > > behavior of dropping the VMA lock before performing I/O.
> > > >
> > > > I see. So, this problem arises from the fact that we are changing the
> > > > pagefaults requiring I/O operation to hold VMA lock...
> > > > And you want to lock VMA on fork only if vma_is_anonymous(vma) ||
> > > > is_cow_mapping(vma->vm_flags). So, we will be blocking page faults for
> > > > anonymous and COW VMAs only while holding mmap_write_lock, preventing
> > > > any VMA modification. On the surface, that looks ok to me but I might
> > > > be missing some corner cases. If nobody sees any obvious issues, I
> > > > think it's worth a try.
> > >
> > > Not sure if you noticed but I did raise concerns ;)
> > >
> > > I wonder if you've confused the fault path and fork here, as I think Barry has
> > > been a little unclear on that.
> >
> > I think I’ve been absolutely clear :-)
>
> On this point sure, I would argue less so around the fork stuff but I responded
> on that specifically elsewhere so let's keep things moving :>)
>
> > We should either stick to the current behavior - drop
> > the VMA lock before doing I/O, or change fork() so that it
> > does not wait on vma_start_write().
>
> Again, as I said elsewhere, I think there might be a 3rd way possibly. It's a
> big mistake to assume that there are only specific solutions to problems in the
> kernel then to present a false dichotomy.

I recalled that when we discussed this part in my slides:

‘For simplicity, rather than using a whitelist mechanism for
per-VMA retry, we could use a blacklist instead: default to
always retry via the VMA lock, and only allow mmap_lock-based
page-fault retry for specific cases such as
__vmf_anon_prepare().’

Suren mentioned introducing a FALLBACK flag. With the
FALLBACK flag, we would retry via mmap_lock; with the RETRY
flag, we would retry via the VMA lock.

Not sure whether this could really be called a ‘third way,’
but it seems more like a shift from a whitelist model to a
blacklist model, without changing the fundamental design, but
it does change where we would need to touch the source code.

>
> We absolutely hear you on this being a problem and it WILL be addressed one way
> or another.

Thanks. This is a bit of light in what has felt like a fairly
dark situation. I really appreciate your thoughtful and
responsible approach.

>
> Of the two approaches, as I said elsewhere, I prefer what you've done in this
> series to anything touching fork.
>
> But give me time to look through the series please (I'd also suggest RFC'ing
> when it's something kinda fundamental that might generate converastion, makes
> life a bit easier on the review side :)

Thanks! Sure, I’m happy to wait and there’s no urgency.

Last year you made quite a significant contribution to the work
when I tried to remove mmap_lock in madvise. I really
appreciated it. Now we’re back to the same lock again, just in
different places.

Best Regards
Barry


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-20  9:07                         ` Barry Song
@ 2026-05-20 10:07                           ` Lorenzo Stoakes
  2026-05-20 16:20                           ` Suren Baghdasaryan
  1 sibling, 0 replies; 80+ messages in thread
From: Lorenzo Stoakes @ 2026-05-20 10:07 UTC (permalink / raw)
  To: Barry Song
  Cc: Suren Baghdasaryan, Matthew Wilcox, akpm, linux-mm, david, liam,
	vbabka, rppt, mhocko, jack, pfalcato, wanglian, chentao,
	lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng,
	nphamcs, bhe, youngjun.park, linux-arm-kernel, linux-kernel,
	loongarch, linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao

On Wed, May 20, 2026 at 05:07:16PM +0800, Barry Song wrote:
> On Wed, May 20, 2026 at 3:50 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
> >
> > On Wed, May 20, 2026 at 05:18:52AM +0800, Barry Song wrote:
> > > On Tue, May 19, 2026 at 8:53 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
> > > >
> > > > On Mon, May 18, 2026 at 12:56:59PM -0700, Suren Baghdasaryan wrote:
> > > >
> > > > > >
> > > > > > I think we either need to fix `fork()`, or keep the current
> > > > > > behavior of dropping the VMA lock before performing I/O.
> > > > >
> > > > > I see. So, this problem arises from the fact that we are changing the
> > > > > pagefaults requiring I/O operation to hold VMA lock...
> > > > > And you want to lock VMA on fork only if vma_is_anonymous(vma) ||
> > > > > is_cow_mapping(vma->vm_flags). So, we will be blocking page faults for
> > > > > anonymous and COW VMAs only while holding mmap_write_lock, preventing
> > > > > any VMA modification. On the surface, that looks ok to me but I might
> > > > > be missing some corner cases. If nobody sees any obvious issues, I
> > > > > think it's worth a try.
> > > >
> > > > Not sure if you noticed but I did raise concerns ;)
> > > >
> > > > I wonder if you've confused the fault path and fork here, as I think Barry has
> > > > been a little unclear on that.
> > >
> > > I think I’ve been absolutely clear :-)
> >
> > On this point sure, I would argue less so around the fork stuff but I responded
> > on that specifically elsewhere so let's keep things moving :>)
> >
> > > We should either stick to the current behavior - drop
> > > the VMA lock before doing I/O, or change fork() so that it
> > > does not wait on vma_start_write().
> >
> > Again, as I said elsewhere, I think there might be a 3rd way possibly. It's a
> > big mistake to assume that there are only specific solutions to problems in the
> > kernel then to present a false dichotomy.
>
> I recalled that when we discussed this part in my slides:
>
> ‘For simplicity, rather than using a whitelist mechanism for
> per-VMA retry, we could use a blacklist instead: default to
> always retry via the VMA lock, and only allow mmap_lock-based
> page-fault retry for specific cases such as
> __vmf_anon_prepare().’

Yeah that's an itneresting approach actually, sorry if I missed that.

>
> Suren mentioned introducing a FALLBACK flag. With the
> FALLBACK flag, we would retry via mmap_lock; with the RETRY
> flag, we would retry via the VMA lock.

Yeah, and honestly I'm beginning to wonder if we don't just have to pay the
complexity tax anyway and eat the fact we have to deal with that.

But as per Josef's comment re: this whole mechanism, simply not waiting for
file-backed I think is another option (but I don't recall where we left that
conversation actually?)

Anyway I want to make sure any complexity we add is necessary so will take a
look through patches and have a think (and obviously others will have their own
opinions!)

>
> Not sure whether this could really be called a ‘third way,’
> but it seems more like a shift from a whitelist model to a
> blacklist model, without changing the fundamental design, but
> it does change where we would need to touch the source code.

Right yeah, good to have more options.

>
> >
> > We absolutely hear you on this being a problem and it WILL be addressed one way
> > or another.
>
> Thanks. This is a bit of light in what has felt like a fairly
> dark situation. I really appreciate your thoughtful and
> responsible approach.

Yes, sorry, I maybe was a bit too harsh in my tone here, I didn't really intend
to be negative as to addresisng the problem as a whole.

Moreso I've been concerned about the fork approach, and that is what's led to me
being shall we say 'emphatic' about it :)

But of course I sometimes make mistakes in quite how my tone comes across, so
apologies if it came across overly negatively - I am negative (on a technical
level) about the fork approach, but not the fact we should address this.

To be clear - I'm very glad you've brought this up, it's important, as much as
it's painful that we have this issue in the first place! :)

>
> >
> > Of the two approaches, as I said elsewhere, I prefer what you've done in this
> > series to anything touching fork.
> >
> > But give me time to look through the series please (I'd also suggest RFC'ing
> > when it's something kinda fundamental that might generate converastion, makes
> > life a bit easier on the review side :)
>
> Thanks! Sure, I’m happy to wait and there’s no urgency.
>
> Last year you made quite a significant contribution to the work
> when I tried to remove mmap_lock in madvise. I really
> appreciated it. Now we’re back to the same lock again, just in
> different places.

Yeah :) one day maybe we can get rid of it altogether (maybe I'm dreaming :)

>
> Best Regards
> Barry

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-20  9:07                         ` Barry Song
  2026-05-20 10:07                           ` Lorenzo Stoakes
@ 2026-05-20 16:20                           ` Suren Baghdasaryan
  1 sibling, 0 replies; 80+ messages in thread
From: Suren Baghdasaryan @ 2026-05-20 16:20 UTC (permalink / raw)
  To: Barry Song
  Cc: Lorenzo Stoakes, Matthew Wilcox, akpm, linux-mm, david, liam,
	vbabka, rppt, mhocko, jack, pfalcato, wanglian, chentao,
	lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng,
	nphamcs, bhe, youngjun.park, linux-arm-kernel, linux-kernel,
	loongarch, linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao

On Wed, May 20, 2026 at 2:07 AM Barry Song <baohua@kernel.org> wrote:
>
> On Wed, May 20, 2026 at 3:50 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
> >
> > On Wed, May 20, 2026 at 05:18:52AM +0800, Barry Song wrote:
> > > On Tue, May 19, 2026 at 8:53 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
> > > >
> > > > On Mon, May 18, 2026 at 12:56:59PM -0700, Suren Baghdasaryan wrote:
> > > >
> > > > > >
> > > > > > I think we either need to fix `fork()`, or keep the current
> > > > > > behavior of dropping the VMA lock before performing I/O.
> > > > >
> > > > > I see. So, this problem arises from the fact that we are changing the
> > > > > pagefaults requiring I/O operation to hold VMA lock...
> > > > > And you want to lock VMA on fork only if vma_is_anonymous(vma) ||
> > > > > is_cow_mapping(vma->vm_flags). So, we will be blocking page faults for
> > > > > anonymous and COW VMAs only while holding mmap_write_lock, preventing
> > > > > any VMA modification. On the surface, that looks ok to me but I might
> > > > > be missing some corner cases. If nobody sees any obvious issues, I
> > > > > think it's worth a try.
> > > >
> > > > Not sure if you noticed but I did raise concerns ;)
> > > >
> > > > I wonder if you've confused the fault path and fork here, as I think Barry has
> > > > been a little unclear on that.
> > >
> > > I think I’ve been absolutely clear :-)
> >
> > On this point sure, I would argue less so around the fork stuff but I responded
> > on that specifically elsewhere so let's keep things moving :>)
> >
> > > We should either stick to the current behavior - drop
> > > the VMA lock before doing I/O, or change fork() so that it
> > > does not wait on vma_start_write().
> >
> > Again, as I said elsewhere, I think there might be a 3rd way possibly. It's a
> > big mistake to assume that there are only specific solutions to problems in the
> > kernel then to present a false dichotomy.
>
> I recalled that when we discussed this part in my slides:
>
> ‘For simplicity, rather than using a whitelist mechanism for
> per-VMA retry, we could use a blacklist instead: default to
> always retry via the VMA lock, and only allow mmap_lock-based
> page-fault retry for specific cases such as
> __vmf_anon_prepare().’
>
> Suren mentioned introducing a FALLBACK flag. With the
> FALLBACK flag, we would retry via mmap_lock; with the RETRY
> flag, we would retry via the VMA lock.
>
> Not sure whether this could really be called a ‘third way,’
> but it seems more like a shift from a whitelist model to a
> blacklist model, without changing the fundamental design, but
> it does change where we would need to touch the source code.

I thought the conclusion of the LSFMM discussion was that this is the
direction we would take. Maybe there were followup discussions which I
missed?
This approach still drops the lock before I/O but after I/O completion
it reacquires the same per-VMA lock instead of falling back to
mmap_lock. IMO it's the simplest fix for the issue you brought up.

>
> >
> > We absolutely hear you on this being a problem and it WILL be addressed one way
> > or another.
>
> Thanks. This is a bit of light in what has felt like a fairly
> dark situation. I really appreciate your thoughtful and
> responsible approach.
>
> >
> > Of the two approaches, as I said elsewhere, I prefer what you've done in this
> > series to anything touching fork.
> >
> > But give me time to look through the series please (I'd also suggest RFC'ing
> > when it's something kinda fundamental that might generate converastion, makes
> > life a bit easier on the review side :)
>
> Thanks! Sure, I’m happy to wait and there’s no urgency.
>
> Last year you made quite a significant contribution to the work
> when I tried to remove mmap_lock in madvise. I really
> appreciated it. Now we’re back to the same lock again, just in
> different places.
>
> Best Regards
> Barry


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-19 12:53                   ` Lorenzo Stoakes
  2026-05-19 21:18                     ` Barry Song
@ 2026-05-20  5:51                     ` Suren Baghdasaryan
  2026-05-22 15:39                       ` Lorenzo Stoakes
  2026-05-20 10:33                     ` David Hildenbrand (Arm)
  2 siblings, 1 reply; 80+ messages in thread
From: Suren Baghdasaryan @ 2026-05-20  5:51 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Barry Song, Matthew Wilcox, akpm, linux-mm, david, liam, vbabka,
	rppt, mhocko, jack, pfalcato, wanglian, chentao, lianux.mm,
	kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao

On Tue, May 19, 2026 at 12:53 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
>
> On Mon, May 18, 2026 at 12:56:59PM -0700, Suren Baghdasaryan wrote:
>
> > >
> > > I think we either need to fix `fork()`, or keep the current
> > > behavior of dropping the VMA lock before performing I/O.
> >
> > I see. So, this problem arises from the fact that we are changing the
> > pagefaults requiring I/O operation to hold VMA lock...
> > And you want to lock VMA on fork only if vma_is_anonymous(vma) ||
> > is_cow_mapping(vma->vm_flags). So, we will be blocking page faults for
> > anonymous and COW VMAs only while holding mmap_write_lock, preventing
> > any VMA modification. On the surface, that looks ok to me but I might
> > be missing some corner cases. If nobody sees any obvious issues, I
> > think it's worth a try.
>
> Not sure if you noticed but I did raise concerns ;)

Sorry, I didn't realize your first comment was a conceptual objection
to this approach of allowing page faults to race with the fork.


>
> I wonder if you've confused the fault path and fork here, as I think Barry has
> been a little unclear on that.
>
> What's being suggested in this thread is to fundamentally change fork behaviour
> so it's different from the entire history of the kernel (or - presumably - at
> least recent history :) and permit concurrent page faults to occur on a forking
> process.
>
> I absolutely object to this for being pretty crazy. I mean I'm not sure we
> really want to be simultaneously modifying page tables while invoking
> copy_page_range()? No?
>
> OK you cover anon and MAP_PRIVATE file-backed but hang on there's
> VM_COPY_ON_FORK too.. so PFN mapped, mixed map and (the accursed) UFFD W/P as
> well as possibly-guard region containing VMAs now can have page tables raced.

Ugh, yeah, I realize now this is a minefield. Resolving all possible
races there would not be trivial and might introduce other performance
issues.

>
> That's not to mention anything else that relies on serialisation here (this
> would be changing how forking has been done in general) that we may or may not
> know about.
>
> The risk level is high, for what amounts to a hack to work around the fault
> issue.
>
> I suggest that if we have a problem with the fault path, let's look at the fault
> path :)
>
> So yeah I'm very opposed to this unless I'm somehow horribly mistaken here or a
> very convincing argument is made.

So, current approach of dropping locks during I/O sounds like still
the best solution.

>
>
> >
> >
> >
> >
> > >
> > > >
> > > > I'd also like to get Suren's input, however.
> > >
> > > Yes. of course.
> > >
> > > >
> > > > Thanks, Lorenzo
> > >
> > > Best Regards
> > > Barry
>
> Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-20  5:51                     ` Suren Baghdasaryan
@ 2026-05-22 15:39                       ` Lorenzo Stoakes
  0 siblings, 0 replies; 80+ messages in thread
From: Lorenzo Stoakes @ 2026-05-22 15:39 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Barry Song, Matthew Wilcox, akpm, linux-mm, david, liam, vbabka,
	rppt, mhocko, jack, pfalcato, wanglian, chentao, lianux.mm,
	kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao

On Wed, May 20, 2026 at 05:51:23AM +0000, Suren Baghdasaryan wrote:
> On Tue, May 19, 2026 at 12:53 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
> >
> > On Mon, May 18, 2026 at 12:56:59PM -0700, Suren Baghdasaryan wrote:
> >
> > > >
> > > > I think we either need to fix `fork()`, or keep the current
> > > > behavior of dropping the VMA lock before performing I/O.
> > >
> > > I see. So, this problem arises from the fact that we are changing the
> > > pagefaults requiring I/O operation to hold VMA lock...
> > > And you want to lock VMA on fork only if vma_is_anonymous(vma) ||
> > > is_cow_mapping(vma->vm_flags). So, we will be blocking page faults for
> > > anonymous and COW VMAs only while holding mmap_write_lock, preventing
> > > any VMA modification. On the surface, that looks ok to me but I might
> > > be missing some corner cases. If nobody sees any obvious issues, I
> > > think it's worth a try.
> >
> > Not sure if you noticed but I did raise concerns ;)
>
> Sorry, I didn't realize your first comment was a conceptual objection
> to this approach of allowing page faults to race with the fork.

Ah yeah it's understandable I think there's been so many threads in this
conversation that it's easy to get lost :)

>
>
> >
> > I wonder if you've confused the fault path and fork here, as I think Barry has
> > been a little unclear on that.
> >
> > What's being suggested in this thread is to fundamentally change fork behaviour
> > so it's different from the entire history of the kernel (or - presumably - at
> > least recent history :) and permit concurrent page faults to occur on a forking
> > process.
> >
> > I absolutely object to this for being pretty crazy. I mean I'm not sure we
> > really want to be simultaneously modifying page tables while invoking
> > copy_page_range()? No?
> >
> > OK you cover anon and MAP_PRIVATE file-backed but hang on there's
> > VM_COPY_ON_FORK too.. so PFN mapped, mixed map and (the accursed) UFFD W/P as
> > well as possibly-guard region containing VMAs now can have page tables raced.
>
> Ugh, yeah, I realize now this is a minefield. Resolving all possible
> races there would not be trivial and might introduce other performance
> issues.

Yeah, it's dangerous waters :)

>
> >
> > That's not to mention anything else that relies on serialisation here (this
> > would be changing how forking has been done in general) that we may or may not
> > know about.
> >
> > The risk level is high, for what amounts to a hack to work around the fault
> > issue.
> >
> > I suggest that if we have a problem with the fault path, let's look at the fault
> > path :)
> >
> > So yeah I'm very opposed to this unless I'm somehow horribly mistaken here or a
> > very convincing argument is made.
>
> So, current approach of dropping locks during I/O sounds like still
> the best solution.

Yeah _of those proposed_ I think importantly. This doesn't mean there aren't
other potential solutions.

Thanks, Lorenzo

>
> >
> >
> > >
> > >
> > >
> > >
> > > >
> > > > >
> > > > > I'd also like to get Suren's input, however.
> > > >
> > > > Yes. of course.
> > > >
> > > > >
> > > > > Thanks, Lorenzo
> > > >
> > > > Best Regards
> > > > Barry
> >
> > Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-19 12:53                   ` Lorenzo Stoakes
  2026-05-19 21:18                     ` Barry Song
  2026-05-20  5:51                     ` Suren Baghdasaryan
@ 2026-05-20 10:33                     ` David Hildenbrand (Arm)
  2026-05-20 12:55                       ` Lorenzo Stoakes
  2026-05-20 21:39                       ` Yang Shi
  2 siblings, 2 replies; 80+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-20 10:33 UTC (permalink / raw)
  To: Lorenzo Stoakes, Suren Baghdasaryan
  Cc: Barry Song, Matthew Wilcox, akpm, linux-mm, liam, vbabka, rppt,
	mhocko, jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
	liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao

On 5/19/26 14:53, Lorenzo Stoakes wrote:
> On Mon, May 18, 2026 at 12:56:59PM -0700, Suren Baghdasaryan wrote:
> 
>>>
>>> I think we either need to fix `fork()`, or keep the current
>>> behavior of dropping the VMA lock before performing I/O.
>>
>> I see. So, this problem arises from the fact that we are changing the
>> pagefaults requiring I/O operation to hold VMA lock...
>> And you want to lock VMA on fork only if vma_is_anonymous(vma) ||
>> is_cow_mapping(vma->vm_flags). So, we will be blocking page faults for
>> anonymous and COW VMAs only while holding mmap_write_lock, preventing
>> any VMA modification. On the surface, that looks ok to me but I might
>> be missing some corner cases. If nobody sees any obvious issues, I
>> think it's worth a try.
> 
> Not sure if you noticed but I did raise concerns ;)
> 
> I wonder if you've confused the fault path and fork here, as I think Barry has
> been a little unclear on that.
> 
> What's being suggested in this thread is to fundamentally change fork behaviour
> so it's different from the entire history of the kernel (or - presumably - at
> least recent history :) 
I don't want fork() to become different in that regard.

There is already a slight difference with vs. without per-VMA locks, because
there is a window in-between us taking the write mmap_lock and all the per-VMA
locks. I raised that previously [1] and assumed that it is probably fine.

I also raised in the past why I think we must not allow concurrent page faults,
at least as soon as anonymous memory is involved [2].

... and I raised that this is pretty much slower by design right now: "Well, the
design decision that CONFIG_PER_VMA_LOCK made for now to make page faults fast
and to make blocking any page faults from happening to  be slower ..." [3]

[1] https://lore.kernel.org/all/970295ab-e85d-7af3-76e6-df53a5c52f8b@redhat.com/
[2] https://lore.kernel.org/all/7e3f35cc-59b9-bf12-b8b1-4ed78223844a@redhat.com/
[3] https://lore.kernel.org/all/2efa2c89-3765-721d-2c3c-00590054aa5b@redhat.com/

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-20 10:33                     ` David Hildenbrand (Arm)
@ 2026-05-20 12:55                       ` Lorenzo Stoakes
  2026-05-20 21:39                       ` Yang Shi
  1 sibling, 0 replies; 80+ messages in thread
From: Lorenzo Stoakes @ 2026-05-20 12:55 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Suren Baghdasaryan, Barry Song, Matthew Wilcox, akpm, linux-mm,
	liam, vbabka, rppt, mhocko, jack, pfalcato, wanglian, chentao,
	lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng,
	nphamcs, bhe, youngjun.park, linux-arm-kernel, linux-kernel,
	loongarch, linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao

On Wed, May 20, 2026 at 12:33:56PM +0200, David Hildenbrand (Arm) wrote:
> On 5/19/26 14:53, Lorenzo Stoakes wrote:
> > On Mon, May 18, 2026 at 12:56:59PM -0700, Suren Baghdasaryan wrote:
> >
> >>>
> >>> I think we either need to fix `fork()`, or keep the current
> >>> behavior of dropping the VMA lock before performing I/O.
> >>
> >> I see. So, this problem arises from the fact that we are changing the
> >> pagefaults requiring I/O operation to hold VMA lock...
> >> And you want to lock VMA on fork only if vma_is_anonymous(vma) ||
> >> is_cow_mapping(vma->vm_flags). So, we will be blocking page faults for
> >> anonymous and COW VMAs only while holding mmap_write_lock, preventing
> >> any VMA modification. On the surface, that looks ok to me but I might
> >> be missing some corner cases. If nobody sees any obvious issues, I
> >> think it's worth a try.
> >
> > Not sure if you noticed but I did raise concerns ;)
> >
> > I wonder if you've confused the fault path and fork here, as I think Barry has
> > been a little unclear on that.
> >
> > What's being suggested in this thread is to fundamentally change fork behaviour
> > so it's different from the entire history of the kernel (or - presumably - at
> > least recent history :)
> I don't want fork() to become different in that regard.
>
> There is already a slight difference with vs. without per-VMA locks, because
> there is a window in-between us taking the write mmap_lock and all the per-VMA
> locks. I raised that previously [1] and assumed that it is probably fine.
>
> I also raised in the past why I think we must not allow concurrent page faults,
> at least as soon as anonymous memory is involved [2].
>
> ... and I raised that this is pretty much slower by design right now: "Well, the
> design decision that CONFIG_PER_VMA_LOCK made for now to make page faults fast
> and to make blocking any page faults from happening to  be slower ..." [3]

Thanks for the background will read through! :)

But yeah I think the transition from !vma->anon_vma -> vma->anon_vma being a bit
slow is kinda ok most page faults will of course have anon_vma populated.

Be interesting with CoW context, because we won't need to mmap read lock there
at all :)

>
> [1] https://lore.kernel.org/all/970295ab-e85d-7af3-76e6-df53a5c52f8b@redhat.com/
> [2] https://lore.kernel.org/all/7e3f35cc-59b9-bf12-b8b1-4ed78223844a@redhat.com/
> [3] https://lore.kernel.org/all/2efa2c89-3765-721d-2c3c-00590054aa5b@redhat.com/
>
> --
> Cheers,
>
> David

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-20 10:33                     ` David Hildenbrand (Arm)
  2026-05-20 12:55                       ` Lorenzo Stoakes
@ 2026-05-20 21:39                       ` Yang Shi
  2026-05-22 15:37                         ` Lorenzo Stoakes
  1 sibling, 1 reply; 80+ messages in thread
From: Yang Shi @ 2026-05-20 21:39 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Lorenzo Stoakes, Suren Baghdasaryan, Barry Song, Matthew Wilcox,
	akpm, linux-mm, liam, vbabka, rppt, mhocko, jack, pfalcato,
	wanglian, chentao, lianux.mm, kunwu.chan, liyangouwen1, chrisl,
	kasong, shikemeng, nphamcs, bhe, youngjun.park, linux-arm-kernel,
	linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390,
	Nanzhe Zhao

On Wed, May 20, 2026 at 3:34 AM David Hildenbrand (Arm)
<david@kernel.org> wrote:
>
> On 5/19/26 14:53, Lorenzo Stoakes wrote:
> > On Mon, May 18, 2026 at 12:56:59PM -0700, Suren Baghdasaryan wrote:
> >
> >>>
> >>> I think we either need to fix `fork()`, or keep the current
> >>> behavior of dropping the VMA lock before performing I/O.
> >>
> >> I see. So, this problem arises from the fact that we are changing the
> >> pagefaults requiring I/O operation to hold VMA lock...
> >> And you want to lock VMA on fork only if vma_is_anonymous(vma) ||
> >> is_cow_mapping(vma->vm_flags). So, we will be blocking page faults for
> >> anonymous and COW VMAs only while holding mmap_write_lock, preventing
> >> any VMA modification. On the surface, that looks ok to me but I might
> >> be missing some corner cases. If nobody sees any obvious issues, I
> >> think it's worth a try.
> >
> > Not sure if you noticed but I did raise concerns ;)
> >
> > I wonder if you've confused the fault path and fork here, as I think Barry has
> > been a little unclear on that.
> >
> > What's being suggested in this thread is to fundamentally change fork behaviour
> > so it's different from the entire history of the kernel (or - presumably - at
> > least recent history :)
> I don't want fork() to become different in that regard.
>
> There is already a slight difference with vs. without per-VMA locks, because
> there is a window in-between us taking the write mmap_lock and all the per-VMA
> locks. I raised that previously [1] and assumed that it is probably fine.
>
> I also raised in the past why I think we must not allow concurrent page faults,
> at least as soon as anonymous memory is involved [2].

Thanks for sharing the context, it is quite helpful to understand the
race conditions. Because Lorenzo also raised the concern about page
fault race, I will reply to all the concerns regarding page fault race
together in this thread.

IIUC, there is already some sort of race with per vma lock. Before per
vma lock, mmap_lock did lock everything. So page fault happened either
before fork or after fork. But page fault can happen on other VMAs
which have not been lock'ed yet during fork with per vma lock. For
example, we have 3 VMAs, we lock the first VMA, but page fault still
can happen on the other 2 VMAs during fork if they already have
anon_vma. This is the status quo now, but it seems not harmful.

The bad race shared by David is caused by racing with copy page. So it
seems like it will be fine as long as we serialize copy page against
page fault if I don't miss anything. Since we decide whether to copy
page or not by checking vma->anon_vma, so it seems fine to not take
vma lock if vma->anon_vma is NULL. This will not introduce more race
either because setting up a new  anon_vma in page fault or madvise
requires taking mmap_lock according to the earlier discussions.

Thanks,
Yang

>
> ... and I raised that this is pretty much slower by design right now: "Well, the
> design decision that CONFIG_PER_VMA_LOCK made for now to make page faults fast
> and to make blocking any page faults from happening to  be slower ..." [3]
>
> [1] https://lore.kernel.org/all/970295ab-e85d-7af3-76e6-df53a5c52f8b@redhat.com/
> [2] https://lore.kernel.org/all/7e3f35cc-59b9-bf12-b8b1-4ed78223844a@redhat.com/
> [3] https://lore.kernel.org/all/2efa2c89-3765-721d-2c3c-00590054aa5b@redhat.com/
>
> --
> Cheers,
>
> David
>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-20 21:39                       ` Yang Shi
@ 2026-05-22 15:37                         ` Lorenzo Stoakes
  0 siblings, 0 replies; 80+ messages in thread
From: Lorenzo Stoakes @ 2026-05-22 15:37 UTC (permalink / raw)
  To: Yang Shi
  Cc: David Hildenbrand (Arm), Suren Baghdasaryan, Barry Song,
	Matthew Wilcox, akpm, linux-mm, liam, vbabka, rppt, mhocko, jack,
	pfalcato, wanglian, chentao, lianux.mm, kunwu.chan, liyangouwen1,
	chrisl, kasong, shikemeng, nphamcs, bhe, youngjun.park,
	linux-arm-kernel, linux-kernel, loongarch, linuxppc-dev,
	linux-riscv, linux-s390, Nanzhe Zhao

On Wed, May 20, 2026 at 02:39:49PM -0700, Yang Shi wrote:
> On Wed, May 20, 2026 at 3:34 AM David Hildenbrand (Arm)
> <david@kernel.org> wrote:
> >
> > On 5/19/26 14:53, Lorenzo Stoakes wrote:
> > > On Mon, May 18, 2026 at 12:56:59PM -0700, Suren Baghdasaryan wrote:
> > >
> > >>>
> > >>> I think we either need to fix `fork()`, or keep the current
> > >>> behavior of dropping the VMA lock before performing I/O.
> > >>
> > >> I see. So, this problem arises from the fact that we are changing the
> > >> pagefaults requiring I/O operation to hold VMA lock...
> > >> And you want to lock VMA on fork only if vma_is_anonymous(vma) ||
> > >> is_cow_mapping(vma->vm_flags). So, we will be blocking page faults for
> > >> anonymous and COW VMAs only while holding mmap_write_lock, preventing
> > >> any VMA modification. On the surface, that looks ok to me but I might
> > >> be missing some corner cases. If nobody sees any obvious issues, I
> > >> think it's worth a try.
> > >
> > > Not sure if you noticed but I did raise concerns ;)
> > >
> > > I wonder if you've confused the fault path and fork here, as I think Barry has
> > > been a little unclear on that.
> > >
> > > What's being suggested in this thread is to fundamentally change fork behaviour
> > > so it's different from the entire history of the kernel (or - presumably - at
> > > least recent history :)
> > I don't want fork() to become different in that regard.
> >
> > There is already a slight difference with vs. without per-VMA locks, because
> > there is a window in-between us taking the write mmap_lock and all the per-VMA
> > locks. I raised that previously [1] and assumed that it is probably fine.
> >
> > I also raised in the past why I think we must not allow concurrent page faults,
> > at least as soon as anonymous memory is involved [2].
>
> Thanks for sharing the context, it is quite helpful to understand the
> race conditions. Because Lorenzo also raised the concern about page
> fault race, I will reply to all the concerns regarding page fault race
> together in this thread.
>
> IIUC, there is already some sort of race with per vma lock. Before per
> vma lock, mmap_lock did lock everything. So page fault happened either
> before fork or after fork. But page fault can happen on other VMAs
> which have not been lock'ed yet during fork with per vma lock. For
> example, we have 3 VMAs, we lock the first VMA, but page fault still
> can happen on the other 2 VMAs during fork if they already have
> anon_vma. This is the status quo now, but it seems not harmful.
>
> The bad race shared by David is caused by racing with copy page. So it
> seems like it will be fine as long as we serialize copy page against
> page fault if I don't miss anything. Since we decide whether to copy
> page or not by checking vma->anon_vma, so it seems fine to not take
> vma lock if vma->anon_vma is NULL. This will not introduce more race
> either because setting up a new  anon_vma in page fault or madvise
> requires taking mmap_lock according to the earlier discussions.

NAK. No.

We're not doing this, we're not changing how fork fundamentally behaves because
of concerns about the fault path.

I've delineated exactly why I think this is a problem and you're pressing ahead
without addressing those concerns.

So at this point I'm going to be a grumpy maintainer and just say no, stop
please :)

Let's fix this in the right place. You don't fix a leak in the roof by repairing
a shelf next door :)

Thanks, Lorenzo


>
> Thanks,
> Yang
>
> >
> > ... and I raised that this is pretty much slower by design right now: "Well, the
> > design decision that CONFIG_PER_VMA_LOCK made for now to make page faults fast
> > and to make blocking any page faults from happening to  be slower ..." [3]
> >
> > [1] https://lore.kernel.org/all/970295ab-e85d-7af3-76e6-df53a5c52f8b@redhat.com/
> > [2] https://lore.kernel.org/all/7e3f35cc-59b9-bf12-b8b1-4ed78223844a@redhat.com/
> > [3] https://lore.kernel.org/all/2efa2c89-3765-721d-2c3c-00590054aa5b@redhat.com/
> >
> > --
> > Cheers,
> >
> > David
> >


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-18 11:25               ` Barry Song
  2026-05-18 16:17                 ` Matthew Wilcox
  2026-05-18 19:56                 ` Suren Baghdasaryan
@ 2026-05-19 12:43                 ` Lorenzo Stoakes
  2 siblings, 0 replies; 80+ messages in thread
From: Lorenzo Stoakes @ 2026-05-19 12:43 UTC (permalink / raw)
  To: Barry Song
  Cc: Matthew Wilcox, surenb, akpm, linux-mm, david, liam, vbabka, rppt,
	mhocko, jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
	liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao

On Mon, May 18, 2026 at 07:25:54PM +0800, Barry Song wrote:
> On Mon, May 18, 2026 at 5:47 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
> >
> > On Sun, May 17, 2026 at 04:45:15PM +0800, Barry Song wrote:
> > > On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > >
> > > > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > > > both the hardware and the software stack (bio/request queues and the
> > > > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > > > for an unpredictable amount of time.
> > > > > >
> > > > > > But does that actually happen?  I find it hard to believe that thread A
> > > > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > > > that same VMA.  mprotect() and madvise() are more likely to happen, but
> > > > > > it still seems really unlikely to me.
> > > > >
> > > > > It doesn’t have to involve unmapping or applying mprotect to
> > > > > the entire VMA—just a portion of it is sufficient.
> > > >
> > > > Yes, but that still fails to answer "does this actually happen".  How much
> > > > performance is all this complexity in the page fault handler buying us?
> > > > If you don't answer this question, I'm just going to go in and rip it
> > > > all out.
> > > >
> > >
> > > Hi Matthew (and Lorenzo, Jan, and anyone else who may be
> > > waiting for answers),
> > >
> > > As promised during LSF/MM/BPF, we conducted thorough
> > > testing on Android phones to determine whether performing
> > > I/O in `filemap_fault()` can block `vma_start_write()`.
> > > I wanted to give a quick update on this question.
> > >
> > > Nanzhe at Xiaomi created tracing scripts and ran various
> > > applications on Android devices with I/O performed under
> > > the VMA lock in `filemap_fault()`. We found that:
> > >
> > > 1. There are very few cases where unmap() is blocked by
> > >    page faults. I assume this is due to buggy user code
> > >    or poor synchronization between reads and unmap().
> > > So I assume it is not a problem.
> > >
> > > 2. We observed many cases where `vma_start_write()`
> > >    is blocked by page-fault I/O in some applications.
> > >    The blocking occurs in the `dup_mmap()` path during
> > >    fork().
> > >
> > > With Suren's commit fb49c455323ff ("fork: lock VMAs of
> > > the parent process when forking"), we now always hold
> > > `vma_write_lock()` for each VMA. Note that the
> > > `mmap_lock` write lock is also held, which could lead to
> > > chained waiting if page-fault I/O is performed without
> > > releasing the VMA lock.
> >
> > Hm but did you observe this 'chained waiting'? And what were the latencies?
>
> We have clearly observed that the `fork()` operations of many
> popular Android apps, such as iQiyi, Baidu Tieba, and 10086,
> end up waiting on page-fault (PF) I/O when the VMA lock is
> held during I/O operations. This has already become a
> practical issue. I also believe this can lead to chained
> waiting, since the global `mmap_lock` blocks all threads that
> need to acquire it.

I asked about the chained waiting :) I'm aware you've observed contention on
write lock, you said so in your LSF talk.

So have you observed that or is this a theory?

>
>
> >
> > >
> > > My gut feeling is that Suren's commit may be overshooting,
> > > so my rough idea is that we might want to do something like
> > > the following (we haven't tested it yet and it might be
> > > wrong):
> >
> > Yeah I'm really not sure about that.
> >
> > Prior to the VMA locks, the mmap write lock would have guaranteed no concurrent
> > page faults, which is really what Fb49c455323ff is about.
> >
> > So Suren's patch was essentially restoring the _existing_ forking behaviour, and
> > now you're saying 'let's change the forking behaviour that's been like that for
> > forever'.
>
>
> I am afraid not. Before we introduced the per-VMA lock, we
> were not performing I/O while holding `mmap_lock`. A page fault
> that needed I/O would drop the `mmap_lock` read lock and allow
> `fork()` to proceed.

Err I'm talking about fork? The patch you reference is a change to fork?

So you're saying that Fb49c455323ff which explicitly takes the VMA write lock on
fork, was somehow an addendum after fork didnt take the mmap write lock?

I must be imagining
https://elixir.bootlin.com/linux/v6.0/source/kernel/fork.c#L590 then in v6.0
pre-vma locks :)

I suspect that's _not_ what you're saying, so now what you're suggesting as I
stated above, is to fundamentally change fork behaviour to account for the
existing per-VMA lock behaviour on the fault path?

Again I state - are you really sure you want to fundamentally change fork
behaviour for this?

I am extremely concerned about doing that.

>
> Now, you are suggesting performing I/O while holding the VMA
> lock, which changes the requirements and introduces this
> problem.
>
> >
> > I think you would _really_ have to be sure that's safe. And forking is a very
> > dangerous time in terms of complexity and sensitivity and 'weird stuff'
> > happening so I'd tread _very_ carefully here.
>
> Yep. I think my original proposal did not require any changes
> to `fork()`, since it simply preserved the current behavior of
> dropping the VMA lock before performing I/O. In that model,
> `fork()` would not end up waiting on I/O at all.
>
> What you are suggesting now appears to be performing I/O while
> holding the VMA lock, which in turn introduces the need to
> change `fork()`.

Again, you're saying we should fundamentally change the way fork has worked
forever to work around something else.

At LSF I raised the fact that Josef himself suggested we simply drop this I/O
waiting behaviour for file-backed mapppings. Isn't there a way forward that way
rather than 'hey let's drop locks and hope for the best!'

I am really reticent about this because we've seen HORRIBLE bugs come from fork
behaviour, especially edge cases, and mm testing isn't great so I am basically
opposed to this, and you're not really convincing me here.

>
> >
> > >
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index 2311ae7c2ff4..5ddaf297f31a 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -1762,7 +1762,13 @@ __latent_entropy int dup_mmap(struct mm_struct
> > > *mm, struct mm_struct *oldmm)
> > >         for_each_vma(vmi, mpnt) {
> > >                 struct file *file;
> > >
> > > -               retval = vma_start_write_killable(mpnt);
> > > +               /*
> > > +                * For anonymous or writable private VMAs, prevent
> > > +                * concurrent CoW faults.
> > > +                */
> >
> > To nit pick I think the comment's confusing but also tells you you don't need to
> > specific anon check - writable private is sufficient. And it's not really just
> > CoW that's the issue, it's anon_vma population _at all_ as well as CoW.
> >
> > > +               if (!mpnt->vm_file || (!(mpnt->vm_flags & VM_SHARED) &&
> > > +                                       (mpnt->vm_flags & VM_WRITE)))
> > > +                       retval = vma_start_write_killable(mpnt);
> >
> > I think this has to be VM_MAYWRITE, because somebody could otherwise mprotect()
> > it R/W.
> >
> > I also don't understand why !mpnt->vm_file for a read-only anon mapping (more
> > likely PROT_NONE) is here, just do the second check?
> >
> > (Also please use the new interface, so !vma_test(mpnt, VMA_SHARED_BIT) &&
> > vma_test(mpnt, VMA_MAYWRITE_BIT))
>
> Yep, I can definitely refine the check further. But before
> doing that, I'd first like to confirm that we are aligned on
> the direction.
>
> If you still intend to hold the VMA lock while performing I/O,
> then I think we should fix `fork()` to avoid taking
> `vma_start_write()`.

Yeah or we could do something different, it isn't a case of you get to do one of
two options you propose - the maintainers decide which way is appropriate.

Of the two options dropping the lock on the fault path rather than this fork
insanity is my preference but I wonder if we can't find another way.

Let me read through the series and give more thoughts I guess.

>
> >
> > >                 if (retval < 0)
> > >                         goto loop_out;
> > >                 if (mpnt->vm_flags & VM_DONTCOPY) {
> > >
> > > Based on the above, we may want to re-check whether fork()
> > > can be blocked by page faults. At the same time, if Suren,
> > > you, or anyone else has any comments, please feel free to
> > > share them.
> > >
> > > Best Regards
> > > Barry
> >
> > Technical commentary above is sort of 'just cos' :) because I really question
> > doing this honestly.
>
> I think we either need to fix `fork()`, or keep the current
> behavior of dropping the VMA lock before performing I/O.

Yup you said :)

>
> >
> > I'd also like to get Suren's input, however.
>
> Yes. of course.
>
> >
> > Thanks, Lorenzo
>
> Best Regards
> Barry

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-17  8:45           ` Barry Song
  2026-05-18  9:46             ` Lorenzo Stoakes
@ 2026-05-18  9:53             ` David Hildenbrand (Arm)
  2026-05-19 13:42               ` Lorenzo Stoakes
  2026-05-18 21:21             ` Yang Shi
  2 siblings, 1 reply; 80+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-18  9:53 UTC (permalink / raw)
  To: Barry Song, Matthew Wilcox, surenb
  Cc: akpm, linux-mm, ljs, liam, vbabka, rppt, mhocko, jack, pfalcato,
	wanglian, chentao, lianux.mm, kunwu.chan, liyangouwen1, chrisl,
	kasong, shikemeng, nphamcs, bhe, youngjun.park, linux-arm-kernel,
	linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390,
	Nanzhe Zhao

On 5/17/26 10:45, Barry Song wrote:
> On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
>>
>> On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
>>>
>>> It doesn’t have to involve unmapping or applying mprotect to
>>> the entire VMA—just a portion of it is sufficient.
>>
>> Yes, but that still fails to answer "does this actually happen".  How much
>> performance is all this complexity in the page fault handler buying us?
>> If you don't answer this question, I'm just going to go in and rip it
>> all out.
>>
> 
> Hi Matthew (and Lorenzo, Jan, and anyone else who may be
> waiting for answers),
> 
> As promised during LSF/MM/BPF, we conducted thorough
> testing on Android phones to determine whether performing
> I/O in `filemap_fault()` can block `vma_start_write()`.
> I wanted to give a quick update on this question.
> 
> Nanzhe at Xiaomi created tracing scripts and ran various
> applications on Android devices with I/O performed under
> the VMA lock in `filemap_fault()`. We found that:
> 
> 1. There are very few cases where unmap() is blocked by
>    page faults. I assume this is due to buggy user code
>    or poor synchronization between reads and unmap().
> So I assume it is not a problem.
> 
> 2. We observed many cases where `vma_start_write()`
>    is blocked by page-fault I/O in some applications.
>    The blocking occurs in the `dup_mmap()` path during
>    fork().
> 
> With Suren's commit fb49c455323ff ("fork: lock VMAs of
> the parent process when forking"), we now always hold
> `vma_write_lock()` for each VMA. Note that the
> `mmap_lock` write lock is also held, which could lead to
> chained waiting if page-fault I/O is performed without
> releasing the VMA lock.
> 
> My gut feeling is that Suren's commit may be overshooting,
> so my rough idea is that we might want to do something like
> the following (we haven't tested it yet and it might be
> wrong):
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 2311ae7c2ff4..5ddaf297f31a 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1762,7 +1762,13 @@ __latent_entropy int dup_mmap(struct mm_struct
> *mm, struct mm_struct *oldmm)
>         for_each_vma(vmi, mpnt) {
>                 struct file *file;
> 
> -               retval = vma_start_write_killable(mpnt);
> +               /*
> +                * For anonymous or writable private VMAs, prevent
> +                * concurrent CoW faults.
> +                */
> +               if (!mpnt->vm_file || (!(mpnt->vm_flags & VM_SHARED) &&
> +                                       (mpnt->vm_flags & VM_WRITE)))
> +                       retval = vma_start_write_killable(mpnt);

Likely is_cow_mapping() is what you would want to check to handle VMAs that
could have anonymous pages in them.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-18  9:53             ` David Hildenbrand (Arm)
@ 2026-05-19 13:42               ` Lorenzo Stoakes
  0 siblings, 0 replies; 80+ messages in thread
From: Lorenzo Stoakes @ 2026-05-19 13:42 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Barry Song, Matthew Wilcox, surenb, akpm, linux-mm, liam, vbabka,
	rppt, mhocko, jack, pfalcato, wanglian, chentao, lianux.mm,
	kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao

On Mon, May 18, 2026 at 11:53:37AM +0200, David Hildenbrand (Arm) wrote:
> On 5/17/26 10:45, Barry Song wrote:
> > On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> >>
> >> On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> >>>
> >>> It doesn’t have to involve unmapping or applying mprotect to
> >>> the entire VMA—just a portion of it is sufficient.
> >>
> >> Yes, but that still fails to answer "does this actually happen".  How much
> >> performance is all this complexity in the page fault handler buying us?
> >> If you don't answer this question, I'm just going to go in and rip it
> >> all out.
> >>
> >
> > Hi Matthew (and Lorenzo, Jan, and anyone else who may be
> > waiting for answers),
> >
> > As promised during LSF/MM/BPF, we conducted thorough
> > testing on Android phones to determine whether performing
> > I/O in `filemap_fault()` can block `vma_start_write()`.
> > I wanted to give a quick update on this question.
> >
> > Nanzhe at Xiaomi created tracing scripts and ran various
> > applications on Android devices with I/O performed under
> > the VMA lock in `filemap_fault()`. We found that:
> >
> > 1. There are very few cases where unmap() is blocked by
> >    page faults. I assume this is due to buggy user code
> >    or poor synchronization between reads and unmap().
> > So I assume it is not a problem.
> >
> > 2. We observed many cases where `vma_start_write()`
> >    is blocked by page-fault I/O in some applications.
> >    The blocking occurs in the `dup_mmap()` path during
> >    fork().
> >
> > With Suren's commit fb49c455323ff ("fork: lock VMAs of
> > the parent process when forking"), we now always hold
> > `vma_write_lock()` for each VMA. Note that the
> > `mmap_lock` write lock is also held, which could lead to
> > chained waiting if page-fault I/O is performed without
> > releasing the VMA lock.
> >
> > My gut feeling is that Suren's commit may be overshooting,
> > so my rough idea is that we might want to do something like
> > the following (we haven't tested it yet and it might be
> > wrong):
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 2311ae7c2ff4..5ddaf297f31a 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1762,7 +1762,13 @@ __latent_entropy int dup_mmap(struct mm_struct
> > *mm, struct mm_struct *oldmm)
> >         for_each_vma(vmi, mpnt) {
> >                 struct file *file;
> >
> > -               retval = vma_start_write_killable(mpnt);
> > +               /*
> > +                * For anonymous or writable private VMAs, prevent
> > +                * concurrent CoW faults.
> > +                */
> > +               if (!mpnt->vm_file || (!(mpnt->vm_flags & VM_SHARED) &&
> > +                                       (mpnt->vm_flags & VM_WRITE)))
> > +                       retval = vma_start_write_killable(mpnt);
>
> Likely is_cow_mapping() is what you would want to check to handle VMAs that
> could have anonymous pages in them.

Yes :) I made pretty much the same comment though I forgot the correct helper :P

>
> --
> Cheers,
>
> David

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-17  8:45           ` Barry Song
  2026-05-18  9:46             ` Lorenzo Stoakes
  2026-05-18  9:53             ` David Hildenbrand (Arm)
@ 2026-05-18 21:21             ` Yang Shi
  2026-05-19 11:07               ` Barry Song
  2026-05-19 13:12               ` Lorenzo Stoakes
  2 siblings, 2 replies; 80+ messages in thread
From: Yang Shi @ 2026-05-18 21:21 UTC (permalink / raw)
  To: Barry Song
  Cc: Matthew Wilcox, surenb, akpm, linux-mm, david, ljs, liam, vbabka,
	rppt, mhocko, jack, pfalcato, wanglian, chentao, lianux.mm,
	kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao

On Sun, May 17, 2026 at 1:45 AM Barry Song <baohua@kernel.org> wrote:
>
> On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > both the hardware and the software stack (bio/request queues and the
> > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > for an unpredictable amount of time.
> > > >
> > > > But does that actually happen?  I find it hard to believe that thread A
> > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > that same VMA.  mprotect() and madvise() are more likely to happen, but
> > > > it still seems really unlikely to me.
> > >
> > > It doesn’t have to involve unmapping or applying mprotect to
> > > the entire VMA—just a portion of it is sufficient.
> >
> > Yes, but that still fails to answer "does this actually happen".  How much
> > performance is all this complexity in the page fault handler buying us?
> > If you don't answer this question, I'm just going to go in and rip it
> > all out.
> >
>
> Hi Matthew (and Lorenzo, Jan, and anyone else who may be
> waiting for answers),
>
> As promised during LSF/MM/BPF, we conducted thorough
> testing on Android phones to determine whether performing
> I/O in `filemap_fault()` can block `vma_start_write()`.
> I wanted to give a quick update on this question.
>
> Nanzhe at Xiaomi created tracing scripts and ran various
> applications on Android devices with I/O performed under
> the VMA lock in `filemap_fault()`. We found that:
>
> 1. There are very few cases where unmap() is blocked by
>    page faults. I assume this is due to buggy user code
>    or poor synchronization between reads and unmap().
> So I assume it is not a problem.
>
> 2. We observed many cases where `vma_start_write()`
>    is blocked by page-fault I/O in some applications.
>    The blocking occurs in the `dup_mmap()` path during
>    fork().
>
> With Suren's commit fb49c455323ff ("fork: lock VMAs of
> the parent process when forking"), we now always hold
> `vma_write_lock()` for each VMA. Note that the
> `mmap_lock` write lock is also held, which could lead to
> chained waiting if page-fault I/O is performed without
> releasing the VMA lock.
>
> My gut feeling is that Suren's commit may be overshooting,
> so my rough idea is that we might want to do something like
> the following (we haven't tested it yet and it might be
> wrong):
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 2311ae7c2ff4..5ddaf297f31a 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1762,7 +1762,13 @@ __latent_entropy int dup_mmap(struct mm_struct
> *mm, struct mm_struct *oldmm)
>         for_each_vma(vmi, mpnt) {
>                 struct file *file;
>
> -               retval = vma_start_write_killable(mpnt);
> +               /*
> +                * For anonymous or writable private VMAs, prevent
> +                * concurrent CoW faults.
> +                */
> +               if (!mpnt->vm_file || (!(mpnt->vm_flags & VM_SHARED) &&
> +                                       (mpnt->vm_flags & VM_WRITE)))
> +                       retval = vma_start_write_killable(mpnt);
>                 if (retval < 0)
>                         goto loop_out;
>                 if (mpnt->vm_flags & VM_DONTCOPY) {

Maybe a little bit off topic. This is an interesting idea. It seems
possible we don't have to take vma write lock unconditionally. IIUC
the write lock is mainly used to serialize against page fault and
madvise, right? I got a crazy idea off the top of my head. We may be
able to just take vma write lock iff vma->anon_vma is not NULL.

First of all, write mmap_lock is held, so the vma can't go or be
changed under us.

Secondly, if vma->anon_vma is NULL, it basically means either no page
fault happened or no cow happened, so there is no page table to copy,
this is also what copy_page_range() does currently. So we can shrink
the critical section to:

if (vma->anon_vma) {
    vma_start_write_killable(src_vma);
    anon_vma_fork(dst_vma, src_vma);
    copy_page_range(dst_vma, src_vma);
}

But page fault can happen before write mmap_lock is taken, when we
check vma->anon_vma, it is possible it has not been set up yet. But it
seems to be equivalent to page fault after fork and won't break the
semantic.

Anyway, just a crazy idea, I may miss some corner cases.

Thanks,
Yang

}

>
> Based on the above, we may want to re-check whether fork()
> can be blocked by page faults. At the same time, if Suren,
> you, or anyone else has any comments, please feel free to
> share them.
>
> Best Regards
> Barry
>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-18 21:21             ` Yang Shi
@ 2026-05-19 11:07               ` Barry Song
  2026-05-19 13:34                 ` Lorenzo Stoakes
  2026-05-19 18:50                 ` Yang Shi
  2026-05-19 13:12               ` Lorenzo Stoakes
  1 sibling, 2 replies; 80+ messages in thread
From: Barry Song @ 2026-05-19 11:07 UTC (permalink / raw)
  To: Yang Shi
  Cc: Matthew Wilcox, surenb, akpm, linux-mm, david, ljs, liam, vbabka,
	rppt, mhocko, jack, pfalcato, wanglian, chentao, lianux.mm,
	kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao

On Tue, May 19, 2026 at 5:21 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Sun, May 17, 2026 at 1:45 AM Barry Song <baohua@kernel.org> wrote:
> >
> > On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > > both the hardware and the software stack (bio/request queues and the
> > > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > > for an unpredictable amount of time.
> > > > >
> > > > > But does that actually happen?  I find it hard to believe that thread A
> > > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > > that same VMA.  mprotect() and madvise() are more likely to happen, but
> > > > > it still seems really unlikely to me.
> > > >
> > > > It doesn’t have to involve unmapping or applying mprotect to
> > > > the entire VMA—just a portion of it is sufficient.
> > >
> > > Yes, but that still fails to answer "does this actually happen".  How much
> > > performance is all this complexity in the page fault handler buying us?
> > > If you don't answer this question, I'm just going to go in and rip it
> > > all out.
> > >
> >
> > Hi Matthew (and Lorenzo, Jan, and anyone else who may be
> > waiting for answers),
> >
> > As promised during LSF/MM/BPF, we conducted thorough
> > testing on Android phones to determine whether performing
> > I/O in `filemap_fault()` can block `vma_start_write()`.
> > I wanted to give a quick update on this question.
> >
> > Nanzhe at Xiaomi created tracing scripts and ran various
> > applications on Android devices with I/O performed under
> > the VMA lock in `filemap_fault()`. We found that:
> >
> > 1. There are very few cases where unmap() is blocked by
> >    page faults. I assume this is due to buggy user code
> >    or poor synchronization between reads and unmap().
> > So I assume it is not a problem.
> >
> > 2. We observed many cases where `vma_start_write()`
> >    is blocked by page-fault I/O in some applications.
> >    The blocking occurs in the `dup_mmap()` path during
> >    fork().
> >
> > With Suren's commit fb49c455323ff ("fork: lock VMAs of
> > the parent process when forking"), we now always hold
> > `vma_write_lock()` for each VMA. Note that the
> > `mmap_lock` write lock is also held, which could lead to
> > chained waiting if page-fault I/O is performed without
> > releasing the VMA lock.
> >
> > My gut feeling is that Suren's commit may be overshooting,
> > so my rough idea is that we might want to do something like
> > the following (we haven't tested it yet and it might be
> > wrong):
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 2311ae7c2ff4..5ddaf297f31a 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1762,7 +1762,13 @@ __latent_entropy int dup_mmap(struct mm_struct
> > *mm, struct mm_struct *oldmm)
> >         for_each_vma(vmi, mpnt) {
> >                 struct file *file;
> >
> > -               retval = vma_start_write_killable(mpnt);
> > +               /*
> > +                * For anonymous or writable private VMAs, prevent
> > +                * concurrent CoW faults.
> > +                */
> > +               if (!mpnt->vm_file || (!(mpnt->vm_flags & VM_SHARED) &&
> > +                                       (mpnt->vm_flags & VM_WRITE)))
> > +                       retval = vma_start_write_killable(mpnt);
> >                 if (retval < 0)
> >                         goto loop_out;
> >                 if (mpnt->vm_flags & VM_DONTCOPY) {
>
> Maybe a little bit off topic. This is an interesting idea. It seems
> possible we don't have to take vma write lock unconditionally. IIUC
> the write lock is mainly used to serialize against page fault and
> madvise, right? I got a crazy idea off the top of my head. We may be
> able to just take vma write lock iff vma->anon_vma is not NULL.
>
> First of all, write mmap_lock is held, so the vma can't go or be
> changed under us.
>
> Secondly, if vma->anon_vma is NULL, it basically means either no page
> fault happened or no cow happened, so there is no page table to copy,
> this is also what copy_page_range() does currently. So we can shrink
> the critical section to:
>
> if (vma->anon_vma) {
>     vma_start_write_killable(src_vma);
>     anon_vma_fork(dst_vma, src_vma);
>     copy_page_range(dst_vma, src_vma);
> }
>
> But page fault can happen before write mmap_lock is taken, when we
> check vma->anon_vma, it is possible it has not been set up yet. But it
> seems to be equivalent to page fault after fork and won't break the
> semantic.

Re-reading Suren's commit log for fb49c455323ff8
("fork: lock VMAs of the parent process when forking"),
it seems that vm_start_write() is used to protect
against a race where anon_vma changes from NULL to
non-NULL during fork. In that scenario, we hold the
mmap_lock write lock, but not vma_start_write(), so a
concurrent anon_vma_prepare() could still install an
anon_vma.

"    A concurrent page fault on a page newly marked read-only by the page
    copy might trigger wp_page_copy() and a anon_vma_prepare(vma) on the
    source vma, defeating the anon_vma_clone() that wasn't done because the
    parent vma originally didn't have an anon_vma, but we now might end up
    copying a pte entry for a page that has one.
"

If that is the case, then your change does not work.

Nowadays, nobody calls anon_vma_prepare(vma) directly.
Instead, vmf_anon_prepare() is used, and we always
require the mmap_lock read lock before calling
__anon_vma_prepare(). As a result, anon_vma cannot
transition from NULL to non-NULL during fork.

So the original race condition has effectively
disappeared.

You also mentioned the madvise() case. If I understand
correctly, madvise() should take mmap_lock before
modifying anon_vma. Only some parts of madvise() can
support per-VMA locking. Therefore, we probably do not
need:

if (vma->anon_vma) {
vma_start_write_killable(src_vma);
...
}

>
> Anyway, just a crazy idea, I may miss some corner cases.

To me, it seems that we could remove vma_start_write()
entirely now. Or is that an even crazier idea?

Thanks
Barry


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-19 11:07               ` Barry Song
@ 2026-05-19 13:34                 ` Lorenzo Stoakes
  2026-05-19 18:50                 ` Yang Shi
  1 sibling, 0 replies; 80+ messages in thread
From: Lorenzo Stoakes @ 2026-05-19 13:34 UTC (permalink / raw)
  To: Barry Song
  Cc: Yang Shi, Matthew Wilcox, surenb, akpm, linux-mm, david, liam,
	vbabka, rppt, mhocko, jack, pfalcato, wanglian, chentao,
	lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng,
	nphamcs, bhe, youngjun.park, linux-arm-kernel, linux-kernel,
	loongarch, linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao

On Tue, May 19, 2026 at 07:07:37PM +0800, Barry Song wrote:
> On Tue, May 19, 2026 at 5:21 AM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Sun, May 17, 2026 at 1:45 AM Barry Song <baohua@kernel.org> wrote:
> > >
> > > On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > >
> > > > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > > > both the hardware and the software stack (bio/request queues and the
> > > > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > > > for an unpredictable amount of time.
> > > > > >
> > > > > > But does that actually happen?  I find it hard to believe that thread A
> > > > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > > > that same VMA.  mprotect() and madvise() are more likely to happen, but
> > > > > > it still seems really unlikely to me.
> > > > >
> > > > > It doesn’t have to involve unmapping or applying mprotect to
> > > > > the entire VMA—just a portion of it is sufficient.
> > > >
> > > > Yes, but that still fails to answer "does this actually happen".  How much
> > > > performance is all this complexity in the page fault handler buying us?
> > > > If you don't answer this question, I'm just going to go in and rip it
> > > > all out.
> > > >
> > >
> > > Hi Matthew (and Lorenzo, Jan, and anyone else who may be
> > > waiting for answers),
> > >
> > > As promised during LSF/MM/BPF, we conducted thorough
> > > testing on Android phones to determine whether performing
> > > I/O in `filemap_fault()` can block `vma_start_write()`.
> > > I wanted to give a quick update on this question.
> > >
> > > Nanzhe at Xiaomi created tracing scripts and ran various
> > > applications on Android devices with I/O performed under
> > > the VMA lock in `filemap_fault()`. We found that:
> > >
> > > 1. There are very few cases where unmap() is blocked by
> > >    page faults. I assume this is due to buggy user code
> > >    or poor synchronization between reads and unmap().
> > > So I assume it is not a problem.
> > >
> > > 2. We observed many cases where `vma_start_write()`
> > >    is blocked by page-fault I/O in some applications.
> > >    The blocking occurs in the `dup_mmap()` path during
> > >    fork().
> > >
> > > With Suren's commit fb49c455323ff ("fork: lock VMAs of
> > > the parent process when forking"), we now always hold
> > > `vma_write_lock()` for each VMA. Note that the
> > > `mmap_lock` write lock is also held, which could lead to
> > > chained waiting if page-fault I/O is performed without
> > > releasing the VMA lock.
> > >
> > > My gut feeling is that Suren's commit may be overshooting,
> > > so my rough idea is that we might want to do something like
> > > the following (we haven't tested it yet and it might be
> > > wrong):
> > >
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index 2311ae7c2ff4..5ddaf297f31a 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -1762,7 +1762,13 @@ __latent_entropy int dup_mmap(struct mm_struct
> > > *mm, struct mm_struct *oldmm)
> > >         for_each_vma(vmi, mpnt) {
> > >                 struct file *file;
> > >
> > > -               retval = vma_start_write_killable(mpnt);
> > > +               /*
> > > +                * For anonymous or writable private VMAs, prevent
> > > +                * concurrent CoW faults.
> > > +                */
> > > +               if (!mpnt->vm_file || (!(mpnt->vm_flags & VM_SHARED) &&
> > > +                                       (mpnt->vm_flags & VM_WRITE)))
> > > +                       retval = vma_start_write_killable(mpnt);
> > >                 if (retval < 0)
> > >                         goto loop_out;
> > >                 if (mpnt->vm_flags & VM_DONTCOPY) {
> >
> > Maybe a little bit off topic. This is an interesting idea. It seems
> > possible we don't have to take vma write lock unconditionally. IIUC
> > the write lock is mainly used to serialize against page fault and
> > madvise, right? I got a crazy idea off the top of my head. We may be
> > able to just take vma write lock iff vma->anon_vma is not NULL.
> >
> > First of all, write mmap_lock is held, so the vma can't go or be
> > changed under us.
> >
> > Secondly, if vma->anon_vma is NULL, it basically means either no page
> > fault happened or no cow happened, so there is no page table to copy,
> > this is also what copy_page_range() does currently. So we can shrink
> > the critical section to:
> >
> > if (vma->anon_vma) {
> >     vma_start_write_killable(src_vma);
> >     anon_vma_fork(dst_vma, src_vma);
> >     copy_page_range(dst_vma, src_vma);
> > }
> >
> > But page fault can happen before write mmap_lock is taken, when we
> > check vma->anon_vma, it is possible it has not been set up yet. But it
> > seems to be equivalent to page fault after fork and won't break the
> > semantic.
>
> Re-reading Suren's commit log for fb49c455323ff8
> ("fork: lock VMAs of the parent process when forking"),
> it seems that vm_start_write() is used to protect
> against a race where anon_vma changes from NULL to
> non-NULL during fork. In that scenario, we hold the
> mmap_lock write lock, but not vma_start_write(), so a
> concurrent anon_vma_prepare() could still install an
> anon_vma.
>
> "    A concurrent page fault on a page newly marked read-only by the page
>     copy might trigger wp_page_copy() and a anon_vma_prepare(vma) on the
>     source vma, defeating the anon_vma_clone() that wasn't done because the
>     parent vma originally didn't have an anon_vma, but we now might end up
>     copying a pte entry for a page that has one.
> "
>
> If that is the case, then your change does not work.
>
> Nowadays, nobody calls anon_vma_prepare(vma) directly.

I see callers? Am I imagining them? :)
https://elixir.bootlin.com/linux/v7.0.9/A/ident/anon_vma_prepare

> Instead, vmf_anon_prepare() is used, and we always
> require the mmap_lock read lock before calling
> __anon_vma_prepare(). As a result, anon_vma cannot
> transition from NULL to non-NULL during fork.

Right, yes the mmap read lock is required for that.

>
> So the original race condition has effectively
> disappeared.

Err the page tables? All the other cases which require page table copying?

Concurrent faults mean that copy_page_range() and faulting with vma->anon_vma
_or_ any of the multiple cases mentioned elsewhere.

And who knows what else serialises on that.

>
> You also mentioned the madvise() case. If I understand
> correctly, madvise() should take mmap_lock before
> modifying anon_vma. Only some parts of madvise() can
> support per-VMA locking. Therefore, we probably do not
> need:
>
> if (vma->anon_vma) {
> vma_start_write_killable(src_vma);
> ...
> }

I like how you hand wave the VMA lock operations in madvise() :)

(Maybe) guard regions being present cause page tables to be copied, they're
installed under VMA (read) lock, and can race now.

And it sets traps for future changes - introducing more horrible edge case race
conditions in fork is just a big nope nope nope.

This isn't an area to play around in.

>
> >
> > Anyway, just a crazy idea, I may miss some corner cases.
>
> To me, it seems that we could remove vma_start_write()
> entirely now. Or is that an even crazier idea?

As above that'd be totally broken. NAK.

>
> Thanks
> Barry

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-19 11:07               ` Barry Song
  2026-05-19 13:34                 ` Lorenzo Stoakes
@ 2026-05-19 18:50                 ` Yang Shi
  2026-05-19 20:53                   ` Yang Shi
  1 sibling, 1 reply; 80+ messages in thread
From: Yang Shi @ 2026-05-19 18:50 UTC (permalink / raw)
  To: Barry Song
  Cc: Matthew Wilcox, surenb, akpm, linux-mm, david, ljs, liam, vbabka,
	rppt, mhocko, jack, pfalcato, wanglian, chentao, lianux.mm,
	kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao

On Tue, May 19, 2026 at 4:07 AM Barry Song <baohua@kernel.org> wrote:
>
> On Tue, May 19, 2026 at 5:21 AM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Sun, May 17, 2026 at 1:45 AM Barry Song <baohua@kernel.org> wrote:
> > >
> > > On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > >
> > > > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > > > both the hardware and the software stack (bio/request queues and the
> > > > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > > > for an unpredictable amount of time.
> > > > > >
> > > > > > But does that actually happen?  I find it hard to believe that thread A
> > > > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > > > that same VMA.  mprotect() and madvise() are more likely to happen, but
> > > > > > it still seems really unlikely to me.
> > > > >
> > > > > It doesn’t have to involve unmapping or applying mprotect to
> > > > > the entire VMA—just a portion of it is sufficient.
> > > >
> > > > Yes, but that still fails to answer "does this actually happen".  How much
> > > > performance is all this complexity in the page fault handler buying us?
> > > > If you don't answer this question, I'm just going to go in and rip it
> > > > all out.
> > > >
> > >
> > > Hi Matthew (and Lorenzo, Jan, and anyone else who may be
> > > waiting for answers),
> > >
> > > As promised during LSF/MM/BPF, we conducted thorough
> > > testing on Android phones to determine whether performing
> > > I/O in `filemap_fault()` can block `vma_start_write()`.
> > > I wanted to give a quick update on this question.
> > >
> > > Nanzhe at Xiaomi created tracing scripts and ran various
> > > applications on Android devices with I/O performed under
> > > the VMA lock in `filemap_fault()`. We found that:
> > >
> > > 1. There are very few cases where unmap() is blocked by
> > >    page faults. I assume this is due to buggy user code
> > >    or poor synchronization between reads and unmap().
> > > So I assume it is not a problem.
> > >
> > > 2. We observed many cases where `vma_start_write()`
> > >    is blocked by page-fault I/O in some applications.
> > >    The blocking occurs in the `dup_mmap()` path during
> > >    fork().
> > >
> > > With Suren's commit fb49c455323ff ("fork: lock VMAs of
> > > the parent process when forking"), we now always hold
> > > `vma_write_lock()` for each VMA. Note that the
> > > `mmap_lock` write lock is also held, which could lead to
> > > chained waiting if page-fault I/O is performed without
> > > releasing the VMA lock.
> > >
> > > My gut feeling is that Suren's commit may be overshooting,
> > > so my rough idea is that we might want to do something like
> > > the following (we haven't tested it yet and it might be
> > > wrong):
> > >
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index 2311ae7c2ff4..5ddaf297f31a 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -1762,7 +1762,13 @@ __latent_entropy int dup_mmap(struct mm_struct
> > > *mm, struct mm_struct *oldmm)
> > >         for_each_vma(vmi, mpnt) {
> > >                 struct file *file;
> > >
> > > -               retval = vma_start_write_killable(mpnt);
> > > +               /*
> > > +                * For anonymous or writable private VMAs, prevent
> > > +                * concurrent CoW faults.
> > > +                */
> > > +               if (!mpnt->vm_file || (!(mpnt->vm_flags & VM_SHARED) &&
> > > +                                       (mpnt->vm_flags & VM_WRITE)))
> > > +                       retval = vma_start_write_killable(mpnt);
> > >                 if (retval < 0)
> > >                         goto loop_out;
> > >                 if (mpnt->vm_flags & VM_DONTCOPY) {
> >
> > Maybe a little bit off topic. This is an interesting idea. It seems
> > possible we don't have to take vma write lock unconditionally. IIUC
> > the write lock is mainly used to serialize against page fault and
> > madvise, right? I got a crazy idea off the top of my head. We may be
> > able to just take vma write lock iff vma->anon_vma is not NULL.
> >
> > First of all, write mmap_lock is held, so the vma can't go or be
> > changed under us.
> >
> > Secondly, if vma->anon_vma is NULL, it basically means either no page
> > fault happened or no cow happened, so there is no page table to copy,
> > this is also what copy_page_range() does currently. So we can shrink
> > the critical section to:
> >
> > if (vma->anon_vma) {
> >     vma_start_write_killable(src_vma);
> >     anon_vma_fork(dst_vma, src_vma);
> >     copy_page_range(dst_vma, src_vma);
> > }
> >
> > But page fault can happen before write mmap_lock is taken, when we
> > check vma->anon_vma, it is possible it has not been set up yet. But it
> > seems to be equivalent to page fault after fork and won't break the
> > semantic.
>
> Re-reading Suren's commit log for fb49c455323ff8
> ("fork: lock VMAs of the parent process when forking"),
> it seems that vm_start_write() is used to protect
> against a race where anon_vma changes from NULL to
> non-NULL during fork. In that scenario, we hold the
> mmap_lock write lock, but not vma_start_write(), so a
> concurrent anon_vma_prepare() could still install an
> anon_vma.
>
> "    A concurrent page fault on a page newly marked read-only by the page
>     copy might trigger wp_page_copy() and a anon_vma_prepare(vma) on the
>     source vma, defeating the anon_vma_clone() that wasn't done because the
>     parent vma originally didn't have an anon_vma, but we now might end up
>     copying a pte entry for a page that has one.
> "
>
> If that is the case, then your change does not work.
>
> Nowadays, nobody calls anon_vma_prepare(vma) directly.
> Instead, vmf_anon_prepare() is used, and we always
> require the mmap_lock read lock before calling
> __anon_vma_prepare(). As a result, anon_vma cannot
> transition from NULL to non-NULL during fork.
>
> So the original race condition has effectively
> disappeared.

anon_vma_prepare() has some usecases too, but it seems like it
requires taking read mmap_lock too if I read the code correctly.

>
> You also mentioned the madvise() case. If I understand
> correctly, madvise() should take mmap_lock before
> modifying anon_vma. Only some parts of madvise() can
> support per-VMA locking. Therefore, we probably do not
> need:
>
> if (vma->anon_vma) {
> vma_start_write_killable(src_vma);
> ...
> }

I think we still need write vma lock to serialize anon_vma fork
otherwise we may see:

        CPU 0                                                 CPU 1
fork                                                       page fault
   src vma has no anon_vma
       skip vma fork

allocate anon_vma for src vma
vma_needs_copy() sees anon_vma
copy page

Then we may end up being no anon_vma for dst vma, but with pages mapped in it.

Thanks,
Yang

>
> >
> > Anyway, just a crazy idea, I may miss some corner cases.
>
> To me, it seems that we could remove vma_start_write()
> entirely now. Or is that an even crazier idea?


>
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-19 18:50                 ` Yang Shi
@ 2026-05-19 20:53                   ` Yang Shi
  0 siblings, 0 replies; 80+ messages in thread
From: Yang Shi @ 2026-05-19 20:53 UTC (permalink / raw)
  To: Barry Song
  Cc: Matthew Wilcox, surenb, akpm, linux-mm, david, ljs, liam, vbabka,
	rppt, mhocko, jack, pfalcato, wanglian, chentao, lianux.mm,
	kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao

On Tue, May 19, 2026 at 11:50 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Tue, May 19, 2026 at 4:07 AM Barry Song <baohua@kernel.org> wrote:
> >
> > On Tue, May 19, 2026 at 5:21 AM Yang Shi <shy828301@gmail.com> wrote:
> > >
> > > On Sun, May 17, 2026 at 1:45 AM Barry Song <baohua@kernel.org> wrote:
> > > >
> > > > On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > > > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > >
> > > > > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > > > > both the hardware and the software stack (bio/request queues and the
> > > > > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > > > > for an unpredictable amount of time.
> > > > > > >
> > > > > > > But does that actually happen?  I find it hard to believe that thread A
> > > > > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > > > > that same VMA.  mprotect() and madvise() are more likely to happen, but
> > > > > > > it still seems really unlikely to me.
> > > > > >
> > > > > > It doesn’t have to involve unmapping or applying mprotect to
> > > > > > the entire VMA—just a portion of it is sufficient.
> > > > >
> > > > > Yes, but that still fails to answer "does this actually happen".  How much
> > > > > performance is all this complexity in the page fault handler buying us?
> > > > > If you don't answer this question, I'm just going to go in and rip it
> > > > > all out.
> > > > >
> > > >
> > > > Hi Matthew (and Lorenzo, Jan, and anyone else who may be
> > > > waiting for answers),
> > > >
> > > > As promised during LSF/MM/BPF, we conducted thorough
> > > > testing on Android phones to determine whether performing
> > > > I/O in `filemap_fault()` can block `vma_start_write()`.
> > > > I wanted to give a quick update on this question.
> > > >
> > > > Nanzhe at Xiaomi created tracing scripts and ran various
> > > > applications on Android devices with I/O performed under
> > > > the VMA lock in `filemap_fault()`. We found that:
> > > >
> > > > 1. There are very few cases where unmap() is blocked by
> > > >    page faults. I assume this is due to buggy user code
> > > >    or poor synchronization between reads and unmap().
> > > > So I assume it is not a problem.
> > > >
> > > > 2. We observed many cases where `vma_start_write()`
> > > >    is blocked by page-fault I/O in some applications.
> > > >    The blocking occurs in the `dup_mmap()` path during
> > > >    fork().
> > > >
> > > > With Suren's commit fb49c455323ff ("fork: lock VMAs of
> > > > the parent process when forking"), we now always hold
> > > > `vma_write_lock()` for each VMA. Note that the
> > > > `mmap_lock` write lock is also held, which could lead to
> > > > chained waiting if page-fault I/O is performed without
> > > > releasing the VMA lock.
> > > >
> > > > My gut feeling is that Suren's commit may be overshooting,
> > > > so my rough idea is that we might want to do something like
> > > > the following (we haven't tested it yet and it might be
> > > > wrong):
> > > >
> > > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > > index 2311ae7c2ff4..5ddaf297f31a 100644
> > > > --- a/mm/mmap.c
> > > > +++ b/mm/mmap.c
> > > > @@ -1762,7 +1762,13 @@ __latent_entropy int dup_mmap(struct mm_struct
> > > > *mm, struct mm_struct *oldmm)
> > > >         for_each_vma(vmi, mpnt) {
> > > >                 struct file *file;
> > > >
> > > > -               retval = vma_start_write_killable(mpnt);
> > > > +               /*
> > > > +                * For anonymous or writable private VMAs, prevent
> > > > +                * concurrent CoW faults.
> > > > +                */
> > > > +               if (!mpnt->vm_file || (!(mpnt->vm_flags & VM_SHARED) &&
> > > > +                                       (mpnt->vm_flags & VM_WRITE)))
> > > > +                       retval = vma_start_write_killable(mpnt);
> > > >                 if (retval < 0)
> > > >                         goto loop_out;
> > > >                 if (mpnt->vm_flags & VM_DONTCOPY) {
> > >
> > > Maybe a little bit off topic. This is an interesting idea. It seems
> > > possible we don't have to take vma write lock unconditionally. IIUC
> > > the write lock is mainly used to serialize against page fault and
> > > madvise, right? I got a crazy idea off the top of my head. We may be
> > > able to just take vma write lock iff vma->anon_vma is not NULL.
> > >
> > > First of all, write mmap_lock is held, so the vma can't go or be
> > > changed under us.
> > >
> > > Secondly, if vma->anon_vma is NULL, it basically means either no page
> > > fault happened or no cow happened, so there is no page table to copy,
> > > this is also what copy_page_range() does currently. So we can shrink
> > > the critical section to:
> > >
> > > if (vma->anon_vma) {
> > >     vma_start_write_killable(src_vma);
> > >     anon_vma_fork(dst_vma, src_vma);
> > >     copy_page_range(dst_vma, src_vma);
> > > }
> > >
> > > But page fault can happen before write mmap_lock is taken, when we
> > > check vma->anon_vma, it is possible it has not been set up yet. But it
> > > seems to be equivalent to page fault after fork and won't break the
> > > semantic.
> >
> > Re-reading Suren's commit log for fb49c455323ff8
> > ("fork: lock VMAs of the parent process when forking"),
> > it seems that vm_start_write() is used to protect
> > against a race where anon_vma changes from NULL to
> > non-NULL during fork. In that scenario, we hold the
> > mmap_lock write lock, but not vma_start_write(), so a
> > concurrent anon_vma_prepare() could still install an
> > anon_vma.
> >
> > "    A concurrent page fault on a page newly marked read-only by the page
> >     copy might trigger wp_page_copy() and a anon_vma_prepare(vma) on the
> >     source vma, defeating the anon_vma_clone() that wasn't done because the
> >     parent vma originally didn't have an anon_vma, but we now might end up
> >     copying a pte entry for a page that has one.
> > "
> >
> > If that is the case, then your change does not work.
> >
> > Nowadays, nobody calls anon_vma_prepare(vma) directly.
> > Instead, vmf_anon_prepare() is used, and we always
> > require the mmap_lock read lock before calling
> > __anon_vma_prepare(). As a result, anon_vma cannot
> > transition from NULL to non-NULL during fork.
> >
> > So the original race condition has effectively
> > disappeared.
>
> anon_vma_prepare() has some usecases too, but it seems like it
> requires taking read mmap_lock too if I read the code correctly.
>
> >
> > You also mentioned the madvise() case. If I understand
> > correctly, madvise() should take mmap_lock before
> > modifying anon_vma. Only some parts of madvise() can
> > support per-VMA locking. Therefore, we probably do not
> > need:
> >
> > if (vma->anon_vma) {
> > vma_start_write_killable(src_vma);
> > ...
> > }
>
> I think we still need write vma lock to serialize anon_vma fork
> otherwise we may see:
>
>         CPU 0                                                 CPU 1
> fork                                                       page fault
>    src vma has no anon_vma
>        skip vma fork
>
> allocate anon_vma for src vma
> vma_needs_copy() sees anon_vma
> copy page
>
> Then we may end up being no anon_vma for dst vma, but with pages mapped in it.

Sorry, this should not happen because creating anon_vma in page fault
needs to take mmap_lock.

Thanks,
Yang

>
> Thanks,
> Yang
>
> >
> > >
> > > Anyway, just a crazy idea, I may miss some corner cases.
> >
> > To me, it seems that we could remove vma_start_write()
> > entirely now. Or is that an even crazier idea?
>
>
> >
> > Thanks
> > Barry


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-18 21:21             ` Yang Shi
  2026-05-19 11:07               ` Barry Song
@ 2026-05-19 13:12               ` Lorenzo Stoakes
  2026-05-19 13:39                 ` Lorenzo Stoakes
  1 sibling, 1 reply; 80+ messages in thread
From: Lorenzo Stoakes @ 2026-05-19 13:12 UTC (permalink / raw)
  To: Yang Shi
  Cc: Barry Song, Matthew Wilcox, surenb, akpm, linux-mm, david, liam,
	vbabka, rppt, mhocko, jack, pfalcato, wanglian, chentao,
	lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng,
	nphamcs, bhe, youngjun.park, linux-arm-kernel, linux-kernel,
	loongarch, linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao

On Mon, May 18, 2026 at 02:21:14PM -0700, Yang Shi wrote:
> On Sun, May 17, 2026 at 1:45 AM Barry Song <baohua@kernel.org> wrote:
> >
> > On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > > both the hardware and the software stack (bio/request queues and the
> > > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > > for an unpredictable amount of time.
> > > > >
> > > > > But does that actually happen?  I find it hard to believe that thread A
> > > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > > that same VMA.  mprotect() and madvise() are more likely to happen, but
> > > > > it still seems really unlikely to me.
> > > >
> > > > It doesn’t have to involve unmapping or applying mprotect to
> > > > the entire VMA—just a portion of it is sufficient.
> > >
> > > Yes, but that still fails to answer "does this actually happen".  How much
> > > performance is all this complexity in the page fault handler buying us?
> > > If you don't answer this question, I'm just going to go in and rip it
> > > all out.
> > >
> >
> > Hi Matthew (and Lorenzo, Jan, and anyone else who may be
> > waiting for answers),
> >
> > As promised during LSF/MM/BPF, we conducted thorough
> > testing on Android phones to determine whether performing
> > I/O in `filemap_fault()` can block `vma_start_write()`.
> > I wanted to give a quick update on this question.
> >
> > Nanzhe at Xiaomi created tracing scripts and ran various
> > applications on Android devices with I/O performed under
> > the VMA lock in `filemap_fault()`. We found that:
> >
> > 1. There are very few cases where unmap() is blocked by
> >    page faults. I assume this is due to buggy user code
> >    or poor synchronization between reads and unmap().
> > So I assume it is not a problem.
> >
> > 2. We observed many cases where `vma_start_write()`
> >    is blocked by page-fault I/O in some applications.
> >    The blocking occurs in the `dup_mmap()` path during
> >    fork().
> >
> > With Suren's commit fb49c455323ff ("fork: lock VMAs of
> > the parent process when forking"), we now always hold
> > `vma_write_lock()` for each VMA. Note that the
> > `mmap_lock` write lock is also held, which could lead to
> > chained waiting if page-fault I/O is performed without
> > releasing the VMA lock.
> >
> > My gut feeling is that Suren's commit may be overshooting,
> > so my rough idea is that we might want to do something like
> > the following (we haven't tested it yet and it might be
> > wrong):
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 2311ae7c2ff4..5ddaf297f31a 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1762,7 +1762,13 @@ __latent_entropy int dup_mmap(struct mm_struct
> > *mm, struct mm_struct *oldmm)
> >         for_each_vma(vmi, mpnt) {
> >                 struct file *file;
> >
> > -               retval = vma_start_write_killable(mpnt);
> > +               /*
> > +                * For anonymous or writable private VMAs, prevent
> > +                * concurrent CoW faults.
> > +                */
> > +               if (!mpnt->vm_file || (!(mpnt->vm_flags & VM_SHARED) &&
> > +                                       (mpnt->vm_flags & VM_WRITE)))
> > +                       retval = vma_start_write_killable(mpnt);
> >                 if (retval < 0)
> >                         goto loop_out;
> >                 if (mpnt->vm_flags & VM_DONTCOPY) {
>
> Maybe a little bit off topic. This is an interesting idea. It seems
> possible we don't have to take vma write lock unconditionally. IIUC
> the write lock is mainly used to serialize against page fault and
> madvise, right? I got a crazy idea off the top of my head. We may be

Err no, it serialises against literally any modification or read of any
characteristic of VMAs.

> able to just take vma write lock iff vma->anon_vma is not NULL.

Except if we don't take it and vma->anon_vma is NULL, then somebody can
anon_vma_prepare() and change vma->anon_vma midway through a fork and completely
screw up the anon_vma fork hierarchy.

So no.

>
> First of all, write mmap_lock is held, so the vma can't go or be
> changed under us.

vma->anon_vma can be changed.

>
> Secondly, if vma->anon_vma is NULL, it basically means either no page
> fault happened or no cow happened, so there is no page table to copy,
> this is also what copy_page_range() does currently. So we can shrink
> the critical section to:

Firstly, with no VMA write lock, !vma->anon_vma means a fault can race and
secondly copy_page_range() checks vma_needs_copy(), there are other cases - PFN
maps, mixed maps, UFFD W/P (ugh), guard regions.

So yeah this isn't sufficient.

>
> if (vma->anon_vma) {
>     vma_start_write_killable(src_vma);
>     anon_vma_fork(dst_vma, src_vma);
>     copy_page_range(dst_vma, src_vma);
> }

Yeah that's totally broken fo reasons above as I said :)

>
> But page fault can happen before write mmap_lock is taken, when we
> check vma->anon_vma, it is possible it has not been set up yet. But it
> seems to be equivalent to page fault after fork and won't break the
> semantic.

It will totally break how the anon_vma hierarchy works :) See the links at the
top of https://ljs.io/talks for a link to various slides on anon_vma behaviour
(it's really a pain to think about because it's a super broken abstraction).

You could end up with a CoW mapping that's unreachable from rmap and you could
get some nasty issues with page table entries pointing at freed folios :)

>
> Anyway, just a crazy idea, I may miss some corner cases.

Yeah sorry to push back here but this is just not a viable approach.

And this is forgetting that we have relied on page faults being blocked by fork
_forever_, who knows what else has baked in assumptions about that
serialisation.

Forking is one of the nastiest parts of mm and has had multiple, subtle, corner
case breakages that have been a nightmare to deal with.

So I'm very much against changing this behaviour to try to fix something in the
fault path.

We should address the fault path issues in the fault path :)

>
> Thanks,
> Yang
>
> }
>
> >
> > Based on the above, we may want to re-check whether fork()
> > can be blocked by page faults. At the same time, if Suren,
> > you, or anyone else has any comments, please feel free to
> > share them.
> >
> > Best Regards
> > Barry
> >

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-19 13:12               ` Lorenzo Stoakes
@ 2026-05-19 13:39                 ` Lorenzo Stoakes
  2026-05-19 18:41                   ` Yang Shi
  0 siblings, 1 reply; 80+ messages in thread
From: Lorenzo Stoakes @ 2026-05-19 13:39 UTC (permalink / raw)
  To: Yang Shi
  Cc: Barry Song, Matthew Wilcox, surenb, akpm, linux-mm, david, liam,
	vbabka, rppt, mhocko, jack, pfalcato, wanglian, chentao,
	lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng,
	nphamcs, bhe, youngjun.park, linux-arm-kernel, linux-kernel,
	loongarch, linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao

On Tue, May 19, 2026 at 02:12:10PM +0100, Lorenzo Stoakes wrote:
> On Mon, May 18, 2026 at 02:21:14PM -0700, Yang Shi wrote:
> > Maybe a little bit off topic. This is an interesting idea. It seems
> > possible we don't have to take vma write lock unconditionally. IIUC
> > the write lock is mainly used to serialize against page fault and
> > madvise, right? I got a crazy idea off the top of my head. We may be
>
> Err no, it serialises against literally any modification or read of any
> characteristic of VMAs.
>
> > able to just take vma write lock iff vma->anon_vma is not NULL.
>
> Except if we don't take it and vma->anon_vma is NULL, then somebody can
> anon_vma_prepare() and change vma->anon_vma midway through a fork and completely
> screw up the anon_vma fork hierarchy.

correction: this won't happen as per Barry (see - I managed to confuse myself
here :), since for vma->anon_vma install we take the mmap read lock.

BUT we also have to consider other cases.

>
> So no.
>
> >
> > First of all, write mmap_lock is held, so the vma can't go or be
> > changed under us.
>
> vma->anon_vma can be changed.

Correction: no it can't :)

>
> >
> > Secondly, if vma->anon_vma is NULL, it basically means either no page
> > fault happened or no cow happened, so there is no page table to copy,
> > this is also what copy_page_range() does currently. So we can shrink
> > the critical section to:
>
> Firstly, with no VMA write lock, !vma->anon_vma means a fault can race and
> secondly copy_page_range() checks vma_needs_copy(), there are other cases - PFN
> maps, mixed maps, UFFD W/P (ugh), guard regions.
>
> So yeah this isn't sufficient.

However this is true...

>
> >
> > if (vma->anon_vma) {
> >     vma_start_write_killable(src_vma);
> >     anon_vma_fork(dst_vma, src_vma);
> >     copy_page_range(dst_vma, src_vma);
> > }
>
> Yeah that's totally broken fo reasons above as I said :)
>
> >
> > But page fault can happen before write mmap_lock is taken, when we
> > check vma->anon_vma, it is possible it has not been set up yet. But it
> > seems to be equivalent to page fault after fork and won't break the
> > semantic.
>
> It will totally break how the anon_vma hierarchy works :) See the links at the
> top of https://ljs.io/talks for a link to various slides on anon_vma behaviour
> (it's really a pain to think about because it's a super broken abstraction).
>
> You could end up with a CoW mapping that's unreachable from rmap and you could
> get some nasty issues with page table entries pointing at freed folios :)

Correction: actually we should be safe given mmap read lock on anon_vma install.

>
> >
> > Anyway, just a crazy idea, I may miss some corner cases.
>
> Yeah sorry to push back here but this is just not a viable approach.
>
> And this is forgetting that we have relied on page faults being blocked by fork
> _forever_, who knows what else has baked in assumptions about that
> serialisation.
>
> Forking is one of the nastiest parts of mm and has had multiple, subtle, corner
> case breakages that have been a nightmare to deal with.
>
> So I'm very much against changing this behaviour to try to fix something in the
> fault path.
>
> We should address the fault path issues in the fault path :)

Above still all true though.

>
> >
> > Thanks,
> > Yang
> >
> > }
> >
> > >
> > > Based on the above, we may want to re-check whether fork()
> > > can be blocked by page faults. At the same time, if Suren,
> > > you, or anyone else has any comments, please feel free to
> > > share them.
> > >
> > > Best Regards
> > > Barry
> > >
>
> Cheers, Lorenzo

So still a nope :)

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-19 13:39                 ` Lorenzo Stoakes
@ 2026-05-19 18:41                   ` Yang Shi
  2026-05-19 21:02                     ` Yang Shi
  0 siblings, 1 reply; 80+ messages in thread
From: Yang Shi @ 2026-05-19 18:41 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Barry Song, Matthew Wilcox, surenb, akpm, linux-mm, david, liam,
	vbabka, rppt, mhocko, jack, pfalcato, wanglian, chentao,
	lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng,
	nphamcs, bhe, youngjun.park, linux-arm-kernel, linux-kernel,
	loongarch, linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao

On Tue, May 19, 2026 at 6:39 AM Lorenzo Stoakes <ljs@kernel.org> wrote:
>
> On Tue, May 19, 2026 at 02:12:10PM +0100, Lorenzo Stoakes wrote:
> > On Mon, May 18, 2026 at 02:21:14PM -0700, Yang Shi wrote:
> > > Maybe a little bit off topic. This is an interesting idea. It seems
> > > possible we don't have to take vma write lock unconditionally. IIUC
> > > the write lock is mainly used to serialize against page fault and
> > > madvise, right? I got a crazy idea off the top of my head. We may be
> >
> > Err no, it serialises against literally any modification or read of any
> > characteristic of VMAs.

If I remember correctly, you are not supposed to change VMA
flags/size/mm pointer/vm_file/pgoff/prot, etc, under read vma lock or
read mmap_lock.

> >
> > > able to just take vma write lock iff vma->anon_vma is not NULL.
> >
> > Except if we don't take it and vma->anon_vma is NULL, then somebody can
> > anon_vma_prepare() and change vma->anon_vma midway through a fork and completely
> > screw up the anon_vma fork hierarchy.
>
> correction: this won't happen as per Barry (see - I managed to confuse myself
> here :), since for vma->anon_vma install we take the mmap read lock.
>
> BUT we also have to consider other cases.
>
> >
> > So no.
> >
> > >
> > > First of all, write mmap_lock is held, so the vma can't go or be
> > > changed under us.
> >
> > vma->anon_vma can be changed.
>
> Correction: no it can't :)

Yes, vma->anon_vma change should require taking read mmap_lock.

>
> >
> > >
> > > Secondly, if vma->anon_vma is NULL, it basically means either no page
> > > fault happened or no cow happened, so there is no page table to copy,
> > > this is also what copy_page_range() does currently. So we can shrink
> > > the critical section to:
> >
> > Firstly, with no VMA write lock, !vma->anon_vma means a fault can race and
> > secondly copy_page_range() checks vma_needs_copy(), there are other cases - PFN
> > maps, mixed maps, UFFD W/P (ugh), guard regions.
> >
> > So yeah this isn't sufficient.
>
> However this is true...

Yes, fault can race with fork. Basically this is actually the purpose
of this idea. We can have improved page fault scalability. In my
proposal (take write vma lock if vma->anon_vma is not NULL), the race
just happens on the VMAs which page fault has not happened on before.
vma_needs_copy() also skips the VMAs which don't have vma->anon_vma.
So there is basically no difference in semantics other than more page
fault races IIUC. It should be safe as long as we can guarantee there
is no writable PTE point to a shared page after fork.

For guard regions, it can be serialized by vma write lock if
vma->anon_vma exists. If vma->anon_vma is NULL, it will prepare
anon_vma, which will take read mmap_lock if I read the code correctly.

I have not investigated UFFD yet.

>
> >
> > >
> > > if (vma->anon_vma) {
> > >     vma_start_write_killable(src_vma);
> > >     anon_vma_fork(dst_vma, src_vma);
> > >     copy_page_range(dst_vma, src_vma);
> > > }
> >
> > Yeah that's totally broken fo reasons above as I said :)
> >
> > >
> > > But page fault can happen before write mmap_lock is taken, when we
> > > check vma->anon_vma, it is possible it has not been set up yet. But it
> > > seems to be equivalent to page fault after fork and won't break the
> > > semantic.
> >
> > It will totally break how the anon_vma hierarchy works :) See the links at the
> > top of https://ljs.io/talks for a link to various slides on anon_vma behaviour
> > (it's really a pain to think about because it's a super broken abstraction).
> >
> > You could end up with a CoW mapping that's unreachable from rmap and you could
> > get some nasty issues with page table entries pointing at freed folios :)
>
> Correction: actually we should be safe given mmap read lock on anon_vma install.
>
> >
> > >
> > > Anyway, just a crazy idea, I may miss some corner cases.
> >
> > Yeah sorry to push back here but this is just not a viable approach.

No worries. Thanks for all the feedback. Just tried to explore whether
such an idea is feasible or not.

> >
> > And this is forgetting that we have relied on page faults being blocked by fork
> > _forever_, who knows what else has baked in assumptions about that
> > serialisation.
> >
> > Forking is one of the nastiest parts of mm and has had multiple, subtle, corner
> > case breakages that have been a nightmare to deal with.

Yes, this might be the biggest concern. The page fault can race with
fork. If some applications rely on such subtle behavior, it may break,
but such applications are fragile too.

> >
> > So I'm very much against changing this behaviour to try to fix something in the
> > fault path.
> >
> > We should address the fault path issues in the fault path :)

Yeah, this idea was inspired by Barry's "not take vma read lock
unconditionally" idea. Maybe irrelevant to Barry's priority inversion
problem, just an idea for further optimization on page fault
scalability. This probably should be a separate topic.

Thanks,
Yang

>
> Above still all true though.
>
> >
> > >
> > > Thanks,
> > > Yang
> > >
> > > }
> > >
> > > >
> > > > Based on the above, we may want to re-check whether fork()
> > > > can be blocked by page faults. At the same time, if Suren,
> > > > you, or anyone else has any comments, please feel free to
> > > > share them.
> > > >
> > > > Best Regards
> > > > Barry
> > > >
> >
> > Cheers, Lorenzo
>
> So still a nope :)
>
> Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-19 18:41                   ` Yang Shi
@ 2026-05-19 21:02                     ` Yang Shi
  2026-05-20  8:11                       ` Lorenzo Stoakes
  0 siblings, 1 reply; 80+ messages in thread
From: Yang Shi @ 2026-05-19 21:02 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Barry Song, Matthew Wilcox, surenb, akpm, linux-mm, david, liam,
	vbabka, rppt, mhocko, jack, pfalcato, wanglian, chentao,
	lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng,
	nphamcs, bhe, youngjun.park, linux-arm-kernel, linux-kernel,
	loongarch, linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao

On Tue, May 19, 2026 at 11:41 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Tue, May 19, 2026 at 6:39 AM Lorenzo Stoakes <ljs@kernel.org> wrote:
> >
> > On Tue, May 19, 2026 at 02:12:10PM +0100, Lorenzo Stoakes wrote:
> > > On Mon, May 18, 2026 at 02:21:14PM -0700, Yang Shi wrote:
> > > > Maybe a little bit off topic. This is an interesting idea. It seems
> > > > possible we don't have to take vma write lock unconditionally. IIUC
> > > > the write lock is mainly used to serialize against page fault and
> > > > madvise, right? I got a crazy idea off the top of my head. We may be
> > >
> > > Err no, it serialises against literally any modification or read of any
> > > characteristic of VMAs.
>
> If I remember correctly, you are not supposed to change VMA
> flags/size/mm pointer/vm_file/pgoff/prot, etc, under read vma lock or
> read mmap_lock.
>
> > >
> > > > able to just take vma write lock iff vma->anon_vma is not NULL.
> > >
> > > Except if we don't take it and vma->anon_vma is NULL, then somebody can
> > > anon_vma_prepare() and change vma->anon_vma midway through a fork and completely
> > > screw up the anon_vma fork hierarchy.
> >
> > correction: this won't happen as per Barry (see - I managed to confuse myself
> > here :), since for vma->anon_vma install we take the mmap read lock.
> >
> > BUT we also have to consider other cases.
> >
> > >
> > > So no.
> > >
> > > >
> > > > First of all, write mmap_lock is held, so the vma can't go or be
> > > > changed under us.
> > >
> > > vma->anon_vma can be changed.
> >
> > Correction: no it can't :)
>
> Yes, vma->anon_vma change should require taking read mmap_lock.
>
> >
> > >
> > > >
> > > > Secondly, if vma->anon_vma is NULL, it basically means either no page
> > > > fault happened or no cow happened, so there is no page table to copy,
> > > > this is also what copy_page_range() does currently. So we can shrink
> > > > the critical section to:
> > >
> > > Firstly, with no VMA write lock, !vma->anon_vma means a fault can race and
> > > secondly copy_page_range() checks vma_needs_copy(), there are other cases - PFN
> > > maps, mixed maps, UFFD W/P (ugh), guard regions.
> > >
> > > So yeah this isn't sufficient.
> >
> > However this is true...
>
> Yes, fault can race with fork. Basically this is actually the purpose
> of this idea. We can have improved page fault scalability. In my
> proposal (take write vma lock if vma->anon_vma is not NULL), the race
> just happens on the VMAs which page fault has not happened on before.

Sorry, this is incorrect. Page fault can't happen on those VMAs
because page fault needs to create anon_vma, but it requires taking
mmap_lock.
If anon_vma is not NULL, vma write lock will serialize against page
fault. So there should be no race with page fault. Removing vma write
lock suggested by Barry may increase race.

Thanks,
Yang

> vma_needs_copy() also skips the VMAs which don't have vma->anon_vma.
> So there is basically no difference in semantics other than more page
> fault races IIUC. It should be safe as long as we can guarantee there
> is no writable PTE point to a shared page after fork.
>
> For guard regions, it can be serialized by vma write lock if
> vma->anon_vma exists. If vma->anon_vma is NULL, it will prepare
> anon_vma, which will take read mmap_lock if I read the code correctly.
>
> I have not investigated UFFD yet.
>
> >
> > >
> > > >
> > > > if (vma->anon_vma) {
> > > >     vma_start_write_killable(src_vma);
> > > >     anon_vma_fork(dst_vma, src_vma);
> > > >     copy_page_range(dst_vma, src_vma);
> > > > }
> > >
> > > Yeah that's totally broken fo reasons above as I said :)
> > >
> > > >
> > > > But page fault can happen before write mmap_lock is taken, when we
> > > > check vma->anon_vma, it is possible it has not been set up yet. But it
> > > > seems to be equivalent to page fault after fork and won't break the
> > > > semantic.
> > >
> > > It will totally break how the anon_vma hierarchy works :) See the links at the
> > > top of https://ljs.io/talks for a link to various slides on anon_vma behaviour
> > > (it's really a pain to think about because it's a super broken abstraction).
> > >
> > > You could end up with a CoW mapping that's unreachable from rmap and you could
> > > get some nasty issues with page table entries pointing at freed folios :)
> >
> > Correction: actually we should be safe given mmap read lock on anon_vma install.
> >
> > >
> > > >
> > > > Anyway, just a crazy idea, I may miss some corner cases.
> > >
> > > Yeah sorry to push back here but this is just not a viable approach.
>
> No worries. Thanks for all the feedback. Just tried to explore whether
> such an idea is feasible or not.
>
> > >
> > > And this is forgetting that we have relied on page faults being blocked by fork
> > > _forever_, who knows what else has baked in assumptions about that
> > > serialisation.
> > >
> > > Forking is one of the nastiest parts of mm and has had multiple, subtle, corner
> > > case breakages that have been a nightmare to deal with.
>
> Yes, this might be the biggest concern. The page fault can race with
> fork. If some applications rely on such subtle behavior, it may break,
> but such applications are fragile too.
>
> > >
> > > So I'm very much against changing this behaviour to try to fix something in the
> > > fault path.
> > >
> > > We should address the fault path issues in the fault path :)
>
> Yeah, this idea was inspired by Barry's "not take vma read lock
> unconditionally" idea. Maybe irrelevant to Barry's priority inversion
> problem, just an idea for further optimization on page fault
> scalability. This probably should be a separate topic.
>
> Thanks,
> Yang
>
> >
> > Above still all true though.
> >
> > >
> > > >
> > > > Thanks,
> > > > Yang
> > > >
> > > > }
> > > >
> > > > >
> > > > > Based on the above, we may want to re-check whether fork()
> > > > > can be blocked by page faults. At the same time, if Suren,
> > > > > you, or anyone else has any comments, please feel free to
> > > > > share them.
> > > > >
> > > > > Best Regards
> > > > > Barry
> > > > >
> > >
> > > Cheers, Lorenzo
> >
> > So still a nope :)
> >
> > Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-19 21:02                     ` Yang Shi
@ 2026-05-20  8:11                       ` Lorenzo Stoakes
  0 siblings, 0 replies; 80+ messages in thread
From: Lorenzo Stoakes @ 2026-05-20  8:11 UTC (permalink / raw)
  To: Yang Shi
  Cc: Barry Song, Matthew Wilcox, surenb, akpm, linux-mm, david, liam,
	vbabka, rppt, mhocko, jack, pfalcato, wanglian, chentao,
	lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng,
	nphamcs, bhe, youngjun.park, linux-arm-kernel, linux-kernel,
	loongarch, linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao

On Tue, May 19, 2026 at 02:02:09PM -0700, Yang Shi wrote:
> On Tue, May 19, 2026 at 11:41 AM Yang Shi <shy828301@gmail.com> wrote:
> > >
> > > >
> > > > >
> > > > > Secondly, if vma->anon_vma is NULL, it basically means either no page
> > > > > fault happened or no cow happened, so there is no page table to copy,
> > > > > this is also what copy_page_range() does currently. So we can shrink
> > > > > the critical section to:
> > > >
> > > > Firstly, with no VMA write lock, !vma->anon_vma means a fault can race and
> > > > secondly copy_page_range() checks vma_needs_copy(), there are other cases - PFN
> > > > maps, mixed maps, UFFD W/P (ugh), guard regions.
> > > >
> > > > So yeah this isn't sufficient.
> > >
> > > However this is true...
> >
> > Yes, fault can race with fork. Basically this is actually the purpose
> > of this idea. We can have improved page fault scalability. In my
> > proposal (take write vma lock if vma->anon_vma is not NULL), the race
> > just happens on the VMAs which page fault has not happened on before.
>
> Sorry, this is incorrect. Page fault can't happen on those VMAs
> because page fault needs to create anon_vma, but it requires taking
> mmap_lock.
> If anon_vma is not NULL, vma write lock will serialize against page
> fault. So there should be no race with page fault. Removing vma write
> lock suggested by Barry may increase race.

Firstly, let's none of us be worried about making mistakes here, the anon_vma
stuff is confusing, and I've stared at it more than mostly, and even so I
managed to make mistakes (as corrected here) and forget details :))

It's a sign it all needs simplifying, but hey that's what my scalable CoW
project is (partly) about :)

Removing the VMA write lock would cause races with page fault which can result
in page tables being installed which are then not correctly duplicated for
ranges that must be.

And again I think the underlying thing here overall I think is:

1. Clearly many cases require serialisation (any that cause copy_page_range() to
   fire).

2. If we were to decide not to take a lock with concurrent page faults, that
   lays a trap for any future change that (reasonably) assumes that page tables
   cannot be simultaneously copied while being accessible to page fault
   handlers, which is bug prone.

3. As per 2, even if we were to only take the lock when we felt we absolutely
   needed to, we still cause risk through adding yet another 'you just have to
   know' risk to this part of mm.

4. The serialisation is quite likely relied upon by other things, this is often
   the case in mm, and we may only realise that such serialisation is critical
   at the point a subtle issue arises out of it.

5. Fork is one of the most sensitive, intuation-defying, complicated, and
   corner- case-problem-baiting areas of mm and I really oppose us changing
   fundamental behaviour here unless incredibly well justified.

On this basis, let's let the sleeping dogs lie and leave fork alone I think :)

I think I am far more inclined to take Barry's fault approach (as I've said to
him) vs. changing fork behaviour.

But I want to make sure there's not a 'third way' that could avoid either!

I am going to have a look through Barry's series in detail so we can have some
movement on this one way or another :)

>
> Thanks,
> Yang
>

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-04-30 12:37 ` [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance Matthew Wilcox
  2026-04-30 22:49   ` Barry Song
@ 2026-05-01 15:52   ` Lorenzo Stoakes
  2026-05-01 16:06     ` Matthew Wilcox
  2026-05-01 17:59     ` Barry Song
  1 sibling, 2 replies; 80+ messages in thread
From: Lorenzo Stoakes @ 2026-05-01 15:52 UTC (permalink / raw)
  To: Barry Song (Xiaomi)
  Cc: Matthew Wilcox, akpm, linux-mm, david, liam, vbabka, rppt, surenb,
	mhocko, jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
	liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390

On Thu, Apr 30, 2026 at 01:37:14PM +0100, Matthew Wilcox wrote:
> On Thu, Apr 30, 2026 at 12:04:22PM +0800, Barry Song (Xiaomi) wrote:
> > (1) If we need to wait for I/O completion, we still drop the per-VMA lock, as
> > current page fault handling already does. Holding it for too long may introduce
> > various priority inversion issues on mobile devices. After I/O completes, we
> > retry the page fault with the per-VMA lock, rather than falling back to
> > mmap_lock.
>
> You're going to have to do better than that.  You know I hate the
> additional complexity you're adding.  You need to explain why my idea of
> ripping out all the complexity now that we have per-VMA locks doesn't
> work.

After a brief eyeball I share Matthew's assessment, I really don't like this
series, it's piling on complexity for what seem like niche cases.

We already have enough weirdness in fault code honestly.

Let's maybe discuss at LSF if you're attending?

I will try to have a more thorough look through when I get a chance.

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-01 15:52   ` Lorenzo Stoakes
@ 2026-05-01 16:06     ` Matthew Wilcox
  2026-05-01 17:09       ` Lorenzo Stoakes
  2026-05-01 17:59     ` Barry Song
  1 sibling, 1 reply; 80+ messages in thread
From: Matthew Wilcox @ 2026-05-01 16:06 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Barry Song (Xiaomi), akpm, linux-mm, david, liam, vbabka, rppt,
	surenb, mhocko, jack, pfalcato, wanglian, chentao, lianux.mm,
	kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390

On Fri, May 01, 2026 at 04:52:12PM +0100, Lorenzo Stoakes wrote:
> After a brief eyeball I share Matthew's assessment, I really don't like this
> series, it's piling on complexity for what seem like niche cases.

I don't think they're niche cases ... I think it's a real problem.
While our current code performs better for this workload than the
pre-vma-lock code did, it doesn't perform as well as it could.

> We already have enough weirdness in fault code honestly.
> 
> Let's maybe discuss at LSF if you're attending?

Not only is he attending, there's a topic scheduled (currently 10:30 on
Wednesday).


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-01 16:06     ` Matthew Wilcox
@ 2026-05-01 17:09       ` Lorenzo Stoakes
  0 siblings, 0 replies; 80+ messages in thread
From: Lorenzo Stoakes @ 2026-05-01 17:09 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Barry Song (Xiaomi), akpm, linux-mm, david, liam, vbabka, rppt,
	surenb, mhocko, jack, pfalcato, wanglian, chentao, lianux.mm,
	kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390

On Fri, May 01, 2026 at 05:06:02PM +0100, Matthew Wilcox wrote:
> On Fri, May 01, 2026 at 04:52:12PM +0100, Lorenzo Stoakes wrote:
> > After a brief eyeball I share Matthew's assessment, I really don't like this
> > series, it's piling on complexity for what seem like niche cases.
>
> I don't think they're niche cases ... I think it's a real problem.
> While our current code performs better for this workload than the
> pre-vma-lock code did, it doesn't perform as well as it could.
>
> > We already have enough weirdness in fault code honestly.
> >
> > Let's maybe discuss at LSF if you're attending?
>
> Not only is he attending, there's a topic scheduled (currently 10:30 on
> Wednesday).

Well then, let's revisit this in person in Zagreb :)

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-01 15:52   ` Lorenzo Stoakes
  2026-05-01 16:06     ` Matthew Wilcox
@ 2026-05-01 17:59     ` Barry Song
  1 sibling, 0 replies; 80+ messages in thread
From: Barry Song @ 2026-05-01 17:59 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Matthew Wilcox, akpm, linux-mm, david, liam, vbabka, rppt, surenb,
	mhocko, jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
	liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390

On Fri, May 1, 2026 at 11:52 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
>
> On Thu, Apr 30, 2026 at 01:37:14PM +0100, Matthew Wilcox wrote:
> > On Thu, Apr 30, 2026 at 12:04:22PM +0800, Barry Song (Xiaomi) wrote:
> > > (1) If we need to wait for I/O completion, we still drop the per-VMA lock, as
> > > current page fault handling already does. Holding it for too long may introduce
> > > various priority inversion issues on mobile devices. After I/O completes, we
> > > retry the page fault with the per-VMA lock, rather than falling back to
> > > mmap_lock.
> >
> > You're going to have to do better than that.  You know I hate the
> > additional complexity you're adding.  You need to explain why my idea of
> > ripping out all the complexity now that we have per-VMA locks doesn't
> > work.
>
> After a brief eyeball I share Matthew's assessment, I really don't like this
> series, it's piling on complexity for what seem like niche cases.

I’d really appreciate it if you could point out the specific parts you
dislike, rather than the whole series—I don’t think that’s a fair
assessment.

I’m not sure what you mean by “niche cases.” Do you mean avoiding taking
mmap_lock for major page faults, or releasing the per-VMA lock and retrying
the page fault?

Right now, major page faults always fall back to mmap_lock, which is a
significant source of lock contention. I assume we agree that this fallback
should be eliminated. Or is there still no agreement on this point either?

Where we may differ is whether to hold the per-VMA lock and
avoid retrying the page fault, or to rely on retrying the
fault while using the per-VMA lock (with the current
mainline approach using mmap_lock instead) ?

>
> We already have enough weirdness in fault code honestly.
>
> Let's maybe discuss at LSF if you're attending?

Sure :-)

>
> I will try to have a more thorough look through when I get a chance.

Thank you, much appreciated.

Best Regards
Barry


^ permalink raw reply	[flat|nested] 80+ messages in thread

end of thread, other threads:[~2026-06-23 10:10 UTC | newest]

Thread overview: 80+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-30  4:04 [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance Barry Song (Xiaomi)
2026-04-30  4:04 ` [PATCH v2 1/5] mm/filemap: Retry fault by VMA lock if the lock was released for I/O Barry Song (Xiaomi)
2026-04-30  4:04 ` [PATCH v2 2/5] mm/swapin: Retry swapin " Barry Song (Xiaomi)
2026-04-30  4:04 ` [PATCH v2 3/5] mm: Move folio_lock_or_retry() and drop __folio_lock_or_retry() Barry Song (Xiaomi)
2026-04-30  4:04 ` [PATCH v2 4/5] mm: Don't retry page fault if folio is uptodate during swap-in Barry Song (Xiaomi)
2026-04-30 12:35   ` Matthew Wilcox
2026-05-01 16:11     ` Matthew Wilcox
2026-04-30  4:04 ` [PATCH v2 5/5] mm/filemap: Avoid retrying page faults on uptodate folios in filemap faults Barry Song (Xiaomi)
2026-04-30 12:37 ` [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance Matthew Wilcox
2026-04-30 22:49   ` Barry Song
2026-05-01 14:56     ` Matthew Wilcox
2026-05-01 17:44       ` Barry Song
2026-05-01 17:57         ` Matthew Wilcox
2026-05-01 18:25           ` Barry Song
2026-05-01 19:39             ` Matthew Wilcox
2026-05-03 20:39               ` Barry Song
2026-05-03 13:13           ` Jan Kara
2026-05-03 19:55             ` Barry Song
2026-05-04 13:03               ` Jan Kara
2026-05-04 13:35                 ` Barry Song
2026-05-04 14:15                 ` Barry Song
2026-05-17  8:45           ` Barry Song
2026-05-18  9:46             ` Lorenzo Stoakes
2026-05-18 11:25               ` Barry Song
2026-05-18 16:17                 ` Matthew Wilcox
2026-05-18 20:50                   ` Barry Song
2026-05-18 19:56                 ` Suren Baghdasaryan
2026-05-18 21:14                   ` Barry Song
2026-05-19 12:45                     ` Lorenzo Stoakes
2026-05-19 14:17                     ` Liam R. Howlett
2026-05-19 22:01                       ` Barry Song
2026-05-20 21:04                         ` Matthew Wilcox
2026-05-20 21:14                           ` Barry Song
2026-05-20 21:15                             ` Matthew Wilcox
2026-05-20 21:35                               ` David Hildenbrand (Arm)
2026-05-20 23:37                                 ` Barry Song
2026-05-22 15:53                                   ` Lorenzo Stoakes
2026-05-22 21:31                                     ` Barry Song
2026-06-20 23:48                                       ` Suren Baghdasaryan
2026-06-21 20:49                                         ` Matthew Wilcox
2026-06-22  0:15                                           ` Barry Song
2026-06-22 14:50                                             ` Liam R. Howlett
2026-06-22 21:35                                               ` Barry Song
2026-06-23  7:58                                 ` Hongru Zhang
2026-06-23  8:02                                   ` David Hildenbrand (Arm)
2026-06-23 10:10                                     ` Hongru Zhang
2026-05-22  2:33                               ` Barry Song (Xiaomi)
2026-05-22 13:09                                 ` Matthew Wilcox
2026-05-22 13:36                                   ` Barry Song
2026-05-22 13:48                                     ` Barry Song
2026-05-22 15:42                                       ` Lorenzo Stoakes
2026-05-19 12:53                   ` Lorenzo Stoakes
2026-05-19 21:18                     ` Barry Song
2026-05-20  7:50                       ` Lorenzo Stoakes
2026-05-20  9:07                         ` Barry Song
2026-05-20 10:07                           ` Lorenzo Stoakes
2026-05-20 16:20                           ` Suren Baghdasaryan
2026-05-20  5:51                     ` Suren Baghdasaryan
2026-05-22 15:39                       ` Lorenzo Stoakes
2026-05-20 10:33                     ` David Hildenbrand (Arm)
2026-05-20 12:55                       ` Lorenzo Stoakes
2026-05-20 21:39                       ` Yang Shi
2026-05-22 15:37                         ` Lorenzo Stoakes
2026-05-19 12:43                 ` Lorenzo Stoakes
2026-05-18  9:53             ` David Hildenbrand (Arm)
2026-05-19 13:42               ` Lorenzo Stoakes
2026-05-18 21:21             ` Yang Shi
2026-05-19 11:07               ` Barry Song
2026-05-19 13:34                 ` Lorenzo Stoakes
2026-05-19 18:50                 ` Yang Shi
2026-05-19 20:53                   ` Yang Shi
2026-05-19 13:12               ` Lorenzo Stoakes
2026-05-19 13:39                 ` Lorenzo Stoakes
2026-05-19 18:41                   ` Yang Shi
2026-05-19 21:02                     ` Yang Shi
2026-05-20  8:11                       ` Lorenzo Stoakes
2026-05-01 15:52   ` Lorenzo Stoakes
2026-05-01 16:06     ` Matthew Wilcox
2026-05-01 17:09       ` Lorenzo Stoakes
2026-05-01 17:59     ` Barry Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox