public inbox for linux-s390@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
@ 2026-04-30  4:04 Barry Song (Xiaomi)
  2026-04-30  4:04 ` [PATCH v2 1/5] mm/filemap: Retry fault by VMA lock if the lock was released for I/O Barry Song (Xiaomi)
                   ` (5 more replies)
  0 siblings, 6 replies; 25+ messages in thread
From: Barry Song (Xiaomi) @ 2026-04-30  4:04 UTC (permalink / raw)
  To: akpm, linux-mm, willy
  Cc: david, ljs, liam, vbabka, rppt, surenb, mhocko, jack, pfalcato,
	wanglian, chentao, lianux.mm, kunwu.chan, liyangouwen1, chrisl,
	kasong, shikemeng, nphamcs, bhe, youngjun.park, linux-arm-kernel,
	linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390,
	Barry Song (Xiaomi)

Oven observed most mmap_lock contention and priority inversion
come from page fault retries after waiting for I/O completion.
Oven subsequently raised the following idea:

There is no need to always fall back to mmap_lock when the per-VMA lock
is released only to wait for the page cache to become ready. On a page
fault retry, the per-VMA lock can still be reused.

We believe the same should also apply to anonymous folios. However, there
is a case where I/O has completed but we fail to acquire the folio lock
because a concurrent thread may be installing PTEs for the folio. This
is expected to be short-lived, so retrying the page fault is unnecessary.

This patchset handles two cases:

(1) If we need to wait for I/O completion, we still drop the per-VMA lock, as
current page fault handling already does. Holding it for too long may introduce
various priority inversion issues on mobile devices. After I/O completes, we
retry the page fault with the per-VMA lock, rather than falling back to
mmap_lock.

(2) If I/O has already completed and the folio is up to date, the wait is
likely due to a concurrent PTE installation. In this case, we keep the
per-VMA lock and avoid retrying the page fault.

With (1), the dramatically reduced mmap_lock contention leads to a
significant improvement in Douyin performance. Oven’s data is shown
below.

Douyin (the Chinese version of TikTok) warm start on a smartphone with
8GB RAM.

== mmap_lock Acquisitions And Wait Time ==

Metric                    Before (Avg)    After (Avg)    Change
------------------------------------------------------------------------
Read Lock Count           20,010          5,719          -71.42%
Read Total Wait (us)      10,695,877     408,436        -96.18%
Read Avg Wait (us)        534.00         71.00           -86.70%
Write Lock Count          838             909            +8.47%
Write Total Wait (us)     501,293        97,633          -80.52%
Write Avg Wait (us)       598.00         107.00          -82.11%


== Read Lock Waiting Time Distribution of mmap_lock ==

Range (us)                 Before (Avg)    After (Avg)    Change
------------------------------------------------------------------------
[0, 1)                     9,927           4,286          -56.82%
[1, 10)                    9,179           1,327          -85.54%
[10, 100)                  191             88             -53.93%
[100, 1000)                57              6              -89.47%
[1000, 10000)              328             9              -97.26%
[10000, 100000)            328             6              -98.17%
[100000, 1000000)          0               0              N/A
[1000000, +)               0               0              N/A

== Write Lock Waiting Time Distribution of mmap_lock ==

Range (us)                 Before (Avg)    After (Avg)    Change
------------------------------------------------------------------------
[0, 1)                     250             300            +20.00%
[1, 10)                    483             556            +15.11%
[10, 100)                  52              41             -21.15%
[100, 1000)                12              5              -58.33%
[1000, 10000)              22              4              -81.82%
[10000, 100000)            16              1              -93.75%
[100000, 1000000)          0               0              N/A
[1000000, +)               0               0              N/A

After the optimization, the number of read lock acquisitions is 
significantly reduced, and both lock waiting time and tail latency are 
dramatically improved.

Kunwu and Lian also developed a model to capture the situation described
by Matthew [1], where a memcg with limited memory may fail to make
progress. This happens because after I/O is initiated on the first page
fault, the folios may be reclaimed by the time of the retry, leaving the
workload with little or no forward progress.

A stress setup made by Kunwu and Lian as follows:
* 256-core x86 system
* 500 threads continuously faulting on 16MB files

The model was running within a memcg with limited memory,
as shown below:

systemd-run --scope -p MemoryHigh=1G -p MemoryMax=1.2G -p MemorySwapMax=0 \
--unit=mmap-thrash-$$ ./mmap_lock & \
TEST_PID=$!

The reproducer code is shown below:

 #define THREADS 500 
 #define FILE_SIZE (16 * 1024 * 1024) /* 16MB */ 
 static _Atomic int g_stop = 0; 
 #define RUN_SECONDS 600 
 
 struct worker_arg { 
         long id; 
         uint64_t *counts; 
 }; 
 
 void *worker(void *arg) 
 { 
         struct worker_arg *wa = (struct worker_arg *)arg; 
         long id = wa->id; 
         char path[64]; 
         uint64_t local_rounds = 0; 
 
         snprintf(path, sizeof(path), "./test_file_%d_%ld.dat", 
                  getpid(), id); 
         int fd = open(path, O_RDWR | O_CREAT | O_TRUNC, 0666); 
         if (fd < 0) return NULL; 
         if (ftruncate(fd, FILE_SIZE) < 0) { 
                 close(fd); return NULL; 
         } 
 
         while (!atomic_load_explicit(&g_stop, memory_order_relaxed)) { 
                 char *f_map = mmap(NULL, FILE_SIZE, PROT_READ, 
                                    MAP_SHARED, fd, 0); 
                 if (f_map != MAP_FAILED) { 
                         /* Pure page cache thrashing */ 
                         for (int i = 0; i < FILE_SIZE; i += 4096) { 
                                 volatile unsigned char c = 
                                         (unsigned char)f_map[i]; 
                                 (void)c; 
                         } 
                         munmap(f_map, FILE_SIZE); 
                         local_rounds++; 
                 } 
         } 
         wa->counts[id] = local_rounds; 
         close(fd); 
         unlink(path); 
         return NULL; 
 } 
 
 int main(void) 
 { 
         printf("Pure File Thrashing Started. PID: %d\n", getpid()); 
         pthread_t t[THREADS]; 
         uint64_t local_counts[THREADS]; 
         memset(local_counts, 0, sizeof(local_counts)); 
         struct worker_arg args[THREADS]; 
 
         for (long i = 0; i < THREADS; i++) { 
                 args[i].id = i; 
                 args[i].counts = local_counts; 
                 pthread_create(&t[i], NULL, worker, &args[i]); 
         } 
 
         sleep(RUN_SECONDS); 
         atomic_store_explicit(&g_stop, 1, memory_order_relaxed); 
 
         for (int i = 0; i < THREADS; i++) pthread_join(t[i], NULL); 
 
         uint64_t total = 0; 
         for (int i = 0; i < THREADS; i++) total += local_counts[i]; 
 
         printf("Total rounds     : %llu\n", (unsigned long long)total); 
         printf("Throughput       : %.2f rounds/sec\n", 
                (double)total / RUN_SECONDS); 
         return 0; 
 }

They also added temporary counters in page fault retries [2]:
- RETRY_IO_MISS   : folio not present after I/O completion
- RETRY_MMAP_DROP : retry fallback due to waiting for I/O

Their results are as follows:

| Case                | Total Rounds | Throughput | Miss/Drop(%) | RETRY_MMAP_DROP | RETRY_IO_MISS |
| ------------------- | ------------ | ---------- | ------------ | --------------- | ------------- |
| Baseline (Run 1)    | 22,711       | 37.85 /s   | 45.04        | 970,078         | 436,956       |
| Baseline (Run 2)    | 23,530       | 39.22 /s   | 44.96        | 972,043         | 437,077       |
| With Series (Run A) | 54,428       | 90.71 /s   | 1.69         | 1,204,124       | 20,398        |
| With Series (Run B) | 35,949       | 59.91 /s   | 0.03         | 327,023         | 99            |

Without this series, nearly half of the retries fail to observe completed
I/O results, leading to significant CPU and I/O waste. With the finer-
grained VMA lock, faulting threads avoid the heavily contended mmap_lock
during retries and are therefore able to complete the page fault.

With (2), there is a clear improvement in swap-in bandwidth in a model
with five threads issuing MADV_PAGEOUT-based swap-outs and five threads
performing swap-ins on a 100MB anonymous mmap VMA.

 #define SIZE (100 * 1024 * 1024)
 #define PAGE_SIZE 4096
 #define WRITER_THREADS 5
 #define READER_THREADS 5
 #define RUN_SECONDS 30
 
 static uint8_t *buf;
 static atomic_ulong pageout_rounds = 0;
 static atomic_ulong swapin_rounds = 0;
 static atomic_int stop_flag = 0;
 
 static void *pageout_thread(void *arg)
 {
     (void)arg;
     while (!atomic_load(&stop_flag)) {
         if (madvise(buf, SIZE, MADV_PAGEOUT) == 0) {
             atomic_fetch_add(&pageout_rounds, 1);
         }
     }
     return NULL;
 }
 
 static void *reader_thread(void *arg)
 {
     (void)arg;
     volatile uint64_t sum = 0;
 
     while (!atomic_load(&stop_flag)) {
         for (size_t i = 0; i < SIZE; i += PAGE_SIZE) {
             sum += buf[i];
         }
         /* One full pass over 100MB, counted as one swap-in round (approximate) */
         atomic_fetch_add(&swapin_rounds, 1);
     }
     return NULL;
 }
 
 int main(void)
 {
     pthread_t writers[WRITER_THREADS];
     pthread_t readers[READER_THREADS];
 
     buf = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
     if (buf == MAP_FAILED) {
         exit(EXIT_FAILURE);
     }
     memset(buf, 0, SIZE);
 
     for (int i = 0; i < WRITER_THREADS; i++) {
         if (pthread_create(&writers[i], NULL, pageout_thread, NULL) != 0) {
             perror("pthread_create");
             exit(EXIT_FAILURE);
         }
     }
     for (int i = 0; i < READER_THREADS; i++) {
         if (pthread_create(&readers[i], NULL, reader_thread, NULL) != 0) {
             perror("pthread_create");
             exit(EXIT_FAILURE);
         }
     }
 
     sleep(RUN_SECONDS);
     atomic_store(&stop_flag, 1);
     for (int i = 0; i < WRITER_THREADS; i++)
         pthread_join(writers[i], NULL);
     for (int i = 0; i < READER_THREADS; i++)
         pthread_join(readers[i], NULL);
 
     printf("=== Result (30s) ===\n");
     printf("Pageout rounds: %lu\n", pageout_rounds);
     printf("Swap-in rounds (approx): %lu\n", swapin_rounds);
     munmap(buf, SIZE);
     return 0;
 }

W/o patches:
=== Result (30s) ===
Pageout rounds: 1324847
Swap-in rounds (approx): 874

W/patches:
=== Result (30s) ===
Pageout rounds: 1330550
Swap-in rounds (approx): 1017

[1] https://lore.kernel.org/linux-mm/aSip2mWX13sqPW_l@casper.infradead.org/
[2] https://github.com/lianux-mm/ioretry_test/

-v2:
  * collect tags from Pedro, Kunwu and Lian, thanks!
  * handle case (2), for uptodate folios, don't retry PF
-RFC:
  https://lore.kernel.org/linux-mm/20251127011438.6918-1-21cnbao@gmail.com/

Barry Song (Xiaomi) (4):
  mm/swapin: Retry swapin by VMA lock if the lock was released for I/O
  mm: Move folio_lock_or_retry() and drop __folio_lock_or_retry()
  mm: Don't retry page fault if folio is uptodate during swap-in
  mm/filemap: Avoid retrying page faults on uptodate folios in filemap
    faults

Oven Liyang (1):
  mm/filemap: Retry fault by VMA lock if the lock was released for I/O

 arch/arm/mm/fault.c       |  5 +++
 arch/arm64/mm/fault.c     |  5 +++
 arch/loongarch/mm/fault.c |  4 +++
 arch/powerpc/mm/fault.c   |  5 ++-
 arch/riscv/mm/fault.c     |  4 +++
 arch/s390/mm/fault.c      |  4 +++
 arch/x86/mm/fault.c       |  4 +++
 include/linux/mm_types.h  |  9 ++---
 include/linux/pagemap.h   | 17 ----------
 mm/filemap.c              | 57 ++++++-------------------------
 mm/memory.c               | 70 +++++++++++++++++++++++++++++++++++++--
 11 files changed, 114 insertions(+), 70 deletions(-)

-- 
* The work began during my collaboration with OPPO and has continued through
my current collaboration with Xiaomi. Although the OPPO collaboration has
ended, OPPO still deserves more than half of the credit for this series,
if any credit is to be assigned.

2.39.3 (Apple Git-146)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 1/5] mm/filemap: Retry fault by VMA lock if the lock was released for I/O
  2026-04-30  4:04 [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance Barry Song (Xiaomi)
@ 2026-04-30  4:04 ` Barry Song (Xiaomi)
  2026-04-30  4:04 ` [PATCH v2 2/5] mm/swapin: Retry swapin " Barry Song (Xiaomi)
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 25+ messages in thread
From: Barry Song (Xiaomi) @ 2026-04-30  4:04 UTC (permalink / raw)
  To: akpm, linux-mm, willy
  Cc: david, ljs, liam, vbabka, rppt, surenb, mhocko, jack, pfalcato,
	wanglian, chentao, lianux.mm, kunwu.chan, liyangouwen1, chrisl,
	kasong, shikemeng, nphamcs, bhe, youngjun.park, linux-arm-kernel,
	linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390,
	Barry Song

From: Oven Liyang <liyangouwen1@oppo.com>

If the current page fault is using the per-VMA lock, and we only released
the lock to wait for I/O completion (e.g., using folio_lock()), then when
the fault is retried after the I/O completes, it should still qualify for
the per-VMA-lock path.

Acked-by: Pedro Falcato <pfalcato@suse.de>
Tested-by: Wang Lian <wanglian@kylinos.cn>
Tested-by: Kunwu Chan <chentao@kylinos.cn>
Reviewed-by: Wang Lian <lianux.mm@gmail.com>
Reviewed-by: Kunwu Chan <kunwu.chan@gmail.com>
Signed-off-by: Oven Liyang <liyangouwen1@oppo.com>
Co-developed-by: Barry Song <baohua@kernel.org>
Signed-off-by: Barry Song <baohua@kernel.org>
---
 arch/arm/mm/fault.c       | 5 +++++
 arch/arm64/mm/fault.c     | 5 +++++
 arch/loongarch/mm/fault.c | 4 ++++
 arch/powerpc/mm/fault.c   | 5 ++++-
 arch/riscv/mm/fault.c     | 4 ++++
 arch/s390/mm/fault.c      | 4 ++++
 arch/x86/mm/fault.c       | 4 ++++
 include/linux/mm_types.h  | 9 +++++----
 mm/filemap.c              | 5 ++++-
 9 files changed, 39 insertions(+), 6 deletions(-)

diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index e62cc4be5adf..5971e02845f7 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -391,6 +391,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 	if (!(flags & FAULT_FLAG_USER))
 		goto lock_mmap;
 
+retry_vma:
 	vma = lock_vma_under_rcu(mm, addr);
 	if (!vma)
 		goto lock_mmap;
@@ -420,6 +421,10 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 			goto no_context;
 		return 0;
 	}
+
+	/* If the first try is only about waiting for the I/O to complete */
+	if (fault & VM_FAULT_RETRY_VMA)
+		goto retry_vma;
 lock_mmap:
 
 retry:
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 739800835920..d0362a3e11b7 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -673,6 +673,7 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
 	if (!(mm_flags & FAULT_FLAG_USER))
 		goto lock_mmap;
 
+retry_vma:
 	vma = lock_vma_under_rcu(mm, addr);
 	if (!vma)
 		goto lock_mmap;
@@ -719,6 +720,10 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
 			goto no_context;
 		return 0;
 	}
+
+	/* If the first try is only about waiting for the I/O to complete */
+	if (fault & VM_FAULT_RETRY_VMA)
+		goto retry_vma;
 lock_mmap:
 
 retry:
diff --git a/arch/loongarch/mm/fault.c b/arch/loongarch/mm/fault.c
index 2c93d33356e5..738f495560c0 100644
--- a/arch/loongarch/mm/fault.c
+++ b/arch/loongarch/mm/fault.c
@@ -219,6 +219,7 @@ static void __kprobes __do_page_fault(struct pt_regs *regs,
 	if (!(flags & FAULT_FLAG_USER))
 		goto lock_mmap;
 
+retry_vma:
 	vma = lock_vma_under_rcu(mm, address);
 	if (!vma)
 		goto lock_mmap;
@@ -265,6 +266,9 @@ static void __kprobes __do_page_fault(struct pt_regs *regs,
 			no_context(regs, write, address);
 		return;
 	}
+	/* If the first try is only about waiting for the I/O to complete */
+	if (fault & VM_FAULT_RETRY_VMA)
+		goto retry_vma;
 lock_mmap:
 
 retry:
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 806c74e0d5ab..cb7ffc20c760 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -487,6 +487,7 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address,
 	if (!(flags & FAULT_FLAG_USER))
 		goto lock_mmap;
 
+retry_vma:
 	vma = lock_vma_under_rcu(mm, address);
 	if (!vma)
 		goto lock_mmap;
@@ -516,7 +517,9 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address,
 
 	if (fault_signal_pending(fault, regs))
 		return user_mode(regs) ? 0 : SIGBUS;
-
+	/* If the first try is only about waiting for the I/O to complete */
+	if (fault & VM_FAULT_RETRY_VMA)
+		goto retry_vma;
 lock_mmap:
 
 	/* When running in the kernel we expect faults to occur only to
diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
index 04ed6f8acae4..b94cf57c2b9a 100644
--- a/arch/riscv/mm/fault.c
+++ b/arch/riscv/mm/fault.c
@@ -347,6 +347,7 @@ void handle_page_fault(struct pt_regs *regs)
 	if (!(flags & FAULT_FLAG_USER))
 		goto lock_mmap;
 
+retry_vma:
 	vma = lock_vma_under_rcu(mm, addr);
 	if (!vma)
 		goto lock_mmap;
@@ -376,6 +377,9 @@ void handle_page_fault(struct pt_regs *regs)
 			no_context(regs, addr);
 		return;
 	}
+	/* If the first try is only about waiting for the I/O to complete */
+	if (fault & VM_FAULT_RETRY_VMA)
+		goto retry_vma;
 lock_mmap:
 
 retry:
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index 191cc53caead..e0576e629f65 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -294,6 +294,7 @@ static void do_exception(struct pt_regs *regs, int access)
 		flags |= FAULT_FLAG_WRITE;
 	if (!(flags & FAULT_FLAG_USER))
 		goto lock_mmap;
+retry_vma:
 	vma = lock_vma_under_rcu(mm, address);
 	if (!vma)
 		goto lock_mmap;
@@ -318,6 +319,9 @@ static void do_exception(struct pt_regs *regs, int access)
 			handle_fault_error_nolock(regs, 0);
 		return;
 	}
+	/* If the first try is only about waiting for the I/O to complete */
+	if (fault & VM_FAULT_RETRY_VMA)
+		goto retry_vma;
 lock_mmap:
 retry:
 	vma = lock_mm_and_find_vma(mm, address, regs);
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index f0e77e084482..0589fc693eea 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1322,6 +1322,7 @@ void do_user_addr_fault(struct pt_regs *regs,
 	if (!(flags & FAULT_FLAG_USER))
 		goto lock_mmap;
 
+retry_vma:
 	vma = lock_vma_under_rcu(mm, address);
 	if (!vma)
 		goto lock_mmap;
@@ -1351,6 +1352,9 @@ void do_user_addr_fault(struct pt_regs *regs,
 						 ARCH_DEFAULT_PKEY);
 		return;
 	}
+	/* If the first try is only about waiting for the I/O to complete */
+	if (fault & VM_FAULT_RETRY_VMA)
+		goto retry_vma;
 lock_mmap:
 
 retry:
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index a308e2c23b82..5907200ea587 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1678,10 +1678,11 @@ enum vm_fault_reason {
 	VM_FAULT_NOPAGE         = (__force vm_fault_t)0x000100,
 	VM_FAULT_LOCKED         = (__force vm_fault_t)0x000200,
 	VM_FAULT_RETRY          = (__force vm_fault_t)0x000400,
-	VM_FAULT_FALLBACK       = (__force vm_fault_t)0x000800,
-	VM_FAULT_DONE_COW       = (__force vm_fault_t)0x001000,
-	VM_FAULT_NEEDDSYNC      = (__force vm_fault_t)0x002000,
-	VM_FAULT_COMPLETED      = (__force vm_fault_t)0x004000,
+	VM_FAULT_RETRY_VMA      = (__force vm_fault_t)0x000800,
+	VM_FAULT_FALLBACK       = (__force vm_fault_t)0x001000,
+	VM_FAULT_DONE_COW       = (__force vm_fault_t)0x002000,
+	VM_FAULT_NEEDDSYNC      = (__force vm_fault_t)0x004000,
+	VM_FAULT_COMPLETED      = (__force vm_fault_t)0x008000,
 	VM_FAULT_HINDEX_MASK    = (__force vm_fault_t)0x0f0000,
 };
 
diff --git a/mm/filemap.c b/mm/filemap.c
index ab34cab2416a..a045b771e8de 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3525,6 +3525,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 	struct folio *folio;
 	vm_fault_t ret = 0;
 	bool mapping_locked = false;
+	bool retry_by_vma_lock = false;
 
 	max_idx = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
 	if (unlikely(index >= max_idx))
@@ -3621,6 +3622,8 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 	 */
 	if (fpin) {
 		folio_unlock(folio);
+		if (vmf->flags & FAULT_FLAG_VMA_LOCK)
+			retry_by_vma_lock = true;
 		goto out_retry;
 	}
 	if (mapping_locked)
@@ -3671,7 +3674,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 		filemap_invalidate_unlock_shared(mapping);
 	if (fpin)
 		fput(fpin);
-	return ret | VM_FAULT_RETRY;
+	return ret | VM_FAULT_RETRY | (retry_by_vma_lock ? VM_FAULT_RETRY_VMA : 0);
 }
 EXPORT_SYMBOL(filemap_fault);
 
-- 
2.39.3 (Apple Git-146)


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 2/5] mm/swapin: Retry swapin by VMA lock if the lock was released for I/O
  2026-04-30  4:04 [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance Barry Song (Xiaomi)
  2026-04-30  4:04 ` [PATCH v2 1/5] mm/filemap: Retry fault by VMA lock if the lock was released for I/O Barry Song (Xiaomi)
@ 2026-04-30  4:04 ` Barry Song (Xiaomi)
  2026-04-30  4:04 ` [PATCH v2 3/5] mm: Move folio_lock_or_retry() and drop __folio_lock_or_retry() Barry Song (Xiaomi)
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 25+ messages in thread
From: Barry Song (Xiaomi) @ 2026-04-30  4:04 UTC (permalink / raw)
  To: akpm, linux-mm, willy
  Cc: david, ljs, liam, vbabka, rppt, surenb, mhocko, jack, pfalcato,
	wanglian, chentao, lianux.mm, kunwu.chan, liyangouwen1, chrisl,
	kasong, shikemeng, nphamcs, bhe, youngjun.park, linux-arm-kernel,
	linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390,
	Barry Song (Xiaomi)

If the current do_swap_page() took the per-VMA lock and we dropped it only
to wait for I/O completion (e.g., use folio_wait_locked()), then when
do_swap_page() is retried after the I/O completes, it should still qualify
for the per-VMA-lock path.

Tested-by: Wang Lian <wanglian@kylinos.cn>
Tested-by: Kunwu Chan <chentao@kylinos.cn>
Reviewed-by: Wang Lian <lianux.mm@gmail.com>
Reviewed-by: Kunwu Chan <kunwu.chan@gmail.com>
Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
---
 mm/memory.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 199214f8de08..00ee1599d637 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4791,6 +4791,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	unsigned long page_idx;
 	unsigned long address;
 	pte_t *ptep;
+	bool retry_by_vma_lock = false;
 
 	if (!pte_unmap_same(vmf))
 		goto out;
@@ -4896,8 +4897,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 
 	swapcache = folio;
 	ret |= folio_lock_or_retry(folio, vmf);
-	if (ret & VM_FAULT_RETRY)
+	if (ret & VM_FAULT_RETRY) {
+		if (fault_flag_allow_retry_first(vmf->flags) &&
+		    !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT) &&
+		    (vmf->flags & FAULT_FLAG_VMA_LOCK))
+			retry_by_vma_lock = true;
 		goto out_release;
+	}
 
 	page = folio_file_page(folio, swp_offset(entry));
 	/*
@@ -5182,7 +5188,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	}
 	if (si)
 		put_swap_device(si);
-	return ret;
+	return ret | (retry_by_vma_lock ? VM_FAULT_RETRY_VMA : 0);
 }
 
 static bool pte_range_none(pte_t *pte, int nr_pages)
-- 
2.39.3 (Apple Git-146)


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 3/5] mm: Move folio_lock_or_retry() and drop __folio_lock_or_retry()
  2026-04-30  4:04 [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance Barry Song (Xiaomi)
  2026-04-30  4:04 ` [PATCH v2 1/5] mm/filemap: Retry fault by VMA lock if the lock was released for I/O Barry Song (Xiaomi)
  2026-04-30  4:04 ` [PATCH v2 2/5] mm/swapin: Retry swapin " Barry Song (Xiaomi)
@ 2026-04-30  4:04 ` Barry Song (Xiaomi)
  2026-04-30  4:04 ` [PATCH v2 4/5] mm: Don't retry page fault if folio is uptodate during swap-in Barry Song (Xiaomi)
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 25+ messages in thread
From: Barry Song (Xiaomi) @ 2026-04-30  4:04 UTC (permalink / raw)
  To: akpm, linux-mm, willy
  Cc: david, ljs, liam, vbabka, rppt, surenb, mhocko, jack, pfalcato,
	wanglian, chentao, lianux.mm, kunwu.chan, liyangouwen1, chrisl,
	kasong, shikemeng, nphamcs, bhe, youngjun.park, linux-arm-kernel,
	linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390,
	Barry Song (Xiaomi)

folio_lock_or_retry() is effectively only used in mm/memory.c,
not in the filemap code. Move it there and make it static.

The helper __folio_lock_or_retry() can be folded into
folio_lock_or_retry(), allowing it to be removed.

Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
---
 include/linux/pagemap.h | 17 -------------
 mm/filemap.c            | 45 ----------------------------------
 mm/memory.c             | 53 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 53 insertions(+), 62 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 1f50991b43e3..500ab783bf70 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -1101,7 +1101,6 @@ static inline bool wake_page_match(struct wait_page_queue *wait_page,
 
 void __folio_lock(struct folio *folio);
 int __folio_lock_killable(struct folio *folio);
-vm_fault_t __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf);
 void unlock_page(struct page *page);
 void folio_unlock(struct folio *folio);
 
@@ -1198,22 +1197,6 @@ static inline int folio_lock_killable(struct folio *folio)
 	return 0;
 }
 
-/*
- * folio_lock_or_retry - Lock the folio, unless this would block and the
- * caller indicated that it can handle a retry.
- *
- * Return value and mmap_lock implications depend on flags; see
- * __folio_lock_or_retry().
- */
-static inline vm_fault_t folio_lock_or_retry(struct folio *folio,
-					     struct vm_fault *vmf)
-{
-	might_sleep();
-	if (!folio_trylock(folio))
-		return __folio_lock_or_retry(folio, vmf);
-	return 0;
-}
-
 /*
  * This is exported only for folio_wait_locked/folio_wait_writeback, etc.,
  * and should not be used directly.
diff --git a/mm/filemap.c b/mm/filemap.c
index a045b771e8de..b532d6cbafc8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1740,51 +1740,6 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
 	return ret;
 }
 
-/*
- * Return values:
- * 0 - folio is locked.
- * non-zero - folio is not locked.
- *     mmap_lock or per-VMA lock has been released (mmap_read_unlock() or
- *     vma_end_read()), unless flags had both FAULT_FLAG_ALLOW_RETRY and
- *     FAULT_FLAG_RETRY_NOWAIT set, in which case the lock is still held.
- *
- * If neither ALLOW_RETRY nor KILLABLE are set, will always return 0
- * with the folio locked and the mmap_lock/per-VMA lock is left unperturbed.
- */
-vm_fault_t __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf)
-{
-	unsigned int flags = vmf->flags;
-
-	if (fault_flag_allow_retry_first(flags)) {
-		/*
-		 * CAUTION! In this case, mmap_lock/per-VMA lock is not
-		 * released even though returning VM_FAULT_RETRY.
-		 */
-		if (flags & FAULT_FLAG_RETRY_NOWAIT)
-			return VM_FAULT_RETRY;
-
-		release_fault_lock(vmf);
-		if (flags & FAULT_FLAG_KILLABLE)
-			folio_wait_locked_killable(folio);
-		else
-			folio_wait_locked(folio);
-		return VM_FAULT_RETRY;
-	}
-	if (flags & FAULT_FLAG_KILLABLE) {
-		bool ret;
-
-		ret = __folio_lock_killable(folio);
-		if (ret) {
-			release_fault_lock(vmf);
-			return VM_FAULT_RETRY;
-		}
-	} else {
-		__folio_lock(folio);
-	}
-
-	return 0;
-}
-
 /**
  * page_cache_next_miss() - Find the next gap in the page cache.
  * @mapping: Mapping.
diff --git a/mm/memory.c b/mm/memory.c
index 00ee1599d637..0c740ca363cc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4442,6 +4442,59 @@ void unmap_mapping_range(struct address_space *mapping,
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
+/*
+ * folio_lock_or_retry - Lock the folio, unless this would block and the
+ * caller indicated that it can handle a retry.
+ *
+ * Return values:
+ * 0 - folio is locked.
+ * non-zero - folio is not locked.
+ *     mmap_lock or per-VMA lock has been released (mmap_read_unlock() or
+ *     vma_end_read()), unless flags had both FAULT_FLAG_ALLOW_RETRY and
+ *     FAULT_FLAG_RETRY_NOWAIT set, in which case the lock is still held.
+ *
+ * If neither ALLOW_RETRY nor KILLABLE are set, will always return 0
+ * with the folio locked and the mmap_lock/per-VMA lock is left unperturbed.
+ */
+static inline vm_fault_t folio_lock_or_retry(struct folio *folio,
+					     struct vm_fault *vmf)
+{
+	unsigned int flags = vmf->flags;
+
+	might_sleep();
+	if (folio_trylock(folio))
+		return 0;
+
+	if (fault_flag_allow_retry_first(flags)) {
+		/*
+		 * CAUTION! In this case, mmap_lock/per-VMA lock is not
+		 * released even though returning VM_FAULT_RETRY.
+		 */
+		if (flags & FAULT_FLAG_RETRY_NOWAIT)
+			return VM_FAULT_RETRY;
+
+		release_fault_lock(vmf);
+		if (flags & FAULT_FLAG_KILLABLE)
+			folio_wait_locked_killable(folio);
+		else
+			folio_wait_locked(folio);
+		return VM_FAULT_RETRY;
+	}
+	if (flags & FAULT_FLAG_KILLABLE) {
+		bool ret;
+
+		ret = __folio_lock_killable(folio);
+		if (ret) {
+			release_fault_lock(vmf);
+			return VM_FAULT_RETRY;
+		}
+	} else {
+		__folio_lock(folio);
+	}
+
+	return 0;
+}
+
 /*
  * Restore a potential device exclusive pte to a working pte entry
  */
-- 
2.39.3 (Apple Git-146)


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 4/5] mm: Don't retry page fault if folio is uptodate during swap-in
  2026-04-30  4:04 [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance Barry Song (Xiaomi)
                   ` (2 preceding siblings ...)
  2026-04-30  4:04 ` [PATCH v2 3/5] mm: Move folio_lock_or_retry() and drop __folio_lock_or_retry() Barry Song (Xiaomi)
@ 2026-04-30  4:04 ` Barry Song (Xiaomi)
  2026-04-30 12:35   ` Matthew Wilcox
  2026-04-30  4:04 ` [PATCH v2 5/5] mm/filemap: Avoid retrying page faults on uptodate folios in filemap faults Barry Song (Xiaomi)
  2026-04-30 12:37 ` [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance Matthew Wilcox
  5 siblings, 1 reply; 25+ messages in thread
From: Barry Song (Xiaomi) @ 2026-04-30  4:04 UTC (permalink / raw)
  To: akpm, linux-mm, willy
  Cc: david, ljs, liam, vbabka, rppt, surenb, mhocko, jack, pfalcato,
	wanglian, chentao, lianux.mm, kunwu.chan, liyangouwen1, chrisl,
	kasong, shikemeng, nphamcs, bhe, youngjun.park, linux-arm-kernel,
	linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390,
	Barry Song (Xiaomi)

If we are waiting for long I/O to complete, it makes sense to
avoid holding locks for too long. However, if the folio is
uptodate, we are likely only waiting for a concurrent PTE
update to finish. Retrying the entire page fault seems
excessive.

Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
---
 mm/memory.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 0c740ca363cc..a2e4f2d87ec8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4949,6 +4949,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	}
 
 	swapcache = folio;
+	/*
+	 * If the folio is uptodate, we are likely only waiting for
+	 * another concurrent PTE mapping to complete, which should
+	 * be brief. No need to drop the lock and retry the fault.
+	 */
+	if (folio_test_uptodate(folio))
+		vmf->flags &= ~FAULT_FLAG_ALLOW_RETRY;
 	ret |= folio_lock_or_retry(folio, vmf);
 	if (ret & VM_FAULT_RETRY) {
 		if (fault_flag_allow_retry_first(vmf->flags) &&
-- 
2.39.3 (Apple Git-146)


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 5/5] mm/filemap: Avoid retrying page faults on uptodate folios in filemap faults
  2026-04-30  4:04 [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance Barry Song (Xiaomi)
                   ` (3 preceding siblings ...)
  2026-04-30  4:04 ` [PATCH v2 4/5] mm: Don't retry page fault if folio is uptodate during swap-in Barry Song (Xiaomi)
@ 2026-04-30  4:04 ` Barry Song (Xiaomi)
  2026-04-30 12:37 ` [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance Matthew Wilcox
  5 siblings, 0 replies; 25+ messages in thread
From: Barry Song (Xiaomi) @ 2026-04-30  4:04 UTC (permalink / raw)
  To: akpm, linux-mm, willy
  Cc: david, ljs, liam, vbabka, rppt, surenb, mhocko, jack, pfalcato,
	wanglian, chentao, lianux.mm, kunwu.chan, liyangouwen1, chrisl,
	kasong, shikemeng, nphamcs, bhe, youngjun.park, linux-arm-kernel,
	linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390,
	Barry Song (Xiaomi)

For uptodate folios, we are not waiting on I/O. We should
be able to acquire the folio lock shortly, so there is no
need to drop per-vma locks and perform a full PF retry.

Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
---
 mm/filemap.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index b532d6cbafc8..0d2f6af5d0fe 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3533,6 +3533,13 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 		}
 	}
 
+	/*
+	 * If the folio is uptodate, we are likely only waiting for
+	 * another concurrent PTE mapping to complete, which should
+	 * be brief. No need to drop the lock and retry the fault.
+	 */
+	if (folio_test_uptodate(folio))
+		vmf->flags &= ~FAULT_FLAG_ALLOW_RETRY;
 	if (!lock_folio_maybe_drop_mmap(vmf, folio, &fpin))
 		goto out_retry;
 
-- 
2.39.3 (Apple Git-146)


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 4/5] mm: Don't retry page fault if folio is uptodate during swap-in
  2026-04-30  4:04 ` [PATCH v2 4/5] mm: Don't retry page fault if folio is uptodate during swap-in Barry Song (Xiaomi)
@ 2026-04-30 12:35   ` Matthew Wilcox
  2026-05-01 16:11     ` Matthew Wilcox
  0 siblings, 1 reply; 25+ messages in thread
From: Matthew Wilcox @ 2026-04-30 12:35 UTC (permalink / raw)
  To: Barry Song (Xiaomi)
  Cc: akpm, linux-mm, david, ljs, liam, vbabka, rppt, surenb, mhocko,
	jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
	liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390

On Thu, Apr 30, 2026 at 12:04:26PM +0800, Barry Song (Xiaomi) wrote:
> If we are waiting for long I/O to complete, it makes sense to
> avoid holding locks for too long. However, if the folio is
> uptodate, we are likely only waiting for a concurrent PTE
> update to finish. Retrying the entire page fault seems
> excessive.

I think the idea is good, but the implementation is misplaced.
The check for folio_uptodate() should be inside folio_lock_or_retry()
rather than tampering with FAULT_FLAG_ALLOW_RETRY in its caller.

Similarly for your next patch.

> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> ---
>  mm/memory.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 0c740ca363cc..a2e4f2d87ec8 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4949,6 +4949,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	}
>  
>  	swapcache = folio;
> +	/*
> +	 * If the folio is uptodate, we are likely only waiting for
> +	 * another concurrent PTE mapping to complete, which should
> +	 * be brief. No need to drop the lock and retry the fault.
> +	 */
> +	if (folio_test_uptodate(folio))
> +		vmf->flags &= ~FAULT_FLAG_ALLOW_RETRY;
>  	ret |= folio_lock_or_retry(folio, vmf);
>  	if (ret & VM_FAULT_RETRY) {
>  		if (fault_flag_allow_retry_first(vmf->flags) &&
> -- 
> 2.39.3 (Apple Git-146)
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-04-30  4:04 [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance Barry Song (Xiaomi)
                   ` (4 preceding siblings ...)
  2026-04-30  4:04 ` [PATCH v2 5/5] mm/filemap: Avoid retrying page faults on uptodate folios in filemap faults Barry Song (Xiaomi)
@ 2026-04-30 12:37 ` Matthew Wilcox
  2026-04-30 22:49   ` Barry Song
  2026-05-01 15:52   ` Lorenzo Stoakes
  5 siblings, 2 replies; 25+ messages in thread
From: Matthew Wilcox @ 2026-04-30 12:37 UTC (permalink / raw)
  To: Barry Song (Xiaomi)
  Cc: akpm, linux-mm, david, ljs, liam, vbabka, rppt, surenb, mhocko,
	jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
	liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390

On Thu, Apr 30, 2026 at 12:04:22PM +0800, Barry Song (Xiaomi) wrote:
> (1) If we need to wait for I/O completion, we still drop the per-VMA lock, as
> current page fault handling already does. Holding it for too long may introduce
> various priority inversion issues on mobile devices. After I/O completes, we
> retry the page fault with the per-VMA lock, rather than falling back to
> mmap_lock.

You're going to have to do better than that.  You know I hate the
additional complexity you're adding.  You need to explain why my idea of
ripping out all the complexity now that we have per-VMA locks doesn't
work.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-04-30 12:37 ` [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance Matthew Wilcox
@ 2026-04-30 22:49   ` Barry Song
  2026-05-01 14:56     ` Matthew Wilcox
  2026-05-01 15:52   ` Lorenzo Stoakes
  1 sibling, 1 reply; 25+ messages in thread
From: Barry Song @ 2026-04-30 22:49 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, linux-mm, david, ljs, liam, vbabka, rppt, surenb, mhocko,
	jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
	liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390

On Thu, Apr 30, 2026 at 8:37 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Apr 30, 2026 at 12:04:22PM +0800, Barry Song (Xiaomi) wrote:
> > (1) If we need to wait for I/O completion, we still drop the per-VMA lock, as
> > current page fault handling already does. Holding it for too long may introduce
> > various priority inversion issues on mobile devices. After I/O completes, we
> > retry the page fault with the per-VMA lock, rather than falling back to
> > mmap_lock.
>
> You're going to have to do better than that.  You know I hate the
> additional complexity you're adding.  You need to explain why my idea of
> ripping out all the complexity now that we have per-VMA locks doesn't
> work.

Yep, I know you don’t like the added complexity, but I would rather prioritize
user experience over simplicity. Let me try to explain in more detail.

1. There is no deterministic latency for I/O completion. It depends on
both the hardware and the software stack (bio/request queues and the
block scheduler). Sometimes the latency is short; at other times it can
be quite long. In such cases, a high-priority thread performing operations
such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
for an unpredictable amount of time. For example, if low-priority tasks
trigger page faults and issue low-priority I/O, a high-priority task
requiring the write lock may end up waiting for an unknown amount of time,
depending on the block layer and filesystem behavior.

As a result, high-priority tasks are exposed to unpredictable I/O latency
introduced by many low-priority tasks that may generate a large number of
page faults.

On Android, latency in certain tasks can significantly affect user experience,
such as interactive threads. Priority inversion is particularly problematic and
should be avoided, especially since we have no clear bound on how long we may
have to wait for I/O from other tasks.

Meanwhile, priority inversion can propagate through a long chain: a writer
waiting on I/O from multiple concurrent page faults may end up blocking other
writers and readers as well. A long-waiting writer can also amplify
mmap_lock contention, which we still rely on in many cases.

2. VMA sizes can be highly uneven: some VMAs may be very large while others are
small. We used to have many reasons to release mmap_lock when we did not have a
per-VMA lock. Since VMA sizes are not uniform, those same considerations may
still apply to the per-VMA lock when a small number of VMAs account for most
of a process’s address space. I recall that Suren also mentioned this[1].

So I would prefer that we hold only the per-VMA lock and avoid retrying the
page fault when we are reasonably sure that I/O has already completed and we
are only waiting for short-lived conditions. Uncertainties in the block layer,
filesystem, and GC behavior, as well as latency-induced priority inversion
chains and potentially amplified mmap_lock contention, can significantly hurt
Android user experience.

[1] https://lore.kernel.org/linux-mm/CAJuCfpFVQJtvbj5fV2fmm4APhNZDL1qPg-YExw7gO1pmngC3Rw@mail.gmail.com/

Thanks
Barry

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-04-30 22:49   ` Barry Song
@ 2026-05-01 14:56     ` Matthew Wilcox
  2026-05-01 17:44       ` Barry Song
  0 siblings, 1 reply; 25+ messages in thread
From: Matthew Wilcox @ 2026-05-01 14:56 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, david, ljs, liam, vbabka, rppt, surenb, mhocko,
	jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
	liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390

On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> 1. There is no deterministic latency for I/O completion. It depends on
> both the hardware and the software stack (bio/request queues and the
> block scheduler). Sometimes the latency is short; at other times it can
> be quite long. In such cases, a high-priority thread performing operations
> such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> for an unpredictable amount of time.

But does that actually happen?  I find it hard to believe that thread A
unmaps a VMA while thread B is in the middle of taking a page fault in
that same VMA.  mprotect() and madvise() are more likely to happen, but
it still seems really unlikely to me.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-04-30 12:37 ` [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance Matthew Wilcox
  2026-04-30 22:49   ` Barry Song
@ 2026-05-01 15:52   ` Lorenzo Stoakes
  2026-05-01 16:06     ` Matthew Wilcox
  2026-05-01 17:59     ` Barry Song
  1 sibling, 2 replies; 25+ messages in thread
From: Lorenzo Stoakes @ 2026-05-01 15:52 UTC (permalink / raw)
  To: Barry Song (Xiaomi)
  Cc: Matthew Wilcox, akpm, linux-mm, david, liam, vbabka, rppt, surenb,
	mhocko, jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
	liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390

On Thu, Apr 30, 2026 at 01:37:14PM +0100, Matthew Wilcox wrote:
> On Thu, Apr 30, 2026 at 12:04:22PM +0800, Barry Song (Xiaomi) wrote:
> > (1) If we need to wait for I/O completion, we still drop the per-VMA lock, as
> > current page fault handling already does. Holding it for too long may introduce
> > various priority inversion issues on mobile devices. After I/O completes, we
> > retry the page fault with the per-VMA lock, rather than falling back to
> > mmap_lock.
>
> You're going to have to do better than that.  You know I hate the
> additional complexity you're adding.  You need to explain why my idea of
> ripping out all the complexity now that we have per-VMA locks doesn't
> work.

After a brief eyeball I share Matthew's assessment, I really don't like this
series, it's piling on complexity for what seem like niche cases.

We already have enough weirdness in fault code honestly.

Let's maybe discuss at LSF if you're attending?

I will try to have a more thorough look through when I get a chance.

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-01 15:52   ` Lorenzo Stoakes
@ 2026-05-01 16:06     ` Matthew Wilcox
  2026-05-01 17:09       ` Lorenzo Stoakes
  2026-05-01 17:59     ` Barry Song
  1 sibling, 1 reply; 25+ messages in thread
From: Matthew Wilcox @ 2026-05-01 16:06 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Barry Song (Xiaomi), akpm, linux-mm, david, liam, vbabka, rppt,
	surenb, mhocko, jack, pfalcato, wanglian, chentao, lianux.mm,
	kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390

On Fri, May 01, 2026 at 04:52:12PM +0100, Lorenzo Stoakes wrote:
> After a brief eyeball I share Matthew's assessment, I really don't like this
> series, it's piling on complexity for what seem like niche cases.

I don't think they're niche cases ... I think it's a real problem.
While our current code performs better for this workload than the
pre-vma-lock code did, it doesn't perform as well as it could.

> We already have enough weirdness in fault code honestly.
> 
> Let's maybe discuss at LSF if you're attending?

Not only is he attending, there's a topic scheduled (currently 10:30 on
Wednesday).

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 4/5] mm: Don't retry page fault if folio is uptodate during swap-in
  2026-04-30 12:35   ` Matthew Wilcox
@ 2026-05-01 16:11     ` Matthew Wilcox
  0 siblings, 0 replies; 25+ messages in thread
From: Matthew Wilcox @ 2026-05-01 16:11 UTC (permalink / raw)
  To: Barry Song (Xiaomi)
  Cc: akpm, linux-mm, david, ljs, liam, vbabka, rppt, surenb, mhocko,
	jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
	liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390

On Thu, Apr 30, 2026 at 01:35:30PM +0100, Matthew Wilcox wrote:
> On Thu, Apr 30, 2026 at 12:04:26PM +0800, Barry Song (Xiaomi) wrote:
> > If we are waiting for long I/O to complete, it makes sense to
> > avoid holding locks for too long. However, if the folio is
> > uptodate, we are likely only waiting for a concurrent PTE
> > update to finish. Retrying the entire page fault seems
> > excessive.
> 
> I think the idea is good, but the implementation is misplaced.
> The check for folio_uptodate() should be inside folio_lock_or_retry()
> rather than tampering with FAULT_FLAG_ALLOW_RETRY in its caller.

Actually it needs to be a little more complex than this.  We
sometimes wait for writeback while holding the folio lock, and
that's a similar latency to reads (or with cheap NAND, maybe longer!)

So I think the test needs to be:

	if (folio_test_uptodate(folio) && !folio_test_writeback(folio))


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-01 16:06     ` Matthew Wilcox
@ 2026-05-01 17:09       ` Lorenzo Stoakes
  0 siblings, 0 replies; 25+ messages in thread
From: Lorenzo Stoakes @ 2026-05-01 17:09 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Barry Song (Xiaomi), akpm, linux-mm, david, liam, vbabka, rppt,
	surenb, mhocko, jack, pfalcato, wanglian, chentao, lianux.mm,
	kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390

On Fri, May 01, 2026 at 05:06:02PM +0100, Matthew Wilcox wrote:
> On Fri, May 01, 2026 at 04:52:12PM +0100, Lorenzo Stoakes wrote:
> > After a brief eyeball I share Matthew's assessment, I really don't like this
> > series, it's piling on complexity for what seem like niche cases.
>
> I don't think they're niche cases ... I think it's a real problem.
> While our current code performs better for this workload than the
> pre-vma-lock code did, it doesn't perform as well as it could.
>
> > We already have enough weirdness in fault code honestly.
> >
> > Let's maybe discuss at LSF if you're attending?
>
> Not only is he attending, there's a topic scheduled (currently 10:30 on
> Wednesday).

Well then, let's revisit this in person in Zagreb :)

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-01 14:56     ` Matthew Wilcox
@ 2026-05-01 17:44       ` Barry Song
  2026-05-01 17:57         ` Matthew Wilcox
  0 siblings, 1 reply; 25+ messages in thread
From: Barry Song @ 2026-05-01 17:44 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, linux-mm, david, ljs, liam, vbabka, rppt, surenb, mhocko,
	jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
	liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390

On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > 1. There is no deterministic latency for I/O completion. It depends on
> > both the hardware and the software stack (bio/request queues and the
> > block scheduler). Sometimes the latency is short; at other times it can
> > be quite long. In such cases, a high-priority thread performing operations
> > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > for an unpredictable amount of time.
>
> But does that actually happen?  I find it hard to believe that thread A
> unmaps a VMA while thread B is in the middle of taking a page fault in
> that same VMA.  mprotect() and madvise() are more likely to happen, but
> it still seems really unlikely to me.

It doesn’t have to involve unmapping or applying mprotect to
the entire VMA—just a portion of it is sufficient.

BTW, the chain can propagate: a page fault occurs, B wants to write this
VMA, and C (a higher-priority task) wants to write another VMA. D may need
to iterate VMAs under mmap_lock, so B can end up blocking both C and D.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-01 17:44       ` Barry Song
@ 2026-05-01 17:57         ` Matthew Wilcox
  2026-05-01 18:25           ` Barry Song
  2026-05-03 13:13           ` Jan Kara
  0 siblings, 2 replies; 25+ messages in thread
From: Matthew Wilcox @ 2026-05-01 17:57 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, david, ljs, liam, vbabka, rppt, surenb, mhocko,
	jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
	liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390

On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > 1. There is no deterministic latency for I/O completion. It depends on
> > > both the hardware and the software stack (bio/request queues and the
> > > block scheduler). Sometimes the latency is short; at other times it can
> > > be quite long. In such cases, a high-priority thread performing operations
> > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > for an unpredictable amount of time.
> >
> > But does that actually happen?  I find it hard to believe that thread A
> > unmaps a VMA while thread B is in the middle of taking a page fault in
> > that same VMA.  mprotect() and madvise() are more likely to happen, but
> > it still seems really unlikely to me.
> 
> It doesn’t have to involve unmapping or applying mprotect to
> the entire VMA—just a portion of it is sufficient.

Yes, but that still fails to answer "does this actually happen".  How much
performance is all this complexity in the page fault handler buying us?
If you don't answer this question, I'm just going to go in and rip it
all out.

> BTW, the chain can propagate: a page fault occurs, B wants to write this
> VMA, and C (a higher-priority task) wants to write another VMA. D may need
> to iterate VMAs under mmap_lock, so B can end up blocking both C and D.

I know.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-01 15:52   ` Lorenzo Stoakes
  2026-05-01 16:06     ` Matthew Wilcox
@ 2026-05-01 17:59     ` Barry Song
  1 sibling, 0 replies; 25+ messages in thread
From: Barry Song @ 2026-05-01 17:59 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Matthew Wilcox, akpm, linux-mm, david, liam, vbabka, rppt, surenb,
	mhocko, jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
	liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390

On Fri, May 1, 2026 at 11:52 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
>
> On Thu, Apr 30, 2026 at 01:37:14PM +0100, Matthew Wilcox wrote:
> > On Thu, Apr 30, 2026 at 12:04:22PM +0800, Barry Song (Xiaomi) wrote:
> > > (1) If we need to wait for I/O completion, we still drop the per-VMA lock, as
> > > current page fault handling already does. Holding it for too long may introduce
> > > various priority inversion issues on mobile devices. After I/O completes, we
> > > retry the page fault with the per-VMA lock, rather than falling back to
> > > mmap_lock.
> >
> > You're going to have to do better than that.  You know I hate the
> > additional complexity you're adding.  You need to explain why my idea of
> > ripping out all the complexity now that we have per-VMA locks doesn't
> > work.
>
> After a brief eyeball I share Matthew's assessment, I really don't like this
> series, it's piling on complexity for what seem like niche cases.

I’d really appreciate it if you could point out the specific parts you
dislike, rather than the whole series—I don’t think that’s a fair
assessment.

I’m not sure what you mean by “niche cases.” Do you mean avoiding taking
mmap_lock for major page faults, or releasing the per-VMA lock and retrying
the page fault?

Right now, major page faults always fall back to mmap_lock, which is a
significant source of lock contention. I assume we agree that this fallback
should be eliminated. Or is there still no agreement on this point either?

Where we may differ is whether to hold the per-VMA lock and
avoid retrying the page fault, or to rely on retrying the
fault while using the per-VMA lock (with the current
mainline approach using mmap_lock instead) ?

>
> We already have enough weirdness in fault code honestly.
>
> Let's maybe discuss at LSF if you're attending?

Sure :-)

>
> I will try to have a more thorough look through when I get a chance.

Thank you, much appreciated.

Best Regards
Barry

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-01 17:57         ` Matthew Wilcox
@ 2026-05-01 18:25           ` Barry Song
  2026-05-01 19:39             ` Matthew Wilcox
  2026-05-03 13:13           ` Jan Kara
  1 sibling, 1 reply; 25+ messages in thread
From: Barry Song @ 2026-05-01 18:25 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, linux-mm, david, ljs, liam, vbabka, rppt, surenb, mhocko,
	jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
	liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390

On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > both the hardware and the software stack (bio/request queues and the
> > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > be quite long. In such cases, a high-priority thread performing operations
> > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > for an unpredictable amount of time.
> > >
> > > But does that actually happen?  I find it hard to believe that thread A
> > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > that same VMA.  mprotect() and madvise() are more likely to happen, but
> > > it still seems really unlikely to me.
> >
> > It doesn’t have to involve unmapping or applying mprotect to
> > the entire VMA—just a portion of it is sufficient.
>
> Yes, but that still fails to answer "does this actually happen".  How much
> performance is all this complexity in the page fault handler buying us?
> If you don't answer this question, I'm just going to go in and rip it
> all out.

I’m getting quite confused. In patch 4/5, you suggest a more
restrictive condition using
if (folio_test_uptodate(folio) && !folio_test_writeback(folio))
rather than if (folio_test_uptodate(folio)), before we decide to skip
retrying the page fault [1].
That seems to suggest we should be more cautious about when we can skip
retrying the page fault.

However, in the cover letter, you suggest removing all retry code entirely.
Does this suggestion apply only to file-backed page faults?

[1] https://lore.kernel.org/linux-mm/afTQl12XcXVnku9J@casper.infradead.org/

Best Regards
Barry

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-01 18:25           ` Barry Song
@ 2026-05-01 19:39             ` Matthew Wilcox
  2026-05-03 20:39               ` Barry Song
  0 siblings, 1 reply; 25+ messages in thread
From: Matthew Wilcox @ 2026-05-01 19:39 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, david, ljs, liam, vbabka, rppt, surenb, mhocko,
	jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
	liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390

On Sat, May 02, 2026 at 02:25:37AM +0800, Barry Song wrote:
> On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> > Yes, but that still fails to answer "does this actually happen".  How much
> > performance is all this complexity in the page fault handler buying us?
> > If you don't answer this question, I'm just going to go in and rip it
> > all out.
> 
> I’m getting quite confused. In patch 4/5, you suggest a more
> restrictive condition using
> if (folio_test_uptodate(folio) && !folio_test_writeback(folio))
> rather than if (folio_test_uptodate(folio)), before we decide to skip
> retrying the page fault [1].
> That seems to suggest we should be more cautious about when we can skip
> retrying the page fault.
> 
> However, in the cover letter, you suggest removing all retry code entirely.
> Does this suggestion apply only to file-backed page faults?

I'm making sure that if Andrew decides to override me he at least sees
that there are other problems with this patchset beyond "I don't like
the additional complexity".

And maybe we decide to do the fallback for anon-mm but not file memory.
Or maybe it's just something somebody happens upon when reading the
mailing list (or more likely it's just grist for an AI).

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-01 17:57         ` Matthew Wilcox
  2026-05-01 18:25           ` Barry Song
@ 2026-05-03 13:13           ` Jan Kara
  2026-05-03 19:55             ` Barry Song
  1 sibling, 1 reply; 25+ messages in thread
From: Jan Kara @ 2026-05-03 13:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Barry Song, akpm, linux-mm, david, ljs, liam, vbabka, rppt,
	surenb, mhocko, jack, pfalcato, wanglian, chentao, lianux.mm,
	kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390

On Fri 01-05-26 18:57:52, Matthew Wilcox wrote:
> On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > both the hardware and the software stack (bio/request queues and the
> > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > be quite long. In such cases, a high-priority thread performing operations
> > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > for an unpredictable amount of time.
> > >
> > > But does that actually happen?  I find it hard to believe that thread A
> > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > that same VMA.  mprotect() and madvise() are more likely to happen, but
> > > it still seems really unlikely to me.
> > 
> > It doesn’t have to involve unmapping or applying mprotect to
> > the entire VMA—just a portion of it is sufficient.
> 
> Yes, but that still fails to answer "does this actually happen".  How much
> performance is all this complexity in the page fault handler buying us?
> If you don't answer this question, I'm just going to go in and rip it
> all out.

I fully agree with you we should verify whether the retry code still brings
in real-world advantage today with VMA locks. After all the retry logic has
been introduced in 2010. That being said if there are realistic loads where
one thread needs VMA write lock while another thread is faulting the VMA,
then the latencies can be indeed extreme. For example things like cgroup IO
throttling happen on the IO path and thus can throttle IO of a low-priority
thread for a long time.

BTW I'm not sure I quite understand Barry's priority inversion problem
since I'd expect all threads of a task to generally be treated with the
same priority...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-03 13:13           ` Jan Kara
@ 2026-05-03 19:55             ` Barry Song
  2026-05-04 13:03               ` Jan Kara
  0 siblings, 1 reply; 25+ messages in thread
From: Barry Song @ 2026-05-03 19:55 UTC (permalink / raw)
  To: Jan Kara
  Cc: Matthew Wilcox, akpm, linux-mm, david, ljs, liam, vbabka, rppt,
	surenb, mhocko, pfalcato, wanglian, chentao, lianux.mm,
	kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390

On Mon, May 4, 2026 at 2:17 AM Jan Kara <jack@suse.cz> wrote:
>
> On Fri 01-05-26 18:57:52, Matthew Wilcox wrote:
> > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > both the hardware and the software stack (bio/request queues and the
> > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > for an unpredictable amount of time.
> > > >
> > > > But does that actually happen?  I find it hard to believe that thread A
> > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > that same VMA.  mprotect() and madvise() are more likely to happen, but
> > > > it still seems really unlikely to me.
> > >
> > > It doesn’t have to involve unmapping or applying mprotect to
> > > the entire VMA—just a portion of it is sufficient.
> >
> > Yes, but that still fails to answer "does this actually happen".  How much
> > performance is all this complexity in the page fault handler buying us?
> > If you don't answer this question, I'm just going to go in and rip it
> > all out.
>
> I fully agree with you we should verify whether the retry code still brings
> in real-world advantage today with VMA locks. After all the retry logic has
> been introduced in 2010. That being said if there are realistic loads where
> one thread needs VMA write lock while another thread is faulting the VMA,
> then the latencies can be indeed extreme. For example things like cgroup IO
> throttling happen on the IO path and thus can throttle IO of a low-priority
> thread for a long time.

I’m quite sure that swap-in and VMA writes can occur
concurrently, and this is fairly common. For example,
Java GC may use mprotect or userfaultfd on a small
portion of a large Java heap while other portions are
still under do_swap_page().

If we start exploring different approaches for anon and
file, I agree I can revisit this on an Android phone if
there is a real, serious case where a file VMA can be
written and a page fault occurs at the same time.

Please note that, as an Android developer, I am particularly
cautious about priority inversion. A recent issue causing
severe priority inversion is zram attempting to support
preemption[1]. When a task performing compression or
decompression is migrated to another CPU and then preempted
by other tasks, high-priority tasks waiting on the mutex may
be significantly delayed, impacting user experience.

>
> BTW I'm not sure I quite understand Barry's priority inversion problem
> since I'd expect all threads of a task to generally be treated with the
> same priority...

Exactly not. Maybe these slides[2] and this project[3] can give
you a hint—they aim to standardize things on Linux by
learning from Apple OS. Basically, tasks are classified
into five types:

USER_INTERACTIVE: Requires immediate response.
USER_INITIATED: Tolerates a short delay, but must respond quickly still.
UTILITY: Tolerates long delays, but not prolonged ones.
BACKGROUND: Doesn’t mind prolonged delays.
DEFAULT: System default behavior.

[1] https://lore.kernel.org/linux-mm/20250303022425.285971-3-senozhatsky@chromium.org/
[2] https://lpc.events/event/19/contributions/2089/attachments/1797/3877/Userspace%20Assisted%20Scheduling%20via%20Sched%20QoS.pdf
[3] https://lore.kernel.org/lkml/20260415000910.2h5misvwc45bdumu@airbuntu/

Thanks
Barry

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-01 19:39             ` Matthew Wilcox
@ 2026-05-03 20:39               ` Barry Song
  0 siblings, 0 replies; 25+ messages in thread
From: Barry Song @ 2026-05-03 20:39 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, linux-mm, david, ljs, liam, vbabka, rppt, surenb, mhocko,
	jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
	liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390

On Sat, May 2, 2026 at 3:39 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sat, May 02, 2026 at 02:25:37AM +0800, Barry Song wrote:
> > On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > Yes, but that still fails to answer "does this actually happen".  How much
> > > performance is all this complexity in the page fault handler buying us?
> > > If you don't answer this question, I'm just going to go in and rip it
> > > all out.

I guess the only way to answer this question is to
remove all retry code for file VMA and run a real test.
For defensive programming, I am generally very cautious
about this approach, but if this is the only way to clarify
whether we still need PF retry for file, I can give it a try
and run a complete test on Android phones after lsf/mm/bpf.

> >
> > I’m getting quite confused. In patch 4/5, you suggest a more
> > restrictive condition using
> > if (folio_test_uptodate(folio) && !folio_test_writeback(folio))
> > rather than if (folio_test_uptodate(folio)), before we decide to skip
> > retrying the page fault [1].
> > That seems to suggest we should be more cautious about when we can skip
> > retrying the page fault.
> >
> > However, in the cover letter, you suggest removing all retry code entirely.
> > Does this suggestion apply only to file-backed page faults?
>
> I'm making sure that if Andrew decides to override me he at least sees

No, I don’t want Andrew to override you unless there is a real PI
issue for file, and only if you still still insist on “ripping it out”
after a thorough test with it removed.

> that there are other problems with this patchset beyond "I don't like
> the additional complexity".

The other issue you are pointing out is that, for anon, we
should be more cautious before deciding to skip PF retry,
which seems to be the opposite direction of what you expect
for file.

>
> And maybe we decide to do the fallback for anon-mm but not file memory.

I was targeting a unified approach for both file-backed
and anonymous memory. For example, if anon requires retry
under the per-VMA lock, we may already have the necessary
branch in place that file-backed cases can also leverage.
For anon cases, high-level language GCs can operate on a
small portion of a large heap requiring VMA writes, which
is fairly common, as I explained to Jan.

> Or maybe it's just something somebody happens upon when reading the
> mailing list (or more likely it's just grist for an AI).

Maybe one or two years from now. For now, at least, there are still
humans working on the kernel :-)

Best Regards
Barry

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-03 19:55             ` Barry Song
@ 2026-05-04 13:03               ` Jan Kara
  2026-05-04 13:35                 ` Barry Song
  2026-05-04 14:15                 ` Barry Song
  0 siblings, 2 replies; 25+ messages in thread
From: Jan Kara @ 2026-05-04 13:03 UTC (permalink / raw)
  To: Barry Song
  Cc: Jan Kara, Matthew Wilcox, akpm, linux-mm, david, ljs, liam,
	vbabka, rppt, surenb, mhocko, pfalcato, wanglian, chentao,
	lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng,
	nphamcs, bhe, youngjun.park, linux-arm-kernel, linux-kernel,
	loongarch, linuxppc-dev, linux-riscv, linux-s390

On Mon 04-05-26 03:55:43, Barry Song wrote:
> On Mon, May 4, 2026 at 2:17 AM Jan Kara <jack@suse.cz> wrote:
> > On Fri 01-05-26 18:57:52, Matthew Wilcox wrote:
> > > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > > both the hardware and the software stack (bio/request queues and the
> > > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > > for an unpredictable amount of time.
> > > > >
> > > > > But does that actually happen?  I find it hard to believe that thread A
> > > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > > that same VMA.  mprotect() and madvise() are more likely to happen, but
> > > > > it still seems really unlikely to me.
> > > >
> > > > It doesn’t have to involve unmapping or applying mprotect to
> > > > the entire VMA—just a portion of it is sufficient.
> > >
> > > Yes, but that still fails to answer "does this actually happen".  How much
> > > performance is all this complexity in the page fault handler buying us?
> > > If you don't answer this question, I'm just going to go in and rip it
> > > all out.
> >
> > I fully agree with you we should verify whether the retry code still brings
> > in real-world advantage today with VMA locks. After all the retry logic has
> > been introduced in 2010. That being said if there are realistic loads where
> > one thread needs VMA write lock while another thread is faulting the VMA,
> > then the latencies can be indeed extreme. For example things like cgroup IO
> > throttling happen on the IO path and thus can throttle IO of a low-priority
> > thread for a long time.
> 
> I’m quite sure that swap-in and VMA writes can occur
> concurrently, and this is fairly common. For example,
> Java GC may use mprotect or userfaultfd on a small
> portion of a large Java heap while other portions are
> still under do_swap_page().

OK, makes sense.

> If we start exploring different approaches for anon and
> file, I agree I can revisit this on an Android phone if
> there is a real, serious case where a file VMA can be
> written and a page fault occurs at the same time.
> 
> Please note that, as an Android developer, I am particularly
> cautious about priority inversion. A recent issue causing
> severe priority inversion is zram attempting to support
> preemption[1]. When a task performing compression or
> decompression is migrated to another CPU and then preempted
> by other tasks, high-priority tasks waiting on the mutex may
> be significantly delayed, impacting user experience.

Well, container people are concerned about priority inversion as well. But
usually this is with coarse lock (such as global filesystem locks) but VMA
lock is specific to a task (and a VMA) so there the opportunity for
priority inversion looks more limited.  But the example with Java where GC
thread can presumably have higher priority than ordinary Java threads is an
interesting one.

> > BTW I'm not sure I quite understand Barry's priority inversion problem
> > since I'd expect all threads of a task to generally be treated with the
> > same priority...
> 
> Exactly not. Maybe these slides[2] and this project[3] can give
> you a hint—they aim to standardize things on Linux by
> learning from Apple OS. Basically, tasks are classified
> into five types:
> 
> USER_INTERACTIVE: Requires immediate response.
> USER_INITIATED: Tolerates a short delay, but must respond quickly still.
> UTILITY: Tolerates long delays, but not prolonged ones.
> BACKGROUND: Doesn’t mind prolonged delays.
> DEFAULT: System default behavior.

Again, this is a clasification of tasks but not really of threads in a task
so at least for VMA lock there's no inversion so have?

								Honza

> [1] https://lore.kernel.org/linux-mm/20250303022425.285971-3-senozhatsky@chromium.org/
> [2] https://lpc.events/event/19/contributions/2089/attachments/1797/3877/Userspace%20Assisted%20Scheduling%20via%20Sched%20QoS.pdf
> [3] https://lore.kernel.org/lkml/20260415000910.2h5misvwc45bdumu@airbuntu/
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-04 13:03               ` Jan Kara
@ 2026-05-04 13:35                 ` Barry Song
  2026-05-04 14:15                 ` Barry Song
  1 sibling, 0 replies; 25+ messages in thread
From: Barry Song @ 2026-05-04 13:35 UTC (permalink / raw)
  To: Jan Kara
  Cc: Matthew Wilcox, akpm, linux-mm, david, ljs, liam, vbabka, rppt,
	surenb, mhocko, pfalcato, wanglian, chentao, lianux.mm,
	kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390

On Mon, May 4, 2026 at 9:04 PM Jan Kara <jack@suse.cz> wrote:
[...]
>
> > > BTW I'm not sure I quite understand Barry's priority inversion problem
> > > since I'd expect all threads of a task to generally be treated with the
> > > same priority...
> >
> > Exactly not. Maybe these slides[2] and this project[3] can give
> > you a hint—they aim to standardize things on Linux by
> > learning from Apple OS. Basically, tasks are classified
> > into five types:
> >
> > USER_INTERACTIVE: Requires immediate response.
> > USER_INITIATED: Tolerates a short delay, but must respond quickly still.
> > UTILITY: Tolerates long delays, but not prolonged ones.
> > BACKGROUND: Doesn’t mind prolonged delays.
> > DEFAULT: System default behavior.
>
> Again, this is a clasification of tasks but not really of threads in a task
> so at least for VMA lock there's no inversion so have?

I’m specifically referring to a task (i.e., a thread) when
discussing scheduler context. It may be clearer to use the
terms process and thread explicitly.

In a typical process sharing an mm_struct, each thread can
have a different priority.

In an Android app, some threads handle the UI and require
higher priority, such as the main thread and RenderThread;
otherwise, frame drops may occur.

The Linux scheduler can control scheduling policy and
priority for each thread.

Thanks
Barry

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
  2026-05-04 13:03               ` Jan Kara
  2026-05-04 13:35                 ` Barry Song
@ 2026-05-04 14:15                 ` Barry Song
  1 sibling, 0 replies; 25+ messages in thread
From: Barry Song @ 2026-05-04 14:15 UTC (permalink / raw)
  To: Jan Kara
  Cc: Matthew Wilcox, akpm, linux-mm, david, ljs, liam, vbabka, rppt,
	surenb, mhocko, pfalcato, wanglian, chentao, lianux.mm,
	kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390

On Mon, May 4, 2026 at 9:04 PM Jan Kara <jack@suse.cz> wrote:
>
> On Mon 04-05-26 03:55:43, Barry Song wrote:
> > On Mon, May 4, 2026 at 2:17 AM Jan Kara <jack@suse.cz> wrote:
> > > On Fri 01-05-26 18:57:52, Matthew Wilcox wrote:
> > > > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > > > both the hardware and the software stack (bio/request queues and the
> > > > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > > > for an unpredictable amount of time.
> > > > > >
> > > > > > But does that actually happen?  I find it hard to believe that thread A
> > > > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > > > that same VMA.  mprotect() and madvise() are more likely to happen, but
> > > > > > it still seems really unlikely to me.
> > > > >
> > > > > It doesn’t have to involve unmapping or applying mprotect to
> > > > > the entire VMA—just a portion of it is sufficient.
> > > >
> > > > Yes, but that still fails to answer "does this actually happen".  How much
> > > > performance is all this complexity in the page fault handler buying us?
> > > > If you don't answer this question, I'm just going to go in and rip it
> > > > all out.
> > >
> > > I fully agree with you we should verify whether the retry code still brings
> > > in real-world advantage today with VMA locks. After all the retry logic has
> > > been introduced in 2010. That being said if there are realistic loads where
> > > one thread needs VMA write lock while another thread is faulting the VMA,
> > > then the latencies can be indeed extreme. For example things like cgroup IO
> > > throttling happen on the IO path and thus can throttle IO of a low-priority
> > > thread for a long time.
> >
> > I’m quite sure that swap-in and VMA writes can occur
> > concurrently, and this is fairly common. For example,
> > Java GC may use mprotect or userfaultfd on a small
> > portion of a large Java heap while other portions are
> > still under do_swap_page().
>
> OK, makes sense.
>
> > If we start exploring different approaches for anon and
> > file, I agree I can revisit this on an Android phone if
> > there is a real, serious case where a file VMA can be
> > written and a page fault occurs at the same time.
> >
> > Please note that, as an Android developer, I am particularly
> > cautious about priority inversion. A recent issue causing
> > severe priority inversion is zram attempting to support
> > preemption[1]. When a task performing compression or
> > decompression is migrated to another CPU and then preempted
> > by other tasks, high-priority tasks waiting on the mutex may
> > be significantly delayed, impacting user experience.
>
> Well, container people are concerned about priority inversion as well. But
> usually this is with coarse lock (such as global filesystem locks) but VMA
> lock is specific to a task (and a VMA) so there the opportunity for
> priority inversion looks more limited.  But the example with Java where GC
> thread can presumably have higher priority than ordinary Java threads is an
> interesting one.

A major difference in Android apps is that each thread can
affect user experience differently. And it is not simply a matter
of whether a VMA writer has higher or lower priority than a
page-fault (PF) thread performing I/O.

For example, thread A handles a PF; thread B attempts to
modify the VMA where the PF occurs; thread C tries to modify
another VMA (requiring mmap_lock in write mode) or iterate
VMAs (requiring mmap_lock in read mode). Regardless of
thread B’s priority, it holds mmap_lock in write mode while
waiting for the VMA lock. The usual pattern for a VMA writer
is:

mmap_write_lock()
vma_start_write()

As a result, thread C can be blocked even if it has higher
priority but operates on a different VMA.

In essence, when a PF and a VMA write occur concurrently,
high-priority threads may be blocked even if they operate on
different VMAs, not necessarily the same one.

Thanks
Barry

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2026-05-04 14:16 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-30  4:04 [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance Barry Song (Xiaomi)
2026-04-30  4:04 ` [PATCH v2 1/5] mm/filemap: Retry fault by VMA lock if the lock was released for I/O Barry Song (Xiaomi)
2026-04-30  4:04 ` [PATCH v2 2/5] mm/swapin: Retry swapin " Barry Song (Xiaomi)
2026-04-30  4:04 ` [PATCH v2 3/5] mm: Move folio_lock_or_retry() and drop __folio_lock_or_retry() Barry Song (Xiaomi)
2026-04-30  4:04 ` [PATCH v2 4/5] mm: Don't retry page fault if folio is uptodate during swap-in Barry Song (Xiaomi)
2026-04-30 12:35   ` Matthew Wilcox
2026-05-01 16:11     ` Matthew Wilcox
2026-04-30  4:04 ` [PATCH v2 5/5] mm/filemap: Avoid retrying page faults on uptodate folios in filemap faults Barry Song (Xiaomi)
2026-04-30 12:37 ` [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance Matthew Wilcox
2026-04-30 22:49   ` Barry Song
2026-05-01 14:56     ` Matthew Wilcox
2026-05-01 17:44       ` Barry Song
2026-05-01 17:57         ` Matthew Wilcox
2026-05-01 18:25           ` Barry Song
2026-05-01 19:39             ` Matthew Wilcox
2026-05-03 20:39               ` Barry Song
2026-05-03 13:13           ` Jan Kara
2026-05-03 19:55             ` Barry Song
2026-05-04 13:03               ` Jan Kara
2026-05-04 13:35                 ` Barry Song
2026-05-04 14:15                 ` Barry Song
2026-05-01 15:52   ` Lorenzo Stoakes
2026-05-01 16:06     ` Matthew Wilcox
2026-05-01 17:09       ` Lorenzo Stoakes
2026-05-01 17:59     ` Barry Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox