* [PATCH v2 1/5] mm/filemap: Retry fault by VMA lock if the lock was released for I/O
2026-04-30 4:04 [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance Barry Song (Xiaomi)
@ 2026-04-30 4:04 ` Barry Song (Xiaomi)
2026-04-30 4:04 ` [PATCH v2 2/5] mm/swapin: Retry swapin " Barry Song (Xiaomi)
` (4 subsequent siblings)
5 siblings, 0 replies; 56+ messages in thread
From: Barry Song (Xiaomi) @ 2026-04-30 4:04 UTC (permalink / raw)
To: akpm, linux-mm, willy
Cc: david, ljs, liam, vbabka, rppt, surenb, mhocko, jack, pfalcato,
wanglian, chentao, lianux.mm, kunwu.chan, liyangouwen1, chrisl,
kasong, shikemeng, nphamcs, bhe, youngjun.park, linux-arm-kernel,
linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390,
Barry Song
From: Oven Liyang <liyangouwen1@oppo.com>
If the current page fault is using the per-VMA lock, and we only released
the lock to wait for I/O completion (e.g., using folio_lock()), then when
the fault is retried after the I/O completes, it should still qualify for
the per-VMA-lock path.
Acked-by: Pedro Falcato <pfalcato@suse.de>
Tested-by: Wang Lian <wanglian@kylinos.cn>
Tested-by: Kunwu Chan <chentao@kylinos.cn>
Reviewed-by: Wang Lian <lianux.mm@gmail.com>
Reviewed-by: Kunwu Chan <kunwu.chan@gmail.com>
Signed-off-by: Oven Liyang <liyangouwen1@oppo.com>
Co-developed-by: Barry Song <baohua@kernel.org>
Signed-off-by: Barry Song <baohua@kernel.org>
---
arch/arm/mm/fault.c | 5 +++++
arch/arm64/mm/fault.c | 5 +++++
arch/loongarch/mm/fault.c | 4 ++++
arch/powerpc/mm/fault.c | 5 ++++-
arch/riscv/mm/fault.c | 4 ++++
arch/s390/mm/fault.c | 4 ++++
arch/x86/mm/fault.c | 4 ++++
include/linux/mm_types.h | 9 +++++----
mm/filemap.c | 5 ++++-
9 files changed, 39 insertions(+), 6 deletions(-)
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index e62cc4be5adf..5971e02845f7 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -391,6 +391,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
if (!(flags & FAULT_FLAG_USER))
goto lock_mmap;
+retry_vma:
vma = lock_vma_under_rcu(mm, addr);
if (!vma)
goto lock_mmap;
@@ -420,6 +421,10 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
goto no_context;
return 0;
}
+
+ /* If the first try is only about waiting for the I/O to complete */
+ if (fault & VM_FAULT_RETRY_VMA)
+ goto retry_vma;
lock_mmap:
retry:
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 739800835920..d0362a3e11b7 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -673,6 +673,7 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
if (!(mm_flags & FAULT_FLAG_USER))
goto lock_mmap;
+retry_vma:
vma = lock_vma_under_rcu(mm, addr);
if (!vma)
goto lock_mmap;
@@ -719,6 +720,10 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
goto no_context;
return 0;
}
+
+ /* If the first try is only about waiting for the I/O to complete */
+ if (fault & VM_FAULT_RETRY_VMA)
+ goto retry_vma;
lock_mmap:
retry:
diff --git a/arch/loongarch/mm/fault.c b/arch/loongarch/mm/fault.c
index 2c93d33356e5..738f495560c0 100644
--- a/arch/loongarch/mm/fault.c
+++ b/arch/loongarch/mm/fault.c
@@ -219,6 +219,7 @@ static void __kprobes __do_page_fault(struct pt_regs *regs,
if (!(flags & FAULT_FLAG_USER))
goto lock_mmap;
+retry_vma:
vma = lock_vma_under_rcu(mm, address);
if (!vma)
goto lock_mmap;
@@ -265,6 +266,9 @@ static void __kprobes __do_page_fault(struct pt_regs *regs,
no_context(regs, write, address);
return;
}
+ /* If the first try is only about waiting for the I/O to complete */
+ if (fault & VM_FAULT_RETRY_VMA)
+ goto retry_vma;
lock_mmap:
retry:
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 806c74e0d5ab..cb7ffc20c760 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -487,6 +487,7 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address,
if (!(flags & FAULT_FLAG_USER))
goto lock_mmap;
+retry_vma:
vma = lock_vma_under_rcu(mm, address);
if (!vma)
goto lock_mmap;
@@ -516,7 +517,9 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address,
if (fault_signal_pending(fault, regs))
return user_mode(regs) ? 0 : SIGBUS;
-
+ /* If the first try is only about waiting for the I/O to complete */
+ if (fault & VM_FAULT_RETRY_VMA)
+ goto retry_vma;
lock_mmap:
/* When running in the kernel we expect faults to occur only to
diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
index 04ed6f8acae4..b94cf57c2b9a 100644
--- a/arch/riscv/mm/fault.c
+++ b/arch/riscv/mm/fault.c
@@ -347,6 +347,7 @@ void handle_page_fault(struct pt_regs *regs)
if (!(flags & FAULT_FLAG_USER))
goto lock_mmap;
+retry_vma:
vma = lock_vma_under_rcu(mm, addr);
if (!vma)
goto lock_mmap;
@@ -376,6 +377,9 @@ void handle_page_fault(struct pt_regs *regs)
no_context(regs, addr);
return;
}
+ /* If the first try is only about waiting for the I/O to complete */
+ if (fault & VM_FAULT_RETRY_VMA)
+ goto retry_vma;
lock_mmap:
retry:
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index 191cc53caead..e0576e629f65 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -294,6 +294,7 @@ static void do_exception(struct pt_regs *regs, int access)
flags |= FAULT_FLAG_WRITE;
if (!(flags & FAULT_FLAG_USER))
goto lock_mmap;
+retry_vma:
vma = lock_vma_under_rcu(mm, address);
if (!vma)
goto lock_mmap;
@@ -318,6 +319,9 @@ static void do_exception(struct pt_regs *regs, int access)
handle_fault_error_nolock(regs, 0);
return;
}
+ /* If the first try is only about waiting for the I/O to complete */
+ if (fault & VM_FAULT_RETRY_VMA)
+ goto retry_vma;
lock_mmap:
retry:
vma = lock_mm_and_find_vma(mm, address, regs);
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index f0e77e084482..0589fc693eea 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1322,6 +1322,7 @@ void do_user_addr_fault(struct pt_regs *regs,
if (!(flags & FAULT_FLAG_USER))
goto lock_mmap;
+retry_vma:
vma = lock_vma_under_rcu(mm, address);
if (!vma)
goto lock_mmap;
@@ -1351,6 +1352,9 @@ void do_user_addr_fault(struct pt_regs *regs,
ARCH_DEFAULT_PKEY);
return;
}
+ /* If the first try is only about waiting for the I/O to complete */
+ if (fault & VM_FAULT_RETRY_VMA)
+ goto retry_vma;
lock_mmap:
retry:
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index a308e2c23b82..5907200ea587 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1678,10 +1678,11 @@ enum vm_fault_reason {
VM_FAULT_NOPAGE = (__force vm_fault_t)0x000100,
VM_FAULT_LOCKED = (__force vm_fault_t)0x000200,
VM_FAULT_RETRY = (__force vm_fault_t)0x000400,
- VM_FAULT_FALLBACK = (__force vm_fault_t)0x000800,
- VM_FAULT_DONE_COW = (__force vm_fault_t)0x001000,
- VM_FAULT_NEEDDSYNC = (__force vm_fault_t)0x002000,
- VM_FAULT_COMPLETED = (__force vm_fault_t)0x004000,
+ VM_FAULT_RETRY_VMA = (__force vm_fault_t)0x000800,
+ VM_FAULT_FALLBACK = (__force vm_fault_t)0x001000,
+ VM_FAULT_DONE_COW = (__force vm_fault_t)0x002000,
+ VM_FAULT_NEEDDSYNC = (__force vm_fault_t)0x004000,
+ VM_FAULT_COMPLETED = (__force vm_fault_t)0x008000,
VM_FAULT_HINDEX_MASK = (__force vm_fault_t)0x0f0000,
};
diff --git a/mm/filemap.c b/mm/filemap.c
index ab34cab2416a..a045b771e8de 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3525,6 +3525,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
struct folio *folio;
vm_fault_t ret = 0;
bool mapping_locked = false;
+ bool retry_by_vma_lock = false;
max_idx = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
if (unlikely(index >= max_idx))
@@ -3621,6 +3622,8 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
*/
if (fpin) {
folio_unlock(folio);
+ if (vmf->flags & FAULT_FLAG_VMA_LOCK)
+ retry_by_vma_lock = true;
goto out_retry;
}
if (mapping_locked)
@@ -3671,7 +3674,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
filemap_invalidate_unlock_shared(mapping);
if (fpin)
fput(fpin);
- return ret | VM_FAULT_RETRY;
+ return ret | VM_FAULT_RETRY | (retry_by_vma_lock ? VM_FAULT_RETRY_VMA : 0);
}
EXPORT_SYMBOL(filemap_fault);
--
2.39.3 (Apple Git-146)
^ permalink raw reply related [flat|nested] 56+ messages in thread* [PATCH v2 2/5] mm/swapin: Retry swapin by VMA lock if the lock was released for I/O
2026-04-30 4:04 [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance Barry Song (Xiaomi)
2026-04-30 4:04 ` [PATCH v2 1/5] mm/filemap: Retry fault by VMA lock if the lock was released for I/O Barry Song (Xiaomi)
@ 2026-04-30 4:04 ` Barry Song (Xiaomi)
2026-04-30 4:04 ` [PATCH v2 3/5] mm: Move folio_lock_or_retry() and drop __folio_lock_or_retry() Barry Song (Xiaomi)
` (3 subsequent siblings)
5 siblings, 0 replies; 56+ messages in thread
From: Barry Song (Xiaomi) @ 2026-04-30 4:04 UTC (permalink / raw)
To: akpm, linux-mm, willy
Cc: david, ljs, liam, vbabka, rppt, surenb, mhocko, jack, pfalcato,
wanglian, chentao, lianux.mm, kunwu.chan, liyangouwen1, chrisl,
kasong, shikemeng, nphamcs, bhe, youngjun.park, linux-arm-kernel,
linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390,
Barry Song (Xiaomi)
If the current do_swap_page() took the per-VMA lock and we dropped it only
to wait for I/O completion (e.g., use folio_wait_locked()), then when
do_swap_page() is retried after the I/O completes, it should still qualify
for the per-VMA-lock path.
Tested-by: Wang Lian <wanglian@kylinos.cn>
Tested-by: Kunwu Chan <chentao@kylinos.cn>
Reviewed-by: Wang Lian <lianux.mm@gmail.com>
Reviewed-by: Kunwu Chan <kunwu.chan@gmail.com>
Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
---
mm/memory.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 199214f8de08..00ee1599d637 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4791,6 +4791,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
unsigned long page_idx;
unsigned long address;
pte_t *ptep;
+ bool retry_by_vma_lock = false;
if (!pte_unmap_same(vmf))
goto out;
@@ -4896,8 +4897,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
swapcache = folio;
ret |= folio_lock_or_retry(folio, vmf);
- if (ret & VM_FAULT_RETRY)
+ if (ret & VM_FAULT_RETRY) {
+ if (fault_flag_allow_retry_first(vmf->flags) &&
+ !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT) &&
+ (vmf->flags & FAULT_FLAG_VMA_LOCK))
+ retry_by_vma_lock = true;
goto out_release;
+ }
page = folio_file_page(folio, swp_offset(entry));
/*
@@ -5182,7 +5188,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
}
if (si)
put_swap_device(si);
- return ret;
+ return ret | (retry_by_vma_lock ? VM_FAULT_RETRY_VMA : 0);
}
static bool pte_range_none(pte_t *pte, int nr_pages)
--
2.39.3 (Apple Git-146)
^ permalink raw reply related [flat|nested] 56+ messages in thread* [PATCH v2 3/5] mm: Move folio_lock_or_retry() and drop __folio_lock_or_retry()
2026-04-30 4:04 [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance Barry Song (Xiaomi)
2026-04-30 4:04 ` [PATCH v2 1/5] mm/filemap: Retry fault by VMA lock if the lock was released for I/O Barry Song (Xiaomi)
2026-04-30 4:04 ` [PATCH v2 2/5] mm/swapin: Retry swapin " Barry Song (Xiaomi)
@ 2026-04-30 4:04 ` Barry Song (Xiaomi)
2026-04-30 4:04 ` [PATCH v2 4/5] mm: Don't retry page fault if folio is uptodate during swap-in Barry Song (Xiaomi)
` (2 subsequent siblings)
5 siblings, 0 replies; 56+ messages in thread
From: Barry Song (Xiaomi) @ 2026-04-30 4:04 UTC (permalink / raw)
To: akpm, linux-mm, willy
Cc: david, ljs, liam, vbabka, rppt, surenb, mhocko, jack, pfalcato,
wanglian, chentao, lianux.mm, kunwu.chan, liyangouwen1, chrisl,
kasong, shikemeng, nphamcs, bhe, youngjun.park, linux-arm-kernel,
linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390,
Barry Song (Xiaomi)
folio_lock_or_retry() is effectively only used in mm/memory.c,
not in the filemap code. Move it there and make it static.
The helper __folio_lock_or_retry() can be folded into
folio_lock_or_retry(), allowing it to be removed.
Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
---
include/linux/pagemap.h | 17 -------------
mm/filemap.c | 45 ----------------------------------
mm/memory.c | 53 +++++++++++++++++++++++++++++++++++++++++
3 files changed, 53 insertions(+), 62 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 1f50991b43e3..500ab783bf70 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -1101,7 +1101,6 @@ static inline bool wake_page_match(struct wait_page_queue *wait_page,
void __folio_lock(struct folio *folio);
int __folio_lock_killable(struct folio *folio);
-vm_fault_t __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf);
void unlock_page(struct page *page);
void folio_unlock(struct folio *folio);
@@ -1198,22 +1197,6 @@ static inline int folio_lock_killable(struct folio *folio)
return 0;
}
-/*
- * folio_lock_or_retry - Lock the folio, unless this would block and the
- * caller indicated that it can handle a retry.
- *
- * Return value and mmap_lock implications depend on flags; see
- * __folio_lock_or_retry().
- */
-static inline vm_fault_t folio_lock_or_retry(struct folio *folio,
- struct vm_fault *vmf)
-{
- might_sleep();
- if (!folio_trylock(folio))
- return __folio_lock_or_retry(folio, vmf);
- return 0;
-}
-
/*
* This is exported only for folio_wait_locked/folio_wait_writeback, etc.,
* and should not be used directly.
diff --git a/mm/filemap.c b/mm/filemap.c
index a045b771e8de..b532d6cbafc8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1740,51 +1740,6 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
return ret;
}
-/*
- * Return values:
- * 0 - folio is locked.
- * non-zero - folio is not locked.
- * mmap_lock or per-VMA lock has been released (mmap_read_unlock() or
- * vma_end_read()), unless flags had both FAULT_FLAG_ALLOW_RETRY and
- * FAULT_FLAG_RETRY_NOWAIT set, in which case the lock is still held.
- *
- * If neither ALLOW_RETRY nor KILLABLE are set, will always return 0
- * with the folio locked and the mmap_lock/per-VMA lock is left unperturbed.
- */
-vm_fault_t __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf)
-{
- unsigned int flags = vmf->flags;
-
- if (fault_flag_allow_retry_first(flags)) {
- /*
- * CAUTION! In this case, mmap_lock/per-VMA lock is not
- * released even though returning VM_FAULT_RETRY.
- */
- if (flags & FAULT_FLAG_RETRY_NOWAIT)
- return VM_FAULT_RETRY;
-
- release_fault_lock(vmf);
- if (flags & FAULT_FLAG_KILLABLE)
- folio_wait_locked_killable(folio);
- else
- folio_wait_locked(folio);
- return VM_FAULT_RETRY;
- }
- if (flags & FAULT_FLAG_KILLABLE) {
- bool ret;
-
- ret = __folio_lock_killable(folio);
- if (ret) {
- release_fault_lock(vmf);
- return VM_FAULT_RETRY;
- }
- } else {
- __folio_lock(folio);
- }
-
- return 0;
-}
-
/**
* page_cache_next_miss() - Find the next gap in the page cache.
* @mapping: Mapping.
diff --git a/mm/memory.c b/mm/memory.c
index 00ee1599d637..0c740ca363cc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4442,6 +4442,59 @@ void unmap_mapping_range(struct address_space *mapping,
}
EXPORT_SYMBOL(unmap_mapping_range);
+/*
+ * folio_lock_or_retry - Lock the folio, unless this would block and the
+ * caller indicated that it can handle a retry.
+ *
+ * Return values:
+ * 0 - folio is locked.
+ * non-zero - folio is not locked.
+ * mmap_lock or per-VMA lock has been released (mmap_read_unlock() or
+ * vma_end_read()), unless flags had both FAULT_FLAG_ALLOW_RETRY and
+ * FAULT_FLAG_RETRY_NOWAIT set, in which case the lock is still held.
+ *
+ * If neither ALLOW_RETRY nor KILLABLE are set, will always return 0
+ * with the folio locked and the mmap_lock/per-VMA lock is left unperturbed.
+ */
+static inline vm_fault_t folio_lock_or_retry(struct folio *folio,
+ struct vm_fault *vmf)
+{
+ unsigned int flags = vmf->flags;
+
+ might_sleep();
+ if (folio_trylock(folio))
+ return 0;
+
+ if (fault_flag_allow_retry_first(flags)) {
+ /*
+ * CAUTION! In this case, mmap_lock/per-VMA lock is not
+ * released even though returning VM_FAULT_RETRY.
+ */
+ if (flags & FAULT_FLAG_RETRY_NOWAIT)
+ return VM_FAULT_RETRY;
+
+ release_fault_lock(vmf);
+ if (flags & FAULT_FLAG_KILLABLE)
+ folio_wait_locked_killable(folio);
+ else
+ folio_wait_locked(folio);
+ return VM_FAULT_RETRY;
+ }
+ if (flags & FAULT_FLAG_KILLABLE) {
+ bool ret;
+
+ ret = __folio_lock_killable(folio);
+ if (ret) {
+ release_fault_lock(vmf);
+ return VM_FAULT_RETRY;
+ }
+ } else {
+ __folio_lock(folio);
+ }
+
+ return 0;
+}
+
/*
* Restore a potential device exclusive pte to a working pte entry
*/
--
2.39.3 (Apple Git-146)
^ permalink raw reply related [flat|nested] 56+ messages in thread* [PATCH v2 4/5] mm: Don't retry page fault if folio is uptodate during swap-in
2026-04-30 4:04 [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance Barry Song (Xiaomi)
` (2 preceding siblings ...)
2026-04-30 4:04 ` [PATCH v2 3/5] mm: Move folio_lock_or_retry() and drop __folio_lock_or_retry() Barry Song (Xiaomi)
@ 2026-04-30 4:04 ` Barry Song (Xiaomi)
2026-04-30 12:35 ` Matthew Wilcox
2026-04-30 4:04 ` [PATCH v2 5/5] mm/filemap: Avoid retrying page faults on uptodate folios in filemap faults Barry Song (Xiaomi)
2026-04-30 12:37 ` [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance Matthew Wilcox
5 siblings, 1 reply; 56+ messages in thread
From: Barry Song (Xiaomi) @ 2026-04-30 4:04 UTC (permalink / raw)
To: akpm, linux-mm, willy
Cc: david, ljs, liam, vbabka, rppt, surenb, mhocko, jack, pfalcato,
wanglian, chentao, lianux.mm, kunwu.chan, liyangouwen1, chrisl,
kasong, shikemeng, nphamcs, bhe, youngjun.park, linux-arm-kernel,
linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390,
Barry Song (Xiaomi)
If we are waiting for long I/O to complete, it makes sense to
avoid holding locks for too long. However, if the folio is
uptodate, we are likely only waiting for a concurrent PTE
update to finish. Retrying the entire page fault seems
excessive.
Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
---
mm/memory.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/mm/memory.c b/mm/memory.c
index 0c740ca363cc..a2e4f2d87ec8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4949,6 +4949,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
}
swapcache = folio;
+ /*
+ * If the folio is uptodate, we are likely only waiting for
+ * another concurrent PTE mapping to complete, which should
+ * be brief. No need to drop the lock and retry the fault.
+ */
+ if (folio_test_uptodate(folio))
+ vmf->flags &= ~FAULT_FLAG_ALLOW_RETRY;
ret |= folio_lock_or_retry(folio, vmf);
if (ret & VM_FAULT_RETRY) {
if (fault_flag_allow_retry_first(vmf->flags) &&
--
2.39.3 (Apple Git-146)
^ permalink raw reply related [flat|nested] 56+ messages in thread* Re: [PATCH v2 4/5] mm: Don't retry page fault if folio is uptodate during swap-in
2026-04-30 4:04 ` [PATCH v2 4/5] mm: Don't retry page fault if folio is uptodate during swap-in Barry Song (Xiaomi)
@ 2026-04-30 12:35 ` Matthew Wilcox
2026-05-01 16:11 ` Matthew Wilcox
0 siblings, 1 reply; 56+ messages in thread
From: Matthew Wilcox @ 2026-04-30 12:35 UTC (permalink / raw)
To: Barry Song (Xiaomi)
Cc: akpm, linux-mm, david, ljs, liam, vbabka, rppt, surenb, mhocko,
jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390
On Thu, Apr 30, 2026 at 12:04:26PM +0800, Barry Song (Xiaomi) wrote:
> If we are waiting for long I/O to complete, it makes sense to
> avoid holding locks for too long. However, if the folio is
> uptodate, we are likely only waiting for a concurrent PTE
> update to finish. Retrying the entire page fault seems
> excessive.
I think the idea is good, but the implementation is misplaced.
The check for folio_uptodate() should be inside folio_lock_or_retry()
rather than tampering with FAULT_FLAG_ALLOW_RETRY in its caller.
Similarly for your next patch.
> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> ---
> mm/memory.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 0c740ca363cc..a2e4f2d87ec8 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4949,6 +4949,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> }
>
> swapcache = folio;
> + /*
> + * If the folio is uptodate, we are likely only waiting for
> + * another concurrent PTE mapping to complete, which should
> + * be brief. No need to drop the lock and retry the fault.
> + */
> + if (folio_test_uptodate(folio))
> + vmf->flags &= ~FAULT_FLAG_ALLOW_RETRY;
> ret |= folio_lock_or_retry(folio, vmf);
> if (ret & VM_FAULT_RETRY) {
> if (fault_flag_allow_retry_first(vmf->flags) &&
> --
> 2.39.3 (Apple Git-146)
>
>
^ permalink raw reply [flat|nested] 56+ messages in thread* Re: [PATCH v2 4/5] mm: Don't retry page fault if folio is uptodate during swap-in
2026-04-30 12:35 ` Matthew Wilcox
@ 2026-05-01 16:11 ` Matthew Wilcox
0 siblings, 0 replies; 56+ messages in thread
From: Matthew Wilcox @ 2026-05-01 16:11 UTC (permalink / raw)
To: Barry Song (Xiaomi)
Cc: akpm, linux-mm, david, ljs, liam, vbabka, rppt, surenb, mhocko,
jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390
On Thu, Apr 30, 2026 at 01:35:30PM +0100, Matthew Wilcox wrote:
> On Thu, Apr 30, 2026 at 12:04:26PM +0800, Barry Song (Xiaomi) wrote:
> > If we are waiting for long I/O to complete, it makes sense to
> > avoid holding locks for too long. However, if the folio is
> > uptodate, we are likely only waiting for a concurrent PTE
> > update to finish. Retrying the entire page fault seems
> > excessive.
>
> I think the idea is good, but the implementation is misplaced.
> The check for folio_uptodate() should be inside folio_lock_or_retry()
> rather than tampering with FAULT_FLAG_ALLOW_RETRY in its caller.
Actually it needs to be a little more complex than this. We
sometimes wait for writeback while holding the folio lock, and
that's a similar latency to reads (or with cheap NAND, maybe longer!)
So I think the test needs to be:
if (folio_test_uptodate(folio) && !folio_test_writeback(folio))
^ permalink raw reply [flat|nested] 56+ messages in thread
* [PATCH v2 5/5] mm/filemap: Avoid retrying page faults on uptodate folios in filemap faults
2026-04-30 4:04 [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance Barry Song (Xiaomi)
` (3 preceding siblings ...)
2026-04-30 4:04 ` [PATCH v2 4/5] mm: Don't retry page fault if folio is uptodate during swap-in Barry Song (Xiaomi)
@ 2026-04-30 4:04 ` Barry Song (Xiaomi)
2026-04-30 12:37 ` [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance Matthew Wilcox
5 siblings, 0 replies; 56+ messages in thread
From: Barry Song (Xiaomi) @ 2026-04-30 4:04 UTC (permalink / raw)
To: akpm, linux-mm, willy
Cc: david, ljs, liam, vbabka, rppt, surenb, mhocko, jack, pfalcato,
wanglian, chentao, lianux.mm, kunwu.chan, liyangouwen1, chrisl,
kasong, shikemeng, nphamcs, bhe, youngjun.park, linux-arm-kernel,
linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390,
Barry Song (Xiaomi)
For uptodate folios, we are not waiting on I/O. We should
be able to acquire the folio lock shortly, so there is no
need to drop per-vma locks and perform a full PF retry.
Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
---
mm/filemap.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/mm/filemap.c b/mm/filemap.c
index b532d6cbafc8..0d2f6af5d0fe 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3533,6 +3533,13 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
}
}
+ /*
+ * If the folio is uptodate, we are likely only waiting for
+ * another concurrent PTE mapping to complete, which should
+ * be brief. No need to drop the lock and retry the fault.
+ */
+ if (folio_test_uptodate(folio))
+ vmf->flags &= ~FAULT_FLAG_ALLOW_RETRY;
if (!lock_folio_maybe_drop_mmap(vmf, folio, &fpin))
goto out_retry;
--
2.39.3 (Apple Git-146)
^ permalink raw reply related [flat|nested] 56+ messages in thread* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-04-30 4:04 [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance Barry Song (Xiaomi)
` (4 preceding siblings ...)
2026-04-30 4:04 ` [PATCH v2 5/5] mm/filemap: Avoid retrying page faults on uptodate folios in filemap faults Barry Song (Xiaomi)
@ 2026-04-30 12:37 ` Matthew Wilcox
2026-04-30 22:49 ` Barry Song
2026-05-01 15:52 ` Lorenzo Stoakes
5 siblings, 2 replies; 56+ messages in thread
From: Matthew Wilcox @ 2026-04-30 12:37 UTC (permalink / raw)
To: Barry Song (Xiaomi)
Cc: akpm, linux-mm, david, ljs, liam, vbabka, rppt, surenb, mhocko,
jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390
On Thu, Apr 30, 2026 at 12:04:22PM +0800, Barry Song (Xiaomi) wrote:
> (1) If we need to wait for I/O completion, we still drop the per-VMA lock, as
> current page fault handling already does. Holding it for too long may introduce
> various priority inversion issues on mobile devices. After I/O completes, we
> retry the page fault with the per-VMA lock, rather than falling back to
> mmap_lock.
You're going to have to do better than that. You know I hate the
additional complexity you're adding. You need to explain why my idea of
ripping out all the complexity now that we have per-VMA locks doesn't
work.
^ permalink raw reply [flat|nested] 56+ messages in thread* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-04-30 12:37 ` [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance Matthew Wilcox
@ 2026-04-30 22:49 ` Barry Song
2026-05-01 14:56 ` Matthew Wilcox
2026-05-01 15:52 ` Lorenzo Stoakes
1 sibling, 1 reply; 56+ messages in thread
From: Barry Song @ 2026-04-30 22:49 UTC (permalink / raw)
To: Matthew Wilcox
Cc: akpm, linux-mm, david, ljs, liam, vbabka, rppt, surenb, mhocko,
jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390
On Thu, Apr 30, 2026 at 8:37 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Apr 30, 2026 at 12:04:22PM +0800, Barry Song (Xiaomi) wrote:
> > (1) If we need to wait for I/O completion, we still drop the per-VMA lock, as
> > current page fault handling already does. Holding it for too long may introduce
> > various priority inversion issues on mobile devices. After I/O completes, we
> > retry the page fault with the per-VMA lock, rather than falling back to
> > mmap_lock.
>
> You're going to have to do better than that. You know I hate the
> additional complexity you're adding. You need to explain why my idea of
> ripping out all the complexity now that we have per-VMA locks doesn't
> work.
Yep, I know you don’t like the added complexity, but I would rather prioritize
user experience over simplicity. Let me try to explain in more detail.
1. There is no deterministic latency for I/O completion. It depends on
both the hardware and the software stack (bio/request queues and the
block scheduler). Sometimes the latency is short; at other times it can
be quite long. In such cases, a high-priority thread performing operations
such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
for an unpredictable amount of time. For example, if low-priority tasks
trigger page faults and issue low-priority I/O, a high-priority task
requiring the write lock may end up waiting for an unknown amount of time,
depending on the block layer and filesystem behavior.
As a result, high-priority tasks are exposed to unpredictable I/O latency
introduced by many low-priority tasks that may generate a large number of
page faults.
On Android, latency in certain tasks can significantly affect user experience,
such as interactive threads. Priority inversion is particularly problematic and
should be avoided, especially since we have no clear bound on how long we may
have to wait for I/O from other tasks.
Meanwhile, priority inversion can propagate through a long chain: a writer
waiting on I/O from multiple concurrent page faults may end up blocking other
writers and readers as well. A long-waiting writer can also amplify
mmap_lock contention, which we still rely on in many cases.
2. VMA sizes can be highly uneven: some VMAs may be very large while others are
small. We used to have many reasons to release mmap_lock when we did not have a
per-VMA lock. Since VMA sizes are not uniform, those same considerations may
still apply to the per-VMA lock when a small number of VMAs account for most
of a process’s address space. I recall that Suren also mentioned this[1].
So I would prefer that we hold only the per-VMA lock and avoid retrying the
page fault when we are reasonably sure that I/O has already completed and we
are only waiting for short-lived conditions. Uncertainties in the block layer,
filesystem, and GC behavior, as well as latency-induced priority inversion
chains and potentially amplified mmap_lock contention, can significantly hurt
Android user experience.
[1] https://lore.kernel.org/linux-mm/CAJuCfpFVQJtvbj5fV2fmm4APhNZDL1qPg-YExw7gO1pmngC3Rw@mail.gmail.com/
Thanks
Barry
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-04-30 22:49 ` Barry Song
@ 2026-05-01 14:56 ` Matthew Wilcox
2026-05-01 17:44 ` Barry Song
0 siblings, 1 reply; 56+ messages in thread
From: Matthew Wilcox @ 2026-05-01 14:56 UTC (permalink / raw)
To: Barry Song
Cc: akpm, linux-mm, david, ljs, liam, vbabka, rppt, surenb, mhocko,
jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390
On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> 1. There is no deterministic latency for I/O completion. It depends on
> both the hardware and the software stack (bio/request queues and the
> block scheduler). Sometimes the latency is short; at other times it can
> be quite long. In such cases, a high-priority thread performing operations
> such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> for an unpredictable amount of time.
But does that actually happen? I find it hard to believe that thread A
unmaps a VMA while thread B is in the middle of taking a page fault in
that same VMA. mprotect() and madvise() are more likely to happen, but
it still seems really unlikely to me.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-01 14:56 ` Matthew Wilcox
@ 2026-05-01 17:44 ` Barry Song
2026-05-01 17:57 ` Matthew Wilcox
0 siblings, 1 reply; 56+ messages in thread
From: Barry Song @ 2026-05-01 17:44 UTC (permalink / raw)
To: Matthew Wilcox
Cc: akpm, linux-mm, david, ljs, liam, vbabka, rppt, surenb, mhocko,
jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390
On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > 1. There is no deterministic latency for I/O completion. It depends on
> > both the hardware and the software stack (bio/request queues and the
> > block scheduler). Sometimes the latency is short; at other times it can
> > be quite long. In such cases, a high-priority thread performing operations
> > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > for an unpredictable amount of time.
>
> But does that actually happen? I find it hard to believe that thread A
> unmaps a VMA while thread B is in the middle of taking a page fault in
> that same VMA. mprotect() and madvise() are more likely to happen, but
> it still seems really unlikely to me.
It doesn’t have to involve unmapping or applying mprotect to
the entire VMA—just a portion of it is sufficient.
BTW, the chain can propagate: a page fault occurs, B wants to write this
VMA, and C (a higher-priority task) wants to write another VMA. D may need
to iterate VMAs under mmap_lock, so B can end up blocking both C and D.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-01 17:44 ` Barry Song
@ 2026-05-01 17:57 ` Matthew Wilcox
2026-05-01 18:25 ` Barry Song
` (2 more replies)
0 siblings, 3 replies; 56+ messages in thread
From: Matthew Wilcox @ 2026-05-01 17:57 UTC (permalink / raw)
To: Barry Song
Cc: akpm, linux-mm, david, ljs, liam, vbabka, rppt, surenb, mhocko,
jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390
On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > 1. There is no deterministic latency for I/O completion. It depends on
> > > both the hardware and the software stack (bio/request queues and the
> > > block scheduler). Sometimes the latency is short; at other times it can
> > > be quite long. In such cases, a high-priority thread performing operations
> > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > for an unpredictable amount of time.
> >
> > But does that actually happen? I find it hard to believe that thread A
> > unmaps a VMA while thread B is in the middle of taking a page fault in
> > that same VMA. mprotect() and madvise() are more likely to happen, but
> > it still seems really unlikely to me.
>
> It doesn’t have to involve unmapping or applying mprotect to
> the entire VMA—just a portion of it is sufficient.
Yes, but that still fails to answer "does this actually happen". How much
performance is all this complexity in the page fault handler buying us?
If you don't answer this question, I'm just going to go in and rip it
all out.
> BTW, the chain can propagate: a page fault occurs, B wants to write this
> VMA, and C (a higher-priority task) wants to write another VMA. D may need
> to iterate VMAs under mmap_lock, so B can end up blocking both C and D.
I know.
^ permalink raw reply [flat|nested] 56+ messages in thread* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-01 17:57 ` Matthew Wilcox
@ 2026-05-01 18:25 ` Barry Song
2026-05-01 19:39 ` Matthew Wilcox
2026-05-03 13:13 ` Jan Kara
2026-05-17 8:45 ` Barry Song
2 siblings, 1 reply; 56+ messages in thread
From: Barry Song @ 2026-05-01 18:25 UTC (permalink / raw)
To: Matthew Wilcox
Cc: akpm, linux-mm, david, ljs, liam, vbabka, rppt, surenb, mhocko,
jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390
On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > both the hardware and the software stack (bio/request queues and the
> > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > be quite long. In such cases, a high-priority thread performing operations
> > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > for an unpredictable amount of time.
> > >
> > > But does that actually happen? I find it hard to believe that thread A
> > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > that same VMA. mprotect() and madvise() are more likely to happen, but
> > > it still seems really unlikely to me.
> >
> > It doesn’t have to involve unmapping or applying mprotect to
> > the entire VMA—just a portion of it is sufficient.
>
> Yes, but that still fails to answer "does this actually happen". How much
> performance is all this complexity in the page fault handler buying us?
> If you don't answer this question, I'm just going to go in and rip it
> all out.
I’m getting quite confused. In patch 4/5, you suggest a more
restrictive condition using
if (folio_test_uptodate(folio) && !folio_test_writeback(folio))
rather than if (folio_test_uptodate(folio)), before we decide to skip
retrying the page fault [1].
That seems to suggest we should be more cautious about when we can skip
retrying the page fault.
However, in the cover letter, you suggest removing all retry code entirely.
Does this suggestion apply only to file-backed page faults?
[1] https://lore.kernel.org/linux-mm/afTQl12XcXVnku9J@casper.infradead.org/
Best Regards
Barry
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-01 18:25 ` Barry Song
@ 2026-05-01 19:39 ` Matthew Wilcox
2026-05-03 20:39 ` Barry Song
0 siblings, 1 reply; 56+ messages in thread
From: Matthew Wilcox @ 2026-05-01 19:39 UTC (permalink / raw)
To: Barry Song
Cc: akpm, linux-mm, david, ljs, liam, vbabka, rppt, surenb, mhocko,
jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390
On Sat, May 02, 2026 at 02:25:37AM +0800, Barry Song wrote:
> On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> > Yes, but that still fails to answer "does this actually happen". How much
> > performance is all this complexity in the page fault handler buying us?
> > If you don't answer this question, I'm just going to go in and rip it
> > all out.
>
> I’m getting quite confused. In patch 4/5, you suggest a more
> restrictive condition using
> if (folio_test_uptodate(folio) && !folio_test_writeback(folio))
> rather than if (folio_test_uptodate(folio)), before we decide to skip
> retrying the page fault [1].
> That seems to suggest we should be more cautious about when we can skip
> retrying the page fault.
>
> However, in the cover letter, you suggest removing all retry code entirely.
> Does this suggestion apply only to file-backed page faults?
I'm making sure that if Andrew decides to override me he at least sees
that there are other problems with this patchset beyond "I don't like
the additional complexity".
And maybe we decide to do the fallback for anon-mm but not file memory.
Or maybe it's just something somebody happens upon when reading the
mailing list (or more likely it's just grist for an AI).
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-01 19:39 ` Matthew Wilcox
@ 2026-05-03 20:39 ` Barry Song
0 siblings, 0 replies; 56+ messages in thread
From: Barry Song @ 2026-05-03 20:39 UTC (permalink / raw)
To: Matthew Wilcox
Cc: akpm, linux-mm, david, ljs, liam, vbabka, rppt, surenb, mhocko,
jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390
On Sat, May 2, 2026 at 3:39 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sat, May 02, 2026 at 02:25:37AM +0800, Barry Song wrote:
> > On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > Yes, but that still fails to answer "does this actually happen". How much
> > > performance is all this complexity in the page fault handler buying us?
> > > If you don't answer this question, I'm just going to go in and rip it
> > > all out.
I guess the only way to answer this question is to
remove all retry code for file VMA and run a real test.
For defensive programming, I am generally very cautious
about this approach, but if this is the only way to clarify
whether we still need PF retry for file, I can give it a try
and run a complete test on Android phones after lsf/mm/bpf.
> >
> > I’m getting quite confused. In patch 4/5, you suggest a more
> > restrictive condition using
> > if (folio_test_uptodate(folio) && !folio_test_writeback(folio))
> > rather than if (folio_test_uptodate(folio)), before we decide to skip
> > retrying the page fault [1].
> > That seems to suggest we should be more cautious about when we can skip
> > retrying the page fault.
> >
> > However, in the cover letter, you suggest removing all retry code entirely.
> > Does this suggestion apply only to file-backed page faults?
>
> I'm making sure that if Andrew decides to override me he at least sees
No, I don’t want Andrew to override you unless there is a real PI
issue for file, and only if you still still insist on “ripping it out”
after a thorough test with it removed.
> that there are other problems with this patchset beyond "I don't like
> the additional complexity".
The other issue you are pointing out is that, for anon, we
should be more cautious before deciding to skip PF retry,
which seems to be the opposite direction of what you expect
for file.
>
> And maybe we decide to do the fallback for anon-mm but not file memory.
I was targeting a unified approach for both file-backed
and anonymous memory. For example, if anon requires retry
under the per-VMA lock, we may already have the necessary
branch in place that file-backed cases can also leverage.
For anon cases, high-level language GCs can operate on a
small portion of a large heap requiring VMA writes, which
is fairly common, as I explained to Jan.
> Or maybe it's just something somebody happens upon when reading the
> mailing list (or more likely it's just grist for an AI).
Maybe one or two years from now. For now, at least, there are still
humans working on the kernel :-)
Best Regards
Barry
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-01 17:57 ` Matthew Wilcox
2026-05-01 18:25 ` Barry Song
@ 2026-05-03 13:13 ` Jan Kara
2026-05-03 19:55 ` Barry Song
2026-05-17 8:45 ` Barry Song
2 siblings, 1 reply; 56+ messages in thread
From: Jan Kara @ 2026-05-03 13:13 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Barry Song, akpm, linux-mm, david, ljs, liam, vbabka, rppt,
surenb, mhocko, jack, pfalcato, wanglian, chentao, lianux.mm,
kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390
On Fri 01-05-26 18:57:52, Matthew Wilcox wrote:
> On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > both the hardware and the software stack (bio/request queues and the
> > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > be quite long. In such cases, a high-priority thread performing operations
> > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > for an unpredictable amount of time.
> > >
> > > But does that actually happen? I find it hard to believe that thread A
> > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > that same VMA. mprotect() and madvise() are more likely to happen, but
> > > it still seems really unlikely to me.
> >
> > It doesn’t have to involve unmapping or applying mprotect to
> > the entire VMA—just a portion of it is sufficient.
>
> Yes, but that still fails to answer "does this actually happen". How much
> performance is all this complexity in the page fault handler buying us?
> If you don't answer this question, I'm just going to go in and rip it
> all out.
I fully agree with you we should verify whether the retry code still brings
in real-world advantage today with VMA locks. After all the retry logic has
been introduced in 2010. That being said if there are realistic loads where
one thread needs VMA write lock while another thread is faulting the VMA,
then the latencies can be indeed extreme. For example things like cgroup IO
throttling happen on the IO path and thus can throttle IO of a low-priority
thread for a long time.
BTW I'm not sure I quite understand Barry's priority inversion problem
since I'd expect all threads of a task to generally be treated with the
same priority...
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-03 13:13 ` Jan Kara
@ 2026-05-03 19:55 ` Barry Song
2026-05-04 13:03 ` Jan Kara
0 siblings, 1 reply; 56+ messages in thread
From: Barry Song @ 2026-05-03 19:55 UTC (permalink / raw)
To: Jan Kara
Cc: Matthew Wilcox, akpm, linux-mm, david, ljs, liam, vbabka, rppt,
surenb, mhocko, pfalcato, wanglian, chentao, lianux.mm,
kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390
On Mon, May 4, 2026 at 2:17 AM Jan Kara <jack@suse.cz> wrote:
>
> On Fri 01-05-26 18:57:52, Matthew Wilcox wrote:
> > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > both the hardware and the software stack (bio/request queues and the
> > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > for an unpredictable amount of time.
> > > >
> > > > But does that actually happen? I find it hard to believe that thread A
> > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > that same VMA. mprotect() and madvise() are more likely to happen, but
> > > > it still seems really unlikely to me.
> > >
> > > It doesn’t have to involve unmapping or applying mprotect to
> > > the entire VMA—just a portion of it is sufficient.
> >
> > Yes, but that still fails to answer "does this actually happen". How much
> > performance is all this complexity in the page fault handler buying us?
> > If you don't answer this question, I'm just going to go in and rip it
> > all out.
>
> I fully agree with you we should verify whether the retry code still brings
> in real-world advantage today with VMA locks. After all the retry logic has
> been introduced in 2010. That being said if there are realistic loads where
> one thread needs VMA write lock while another thread is faulting the VMA,
> then the latencies can be indeed extreme. For example things like cgroup IO
> throttling happen on the IO path and thus can throttle IO of a low-priority
> thread for a long time.
I’m quite sure that swap-in and VMA writes can occur
concurrently, and this is fairly common. For example,
Java GC may use mprotect or userfaultfd on a small
portion of a large Java heap while other portions are
still under do_swap_page().
If we start exploring different approaches for anon and
file, I agree I can revisit this on an Android phone if
there is a real, serious case where a file VMA can be
written and a page fault occurs at the same time.
Please note that, as an Android developer, I am particularly
cautious about priority inversion. A recent issue causing
severe priority inversion is zram attempting to support
preemption[1]. When a task performing compression or
decompression is migrated to another CPU and then preempted
by other tasks, high-priority tasks waiting on the mutex may
be significantly delayed, impacting user experience.
>
> BTW I'm not sure I quite understand Barry's priority inversion problem
> since I'd expect all threads of a task to generally be treated with the
> same priority...
Exactly not. Maybe these slides[2] and this project[3] can give
you a hint—they aim to standardize things on Linux by
learning from Apple OS. Basically, tasks are classified
into five types:
USER_INTERACTIVE: Requires immediate response.
USER_INITIATED: Tolerates a short delay, but must respond quickly still.
UTILITY: Tolerates long delays, but not prolonged ones.
BACKGROUND: Doesn’t mind prolonged delays.
DEFAULT: System default behavior.
[1] https://lore.kernel.org/linux-mm/20250303022425.285971-3-senozhatsky@chromium.org/
[2] https://lpc.events/event/19/contributions/2089/attachments/1797/3877/Userspace%20Assisted%20Scheduling%20via%20Sched%20QoS.pdf
[3] https://lore.kernel.org/lkml/20260415000910.2h5misvwc45bdumu@airbuntu/
Thanks
Barry
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-03 19:55 ` Barry Song
@ 2026-05-04 13:03 ` Jan Kara
2026-05-04 13:35 ` Barry Song
2026-05-04 14:15 ` Barry Song
0 siblings, 2 replies; 56+ messages in thread
From: Jan Kara @ 2026-05-04 13:03 UTC (permalink / raw)
To: Barry Song
Cc: Jan Kara, Matthew Wilcox, akpm, linux-mm, david, ljs, liam,
vbabka, rppt, surenb, mhocko, pfalcato, wanglian, chentao,
lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng,
nphamcs, bhe, youngjun.park, linux-arm-kernel, linux-kernel,
loongarch, linuxppc-dev, linux-riscv, linux-s390
On Mon 04-05-26 03:55:43, Barry Song wrote:
> On Mon, May 4, 2026 at 2:17 AM Jan Kara <jack@suse.cz> wrote:
> > On Fri 01-05-26 18:57:52, Matthew Wilcox wrote:
> > > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > > both the hardware and the software stack (bio/request queues and the
> > > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > > for an unpredictable amount of time.
> > > > >
> > > > > But does that actually happen? I find it hard to believe that thread A
> > > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > > that same VMA. mprotect() and madvise() are more likely to happen, but
> > > > > it still seems really unlikely to me.
> > > >
> > > > It doesn’t have to involve unmapping or applying mprotect to
> > > > the entire VMA—just a portion of it is sufficient.
> > >
> > > Yes, but that still fails to answer "does this actually happen". How much
> > > performance is all this complexity in the page fault handler buying us?
> > > If you don't answer this question, I'm just going to go in and rip it
> > > all out.
> >
> > I fully agree with you we should verify whether the retry code still brings
> > in real-world advantage today with VMA locks. After all the retry logic has
> > been introduced in 2010. That being said if there are realistic loads where
> > one thread needs VMA write lock while another thread is faulting the VMA,
> > then the latencies can be indeed extreme. For example things like cgroup IO
> > throttling happen on the IO path and thus can throttle IO of a low-priority
> > thread for a long time.
>
> I’m quite sure that swap-in and VMA writes can occur
> concurrently, and this is fairly common. For example,
> Java GC may use mprotect or userfaultfd on a small
> portion of a large Java heap while other portions are
> still under do_swap_page().
OK, makes sense.
> If we start exploring different approaches for anon and
> file, I agree I can revisit this on an Android phone if
> there is a real, serious case where a file VMA can be
> written and a page fault occurs at the same time.
>
> Please note that, as an Android developer, I am particularly
> cautious about priority inversion. A recent issue causing
> severe priority inversion is zram attempting to support
> preemption[1]. When a task performing compression or
> decompression is migrated to another CPU and then preempted
> by other tasks, high-priority tasks waiting on the mutex may
> be significantly delayed, impacting user experience.
Well, container people are concerned about priority inversion as well. But
usually this is with coarse lock (such as global filesystem locks) but VMA
lock is specific to a task (and a VMA) so there the opportunity for
priority inversion looks more limited. But the example with Java where GC
thread can presumably have higher priority than ordinary Java threads is an
interesting one.
> > BTW I'm not sure I quite understand Barry's priority inversion problem
> > since I'd expect all threads of a task to generally be treated with the
> > same priority...
>
> Exactly not. Maybe these slides[2] and this project[3] can give
> you a hint—they aim to standardize things on Linux by
> learning from Apple OS. Basically, tasks are classified
> into five types:
>
> USER_INTERACTIVE: Requires immediate response.
> USER_INITIATED: Tolerates a short delay, but must respond quickly still.
> UTILITY: Tolerates long delays, but not prolonged ones.
> BACKGROUND: Doesn’t mind prolonged delays.
> DEFAULT: System default behavior.
Again, this is a clasification of tasks but not really of threads in a task
so at least for VMA lock there's no inversion so have?
Honza
> [1] https://lore.kernel.org/linux-mm/20250303022425.285971-3-senozhatsky@chromium.org/
> [2] https://lpc.events/event/19/contributions/2089/attachments/1797/3877/Userspace%20Assisted%20Scheduling%20via%20Sched%20QoS.pdf
> [3] https://lore.kernel.org/lkml/20260415000910.2h5misvwc45bdumu@airbuntu/
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-04 13:03 ` Jan Kara
@ 2026-05-04 13:35 ` Barry Song
2026-05-04 14:15 ` Barry Song
1 sibling, 0 replies; 56+ messages in thread
From: Barry Song @ 2026-05-04 13:35 UTC (permalink / raw)
To: Jan Kara
Cc: Matthew Wilcox, akpm, linux-mm, david, ljs, liam, vbabka, rppt,
surenb, mhocko, pfalcato, wanglian, chentao, lianux.mm,
kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390
On Mon, May 4, 2026 at 9:04 PM Jan Kara <jack@suse.cz> wrote:
[...]
>
> > > BTW I'm not sure I quite understand Barry's priority inversion problem
> > > since I'd expect all threads of a task to generally be treated with the
> > > same priority...
> >
> > Exactly not. Maybe these slides[2] and this project[3] can give
> > you a hint—they aim to standardize things on Linux by
> > learning from Apple OS. Basically, tasks are classified
> > into five types:
> >
> > USER_INTERACTIVE: Requires immediate response.
> > USER_INITIATED: Tolerates a short delay, but must respond quickly still.
> > UTILITY: Tolerates long delays, but not prolonged ones.
> > BACKGROUND: Doesn’t mind prolonged delays.
> > DEFAULT: System default behavior.
>
> Again, this is a clasification of tasks but not really of threads in a task
> so at least for VMA lock there's no inversion so have?
I’m specifically referring to a task (i.e., a thread) when
discussing scheduler context. It may be clearer to use the
terms process and thread explicitly.
In a typical process sharing an mm_struct, each thread can
have a different priority.
In an Android app, some threads handle the UI and require
higher priority, such as the main thread and RenderThread;
otherwise, frame drops may occur.
The Linux scheduler can control scheduling policy and
priority for each thread.
Thanks
Barry
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-04 13:03 ` Jan Kara
2026-05-04 13:35 ` Barry Song
@ 2026-05-04 14:15 ` Barry Song
1 sibling, 0 replies; 56+ messages in thread
From: Barry Song @ 2026-05-04 14:15 UTC (permalink / raw)
To: Jan Kara
Cc: Matthew Wilcox, akpm, linux-mm, david, ljs, liam, vbabka, rppt,
surenb, mhocko, pfalcato, wanglian, chentao, lianux.mm,
kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390
On Mon, May 4, 2026 at 9:04 PM Jan Kara <jack@suse.cz> wrote:
>
> On Mon 04-05-26 03:55:43, Barry Song wrote:
> > On Mon, May 4, 2026 at 2:17 AM Jan Kara <jack@suse.cz> wrote:
> > > On Fri 01-05-26 18:57:52, Matthew Wilcox wrote:
> > > > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > > > both the hardware and the software stack (bio/request queues and the
> > > > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > > > for an unpredictable amount of time.
> > > > > >
> > > > > > But does that actually happen? I find it hard to believe that thread A
> > > > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > > > that same VMA. mprotect() and madvise() are more likely to happen, but
> > > > > > it still seems really unlikely to me.
> > > > >
> > > > > It doesn’t have to involve unmapping or applying mprotect to
> > > > > the entire VMA—just a portion of it is sufficient.
> > > >
> > > > Yes, but that still fails to answer "does this actually happen". How much
> > > > performance is all this complexity in the page fault handler buying us?
> > > > If you don't answer this question, I'm just going to go in and rip it
> > > > all out.
> > >
> > > I fully agree with you we should verify whether the retry code still brings
> > > in real-world advantage today with VMA locks. After all the retry logic has
> > > been introduced in 2010. That being said if there are realistic loads where
> > > one thread needs VMA write lock while another thread is faulting the VMA,
> > > then the latencies can be indeed extreme. For example things like cgroup IO
> > > throttling happen on the IO path and thus can throttle IO of a low-priority
> > > thread for a long time.
> >
> > I’m quite sure that swap-in and VMA writes can occur
> > concurrently, and this is fairly common. For example,
> > Java GC may use mprotect or userfaultfd on a small
> > portion of a large Java heap while other portions are
> > still under do_swap_page().
>
> OK, makes sense.
>
> > If we start exploring different approaches for anon and
> > file, I agree I can revisit this on an Android phone if
> > there is a real, serious case where a file VMA can be
> > written and a page fault occurs at the same time.
> >
> > Please note that, as an Android developer, I am particularly
> > cautious about priority inversion. A recent issue causing
> > severe priority inversion is zram attempting to support
> > preemption[1]. When a task performing compression or
> > decompression is migrated to another CPU and then preempted
> > by other tasks, high-priority tasks waiting on the mutex may
> > be significantly delayed, impacting user experience.
>
> Well, container people are concerned about priority inversion as well. But
> usually this is with coarse lock (such as global filesystem locks) but VMA
> lock is specific to a task (and a VMA) so there the opportunity for
> priority inversion looks more limited. But the example with Java where GC
> thread can presumably have higher priority than ordinary Java threads is an
> interesting one.
A major difference in Android apps is that each thread can
affect user experience differently. And it is not simply a matter
of whether a VMA writer has higher or lower priority than a
page-fault (PF) thread performing I/O.
For example, thread A handles a PF; thread B attempts to
modify the VMA where the PF occurs; thread C tries to modify
another VMA (requiring mmap_lock in write mode) or iterate
VMAs (requiring mmap_lock in read mode). Regardless of
thread B’s priority, it holds mmap_lock in write mode while
waiting for the VMA lock. The usual pattern for a VMA writer
is:
mmap_write_lock()
vma_start_write()
As a result, thread C can be blocked even if it has higher
priority but operates on a different VMA.
In essence, when a PF and a VMA write occur concurrently,
high-priority threads may be blocked even if they operate on
different VMAs, not necessarily the same one.
Thanks
Barry
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-01 17:57 ` Matthew Wilcox
2026-05-01 18:25 ` Barry Song
2026-05-03 13:13 ` Jan Kara
@ 2026-05-17 8:45 ` Barry Song
2026-05-18 9:46 ` Lorenzo Stoakes
` (2 more replies)
2 siblings, 3 replies; 56+ messages in thread
From: Barry Song @ 2026-05-17 8:45 UTC (permalink / raw)
To: Matthew Wilcox, surenb
Cc: akpm, linux-mm, david, ljs, liam, vbabka, rppt, mhocko, jack,
pfalcato, wanglian, chentao, lianux.mm, kunwu.chan, liyangouwen1,
chrisl, kasong, shikemeng, nphamcs, bhe, youngjun.park,
linux-arm-kernel, linux-kernel, loongarch, linuxppc-dev,
linux-riscv, linux-s390, Nanzhe Zhao
On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > both the hardware and the software stack (bio/request queues and the
> > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > be quite long. In such cases, a high-priority thread performing operations
> > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > for an unpredictable amount of time.
> > >
> > > But does that actually happen? I find it hard to believe that thread A
> > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > that same VMA. mprotect() and madvise() are more likely to happen, but
> > > it still seems really unlikely to me.
> >
> > It doesn’t have to involve unmapping or applying mprotect to
> > the entire VMA—just a portion of it is sufficient.
>
> Yes, but that still fails to answer "does this actually happen". How much
> performance is all this complexity in the page fault handler buying us?
> If you don't answer this question, I'm just going to go in and rip it
> all out.
>
Hi Matthew (and Lorenzo, Jan, and anyone else who may be
waiting for answers),
As promised during LSF/MM/BPF, we conducted thorough
testing on Android phones to determine whether performing
I/O in `filemap_fault()` can block `vma_start_write()`.
I wanted to give a quick update on this question.
Nanzhe at Xiaomi created tracing scripts and ran various
applications on Android devices with I/O performed under
the VMA lock in `filemap_fault()`. We found that:
1. There are very few cases where unmap() is blocked by
page faults. I assume this is due to buggy user code
or poor synchronization between reads and unmap().
So I assume it is not a problem.
2. We observed many cases where `vma_start_write()`
is blocked by page-fault I/O in some applications.
The blocking occurs in the `dup_mmap()` path during
fork().
With Suren's commit fb49c455323ff ("fork: lock VMAs of
the parent process when forking"), we now always hold
`vma_write_lock()` for each VMA. Note that the
`mmap_lock` write lock is also held, which could lead to
chained waiting if page-fault I/O is performed without
releasing the VMA lock.
My gut feeling is that Suren's commit may be overshooting,
so my rough idea is that we might want to do something like
the following (we haven't tested it yet and it might be
wrong):
diff --git a/mm/mmap.c b/mm/mmap.c
index 2311ae7c2ff4..5ddaf297f31a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1762,7 +1762,13 @@ __latent_entropy int dup_mmap(struct mm_struct
*mm, struct mm_struct *oldmm)
for_each_vma(vmi, mpnt) {
struct file *file;
- retval = vma_start_write_killable(mpnt);
+ /*
+ * For anonymous or writable private VMAs, prevent
+ * concurrent CoW faults.
+ */
+ if (!mpnt->vm_file || (!(mpnt->vm_flags & VM_SHARED) &&
+ (mpnt->vm_flags & VM_WRITE)))
+ retval = vma_start_write_killable(mpnt);
if (retval < 0)
goto loop_out;
if (mpnt->vm_flags & VM_DONTCOPY) {
Based on the above, we may want to re-check whether fork()
can be blocked by page faults. At the same time, if Suren,
you, or anyone else has any comments, please feel free to
share them.
Best Regards
Barry
^ permalink raw reply related [flat|nested] 56+ messages in thread* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-17 8:45 ` Barry Song
@ 2026-05-18 9:46 ` Lorenzo Stoakes
2026-05-18 11:25 ` Barry Song
2026-05-18 9:53 ` David Hildenbrand (Arm)
2026-05-18 21:21 ` Yang Shi
2 siblings, 1 reply; 56+ messages in thread
From: Lorenzo Stoakes @ 2026-05-18 9:46 UTC (permalink / raw)
To: Barry Song
Cc: Matthew Wilcox, surenb, akpm, linux-mm, david, liam, vbabka, rppt,
mhocko, jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
On Sun, May 17, 2026 at 04:45:15PM +0800, Barry Song wrote:
> On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > both the hardware and the software stack (bio/request queues and the
> > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > for an unpredictable amount of time.
> > > >
> > > > But does that actually happen? I find it hard to believe that thread A
> > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > that same VMA. mprotect() and madvise() are more likely to happen, but
> > > > it still seems really unlikely to me.
> > >
> > > It doesn’t have to involve unmapping or applying mprotect to
> > > the entire VMA—just a portion of it is sufficient.
> >
> > Yes, but that still fails to answer "does this actually happen". How much
> > performance is all this complexity in the page fault handler buying us?
> > If you don't answer this question, I'm just going to go in and rip it
> > all out.
> >
>
> Hi Matthew (and Lorenzo, Jan, and anyone else who may be
> waiting for answers),
>
> As promised during LSF/MM/BPF, we conducted thorough
> testing on Android phones to determine whether performing
> I/O in `filemap_fault()` can block `vma_start_write()`.
> I wanted to give a quick update on this question.
>
> Nanzhe at Xiaomi created tracing scripts and ran various
> applications on Android devices with I/O performed under
> the VMA lock in `filemap_fault()`. We found that:
>
> 1. There are very few cases where unmap() is blocked by
> page faults. I assume this is due to buggy user code
> or poor synchronization between reads and unmap().
> So I assume it is not a problem.
>
> 2. We observed many cases where `vma_start_write()`
> is blocked by page-fault I/O in some applications.
> The blocking occurs in the `dup_mmap()` path during
> fork().
>
> With Suren's commit fb49c455323ff ("fork: lock VMAs of
> the parent process when forking"), we now always hold
> `vma_write_lock()` for each VMA. Note that the
> `mmap_lock` write lock is also held, which could lead to
> chained waiting if page-fault I/O is performed without
> releasing the VMA lock.
Hm but did you observe this 'chained waiting'? And what were the latencies?
>
> My gut feeling is that Suren's commit may be overshooting,
> so my rough idea is that we might want to do something like
> the following (we haven't tested it yet and it might be
> wrong):
Yeah I'm really not sure about that.
Prior to the VMA locks, the mmap write lock would have guaranteed no concurrent
page faults, which is really what fb49c455323ff is about.
So Suren's patch was essentially restoring the _existing_ forking behaviour, and
now you're saying 'let's change the forking behaviour that's been like that for
forever'.
I think you would _really_ have to be sure that's safe. And forking is a very
dangerous time in terms of complexity and sensitivity and 'weird stuff'
happening so I'd tread _very_ carefully here.
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 2311ae7c2ff4..5ddaf297f31a 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1762,7 +1762,13 @@ __latent_entropy int dup_mmap(struct mm_struct
> *mm, struct mm_struct *oldmm)
> for_each_vma(vmi, mpnt) {
> struct file *file;
>
> - retval = vma_start_write_killable(mpnt);
> + /*
> + * For anonymous or writable private VMAs, prevent
> + * concurrent CoW faults.
> + */
To nit pick I think the comment's confusing but also tells you you don't need to
specific anon check - writable private is sufficient. And it's not really just
CoW that's the issue, it's anon_vma population _at all_ as well as CoW.
> + if (!mpnt->vm_file || (!(mpnt->vm_flags & VM_SHARED) &&
> + (mpnt->vm_flags & VM_WRITE)))
> + retval = vma_start_write_killable(mpnt);
I think this has to be VM_MAYWRITE, because somebody could otherwise mprotect()
it R/W.
I also don't understand why !mpnt->vm_file for a read-only anon mapping (more
likely PROT_NONE) is here, just do the second check?
(Also please use the new interface, so !vma_test(mpnt, VMA_SHARED_BIT) &&
vma_test(mpnt, VMA_MAYWRITE_BIT))
> if (retval < 0)
> goto loop_out;
> if (mpnt->vm_flags & VM_DONTCOPY) {
>
> Based on the above, we may want to re-check whether fork()
> can be blocked by page faults. At the same time, if Suren,
> you, or anyone else has any comments, please feel free to
> share them.
>
> Best Regards
> Barry
Technical commentary above is sort of 'just cos' :) because I really question
doing this honestly.
I'd also like to get Suren's input, however.
Thanks, Lorenzo
^ permalink raw reply [flat|nested] 56+ messages in thread* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-18 9:46 ` Lorenzo Stoakes
@ 2026-05-18 11:25 ` Barry Song
2026-05-18 16:17 ` Matthew Wilcox
` (2 more replies)
0 siblings, 3 replies; 56+ messages in thread
From: Barry Song @ 2026-05-18 11:25 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Matthew Wilcox, surenb, akpm, linux-mm, david, liam, vbabka, rppt,
mhocko, jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
On Mon, May 18, 2026 at 5:47 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
>
> On Sun, May 17, 2026 at 04:45:15PM +0800, Barry Song wrote:
> > On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > > both the hardware and the software stack (bio/request queues and the
> > > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > > for an unpredictable amount of time.
> > > > >
> > > > > But does that actually happen? I find it hard to believe that thread A
> > > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > > that same VMA. mprotect() and madvise() are more likely to happen, but
> > > > > it still seems really unlikely to me.
> > > >
> > > > It doesn’t have to involve unmapping or applying mprotect to
> > > > the entire VMA—just a portion of it is sufficient.
> > >
> > > Yes, but that still fails to answer "does this actually happen". How much
> > > performance is all this complexity in the page fault handler buying us?
> > > If you don't answer this question, I'm just going to go in and rip it
> > > all out.
> > >
> >
> > Hi Matthew (and Lorenzo, Jan, and anyone else who may be
> > waiting for answers),
> >
> > As promised during LSF/MM/BPF, we conducted thorough
> > testing on Android phones to determine whether performing
> > I/O in `filemap_fault()` can block `vma_start_write()`.
> > I wanted to give a quick update on this question.
> >
> > Nanzhe at Xiaomi created tracing scripts and ran various
> > applications on Android devices with I/O performed under
> > the VMA lock in `filemap_fault()`. We found that:
> >
> > 1. There are very few cases where unmap() is blocked by
> > page faults. I assume this is due to buggy user code
> > or poor synchronization between reads and unmap().
> > So I assume it is not a problem.
> >
> > 2. We observed many cases where `vma_start_write()`
> > is blocked by page-fault I/O in some applications.
> > The blocking occurs in the `dup_mmap()` path during
> > fork().
> >
> > With Suren's commit fb49c455323ff ("fork: lock VMAs of
> > the parent process when forking"), we now always hold
> > `vma_write_lock()` for each VMA. Note that the
> > `mmap_lock` write lock is also held, which could lead to
> > chained waiting if page-fault I/O is performed without
> > releasing the VMA lock.
>
> Hm but did you observe this 'chained waiting'? And what were the latencies?
We have clearly observed that the `fork()` operations of many
popular Android apps, such as iQiyi, Baidu Tieba, and 10086,
end up waiting on page-fault (PF) I/O when the VMA lock is
held during I/O operations. This has already become a
practical issue. I also believe this can lead to chained
waiting, since the global `mmap_lock` blocks all threads that
need to acquire it.
>
> >
> > My gut feeling is that Suren's commit may be overshooting,
> > so my rough idea is that we might want to do something like
> > the following (we haven't tested it yet and it might be
> > wrong):
>
> Yeah I'm really not sure about that.
>
> Prior to the VMA locks, the mmap write lock would have guaranteed no concurrent
> page faults, which is really what fb49c455323ff is about.
>
> So Suren's patch was essentially restoring the _existing_ forking behaviour, and
> now you're saying 'let's change the forking behaviour that's been like that for
> forever'.
I am afraid not. Before we introduced the per-VMA lock, we
were not performing I/O while holding `mmap_lock`. A page fault
that needed I/O would drop the `mmap_lock` read lock and allow
`fork()` to proceed.
Now, you are suggesting performing I/O while holding the VMA
lock, which changes the requirements and introduces this
problem.
>
> I think you would _really_ have to be sure that's safe. And forking is a very
> dangerous time in terms of complexity and sensitivity and 'weird stuff'
> happening so I'd tread _very_ carefully here.
Yep. I think my original proposal did not require any changes
to `fork()`, since it simply preserved the current behavior of
dropping the VMA lock before performing I/O. In that model,
`fork()` would not end up waiting on I/O at all.
What you are suggesting now appears to be performing I/O while
holding the VMA lock, which in turn introduces the need to
change `fork()`.
>
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 2311ae7c2ff4..5ddaf297f31a 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1762,7 +1762,13 @@ __latent_entropy int dup_mmap(struct mm_struct
> > *mm, struct mm_struct *oldmm)
> > for_each_vma(vmi, mpnt) {
> > struct file *file;
> >
> > - retval = vma_start_write_killable(mpnt);
> > + /*
> > + * For anonymous or writable private VMAs, prevent
> > + * concurrent CoW faults.
> > + */
>
> To nit pick I think the comment's confusing but also tells you you don't need to
> specific anon check - writable private is sufficient. And it's not really just
> CoW that's the issue, it's anon_vma population _at all_ as well as CoW.
>
> > + if (!mpnt->vm_file || (!(mpnt->vm_flags & VM_SHARED) &&
> > + (mpnt->vm_flags & VM_WRITE)))
> > + retval = vma_start_write_killable(mpnt);
>
> I think this has to be VM_MAYWRITE, because somebody could otherwise mprotect()
> it R/W.
>
> I also don't understand why !mpnt->vm_file for a read-only anon mapping (more
> likely PROT_NONE) is here, just do the second check?
>
> (Also please use the new interface, so !vma_test(mpnt, VMA_SHARED_BIT) &&
> vma_test(mpnt, VMA_MAYWRITE_BIT))
Yep, I can definitely refine the check further. But before
doing that, I'd first like to confirm that we are aligned on
the direction.
If you still intend to hold the VMA lock while performing I/O,
then I think we should fix `fork()` to avoid taking
`vma_start_write()`.
>
> > if (retval < 0)
> > goto loop_out;
> > if (mpnt->vm_flags & VM_DONTCOPY) {
> >
> > Based on the above, we may want to re-check whether fork()
> > can be blocked by page faults. At the same time, if Suren,
> > you, or anyone else has any comments, please feel free to
> > share them.
> >
> > Best Regards
> > Barry
>
> Technical commentary above is sort of 'just cos' :) because I really question
> doing this honestly.
I think we either need to fix `fork()`, or keep the current
behavior of dropping the VMA lock before performing I/O.
>
> I'd also like to get Suren's input, however.
Yes. of course.
>
> Thanks, Lorenzo
Best Regards
Barry
^ permalink raw reply [flat|nested] 56+ messages in thread* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-18 11:25 ` Barry Song
@ 2026-05-18 16:17 ` Matthew Wilcox
2026-05-18 20:50 ` Barry Song
2026-05-18 19:56 ` Suren Baghdasaryan
2026-05-19 12:43 ` Lorenzo Stoakes
2 siblings, 1 reply; 56+ messages in thread
From: Matthew Wilcox @ 2026-05-18 16:17 UTC (permalink / raw)
To: Barry Song
Cc: Lorenzo Stoakes, surenb, akpm, linux-mm, david, liam, vbabka,
rppt, mhocko, jack, pfalcato, wanglian, chentao, lianux.mm,
kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
On Mon, May 18, 2026 at 07:25:54PM +0800, Barry Song wrote:
> We have clearly observed that the `fork()` operations of many
> popular Android apps, such as iQiyi, Baidu Tieba, and 10086,
> end up waiting on page-fault (PF) I/O when the VMA lock is
> held during I/O operations. This has already become a
> practical issue. I also believe this can lead to chained
> waiting, since the global `mmap_lock` blocks all threads that
> need to acquire it.
It's always been a terrible idea to call fork() from a multithreaded
application. For example, this question:
https://stackoverflow.com/questions/53601200/calling-fork-on-a-multithreaded-process
or this lwn thread: https://lwn.net/Articles/674660/
Do we have any insight into why these applications are doing this
horrible thing?
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-18 16:17 ` Matthew Wilcox
@ 2026-05-18 20:50 ` Barry Song
0 siblings, 0 replies; 56+ messages in thread
From: Barry Song @ 2026-05-18 20:50 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Lorenzo Stoakes, surenb, akpm, linux-mm, david, liam, vbabka,
rppt, mhocko, jack, pfalcato, wanglian, chentao, lianux.mm,
kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
On Tue, May 19, 2026 at 12:17 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, May 18, 2026 at 07:25:54PM +0800, Barry Song wrote:
> > We have clearly observed that the `fork()` operations of many
> > popular Android apps, such as iQiyi, Baidu Tieba, and 10086,
> > end up waiting on page-fault (PF) I/O when the VMA lock is
> > held during I/O operations. This has already become a
> > practical issue. I also believe this can lead to chained
> > waiting, since the global `mmap_lock` blocks all threads that
> > need to acquire it.
>
> It's always been a terrible idea to call fork() from a multithreaded
> application. For example, this question:
>
> https://stackoverflow.com/questions/53601200/calling-fork-on-a-multithreaded-process
>
> or this lwn thread: https://lwn.net/Articles/674660/
>
> Do we have any insight into why these applications are doing this
> horrible thing?
I swear I read the two links you shared. But the reality
is that as long as people use the Android framework,
even the simplest "Hello World" app already runs with
10+ threads :-)
main
RenderThread
ReferenceQueueDaemon
FinalizerDaemon
FinalizerWatchdogDaemon
HeapTaskDaemon
Binder:1234_1
Binder:1234_2
Signal Catcher
JDWP
...
Best Regards
Barry
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-18 11:25 ` Barry Song
2026-05-18 16:17 ` Matthew Wilcox
@ 2026-05-18 19:56 ` Suren Baghdasaryan
2026-05-18 21:14 ` Barry Song
2026-05-19 12:53 ` Lorenzo Stoakes
2026-05-19 12:43 ` Lorenzo Stoakes
2 siblings, 2 replies; 56+ messages in thread
From: Suren Baghdasaryan @ 2026-05-18 19:56 UTC (permalink / raw)
To: Barry Song
Cc: Lorenzo Stoakes, Matthew Wilcox, akpm, linux-mm, david, liam,
vbabka, rppt, mhocko, jack, pfalcato, wanglian, chentao,
lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng,
nphamcs, bhe, youngjun.park, linux-arm-kernel, linux-kernel,
loongarch, linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
On Mon, May 18, 2026 at 4:26 AM Barry Song <baohua@kernel.org> wrote:
>
> On Mon, May 18, 2026 at 5:47 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
> >
> > On Sun, May 17, 2026 at 04:45:15PM +0800, Barry Song wrote:
> > > On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > >
> > > > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > > > both the hardware and the software stack (bio/request queues and the
> > > > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > > > for an unpredictable amount of time.
> > > > > >
> > > > > > But does that actually happen? I find it hard to believe that thread A
> > > > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > > > that same VMA. mprotect() and madvise() are more likely to happen, but
> > > > > > it still seems really unlikely to me.
> > > > >
> > > > > It doesn’t have to involve unmapping or applying mprotect to
> > > > > the entire VMA—just a portion of it is sufficient.
> > > >
> > > > Yes, but that still fails to answer "does this actually happen". How much
> > > > performance is all this complexity in the page fault handler buying us?
> > > > If you don't answer this question, I'm just going to go in and rip it
> > > > all out.
> > > >
> > >
> > > Hi Matthew (and Lorenzo, Jan, and anyone else who may be
> > > waiting for answers),
> > >
> > > As promised during LSF/MM/BPF, we conducted thorough
> > > testing on Android phones to determine whether performing
> > > I/O in `filemap_fault()` can block `vma_start_write()`.
> > > I wanted to give a quick update on this question.
> > >
> > > Nanzhe at Xiaomi created tracing scripts and ran various
> > > applications on Android devices with I/O performed under
> > > the VMA lock in `filemap_fault()`. We found that:
> > >
> > > 1. There are very few cases where unmap() is blocked by
> > > page faults. I assume this is due to buggy user code
> > > or poor synchronization between reads and unmap().
> > > So I assume it is not a problem.
> > >
> > > 2. We observed many cases where `vma_start_write()`
> > > is blocked by page-fault I/O in some applications.
> > > The blocking occurs in the `dup_mmap()` path during
> > > fork().
> > >
> > > With Suren's commit fb49c455323ff ("fork: lock VMAs of
> > > the parent process when forking"), we now always hold
> > > `vma_write_lock()` for each VMA. Note that the
> > > `mmap_lock` write lock is also held, which could lead to
> > > chained waiting if page-fault I/O is performed without
> > > releasing the VMA lock.
> >
> > Hm but did you observe this 'chained waiting'? And what were the latencies?
>
> We have clearly observed that the `fork()` operations of many
> popular Android apps, such as iQiyi, Baidu Tieba, and 10086,
> end up waiting on page-fault (PF) I/O when the VMA lock is
> held during I/O operations. This has already become a
> practical issue. I also believe this can lead to chained
> waiting, since the global `mmap_lock` blocks all threads that
> need to acquire it.
>
>
> >
> > >
> > > My gut feeling is that Suren's commit may be overshooting,
> > > so my rough idea is that we might want to do something like
> > > the following (we haven't tested it yet and it might be
> > > wrong):
> >
> > Yeah I'm really not sure about that.
> >
> > Prior to the VMA locks, the mmap write lock would have guaranteed no concurrent
> > page faults, which is really what fb49c455323ff is about.
> >
> > So Suren's patch was essentially restoring the _existing_ forking behaviour, and
> > now you're saying 'let's change the forking behaviour that's been like that for
> > forever'.
>
>
> I am afraid not. Before we introduced the per-VMA lock, we
> were not performing I/O while holding `mmap_lock`. A page fault
> that needed I/O would drop the `mmap_lock` read lock and allow
> `fork()` to proceed.
>
> Now, you are suggesting performing I/O while holding the VMA
> lock, which changes the requirements and introduces this
> problem.
>
> >
> > I think you would _really_ have to be sure that's safe. And forking is a very
> > dangerous time in terms of complexity and sensitivity and 'weird stuff'
> > happening so I'd tread _very_ carefully here.
>
> Yep. I think my original proposal did not require any changes
> to `fork()`, since it simply preserved the current behavior of
> dropping the VMA lock before performing I/O. In that model,
> `fork()` would not end up waiting on I/O at all.
>
> What you are suggesting now appears to be performing I/O while
> holding the VMA lock, which in turn introduces the need to
> change `fork()`.
>
> >
> > >
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index 2311ae7c2ff4..5ddaf297f31a 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -1762,7 +1762,13 @@ __latent_entropy int dup_mmap(struct mm_struct
> > > *mm, struct mm_struct *oldmm)
> > > for_each_vma(vmi, mpnt) {
> > > struct file *file;
> > >
> > > - retval = vma_start_write_killable(mpnt);
> > > + /*
> > > + * For anonymous or writable private VMAs, prevent
> > > + * concurrent CoW faults.
> > > + */
> >
> > To nit pick I think the comment's confusing but also tells you you don't need to
> > specific anon check - writable private is sufficient. And it's not really just
> > CoW that's the issue, it's anon_vma population _at all_ as well as CoW.
> >
> > > + if (!mpnt->vm_file || (!(mpnt->vm_flags & VM_SHARED) &&
> > > + (mpnt->vm_flags & VM_WRITE)))
> > > + retval = vma_start_write_killable(mpnt);
> >
> > I think this has to be VM_MAYWRITE, because somebody could otherwise mprotect()
> > it R/W.
> >
> > I also don't understand why !mpnt->vm_file for a read-only anon mapping (more
> > likely PROT_NONE) is here, just do the second check?
> >
> > (Also please use the new interface, so !vma_test(mpnt, VMA_SHARED_BIT) &&
> > vma_test(mpnt, VMA_MAYWRITE_BIT))
>
> Yep, I can definitely refine the check further. But before
> doing that, I'd first like to confirm that we are aligned on
> the direction.
>
> If you still intend to hold the VMA lock while performing I/O,
> then I think we should fix `fork()` to avoid taking
> `vma_start_write()`.
>
> >
> > > if (retval < 0)
> > > goto loop_out;
> > > if (mpnt->vm_flags & VM_DONTCOPY) {
> > >
> > > Based on the above, we may want to re-check whether fork()
> > > can be blocked by page faults. At the same time, if Suren,
> > > you, or anyone else has any comments, please feel free to
> > > share them.
> > >
> > > Best Regards
> > > Barry
> >
> > Technical commentary above is sort of 'just cos' :) because I really question
> > doing this honestly.
>
> I think we either need to fix `fork()`, or keep the current
> behavior of dropping the VMA lock before performing I/O.
I see. So, this problem arises from the fact that we are changing the
pagefaults requiring I/O operation to hold VMA lock...
And you want to lock VMA on fork only if vma_is_anonymous(vma) ||
is_cow_mapping(vma->vm_flags). So, we will be blocking page faults for
anonymous and COW VMAs only while holding mmap_write_lock, preventing
any VMA modification. On the surface, that looks ok to me but I might
be missing some corner cases. If nobody sees any obvious issues, I
think it's worth a try.
>
> >
> > I'd also like to get Suren's input, however.
>
> Yes. of course.
>
> >
> > Thanks, Lorenzo
>
> Best Regards
> Barry
^ permalink raw reply [flat|nested] 56+ messages in thread* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-18 19:56 ` Suren Baghdasaryan
@ 2026-05-18 21:14 ` Barry Song
2026-05-19 12:45 ` Lorenzo Stoakes
2026-05-19 14:17 ` Liam R. Howlett
2026-05-19 12:53 ` Lorenzo Stoakes
1 sibling, 2 replies; 56+ messages in thread
From: Barry Song @ 2026-05-18 21:14 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: Lorenzo Stoakes, Matthew Wilcox, akpm, linux-mm, david, liam,
vbabka, rppt, mhocko, jack, pfalcato, wanglian, chentao,
lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng,
nphamcs, bhe, youngjun.park, linux-arm-kernel, linux-kernel,
loongarch, linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
On Tue, May 19, 2026 at 3:57 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Mon, May 18, 2026 at 4:26 AM Barry Song <baohua@kernel.org> wrote:
> >
> > On Mon, May 18, 2026 at 5:47 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
> > >
> > > On Sun, May 17, 2026 at 04:45:15PM +0800, Barry Song wrote:
[...]
> >
> > I think we either need to fix `fork()`, or keep the current
> > behavior of dropping the VMA lock before performing I/O.
>
> I see. So, this problem arises from the fact that we are changing the
> pagefaults requiring I/O operation to hold VMA lock...
> And you want to lock VMA on fork only if vma_is_anonymous(vma) ||
> is_cow_mapping(vma->vm_flags). So, we will be blocking page faults for
> anonymous and COW VMAs only while holding mmap_write_lock, preventing
> any VMA modification. On the surface, that looks ok to me but I might
> be missing some corner cases. If nobody sees any obvious issues, I
> think it's worth a try.
>
Thanks. Besides the creation of processes via fork(), I
am also beginning to worry about the death of processes.
One thing that came to my mind this morning
is that when lowmemorykiller decides to kill an app, we
want the memory to be released as quickly as possible so
the new app or user scenario can get memory sooner.
In that case, if the app being killed is performing I/O
while holding the VMA lock, the unmapping procedure
could end up being blocked as well.
If we release the VMA lock as we currently do, we allow
process exit to proceed.
I haven't thought it through very clearly yet, and I
may be wrong. I'd like to do more investigation. I hope
the apps being killed stay very still, but who knows—we
have so many applications in the market.
Meanwhile, if you have any comments regarding the death
of processes, they would be very welcome.
Best Regards
Barry
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-18 21:14 ` Barry Song
@ 2026-05-19 12:45 ` Lorenzo Stoakes
2026-05-19 14:17 ` Liam R. Howlett
1 sibling, 0 replies; 56+ messages in thread
From: Lorenzo Stoakes @ 2026-05-19 12:45 UTC (permalink / raw)
To: Barry Song
Cc: Suren Baghdasaryan, Matthew Wilcox, akpm, linux-mm, david, liam,
vbabka, rppt, mhocko, jack, pfalcato, wanglian, chentao,
lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng,
nphamcs, bhe, youngjun.park, linux-arm-kernel, linux-kernel,
loongarch, linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
On Tue, May 19, 2026 at 05:14:45AM +0800, Barry Song wrote:
> On Tue, May 19, 2026 at 3:57 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Mon, May 18, 2026 at 4:26 AM Barry Song <baohua@kernel.org> wrote:
> > >
> > > On Mon, May 18, 2026 at 5:47 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
> > > >
> > > > On Sun, May 17, 2026 at 04:45:15PM +0800, Barry Song wrote:
> [...]
> > >
> > > I think we either need to fix `fork()`, or keep the current
> > > behavior of dropping the VMA lock before performing I/O.
> >
> > I see. So, this problem arises from the fact that we are changing the
> > pagefaults requiring I/O operation to hold VMA lock...
> > And you want to lock VMA on fork only if vma_is_anonymous(vma) ||
> > is_cow_mapping(vma->vm_flags). So, we will be blocking page faults for
> > anonymous and COW VMAs only while holding mmap_write_lock, preventing
> > any VMA modification. On the surface, that looks ok to me but I might
> > be missing some corner cases. If nobody sees any obvious issues, I
> > think it's worth a try.
> >
>
> Thanks. Besides the creation of processes via fork(), I
> am also beginning to worry about the death of processes.
>
> One thing that came to my mind this morning
> is that when lowmemorykiller decides to kill an app, we
What's the lowmemorykiller? :P you mean the OOM killer?
> want the memory to be released as quickly as possible so
> the new app or user scenario can get memory sooner.
>
> In that case, if the app being killed is performing I/O
> while holding the VMA lock, the unmapping procedure
> could end up being blocked as well.
>
> If we release the VMA lock as we currently do, we allow
> process exit to proceed.
>
> I haven't thought it through very clearly yet, and I
> may be wrong. I'd like to do more investigation. I hope
> the apps being killed stay very still, but who knows—we
> have so many applications in the market.
Yeah let's tread very carefully please, you're picking two of the most fraught
areas of mm, I'm not going to want to see changes there unless they're
substantially more convincingly argued.
>
> Meanwhile, if you have any comments regarding the death
> of processes, they would be very welcome.
As above, leave it alone please :)
>
> Best Regards
> Barry
Thanks, Lorenzo
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-18 21:14 ` Barry Song
2026-05-19 12:45 ` Lorenzo Stoakes
@ 2026-05-19 14:17 ` Liam R. Howlett
2026-05-19 22:01 ` Barry Song
1 sibling, 1 reply; 56+ messages in thread
From: Liam R. Howlett @ 2026-05-19 14:17 UTC (permalink / raw)
To: Barry Song
Cc: Suren Baghdasaryan, Lorenzo Stoakes, Matthew Wilcox, akpm,
linux-mm, david, vbabka, rppt, mhocko, jack, pfalcato, wanglian,
chentao, lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong,
shikemeng, nphamcs, bhe, youngjun.park, linux-arm-kernel,
linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390,
Nanzhe Zhao
On 26/05/19 05:14AM, Barry Song wrote:
> On Tue, May 19, 2026 at 3:57 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Mon, May 18, 2026 at 4:26 AM Barry Song <baohua@kernel.org> wrote:
> > >
> > > On Mon, May 18, 2026 at 5:47 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
> > > >
> > > > On Sun, May 17, 2026 at 04:45:15PM +0800, Barry Song wrote:
> [...]
> > >
> > > I think we either need to fix `fork()`, or keep the current
> > > behavior of dropping the VMA lock before performing I/O.
> >
> > I see. So, this problem arises from the fact that we are changing the
> > pagefaults requiring I/O operation to hold VMA lock...
> > And you want to lock VMA on fork only if vma_is_anonymous(vma) ||
> > is_cow_mapping(vma->vm_flags). So, we will be blocking page faults for
> > anonymous and COW VMAs only while holding mmap_write_lock, preventing
> > any VMA modification. On the surface, that looks ok to me but I might
> > be missing some corner cases. If nobody sees any obvious issues, I
> > think it's worth a try.
From Barry's description, I think what he is saying is that the vma
locking has caused the mmap_lock to become unfair? I think what is
implied is that the per-vma locking may stall mmap_lock writes for
longer than if the mmap_lock was taken in read mode? Barry, is that
correct?
Since Android is doing something (according to Barry) that should not be
done (according to Willy), both of these together are causing slow down?
>
> Thanks. Besides the creation of processes via fork(), I
> am also beginning to worry about the death of processes.
>
> One thing that came to my mind this morning
> is that when lowmemorykiller decides to kill an app, we
> want the memory to be released as quickly as possible so
> the new app or user scenario can get memory sooner.
>
> In that case, if the app being killed is performing I/O
> while holding the VMA lock, the unmapping procedure
> could end up being blocked as well.
>
> If we release the VMA lock as we currently do, we allow
> process exit to proceed.
>
> I haven't thought it through very clearly yet, and I
> may be wrong. I'd like to do more investigation. I hope
> the apps being killed stay very still, but who knows—we
> have so many applications in the market.
>
> Meanwhile, if you have any comments regarding the death
> of processes, they would be very welcome.
The oom killer only cleans out anon/not shared vmas [1]. So, what this
would hold up would be the actual process exit path. Although that
would have resources associated with it, the amount of resources should
be relatively low compared to the amount freed by the oom reaper, right?
The other entry point that's mostly to do with android,
process_mrelease() [2] will end up in the same __oom_reap_task_mm()
function.
So, for the most part, the memory will be freed while the file backed
vma completes IO and that sounds like the right thing to do anyways.
Thanks,
Liam
[1]. https://elixir.bootlin.com/linux/v7.1-rc4/source/mm/oom_kill.c#L547
[2]. https://elixir.bootlin.com/linux/v6.18.6/source/mm/oom_kill.c#L1210
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-19 14:17 ` Liam R. Howlett
@ 2026-05-19 22:01 ` Barry Song
0 siblings, 0 replies; 56+ messages in thread
From: Barry Song @ 2026-05-19 22:01 UTC (permalink / raw)
To: Liam R. Howlett
Cc: Suren Baghdasaryan, Lorenzo Stoakes, Matthew Wilcox, akpm,
linux-mm, david, vbabka, rppt, mhocko, jack, pfalcato, wanglian,
chentao, lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong,
shikemeng, nphamcs, bhe, youngjun.park, linux-arm-kernel,
linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390,
Nanzhe Zhao
On Tue, May 19, 2026 at 10:17 PM Liam R. Howlett <liam@infradead.org> wrote:
>
> On 26/05/19 05:14AM, Barry Song wrote:
> > On Tue, May 19, 2026 at 3:57 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Mon, May 18, 2026 at 4:26 AM Barry Song <baohua@kernel.org> wrote:
> > > >
> > > > On Mon, May 18, 2026 at 5:47 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
> > > > >
> > > > > On Sun, May 17, 2026 at 04:45:15PM +0800, Barry Song wrote:
> > [...]
> > > >
> > > > I think we either need to fix `fork()`, or keep the current
> > > > behavior of dropping the VMA lock before performing I/O.
> > >
> > > I see. So, this problem arises from the fact that we are changing the
> > > pagefaults requiring I/O operation to hold VMA lock...
> > > And you want to lock VMA on fork only if vma_is_anonymous(vma) ||
> > > is_cow_mapping(vma->vm_flags). So, we will be blocking page faults for
> > > anonymous and COW VMAs only while holding mmap_write_lock, preventing
> > > any VMA modification. On the surface, that looks ok to me but I might
> > > be missing some corner cases. If nobody sees any obvious issues, I
> > > think it's worth a try.
>
> From Barry's description, I think what he is saying is that the vma
> locking has caused the mmap_lock to become unfair? I think what is
For now, we do not have this problem. Before per-VMA
locks, we dropped mmap_lock before doing I/O in the
page-fault path and then retried the page fault. After
per-VMA locks, we dropped the VMA lock before doing I/O in
the page-fault path and then retried the page fault.
The problem only starts to exist if we decide to perform
I/O without releasing the VMA lock — which is what Matthew
is suggesting, because it would allow us to rip out a large
amount of page-fault retry code.
> implied is that the per-vma locking may stall mmap_lock writes for
> longer than if the mmap_lock was taken in read mode? Barry, is that
> correct?
Not the case — the actual situation is (if we modify the
current kernel to perform I/O without releasing VMA read locks):
thread 1 PF: lock vma1 read ---- IO ----- ;
thread 2 PF: lock vma2 read ----- IO ----- ;
thread 3 PF: lock vma3 read ---- IO ----- ;
thread 4 fork: mmap_lock_write ---- lock vma1, vma2, vma3 write ;
thread 5 : take mmap_lock for any read/write reason
Now you can see that thread 4 has to wait for the I/O of
VMA1, VMA2, and VMA3 to complete, and thread 5 then has to
wait for thread 4 to release mmap_lock. Both thread 4 and
thread 5 can become extremely slow, because I/O may be stuck
anywhere in the bio/request queue or filesystem GC.
So now we have two choices:
1. Change fork() to avoid taking the vma write lock for vma1/2/3 where possible;
2. Keep the current kernel behavior and drop the VMA lock before I/O:
thread 1 PF: lock vma1 read; drop vma1 read_lock ---- IO ----- retry PF
thread 2 PF: lock vma2 read; drop vma2 read_lock ----- IO ----- retry PF
thread 3 PF: lock vma3 read; drop vma3 read_lock ---- IO ----- retry PF
Option 2 is what mainline is currently doing, and what this
patchset also follows. The only difference in this patchset is
that page faults are retried under the VMA read lock, rather
than under mmap_lock as in the current kernel, which is causing
mmap_lock contention.
>
> Since Android is doing something (according to Barry) that should not be
> done (according to Willy), both of these together are causing slow down?
The only thing that would cause slowdown is holding the VMA
lock while performing I/O in the page-fault path, which is not
happening today. It would only happen if we insist on doing I/O
under the VMA lock without changing fork().
>
> >
> > Thanks. Besides the creation of processes via fork(), I
> > am also beginning to worry about the death of processes.
> >
> > One thing that came to my mind this morning
> > is that when lowmemorykiller decides to kill an app, we
> > want the memory to be released as quickly as possible so
> > the new app or user scenario can get memory sooner.
> >
> > In that case, if the app being killed is performing I/O
> > while holding the VMA lock, the unmapping procedure
> > could end up being blocked as well.
> >
> > If we release the VMA lock as we currently do, we allow
> > process exit to proceed.
> >
> > I haven't thought it through very clearly yet, and I
> > may be wrong. I'd like to do more investigation. I hope
> > the apps being killed stay very still, but who knows—we
> > have so many applications in the market.
> >
> > Meanwhile, if you have any comments regarding the death
> > of processes, they would be very welcome.
>
> The oom killer only cleans out anon/not shared vmas [1]. So, what this
> would hold up would be the actual process exit path. Although that
> would have resources associated with it, the amount of resources should
> be relatively low compared to the amount freed by the oom reaper, right?
>
> The other entry point that's mostly to do with android,
> process_mrelease() [2] will end up in the same __oom_reap_task_mm()
> function.
>
> So, for the most part, the memory will be freed while the file backed
> vma completes IO and that sounds like the right thing to do anyways.
Thanks very much for your valuable input!
I’m going to run more experiments to dig deeper into this.
>
> Thanks,
> Liam
>
> [1]. https://elixir.bootlin.com/linux/v7.1-rc4/source/mm/oom_kill.c#L547
> [2]. https://elixir.bootlin.com/linux/v6.18.6/source/mm/oom_kill.c#L1210
>
Best Regards
Barry
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-18 19:56 ` Suren Baghdasaryan
2026-05-18 21:14 ` Barry Song
@ 2026-05-19 12:53 ` Lorenzo Stoakes
2026-05-19 21:18 ` Barry Song
` (2 more replies)
1 sibling, 3 replies; 56+ messages in thread
From: Lorenzo Stoakes @ 2026-05-19 12:53 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: Barry Song, Matthew Wilcox, akpm, linux-mm, david, liam, vbabka,
rppt, mhocko, jack, pfalcato, wanglian, chentao, lianux.mm,
kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
On Mon, May 18, 2026 at 12:56:59PM -0700, Suren Baghdasaryan wrote:
> >
> > I think we either need to fix `fork()`, or keep the current
> > behavior of dropping the VMA lock before performing I/O.
>
> I see. So, this problem arises from the fact that we are changing the
> pagefaults requiring I/O operation to hold VMA lock...
> And you want to lock VMA on fork only if vma_is_anonymous(vma) ||
> is_cow_mapping(vma->vm_flags). So, we will be blocking page faults for
> anonymous and COW VMAs only while holding mmap_write_lock, preventing
> any VMA modification. On the surface, that looks ok to me but I might
> be missing some corner cases. If nobody sees any obvious issues, I
> think it's worth a try.
Not sure if you noticed but I did raise concerns ;)
I wonder if you've confused the fault path and fork here, as I think Barry has
been a little unclear on that.
What's being suggested in this thread is to fundamentally change fork behaviour
so it's different from the entire history of the kernel (or - presumably - at
least recent history :) and permit concurrent page faults to occur on a forking
process.
I absolutely object to this for being pretty crazy. I mean I'm not sure we
really want to be simultaneously modifying page tables while invoking
copy_page_range()? No?
OK you cover anon and MAP_PRIVATE file-backed but hang on there's
VM_COPY_ON_FORK too.. so PFN mapped, mixed map and (the accursed) UFFD W/P as
well as possibly-guard region containing VMAs now can have page tables raced.
That's not to mention anything else that relies on serialisation here (this
would be changing how forking has been done in general) that we may or may not
know about.
The risk level is high, for what amounts to a hack to work around the fault
issue.
I suggest that if we have a problem with the fault path, let's look at the fault
path :)
So yeah I'm very opposed to this unless I'm somehow horribly mistaken here or a
very convincing argument is made.
>
>
>
>
> >
> > >
> > > I'd also like to get Suren's input, however.
> >
> > Yes. of course.
> >
> > >
> > > Thanks, Lorenzo
> >
> > Best Regards
> > Barry
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 56+ messages in thread* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-19 12:53 ` Lorenzo Stoakes
@ 2026-05-19 21:18 ` Barry Song
2026-05-20 7:50 ` Lorenzo Stoakes
2026-05-20 5:51 ` Suren Baghdasaryan
2026-05-20 10:33 ` David Hildenbrand (Arm)
2 siblings, 1 reply; 56+ messages in thread
From: Barry Song @ 2026-05-19 21:18 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Suren Baghdasaryan, Matthew Wilcox, akpm, linux-mm, david, liam,
vbabka, rppt, mhocko, jack, pfalcato, wanglian, chentao,
lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng,
nphamcs, bhe, youngjun.park, linux-arm-kernel, linux-kernel,
loongarch, linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
On Tue, May 19, 2026 at 8:53 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
>
> On Mon, May 18, 2026 at 12:56:59PM -0700, Suren Baghdasaryan wrote:
>
> > >
> > > I think we either need to fix `fork()`, or keep the current
> > > behavior of dropping the VMA lock before performing I/O.
> >
> > I see. So, this problem arises from the fact that we are changing the
> > pagefaults requiring I/O operation to hold VMA lock...
> > And you want to lock VMA on fork only if vma_is_anonymous(vma) ||
> > is_cow_mapping(vma->vm_flags). So, we will be blocking page faults for
> > anonymous and COW VMAs only while holding mmap_write_lock, preventing
> > any VMA modification. On the surface, that looks ok to me but I might
> > be missing some corner cases. If nobody sees any obvious issues, I
> > think it's worth a try.
>
> Not sure if you noticed but I did raise concerns ;)
>
> I wonder if you've confused the fault path and fork here, as I think Barry has
> been a little unclear on that.
I think I’ve been absolutely clear :-)
We should either stick to the current behavior - drop
the VMA lock before doing I/O, or change fork() so that it
does not wait on vma_start_write().
Before per-VMA locks, page faults dropped mmap_lock before
doing I/O. After per-VMA locks, page faults dropped the
VMA lock before doing I/O. In both cases, fork() would not
wait for I/O in the page-fault path.
Now you guys are suggesting performing I/O while holding
the VMA lock, which means fork() must wait for that I/O to
complete. Since an application can have more than 1000
VMAs, and I/O can be stalled for an unpredictable amount
of time in the bio/request queue or filesystem GC, fork()
could end up blocked on multiple VMAs while taking
vma_start_write() for each of them.
As a result, fork() could hold mmap_lock for a very, very,
very long time. fork() itself would become extremely slow,
and any other task needing mmap_lock would also be blocked
behind it.
>
> What's being suggested in this thread is to fundamentally change fork behaviour
> so it's different from the entire history of the kernel (or - presumably - at
> least recent history :) and permit concurrent page faults to occur on a forking
> process.
>
> I absolutely object to this for being pretty crazy. I mean I'm not sure we
> really want to be simultaneously modifying page tables while invoking
> copy_page_range()? No?
If you object to touching fork(), can you at least accept
keeping the existing behavior of dropping the VMA lock
before doing I/O? If you object to both approaches, then I
really do not know how we can continue :-)
Thanks
Barry
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-19 21:18 ` Barry Song
@ 2026-05-20 7:50 ` Lorenzo Stoakes
2026-05-20 9:07 ` Barry Song
0 siblings, 1 reply; 56+ messages in thread
From: Lorenzo Stoakes @ 2026-05-20 7:50 UTC (permalink / raw)
To: Barry Song
Cc: Suren Baghdasaryan, Matthew Wilcox, akpm, linux-mm, david, liam,
vbabka, rppt, mhocko, jack, pfalcato, wanglian, chentao,
lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng,
nphamcs, bhe, youngjun.park, linux-arm-kernel, linux-kernel,
loongarch, linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
On Wed, May 20, 2026 at 05:18:52AM +0800, Barry Song wrote:
> On Tue, May 19, 2026 at 8:53 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
> >
> > On Mon, May 18, 2026 at 12:56:59PM -0700, Suren Baghdasaryan wrote:
> >
> > > >
> > > > I think we either need to fix `fork()`, or keep the current
> > > > behavior of dropping the VMA lock before performing I/O.
> > >
> > > I see. So, this problem arises from the fact that we are changing the
> > > pagefaults requiring I/O operation to hold VMA lock...
> > > And you want to lock VMA on fork only if vma_is_anonymous(vma) ||
> > > is_cow_mapping(vma->vm_flags). So, we will be blocking page faults for
> > > anonymous and COW VMAs only while holding mmap_write_lock, preventing
> > > any VMA modification. On the surface, that looks ok to me but I might
> > > be missing some corner cases. If nobody sees any obvious issues, I
> > > think it's worth a try.
> >
> > Not sure if you noticed but I did raise concerns ;)
> >
> > I wonder if you've confused the fault path and fork here, as I think Barry has
> > been a little unclear on that.
>
> I think I’ve been absolutely clear :-)
On this point sure, I would argue less so around the fork stuff but I responded
on that specifically elsewhere so let's keep things moving :>)
> We should either stick to the current behavior - drop
> the VMA lock before doing I/O, or change fork() so that it
> does not wait on vma_start_write().
Again, as I said elsewhere, I think there might be a 3rd way possibly. It's a
big mistake to assume that there are only specific solutions to problems in the
kernel then to present a false dichotomy.
We absolutely hear you on this being a problem and it WILL be addressed one way
or another.
Of the two approaches, as I said elsewhere, I prefer what you've done in this
series to anything touching fork.
But give me time to look through the series please (I'd also suggest RFC'ing
when it's something kinda fundamental that might generate converastion, makes
life a bit easier on the review side :)
>
> Before per-VMA locks, page faults dropped mmap_lock before
> doing I/O. After per-VMA locks, page faults dropped the
> VMA lock before doing I/O. In both cases, fork() would not
> wait for I/O in the page-fault path.
>
> Now you guys are suggesting performing I/O while holding
> the VMA lock, which means fork() must wait for that I/O to
> complete. Since an application can have more than 1000
> VMAs, and I/O can be stalled for an unpredictable amount
> of time in the bio/request queue or filesystem GC, fork()
> could end up blocked on multiple VMAs while taking
> vma_start_write() for each of them.
>
> As a result, fork() could hold mmap_lock for a very, very,
> very long time. fork() itself would become extremely slow,
> and any other task needing mmap_lock would also be blocked
> behind it.
Yep aware, we spoke in Zagreb about this, and on this thread, we know :)
>
> >
> > What's being suggested in this thread is to fundamentally change fork behaviour
> > so it's different from the entire history of the kernel (or - presumably - at
> > least recent history :) and permit concurrent page faults to occur on a forking
> > process.
> >
> > I absolutely object to this for being pretty crazy. I mean I'm not sure we
> > really want to be simultaneously modifying page tables while invoking
> > copy_page_range()? No?
>
> If you object to touching fork(), can you at least accept
> keeping the existing behavior of dropping the VMA lock
> before doing I/O? If you object to both approaches, then I
> really do not know how we can continue :-)
Again as per above, let's not impose a false dichtomy, let's take our time, and
specifically - please give me time to read through the series and think about
this.
>
> Thanks
> Barry
Thanks, Lorenzo
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-20 7:50 ` Lorenzo Stoakes
@ 2026-05-20 9:07 ` Barry Song
2026-05-20 10:07 ` Lorenzo Stoakes
0 siblings, 1 reply; 56+ messages in thread
From: Barry Song @ 2026-05-20 9:07 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Suren Baghdasaryan, Matthew Wilcox, akpm, linux-mm, david, liam,
vbabka, rppt, mhocko, jack, pfalcato, wanglian, chentao,
lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng,
nphamcs, bhe, youngjun.park, linux-arm-kernel, linux-kernel,
loongarch, linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
On Wed, May 20, 2026 at 3:50 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
>
> On Wed, May 20, 2026 at 05:18:52AM +0800, Barry Song wrote:
> > On Tue, May 19, 2026 at 8:53 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
> > >
> > > On Mon, May 18, 2026 at 12:56:59PM -0700, Suren Baghdasaryan wrote:
> > >
> > > > >
> > > > > I think we either need to fix `fork()`, or keep the current
> > > > > behavior of dropping the VMA lock before performing I/O.
> > > >
> > > > I see. So, this problem arises from the fact that we are changing the
> > > > pagefaults requiring I/O operation to hold VMA lock...
> > > > And you want to lock VMA on fork only if vma_is_anonymous(vma) ||
> > > > is_cow_mapping(vma->vm_flags). So, we will be blocking page faults for
> > > > anonymous and COW VMAs only while holding mmap_write_lock, preventing
> > > > any VMA modification. On the surface, that looks ok to me but I might
> > > > be missing some corner cases. If nobody sees any obvious issues, I
> > > > think it's worth a try.
> > >
> > > Not sure if you noticed but I did raise concerns ;)
> > >
> > > I wonder if you've confused the fault path and fork here, as I think Barry has
> > > been a little unclear on that.
> >
> > I think I’ve been absolutely clear :-)
>
> On this point sure, I would argue less so around the fork stuff but I responded
> on that specifically elsewhere so let's keep things moving :>)
>
> > We should either stick to the current behavior - drop
> > the VMA lock before doing I/O, or change fork() so that it
> > does not wait on vma_start_write().
>
> Again, as I said elsewhere, I think there might be a 3rd way possibly. It's a
> big mistake to assume that there are only specific solutions to problems in the
> kernel then to present a false dichotomy.
I recalled that when we discussed this part in my slides:
‘For simplicity, rather than using a whitelist mechanism for
per-VMA retry, we could use a blacklist instead: default to
always retry via the VMA lock, and only allow mmap_lock-based
page-fault retry for specific cases such as
__vmf_anon_prepare().’
Suren mentioned introducing a FALLBACK flag. With the
FALLBACK flag, we would retry via mmap_lock; with the RETRY
flag, we would retry via the VMA lock.
Not sure whether this could really be called a ‘third way,’
but it seems more like a shift from a whitelist model to a
blacklist model, without changing the fundamental design, but
it does change where we would need to touch the source code.
>
> We absolutely hear you on this being a problem and it WILL be addressed one way
> or another.
Thanks. This is a bit of light in what has felt like a fairly
dark situation. I really appreciate your thoughtful and
responsible approach.
>
> Of the two approaches, as I said elsewhere, I prefer what you've done in this
> series to anything touching fork.
>
> But give me time to look through the series please (I'd also suggest RFC'ing
> when it's something kinda fundamental that might generate converastion, makes
> life a bit easier on the review side :)
Thanks! Sure, I’m happy to wait and there’s no urgency.
Last year you made quite a significant contribution to the work
when I tried to remove mmap_lock in madvise. I really
appreciated it. Now we’re back to the same lock again, just in
different places.
Best Regards
Barry
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-20 9:07 ` Barry Song
@ 2026-05-20 10:07 ` Lorenzo Stoakes
0 siblings, 0 replies; 56+ messages in thread
From: Lorenzo Stoakes @ 2026-05-20 10:07 UTC (permalink / raw)
To: Barry Song
Cc: Suren Baghdasaryan, Matthew Wilcox, akpm, linux-mm, david, liam,
vbabka, rppt, mhocko, jack, pfalcato, wanglian, chentao,
lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng,
nphamcs, bhe, youngjun.park, linux-arm-kernel, linux-kernel,
loongarch, linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
On Wed, May 20, 2026 at 05:07:16PM +0800, Barry Song wrote:
> On Wed, May 20, 2026 at 3:50 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
> >
> > On Wed, May 20, 2026 at 05:18:52AM +0800, Barry Song wrote:
> > > On Tue, May 19, 2026 at 8:53 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
> > > >
> > > > On Mon, May 18, 2026 at 12:56:59PM -0700, Suren Baghdasaryan wrote:
> > > >
> > > > > >
> > > > > > I think we either need to fix `fork()`, or keep the current
> > > > > > behavior of dropping the VMA lock before performing I/O.
> > > > >
> > > > > I see. So, this problem arises from the fact that we are changing the
> > > > > pagefaults requiring I/O operation to hold VMA lock...
> > > > > And you want to lock VMA on fork only if vma_is_anonymous(vma) ||
> > > > > is_cow_mapping(vma->vm_flags). So, we will be blocking page faults for
> > > > > anonymous and COW VMAs only while holding mmap_write_lock, preventing
> > > > > any VMA modification. On the surface, that looks ok to me but I might
> > > > > be missing some corner cases. If nobody sees any obvious issues, I
> > > > > think it's worth a try.
> > > >
> > > > Not sure if you noticed but I did raise concerns ;)
> > > >
> > > > I wonder if you've confused the fault path and fork here, as I think Barry has
> > > > been a little unclear on that.
> > >
> > > I think I’ve been absolutely clear :-)
> >
> > On this point sure, I would argue less so around the fork stuff but I responded
> > on that specifically elsewhere so let's keep things moving :>)
> >
> > > We should either stick to the current behavior - drop
> > > the VMA lock before doing I/O, or change fork() so that it
> > > does not wait on vma_start_write().
> >
> > Again, as I said elsewhere, I think there might be a 3rd way possibly. It's a
> > big mistake to assume that there are only specific solutions to problems in the
> > kernel then to present a false dichotomy.
>
> I recalled that when we discussed this part in my slides:
>
> ‘For simplicity, rather than using a whitelist mechanism for
> per-VMA retry, we could use a blacklist instead: default to
> always retry via the VMA lock, and only allow mmap_lock-based
> page-fault retry for specific cases such as
> __vmf_anon_prepare().’
Yeah that's an itneresting approach actually, sorry if I missed that.
>
> Suren mentioned introducing a FALLBACK flag. With the
> FALLBACK flag, we would retry via mmap_lock; with the RETRY
> flag, we would retry via the VMA lock.
Yeah, and honestly I'm beginning to wonder if we don't just have to pay the
complexity tax anyway and eat the fact we have to deal with that.
But as per Josef's comment re: this whole mechanism, simply not waiting for
file-backed I think is another option (but I don't recall where we left that
conversation actually?)
Anyway I want to make sure any complexity we add is necessary so will take a
look through patches and have a think (and obviously others will have their own
opinions!)
>
> Not sure whether this could really be called a ‘third way,’
> but it seems more like a shift from a whitelist model to a
> blacklist model, without changing the fundamental design, but
> it does change where we would need to touch the source code.
Right yeah, good to have more options.
>
> >
> > We absolutely hear you on this being a problem and it WILL be addressed one way
> > or another.
>
> Thanks. This is a bit of light in what has felt like a fairly
> dark situation. I really appreciate your thoughtful and
> responsible approach.
Yes, sorry, I maybe was a bit too harsh in my tone here, I didn't really intend
to be negative as to addresisng the problem as a whole.
Moreso I've been concerned about the fork approach, and that is what's led to me
being shall we say 'emphatic' about it :)
But of course I sometimes make mistakes in quite how my tone comes across, so
apologies if it came across overly negatively - I am negative (on a technical
level) about the fork approach, but not the fact we should address this.
To be clear - I'm very glad you've brought this up, it's important, as much as
it's painful that we have this issue in the first place! :)
>
> >
> > Of the two approaches, as I said elsewhere, I prefer what you've done in this
> > series to anything touching fork.
> >
> > But give me time to look through the series please (I'd also suggest RFC'ing
> > when it's something kinda fundamental that might generate converastion, makes
> > life a bit easier on the review side :)
>
> Thanks! Sure, I’m happy to wait and there’s no urgency.
>
> Last year you made quite a significant contribution to the work
> when I tried to remove mmap_lock in madvise. I really
> appreciated it. Now we’re back to the same lock again, just in
> different places.
Yeah :) one day maybe we can get rid of it altogether (maybe I'm dreaming :)
>
> Best Regards
> Barry
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-19 12:53 ` Lorenzo Stoakes
2026-05-19 21:18 ` Barry Song
@ 2026-05-20 5:51 ` Suren Baghdasaryan
2026-05-20 10:33 ` David Hildenbrand (Arm)
2 siblings, 0 replies; 56+ messages in thread
From: Suren Baghdasaryan @ 2026-05-20 5:51 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Barry Song, Matthew Wilcox, akpm, linux-mm, david, liam, vbabka,
rppt, mhocko, jack, pfalcato, wanglian, chentao, lianux.mm,
kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
On Tue, May 19, 2026 at 12:53 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
>
> On Mon, May 18, 2026 at 12:56:59PM -0700, Suren Baghdasaryan wrote:
>
> > >
> > > I think we either need to fix `fork()`, or keep the current
> > > behavior of dropping the VMA lock before performing I/O.
> >
> > I see. So, this problem arises from the fact that we are changing the
> > pagefaults requiring I/O operation to hold VMA lock...
> > And you want to lock VMA on fork only if vma_is_anonymous(vma) ||
> > is_cow_mapping(vma->vm_flags). So, we will be blocking page faults for
> > anonymous and COW VMAs only while holding mmap_write_lock, preventing
> > any VMA modification. On the surface, that looks ok to me but I might
> > be missing some corner cases. If nobody sees any obvious issues, I
> > think it's worth a try.
>
> Not sure if you noticed but I did raise concerns ;)
Sorry, I didn't realize your first comment was a conceptual objection
to this approach of allowing page faults to race with the fork.
>
> I wonder if you've confused the fault path and fork here, as I think Barry has
> been a little unclear on that.
>
> What's being suggested in this thread is to fundamentally change fork behaviour
> so it's different from the entire history of the kernel (or - presumably - at
> least recent history :) and permit concurrent page faults to occur on a forking
> process.
>
> I absolutely object to this for being pretty crazy. I mean I'm not sure we
> really want to be simultaneously modifying page tables while invoking
> copy_page_range()? No?
>
> OK you cover anon and MAP_PRIVATE file-backed but hang on there's
> VM_COPY_ON_FORK too.. so PFN mapped, mixed map and (the accursed) UFFD W/P as
> well as possibly-guard region containing VMAs now can have page tables raced.
Ugh, yeah, I realize now this is a minefield. Resolving all possible
races there would not be trivial and might introduce other performance
issues.
>
> That's not to mention anything else that relies on serialisation here (this
> would be changing how forking has been done in general) that we may or may not
> know about.
>
> The risk level is high, for what amounts to a hack to work around the fault
> issue.
>
> I suggest that if we have a problem with the fault path, let's look at the fault
> path :)
>
> So yeah I'm very opposed to this unless I'm somehow horribly mistaken here or a
> very convincing argument is made.
So, current approach of dropping locks during I/O sounds like still
the best solution.
>
>
> >
> >
> >
> >
> > >
> > > >
> > > > I'd also like to get Suren's input, however.
> > >
> > > Yes. of course.
> > >
> > > >
> > > > Thanks, Lorenzo
> > >
> > > Best Regards
> > > Barry
>
> Cheers, Lorenzo
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-19 12:53 ` Lorenzo Stoakes
2026-05-19 21:18 ` Barry Song
2026-05-20 5:51 ` Suren Baghdasaryan
@ 2026-05-20 10:33 ` David Hildenbrand (Arm)
2026-05-20 12:55 ` Lorenzo Stoakes
2 siblings, 1 reply; 56+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-20 10:33 UTC (permalink / raw)
To: Lorenzo Stoakes, Suren Baghdasaryan
Cc: Barry Song, Matthew Wilcox, akpm, linux-mm, liam, vbabka, rppt,
mhocko, jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
On 5/19/26 14:53, Lorenzo Stoakes wrote:
> On Mon, May 18, 2026 at 12:56:59PM -0700, Suren Baghdasaryan wrote:
>
>>>
>>> I think we either need to fix `fork()`, or keep the current
>>> behavior of dropping the VMA lock before performing I/O.
>>
>> I see. So, this problem arises from the fact that we are changing the
>> pagefaults requiring I/O operation to hold VMA lock...
>> And you want to lock VMA on fork only if vma_is_anonymous(vma) ||
>> is_cow_mapping(vma->vm_flags). So, we will be blocking page faults for
>> anonymous and COW VMAs only while holding mmap_write_lock, preventing
>> any VMA modification. On the surface, that looks ok to me but I might
>> be missing some corner cases. If nobody sees any obvious issues, I
>> think it's worth a try.
>
> Not sure if you noticed but I did raise concerns ;)
>
> I wonder if you've confused the fault path and fork here, as I think Barry has
> been a little unclear on that.
>
> What's being suggested in this thread is to fundamentally change fork behaviour
> so it's different from the entire history of the kernel (or - presumably - at
> least recent history :)
I don't want fork() to become different in that regard.
There is already a slight difference with vs. without per-VMA locks, because
there is a window in-between us taking the write mmap_lock and all the per-VMA
locks. I raised that previously [1] and assumed that it is probably fine.
I also raised in the past why I think we must not allow concurrent page faults,
at least as soon as anonymous memory is involved [2].
... and I raised that this is pretty much slower by design right now: "Well, the
design decision that CONFIG_PER_VMA_LOCK made for now to make page faults fast
and to make blocking any page faults from happening to be slower ..." [3]
[1] https://lore.kernel.org/all/970295ab-e85d-7af3-76e6-df53a5c52f8b@redhat.com/
[2] https://lore.kernel.org/all/7e3f35cc-59b9-bf12-b8b1-4ed78223844a@redhat.com/
[3] https://lore.kernel.org/all/2efa2c89-3765-721d-2c3c-00590054aa5b@redhat.com/
--
Cheers,
David
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-20 10:33 ` David Hildenbrand (Arm)
@ 2026-05-20 12:55 ` Lorenzo Stoakes
0 siblings, 0 replies; 56+ messages in thread
From: Lorenzo Stoakes @ 2026-05-20 12:55 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Suren Baghdasaryan, Barry Song, Matthew Wilcox, akpm, linux-mm,
liam, vbabka, rppt, mhocko, jack, pfalcato, wanglian, chentao,
lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng,
nphamcs, bhe, youngjun.park, linux-arm-kernel, linux-kernel,
loongarch, linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
On Wed, May 20, 2026 at 12:33:56PM +0200, David Hildenbrand (Arm) wrote:
> On 5/19/26 14:53, Lorenzo Stoakes wrote:
> > On Mon, May 18, 2026 at 12:56:59PM -0700, Suren Baghdasaryan wrote:
> >
> >>>
> >>> I think we either need to fix `fork()`, or keep the current
> >>> behavior of dropping the VMA lock before performing I/O.
> >>
> >> I see. So, this problem arises from the fact that we are changing the
> >> pagefaults requiring I/O operation to hold VMA lock...
> >> And you want to lock VMA on fork only if vma_is_anonymous(vma) ||
> >> is_cow_mapping(vma->vm_flags). So, we will be blocking page faults for
> >> anonymous and COW VMAs only while holding mmap_write_lock, preventing
> >> any VMA modification. On the surface, that looks ok to me but I might
> >> be missing some corner cases. If nobody sees any obvious issues, I
> >> think it's worth a try.
> >
> > Not sure if you noticed but I did raise concerns ;)
> >
> > I wonder if you've confused the fault path and fork here, as I think Barry has
> > been a little unclear on that.
> >
> > What's being suggested in this thread is to fundamentally change fork behaviour
> > so it's different from the entire history of the kernel (or - presumably - at
> > least recent history :)
> I don't want fork() to become different in that regard.
>
> There is already a slight difference with vs. without per-VMA locks, because
> there is a window in-between us taking the write mmap_lock and all the per-VMA
> locks. I raised that previously [1] and assumed that it is probably fine.
>
> I also raised in the past why I think we must not allow concurrent page faults,
> at least as soon as anonymous memory is involved [2].
>
> ... and I raised that this is pretty much slower by design right now: "Well, the
> design decision that CONFIG_PER_VMA_LOCK made for now to make page faults fast
> and to make blocking any page faults from happening to be slower ..." [3]
Thanks for the background will read through! :)
But yeah I think the transition from !vma->anon_vma -> vma->anon_vma being a bit
slow is kinda ok most page faults will of course have anon_vma populated.
Be interesting with CoW context, because we won't need to mmap read lock there
at all :)
>
> [1] https://lore.kernel.org/all/970295ab-e85d-7af3-76e6-df53a5c52f8b@redhat.com/
> [2] https://lore.kernel.org/all/7e3f35cc-59b9-bf12-b8b1-4ed78223844a@redhat.com/
> [3] https://lore.kernel.org/all/2efa2c89-3765-721d-2c3c-00590054aa5b@redhat.com/
>
> --
> Cheers,
>
> David
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-18 11:25 ` Barry Song
2026-05-18 16:17 ` Matthew Wilcox
2026-05-18 19:56 ` Suren Baghdasaryan
@ 2026-05-19 12:43 ` Lorenzo Stoakes
2 siblings, 0 replies; 56+ messages in thread
From: Lorenzo Stoakes @ 2026-05-19 12:43 UTC (permalink / raw)
To: Barry Song
Cc: Matthew Wilcox, surenb, akpm, linux-mm, david, liam, vbabka, rppt,
mhocko, jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
On Mon, May 18, 2026 at 07:25:54PM +0800, Barry Song wrote:
> On Mon, May 18, 2026 at 5:47 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
> >
> > On Sun, May 17, 2026 at 04:45:15PM +0800, Barry Song wrote:
> > > On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > >
> > > > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > > > both the hardware and the software stack (bio/request queues and the
> > > > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > > > for an unpredictable amount of time.
> > > > > >
> > > > > > But does that actually happen? I find it hard to believe that thread A
> > > > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > > > that same VMA. mprotect() and madvise() are more likely to happen, but
> > > > > > it still seems really unlikely to me.
> > > > >
> > > > > It doesn’t have to involve unmapping or applying mprotect to
> > > > > the entire VMA—just a portion of it is sufficient.
> > > >
> > > > Yes, but that still fails to answer "does this actually happen". How much
> > > > performance is all this complexity in the page fault handler buying us?
> > > > If you don't answer this question, I'm just going to go in and rip it
> > > > all out.
> > > >
> > >
> > > Hi Matthew (and Lorenzo, Jan, and anyone else who may be
> > > waiting for answers),
> > >
> > > As promised during LSF/MM/BPF, we conducted thorough
> > > testing on Android phones to determine whether performing
> > > I/O in `filemap_fault()` can block `vma_start_write()`.
> > > I wanted to give a quick update on this question.
> > >
> > > Nanzhe at Xiaomi created tracing scripts and ran various
> > > applications on Android devices with I/O performed under
> > > the VMA lock in `filemap_fault()`. We found that:
> > >
> > > 1. There are very few cases where unmap() is blocked by
> > > page faults. I assume this is due to buggy user code
> > > or poor synchronization between reads and unmap().
> > > So I assume it is not a problem.
> > >
> > > 2. We observed many cases where `vma_start_write()`
> > > is blocked by page-fault I/O in some applications.
> > > The blocking occurs in the `dup_mmap()` path during
> > > fork().
> > >
> > > With Suren's commit fb49c455323ff ("fork: lock VMAs of
> > > the parent process when forking"), we now always hold
> > > `vma_write_lock()` for each VMA. Note that the
> > > `mmap_lock` write lock is also held, which could lead to
> > > chained waiting if page-fault I/O is performed without
> > > releasing the VMA lock.
> >
> > Hm but did you observe this 'chained waiting'? And what were the latencies?
>
> We have clearly observed that the `fork()` operations of many
> popular Android apps, such as iQiyi, Baidu Tieba, and 10086,
> end up waiting on page-fault (PF) I/O when the VMA lock is
> held during I/O operations. This has already become a
> practical issue. I also believe this can lead to chained
> waiting, since the global `mmap_lock` blocks all threads that
> need to acquire it.
I asked about the chained waiting :) I'm aware you've observed contention on
write lock, you said so in your LSF talk.
So have you observed that or is this a theory?
>
>
> >
> > >
> > > My gut feeling is that Suren's commit may be overshooting,
> > > so my rough idea is that we might want to do something like
> > > the following (we haven't tested it yet and it might be
> > > wrong):
> >
> > Yeah I'm really not sure about that.
> >
> > Prior to the VMA locks, the mmap write lock would have guaranteed no concurrent
> > page faults, which is really what Fb49c455323ff is about.
> >
> > So Suren's patch was essentially restoring the _existing_ forking behaviour, and
> > now you're saying 'let's change the forking behaviour that's been like that for
> > forever'.
>
>
> I am afraid not. Before we introduced the per-VMA lock, we
> were not performing I/O while holding `mmap_lock`. A page fault
> that needed I/O would drop the `mmap_lock` read lock and allow
> `fork()` to proceed.
Err I'm talking about fork? The patch you reference is a change to fork?
So you're saying that Fb49c455323ff which explicitly takes the VMA write lock on
fork, was somehow an addendum after fork didnt take the mmap write lock?
I must be imagining
https://elixir.bootlin.com/linux/v6.0/source/kernel/fork.c#L590 then in v6.0
pre-vma locks :)
I suspect that's _not_ what you're saying, so now what you're suggesting as I
stated above, is to fundamentally change fork behaviour to account for the
existing per-VMA lock behaviour on the fault path?
Again I state - are you really sure you want to fundamentally change fork
behaviour for this?
I am extremely concerned about doing that.
>
> Now, you are suggesting performing I/O while holding the VMA
> lock, which changes the requirements and introduces this
> problem.
>
> >
> > I think you would _really_ have to be sure that's safe. And forking is a very
> > dangerous time in terms of complexity and sensitivity and 'weird stuff'
> > happening so I'd tread _very_ carefully here.
>
> Yep. I think my original proposal did not require any changes
> to `fork()`, since it simply preserved the current behavior of
> dropping the VMA lock before performing I/O. In that model,
> `fork()` would not end up waiting on I/O at all.
>
> What you are suggesting now appears to be performing I/O while
> holding the VMA lock, which in turn introduces the need to
> change `fork()`.
Again, you're saying we should fundamentally change the way fork has worked
forever to work around something else.
At LSF I raised the fact that Josef himself suggested we simply drop this I/O
waiting behaviour for file-backed mapppings. Isn't there a way forward that way
rather than 'hey let's drop locks and hope for the best!'
I am really reticent about this because we've seen HORRIBLE bugs come from fork
behaviour, especially edge cases, and mm testing isn't great so I am basically
opposed to this, and you're not really convincing me here.
>
> >
> > >
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index 2311ae7c2ff4..5ddaf297f31a 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -1762,7 +1762,13 @@ __latent_entropy int dup_mmap(struct mm_struct
> > > *mm, struct mm_struct *oldmm)
> > > for_each_vma(vmi, mpnt) {
> > > struct file *file;
> > >
> > > - retval = vma_start_write_killable(mpnt);
> > > + /*
> > > + * For anonymous or writable private VMAs, prevent
> > > + * concurrent CoW faults.
> > > + */
> >
> > To nit pick I think the comment's confusing but also tells you you don't need to
> > specific anon check - writable private is sufficient. And it's not really just
> > CoW that's the issue, it's anon_vma population _at all_ as well as CoW.
> >
> > > + if (!mpnt->vm_file || (!(mpnt->vm_flags & VM_SHARED) &&
> > > + (mpnt->vm_flags & VM_WRITE)))
> > > + retval = vma_start_write_killable(mpnt);
> >
> > I think this has to be VM_MAYWRITE, because somebody could otherwise mprotect()
> > it R/W.
> >
> > I also don't understand why !mpnt->vm_file for a read-only anon mapping (more
> > likely PROT_NONE) is here, just do the second check?
> >
> > (Also please use the new interface, so !vma_test(mpnt, VMA_SHARED_BIT) &&
> > vma_test(mpnt, VMA_MAYWRITE_BIT))
>
> Yep, I can definitely refine the check further. But before
> doing that, I'd first like to confirm that we are aligned on
> the direction.
>
> If you still intend to hold the VMA lock while performing I/O,
> then I think we should fix `fork()` to avoid taking
> `vma_start_write()`.
Yeah or we could do something different, it isn't a case of you get to do one of
two options you propose - the maintainers decide which way is appropriate.
Of the two options dropping the lock on the fault path rather than this fork
insanity is my preference but I wonder if we can't find another way.
Let me read through the series and give more thoughts I guess.
>
> >
> > > if (retval < 0)
> > > goto loop_out;
> > > if (mpnt->vm_flags & VM_DONTCOPY) {
> > >
> > > Based on the above, we may want to re-check whether fork()
> > > can be blocked by page faults. At the same time, if Suren,
> > > you, or anyone else has any comments, please feel free to
> > > share them.
> > >
> > > Best Regards
> > > Barry
> >
> > Technical commentary above is sort of 'just cos' :) because I really question
> > doing this honestly.
>
> I think we either need to fix `fork()`, or keep the current
> behavior of dropping the VMA lock before performing I/O.
Yup you said :)
>
> >
> > I'd also like to get Suren's input, however.
>
> Yes. of course.
>
> >
> > Thanks, Lorenzo
>
> Best Regards
> Barry
Thanks, Lorenzo
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-17 8:45 ` Barry Song
2026-05-18 9:46 ` Lorenzo Stoakes
@ 2026-05-18 9:53 ` David Hildenbrand (Arm)
2026-05-19 13:42 ` Lorenzo Stoakes
2026-05-18 21:21 ` Yang Shi
2 siblings, 1 reply; 56+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-18 9:53 UTC (permalink / raw)
To: Barry Song, Matthew Wilcox, surenb
Cc: akpm, linux-mm, ljs, liam, vbabka, rppt, mhocko, jack, pfalcato,
wanglian, chentao, lianux.mm, kunwu.chan, liyangouwen1, chrisl,
kasong, shikemeng, nphamcs, bhe, youngjun.park, linux-arm-kernel,
linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390,
Nanzhe Zhao
On 5/17/26 10:45, Barry Song wrote:
> On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
>>
>> On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
>>>
>>> It doesn’t have to involve unmapping or applying mprotect to
>>> the entire VMA—just a portion of it is sufficient.
>>
>> Yes, but that still fails to answer "does this actually happen". How much
>> performance is all this complexity in the page fault handler buying us?
>> If you don't answer this question, I'm just going to go in and rip it
>> all out.
>>
>
> Hi Matthew (and Lorenzo, Jan, and anyone else who may be
> waiting for answers),
>
> As promised during LSF/MM/BPF, we conducted thorough
> testing on Android phones to determine whether performing
> I/O in `filemap_fault()` can block `vma_start_write()`.
> I wanted to give a quick update on this question.
>
> Nanzhe at Xiaomi created tracing scripts and ran various
> applications on Android devices with I/O performed under
> the VMA lock in `filemap_fault()`. We found that:
>
> 1. There are very few cases where unmap() is blocked by
> page faults. I assume this is due to buggy user code
> or poor synchronization between reads and unmap().
> So I assume it is not a problem.
>
> 2. We observed many cases where `vma_start_write()`
> is blocked by page-fault I/O in some applications.
> The blocking occurs in the `dup_mmap()` path during
> fork().
>
> With Suren's commit fb49c455323ff ("fork: lock VMAs of
> the parent process when forking"), we now always hold
> `vma_write_lock()` for each VMA. Note that the
> `mmap_lock` write lock is also held, which could lead to
> chained waiting if page-fault I/O is performed without
> releasing the VMA lock.
>
> My gut feeling is that Suren's commit may be overshooting,
> so my rough idea is that we might want to do something like
> the following (we haven't tested it yet and it might be
> wrong):
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 2311ae7c2ff4..5ddaf297f31a 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1762,7 +1762,13 @@ __latent_entropy int dup_mmap(struct mm_struct
> *mm, struct mm_struct *oldmm)
> for_each_vma(vmi, mpnt) {
> struct file *file;
>
> - retval = vma_start_write_killable(mpnt);
> + /*
> + * For anonymous or writable private VMAs, prevent
> + * concurrent CoW faults.
> + */
> + if (!mpnt->vm_file || (!(mpnt->vm_flags & VM_SHARED) &&
> + (mpnt->vm_flags & VM_WRITE)))
> + retval = vma_start_write_killable(mpnt);
Likely is_cow_mapping() is what you would want to check to handle VMAs that
could have anonymous pages in them.
--
Cheers,
David
^ permalink raw reply [flat|nested] 56+ messages in thread* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-18 9:53 ` David Hildenbrand (Arm)
@ 2026-05-19 13:42 ` Lorenzo Stoakes
0 siblings, 0 replies; 56+ messages in thread
From: Lorenzo Stoakes @ 2026-05-19 13:42 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Barry Song, Matthew Wilcox, surenb, akpm, linux-mm, liam, vbabka,
rppt, mhocko, jack, pfalcato, wanglian, chentao, lianux.mm,
kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
On Mon, May 18, 2026 at 11:53:37AM +0200, David Hildenbrand (Arm) wrote:
> On 5/17/26 10:45, Barry Song wrote:
> > On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> >>
> >> On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> >>>
> >>> It doesn’t have to involve unmapping or applying mprotect to
> >>> the entire VMA—just a portion of it is sufficient.
> >>
> >> Yes, but that still fails to answer "does this actually happen". How much
> >> performance is all this complexity in the page fault handler buying us?
> >> If you don't answer this question, I'm just going to go in and rip it
> >> all out.
> >>
> >
> > Hi Matthew (and Lorenzo, Jan, and anyone else who may be
> > waiting for answers),
> >
> > As promised during LSF/MM/BPF, we conducted thorough
> > testing on Android phones to determine whether performing
> > I/O in `filemap_fault()` can block `vma_start_write()`.
> > I wanted to give a quick update on this question.
> >
> > Nanzhe at Xiaomi created tracing scripts and ran various
> > applications on Android devices with I/O performed under
> > the VMA lock in `filemap_fault()`. We found that:
> >
> > 1. There are very few cases where unmap() is blocked by
> > page faults. I assume this is due to buggy user code
> > or poor synchronization between reads and unmap().
> > So I assume it is not a problem.
> >
> > 2. We observed many cases where `vma_start_write()`
> > is blocked by page-fault I/O in some applications.
> > The blocking occurs in the `dup_mmap()` path during
> > fork().
> >
> > With Suren's commit fb49c455323ff ("fork: lock VMAs of
> > the parent process when forking"), we now always hold
> > `vma_write_lock()` for each VMA. Note that the
> > `mmap_lock` write lock is also held, which could lead to
> > chained waiting if page-fault I/O is performed without
> > releasing the VMA lock.
> >
> > My gut feeling is that Suren's commit may be overshooting,
> > so my rough idea is that we might want to do something like
> > the following (we haven't tested it yet and it might be
> > wrong):
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 2311ae7c2ff4..5ddaf297f31a 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1762,7 +1762,13 @@ __latent_entropy int dup_mmap(struct mm_struct
> > *mm, struct mm_struct *oldmm)
> > for_each_vma(vmi, mpnt) {
> > struct file *file;
> >
> > - retval = vma_start_write_killable(mpnt);
> > + /*
> > + * For anonymous or writable private VMAs, prevent
> > + * concurrent CoW faults.
> > + */
> > + if (!mpnt->vm_file || (!(mpnt->vm_flags & VM_SHARED) &&
> > + (mpnt->vm_flags & VM_WRITE)))
> > + retval = vma_start_write_killable(mpnt);
>
> Likely is_cow_mapping() is what you would want to check to handle VMAs that
> could have anonymous pages in them.
Yes :) I made pretty much the same comment though I forgot the correct helper :P
>
> --
> Cheers,
>
> David
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-17 8:45 ` Barry Song
2026-05-18 9:46 ` Lorenzo Stoakes
2026-05-18 9:53 ` David Hildenbrand (Arm)
@ 2026-05-18 21:21 ` Yang Shi
2026-05-19 11:07 ` Barry Song
2026-05-19 13:12 ` Lorenzo Stoakes
2 siblings, 2 replies; 56+ messages in thread
From: Yang Shi @ 2026-05-18 21:21 UTC (permalink / raw)
To: Barry Song
Cc: Matthew Wilcox, surenb, akpm, linux-mm, david, ljs, liam, vbabka,
rppt, mhocko, jack, pfalcato, wanglian, chentao, lianux.mm,
kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
On Sun, May 17, 2026 at 1:45 AM Barry Song <baohua@kernel.org> wrote:
>
> On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > both the hardware and the software stack (bio/request queues and the
> > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > for an unpredictable amount of time.
> > > >
> > > > But does that actually happen? I find it hard to believe that thread A
> > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > that same VMA. mprotect() and madvise() are more likely to happen, but
> > > > it still seems really unlikely to me.
> > >
> > > It doesn’t have to involve unmapping or applying mprotect to
> > > the entire VMA—just a portion of it is sufficient.
> >
> > Yes, but that still fails to answer "does this actually happen". How much
> > performance is all this complexity in the page fault handler buying us?
> > If you don't answer this question, I'm just going to go in and rip it
> > all out.
> >
>
> Hi Matthew (and Lorenzo, Jan, and anyone else who may be
> waiting for answers),
>
> As promised during LSF/MM/BPF, we conducted thorough
> testing on Android phones to determine whether performing
> I/O in `filemap_fault()` can block `vma_start_write()`.
> I wanted to give a quick update on this question.
>
> Nanzhe at Xiaomi created tracing scripts and ran various
> applications on Android devices with I/O performed under
> the VMA lock in `filemap_fault()`. We found that:
>
> 1. There are very few cases where unmap() is blocked by
> page faults. I assume this is due to buggy user code
> or poor synchronization between reads and unmap().
> So I assume it is not a problem.
>
> 2. We observed many cases where `vma_start_write()`
> is blocked by page-fault I/O in some applications.
> The blocking occurs in the `dup_mmap()` path during
> fork().
>
> With Suren's commit fb49c455323ff ("fork: lock VMAs of
> the parent process when forking"), we now always hold
> `vma_write_lock()` for each VMA. Note that the
> `mmap_lock` write lock is also held, which could lead to
> chained waiting if page-fault I/O is performed without
> releasing the VMA lock.
>
> My gut feeling is that Suren's commit may be overshooting,
> so my rough idea is that we might want to do something like
> the following (we haven't tested it yet and it might be
> wrong):
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 2311ae7c2ff4..5ddaf297f31a 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1762,7 +1762,13 @@ __latent_entropy int dup_mmap(struct mm_struct
> *mm, struct mm_struct *oldmm)
> for_each_vma(vmi, mpnt) {
> struct file *file;
>
> - retval = vma_start_write_killable(mpnt);
> + /*
> + * For anonymous or writable private VMAs, prevent
> + * concurrent CoW faults.
> + */
> + if (!mpnt->vm_file || (!(mpnt->vm_flags & VM_SHARED) &&
> + (mpnt->vm_flags & VM_WRITE)))
> + retval = vma_start_write_killable(mpnt);
> if (retval < 0)
> goto loop_out;
> if (mpnt->vm_flags & VM_DONTCOPY) {
Maybe a little bit off topic. This is an interesting idea. It seems
possible we don't have to take vma write lock unconditionally. IIUC
the write lock is mainly used to serialize against page fault and
madvise, right? I got a crazy idea off the top of my head. We may be
able to just take vma write lock iff vma->anon_vma is not NULL.
First of all, write mmap_lock is held, so the vma can't go or be
changed under us.
Secondly, if vma->anon_vma is NULL, it basically means either no page
fault happened or no cow happened, so there is no page table to copy,
this is also what copy_page_range() does currently. So we can shrink
the critical section to:
if (vma->anon_vma) {
vma_start_write_killable(src_vma);
anon_vma_fork(dst_vma, src_vma);
copy_page_range(dst_vma, src_vma);
}
But page fault can happen before write mmap_lock is taken, when we
check vma->anon_vma, it is possible it has not been set up yet. But it
seems to be equivalent to page fault after fork and won't break the
semantic.
Anyway, just a crazy idea, I may miss some corner cases.
Thanks,
Yang
}
>
> Based on the above, we may want to re-check whether fork()
> can be blocked by page faults. At the same time, if Suren,
> you, or anyone else has any comments, please feel free to
> share them.
>
> Best Regards
> Barry
>
^ permalink raw reply [flat|nested] 56+ messages in thread* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-18 21:21 ` Yang Shi
@ 2026-05-19 11:07 ` Barry Song
2026-05-19 13:34 ` Lorenzo Stoakes
2026-05-19 18:50 ` Yang Shi
2026-05-19 13:12 ` Lorenzo Stoakes
1 sibling, 2 replies; 56+ messages in thread
From: Barry Song @ 2026-05-19 11:07 UTC (permalink / raw)
To: Yang Shi
Cc: Matthew Wilcox, surenb, akpm, linux-mm, david, ljs, liam, vbabka,
rppt, mhocko, jack, pfalcato, wanglian, chentao, lianux.mm,
kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
On Tue, May 19, 2026 at 5:21 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Sun, May 17, 2026 at 1:45 AM Barry Song <baohua@kernel.org> wrote:
> >
> > On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > > both the hardware and the software stack (bio/request queues and the
> > > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > > for an unpredictable amount of time.
> > > > >
> > > > > But does that actually happen? I find it hard to believe that thread A
> > > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > > that same VMA. mprotect() and madvise() are more likely to happen, but
> > > > > it still seems really unlikely to me.
> > > >
> > > > It doesn’t have to involve unmapping or applying mprotect to
> > > > the entire VMA—just a portion of it is sufficient.
> > >
> > > Yes, but that still fails to answer "does this actually happen". How much
> > > performance is all this complexity in the page fault handler buying us?
> > > If you don't answer this question, I'm just going to go in and rip it
> > > all out.
> > >
> >
> > Hi Matthew (and Lorenzo, Jan, and anyone else who may be
> > waiting for answers),
> >
> > As promised during LSF/MM/BPF, we conducted thorough
> > testing on Android phones to determine whether performing
> > I/O in `filemap_fault()` can block `vma_start_write()`.
> > I wanted to give a quick update on this question.
> >
> > Nanzhe at Xiaomi created tracing scripts and ran various
> > applications on Android devices with I/O performed under
> > the VMA lock in `filemap_fault()`. We found that:
> >
> > 1. There are very few cases where unmap() is blocked by
> > page faults. I assume this is due to buggy user code
> > or poor synchronization between reads and unmap().
> > So I assume it is not a problem.
> >
> > 2. We observed many cases where `vma_start_write()`
> > is blocked by page-fault I/O in some applications.
> > The blocking occurs in the `dup_mmap()` path during
> > fork().
> >
> > With Suren's commit fb49c455323ff ("fork: lock VMAs of
> > the parent process when forking"), we now always hold
> > `vma_write_lock()` for each VMA. Note that the
> > `mmap_lock` write lock is also held, which could lead to
> > chained waiting if page-fault I/O is performed without
> > releasing the VMA lock.
> >
> > My gut feeling is that Suren's commit may be overshooting,
> > so my rough idea is that we might want to do something like
> > the following (we haven't tested it yet and it might be
> > wrong):
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 2311ae7c2ff4..5ddaf297f31a 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1762,7 +1762,13 @@ __latent_entropy int dup_mmap(struct mm_struct
> > *mm, struct mm_struct *oldmm)
> > for_each_vma(vmi, mpnt) {
> > struct file *file;
> >
> > - retval = vma_start_write_killable(mpnt);
> > + /*
> > + * For anonymous or writable private VMAs, prevent
> > + * concurrent CoW faults.
> > + */
> > + if (!mpnt->vm_file || (!(mpnt->vm_flags & VM_SHARED) &&
> > + (mpnt->vm_flags & VM_WRITE)))
> > + retval = vma_start_write_killable(mpnt);
> > if (retval < 0)
> > goto loop_out;
> > if (mpnt->vm_flags & VM_DONTCOPY) {
>
> Maybe a little bit off topic. This is an interesting idea. It seems
> possible we don't have to take vma write lock unconditionally. IIUC
> the write lock is mainly used to serialize against page fault and
> madvise, right? I got a crazy idea off the top of my head. We may be
> able to just take vma write lock iff vma->anon_vma is not NULL.
>
> First of all, write mmap_lock is held, so the vma can't go or be
> changed under us.
>
> Secondly, if vma->anon_vma is NULL, it basically means either no page
> fault happened or no cow happened, so there is no page table to copy,
> this is also what copy_page_range() does currently. So we can shrink
> the critical section to:
>
> if (vma->anon_vma) {
> vma_start_write_killable(src_vma);
> anon_vma_fork(dst_vma, src_vma);
> copy_page_range(dst_vma, src_vma);
> }
>
> But page fault can happen before write mmap_lock is taken, when we
> check vma->anon_vma, it is possible it has not been set up yet. But it
> seems to be equivalent to page fault after fork and won't break the
> semantic.
Re-reading Suren's commit log for fb49c455323ff8
("fork: lock VMAs of the parent process when forking"),
it seems that vm_start_write() is used to protect
against a race where anon_vma changes from NULL to
non-NULL during fork. In that scenario, we hold the
mmap_lock write lock, but not vma_start_write(), so a
concurrent anon_vma_prepare() could still install an
anon_vma.
" A concurrent page fault on a page newly marked read-only by the page
copy might trigger wp_page_copy() and a anon_vma_prepare(vma) on the
source vma, defeating the anon_vma_clone() that wasn't done because the
parent vma originally didn't have an anon_vma, but we now might end up
copying a pte entry for a page that has one.
"
If that is the case, then your change does not work.
Nowadays, nobody calls anon_vma_prepare(vma) directly.
Instead, vmf_anon_prepare() is used, and we always
require the mmap_lock read lock before calling
__anon_vma_prepare(). As a result, anon_vma cannot
transition from NULL to non-NULL during fork.
So the original race condition has effectively
disappeared.
You also mentioned the madvise() case. If I understand
correctly, madvise() should take mmap_lock before
modifying anon_vma. Only some parts of madvise() can
support per-VMA locking. Therefore, we probably do not
need:
if (vma->anon_vma) {
vma_start_write_killable(src_vma);
...
}
>
> Anyway, just a crazy idea, I may miss some corner cases.
To me, it seems that we could remove vma_start_write()
entirely now. Or is that an even crazier idea?
Thanks
Barry
^ permalink raw reply [flat|nested] 56+ messages in thread* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-19 11:07 ` Barry Song
@ 2026-05-19 13:34 ` Lorenzo Stoakes
2026-05-19 18:50 ` Yang Shi
1 sibling, 0 replies; 56+ messages in thread
From: Lorenzo Stoakes @ 2026-05-19 13:34 UTC (permalink / raw)
To: Barry Song
Cc: Yang Shi, Matthew Wilcox, surenb, akpm, linux-mm, david, liam,
vbabka, rppt, mhocko, jack, pfalcato, wanglian, chentao,
lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng,
nphamcs, bhe, youngjun.park, linux-arm-kernel, linux-kernel,
loongarch, linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
On Tue, May 19, 2026 at 07:07:37PM +0800, Barry Song wrote:
> On Tue, May 19, 2026 at 5:21 AM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Sun, May 17, 2026 at 1:45 AM Barry Song <baohua@kernel.org> wrote:
> > >
> > > On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > >
> > > > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > > > both the hardware and the software stack (bio/request queues and the
> > > > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > > > for an unpredictable amount of time.
> > > > > >
> > > > > > But does that actually happen? I find it hard to believe that thread A
> > > > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > > > that same VMA. mprotect() and madvise() are more likely to happen, but
> > > > > > it still seems really unlikely to me.
> > > > >
> > > > > It doesn’t have to involve unmapping or applying mprotect to
> > > > > the entire VMA—just a portion of it is sufficient.
> > > >
> > > > Yes, but that still fails to answer "does this actually happen". How much
> > > > performance is all this complexity in the page fault handler buying us?
> > > > If you don't answer this question, I'm just going to go in and rip it
> > > > all out.
> > > >
> > >
> > > Hi Matthew (and Lorenzo, Jan, and anyone else who may be
> > > waiting for answers),
> > >
> > > As promised during LSF/MM/BPF, we conducted thorough
> > > testing on Android phones to determine whether performing
> > > I/O in `filemap_fault()` can block `vma_start_write()`.
> > > I wanted to give a quick update on this question.
> > >
> > > Nanzhe at Xiaomi created tracing scripts and ran various
> > > applications on Android devices with I/O performed under
> > > the VMA lock in `filemap_fault()`. We found that:
> > >
> > > 1. There are very few cases where unmap() is blocked by
> > > page faults. I assume this is due to buggy user code
> > > or poor synchronization between reads and unmap().
> > > So I assume it is not a problem.
> > >
> > > 2. We observed many cases where `vma_start_write()`
> > > is blocked by page-fault I/O in some applications.
> > > The blocking occurs in the `dup_mmap()` path during
> > > fork().
> > >
> > > With Suren's commit fb49c455323ff ("fork: lock VMAs of
> > > the parent process when forking"), we now always hold
> > > `vma_write_lock()` for each VMA. Note that the
> > > `mmap_lock` write lock is also held, which could lead to
> > > chained waiting if page-fault I/O is performed without
> > > releasing the VMA lock.
> > >
> > > My gut feeling is that Suren's commit may be overshooting,
> > > so my rough idea is that we might want to do something like
> > > the following (we haven't tested it yet and it might be
> > > wrong):
> > >
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index 2311ae7c2ff4..5ddaf297f31a 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -1762,7 +1762,13 @@ __latent_entropy int dup_mmap(struct mm_struct
> > > *mm, struct mm_struct *oldmm)
> > > for_each_vma(vmi, mpnt) {
> > > struct file *file;
> > >
> > > - retval = vma_start_write_killable(mpnt);
> > > + /*
> > > + * For anonymous or writable private VMAs, prevent
> > > + * concurrent CoW faults.
> > > + */
> > > + if (!mpnt->vm_file || (!(mpnt->vm_flags & VM_SHARED) &&
> > > + (mpnt->vm_flags & VM_WRITE)))
> > > + retval = vma_start_write_killable(mpnt);
> > > if (retval < 0)
> > > goto loop_out;
> > > if (mpnt->vm_flags & VM_DONTCOPY) {
> >
> > Maybe a little bit off topic. This is an interesting idea. It seems
> > possible we don't have to take vma write lock unconditionally. IIUC
> > the write lock is mainly used to serialize against page fault and
> > madvise, right? I got a crazy idea off the top of my head. We may be
> > able to just take vma write lock iff vma->anon_vma is not NULL.
> >
> > First of all, write mmap_lock is held, so the vma can't go or be
> > changed under us.
> >
> > Secondly, if vma->anon_vma is NULL, it basically means either no page
> > fault happened or no cow happened, so there is no page table to copy,
> > this is also what copy_page_range() does currently. So we can shrink
> > the critical section to:
> >
> > if (vma->anon_vma) {
> > vma_start_write_killable(src_vma);
> > anon_vma_fork(dst_vma, src_vma);
> > copy_page_range(dst_vma, src_vma);
> > }
> >
> > But page fault can happen before write mmap_lock is taken, when we
> > check vma->anon_vma, it is possible it has not been set up yet. But it
> > seems to be equivalent to page fault after fork and won't break the
> > semantic.
>
> Re-reading Suren's commit log for fb49c455323ff8
> ("fork: lock VMAs of the parent process when forking"),
> it seems that vm_start_write() is used to protect
> against a race where anon_vma changes from NULL to
> non-NULL during fork. In that scenario, we hold the
> mmap_lock write lock, but not vma_start_write(), so a
> concurrent anon_vma_prepare() could still install an
> anon_vma.
>
> " A concurrent page fault on a page newly marked read-only by the page
> copy might trigger wp_page_copy() and a anon_vma_prepare(vma) on the
> source vma, defeating the anon_vma_clone() that wasn't done because the
> parent vma originally didn't have an anon_vma, but we now might end up
> copying a pte entry for a page that has one.
> "
>
> If that is the case, then your change does not work.
>
> Nowadays, nobody calls anon_vma_prepare(vma) directly.
I see callers? Am I imagining them? :)
https://elixir.bootlin.com/linux/v7.0.9/A/ident/anon_vma_prepare
> Instead, vmf_anon_prepare() is used, and we always
> require the mmap_lock read lock before calling
> __anon_vma_prepare(). As a result, anon_vma cannot
> transition from NULL to non-NULL during fork.
Right, yes the mmap read lock is required for that.
>
> So the original race condition has effectively
> disappeared.
Err the page tables? All the other cases which require page table copying?
Concurrent faults mean that copy_page_range() and faulting with vma->anon_vma
_or_ any of the multiple cases mentioned elsewhere.
And who knows what else serialises on that.
>
> You also mentioned the madvise() case. If I understand
> correctly, madvise() should take mmap_lock before
> modifying anon_vma. Only some parts of madvise() can
> support per-VMA locking. Therefore, we probably do not
> need:
>
> if (vma->anon_vma) {
> vma_start_write_killable(src_vma);
> ...
> }
I like how you hand wave the VMA lock operations in madvise() :)
(Maybe) guard regions being present cause page tables to be copied, they're
installed under VMA (read) lock, and can race now.
And it sets traps for future changes - introducing more horrible edge case race
conditions in fork is just a big nope nope nope.
This isn't an area to play around in.
>
> >
> > Anyway, just a crazy idea, I may miss some corner cases.
>
> To me, it seems that we could remove vma_start_write()
> entirely now. Or is that an even crazier idea?
As above that'd be totally broken. NAK.
>
> Thanks
> Barry
Thanks, Lorenzo
^ permalink raw reply [flat|nested] 56+ messages in thread* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-19 11:07 ` Barry Song
2026-05-19 13:34 ` Lorenzo Stoakes
@ 2026-05-19 18:50 ` Yang Shi
2026-05-19 20:53 ` Yang Shi
1 sibling, 1 reply; 56+ messages in thread
From: Yang Shi @ 2026-05-19 18:50 UTC (permalink / raw)
To: Barry Song
Cc: Matthew Wilcox, surenb, akpm, linux-mm, david, ljs, liam, vbabka,
rppt, mhocko, jack, pfalcato, wanglian, chentao, lianux.mm,
kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
On Tue, May 19, 2026 at 4:07 AM Barry Song <baohua@kernel.org> wrote:
>
> On Tue, May 19, 2026 at 5:21 AM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Sun, May 17, 2026 at 1:45 AM Barry Song <baohua@kernel.org> wrote:
> > >
> > > On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > >
> > > > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > > > both the hardware and the software stack (bio/request queues and the
> > > > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > > > for an unpredictable amount of time.
> > > > > >
> > > > > > But does that actually happen? I find it hard to believe that thread A
> > > > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > > > that same VMA. mprotect() and madvise() are more likely to happen, but
> > > > > > it still seems really unlikely to me.
> > > > >
> > > > > It doesn’t have to involve unmapping or applying mprotect to
> > > > > the entire VMA—just a portion of it is sufficient.
> > > >
> > > > Yes, but that still fails to answer "does this actually happen". How much
> > > > performance is all this complexity in the page fault handler buying us?
> > > > If you don't answer this question, I'm just going to go in and rip it
> > > > all out.
> > > >
> > >
> > > Hi Matthew (and Lorenzo, Jan, and anyone else who may be
> > > waiting for answers),
> > >
> > > As promised during LSF/MM/BPF, we conducted thorough
> > > testing on Android phones to determine whether performing
> > > I/O in `filemap_fault()` can block `vma_start_write()`.
> > > I wanted to give a quick update on this question.
> > >
> > > Nanzhe at Xiaomi created tracing scripts and ran various
> > > applications on Android devices with I/O performed under
> > > the VMA lock in `filemap_fault()`. We found that:
> > >
> > > 1. There are very few cases where unmap() is blocked by
> > > page faults. I assume this is due to buggy user code
> > > or poor synchronization between reads and unmap().
> > > So I assume it is not a problem.
> > >
> > > 2. We observed many cases where `vma_start_write()`
> > > is blocked by page-fault I/O in some applications.
> > > The blocking occurs in the `dup_mmap()` path during
> > > fork().
> > >
> > > With Suren's commit fb49c455323ff ("fork: lock VMAs of
> > > the parent process when forking"), we now always hold
> > > `vma_write_lock()` for each VMA. Note that the
> > > `mmap_lock` write lock is also held, which could lead to
> > > chained waiting if page-fault I/O is performed without
> > > releasing the VMA lock.
> > >
> > > My gut feeling is that Suren's commit may be overshooting,
> > > so my rough idea is that we might want to do something like
> > > the following (we haven't tested it yet and it might be
> > > wrong):
> > >
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index 2311ae7c2ff4..5ddaf297f31a 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -1762,7 +1762,13 @@ __latent_entropy int dup_mmap(struct mm_struct
> > > *mm, struct mm_struct *oldmm)
> > > for_each_vma(vmi, mpnt) {
> > > struct file *file;
> > >
> > > - retval = vma_start_write_killable(mpnt);
> > > + /*
> > > + * For anonymous or writable private VMAs, prevent
> > > + * concurrent CoW faults.
> > > + */
> > > + if (!mpnt->vm_file || (!(mpnt->vm_flags & VM_SHARED) &&
> > > + (mpnt->vm_flags & VM_WRITE)))
> > > + retval = vma_start_write_killable(mpnt);
> > > if (retval < 0)
> > > goto loop_out;
> > > if (mpnt->vm_flags & VM_DONTCOPY) {
> >
> > Maybe a little bit off topic. This is an interesting idea. It seems
> > possible we don't have to take vma write lock unconditionally. IIUC
> > the write lock is mainly used to serialize against page fault and
> > madvise, right? I got a crazy idea off the top of my head. We may be
> > able to just take vma write lock iff vma->anon_vma is not NULL.
> >
> > First of all, write mmap_lock is held, so the vma can't go or be
> > changed under us.
> >
> > Secondly, if vma->anon_vma is NULL, it basically means either no page
> > fault happened or no cow happened, so there is no page table to copy,
> > this is also what copy_page_range() does currently. So we can shrink
> > the critical section to:
> >
> > if (vma->anon_vma) {
> > vma_start_write_killable(src_vma);
> > anon_vma_fork(dst_vma, src_vma);
> > copy_page_range(dst_vma, src_vma);
> > }
> >
> > But page fault can happen before write mmap_lock is taken, when we
> > check vma->anon_vma, it is possible it has not been set up yet. But it
> > seems to be equivalent to page fault after fork and won't break the
> > semantic.
>
> Re-reading Suren's commit log for fb49c455323ff8
> ("fork: lock VMAs of the parent process when forking"),
> it seems that vm_start_write() is used to protect
> against a race where anon_vma changes from NULL to
> non-NULL during fork. In that scenario, we hold the
> mmap_lock write lock, but not vma_start_write(), so a
> concurrent anon_vma_prepare() could still install an
> anon_vma.
>
> " A concurrent page fault on a page newly marked read-only by the page
> copy might trigger wp_page_copy() and a anon_vma_prepare(vma) on the
> source vma, defeating the anon_vma_clone() that wasn't done because the
> parent vma originally didn't have an anon_vma, but we now might end up
> copying a pte entry for a page that has one.
> "
>
> If that is the case, then your change does not work.
>
> Nowadays, nobody calls anon_vma_prepare(vma) directly.
> Instead, vmf_anon_prepare() is used, and we always
> require the mmap_lock read lock before calling
> __anon_vma_prepare(). As a result, anon_vma cannot
> transition from NULL to non-NULL during fork.
>
> So the original race condition has effectively
> disappeared.
anon_vma_prepare() has some usecases too, but it seems like it
requires taking read mmap_lock too if I read the code correctly.
>
> You also mentioned the madvise() case. If I understand
> correctly, madvise() should take mmap_lock before
> modifying anon_vma. Only some parts of madvise() can
> support per-VMA locking. Therefore, we probably do not
> need:
>
> if (vma->anon_vma) {
> vma_start_write_killable(src_vma);
> ...
> }
I think we still need write vma lock to serialize anon_vma fork
otherwise we may see:
CPU 0 CPU 1
fork page fault
src vma has no anon_vma
skip vma fork
allocate anon_vma for src vma
vma_needs_copy() sees anon_vma
copy page
Then we may end up being no anon_vma for dst vma, but with pages mapped in it.
Thanks,
Yang
>
> >
> > Anyway, just a crazy idea, I may miss some corner cases.
>
> To me, it seems that we could remove vma_start_write()
> entirely now. Or is that an even crazier idea?
>
> Thanks
> Barry
^ permalink raw reply [flat|nested] 56+ messages in thread* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-19 18:50 ` Yang Shi
@ 2026-05-19 20:53 ` Yang Shi
0 siblings, 0 replies; 56+ messages in thread
From: Yang Shi @ 2026-05-19 20:53 UTC (permalink / raw)
To: Barry Song
Cc: Matthew Wilcox, surenb, akpm, linux-mm, david, ljs, liam, vbabka,
rppt, mhocko, jack, pfalcato, wanglian, chentao, lianux.mm,
kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
On Tue, May 19, 2026 at 11:50 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Tue, May 19, 2026 at 4:07 AM Barry Song <baohua@kernel.org> wrote:
> >
> > On Tue, May 19, 2026 at 5:21 AM Yang Shi <shy828301@gmail.com> wrote:
> > >
> > > On Sun, May 17, 2026 at 1:45 AM Barry Song <baohua@kernel.org> wrote:
> > > >
> > > > On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > > > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > >
> > > > > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > > > > both the hardware and the software stack (bio/request queues and the
> > > > > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > > > > for an unpredictable amount of time.
> > > > > > >
> > > > > > > But does that actually happen? I find it hard to believe that thread A
> > > > > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > > > > that same VMA. mprotect() and madvise() are more likely to happen, but
> > > > > > > it still seems really unlikely to me.
> > > > > >
> > > > > > It doesn’t have to involve unmapping or applying mprotect to
> > > > > > the entire VMA—just a portion of it is sufficient.
> > > > >
> > > > > Yes, but that still fails to answer "does this actually happen". How much
> > > > > performance is all this complexity in the page fault handler buying us?
> > > > > If you don't answer this question, I'm just going to go in and rip it
> > > > > all out.
> > > > >
> > > >
> > > > Hi Matthew (and Lorenzo, Jan, and anyone else who may be
> > > > waiting for answers),
> > > >
> > > > As promised during LSF/MM/BPF, we conducted thorough
> > > > testing on Android phones to determine whether performing
> > > > I/O in `filemap_fault()` can block `vma_start_write()`.
> > > > I wanted to give a quick update on this question.
> > > >
> > > > Nanzhe at Xiaomi created tracing scripts and ran various
> > > > applications on Android devices with I/O performed under
> > > > the VMA lock in `filemap_fault()`. We found that:
> > > >
> > > > 1. There are very few cases where unmap() is blocked by
> > > > page faults. I assume this is due to buggy user code
> > > > or poor synchronization between reads and unmap().
> > > > So I assume it is not a problem.
> > > >
> > > > 2. We observed many cases where `vma_start_write()`
> > > > is blocked by page-fault I/O in some applications.
> > > > The blocking occurs in the `dup_mmap()` path during
> > > > fork().
> > > >
> > > > With Suren's commit fb49c455323ff ("fork: lock VMAs of
> > > > the parent process when forking"), we now always hold
> > > > `vma_write_lock()` for each VMA. Note that the
> > > > `mmap_lock` write lock is also held, which could lead to
> > > > chained waiting if page-fault I/O is performed without
> > > > releasing the VMA lock.
> > > >
> > > > My gut feeling is that Suren's commit may be overshooting,
> > > > so my rough idea is that we might want to do something like
> > > > the following (we haven't tested it yet and it might be
> > > > wrong):
> > > >
> > > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > > index 2311ae7c2ff4..5ddaf297f31a 100644
> > > > --- a/mm/mmap.c
> > > > +++ b/mm/mmap.c
> > > > @@ -1762,7 +1762,13 @@ __latent_entropy int dup_mmap(struct mm_struct
> > > > *mm, struct mm_struct *oldmm)
> > > > for_each_vma(vmi, mpnt) {
> > > > struct file *file;
> > > >
> > > > - retval = vma_start_write_killable(mpnt);
> > > > + /*
> > > > + * For anonymous or writable private VMAs, prevent
> > > > + * concurrent CoW faults.
> > > > + */
> > > > + if (!mpnt->vm_file || (!(mpnt->vm_flags & VM_SHARED) &&
> > > > + (mpnt->vm_flags & VM_WRITE)))
> > > > + retval = vma_start_write_killable(mpnt);
> > > > if (retval < 0)
> > > > goto loop_out;
> > > > if (mpnt->vm_flags & VM_DONTCOPY) {
> > >
> > > Maybe a little bit off topic. This is an interesting idea. It seems
> > > possible we don't have to take vma write lock unconditionally. IIUC
> > > the write lock is mainly used to serialize against page fault and
> > > madvise, right? I got a crazy idea off the top of my head. We may be
> > > able to just take vma write lock iff vma->anon_vma is not NULL.
> > >
> > > First of all, write mmap_lock is held, so the vma can't go or be
> > > changed under us.
> > >
> > > Secondly, if vma->anon_vma is NULL, it basically means either no page
> > > fault happened or no cow happened, so there is no page table to copy,
> > > this is also what copy_page_range() does currently. So we can shrink
> > > the critical section to:
> > >
> > > if (vma->anon_vma) {
> > > vma_start_write_killable(src_vma);
> > > anon_vma_fork(dst_vma, src_vma);
> > > copy_page_range(dst_vma, src_vma);
> > > }
> > >
> > > But page fault can happen before write mmap_lock is taken, when we
> > > check vma->anon_vma, it is possible it has not been set up yet. But it
> > > seems to be equivalent to page fault after fork and won't break the
> > > semantic.
> >
> > Re-reading Suren's commit log for fb49c455323ff8
> > ("fork: lock VMAs of the parent process when forking"),
> > it seems that vm_start_write() is used to protect
> > against a race where anon_vma changes from NULL to
> > non-NULL during fork. In that scenario, we hold the
> > mmap_lock write lock, but not vma_start_write(), so a
> > concurrent anon_vma_prepare() could still install an
> > anon_vma.
> >
> > " A concurrent page fault on a page newly marked read-only by the page
> > copy might trigger wp_page_copy() and a anon_vma_prepare(vma) on the
> > source vma, defeating the anon_vma_clone() that wasn't done because the
> > parent vma originally didn't have an anon_vma, but we now might end up
> > copying a pte entry for a page that has one.
> > "
> >
> > If that is the case, then your change does not work.
> >
> > Nowadays, nobody calls anon_vma_prepare(vma) directly.
> > Instead, vmf_anon_prepare() is used, and we always
> > require the mmap_lock read lock before calling
> > __anon_vma_prepare(). As a result, anon_vma cannot
> > transition from NULL to non-NULL during fork.
> >
> > So the original race condition has effectively
> > disappeared.
>
> anon_vma_prepare() has some usecases too, but it seems like it
> requires taking read mmap_lock too if I read the code correctly.
>
> >
> > You also mentioned the madvise() case. If I understand
> > correctly, madvise() should take mmap_lock before
> > modifying anon_vma. Only some parts of madvise() can
> > support per-VMA locking. Therefore, we probably do not
> > need:
> >
> > if (vma->anon_vma) {
> > vma_start_write_killable(src_vma);
> > ...
> > }
>
> I think we still need write vma lock to serialize anon_vma fork
> otherwise we may see:
>
> CPU 0 CPU 1
> fork page fault
> src vma has no anon_vma
> skip vma fork
>
> allocate anon_vma for src vma
> vma_needs_copy() sees anon_vma
> copy page
>
> Then we may end up being no anon_vma for dst vma, but with pages mapped in it.
Sorry, this should not happen because creating anon_vma in page fault
needs to take mmap_lock.
Thanks,
Yang
>
> Thanks,
> Yang
>
> >
> > >
> > > Anyway, just a crazy idea, I may miss some corner cases.
> >
> > To me, it seems that we could remove vma_start_write()
> > entirely now. Or is that an even crazier idea?
>
>
> >
> > Thanks
> > Barry
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-18 21:21 ` Yang Shi
2026-05-19 11:07 ` Barry Song
@ 2026-05-19 13:12 ` Lorenzo Stoakes
2026-05-19 13:39 ` Lorenzo Stoakes
1 sibling, 1 reply; 56+ messages in thread
From: Lorenzo Stoakes @ 2026-05-19 13:12 UTC (permalink / raw)
To: Yang Shi
Cc: Barry Song, Matthew Wilcox, surenb, akpm, linux-mm, david, liam,
vbabka, rppt, mhocko, jack, pfalcato, wanglian, chentao,
lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng,
nphamcs, bhe, youngjun.park, linux-arm-kernel, linux-kernel,
loongarch, linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
On Mon, May 18, 2026 at 02:21:14PM -0700, Yang Shi wrote:
> On Sun, May 17, 2026 at 1:45 AM Barry Song <baohua@kernel.org> wrote:
> >
> > On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > > both the hardware and the software stack (bio/request queues and the
> > > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > > for an unpredictable amount of time.
> > > > >
> > > > > But does that actually happen? I find it hard to believe that thread A
> > > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > > that same VMA. mprotect() and madvise() are more likely to happen, but
> > > > > it still seems really unlikely to me.
> > > >
> > > > It doesn’t have to involve unmapping or applying mprotect to
> > > > the entire VMA—just a portion of it is sufficient.
> > >
> > > Yes, but that still fails to answer "does this actually happen". How much
> > > performance is all this complexity in the page fault handler buying us?
> > > If you don't answer this question, I'm just going to go in and rip it
> > > all out.
> > >
> >
> > Hi Matthew (and Lorenzo, Jan, and anyone else who may be
> > waiting for answers),
> >
> > As promised during LSF/MM/BPF, we conducted thorough
> > testing on Android phones to determine whether performing
> > I/O in `filemap_fault()` can block `vma_start_write()`.
> > I wanted to give a quick update on this question.
> >
> > Nanzhe at Xiaomi created tracing scripts and ran various
> > applications on Android devices with I/O performed under
> > the VMA lock in `filemap_fault()`. We found that:
> >
> > 1. There are very few cases where unmap() is blocked by
> > page faults. I assume this is due to buggy user code
> > or poor synchronization between reads and unmap().
> > So I assume it is not a problem.
> >
> > 2. We observed many cases where `vma_start_write()`
> > is blocked by page-fault I/O in some applications.
> > The blocking occurs in the `dup_mmap()` path during
> > fork().
> >
> > With Suren's commit fb49c455323ff ("fork: lock VMAs of
> > the parent process when forking"), we now always hold
> > `vma_write_lock()` for each VMA. Note that the
> > `mmap_lock` write lock is also held, which could lead to
> > chained waiting if page-fault I/O is performed without
> > releasing the VMA lock.
> >
> > My gut feeling is that Suren's commit may be overshooting,
> > so my rough idea is that we might want to do something like
> > the following (we haven't tested it yet and it might be
> > wrong):
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 2311ae7c2ff4..5ddaf297f31a 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1762,7 +1762,13 @@ __latent_entropy int dup_mmap(struct mm_struct
> > *mm, struct mm_struct *oldmm)
> > for_each_vma(vmi, mpnt) {
> > struct file *file;
> >
> > - retval = vma_start_write_killable(mpnt);
> > + /*
> > + * For anonymous or writable private VMAs, prevent
> > + * concurrent CoW faults.
> > + */
> > + if (!mpnt->vm_file || (!(mpnt->vm_flags & VM_SHARED) &&
> > + (mpnt->vm_flags & VM_WRITE)))
> > + retval = vma_start_write_killable(mpnt);
> > if (retval < 0)
> > goto loop_out;
> > if (mpnt->vm_flags & VM_DONTCOPY) {
>
> Maybe a little bit off topic. This is an interesting idea. It seems
> possible we don't have to take vma write lock unconditionally. IIUC
> the write lock is mainly used to serialize against page fault and
> madvise, right? I got a crazy idea off the top of my head. We may be
Err no, it serialises against literally any modification or read of any
characteristic of VMAs.
> able to just take vma write lock iff vma->anon_vma is not NULL.
Except if we don't take it and vma->anon_vma is NULL, then somebody can
anon_vma_prepare() and change vma->anon_vma midway through a fork and completely
screw up the anon_vma fork hierarchy.
So no.
>
> First of all, write mmap_lock is held, so the vma can't go or be
> changed under us.
vma->anon_vma can be changed.
>
> Secondly, if vma->anon_vma is NULL, it basically means either no page
> fault happened or no cow happened, so there is no page table to copy,
> this is also what copy_page_range() does currently. So we can shrink
> the critical section to:
Firstly, with no VMA write lock, !vma->anon_vma means a fault can race and
secondly copy_page_range() checks vma_needs_copy(), there are other cases - PFN
maps, mixed maps, UFFD W/P (ugh), guard regions.
So yeah this isn't sufficient.
>
> if (vma->anon_vma) {
> vma_start_write_killable(src_vma);
> anon_vma_fork(dst_vma, src_vma);
> copy_page_range(dst_vma, src_vma);
> }
Yeah that's totally broken fo reasons above as I said :)
>
> But page fault can happen before write mmap_lock is taken, when we
> check vma->anon_vma, it is possible it has not been set up yet. But it
> seems to be equivalent to page fault after fork and won't break the
> semantic.
It will totally break how the anon_vma hierarchy works :) See the links at the
top of https://ljs.io/talks for a link to various slides on anon_vma behaviour
(it's really a pain to think about because it's a super broken abstraction).
You could end up with a CoW mapping that's unreachable from rmap and you could
get some nasty issues with page table entries pointing at freed folios :)
>
> Anyway, just a crazy idea, I may miss some corner cases.
Yeah sorry to push back here but this is just not a viable approach.
And this is forgetting that we have relied on page faults being blocked by fork
_forever_, who knows what else has baked in assumptions about that
serialisation.
Forking is one of the nastiest parts of mm and has had multiple, subtle, corner
case breakages that have been a nightmare to deal with.
So I'm very much against changing this behaviour to try to fix something in the
fault path.
We should address the fault path issues in the fault path :)
>
> Thanks,
> Yang
>
> }
>
> >
> > Based on the above, we may want to re-check whether fork()
> > can be blocked by page faults. At the same time, if Suren,
> > you, or anyone else has any comments, please feel free to
> > share them.
> >
> > Best Regards
> > Barry
> >
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 56+ messages in thread* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-19 13:12 ` Lorenzo Stoakes
@ 2026-05-19 13:39 ` Lorenzo Stoakes
2026-05-19 18:41 ` Yang Shi
0 siblings, 1 reply; 56+ messages in thread
From: Lorenzo Stoakes @ 2026-05-19 13:39 UTC (permalink / raw)
To: Yang Shi
Cc: Barry Song, Matthew Wilcox, surenb, akpm, linux-mm, david, liam,
vbabka, rppt, mhocko, jack, pfalcato, wanglian, chentao,
lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng,
nphamcs, bhe, youngjun.park, linux-arm-kernel, linux-kernel,
loongarch, linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
On Tue, May 19, 2026 at 02:12:10PM +0100, Lorenzo Stoakes wrote:
> On Mon, May 18, 2026 at 02:21:14PM -0700, Yang Shi wrote:
> > Maybe a little bit off topic. This is an interesting idea. It seems
> > possible we don't have to take vma write lock unconditionally. IIUC
> > the write lock is mainly used to serialize against page fault and
> > madvise, right? I got a crazy idea off the top of my head. We may be
>
> Err no, it serialises against literally any modification or read of any
> characteristic of VMAs.
>
> > able to just take vma write lock iff vma->anon_vma is not NULL.
>
> Except if we don't take it and vma->anon_vma is NULL, then somebody can
> anon_vma_prepare() and change vma->anon_vma midway through a fork and completely
> screw up the anon_vma fork hierarchy.
correction: this won't happen as per Barry (see - I managed to confuse myself
here :), since for vma->anon_vma install we take the mmap read lock.
BUT we also have to consider other cases.
>
> So no.
>
> >
> > First of all, write mmap_lock is held, so the vma can't go or be
> > changed under us.
>
> vma->anon_vma can be changed.
Correction: no it can't :)
>
> >
> > Secondly, if vma->anon_vma is NULL, it basically means either no page
> > fault happened or no cow happened, so there is no page table to copy,
> > this is also what copy_page_range() does currently. So we can shrink
> > the critical section to:
>
> Firstly, with no VMA write lock, !vma->anon_vma means a fault can race and
> secondly copy_page_range() checks vma_needs_copy(), there are other cases - PFN
> maps, mixed maps, UFFD W/P (ugh), guard regions.
>
> So yeah this isn't sufficient.
However this is true...
>
> >
> > if (vma->anon_vma) {
> > vma_start_write_killable(src_vma);
> > anon_vma_fork(dst_vma, src_vma);
> > copy_page_range(dst_vma, src_vma);
> > }
>
> Yeah that's totally broken fo reasons above as I said :)
>
> >
> > But page fault can happen before write mmap_lock is taken, when we
> > check vma->anon_vma, it is possible it has not been set up yet. But it
> > seems to be equivalent to page fault after fork and won't break the
> > semantic.
>
> It will totally break how the anon_vma hierarchy works :) See the links at the
> top of https://ljs.io/talks for a link to various slides on anon_vma behaviour
> (it's really a pain to think about because it's a super broken abstraction).
>
> You could end up with a CoW mapping that's unreachable from rmap and you could
> get some nasty issues with page table entries pointing at freed folios :)
Correction: actually we should be safe given mmap read lock on anon_vma install.
>
> >
> > Anyway, just a crazy idea, I may miss some corner cases.
>
> Yeah sorry to push back here but this is just not a viable approach.
>
> And this is forgetting that we have relied on page faults being blocked by fork
> _forever_, who knows what else has baked in assumptions about that
> serialisation.
>
> Forking is one of the nastiest parts of mm and has had multiple, subtle, corner
> case breakages that have been a nightmare to deal with.
>
> So I'm very much against changing this behaviour to try to fix something in the
> fault path.
>
> We should address the fault path issues in the fault path :)
Above still all true though.
>
> >
> > Thanks,
> > Yang
> >
> > }
> >
> > >
> > > Based on the above, we may want to re-check whether fork()
> > > can be blocked by page faults. At the same time, if Suren,
> > > you, or anyone else has any comments, please feel free to
> > > share them.
> > >
> > > Best Regards
> > > Barry
> > >
>
> Cheers, Lorenzo
So still a nope :)
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 56+ messages in thread* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-19 13:39 ` Lorenzo Stoakes
@ 2026-05-19 18:41 ` Yang Shi
2026-05-19 21:02 ` Yang Shi
0 siblings, 1 reply; 56+ messages in thread
From: Yang Shi @ 2026-05-19 18:41 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Barry Song, Matthew Wilcox, surenb, akpm, linux-mm, david, liam,
vbabka, rppt, mhocko, jack, pfalcato, wanglian, chentao,
lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng,
nphamcs, bhe, youngjun.park, linux-arm-kernel, linux-kernel,
loongarch, linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
On Tue, May 19, 2026 at 6:39 AM Lorenzo Stoakes <ljs@kernel.org> wrote:
>
> On Tue, May 19, 2026 at 02:12:10PM +0100, Lorenzo Stoakes wrote:
> > On Mon, May 18, 2026 at 02:21:14PM -0700, Yang Shi wrote:
> > > Maybe a little bit off topic. This is an interesting idea. It seems
> > > possible we don't have to take vma write lock unconditionally. IIUC
> > > the write lock is mainly used to serialize against page fault and
> > > madvise, right? I got a crazy idea off the top of my head. We may be
> >
> > Err no, it serialises against literally any modification or read of any
> > characteristic of VMAs.
If I remember correctly, you are not supposed to change VMA
flags/size/mm pointer/vm_file/pgoff/prot, etc, under read vma lock or
read mmap_lock.
> >
> > > able to just take vma write lock iff vma->anon_vma is not NULL.
> >
> > Except if we don't take it and vma->anon_vma is NULL, then somebody can
> > anon_vma_prepare() and change vma->anon_vma midway through a fork and completely
> > screw up the anon_vma fork hierarchy.
>
> correction: this won't happen as per Barry (see - I managed to confuse myself
> here :), since for vma->anon_vma install we take the mmap read lock.
>
> BUT we also have to consider other cases.
>
> >
> > So no.
> >
> > >
> > > First of all, write mmap_lock is held, so the vma can't go or be
> > > changed under us.
> >
> > vma->anon_vma can be changed.
>
> Correction: no it can't :)
Yes, vma->anon_vma change should require taking read mmap_lock.
>
> >
> > >
> > > Secondly, if vma->anon_vma is NULL, it basically means either no page
> > > fault happened or no cow happened, so there is no page table to copy,
> > > this is also what copy_page_range() does currently. So we can shrink
> > > the critical section to:
> >
> > Firstly, with no VMA write lock, !vma->anon_vma means a fault can race and
> > secondly copy_page_range() checks vma_needs_copy(), there are other cases - PFN
> > maps, mixed maps, UFFD W/P (ugh), guard regions.
> >
> > So yeah this isn't sufficient.
>
> However this is true...
Yes, fault can race with fork. Basically this is actually the purpose
of this idea. We can have improved page fault scalability. In my
proposal (take write vma lock if vma->anon_vma is not NULL), the race
just happens on the VMAs which page fault has not happened on before.
vma_needs_copy() also skips the VMAs which don't have vma->anon_vma.
So there is basically no difference in semantics other than more page
fault races IIUC. It should be safe as long as we can guarantee there
is no writable PTE point to a shared page after fork.
For guard regions, it can be serialized by vma write lock if
vma->anon_vma exists. If vma->anon_vma is NULL, it will prepare
anon_vma, which will take read mmap_lock if I read the code correctly.
I have not investigated UFFD yet.
>
> >
> > >
> > > if (vma->anon_vma) {
> > > vma_start_write_killable(src_vma);
> > > anon_vma_fork(dst_vma, src_vma);
> > > copy_page_range(dst_vma, src_vma);
> > > }
> >
> > Yeah that's totally broken fo reasons above as I said :)
> >
> > >
> > > But page fault can happen before write mmap_lock is taken, when we
> > > check vma->anon_vma, it is possible it has not been set up yet. But it
> > > seems to be equivalent to page fault after fork and won't break the
> > > semantic.
> >
> > It will totally break how the anon_vma hierarchy works :) See the links at the
> > top of https://ljs.io/talks for a link to various slides on anon_vma behaviour
> > (it's really a pain to think about because it's a super broken abstraction).
> >
> > You could end up with a CoW mapping that's unreachable from rmap and you could
> > get some nasty issues with page table entries pointing at freed folios :)
>
> Correction: actually we should be safe given mmap read lock on anon_vma install.
>
> >
> > >
> > > Anyway, just a crazy idea, I may miss some corner cases.
> >
> > Yeah sorry to push back here but this is just not a viable approach.
No worries. Thanks for all the feedback. Just tried to explore whether
such an idea is feasible or not.
> >
> > And this is forgetting that we have relied on page faults being blocked by fork
> > _forever_, who knows what else has baked in assumptions about that
> > serialisation.
> >
> > Forking is one of the nastiest parts of mm and has had multiple, subtle, corner
> > case breakages that have been a nightmare to deal with.
Yes, this might be the biggest concern. The page fault can race with
fork. If some applications rely on such subtle behavior, it may break,
but such applications are fragile too.
> >
> > So I'm very much against changing this behaviour to try to fix something in the
> > fault path.
> >
> > We should address the fault path issues in the fault path :)
Yeah, this idea was inspired by Barry's "not take vma read lock
unconditionally" idea. Maybe irrelevant to Barry's priority inversion
problem, just an idea for further optimization on page fault
scalability. This probably should be a separate topic.
Thanks,
Yang
>
> Above still all true though.
>
> >
> > >
> > > Thanks,
> > > Yang
> > >
> > > }
> > >
> > > >
> > > > Based on the above, we may want to re-check whether fork()
> > > > can be blocked by page faults. At the same time, if Suren,
> > > > you, or anyone else has any comments, please feel free to
> > > > share them.
> > > >
> > > > Best Regards
> > > > Barry
> > > >
> >
> > Cheers, Lorenzo
>
> So still a nope :)
>
> Cheers, Lorenzo
^ permalink raw reply [flat|nested] 56+ messages in thread* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-19 18:41 ` Yang Shi
@ 2026-05-19 21:02 ` Yang Shi
2026-05-20 8:11 ` Lorenzo Stoakes
0 siblings, 1 reply; 56+ messages in thread
From: Yang Shi @ 2026-05-19 21:02 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Barry Song, Matthew Wilcox, surenb, akpm, linux-mm, david, liam,
vbabka, rppt, mhocko, jack, pfalcato, wanglian, chentao,
lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng,
nphamcs, bhe, youngjun.park, linux-arm-kernel, linux-kernel,
loongarch, linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
On Tue, May 19, 2026 at 11:41 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Tue, May 19, 2026 at 6:39 AM Lorenzo Stoakes <ljs@kernel.org> wrote:
> >
> > On Tue, May 19, 2026 at 02:12:10PM +0100, Lorenzo Stoakes wrote:
> > > On Mon, May 18, 2026 at 02:21:14PM -0700, Yang Shi wrote:
> > > > Maybe a little bit off topic. This is an interesting idea. It seems
> > > > possible we don't have to take vma write lock unconditionally. IIUC
> > > > the write lock is mainly used to serialize against page fault and
> > > > madvise, right? I got a crazy idea off the top of my head. We may be
> > >
> > > Err no, it serialises against literally any modification or read of any
> > > characteristic of VMAs.
>
> If I remember correctly, you are not supposed to change VMA
> flags/size/mm pointer/vm_file/pgoff/prot, etc, under read vma lock or
> read mmap_lock.
>
> > >
> > > > able to just take vma write lock iff vma->anon_vma is not NULL.
> > >
> > > Except if we don't take it and vma->anon_vma is NULL, then somebody can
> > > anon_vma_prepare() and change vma->anon_vma midway through a fork and completely
> > > screw up the anon_vma fork hierarchy.
> >
> > correction: this won't happen as per Barry (see - I managed to confuse myself
> > here :), since for vma->anon_vma install we take the mmap read lock.
> >
> > BUT we also have to consider other cases.
> >
> > >
> > > So no.
> > >
> > > >
> > > > First of all, write mmap_lock is held, so the vma can't go or be
> > > > changed under us.
> > >
> > > vma->anon_vma can be changed.
> >
> > Correction: no it can't :)
>
> Yes, vma->anon_vma change should require taking read mmap_lock.
>
> >
> > >
> > > >
> > > > Secondly, if vma->anon_vma is NULL, it basically means either no page
> > > > fault happened or no cow happened, so there is no page table to copy,
> > > > this is also what copy_page_range() does currently. So we can shrink
> > > > the critical section to:
> > >
> > > Firstly, with no VMA write lock, !vma->anon_vma means a fault can race and
> > > secondly copy_page_range() checks vma_needs_copy(), there are other cases - PFN
> > > maps, mixed maps, UFFD W/P (ugh), guard regions.
> > >
> > > So yeah this isn't sufficient.
> >
> > However this is true...
>
> Yes, fault can race with fork. Basically this is actually the purpose
> of this idea. We can have improved page fault scalability. In my
> proposal (take write vma lock if vma->anon_vma is not NULL), the race
> just happens on the VMAs which page fault has not happened on before.
Sorry, this is incorrect. Page fault can't happen on those VMAs
because page fault needs to create anon_vma, but it requires taking
mmap_lock.
If anon_vma is not NULL, vma write lock will serialize against page
fault. So there should be no race with page fault. Removing vma write
lock suggested by Barry may increase race.
Thanks,
Yang
> vma_needs_copy() also skips the VMAs which don't have vma->anon_vma.
> So there is basically no difference in semantics other than more page
> fault races IIUC. It should be safe as long as we can guarantee there
> is no writable PTE point to a shared page after fork.
>
> For guard regions, it can be serialized by vma write lock if
> vma->anon_vma exists. If vma->anon_vma is NULL, it will prepare
> anon_vma, which will take read mmap_lock if I read the code correctly.
>
> I have not investigated UFFD yet.
>
> >
> > >
> > > >
> > > > if (vma->anon_vma) {
> > > > vma_start_write_killable(src_vma);
> > > > anon_vma_fork(dst_vma, src_vma);
> > > > copy_page_range(dst_vma, src_vma);
> > > > }
> > >
> > > Yeah that's totally broken fo reasons above as I said :)
> > >
> > > >
> > > > But page fault can happen before write mmap_lock is taken, when we
> > > > check vma->anon_vma, it is possible it has not been set up yet. But it
> > > > seems to be equivalent to page fault after fork and won't break the
> > > > semantic.
> > >
> > > It will totally break how the anon_vma hierarchy works :) See the links at the
> > > top of https://ljs.io/talks for a link to various slides on anon_vma behaviour
> > > (it's really a pain to think about because it's a super broken abstraction).
> > >
> > > You could end up with a CoW mapping that's unreachable from rmap and you could
> > > get some nasty issues with page table entries pointing at freed folios :)
> >
> > Correction: actually we should be safe given mmap read lock on anon_vma install.
> >
> > >
> > > >
> > > > Anyway, just a crazy idea, I may miss some corner cases.
> > >
> > > Yeah sorry to push back here but this is just not a viable approach.
>
> No worries. Thanks for all the feedback. Just tried to explore whether
> such an idea is feasible or not.
>
> > >
> > > And this is forgetting that we have relied on page faults being blocked by fork
> > > _forever_, who knows what else has baked in assumptions about that
> > > serialisation.
> > >
> > > Forking is one of the nastiest parts of mm and has had multiple, subtle, corner
> > > case breakages that have been a nightmare to deal with.
>
> Yes, this might be the biggest concern. The page fault can race with
> fork. If some applications rely on such subtle behavior, it may break,
> but such applications are fragile too.
>
> > >
> > > So I'm very much against changing this behaviour to try to fix something in the
> > > fault path.
> > >
> > > We should address the fault path issues in the fault path :)
>
> Yeah, this idea was inspired by Barry's "not take vma read lock
> unconditionally" idea. Maybe irrelevant to Barry's priority inversion
> problem, just an idea for further optimization on page fault
> scalability. This probably should be a separate topic.
>
> Thanks,
> Yang
>
> >
> > Above still all true though.
> >
> > >
> > > >
> > > > Thanks,
> > > > Yang
> > > >
> > > > }
> > > >
> > > > >
> > > > > Based on the above, we may want to re-check whether fork()
> > > > > can be blocked by page faults. At the same time, if Suren,
> > > > > you, or anyone else has any comments, please feel free to
> > > > > share them.
> > > > >
> > > > > Best Regards
> > > > > Barry
> > > > >
> > >
> > > Cheers, Lorenzo
> >
> > So still a nope :)
> >
> > Cheers, Lorenzo
^ permalink raw reply [flat|nested] 56+ messages in thread* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-19 21:02 ` Yang Shi
@ 2026-05-20 8:11 ` Lorenzo Stoakes
0 siblings, 0 replies; 56+ messages in thread
From: Lorenzo Stoakes @ 2026-05-20 8:11 UTC (permalink / raw)
To: Yang Shi
Cc: Barry Song, Matthew Wilcox, surenb, akpm, linux-mm, david, liam,
vbabka, rppt, mhocko, jack, pfalcato, wanglian, chentao,
lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng,
nphamcs, bhe, youngjun.park, linux-arm-kernel, linux-kernel,
loongarch, linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
On Tue, May 19, 2026 at 02:02:09PM -0700, Yang Shi wrote:
> On Tue, May 19, 2026 at 11:41 AM Yang Shi <shy828301@gmail.com> wrote:
> > >
> > > >
> > > > >
> > > > > Secondly, if vma->anon_vma is NULL, it basically means either no page
> > > > > fault happened or no cow happened, so there is no page table to copy,
> > > > > this is also what copy_page_range() does currently. So we can shrink
> > > > > the critical section to:
> > > >
> > > > Firstly, with no VMA write lock, !vma->anon_vma means a fault can race and
> > > > secondly copy_page_range() checks vma_needs_copy(), there are other cases - PFN
> > > > maps, mixed maps, UFFD W/P (ugh), guard regions.
> > > >
> > > > So yeah this isn't sufficient.
> > >
> > > However this is true...
> >
> > Yes, fault can race with fork. Basically this is actually the purpose
> > of this idea. We can have improved page fault scalability. In my
> > proposal (take write vma lock if vma->anon_vma is not NULL), the race
> > just happens on the VMAs which page fault has not happened on before.
>
> Sorry, this is incorrect. Page fault can't happen on those VMAs
> because page fault needs to create anon_vma, but it requires taking
> mmap_lock.
> If anon_vma is not NULL, vma write lock will serialize against page
> fault. So there should be no race with page fault. Removing vma write
> lock suggested by Barry may increase race.
Firstly, let's none of us be worried about making mistakes here, the anon_vma
stuff is confusing, and I've stared at it more than mostly, and even so I
managed to make mistakes (as corrected here) and forget details :))
It's a sign it all needs simplifying, but hey that's what my scalable CoW
project is (partly) about :)
Removing the VMA write lock would cause races with page fault which can result
in page tables being installed which are then not correctly duplicated for
ranges that must be.
And again I think the underlying thing here overall I think is:
1. Clearly many cases require serialisation (any that cause copy_page_range() to
fire).
2. If we were to decide not to take a lock with concurrent page faults, that
lays a trap for any future change that (reasonably) assumes that page tables
cannot be simultaneously copied while being accessible to page fault
handlers, which is bug prone.
3. As per 2, even if we were to only take the lock when we felt we absolutely
needed to, we still cause risk through adding yet another 'you just have to
know' risk to this part of mm.
4. The serialisation is quite likely relied upon by other things, this is often
the case in mm, and we may only realise that such serialisation is critical
at the point a subtle issue arises out of it.
5. Fork is one of the most sensitive, intuation-defying, complicated, and
corner- case-problem-baiting areas of mm and I really oppose us changing
fundamental behaviour here unless incredibly well justified.
On this basis, let's let the sleeping dogs lie and leave fork alone I think :)
I think I am far more inclined to take Barry's fault approach (as I've said to
him) vs. changing fork behaviour.
But I want to make sure there's not a 'third way' that could avoid either!
I am going to have a look through Barry's series in detail so we can have some
movement on this one way or another :)
>
> Thanks,
> Yang
>
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-04-30 12:37 ` [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance Matthew Wilcox
2026-04-30 22:49 ` Barry Song
@ 2026-05-01 15:52 ` Lorenzo Stoakes
2026-05-01 16:06 ` Matthew Wilcox
2026-05-01 17:59 ` Barry Song
1 sibling, 2 replies; 56+ messages in thread
From: Lorenzo Stoakes @ 2026-05-01 15:52 UTC (permalink / raw)
To: Barry Song (Xiaomi)
Cc: Matthew Wilcox, akpm, linux-mm, david, liam, vbabka, rppt, surenb,
mhocko, jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390
On Thu, Apr 30, 2026 at 01:37:14PM +0100, Matthew Wilcox wrote:
> On Thu, Apr 30, 2026 at 12:04:22PM +0800, Barry Song (Xiaomi) wrote:
> > (1) If we need to wait for I/O completion, we still drop the per-VMA lock, as
> > current page fault handling already does. Holding it for too long may introduce
> > various priority inversion issues on mobile devices. After I/O completes, we
> > retry the page fault with the per-VMA lock, rather than falling back to
> > mmap_lock.
>
> You're going to have to do better than that. You know I hate the
> additional complexity you're adding. You need to explain why my idea of
> ripping out all the complexity now that we have per-VMA locks doesn't
> work.
After a brief eyeball I share Matthew's assessment, I really don't like this
series, it's piling on complexity for what seem like niche cases.
We already have enough weirdness in fault code honestly.
Let's maybe discuss at LSF if you're attending?
I will try to have a more thorough look through when I get a chance.
Thanks, Lorenzo
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-01 15:52 ` Lorenzo Stoakes
@ 2026-05-01 16:06 ` Matthew Wilcox
2026-05-01 17:09 ` Lorenzo Stoakes
2026-05-01 17:59 ` Barry Song
1 sibling, 1 reply; 56+ messages in thread
From: Matthew Wilcox @ 2026-05-01 16:06 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Barry Song (Xiaomi), akpm, linux-mm, david, liam, vbabka, rppt,
surenb, mhocko, jack, pfalcato, wanglian, chentao, lianux.mm,
kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390
On Fri, May 01, 2026 at 04:52:12PM +0100, Lorenzo Stoakes wrote:
> After a brief eyeball I share Matthew's assessment, I really don't like this
> series, it's piling on complexity for what seem like niche cases.
I don't think they're niche cases ... I think it's a real problem.
While our current code performs better for this workload than the
pre-vma-lock code did, it doesn't perform as well as it could.
> We already have enough weirdness in fault code honestly.
>
> Let's maybe discuss at LSF if you're attending?
Not only is he attending, there's a topic scheduled (currently 10:30 on
Wednesday).
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-01 16:06 ` Matthew Wilcox
@ 2026-05-01 17:09 ` Lorenzo Stoakes
0 siblings, 0 replies; 56+ messages in thread
From: Lorenzo Stoakes @ 2026-05-01 17:09 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Barry Song (Xiaomi), akpm, linux-mm, david, liam, vbabka, rppt,
surenb, mhocko, jack, pfalcato, wanglian, chentao, lianux.mm,
kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390
On Fri, May 01, 2026 at 05:06:02PM +0100, Matthew Wilcox wrote:
> On Fri, May 01, 2026 at 04:52:12PM +0100, Lorenzo Stoakes wrote:
> > After a brief eyeball I share Matthew's assessment, I really don't like this
> > series, it's piling on complexity for what seem like niche cases.
>
> I don't think they're niche cases ... I think it's a real problem.
> While our current code performs better for this workload than the
> pre-vma-lock code did, it doesn't perform as well as it could.
>
> > We already have enough weirdness in fault code honestly.
> >
> > Let's maybe discuss at LSF if you're attending?
>
> Not only is he attending, there's a topic scheduled (currently 10:30 on
> Wednesday).
Well then, let's revisit this in person in Zagreb :)
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
2026-05-01 15:52 ` Lorenzo Stoakes
2026-05-01 16:06 ` Matthew Wilcox
@ 2026-05-01 17:59 ` Barry Song
1 sibling, 0 replies; 56+ messages in thread
From: Barry Song @ 2026-05-01 17:59 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Matthew Wilcox, akpm, linux-mm, david, liam, vbabka, rppt, surenb,
mhocko, jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390
On Fri, May 1, 2026 at 11:52 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
>
> On Thu, Apr 30, 2026 at 01:37:14PM +0100, Matthew Wilcox wrote:
> > On Thu, Apr 30, 2026 at 12:04:22PM +0800, Barry Song (Xiaomi) wrote:
> > > (1) If we need to wait for I/O completion, we still drop the per-VMA lock, as
> > > current page fault handling already does. Holding it for too long may introduce
> > > various priority inversion issues on mobile devices. After I/O completes, we
> > > retry the page fault with the per-VMA lock, rather than falling back to
> > > mmap_lock.
> >
> > You're going to have to do better than that. You know I hate the
> > additional complexity you're adding. You need to explain why my idea of
> > ripping out all the complexity now that we have per-VMA locks doesn't
> > work.
>
> After a brief eyeball I share Matthew's assessment, I really don't like this
> series, it's piling on complexity for what seem like niche cases.
I’d really appreciate it if you could point out the specific parts you
dislike, rather than the whole series—I don’t think that’s a fair
assessment.
I’m not sure what you mean by “niche cases.” Do you mean avoiding taking
mmap_lock for major page faults, or releasing the per-VMA lock and retrying
the page fault?
Right now, major page faults always fall back to mmap_lock, which is a
significant source of lock contention. I assume we agree that this fallback
should be eliminated. Or is there still no agreement on this point either?
Where we may differ is whether to hold the per-VMA lock and
avoid retrying the page fault, or to rely on retrying the
fault while using the per-VMA lock (with the current
mainline approach using mmap_lock instead) ?
>
> We already have enough weirdness in fault code honestly.
>
> Let's maybe discuss at LSF if you're attending?
Sure :-)
>
> I will try to have a more thorough look through when I get a chance.
Thank you, much appreciated.
Best Regards
Barry
^ permalink raw reply [flat|nested] 56+ messages in thread