* [PATCH v6 0/6] Use killable vma write locking in most places
@ 2026-03-27 20:54 Suren Baghdasaryan
2026-03-27 20:54 ` [PATCH v6 1/6] mm/vma: cleanup error handling path in vma_expand() Suren Baghdasaryan
` (6 more replies)
0 siblings, 7 replies; 23+ messages in thread
From: Suren Baghdasaryan @ 2026-03-27 20:54 UTC (permalink / raw)
To: akpm
Cc: willy, david, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, gourry, ying.huang, apopple, ljs, baolin.wang,
Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, lance.yang,
vbabka, jannh, rppt, mhocko, pfalcato, kees, maddy, npiggin, mpe,
chleroy, borntraeger, frankja, imbrenda, hca, gor, agordeev,
svens, gerald.schaefer, linux-mm, linuxppc-dev, kvm, linux-kernel,
linux-s390, surenb
Now that we have vma_start_write_killable() we can replace most of the
vma_start_write() calls with it, improving reaction time to the kill
signal.
There are several places which are left untouched by this patchset:
1. free_pgtables() because function should free page tables even if a
fatal signal is pending.
2. userfaultd code, where some paths calling vma_start_write() can
handle EINTR and some can't without a deeper code refactoring.
3. mpol_rebind_mm() which is used by cpusset controller for migrations
and operates on a remote mm. Incomplete operations here would result
in an inconsistent cgroup state.
4. vm_flags_{set|mod|clear} require refactoring that involves moving
vma_start_write() out of these functions and replacing it with
vma_assert_write_locked(), then callers of these functions should
lock the vma themselves using vma_start_write_killable() whenever
possible.
Changes since v5 [1]:
- Added Reviewed-by for unchanged patches, per Lorenzo Stoakes
Patch#2:
- Fixed locked_vm counter if mlock_vma_pages_range() fails in
mlock_fixup(), per Sashiko
- Avoid VMA re-locking in madvise_update_vma(), mprotect_fixup() and
mseal_apply() when vma_modify_XXX creates a new VMA as it will already be
locked. This prevents the possibility of incomplete operation if signal
happens after a successful vma_modify_XXX modified the vma tree,
per Sashiko
- Removed obsolete comment in madvise_update_vma() and mprotect_fixup()
Patch#4:
- Added clarifying comment for vma_start_write_killable() when locking a
detached VMA
- Override VMA_MERGE_NOMERGE in vma_expand() to prevent callers from
falling back to a new VMA allocation, per Sashiko
- Added a note in the changelog about temporary workaround of using
ENOMEM to propagate the error in vma_merge_existing_range() and
vma_expand()
Patch#5:
- Added fatal_signal_pending() check in do_mbind() to detect
queue_pages_range() failures due to a pendig fatal signal, per Sashiko
[1] https://lore.kernel.org/all/20260326080836.695207-1-surenb@google.com/
Suren Baghdasaryan (6):
mm/vma: cleanup error handling path in vma_expand()
mm: use vma_start_write_killable() in mm syscalls
mm/khugepaged: use vma_start_write_killable() in collapse_huge_page()
mm/vma: use vma_start_write_killable() in vma operations
mm: use vma_start_write_killable() in process_vma_walk_lock()
KVM: PPC: use vma_start_write_killable() in
kvmppc_memslot_page_merge()
arch/powerpc/kvm/book3s_hv_uvmem.c | 5 +-
fs/proc/task_mmu.c | 12 +--
mm/khugepaged.c | 5 +-
mm/madvise.c | 13 ++-
mm/memory.c | 2 +
mm/mempolicy.c | 21 +++-
mm/mlock.c | 30 ++++--
mm/mprotect.c | 25 +++--
mm/mremap.c | 8 +-
mm/mseal.c | 24 ++++-
mm/pagewalk.c | 22 ++--
mm/vma.c | 162 ++++++++++++++++++++++-------
mm/vma_exec.c | 6 +-
13 files changed, 251 insertions(+), 84 deletions(-)
base-commit: e53c9040ab1b738dd2c83b57558f141902caaf4f
--
2.53.0.1018.g2bb0e51243-goog
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH v6 1/6] mm/vma: cleanup error handling path in vma_expand()
2026-03-27 20:54 [PATCH v6 0/6] Use killable vma write locking in most places Suren Baghdasaryan
@ 2026-03-27 20:54 ` Suren Baghdasaryan
2026-03-27 20:54 ` [PATCH v6 2/6] mm: use vma_start_write_killable() in mm syscalls Suren Baghdasaryan
` (5 subsequent siblings)
6 siblings, 0 replies; 23+ messages in thread
From: Suren Baghdasaryan @ 2026-03-27 20:54 UTC (permalink / raw)
To: akpm
Cc: willy, david, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, gourry, ying.huang, apopple, ljs, baolin.wang,
Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, lance.yang,
vbabka, jannh, rppt, mhocko, pfalcato, kees, maddy, npiggin, mpe,
chleroy, borntraeger, frankja, imbrenda, hca, gor, agordeev,
svens, gerald.schaefer, linux-mm, linuxppc-dev, kvm, linux-kernel,
linux-s390, surenb, Lorenzo Stoakes
vma_expand() error handling is a bit confusing with "if (ret) return ret;"
mixed with "if (!ret && ...) ret = ...;". Simplify the code to check
for errors and return immediately after an operation that might fail.
This also makes later changes to this function more readable.
Change variable name for storing the error code from "ret" to "err".
No functional change intended.
Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Barry Song <baohua@kernel.org>
---
mm/vma.c | 18 +++++++++++-------
1 file changed, 11 insertions(+), 7 deletions(-)
diff --git a/mm/vma.c b/mm/vma.c
index a43f3c5d4b3d..ba78ab1f397a 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -1170,7 +1170,7 @@ int vma_expand(struct vma_merge_struct *vmg)
vma_flags_t sticky_flags =
vma_flags_and_mask(&vmg->vma_flags, VMA_STICKY_FLAGS);
vma_flags_t target_sticky;
- int ret = 0;
+ int err = 0;
mmap_assert_write_locked(vmg->mm);
vma_start_write(target);
@@ -1200,12 +1200,16 @@ int vma_expand(struct vma_merge_struct *vmg)
* Note that, by convention, callers ignore OOM for this case, so
* we don't need to account for vmg->give_up_on_mm here.
*/
- if (remove_next)
- ret = dup_anon_vma(target, next, &anon_dup);
- if (!ret && vmg->copied_from)
- ret = dup_anon_vma(target, vmg->copied_from, &anon_dup);
- if (ret)
- return ret;
+ if (remove_next) {
+ err = dup_anon_vma(target, next, &anon_dup);
+ if (err)
+ return err;
+ }
+ if (vmg->copied_from) {
+ err = dup_anon_vma(target, vmg->copied_from, &anon_dup);
+ if (err)
+ return err;
+ }
if (remove_next) {
vma_flags_t next_sticky;
--
2.53.0.1018.g2bb0e51243-goog
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v6 2/6] mm: use vma_start_write_killable() in mm syscalls
2026-03-27 20:54 [PATCH v6 0/6] Use killable vma write locking in most places Suren Baghdasaryan
2026-03-27 20:54 ` [PATCH v6 1/6] mm/vma: cleanup error handling path in vma_expand() Suren Baghdasaryan
@ 2026-03-27 20:54 ` Suren Baghdasaryan
2026-03-31 9:35 ` Lorenzo Stoakes (Oracle)
2026-03-27 20:54 ` [PATCH v6 3/6] mm/khugepaged: use vma_start_write_killable() in collapse_huge_page() Suren Baghdasaryan
` (4 subsequent siblings)
6 siblings, 1 reply; 23+ messages in thread
From: Suren Baghdasaryan @ 2026-03-27 20:54 UTC (permalink / raw)
To: akpm
Cc: willy, david, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, gourry, ying.huang, apopple, ljs, baolin.wang,
Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, lance.yang,
vbabka, jannh, rppt, mhocko, pfalcato, kees, maddy, npiggin, mpe,
chleroy, borntraeger, frankja, imbrenda, hca, gor, agordeev,
svens, gerald.schaefer, linux-mm, linuxppc-dev, kvm, linux-kernel,
linux-s390, surenb
Replace vma_start_write() with vma_start_write_killable() in syscalls,
improving reaction time to the kill signal.
In a number of places we now lock VMA earlier than before to avoid
doing work and undoing it later if a fatal signal is pending. This
is safe because the moves are happening within sections where we
already hold the mmap_write_lock, so the moves do not change the
locking order relative to other kernel locks.
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
mm/madvise.c | 13 ++++++++++---
mm/memory.c | 2 ++
mm/mempolicy.c | 11 +++++++++--
mm/mlock.c | 30 ++++++++++++++++++++++++------
mm/mprotect.c | 25 +++++++++++++++++--------
mm/mremap.c | 8 +++++---
mm/mseal.c | 24 +++++++++++++++++++-----
7 files changed, 86 insertions(+), 27 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 69708e953cf5..f2c7b0512cdf 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -172,10 +172,17 @@ static int madvise_update_vma(vm_flags_t new_flags,
if (IS_ERR(vma))
return PTR_ERR(vma);
- madv_behavior->vma = vma;
+ /*
+ * If a new vma was created during vma_modify_XXX, the resulting
+ * vma is already locked. Skip re-locking new vma in this case.
+ */
+ if (vma == madv_behavior->vma) {
+ if (vma_start_write_killable(vma))
+ return -EINTR;
+ } else {
+ madv_behavior->vma = vma;
+ }
- /* vm_flags is protected by the mmap_lock held in write mode. */
- vma_start_write(vma);
vma->flags = new_vma_flags;
if (set_new_anon_name)
return replace_anon_vma_name(vma, anon_name);
diff --git a/mm/memory.c b/mm/memory.c
index e44469f9cf65..9f99ec634831 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -366,6 +366,8 @@ void free_pgd_range(struct mmu_gather *tlb,
* page tables that should be removed. This can differ from the vma mappings on
* some archs that may have mappings that need to be removed outside the vmas.
* Note that the prev->vm_end and next->vm_start are often used.
+ * We don't use vma_start_write_killable() because page tables should be freed
+ * even if the task is being killed.
*
* The vma_end differs from the pg_end when a dup_mmap() failed and the tree has
* unrelated data to the mm_struct being torn down.
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index fd08771e2057..c38a90487531 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1784,7 +1784,8 @@ SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, le
return -EINVAL;
if (end == start)
return 0;
- mmap_write_lock(mm);
+ if (mmap_write_lock_killable(mm))
+ return -EINTR;
prev = vma_prev(&vmi);
for_each_vma_range(vmi, vma, end) {
/*
@@ -1801,13 +1802,19 @@ SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, le
err = -EOPNOTSUPP;
break;
}
+ /*
+ * Lock the VMA early to avoid extra work if fatal signal
+ * is pending.
+ */
+ err = vma_start_write_killable(vma);
+ if (err)
+ break;
new = mpol_dup(old);
if (IS_ERR(new)) {
err = PTR_ERR(new);
break;
}
- vma_start_write(vma);
new->home_node = home_node;
err = mbind_range(&vmi, vma, &prev, start, end, new);
mpol_put(new);
diff --git a/mm/mlock.c b/mm/mlock.c
index 8c227fefa2df..2ed454db7cf7 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -419,8 +419,10 @@ static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
*
* Called for mlock(), mlock2() and mlockall(), to set @vma VM_LOCKED;
* called for munlock() and munlockall(), to clear VM_LOCKED from @vma.
+ *
+ * Return: 0 on success, -EINTR if fatal signal is pending.
*/
-static void mlock_vma_pages_range(struct vm_area_struct *vma,
+static int mlock_vma_pages_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end,
vma_flags_t *new_vma_flags)
{
@@ -442,7 +444,9 @@ static void mlock_vma_pages_range(struct vm_area_struct *vma,
*/
if (vma_flags_test(new_vma_flags, VMA_LOCKED_BIT))
vma_flags_set(new_vma_flags, VMA_IO_BIT);
- vma_start_write(vma);
+ if (vma_start_write_killable(vma))
+ return -EINTR;
+
vma_flags_reset_once(vma, new_vma_flags);
lru_add_drain();
@@ -453,6 +457,7 @@ static void mlock_vma_pages_range(struct vm_area_struct *vma,
vma_flags_clear(new_vma_flags, VMA_IO_BIT);
vma_flags_reset_once(vma, new_vma_flags);
}
+ return 0;
}
/*
@@ -506,11 +511,15 @@ static int mlock_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
*/
if (vma_flags_test(&new_vma_flags, VMA_LOCKED_BIT) &&
vma_flags_test(&old_vma_flags, VMA_LOCKED_BIT)) {
+ ret = vma_start_write_killable(vma);
+ if (ret)
+ goto out; /* mm->locked_vm is fine as nr_pages == 0 */
/* No work to do, and mlocking twice would be wrong */
- vma_start_write(vma);
vma->flags = new_vma_flags;
} else {
- mlock_vma_pages_range(vma, start, end, &new_vma_flags);
+ ret = mlock_vma_pages_range(vma, start, end, &new_vma_flags);
+ if (ret)
+ mm->locked_vm -= nr_pages;
}
out:
*prev = vma;
@@ -739,9 +748,18 @@ static int apply_mlockall_flags(int flags)
error = mlock_fixup(&vmi, vma, &prev, vma->vm_start, vma->vm_end,
newflags);
- /* Ignore errors, but prev needs fixing up. */
- if (error)
+ if (error) {
+ /*
+ * If we failed due to a pending fatal signal, return
+ * now. If we locked the vma before signal arrived, it
+ * will be unlocked when we drop mmap_write_lock.
+ */
+ if (fatal_signal_pending(current))
+ return -EINTR;
+
+ /* Ignore errors, but prev needs fixing up. */
prev = vma;
+ }
cond_resched();
}
out:
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 110d47a36d4b..d6227877465f 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -700,6 +700,7 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb,
const vma_flags_t old_vma_flags = READ_ONCE(vma->flags);
vma_flags_t new_vma_flags = legacy_to_vma_flags(newflags);
long nrpages = (end - start) >> PAGE_SHIFT;
+ struct vm_area_struct *new_vma;
unsigned int mm_cp_flags = 0;
unsigned long charged = 0;
int error;
@@ -756,19 +757,27 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb,
vma_flags_clear(&new_vma_flags, VMA_ACCOUNT_BIT);
}
- vma = vma_modify_flags(vmi, *pprev, vma, start, end, &new_vma_flags);
- if (IS_ERR(vma)) {
- error = PTR_ERR(vma);
+ new_vma = vma_modify_flags(vmi, *pprev, vma, start, end,
+ &new_vma_flags);
+ if (IS_ERR(new_vma)) {
+ error = PTR_ERR(new_vma);
goto fail;
}
- *pprev = vma;
-
/*
- * vm_flags and vm_page_prot are protected by the mmap_lock
- * held in write mode.
+ * If a new vma was created during vma_modify_flags, the resulting
+ * vma is already locked. Skip re-locking new vma in this case.
*/
- vma_start_write(vma);
+ if (new_vma == vma) {
+ error = vma_start_write_killable(vma);
+ if (error)
+ goto fail;
+ } else {
+ vma = new_vma;
+ }
+
+ *pprev = vma;
+
vma_flags_reset_once(vma, &new_vma_flags);
if (vma_wants_manual_pte_write_upgrade(vma))
mm_cp_flags |= MM_CP_TRY_CHANGE_WRITABLE;
diff --git a/mm/mremap.c b/mm/mremap.c
index e9c8b1d05832..0860102bddab 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -1348,6 +1348,11 @@ static unsigned long move_vma(struct vma_remap_struct *vrm)
if (err)
return err;
+ /* We don't want racing faults. */
+ err = vma_start_write_killable(vrm->vma);
+ if (err)
+ return err;
+
/*
* If accounted, determine the number of bytes the operation will
* charge.
@@ -1355,9 +1360,6 @@ static unsigned long move_vma(struct vma_remap_struct *vrm)
if (!vrm_calc_charge(vrm))
return -ENOMEM;
- /* We don't want racing faults. */
- vma_start_write(vrm->vma);
-
/* Perform copy step. */
err = copy_vma_and_data(vrm, &new_vma);
/*
diff --git a/mm/mseal.c b/mm/mseal.c
index 603df53ad267..1ea19fd3d384 100644
--- a/mm/mseal.c
+++ b/mm/mseal.c
@@ -70,14 +70,28 @@ static int mseal_apply(struct mm_struct *mm,
if (!vma_test(vma, VMA_SEALED_BIT)) {
vma_flags_t vma_flags = vma->flags;
+ struct vm_area_struct *new_vma;
vma_flags_set(&vma_flags, VMA_SEALED_BIT);
- vma = vma_modify_flags(&vmi, prev, vma, curr_start,
- curr_end, &vma_flags);
- if (IS_ERR(vma))
- return PTR_ERR(vma);
- vma_start_write(vma);
+ new_vma = vma_modify_flags(&vmi, prev, vma, curr_start,
+ curr_end, &vma_flags);
+ if (IS_ERR(new_vma))
+ return PTR_ERR(new_vma);
+
+ /*
+ * If a new vma was created during vma_modify_flags,
+ * the resulting vma is already locked.
+ * Skip re-locking new vma in this case.
+ */
+ if (new_vma == vma) {
+ int err = vma_start_write_killable(vma);
+ if (err)
+ return err;
+ } else {
+ vma = new_vma;
+ }
+
vma_set_flags(vma, VMA_SEALED_BIT);
}
--
2.53.0.1018.g2bb0e51243-goog
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v6 3/6] mm/khugepaged: use vma_start_write_killable() in collapse_huge_page()
2026-03-27 20:54 [PATCH v6 0/6] Use killable vma write locking in most places Suren Baghdasaryan
2026-03-27 20:54 ` [PATCH v6 1/6] mm/vma: cleanup error handling path in vma_expand() Suren Baghdasaryan
2026-03-27 20:54 ` [PATCH v6 2/6] mm: use vma_start_write_killable() in mm syscalls Suren Baghdasaryan
@ 2026-03-27 20:54 ` Suren Baghdasaryan
2026-03-27 20:54 ` [PATCH v6 4/6] mm/vma: use vma_start_write_killable() in vma operations Suren Baghdasaryan
` (3 subsequent siblings)
6 siblings, 0 replies; 23+ messages in thread
From: Suren Baghdasaryan @ 2026-03-27 20:54 UTC (permalink / raw)
To: akpm
Cc: willy, david, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, gourry, ying.huang, apopple, ljs, baolin.wang,
Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, lance.yang,
vbabka, jannh, rppt, mhocko, pfalcato, kees, maddy, npiggin, mpe,
chleroy, borntraeger, frankja, imbrenda, hca, gor, agordeev,
svens, gerald.schaefer, linux-mm, linuxppc-dev, kvm, linux-kernel,
linux-s390, surenb
Replace vma_start_write() with vma_start_write_killable(), improving
reaction time to the kill signal.
Replace vma_start_write() in collapse_huge_page().
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
---
mm/khugepaged.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index d06d84219e1b..a1825a4dec8b 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1166,7 +1166,10 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
if (result != SCAN_SUCCEED)
goto out_up_write;
/* check if the pmd is still valid */
- vma_start_write(vma);
+ if (vma_start_write_killable(vma)) {
+ result = SCAN_FAIL;
+ goto out_up_write;
+ }
result = check_pmd_still_valid(mm, address, pmd);
if (result != SCAN_SUCCEED)
goto out_up_write;
--
2.53.0.1018.g2bb0e51243-goog
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v6 4/6] mm/vma: use vma_start_write_killable() in vma operations
2026-03-27 20:54 [PATCH v6 0/6] Use killable vma write locking in most places Suren Baghdasaryan
` (2 preceding siblings ...)
2026-03-27 20:54 ` [PATCH v6 3/6] mm/khugepaged: use vma_start_write_killable() in collapse_huge_page() Suren Baghdasaryan
@ 2026-03-27 20:54 ` Suren Baghdasaryan
2026-03-31 10:24 ` Lorenzo Stoakes (Oracle)
2026-03-27 20:54 ` [PATCH v6 5/6] mm: use vma_start_write_killable() in process_vma_walk_lock() Suren Baghdasaryan
` (2 subsequent siblings)
6 siblings, 1 reply; 23+ messages in thread
From: Suren Baghdasaryan @ 2026-03-27 20:54 UTC (permalink / raw)
To: akpm
Cc: willy, david, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, gourry, ying.huang, apopple, ljs, baolin.wang,
Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, lance.yang,
vbabka, jannh, rppt, mhocko, pfalcato, kees, maddy, npiggin, mpe,
chleroy, borntraeger, frankja, imbrenda, hca, gor, agordeev,
svens, gerald.schaefer, linux-mm, linuxppc-dev, kvm, linux-kernel,
linux-s390, surenb
Replace vma_start_write() with vma_start_write_killable(), improving
reaction time to the kill signal.
Replace vma_start_write() calls when we operate on VMAs.
To propagate errors from vma_merge_existing_range() and vma_expand()
we fake an ENOMEM error when we fail due to a pending fatal signal.
This is a temporary workaround. Fixing this requires some refactoring
and will be done separately in the future.
In a number of places we now lock VMA earlier than before to avoid
doing work and undoing it later if a fatal signal is pending. This
is safe because the moves are happening within sections where we
already hold the mmap_write_lock, so the moves do not change the
locking order relative to other kernel locks.
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
mm/vma.c | 146 ++++++++++++++++++++++++++++++++++++++------------
mm/vma_exec.c | 6 ++-
2 files changed, 117 insertions(+), 35 deletions(-)
diff --git a/mm/vma.c b/mm/vma.c
index ba78ab1f397a..cc382217f730 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -524,6 +524,21 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
new->vm_pgoff += ((addr - vma->vm_start) >> PAGE_SHIFT);
}
+ /*
+ * Lock VMAs before cloning to avoid extra work if fatal signal
+ * is pending.
+ */
+ err = vma_start_write_killable(vma);
+ if (err)
+ goto out_free_vma;
+ /*
+ * Locking a new detached VMA will always succeed but it's just a
+ * detail of the current implementation, so handle it all the same.
+ */
+ err = vma_start_write_killable(new);
+ if (err)
+ goto out_free_vma;
+
err = -ENOMEM;
vma_iter_config(vmi, new->vm_start, new->vm_end);
if (vma_iter_prealloc(vmi, new))
@@ -543,9 +558,6 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
if (new->vm_ops && new->vm_ops->open)
new->vm_ops->open(new);
- vma_start_write(vma);
- vma_start_write(new);
-
init_vma_prep(&vp, vma);
vp.insert = new;
vma_prepare(&vp);
@@ -900,12 +912,22 @@ static __must_check struct vm_area_struct *vma_merge_existing_range(
}
/* No matter what happens, we will be adjusting middle. */
- vma_start_write(middle);
+ err = vma_start_write_killable(middle);
+ if (err) {
+ /* Ensure error propagates. */
+ vmg->give_up_on_oom = false;
+ goto abort;
+ }
if (merge_right) {
vma_flags_t next_sticky;
- vma_start_write(next);
+ err = vma_start_write_killable(next);
+ if (err) {
+ /* Ensure error propagates. */
+ vmg->give_up_on_oom = false;
+ goto abort;
+ }
vmg->target = next;
next_sticky = vma_flags_and_mask(&next->flags, VMA_STICKY_FLAGS);
vma_flags_set_mask(&sticky_flags, next_sticky);
@@ -914,7 +936,12 @@ static __must_check struct vm_area_struct *vma_merge_existing_range(
if (merge_left) {
vma_flags_t prev_sticky;
- vma_start_write(prev);
+ err = vma_start_write_killable(prev);
+ if (err) {
+ /* Ensure error propagates. */
+ vmg->give_up_on_oom = false;
+ goto abort;
+ }
vmg->target = prev;
prev_sticky = vma_flags_and_mask(&prev->flags, VMA_STICKY_FLAGS);
@@ -1170,10 +1197,18 @@ int vma_expand(struct vma_merge_struct *vmg)
vma_flags_t sticky_flags =
vma_flags_and_mask(&vmg->vma_flags, VMA_STICKY_FLAGS);
vma_flags_t target_sticky;
- int err = 0;
+ int err;
mmap_assert_write_locked(vmg->mm);
- vma_start_write(target);
+ err = vma_start_write_killable(target);
+ if (err) {
+ /*
+ * Override VMA_MERGE_NOMERGE to prevent callers from
+ * falling back to a new VMA allocation.
+ */
+ vmg->state = VMA_MERGE_ERROR_NOMEM;
+ return err;
+ }
target_sticky = vma_flags_and_mask(&target->flags, VMA_STICKY_FLAGS);
@@ -1201,6 +1236,19 @@ int vma_expand(struct vma_merge_struct *vmg)
* we don't need to account for vmg->give_up_on_mm here.
*/
if (remove_next) {
+ /*
+ * Lock the VMA early to avoid extra work if fatal signal
+ * is pending.
+ */
+ err = vma_start_write_killable(next);
+ if (err) {
+ /*
+ * Override VMA_MERGE_NOMERGE to prevent callers from
+ * falling back to a new VMA allocation.
+ */
+ vmg->state = VMA_MERGE_ERROR_NOMEM;
+ return err;
+ }
err = dup_anon_vma(target, next, &anon_dup);
if (err)
return err;
@@ -1214,7 +1262,6 @@ int vma_expand(struct vma_merge_struct *vmg)
if (remove_next) {
vma_flags_t next_sticky;
- vma_start_write(next);
vmg->__remove_next = true;
next_sticky = vma_flags_and_mask(&next->flags, VMA_STICKY_FLAGS);
@@ -1252,9 +1299,14 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
unsigned long start, unsigned long end, pgoff_t pgoff)
{
struct vma_prepare vp;
+ int err;
WARN_ON((vma->vm_start != start) && (vma->vm_end != end));
+ err = vma_start_write_killable(vma);
+ if (err)
+ return err;
+
if (vma->vm_start < start)
vma_iter_config(vmi, vma->vm_start, start);
else
@@ -1263,8 +1315,6 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
if (vma_iter_prealloc(vmi, NULL))
return -ENOMEM;
- vma_start_write(vma);
-
init_vma_prep(&vp, vma);
vma_prepare(&vp);
vma_adjust_trans_huge(vma, start, end, NULL);
@@ -1453,7 +1503,9 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
if (error)
goto end_split_failed;
}
- vma_start_write(next);
+ error = vma_start_write_killable(next);
+ if (error)
+ goto munmap_gather_failed;
mas_set(mas_detach, vms->vma_count++);
error = mas_store_gfp(mas_detach, next, GFP_KERNEL);
if (error)
@@ -1848,12 +1900,16 @@ static void vma_link_file(struct vm_area_struct *vma, bool hold_rmap_lock)
static int vma_link(struct mm_struct *mm, struct vm_area_struct *vma)
{
VMA_ITERATOR(vmi, mm, 0);
+ int err;
+
+ err = vma_start_write_killable(vma);
+ if (err)
+ return err;
vma_iter_config(&vmi, vma->vm_start, vma->vm_end);
if (vma_iter_prealloc(&vmi, vma))
return -ENOMEM;
- vma_start_write(vma);
vma_iter_store_new(&vmi, vma);
vma_link_file(vma, /* hold_rmap_lock= */false);
mm->map_count++;
@@ -2239,9 +2295,8 @@ int mm_take_all_locks(struct mm_struct *mm)
* is reached.
*/
for_each_vma(vmi, vma) {
- if (signal_pending(current))
+ if (signal_pending(current) || vma_start_write_killable(vma))
goto out_unlock;
- vma_start_write(vma);
}
vma_iter_init(&vmi, mm, 0);
@@ -2540,8 +2595,8 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap,
struct mmap_action *action)
{
struct vma_iterator *vmi = map->vmi;
- int error = 0;
struct vm_area_struct *vma;
+ int error;
/*
* Determine the object being mapped and call the appropriate
@@ -2552,6 +2607,14 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap,
if (!vma)
return -ENOMEM;
+ /*
+ * Lock the VMA early to avoid extra work if fatal signal
+ * is pending.
+ */
+ error = vma_start_write_killable(vma);
+ if (error)
+ goto free_vma;
+
vma_iter_config(vmi, map->addr, map->end);
vma_set_range(vma, map->addr, map->end, map->pgoff);
vma->flags = map->vma_flags;
@@ -2582,8 +2645,6 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap,
WARN_ON_ONCE(!arch_validate_flags(map->vm_flags));
#endif
- /* Lock the VMA since it is modified after insertion into VMA tree */
- vma_start_write(vma);
vma_iter_store_new(vmi, vma);
map->mm->map_count++;
vma_link_file(vma, action->hide_from_rmap_until_complete);
@@ -2878,6 +2939,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
unsigned long addr, unsigned long len, vma_flags_t vma_flags)
{
struct mm_struct *mm = current->mm;
+ int err;
/*
* Check against address space limits by the changed size
@@ -2910,24 +2972,33 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
if (vma_merge_new_range(&vmg))
goto out;
- else if (vmg_nomem(&vmg))
+ if (vmg_nomem(&vmg)) {
+ err = -ENOMEM;
goto unacct_fail;
+ }
}
if (vma)
vma_iter_next_range(vmi);
/* create a vma struct for an anonymous mapping */
vma = vm_area_alloc(mm);
- if (!vma)
+ if (!vma) {
+ err = -ENOMEM;
goto unacct_fail;
+ }
vma_set_anonymous(vma);
vma_set_range(vma, addr, addr + len, addr >> PAGE_SHIFT);
vma->flags = vma_flags;
vma->vm_page_prot = vm_get_page_prot(vma_flags_to_legacy(vma_flags));
- vma_start_write(vma);
- if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL))
+ if (vma_start_write_killable(vma)) {
+ err = -EINTR;
+ goto vma_lock_fail;
+ }
+ if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL)) {
+ err = -ENOMEM;
goto mas_store_fail;
+ }
mm->map_count++;
validate_mm(mm);
@@ -2942,10 +3013,11 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
return 0;
mas_store_fail:
+vma_lock_fail:
vm_area_free(vma);
unacct_fail:
vm_unacct_memory(len >> PAGE_SHIFT);
- return -ENOMEM;
+ return err;
}
/**
@@ -3112,8 +3184,8 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
struct mm_struct *mm = vma->vm_mm;
struct vm_area_struct *next;
unsigned long gap_addr;
- int error = 0;
VMA_ITERATOR(vmi, mm, vma->vm_start);
+ int error;
if (!vma_test(vma, VMA_GROWSUP_BIT))
return -EFAULT;
@@ -3149,12 +3221,14 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
/* We must make sure the anon_vma is allocated. */
if (unlikely(anon_vma_prepare(vma))) {
- vma_iter_free(&vmi);
- return -ENOMEM;
+ error = -ENOMEM;
+ goto vma_prep_fail;
}
/* Lock the VMA before expanding to prevent concurrent page faults */
- vma_start_write(vma);
+ error = vma_start_write_killable(vma);
+ if (error)
+ goto vma_lock_fail;
/* We update the anon VMA tree. */
anon_vma_lock_write(vma->anon_vma);
@@ -3183,8 +3257,10 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
}
}
anon_vma_unlock_write(vma->anon_vma);
- vma_iter_free(&vmi);
validate_mm(mm);
+vma_lock_fail:
+vma_prep_fail:
+ vma_iter_free(&vmi);
return error;
}
#endif /* CONFIG_STACK_GROWSUP */
@@ -3197,8 +3273,8 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
{
struct mm_struct *mm = vma->vm_mm;
struct vm_area_struct *prev;
- int error = 0;
VMA_ITERATOR(vmi, mm, vma->vm_start);
+ int error;
if (!vma_test(vma, VMA_GROWSDOWN_BIT))
return -EFAULT;
@@ -3228,12 +3304,14 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
/* We must make sure the anon_vma is allocated. */
if (unlikely(anon_vma_prepare(vma))) {
- vma_iter_free(&vmi);
- return -ENOMEM;
+ error = -ENOMEM;
+ goto vma_prep_fail;
}
/* Lock the VMA before expanding to prevent concurrent page faults */
- vma_start_write(vma);
+ error = vma_start_write_killable(vma);
+ if (error)
+ goto vma_lock_fail;
/* We update the anon VMA tree. */
anon_vma_lock_write(vma->anon_vma);
@@ -3263,8 +3341,10 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
}
}
anon_vma_unlock_write(vma->anon_vma);
- vma_iter_free(&vmi);
validate_mm(mm);
+vma_lock_fail:
+vma_prep_fail:
+ vma_iter_free(&vmi);
return error;
}
diff --git a/mm/vma_exec.c b/mm/vma_exec.c
index 5cee8b7efa0f..8ddcc791d828 100644
--- a/mm/vma_exec.c
+++ b/mm/vma_exec.c
@@ -41,6 +41,7 @@ int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift)
struct vm_area_struct *next;
struct mmu_gather tlb;
PAGETABLE_MOVE(pmc, vma, vma, old_start, new_start, length);
+ int err;
BUG_ON(new_start > new_end);
@@ -56,8 +57,9 @@ int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift)
* cover the whole range: [new_start, old_end)
*/
vmg.target = vma;
- if (vma_expand(&vmg))
- return -ENOMEM;
+ err = vma_expand(&vmg);
+ if (err)
+ return err;
/*
* move the page tables downwards, on failure we rely on
--
2.53.0.1018.g2bb0e51243-goog
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v6 5/6] mm: use vma_start_write_killable() in process_vma_walk_lock()
2026-03-27 20:54 [PATCH v6 0/6] Use killable vma write locking in most places Suren Baghdasaryan
` (3 preceding siblings ...)
2026-03-27 20:54 ` [PATCH v6 4/6] mm/vma: use vma_start_write_killable() in vma operations Suren Baghdasaryan
@ 2026-03-27 20:54 ` Suren Baghdasaryan
2026-03-31 10:38 ` Lorenzo Stoakes (Oracle)
2026-03-27 20:54 ` [PATCH v6 6/6] KVM: PPC: use vma_start_write_killable() in kvmppc_memslot_page_merge() Suren Baghdasaryan
2026-03-27 23:12 ` [PATCH v6 0/6] Use killable vma write locking in most places Andrew Morton
6 siblings, 1 reply; 23+ messages in thread
From: Suren Baghdasaryan @ 2026-03-27 20:54 UTC (permalink / raw)
To: akpm
Cc: willy, david, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, gourry, ying.huang, apopple, ljs, baolin.wang,
Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, lance.yang,
vbabka, jannh, rppt, mhocko, pfalcato, kees, maddy, npiggin, mpe,
chleroy, borntraeger, frankja, imbrenda, hca, gor, agordeev,
svens, gerald.schaefer, linux-mm, linuxppc-dev, kvm, linux-kernel,
linux-s390, surenb
Replace vma_start_write() with vma_start_write_killable() when
process_vma_walk_lock() is used with PGWALK_WRLOCK option.
Adjust its direct and indirect users to check for a possible error
and handle it. Ensure users handle EINTR correctly and do not ignore
it. When queue_pages_range() fails, check whether it failed due to
a fatal signal or some other reason and return appropriate error.
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
fs/proc/task_mmu.c | 12 ++++++------
mm/mempolicy.c | 10 +++++++++-
mm/pagewalk.c | 22 +++++++++++++++-------
3 files changed, 30 insertions(+), 14 deletions(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index e091931d7ca1..33e5094a7842 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1774,15 +1774,15 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
struct vm_area_struct *vma;
enum clear_refs_types type;
int itype;
- int rv;
+ int err;
if (count > sizeof(buffer) - 1)
count = sizeof(buffer) - 1;
if (copy_from_user(buffer, buf, count))
return -EFAULT;
- rv = kstrtoint(strstrip(buffer), 10, &itype);
- if (rv < 0)
- return rv;
+ err = kstrtoint(strstrip(buffer), 10, &itype);
+ if (err)
+ return err;
type = (enum clear_refs_types)itype;
if (type < CLEAR_REFS_ALL || type >= CLEAR_REFS_LAST)
return -EINVAL;
@@ -1824,7 +1824,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
0, mm, 0, -1UL);
mmu_notifier_invalidate_range_start(&range);
}
- walk_page_range(mm, 0, -1, &clear_refs_walk_ops, &cp);
+ err = walk_page_range(mm, 0, -1, &clear_refs_walk_ops, &cp);
if (type == CLEAR_REFS_SOFT_DIRTY) {
mmu_notifier_invalidate_range_end(&range);
flush_tlb_mm(mm);
@@ -1837,7 +1837,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
}
put_task_struct(task);
- return count;
+ return err ? : count;
}
const struct file_operations proc_clear_refs_operations = {
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index c38a90487531..51f298cfc33b 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -969,6 +969,7 @@ static const struct mm_walk_ops queue_pages_lock_vma_walk_ops = {
* (a hugetlbfs page or a transparent huge page being counted as 1).
* -EIO - a misplaced page found, when MPOL_MF_STRICT specified without MOVEs.
* -EFAULT - a hole in the memory range, when MPOL_MF_DISCONTIG_OK unspecified.
+ * -EINTR - walk got terminated due to pending fatal signal.
*/
static long
queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end,
@@ -1545,7 +1546,14 @@ static long do_mbind(unsigned long start, unsigned long len,
flags | MPOL_MF_INVERT | MPOL_MF_WRLOCK, &pagelist);
if (nr_failed < 0) {
- err = nr_failed;
+ /*
+ * queue_pages_range() might override the original error with -EFAULT.
+ * Confirm that fatal signals are still treated correctly.
+ */
+ if (fatal_signal_pending(current))
+ err = -EINTR;
+ else
+ err = nr_failed;
nr_failed = 0;
} else {
vma_iter_init(&vmi, mm, start);
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 3ae2586ff45b..eca7bc711617 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -443,14 +443,13 @@ static inline void process_mm_walk_lock(struct mm_struct *mm,
mmap_assert_write_locked(mm);
}
-static inline void process_vma_walk_lock(struct vm_area_struct *vma,
- enum page_walk_lock walk_lock)
+static int process_vma_walk_lock(struct vm_area_struct *vma,
+ enum page_walk_lock walk_lock)
{
#ifdef CONFIG_PER_VMA_LOCK
switch (walk_lock) {
case PGWALK_WRLOCK:
- vma_start_write(vma);
- break;
+ return vma_start_write_killable(vma);
case PGWALK_WRLOCK_VERIFY:
vma_assert_write_locked(vma);
break;
@@ -462,6 +461,7 @@ static inline void process_vma_walk_lock(struct vm_area_struct *vma,
break;
}
#endif
+ return 0;
}
/*
@@ -505,7 +505,9 @@ int walk_page_range_mm_unsafe(struct mm_struct *mm, unsigned long start,
if (ops->pte_hole)
err = ops->pte_hole(start, next, -1, &walk);
} else { /* inside vma */
- process_vma_walk_lock(vma, ops->walk_lock);
+ err = process_vma_walk_lock(vma, ops->walk_lock);
+ if (err)
+ break;
walk.vma = vma;
next = min(end, vma->vm_end);
vma = find_vma(mm, vma->vm_end);
@@ -722,6 +724,7 @@ int walk_page_range_vma_unsafe(struct vm_area_struct *vma, unsigned long start,
.vma = vma,
.private = private,
};
+ int err;
if (start >= end || !walk.mm)
return -EINVAL;
@@ -729,7 +732,9 @@ int walk_page_range_vma_unsafe(struct vm_area_struct *vma, unsigned long start,
return -EINVAL;
process_mm_walk_lock(walk.mm, ops->walk_lock);
- process_vma_walk_lock(vma, ops->walk_lock);
+ err = process_vma_walk_lock(vma, ops->walk_lock);
+ if (err)
+ return err;
return __walk_page_range(start, end, &walk);
}
@@ -752,6 +757,7 @@ int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
.vma = vma,
.private = private,
};
+ int err;
if (!walk.mm)
return -EINVAL;
@@ -759,7 +765,9 @@ int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
return -EINVAL;
process_mm_walk_lock(walk.mm, ops->walk_lock);
- process_vma_walk_lock(vma, ops->walk_lock);
+ err = process_vma_walk_lock(vma, ops->walk_lock);
+ if (err)
+ return err;
return __walk_page_range(vma->vm_start, vma->vm_end, &walk);
}
--
2.53.0.1018.g2bb0e51243-goog
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v6 6/6] KVM: PPC: use vma_start_write_killable() in kvmppc_memslot_page_merge()
2026-03-27 20:54 [PATCH v6 0/6] Use killable vma write locking in most places Suren Baghdasaryan
` (4 preceding siblings ...)
2026-03-27 20:54 ` [PATCH v6 5/6] mm: use vma_start_write_killable() in process_vma_walk_lock() Suren Baghdasaryan
@ 2026-03-27 20:54 ` Suren Baghdasaryan
2026-03-27 23:12 ` [PATCH v6 0/6] Use killable vma write locking in most places Andrew Morton
6 siblings, 0 replies; 23+ messages in thread
From: Suren Baghdasaryan @ 2026-03-27 20:54 UTC (permalink / raw)
To: akpm
Cc: willy, david, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, gourry, ying.huang, apopple, ljs, baolin.wang,
Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, lance.yang,
vbabka, jannh, rppt, mhocko, pfalcato, kees, maddy, npiggin, mpe,
chleroy, borntraeger, frankja, imbrenda, hca, gor, agordeev,
svens, gerald.schaefer, linux-mm, linuxppc-dev, kvm, linux-kernel,
linux-s390, surenb, Ritesh Harjani (IBM)
Replace vma_start_write() with vma_start_write_killable(), improving
reaction time to the kill signal.
Replace vma_start_write() in kvmppc_memslot_page_merge().
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
---
arch/powerpc/kvm/book3s_hv_uvmem.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 5fbb95d90e99..0a28b48a46b8 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -410,7 +410,10 @@ static int kvmppc_memslot_page_merge(struct kvm *kvm,
ret = H_STATE;
break;
}
- vma_start_write(vma);
+ if (vma_start_write_killable(vma)) {
+ ret = H_STATE;
+ break;
+ }
/* Copy vm_flags to avoid partial modifications in ksm_madvise */
vm_flags = vma->vm_flags;
ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
--
2.53.0.1018.g2bb0e51243-goog
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH v6 0/6] Use killable vma write locking in most places
2026-03-27 20:54 [PATCH v6 0/6] Use killable vma write locking in most places Suren Baghdasaryan
` (5 preceding siblings ...)
2026-03-27 20:54 ` [PATCH v6 6/6] KVM: PPC: use vma_start_write_killable() in kvmppc_memslot_page_merge() Suren Baghdasaryan
@ 2026-03-27 23:12 ` Andrew Morton
2026-03-31 9:51 ` Lorenzo Stoakes (Oracle)
6 siblings, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2026-03-27 23:12 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: willy, david, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, gourry, ying.huang, apopple, ljs, baolin.wang,
Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, lance.yang,
vbabka, jannh, rppt, mhocko, pfalcato, kees, maddy, npiggin, mpe,
chleroy, borntraeger, frankja, imbrenda, hca, gor, agordeev,
svens, gerald.schaefer, linux-mm, linuxppc-dev, kvm, linux-kernel,
linux-s390
On Fri, 27 Mar 2026 13:54:51 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
> Now that we have vma_start_write_killable() we can replace most of the
> vma_start_write() calls with it, improving reaction time to the kill
> signal.
>
> There are several places which are left untouched by this patchset:
>
> 1. free_pgtables() because function should free page tables even if a
> fatal signal is pending.
>
> 2. userfaultd code, where some paths calling vma_start_write() can
> handle EINTR and some can't without a deeper code refactoring.
>
> 3. mpol_rebind_mm() which is used by cpusset controller for migrations
> and operates on a remote mm. Incomplete operations here would result
> in an inconsistent cgroup state.
>
> 4. vm_flags_{set|mod|clear} require refactoring that involves moving
> vma_start_write() out of these functions and replacing it with
> vma_assert_write_locked(), then callers of these functions should
> lock the vma themselves using vma_start_write_killable() whenever
> possible.
Updated, thanks.
> Changes since v5 [1]:
> - Added Reviewed-by for unchanged patches, per Lorenzo Stoakes
>
> Patch#2:
> - Fixed locked_vm counter if mlock_vma_pages_range() fails in
> mlock_fixup(), per Sashiko
> - Avoid VMA re-locking in madvise_update_vma(), mprotect_fixup() and
> mseal_apply() when vma_modify_XXX creates a new VMA as it will already be
> locked. This prevents the possibility of incomplete operation if signal
> happens after a successful vma_modify_XXX modified the vma tree,
> per Sashiko
> - Removed obsolete comment in madvise_update_vma() and mprotect_fixup()
>
> Patch#4:
> - Added clarifying comment for vma_start_write_killable() when locking a
> detached VMA
> - Override VMA_MERGE_NOMERGE in vma_expand() to prevent callers from
> falling back to a new VMA allocation, per Sashiko
> - Added a note in the changelog about temporary workaround of using
> ENOMEM to propagate the error in vma_merge_existing_range() and
> vma_expand()
>
> Patch#5:
> - Added fatal_signal_pending() check in do_mbind() to detect
> queue_pages_range() failures due to a pendig fatal signal, per Sashiko
Changes since v5:
mm/madvise.c | 15 ++++++++++-----
mm/mempolicy.c | 9 ++++++++-
mm/mlock.c | 2 ++
mm/mprotect.c | 26 ++++++++++++++++----------
mm/mseal.c | 27 +++++++++++++++++++--------
mm/vma.c | 20 ++++++++++++++++++--
6 files changed, 73 insertions(+), 26 deletions(-)
--- a/mm/madvise.c~b
+++ a/mm/madvise.c
@@ -172,11 +172,16 @@ static int madvise_update_vma(vm_flags_t
if (IS_ERR(vma))
return PTR_ERR(vma);
- madv_behavior->vma = vma;
-
- /* vm_flags is protected by the mmap_lock held in write mode. */
- if (vma_start_write_killable(vma))
- return -EINTR;
+ /*
+ * If a new vma was created during vma_modify_XXX, the resulting
+ * vma is already locked. Skip re-locking new vma in this case.
+ */
+ if (vma == madv_behavior->vma) {
+ if (vma_start_write_killable(vma))
+ return -EINTR;
+ } else {
+ madv_behavior->vma = vma;
+ }
vma->flags = new_vma_flags;
if (set_new_anon_name)
--- a/mm/mempolicy.c~b
+++ a/mm/mempolicy.c
@@ -1546,7 +1546,14 @@ static long do_mbind(unsigned long start
flags | MPOL_MF_INVERT | MPOL_MF_WRLOCK, &pagelist);
if (nr_failed < 0) {
- err = nr_failed;
+ /*
+ * queue_pages_range() might override the original error with -EFAULT.
+ * Confirm that fatal signals are still treated correctly.
+ */
+ if (fatal_signal_pending(current))
+ err = -EINTR;
+ else
+ err = nr_failed;
nr_failed = 0;
} else {
vma_iter_init(&vmi, mm, start);
--- a/mm/mlock.c~b
+++ a/mm/mlock.c
@@ -518,6 +518,8 @@ static int mlock_fixup(struct vma_iterat
vma->flags = new_vma_flags;
} else {
ret = mlock_vma_pages_range(vma, start, end, &new_vma_flags);
+ if (ret)
+ mm->locked_vm -= nr_pages;
}
out:
*prev = vma;
--- a/mm/mprotect.c~b
+++ a/mm/mprotect.c
@@ -716,6 +716,7 @@ mprotect_fixup(struct vma_iterator *vmi,
const vma_flags_t old_vma_flags = READ_ONCE(vma->flags);
vma_flags_t new_vma_flags = legacy_to_vma_flags(newflags);
long nrpages = (end - start) >> PAGE_SHIFT;
+ struct vm_area_struct *new_vma;
unsigned int mm_cp_flags = 0;
unsigned long charged = 0;
int error;
@@ -772,21 +773,26 @@ mprotect_fixup(struct vma_iterator *vmi,
vma_flags_clear(&new_vma_flags, VMA_ACCOUNT_BIT);
}
- vma = vma_modify_flags(vmi, *pprev, vma, start, end, &new_vma_flags);
- if (IS_ERR(vma)) {
- error = PTR_ERR(vma);
+ new_vma = vma_modify_flags(vmi, *pprev, vma, start, end,
+ &new_vma_flags);
+ if (IS_ERR(new_vma)) {
+ error = PTR_ERR(new_vma);
goto fail;
}
- *pprev = vma;
-
/*
- * vm_flags and vm_page_prot are protected by the mmap_lock
- * held in write mode.
+ * If a new vma was created during vma_modify_flags, the resulting
+ * vma is already locked. Skip re-locking new vma in this case.
*/
- error = vma_start_write_killable(vma);
- if (error)
- goto fail;
+ if (new_vma == vma) {
+ error = vma_start_write_killable(vma);
+ if (error)
+ goto fail;
+ } else {
+ vma = new_vma;
+ }
+
+ *pprev = vma;
vma_flags_reset_once(vma, &new_vma_flags);
if (vma_wants_manual_pte_write_upgrade(vma))
--- a/mm/mseal.c~b
+++ a/mm/mseal.c
@@ -70,17 +70,28 @@ static int mseal_apply(struct mm_struct
if (!vma_test(vma, VMA_SEALED_BIT)) {
vma_flags_t vma_flags = vma->flags;
- int err;
+ struct vm_area_struct *new_vma;
vma_flags_set(&vma_flags, VMA_SEALED_BIT);
- vma = vma_modify_flags(&vmi, prev, vma, curr_start,
- curr_end, &vma_flags);
- if (IS_ERR(vma))
- return PTR_ERR(vma);
- err = vma_start_write_killable(vma);
- if (err)
- return err;
+ new_vma = vma_modify_flags(&vmi, prev, vma, curr_start,
+ curr_end, &vma_flags);
+ if (IS_ERR(new_vma))
+ return PTR_ERR(new_vma);
+
+ /*
+ * If a new vma was created during vma_modify_flags,
+ * the resulting vma is already locked.
+ * Skip re-locking new vma in this case.
+ */
+ if (new_vma == vma) {
+ int err = vma_start_write_killable(vma);
+ if (err)
+ return err;
+ } else {
+ vma = new_vma;
+ }
+
vma_set_flags(vma, VMA_SEALED_BIT);
}
--- a/mm/vma.c~b
+++ a/mm/vma.c
@@ -531,6 +531,10 @@ __split_vma(struct vma_iterator *vmi, st
err = vma_start_write_killable(vma);
if (err)
goto out_free_vma;
+ /*
+ * Locking a new detached VMA will always succeed but it's just a
+ * detail of the current implementation, so handle it all the same.
+ */
err = vma_start_write_killable(new);
if (err)
goto out_free_vma;
@@ -1197,8 +1201,14 @@ int vma_expand(struct vma_merge_struct *
mmap_assert_write_locked(vmg->mm);
err = vma_start_write_killable(target);
- if (err)
+ if (err) {
+ /*
+ * Override VMA_MERGE_NOMERGE to prevent callers from
+ * falling back to a new VMA allocation.
+ */
+ vmg->state = VMA_MERGE_ERROR_NOMEM;
return err;
+ }
target_sticky = vma_flags_and_mask(&target->flags, VMA_STICKY_FLAGS);
@@ -1231,8 +1241,14 @@ int vma_expand(struct vma_merge_struct *
* is pending.
*/
err = vma_start_write_killable(next);
- if (err)
+ if (err) {
+ /*
+ * Override VMA_MERGE_NOMERGE to prevent callers from
+ * falling back to a new VMA allocation.
+ */
+ vmg->state = VMA_MERGE_ERROR_NOMEM;
return err;
+ }
err = dup_anon_vma(target, next, &anon_dup);
if (err)
return err;
_
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v6 2/6] mm: use vma_start_write_killable() in mm syscalls
2026-03-27 20:54 ` [PATCH v6 2/6] mm: use vma_start_write_killable() in mm syscalls Suren Baghdasaryan
@ 2026-03-31 9:35 ` Lorenzo Stoakes (Oracle)
2026-03-31 15:01 ` Suren Baghdasaryan
0 siblings, 1 reply; 23+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-31 9:35 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: akpm, willy, david, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, gourry, ying.huang, apopple, baolin.wang, Liam.Howlett,
npache, ryan.roberts, dev.jain, baohua, lance.yang, vbabka, jannh,
rppt, mhocko, pfalcato, kees, maddy, npiggin, mpe, chleroy,
borntraeger, frankja, imbrenda, hca, gor, agordeev, svens,
gerald.schaefer, linux-mm, linuxppc-dev, kvm, linux-kernel,
linux-s390
On Fri, Mar 27, 2026 at 01:54:53PM -0700, Suren Baghdasaryan wrote:
> Replace vma_start_write() with vma_start_write_killable() in syscalls,
> improving reaction time to the kill signal.
>
> In a number of places we now lock VMA earlier than before to avoid
> doing work and undoing it later if a fatal signal is pending. This
> is safe because the moves are happening within sections where we
> already hold the mmap_write_lock, so the moves do not change the
> locking order relative to other kernel locks.
>
> Suggested-by: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
I think Sashiko (presumably) has regressed this series :/
> ---
> mm/madvise.c | 13 ++++++++++---
> mm/memory.c | 2 ++
> mm/mempolicy.c | 11 +++++++++--
> mm/mlock.c | 30 ++++++++++++++++++++++++------
> mm/mprotect.c | 25 +++++++++++++++++--------
> mm/mremap.c | 8 +++++---
> mm/mseal.c | 24 +++++++++++++++++++-----
> 7 files changed, 86 insertions(+), 27 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 69708e953cf5..f2c7b0512cdf 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -172,10 +172,17 @@ static int madvise_update_vma(vm_flags_t new_flags,
> if (IS_ERR(vma))
> return PTR_ERR(vma);
>
> - madv_behavior->vma = vma;
> + /*
> + * If a new vma was created during vma_modify_XXX, the resulting
> + * vma is already locked. Skip re-locking new vma in this case.
> + */
well you're not re-locking, in vma_start_write_killable() we have:
if (__is_vma_write_locked(vma))
return 0;
So I'm not sure this is really worth the effort? Was this a sashiko 'find'?
And is 're-locking' the right thing to say here? Probably nitty but that to me
implies you're locking again when in fact you're just calling the function only
to check seq nums uselessly and have a nop.
But it's not like we're on a hotpath here where we're sweating every little
thing, we're already taking the mmap write lock and doing heavy operations
etc. so not sure we care all that much.
OTOH, it's hardly like this is a bad thing so don't want to hold up series.
> + if (vma == madv_behavior->vma) {
> + if (vma_start_write_killable(vma))
> + return -EINTR;
> + } else {
> + madv_behavior->vma = vma;
> + }
This is kind of ugly.
Can't we just do:
const struct vm_area_struct *old_vma = madv_behavior->vma;
struct vm_area_struct *vma = old_vma;
...
madv_behavior->vma = vma;
/* If the VMA didn't change, it isn't locked yet. */
if (vma == old_vma && vma_start_write_killable(vma))
return -EINTR;
Instead? That is, assuming we really need to care about this at all.
But I think I don't like this change at all?
>
> - /* vm_flags is protected by the mmap_lock held in write mode. */
> - vma_start_write(vma);
> vma->flags = new_vma_flags;
> if (set_new_anon_name)
> return replace_anon_vma_name(vma, anon_name);
> diff --git a/mm/memory.c b/mm/memory.c
> index e44469f9cf65..9f99ec634831 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -366,6 +366,8 @@ void free_pgd_range(struct mmu_gather *tlb,
> * page tables that should be removed. This can differ from the vma mappings on
> * some archs that may have mappings that need to be removed outside the vmas.
> * Note that the prev->vm_end and next->vm_start are often used.
> + * We don't use vma_start_write_killable() because page tables should be freed
> + * even if the task is being killed.
> *
> * The vma_end differs from the pg_end when a dup_mmap() failed and the tree has
> * unrelated data to the mm_struct being torn down.
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index fd08771e2057..c38a90487531 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1784,7 +1784,8 @@ SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, le
> return -EINVAL;
> if (end == start)
> return 0;
> - mmap_write_lock(mm);
> + if (mmap_write_lock_killable(mm))
> + return -EINTR;
> prev = vma_prev(&vmi);
> for_each_vma_range(vmi, vma, end) {
> /*
> @@ -1801,13 +1802,19 @@ SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, le
> err = -EOPNOTSUPP;
> break;
> }
> + /*
> + * Lock the VMA early to avoid extra work if fatal signal
> + * is pending.
> + */
> + err = vma_start_write_killable(vma);
> + if (err)
> + break;
> new = mpol_dup(old);
> if (IS_ERR(new)) {
> err = PTR_ERR(new);
> break;
> }
>
> - vma_start_write(vma);
> new->home_node = home_node;
> err = mbind_range(&vmi, vma, &prev, start, end, new);
> mpol_put(new);
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 8c227fefa2df..2ed454db7cf7 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -419,8 +419,10 @@ static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
> *
> * Called for mlock(), mlock2() and mlockall(), to set @vma VM_LOCKED;
> * called for munlock() and munlockall(), to clear VM_LOCKED from @vma.
> + *
> + * Return: 0 on success, -EINTR if fatal signal is pending.
> */
> -static void mlock_vma_pages_range(struct vm_area_struct *vma,
> +static int mlock_vma_pages_range(struct vm_area_struct *vma,
> unsigned long start, unsigned long end,
> vma_flags_t *new_vma_flags)
> {
> @@ -442,7 +444,9 @@ static void mlock_vma_pages_range(struct vm_area_struct *vma,
> */
> if (vma_flags_test(new_vma_flags, VMA_LOCKED_BIT))
> vma_flags_set(new_vma_flags, VMA_IO_BIT);
> - vma_start_write(vma);
> + if (vma_start_write_killable(vma))
> + return -EINTR;
> +
> vma_flags_reset_once(vma, new_vma_flags);
>
> lru_add_drain();
> @@ -453,6 +457,7 @@ static void mlock_vma_pages_range(struct vm_area_struct *vma,
> vma_flags_clear(new_vma_flags, VMA_IO_BIT);
> vma_flags_reset_once(vma, new_vma_flags);
> }
> + return 0;
> }
>
> /*
> @@ -506,11 +511,15 @@ static int mlock_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
> */
> if (vma_flags_test(&new_vma_flags, VMA_LOCKED_BIT) &&
> vma_flags_test(&old_vma_flags, VMA_LOCKED_BIT)) {
> + ret = vma_start_write_killable(vma);
> + if (ret)
> + goto out; /* mm->locked_vm is fine as nr_pages == 0 */
> /* No work to do, and mlocking twice would be wrong */
> - vma_start_write(vma);
> vma->flags = new_vma_flags;
> } else {
> - mlock_vma_pages_range(vma, start, end, &new_vma_flags);
> + ret = mlock_vma_pages_range(vma, start, end, &new_vma_flags);
> + if (ret)
> + mm->locked_vm -= nr_pages;
> }
> out:
> *prev = vma;
> @@ -739,9 +748,18 @@ static int apply_mlockall_flags(int flags)
>
> error = mlock_fixup(&vmi, vma, &prev, vma->vm_start, vma->vm_end,
> newflags);
> - /* Ignore errors, but prev needs fixing up. */
> - if (error)
> + if (error) {
> + /*
> + * If we failed due to a pending fatal signal, return
> + * now. If we locked the vma before signal arrived, it
> + * will be unlocked when we drop mmap_write_lock.
> + */
> + if (fatal_signal_pending(current))
> + return -EINTR;
> +
> + /* Ignore errors, but prev needs fixing up. */
> prev = vma;
> + }
> cond_resched();
> }
> out:
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 110d47a36d4b..d6227877465f 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -700,6 +700,7 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb,
> const vma_flags_t old_vma_flags = READ_ONCE(vma->flags);
> vma_flags_t new_vma_flags = legacy_to_vma_flags(newflags);
> long nrpages = (end - start) >> PAGE_SHIFT;
> + struct vm_area_struct *new_vma;
> unsigned int mm_cp_flags = 0;
> unsigned long charged = 0;
> int error;
> @@ -756,19 +757,27 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb,
> vma_flags_clear(&new_vma_flags, VMA_ACCOUNT_BIT);
> }
>
> - vma = vma_modify_flags(vmi, *pprev, vma, start, end, &new_vma_flags);
> - if (IS_ERR(vma)) {
> - error = PTR_ERR(vma);
> + new_vma = vma_modify_flags(vmi, *pprev, vma, start, end,
> + &new_vma_flags);
> + if (IS_ERR(new_vma)) {
> + error = PTR_ERR(new_vma);
> goto fail;
> }
>
> - *pprev = vma;
> -
> /*
> - * vm_flags and vm_page_prot are protected by the mmap_lock
> - * held in write mode.
> + * If a new vma was created during vma_modify_flags, the resulting
> + * vma is already locked. Skip re-locking new vma in this case.
> */
> - vma_start_write(vma);
> + if (new_vma == vma) {
> + error = vma_start_write_killable(vma);
> + if (error)
> + goto fail;
> + } else {
> + vma = new_vma;
> + }
I mean again this is hideous and I don't know why we're bothering? This just
introduces even more open-coded VMA lock specific stuff everywhere.
And the comment is just not correct, we're not re-locking anything if it's
already locked...
> +
> + *pprev = vma;
> +
> vma_flags_reset_once(vma, &new_vma_flags);
> if (vma_wants_manual_pte_write_upgrade(vma))
> mm_cp_flags |= MM_CP_TRY_CHANGE_WRITABLE;
> diff --git a/mm/mremap.c b/mm/mremap.c
> index e9c8b1d05832..0860102bddab 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -1348,6 +1348,11 @@ static unsigned long move_vma(struct vma_remap_struct *vrm)
> if (err)
> return err;
>
> + /* We don't want racing faults. */
> + err = vma_start_write_killable(vrm->vma);
> + if (err)
> + return err;
> +
> /*
> * If accounted, determine the number of bytes the operation will
> * charge.
> @@ -1355,9 +1360,6 @@ static unsigned long move_vma(struct vma_remap_struct *vrm)
> if (!vrm_calc_charge(vrm))
> return -ENOMEM;
>
> - /* We don't want racing faults. */
> - vma_start_write(vrm->vma);
> -
> /* Perform copy step. */
> err = copy_vma_and_data(vrm, &new_vma);
> /*
> diff --git a/mm/mseal.c b/mm/mseal.c
> index 603df53ad267..1ea19fd3d384 100644
> --- a/mm/mseal.c
> +++ b/mm/mseal.c
> @@ -70,14 +70,28 @@ static int mseal_apply(struct mm_struct *mm,
>
> if (!vma_test(vma, VMA_SEALED_BIT)) {
> vma_flags_t vma_flags = vma->flags;
> + struct vm_area_struct *new_vma;
>
> vma_flags_set(&vma_flags, VMA_SEALED_BIT);
>
> - vma = vma_modify_flags(&vmi, prev, vma, curr_start,
> - curr_end, &vma_flags);
> - if (IS_ERR(vma))
> - return PTR_ERR(vma);
> - vma_start_write(vma);
> + new_vma = vma_modify_flags(&vmi, prev, vma, curr_start,
> + curr_end, &vma_flags);
> + if (IS_ERR(new_vma))
> + return PTR_ERR(new_vma);
> +
> + /*
> + * If a new vma was created during vma_modify_flags,
> + * the resulting vma is already locked.
> + * Skip re-locking new vma in this case.
> + */
> + if (new_vma == vma) {
> + int err = vma_start_write_killable(vma);
> + if (err)
> + return err;
> + } else {
> + vma = new_vma;
> + }
> +
I mean this is the exact same open-coded block all over this patch, and again I
don't see the point...
The VMA locking is tricky enough that I don't think doing this is a good idea,
especially on the basis of 'avoid looking at sequence numbers under mmap write
lock' or something.
> vma_set_flags(vma, VMA_SEALED_BIT);
> }
>
> --
> 2.53.0.1018.g2bb0e51243-goog
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v6 0/6] Use killable vma write locking in most places
2026-03-27 23:12 ` [PATCH v6 0/6] Use killable vma write locking in most places Andrew Morton
@ 2026-03-31 9:51 ` Lorenzo Stoakes (Oracle)
2026-03-31 15:06 ` Suren Baghdasaryan
0 siblings, 1 reply; 23+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-31 9:51 UTC (permalink / raw)
To: Andrew Morton
Cc: Suren Baghdasaryan, willy, david, ziy, matthew.brost,
joshua.hahnjy, rakie.kim, byungchul, gourry, ying.huang, apopple,
baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua,
lance.yang, vbabka, jannh, rppt, mhocko, pfalcato, kees, maddy,
npiggin, mpe, chleroy, borntraeger, frankja, imbrenda, hca, gor,
agordeev, svens, gerald.schaefer, linux-mm, linuxppc-dev, kvm,
linux-kernel, linux-s390
On Fri, Mar 27, 2026 at 04:12:26PM -0700, Andrew Morton wrote:
> On Fri, 27 Mar 2026 13:54:51 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
>
> > Now that we have vma_start_write_killable() we can replace most of the
> > vma_start_write() calls with it, improving reaction time to the kill
> > signal.
> >
> > There are several places which are left untouched by this patchset:
> >
> > 1. free_pgtables() because function should free page tables even if a
> > fatal signal is pending.
> >
> > 2. userfaultd code, where some paths calling vma_start_write() can
> > handle EINTR and some can't without a deeper code refactoring.
> >
> > 3. mpol_rebind_mm() which is used by cpusset controller for migrations
> > and operates on a remote mm. Incomplete operations here would result
> > in an inconsistent cgroup state.
> >
> > 4. vm_flags_{set|mod|clear} require refactoring that involves moving
> > vma_start_write() out of these functions and replacing it with
> > vma_assert_write_locked(), then callers of these functions should
> > lock the vma themselves using vma_start_write_killable() whenever
> > possible.
>
> Updated, thanks.
Andrew - sorry I think we need to yank this and defer to next cycle,
there's too many functional changes here.
(There was not really any way for me to predict this would happen ahead of
time, unfortunately.)
>
> > Changes since v5 [1]:
> > - Added Reviewed-by for unchanged patches, per Lorenzo Stoakes
> >
> > Patch#2:
> > - Fixed locked_vm counter if mlock_vma_pages_range() fails in
> > mlock_fixup(), per Sashiko
> > - Avoid VMA re-locking in madvise_update_vma(), mprotect_fixup() and
> > mseal_apply() when vma_modify_XXX creates a new VMA as it will already be
> > locked. This prevents the possibility of incomplete operation if signal
> > happens after a successful vma_modify_XXX modified the vma tree,
> > per Sashiko
Prevents the possibility of an incomplete operation? But
vma_write_lock_killable() checks to see if you're _already_ write locked
and would make the operation a no-op? So how is this even a delta?
It's a brave new world, arguing with sashiko via a submitter... :)
> > - Removed obsolete comment in madvise_update_vma() and mprotect_fixup()
> >
> > Patch#4:
> > - Added clarifying comment for vma_start_write_killable() when locking a
> > detached VMA
> > - Override VMA_MERGE_NOMERGE in vma_expand() to prevent callers from
> > falling back to a new VMA allocation, per Sashiko
> > - Added a note in the changelog about temporary workaround of using
> > ENOMEM to propagate the error in vma_merge_existing_range() and
> > vma_expand()
> >
> > Patch#5:
> > - Added fatal_signal_pending() check in do_mbind() to detect
> > queue_pages_range() failures due to a pendig fatal signal, per Sashiko
>
> Changes since v5:
>
>
> mm/madvise.c | 15 ++++++++++-----
> mm/mempolicy.c | 9 ++++++++-
> mm/mlock.c | 2 ++
> mm/mprotect.c | 26 ++++++++++++++++----------
> mm/mseal.c | 27 +++++++++++++++++++--------
> mm/vma.c | 20 ++++++++++++++++++--
> 6 files changed, 73 insertions(+), 26 deletions(-)
>
> --- a/mm/madvise.c~b
> +++ a/mm/madvise.c
> @@ -172,11 +172,16 @@ static int madvise_update_vma(vm_flags_t
> if (IS_ERR(vma))
> return PTR_ERR(vma);
>
> - madv_behavior->vma = vma;
> -
> - /* vm_flags is protected by the mmap_lock held in write mode. */
> - if (vma_start_write_killable(vma))
> - return -EINTR;
> + /*
> + * If a new vma was created during vma_modify_XXX, the resulting
> + * vma is already locked. Skip re-locking new vma in this case.
> + */
> + if (vma == madv_behavior->vma) {
> + if (vma_start_write_killable(vma))
> + return -EINTR;
> + } else {
> + madv_behavior->vma = vma;
> + }
>
> vma->flags = new_vma_flags;
> if (set_new_anon_name)
> --- a/mm/mempolicy.c~b
> +++ a/mm/mempolicy.c
> @@ -1546,7 +1546,14 @@ static long do_mbind(unsigned long start
> flags | MPOL_MF_INVERT | MPOL_MF_WRLOCK, &pagelist);
>
> if (nr_failed < 0) {
> - err = nr_failed;
> + /*
> + * queue_pages_range() might override the original error with -EFAULT.
> + * Confirm that fatal signals are still treated correctly.
> + */
> + if (fatal_signal_pending(current))
> + err = -EINTR;
> + else
> + err = nr_failed;
> nr_failed = 0;
> } else {
> vma_iter_init(&vmi, mm, start);
> --- a/mm/mlock.c~b
> +++ a/mm/mlock.c
> @@ -518,6 +518,8 @@ static int mlock_fixup(struct vma_iterat
> vma->flags = new_vma_flags;
> } else {
> ret = mlock_vma_pages_range(vma, start, end, &new_vma_flags);
> + if (ret)
> + mm->locked_vm -= nr_pages;
> }
> out:
> *prev = vma;
> --- a/mm/mprotect.c~b
> +++ a/mm/mprotect.c
> @@ -716,6 +716,7 @@ mprotect_fixup(struct vma_iterator *vmi,
> const vma_flags_t old_vma_flags = READ_ONCE(vma->flags);
> vma_flags_t new_vma_flags = legacy_to_vma_flags(newflags);
> long nrpages = (end - start) >> PAGE_SHIFT;
> + struct vm_area_struct *new_vma;
> unsigned int mm_cp_flags = 0;
> unsigned long charged = 0;
> int error;
> @@ -772,21 +773,26 @@ mprotect_fixup(struct vma_iterator *vmi,
> vma_flags_clear(&new_vma_flags, VMA_ACCOUNT_BIT);
> }
>
> - vma = vma_modify_flags(vmi, *pprev, vma, start, end, &new_vma_flags);
> - if (IS_ERR(vma)) {
> - error = PTR_ERR(vma);
> + new_vma = vma_modify_flags(vmi, *pprev, vma, start, end,
> + &new_vma_flags);
> + if (IS_ERR(new_vma)) {
> + error = PTR_ERR(new_vma);
> goto fail;
> }
>
> - *pprev = vma;
> -
> /*
> - * vm_flags and vm_page_prot are protected by the mmap_lock
> - * held in write mode.
> + * If a new vma was created during vma_modify_flags, the resulting
> + * vma is already locked. Skip re-locking new vma in this case.
> */
> - error = vma_start_write_killable(vma);
> - if (error)
> - goto fail;
> + if (new_vma == vma) {
> + error = vma_start_write_killable(vma);
> + if (error)
> + goto fail;
> + } else {
> + vma = new_vma;
> + }
> +
> + *pprev = vma;
>
> vma_flags_reset_once(vma, &new_vma_flags);
> if (vma_wants_manual_pte_write_upgrade(vma))
> --- a/mm/mseal.c~b
> +++ a/mm/mseal.c
> @@ -70,17 +70,28 @@ static int mseal_apply(struct mm_struct
>
> if (!vma_test(vma, VMA_SEALED_BIT)) {
> vma_flags_t vma_flags = vma->flags;
> - int err;
> + struct vm_area_struct *new_vma;
>
> vma_flags_set(&vma_flags, VMA_SEALED_BIT);
>
> - vma = vma_modify_flags(&vmi, prev, vma, curr_start,
> - curr_end, &vma_flags);
> - if (IS_ERR(vma))
> - return PTR_ERR(vma);
> - err = vma_start_write_killable(vma);
> - if (err)
> - return err;
> + new_vma = vma_modify_flags(&vmi, prev, vma, curr_start,
> + curr_end, &vma_flags);
> + if (IS_ERR(new_vma))
> + return PTR_ERR(new_vma);
> +
> + /*
> + * If a new vma was created during vma_modify_flags,
> + * the resulting vma is already locked.
> + * Skip re-locking new vma in this case.
> + */
> + if (new_vma == vma) {
> + int err = vma_start_write_killable(vma);
> + if (err)
> + return err;
> + } else {
> + vma = new_vma;
> + }
> +
> vma_set_flags(vma, VMA_SEALED_BIT);
> }
>
> --- a/mm/vma.c~b
> +++ a/mm/vma.c
> @@ -531,6 +531,10 @@ __split_vma(struct vma_iterator *vmi, st
> err = vma_start_write_killable(vma);
> if (err)
> goto out_free_vma;
> + /*
> + * Locking a new detached VMA will always succeed but it's just a
> + * detail of the current implementation, so handle it all the same.
> + */
> err = vma_start_write_killable(new);
> if (err)
> goto out_free_vma;
> @@ -1197,8 +1201,14 @@ int vma_expand(struct vma_merge_struct *
>
> mmap_assert_write_locked(vmg->mm);
> err = vma_start_write_killable(target);
> - if (err)
> + if (err) {
> + /*
> + * Override VMA_MERGE_NOMERGE to prevent callers from
> + * falling back to a new VMA allocation.
> + */
> + vmg->state = VMA_MERGE_ERROR_NOMEM;
> return err;
> + }
>
> target_sticky = vma_flags_and_mask(&target->flags, VMA_STICKY_FLAGS);
>
> @@ -1231,8 +1241,14 @@ int vma_expand(struct vma_merge_struct *
> * is pending.
> */
> err = vma_start_write_killable(next);
> - if (err)
> + if (err) {
> + /*
> + * Override VMA_MERGE_NOMERGE to prevent callers from
> + * falling back to a new VMA allocation.
> + */
> + vmg->state = VMA_MERGE_ERROR_NOMEM;
> return err;
> + }
> err = dup_anon_vma(target, next, &anon_dup);
> if (err)
> return err;
> _
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v6 4/6] mm/vma: use vma_start_write_killable() in vma operations
2026-03-27 20:54 ` [PATCH v6 4/6] mm/vma: use vma_start_write_killable() in vma operations Suren Baghdasaryan
@ 2026-03-31 10:24 ` Lorenzo Stoakes (Oracle)
2026-03-31 15:37 ` Suren Baghdasaryan
0 siblings, 1 reply; 23+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-31 10:24 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: akpm, willy, david, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, gourry, ying.huang, apopple, baolin.wang, Liam.Howlett,
npache, ryan.roberts, dev.jain, baohua, lance.yang, vbabka, jannh,
rppt, mhocko, pfalcato, kees, maddy, npiggin, mpe, chleroy,
borntraeger, frankja, imbrenda, hca, gor, agordeev, svens,
gerald.schaefer, linux-mm, linuxppc-dev, kvm, linux-kernel,
linux-s390
On Fri, Mar 27, 2026 at 01:54:55PM -0700, Suren Baghdasaryan wrote:
> Replace vma_start_write() with vma_start_write_killable(), improving
> reaction time to the kill signal.
> Replace vma_start_write() calls when we operate on VMAs.
>
> To propagate errors from vma_merge_existing_range() and vma_expand()
> we fake an ENOMEM error when we fail due to a pending fatal signal.
> This is a temporary workaround. Fixing this requires some refactoring
> and will be done separately in the future.
>
> In a number of places we now lock VMA earlier than before to avoid
> doing work and undoing it later if a fatal signal is pending. This
> is safe because the moves are happening within sections where we
> already hold the mmap_write_lock, so the moves do not change the
> locking order relative to other kernel locks.
>
> Suggested-by: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
> mm/vma.c | 146 ++++++++++++++++++++++++++++++++++++++------------
> mm/vma_exec.c | 6 ++-
> 2 files changed, 117 insertions(+), 35 deletions(-)
>
> diff --git a/mm/vma.c b/mm/vma.c
> index ba78ab1f397a..cc382217f730 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -524,6 +524,21 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
> new->vm_pgoff += ((addr - vma->vm_start) >> PAGE_SHIFT);
> }
>
> + /*
> + * Lock VMAs before cloning to avoid extra work if fatal signal
> + * is pending.
> + */
> + err = vma_start_write_killable(vma);
> + if (err)
> + goto out_free_vma;
> + /*
> + * Locking a new detached VMA will always succeed but it's just a
> + * detail of the current implementation, so handle it all the same.
> + */
> + err = vma_start_write_killable(new);
> + if (err)
> + goto out_free_vma;
> +
> err = -ENOMEM;
> vma_iter_config(vmi, new->vm_start, new->vm_end);
> if (vma_iter_prealloc(vmi, new))
> @@ -543,9 +558,6 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
> if (new->vm_ops && new->vm_ops->open)
> new->vm_ops->open(new);
>
> - vma_start_write(vma);
> - vma_start_write(new);
> -
> init_vma_prep(&vp, vma);
> vp.insert = new;
> vma_prepare(&vp);
> @@ -900,12 +912,22 @@ static __must_check struct vm_area_struct *vma_merge_existing_range(
> }
>
> /* No matter what happens, we will be adjusting middle. */
> - vma_start_write(middle);
> + err = vma_start_write_killable(middle);
> + if (err) {
> + /* Ensure error propagates. */
> + vmg->give_up_on_oom = false;
> + goto abort;
> + }
>
> if (merge_right) {
> vma_flags_t next_sticky;
>
> - vma_start_write(next);
> + err = vma_start_write_killable(next);
> + if (err) {
> + /* Ensure error propagates. */
> + vmg->give_up_on_oom = false;
> + goto abort;
> + }
> vmg->target = next;
> next_sticky = vma_flags_and_mask(&next->flags, VMA_STICKY_FLAGS);
> vma_flags_set_mask(&sticky_flags, next_sticky);
> @@ -914,7 +936,12 @@ static __must_check struct vm_area_struct *vma_merge_existing_range(
> if (merge_left) {
> vma_flags_t prev_sticky;
>
> - vma_start_write(prev);
> + err = vma_start_write_killable(prev);
> + if (err) {
> + /* Ensure error propagates. */
> + vmg->give_up_on_oom = false;
> + goto abort;
> + }
> vmg->target = prev;
>
> prev_sticky = vma_flags_and_mask(&prev->flags, VMA_STICKY_FLAGS);
> @@ -1170,10 +1197,18 @@ int vma_expand(struct vma_merge_struct *vmg)
> vma_flags_t sticky_flags =
> vma_flags_and_mask(&vmg->vma_flags, VMA_STICKY_FLAGS);
> vma_flags_t target_sticky;
> - int err = 0;
> + int err;
>
> mmap_assert_write_locked(vmg->mm);
> - vma_start_write(target);
> + err = vma_start_write_killable(target);
> + if (err) {
> + /*
> + * Override VMA_MERGE_NOMERGE to prevent callers from
> + * falling back to a new VMA allocation.
> + */
> + vmg->state = VMA_MERGE_ERROR_NOMEM;
> + return err;
> + }
>
> target_sticky = vma_flags_and_mask(&target->flags, VMA_STICKY_FLAGS);
>
> @@ -1201,6 +1236,19 @@ int vma_expand(struct vma_merge_struct *vmg)
> * we don't need to account for vmg->give_up_on_mm here.
> */
> if (remove_next) {
> + /*
> + * Lock the VMA early to avoid extra work if fatal signal
> + * is pending.
> + */
> + err = vma_start_write_killable(next);
> + if (err) {
> + /*
> + * Override VMA_MERGE_NOMERGE to prevent callers from
> + * falling back to a new VMA allocation.
> + */
I don't think we need huge, duplicated comments everywhere.
I don't like us effectively lying about an OOM.
Here's what I said on v4:
https://lore.kernel.org/all/9845b243-1984-4d74-9feb-d9d28757fba6@lucifer.local/
I think we need to update vma_modify():
/* First, try to merge. */
merged = vma_merge_existing_range(vmg);
if (merged)
return merged;
if (vmg_nomem(vmg))
return ERR_PTR(-ENOMEM);
+ if (fatal_signal_pending(current))
+ return -EINTR;
OK I see you replied:
We need to be careful here. I think there are cases when vma is
modified from a context of a different process, for example in
process_madvise(). fatal_signal_pending(current) would yield incorrect
result because vma->vm_mm is not the same as current->mm.
Sorry missed that.
That's utterly horrible, yes.
I'm sorry but I think this series then is going to have to wait for me to rework
this code, unfortunately.
I can't really stand you returning a fake error code it's too confusing.
I guess I'll have to go do that as a priority then and maybe queue it up so it's
kinda ready for 7.2.
In any case I said in reply to the cover, I think this series is going to have
to wait for next cycle (i.e. 7.2), sorry. Just too many functional changes in
this revision.
> + vmg->state = VMA_MERGE_ERROR_NOMEM;
> + return err;
> + }
> err = dup_anon_vma(target, next, &anon_dup);
> if (err)
> return err;
> @@ -1214,7 +1262,6 @@ int vma_expand(struct vma_merge_struct *vmg)
> if (remove_next) {
> vma_flags_t next_sticky;
>
> - vma_start_write(next);
> vmg->__remove_next = true;
>
> next_sticky = vma_flags_and_mask(&next->flags, VMA_STICKY_FLAGS);
> @@ -1252,9 +1299,14 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
> unsigned long start, unsigned long end, pgoff_t pgoff)
> {
> struct vma_prepare vp;
> + int err;
>
> WARN_ON((vma->vm_start != start) && (vma->vm_end != end));
>
> + err = vma_start_write_killable(vma);
> + if (err)
> + return err;
> +
> if (vma->vm_start < start)
> vma_iter_config(vmi, vma->vm_start, start);
> else
> @@ -1263,8 +1315,6 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
> if (vma_iter_prealloc(vmi, NULL))
> return -ENOMEM;
>
> - vma_start_write(vma);
> -
> init_vma_prep(&vp, vma);
> vma_prepare(&vp);
> vma_adjust_trans_huge(vma, start, end, NULL);
> @@ -1453,7 +1503,9 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
> if (error)
> goto end_split_failed;
> }
> - vma_start_write(next);
> + error = vma_start_write_killable(next);
> + if (error)
> + goto munmap_gather_failed;
> mas_set(mas_detach, vms->vma_count++);
> error = mas_store_gfp(mas_detach, next, GFP_KERNEL);
> if (error)
> @@ -1848,12 +1900,16 @@ static void vma_link_file(struct vm_area_struct *vma, bool hold_rmap_lock)
> static int vma_link(struct mm_struct *mm, struct vm_area_struct *vma)
> {
> VMA_ITERATOR(vmi, mm, 0);
> + int err;
> +
> + err = vma_start_write_killable(vma);
> + if (err)
> + return err;
>
> vma_iter_config(&vmi, vma->vm_start, vma->vm_end);
> if (vma_iter_prealloc(&vmi, vma))
> return -ENOMEM;
>
> - vma_start_write(vma);
> vma_iter_store_new(&vmi, vma);
> vma_link_file(vma, /* hold_rmap_lock= */false);
> mm->map_count++;
> @@ -2239,9 +2295,8 @@ int mm_take_all_locks(struct mm_struct *mm)
> * is reached.
> */
> for_each_vma(vmi, vma) {
> - if (signal_pending(current))
> + if (signal_pending(current) || vma_start_write_killable(vma))
> goto out_unlock;
> - vma_start_write(vma);
> }
>
> vma_iter_init(&vmi, mm, 0);
> @@ -2540,8 +2595,8 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap,
> struct mmap_action *action)
> {
> struct vma_iterator *vmi = map->vmi;
> - int error = 0;
> struct vm_area_struct *vma;
> + int error;
>
> /*
> * Determine the object being mapped and call the appropriate
> @@ -2552,6 +2607,14 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap,
> if (!vma)
> return -ENOMEM;
>
> + /*
> + * Lock the VMA early to avoid extra work if fatal signal
> + * is pending.
> + */
> + error = vma_start_write_killable(vma);
> + if (error)
> + goto free_vma;
> +
> vma_iter_config(vmi, map->addr, map->end);
> vma_set_range(vma, map->addr, map->end, map->pgoff);
> vma->flags = map->vma_flags;
> @@ -2582,8 +2645,6 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap,
> WARN_ON_ONCE(!arch_validate_flags(map->vm_flags));
> #endif
>
> - /* Lock the VMA since it is modified after insertion into VMA tree */
> - vma_start_write(vma);
> vma_iter_store_new(vmi, vma);
> map->mm->map_count++;
> vma_link_file(vma, action->hide_from_rmap_until_complete);
> @@ -2878,6 +2939,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
> unsigned long addr, unsigned long len, vma_flags_t vma_flags)
> {
> struct mm_struct *mm = current->mm;
> + int err;
>
> /*
> * Check against address space limits by the changed size
> @@ -2910,24 +2972,33 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
>
> if (vma_merge_new_range(&vmg))
> goto out;
> - else if (vmg_nomem(&vmg))
> + if (vmg_nomem(&vmg)) {
> + err = -ENOMEM;
> goto unacct_fail;
> + }
> }
>
> if (vma)
> vma_iter_next_range(vmi);
> /* create a vma struct for an anonymous mapping */
> vma = vm_area_alloc(mm);
> - if (!vma)
> + if (!vma) {
> + err = -ENOMEM;
> goto unacct_fail;
> + }
>
> vma_set_anonymous(vma);
> vma_set_range(vma, addr, addr + len, addr >> PAGE_SHIFT);
> vma->flags = vma_flags;
> vma->vm_page_prot = vm_get_page_prot(vma_flags_to_legacy(vma_flags));
> - vma_start_write(vma);
> - if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL))
> + if (vma_start_write_killable(vma)) {
> + err = -EINTR;
> + goto vma_lock_fail;
> + }
> + if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL)) {
> + err = -ENOMEM;
> goto mas_store_fail;
> + }
>
> mm->map_count++;
> validate_mm(mm);
> @@ -2942,10 +3013,11 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
> return 0;
>
> mas_store_fail:
> +vma_lock_fail:
> vm_area_free(vma);
> unacct_fail:
> vm_unacct_memory(len >> PAGE_SHIFT);
> - return -ENOMEM;
> + return err;
> }
>
> /**
> @@ -3112,8 +3184,8 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
> struct mm_struct *mm = vma->vm_mm;
> struct vm_area_struct *next;
> unsigned long gap_addr;
> - int error = 0;
> VMA_ITERATOR(vmi, mm, vma->vm_start);
> + int error;
>
> if (!vma_test(vma, VMA_GROWSUP_BIT))
> return -EFAULT;
> @@ -3149,12 +3221,14 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
>
> /* We must make sure the anon_vma is allocated. */
> if (unlikely(anon_vma_prepare(vma))) {
> - vma_iter_free(&vmi);
> - return -ENOMEM;
> + error = -ENOMEM;
> + goto vma_prep_fail;
> }
>
> /* Lock the VMA before expanding to prevent concurrent page faults */
> - vma_start_write(vma);
> + error = vma_start_write_killable(vma);
> + if (error)
> + goto vma_lock_fail;
> /* We update the anon VMA tree. */
> anon_vma_lock_write(vma->anon_vma);
>
> @@ -3183,8 +3257,10 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
> }
> }
> anon_vma_unlock_write(vma->anon_vma);
> - vma_iter_free(&vmi);
> validate_mm(mm);
> +vma_lock_fail:
> +vma_prep_fail:
> + vma_iter_free(&vmi);
> return error;
> }
> #endif /* CONFIG_STACK_GROWSUP */
> @@ -3197,8 +3273,8 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
> {
> struct mm_struct *mm = vma->vm_mm;
> struct vm_area_struct *prev;
> - int error = 0;
> VMA_ITERATOR(vmi, mm, vma->vm_start);
> + int error;
>
> if (!vma_test(vma, VMA_GROWSDOWN_BIT))
> return -EFAULT;
> @@ -3228,12 +3304,14 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
>
> /* We must make sure the anon_vma is allocated. */
> if (unlikely(anon_vma_prepare(vma))) {
> - vma_iter_free(&vmi);
> - return -ENOMEM;
> + error = -ENOMEM;
> + goto vma_prep_fail;
> }
>
> /* Lock the VMA before expanding to prevent concurrent page faults */
> - vma_start_write(vma);
> + error = vma_start_write_killable(vma);
> + if (error)
> + goto vma_lock_fail;
> /* We update the anon VMA tree. */
> anon_vma_lock_write(vma->anon_vma);
>
> @@ -3263,8 +3341,10 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
> }
> }
> anon_vma_unlock_write(vma->anon_vma);
> - vma_iter_free(&vmi);
> validate_mm(mm);
> +vma_lock_fail:
> +vma_prep_fail:
> + vma_iter_free(&vmi);
> return error;
> }
>
> diff --git a/mm/vma_exec.c b/mm/vma_exec.c
> index 5cee8b7efa0f..8ddcc791d828 100644
> --- a/mm/vma_exec.c
> +++ b/mm/vma_exec.c
> @@ -41,6 +41,7 @@ int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift)
> struct vm_area_struct *next;
> struct mmu_gather tlb;
> PAGETABLE_MOVE(pmc, vma, vma, old_start, new_start, length);
> + int err;
>
> BUG_ON(new_start > new_end);
>
> @@ -56,8 +57,9 @@ int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift)
> * cover the whole range: [new_start, old_end)
> */
> vmg.target = vma;
> - if (vma_expand(&vmg))
> - return -ENOMEM;
> + err = vma_expand(&vmg);
> + if (err)
> + return err;
>
> /*
> * move the page tables downwards, on failure we rely on
> --
> 2.53.0.1018.g2bb0e51243-goog
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v6 5/6] mm: use vma_start_write_killable() in process_vma_walk_lock()
2026-03-27 20:54 ` [PATCH v6 5/6] mm: use vma_start_write_killable() in process_vma_walk_lock() Suren Baghdasaryan
@ 2026-03-31 10:38 ` Lorenzo Stoakes (Oracle)
2026-03-31 15:43 ` Suren Baghdasaryan
0 siblings, 1 reply; 23+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-31 10:38 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: akpm, willy, david, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, gourry, ying.huang, apopple, baolin.wang, Liam.Howlett,
npache, ryan.roberts, dev.jain, baohua, lance.yang, vbabka, jannh,
rppt, mhocko, pfalcato, kees, maddy, npiggin, mpe, chleroy,
borntraeger, frankja, imbrenda, hca, gor, agordeev, svens,
gerald.schaefer, linux-mm, linuxppc-dev, kvm, linux-kernel,
linux-s390
On Fri, Mar 27, 2026 at 01:54:56PM -0700, Suren Baghdasaryan wrote:
> Replace vma_start_write() with vma_start_write_killable() when
> process_vma_walk_lock() is used with PGWALK_WRLOCK option.
> Adjust its direct and indirect users to check for a possible error
> and handle it. Ensure users handle EINTR correctly and do not ignore
> it. When queue_pages_range() fails, check whether it failed due to
> a fatal signal or some other reason and return appropriate error.
>
> Suggested-by: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
> fs/proc/task_mmu.c | 12 ++++++------
> mm/mempolicy.c | 10 +++++++++-
> mm/pagewalk.c | 22 +++++++++++++++-------
> 3 files changed, 30 insertions(+), 14 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index e091931d7ca1..33e5094a7842 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1774,15 +1774,15 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
> struct vm_area_struct *vma;
> enum clear_refs_types type;
> int itype;
> - int rv;
> + int err;
>
> if (count > sizeof(buffer) - 1)
> count = sizeof(buffer) - 1;
> if (copy_from_user(buffer, buf, count))
> return -EFAULT;
> - rv = kstrtoint(strstrip(buffer), 10, &itype);
> - if (rv < 0)
> - return rv;
> + err = kstrtoint(strstrip(buffer), 10, &itype);
> + if (err)
> + return err;
> type = (enum clear_refs_types)itype;
> if (type < CLEAR_REFS_ALL || type >= CLEAR_REFS_LAST)
> return -EINVAL;
> @@ -1824,7 +1824,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
> 0, mm, 0, -1UL);
> mmu_notifier_invalidate_range_start(&range);
> }
> - walk_page_range(mm, 0, -1, &clear_refs_walk_ops, &cp);
> + err = walk_page_range(mm, 0, -1, &clear_refs_walk_ops, &cp);
> if (type == CLEAR_REFS_SOFT_DIRTY) {
> mmu_notifier_invalidate_range_end(&range);
> flush_tlb_mm(mm);
> @@ -1837,7 +1837,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
> }
> put_task_struct(task);
>
> - return count;
> + return err ? : count;
> }
>
> const struct file_operations proc_clear_refs_operations = {
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index c38a90487531..51f298cfc33b 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -969,6 +969,7 @@ static const struct mm_walk_ops queue_pages_lock_vma_walk_ops = {
> * (a hugetlbfs page or a transparent huge page being counted as 1).
> * -EIO - a misplaced page found, when MPOL_MF_STRICT specified without MOVEs.
> * -EFAULT - a hole in the memory range, when MPOL_MF_DISCONTIG_OK unspecified.
> + * -EINTR - walk got terminated due to pending fatal signal.
> */
> static long
> queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end,
> @@ -1545,7 +1546,14 @@ static long do_mbind(unsigned long start, unsigned long len,
> flags | MPOL_MF_INVERT | MPOL_MF_WRLOCK, &pagelist);
>
> if (nr_failed < 0) {
> - err = nr_failed;
> + /*
> + * queue_pages_range() might override the original error with -EFAULT.
> + * Confirm that fatal signals are still treated correctly.
> + */
> + if (fatal_signal_pending(current))
> + err = -EINTR;
> + else
> + err = nr_failed;
Is that really a big deal? Does it really matter if the caller doesn't get
-EINTR in this case? This feels like another sashiko nitpick and is adding a
bunch of additional complexity here.
I mean if you 'filter' error messages you might always end up with an error
that's different than the original...
> nr_failed = 0;
> } else {
> vma_iter_init(&vmi, mm, start);
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index 3ae2586ff45b..eca7bc711617 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -443,14 +443,13 @@ static inline void process_mm_walk_lock(struct mm_struct *mm,
> mmap_assert_write_locked(mm);
> }
>
> -static inline void process_vma_walk_lock(struct vm_area_struct *vma,
> - enum page_walk_lock walk_lock)
> +static int process_vma_walk_lock(struct vm_area_struct *vma,
> + enum page_walk_lock walk_lock)
> {
> #ifdef CONFIG_PER_VMA_LOCK
> switch (walk_lock) {
> case PGWALK_WRLOCK:
> - vma_start_write(vma);
> - break;
> + return vma_start_write_killable(vma);
> case PGWALK_WRLOCK_VERIFY:
> vma_assert_write_locked(vma);
> break;
> @@ -462,6 +461,7 @@ static inline void process_vma_walk_lock(struct vm_area_struct *vma,
> break;
> }
> #endif
> + return 0;
> }
>
> /*
> @@ -505,7 +505,9 @@ int walk_page_range_mm_unsafe(struct mm_struct *mm, unsigned long start,
> if (ops->pte_hole)
> err = ops->pte_hole(start, next, -1, &walk);
> } else { /* inside vma */
> - process_vma_walk_lock(vma, ops->walk_lock);
> + err = process_vma_walk_lock(vma, ops->walk_lock);
> + if (err)
> + break;
> walk.vma = vma;
> next = min(end, vma->vm_end);
> vma = find_vma(mm, vma->vm_end);
> @@ -722,6 +724,7 @@ int walk_page_range_vma_unsafe(struct vm_area_struct *vma, unsigned long start,
> .vma = vma,
> .private = private,
> };
> + int err;
>
> if (start >= end || !walk.mm)
> return -EINVAL;
> @@ -729,7 +732,9 @@ int walk_page_range_vma_unsafe(struct vm_area_struct *vma, unsigned long start,
> return -EINVAL;
>
> process_mm_walk_lock(walk.mm, ops->walk_lock);
> - process_vma_walk_lock(vma, ops->walk_lock);
> + err = process_vma_walk_lock(vma, ops->walk_lock);
> + if (err)
> + return err;
> return __walk_page_range(start, end, &walk);
> }
>
> @@ -752,6 +757,7 @@ int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
> .vma = vma,
> .private = private,
> };
> + int err;
>
> if (!walk.mm)
> return -EINVAL;
> @@ -759,7 +765,9 @@ int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
> return -EINVAL;
>
> process_mm_walk_lock(walk.mm, ops->walk_lock);
> - process_vma_walk_lock(vma, ops->walk_lock);
> + err = process_vma_walk_lock(vma, ops->walk_lock);
> + if (err)
> + return err;
> return __walk_page_range(vma->vm_start, vma->vm_end, &walk);
> }
>
> --
> 2.53.0.1018.g2bb0e51243-goog
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v6 2/6] mm: use vma_start_write_killable() in mm syscalls
2026-03-31 9:35 ` Lorenzo Stoakes (Oracle)
@ 2026-03-31 15:01 ` Suren Baghdasaryan
2026-03-31 18:29 ` Andrew Morton
0 siblings, 1 reply; 23+ messages in thread
From: Suren Baghdasaryan @ 2026-03-31 15:01 UTC (permalink / raw)
To: Lorenzo Stoakes (Oracle)
Cc: akpm, willy, david, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, gourry, ying.huang, apopple, baolin.wang, Liam.Howlett,
npache, ryan.roberts, dev.jain, baohua, lance.yang, vbabka, jannh,
rppt, mhocko, pfalcato, kees, maddy, npiggin, mpe, chleroy,
borntraeger, frankja, imbrenda, hca, gor, agordeev, svens,
gerald.schaefer, linux-mm, linuxppc-dev, kvm, linux-kernel,
linux-s390
On Tue, Mar 31, 2026 at 2:35 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
>
> On Fri, Mar 27, 2026 at 01:54:53PM -0700, Suren Baghdasaryan wrote:
> > Replace vma_start_write() with vma_start_write_killable() in syscalls,
> > improving reaction time to the kill signal.
> >
> > In a number of places we now lock VMA earlier than before to avoid
> > doing work and undoing it later if a fatal signal is pending. This
> > is safe because the moves are happening within sections where we
> > already hold the mmap_write_lock, so the moves do not change the
> > locking order relative to other kernel locks.
> >
> > Suggested-by: Matthew Wilcox <willy@infradead.org>
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>
> I think Sashiko (presumably) has regressed this series :/
>
> > ---
> > mm/madvise.c | 13 ++++++++++---
> > mm/memory.c | 2 ++
> > mm/mempolicy.c | 11 +++++++++--
> > mm/mlock.c | 30 ++++++++++++++++++++++++------
> > mm/mprotect.c | 25 +++++++++++++++++--------
> > mm/mremap.c | 8 +++++---
> > mm/mseal.c | 24 +++++++++++++++++++-----
> > 7 files changed, 86 insertions(+), 27 deletions(-)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 69708e953cf5..f2c7b0512cdf 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -172,10 +172,17 @@ static int madvise_update_vma(vm_flags_t new_flags,
> > if (IS_ERR(vma))
> > return PTR_ERR(vma);
> >
> > - madv_behavior->vma = vma;
> > + /*
> > + * If a new vma was created during vma_modify_XXX, the resulting
> > + * vma is already locked. Skip re-locking new vma in this case.
> > + */
>
> well you're not re-locking, in vma_start_write_killable() we have:
>
> if (__is_vma_write_locked(vma))
> return 0;
>
> So I'm not sure this is really worth the effort? Was this a sashiko 'find'?
>
> And is 're-locking' the right thing to say here? Probably nitty but that to me
> implies you're locking again when in fact you're just calling the function only
> to check seq nums uselessly and have a nop.
>
> But it's not like we're on a hotpath here where we're sweating every little
> thing, we're already taking the mmap write lock and doing heavy operations
> etc. so not sure we care all that much.
>
> OTOH, it's hardly like this is a bad thing so don't want to hold up series.
>
> > + if (vma == madv_behavior->vma) {
> > + if (vma_start_write_killable(vma))
> > + return -EINTR;
> > + } else {
> > + madv_behavior->vma = vma;
> > + }
>
> This is kind of ugly.
>
> Can't we just do:
>
> const struct vm_area_struct *old_vma = madv_behavior->vma;
> struct vm_area_struct *vma = old_vma;
>
> ...
> madv_behavior->vma = vma;
> /* If the VMA didn't change, it isn't locked yet. */
> if (vma == old_vma && vma_start_write_killable(vma))
> return -EINTR;
>
> Instead? That is, assuming we really need to care about this at all.
>
> But I think I don't like this change at all?
Yeah, this was the part I wasn't sure if it's worth adding. With your
vote confirming my scepticism I'll go ahead and remove the parts I
added to avoid extra vma_start_write_killable() call (3 instances in
all) and will post v7.
Thanks,
Suren.
>
> >
> > - /* vm_flags is protected by the mmap_lock held in write mode. */
> > - vma_start_write(vma);
> > vma->flags = new_vma_flags;
> > if (set_new_anon_name)
> > return replace_anon_vma_name(vma, anon_name);
> > diff --git a/mm/memory.c b/mm/memory.c
> > index e44469f9cf65..9f99ec634831 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -366,6 +366,8 @@ void free_pgd_range(struct mmu_gather *tlb,
> > * page tables that should be removed. This can differ from the vma mappings on
> > * some archs that may have mappings that need to be removed outside the vmas.
> > * Note that the prev->vm_end and next->vm_start are often used.
> > + * We don't use vma_start_write_killable() because page tables should be freed
> > + * even if the task is being killed.
> > *
> > * The vma_end differs from the pg_end when a dup_mmap() failed and the tree has
> > * unrelated data to the mm_struct being torn down.
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index fd08771e2057..c38a90487531 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -1784,7 +1784,8 @@ SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, le
> > return -EINVAL;
> > if (end == start)
> > return 0;
> > - mmap_write_lock(mm);
> > + if (mmap_write_lock_killable(mm))
> > + return -EINTR;
> > prev = vma_prev(&vmi);
> > for_each_vma_range(vmi, vma, end) {
> > /*
> > @@ -1801,13 +1802,19 @@ SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, le
> > err = -EOPNOTSUPP;
> > break;
> > }
> > + /*
> > + * Lock the VMA early to avoid extra work if fatal signal
> > + * is pending.
> > + */
> > + err = vma_start_write_killable(vma);
> > + if (err)
> > + break;
> > new = mpol_dup(old);
> > if (IS_ERR(new)) {
> > err = PTR_ERR(new);
> > break;
> > }
> >
> > - vma_start_write(vma);
> > new->home_node = home_node;
> > err = mbind_range(&vmi, vma, &prev, start, end, new);
> > mpol_put(new);
> > diff --git a/mm/mlock.c b/mm/mlock.c
> > index 8c227fefa2df..2ed454db7cf7 100644
> > --- a/mm/mlock.c
> > +++ b/mm/mlock.c
> > @@ -419,8 +419,10 @@ static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
> > *
> > * Called for mlock(), mlock2() and mlockall(), to set @vma VM_LOCKED;
> > * called for munlock() and munlockall(), to clear VM_LOCKED from @vma.
> > + *
> > + * Return: 0 on success, -EINTR if fatal signal is pending.
> > */
> > -static void mlock_vma_pages_range(struct vm_area_struct *vma,
> > +static int mlock_vma_pages_range(struct vm_area_struct *vma,
> > unsigned long start, unsigned long end,
> > vma_flags_t *new_vma_flags)
> > {
> > @@ -442,7 +444,9 @@ static void mlock_vma_pages_range(struct vm_area_struct *vma,
> > */
> > if (vma_flags_test(new_vma_flags, VMA_LOCKED_BIT))
> > vma_flags_set(new_vma_flags, VMA_IO_BIT);
> > - vma_start_write(vma);
> > + if (vma_start_write_killable(vma))
> > + return -EINTR;
> > +
> > vma_flags_reset_once(vma, new_vma_flags);
> >
> > lru_add_drain();
> > @@ -453,6 +457,7 @@ static void mlock_vma_pages_range(struct vm_area_struct *vma,
> > vma_flags_clear(new_vma_flags, VMA_IO_BIT);
> > vma_flags_reset_once(vma, new_vma_flags);
> > }
> > + return 0;
> > }
> >
> > /*
> > @@ -506,11 +511,15 @@ static int mlock_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > */
> > if (vma_flags_test(&new_vma_flags, VMA_LOCKED_BIT) &&
> > vma_flags_test(&old_vma_flags, VMA_LOCKED_BIT)) {
> > + ret = vma_start_write_killable(vma);
> > + if (ret)
> > + goto out; /* mm->locked_vm is fine as nr_pages == 0 */
> > /* No work to do, and mlocking twice would be wrong */
> > - vma_start_write(vma);
> > vma->flags = new_vma_flags;
> > } else {
> > - mlock_vma_pages_range(vma, start, end, &new_vma_flags);
> > + ret = mlock_vma_pages_range(vma, start, end, &new_vma_flags);
> > + if (ret)
> > + mm->locked_vm -= nr_pages;
> > }
> > out:
> > *prev = vma;
> > @@ -739,9 +748,18 @@ static int apply_mlockall_flags(int flags)
> >
> > error = mlock_fixup(&vmi, vma, &prev, vma->vm_start, vma->vm_end,
> > newflags);
> > - /* Ignore errors, but prev needs fixing up. */
> > - if (error)
> > + if (error) {
> > + /*
> > + * If we failed due to a pending fatal signal, return
> > + * now. If we locked the vma before signal arrived, it
> > + * will be unlocked when we drop mmap_write_lock.
> > + */
> > + if (fatal_signal_pending(current))
> > + return -EINTR;
> > +
> > + /* Ignore errors, but prev needs fixing up. */
> > prev = vma;
> > + }
> > cond_resched();
> > }
> > out:
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index 110d47a36d4b..d6227877465f 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -700,6 +700,7 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb,
> > const vma_flags_t old_vma_flags = READ_ONCE(vma->flags);
> > vma_flags_t new_vma_flags = legacy_to_vma_flags(newflags);
> > long nrpages = (end - start) >> PAGE_SHIFT;
> > + struct vm_area_struct *new_vma;
> > unsigned int mm_cp_flags = 0;
> > unsigned long charged = 0;
> > int error;
> > @@ -756,19 +757,27 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb,
> > vma_flags_clear(&new_vma_flags, VMA_ACCOUNT_BIT);
> > }
> >
> > - vma = vma_modify_flags(vmi, *pprev, vma, start, end, &new_vma_flags);
> > - if (IS_ERR(vma)) {
> > - error = PTR_ERR(vma);
> > + new_vma = vma_modify_flags(vmi, *pprev, vma, start, end,
> > + &new_vma_flags);
> > + if (IS_ERR(new_vma)) {
> > + error = PTR_ERR(new_vma);
> > goto fail;
> > }
> >
> > - *pprev = vma;
> > -
> > /*
> > - * vm_flags and vm_page_prot are protected by the mmap_lock
> > - * held in write mode.
> > + * If a new vma was created during vma_modify_flags, the resulting
> > + * vma is already locked. Skip re-locking new vma in this case.
> > */
> > - vma_start_write(vma);
> > + if (new_vma == vma) {
> > + error = vma_start_write_killable(vma);
> > + if (error)
> > + goto fail;
> > + } else {
> > + vma = new_vma;
> > + }
>
> I mean again this is hideous and I don't know why we're bothering? This just
> introduces even more open-coded VMA lock specific stuff everywhere.
>
> And the comment is just not correct, we're not re-locking anything if it's
> already locked...
>
> > +
> > + *pprev = vma;
> > +
> > vma_flags_reset_once(vma, &new_vma_flags);
> > if (vma_wants_manual_pte_write_upgrade(vma))
> > mm_cp_flags |= MM_CP_TRY_CHANGE_WRITABLE;
> > diff --git a/mm/mremap.c b/mm/mremap.c
> > index e9c8b1d05832..0860102bddab 100644
> > --- a/mm/mremap.c
> > +++ b/mm/mremap.c
> > @@ -1348,6 +1348,11 @@ static unsigned long move_vma(struct vma_remap_struct *vrm)
> > if (err)
> > return err;
> >
> > + /* We don't want racing faults. */
> > + err = vma_start_write_killable(vrm->vma);
> > + if (err)
> > + return err;
> > +
> > /*
> > * If accounted, determine the number of bytes the operation will
> > * charge.
> > @@ -1355,9 +1360,6 @@ static unsigned long move_vma(struct vma_remap_struct *vrm)
> > if (!vrm_calc_charge(vrm))
> > return -ENOMEM;
> >
> > - /* We don't want racing faults. */
> > - vma_start_write(vrm->vma);
> > -
> > /* Perform copy step. */
> > err = copy_vma_and_data(vrm, &new_vma);
> > /*
> > diff --git a/mm/mseal.c b/mm/mseal.c
> > index 603df53ad267..1ea19fd3d384 100644
> > --- a/mm/mseal.c
> > +++ b/mm/mseal.c
> > @@ -70,14 +70,28 @@ static int mseal_apply(struct mm_struct *mm,
> >
> > if (!vma_test(vma, VMA_SEALED_BIT)) {
> > vma_flags_t vma_flags = vma->flags;
> > + struct vm_area_struct *new_vma;
> >
> > vma_flags_set(&vma_flags, VMA_SEALED_BIT);
> >
> > - vma = vma_modify_flags(&vmi, prev, vma, curr_start,
> > - curr_end, &vma_flags);
> > - if (IS_ERR(vma))
> > - return PTR_ERR(vma);
> > - vma_start_write(vma);
> > + new_vma = vma_modify_flags(&vmi, prev, vma, curr_start,
> > + curr_end, &vma_flags);
> > + if (IS_ERR(new_vma))
> > + return PTR_ERR(new_vma);
> > +
> > + /*
> > + * If a new vma was created during vma_modify_flags,
> > + * the resulting vma is already locked.
> > + * Skip re-locking new vma in this case.
> > + */
> > + if (new_vma == vma) {
> > + int err = vma_start_write_killable(vma);
> > + if (err)
> > + return err;
> > + } else {
> > + vma = new_vma;
> > + }
> > +
>
> I mean this is the exact same open-coded block all over this patch, and again I
> don't see the point...
>
> The VMA locking is tricky enough that I don't think doing this is a good idea,
> especially on the basis of 'avoid looking at sequence numbers under mmap write
> lock' or something.
>
> > vma_set_flags(vma, VMA_SEALED_BIT);
> > }
> >
> > --
> > 2.53.0.1018.g2bb0e51243-goog
> >
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v6 0/6] Use killable vma write locking in most places
2026-03-31 9:51 ` Lorenzo Stoakes (Oracle)
@ 2026-03-31 15:06 ` Suren Baghdasaryan
2026-03-31 15:34 ` Suren Baghdasaryan
0 siblings, 1 reply; 23+ messages in thread
From: Suren Baghdasaryan @ 2026-03-31 15:06 UTC (permalink / raw)
To: Lorenzo Stoakes (Oracle)
Cc: Andrew Morton, willy, david, ziy, matthew.brost, joshua.hahnjy,
rakie.kim, byungchul, gourry, ying.huang, apopple, baolin.wang,
Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, lance.yang,
vbabka, jannh, rppt, mhocko, pfalcato, kees, maddy, npiggin, mpe,
chleroy, borntraeger, frankja, imbrenda, hca, gor, agordeev,
svens, gerald.schaefer, linux-mm, linuxppc-dev, kvm, linux-kernel,
linux-s390
On Tue, Mar 31, 2026 at 2:51 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
>
> On Fri, Mar 27, 2026 at 04:12:26PM -0700, Andrew Morton wrote:
> > On Fri, 27 Mar 2026 13:54:51 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > > Now that we have vma_start_write_killable() we can replace most of the
> > > vma_start_write() calls with it, improving reaction time to the kill
> > > signal.
> > >
> > > There are several places which are left untouched by this patchset:
> > >
> > > 1. free_pgtables() because function should free page tables even if a
> > > fatal signal is pending.
> > >
> > > 2. userfaultd code, where some paths calling vma_start_write() can
> > > handle EINTR and some can't without a deeper code refactoring.
> > >
> > > 3. mpol_rebind_mm() which is used by cpusset controller for migrations
> > > and operates on a remote mm. Incomplete operations here would result
> > > in an inconsistent cgroup state.
> > >
> > > 4. vm_flags_{set|mod|clear} require refactoring that involves moving
> > > vma_start_write() out of these functions and replacing it with
> > > vma_assert_write_locked(), then callers of these functions should
> > > lock the vma themselves using vma_start_write_killable() whenever
> > > possible.
> >
> > Updated, thanks.
>
> Andrew - sorry I think we need to yank this and defer to next cycle,
> there's too many functional changes here.
>
> (There was not really any way for me to predict this would happen ahead of
> time, unfortunately.)
Ok, no objections from me. I'll post v6 removing the part Lorenzo
objects to and you can pick it up whenever you deem appropriate.
>
> >
> > > Changes since v5 [1]:
> > > - Added Reviewed-by for unchanged patches, per Lorenzo Stoakes
> > >
> > > Patch#2:
> > > - Fixed locked_vm counter if mlock_vma_pages_range() fails in
> > > mlock_fixup(), per Sashiko
> > > - Avoid VMA re-locking in madvise_update_vma(), mprotect_fixup() and
> > > mseal_apply() when vma_modify_XXX creates a new VMA as it will already be
> > > locked. This prevents the possibility of incomplete operation if signal
> > > happens after a successful vma_modify_XXX modified the vma tree,
> > > per Sashiko
>
> Prevents the possibility of an incomplete operation? But
> vma_write_lock_killable() checks to see if you're _already_ write locked
> and would make the operation a no-op? So how is this even a delta?
>
> It's a brave new world, arguing with sashiko via a submitter... :)
Yeah, this is not really a problem but I thought I would change it up
to make it apparent that the extra vma_write_lock_killable() is not
even called.
>
> > > - Removed obsolete comment in madvise_update_vma() and mprotect_fixup()
> > >
> > > Patch#4:
> > > - Added clarifying comment for vma_start_write_killable() when locking a
> > > detached VMA
> > > - Override VMA_MERGE_NOMERGE in vma_expand() to prevent callers from
> > > falling back to a new VMA allocation, per Sashiko
> > > - Added a note in the changelog about temporary workaround of using
> > > ENOMEM to propagate the error in vma_merge_existing_range() and
> > > vma_expand()
> > >
> > > Patch#5:
> > > - Added fatal_signal_pending() check in do_mbind() to detect
> > > queue_pages_range() failures due to a pendig fatal signal, per Sashiko
> >
> > Changes since v5:
> >
> >
> > mm/madvise.c | 15 ++++++++++-----
> > mm/mempolicy.c | 9 ++++++++-
> > mm/mlock.c | 2 ++
> > mm/mprotect.c | 26 ++++++++++++++++----------
> > mm/mseal.c | 27 +++++++++++++++++++--------
> > mm/vma.c | 20 ++++++++++++++++++--
> > 6 files changed, 73 insertions(+), 26 deletions(-)
> >
> > --- a/mm/madvise.c~b
> > +++ a/mm/madvise.c
> > @@ -172,11 +172,16 @@ static int madvise_update_vma(vm_flags_t
> > if (IS_ERR(vma))
> > return PTR_ERR(vma);
> >
> > - madv_behavior->vma = vma;
> > -
> > - /* vm_flags is protected by the mmap_lock held in write mode. */
> > - if (vma_start_write_killable(vma))
> > - return -EINTR;
> > + /*
> > + * If a new vma was created during vma_modify_XXX, the resulting
> > + * vma is already locked. Skip re-locking new vma in this case.
> > + */
> > + if (vma == madv_behavior->vma) {
> > + if (vma_start_write_killable(vma))
> > + return -EINTR;
> > + } else {
> > + madv_behavior->vma = vma;
> > + }
> >
> > vma->flags = new_vma_flags;
> > if (set_new_anon_name)
> > --- a/mm/mempolicy.c~b
> > +++ a/mm/mempolicy.c
> > @@ -1546,7 +1546,14 @@ static long do_mbind(unsigned long start
> > flags | MPOL_MF_INVERT | MPOL_MF_WRLOCK, &pagelist);
> >
> > if (nr_failed < 0) {
> > - err = nr_failed;
> > + /*
> > + * queue_pages_range() might override the original error with -EFAULT.
> > + * Confirm that fatal signals are still treated correctly.
> > + */
> > + if (fatal_signal_pending(current))
> > + err = -EINTR;
> > + else
> > + err = nr_failed;
> > nr_failed = 0;
> > } else {
> > vma_iter_init(&vmi, mm, start);
> > --- a/mm/mlock.c~b
> > +++ a/mm/mlock.c
> > @@ -518,6 +518,8 @@ static int mlock_fixup(struct vma_iterat
> > vma->flags = new_vma_flags;
> > } else {
> > ret = mlock_vma_pages_range(vma, start, end, &new_vma_flags);
> > + if (ret)
> > + mm->locked_vm -= nr_pages;
> > }
> > out:
> > *prev = vma;
> > --- a/mm/mprotect.c~b
> > +++ a/mm/mprotect.c
> > @@ -716,6 +716,7 @@ mprotect_fixup(struct vma_iterator *vmi,
> > const vma_flags_t old_vma_flags = READ_ONCE(vma->flags);
> > vma_flags_t new_vma_flags = legacy_to_vma_flags(newflags);
> > long nrpages = (end - start) >> PAGE_SHIFT;
> > + struct vm_area_struct *new_vma;
> > unsigned int mm_cp_flags = 0;
> > unsigned long charged = 0;
> > int error;
> > @@ -772,21 +773,26 @@ mprotect_fixup(struct vma_iterator *vmi,
> > vma_flags_clear(&new_vma_flags, VMA_ACCOUNT_BIT);
> > }
> >
> > - vma = vma_modify_flags(vmi, *pprev, vma, start, end, &new_vma_flags);
> > - if (IS_ERR(vma)) {
> > - error = PTR_ERR(vma);
> > + new_vma = vma_modify_flags(vmi, *pprev, vma, start, end,
> > + &new_vma_flags);
> > + if (IS_ERR(new_vma)) {
> > + error = PTR_ERR(new_vma);
> > goto fail;
> > }
> >
> > - *pprev = vma;
> > -
> > /*
> > - * vm_flags and vm_page_prot are protected by the mmap_lock
> > - * held in write mode.
> > + * If a new vma was created during vma_modify_flags, the resulting
> > + * vma is already locked. Skip re-locking new vma in this case.
> > */
> > - error = vma_start_write_killable(vma);
> > - if (error)
> > - goto fail;
> > + if (new_vma == vma) {
> > + error = vma_start_write_killable(vma);
> > + if (error)
> > + goto fail;
> > + } else {
> > + vma = new_vma;
> > + }
> > +
> > + *pprev = vma;
> >
> > vma_flags_reset_once(vma, &new_vma_flags);
> > if (vma_wants_manual_pte_write_upgrade(vma))
> > --- a/mm/mseal.c~b
> > +++ a/mm/mseal.c
> > @@ -70,17 +70,28 @@ static int mseal_apply(struct mm_struct
> >
> > if (!vma_test(vma, VMA_SEALED_BIT)) {
> > vma_flags_t vma_flags = vma->flags;
> > - int err;
> > + struct vm_area_struct *new_vma;
> >
> > vma_flags_set(&vma_flags, VMA_SEALED_BIT);
> >
> > - vma = vma_modify_flags(&vmi, prev, vma, curr_start,
> > - curr_end, &vma_flags);
> > - if (IS_ERR(vma))
> > - return PTR_ERR(vma);
> > - err = vma_start_write_killable(vma);
> > - if (err)
> > - return err;
> > + new_vma = vma_modify_flags(&vmi, prev, vma, curr_start,
> > + curr_end, &vma_flags);
> > + if (IS_ERR(new_vma))
> > + return PTR_ERR(new_vma);
> > +
> > + /*
> > + * If a new vma was created during vma_modify_flags,
> > + * the resulting vma is already locked.
> > + * Skip re-locking new vma in this case.
> > + */
> > + if (new_vma == vma) {
> > + int err = vma_start_write_killable(vma);
> > + if (err)
> > + return err;
> > + } else {
> > + vma = new_vma;
> > + }
> > +
> > vma_set_flags(vma, VMA_SEALED_BIT);
> > }
> >
> > --- a/mm/vma.c~b
> > +++ a/mm/vma.c
> > @@ -531,6 +531,10 @@ __split_vma(struct vma_iterator *vmi, st
> > err = vma_start_write_killable(vma);
> > if (err)
> > goto out_free_vma;
> > + /*
> > + * Locking a new detached VMA will always succeed but it's just a
> > + * detail of the current implementation, so handle it all the same.
> > + */
> > err = vma_start_write_killable(new);
> > if (err)
> > goto out_free_vma;
> > @@ -1197,8 +1201,14 @@ int vma_expand(struct vma_merge_struct *
> >
> > mmap_assert_write_locked(vmg->mm);
> > err = vma_start_write_killable(target);
> > - if (err)
> > + if (err) {
> > + /*
> > + * Override VMA_MERGE_NOMERGE to prevent callers from
> > + * falling back to a new VMA allocation.
> > + */
> > + vmg->state = VMA_MERGE_ERROR_NOMEM;
> > return err;
> > + }
> >
> > target_sticky = vma_flags_and_mask(&target->flags, VMA_STICKY_FLAGS);
> >
> > @@ -1231,8 +1241,14 @@ int vma_expand(struct vma_merge_struct *
> > * is pending.
> > */
> > err = vma_start_write_killable(next);
> > - if (err)
> > + if (err) {
> > + /*
> > + * Override VMA_MERGE_NOMERGE to prevent callers from
> > + * falling back to a new VMA allocation.
> > + */
> > + vmg->state = VMA_MERGE_ERROR_NOMEM;
> > return err;
> > + }
> > err = dup_anon_vma(target, next, &anon_dup);
> > if (err)
> > return err;
> > _
> >
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v6 0/6] Use killable vma write locking in most places
2026-03-31 15:06 ` Suren Baghdasaryan
@ 2026-03-31 15:34 ` Suren Baghdasaryan
0 siblings, 0 replies; 23+ messages in thread
From: Suren Baghdasaryan @ 2026-03-31 15:34 UTC (permalink / raw)
To: Lorenzo Stoakes (Oracle)
Cc: Andrew Morton, willy, david, ziy, matthew.brost, joshua.hahnjy,
rakie.kim, byungchul, gourry, ying.huang, apopple, baolin.wang,
Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, lance.yang,
vbabka, jannh, rppt, mhocko, pfalcato, kees, maddy, npiggin, mpe,
chleroy, borntraeger, frankja, imbrenda, hca, gor, agordeev,
svens, gerald.schaefer, linux-mm, linuxppc-dev, kvm, linux-kernel,
linux-s390
On Tue, Mar 31, 2026 at 8:06 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, Mar 31, 2026 at 2:51 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> >
> > On Fri, Mar 27, 2026 at 04:12:26PM -0700, Andrew Morton wrote:
> > > On Fri, 27 Mar 2026 13:54:51 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > > Now that we have vma_start_write_killable() we can replace most of the
> > > > vma_start_write() calls with it, improving reaction time to the kill
> > > > signal.
> > > >
> > > > There are several places which are left untouched by this patchset:
> > > >
> > > > 1. free_pgtables() because function should free page tables even if a
> > > > fatal signal is pending.
> > > >
> > > > 2. userfaultd code, where some paths calling vma_start_write() can
> > > > handle EINTR and some can't without a deeper code refactoring.
> > > >
> > > > 3. mpol_rebind_mm() which is used by cpusset controller for migrations
> > > > and operates on a remote mm. Incomplete operations here would result
> > > > in an inconsistent cgroup state.
> > > >
> > > > 4. vm_flags_{set|mod|clear} require refactoring that involves moving
> > > > vma_start_write() out of these functions and replacing it with
> > > > vma_assert_write_locked(), then callers of these functions should
> > > > lock the vma themselves using vma_start_write_killable() whenever
> > > > possible.
> > >
> > > Updated, thanks.
> >
> > Andrew - sorry I think we need to yank this and defer to next cycle,
> > there's too many functional changes here.
> >
> > (There was not really any way for me to predict this would happen ahead of
> > time, unfortunately.)
>
> Ok, no objections from me. I'll post v6 removing the part Lorenzo
> objects to and you can pick it up whenever you deem appropriate.
Just saw Lorenzo's other reply about reworking some vma error handling
first. I'll wait for that rework before posting the new version.
>
> >
> > >
> > > > Changes since v5 [1]:
> > > > - Added Reviewed-by for unchanged patches, per Lorenzo Stoakes
> > > >
> > > > Patch#2:
> > > > - Fixed locked_vm counter if mlock_vma_pages_range() fails in
> > > > mlock_fixup(), per Sashiko
> > > > - Avoid VMA re-locking in madvise_update_vma(), mprotect_fixup() and
> > > > mseal_apply() when vma_modify_XXX creates a new VMA as it will already be
> > > > locked. This prevents the possibility of incomplete operation if signal
> > > > happens after a successful vma_modify_XXX modified the vma tree,
> > > > per Sashiko
> >
> > Prevents the possibility of an incomplete operation? But
> > vma_write_lock_killable() checks to see if you're _already_ write locked
> > and would make the operation a no-op? So how is this even a delta?
> >
> > It's a brave new world, arguing with sashiko via a submitter... :)
>
> Yeah, this is not really a problem but I thought I would change it up
> to make it apparent that the extra vma_write_lock_killable() is not
> even called.
>
> >
> > > > - Removed obsolete comment in madvise_update_vma() and mprotect_fixup()
> > > >
> > > > Patch#4:
> > > > - Added clarifying comment for vma_start_write_killable() when locking a
> > > > detached VMA
> > > > - Override VMA_MERGE_NOMERGE in vma_expand() to prevent callers from
> > > > falling back to a new VMA allocation, per Sashiko
> > > > - Added a note in the changelog about temporary workaround of using
> > > > ENOMEM to propagate the error in vma_merge_existing_range() and
> > > > vma_expand()
> > > >
> > > > Patch#5:
> > > > - Added fatal_signal_pending() check in do_mbind() to detect
> > > > queue_pages_range() failures due to a pendig fatal signal, per Sashiko
> > >
> > > Changes since v5:
> > >
> > >
> > > mm/madvise.c | 15 ++++++++++-----
> > > mm/mempolicy.c | 9 ++++++++-
> > > mm/mlock.c | 2 ++
> > > mm/mprotect.c | 26 ++++++++++++++++----------
> > > mm/mseal.c | 27 +++++++++++++++++++--------
> > > mm/vma.c | 20 ++++++++++++++++++--
> > > 6 files changed, 73 insertions(+), 26 deletions(-)
> > >
> > > --- a/mm/madvise.c~b
> > > +++ a/mm/madvise.c
> > > @@ -172,11 +172,16 @@ static int madvise_update_vma(vm_flags_t
> > > if (IS_ERR(vma))
> > > return PTR_ERR(vma);
> > >
> > > - madv_behavior->vma = vma;
> > > -
> > > - /* vm_flags is protected by the mmap_lock held in write mode. */
> > > - if (vma_start_write_killable(vma))
> > > - return -EINTR;
> > > + /*
> > > + * If a new vma was created during vma_modify_XXX, the resulting
> > > + * vma is already locked. Skip re-locking new vma in this case.
> > > + */
> > > + if (vma == madv_behavior->vma) {
> > > + if (vma_start_write_killable(vma))
> > > + return -EINTR;
> > > + } else {
> > > + madv_behavior->vma = vma;
> > > + }
> > >
> > > vma->flags = new_vma_flags;
> > > if (set_new_anon_name)
> > > --- a/mm/mempolicy.c~b
> > > +++ a/mm/mempolicy.c
> > > @@ -1546,7 +1546,14 @@ static long do_mbind(unsigned long start
> > > flags | MPOL_MF_INVERT | MPOL_MF_WRLOCK, &pagelist);
> > >
> > > if (nr_failed < 0) {
> > > - err = nr_failed;
> > > + /*
> > > + * queue_pages_range() might override the original error with -EFAULT.
> > > + * Confirm that fatal signals are still treated correctly.
> > > + */
> > > + if (fatal_signal_pending(current))
> > > + err = -EINTR;
> > > + else
> > > + err = nr_failed;
> > > nr_failed = 0;
> > > } else {
> > > vma_iter_init(&vmi, mm, start);
> > > --- a/mm/mlock.c~b
> > > +++ a/mm/mlock.c
> > > @@ -518,6 +518,8 @@ static int mlock_fixup(struct vma_iterat
> > > vma->flags = new_vma_flags;
> > > } else {
> > > ret = mlock_vma_pages_range(vma, start, end, &new_vma_flags);
> > > + if (ret)
> > > + mm->locked_vm -= nr_pages;
> > > }
> > > out:
> > > *prev = vma;
> > > --- a/mm/mprotect.c~b
> > > +++ a/mm/mprotect.c
> > > @@ -716,6 +716,7 @@ mprotect_fixup(struct vma_iterator *vmi,
> > > const vma_flags_t old_vma_flags = READ_ONCE(vma->flags);
> > > vma_flags_t new_vma_flags = legacy_to_vma_flags(newflags);
> > > long nrpages = (end - start) >> PAGE_SHIFT;
> > > + struct vm_area_struct *new_vma;
> > > unsigned int mm_cp_flags = 0;
> > > unsigned long charged = 0;
> > > int error;
> > > @@ -772,21 +773,26 @@ mprotect_fixup(struct vma_iterator *vmi,
> > > vma_flags_clear(&new_vma_flags, VMA_ACCOUNT_BIT);
> > > }
> > >
> > > - vma = vma_modify_flags(vmi, *pprev, vma, start, end, &new_vma_flags);
> > > - if (IS_ERR(vma)) {
> > > - error = PTR_ERR(vma);
> > > + new_vma = vma_modify_flags(vmi, *pprev, vma, start, end,
> > > + &new_vma_flags);
> > > + if (IS_ERR(new_vma)) {
> > > + error = PTR_ERR(new_vma);
> > > goto fail;
> > > }
> > >
> > > - *pprev = vma;
> > > -
> > > /*
> > > - * vm_flags and vm_page_prot are protected by the mmap_lock
> > > - * held in write mode.
> > > + * If a new vma was created during vma_modify_flags, the resulting
> > > + * vma is already locked. Skip re-locking new vma in this case.
> > > */
> > > - error = vma_start_write_killable(vma);
> > > - if (error)
> > > - goto fail;
> > > + if (new_vma == vma) {
> > > + error = vma_start_write_killable(vma);
> > > + if (error)
> > > + goto fail;
> > > + } else {
> > > + vma = new_vma;
> > > + }
> > > +
> > > + *pprev = vma;
> > >
> > > vma_flags_reset_once(vma, &new_vma_flags);
> > > if (vma_wants_manual_pte_write_upgrade(vma))
> > > --- a/mm/mseal.c~b
> > > +++ a/mm/mseal.c
> > > @@ -70,17 +70,28 @@ static int mseal_apply(struct mm_struct
> > >
> > > if (!vma_test(vma, VMA_SEALED_BIT)) {
> > > vma_flags_t vma_flags = vma->flags;
> > > - int err;
> > > + struct vm_area_struct *new_vma;
> > >
> > > vma_flags_set(&vma_flags, VMA_SEALED_BIT);
> > >
> > > - vma = vma_modify_flags(&vmi, prev, vma, curr_start,
> > > - curr_end, &vma_flags);
> > > - if (IS_ERR(vma))
> > > - return PTR_ERR(vma);
> > > - err = vma_start_write_killable(vma);
> > > - if (err)
> > > - return err;
> > > + new_vma = vma_modify_flags(&vmi, prev, vma, curr_start,
> > > + curr_end, &vma_flags);
> > > + if (IS_ERR(new_vma))
> > > + return PTR_ERR(new_vma);
> > > +
> > > + /*
> > > + * If a new vma was created during vma_modify_flags,
> > > + * the resulting vma is already locked.
> > > + * Skip re-locking new vma in this case.
> > > + */
> > > + if (new_vma == vma) {
> > > + int err = vma_start_write_killable(vma);
> > > + if (err)
> > > + return err;
> > > + } else {
> > > + vma = new_vma;
> > > + }
> > > +
> > > vma_set_flags(vma, VMA_SEALED_BIT);
> > > }
> > >
> > > --- a/mm/vma.c~b
> > > +++ a/mm/vma.c
> > > @@ -531,6 +531,10 @@ __split_vma(struct vma_iterator *vmi, st
> > > err = vma_start_write_killable(vma);
> > > if (err)
> > > goto out_free_vma;
> > > + /*
> > > + * Locking a new detached VMA will always succeed but it's just a
> > > + * detail of the current implementation, so handle it all the same.
> > > + */
> > > err = vma_start_write_killable(new);
> > > if (err)
> > > goto out_free_vma;
> > > @@ -1197,8 +1201,14 @@ int vma_expand(struct vma_merge_struct *
> > >
> > > mmap_assert_write_locked(vmg->mm);
> > > err = vma_start_write_killable(target);
> > > - if (err)
> > > + if (err) {
> > > + /*
> > > + * Override VMA_MERGE_NOMERGE to prevent callers from
> > > + * falling back to a new VMA allocation.
> > > + */
> > > + vmg->state = VMA_MERGE_ERROR_NOMEM;
> > > return err;
> > > + }
> > >
> > > target_sticky = vma_flags_and_mask(&target->flags, VMA_STICKY_FLAGS);
> > >
> > > @@ -1231,8 +1241,14 @@ int vma_expand(struct vma_merge_struct *
> > > * is pending.
> > > */
> > > err = vma_start_write_killable(next);
> > > - if (err)
> > > + if (err) {
> > > + /*
> > > + * Override VMA_MERGE_NOMERGE to prevent callers from
> > > + * falling back to a new VMA allocation.
> > > + */
> > > + vmg->state = VMA_MERGE_ERROR_NOMEM;
> > > return err;
> > > + }
> > > err = dup_anon_vma(target, next, &anon_dup);
> > > if (err)
> > > return err;
> > > _
> > >
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v6 4/6] mm/vma: use vma_start_write_killable() in vma operations
2026-03-31 10:24 ` Lorenzo Stoakes (Oracle)
@ 2026-03-31 15:37 ` Suren Baghdasaryan
0 siblings, 0 replies; 23+ messages in thread
From: Suren Baghdasaryan @ 2026-03-31 15:37 UTC (permalink / raw)
To: Lorenzo Stoakes (Oracle)
Cc: akpm, willy, david, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, gourry, ying.huang, apopple, baolin.wang, Liam.Howlett,
npache, ryan.roberts, dev.jain, baohua, lance.yang, vbabka, jannh,
rppt, mhocko, pfalcato, kees, maddy, npiggin, mpe, chleroy,
borntraeger, frankja, imbrenda, hca, gor, agordeev, svens,
gerald.schaefer, linux-mm, linuxppc-dev, kvm, linux-kernel,
linux-s390
On Tue, Mar 31, 2026 at 3:24 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
>
> On Fri, Mar 27, 2026 at 01:54:55PM -0700, Suren Baghdasaryan wrote:
> > Replace vma_start_write() with vma_start_write_killable(), improving
> > reaction time to the kill signal.
> > Replace vma_start_write() calls when we operate on VMAs.
> >
> > To propagate errors from vma_merge_existing_range() and vma_expand()
> > we fake an ENOMEM error when we fail due to a pending fatal signal.
> > This is a temporary workaround. Fixing this requires some refactoring
> > and will be done separately in the future.
> >
> > In a number of places we now lock VMA earlier than before to avoid
> > doing work and undoing it later if a fatal signal is pending. This
> > is safe because the moves are happening within sections where we
> > already hold the mmap_write_lock, so the moves do not change the
> > locking order relative to other kernel locks.
> >
> > Suggested-by: Matthew Wilcox <willy@infradead.org>
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> > mm/vma.c | 146 ++++++++++++++++++++++++++++++++++++++------------
> > mm/vma_exec.c | 6 ++-
> > 2 files changed, 117 insertions(+), 35 deletions(-)
> >
> > diff --git a/mm/vma.c b/mm/vma.c
> > index ba78ab1f397a..cc382217f730 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -524,6 +524,21 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > new->vm_pgoff += ((addr - vma->vm_start) >> PAGE_SHIFT);
> > }
> >
> > + /*
> > + * Lock VMAs before cloning to avoid extra work if fatal signal
> > + * is pending.
> > + */
> > + err = vma_start_write_killable(vma);
> > + if (err)
> > + goto out_free_vma;
> > + /*
> > + * Locking a new detached VMA will always succeed but it's just a
> > + * detail of the current implementation, so handle it all the same.
> > + */
> > + err = vma_start_write_killable(new);
> > + if (err)
> > + goto out_free_vma;
> > +
> > err = -ENOMEM;
> > vma_iter_config(vmi, new->vm_start, new->vm_end);
> > if (vma_iter_prealloc(vmi, new))
> > @@ -543,9 +558,6 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > if (new->vm_ops && new->vm_ops->open)
> > new->vm_ops->open(new);
> >
> > - vma_start_write(vma);
> > - vma_start_write(new);
> > -
> > init_vma_prep(&vp, vma);
> > vp.insert = new;
> > vma_prepare(&vp);
> > @@ -900,12 +912,22 @@ static __must_check struct vm_area_struct *vma_merge_existing_range(
> > }
> >
> > /* No matter what happens, we will be adjusting middle. */
> > - vma_start_write(middle);
> > + err = vma_start_write_killable(middle);
> > + if (err) {
> > + /* Ensure error propagates. */
> > + vmg->give_up_on_oom = false;
> > + goto abort;
> > + }
> >
> > if (merge_right) {
> > vma_flags_t next_sticky;
> >
> > - vma_start_write(next);
> > + err = vma_start_write_killable(next);
> > + if (err) {
> > + /* Ensure error propagates. */
> > + vmg->give_up_on_oom = false;
> > + goto abort;
> > + }
> > vmg->target = next;
> > next_sticky = vma_flags_and_mask(&next->flags, VMA_STICKY_FLAGS);
> > vma_flags_set_mask(&sticky_flags, next_sticky);
> > @@ -914,7 +936,12 @@ static __must_check struct vm_area_struct *vma_merge_existing_range(
> > if (merge_left) {
> > vma_flags_t prev_sticky;
> >
> > - vma_start_write(prev);
> > + err = vma_start_write_killable(prev);
> > + if (err) {
> > + /* Ensure error propagates. */
> > + vmg->give_up_on_oom = false;
> > + goto abort;
> > + }
> > vmg->target = prev;
> >
> > prev_sticky = vma_flags_and_mask(&prev->flags, VMA_STICKY_FLAGS);
> > @@ -1170,10 +1197,18 @@ int vma_expand(struct vma_merge_struct *vmg)
> > vma_flags_t sticky_flags =
> > vma_flags_and_mask(&vmg->vma_flags, VMA_STICKY_FLAGS);
> > vma_flags_t target_sticky;
> > - int err = 0;
> > + int err;
> >
> > mmap_assert_write_locked(vmg->mm);
> > - vma_start_write(target);
> > + err = vma_start_write_killable(target);
> > + if (err) {
> > + /*
> > + * Override VMA_MERGE_NOMERGE to prevent callers from
> > + * falling back to a new VMA allocation.
> > + */
> > + vmg->state = VMA_MERGE_ERROR_NOMEM;
> > + return err;
> > + }
> >
> > target_sticky = vma_flags_and_mask(&target->flags, VMA_STICKY_FLAGS);
> >
> > @@ -1201,6 +1236,19 @@ int vma_expand(struct vma_merge_struct *vmg)
> > * we don't need to account for vmg->give_up_on_mm here.
> > */
> > if (remove_next) {
> > + /*
> > + * Lock the VMA early to avoid extra work if fatal signal
> > + * is pending.
> > + */
> > + err = vma_start_write_killable(next);
> > + if (err) {
> > + /*
> > + * Override VMA_MERGE_NOMERGE to prevent callers from
> > + * falling back to a new VMA allocation.
> > + */
>
> I don't think we need huge, duplicated comments everywhere.
>
> I don't like us effectively lying about an OOM.
>
> Here's what I said on v4:
>
> https://lore.kernel.org/all/9845b243-1984-4d74-9feb-d9d28757fba6@lucifer.local/
>
> I think we need to update vma_modify():
>
> /* First, try to merge. */
> merged = vma_merge_existing_range(vmg);
> if (merged)
> return merged;
> if (vmg_nomem(vmg))
> return ERR_PTR(-ENOMEM);
> + if (fatal_signal_pending(current))
> + return -EINTR;
>
> OK I see you replied:
>
> We need to be careful here. I think there are cases when vma is
> modified from a context of a different process, for example in
> process_madvise(). fatal_signal_pending(current) would yield incorrect
> result because vma->vm_mm is not the same as current->mm.
>
> Sorry missed that.
>
> That's utterly horrible, yes.
>
> I'm sorry but I think this series then is going to have to wait for me to rework
> this code, unfortunately.
>
> I can't really stand you returning a fake error code it's too confusing.
>
> I guess I'll have to go do that as a priority then and maybe queue it up so it's
> kinda ready for 7.2.
>
> In any case I said in reply to the cover, I think this series is going to have
> to wait for next cycle (i.e. 7.2), sorry. Just too many functional changes in
> this revision.
Sounds reasonable. I'm not a fan of faking the error code myself, so
hopefully, this change becomes much simpler after your rework.
>
> > + vmg->state = VMA_MERGE_ERROR_NOMEM;
> > + return err;
> > + }
> > err = dup_anon_vma(target, next, &anon_dup);
> > if (err)
> > return err;
> > @@ -1214,7 +1262,6 @@ int vma_expand(struct vma_merge_struct *vmg)
> > if (remove_next) {
> > vma_flags_t next_sticky;
> >
> > - vma_start_write(next);
> > vmg->__remove_next = true;
> >
> > next_sticky = vma_flags_and_mask(&next->flags, VMA_STICKY_FLAGS);
> > @@ -1252,9 +1299,14 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > unsigned long start, unsigned long end, pgoff_t pgoff)
> > {
> > struct vma_prepare vp;
> > + int err;
> >
> > WARN_ON((vma->vm_start != start) && (vma->vm_end != end));
> >
> > + err = vma_start_write_killable(vma);
> > + if (err)
> > + return err;
> > +
> > if (vma->vm_start < start)
> > vma_iter_config(vmi, vma->vm_start, start);
> > else
> > @@ -1263,8 +1315,6 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > if (vma_iter_prealloc(vmi, NULL))
> > return -ENOMEM;
> >
> > - vma_start_write(vma);
> > -
> > init_vma_prep(&vp, vma);
> > vma_prepare(&vp);
> > vma_adjust_trans_huge(vma, start, end, NULL);
> > @@ -1453,7 +1503,9 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
> > if (error)
> > goto end_split_failed;
> > }
> > - vma_start_write(next);
> > + error = vma_start_write_killable(next);
> > + if (error)
> > + goto munmap_gather_failed;
> > mas_set(mas_detach, vms->vma_count++);
> > error = mas_store_gfp(mas_detach, next, GFP_KERNEL);
> > if (error)
> > @@ -1848,12 +1900,16 @@ static void vma_link_file(struct vm_area_struct *vma, bool hold_rmap_lock)
> > static int vma_link(struct mm_struct *mm, struct vm_area_struct *vma)
> > {
> > VMA_ITERATOR(vmi, mm, 0);
> > + int err;
> > +
> > + err = vma_start_write_killable(vma);
> > + if (err)
> > + return err;
> >
> > vma_iter_config(&vmi, vma->vm_start, vma->vm_end);
> > if (vma_iter_prealloc(&vmi, vma))
> > return -ENOMEM;
> >
> > - vma_start_write(vma);
> > vma_iter_store_new(&vmi, vma);
> > vma_link_file(vma, /* hold_rmap_lock= */false);
> > mm->map_count++;
> > @@ -2239,9 +2295,8 @@ int mm_take_all_locks(struct mm_struct *mm)
> > * is reached.
> > */
> > for_each_vma(vmi, vma) {
> > - if (signal_pending(current))
> > + if (signal_pending(current) || vma_start_write_killable(vma))
> > goto out_unlock;
> > - vma_start_write(vma);
> > }
> >
> > vma_iter_init(&vmi, mm, 0);
> > @@ -2540,8 +2595,8 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap,
> > struct mmap_action *action)
> > {
> > struct vma_iterator *vmi = map->vmi;
> > - int error = 0;
> > struct vm_area_struct *vma;
> > + int error;
> >
> > /*
> > * Determine the object being mapped and call the appropriate
> > @@ -2552,6 +2607,14 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap,
> > if (!vma)
> > return -ENOMEM;
> >
> > + /*
> > + * Lock the VMA early to avoid extra work if fatal signal
> > + * is pending.
> > + */
> > + error = vma_start_write_killable(vma);
> > + if (error)
> > + goto free_vma;
> > +
> > vma_iter_config(vmi, map->addr, map->end);
> > vma_set_range(vma, map->addr, map->end, map->pgoff);
> > vma->flags = map->vma_flags;
> > @@ -2582,8 +2645,6 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap,
> > WARN_ON_ONCE(!arch_validate_flags(map->vm_flags));
> > #endif
> >
> > - /* Lock the VMA since it is modified after insertion into VMA tree */
> > - vma_start_write(vma);
> > vma_iter_store_new(vmi, vma);
> > map->mm->map_count++;
> > vma_link_file(vma, action->hide_from_rmap_until_complete);
> > @@ -2878,6 +2939,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > unsigned long addr, unsigned long len, vma_flags_t vma_flags)
> > {
> > struct mm_struct *mm = current->mm;
> > + int err;
> >
> > /*
> > * Check against address space limits by the changed size
> > @@ -2910,24 +2972,33 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >
> > if (vma_merge_new_range(&vmg))
> > goto out;
> > - else if (vmg_nomem(&vmg))
> > + if (vmg_nomem(&vmg)) {
> > + err = -ENOMEM;
> > goto unacct_fail;
> > + }
> > }
> >
> > if (vma)
> > vma_iter_next_range(vmi);
> > /* create a vma struct for an anonymous mapping */
> > vma = vm_area_alloc(mm);
> > - if (!vma)
> > + if (!vma) {
> > + err = -ENOMEM;
> > goto unacct_fail;
> > + }
> >
> > vma_set_anonymous(vma);
> > vma_set_range(vma, addr, addr + len, addr >> PAGE_SHIFT);
> > vma->flags = vma_flags;
> > vma->vm_page_prot = vm_get_page_prot(vma_flags_to_legacy(vma_flags));
> > - vma_start_write(vma);
> > - if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL))
> > + if (vma_start_write_killable(vma)) {
> > + err = -EINTR;
> > + goto vma_lock_fail;
> > + }
> > + if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL)) {
> > + err = -ENOMEM;
> > goto mas_store_fail;
> > + }
> >
> > mm->map_count++;
> > validate_mm(mm);
> > @@ -2942,10 +3013,11 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > return 0;
> >
> > mas_store_fail:
> > +vma_lock_fail:
> > vm_area_free(vma);
> > unacct_fail:
> > vm_unacct_memory(len >> PAGE_SHIFT);
> > - return -ENOMEM;
> > + return err;
> > }
> >
> > /**
> > @@ -3112,8 +3184,8 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
> > struct mm_struct *mm = vma->vm_mm;
> > struct vm_area_struct *next;
> > unsigned long gap_addr;
> > - int error = 0;
> > VMA_ITERATOR(vmi, mm, vma->vm_start);
> > + int error;
> >
> > if (!vma_test(vma, VMA_GROWSUP_BIT))
> > return -EFAULT;
> > @@ -3149,12 +3221,14 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
> >
> > /* We must make sure the anon_vma is allocated. */
> > if (unlikely(anon_vma_prepare(vma))) {
> > - vma_iter_free(&vmi);
> > - return -ENOMEM;
> > + error = -ENOMEM;
> > + goto vma_prep_fail;
> > }
> >
> > /* Lock the VMA before expanding to prevent concurrent page faults */
> > - vma_start_write(vma);
> > + error = vma_start_write_killable(vma);
> > + if (error)
> > + goto vma_lock_fail;
> > /* We update the anon VMA tree. */
> > anon_vma_lock_write(vma->anon_vma);
> >
> > @@ -3183,8 +3257,10 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
> > }
> > }
> > anon_vma_unlock_write(vma->anon_vma);
> > - vma_iter_free(&vmi);
> > validate_mm(mm);
> > +vma_lock_fail:
> > +vma_prep_fail:
> > + vma_iter_free(&vmi);
> > return error;
> > }
> > #endif /* CONFIG_STACK_GROWSUP */
> > @@ -3197,8 +3273,8 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
> > {
> > struct mm_struct *mm = vma->vm_mm;
> > struct vm_area_struct *prev;
> > - int error = 0;
> > VMA_ITERATOR(vmi, mm, vma->vm_start);
> > + int error;
> >
> > if (!vma_test(vma, VMA_GROWSDOWN_BIT))
> > return -EFAULT;
> > @@ -3228,12 +3304,14 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
> >
> > /* We must make sure the anon_vma is allocated. */
> > if (unlikely(anon_vma_prepare(vma))) {
> > - vma_iter_free(&vmi);
> > - return -ENOMEM;
> > + error = -ENOMEM;
> > + goto vma_prep_fail;
> > }
> >
> > /* Lock the VMA before expanding to prevent concurrent page faults */
> > - vma_start_write(vma);
> > + error = vma_start_write_killable(vma);
> > + if (error)
> > + goto vma_lock_fail;
> > /* We update the anon VMA tree. */
> > anon_vma_lock_write(vma->anon_vma);
> >
> > @@ -3263,8 +3341,10 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
> > }
> > }
> > anon_vma_unlock_write(vma->anon_vma);
> > - vma_iter_free(&vmi);
> > validate_mm(mm);
> > +vma_lock_fail:
> > +vma_prep_fail:
> > + vma_iter_free(&vmi);
> > return error;
> > }
> >
> > diff --git a/mm/vma_exec.c b/mm/vma_exec.c
> > index 5cee8b7efa0f..8ddcc791d828 100644
> > --- a/mm/vma_exec.c
> > +++ b/mm/vma_exec.c
> > @@ -41,6 +41,7 @@ int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift)
> > struct vm_area_struct *next;
> > struct mmu_gather tlb;
> > PAGETABLE_MOVE(pmc, vma, vma, old_start, new_start, length);
> > + int err;
> >
> > BUG_ON(new_start > new_end);
> >
> > @@ -56,8 +57,9 @@ int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift)
> > * cover the whole range: [new_start, old_end)
> > */
> > vmg.target = vma;
> > - if (vma_expand(&vmg))
> > - return -ENOMEM;
> > + err = vma_expand(&vmg);
> > + if (err)
> > + return err;
> >
> > /*
> > * move the page tables downwards, on failure we rely on
> > --
> > 2.53.0.1018.g2bb0e51243-goog
> >
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v6 5/6] mm: use vma_start_write_killable() in process_vma_walk_lock()
2026-03-31 10:38 ` Lorenzo Stoakes (Oracle)
@ 2026-03-31 15:43 ` Suren Baghdasaryan
0 siblings, 0 replies; 23+ messages in thread
From: Suren Baghdasaryan @ 2026-03-31 15:43 UTC (permalink / raw)
To: Lorenzo Stoakes (Oracle)
Cc: akpm, willy, david, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, gourry, ying.huang, apopple, baolin.wang, Liam.Howlett,
npache, ryan.roberts, dev.jain, baohua, lance.yang, vbabka, jannh,
rppt, mhocko, pfalcato, kees, maddy, npiggin, mpe, chleroy,
borntraeger, frankja, imbrenda, hca, gor, agordeev, svens,
gerald.schaefer, linux-mm, linuxppc-dev, kvm, linux-kernel,
linux-s390
On Tue, Mar 31, 2026 at 3:39 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
>
> On Fri, Mar 27, 2026 at 01:54:56PM -0700, Suren Baghdasaryan wrote:
> > Replace vma_start_write() with vma_start_write_killable() when
> > process_vma_walk_lock() is used with PGWALK_WRLOCK option.
> > Adjust its direct and indirect users to check for a possible error
> > and handle it. Ensure users handle EINTR correctly and do not ignore
> > it. When queue_pages_range() fails, check whether it failed due to
> > a fatal signal or some other reason and return appropriate error.
> >
> > Suggested-by: Matthew Wilcox <willy@infradead.org>
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> > fs/proc/task_mmu.c | 12 ++++++------
> > mm/mempolicy.c | 10 +++++++++-
> > mm/pagewalk.c | 22 +++++++++++++++-------
> > 3 files changed, 30 insertions(+), 14 deletions(-)
> >
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index e091931d7ca1..33e5094a7842 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -1774,15 +1774,15 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
> > struct vm_area_struct *vma;
> > enum clear_refs_types type;
> > int itype;
> > - int rv;
> > + int err;
> >
> > if (count > sizeof(buffer) - 1)
> > count = sizeof(buffer) - 1;
> > if (copy_from_user(buffer, buf, count))
> > return -EFAULT;
> > - rv = kstrtoint(strstrip(buffer), 10, &itype);
> > - if (rv < 0)
> > - return rv;
> > + err = kstrtoint(strstrip(buffer), 10, &itype);
> > + if (err)
> > + return err;
> > type = (enum clear_refs_types)itype;
> > if (type < CLEAR_REFS_ALL || type >= CLEAR_REFS_LAST)
> > return -EINVAL;
> > @@ -1824,7 +1824,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
> > 0, mm, 0, -1UL);
> > mmu_notifier_invalidate_range_start(&range);
> > }
> > - walk_page_range(mm, 0, -1, &clear_refs_walk_ops, &cp);
> > + err = walk_page_range(mm, 0, -1, &clear_refs_walk_ops, &cp);
> > if (type == CLEAR_REFS_SOFT_DIRTY) {
> > mmu_notifier_invalidate_range_end(&range);
> > flush_tlb_mm(mm);
> > @@ -1837,7 +1837,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
> > }
> > put_task_struct(task);
> >
> > - return count;
> > + return err ? : count;
> > }
> >
> > const struct file_operations proc_clear_refs_operations = {
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index c38a90487531..51f298cfc33b 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -969,6 +969,7 @@ static const struct mm_walk_ops queue_pages_lock_vma_walk_ops = {
> > * (a hugetlbfs page or a transparent huge page being counted as 1).
> > * -EIO - a misplaced page found, when MPOL_MF_STRICT specified without MOVEs.
> > * -EFAULT - a hole in the memory range, when MPOL_MF_DISCONTIG_OK unspecified.
> > + * -EINTR - walk got terminated due to pending fatal signal.
> > */
> > static long
> > queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end,
> > @@ -1545,7 +1546,14 @@ static long do_mbind(unsigned long start, unsigned long len,
> > flags | MPOL_MF_INVERT | MPOL_MF_WRLOCK, &pagelist);
> >
> > if (nr_failed < 0) {
> > - err = nr_failed;
> > + /*
> > + * queue_pages_range() might override the original error with -EFAULT.
> > + * Confirm that fatal signals are still treated correctly.
> > + */
> > + if (fatal_signal_pending(current))
> > + err = -EINTR;
> > + else
> > + err = nr_failed;
>
> Is that really a big deal? Does it really matter if the caller doesn't get
> -EINTR in this case? This feels like another sashiko nitpick and is adding a
> bunch of additional complexity here.
I think there is a difference in that the userspace caller will never
see EINTR, as Matthew explained in an earlier version, while EFAULT
will be seen and can be handled. I don't see why we wouldn't want to
report the correct error, and the code doing that seems clear and
straight-forward to me. That was in fact your suggested solution for
not checking for EINTR :)
>
> I mean if you 'filter' error messages you might always end up with an error
> that's different than the original...
>
> > nr_failed = 0;
> > } else {
> > vma_iter_init(&vmi, mm, start);
> > diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> > index 3ae2586ff45b..eca7bc711617 100644
> > --- a/mm/pagewalk.c
> > +++ b/mm/pagewalk.c
> > @@ -443,14 +443,13 @@ static inline void process_mm_walk_lock(struct mm_struct *mm,
> > mmap_assert_write_locked(mm);
> > }
> >
> > -static inline void process_vma_walk_lock(struct vm_area_struct *vma,
> > - enum page_walk_lock walk_lock)
> > +static int process_vma_walk_lock(struct vm_area_struct *vma,
> > + enum page_walk_lock walk_lock)
> > {
> > #ifdef CONFIG_PER_VMA_LOCK
> > switch (walk_lock) {
> > case PGWALK_WRLOCK:
> > - vma_start_write(vma);
> > - break;
> > + return vma_start_write_killable(vma);
> > case PGWALK_WRLOCK_VERIFY:
> > vma_assert_write_locked(vma);
> > break;
> > @@ -462,6 +461,7 @@ static inline void process_vma_walk_lock(struct vm_area_struct *vma,
> > break;
> > }
> > #endif
> > + return 0;
> > }
> >
> > /*
> > @@ -505,7 +505,9 @@ int walk_page_range_mm_unsafe(struct mm_struct *mm, unsigned long start,
> > if (ops->pte_hole)
> > err = ops->pte_hole(start, next, -1, &walk);
> > } else { /* inside vma */
> > - process_vma_walk_lock(vma, ops->walk_lock);
> > + err = process_vma_walk_lock(vma, ops->walk_lock);
> > + if (err)
> > + break;
> > walk.vma = vma;
> > next = min(end, vma->vm_end);
> > vma = find_vma(mm, vma->vm_end);
> > @@ -722,6 +724,7 @@ int walk_page_range_vma_unsafe(struct vm_area_struct *vma, unsigned long start,
> > .vma = vma,
> > .private = private,
> > };
> > + int err;
> >
> > if (start >= end || !walk.mm)
> > return -EINVAL;
> > @@ -729,7 +732,9 @@ int walk_page_range_vma_unsafe(struct vm_area_struct *vma, unsigned long start,
> > return -EINVAL;
> >
> > process_mm_walk_lock(walk.mm, ops->walk_lock);
> > - process_vma_walk_lock(vma, ops->walk_lock);
> > + err = process_vma_walk_lock(vma, ops->walk_lock);
> > + if (err)
> > + return err;
> > return __walk_page_range(start, end, &walk);
> > }
> >
> > @@ -752,6 +757,7 @@ int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
> > .vma = vma,
> > .private = private,
> > };
> > + int err;
> >
> > if (!walk.mm)
> > return -EINVAL;
> > @@ -759,7 +765,9 @@ int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
> > return -EINVAL;
> >
> > process_mm_walk_lock(walk.mm, ops->walk_lock);
> > - process_vma_walk_lock(vma, ops->walk_lock);
> > + err = process_vma_walk_lock(vma, ops->walk_lock);
> > + if (err)
> > + return err;
> > return __walk_page_range(vma->vm_start, vma->vm_end, &walk);
> > }
> >
> > --
> > 2.53.0.1018.g2bb0e51243-goog
> >
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v6 2/6] mm: use vma_start_write_killable() in mm syscalls
2026-03-31 15:01 ` Suren Baghdasaryan
@ 2026-03-31 18:29 ` Andrew Morton
2026-03-31 18:47 ` Lorenzo Stoakes (Oracle)
0 siblings, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2026-03-31 18:29 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: Lorenzo Stoakes (Oracle), willy, david, ziy, matthew.brost,
joshua.hahnjy, rakie.kim, byungchul, gourry, ying.huang, apopple,
baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua,
lance.yang, vbabka, jannh, rppt, mhocko, pfalcato, kees, maddy,
npiggin, mpe, chleroy, borntraeger, frankja, imbrenda, hca, gor,
agordeev, svens, gerald.schaefer, linux-mm, linuxppc-dev, kvm,
linux-kernel, linux-s390
On Tue, 31 Mar 2026 08:01:11 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
> > Instead? That is, assuming we really need to care about this at all.
> >
> > But I think I don't like this change at all?
>
> Yeah, this was the part I wasn't sure if it's worth adding. With your
> vote confirming my scepticism I'll go ahead and remove the parts I
> added to avoid extra vma_start_write_killable() call (3 instances in
> all) and will post v7.
Thanks. I'll remove v6 from mm.git and shall await Lorenzo's advice on
v7.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v6 2/6] mm: use vma_start_write_killable() in mm syscalls
2026-03-31 18:29 ` Andrew Morton
@ 2026-03-31 18:47 ` Lorenzo Stoakes (Oracle)
2026-03-31 20:14 ` Suren Baghdasaryan
0 siblings, 1 reply; 23+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-31 18:47 UTC (permalink / raw)
To: Andrew Morton
Cc: Suren Baghdasaryan, willy, david, ziy, matthew.brost,
joshua.hahnjy, rakie.kim, byungchul, gourry, ying.huang, apopple,
baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua,
lance.yang, vbabka, jannh, rppt, mhocko, pfalcato, kees, maddy,
npiggin, mpe, chleroy, borntraeger, frankja, imbrenda, hca, gor,
agordeev, svens, gerald.schaefer, linux-mm, linuxppc-dev, kvm,
linux-kernel, linux-s390
On Tue, Mar 31, 2026 at 11:29:21AM -0700, Andrew Morton wrote:
> On Tue, 31 Mar 2026 08:01:11 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
>
> > > Instead? That is, assuming we really need to care about this at all.
> > >
> > > But I think I don't like this change at all?
> >
> > Yeah, this was the part I wasn't sure if it's worth adding. With your
> > vote confirming my scepticism I'll go ahead and remove the parts I
> > added to avoid extra vma_start_write_killable() call (3 instances in
> > all) and will post v7.
>
> Thanks. I'll remove v6 from mm.git and shall await Lorenzo's advice on
> v7.
>
Ack, if we just drop the problematic bits we might still be fine for 7.1 :)
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v6 2/6] mm: use vma_start_write_killable() in mm syscalls
2026-03-31 18:47 ` Lorenzo Stoakes (Oracle)
@ 2026-03-31 20:14 ` Suren Baghdasaryan
2026-04-02 13:19 ` Lorenzo Stoakes (Oracle)
0 siblings, 1 reply; 23+ messages in thread
From: Suren Baghdasaryan @ 2026-03-31 20:14 UTC (permalink / raw)
To: Lorenzo Stoakes (Oracle)
Cc: Andrew Morton, willy, david, ziy, matthew.brost, joshua.hahnjy,
rakie.kim, byungchul, gourry, ying.huang, apopple, baolin.wang,
Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, lance.yang,
vbabka, jannh, rppt, mhocko, pfalcato, kees, maddy, npiggin, mpe,
chleroy, borntraeger, frankja, imbrenda, hca, gor, agordeev,
svens, gerald.schaefer, linux-mm, linuxppc-dev, kvm, linux-kernel,
linux-s390
On Tue, Mar 31, 2026 at 11:47 AM Lorenzo Stoakes (Oracle)
<ljs@kernel.org> wrote:
>
> On Tue, Mar 31, 2026 at 11:29:21AM -0700, Andrew Morton wrote:
> > On Tue, 31 Mar 2026 08:01:11 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > > > Instead? That is, assuming we really need to care about this at all.
> > > >
> > > > But I think I don't like this change at all?
> > >
> > > Yeah, this was the part I wasn't sure if it's worth adding. With your
> > > vote confirming my scepticism I'll go ahead and remove the parts I
> > > added to avoid extra vma_start_write_killable() call (3 instances in
> > > all) and will post v7.
> >
> > Thanks. I'll remove v6 from mm.git and shall await Lorenzo's advice on
> > v7.
> >
>
> Ack, if we just drop the problematic bits we might still be fine for 7.1 :)
Let's get your cleanup first and then see where we are. I also hate
the way I have to report a fake error code, so with that cleanup the
patchset should be much nicer.
>
> Cheers, Lorenzo
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v6 2/6] mm: use vma_start_write_killable() in mm syscalls
2026-03-31 20:14 ` Suren Baghdasaryan
@ 2026-04-02 13:19 ` Lorenzo Stoakes (Oracle)
2026-04-02 15:11 ` Suren Baghdasaryan
0 siblings, 1 reply; 23+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-04-02 13:19 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: Andrew Morton, willy, david, ziy, matthew.brost, joshua.hahnjy,
rakie.kim, byungchul, gourry, ying.huang, apopple, baolin.wang,
Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, lance.yang,
vbabka, jannh, rppt, mhocko, pfalcato, kees, maddy, npiggin, mpe,
chleroy, borntraeger, frankja, imbrenda, hca, gor, agordeev,
svens, gerald.schaefer, linux-mm, linuxppc-dev, kvm, linux-kernel,
linux-s390
On Tue, Mar 31, 2026 at 01:14:51PM -0700, Suren Baghdasaryan wrote:
> On Tue, Mar 31, 2026 at 11:47 AM Lorenzo Stoakes (Oracle)
> <ljs@kernel.org> wrote:
> >
> > On Tue, Mar 31, 2026 at 11:29:21AM -0700, Andrew Morton wrote:
> > > On Tue, 31 Mar 2026 08:01:11 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > > > Instead? That is, assuming we really need to care about this at all.
> > > > >
> > > > > But I think I don't like this change at all?
> > > >
> > > > Yeah, this was the part I wasn't sure if it's worth adding. With your
> > > > vote confirming my scepticism I'll go ahead and remove the parts I
> > > > added to avoid extra vma_start_write_killable() call (3 instances in
> > > > all) and will post v7.
> > >
> > > Thanks. I'll remove v6 from mm.git and shall await Lorenzo's advice on
> > > v7.
> > >
> >
> > Ack, if we just drop the problematic bits we might still be fine for 7.1 :)
>
> Let's get your cleanup first and then see where we are. I also hate
> the way I have to report a fake error code, so with that cleanup the
> patchset should be much nicer.
Ack, yeah it's horrid, + _my fault_ :)
I can put something together now perhaps that can potentially be queued up for
7.2 at 7.1-rc1.
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v6 2/6] mm: use vma_start_write_killable() in mm syscalls
2026-04-02 13:19 ` Lorenzo Stoakes (Oracle)
@ 2026-04-02 15:11 ` Suren Baghdasaryan
2026-04-02 15:20 ` Lorenzo Stoakes (Oracle)
0 siblings, 1 reply; 23+ messages in thread
From: Suren Baghdasaryan @ 2026-04-02 15:11 UTC (permalink / raw)
To: Lorenzo Stoakes (Oracle)
Cc: Andrew Morton, willy, david, ziy, matthew.brost, joshua.hahnjy,
rakie.kim, byungchul, gourry, ying.huang, apopple, baolin.wang,
Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, lance.yang,
vbabka, jannh, rppt, mhocko, pfalcato, kees, maddy, npiggin, mpe,
chleroy, borntraeger, frankja, imbrenda, hca, gor, agordeev,
svens, gerald.schaefer, linux-mm, linuxppc-dev, kvm, linux-kernel,
linux-s390
On Thu, Apr 2, 2026 at 6:19 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
>
> On Tue, Mar 31, 2026 at 01:14:51PM -0700, Suren Baghdasaryan wrote:
> > On Tue, Mar 31, 2026 at 11:47 AM Lorenzo Stoakes (Oracle)
> > <ljs@kernel.org> wrote:
> > >
> > > On Tue, Mar 31, 2026 at 11:29:21AM -0700, Andrew Morton wrote:
> > > > On Tue, 31 Mar 2026 08:01:11 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
> > > >
> > > > > > Instead? That is, assuming we really need to care about this at all.
> > > > > >
> > > > > > But I think I don't like this change at all?
> > > > >
> > > > > Yeah, this was the part I wasn't sure if it's worth adding. With your
> > > > > vote confirming my scepticism I'll go ahead and remove the parts I
> > > > > added to avoid extra vma_start_write_killable() call (3 instances in
> > > > > all) and will post v7.
> > > >
> > > > Thanks. I'll remove v6 from mm.git and shall await Lorenzo's advice on
> > > > v7.
> > > >
> > >
> > > Ack, if we just drop the problematic bits we might still be fine for 7.1 :)
> >
> > Let's get your cleanup first and then see where we are. I also hate
> > the way I have to report a fake error code, so with that cleanup the
> > patchset should be much nicer.
>
> Ack, yeah it's horrid, + _my fault_ :)
No worries, I can wait.
>
> I can put something together now perhaps that can potentially be queued up for
> 7.2 at 7.1-rc1.
Thanks! That would be appreciated.
>
> Cheers, Lorenzo
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v6 2/6] mm: use vma_start_write_killable() in mm syscalls
2026-04-02 15:11 ` Suren Baghdasaryan
@ 2026-04-02 15:20 ` Lorenzo Stoakes (Oracle)
0 siblings, 0 replies; 23+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-04-02 15:20 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: Andrew Morton, willy, david, ziy, matthew.brost, joshua.hahnjy,
rakie.kim, byungchul, gourry, ying.huang, apopple, baolin.wang,
Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, lance.yang,
vbabka, jannh, rppt, mhocko, pfalcato, kees, maddy, npiggin, mpe,
chleroy, borntraeger, frankja, imbrenda, hca, gor, agordeev,
svens, gerald.schaefer, linux-mm, linuxppc-dev, kvm, linux-kernel,
linux-s390
On Thu, Apr 02, 2026 at 08:11:44AM -0700, Suren Baghdasaryan wrote:
> On Thu, Apr 2, 2026 at 6:19 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> >
> > On Tue, Mar 31, 2026 at 01:14:51PM -0700, Suren Baghdasaryan wrote:
> > > On Tue, Mar 31, 2026 at 11:47 AM Lorenzo Stoakes (Oracle)
> > > <ljs@kernel.org> wrote:
> > > >
> > > > On Tue, Mar 31, 2026 at 11:29:21AM -0700, Andrew Morton wrote:
> > > > > On Tue, 31 Mar 2026 08:01:11 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
> > > > >
> > > > > > > Instead? That is, assuming we really need to care about this at all.
> > > > > > >
> > > > > > > But I think I don't like this change at all?
> > > > > >
> > > > > > Yeah, this was the part I wasn't sure if it's worth adding. With your
> > > > > > vote confirming my scepticism I'll go ahead and remove the parts I
> > > > > > added to avoid extra vma_start_write_killable() call (3 instances in
> > > > > > all) and will post v7.
> > > > >
> > > > > Thanks. I'll remove v6 from mm.git and shall await Lorenzo's advice on
> > > > > v7.
> > > > >
> > > >
> > > > Ack, if we just drop the problematic bits we might still be fine for 7.1 :)
> > >
> > > Let's get your cleanup first and then see where we are. I also hate
> > > the way I have to report a fake error code, so with that cleanup the
> > > patchset should be much nicer.
> >
> > Ack, yeah it's horrid, + _my fault_ :)
>
> No worries, I can wait.
>
> >
> > I can put something together now perhaps that can potentially be queued up for
> > 7.2 at 7.1-rc1.
>
> Thanks! That would be appreciated.
This might not quite be 'now' as of course it's grown into something much
bigger :P
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2026-04-02 15:21 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-27 20:54 [PATCH v6 0/6] Use killable vma write locking in most places Suren Baghdasaryan
2026-03-27 20:54 ` [PATCH v6 1/6] mm/vma: cleanup error handling path in vma_expand() Suren Baghdasaryan
2026-03-27 20:54 ` [PATCH v6 2/6] mm: use vma_start_write_killable() in mm syscalls Suren Baghdasaryan
2026-03-31 9:35 ` Lorenzo Stoakes (Oracle)
2026-03-31 15:01 ` Suren Baghdasaryan
2026-03-31 18:29 ` Andrew Morton
2026-03-31 18:47 ` Lorenzo Stoakes (Oracle)
2026-03-31 20:14 ` Suren Baghdasaryan
2026-04-02 13:19 ` Lorenzo Stoakes (Oracle)
2026-04-02 15:11 ` Suren Baghdasaryan
2026-04-02 15:20 ` Lorenzo Stoakes (Oracle)
2026-03-27 20:54 ` [PATCH v6 3/6] mm/khugepaged: use vma_start_write_killable() in collapse_huge_page() Suren Baghdasaryan
2026-03-27 20:54 ` [PATCH v6 4/6] mm/vma: use vma_start_write_killable() in vma operations Suren Baghdasaryan
2026-03-31 10:24 ` Lorenzo Stoakes (Oracle)
2026-03-31 15:37 ` Suren Baghdasaryan
2026-03-27 20:54 ` [PATCH v6 5/6] mm: use vma_start_write_killable() in process_vma_walk_lock() Suren Baghdasaryan
2026-03-31 10:38 ` Lorenzo Stoakes (Oracle)
2026-03-31 15:43 ` Suren Baghdasaryan
2026-03-27 20:54 ` [PATCH v6 6/6] KVM: PPC: use vma_start_write_killable() in kvmppc_memslot_page_merge() Suren Baghdasaryan
2026-03-27 23:12 ` [PATCH v6 0/6] Use killable vma write locking in most places Andrew Morton
2026-03-31 9:51 ` Lorenzo Stoakes (Oracle)
2026-03-31 15:06 ` Suren Baghdasaryan
2026-03-31 15:34 ` Suren Baghdasaryan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox