[PATCH v4 0/4] Use killable vma write locking in most places

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [PATCH v4 0/4] Use killable vma write locking in most places
@ 2026-03-22  5:43 Suren Baghdasaryan
  2026-03-22  5:43 ` [PATCH v4 1/4] mm/vma: cleanup error handling path in vma_expand() Suren Baghdasaryan
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Suren Baghdasaryan @ 2026-03-22  5:43 UTC (permalink / raw)
  To: akpm
  Cc: willy, david, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, lorenzo.stoakes,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua,
	lance.yang, vbabka, jannh, rppt, mhocko, pfalcato, kees, maddy,
	npiggin, mpe, chleroy, borntraeger, frankja, imbrenda, hca, gor,
	agordeev, svens, gerald.schaefer, linux-mm, linuxppc-dev, kvm,
	linux-kernel, linux-s390, surenb

Now that we have vma_start_write_killable() we can replace most of the
vma_start_write() calls with it, improving reaction time to the kill
signal.

There are several places which are left untouched by this patchset:

1. free_pgtables() because function should free page tables even if a
fatal signal is pending.

2. userfaultd code, where some paths calling vma_start_write() can
handle EINTR and some can't without a deeper code refactoring.

3. mpol_rebind_mm() which is used by cpusset controller for migrations
and operates on a remote mm. Incomplete operations here would result
in an inconsistent cgroup state.

4. vm_flags_{set|mod|clear} require refactoring that involves moving
vma_start_write() out of these functions and replacing it with
vma_assert_write_locked(), then callers of these functions should
lock the vma themselves using vma_start_write_killable() whenever
possible.

A cleanup patch is added in the beginning to make later changes more
readable. The second patch contains most of the changes. The third patch
is a small error handling fixup. The last patch contains the changes
associated with process_vma_walk_lock() error handling.

Changes since v3 [1]:
- rebased over mm-unstable
- added Reviewed-by, per Liam R. Howlett and Lorenzo Stoakes
- moved locking before vma_iter_prealloc in vma_shrink and in vma_link,
per Liam R. Howlett
- added a separate jump label for vma lock failure case in do_brk_flags(),
per Liam R. Howlett
- fixed cpusset -> cpuset, per Lorenzo Stoakes
- added comments explaining vma_start_write moves, per Lorenzo Stoakes
- amended patch description with explanation why vma_start_write moves
are safe, per Lorenzo Stoakes
- added comments listing EINTR as a new possible error code,
per Lorenzo Stoakes
- moved comments in mlock_fixup() and apply_mlockall_flags() to more
appropriate places, per Lorenzo Stoakes
- replaced check for EINTR with fatal_signal_pending() with a comment why
it is safe, per Lorenzo Stoakes
- fixed error check in mprotect_fixup(), per Lorenzo Stoakes
- moved vma_start_write_killable() before allocations inside __split_vma()
with a clarifying comment
- changed mmap_region() to set err for each failing case,
per Lorenzo Stoakes
- changed label names in expand_upwards() and expand_downwards(),
per Lorenzo Stoakes
- changed "if (err < 0)" to "if (err)" for consistency,
per Lorenzo Stoakes
- separated error checking fix for s390 into its own patch,
per Lorenzo Stoakes
- removed special handling for EINTR, per Lorenzo Stoakes
- dropped changes trying to propagate EINTR during vma merge,
per Lorenzo Stoakes

[1] https://lore.kernel.org/all/20260226070609.3072570-1-surenb@google.com/

Suren Baghdasaryan (4):
  mm/vma: cleanup error handling path in vma_expand()
  mm: replace vma_start_write() with vma_start_write_killable()
  KVM: s390: avoid kvm_s390_handle_pv() error overwrite
  mm: use vma_start_write_killable() in process_vma_walk_lock()

 arch/powerpc/kvm/book3s_hv_uvmem.c |   5 +-
 arch/s390/kvm/kvm-s390.c           |   2 +
 fs/proc/task_mmu.c                 |   5 +-
 mm/khugepaged.c                    |   5 +-
 mm/madvise.c                       |   4 +-
 mm/memory.c                        |   2 +
 mm/mempolicy.c                     |  13 ++-
 mm/mlock.c                         |  28 ++++--
 mm/mprotect.c                      |   5 +-
 mm/mremap.c                        |   4 +-
 mm/pagewalk.c                      |  20 +++--
 mm/vma.c                           | 133 +++++++++++++++++++++--------
 mm/vma_exec.c                      |   6 +-
 13 files changed, 173 insertions(+), 59 deletions(-)


base-commit: 8c65073d94c8b7cc3170de31af38edc9f5d96f0e
-- 
2.53.0.1018.g2bb0e51243-goog



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v4 1/4] mm/vma: cleanup error handling path in vma_expand()
  2026-03-22  5:43 [PATCH v4 0/4] Use killable vma write locking in most places Suren Baghdasaryan
@ 2026-03-22  5:43 ` Suren Baghdasaryan
  2026-03-22  7:49   ` Barry Song
  2026-03-22  5:43 ` [PATCH v4 2/4] mm: replace vma_start_write() with vma_start_write_killable() Suren Baghdasaryan
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 16+ messages in thread
From: Suren Baghdasaryan @ 2026-03-22  5:43 UTC (permalink / raw)
  To: akpm
  Cc: willy, david, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, lorenzo.stoakes,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua,
	lance.yang, vbabka, jannh, rppt, mhocko, pfalcato, kees, maddy,
	npiggin, mpe, chleroy, borntraeger, frankja, imbrenda, hca, gor,
	agordeev, svens, gerald.schaefer, linux-mm, linuxppc-dev, kvm,
	linux-kernel, linux-s390, surenb

vma_expand() error handling is a bit confusing with "if (ret) return ret;"
mixed with "if (!ret && ...) ret = ...;". Simplify the code to check
for errors and return immediately after an operation that might fail.
This also makes later changes to this function more readable.
Change variable name for storing the error code from "ret" to "err".

No functional change intended.

Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/vma.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/mm/vma.c b/mm/vma.c
index a43f3c5d4b3d..ba78ab1f397a 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -1170,7 +1170,7 @@ int vma_expand(struct vma_merge_struct *vmg)
 	vma_flags_t sticky_flags =
 		vma_flags_and_mask(&vmg->vma_flags, VMA_STICKY_FLAGS);
 	vma_flags_t target_sticky;
-	int ret = 0;
+	int err = 0;
 
 	mmap_assert_write_locked(vmg->mm);
 	vma_start_write(target);
@@ -1200,12 +1200,16 @@ int vma_expand(struct vma_merge_struct *vmg)
 	 * Note that, by convention, callers ignore OOM for this case, so
 	 * we don't need to account for vmg->give_up_on_mm here.
 	 */
-	if (remove_next)
-		ret = dup_anon_vma(target, next, &anon_dup);
-	if (!ret && vmg->copied_from)
-		ret = dup_anon_vma(target, vmg->copied_from, &anon_dup);
-	if (ret)
-		return ret;
+	if (remove_next) {
+		err = dup_anon_vma(target, next, &anon_dup);
+		if (err)
+			return err;
+	}
+	if (vmg->copied_from) {
+		err = dup_anon_vma(target, vmg->copied_from, &anon_dup);
+		if (err)
+			return err;
+	}
 
 	if (remove_next) {
 		vma_flags_t next_sticky;
-- 
2.53.0.1018.g2bb0e51243-goog



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 2/4] mm: replace vma_start_write() with vma_start_write_killable()
  2026-03-22  5:43 [PATCH v4 0/4] Use killable vma write locking in most places Suren Baghdasaryan
  2026-03-22  5:43 ` [PATCH v4 1/4] mm/vma: cleanup error handling path in vma_expand() Suren Baghdasaryan
@ 2026-03-22  5:43 ` Suren Baghdasaryan
  2026-03-23 17:49   ` Lorenzo Stoakes (Oracle)
  2026-03-22  5:43 ` [PATCH v4 3/4] KVM: s390: avoid kvm_s390_handle_pv() error overwrite Suren Baghdasaryan
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 16+ messages in thread
From: Suren Baghdasaryan @ 2026-03-22  5:43 UTC (permalink / raw)
  To: akpm
  Cc: willy, david, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, lorenzo.stoakes,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua,
	lance.yang, vbabka, jannh, rppt, mhocko, pfalcato, kees, maddy,
	npiggin, mpe, chleroy, borntraeger, frankja, imbrenda, hca, gor,
	agordeev, svens, gerald.schaefer, linux-mm, linuxppc-dev, kvm,
	linux-kernel, linux-s390, surenb, Ritesh Harjani (IBM)

Now that we have vma_start_write_killable() we can replace most of the
vma_start_write() calls with it, improving reaction time to the kill
signal.

There are several places which are left untouched by this patch:

1. free_pgtables() because function should free page tables even if a
fatal signal is pending.

2. process_vma_walk_lock(), which requires changes in its callers and
will be handled in the next patch.

3. userfaultd code, where some paths calling vma_start_write() can
handle EINTR and some can't without a deeper code refactoring.

4. mpol_rebind_mm() which is used by cpuset controller for migrations
and operates on a remote mm. Incomplete operations here would result
in an inconsistent cgroup state.

5. vm_flags_{set|mod|clear} require refactoring that involves moving
vma_start_write() out of these functions and replacing it with
vma_assert_write_locked(), then callers of these functions should
lock the vma themselves using vma_start_write_killable() whenever
possible.

In a number of places we now lock VMA earlier than before to avoid
doing work and undoing it later if a fatal signal is pending. This
is safe because the moves are happening within sections where we
already hold the mmap_write_lock, so the moves do not change the
locking order relative to other kernel locks.

Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> # powerpc
---
 arch/powerpc/kvm/book3s_hv_uvmem.c |   5 +-
 mm/khugepaged.c                    |   5 +-
 mm/madvise.c                       |   4 +-
 mm/memory.c                        |   2 +
 mm/mempolicy.c                     |  12 ++-
 mm/mlock.c                         |  28 +++++--
 mm/mprotect.c                      |   5 +-
 mm/mremap.c                        |   4 +-
 mm/vma.c                           | 117 +++++++++++++++++++++--------
 mm/vma_exec.c                      |   6 +-
 10 files changed, 142 insertions(+), 46 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 5fbb95d90e99..0a28b48a46b8 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -410,7 +410,10 @@ static int kvmppc_memslot_page_merge(struct kvm *kvm,
 			ret = H_STATE;
 			break;
 		}
-		vma_start_write(vma);
+		if (vma_start_write_killable(vma)) {
+			ret = H_STATE;
+			break;
+		}
 		/* Copy vm_flags to avoid partial modifications in ksm_madvise */
 		vm_flags = vma->vm_flags;
 		ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 4b0e59c7c0e6..e2f263076084 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1159,7 +1159,10 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
 	if (result != SCAN_SUCCEED)
 		goto out_up_write;
 	/* check if the pmd is still valid */
-	vma_start_write(vma);
+	if (vma_start_write_killable(vma)) {
+		result = SCAN_FAIL;
+		goto out_up_write;
+	}
 	result = check_pmd_still_valid(mm, address, pmd);
 	if (result != SCAN_SUCCEED)
 		goto out_up_write;
diff --git a/mm/madvise.c b/mm/madvise.c
index 69708e953cf5..feaa16b0e1dc 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -175,7 +175,9 @@ static int madvise_update_vma(vm_flags_t new_flags,
 	madv_behavior->vma = vma;
 
 	/* vm_flags is protected by the mmap_lock held in write mode. */
-	vma_start_write(vma);
+	if (vma_start_write_killable(vma))
+		return -EINTR;
+
 	vma->flags = new_vma_flags;
 	if (set_new_anon_name)
 		return replace_anon_vma_name(vma, anon_name);
diff --git a/mm/memory.c b/mm/memory.c
index 68cc592ff0ba..b930459e32ec 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -366,6 +366,8 @@ void free_pgd_range(struct mmu_gather *tlb,
  * page tables that should be removed.  This can differ from the vma mappings on
  * some archs that may have mappings that need to be removed outside the vmas.
  * Note that the prev->vm_end and next->vm_start are often used.
+ * We don't use vma_start_write_killable() because page tables should be freed
+ * even if the task is being killed.
  *
  * The vma_end differs from the pg_end when a dup_mmap() failed and the tree has
  * unrelated data to the mm_struct being torn down.
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index e5528c35bbb8..929e843543cf 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1784,7 +1784,8 @@ SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, le
 		return -EINVAL;
 	if (end == start)
 		return 0;
-	mmap_write_lock(mm);
+	if (mmap_write_lock_killable(mm))
+		return -EINTR;
 	prev = vma_prev(&vmi);
 	for_each_vma_range(vmi, vma, end) {
 		/*
@@ -1801,13 +1802,20 @@ SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, le
 			err = -EOPNOTSUPP;
 			break;
 		}
+		/*
+		 * Lock the VMA early to avoid extra work if fatal signal
+		 * is pending.
+		 */
+		if (vma_start_write_killable(vma)) {
+			err = -EINTR;
+			break;
+		}
 		new = mpol_dup(old);
 		if (IS_ERR(new)) {
 			err = PTR_ERR(new);
 			break;
 		}
 
-		vma_start_write(vma);
 		new->home_node = home_node;
 		err = mbind_range(&vmi, vma, &prev, start, end, new);
 		mpol_put(new);
diff --git a/mm/mlock.c b/mm/mlock.c
index 8c227fefa2df..efbb9c783f25 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -419,8 +419,10 @@ static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
  *
  * Called for mlock(), mlock2() and mlockall(), to set @vma VM_LOCKED;
  * called for munlock() and munlockall(), to clear VM_LOCKED from @vma.
+ *
+ * Return: 0 on success, -EINTR if fatal signal is pending.
  */
-static void mlock_vma_pages_range(struct vm_area_struct *vma,
+static int mlock_vma_pages_range(struct vm_area_struct *vma,
 	unsigned long start, unsigned long end,
 	vma_flags_t *new_vma_flags)
 {
@@ -442,7 +444,9 @@ static void mlock_vma_pages_range(struct vm_area_struct *vma,
 	 */
 	if (vma_flags_test(new_vma_flags, VMA_LOCKED_BIT))
 		vma_flags_set(new_vma_flags, VMA_IO_BIT);
-	vma_start_write(vma);
+	if (vma_start_write_killable(vma))
+		return -EINTR;
+
 	vma_flags_reset_once(vma, new_vma_flags);
 
 	lru_add_drain();
@@ -453,6 +457,7 @@ static void mlock_vma_pages_range(struct vm_area_struct *vma,
 		vma_flags_clear(new_vma_flags, VMA_IO_BIT);
 		vma_flags_reset_once(vma, new_vma_flags);
 	}
+	return 0;
 }
 
 /*
@@ -506,11 +511,13 @@ static int mlock_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	 */
 	if (vma_flags_test(&new_vma_flags, VMA_LOCKED_BIT) &&
 	    vma_flags_test(&old_vma_flags, VMA_LOCKED_BIT)) {
+		ret = vma_start_write_killable(vma);
+		if (ret)
+			goto out;
 		/* No work to do, and mlocking twice would be wrong */
-		vma_start_write(vma);
 		vma->flags = new_vma_flags;
 	} else {
-		mlock_vma_pages_range(vma, start, end, &new_vma_flags);
+		ret = mlock_vma_pages_range(vma, start, end, &new_vma_flags);
 	}
 out:
 	*prev = vma;
@@ -739,9 +746,18 @@ static int apply_mlockall_flags(int flags)
 
 		error = mlock_fixup(&vmi, vma, &prev, vma->vm_start, vma->vm_end,
 				    newflags);
-		/* Ignore errors, but prev needs fixing up. */
-		if (error)
+		if (error) {
+			/*
+			 * If we failed due to a pending fatal signal, return
+			 * now. If we locked the vma before signal arrived, it
+			 * will be unlocked when we drop mmap_write_lock.
+			 */
+			if (fatal_signal_pending(current))
+				return -EINTR;
+
+			/* Ignore errors, but prev needs fixing up. */
 			prev = vma;
+		}
 		cond_resched();
 	}
 out:
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 110d47a36d4b..ae6ed882b600 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -768,7 +768,10 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb,
 	 * vm_flags and vm_page_prot are protected by the mmap_lock
 	 * held in write mode.
 	 */
-	vma_start_write(vma);
+	error = vma_start_write_killable(vma);
+	if (error)
+		goto fail;
+
 	vma_flags_reset_once(vma, &new_vma_flags);
 	if (vma_wants_manual_pte_write_upgrade(vma))
 		mm_cp_flags |= MM_CP_TRY_CHANGE_WRITABLE;
diff --git a/mm/mremap.c b/mm/mremap.c
index e9c8b1d05832..dec39ec314f9 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -1356,7 +1356,9 @@ static unsigned long move_vma(struct vma_remap_struct *vrm)
 		return -ENOMEM;
 
 	/* We don't want racing faults. */
-	vma_start_write(vrm->vma);
+	err = vma_start_write_killable(vrm->vma);
+	if (err)
+		return err;
 
 	/* Perform copy step. */
 	err = copy_vma_and_data(vrm, &new_vma);
diff --git a/mm/vma.c b/mm/vma.c
index ba78ab1f397a..7930a4270eb9 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -524,6 +524,17 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		new->vm_pgoff += ((addr - vma->vm_start) >> PAGE_SHIFT);
 	}
 
+	/*
+	 * Lock VMAs before cloning to avoid extra work if fatal signal
+	 * is pending.
+	 */
+	err = vma_start_write_killable(vma);
+	if (err)
+		goto out_free_vma;
+	err = vma_start_write_killable(new);
+	if (err)
+		goto out_free_vma;
+
 	err = -ENOMEM;
 	vma_iter_config(vmi, new->vm_start, new->vm_end);
 	if (vma_iter_prealloc(vmi, new))
@@ -543,9 +554,6 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	if (new->vm_ops && new->vm_ops->open)
 		new->vm_ops->open(new);
 
-	vma_start_write(vma);
-	vma_start_write(new);
-
 	init_vma_prep(&vp, vma);
 	vp.insert = new;
 	vma_prepare(&vp);
@@ -900,12 +908,16 @@ static __must_check struct vm_area_struct *vma_merge_existing_range(
 	}
 
 	/* No matter what happens, we will be adjusting middle. */
-	vma_start_write(middle);
+	err = vma_start_write_killable(middle);
+	if (err)
+		goto abort;
 
 	if (merge_right) {
 		vma_flags_t next_sticky;
 
-		vma_start_write(next);
+		err = vma_start_write_killable(next);
+		if (err)
+			goto abort;
 		vmg->target = next;
 		next_sticky = vma_flags_and_mask(&next->flags, VMA_STICKY_FLAGS);
 		vma_flags_set_mask(&sticky_flags, next_sticky);
@@ -914,7 +926,9 @@ static __must_check struct vm_area_struct *vma_merge_existing_range(
 	if (merge_left) {
 		vma_flags_t prev_sticky;
 
-		vma_start_write(prev);
+		err = vma_start_write_killable(prev);
+		if (err)
+			goto abort;
 		vmg->target = prev;
 
 		prev_sticky = vma_flags_and_mask(&prev->flags, VMA_STICKY_FLAGS);
@@ -1170,10 +1184,12 @@ int vma_expand(struct vma_merge_struct *vmg)
 	vma_flags_t sticky_flags =
 		vma_flags_and_mask(&vmg->vma_flags, VMA_STICKY_FLAGS);
 	vma_flags_t target_sticky;
-	int err = 0;
+	int err;
 
 	mmap_assert_write_locked(vmg->mm);
-	vma_start_write(target);
+	err = vma_start_write_killable(target);
+	if (err)
+		return err;
 
 	target_sticky = vma_flags_and_mask(&target->flags, VMA_STICKY_FLAGS);
 
@@ -1201,6 +1217,13 @@ int vma_expand(struct vma_merge_struct *vmg)
 	 * we don't need to account for vmg->give_up_on_mm here.
 	 */
 	if (remove_next) {
+		/*
+		 * Lock the VMA early to avoid extra work if fatal signal
+		 * is pending.
+		 */
+		err = vma_start_write_killable(next);
+		if (err)
+			return err;
 		err = dup_anon_vma(target, next, &anon_dup);
 		if (err)
 			return err;
@@ -1214,7 +1237,6 @@ int vma_expand(struct vma_merge_struct *vmg)
 	if (remove_next) {
 		vma_flags_t next_sticky;
 
-		vma_start_write(next);
 		vmg->__remove_next = true;
 
 		next_sticky = vma_flags_and_mask(&next->flags, VMA_STICKY_FLAGS);
@@ -1252,9 +1274,14 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	       unsigned long start, unsigned long end, pgoff_t pgoff)
 {
 	struct vma_prepare vp;
+	int err;
 
 	WARN_ON((vma->vm_start != start) && (vma->vm_end != end));
 
+	err = vma_start_write_killable(vma);
+	if (err)
+		return err;
+
 	if (vma->vm_start < start)
 		vma_iter_config(vmi, vma->vm_start, start);
 	else
@@ -1263,8 +1290,6 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	if (vma_iter_prealloc(vmi, NULL))
 		return -ENOMEM;
 
-	vma_start_write(vma);
-
 	init_vma_prep(&vp, vma);
 	vma_prepare(&vp);
 	vma_adjust_trans_huge(vma, start, end, NULL);
@@ -1453,7 +1478,9 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 			if (error)
 				goto end_split_failed;
 		}
-		vma_start_write(next);
+		error = vma_start_write_killable(next);
+		if (error)
+			goto munmap_gather_failed;
 		mas_set(mas_detach, vms->vma_count++);
 		error = mas_store_gfp(mas_detach, next, GFP_KERNEL);
 		if (error)
@@ -1848,12 +1875,16 @@ static void vma_link_file(struct vm_area_struct *vma, bool hold_rmap_lock)
 static int vma_link(struct mm_struct *mm, struct vm_area_struct *vma)
 {
 	VMA_ITERATOR(vmi, mm, 0);
+	int err;
+
+	err = vma_start_write_killable(vma);
+	if (err)
+		return err;
 
 	vma_iter_config(&vmi, vma->vm_start, vma->vm_end);
 	if (vma_iter_prealloc(&vmi, vma))
 		return -ENOMEM;
 
-	vma_start_write(vma);
 	vma_iter_store_new(&vmi, vma);
 	vma_link_file(vma, /* hold_rmap_lock= */false);
 	mm->map_count++;
@@ -2239,9 +2270,8 @@ int mm_take_all_locks(struct mm_struct *mm)
 	 * is reached.
 	 */
 	for_each_vma(vmi, vma) {
-		if (signal_pending(current))
+		if (signal_pending(current) || vma_start_write_killable(vma))
 			goto out_unlock;
-		vma_start_write(vma);
 	}
 
 	vma_iter_init(&vmi, mm, 0);
@@ -2540,8 +2570,8 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap,
 	struct mmap_action *action)
 {
 	struct vma_iterator *vmi = map->vmi;
-	int error = 0;
 	struct vm_area_struct *vma;
+	int error;
 
 	/*
 	 * Determine the object being mapped and call the appropriate
@@ -2552,6 +2582,14 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap,
 	if (!vma)
 		return -ENOMEM;
 
+	/*
+	 * Lock the VMA early to avoid extra work if fatal signal
+	 * is pending.
+	 */
+	error = vma_start_write_killable(vma);
+	if (error)
+		goto free_vma;
+
 	vma_iter_config(vmi, map->addr, map->end);
 	vma_set_range(vma, map->addr, map->end, map->pgoff);
 	vma->flags = map->vma_flags;
@@ -2582,8 +2620,6 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap,
 	WARN_ON_ONCE(!arch_validate_flags(map->vm_flags));
 #endif
 
-	/* Lock the VMA since it is modified after insertion into VMA tree */
-	vma_start_write(vma);
 	vma_iter_store_new(vmi, vma);
 	map->mm->map_count++;
 	vma_link_file(vma, action->hide_from_rmap_until_complete);
@@ -2878,6 +2914,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		 unsigned long addr, unsigned long len, vma_flags_t vma_flags)
 {
 	struct mm_struct *mm = current->mm;
+	int err;
 
 	/*
 	 * Check against address space limits by the changed size
@@ -2910,24 +2947,33 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
 
 		if (vma_merge_new_range(&vmg))
 			goto out;
-		else if (vmg_nomem(&vmg))
+		if (vmg_nomem(&vmg)) {
+			err = -ENOMEM;
 			goto unacct_fail;
+		}
 	}
 
 	if (vma)
 		vma_iter_next_range(vmi);
 	/* create a vma struct for an anonymous mapping */
 	vma = vm_area_alloc(mm);
-	if (!vma)
+	if (!vma) {
+		err = -ENOMEM;
 		goto unacct_fail;
+	}
 
 	vma_set_anonymous(vma);
 	vma_set_range(vma, addr, addr + len, addr >> PAGE_SHIFT);
 	vma->flags = vma_flags;
 	vma->vm_page_prot = vm_get_page_prot(vma_flags_to_legacy(vma_flags));
-	vma_start_write(vma);
-	if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL))
+	if (vma_start_write_killable(vma)) {
+		err = -EINTR;
+		goto vma_lock_fail;
+	}
+	if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL)) {
+		err = -ENOMEM;
 		goto mas_store_fail;
+	}
 
 	mm->map_count++;
 	validate_mm(mm);
@@ -2942,10 +2988,11 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	return 0;
 
 mas_store_fail:
+vma_lock_fail:
 	vm_area_free(vma);
 unacct_fail:
 	vm_unacct_memory(len >> PAGE_SHIFT);
-	return -ENOMEM;
+	return err;
 }
 
 /**
@@ -3112,8 +3159,8 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 	struct mm_struct *mm = vma->vm_mm;
 	struct vm_area_struct *next;
 	unsigned long gap_addr;
-	int error = 0;
 	VMA_ITERATOR(vmi, mm, vma->vm_start);
+	int error;
 
 	if (!vma_test(vma, VMA_GROWSUP_BIT))
 		return -EFAULT;
@@ -3149,12 +3196,14 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 
 	/* We must make sure the anon_vma is allocated. */
 	if (unlikely(anon_vma_prepare(vma))) {
-		vma_iter_free(&vmi);
-		return -ENOMEM;
+		error = -ENOMEM;
+		goto vma_prep_fail;
 	}
 
 	/* Lock the VMA before expanding to prevent concurrent page faults */
-	vma_start_write(vma);
+	error = vma_start_write_killable(vma);
+	if (error)
+		goto vma_lock_fail;
 	/* We update the anon VMA tree. */
 	anon_vma_lock_write(vma->anon_vma);
 
@@ -3183,6 +3232,8 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 		}
 	}
 	anon_vma_unlock_write(vma->anon_vma);
+vma_lock_fail:
+vma_prep_fail:
 	vma_iter_free(&vmi);
 	validate_mm(mm);
 	return error;
@@ -3197,8 +3248,8 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct vm_area_struct *prev;
-	int error = 0;
 	VMA_ITERATOR(vmi, mm, vma->vm_start);
+	int error;
 
 	if (!vma_test(vma, VMA_GROWSDOWN_BIT))
 		return -EFAULT;
@@ -3228,12 +3279,14 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
 
 	/* We must make sure the anon_vma is allocated. */
 	if (unlikely(anon_vma_prepare(vma))) {
-		vma_iter_free(&vmi);
-		return -ENOMEM;
+		error = -ENOMEM;
+		goto vma_prep_fail;
 	}
 
 	/* Lock the VMA before expanding to prevent concurrent page faults */
-	vma_start_write(vma);
+	error = vma_start_write_killable(vma);
+	if (error)
+		goto vma_lock_fail;
 	/* We update the anon VMA tree. */
 	anon_vma_lock_write(vma->anon_vma);
 
@@ -3263,6 +3316,8 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
 		}
 	}
 	anon_vma_unlock_write(vma->anon_vma);
+vma_lock_fail:
+vma_prep_fail:
 	vma_iter_free(&vmi);
 	validate_mm(mm);
 	return error;
diff --git a/mm/vma_exec.c b/mm/vma_exec.c
index 5cee8b7efa0f..8ddcc791d828 100644
--- a/mm/vma_exec.c
+++ b/mm/vma_exec.c
@@ -41,6 +41,7 @@ int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift)
 	struct vm_area_struct *next;
 	struct mmu_gather tlb;
 	PAGETABLE_MOVE(pmc, vma, vma, old_start, new_start, length);
+	int err;
 
 	BUG_ON(new_start > new_end);
 
@@ -56,8 +57,9 @@ int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift)
 	 * cover the whole range: [new_start, old_end)
 	 */
 	vmg.target = vma;
-	if (vma_expand(&vmg))
-		return -ENOMEM;
+	err = vma_expand(&vmg);
+	if (err)
+		return err;
 
 	/*
 	 * move the page tables downwards, on failure we rely on
-- 
2.53.0.1018.g2bb0e51243-goog



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 3/4] KVM: s390: avoid kvm_s390_handle_pv() error overwrite
  2026-03-22  5:43 [PATCH v4 0/4] Use killable vma write locking in most places Suren Baghdasaryan
  2026-03-22  5:43 ` [PATCH v4 1/4] mm/vma: cleanup error handling path in vma_expand() Suren Baghdasaryan
  2026-03-22  5:43 ` [PATCH v4 2/4] mm: replace vma_start_write() with vma_start_write_killable() Suren Baghdasaryan
@ 2026-03-22  5:43 ` Suren Baghdasaryan
  2026-03-23  4:41   ` Suren Baghdasaryan
  2026-03-23 17:53   ` Lorenzo Stoakes (Oracle)
  2026-03-22  5:43 ` [PATCH v4 4/4] mm: use vma_start_write_killable() in process_vma_walk_lock() Suren Baghdasaryan
  2026-03-22 16:17 ` [PATCH v4 0/4] Use killable vma write locking in most places Andrew Morton
  4 siblings, 2 replies; 16+ messages in thread
From: Suren Baghdasaryan @ 2026-03-22  5:43 UTC (permalink / raw)
  To: akpm
  Cc: willy, david, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, lorenzo.stoakes,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua,
	lance.yang, vbabka, jannh, rppt, mhocko, pfalcato, kees, maddy,
	npiggin, mpe, chleroy, borntraeger, frankja, imbrenda, hca, gor,
	agordeev, svens, gerald.schaefer, linux-mm, linuxppc-dev, kvm,
	linux-kernel, linux-s390, surenb

If kvm_s390_handle_pv() call fails its error code gets recorded but
execution proceeds as if the call was successful. If the next call to
copy_to_user() fails then the original error is overwritten.
The follow-up patch adds fatal signal checks during VMA walk, which
makes it possible for kvm_s390_handle_pv() to return EINTR error.
Without this fix any error including EINTR can be overwritten and
original error will be lost.

Change error handling for kvm_s390_handle_pv() to alter normal flow
once failure happens. This is consistent with how kvm_arch_vm_ioctl
handles errors for other ioctl commands.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 arch/s390/kvm/kvm-s390.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 3eb60aa932ec..ddad08c0926f 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -2947,6 +2947,8 @@ int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
 		}
 		/* must be called without kvm->lock */
 		r = kvm_s390_handle_pv(kvm, &args);
+		if (r)
+			break;
 		if (copy_to_user(argp, &args, sizeof(args))) {
 			r = -EFAULT;
 			break;
-- 
2.53.0.1018.g2bb0e51243-goog



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 4/4] mm: use vma_start_write_killable() in process_vma_walk_lock()
  2026-03-22  5:43 [PATCH v4 0/4] Use killable vma write locking in most places Suren Baghdasaryan
                   ` (2 preceding siblings ...)
  2026-03-22  5:43 ` [PATCH v4 3/4] KVM: s390: avoid kvm_s390_handle_pv() error overwrite Suren Baghdasaryan
@ 2026-03-22  5:43 ` Suren Baghdasaryan
  2026-03-23 18:04   ` Lorenzo Stoakes (Oracle)
  2026-03-22 16:17 ` [PATCH v4 0/4] Use killable vma write locking in most places Andrew Morton
  4 siblings, 1 reply; 16+ messages in thread
From: Suren Baghdasaryan @ 2026-03-22  5:43 UTC (permalink / raw)
  To: akpm
  Cc: willy, david, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, lorenzo.stoakes,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua,
	lance.yang, vbabka, jannh, rppt, mhocko, pfalcato, kees, maddy,
	npiggin, mpe, chleroy, borntraeger, frankja, imbrenda, hca, gor,
	agordeev, svens, gerald.schaefer, linux-mm, linuxppc-dev, kvm,
	linux-kernel, linux-s390, surenb

Replace vma_start_write() with vma_start_write_killable() when
process_vma_walk_lock() is used with PGWALK_WRLOCK option.
Adjust its direct and indirect users to check for a possible error
and handle it. Ensure users handle EINTR correctly and do not ignore
it.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 fs/proc/task_mmu.c |  5 ++++-
 mm/mempolicy.c     |  1 +
 mm/pagewalk.c      | 20 ++++++++++++++------
 3 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index e091931d7ca1..2fe3d11aad03 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1797,6 +1797,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 		struct clear_refs_private cp = {
 			.type = type,
 		};
+		int err;
 
 		if (mmap_write_lock_killable(mm)) {
 			count = -EINTR;
@@ -1824,7 +1825,9 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 						0, mm, 0, -1UL);
 			mmu_notifier_invalidate_range_start(&range);
 		}
-		walk_page_range(mm, 0, -1, &clear_refs_walk_ops, &cp);
+		err = walk_page_range(mm, 0, -1, &clear_refs_walk_ops, &cp);
+		if (err)
+			count = err;
 		if (type == CLEAR_REFS_SOFT_DIRTY) {
 			mmu_notifier_invalidate_range_end(&range);
 			flush_tlb_mm(mm);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 929e843543cf..bb5b0e83ce0f 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -969,6 +969,7 @@ static const struct mm_walk_ops queue_pages_lock_vma_walk_ops = {
  *      (a hugetlbfs page or a transparent huge page being counted as 1).
  * -EIO - a misplaced page found, when MPOL_MF_STRICT specified without MOVEs.
  * -EFAULT - a hole in the memory range, when MPOL_MF_DISCONTIG_OK unspecified.
+ * -EINTR - walk got terminated due to pending fatal signal.
  */
 static long
 queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end,
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index eda74273c8ec..a42cd6a6d812 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -438,14 +438,13 @@ static inline void process_mm_walk_lock(struct mm_struct *mm,
 		mmap_assert_write_locked(mm);
 }
 
-static inline void process_vma_walk_lock(struct vm_area_struct *vma,
+static inline int process_vma_walk_lock(struct vm_area_struct *vma,
 					 enum page_walk_lock walk_lock)
 {
 #ifdef CONFIG_PER_VMA_LOCK
 	switch (walk_lock) {
 	case PGWALK_WRLOCK:
-		vma_start_write(vma);
-		break;
+		return vma_start_write_killable(vma);
 	case PGWALK_WRLOCK_VERIFY:
 		vma_assert_write_locked(vma);
 		break;
@@ -457,6 +456,7 @@ static inline void process_vma_walk_lock(struct vm_area_struct *vma,
 		break;
 	}
 #endif
+	return 0;
 }
 
 /*
@@ -500,7 +500,9 @@ int walk_page_range_mm_unsafe(struct mm_struct *mm, unsigned long start,
 			if (ops->pte_hole)
 				err = ops->pte_hole(start, next, -1, &walk);
 		} else { /* inside vma */
-			process_vma_walk_lock(vma, ops->walk_lock);
+			err = process_vma_walk_lock(vma, ops->walk_lock);
+			if (err)
+				break;
 			walk.vma = vma;
 			next = min(end, vma->vm_end);
 			vma = find_vma(mm, vma->vm_end);
@@ -717,6 +719,7 @@ int walk_page_range_vma_unsafe(struct vm_area_struct *vma, unsigned long start,
 		.vma		= vma,
 		.private	= private,
 	};
+	int err;
 
 	if (start >= end || !walk.mm)
 		return -EINVAL;
@@ -724,7 +727,9 @@ int walk_page_range_vma_unsafe(struct vm_area_struct *vma, unsigned long start,
 		return -EINVAL;
 
 	process_mm_walk_lock(walk.mm, ops->walk_lock);
-	process_vma_walk_lock(vma, ops->walk_lock);
+	err = process_vma_walk_lock(vma, ops->walk_lock);
+	if (err)
+		return err;
 	return __walk_page_range(start, end, &walk);
 }
 
@@ -747,6 +752,7 @@ int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
 		.vma		= vma,
 		.private	= private,
 	};
+	int err;
 
 	if (!walk.mm)
 		return -EINVAL;
@@ -754,7 +760,9 @@ int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
 		return -EINVAL;
 
 	process_mm_walk_lock(walk.mm, ops->walk_lock);
-	process_vma_walk_lock(vma, ops->walk_lock);
+	err = process_vma_walk_lock(vma, ops->walk_lock);
+	if (err)
+		return err;
 	return __walk_page_range(vma->vm_start, vma->vm_end, &walk);
 }
 
-- 
2.53.0.1018.g2bb0e51243-goog



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 1/4] mm/vma: cleanup error handling path in vma_expand()
  2026-03-22  5:43 ` [PATCH v4 1/4] mm/vma: cleanup error handling path in vma_expand() Suren Baghdasaryan
@ 2026-03-22  7:49   ` Barry Song
  0 siblings, 0 replies; 16+ messages in thread
From: Barry Song @ 2026-03-22  7:49 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, david, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, lorenzo.stoakes,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
	lance.yang, vbabka, jannh, rppt, mhocko, pfalcato, kees, maddy,
	npiggin, mpe, chleroy, borntraeger, frankja, imbrenda, hca, gor,
	agordeev, svens, gerald.schaefer, linux-mm, linuxppc-dev, kvm,
	linux-kernel, linux-s390

On Sun, Mar 22, 2026 at 1:43 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> vma_expand() error handling is a bit confusing with "if (ret) return ret;"
> mixed with "if (!ret && ...) ret = ...;". Simplify the code to check
> for errors and return immediately after an operation that might fail.
> This also makes later changes to this function more readable.
> Change variable name for storing the error code from "ret" to "err".
>
> No functional change intended.
>
> Suggested-by: Jann Horn <jannh@google.com>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---

LGTM,
Reviewed-by: Barry Song <baohua@kernel.org>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 0/4] Use killable vma write locking in most places
  2026-03-22  5:43 [PATCH v4 0/4] Use killable vma write locking in most places Suren Baghdasaryan
                   ` (3 preceding siblings ...)
  2026-03-22  5:43 ` [PATCH v4 4/4] mm: use vma_start_write_killable() in process_vma_walk_lock() Suren Baghdasaryan
@ 2026-03-22 16:17 ` Andrew Morton
  2026-03-23  4:29   ` Suren Baghdasaryan
  4 siblings, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2026-03-22 16:17 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: willy, david, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, lorenzo.stoakes,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua,
	lance.yang, vbabka, jannh, rppt, mhocko, pfalcato, kees, maddy,
	npiggin, mpe, chleroy, borntraeger, frankja, imbrenda, hca, gor,
	agordeev, svens, gerald.schaefer, linux-mm, linuxppc-dev, kvm,
	linux-kernel, linux-s390

On Sat, 21 Mar 2026 22:43:04 -0700 Suren Baghdasaryan <surenb@google.com> wrote:

> Now that we have vma_start_write_killable() we can replace most of the
> vma_start_write() calls with it, improving reaction time to the kill
> signal.

Thanks.  Sashiko review raised a few possible issues:
	https://sashiko.dev/#/patchset/20260322054309.898214-1-surenb@google.com


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 0/4] Use killable vma write locking in most places
  2026-03-22 16:17 ` [PATCH v4 0/4] Use killable vma write locking in most places Andrew Morton
@ 2026-03-23  4:29   ` Suren Baghdasaryan
  2026-03-24 15:59     ` Suren Baghdasaryan
  0 siblings, 1 reply; 16+ messages in thread
From: Suren Baghdasaryan @ 2026-03-23  4:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: willy, david, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, lorenzo.stoakes,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua,
	lance.yang, vbabka, jannh, rppt, mhocko, pfalcato, kees, maddy,
	npiggin, mpe, chleroy, borntraeger, frankja, imbrenda, hca, gor,
	agordeev, svens, gerald.schaefer, linux-mm, linuxppc-dev, kvm,
	linux-kernel, linux-s390

On Sun, Mar 22, 2026 at 9:17 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Sat, 21 Mar 2026 22:43:04 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
>
> > Now that we have vma_start_write_killable() we can replace most of the
> > vma_start_write() calls with it, improving reaction time to the kill
> > signal.
>
> Thanks.  Sashiko review raised a few possible issues:
>         https://sashiko.dev/#/patchset/20260322054309.898214-1-surenb@google.com

Thanks! This Sashiko dude is good :)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 3/4] KVM: s390: avoid kvm_s390_handle_pv() error overwrite
  2026-03-22  5:43 ` [PATCH v4 3/4] KVM: s390: avoid kvm_s390_handle_pv() error overwrite Suren Baghdasaryan
@ 2026-03-23  4:41   ` Suren Baghdasaryan
  2026-03-23 17:53   ` Lorenzo Stoakes (Oracle)
  1 sibling, 0 replies; 16+ messages in thread
From: Suren Baghdasaryan @ 2026-03-23  4:41 UTC (permalink / raw)
  To: akpm
  Cc: willy, david, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, lorenzo.stoakes,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua,
	lance.yang, vbabka, jannh, rppt, mhocko, pfalcato, kees, maddy,
	npiggin, mpe, chleroy, borntraeger, frankja, imbrenda, hca, gor,
	agordeev, svens, gerald.schaefer, linux-mm, linuxppc-dev, kvm,
	linux-kernel, linux-s390

On Sat, Mar 21, 2026 at 10:43 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> If kvm_s390_handle_pv() call fails its error code gets recorded but
> execution proceeds as if the call was successful. If the next call to
> copy_to_user() fails then the original error is overwritten.
> The follow-up patch adds fatal signal checks during VMA walk, which
> makes it possible for kvm_s390_handle_pv() to return EINTR error.
> Without this fix any error including EINTR can be overwritten and
> original error will be lost.
>
> Change error handling for kvm_s390_handle_pv() to alter normal flow
> once failure happens. This is consistent with how kvm_arch_vm_ioctl
> handles errors for other ioctl commands.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  arch/s390/kvm/kvm-s390.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> index 3eb60aa932ec..ddad08c0926f 100644
> --- a/arch/s390/kvm/kvm-s390.c
> +++ b/arch/s390/kvm/kvm-s390.c
> @@ -2947,6 +2947,8 @@ int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
>                 }
>                 /* must be called without kvm->lock */
>                 r = kvm_s390_handle_pv(kvm, &args);
> +               if (r)
> +                       break;

At [1] Sashiko says:

"Does skipping the copy_to_user() call here prevent returning Ultravisor
error codes (rc and rrc) to userspace when kvm_s390_handle_pv() fails?
When an Ultravisor command fails inside kvm_s390_handle_pv(), it populates
args.rc and args.rrc with hardware failure codes and returns a negative
error code (e.g. -EINVAL).
Before this patch, execution unconditionally continued to copy_to_user(),
allowing these diagnostic codes to reach userspace even if the ioctl
ultimately returned an error.
By breaking early on error, this skips copy_to_user() entirely and silently
drops the updated rc and rrc values, which might break userspace's ability
to handle and diagnose hardware Secure Execution failures.
If the goal is to prevent overwriting the original error with -EFAULT,
could we perform the copy unconditionally and only update 'r' if it was
previously 0?"

[1] https://sashiko.dev/#/patchset/20260322054309.898214-1-surenb@google.com

This might be the reason why copy_to_user() in the original code is
called even when kvm_s390_handle_pv() fails. Then I guess it would not
be a problem if copy_to_user() later fails and overrides EINTR with
EFAULT. We could avoid that override by checking if r is already
holding an error code but that would change current behavior and
possibly the userspace expectations. I'm more inclined to simply drop
this patch and let errors including EINTR to be handled as they are
today. If anyone has objections please let me know.

>                 if (copy_to_user(argp, &args, sizeof(args))) {
>                         r = -EFAULT;
>                         break;
> --
> 2.53.0.1018.g2bb0e51243-goog
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 2/4] mm: replace vma_start_write() with vma_start_write_killable()
  2026-03-22  5:43 ` [PATCH v4 2/4] mm: replace vma_start_write() with vma_start_write_killable() Suren Baghdasaryan
@ 2026-03-23 17:49   ` Lorenzo Stoakes (Oracle)
  2026-03-23 19:22     ` Suren Baghdasaryan
  0 siblings, 1 reply; 16+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-23 17:49 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, david, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, lorenzo.stoakes,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua,
	lance.yang, vbabka, jannh, rppt, mhocko, pfalcato, kees, maddy,
	npiggin, mpe, chleroy, borntraeger, frankja, imbrenda, hca, gor,
	agordeev, svens, gerald.schaefer, linux-mm, linuxppc-dev, kvm,
	linux-kernel, linux-s390, Ritesh Harjani (IBM)

On Sat, Mar 21, 2026 at 10:43:06PM -0700, Suren Baghdasaryan wrote:
> Now that we have vma_start_write_killable() we can replace most of the
> vma_start_write() calls with it, improving reaction time to the kill
> signal.
>
> There are several places which are left untouched by this patch:
>
> 1. free_pgtables() because function should free page tables even if a
> fatal signal is pending.
>
> 2. process_vma_walk_lock(), which requires changes in its callers and
> will be handled in the next patch.
>
> 3. userfaultd code, where some paths calling vma_start_write() can
> handle EINTR and some can't without a deeper code refactoring.
>
> 4. mpol_rebind_mm() which is used by cpuset controller for migrations
> and operates on a remote mm. Incomplete operations here would result
> in an inconsistent cgroup state.
>
> 5. vm_flags_{set|mod|clear} require refactoring that involves moving
> vma_start_write() out of these functions and replacing it with
> vma_assert_write_locked(), then callers of these functions should
> lock the vma themselves using vma_start_write_killable() whenever
> possible.
>
> In a number of places we now lock VMA earlier than before to avoid
> doing work and undoing it later if a fatal signal is pending. This
> is safe because the moves are happening within sections where we
> already hold the mmap_write_lock, so the moves do not change the
> locking order relative to other kernel locks.
>
> Suggested-by: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> # powerpc

This really really needs splitting up into separate patches for the various
bits you change, I know it might seem pedantic, but it's much harder to
review this as one big patch, and hurts bisectability/specificity of fixes
etc.

Come, embrace the stats :)

You are welcome to apply a 'Reviewed-by: Lorenzo Stoakes (Oracle)
<ljs@kernel.org>' tag to all of the split-out patches where I have said
LGTM throughout!

> ---
>  arch/powerpc/kvm/book3s_hv_uvmem.c |   5 +-
>  mm/khugepaged.c                    |   5 +-
>  mm/madvise.c                       |   4 +-
>  mm/memory.c                        |   2 +
>  mm/mempolicy.c                     |  12 ++-
>  mm/mlock.c                         |  28 +++++--
>  mm/mprotect.c                      |   5 +-
>  mm/mremap.c                        |   4 +-
>  mm/vma.c                           | 117 +++++++++++++++++++++--------
>  mm/vma_exec.c                      |   6 +-
>  10 files changed, 142 insertions(+), 46 deletions(-)
>
> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
> index 5fbb95d90e99..0a28b48a46b8 100644
> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> @@ -410,7 +410,10 @@ static int kvmppc_memslot_page_merge(struct kvm *kvm,
>  			ret = H_STATE;
>  			break;
>  		}
> -		vma_start_write(vma);
> +		if (vma_start_write_killable(vma)) {
> +			ret = H_STATE;
> +			break;
> +		}

LGTM

>  		/* Copy vm_flags to avoid partial modifications in ksm_madvise */
>  		vm_flags = vma->vm_flags;
>  		ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 4b0e59c7c0e6..e2f263076084 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1159,7 +1159,10 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  	if (result != SCAN_SUCCEED)
>  		goto out_up_write;
>  	/* check if the pmd is still valid */
> -	vma_start_write(vma);
> +	if (vma_start_write_killable(vma)) {
> +		result = SCAN_FAIL;
> +		goto out_up_write;
> +	}

LGTM

>  	result = check_pmd_still_valid(mm, address, pmd);
>  	if (result != SCAN_SUCCEED)
>  		goto out_up_write;
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 69708e953cf5..feaa16b0e1dc 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -175,7 +175,9 @@ static int madvise_update_vma(vm_flags_t new_flags,
>  	madv_behavior->vma = vma;
>
>  	/* vm_flags is protected by the mmap_lock held in write mode. */
> -	vma_start_write(vma);
> +	if (vma_start_write_killable(vma))
> +		return -EINTR;
> +

LGTM

>  	vma->flags = new_vma_flags;
>  	if (set_new_anon_name)
>  		return replace_anon_vma_name(vma, anon_name);
> diff --git a/mm/memory.c b/mm/memory.c
> index 68cc592ff0ba..b930459e32ec 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -366,6 +366,8 @@ void free_pgd_range(struct mmu_gather *tlb,
>   * page tables that should be removed.  This can differ from the vma mappings on
>   * some archs that may have mappings that need to be removed outside the vmas.
>   * Note that the prev->vm_end and next->vm_start are often used.
> + * We don't use vma_start_write_killable() because page tables should be freed
> + * even if the task is being killed.

Nice.

>   *
>   * The vma_end differs from the pg_end when a dup_mmap() failed and the tree has
>   * unrelated data to the mm_struct being torn down.
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index e5528c35bbb8..929e843543cf 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1784,7 +1784,8 @@ SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, le
>  		return -EINVAL;
>  	if (end == start)
>  		return 0;
> -	mmap_write_lock(mm);
> +	if (mmap_write_lock_killable(mm))
> +		return -EINTR;
>  	prev = vma_prev(&vmi);
>  	for_each_vma_range(vmi, vma, end) {
>  		/*
> @@ -1801,13 +1802,20 @@ SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, le
>  			err = -EOPNOTSUPP;
>  			break;
>  		}
> +		/*
> +		 * Lock the VMA early to avoid extra work if fatal signal
> +		 * is pending.
> +		 */
> +		if (vma_start_write_killable(vma)) {
> +			err = -EINTR;
> +			break;
> +		}

LGTM, one nitty thing - wonder if we shouldn't pass through the error from
vma_start_write_killable()? OTOH, it's not really a big deal. I don't
foresee us ever returning anything but -EINTR or 0 :)

>  		new = mpol_dup(old);
>  		if (IS_ERR(new)) {
>  			err = PTR_ERR(new);
>  			break;
>  		}
>
> -		vma_start_write(vma);
>  		new->home_node = home_node;
>  		err = mbind_range(&vmi, vma, &prev, start, end, new);
>  		mpol_put(new);
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 8c227fefa2df..efbb9c783f25 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -419,8 +419,10 @@ static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
>   *
>   * Called for mlock(), mlock2() and mlockall(), to set @vma VM_LOCKED;
>   * called for munlock() and munlockall(), to clear VM_LOCKED from @vma.
> + *
> + * Return: 0 on success, -EINTR if fatal signal is pending.
>   */
> -static void mlock_vma_pages_range(struct vm_area_struct *vma,
> +static int mlock_vma_pages_range(struct vm_area_struct *vma,
>  	unsigned long start, unsigned long end,
>  	vma_flags_t *new_vma_flags)
>  {
> @@ -442,7 +444,9 @@ static void mlock_vma_pages_range(struct vm_area_struct *vma,
>  	 */
>  	if (vma_flags_test(new_vma_flags, VMA_LOCKED_BIT))
>  		vma_flags_set(new_vma_flags, VMA_IO_BIT);
> -	vma_start_write(vma);
> +	if (vma_start_write_killable(vma))
> +		return -EINTR;
> +

LGTM

>  	vma_flags_reset_once(vma, new_vma_flags);
>
>  	lru_add_drain();
> @@ -453,6 +457,7 @@ static void mlock_vma_pages_range(struct vm_area_struct *vma,
>  		vma_flags_clear(new_vma_flags, VMA_IO_BIT);
>  		vma_flags_reset_once(vma, new_vma_flags);
>  	}
> +	return 0;
>  }
>
>  /*
> @@ -506,11 +511,13 @@ static int mlock_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  	 */
>  	if (vma_flags_test(&new_vma_flags, VMA_LOCKED_BIT) &&
>  	    vma_flags_test(&old_vma_flags, VMA_LOCKED_BIT)) {
> +		ret = vma_start_write_killable(vma);
> +		if (ret)
> +			goto out;

LGTM

>  		/* No work to do, and mlocking twice would be wrong */
> -		vma_start_write(vma);
>  		vma->flags = new_vma_flags;
>  	} else {
> -		mlock_vma_pages_range(vma, start, end, &new_vma_flags);
> +		ret = mlock_vma_pages_range(vma, start, end, &new_vma_flags);

LGTM

>  	}
>  out:
>  	*prev = vma;
> @@ -739,9 +746,18 @@ static int apply_mlockall_flags(int flags)
>
>  		error = mlock_fixup(&vmi, vma, &prev, vma->vm_start, vma->vm_end,
>  				    newflags);
> -		/* Ignore errors, but prev needs fixing up. */
> -		if (error)
> +		if (error) {
> +			/*
> +			 * If we failed due to a pending fatal signal, return
> +			 * now. If we locked the vma before signal arrived, it
> +			 * will be unlocked when we drop mmap_write_lock.
> +			 */
> +			if (fatal_signal_pending(current))
> +				return -EINTR;

LGTM, and thanks for careful explanation, hopefully addresses Matthew's
concerns ([0]).

[0]:https://lore.kernel.org/all/aaiBX5Mm36Kg0wq1@casper.infradead.org/

> +
> +			/* Ignore errors, but prev needs fixing up. */
>  			prev = vma;
> +		}
>  		cond_resched();
>  	}
>  out:
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 110d47a36d4b..ae6ed882b600 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -768,7 +768,10 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb,
>  	 * vm_flags and vm_page_prot are protected by the mmap_lock
>  	 * held in write mode.
>  	 */
> -	vma_start_write(vma);
> +	error = vma_start_write_killable(vma);
> +	if (error)
> +		goto fail;
> +

LGTM

>  	vma_flags_reset_once(vma, &new_vma_flags);
>  	if (vma_wants_manual_pte_write_upgrade(vma))
>  		mm_cp_flags |= MM_CP_TRY_CHANGE_WRITABLE;
> diff --git a/mm/mremap.c b/mm/mremap.c
> index e9c8b1d05832..dec39ec314f9 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -1356,7 +1356,9 @@ static unsigned long move_vma(struct vma_remap_struct *vrm)
>  		return -ENOMEM;
>
>  	/* We don't want racing faults. */
> -	vma_start_write(vrm->vma);
> +	err = vma_start_write_killable(vrm->vma);
> +	if (err)
> +		return err;

Yeah this is subtle, I saw sashiko flagged it, so I checked and this isn't
being accounted properly.

vrm_calc_charge() charges via
security_vm_enough_memory_mm()->__vm_enough_memory()->vm_acct_memory()

And we only uncharge via vrm_uncharge()->vm_unacct_memory()

The other error function there is:

	err = copy_vma_and_data(vrm, &new_vma);
	...
	if (err && !new_vma)
		return err;

And in copy_vma_and_data():

	if (!new_vma) {
		vrm_uncharge(vrm);
		*new_vma_ptr = NULL;
		return -ENOMEM;
	}

So that's already taken care of for us.

It's stuff like this that is why it's important to separate this patch into
separate patches :)

We also do some weird and wonderful stuff with error handling where we try
to back out the remapped page tables, having swapped the new and old VMAs
around so the unmapping of the source VMA is done on the destination (yeah
it's horrible).

So this is all very subtle really.

Honestly you might just be better off moving the vma_start_write_killable()
above vrm_calc_charge()?

If not, you'd just need to replace this with:

	err = vma_start_write_killable(vrm->vma);
	if (err) {
		vrm_uncharge(vrm);
		return err;
	}

>
>  	/* Perform copy step. */
>  	err = copy_vma_and_data(vrm, &new_vma);
> diff --git a/mm/vma.c b/mm/vma.c
> index ba78ab1f397a..7930a4270eb9 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -524,6 +524,17 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  		new->vm_pgoff += ((addr - vma->vm_start) >> PAGE_SHIFT);
>  	}
>
> +	/*
> +	 * Lock VMAs before cloning to avoid extra work if fatal signal
> +	 * is pending.
> +	 */
> +	err = vma_start_write_killable(vma);
> +	if (err)
> +		goto out_free_vma;
> +	err = vma_start_write_killable(new);
> +	if (err)
> +		goto out_free_vma;
> +

LGTM

>  	err = -ENOMEM;
>  	vma_iter_config(vmi, new->vm_start, new->vm_end);
>  	if (vma_iter_prealloc(vmi, new))
> @@ -543,9 +554,6 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  	if (new->vm_ops && new->vm_ops->open)
>  		new->vm_ops->open(new);
>
> -	vma_start_write(vma);
> -	vma_start_write(new);
> -

LGTM

>  	init_vma_prep(&vp, vma);
>  	vp.insert = new;
>  	vma_prepare(&vp);
> @@ -900,12 +908,16 @@ static __must_check struct vm_area_struct *vma_merge_existing_range(
>  	}
>
>  	/* No matter what happens, we will be adjusting middle. */
> -	vma_start_write(middle);
> +	err = vma_start_write_killable(middle);
> +	if (err)
> +		goto abort;

I mean looking at this again the vmg_oom() etc. stuff is horrible and I
definitely need to rework it. Sorry about that!

I think we need to update vma_modify():

	/* First, try to merge. */
	merged = vma_merge_existing_range(vmg);
	if (merged)
		return merged;
	if (vmg_nomem(vmg))
		return ERR_PTR(-ENOMEM);
+	if (fatal_signal_pending(current))
+		return -EINTR;

So we can just get out early in this case. I _think_ that should be safe?
:)

If not, we could hack things here by doing:

	if (err) {
		/* Ensure error propagated. */
		vmg->give_up_on_oom = false;
		goto abort;
	}

And I can just go fix this all up afterwards.

I still don't want a vmg_intr() or anything though, I will rework this into
something saner!

Anyway apologies, it's entirely my fault that this is a bit of a mess :)

>
>  	if (merge_right) {
>  		vma_flags_t next_sticky;
>
> -		vma_start_write(next);
> +		err = vma_start_write_killable(next);
> +		if (err)
> +			goto abort;

Obv not withstanding above LGTM

>  		vmg->target = next;
>  		next_sticky = vma_flags_and_mask(&next->flags, VMA_STICKY_FLAGS);
>  		vma_flags_set_mask(&sticky_flags, next_sticky);
> @@ -914,7 +926,9 @@ static __must_check struct vm_area_struct *vma_merge_existing_range(
>  	if (merge_left) {
>  		vma_flags_t prev_sticky;
>
> -		vma_start_write(prev);
> +		err = vma_start_write_killable(prev);
> +		if (err)
> +			goto abort;

Ditto

>  		vmg->target = prev;
>
>  		prev_sticky = vma_flags_and_mask(&prev->flags, VMA_STICKY_FLAGS);
> @@ -1170,10 +1184,12 @@ int vma_expand(struct vma_merge_struct *vmg)
>  	vma_flags_t sticky_flags =
>  		vma_flags_and_mask(&vmg->vma_flags, VMA_STICKY_FLAGS);
>  	vma_flags_t target_sticky;
> -	int err = 0;
> +	int err;
>
>  	mmap_assert_write_locked(vmg->mm);
> -	vma_start_write(target);
> +	err = vma_start_write_killable(target);
> +	if (err)
> +		return err;

Yeah dup_anon stuff will not be done yet so this LGTM

>
>  	target_sticky = vma_flags_and_mask(&target->flags, VMA_STICKY_FLAGS);
>
> @@ -1201,6 +1217,13 @@ int vma_expand(struct vma_merge_struct *vmg)
>  	 * we don't need to account for vmg->give_up_on_mm here.
>  	 */
>  	if (remove_next) {
> +		/*
> +		 * Lock the VMA early to avoid extra work if fatal signal
> +		 * is pending.
> +		 */
> +		err = vma_start_write_killable(next);
> +		if (err)
> +			return err;

Same here, so LGTM

>  		err = dup_anon_vma(target, next, &anon_dup);
>  		if (err)
>  			return err;
> @@ -1214,7 +1237,6 @@ int vma_expand(struct vma_merge_struct *vmg)
>  	if (remove_next) {
>  		vma_flags_t next_sticky;
>
> -		vma_start_write(next);

LGTM

>  		vmg->__remove_next = true;
>
>  		next_sticky = vma_flags_and_mask(&next->flags, VMA_STICKY_FLAGS);
> @@ -1252,9 +1274,14 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  	       unsigned long start, unsigned long end, pgoff_t pgoff)
>  {
>  	struct vma_prepare vp;
> +	int err;
>
>  	WARN_ON((vma->vm_start != start) && (vma->vm_end != end));
>
> +	err = vma_start_write_killable(vma);
> +	if (err)
> +		return err;
> +
>  	if (vma->vm_start < start)
>  		vma_iter_config(vmi, vma->vm_start, start);
>  	else
> @@ -1263,8 +1290,6 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  	if (vma_iter_prealloc(vmi, NULL))
>  		return -ENOMEM;
>
> -	vma_start_write(vma);
> -

We will hold the VMA write lock even though we return -ENOMEM, but I guess
not a big deal as we will drop the mmap write lock on error at the earliest
opportunity anyway.

So LGTM

>  	init_vma_prep(&vp, vma);
>  	vma_prepare(&vp);
>  	vma_adjust_trans_huge(vma, start, end, NULL);
> @@ -1453,7 +1478,9 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
>  			if (error)
>  				goto end_split_failed;
>  		}
> -		vma_start_write(next);
> +		error = vma_start_write_killable(next);
> +		if (error)
> +			goto munmap_gather_failed;

LGTM

>  		mas_set(mas_detach, vms->vma_count++);
>  		error = mas_store_gfp(mas_detach, next, GFP_KERNEL);
>  		if (error)
> @@ -1848,12 +1875,16 @@ static void vma_link_file(struct vm_area_struct *vma, bool hold_rmap_lock)
>  static int vma_link(struct mm_struct *mm, struct vm_area_struct *vma)
>  {
>  	VMA_ITERATOR(vmi, mm, 0);
> +	int err;
> +
> +	err = vma_start_write_killable(vma);
> +	if (err)
> +		return err;

LGTM

>
>  	vma_iter_config(&vmi, vma->vm_start, vma->vm_end);
>  	if (vma_iter_prealloc(&vmi, vma))
>  		return -ENOMEM;
>
> -	vma_start_write(vma);

LGTM

>  	vma_iter_store_new(&vmi, vma);
>  	vma_link_file(vma, /* hold_rmap_lock= */false);
>  	mm->map_count++;
> @@ -2239,9 +2270,8 @@ int mm_take_all_locks(struct mm_struct *mm)
>  	 * is reached.
>  	 */
>  	for_each_vma(vmi, vma) {
> -		if (signal_pending(current))
> +		if (signal_pending(current) || vma_start_write_killable(vma))

Hmm, surely signal_pending() should catch it :) but I suppose there could
be a (very very tight) timing issue here, but case to be made for not
converting this?

OTOH it doesn't really make much difference I guess.

>  			goto out_unlock;
> -		vma_start_write(vma);
>  	}
>
>  	vma_iter_init(&vmi, mm, 0);
> @@ -2540,8 +2570,8 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap,
>  	struct mmap_action *action)
>  {
>  	struct vma_iterator *vmi = map->vmi;
> -	int error = 0;
>  	struct vm_area_struct *vma;
> +	int error;
>
>  	/*
>  	 * Determine the object being mapped and call the appropriate
> @@ -2552,6 +2582,14 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap,
>  	if (!vma)
>  		return -ENOMEM;
>
> +	/*
> +	 * Lock the VMA early to avoid extra work if fatal signal
> +	 * is pending.
> +	 */
> +	error = vma_start_write_killable(vma);
> +	if (error)
> +		goto free_vma;
> +

LGTM

>  	vma_iter_config(vmi, map->addr, map->end);
>  	vma_set_range(vma, map->addr, map->end, map->pgoff);
>  	vma->flags = map->vma_flags;
> @@ -2582,8 +2620,6 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap,
>  	WARN_ON_ONCE(!arch_validate_flags(map->vm_flags));
>  #endif
>
> -	/* Lock the VMA since it is modified after insertion into VMA tree */
> -	vma_start_write(vma);

LGTM

>  	vma_iter_store_new(vmi, vma);
>  	map->mm->map_count++;
>  	vma_link_file(vma, action->hide_from_rmap_until_complete);
> @@ -2878,6 +2914,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  		 unsigned long addr, unsigned long len, vma_flags_t vma_flags)
>  {
>  	struct mm_struct *mm = current->mm;
> +	int err;

LGTM

>
>  	/*
>  	 * Check against address space limits by the changed size
> @@ -2910,24 +2947,33 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
>
>  		if (vma_merge_new_range(&vmg))
>  			goto out;
> -		else if (vmg_nomem(&vmg))
> +		if (vmg_nomem(&vmg)) {
> +			err = -ENOMEM;
>  			goto unacct_fail;
> +		}

LGTM

>  	}
>
>  	if (vma)
>  		vma_iter_next_range(vmi);
>  	/* create a vma struct for an anonymous mapping */
>  	vma = vm_area_alloc(mm);
> -	if (!vma)
> +	if (!vma) {
> +		err = -ENOMEM;
>  		goto unacct_fail;
> +	}

LGTM

>
>  	vma_set_anonymous(vma);
>  	vma_set_range(vma, addr, addr + len, addr >> PAGE_SHIFT);
>  	vma->flags = vma_flags;
>  	vma->vm_page_prot = vm_get_page_prot(vma_flags_to_legacy(vma_flags));
> -	vma_start_write(vma);
> -	if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL))
> +	if (vma_start_write_killable(vma)) {
> +		err = -EINTR;
> +		goto vma_lock_fail;
> +	}

LGTM

> +	if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL)) {
> +		err = -ENOMEM;
>  		goto mas_store_fail;
> +	}

LGTM

>
>  	mm->map_count++;
>  	validate_mm(mm);
> @@ -2942,10 +2988,11 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  	return 0;
>
>  mas_store_fail:
> +vma_lock_fail:
>  	vm_area_free(vma);
>  unacct_fail:
>  	vm_unacct_memory(len >> PAGE_SHIFT);
> -	return -ENOMEM;
> +	return err;

LGTM

>  }
>
>  /**
> @@ -3112,8 +3159,8 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
>  	struct mm_struct *mm = vma->vm_mm;
>  	struct vm_area_struct *next;
>  	unsigned long gap_addr;
> -	int error = 0;
>  	VMA_ITERATOR(vmi, mm, vma->vm_start);
> +	int error;

LGTM

>
>  	if (!vma_test(vma, VMA_GROWSUP_BIT))
>  		return -EFAULT;
> @@ -3149,12 +3196,14 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
>
>  	/* We must make sure the anon_vma is allocated. */
>  	if (unlikely(anon_vma_prepare(vma))) {
> -		vma_iter_free(&vmi);
> -		return -ENOMEM;
> +		error = -ENOMEM;
> +		goto vma_prep_fail;

Hm, but before we weren't calling validate_mm() here, and now we will be, I
suspect we probably don't want to do that on error. Will comment on it
below in that bit of the code.

Otherwise, LGTM.

>  	}
>
>  	/* Lock the VMA before expanding to prevent concurrent page faults */
> -	vma_start_write(vma);
> +	error = vma_start_write_killable(vma);
> +	if (error)
> +		goto vma_lock_fail;

LGTM

>  	/* We update the anon VMA tree. */
>  	anon_vma_lock_write(vma->anon_vma);
>
> @@ -3183,6 +3232,8 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
>  		}
>  	}
>  	anon_vma_unlock_write(vma->anon_vma);
> +vma_lock_fail:
> +vma_prep_fail:

LGTM

>  	vma_iter_free(&vmi);
>  	validate_mm(mm);

Maybe this should be put before the labels?

Then again, I guess there's no harm in it. But I'm not sure taking a long
time to check the entire maple tree when CONFIG_DEBUG_VM_MAPLE_TREE is
enabled is a good idea with a fatal signal pending?

>  	return error;
> @@ -3197,8 +3248,8 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	struct vm_area_struct *prev;
> -	int error = 0;
>  	VMA_ITERATOR(vmi, mm, vma->vm_start);
> +	int error;

LGTM

>
>  	if (!vma_test(vma, VMA_GROWSDOWN_BIT))
>  		return -EFAULT;
> @@ -3228,12 +3279,14 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
>
>  	/* We must make sure the anon_vma is allocated. */
>  	if (unlikely(anon_vma_prepare(vma))) {
> -		vma_iter_free(&vmi);
> -		return -ENOMEM;
> +		error = -ENOMEM;
> +		goto vma_prep_fail;
>  	}
>
>  	/* Lock the VMA before expanding to prevent concurrent page faults */
> -	vma_start_write(vma);
> +	error = vma_start_write_killable(vma);
> +	if (error)
> +		goto vma_lock_fail;
>  	/* We update the anon VMA tree. */
>  	anon_vma_lock_write(vma->anon_vma);
>
> @@ -3263,6 +3316,8 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
>  		}
>  	}
>  	anon_vma_unlock_write(vma->anon_vma);
> +vma_lock_fail:
> +vma_prep_fail:

Obv same comments for expand_upwards() apply here also.

>  	vma_iter_free(&vmi);
>  	validate_mm(mm);
>  	return error;
> diff --git a/mm/vma_exec.c b/mm/vma_exec.c
> index 5cee8b7efa0f..8ddcc791d828 100644
> --- a/mm/vma_exec.c
> +++ b/mm/vma_exec.c
> @@ -41,6 +41,7 @@ int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift)
>  	struct vm_area_struct *next;
>  	struct mmu_gather tlb;
>  	PAGETABLE_MOVE(pmc, vma, vma, old_start, new_start, length);
> +	int err;
>
>  	BUG_ON(new_start > new_end);
>
> @@ -56,8 +57,9 @@ int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift)
>  	 * cover the whole range: [new_start, old_end)
>  	 */
>  	vmg.target = vma;
> -	if (vma_expand(&vmg))
> -		return -ENOMEM;
> +	err = vma_expand(&vmg);
> +	if (err)
> +		return err;

LGTM

>
>  	/*
>  	 * move the page tables downwards, on failure we rely on
> --
> 2.53.0.1018.g2bb0e51243-goog
>

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 3/4] KVM: s390: avoid kvm_s390_handle_pv() error overwrite
  2026-03-22  5:43 ` [PATCH v4 3/4] KVM: s390: avoid kvm_s390_handle_pv() error overwrite Suren Baghdasaryan
  2026-03-23  4:41   ` Suren Baghdasaryan
@ 2026-03-23 17:53   ` Lorenzo Stoakes (Oracle)
  2026-03-23 18:48     ` Suren Baghdasaryan
  1 sibling, 1 reply; 16+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-23 17:53 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, david, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, lorenzo.stoakes,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua,
	lance.yang, vbabka, jannh, rppt, mhocko, pfalcato, kees, maddy,
	npiggin, mpe, chleroy, borntraeger, frankja, imbrenda, hca, gor,
	agordeev, svens, gerald.schaefer, linux-mm, linuxppc-dev, kvm,
	linux-kernel, linux-s390

On Sat, Mar 21, 2026 at 10:43:07PM -0700, Suren Baghdasaryan wrote:
> If kvm_s390_handle_pv() call fails its error code gets recorded but
> execution proceeds as if the call was successful. If the next call to
> copy_to_user() fails then the original error is overwritten.

Is that really a big deal though, as you're returning an error in either case?

> The follow-up patch adds fatal signal checks during VMA walk, which
> makes it possible for kvm_s390_handle_pv() to return EINTR error.
> Without this fix any error including EINTR can be overwritten and
> original error will be lost.
>
> Change error handling for kvm_s390_handle_pv() to alter normal flow
> once failure happens. This is consistent with how kvm_arch_vm_ioctl
> handles errors for other ioctl commands.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  arch/s390/kvm/kvm-s390.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> index 3eb60aa932ec..ddad08c0926f 100644
> --- a/arch/s390/kvm/kvm-s390.c
> +++ b/arch/s390/kvm/kvm-s390.c
> @@ -2947,6 +2947,8 @@ int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
>  		}
>  		/* must be called without kvm->lock */
>  		r = kvm_s390_handle_pv(kvm, &args);
> +		if (r)
> +			break;
>  		if (copy_to_user(argp, &args, sizeof(args))) {

Yeah as per Sashiko we probably need to copy_to_user() still.

But in that case, do we even need a change at all? I'm not sure it really
matters which error code terminates things does it?

>  			r = -EFAULT;
>  			break;
> --
> 2.53.0.1018.g2bb0e51243-goog
>

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 4/4] mm: use vma_start_write_killable() in process_vma_walk_lock()
  2026-03-22  5:43 ` [PATCH v4 4/4] mm: use vma_start_write_killable() in process_vma_walk_lock() Suren Baghdasaryan
@ 2026-03-23 18:04   ` Lorenzo Stoakes (Oracle)
  2026-03-23 19:29     ` Suren Baghdasaryan
  0 siblings, 1 reply; 16+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-23 18:04 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, david, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, lorenzo.stoakes,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua,
	lance.yang, vbabka, jannh, rppt, mhocko, pfalcato, kees, maddy,
	npiggin, mpe, chleroy, borntraeger, frankja, imbrenda, hca, gor,
	agordeev, svens, gerald.schaefer, linux-mm, linuxppc-dev, kvm,
	linux-kernel, linux-s390

On Sat, Mar 21, 2026 at 10:43:08PM -0700, Suren Baghdasaryan wrote:
> Replace vma_start_write() with vma_start_write_killable() when
> process_vma_walk_lock() is used with PGWALK_WRLOCK option.
> Adjust its direct and indirect users to check for a possible error
> and handle it. Ensure users handle EINTR correctly and do not ignore
> it.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  fs/proc/task_mmu.c |  5 ++++-
>  mm/mempolicy.c     |  1 +
>  mm/pagewalk.c      | 20 ++++++++++++++------
>  3 files changed, 19 insertions(+), 7 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index e091931d7ca1..2fe3d11aad03 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1797,6 +1797,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
>  		struct clear_refs_private cp = {
>  			.type = type,
>  		};
> +		int err;

Maybe better to make it a ssize_t given return type of function?

>
>  		if (mmap_write_lock_killable(mm)) {
>  			count = -EINTR;
> @@ -1824,7 +1825,9 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
>  						0, mm, 0, -1UL);
>  			mmu_notifier_invalidate_range_start(&range);
>  		}
> -		walk_page_range(mm, 0, -1, &clear_refs_walk_ops, &cp);
> +		err = walk_page_range(mm, 0, -1, &clear_refs_walk_ops, &cp);
> +		if (err)
> +			count = err;

Hmm this is gross, but it's an established pattern here, ugh.

Now we have an err though, could we update:

		if (mmap_write_lock_killable(mm)) {
-			count = -EINTR;
+			err = -EINTR;
			goto out_mm;
		}

Then we can just do:

+		err = walk_page_range(mm, 0, -1, &clear_refs_walk_ops, &cp);

And at the end do:

	return err ?: count;

Which possibly _necessitates_ err being a ssize_t.

>  		if (type == CLEAR_REFS_SOFT_DIRTY) {
>  			mmu_notifier_invalidate_range_end(&range);
>  			flush_tlb_mm(mm);
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 929e843543cf..bb5b0e83ce0f 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -969,6 +969,7 @@ static const struct mm_walk_ops queue_pages_lock_vma_walk_ops = {
>   *      (a hugetlbfs page or a transparent huge page being counted as 1).
>   * -EIO - a misplaced page found, when MPOL_MF_STRICT specified without MOVEs.
>   * -EFAULT - a hole in the memory range, when MPOL_MF_DISCONTIG_OK unspecified.
> + * -EINTR - walk got terminated due to pending fatal signal.

Thanks!

>   */
>  static long
>  queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end,
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index eda74273c8ec..a42cd6a6d812 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -438,14 +438,13 @@ static inline void process_mm_walk_lock(struct mm_struct *mm,
>  		mmap_assert_write_locked(mm);
>  }
>
> -static inline void process_vma_walk_lock(struct vm_area_struct *vma,
> +static inline int process_vma_walk_lock(struct vm_area_struct *vma,

NIT: Don't need this to be an inline any longer. May as well fix up while we're
here.

>  					 enum page_walk_lock walk_lock)
>  {
>  #ifdef CONFIG_PER_VMA_LOCK
>  	switch (walk_lock) {
>  	case PGWALK_WRLOCK:
> -		vma_start_write(vma);
> -		break;
> +		return vma_start_write_killable(vma);

LGTM

>  	case PGWALK_WRLOCK_VERIFY:
>  		vma_assert_write_locked(vma);
>  		break;
> @@ -457,6 +456,7 @@ static inline void process_vma_walk_lock(struct vm_area_struct *vma,
>  		break;
>  	}
>  #endif
> +	return 0;
>  }
>
>  /*
> @@ -500,7 +500,9 @@ int walk_page_range_mm_unsafe(struct mm_struct *mm, unsigned long start,
>  			if (ops->pte_hole)
>  				err = ops->pte_hole(start, next, -1, &walk);
>  		} else { /* inside vma */
> -			process_vma_walk_lock(vma, ops->walk_lock);
> +			err = process_vma_walk_lock(vma, ops->walk_lock);
> +			if (err)
> +				break;

In every other case we set walk.vma = vma or NULL. Is it a problem not setting
it at all in this code path?

>  			walk.vma = vma;
>  			next = min(end, vma->vm_end);
>  			vma = find_vma(mm, vma->vm_end);
> @@ -717,6 +719,7 @@ int walk_page_range_vma_unsafe(struct vm_area_struct *vma, unsigned long start,
>  		.vma		= vma,
>  		.private	= private,
>  	};
> +	int err;
>
>  	if (start >= end || !walk.mm)
>  		return -EINVAL;
> @@ -724,7 +727,9 @@ int walk_page_range_vma_unsafe(struct vm_area_struct *vma, unsigned long start,
>  		return -EINVAL;
>
>  	process_mm_walk_lock(walk.mm, ops->walk_lock);
> -	process_vma_walk_lock(vma, ops->walk_lock);
> +	err = process_vma_walk_lock(vma, ops->walk_lock);
> +	if (err)
> +		return err;

LGTM

>  	return __walk_page_range(start, end, &walk);
>  }
>
> @@ -747,6 +752,7 @@ int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
>  		.vma		= vma,
>  		.private	= private,
>  	};
> +	int err;
>
>  	if (!walk.mm)
>  		return -EINVAL;
> @@ -754,7 +760,9 @@ int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
>  		return -EINVAL;
>
>  	process_mm_walk_lock(walk.mm, ops->walk_lock);
> -	process_vma_walk_lock(vma, ops->walk_lock);
> +	err = process_vma_walk_lock(vma, ops->walk_lock);
> +	if (err)
> +		return err;

LGTM

>  	return __walk_page_range(vma->vm_start, vma->vm_end, &walk);
>  }
>
> --
> 2.53.0.1018.g2bb0e51243-goog
>

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 3/4] KVM: s390: avoid kvm_s390_handle_pv() error overwrite
  2026-03-23 17:53   ` Lorenzo Stoakes (Oracle)
@ 2026-03-23 18:48     ` Suren Baghdasaryan
  0 siblings, 0 replies; 16+ messages in thread
From: Suren Baghdasaryan @ 2026-03-23 18:48 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: akpm, willy, david, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, lorenzo.stoakes,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua,
	lance.yang, vbabka, jannh, rppt, mhocko, pfalcato, kees, maddy,
	npiggin, mpe, chleroy, borntraeger, frankja, imbrenda, hca, gor,
	agordeev, svens, gerald.schaefer, linux-mm, linuxppc-dev, kvm,
	linux-kernel, linux-s390

On Mon, Mar 23, 2026 at 10:53 AM Lorenzo Stoakes (Oracle)
<ljs@kernel.org> wrote:
>
> On Sat, Mar 21, 2026 at 10:43:07PM -0700, Suren Baghdasaryan wrote:
> > If kvm_s390_handle_pv() call fails its error code gets recorded but
> > execution proceeds as if the call was successful. If the next call to
> > copy_to_user() fails then the original error is overwritten.
>
> Is that really a big deal though, as you're returning an error in either case?
>
> > The follow-up patch adds fatal signal checks during VMA walk, which
> > makes it possible for kvm_s390_handle_pv() to return EINTR error.
> > Without this fix any error including EINTR can be overwritten and
> > original error will be lost.
> >
> > Change error handling for kvm_s390_handle_pv() to alter normal flow
> > once failure happens. This is consistent with how kvm_arch_vm_ioctl
> > handles errors for other ioctl commands.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  arch/s390/kvm/kvm-s390.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> > index 3eb60aa932ec..ddad08c0926f 100644
> > --- a/arch/s390/kvm/kvm-s390.c
> > +++ b/arch/s390/kvm/kvm-s390.c
> > @@ -2947,6 +2947,8 @@ int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
> >               }
> >               /* must be called without kvm->lock */
> >               r = kvm_s390_handle_pv(kvm, &args);
> > +             if (r)
> > +                     break;
> >               if (copy_to_user(argp, &args, sizeof(args))) {
>
> Yeah as per Sashiko we probably need to copy_to_user() still.
>
> But in that case, do we even need a change at all? I'm not sure it really
> matters which error code terminates things does it?

I agree. I plan to drop this patch from the series (see my former
reply) unless someone tells me otherwise.

>
> >                       r = -EFAULT;
> >                       break;
> > --
> > 2.53.0.1018.g2bb0e51243-goog
> >
>
> Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 2/4] mm: replace vma_start_write() with vma_start_write_killable()
  2026-03-23 17:49   ` Lorenzo Stoakes (Oracle)
@ 2026-03-23 19:22     ` Suren Baghdasaryan
  0 siblings, 0 replies; 16+ messages in thread
From: Suren Baghdasaryan @ 2026-03-23 19:22 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: akpm, willy, david, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, lorenzo.stoakes,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua,
	lance.yang, vbabka, jannh, rppt, mhocko, pfalcato, kees, maddy,
	npiggin, mpe, chleroy, borntraeger, frankja, imbrenda, hca, gor,
	agordeev, svens, gerald.schaefer, linux-mm, linuxppc-dev, kvm,
	linux-kernel, linux-s390, Ritesh Harjani (IBM)

On Mon, Mar 23, 2026 at 10:49 AM Lorenzo Stoakes (Oracle)
<ljs@kernel.org> wrote:
>
> On Sat, Mar 21, 2026 at 10:43:06PM -0700, Suren Baghdasaryan wrote:
> > Now that we have vma_start_write_killable() we can replace most of the
> > vma_start_write() calls with it, improving reaction time to the kill
> > signal.
> >
> > There are several places which are left untouched by this patch:
> >
> > 1. free_pgtables() because function should free page tables even if a
> > fatal signal is pending.
> >
> > 2. process_vma_walk_lock(), which requires changes in its callers and
> > will be handled in the next patch.
> >
> > 3. userfaultd code, where some paths calling vma_start_write() can
> > handle EINTR and some can't without a deeper code refactoring.
> >
> > 4. mpol_rebind_mm() which is used by cpuset controller for migrations
> > and operates on a remote mm. Incomplete operations here would result
> > in an inconsistent cgroup state.
> >
> > 5. vm_flags_{set|mod|clear} require refactoring that involves moving
> > vma_start_write() out of these functions and replacing it with
> > vma_assert_write_locked(), then callers of these functions should
> > lock the vma themselves using vma_start_write_killable() whenever
> > possible.
> >
> > In a number of places we now lock VMA earlier than before to avoid
> > doing work and undoing it later if a fatal signal is pending. This
> > is safe because the moves are happening within sections where we
> > already hold the mmap_write_lock, so the moves do not change the
> > locking order relative to other kernel locks.
> >
> > Suggested-by: Matthew Wilcox <willy@infradead.org>
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> # powerpc
>
> This really really needs splitting up into separate patches for the various
> bits you change, I know it might seem pedantic, but it's much harder to
> review this as one big patch, and hurts bisectability/specificity of fixes
> etc.
>
> Come, embrace the stats :)
>
> You are welcome to apply a 'Reviewed-by: Lorenzo Stoakes (Oracle)
> <ljs@kernel.org>' tag to all of the split-out patches where I have said
> LGTM throughout!

Thanks!
How would you suggest splitting the patch? I'm outlining one possible
division below. Let me know if you want to slice it differently:

>
> > ---
> >  arch/powerpc/kvm/book3s_hv_uvmem.c |   5 +-

The above one can obviously be split as it's powerpc specific.

> >  mm/khugepaged.c                    |   5 +-
> >  mm/madvise.c                       |   4 +-
> >  mm/memory.c                        |   2 +

The changes in the above files are almost one-liners with the last one
being just a comment change. Should each one go into a separate patch?

> >  mm/mempolicy.c                     |  12 ++-

I guess mempolicy one can be a separate patch.

> >  mm/mlock.c                         |  28 +++++--
> >  mm/mprotect.c                      |   5 +-
> >  mm/mremap.c                        |   4 +-

Maybe the above 3 can go together?

> >  mm/vma.c                           | 117 +++++++++++++++++++++--------
> >  mm/vma_exec.c                      |   6 +-

The above two files I think can go together and this is the most sizable one.


> >  10 files changed, 142 insertions(+), 46 deletions(-)
> >
> > diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
> > index 5fbb95d90e99..0a28b48a46b8 100644
> > --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> > +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> > @@ -410,7 +410,10 @@ static int kvmppc_memslot_page_merge(struct kvm *kvm,
> >                       ret = H_STATE;
> >                       break;
> >               }
> > -             vma_start_write(vma);
> > +             if (vma_start_write_killable(vma)) {
> > +                     ret = H_STATE;
> > +                     break;
> > +             }
>
> LGTM
>
> >               /* Copy vm_flags to avoid partial modifications in ksm_madvise */
> >               vm_flags = vma->vm_flags;
> >               ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 4b0e59c7c0e6..e2f263076084 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1159,7 +1159,10 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> >       if (result != SCAN_SUCCEED)
> >               goto out_up_write;
> >       /* check if the pmd is still valid */
> > -     vma_start_write(vma);
> > +     if (vma_start_write_killable(vma)) {
> > +             result = SCAN_FAIL;
> > +             goto out_up_write;
> > +     }
>
> LGTM
>
> >       result = check_pmd_still_valid(mm, address, pmd);
> >       if (result != SCAN_SUCCEED)
> >               goto out_up_write;
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 69708e953cf5..feaa16b0e1dc 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -175,7 +175,9 @@ static int madvise_update_vma(vm_flags_t new_flags,
> >       madv_behavior->vma = vma;
> >
> >       /* vm_flags is protected by the mmap_lock held in write mode. */
> > -     vma_start_write(vma);
> > +     if (vma_start_write_killable(vma))
> > +             return -EINTR;
> > +
>
> LGTM
>
> >       vma->flags = new_vma_flags;
> >       if (set_new_anon_name)
> >               return replace_anon_vma_name(vma, anon_name);
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 68cc592ff0ba..b930459e32ec 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -366,6 +366,8 @@ void free_pgd_range(struct mmu_gather *tlb,
> >   * page tables that should be removed.  This can differ from the vma mappings on
> >   * some archs that may have mappings that need to be removed outside the vmas.
> >   * Note that the prev->vm_end and next->vm_start are often used.
> > + * We don't use vma_start_write_killable() because page tables should be freed
> > + * even if the task is being killed.
>
> Nice.
>
> >   *
> >   * The vma_end differs from the pg_end when a dup_mmap() failed and the tree has
> >   * unrelated data to the mm_struct being torn down.
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index e5528c35bbb8..929e843543cf 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -1784,7 +1784,8 @@ SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, le
> >               return -EINVAL;
> >       if (end == start)
> >               return 0;
> > -     mmap_write_lock(mm);
> > +     if (mmap_write_lock_killable(mm))
> > +             return -EINTR;
> >       prev = vma_prev(&vmi);
> >       for_each_vma_range(vmi, vma, end) {
> >               /*
> > @@ -1801,13 +1802,20 @@ SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, le
> >                       err = -EOPNOTSUPP;
> >                       break;
> >               }
> > +             /*
> > +              * Lock the VMA early to avoid extra work if fatal signal
> > +              * is pending.
> > +              */
> > +             if (vma_start_write_killable(vma)) {
> > +                     err = -EINTR;
> > +                     break;
> > +             }
>
> LGTM, one nitty thing - wonder if we shouldn't pass through the error from
> vma_start_write_killable()? OTOH, it's not really a big deal. I don't
> foresee us ever returning anything but -EINTR or 0 :)

Ack. I'll change err to record vma_start_write_killable() return value instead.

>
> >               new = mpol_dup(old);
> >               if (IS_ERR(new)) {
> >                       err = PTR_ERR(new);
> >                       break;
> >               }
> >
> > -             vma_start_write(vma);
> >               new->home_node = home_node;
> >               err = mbind_range(&vmi, vma, &prev, start, end, new);
> >               mpol_put(new);
> > diff --git a/mm/mlock.c b/mm/mlock.c
> > index 8c227fefa2df..efbb9c783f25 100644
> > --- a/mm/mlock.c
> > +++ b/mm/mlock.c
> > @@ -419,8 +419,10 @@ static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
> >   *
> >   * Called for mlock(), mlock2() and mlockall(), to set @vma VM_LOCKED;
> >   * called for munlock() and munlockall(), to clear VM_LOCKED from @vma.
> > + *
> > + * Return: 0 on success, -EINTR if fatal signal is pending.
> >   */
> > -static void mlock_vma_pages_range(struct vm_area_struct *vma,
> > +static int mlock_vma_pages_range(struct vm_area_struct *vma,
> >       unsigned long start, unsigned long end,
> >       vma_flags_t *new_vma_flags)
> >  {
> > @@ -442,7 +444,9 @@ static void mlock_vma_pages_range(struct vm_area_struct *vma,
> >        */
> >       if (vma_flags_test(new_vma_flags, VMA_LOCKED_BIT))
> >               vma_flags_set(new_vma_flags, VMA_IO_BIT);
> > -     vma_start_write(vma);
> > +     if (vma_start_write_killable(vma))
> > +             return -EINTR;
> > +
>
> LGTM
>
> >       vma_flags_reset_once(vma, new_vma_flags);
> >
> >       lru_add_drain();
> > @@ -453,6 +457,7 @@ static void mlock_vma_pages_range(struct vm_area_struct *vma,
> >               vma_flags_clear(new_vma_flags, VMA_IO_BIT);
> >               vma_flags_reset_once(vma, new_vma_flags);
> >       }
> > +     return 0;
> >  }
> >
> >  /*
> > @@ -506,11 +511,13 @@ static int mlock_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >        */
> >       if (vma_flags_test(&new_vma_flags, VMA_LOCKED_BIT) &&
> >           vma_flags_test(&old_vma_flags, VMA_LOCKED_BIT)) {
> > +             ret = vma_start_write_killable(vma);
> > +             if (ret)
> > +                     goto out;
>
> LGTM
>
> >               /* No work to do, and mlocking twice would be wrong */
> > -             vma_start_write(vma);
> >               vma->flags = new_vma_flags;
> >       } else {
> > -             mlock_vma_pages_range(vma, start, end, &new_vma_flags);
> > +             ret = mlock_vma_pages_range(vma, start, end, &new_vma_flags);
>
> LGTM
>
> >       }
> >  out:
> >       *prev = vma;
> > @@ -739,9 +746,18 @@ static int apply_mlockall_flags(int flags)
> >
> >               error = mlock_fixup(&vmi, vma, &prev, vma->vm_start, vma->vm_end,
> >                                   newflags);
> > -             /* Ignore errors, but prev needs fixing up. */
> > -             if (error)
> > +             if (error) {
> > +                     /*
> > +                      * If we failed due to a pending fatal signal, return
> > +                      * now. If we locked the vma before signal arrived, it
> > +                      * will be unlocked when we drop mmap_write_lock.
> > +                      */
> > +                     if (fatal_signal_pending(current))
> > +                             return -EINTR;
>
> LGTM, and thanks for careful explanation, hopefully addresses Matthew's
> concerns ([0]).

That was my goal :)

>
> [0]:https://lore.kernel.org/all/aaiBX5Mm36Kg0wq1@casper.infradead.org/
>
> > +
> > +                     /* Ignore errors, but prev needs fixing up. */
> >                       prev = vma;
> > +             }
> >               cond_resched();
> >       }
> >  out:
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index 110d47a36d4b..ae6ed882b600 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -768,7 +768,10 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb,
> >        * vm_flags and vm_page_prot are protected by the mmap_lock
> >        * held in write mode.
> >        */
> > -     vma_start_write(vma);
> > +     error = vma_start_write_killable(vma);
> > +     if (error)
> > +             goto fail;
> > +
>
> LGTM
>
> >       vma_flags_reset_once(vma, &new_vma_flags);
> >       if (vma_wants_manual_pte_write_upgrade(vma))
> >               mm_cp_flags |= MM_CP_TRY_CHANGE_WRITABLE;
> > diff --git a/mm/mremap.c b/mm/mremap.c
> > index e9c8b1d05832..dec39ec314f9 100644
> > --- a/mm/mremap.c
> > +++ b/mm/mremap.c
> > @@ -1356,7 +1356,9 @@ static unsigned long move_vma(struct vma_remap_struct *vrm)
> >               return -ENOMEM;
> >
> >       /* We don't want racing faults. */
> > -     vma_start_write(vrm->vma);
> > +     err = vma_start_write_killable(vrm->vma);
> > +     if (err)
> > +             return err;
>
> Yeah this is subtle, I saw sashiko flagged it, so I checked and this isn't
> being accounted properly.
>
> vrm_calc_charge() charges via
> security_vm_enough_memory_mm()->__vm_enough_memory()->vm_acct_memory()
>
> And we only uncharge via vrm_uncharge()->vm_unacct_memory()
>
> The other error function there is:
>
>         err = copy_vma_and_data(vrm, &new_vma);
>         ...
>         if (err && !new_vma)
>                 return err;
>
> And in copy_vma_and_data():
>
>         if (!new_vma) {
>                 vrm_uncharge(vrm);
>                 *new_vma_ptr = NULL;
>                 return -ENOMEM;
>         }
>
> So that's already taken care of for us.
>
> It's stuff like this that is why it's important to separate this patch into
> separate patches :)
>
> We also do some weird and wonderful stuff with error handling where we try
> to back out the remapped page tables, having swapped the new and old VMAs
> around so the unmapping of the source VMA is done on the destination (yeah
> it's horrible).
>
> So this is all very subtle really.
>
> Honestly you might just be better off moving the vma_start_write_killable()
> above vrm_calc_charge()?

Yes I'll try that...

>
> If not, you'd just need to replace this with:
>
>         err = vma_start_write_killable(vrm->vma);
>         if (err) {
>                 vrm_uncharge(vrm);
>                 return err;
>         }

and if it doesn't work will do the above.

>
> >
> >       /* Perform copy step. */
> >       err = copy_vma_and_data(vrm, &new_vma);
> > diff --git a/mm/vma.c b/mm/vma.c
> > index ba78ab1f397a..7930a4270eb9 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -524,6 +524,17 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >               new->vm_pgoff += ((addr - vma->vm_start) >> PAGE_SHIFT);
> >       }
> >
> > +     /*
> > +      * Lock VMAs before cloning to avoid extra work if fatal signal
> > +      * is pending.
> > +      */
> > +     err = vma_start_write_killable(vma);
> > +     if (err)
> > +             goto out_free_vma;
> > +     err = vma_start_write_killable(new);
> > +     if (err)
> > +             goto out_free_vma;
> > +
>
> LGTM
>
> >       err = -ENOMEM;
> >       vma_iter_config(vmi, new->vm_start, new->vm_end);
> >       if (vma_iter_prealloc(vmi, new))
> > @@ -543,9 +554,6 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >       if (new->vm_ops && new->vm_ops->open)
> >               new->vm_ops->open(new);
> >
> > -     vma_start_write(vma);
> > -     vma_start_write(new);
> > -
>
> LGTM
>
> >       init_vma_prep(&vp, vma);
> >       vp.insert = new;
> >       vma_prepare(&vp);
> > @@ -900,12 +908,16 @@ static __must_check struct vm_area_struct *vma_merge_existing_range(
> >       }
> >
> >       /* No matter what happens, we will be adjusting middle. */
> > -     vma_start_write(middle);
> > +     err = vma_start_write_killable(middle);
> > +     if (err)
> > +             goto abort;
>
> I mean looking at this again the vmg_oom() etc. stuff is horrible and I
> definitely need to rework it. Sorry about that!
>
> I think we need to update vma_modify():
>
>         /* First, try to merge. */
>         merged = vma_merge_existing_range(vmg);
>         if (merged)
>                 return merged;
>         if (vmg_nomem(vmg))
>                 return ERR_PTR(-ENOMEM);
> +       if (fatal_signal_pending(current))

We need to be careful here. I think there are cases when vma is
modified from a context of a different process, for example in
process_madvise(). fatal_signal_pending(current) would yield incorrect
result because vma->vm_mm is not the same as current->mm.

> +               return -EINTR;
>
> So we can just get out early in this case. I _think_ that should be safe?
> :)
>
> If not, we could hack things here by doing:
>
>         if (err) {
>                 /* Ensure error propagated. */
>                 vmg->give_up_on_oom = false;
>                 goto abort;
>         }

I think that would be safer.

>
> And I can just go fix this all up afterwards.
>
> I still don't want a vmg_intr() or anything though, I will rework this into
> something saner!
>
> Anyway apologies, it's entirely my fault that this is a bit of a mess :)

No worries. We all make mistakes that become obvious only later.
Hindsight is always 20/20.

>
> >
> >       if (merge_right) {
> >               vma_flags_t next_sticky;
> >
> > -             vma_start_write(next);
> > +             err = vma_start_write_killable(next);
> > +             if (err)
> > +                     goto abort;
>
> Obv not withstanding above LGTM
>
> >               vmg->target = next;
> >               next_sticky = vma_flags_and_mask(&next->flags, VMA_STICKY_FLAGS);
> >               vma_flags_set_mask(&sticky_flags, next_sticky);
> > @@ -914,7 +926,9 @@ static __must_check struct vm_area_struct *vma_merge_existing_range(
> >       if (merge_left) {
> >               vma_flags_t prev_sticky;
> >
> > -             vma_start_write(prev);
> > +             err = vma_start_write_killable(prev);
> > +             if (err)
> > +                     goto abort;
>
> Ditto
>
> >               vmg->target = prev;
> >
> >               prev_sticky = vma_flags_and_mask(&prev->flags, VMA_STICKY_FLAGS);
> > @@ -1170,10 +1184,12 @@ int vma_expand(struct vma_merge_struct *vmg)
> >       vma_flags_t sticky_flags =
> >               vma_flags_and_mask(&vmg->vma_flags, VMA_STICKY_FLAGS);
> >       vma_flags_t target_sticky;
> > -     int err = 0;
> > +     int err;
> >
> >       mmap_assert_write_locked(vmg->mm);
> > -     vma_start_write(target);
> > +     err = vma_start_write_killable(target);
> > +     if (err)
> > +             return err;
>
> Yeah dup_anon stuff will not be done yet so this LGTM
>
> >
> >       target_sticky = vma_flags_and_mask(&target->flags, VMA_STICKY_FLAGS);
> >
> > @@ -1201,6 +1217,13 @@ int vma_expand(struct vma_merge_struct *vmg)
> >        * we don't need to account for vmg->give_up_on_mm here.
> >        */
> >       if (remove_next) {
> > +             /*
> > +              * Lock the VMA early to avoid extra work if fatal signal
> > +              * is pending.
> > +              */
> > +             err = vma_start_write_killable(next);
> > +             if (err)
> > +                     return err;
>
> Same here, so LGTM
>
> >               err = dup_anon_vma(target, next, &anon_dup);
> >               if (err)
> >                       return err;
> > @@ -1214,7 +1237,6 @@ int vma_expand(struct vma_merge_struct *vmg)
> >       if (remove_next) {
> >               vma_flags_t next_sticky;
> >
> > -             vma_start_write(next);
>
> LGTM
>
> >               vmg->__remove_next = true;
> >
> >               next_sticky = vma_flags_and_mask(&next->flags, VMA_STICKY_FLAGS);
> > @@ -1252,9 +1274,14 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >              unsigned long start, unsigned long end, pgoff_t pgoff)
> >  {
> >       struct vma_prepare vp;
> > +     int err;
> >
> >       WARN_ON((vma->vm_start != start) && (vma->vm_end != end));
> >
> > +     err = vma_start_write_killable(vma);
> > +     if (err)
> > +             return err;
> > +
> >       if (vma->vm_start < start)
> >               vma_iter_config(vmi, vma->vm_start, start);
> >       else
> > @@ -1263,8 +1290,6 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >       if (vma_iter_prealloc(vmi, NULL))
> >               return -ENOMEM;
> >
> > -     vma_start_write(vma);
> > -
>
> We will hold the VMA write lock even though we return -ENOMEM, but I guess
> not a big deal as we will drop the mmap write lock on error at the earliest
> opportunity anyway.
>
> So LGTM
>
> >       init_vma_prep(&vp, vma);
> >       vma_prepare(&vp);
> >       vma_adjust_trans_huge(vma, start, end, NULL);
> > @@ -1453,7 +1478,9 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
> >                       if (error)
> >                               goto end_split_failed;
> >               }
> > -             vma_start_write(next);
> > +             error = vma_start_write_killable(next);
> > +             if (error)
> > +                     goto munmap_gather_failed;
>
> LGTM
>
> >               mas_set(mas_detach, vms->vma_count++);
> >               error = mas_store_gfp(mas_detach, next, GFP_KERNEL);
> >               if (error)
> > @@ -1848,12 +1875,16 @@ static void vma_link_file(struct vm_area_struct *vma, bool hold_rmap_lock)
> >  static int vma_link(struct mm_struct *mm, struct vm_area_struct *vma)
> >  {
> >       VMA_ITERATOR(vmi, mm, 0);
> > +     int err;
> > +
> > +     err = vma_start_write_killable(vma);
> > +     if (err)
> > +             return err;
>
> LGTM
>
> >
> >       vma_iter_config(&vmi, vma->vm_start, vma->vm_end);
> >       if (vma_iter_prealloc(&vmi, vma))
> >               return -ENOMEM;
> >
> > -     vma_start_write(vma);
>
> LGTM
>
> >       vma_iter_store_new(&vmi, vma);
> >       vma_link_file(vma, /* hold_rmap_lock= */false);
> >       mm->map_count++;
> > @@ -2239,9 +2270,8 @@ int mm_take_all_locks(struct mm_struct *mm)
> >        * is reached.
> >        */
> >       for_each_vma(vmi, vma) {
> > -             if (signal_pending(current))
> > +             if (signal_pending(current) || vma_start_write_killable(vma))
>
> Hmm, surely signal_pending() should catch it :) but I suppose there could
> be a (very very tight) timing issue here, but case to be made for not
> converting this?

I want to convert all possible vma_start_write() cases in the hopes
that over time I might be able to deprecate vma_start_write()
completely. It's not possible right now (see the changelog) but with
some effort we might get there eventually.

>
> OTOH it doesn't really make much difference I guess.
>
> >                       goto out_unlock;
> > -             vma_start_write(vma);
> >       }
> >
> >       vma_iter_init(&vmi, mm, 0);
> > @@ -2540,8 +2570,8 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap,
> >       struct mmap_action *action)
> >  {
> >       struct vma_iterator *vmi = map->vmi;
> > -     int error = 0;
> >       struct vm_area_struct *vma;
> > +     int error;
> >
> >       /*
> >        * Determine the object being mapped and call the appropriate
> > @@ -2552,6 +2582,14 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap,
> >       if (!vma)
> >               return -ENOMEM;
> >
> > +     /*
> > +      * Lock the VMA early to avoid extra work if fatal signal
> > +      * is pending.
> > +      */
> > +     error = vma_start_write_killable(vma);
> > +     if (error)
> > +             goto free_vma;
> > +
>
> LGTM
>
> >       vma_iter_config(vmi, map->addr, map->end);
> >       vma_set_range(vma, map->addr, map->end, map->pgoff);
> >       vma->flags = map->vma_flags;
> > @@ -2582,8 +2620,6 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap,
> >       WARN_ON_ONCE(!arch_validate_flags(map->vm_flags));
> >  #endif
> >
> > -     /* Lock the VMA since it is modified after insertion into VMA tree */
> > -     vma_start_write(vma);
>
> LGTM
>
> >       vma_iter_store_new(vmi, vma);
> >       map->mm->map_count++;
> >       vma_link_file(vma, action->hide_from_rmap_until_complete);
> > @@ -2878,6 +2914,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >                unsigned long addr, unsigned long len, vma_flags_t vma_flags)
> >  {
> >       struct mm_struct *mm = current->mm;
> > +     int err;
>
> LGTM
>
> >
> >       /*
> >        * Check against address space limits by the changed size
> > @@ -2910,24 +2947,33 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >
> >               if (vma_merge_new_range(&vmg))
> >                       goto out;
> > -             else if (vmg_nomem(&vmg))
> > +             if (vmg_nomem(&vmg)) {
> > +                     err = -ENOMEM;
> >                       goto unacct_fail;
> > +             }
>
> LGTM
>
> >       }
> >
> >       if (vma)
> >               vma_iter_next_range(vmi);
> >       /* create a vma struct for an anonymous mapping */
> >       vma = vm_area_alloc(mm);
> > -     if (!vma)
> > +     if (!vma) {
> > +             err = -ENOMEM;
> >               goto unacct_fail;
> > +     }
>
> LGTM
>
> >
> >       vma_set_anonymous(vma);
> >       vma_set_range(vma, addr, addr + len, addr >> PAGE_SHIFT);
> >       vma->flags = vma_flags;
> >       vma->vm_page_prot = vm_get_page_prot(vma_flags_to_legacy(vma_flags));
> > -     vma_start_write(vma);
> > -     if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL))
> > +     if (vma_start_write_killable(vma)) {
> > +             err = -EINTR;
> > +             goto vma_lock_fail;
> > +     }
>
> LGTM
>
> > +     if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL)) {
> > +             err = -ENOMEM;
> >               goto mas_store_fail;
> > +     }
>
> LGTM
>
> >
> >       mm->map_count++;
> >       validate_mm(mm);
> > @@ -2942,10 +2988,11 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >       return 0;
> >
> >  mas_store_fail:
> > +vma_lock_fail:
> >       vm_area_free(vma);
> >  unacct_fail:
> >       vm_unacct_memory(len >> PAGE_SHIFT);
> > -     return -ENOMEM;
> > +     return err;
>
> LGTM
>
> >  }
> >
> >  /**
> > @@ -3112,8 +3159,8 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
> >       struct mm_struct *mm = vma->vm_mm;
> >       struct vm_area_struct *next;
> >       unsigned long gap_addr;
> > -     int error = 0;
> >       VMA_ITERATOR(vmi, mm, vma->vm_start);
> > +     int error;
>
> LGTM
>
> >
> >       if (!vma_test(vma, VMA_GROWSUP_BIT))
> >               return -EFAULT;
> > @@ -3149,12 +3196,14 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
> >
> >       /* We must make sure the anon_vma is allocated. */
> >       if (unlikely(anon_vma_prepare(vma))) {
> > -             vma_iter_free(&vmi);
> > -             return -ENOMEM;
> > +             error = -ENOMEM;
> > +             goto vma_prep_fail;
>
> Hm, but before we weren't calling validate_mm() here, and now we will be, I
> suspect we probably don't want to do that on error. Will comment on it
> below in that bit of the code.
>
> Otherwise, LGTM.
>
> >       }
> >
> >       /* Lock the VMA before expanding to prevent concurrent page faults */
> > -     vma_start_write(vma);
> > +     error = vma_start_write_killable(vma);
> > +     if (error)
> > +             goto vma_lock_fail;
>
> LGTM
>
> >       /* We update the anon VMA tree. */
> >       anon_vma_lock_write(vma->anon_vma);
> >
> > @@ -3183,6 +3232,8 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
> >               }
> >       }
> >       anon_vma_unlock_write(vma->anon_vma);
> > +vma_lock_fail:
> > +vma_prep_fail:
>
> LGTM
>
> >       vma_iter_free(&vmi);
> >       validate_mm(mm);
>
> Maybe this should be put before the labels?
>
> Then again, I guess there's no harm in it. But I'm not sure taking a long
> time to check the entire maple tree when CONFIG_DEBUG_VM_MAPLE_TREE is
> enabled is a good idea with a fatal signal pending?

That makes sense. Validating a tree when we know something went wrong
is probably meaningless. I'll move it before the jump labels.

>
> >       return error;
> > @@ -3197,8 +3248,8 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
> >  {
> >       struct mm_struct *mm = vma->vm_mm;
> >       struct vm_area_struct *prev;
> > -     int error = 0;
> >       VMA_ITERATOR(vmi, mm, vma->vm_start);
> > +     int error;
>
> LGTM
>
> >
> >       if (!vma_test(vma, VMA_GROWSDOWN_BIT))
> >               return -EFAULT;
> > @@ -3228,12 +3279,14 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
> >
> >       /* We must make sure the anon_vma is allocated. */
> >       if (unlikely(anon_vma_prepare(vma))) {
> > -             vma_iter_free(&vmi);
> > -             return -ENOMEM;
> > +             error = -ENOMEM;
> > +             goto vma_prep_fail;
> >       }
> >
> >       /* Lock the VMA before expanding to prevent concurrent page faults */
> > -     vma_start_write(vma);
> > +     error = vma_start_write_killable(vma);
> > +     if (error)
> > +             goto vma_lock_fail;
> >       /* We update the anon VMA tree. */
> >       anon_vma_lock_write(vma->anon_vma);
> >
> > @@ -3263,6 +3316,8 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
> >               }
> >       }
> >       anon_vma_unlock_write(vma->anon_vma);
> > +vma_lock_fail:
> > +vma_prep_fail:
>
> Obv same comments for expand_upwards() apply here also.
>
> >       vma_iter_free(&vmi);
> >       validate_mm(mm);
> >       return error;
> > diff --git a/mm/vma_exec.c b/mm/vma_exec.c
> > index 5cee8b7efa0f..8ddcc791d828 100644
> > --- a/mm/vma_exec.c
> > +++ b/mm/vma_exec.c
> > @@ -41,6 +41,7 @@ int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift)
> >       struct vm_area_struct *next;
> >       struct mmu_gather tlb;
> >       PAGETABLE_MOVE(pmc, vma, vma, old_start, new_start, length);
> > +     int err;
> >
> >       BUG_ON(new_start > new_end);
> >
> > @@ -56,8 +57,9 @@ int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift)
> >        * cover the whole range: [new_start, old_end)
> >        */
> >       vmg.target = vma;
> > -     if (vma_expand(&vmg))
> > -             return -ENOMEM;
> > +     err = vma_expand(&vmg);
> > +     if (err)
> > +             return err;
>
> LGTM

Thanks for the detailed review!

>
> >
> >       /*
> >        * move the page tables downwards, on failure we rely on
> > --
> > 2.53.0.1018.g2bb0e51243-goog
> >
>
> Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 4/4] mm: use vma_start_write_killable() in process_vma_walk_lock()
  2026-03-23 18:04   ` Lorenzo Stoakes (Oracle)
@ 2026-03-23 19:29     ` Suren Baghdasaryan
  0 siblings, 0 replies; 16+ messages in thread
From: Suren Baghdasaryan @ 2026-03-23 19:29 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: akpm, willy, david, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, lorenzo.stoakes,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua,
	lance.yang, vbabka, jannh, rppt, mhocko, pfalcato, kees, maddy,
	npiggin, mpe, chleroy, borntraeger, frankja, imbrenda, hca, gor,
	agordeev, svens, gerald.schaefer, linux-mm, linuxppc-dev, kvm,
	linux-kernel, linux-s390

On Mon, Mar 23, 2026 at 11:04 AM Lorenzo Stoakes (Oracle)
<ljs@kernel.org> wrote:
>
> On Sat, Mar 21, 2026 at 10:43:08PM -0700, Suren Baghdasaryan wrote:
> > Replace vma_start_write() with vma_start_write_killable() when
> > process_vma_walk_lock() is used with PGWALK_WRLOCK option.
> > Adjust its direct and indirect users to check for a possible error
> > and handle it. Ensure users handle EINTR correctly and do not ignore
> > it.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  fs/proc/task_mmu.c |  5 ++++-
> >  mm/mempolicy.c     |  1 +
> >  mm/pagewalk.c      | 20 ++++++++++++++------
> >  3 files changed, 19 insertions(+), 7 deletions(-)
> >
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index e091931d7ca1..2fe3d11aad03 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -1797,6 +1797,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
> >               struct clear_refs_private cp = {
> >                       .type = type,
> >               };
> > +             int err;
>
> Maybe better to make it a ssize_t given return type of function?

Ack.

>
> >
> >               if (mmap_write_lock_killable(mm)) {
> >                       count = -EINTR;
> > @@ -1824,7 +1825,9 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
> >                                               0, mm, 0, -1UL);
> >                       mmu_notifier_invalidate_range_start(&range);
> >               }
> > -             walk_page_range(mm, 0, -1, &clear_refs_walk_ops, &cp);
> > +             err = walk_page_range(mm, 0, -1, &clear_refs_walk_ops, &cp);
> > +             if (err)
> > +                     count = err;
>
> Hmm this is gross, but it's an established pattern here, ugh.
>
> Now we have an err though, could we update:
>
>                 if (mmap_write_lock_killable(mm)) {
> -                       count = -EINTR;
> +                       err = -EINTR;
>                         goto out_mm;
>                 }
>
> Then we can just do:
>
> +               err = walk_page_range(mm, 0, -1, &clear_refs_walk_ops, &cp);
>
> And at the end do:
>
>         return err ?: count;
>
> Which possibly _necessitates_ err being a ssize_t.

Sounds doable. Let me try that.

>
> >               if (type == CLEAR_REFS_SOFT_DIRTY) {
> >                       mmu_notifier_invalidate_range_end(&range);
> >                       flush_tlb_mm(mm);
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 929e843543cf..bb5b0e83ce0f 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -969,6 +969,7 @@ static const struct mm_walk_ops queue_pages_lock_vma_walk_ops = {
> >   *      (a hugetlbfs page or a transparent huge page being counted as 1).
> >   * -EIO - a misplaced page found, when MPOL_MF_STRICT specified without MOVEs.
> >   * -EFAULT - a hole in the memory range, when MPOL_MF_DISCONTIG_OK unspecified.
> > + * -EINTR - walk got terminated due to pending fatal signal.
>
> Thanks!
>
> >   */
> >  static long
> >  queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end,
> > diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> > index eda74273c8ec..a42cd6a6d812 100644
> > --- a/mm/pagewalk.c
> > +++ b/mm/pagewalk.c
> > @@ -438,14 +438,13 @@ static inline void process_mm_walk_lock(struct mm_struct *mm,
> >               mmap_assert_write_locked(mm);
> >  }
> >
> > -static inline void process_vma_walk_lock(struct vm_area_struct *vma,
> > +static inline int process_vma_walk_lock(struct vm_area_struct *vma,
>
> NIT: Don't need this to be an inline any longer. May as well fix up while we're
> here.

Ack.

>
> >                                        enum page_walk_lock walk_lock)
> >  {
> >  #ifdef CONFIG_PER_VMA_LOCK
> >       switch (walk_lock) {
> >       case PGWALK_WRLOCK:
> > -             vma_start_write(vma);
> > -             break;
> > +             return vma_start_write_killable(vma);
>
> LGTM
>
> >       case PGWALK_WRLOCK_VERIFY:
> >               vma_assert_write_locked(vma);
> >               break;
> > @@ -457,6 +456,7 @@ static inline void process_vma_walk_lock(struct vm_area_struct *vma,
> >               break;
> >       }
> >  #endif
> > +     return 0;
> >  }
> >
> >  /*
> > @@ -500,7 +500,9 @@ int walk_page_range_mm_unsafe(struct mm_struct *mm, unsigned long start,
> >                       if (ops->pte_hole)
> >                               err = ops->pte_hole(start, next, -1, &walk);
> >               } else { /* inside vma */
> > -                     process_vma_walk_lock(vma, ops->walk_lock);
> > +                     err = process_vma_walk_lock(vma, ops->walk_lock);
> > +                     if (err)
> > +                             break;
>
> In every other case we set walk.vma = vma or NULL. Is it a problem not setting
> it at all in this code path?

IIUC the other cases set walk.vma because they later call
ops->pte_hole(..., walk). In our case we immediately break out of the
loop and exit the function, which pushes "walk" variable out of scope.
So, walk.vma won't be used and setting it would achieve nothing.

>
> >                       walk.vma = vma;
> >                       next = min(end, vma->vm_end);
> >                       vma = find_vma(mm, vma->vm_end);
> > @@ -717,6 +719,7 @@ int walk_page_range_vma_unsafe(struct vm_area_struct *vma, unsigned long start,
> >               .vma            = vma,
> >               .private        = private,
> >       };
> > +     int err;
> >
> >       if (start >= end || !walk.mm)
> >               return -EINVAL;
> > @@ -724,7 +727,9 @@ int walk_page_range_vma_unsafe(struct vm_area_struct *vma, unsigned long start,
> >               return -EINVAL;
> >
> >       process_mm_walk_lock(walk.mm, ops->walk_lock);
> > -     process_vma_walk_lock(vma, ops->walk_lock);
> > +     err = process_vma_walk_lock(vma, ops->walk_lock);
> > +     if (err)
> > +             return err;
>
> LGTM
>
> >       return __walk_page_range(start, end, &walk);
> >  }
> >
> > @@ -747,6 +752,7 @@ int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
> >               .vma            = vma,
> >               .private        = private,
> >       };
> > +     int err;
> >
> >       if (!walk.mm)
> >               return -EINVAL;
> > @@ -754,7 +760,9 @@ int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
> >               return -EINVAL;
> >
> >       process_mm_walk_lock(walk.mm, ops->walk_lock);
> > -     process_vma_walk_lock(vma, ops->walk_lock);
> > +     err = process_vma_walk_lock(vma, ops->walk_lock);
> > +     if (err)
> > +             return err;
>
> LGTM
>
> >       return __walk_page_range(vma->vm_start, vma->vm_end, &walk);
> >  }
> >
> > --
> > 2.53.0.1018.g2bb0e51243-goog
> >
>
> Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 0/4] Use killable vma write locking in most places
  2026-03-23  4:29   ` Suren Baghdasaryan
@ 2026-03-24 15:59     ` Suren Baghdasaryan
  0 siblings, 0 replies; 16+ messages in thread
From: Suren Baghdasaryan @ 2026-03-24 15:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: willy, david, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, lorenzo.stoakes,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua,
	lance.yang, vbabka, jannh, rppt, mhocko, pfalcato, kees, maddy,
	npiggin, mpe, chleroy, borntraeger, frankja, imbrenda, hca, gor,
	agordeev, svens, gerald.schaefer, linux-mm, linuxppc-dev, kvm,
	linux-kernel, linux-s390, Roman Gushchin

On Sun, Mar 22, 2026 at 9:29 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Sun, Mar 22, 2026 at 9:17 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Sat, 21 Mar 2026 22:43:04 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > > Now that we have vma_start_write_killable() we can replace most of the
> > > vma_start_write() calls with it, improving reaction time to the kill
> > > signal.
> >
> > Thanks.  Sashiko review raised a few possible issues:
> >         https://sashiko.dev/#/patchset/20260322054309.898214-1-surenb@google.com
>
> Thanks! This Sashiko dude is good :)

Interestingly Sashiko had one false flag: "Does this code leave
mm->locked_vm permanently corrupted if vma_start_write_killable()
fails?"

In mlock_fixup() the path that we call vma_start_write_killable()
happens only if both new_vma_flags and old_vma_flags have their
VMA_LOCKED_BIT set. In such case nr_pages is 0, so "mm->locked_vm +=
nr_pages;" does not change the value of mm->locked_vm and we are fine.
Perhaps this can be used to improve the model?

CC'ing Roman.


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2026-03-24 16:00 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-22  5:43 [PATCH v4 0/4] Use killable vma write locking in most places Suren Baghdasaryan
2026-03-22  5:43 ` [PATCH v4 1/4] mm/vma: cleanup error handling path in vma_expand() Suren Baghdasaryan
2026-03-22  7:49   ` Barry Song
2026-03-22  5:43 ` [PATCH v4 2/4] mm: replace vma_start_write() with vma_start_write_killable() Suren Baghdasaryan
2026-03-23 17:49   ` Lorenzo Stoakes (Oracle)
2026-03-23 19:22     ` Suren Baghdasaryan
2026-03-22  5:43 ` [PATCH v4 3/4] KVM: s390: avoid kvm_s390_handle_pv() error overwrite Suren Baghdasaryan
2026-03-23  4:41   ` Suren Baghdasaryan
2026-03-23 17:53   ` Lorenzo Stoakes (Oracle)
2026-03-23 18:48     ` Suren Baghdasaryan
2026-03-22  5:43 ` [PATCH v4 4/4] mm: use vma_start_write_killable() in process_vma_walk_lock() Suren Baghdasaryan
2026-03-23 18:04   ` Lorenzo Stoakes (Oracle)
2026-03-23 19:29     ` Suren Baghdasaryan
2026-03-22 16:17 ` [PATCH v4 0/4] Use killable vma write locking in most places Andrew Morton
2026-03-23  4:29   ` Suren Baghdasaryan
2026-03-24 15:59     ` Suren Baghdasaryan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox