public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] mm/madvise: prefer VMA lock for MADV_REMOVE
@ 2026-04-10  8:02 jiang.kun2
  2026-04-10 13:40 ` Lorenzo Stoakes
  0 siblings, 1 reply; 2+ messages in thread
From: jiang.kun2 @ 2026-04-10  8:02 UTC (permalink / raw)
  To: akpm, liam.howlett, ljs, david, vbabka, jannh
  Cc: linux-mm, linux-kernel, xu.xin16, wang.yaxin, jiang.kun2,
	lu.zhongjun

From: Jiang Kun <jiang.kun2@zte.com.cn>

MADV_REMOVE prefers the per-VMA read lock for single-VMA, local-mm,
non-UFFD-armed ranges, avoiding mmap_lock contention for such ranges.

However, calling into the filesystem while holding vm_lock (VMA lock) can
create lock ordering issues. syzbot reported a possible deadlock in
blkdev_fallocate() when vfs_fallocate() is called under vm_lock.

Fix this by dropping the VMA lock before invoking vfs_fallocate(), after
taking an extra reference to the file. Keep the existing mmap_lock fallback
path and its userfaultfd coordination unchanged.

Repeated benchmark runs show no regression in the uncontended case, and show
benefit once mmap_lock contention is introduced.

Link: https://ci.syzbot.org/series/30acb9df-ca55-4cbf-81ed-89b84da8edc1
Link: https://lore.kernel.org/all/aWcZCwz__qwwKbxw@casper.infradead.org/
Signed-off-by: Jiang Kun <jiang.kun2@zte.com.cn>
Signed-off-by: Yaxin Wang <wang.yaxin@zte.com.cn>
---
 mm/madvise.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 69708e953cf5..0932579bccb4 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1008,8 +1008,6 @@ static long madvise_remove(struct madvise_behavior *madv_behavior)
 	unsigned long start = madv_behavior->range.start;
 	unsigned long end = madv_behavior->range.end;

-	mark_mmap_lock_dropped(madv_behavior);
-
 	if (vma->vm_flags & VM_LOCKED)
 		return -EINVAL;

@@ -1025,6 +1023,20 @@ static long madvise_remove(struct madvise_behavior *madv_behavior)
 	offset = (loff_t)(start - vma->vm_start)
 			+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);

+	/* Avoid calling into the filesystem while holding a VMA lock. */
+	if (madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK) {
+		get_file(f);
+		vma_end_read(vma);
+		madv_behavior->vma = NULL;
+		error = vfs_fallocate(f,
+				FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+				offset, end - start);
+		fput(f);
+		return error;
+	}
+
+	mark_mmap_lock_dropped(madv_behavior);
+
 	/*
 	 * Filesystem's fallocate may need to take i_rwsem.  We need to
 	 * explicitly grab a reference because the vma (and hence the
@@ -1677,7 +1689,8 @@ int madvise_walk_vmas(struct madvise_behavior *madv_behavior)
 	if (madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK &&
 	    try_vma_read_lock(madv_behavior)) {
 		error = madvise_vma_behavior(madv_behavior);
-		vma_end_read(madv_behavior->vma);
+		if (madv_behavior->vma)
+			vma_end_read(madv_behavior->vma);
 		return error;
 	}

@@ -1746,7 +1759,6 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi
 		return MADVISE_NO_LOCK;

 	switch (madv_behavior->behavior) {
-	case MADV_REMOVE:
 	case MADV_WILLNEED:
 	case MADV_COLD:
 	case MADV_PAGEOUT:
@@ -1754,6 +1766,7 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi
 	case MADV_POPULATE_WRITE:
 	case MADV_COLLAPSE:
 		return MADVISE_MMAP_READ_LOCK;
+	case MADV_REMOVE:
 	case MADV_GUARD_INSTALL:
 	case MADV_GUARD_REMOVE:
 	case MADV_DONTNEED:
-- 
2.53.0

^ permalink raw reply related	[flat|nested] 2+ messages in thread

* Re: [PATCH v2] mm/madvise: prefer VMA lock for MADV_REMOVE
  2026-04-10  8:02 [PATCH v2] mm/madvise: prefer VMA lock for MADV_REMOVE jiang.kun2
@ 2026-04-10 13:40 ` Lorenzo Stoakes
  0 siblings, 0 replies; 2+ messages in thread
From: Lorenzo Stoakes @ 2026-04-10 13:40 UTC (permalink / raw)
  To: jiang.kun2
  Cc: akpm, liam.howlett, david, vbabka, jannh, linux-mm, linux-kernel,
	xu.xin16, wang.yaxin, lu.zhongjun

On Fri, Apr 10, 2026 at 04:02:49PM +0800, jiang.kun2@zte.com.cn wrote:
> From: Jiang Kun <jiang.kun2@zte.com.cn>
>
> MADV_REMOVE prefers the per-VMA read lock for single-VMA, local-mm,
> non-UFFD-armed ranges, avoiding mmap_lock contention for such ranges.

I'm not sure what 'MADV_REMOVE prefers' means here? You're making the change,
you should be making a case for it, ideally with performance numbers.

I'm really not sure we should just be enabling the VMA lock for things
arbitrarily unless there's a real need for it.

>
> However, calling into the filesystem while holding vm_lock (VMA lock) can

Not sure what vm_lock is? :)

> create lock ordering issues. syzbot reported a possible deadlock in
> blkdev_fallocate() when vfs_fallocate() is called under vm_lock.

It's not really worth mentioning an issue that came up in a v1 of this code in
the commit message. The commit message should describe what you're actually
doing, why, and any details that might be helpful to people looking at the
patch.

>
> Fix this by dropping the VMA lock before invoking vfs_fallocate(), after
> taking an extra reference to the file. Keep the existing mmap_lock fallback
> path and its userfaultfd coordination unchanged.

It's worth mentioning this, but not in terms of a bug that happened in v1

>
> Repeated benchmark runs show no regression in the uncontended case, and show
> benefit once mmap_lock contention is introduced.

Numbers please :) ideally with some statistics like standard deviation etc.

>
> Link: https://ci.syzbot.org/series/30acb9df-ca55-4cbf-81ed-89b84da8edc1
> Link: https://lore.kernel.org/all/aWcZCwz__qwwKbxw@casper.infradead.org/
> Signed-off-by: Jiang Kun <jiang.kun2@zte.com.cn>
> Signed-off-by: Yaxin Wang <wang.yaxin@zte.com.cn>

Some things to address, I have attached a patch that reworks what you have
here in a way that avoids duplication, so if you're good with that feel
free to send as v3, I don't need attribution other than a Suggested-by if
you like.

The logic in general seems correct, but we need numbers! And the commit
message to be improved as per the above also.

Thanks, Lorenzo

> ---
>  mm/madvise.c | 21 +++++++++++++++++----
>  1 file changed, 17 insertions(+), 4 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 69708e953cf5..0932579bccb4 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1008,8 +1008,6 @@ static long madvise_remove(struct madvise_behavior *madv_behavior)
>  	unsigned long start = madv_behavior->range.start;
>  	unsigned long end = madv_behavior->range.end;
>
> -	mark_mmap_lock_dropped(madv_behavior);
> -
>  	if (vma->vm_flags & VM_LOCKED)
>  		return -EINVAL;
>
> @@ -1025,6 +1023,20 @@ static long madvise_remove(struct madvise_behavior *madv_behavior)
>  	offset = (loff_t)(start - vma->vm_start)
>  			+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
>
> +	/* Avoid calling into the filesystem while holding a VMA lock. */
> +	if (madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK) {
> +		get_file(f);
> +		vma_end_read(vma);
> +		madv_behavior->vma = NULL;
> +		error = vfs_fallocate(f,
> +				FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> +				offset, end - start);
> +		fput(f);
> +		return error;
> +	}
> +
> +	mark_mmap_lock_dropped(madv_behavior);

I don't like this code duplication, you're literally copy/pasting code here.

Let's use what we already have.

We can avoid duplicating like this, make mark_mmap_lock_dropped() valid for VMA
locks too and rename it to mark_lock_dropped(), then change the logic around a
bit and have something a bit nicer I think.

See attached patch, you're welcome to use this as-is (you can add a Suggested-by
if you like), but that avoids the duplication and keeps things unified.

> +
>  	/*
>  	 * Filesystem's fallocate may need to take i_rwsem.  We need to
>  	 * explicitly grab a reference because the vma (and hence the
> @@ -1677,7 +1689,8 @@ int madvise_walk_vmas(struct madvise_behavior *madv_behavior)
>  	if (madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK &&
>  	    try_vma_read_lock(madv_behavior)) {
>  		error = madvise_vma_behavior(madv_behavior);
> -		vma_end_read(madv_behavior->vma);
> +		if (madv_behavior->vma)
> +			vma_end_read(madv_behavior->vma);
>  		return error;
>  	}
>
> @@ -1746,7 +1759,6 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi
>  		return MADVISE_NO_LOCK;
>
>  	switch (madv_behavior->behavior) {
> -	case MADV_REMOVE:
>  	case MADV_WILLNEED:
>  	case MADV_COLD:
>  	case MADV_PAGEOUT:
> @@ -1754,6 +1766,7 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi
>  	case MADV_POPULATE_WRITE:
>  	case MADV_COLLAPSE:
>  		return MADVISE_MMAP_READ_LOCK;
> +	case MADV_REMOVE:
>  	case MADV_GUARD_INSTALL:
>  	case MADV_GUARD_REMOVE:
>  	case MADV_DONTNEED:
> --
> 2.53.0

Suggested reworked version below:

----8<----
From 97f6ff9749a38256e919a24ff2867a077a6f8cfc Mon Sep 17 00:00:00 2001
From: Lorenzo Stoakes <ljs@kernel.org>
Date: Fri, 10 Apr 2026 14:37:23 +0100
Subject: [PATCH] slightly reworked

---
 mm/madvise.c | 31 ++++++++++++++++---------------
 1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 69708e953cf5..4af0565809f2 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -269,9 +269,8 @@ static void shmem_swapin_range(struct vm_area_struct *vma,
 }
 #endif		/* CONFIG_SWAP */

-static void mark_mmap_lock_dropped(struct madvise_behavior *madv_behavior)
+static void mark_lock_dropped(struct madvise_behavior *madv_behavior)
 {
-	VM_WARN_ON_ONCE(madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK);
 	madv_behavior->lock_dropped = true;
 }

@@ -315,7 +314,7 @@ static long madvise_willneed(struct madvise_behavior *madv_behavior)
 	 * vma's reference to the file) can go away as soon as we drop
 	 * mmap_lock.
 	 */
-	mark_mmap_lock_dropped(madv_behavior);
+	mark_lock_dropped(madv_behavior);
 	get_file(file);
 	offset = (loff_t)(start - vma->vm_start)
 			+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
@@ -909,7 +908,7 @@ static long madvise_dontneed_free(struct madvise_behavior *madv_behavior)
 	if (!userfaultfd_remove(madv_behavior->vma, range->start, range->end)) {
 		struct vm_area_struct *vma;

-		mark_mmap_lock_dropped(madv_behavior);
+		mark_lock_dropped(madv_behavior);
 		mmap_read_lock(mm);
 		madv_behavior->vma = vma = vma_lookup(mm, range->start);
 		if (!vma)
@@ -1005,10 +1004,9 @@ static long madvise_remove(struct madvise_behavior *madv_behavior)
 	struct file *f;
 	struct mm_struct *mm = madv_behavior->mm;
 	struct vm_area_struct *vma = madv_behavior->vma;
-	unsigned long start = madv_behavior->range.start;
-	unsigned long end = madv_behavior->range.end;
-
-	mark_mmap_lock_dropped(madv_behavior);
+	const unsigned long start = madv_behavior->range.start;
+	const unsigned long end = madv_behavior->range.end;
+	const bool vma_locked = madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK;

 	if (vma->vm_flags & VM_LOCKED)
 		return -EINVAL;
@@ -1029,18 +1027,20 @@ static long madvise_remove(struct madvise_behavior *madv_behavior)
 	 * Filesystem's fallocate may need to take i_rwsem.  We need to
 	 * explicitly grab a reference because the vma (and hence the
 	 * vma's reference to the file) can go away as soon as we drop
-	 * mmap_lock.
+	 * its lock.
 	 */
 	get_file(f);
-	if (userfaultfd_remove(vma, start, end)) {
-		/* mmap_lock was not released by userfaultfd_remove() */
+	mark_lock_dropped(madv_behavior);
+	if (vma_locked)
+		vma_end_read(vma);
+	else if (userfaultfd_remove(vma, start, end)) /* uffd didn't drop? */
 		mmap_read_unlock(mm);
-	}
 	error = vfs_fallocate(f,
 				FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
 				offset, end - start);
 	fput(f);
-	mmap_read_lock(mm);
+	if (!vma_locked)
+		mmap_read_lock(mm);
 	return error;
 }

@@ -1677,7 +1677,8 @@ int madvise_walk_vmas(struct madvise_behavior *madv_behavior)
 	if (madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK &&
 	    try_vma_read_lock(madv_behavior)) {
 		error = madvise_vma_behavior(madv_behavior);
-		vma_end_read(madv_behavior->vma);
+		if (!madv_behavior->lock_dropped)
+			vma_end_read(madv_behavior->vma);
 		return error;
 	}

@@ -1746,7 +1747,6 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi
 		return MADVISE_NO_LOCK;

 	switch (madv_behavior->behavior) {
-	case MADV_REMOVE:
 	case MADV_WILLNEED:
 	case MADV_COLD:
 	case MADV_PAGEOUT:
@@ -1754,6 +1754,7 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi
 	case MADV_POPULATE_WRITE:
 	case MADV_COLLAPSE:
 		return MADVISE_MMAP_READ_LOCK;
+	case MADV_REMOVE:
 	case MADV_GUARD_INSTALL:
 	case MADV_GUARD_REMOVE:
 	case MADV_DONTNEED:
--
2.53.0

^ permalink raw reply related	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2026-04-10 13:40 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-10  8:02 [PATCH v2] mm/madvise: prefer VMA lock for MADV_REMOVE jiang.kun2
2026-04-10 13:40 ` Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox