[PATCH v4] mm: use per_vma lock for MADV

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4] mm: use per_vma lock for MADV_DONTNEED
@ 2025-06-07 22:01 Barry Song
  2025-06-09  7:21 ` Qi Zheng
                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Barry Song @ 2025-06-07 22:01 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: linux-kernel, Barry Song, Lorenzo Stoakes, Liam R. Howlett,
	David Hildenbrand, Vlastimil Babka, Jann Horn, Suren Baghdasaryan,
	Lokesh Gidra, Tangquan Zheng, Qi Zheng

From: Barry Song <v-songbaohua@oppo.com>

Certain madvise operations, especially MADV_DONTNEED, occur far more
frequently than other madvise options, particularly in native and Java
heaps for dynamic memory management.

Currently, the mmap_lock is always held during these operations, even when
unnecessary. This causes lock contention and can lead to severe priority
inversion, where low-priority threads—such as Android's HeapTaskDaemon—
hold the lock and block higher-priority threads.

This patch enables the use of per-VMA locks when the advised range lies
entirely within a single VMA, avoiding the need for full VMA traversal. In
practice, userspace heaps rarely issue MADV_DONTNEED across multiple VMAs.

Tangquan’s testing shows that over 99.5% of memory reclaimed by Android
benefits from this per-VMA lock optimization. After extended runtime,
217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while
only 1,231 fell back to mmap_lock.

To simplify handling, the implementation falls back to the standard
mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of
userfaultfd_remove().

Many thanks to Lorenzo's work[1] on:
"Refactor the madvise() code to retain state about the locking mode
utilised for traversing VMAs.

Then use this mechanism to permit VMA locking to be done later in the
madvise() logic and also to allow altering of the locking mode to permit
falling back to an mmap read lock if required."

One important point, as pointed out by Jann[2], is that
untagged_addr_remote() requires holding mmap_lock. This is because
address tagging on x86 and RISC-V is quite complex.

Until untagged_addr_remote() becomes atomic—which seems unlikely in
the near future—we cannot support per-VMA locks for remote processes.
So for now, only local processes are supported.

Link: https://lore.kernel.org/all/0b96ce61-a52c-4036-b5b6-5c50783db51f@lucifer.local/ [1]
Link: https://lore.kernel.org/all/CAG48ez11zi-1jicHUZtLhyoNPGGVB+ROeAJCUw48bsjk4bbEkA@mail.gmail.com/ [2]
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 -v4:
 * collect Lorenzo's RB;
 * use visit() for per-vma path

 mm/madvise.c | 195 ++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 147 insertions(+), 48 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 56d9ca2557b9..8382614b71d1 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -48,38 +48,19 @@ struct madvise_walk_private {
 	bool pageout;
 };
 
+enum madvise_lock_mode {
+	MADVISE_NO_LOCK,
+	MADVISE_MMAP_READ_LOCK,
+	MADVISE_MMAP_WRITE_LOCK,
+	MADVISE_VMA_READ_LOCK,
+};
+
 struct madvise_behavior {
 	int behavior;
 	struct mmu_gather *tlb;
+	enum madvise_lock_mode lock_mode;
 };
 
-/*
- * Any behaviour which results in changes to the vma->vm_flags needs to
- * take mmap_lock for writing. Others, which simply traverse vmas, need
- * to only take it for reading.
- */
-static int madvise_need_mmap_write(int behavior)
-{
-	switch (behavior) {
-	case MADV_REMOVE:
-	case MADV_WILLNEED:
-	case MADV_DONTNEED:
-	case MADV_DONTNEED_LOCKED:
-	case MADV_COLD:
-	case MADV_PAGEOUT:
-	case MADV_FREE:
-	case MADV_POPULATE_READ:
-	case MADV_POPULATE_WRITE:
-	case MADV_COLLAPSE:
-	case MADV_GUARD_INSTALL:
-	case MADV_GUARD_REMOVE:
-		return 0;
-	default:
-		/* be safe, default to 1. list exceptions explicitly */
-		return 1;
-	}
-}
-
 #ifdef CONFIG_ANON_VMA_NAME
 struct anon_vma_name *anon_vma_name_alloc(const char *name)
 {
@@ -1486,6 +1467,44 @@ static bool process_madvise_remote_valid(int behavior)
 	}
 }
 
+/*
+ * Try to acquire a VMA read lock if possible.
+ *
+ * We only support this lock over a single VMA, which the input range must
+ * span either partially or fully.
+ *
+ * This function always returns with an appropriate lock held. If a VMA read
+ * lock could be acquired, we return the locked VMA.
+ *
+ * If a VMA read lock could not be acquired, we return NULL and expect caller to
+ * fallback to mmap lock behaviour.
+ */
+static struct vm_area_struct *try_vma_read_lock(struct mm_struct *mm,
+		struct madvise_behavior *madv_behavior,
+		unsigned long start, unsigned long end)
+{
+	struct vm_area_struct *vma;
+
+	vma = lock_vma_under_rcu(mm, start);
+	if (!vma)
+		goto take_mmap_read_lock;
+	/*
+	 * Must span only a single VMA; uffd and remote processes are
+	 * unsupported.
+	 */
+	if (end > vma->vm_end || current->mm != mm ||
+	    userfaultfd_armed(vma)) {
+		vma_end_read(vma);
+		goto take_mmap_read_lock;
+	}
+	return vma;
+
+take_mmap_read_lock:
+	mmap_read_lock(mm);
+	madv_behavior->lock_mode = MADVISE_MMAP_READ_LOCK;
+	return NULL;
+}
+
 /*
  * Walk the vmas in range [start,end), and call the visit function on each one.
  * The visit function will get start and end parameters that cover the overlap
@@ -1496,7 +1515,8 @@ static bool process_madvise_remote_valid(int behavior)
  */
 static
 int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
-		      unsigned long end, void *arg,
+		      unsigned long end, struct madvise_behavior *madv_behavior,
+		      void *arg,
 		      int (*visit)(struct vm_area_struct *vma,
 				   struct vm_area_struct **prev, unsigned long start,
 				   unsigned long end, void *arg))
@@ -1505,6 +1525,20 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
 	struct vm_area_struct *prev;
 	unsigned long tmp;
 	int unmapped_error = 0;
+	int error;
+
+	/*
+	 * If VMA read lock is supported, apply madvise to a single VMA
+	 * tentatively, avoiding walking VMAs.
+	 */
+	if (madv_behavior && madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK) {
+		vma = try_vma_read_lock(mm, madv_behavior, start, end);
+		if (vma) {
+			error = visit(vma, &prev, start, end, arg);
+			vma_end_read(vma);
+			return error;
+		}
+	}
 
 	/*
 	 * If the interval [start,end) covers some unmapped address
@@ -1516,8 +1550,6 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
 		prev = vma;
 
 	for (;;) {
-		int error;
-
 		/* Still start < end. */
 		if (!vma)
 			return -ENOMEM;
@@ -1598,34 +1630,86 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
 	if (end == start)
 		return 0;
 
-	return madvise_walk_vmas(mm, start, end, anon_name,
+	return madvise_walk_vmas(mm, start, end, NULL, anon_name,
 				 madvise_vma_anon_name);
 }
 #endif /* CONFIG_ANON_VMA_NAME */
 
-static int madvise_lock(struct mm_struct *mm, int behavior)
+
+/*
+ * Any behaviour which results in changes to the vma->vm_flags needs to
+ * take mmap_lock for writing. Others, which simply traverse vmas, need
+ * to only take it for reading.
+ */
+static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavior)
 {
+	int behavior = madv_behavior->behavior;
+
 	if (is_memory_failure(behavior))
-		return 0;
+		return MADVISE_NO_LOCK;
 
-	if (madvise_need_mmap_write(behavior)) {
+	switch (behavior) {
+	case MADV_REMOVE:
+	case MADV_WILLNEED:
+	case MADV_COLD:
+	case MADV_PAGEOUT:
+	case MADV_FREE:
+	case MADV_POPULATE_READ:
+	case MADV_POPULATE_WRITE:
+	case MADV_COLLAPSE:
+	case MADV_GUARD_INSTALL:
+	case MADV_GUARD_REMOVE:
+		return MADVISE_MMAP_READ_LOCK;
+	case MADV_DONTNEED:
+	case MADV_DONTNEED_LOCKED:
+		return MADVISE_VMA_READ_LOCK;
+	default:
+		return MADVISE_MMAP_WRITE_LOCK;
+	}
+}
+
+static int madvise_lock(struct mm_struct *mm,
+		struct madvise_behavior *madv_behavior)
+{
+	enum madvise_lock_mode lock_mode = get_lock_mode(madv_behavior);
+
+	switch (lock_mode) {
+	case MADVISE_NO_LOCK:
+		break;
+	case MADVISE_MMAP_WRITE_LOCK:
 		if (mmap_write_lock_killable(mm))
 			return -EINTR;
-	} else {
+		break;
+	case MADVISE_MMAP_READ_LOCK:
 		mmap_read_lock(mm);
+		break;
+	case MADVISE_VMA_READ_LOCK:
+		/* We will acquire the lock per-VMA in madvise_walk_vmas(). */
+		break;
 	}
+
+	madv_behavior->lock_mode = lock_mode;
 	return 0;
 }
 
-static void madvise_unlock(struct mm_struct *mm, int behavior)
+static void madvise_unlock(struct mm_struct *mm,
+		struct madvise_behavior *madv_behavior)
 {
-	if (is_memory_failure(behavior))
+	switch (madv_behavior->lock_mode) {
+	case  MADVISE_NO_LOCK:
 		return;
-
-	if (madvise_need_mmap_write(behavior))
+	case MADVISE_MMAP_WRITE_LOCK:
 		mmap_write_unlock(mm);
-	else
+		break;
+	case MADVISE_MMAP_READ_LOCK:
 		mmap_read_unlock(mm);
+		break;
+	case MADVISE_VMA_READ_LOCK:
+		/* We will drop the lock per-VMA in madvise_walk_vmas(). */
+		break;
+	}
+
+	madv_behavior->lock_mode = MADVISE_NO_LOCK;
 }
 
 static bool madvise_batch_tlb_flush(int behavior)
@@ -1710,6 +1794,21 @@ static bool is_madvise_populate(int behavior)
 	}
 }
 
+/*
+ * untagged_addr_remote() assumes mmap_lock is already held. On
+ * architectures like x86 and RISC-V, tagging is tricky because each
+ * mm may have a different tagging mask. However, we might only hold
+ * the per-VMA lock (currently only local processes are supported),
+ * so untagged_addr is used to avoid the mmap_lock assertion for
+ * local processes.
+ */
+static inline unsigned long get_untagged_addr(struct mm_struct *mm,
+		unsigned long start)
+{
+	return current->mm == mm ? untagged_addr(start) :
+				   untagged_addr_remote(mm, start);
+}
+
 static int madvise_do_behavior(struct mm_struct *mm,
 		unsigned long start, size_t len_in,
 		struct madvise_behavior *madv_behavior)
@@ -1721,7 +1820,7 @@ static int madvise_do_behavior(struct mm_struct *mm,
 
 	if (is_memory_failure(behavior))
 		return madvise_inject_error(behavior, start, start + len_in);
-	start = untagged_addr_remote(mm, start);
+	start = get_untagged_addr(mm, start);
 	end = start + PAGE_ALIGN(len_in);
 
 	blk_start_plug(&plug);
@@ -1729,7 +1828,7 @@ static int madvise_do_behavior(struct mm_struct *mm,
 		error = madvise_populate(mm, start, end, behavior);
 	else
 		error = madvise_walk_vmas(mm, start, end, madv_behavior,
-					  madvise_vma_behavior);
+				madv_behavior, madvise_vma_behavior);
 	blk_finish_plug(&plug);
 	return error;
 }
@@ -1817,13 +1916,13 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
 
 	if (madvise_should_skip(start, len_in, behavior, &error))
 		return error;
-	error = madvise_lock(mm, behavior);
+	error = madvise_lock(mm, &madv_behavior);
 	if (error)
 		return error;
 	madvise_init_tlb(&madv_behavior, mm);
 	error = madvise_do_behavior(mm, start, len_in, &madv_behavior);
 	madvise_finish_tlb(&madv_behavior);
-	madvise_unlock(mm, behavior);
+	madvise_unlock(mm, &madv_behavior);
 
 	return error;
 }
@@ -1847,7 +1946,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 
 	total_len = iov_iter_count(iter);
 
-	ret = madvise_lock(mm, behavior);
+	ret = madvise_lock(mm, &madv_behavior);
 	if (ret)
 		return ret;
 	madvise_init_tlb(&madv_behavior, mm);
@@ -1880,8 +1979,8 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 
 			/* Drop and reacquire lock to unwind race. */
 			madvise_finish_tlb(&madv_behavior);
-			madvise_unlock(mm, behavior);
-			ret = madvise_lock(mm, behavior);
+			madvise_unlock(mm, &madv_behavior);
+			ret = madvise_lock(mm, &madv_behavior);
 			if (ret)
 				goto out;
 			madvise_init_tlb(&madv_behavior, mm);
@@ -1892,7 +1991,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 		iov_iter_advance(iter, iter_iov_len(iter));
 	}
 	madvise_finish_tlb(&madv_behavior);
-	madvise_unlock(mm, behavior);
+	madvise_unlock(mm, &madv_behavior);
 
 out:
 	ret = (total_len - iov_iter_count(iter)) ? : ret;
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v4] mm: use per_vma lock for MADV_DONTNEED
  2025-06-07 22:01 [PATCH v4] mm: use per_vma lock for MADV_DONTNEED Barry Song
@ 2025-06-09  7:21 ` Qi Zheng
  2025-06-17 13:38 ` Lorenzo Stoakes
  2025-08-04  0:58 ` Lai, Yi
  2 siblings, 0 replies; 26+ messages in thread
From: Qi Zheng @ 2025-06-09  7:21 UTC (permalink / raw)
  To: Barry Song, akpm, linux-mm
  Cc: linux-kernel, Barry Song, Lorenzo Stoakes, Liam R. Howlett,
	David Hildenbrand, Vlastimil Babka, Jann Horn, Suren Baghdasaryan,
	Lokesh Gidra, Tangquan Zheng



On 6/8/25 6:01 AM, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> Certain madvise operations, especially MADV_DONTNEED, occur far more
> frequently than other madvise options, particularly in native and Java
> heaps for dynamic memory management.
> 
> Currently, the mmap_lock is always held during these operations, even when
> unnecessary. This causes lock contention and can lead to severe priority
> inversion, where low-priority threads—such as Android's HeapTaskDaemon—
> hold the lock and block higher-priority threads.
> 
> This patch enables the use of per-VMA locks when the advised range lies
> entirely within a single VMA, avoiding the need for full VMA traversal. In
> practice, userspace heaps rarely issue MADV_DONTNEED across multiple VMAs.
> 
> Tangquan’s testing shows that over 99.5% of memory reclaimed by Android
> benefits from this per-VMA lock optimization. After extended runtime,
> 217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while
> only 1,231 fell back to mmap_lock.
> 
> To simplify handling, the implementation falls back to the standard
> mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of
> userfaultfd_remove().
> 
> Many thanks to Lorenzo's work[1] on:
> "Refactor the madvise() code to retain state about the locking mode
> utilised for traversing VMAs.
> 
> Then use this mechanism to permit VMA locking to be done later in the
> madvise() logic and also to allow altering of the locking mode to permit
> falling back to an mmap read lock if required."
> 
> One important point, as pointed out by Jann[2], is that
> untagged_addr_remote() requires holding mmap_lock. This is because
> address tagging on x86 and RISC-V is quite complex.
> 
> Until untagged_addr_remote() becomes atomic—which seems unlikely in
> the near future—we cannot support per-VMA locks for remote processes.
> So for now, only local processes are supported.
> 
> Link: https://lore.kernel.org/all/0b96ce61-a52c-4036-b5b6-5c50783db51f@lucifer.local/ [1]
> Link: https://lore.kernel.org/all/CAG48ez11zi-1jicHUZtLhyoNPGGVB+ROeAJCUw48bsjk4bbEkA@mail.gmail.com/ [2]
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Jann Horn <jannh@google.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Lokesh Gidra <lokeshgidra@google.com>
> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> Cc: Qi Zheng <zhengqi.arch@bytedance.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>   -v4:
>   * collect Lorenzo's RB;
>   * use visit() for per-vma path
> 
>   mm/madvise.c | 195 ++++++++++++++++++++++++++++++++++++++-------------
>   1 file changed, 147 insertions(+), 48 deletions(-)
> 

Acked-by: Qi Zheng <zhengqi.arch@bytedance.com>

Thanks.




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4] mm: use per_vma lock for MADV_DONTNEED
  2025-06-07 22:01 [PATCH v4] mm: use per_vma lock for MADV_DONTNEED Barry Song
  2025-06-09  7:21 ` Qi Zheng
@ 2025-06-17 13:38 ` Lorenzo Stoakes
  2025-06-18  2:25   ` Lance Yang
  2025-06-18 10:11   ` Barry Song
  2025-08-04  0:58 ` Lai, Yi
  2 siblings, 2 replies; 26+ messages in thread
From: Lorenzo Stoakes @ 2025-06-17 13:38 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, linux-kernel, Barry Song, Liam R. Howlett,
	David Hildenbrand, Vlastimil Babka, Jann Horn, Suren Baghdasaryan,
	Lokesh Gidra, Tangquan Zheng, Qi Zheng, Lance Yang

+cc Lance

Hi Barry,

This needs a quick fixpatch, as discovered by Lance in [0], which I did an
analysis on [1].

Basically, _theoretically_ though not currently in practice, we might end up
accessing uninitialised state in the struct vm_area_struct **prev value passed
around madvise.

The solution for now is to simply initialise it in the VMA read lock case, as
all users of this set *prev = vma prior to performing the operation.

Cheers, Lorenzo

[0]: https://lore.kernel.org/all/20250617020544.57305-1-lance.yang@linux.dev/
[1]: https://lore.kernel.org/all/6181fd25-6527-4cd0-b67f-2098191d262d@lucifer.local/

On Sun, Jun 08, 2025 at 10:01:50AM +1200, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
>
> Certain madvise operations, especially MADV_DONTNEED, occur far more
> frequently than other madvise options, particularly in native and Java
> heaps for dynamic memory management.
>
> Currently, the mmap_lock is always held during these operations, even when
> unnecessary. This causes lock contention and can lead to severe priority
> inversion, where low-priority threads—such as Android's HeapTaskDaemon—
> hold the lock and block higher-priority threads.
>
> This patch enables the use of per-VMA locks when the advised range lies
> entirely within a single VMA, avoiding the need for full VMA traversal. In
> practice, userspace heaps rarely issue MADV_DONTNEED across multiple VMAs.
>
> Tangquan’s testing shows that over 99.5% of memory reclaimed by Android
> benefits from this per-VMA lock optimization. After extended runtime,
> 217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while
> only 1,231 fell back to mmap_lock.
>
> To simplify handling, the implementation falls back to the standard
> mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of
> userfaultfd_remove().
>
> Many thanks to Lorenzo's work[1] on:
> "Refactor the madvise() code to retain state about the locking mode
> utilised for traversing VMAs.
>
> Then use this mechanism to permit VMA locking to be done later in the
> madvise() logic and also to allow altering of the locking mode to permit
> falling back to an mmap read lock if required."
>
> One important point, as pointed out by Jann[2], is that
> untagged_addr_remote() requires holding mmap_lock. This is because
> address tagging on x86 and RISC-V is quite complex.
>
> Until untagged_addr_remote() becomes atomic—which seems unlikely in
> the near future—we cannot support per-VMA locks for remote processes.
> So for now, only local processes are supported.
>
> Link: https://lore.kernel.org/all/0b96ce61-a52c-4036-b5b6-5c50783db51f@lucifer.local/ [1]
> Link: https://lore.kernel.org/all/CAG48ez11zi-1jicHUZtLhyoNPGGVB+ROeAJCUw48bsjk4bbEkA@mail.gmail.com/ [2]
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Jann Horn <jannh@google.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Lokesh Gidra <lokeshgidra@google.com>
> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> Cc: Qi Zheng <zhengqi.arch@bytedance.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  -v4:
>  * collect Lorenzo's RB;
>  * use visit() for per-vma path
>
>  mm/madvise.c | 195 ++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 147 insertions(+), 48 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 56d9ca2557b9..8382614b71d1 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -48,38 +48,19 @@ struct madvise_walk_private {
>  	bool pageout;
>  };
>
> +enum madvise_lock_mode {
> +	MADVISE_NO_LOCK,
> +	MADVISE_MMAP_READ_LOCK,
> +	MADVISE_MMAP_WRITE_LOCK,
> +	MADVISE_VMA_READ_LOCK,
> +};
> +
>  struct madvise_behavior {
>  	int behavior;
>  	struct mmu_gather *tlb;
> +	enum madvise_lock_mode lock_mode;
>  };
>
> -/*
> - * Any behaviour which results in changes to the vma->vm_flags needs to
> - * take mmap_lock for writing. Others, which simply traverse vmas, need
> - * to only take it for reading.
> - */
> -static int madvise_need_mmap_write(int behavior)
> -{
> -	switch (behavior) {
> -	case MADV_REMOVE:
> -	case MADV_WILLNEED:
> -	case MADV_DONTNEED:
> -	case MADV_DONTNEED_LOCKED:
> -	case MADV_COLD:
> -	case MADV_PAGEOUT:
> -	case MADV_FREE:
> -	case MADV_POPULATE_READ:
> -	case MADV_POPULATE_WRITE:
> -	case MADV_COLLAPSE:
> -	case MADV_GUARD_INSTALL:
> -	case MADV_GUARD_REMOVE:
> -		return 0;
> -	default:
> -		/* be safe, default to 1. list exceptions explicitly */
> -		return 1;
> -	}
> -}
> -
>  #ifdef CONFIG_ANON_VMA_NAME
>  struct anon_vma_name *anon_vma_name_alloc(const char *name)
>  {
> @@ -1486,6 +1467,44 @@ static bool process_madvise_remote_valid(int behavior)
>  	}
>  }
>
> +/*
> + * Try to acquire a VMA read lock if possible.
> + *
> + * We only support this lock over a single VMA, which the input range must
> + * span either partially or fully.
> + *
> + * This function always returns with an appropriate lock held. If a VMA read
> + * lock could be acquired, we return the locked VMA.
> + *
> + * If a VMA read lock could not be acquired, we return NULL and expect caller to
> + * fallback to mmap lock behaviour.
> + */
> +static struct vm_area_struct *try_vma_read_lock(struct mm_struct *mm,
> +		struct madvise_behavior *madv_behavior,
> +		unsigned long start, unsigned long end)
> +{
> +	struct vm_area_struct *vma;
> +
> +	vma = lock_vma_under_rcu(mm, start);
> +	if (!vma)
> +		goto take_mmap_read_lock;
> +	/*
> +	 * Must span only a single VMA; uffd and remote processes are
> +	 * unsupported.
> +	 */
> +	if (end > vma->vm_end || current->mm != mm ||
> +	    userfaultfd_armed(vma)) {
> +		vma_end_read(vma);
> +		goto take_mmap_read_lock;
> +	}
> +	return vma;
> +
> +take_mmap_read_lock:
> +	mmap_read_lock(mm);
> +	madv_behavior->lock_mode = MADVISE_MMAP_READ_LOCK;
> +	return NULL;
> +}
> +
>  /*
>   * Walk the vmas in range [start,end), and call the visit function on each one.
>   * The visit function will get start and end parameters that cover the overlap
> @@ -1496,7 +1515,8 @@ static bool process_madvise_remote_valid(int behavior)
>   */
>  static
>  int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
> -		      unsigned long end, void *arg,
> +		      unsigned long end, struct madvise_behavior *madv_behavior,
> +		      void *arg,
>  		      int (*visit)(struct vm_area_struct *vma,
>  				   struct vm_area_struct **prev, unsigned long start,
>  				   unsigned long end, void *arg))
> @@ -1505,6 +1525,20 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
>  	struct vm_area_struct *prev;
>  	unsigned long tmp;
>  	int unmapped_error = 0;
> +	int error;
> +
> +	/*
> +	 * If VMA read lock is supported, apply madvise to a single VMA
> +	 * tentatively, avoiding walking VMAs.
> +	 */
> +	if (madv_behavior && madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK) {
> +		vma = try_vma_read_lock(mm, madv_behavior, start, end);
> +		if (vma) {
> +			error = visit(vma, &prev, start, end, arg);
> +			vma_end_read(vma);
> +			return error;
> +		}
> +	}
>
>  	/*
>  	 * If the interval [start,end) covers some unmapped address
> @@ -1516,8 +1550,6 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
>  		prev = vma;
>
>  	for (;;) {
> -		int error;
> -
>  		/* Still start < end. */
>  		if (!vma)
>  			return -ENOMEM;
> @@ -1598,34 +1630,86 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
>  	if (end == start)
>  		return 0;
>
> -	return madvise_walk_vmas(mm, start, end, anon_name,
> +	return madvise_walk_vmas(mm, start, end, NULL, anon_name,
>  				 madvise_vma_anon_name);
>  }
>  #endif /* CONFIG_ANON_VMA_NAME */
>
> -static int madvise_lock(struct mm_struct *mm, int behavior)
> +
> +/*
> + * Any behaviour which results in changes to the vma->vm_flags needs to
> + * take mmap_lock for writing. Others, which simply traverse vmas, need
> + * to only take it for reading.
> + */
> +static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavior)
>  {
> +	int behavior = madv_behavior->behavior;
> +
>  	if (is_memory_failure(behavior))
> -		return 0;
> +		return MADVISE_NO_LOCK;
>
> -	if (madvise_need_mmap_write(behavior)) {
> +	switch (behavior) {
> +	case MADV_REMOVE:
> +	case MADV_WILLNEED:
> +	case MADV_COLD:
> +	case MADV_PAGEOUT:
> +	case MADV_FREE:
> +	case MADV_POPULATE_READ:
> +	case MADV_POPULATE_WRITE:
> +	case MADV_COLLAPSE:
> +	case MADV_GUARD_INSTALL:
> +	case MADV_GUARD_REMOVE:
> +		return MADVISE_MMAP_READ_LOCK;
> +	case MADV_DONTNEED:
> +	case MADV_DONTNEED_LOCKED:
> +		return MADVISE_VMA_READ_LOCK;
> +	default:
> +		return MADVISE_MMAP_WRITE_LOCK;
> +	}
> +}
> +
> +static int madvise_lock(struct mm_struct *mm,
> +		struct madvise_behavior *madv_behavior)
> +{
> +	enum madvise_lock_mode lock_mode = get_lock_mode(madv_behavior);
> +
> +	switch (lock_mode) {
> +	case MADVISE_NO_LOCK:
> +		break;
> +	case MADVISE_MMAP_WRITE_LOCK:
>  		if (mmap_write_lock_killable(mm))
>  			return -EINTR;
> -	} else {
> +		break;
> +	case MADVISE_MMAP_READ_LOCK:
>  		mmap_read_lock(mm);
> +		break;
> +	case MADVISE_VMA_READ_LOCK:
> +		/* We will acquire the lock per-VMA in madvise_walk_vmas(). */
> +		break;
>  	}
> +
> +	madv_behavior->lock_mode = lock_mode;
>  	return 0;
>  }
>
> -static void madvise_unlock(struct mm_struct *mm, int behavior)
> +static void madvise_unlock(struct mm_struct *mm,
> +		struct madvise_behavior *madv_behavior)
>  {
> -	if (is_memory_failure(behavior))
> +	switch (madv_behavior->lock_mode) {
> +	case  MADVISE_NO_LOCK:
>  		return;
> -
> -	if (madvise_need_mmap_write(behavior))
> +	case MADVISE_MMAP_WRITE_LOCK:
>  		mmap_write_unlock(mm);
> -	else
> +		break;
> +	case MADVISE_MMAP_READ_LOCK:
>  		mmap_read_unlock(mm);
> +		break;
> +	case MADVISE_VMA_READ_LOCK:
> +		/* We will drop the lock per-VMA in madvise_walk_vmas(). */
> +		break;
> +	}
> +
> +	madv_behavior->lock_mode = MADVISE_NO_LOCK;
>  }
>
>  static bool madvise_batch_tlb_flush(int behavior)
> @@ -1710,6 +1794,21 @@ static bool is_madvise_populate(int behavior)
>  	}
>  }
>
> +/*
> + * untagged_addr_remote() assumes mmap_lock is already held. On
> + * architectures like x86 and RISC-V, tagging is tricky because each
> + * mm may have a different tagging mask. However, we might only hold
> + * the per-VMA lock (currently only local processes are supported),
> + * so untagged_addr is used to avoid the mmap_lock assertion for
> + * local processes.
> + */
> +static inline unsigned long get_untagged_addr(struct mm_struct *mm,
> +		unsigned long start)
> +{
> +	return current->mm == mm ? untagged_addr(start) :
> +				   untagged_addr_remote(mm, start);
> +}
> +
>  static int madvise_do_behavior(struct mm_struct *mm,
>  		unsigned long start, size_t len_in,
>  		struct madvise_behavior *madv_behavior)
> @@ -1721,7 +1820,7 @@ static int madvise_do_behavior(struct mm_struct *mm,
>
>  	if (is_memory_failure(behavior))
>  		return madvise_inject_error(behavior, start, start + len_in);
> -	start = untagged_addr_remote(mm, start);
> +	start = get_untagged_addr(mm, start);
>  	end = start + PAGE_ALIGN(len_in);
>
>  	blk_start_plug(&plug);
> @@ -1729,7 +1828,7 @@ static int madvise_do_behavior(struct mm_struct *mm,
>  		error = madvise_populate(mm, start, end, behavior);
>  	else
>  		error = madvise_walk_vmas(mm, start, end, madv_behavior,
> -					  madvise_vma_behavior);
> +				madv_behavior, madvise_vma_behavior);
>  	blk_finish_plug(&plug);
>  	return error;
>  }
> @@ -1817,13 +1916,13 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
>
>  	if (madvise_should_skip(start, len_in, behavior, &error))
>  		return error;
> -	error = madvise_lock(mm, behavior);
> +	error = madvise_lock(mm, &madv_behavior);
>  	if (error)
>  		return error;
>  	madvise_init_tlb(&madv_behavior, mm);
>  	error = madvise_do_behavior(mm, start, len_in, &madv_behavior);
>  	madvise_finish_tlb(&madv_behavior);
> -	madvise_unlock(mm, behavior);
> +	madvise_unlock(mm, &madv_behavior);
>
>  	return error;
>  }
> @@ -1847,7 +1946,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
>
>  	total_len = iov_iter_count(iter);
>
> -	ret = madvise_lock(mm, behavior);
> +	ret = madvise_lock(mm, &madv_behavior);
>  	if (ret)
>  		return ret;
>  	madvise_init_tlb(&madv_behavior, mm);
> @@ -1880,8 +1979,8 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
>
>  			/* Drop and reacquire lock to unwind race. */
>  			madvise_finish_tlb(&madv_behavior);
> -			madvise_unlock(mm, behavior);
> -			ret = madvise_lock(mm, behavior);
> +			madvise_unlock(mm, &madv_behavior);
> +			ret = madvise_lock(mm, &madv_behavior);
>  			if (ret)
>  				goto out;
>  			madvise_init_tlb(&madv_behavior, mm);
> @@ -1892,7 +1991,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
>  		iov_iter_advance(iter, iter_iov_len(iter));
>  	}
>  	madvise_finish_tlb(&madv_behavior);
> -	madvise_unlock(mm, behavior);
> +	madvise_unlock(mm, &madv_behavior);
>
>  out:
>  	ret = (total_len - iov_iter_count(iter)) ? : ret;
> --
> 2.39.3 (Apple Git-146)
>

----8<----
From 1ffcaea75ebdaffe15805386f6d7733883d265a5 Mon Sep 17 00:00:00 2001
From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Date: Tue, 17 Jun 2025 14:35:13 +0100
Subject: [PATCH] mm/madvise: avoid any chance of uninitialised pointer deref

If we were to extend madvise() to support more operations under VMA lock,
we could potentially dereference prev to uninitialised state in
madvise_update_vma().

Avoid this by explicitly setting prev to vma before invoking the visit()
function.

This has no impact on behaviour, as all visitors compatible with a VMA lock
do not require prev to be set to the previous VMA and at any rate we only
examine a single VMA in VMA lock mode.

Reported-by: Lance Yang <ioworker0@gmail.com>
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/madvise.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/madvise.c b/mm/madvise.c
index efe5d64e1175..0970623a0e98 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1333,6 +1333,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
 		return madvise_guard_remove(vma, prev, start, end);
 	}

+	/* We cannot provide prev in this lock mode. */
+	VM_WARN_ON_ONCE(arg->lock_mode == MADVISE_VMA_READ_LOCK);
 	anon_name = anon_vma_name(vma);
 	anon_vma_name_get(anon_name);
 	error = madvise_update_vma(vma, prev, start, end, new_flags,
@@ -1549,6 +1551,7 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
 	if (madv_behavior && madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK) {
 		vma = try_vma_read_lock(mm, madv_behavior, start, end);
 		if (vma) {
+			prev = vma;
 			error = visit(vma, &prev, start, end, arg);
 			vma_end_read(vma);
 			return error;
--
2.49.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v4] mm: use per_vma lock for MADV_DONTNEED
  2025-06-17 13:38 ` Lorenzo Stoakes
@ 2025-06-18  2:25   ` Lance Yang
  2025-06-18  9:52     ` Barry Song
  2025-06-18 10:11   ` Barry Song
  1 sibling, 1 reply; 26+ messages in thread
From: Lance Yang @ 2025-06-18  2:25 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, linux-kernel, Barry Song, Liam R. Howlett,
	David Hildenbrand, Vlastimil Babka, Jann Horn, Suren Baghdasaryan,
	Lokesh Gidra, Tangquan Zheng, Qi Zheng, Lance Yang,
	Lorenzo Stoakes, Zi Li

Hi all,

Crazy, the per-VMA lock for madvise is an absolute game-changer ;)

On 2025/6/17 21:38, Lorenzo Stoakes wrote:
[...]
> 
> On Sun, Jun 08, 2025 at 10:01:50AM +1200, Barry Song wrote:
>> From: Barry Song <v-songbaohua@oppo.com>
>>
>> Certain madvise operations, especially MADV_DONTNEED, occur far more
>> frequently than other madvise options, particularly in native and Java
>> heaps for dynamic memory management.
>>
>> Currently, the mmap_lock is always held during these operations, even when
>> unnecessary. This causes lock contention and can lead to severe priority
>> inversion, where low-priority threads—such as Android's HeapTaskDaemon—
>> hold the lock and block higher-priority threads.
>>
>> This patch enables the use of per-VMA locks when the advised range lies
>> entirely within a single VMA, avoiding the need for full VMA traversal. In
>> practice, userspace heaps rarely issue MADV_DONTNEED across multiple VMAs.
>>
>> Tangquan’s testing shows that over 99.5% of memory reclaimed by Android
>> benefits from this per-VMA lock optimization. After extended runtime,
>> 217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while
>> only 1,231 fell back to mmap_lock.
>>
>> To simplify handling, the implementation falls back to the standard
>> mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of
>> userfaultfd_remove().
>>
>> Many thanks to Lorenzo's work[1] on:
>> "Refactor the madvise() code to retain state about the locking mode
>> utilised for traversing VMAs.
>>
>> Then use this mechanism to permit VMA locking to be done later in the
>> madvise() logic and also to allow altering of the locking mode to permit
>> falling back to an mmap read lock if required."
>>
>> One important point, as pointed out by Jann[2], is that
>> untagged_addr_remote() requires holding mmap_lock. This is because
>> address tagging on x86 and RISC-V is quite complex.
>>
>> Until untagged_addr_remote() becomes atomic—which seems unlikely in
>> the near future—we cannot support per-VMA locks for remote processes.
>> So for now, only local processes are supported.

Just to put some numbers on it, I ran a micro-benchmark with 100
parallel threads, where each thread calls madvise() on its own 1GiB
chunk of 64KiB mTHP-backed memory. The performance gain is huge:

1) MADV_DONTNEED saw its average time drop from 0.0508s to 0.0270s (~47% 
faster)
2) MADV_FREE     saw its average time drop from 0.3078s to 0.1095s (~64% 
faster)

Thanks,
Lance


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4] mm: use per_vma lock for MADV_DONTNEED
  2025-06-18  2:25   ` Lance Yang
@ 2025-06-18  9:52     ` Barry Song
  2025-06-18 10:18       ` David Hildenbrand
  0 siblings, 1 reply; 26+ messages in thread
From: Barry Song @ 2025-06-18  9:52 UTC (permalink / raw)
  To: Lance Yang
  Cc: akpm, linux-mm, linux-kernel, Barry Song, Liam R. Howlett,
	David Hildenbrand, Vlastimil Babka, Jann Horn, Suren Baghdasaryan,
	Lokesh Gidra, Tangquan Zheng, Qi Zheng, Lance Yang,
	Lorenzo Stoakes, Zi Li

On Wed, Jun 18, 2025 at 10:25 AM Lance Yang <lance.yang@linux.dev> wrote:
>
> Hi all,
>
> Crazy, the per-VMA lock for madvise is an absolute game-changer ;)
>
> On 2025/6/17 21:38, Lorenzo Stoakes wrote:
> [...]
> >
> > On Sun, Jun 08, 2025 at 10:01:50AM +1200, Barry Song wrote:
> >> From: Barry Song <v-songbaohua@oppo.com>
> >>
> >> Certain madvise operations, especially MADV_DONTNEED, occur far more
> >> frequently than other madvise options, particularly in native and Java
> >> heaps for dynamic memory management.
> >>
> >> Currently, the mmap_lock is always held during these operations, even when
> >> unnecessary. This causes lock contention and can lead to severe priority
> >> inversion, where low-priority threads—such as Android's HeapTaskDaemon—
> >> hold the lock and block higher-priority threads.
> >>
> >> This patch enables the use of per-VMA locks when the advised range lies
> >> entirely within a single VMA, avoiding the need for full VMA traversal. In
> >> practice, userspace heaps rarely issue MADV_DONTNEED across multiple VMAs.
> >>
> >> Tangquan’s testing shows that over 99.5% of memory reclaimed by Android
> >> benefits from this per-VMA lock optimization. After extended runtime,
> >> 217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while
> >> only 1,231 fell back to mmap_lock.
> >>
> >> To simplify handling, the implementation falls back to the standard
> >> mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of
> >> userfaultfd_remove().
> >>
> >> Many thanks to Lorenzo's work[1] on:
> >> "Refactor the madvise() code to retain state about the locking mode
> >> utilised for traversing VMAs.
> >>
> >> Then use this mechanism to permit VMA locking to be done later in the
> >> madvise() logic and also to allow altering of the locking mode to permit
> >> falling back to an mmap read lock if required."
> >>
> >> One important point, as pointed out by Jann[2], is that
> >> untagged_addr_remote() requires holding mmap_lock. This is because
> >> address tagging on x86 and RISC-V is quite complex.
> >>
> >> Until untagged_addr_remote() becomes atomic—which seems unlikely in
> >> the near future—we cannot support per-VMA locks for remote processes.
> >> So for now, only local processes are supported.
>
> Just to put some numbers on it, I ran a micro-benchmark with 100
> parallel threads, where each thread calls madvise() on its own 1GiB
> chunk of 64KiB mTHP-backed memory. The performance gain is huge:
>
> 1) MADV_DONTNEED saw its average time drop from 0.0508s to 0.0270s (~47%
> faster)
> 2) MADV_FREE     saw its average time drop from 0.3078s to 0.1095s (~64%
> faster)

Thanks for the report, Lance. I assume your micro-benchmark includes some
explicit or implicit operations that may require mmap_write_lock().
As  mmap_read_lock() only waits for writers and does not block other
mmap_read_lock() calls.

By the way, I would expect that per-VMA locking for madvise_dontneed or
madvise_free would benefit nearly all Linux and Android systems, as long
as they use a dynamic C/Java memory allocator.

>
> Thanks,
> Lance

Thanks
Barry


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4] mm: use per_vma lock for MADV_DONTNEED
  2025-06-17 13:38 ` Lorenzo Stoakes
  2025-06-18  2:25   ` Lance Yang
@ 2025-06-18 10:11   ` Barry Song
  2025-06-18 10:33     ` Lorenzo Stoakes
  1 sibling, 1 reply; 26+ messages in thread
From: Barry Song @ 2025-06-18 10:11 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, linux-mm, linux-kernel, Barry Song, Liam R. Howlett,
	David Hildenbrand, Vlastimil Babka, Jann Horn, Suren Baghdasaryan,
	Lokesh Gidra, Tangquan Zheng, Qi Zheng, Lance Yang

On Tue, Jun 17, 2025 at 9:39 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> +cc Lance
>
> Hi Barry,
>
> This needs a quick fixpatch, as discovered by Lance in [0], which I did an
> analysis on [1].
>
> Basically, _theoretically_ though not currently in practice, we might end up
> accessing uninitialised state in the struct vm_area_struct **prev value passed
> around madvise.
>
> The solution for now is to simply initialise it in the VMA read lock case, as
> all users of this set *prev = vma prior to performing the operation.
>
> Cheers, Lorenzo
>
> [0]: https://lore.kernel.org/all/20250617020544.57305-1-lance.yang@linux.dev/
> [1]: https://lore.kernel.org/all/6181fd25-6527-4cd0-b67f-2098191d262d@lucifer.local/
>
> On Sun, Jun 08, 2025 at 10:01:50AM +1200, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > Certain madvise operations, especially MADV_DONTNEED, occur far more
> > frequently than other madvise options, particularly in native and Java
> > heaps for dynamic memory management.
> >
> > Currently, the mmap_lock is always held during these operations, even when
> > unnecessary. This causes lock contention and can lead to severe priority
> > inversion, where low-priority threads—such as Android's HeapTaskDaemon—
> > hold the lock and block higher-priority threads.
> >
> > This patch enables the use of per-VMA locks when the advised range lies
> > entirely within a single VMA, avoiding the need for full VMA traversal. In
> > practice, userspace heaps rarely issue MADV_DONTNEED across multiple VMAs.
> >
> > Tangquan’s testing shows that over 99.5% of memory reclaimed by Android
> > benefits from this per-VMA lock optimization. After extended runtime,
> > 217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while
> > only 1,231 fell back to mmap_lock.
> >
> > To simplify handling, the implementation falls back to the standard
> > mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of
> > userfaultfd_remove().
> >
> > Many thanks to Lorenzo's work[1] on:
> > "Refactor the madvise() code to retain state about the locking mode
> > utilised for traversing VMAs.
> >
> > Then use this mechanism to permit VMA locking to be done later in the
> > madvise() logic and also to allow altering of the locking mode to permit
> > falling back to an mmap read lock if required."
> >
> > One important point, as pointed out by Jann[2], is that
> > untagged_addr_remote() requires holding mmap_lock. This is because
> > address tagging on x86 and RISC-V is quite complex.
> >
> > Until untagged_addr_remote() becomes atomic—which seems unlikely in
> > the near future—we cannot support per-VMA locks for remote processes.
> > So for now, only local processes are supported.
> >
> > Link: https://lore.kernel.org/all/0b96ce61-a52c-4036-b5b6-5c50783db51f@lucifer.local/ [1]
> > Link: https://lore.kernel.org/all/CAG48ez11zi-1jicHUZtLhyoNPGGVB+ROeAJCUw48bsjk4bbEkA@mail.gmail.com/ [2]
> > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Cc: Jann Horn <jannh@google.com>
> > Cc: Suren Baghdasaryan <surenb@google.com>
> > Cc: Lokesh Gidra <lokeshgidra@google.com>
> > Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> > Cc: Qi Zheng <zhengqi.arch@bytedance.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  -v4:
> >  * collect Lorenzo's RB;
> >  * use visit() for per-vma path
> >
> >  mm/madvise.c | 195 ++++++++++++++++++++++++++++++++++++++-------------
> >  1 file changed, 147 insertions(+), 48 deletions(-)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 56d9ca2557b9..8382614b71d1 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -48,38 +48,19 @@ struct madvise_walk_private {
> >       bool pageout;
> >  };
> >
> > +enum madvise_lock_mode {
> > +     MADVISE_NO_LOCK,
> > +     MADVISE_MMAP_READ_LOCK,
> > +     MADVISE_MMAP_WRITE_LOCK,
> > +     MADVISE_VMA_READ_LOCK,
> > +};
> > +
> >  struct madvise_behavior {
> >       int behavior;
> >       struct mmu_gather *tlb;
> > +     enum madvise_lock_mode lock_mode;
> >  };
> >
> > -/*
> > - * Any behaviour which results in changes to the vma->vm_flags needs to
> > - * take mmap_lock for writing. Others, which simply traverse vmas, need
> > - * to only take it for reading.
> > - */
> > -static int madvise_need_mmap_write(int behavior)
> > -{
> > -     switch (behavior) {
> > -     case MADV_REMOVE:
> > -     case MADV_WILLNEED:
> > -     case MADV_DONTNEED:
> > -     case MADV_DONTNEED_LOCKED:
> > -     case MADV_COLD:
> > -     case MADV_PAGEOUT:
> > -     case MADV_FREE:
> > -     case MADV_POPULATE_READ:
> > -     case MADV_POPULATE_WRITE:
> > -     case MADV_COLLAPSE:
> > -     case MADV_GUARD_INSTALL:
> > -     case MADV_GUARD_REMOVE:
> > -             return 0;
> > -     default:
> > -             /* be safe, default to 1. list exceptions explicitly */
> > -             return 1;
> > -     }
> > -}
> > -
> >  #ifdef CONFIG_ANON_VMA_NAME
> >  struct anon_vma_name *anon_vma_name_alloc(const char *name)
> >  {
> > @@ -1486,6 +1467,44 @@ static bool process_madvise_remote_valid(int behavior)
> >       }
> >  }
> >
> > +/*
> > + * Try to acquire a VMA read lock if possible.
> > + *
> > + * We only support this lock over a single VMA, which the input range must
> > + * span either partially or fully.
> > + *
> > + * This function always returns with an appropriate lock held. If a VMA read
> > + * lock could be acquired, we return the locked VMA.
> > + *
> > + * If a VMA read lock could not be acquired, we return NULL and expect caller to
> > + * fallback to mmap lock behaviour.
> > + */
> > +static struct vm_area_struct *try_vma_read_lock(struct mm_struct *mm,
> > +             struct madvise_behavior *madv_behavior,
> > +             unsigned long start, unsigned long end)
> > +{
> > +     struct vm_area_struct *vma;
> > +
> > +     vma = lock_vma_under_rcu(mm, start);
> > +     if (!vma)
> > +             goto take_mmap_read_lock;
> > +     /*
> > +      * Must span only a single VMA; uffd and remote processes are
> > +      * unsupported.
> > +      */
> > +     if (end > vma->vm_end || current->mm != mm ||
> > +         userfaultfd_armed(vma)) {
> > +             vma_end_read(vma);
> > +             goto take_mmap_read_lock;
> > +     }
> > +     return vma;
> > +
> > +take_mmap_read_lock:
> > +     mmap_read_lock(mm);
> > +     madv_behavior->lock_mode = MADVISE_MMAP_READ_LOCK;
> > +     return NULL;
> > +}
> > +
> >  /*
> >   * Walk the vmas in range [start,end), and call the visit function on each one.
> >   * The visit function will get start and end parameters that cover the overlap
> > @@ -1496,7 +1515,8 @@ static bool process_madvise_remote_valid(int behavior)
> >   */
> >  static
> >  int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
> > -                   unsigned long end, void *arg,
> > +                   unsigned long end, struct madvise_behavior *madv_behavior,
> > +                   void *arg,
> >                     int (*visit)(struct vm_area_struct *vma,
> >                                  struct vm_area_struct **prev, unsigned long start,
> >                                  unsigned long end, void *arg))
> > @@ -1505,6 +1525,20 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
> >       struct vm_area_struct *prev;
> >       unsigned long tmp;
> >       int unmapped_error = 0;
> > +     int error;
> > +
> > +     /*
> > +      * If VMA read lock is supported, apply madvise to a single VMA
> > +      * tentatively, avoiding walking VMAs.
> > +      */
> > +     if (madv_behavior && madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK) {
> > +             vma = try_vma_read_lock(mm, madv_behavior, start, end);
> > +             if (vma) {
> > +                     error = visit(vma, &prev, start, end, arg);
> > +                     vma_end_read(vma);
> > +                     return error;
> > +             }
> > +     }
> >
> >       /*
> >        * If the interval [start,end) covers some unmapped address
> > @@ -1516,8 +1550,6 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
> >               prev = vma;
> >
> >       for (;;) {
> > -             int error;
> > -
> >               /* Still start < end. */
> >               if (!vma)
> >                       return -ENOMEM;
> > @@ -1598,34 +1630,86 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> >       if (end == start)
> >               return 0;
> >
> > -     return madvise_walk_vmas(mm, start, end, anon_name,
> > +     return madvise_walk_vmas(mm, start, end, NULL, anon_name,
> >                                madvise_vma_anon_name);
> >  }
> >  #endif /* CONFIG_ANON_VMA_NAME */
> >
> > -static int madvise_lock(struct mm_struct *mm, int behavior)
> > +
> > +/*
> > + * Any behaviour which results in changes to the vma->vm_flags needs to
> > + * take mmap_lock for writing. Others, which simply traverse vmas, need
> > + * to only take it for reading.
> > + */
> > +static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavior)
> >  {
> > +     int behavior = madv_behavior->behavior;
> > +
> >       if (is_memory_failure(behavior))
> > -             return 0;
> > +             return MADVISE_NO_LOCK;
> >
> > -     if (madvise_need_mmap_write(behavior)) {
> > +     switch (behavior) {
> > +     case MADV_REMOVE:
> > +     case MADV_WILLNEED:
> > +     case MADV_COLD:
> > +     case MADV_PAGEOUT:
> > +     case MADV_FREE:
> > +     case MADV_POPULATE_READ:
> > +     case MADV_POPULATE_WRITE:
> > +     case MADV_COLLAPSE:
> > +     case MADV_GUARD_INSTALL:
> > +     case MADV_GUARD_REMOVE:
> > +             return MADVISE_MMAP_READ_LOCK;
> > +     case MADV_DONTNEED:
> > +     case MADV_DONTNEED_LOCKED:
> > +             return MADVISE_VMA_READ_LOCK;
> > +     default:
> > +             return MADVISE_MMAP_WRITE_LOCK;
> > +     }
> > +}
> > +
> > +static int madvise_lock(struct mm_struct *mm,
> > +             struct madvise_behavior *madv_behavior)
> > +{
> > +     enum madvise_lock_mode lock_mode = get_lock_mode(madv_behavior);
> > +
> > +     switch (lock_mode) {
> > +     case MADVISE_NO_LOCK:
> > +             break;
> > +     case MADVISE_MMAP_WRITE_LOCK:
> >               if (mmap_write_lock_killable(mm))
> >                       return -EINTR;
> > -     } else {
> > +             break;
> > +     case MADVISE_MMAP_READ_LOCK:
> >               mmap_read_lock(mm);
> > +             break;
> > +     case MADVISE_VMA_READ_LOCK:
> > +             /* We will acquire the lock per-VMA in madvise_walk_vmas(). */
> > +             break;
> >       }
> > +
> > +     madv_behavior->lock_mode = lock_mode;
> >       return 0;
> >  }
> >
> > -static void madvise_unlock(struct mm_struct *mm, int behavior)
> > +static void madvise_unlock(struct mm_struct *mm,
> > +             struct madvise_behavior *madv_behavior)
> >  {
> > -     if (is_memory_failure(behavior))
> > +     switch (madv_behavior->lock_mode) {
> > +     case  MADVISE_NO_LOCK:
> >               return;
> > -
> > -     if (madvise_need_mmap_write(behavior))
> > +     case MADVISE_MMAP_WRITE_LOCK:
> >               mmap_write_unlock(mm);
> > -     else
> > +             break;
> > +     case MADVISE_MMAP_READ_LOCK:
> >               mmap_read_unlock(mm);
> > +             break;
> > +     case MADVISE_VMA_READ_LOCK:
> > +             /* We will drop the lock per-VMA in madvise_walk_vmas(). */
> > +             break;
> > +     }
> > +
> > +     madv_behavior->lock_mode = MADVISE_NO_LOCK;
> >  }
> >
> >  static bool madvise_batch_tlb_flush(int behavior)
> > @@ -1710,6 +1794,21 @@ static bool is_madvise_populate(int behavior)
> >       }
> >  }
> >
> > +/*
> > + * untagged_addr_remote() assumes mmap_lock is already held. On
> > + * architectures like x86 and RISC-V, tagging is tricky because each
> > + * mm may have a different tagging mask. However, we might only hold
> > + * the per-VMA lock (currently only local processes are supported),
> > + * so untagged_addr is used to avoid the mmap_lock assertion for
> > + * local processes.
> > + */
> > +static inline unsigned long get_untagged_addr(struct mm_struct *mm,
> > +             unsigned long start)
> > +{
> > +     return current->mm == mm ? untagged_addr(start) :
> > +                                untagged_addr_remote(mm, start);
> > +}
> > +
> >  static int madvise_do_behavior(struct mm_struct *mm,
> >               unsigned long start, size_t len_in,
> >               struct madvise_behavior *madv_behavior)
> > @@ -1721,7 +1820,7 @@ static int madvise_do_behavior(struct mm_struct *mm,
> >
> >       if (is_memory_failure(behavior))
> >               return madvise_inject_error(behavior, start, start + len_in);
> > -     start = untagged_addr_remote(mm, start);
> > +     start = get_untagged_addr(mm, start);
> >       end = start + PAGE_ALIGN(len_in);
> >
> >       blk_start_plug(&plug);
> > @@ -1729,7 +1828,7 @@ static int madvise_do_behavior(struct mm_struct *mm,
> >               error = madvise_populate(mm, start, end, behavior);
> >       else
> >               error = madvise_walk_vmas(mm, start, end, madv_behavior,
> > -                                       madvise_vma_behavior);
> > +                             madv_behavior, madvise_vma_behavior);
> >       blk_finish_plug(&plug);
> >       return error;
> >  }
> > @@ -1817,13 +1916,13 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
> >
> >       if (madvise_should_skip(start, len_in, behavior, &error))
> >               return error;
> > -     error = madvise_lock(mm, behavior);
> > +     error = madvise_lock(mm, &madv_behavior);
> >       if (error)
> >               return error;
> >       madvise_init_tlb(&madv_behavior, mm);
> >       error = madvise_do_behavior(mm, start, len_in, &madv_behavior);
> >       madvise_finish_tlb(&madv_behavior);
> > -     madvise_unlock(mm, behavior);
> > +     madvise_unlock(mm, &madv_behavior);
> >
> >       return error;
> >  }
> > @@ -1847,7 +1946,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
> >
> >       total_len = iov_iter_count(iter);
> >
> > -     ret = madvise_lock(mm, behavior);
> > +     ret = madvise_lock(mm, &madv_behavior);
> >       if (ret)
> >               return ret;
> >       madvise_init_tlb(&madv_behavior, mm);
> > @@ -1880,8 +1979,8 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
> >
> >                       /* Drop and reacquire lock to unwind race. */
> >                       madvise_finish_tlb(&madv_behavior);
> > -                     madvise_unlock(mm, behavior);
> > -                     ret = madvise_lock(mm, behavior);
> > +                     madvise_unlock(mm, &madv_behavior);
> > +                     ret = madvise_lock(mm, &madv_behavior);
> >                       if (ret)
> >                               goto out;
> >                       madvise_init_tlb(&madv_behavior, mm);
> > @@ -1892,7 +1991,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
> >               iov_iter_advance(iter, iter_iov_len(iter));
> >       }
> >       madvise_finish_tlb(&madv_behavior);
> > -     madvise_unlock(mm, behavior);
> > +     madvise_unlock(mm, &madv_behavior);
> >
> >  out:
> >       ret = (total_len - iov_iter_count(iter)) ? : ret;
> > --
> > 2.39.3 (Apple Git-146)
> >
>
> ----8<----
> From 1ffcaea75ebdaffe15805386f6d7733883d265a5 Mon Sep 17 00:00:00 2001
> From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Date: Tue, 17 Jun 2025 14:35:13 +0100
> Subject: [PATCH] mm/madvise: avoid any chance of uninitialised pointer deref
>
> If we were to extend madvise() to support more operations under VMA lock,
> we could potentially dereference prev to uninitialised state in
> madvise_update_vma().
>
> Avoid this by explicitly setting prev to vma before invoking the visit()
> function.
>
> This has no impact on behaviour, as all visitors compatible with a VMA lock
> do not require prev to be set to the previous VMA and at any rate we only
> examine a single VMA in VMA lock mode.
>
> Reported-by: Lance Yang <ioworker0@gmail.com>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>  mm/madvise.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index efe5d64e1175..0970623a0e98 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1333,6 +1333,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
>                 return madvise_guard_remove(vma, prev, start, end);
>         }
>
> +       /* We cannot provide prev in this lock mode. */
> +       VM_WARN_ON_ONCE(arg->lock_mode == MADVISE_VMA_READ_LOCK);

Thanks, Lorenzo.
Do we even reach this point for MADVISE_MMAP_READ_LOCK cases?
madvise_update_vma() attempts to merge or split VMAs—wouldn't that be
a scenario that requires a write lock?

The prerequisite for using a VMA read lock is that the operation must
be safe under an mmap read lock as well.

>         anon_name = anon_vma_name(vma);
>         anon_vma_name_get(anon_name);
>         error = madvise_update_vma(vma, prev, start, end, new_flags,
> @@ -1549,6 +1551,7 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
>         if (madv_behavior && madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK) {
>                 vma = try_vma_read_lock(mm, madv_behavior, start, end);
>                 if (vma) {
> +                       prev = vma;
>                         error = visit(vma, &prev, start, end, arg);
>                         vma_end_read(vma);
>                         return error;
> --
> 2.49.0

Thanks
Barry


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4] mm: use per_vma lock for MADV_DONTNEED
  2025-06-18  9:52     ` Barry Song
@ 2025-06-18 10:18       ` David Hildenbrand
  2025-06-18 10:30         ` Barry Song
  2025-06-18 13:05         ` Lance Yang
  0 siblings, 2 replies; 26+ messages in thread
From: David Hildenbrand @ 2025-06-18 10:18 UTC (permalink / raw)
  To: Barry Song, Lance Yang
  Cc: akpm, linux-mm, linux-kernel, Barry Song, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Suren Baghdasaryan, Lokesh Gidra,
	Tangquan Zheng, Qi Zheng, Lance Yang, Lorenzo Stoakes, Zi Li

On 18.06.25 11:52, Barry Song wrote:
> On Wed, Jun 18, 2025 at 10:25 AM Lance Yang <lance.yang@linux.dev> wrote:
>>
>> Hi all,
>>
>> Crazy, the per-VMA lock for madvise is an absolute game-changer ;)
>>
>> On 2025/6/17 21:38, Lorenzo Stoakes wrote:
>> [...]
>>>
>>> On Sun, Jun 08, 2025 at 10:01:50AM +1200, Barry Song wrote:
>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>
>>>> Certain madvise operations, especially MADV_DONTNEED, occur far more
>>>> frequently than other madvise options, particularly in native and Java
>>>> heaps for dynamic memory management.
>>>>
>>>> Currently, the mmap_lock is always held during these operations, even when
>>>> unnecessary. This causes lock contention and can lead to severe priority
>>>> inversion, where low-priority threads—such as Android's HeapTaskDaemon—
>>>> hold the lock and block higher-priority threads.
>>>>
>>>> This patch enables the use of per-VMA locks when the advised range lies
>>>> entirely within a single VMA, avoiding the need for full VMA traversal. In
>>>> practice, userspace heaps rarely issue MADV_DONTNEED across multiple VMAs.
>>>>
>>>> Tangquan’s testing shows that over 99.5% of memory reclaimed by Android
>>>> benefits from this per-VMA lock optimization. After extended runtime,
>>>> 217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while
>>>> only 1,231 fell back to mmap_lock.
>>>>
>>>> To simplify handling, the implementation falls back to the standard
>>>> mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of
>>>> userfaultfd_remove().
>>>>
>>>> Many thanks to Lorenzo's work[1] on:
>>>> "Refactor the madvise() code to retain state about the locking mode
>>>> utilised for traversing VMAs.
>>>>
>>>> Then use this mechanism to permit VMA locking to be done later in the
>>>> madvise() logic and also to allow altering of the locking mode to permit
>>>> falling back to an mmap read lock if required."
>>>>
>>>> One important point, as pointed out by Jann[2], is that
>>>> untagged_addr_remote() requires holding mmap_lock. This is because
>>>> address tagging on x86 and RISC-V is quite complex.
>>>>
>>>> Until untagged_addr_remote() becomes atomic—which seems unlikely in
>>>> the near future—we cannot support per-VMA locks for remote processes.
>>>> So for now, only local processes are supported.
>>
>> Just to put some numbers on it, I ran a micro-benchmark with 100
>> parallel threads, where each thread calls madvise() on its own 1GiB
>> chunk of 64KiB mTHP-backed memory. The performance gain is huge:
>>
>> 1) MADV_DONTNEED saw its average time drop from 0.0508s to 0.0270s (~47%
>> faster)
>> 2) MADV_FREE     saw its average time drop from 0.3078s to 0.1095s (~64%
>> faster)
> 
> Thanks for the report, Lance. I assume your micro-benchmark includes some
> explicit or implicit operations that may require mmap_write_lock().
> As  mmap_read_lock() only waits for writers and does not block other
> mmap_read_lock() calls.

The number rather indicate that one test was run with (m)THPs enabled 
and the other not? Just a thought. The locking overhead from my 
experience is not that significant.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4] mm: use per_vma lock for MADV_DONTNEED
  2025-06-18 10:18       ` David Hildenbrand
@ 2025-06-18 10:30         ` Barry Song
  2025-06-18 10:32           ` Barry Song
  2025-06-18 13:05         ` Lance Yang
  1 sibling, 1 reply; 26+ messages in thread
From: Barry Song @ 2025-06-18 10:30 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lance Yang, akpm, linux-mm, linux-kernel, Barry Song,
	Liam R. Howlett, Vlastimil Babka, Jann Horn, Suren Baghdasaryan,
	Lokesh Gidra, Tangquan Zheng, Qi Zheng, Lance Yang,
	Lorenzo Stoakes, Zi Li

On Wed, Jun 18, 2025 at 6:18 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 18.06.25 11:52, Barry Song wrote:
> > On Wed, Jun 18, 2025 at 10:25 AM Lance Yang <lance.yang@linux.dev> wrote:
> >>
> >> Hi all,
> >>
> >> Crazy, the per-VMA lock for madvise is an absolute game-changer ;)
> >>
> >> On 2025/6/17 21:38, Lorenzo Stoakes wrote:
> >> [...]
> >>>
> >>> On Sun, Jun 08, 2025 at 10:01:50AM +1200, Barry Song wrote:
> >>>> From: Barry Song <v-songbaohua@oppo.com>
> >>>>
> >>>> Certain madvise operations, especially MADV_DONTNEED, occur far more
> >>>> frequently than other madvise options, particularly in native and Java
> >>>> heaps for dynamic memory management.
> >>>>
> >>>> Currently, the mmap_lock is always held during these operations, even when
> >>>> unnecessary. This causes lock contention and can lead to severe priority
> >>>> inversion, where low-priority threads—such as Android's HeapTaskDaemon—
> >>>> hold the lock and block higher-priority threads.
> >>>>
> >>>> This patch enables the use of per-VMA locks when the advised range lies
> >>>> entirely within a single VMA, avoiding the need for full VMA traversal. In
> >>>> practice, userspace heaps rarely issue MADV_DONTNEED across multiple VMAs.
> >>>>
> >>>> Tangquan’s testing shows that over 99.5% of memory reclaimed by Android
> >>>> benefits from this per-VMA lock optimization. After extended runtime,
> >>>> 217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while
> >>>> only 1,231 fell back to mmap_lock.
> >>>>
> >>>> To simplify handling, the implementation falls back to the standard
> >>>> mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of
> >>>> userfaultfd_remove().
> >>>>
> >>>> Many thanks to Lorenzo's work[1] on:
> >>>> "Refactor the madvise() code to retain state about the locking mode
> >>>> utilised for traversing VMAs.
> >>>>
> >>>> Then use this mechanism to permit VMA locking to be done later in the
> >>>> madvise() logic and also to allow altering of the locking mode to permit
> >>>> falling back to an mmap read lock if required."
> >>>>
> >>>> One important point, as pointed out by Jann[2], is that
> >>>> untagged_addr_remote() requires holding mmap_lock. This is because
> >>>> address tagging on x86 and RISC-V is quite complex.
> >>>>
> >>>> Until untagged_addr_remote() becomes atomic—which seems unlikely in
> >>>> the near future—we cannot support per-VMA locks for remote processes.
> >>>> So for now, only local processes are supported.
> >>
> >> Just to put some numbers on it, I ran a micro-benchmark with 100
> >> parallel threads, where each thread calls madvise() on its own 1GiB
> >> chunk of 64KiB mTHP-backed memory. The performance gain is huge:
> >>
> >> 1) MADV_DONTNEED saw its average time drop from 0.0508s to 0.0270s (~47%
> >> faster)
> >> 2) MADV_FREE     saw its average time drop from 0.3078s to 0.1095s (~64%
> >> faster)
> >
> > Thanks for the report, Lance. I assume your micro-benchmark includes some
> > explicit or implicit operations that may require mmap_write_lock().
> > As  mmap_read_lock() only waits for writers and does not block other
> > mmap_read_lock() calls.
>
> The number rather indicate that one test was run with (m)THPs enabled
> and the other not? Just a thought. The locking overhead from my
> experience is not that significant.

Right. I don't expect pure madvise_dontneed/free—without any additional
behavior requiring mmap_write_lock—to improve performance significantly.
The main benefit would be avoiding contention on the write lock.

Consider this scenario:
timestamp1: Thread A acquires the read lock
timestamp2: Thread B attempts to acquire the write lock
timestamp3: Threads C, D, and E attempt to acquire the read lock

In this case, thread B must wait for A, and threads C, D, and E will
wait for both A and B. Any write lock request effectively blocks all
subsequent read acquisitions.

In the worst case, thread A might be a GC thread with a high nice value.
If it's preempted by other threads, the delay can reach several
milliseconds—as we've observed in some cases.

>
> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4] mm: use per_vma lock for MADV_DONTNEED
  2025-06-18 10:30         ` Barry Song
@ 2025-06-18 10:32           ` Barry Song
  0 siblings, 0 replies; 26+ messages in thread
From: Barry Song @ 2025-06-18 10:32 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lance Yang, akpm, linux-mm, linux-kernel, Barry Song,
	Liam R. Howlett, Vlastimil Babka, Jann Horn, Suren Baghdasaryan,
	Lokesh Gidra, Tangquan Zheng, Qi Zheng, Lance Yang,
	Lorenzo Stoakes, Zi Li

On Wed, Jun 18, 2025 at 6:30 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Jun 18, 2025 at 6:18 PM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 18.06.25 11:52, Barry Song wrote:
> > > On Wed, Jun 18, 2025 at 10:25 AM Lance Yang <lance.yang@linux.dev> wrote:
> > >>
> > >> Hi all,
> > >>
> > >> Crazy, the per-VMA lock for madvise is an absolute game-changer ;)
> > >>
> > >> On 2025/6/17 21:38, Lorenzo Stoakes wrote:
> > >> [...]
> > >>>
> > >>> On Sun, Jun 08, 2025 at 10:01:50AM +1200, Barry Song wrote:
> > >>>> From: Barry Song <v-songbaohua@oppo.com>
> > >>>>
> > >>>> Certain madvise operations, especially MADV_DONTNEED, occur far more
> > >>>> frequently than other madvise options, particularly in native and Java
> > >>>> heaps for dynamic memory management.
> > >>>>
> > >>>> Currently, the mmap_lock is always held during these operations, even when
> > >>>> unnecessary. This causes lock contention and can lead to severe priority
> > >>>> inversion, where low-priority threads—such as Android's HeapTaskDaemon—
> > >>>> hold the lock and block higher-priority threads.
> > >>>>
> > >>>> This patch enables the use of per-VMA locks when the advised range lies
> > >>>> entirely within a single VMA, avoiding the need for full VMA traversal. In
> > >>>> practice, userspace heaps rarely issue MADV_DONTNEED across multiple VMAs.
> > >>>>
> > >>>> Tangquan’s testing shows that over 99.5% of memory reclaimed by Android
> > >>>> benefits from this per-VMA lock optimization. After extended runtime,
> > >>>> 217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while
> > >>>> only 1,231 fell back to mmap_lock.
> > >>>>
> > >>>> To simplify handling, the implementation falls back to the standard
> > >>>> mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of
> > >>>> userfaultfd_remove().
> > >>>>
> > >>>> Many thanks to Lorenzo's work[1] on:
> > >>>> "Refactor the madvise() code to retain state about the locking mode
> > >>>> utilised for traversing VMAs.
> > >>>>
> > >>>> Then use this mechanism to permit VMA locking to be done later in the
> > >>>> madvise() logic and also to allow altering of the locking mode to permit
> > >>>> falling back to an mmap read lock if required."
> > >>>>
> > >>>> One important point, as pointed out by Jann[2], is that
> > >>>> untagged_addr_remote() requires holding mmap_lock. This is because
> > >>>> address tagging on x86 and RISC-V is quite complex.
> > >>>>
> > >>>> Until untagged_addr_remote() becomes atomic—which seems unlikely in
> > >>>> the near future—we cannot support per-VMA locks for remote processes.
> > >>>> So for now, only local processes are supported.
> > >>
> > >> Just to put some numbers on it, I ran a micro-benchmark with 100
> > >> parallel threads, where each thread calls madvise() on its own 1GiB
> > >> chunk of 64KiB mTHP-backed memory. The performance gain is huge:
> > >>
> > >> 1) MADV_DONTNEED saw its average time drop from 0.0508s to 0.0270s (~47%
> > >> faster)
> > >> 2) MADV_FREE     saw its average time drop from 0.3078s to 0.1095s (~64%
> > >> faster)
> > >
> > > Thanks for the report, Lance. I assume your micro-benchmark includes some
> > > explicit or implicit operations that may require mmap_write_lock().
> > > As  mmap_read_lock() only waits for writers and does not block other
> > > mmap_read_lock() calls.
> >
> > The number rather indicate that one test was run with (m)THPs enabled
> > and the other not? Just a thought. The locking overhead from my
> > experience is not that significant.
>
> Right. I don't expect pure madvise_dontneed/free—without any additional
> behavior requiring mmap_write_lock—to improve performance significantly.
> The main benefit would be avoiding contention on the write lock.
>
> Consider this scenario:
> timestamp1: Thread A acquires the read lock
> timestamp2: Thread B attempts to acquire the write lock
> timestamp3: Threads C, D, and E attempt to acquire the read lock
>
> In this case, thread B must wait for A, and threads C, D, and E will
> wait for both A and B. Any write lock request effectively blocks all
> subsequent read acquisitions.
>
> In the worst case, thread A might be a GC thread with a high nice value.
> If it's preempted by other threads, the delay can reach several
> milliseconds—as we've observed in some cases.

sorry for the typo. I mean a few hundred milliseconds.
>
> >
> > --
> > Cheers,
> >
> > David / dhildenb
> >
>
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4] mm: use per_vma lock for MADV_DONTNEED
  2025-06-18 10:11   ` Barry Song
@ 2025-06-18 10:33     ` Lorenzo Stoakes
  2025-06-18 10:36       ` Barry Song
  0 siblings, 1 reply; 26+ messages in thread
From: Lorenzo Stoakes @ 2025-06-18 10:33 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, linux-kernel, Barry Song, Liam R. Howlett,
	David Hildenbrand, Vlastimil Babka, Jann Horn, Suren Baghdasaryan,
	Lokesh Gidra, Tangquan Zheng, Qi Zheng, Lance Yang

On Wed, Jun 18, 2025 at 06:11:26PM +0800, Barry Song wrote:
[sip]
> > ----8<----
> > From 1ffcaea75ebdaffe15805386f6d7733883d265a5 Mon Sep 17 00:00:00 2001
> > From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Date: Tue, 17 Jun 2025 14:35:13 +0100
> > Subject: [PATCH] mm/madvise: avoid any chance of uninitialised pointer deref
> >
> > If we were to extend madvise() to support more operations under VMA lock,
> > we could potentially dereference prev to uninitialised state in
> > madvise_update_vma().
> >
> > Avoid this by explicitly setting prev to vma before invoking the visit()
> > function.
> >
> > This has no impact on behaviour, as all visitors compatible with a VMA lock
> > do not require prev to be set to the previous VMA and at any rate we only
> > examine a single VMA in VMA lock mode.
> >
> > Reported-by: Lance Yang <ioworker0@gmail.com>
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> >  mm/madvise.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index efe5d64e1175..0970623a0e98 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -1333,6 +1333,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
> >                 return madvise_guard_remove(vma, prev, start, end);
> >         }
> >
> > +       /* We cannot provide prev in this lock mode. */
> > +       VM_WARN_ON_ONCE(arg->lock_mode == MADVISE_VMA_READ_LOCK);
>
> Thanks, Lorenzo.
> Do we even reach this point for MADVISE_MMAP_READ_LOCK cases?
> madvise_update_vma() attempts to merge or split VMAs—wouldn't that be
> a scenario that requires a write lock?

Well we're relying on happening to reach here with the correct lock afaict.

I'm going to be doing some follow-up series to clean all this up!

I'd rather keep this in here for now just to ensure we don't miss some stupidity
here.

Thanks!

>
> The prerequisite for using a VMA read lock is that the operation must
> be safe under an mmap read lock as well.
>
> >         anon_name = anon_vma_name(vma);
> >         anon_vma_name_get(anon_name);
> >         error = madvise_update_vma(vma, prev, start, end, new_flags,
> > @@ -1549,6 +1551,7 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
> >         if (madv_behavior && madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK) {
> >                 vma = try_vma_read_lock(mm, madv_behavior, start, end);
> >                 if (vma) {
> > +                       prev = vma;
> >                         error = visit(vma, &prev, start, end, arg);
> >                         vma_end_read(vma);
> >                         return error;
> > --
> > 2.49.0
>
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4] mm: use per_vma lock for MADV_DONTNEED
  2025-06-18 10:33     ` Lorenzo Stoakes
@ 2025-06-18 10:36       ` Barry Song
  0 siblings, 0 replies; 26+ messages in thread
From: Barry Song @ 2025-06-18 10:36 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, linux-mm, linux-kernel, Barry Song, Liam R. Howlett,
	David Hildenbrand, Vlastimil Babka, Jann Horn, Suren Baghdasaryan,
	Lokesh Gidra, Tangquan Zheng, Qi Zheng, Lance Yang

On Wed, Jun 18, 2025 at 6:33 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Wed, Jun 18, 2025 at 06:11:26PM +0800, Barry Song wrote:
> [sip]
> > > ----8<----
> > > From 1ffcaea75ebdaffe15805386f6d7733883d265a5 Mon Sep 17 00:00:00 2001
> > > From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > Date: Tue, 17 Jun 2025 14:35:13 +0100
> > > Subject: [PATCH] mm/madvise: avoid any chance of uninitialised pointer deref
> > >
> > > If we were to extend madvise() to support more operations under VMA lock,
> > > we could potentially dereference prev to uninitialised state in
> > > madvise_update_vma().
> > >
> > > Avoid this by explicitly setting prev to vma before invoking the visit()
> > > function.
> > >
> > > This has no impact on behaviour, as all visitors compatible with a VMA lock
> > > do not require prev to be set to the previous VMA and at any rate we only
> > > examine a single VMA in VMA lock mode.
> > >
> > > Reported-by: Lance Yang <ioworker0@gmail.com>
> > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > ---
> > >  mm/madvise.c | 3 +++
> > >  1 file changed, 3 insertions(+)
> > >
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index efe5d64e1175..0970623a0e98 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -1333,6 +1333,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
> > >                 return madvise_guard_remove(vma, prev, start, end);
> > >         }
> > >
> > > +       /* We cannot provide prev in this lock mode. */
> > > +       VM_WARN_ON_ONCE(arg->lock_mode == MADVISE_VMA_READ_LOCK);
> >
> > Thanks, Lorenzo.
> > Do we even reach this point for MADVISE_MMAP_READ_LOCK cases?
> > madvise_update_vma() attempts to merge or split VMAs—wouldn't that be
> > a scenario that requires a write lock?
>
> Well we're relying on happening to reach here with the correct lock afaict.
>
> I'm going to be doing some follow-up series to clean all this up!
>
> I'd rather keep this in here for now just to ensure we don't miss some stupidity
> here.

I have no objection to keeping this as-is—just curious if using
VM_WARN_ON_ONCE(arg->lock_mode != MADVISE_MMAP_WRITE_LOCK)
would be more accurate.

In any case, your cleanup series will address this, so it's probably
not something we need to handle right now.

>
> Thanks!
>
> >
> > The prerequisite for using a VMA read lock is that the operation must
> > be safe under an mmap read lock as well.
> >
> > >         anon_name = anon_vma_name(vma);
> > >         anon_vma_name_get(anon_name);
> > >         error = madvise_update_vma(vma, prev, start, end, new_flags,
> > > @@ -1549,6 +1551,7 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
> > >         if (madv_behavior && madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK) {
> > >                 vma = try_vma_read_lock(mm, madv_behavior, start, end);
> > >                 if (vma) {
> > > +                       prev = vma;
> > >                         error = visit(vma, &prev, start, end, arg);
> > >                         vma_end_read(vma);
> > >                         return error;
> > > --
> > > 2.49.0
> >

Thanks
Barry


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4] mm: use per_vma lock for MADV_DONTNEED
  2025-06-18 10:18       ` David Hildenbrand
  2025-06-18 10:30         ` Barry Song
@ 2025-06-18 13:05         ` Lance Yang
  2025-06-18 13:13           ` David Hildenbrand
  1 sibling, 1 reply; 26+ messages in thread
From: Lance Yang @ 2025-06-18 13:05 UTC (permalink / raw)
  To: David Hildenbrand, Barry Song
  Cc: akpm, linux-mm, linux-kernel, Barry Song, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Suren Baghdasaryan, Lokesh Gidra,
	Tangquan Zheng, Qi Zheng, Lance Yang, Lorenzo Stoakes, Zi Li



On 2025/6/18 18:18, David Hildenbrand wrote:
> On 18.06.25 11:52, Barry Song wrote:
>> On Wed, Jun 18, 2025 at 10:25 AM Lance Yang <lance.yang@linux.dev> wrote:
>>>
>>> Hi all,
>>>
>>> Crazy, the per-VMA lock for madvise is an absolute game-changer ;)
>>>
>>> On 2025/6/17 21:38, Lorenzo Stoakes wrote:
>>> [...]
>>>>
>>>> On Sun, Jun 08, 2025 at 10:01:50AM +1200, Barry Song wrote:
>>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>>
>>>>> Certain madvise operations, especially MADV_DONTNEED, occur far more
>>>>> frequently than other madvise options, particularly in native and Java
>>>>> heaps for dynamic memory management.
>>>>>
>>>>> Currently, the mmap_lock is always held during these operations, 
>>>>> even when
>>>>> unnecessary. This causes lock contention and can lead to severe 
>>>>> priority
>>>>> inversion, where low-priority threads—such as Android's 
>>>>> HeapTaskDaemon—
>>>>> hold the lock and block higher-priority threads.
>>>>>
>>>>> This patch enables the use of per-VMA locks when the advised range 
>>>>> lies
>>>>> entirely within a single VMA, avoiding the need for full VMA 
>>>>> traversal. In
>>>>> practice, userspace heaps rarely issue MADV_DONTNEED across 
>>>>> multiple VMAs.
>>>>>
>>>>> Tangquan’s testing shows that over 99.5% of memory reclaimed by 
>>>>> Android
>>>>> benefits from this per-VMA lock optimization. After extended runtime,
>>>>> 217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while
>>>>> only 1,231 fell back to mmap_lock.
>>>>>
>>>>> To simplify handling, the implementation falls back to the standard
>>>>> mmap_lock if userfaultfd is enabled on the VMA, avoiding the 
>>>>> complexity of
>>>>> userfaultfd_remove().
>>>>>
>>>>> Many thanks to Lorenzo's work[1] on:
>>>>> "Refactor the madvise() code to retain state about the locking mode
>>>>> utilised for traversing VMAs.
>>>>>
>>>>> Then use this mechanism to permit VMA locking to be done later in the
>>>>> madvise() logic and also to allow altering of the locking mode to 
>>>>> permit
>>>>> falling back to an mmap read lock if required."
>>>>>
>>>>> One important point, as pointed out by Jann[2], is that
>>>>> untagged_addr_remote() requires holding mmap_lock. This is because
>>>>> address tagging on x86 and RISC-V is quite complex.
>>>>>
>>>>> Until untagged_addr_remote() becomes atomic—which seems unlikely in
>>>>> the near future—we cannot support per-VMA locks for remote processes.
>>>>> So for now, only local processes are supported.
>>>
>>> Just to put some numbers on it, I ran a micro-benchmark with 100
>>> parallel threads, where each thread calls madvise() on its own 1GiB

Correction: it uses 256MiB chunks per thread, not 1GiB ...

>>> chunk of 64KiB mTHP-backed memory. The performance gain is huge:
>>>
>>> 1) MADV_DONTNEED saw its average time drop from 0.0508s to 0.0270s (~47%
>>> faster)
>>> 2) MADV_FREE     saw its average time drop from 0.3078s to 0.1095s (~64%
>>> faster)
>>
>> Thanks for the report, Lance. I assume your micro-benchmark includes some
>> explicit or implicit operations that may require mmap_write_lock().
>> As  mmap_read_lock() only waits for writers and does not block other
>> mmap_read_lock() calls.
> 
> The number rather indicate that one test was run with (m)THPs enabled 
> and the other not? Just a thought. The locking overhead from my 
> experience is not that significant.
> 

Both tests were run with 64KiB mTHP enabled on an Intel(R) Xeon(R)
Silver 4314 CPU. The micro-benchmark code is following:

```
#define _GNU_SOURCE
#include <pthread.h>
#include <sys/mman.h>
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <unistd.h>

#define NUM_THREADS 100
#define MMAP_SIZE (512L * 1024 * 1024)
#define WRITE_START (128L * 1024 * 1024)
#define WRITE_SIZE (256L * 1024 * 1024)
#define MADV_HUGEPAGE 14
#define MADV_DONTNEED 4
#define MADV_FREE 8

typedef struct {
     int id;
     int madvise_option;
} thread_data_t;

void *thread_function(void *arg) {
     thread_data_t *data = (thread_data_t *)arg;

     uint8_t *mmap_area = mmap(NULL, MMAP_SIZE, PROT_NONE, MAP_PRIVATE | 
MAP_ANONYMOUS, -1, 0);
     if (mmap_area == MAP_FAILED) {
         perror("mmap");
         return NULL;
     }

     if (mprotect(mmap_area + WRITE_START, WRITE_SIZE, PROT_READ | 
PROT_WRITE) != 0) {
         perror("mprotect");
         munmap(mmap_area, MMAP_SIZE);
         return NULL;
     }

     if (madvise(mmap_area + WRITE_START, WRITE_SIZE, MADV_HUGEPAGE) != 0) {
         perror("madvise hugepage");
         munmap(mmap_area, MMAP_SIZE);
         return NULL;
     }

     for (size_t i = 0; i < WRITE_SIZE; i++) {
         mmap_area[WRITE_START + i] = 255;
     }

     struct timespec start_time, end_time;
     clock_gettime(CLOCK_MONOTONIC, &start_time);

     if (madvise(mmap_area + WRITE_START, WRITE_SIZE, 
data->madvise_option) != 0) {
         perror("madvise");
     }

     clock_gettime(CLOCK_MONOTONIC, &end_time);
     double elapsed_time = (end_time.tv_sec - start_time.tv_sec) +
                           (end_time.tv_nsec - start_time.tv_nsec) / 1e9;
     printf("Thread %d elapsed time: %.6f seconds\n", data->id, 
elapsed_time);

     munmap(mmap_area, MMAP_SIZE);
     return NULL;
}

int main(int argc, char *argv[]) {
     if (argc != 2) {
         fprintf(stderr, "Usage: %s <madvise_option>\n", argv[0]);
         fprintf(stderr, "  1: MADV_DONTNEED\n");
         fprintf(stderr, "  2: MADV_FREE\n");
         return EXIT_FAILURE;
     }

     int madvise_option;
     if (atoi(argv[1]) == 1) {
         madvise_option = MADV_DONTNEED;
     } else if (atoi(argv[1]) == 2) {
         madvise_option = MADV_FREE;
     } else {
         fprintf(stderr, "Invalid madvise_option. Use 1 for 
MADV_DONTNEED or 2 for MADV_FREE.\n");
         return EXIT_FAILURE;
     }

     pthread_t threads[NUM_THREADS];
     thread_data_t thread_data[NUM_THREADS];
     int i;

     for (i = 0; i < NUM_THREADS; i++) {
         thread_data[i].id = i;
         thread_data[i].madvise_option = madvise_option;
         pthread_create(&threads[i], NULL, thread_function, 
&thread_data[i]);
     }
     for (i = 0; i < NUM_THREADS; i++) {
         pthread_join(threads[i], NULL);
     }

     sleep(10);
     return 0;
}
```

Thanks,
Lance



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4] mm: use per_vma lock for MADV_DONTNEED
  2025-06-18 13:05         ` Lance Yang
@ 2025-06-18 13:13           ` David Hildenbrand
  0 siblings, 0 replies; 26+ messages in thread
From: David Hildenbrand @ 2025-06-18 13:13 UTC (permalink / raw)
  To: Lance Yang, Barry Song
  Cc: akpm, linux-mm, linux-kernel, Barry Song, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Suren Baghdasaryan, Lokesh Gidra,
	Tangquan Zheng, Qi Zheng, Lance Yang, Lorenzo Stoakes, Zi Li

On 18.06.25 15:05, Lance Yang wrote:
> 
> 
> On 2025/6/18 18:18, David Hildenbrand wrote:
>> On 18.06.25 11:52, Barry Song wrote:
>>> On Wed, Jun 18, 2025 at 10:25 AM Lance Yang <lance.yang@linux.dev> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> Crazy, the per-VMA lock for madvise is an absolute game-changer ;)
>>>>
>>>> On 2025/6/17 21:38, Lorenzo Stoakes wrote:
>>>> [...]
>>>>>
>>>>> On Sun, Jun 08, 2025 at 10:01:50AM +1200, Barry Song wrote:
>>>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>>>
>>>>>> Certain madvise operations, especially MADV_DONTNEED, occur far more
>>>>>> frequently than other madvise options, particularly in native and Java
>>>>>> heaps for dynamic memory management.
>>>>>>
>>>>>> Currently, the mmap_lock is always held during these operations,
>>>>>> even when
>>>>>> unnecessary. This causes lock contention and can lead to severe
>>>>>> priority
>>>>>> inversion, where low-priority threads—such as Android's
>>>>>> HeapTaskDaemon—
>>>>>> hold the lock and block higher-priority threads.
>>>>>>
>>>>>> This patch enables the use of per-VMA locks when the advised range
>>>>>> lies
>>>>>> entirely within a single VMA, avoiding the need for full VMA
>>>>>> traversal. In
>>>>>> practice, userspace heaps rarely issue MADV_DONTNEED across
>>>>>> multiple VMAs.
>>>>>>
>>>>>> Tangquan’s testing shows that over 99.5% of memory reclaimed by
>>>>>> Android
>>>>>> benefits from this per-VMA lock optimization. After extended runtime,
>>>>>> 217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while
>>>>>> only 1,231 fell back to mmap_lock.
>>>>>>
>>>>>> To simplify handling, the implementation falls back to the standard
>>>>>> mmap_lock if userfaultfd is enabled on the VMA, avoiding the
>>>>>> complexity of
>>>>>> userfaultfd_remove().
>>>>>>
>>>>>> Many thanks to Lorenzo's work[1] on:
>>>>>> "Refactor the madvise() code to retain state about the locking mode
>>>>>> utilised for traversing VMAs.
>>>>>>
>>>>>> Then use this mechanism to permit VMA locking to be done later in the
>>>>>> madvise() logic and also to allow altering of the locking mode to
>>>>>> permit
>>>>>> falling back to an mmap read lock if required."
>>>>>>
>>>>>> One important point, as pointed out by Jann[2], is that
>>>>>> untagged_addr_remote() requires holding mmap_lock. This is because
>>>>>> address tagging on x86 and RISC-V is quite complex.
>>>>>>
>>>>>> Until untagged_addr_remote() becomes atomic—which seems unlikely in
>>>>>> the near future—we cannot support per-VMA locks for remote processes.
>>>>>> So for now, only local processes are supported.
>>>>
>>>> Just to put some numbers on it, I ran a micro-benchmark with 100
>>>> parallel threads, where each thread calls madvise() on its own 1GiB
> 
> Correction: it uses 256MiB chunks per thread, not 1GiB ...
> 
>>>> chunk of 64KiB mTHP-backed memory. The performance gain is huge:
>>>>
>>>> 1) MADV_DONTNEED saw its average time drop from 0.0508s to 0.0270s (~47%
>>>> faster)
>>>> 2) MADV_FREE     saw its average time drop from 0.3078s to 0.1095s (~64%
>>>> faster)
>>>
>>> Thanks for the report, Lance. I assume your micro-benchmark includes some
>>> explicit or implicit operations that may require mmap_write_lock().
>>> As  mmap_read_lock() only waits for writers and does not block other
>>> mmap_read_lock() calls.
>>
>> The number rather indicate that one test was run with (m)THPs enabled
>> and the other not? Just a thought. The locking overhead from my
>> experience is not that significant.
>>
> 
> Both tests were run with 64KiB mTHP enabled on an Intel(R) Xeon(R)
> Silver 4314 CPU. The micro-benchmark code is following:

Ah, I missed the "100 threads" above. Yeah, there should be plenty of 
locking contention with all the mprotect() in there.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4] mm: use per_vma lock for MADV_DONTNEED
  2025-06-07 22:01 [PATCH v4] mm: use per_vma lock for MADV_DONTNEED Barry Song
  2025-06-09  7:21 ` Qi Zheng
  2025-06-17 13:38 ` Lorenzo Stoakes
@ 2025-08-04  0:58 ` Lai, Yi
  2025-08-04  7:19   ` Barry Song
                     ` (2 more replies)
  2 siblings, 3 replies; 26+ messages in thread
From: Lai, Yi @ 2025-08-04  0:58 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, linux-kernel, Barry Song, Lorenzo Stoakes,
	Liam R. Howlett, David Hildenbrand, Vlastimil Babka, Jann Horn,
	Suren Baghdasaryan, Lokesh Gidra, Tangquan Zheng, Qi Zheng,
	yi1.lai

Hi Barry Song,

Greetings!

I used Syzkaller and found that there is general protection fault in __pte_offset_map_lock in linux-next next-20250801.

After bisection and the first bad commit is:
"
a6fde7add78d mm: use per_vma lock for MADV_DONTNEED
"

All detailed into can be found at:
https://github.com/laifryiee/syzkaller_logs/tree/main/250803_193026___pte_offset_map_lock
Syzkaller repro code:
https://github.com/laifryiee/syzkaller_logs/tree/main/250803_193026___pte_offset_map_lock/repro.c
Syzkaller repro syscall steps:
https://github.com/laifryiee/syzkaller_logs/tree/main/250803_193026___pte_offset_map_lock/repro.prog
Syzkaller report:
https://github.com/laifryiee/syzkaller_logs/tree/main/250803_193026___pte_offset_map_lock/repro.report
Kconfig(make olddefconfig):
https://github.com/laifryiee/syzkaller_logs/tree/main/250803_193026___pte_offset_map_lock/kconfig_origin
Bisect info:
https://github.com/laifryiee/syzkaller_logs/tree/main/250803_193026___pte_offset_map_lock/bisect_info.log
bzImage:
https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/main/250803_193026___pte_offset_map_lock/bzImage_next-20250801
Issue dmesg:
https://github.com/laifryiee/syzkaller_logs/blob/main/250803_193026___pte_offset_map_lock/next-20250801_dmesg.log

"
[   38.151897] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] SMP KASI
[   38.153519] KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
[   38.154605] CPU: 0 UID: 0 PID: 721 Comm: repro Not tainted 6.16.0-next-20250801-next-2025080 #1 PREEMPT(voluntary)
[   38.155929] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org4
[   38.157418] RIP: 0010:kasan_byte_accessible+0x15/0x30
[   38.158125] Code: 03 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 48 b8 00 00 00 00 00 fc0
[   38.160461] RSP: 0018:ffff88800feef678 EFLAGS: 00010286
[   38.161220] RAX: dffffc0000000000 RBX: 0000000000000001 RCX: 1ffffffff0dde60c
[   38.162232] RDX: 0000000000000000 RSI: ffffffff85da1e18 RDI: dffffc0000000003
[   38.163176] RBP: ffff88800feef698 R08: 0000000000000001 R09: 0000000000000000
[   38.164195] R10: 0000000000000000 R11: ffff888016a8ba58 R12: 0000000000000018
[   38.165189] R13: 0000000000000018 R14: ffffffff85da1e18 R15: 0000000000000000
[   38.166100] FS:  0000000000000000(0000) GS:ffff8880e3b40000(0000) knlGS:0000000000000000
[   38.167137] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   38.167891] CR2: 00007f97fadfe504 CR3: 0000000007088005 CR4: 0000000000770ef0
[   38.168812] PKRU: 55555554
[   38.169275] Call Trace:
[   38.169647]  <TASK>
[   38.169975]  ? __kasan_check_byte+0x19/0x50
[   38.170581]  lock_acquire+0xea/0x310
[   38.171083]  ? rcu_is_watching+0x19/0xc0
[   38.171615]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[   38.172343]  ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
[   38.173130]  _raw_spin_lock+0x38/0x50
[   38.173707]  ? __pte_offset_map_lock+0x1a2/0x3c0
[   38.174390]  __pte_offset_map_lock+0x1a2/0x3c0
[   38.174987]  ? __pfx___pte_offset_map_lock+0x10/0x10
[   38.175724]  ? __pfx_pud_val+0x10/0x10
[   38.176308]  ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30
[   38.177183]  unmap_page_range+0xb60/0x43e0
[   38.177824]  ? __pfx_unmap_page_range+0x10/0x10
[   38.178485]  ? mas_next_slot+0x133a/0x1a50
[   38.179079]  unmap_single_vma.constprop.0+0x15b/0x250
[   38.179830]  unmap_vmas+0x1fa/0x460
[   38.180373]  ? __pfx_unmap_vmas+0x10/0x10
[   38.180994]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[   38.181877]  exit_mmap+0x1a2/0xb40
[   38.182396]  ? lock_release+0x14f/0x2c0
[   38.182929]  ? __pfx_exit_mmap+0x10/0x10
[   38.183474]  ? __pfx___mutex_unlock_slowpath+0x10/0x10
[   38.184188]  ? mutex_unlock+0x16/0x20
[   38.184704]  mmput+0x132/0x370
[   38.185208]  do_exit+0x7e7/0x28c0
[   38.185682]  ? __this_cpu_preempt_check+0x21/0x30
[   38.186328]  ? do_group_exit+0x1d8/0x2c0
[   38.186873]  ? __pfx_do_exit+0x10/0x10
[   38.187401]  ? __this_cpu_preempt_check+0x21/0x30
[   38.188036]  ? _raw_spin_unlock_irq+0x2c/0x60
[   38.188634]  ? lockdep_hardirqs_on+0x89/0x110
[   38.189313]  do_group_exit+0xe4/0x2c0
[   38.189831]  __x64_sys_exit_group+0x4d/0x60
[   38.190413]  x64_sys_call+0x2174/0x2180
[   38.190935]  do_syscall_64+0x6d/0x2e0
[   38.191449]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   38.192118] RIP: 0033:0x7f97fab18a4d
[   38.192614] Code: Unable to access opcode bytes at 0x7f97fab18a23.
[   38.193552] RSP: 002b:00007ffd957e62d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[   38.194557] RAX: ffffffffffffffda RBX: 00007f97fabf69e0 RCX: 00007f97fab18a4d
[   38.195472] RDX: 00000000000000e7 RSI: ffffffffffffff80 RDI: 0000000000000000
[   38.196388] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000020
[   38.197357] R10: 00007ffd957e6180 R11: 0000000000000246 R12: 00007f97fabf69e0
[   38.198273] R13: 00007f97fabfbf00 R14: 0000000000000001 R15: 00007f97fabfbee8
[   38.199205]  </TASK>
[   38.199512] Modules linked in:
[   38.200069] ---[ end trace 0000000000000000 ]---
[   38.200678] RIP: 0010:kasan_byte_accessible+0x15/0x30
[   38.201443] Code: 03 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 48 b8 00 00 00 00 00 fc0
[   38.203815] RSP: 0018:ffff88800feef678 EFLAGS: 00010286
[   38.204503] RAX: dffffc0000000000 RBX: 0000000000000001 RCX: 1ffffffff0dde60c
[   38.205492] RDX: 0000000000000000 RSI: ffffffff85da1e18 RDI: dffffc0000000003
[   38.206428] RBP: ffff88800feef698 R08: 0000000000000001 R09: 0000000000000000
[   38.207350] R10: 0000000000000000 R11: ffff888016a8ba58 R12: 0000000000000018
[   38.208269] R13: 0000000000000018 R14: ffffffff85da1e18 R15: 0000000000000000
[   38.209270] FS:  0000000000000000(0000) GS:ffff8880e3b40000(0000) knlGS:0000000000000000
[   38.210329] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   38.211080] CR2: 00007f97fadfe504 CR3: 0000000007088005 CR4: 0000000000770ef0
[   38.211999] PKRU: 55555554
[   38.212374] note: repro[721] exited with preempt_count 1
[   38.213121] Fixing recursive fault but reboot is needed!
[   38.213815] BUG: using smp_processor_id() in preemptible [00000000] code: repro/721
[   38.214856] caller is debug_smp_processor_id+0x20/0x30
[   38.215553] CPU: 0 UID: 0 PID: 721 Comm: repro Tainted: G      D             6.16.0-next-20250801-next-2025080 #1 PR
[   38.215588] Tainted: [D]=DIE
[   38.215595] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org4
[   38.215611] Call Trace:
[   38.215617]  <TASK>
[   38.215626]  dump_stack_lvl+0x121/0x150
[   38.215657]  dump_stack+0x19/0x20
[   38.215685]  check_preemption_disabled+0x168/0x180
[   38.215722]  ? do_task_dead+0x48/0x120
[   38.215753]  debug_smp_processor_id+0x20/0x30
[   38.215788]  __schedule+0xd9/0x3840
[   38.215827]  ? __pfx___schedule+0x10/0x10
[   38.215860]  ? do_task_dead+0xae/0x120
[   38.215888]  ? do_task_dead+0x48/0x120
[   38.215917]  ? debug_smp_processor_id+0x20/0x30
[   38.215953]  ? rcu_is_watching+0x19/0xc0
[   38.215978]  ? trace_irq_enable+0xd1/0x110
[   38.216017]  ? do_task_dead+0x48/0x120
[   38.216046]  do_task_dead+0xe6/0x120
[   38.216076]  make_task_dead+0x384/0x3c0
[   38.216116]  rewind_stack_and_make_dead+0x16/0x20
[   38.216143] RIP: 0033:0x7f97fab18a4d
[   38.216159] Code: Unable to access opcode bytes at 0x7f97fab18a23.
[   38.216170] RSP: 002b:00007ffd957e62d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[   38.216190] RAX: ffffffffffffffda RBX: 00007f97fabf69e0 RCX: 00007f97fab18a4d
[   38.216205] RDX: 00000000000000e7 RSI: ffffffffffffff80 RDI: 0000000000000000
[   38.216220] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000020
[   38.216234] R10: 00007ffd957e6180 R11: 0000000000000246 R12: 00007f97fabf69e0
[   38.216248] R13: 00007f97fabfbf00 R14: 0000000000000001 R15: 00007f97fabfbee8
[   38.216279]  </TASK>
[   38.216287] BUG: scheduling while atomic: repro/721/0x00000000
[   38.236575] INFO: lockdep is turned off.
[   38.237107] Modules linked in:
[   38.237574] Preemption disabled at:
[   38.237582] [<ffffffff81552c46>] do_task_dead+0x26/0x120
[   38.238779] CPU: 0 UID: 0 PID: 721 Comm: repro Tainted: G      D             6.16.0-next-20250801-next-2025080 #1 PR
[   38.238813] Tainted: [D]=DIE
[   38.238820] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org4
[   38.238836] Call Trace:
[   38.238842]  <TASK>
[   38.238850]  dump_stack_lvl+0x121/0x150
[   38.238881]  ? do_task_dead+0x26/0x120
[   38.238908]  dump_stack+0x19/0x20
[   38.238936]  __schedule_bug+0x12d/0x180
[   38.238977]  __schedule+0x276e/0x3840
[   38.239017]  ? __pfx___schedule+0x10/0x10
[   38.239050]  ? do_task_dead+0xae/0x120
[   38.239078]  ? do_task_dead+0x48/0x120
[   38.239108]  ? debug_smp_processor_id+0x20/0x30
[   38.239148]  ? rcu_is_watching+0x19/0xc0
[   38.239173]  ? trace_irq_enable+0xd1/0x110
[   38.239211]  ? do_task_dead+0x48/0x120
[   38.239241]  do_task_dead+0xe6/0x120
[   38.239271]  make_task_dead+0x384/0x3c0
[   38.239328]  rewind_stack_and_make_dead+0x16/0x20
[   38.239353] RIP: 0033:0x7f97fab18a4d
[   38.239370] Code: Unable to access opcode bytes at 0x7f97fab18a23.
[   38.239382] RSP: 002b:00007ffd957e62d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[   38.239404] RAX: ffffffffffffffda RBX: 00007f97fabf69e0 RCX: 00007f97fab18a4d
[   38.239421] RDX: 00000000000000e7 RSI: ffffffffffffff80 RDI: 0000000000000000
[   38.239437] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000020
[   38.239452] R10: 00007ffd957e6180 R11: 0000000000000246 R12: 00007f97fabf69e0
[   38.239469] R13: 00007f97fabfbf00 R14: 0000000000000001 R15: 00007f97fabfbee8
[   38.239503]  </TASK>
[   38.239513] ------------[ cut here ]------------
[   38.259566] Voluntary context switch within RCU read-side critical section!
[   38.259699] WARNING: kernel/rcu/tree_plugin.h:332 at rcu_note_context_switch+0xc6f/0x1900, CPU#0: repro/721
[   38.262004] Modules linked in:
[   38.262458] CPU: 0 UID: 0 PID: 721 Comm: repro Tainted: G      D W           6.16.0-next-20250801-next-2025080 #1 PR
[   38.263978] Tainted: [D]=DIE, [W]=WARN
[   38.264509] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org4
[   38.266148] RIP: 0010:rcu_note_context_switch+0xc6f/0x1900
[   38.266891] Code: 8b 45 a0 4c 8b 55 a8 48 c7 c2 44 b4 ef 87 44 8b 4d b0 e9 0f 03 00 00 48 c7 c7 c0 5d ed 85 c6 05 bf0
[   38.269294] RSP: 0018:ffff88800feefd28 EFLAGS: 00010096
[   38.269980] RAX: 0000000000000000 RBX: ffff88806ca450c0 RCX: ffffffff814612c2
[   38.270899] RDX: ffff888016a8ac00 RSI: ffffffff814612cf RDI: 0000000000000001
[   38.271817] RBP: ffff88800feefd98 R08: 0000000000000001 R09: ffffed100d9459b9
[   38.272733] R10: 0000000000000000 R11: 6e69687469772068 R12: ffff888016a8ac00
[   38.273729] R13: 0000000000000000 R14: ffff888016a8ac00 R15: 0000000000000000
[   38.274649] FS:  0000000000000000(0000) GS:ffff8880e3b40000(0000) knlGS:0000000000000000
[   38.275685] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   38.276435] CR2: 00007f97fadfe504 CR3: 0000000007088005 CR4: 0000000000770ef0
[   38.277410] PKRU: 55555554
[   38.277778] Call Trace:
[   38.278113]  <TASK>
[   38.278416]  ? debug_smp_processor_id+0x20/0x30
[   38.279034]  ? rcu_is_watching+0x19/0xc0
[   38.279570]  ? trace_irq_disable+0xd1/0x110
[   38.280152]  __schedule+0x25c/0x3840
[   38.280660]  ? __pfx___schedule+0x10/0x10
[   38.281273]  ? do_task_dead+0xae/0x120
[   38.281792]  ? do_task_dead+0x48/0x120
[   38.282314]  ? debug_smp_processor_id+0x20/0x30
[   38.282933]  ? rcu_is_watching+0x19/0xc0
[   38.283472]  ? trace_irq_enable+0xd1/0x110
[   38.284039]  ? do_task_dead+0x48/0x120
[   38.284561]  do_task_dead+0xe6/0x120
[   38.285144]  make_task_dead+0x384/0x3c0
[   38.285742]  rewind_stack_and_make_dead+0x16/0x20
[   38.286371] RIP: 0033:0x7f97fab18a4d
[   38.286855] Code: Unable to access opcode bytes at 0x7f97fab18a23.
[   38.287653] RSP: 002b:00007ffd957e62d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[   38.288628] RAX: ffffffffffffffda RBX: 00007f97fabf69e0 RCX: 00007f97fab18a4d
[   38.289600] RDX: 00000000000000e7 RSI: ffffffffffffff80 RDI: 0000000000000000
[   38.290519] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000020
[   38.291436] R10: 00007ffd957e6180 R11: 0000000000000246 R12: 00007f97fabf69e0
[   38.292355] R13: 00007f97fabfbf00 R14: 0000000000000001 R15: 00007f97fabfbee8
[   38.293362]  </TASK>
[   38.293669] irq event stamp: 605
[   38.294103] hardirqs last  enabled at (605): [<ffffffff812f087b>] cond_local_irq_enable.isra.0+0x3b/0x50
[   38.295323] hardirqs last disabled at (604): [<ffffffff85d6dba6>] exc_general_protection+0x36/0x340
[   38.296499] softirqs last  enabled at (516): [<ffffffff8132638c>] fpu_clone+0x10c/0x710
[   38.297603] softirqs last disabled at (514): [<ffffffff8132631f>] fpu_clone+0x9f/0x710
[   38.298639] ---[ end trace 0000000000000000 ]---
"

Hope this cound be insightful to you.

Regards,
Yi Lai

---

If you don't need the following environment to reproduce the problem or if you
already have one reproduced environment, please ignore the following information.

How to reproduce:
git clone https://gitlab.com/xupengfe/repro_vm_env.git
cd repro_vm_env
tar -xvf repro_vm_env.tar.gz
cd repro_vm_env; ./start3.sh  // it needs qemu-system-x86_64 and I used v7.1.0
  // start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel
  // You could change the bzImage_xxx as you want
  // Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version
You could use below command to log in, there is no password for root.
ssh -p 10023 root@localhost

After login vm(virtual machine) successfully, you could transfer reproduced
binary to the vm by below way, and reproduce the problem in vm:
gcc -pthread -o repro repro.c
scp -P 10023 repro root@localhost:/root/

Get the bzImage for target kernel:
Please use target kconfig and copy it to kernel_src/.config
make olddefconfig
make -jx bzImage           //x should equal or less than cpu num your pc has

Fill the bzImage file into above start3.sh to load the target kernel in vm.


Tips:
If you already have qemu-system-x86_64, please ignore below info.
If you want to install qemu v7.1.0 version:
git clone https://github.com/qemu/qemu.git
cd qemu
git checkout -f v7.1.0
mkdir build
cd build
yum install -y ninja-build.x86_64
yum -y install libslirp-devel.x86_64
../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp
make
make install 

On Sun, Jun 08, 2025 at 10:01:50AM +1200, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> Certain madvise operations, especially MADV_DONTNEED, occur far more
> frequently than other madvise options, particularly in native and Java
> heaps for dynamic memory management.
> 
> Currently, the mmap_lock is always held during these operations, even when
> unnecessary. This causes lock contention and can lead to severe priority
> inversion, where low-priority threads—such as Android's HeapTaskDaemon—
> hold the lock and block higher-priority threads.
> 
> This patch enables the use of per-VMA locks when the advised range lies
> entirely within a single VMA, avoiding the need for full VMA traversal. In
> practice, userspace heaps rarely issue MADV_DONTNEED across multiple VMAs.
> 
> Tangquan’s testing shows that over 99.5% of memory reclaimed by Android
> benefits from this per-VMA lock optimization. After extended runtime,
> 217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while
> only 1,231 fell back to mmap_lock.
> 
> To simplify handling, the implementation falls back to the standard
> mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of
> userfaultfd_remove().
> 
> Many thanks to Lorenzo's work[1] on:
> "Refactor the madvise() code to retain state about the locking mode
> utilised for traversing VMAs.
> 
> Then use this mechanism to permit VMA locking to be done later in the
> madvise() logic and also to allow altering of the locking mode to permit
> falling back to an mmap read lock if required."
> 
> One important point, as pointed out by Jann[2], is that
> untagged_addr_remote() requires holding mmap_lock. This is because
> address tagging on x86 and RISC-V is quite complex.
> 
> Until untagged_addr_remote() becomes atomic—which seems unlikely in
> the near future—we cannot support per-VMA locks for remote processes.
> So for now, only local processes are supported.
> 
> Link: https://lore.kernel.org/all/0b96ce61-a52c-4036-b5b6-5c50783db51f@lucifer.local/ [1]
> Link: https://lore.kernel.org/all/CAG48ez11zi-1jicHUZtLhyoNPGGVB+ROeAJCUw48bsjk4bbEkA@mail.gmail.com/ [2]
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Jann Horn <jannh@google.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Lokesh Gidra <lokeshgidra@google.com>
> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> Cc: Qi Zheng <zhengqi.arch@bytedance.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  -v4:
>  * collect Lorenzo's RB;
>  * use visit() for per-vma path
> 
>  mm/madvise.c | 195 ++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 147 insertions(+), 48 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 56d9ca2557b9..8382614b71d1 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -48,38 +48,19 @@ struct madvise_walk_private {
>  	bool pageout;
>  };
>  
> +enum madvise_lock_mode {
> +	MADVISE_NO_LOCK,
> +	MADVISE_MMAP_READ_LOCK,
> +	MADVISE_MMAP_WRITE_LOCK,
> +	MADVISE_VMA_READ_LOCK,
> +};
> +
>  struct madvise_behavior {
>  	int behavior;
>  	struct mmu_gather *tlb;
> +	enum madvise_lock_mode lock_mode;
>  };
>  
> -/*
> - * Any behaviour which results in changes to the vma->vm_flags needs to
> - * take mmap_lock for writing. Others, which simply traverse vmas, need
> - * to only take it for reading.
> - */
> -static int madvise_need_mmap_write(int behavior)
> -{
> -	switch (behavior) {
> -	case MADV_REMOVE:
> -	case MADV_WILLNEED:
> -	case MADV_DONTNEED:
> -	case MADV_DONTNEED_LOCKED:
> -	case MADV_COLD:
> -	case MADV_PAGEOUT:
> -	case MADV_FREE:
> -	case MADV_POPULATE_READ:
> -	case MADV_POPULATE_WRITE:
> -	case MADV_COLLAPSE:
> -	case MADV_GUARD_INSTALL:
> -	case MADV_GUARD_REMOVE:
> -		return 0;
> -	default:
> -		/* be safe, default to 1. list exceptions explicitly */
> -		return 1;
> -	}
> -}
> -
>  #ifdef CONFIG_ANON_VMA_NAME
>  struct anon_vma_name *anon_vma_name_alloc(const char *name)
>  {
> @@ -1486,6 +1467,44 @@ static bool process_madvise_remote_valid(int behavior)
>  	}
>  }
>  
> +/*
> + * Try to acquire a VMA read lock if possible.
> + *
> + * We only support this lock over a single VMA, which the input range must
> + * span either partially or fully.
> + *
> + * This function always returns with an appropriate lock held. If a VMA read
> + * lock could be acquired, we return the locked VMA.
> + *
> + * If a VMA read lock could not be acquired, we return NULL and expect caller to
> + * fallback to mmap lock behaviour.
> + */
> +static struct vm_area_struct *try_vma_read_lock(struct mm_struct *mm,
> +		struct madvise_behavior *madv_behavior,
> +		unsigned long start, unsigned long end)
> +{
> +	struct vm_area_struct *vma;
> +
> +	vma = lock_vma_under_rcu(mm, start);
> +	if (!vma)
> +		goto take_mmap_read_lock;
> +	/*
> +	 * Must span only a single VMA; uffd and remote processes are
> +	 * unsupported.
> +	 */
> +	if (end > vma->vm_end || current->mm != mm ||
> +	    userfaultfd_armed(vma)) {
> +		vma_end_read(vma);
> +		goto take_mmap_read_lock;
> +	}
> +	return vma;
> +
> +take_mmap_read_lock:
> +	mmap_read_lock(mm);
> +	madv_behavior->lock_mode = MADVISE_MMAP_READ_LOCK;
> +	return NULL;
> +}
> +
>  /*
>   * Walk the vmas in range [start,end), and call the visit function on each one.
>   * The visit function will get start and end parameters that cover the overlap
> @@ -1496,7 +1515,8 @@ static bool process_madvise_remote_valid(int behavior)
>   */
>  static
>  int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
> -		      unsigned long end, void *arg,
> +		      unsigned long end, struct madvise_behavior *madv_behavior,
> +		      void *arg,
>  		      int (*visit)(struct vm_area_struct *vma,
>  				   struct vm_area_struct **prev, unsigned long start,
>  				   unsigned long end, void *arg))
> @@ -1505,6 +1525,20 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
>  	struct vm_area_struct *prev;
>  	unsigned long tmp;
>  	int unmapped_error = 0;
> +	int error;
> +
> +	/*
> +	 * If VMA read lock is supported, apply madvise to a single VMA
> +	 * tentatively, avoiding walking VMAs.
> +	 */
> +	if (madv_behavior && madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK) {
> +		vma = try_vma_read_lock(mm, madv_behavior, start, end);
> +		if (vma) {
> +			error = visit(vma, &prev, start, end, arg);
> +			vma_end_read(vma);
> +			return error;
> +		}
> +	}
>  
>  	/*
>  	 * If the interval [start,end) covers some unmapped address
> @@ -1516,8 +1550,6 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
>  		prev = vma;
>  
>  	for (;;) {
> -		int error;
> -
>  		/* Still start < end. */
>  		if (!vma)
>  			return -ENOMEM;
> @@ -1598,34 +1630,86 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
>  	if (end == start)
>  		return 0;
>  
> -	return madvise_walk_vmas(mm, start, end, anon_name,
> +	return madvise_walk_vmas(mm, start, end, NULL, anon_name,
>  				 madvise_vma_anon_name);
>  }
>  #endif /* CONFIG_ANON_VMA_NAME */
>  
> -static int madvise_lock(struct mm_struct *mm, int behavior)
> +
> +/*
> + * Any behaviour which results in changes to the vma->vm_flags needs to
> + * take mmap_lock for writing. Others, which simply traverse vmas, need
> + * to only take it for reading.
> + */
> +static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavior)
>  {
> +	int behavior = madv_behavior->behavior;
> +
>  	if (is_memory_failure(behavior))
> -		return 0;
> +		return MADVISE_NO_LOCK;
>  
> -	if (madvise_need_mmap_write(behavior)) {
> +	switch (behavior) {
> +	case MADV_REMOVE:
> +	case MADV_WILLNEED:
> +	case MADV_COLD:
> +	case MADV_PAGEOUT:
> +	case MADV_FREE:
> +	case MADV_POPULATE_READ:
> +	case MADV_POPULATE_WRITE:
> +	case MADV_COLLAPSE:
> +	case MADV_GUARD_INSTALL:
> +	case MADV_GUARD_REMOVE:
> +		return MADVISE_MMAP_READ_LOCK;
> +	case MADV_DONTNEED:
> +	case MADV_DONTNEED_LOCKED:
> +		return MADVISE_VMA_READ_LOCK;
> +	default:
> +		return MADVISE_MMAP_WRITE_LOCK;
> +	}
> +}
> +
> +static int madvise_lock(struct mm_struct *mm,
> +		struct madvise_behavior *madv_behavior)
> +{
> +	enum madvise_lock_mode lock_mode = get_lock_mode(madv_behavior);
> +
> +	switch (lock_mode) {
> +	case MADVISE_NO_LOCK:
> +		break;
> +	case MADVISE_MMAP_WRITE_LOCK:
>  		if (mmap_write_lock_killable(mm))
>  			return -EINTR;
> -	} else {
> +		break;
> +	case MADVISE_MMAP_READ_LOCK:
>  		mmap_read_lock(mm);
> +		break;
> +	case MADVISE_VMA_READ_LOCK:
> +		/* We will acquire the lock per-VMA in madvise_walk_vmas(). */
> +		break;
>  	}
> +
> +	madv_behavior->lock_mode = lock_mode;
>  	return 0;
>  }
>  
> -static void madvise_unlock(struct mm_struct *mm, int behavior)
> +static void madvise_unlock(struct mm_struct *mm,
> +		struct madvise_behavior *madv_behavior)
>  {
> -	if (is_memory_failure(behavior))
> +	switch (madv_behavior->lock_mode) {
> +	case  MADVISE_NO_LOCK:
>  		return;
> -
> -	if (madvise_need_mmap_write(behavior))
> +	case MADVISE_MMAP_WRITE_LOCK:
>  		mmap_write_unlock(mm);
> -	else
> +		break;
> +	case MADVISE_MMAP_READ_LOCK:
>  		mmap_read_unlock(mm);
> +		break;
> +	case MADVISE_VMA_READ_LOCK:
> +		/* We will drop the lock per-VMA in madvise_walk_vmas(). */
> +		break;
> +	}
> +
> +	madv_behavior->lock_mode = MADVISE_NO_LOCK;
>  }
>  
>  static bool madvise_batch_tlb_flush(int behavior)
> @@ -1710,6 +1794,21 @@ static bool is_madvise_populate(int behavior)
>  	}
>  }
>  
> +/*
> + * untagged_addr_remote() assumes mmap_lock is already held. On
> + * architectures like x86 and RISC-V, tagging is tricky because each
> + * mm may have a different tagging mask. However, we might only hold
> + * the per-VMA lock (currently only local processes are supported),
> + * so untagged_addr is used to avoid the mmap_lock assertion for
> + * local processes.
> + */
> +static inline unsigned long get_untagged_addr(struct mm_struct *mm,
> +		unsigned long start)
> +{
> +	return current->mm == mm ? untagged_addr(start) :
> +				   untagged_addr_remote(mm, start);
> +}
> +
>  static int madvise_do_behavior(struct mm_struct *mm,
>  		unsigned long start, size_t len_in,
>  		struct madvise_behavior *madv_behavior)
> @@ -1721,7 +1820,7 @@ static int madvise_do_behavior(struct mm_struct *mm,
>  
>  	if (is_memory_failure(behavior))
>  		return madvise_inject_error(behavior, start, start + len_in);
> -	start = untagged_addr_remote(mm, start);
> +	start = get_untagged_addr(mm, start);
>  	end = start + PAGE_ALIGN(len_in);
>  
>  	blk_start_plug(&plug);
> @@ -1729,7 +1828,7 @@ static int madvise_do_behavior(struct mm_struct *mm,
>  		error = madvise_populate(mm, start, end, behavior);
>  	else
>  		error = madvise_walk_vmas(mm, start, end, madv_behavior,
> -					  madvise_vma_behavior);
> +				madv_behavior, madvise_vma_behavior);
>  	blk_finish_plug(&plug);
>  	return error;
>  }
> @@ -1817,13 +1916,13 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
>  
>  	if (madvise_should_skip(start, len_in, behavior, &error))
>  		return error;
> -	error = madvise_lock(mm, behavior);
> +	error = madvise_lock(mm, &madv_behavior);
>  	if (error)
>  		return error;
>  	madvise_init_tlb(&madv_behavior, mm);
>  	error = madvise_do_behavior(mm, start, len_in, &madv_behavior);
>  	madvise_finish_tlb(&madv_behavior);
> -	madvise_unlock(mm, behavior);
> +	madvise_unlock(mm, &madv_behavior);
>  
>  	return error;
>  }
> @@ -1847,7 +1946,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
>  
>  	total_len = iov_iter_count(iter);
>  
> -	ret = madvise_lock(mm, behavior);
> +	ret = madvise_lock(mm, &madv_behavior);
>  	if (ret)
>  		return ret;
>  	madvise_init_tlb(&madv_behavior, mm);
> @@ -1880,8 +1979,8 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
>  
>  			/* Drop and reacquire lock to unwind race. */
>  			madvise_finish_tlb(&madv_behavior);
> -			madvise_unlock(mm, behavior);
> -			ret = madvise_lock(mm, behavior);
> +			madvise_unlock(mm, &madv_behavior);
> +			ret = madvise_lock(mm, &madv_behavior);
>  			if (ret)
>  				goto out;
>  			madvise_init_tlb(&madv_behavior, mm);
> @@ -1892,7 +1991,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
>  		iov_iter_advance(iter, iter_iov_len(iter));
>  	}
>  	madvise_finish_tlb(&madv_behavior);
> -	madvise_unlock(mm, behavior);
> +	madvise_unlock(mm, &madv_behavior);
>  
>  out:
>  	ret = (total_len - iov_iter_count(iter)) ? : ret;
> -- 
> 2.39.3 (Apple Git-146)
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4] mm: use per_vma lock for MADV_DONTNEED
  2025-08-04  0:58 ` Lai, Yi
@ 2025-08-04  7:19   ` Barry Song
  2025-08-04  7:57   ` David Hildenbrand
  2025-08-04  8:19   ` Barry Song
  2 siblings, 0 replies; 26+ messages in thread
From: Barry Song @ 2025-08-04  7:19 UTC (permalink / raw)
  To: Lai, Yi
  Cc: akpm, linux-mm, linux-kernel, Barry Song, Lorenzo Stoakes,
	Liam R. Howlett, David Hildenbrand, Vlastimil Babka, Jann Horn,
	Suren Baghdasaryan, Lokesh Gidra, Tangquan Zheng, Qi Zheng,
	yi1.lai

On Mon, Aug 4, 2025 at 8:58 AM Lai, Yi <yi1.lai@linux.intel.com> wrote:
>
> Hi Barry Song,
>
> Greetings!
>
> I used Syzkaller and found that there is general protection fault in __pte_offset_map_lock in linux-next next-20250801.
>
> After bisection and the first bad commit is:
> "
> a6fde7add78d mm: use per_vma lock for MADV_DONTNEED
> "
>
> All detailed into can be found at:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250803_193026___pte_offset_map_lock
> Syzkaller repro code:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250803_193026___pte_offset_map_lock/repro.c

Hi Lai,

Thanks for your email. I gave "repro" an initial try with today's mm-unstable
on arm64, but I wasn't able to reproduce the issue.

Please give me some time to test with your kernel and config, as well as
on x86_64 (though I'm not very familiar with x86).

Thanks
Barry


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4] mm: use per_vma lock for MADV_DONTNEED
  2025-08-04  0:58 ` Lai, Yi
  2025-08-04  7:19   ` Barry Song
@ 2025-08-04  7:57   ` David Hildenbrand
  2025-08-04  8:26     ` Qi Zheng
  2025-08-04 21:48     ` Barry Song
  2025-08-04  8:19   ` Barry Song
  2 siblings, 2 replies; 26+ messages in thread
From: David Hildenbrand @ 2025-08-04  7:57 UTC (permalink / raw)
  To: Lai, Yi, Barry Song
  Cc: akpm, linux-mm, linux-kernel, Barry Song, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Jann Horn, Suren Baghdasaryan,
	Lokesh Gidra, Tangquan Zheng, Qi Zheng, yi1.lai

On 04.08.25 02:58, Lai, Yi wrote:
> Hi Barry Song,
> 
> Greetings!
> 
> I used Syzkaller and found that there is general protection fault in __pte_offset_map_lock in linux-next next-20250801.
> 
> After bisection and the first bad commit is:
> "
> a6fde7add78d mm: use per_vma lock for MADV_DONTNEED
> "
> 
> All detailed into can be found at:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250803_193026___pte_offset_map_lock
> Syzkaller repro code:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250803_193026___pte_offset_map_lock/repro.c
> Syzkaller repro syscall steps:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250803_193026___pte_offset_map_lock/repro.prog
> Syzkaller report:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250803_193026___pte_offset_map_lock/repro.report
> Kconfig(make olddefconfig):
> https://github.com/laifryiee/syzkaller_logs/tree/main/250803_193026___pte_offset_map_lock/kconfig_origin
> Bisect info:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250803_193026___pte_offset_map_lock/bisect_info.log
> bzImage:
> https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/main/250803_193026___pte_offset_map_lock/bzImage_next-20250801
> Issue dmesg:
> https://github.com/laifryiee/syzkaller_logs/blob/main/250803_193026___pte_offset_map_lock/next-20250801_dmesg.log

Skimming over the reproducer, we seem to have racing MADV_DONTNEED and 
MADV_COLLAPSE on the same anon area, but the problem only shows up once 
we tear down that MM.

If I would have to guess, I'd assume it's related to PT_RECLAIM 
reclaiming empty page tables during MADV_DONTNEED -- but the kconfig 
does not indicate that CONFIG_PT_RECLAIM was set.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4] mm: use per_vma lock for MADV_DONTNEED
  2025-08-04  0:58 ` Lai, Yi
  2025-08-04  7:19   ` Barry Song
  2025-08-04  7:57   ` David Hildenbrand
@ 2025-08-04  8:19   ` Barry Song
  2 siblings, 0 replies; 26+ messages in thread
From: Barry Song @ 2025-08-04  8:19 UTC (permalink / raw)
  To: Lai, Yi
  Cc: akpm, linux-mm, linux-kernel, Barry Song, Lorenzo Stoakes,
	Liam R. Howlett, David Hildenbrand, Vlastimil Babka, Jann Horn,
	Suren Baghdasaryan, Lokesh Gidra, Tangquan Zheng, Qi Zheng,
	yi1.lai

On Mon, Aug 4, 2025 at 12:58 PM Lai, Yi <yi1.lai@linux.intel.com> wrote:
>
> Hi Barry Song,
>
> Greetings!
>
> I used Syzkaller and found that there is general protection fault in __pte_offset_map_lock in linux-next next-20250801.
>
> After bisection and the first bad commit is:
> "
> a6fde7add78d mm: use per_vma lock for MADV_DONTNEED
> "
>
> All detailed into can be found at:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250803_193026___pte_offset_map_lock
> Syzkaller repro code:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250803_193026___pte_offset_map_lock/repro.c
> Syzkaller repro syscall steps:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250803_193026___pte_offset_map_lock/repro.prog
> Syzkaller report:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250803_193026___pte_offset_map_lock/repro.report
> Kconfig(make olddefconfig):
> https://github.com/laifryiee/syzkaller_logs/tree/main/250803_193026___pte_offset_map_lock/kconfig_origin
[...]
>
> If you don't need the following environment to reproduce the problem or if you
> already have one reproduced environment, please ignore the following information.
>
> How to reproduce:
> git clone https://gitlab.com/xupengfe/repro_vm_env.git
> cd repro_vm_env
> tar -xvf repro_vm_env.tar.gz
> cd repro_vm_env; ./start3.sh  // it needs qemu-system-x86_64 and I used v7.1.0
>   // start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel
>   // You could change the bzImage_xxx as you want
>   // Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version
> You could use below command to log in, there is no password for root.
> ssh -p 10023 root@localhost
>
> After login vm(virtual machine) successfully, you could transfer reproduced
> binary to the vm by below way, and reproduce the problem in vm:
> gcc -pthread -o repro repro.c
> scp -P 10023 repro root@localhost:/root/
>
> Get the bzImage for target kernel:
> Please use target kconfig and copy it to kernel_src/.config
> make olddefconfig
> make -jx bzImage           //x should equal or less than cpu num your pc has
>
> Fill the bzImage file into above start3.sh to load the target kernel in vm.
>

I can boot successfully with your bzImage for the 6.2 kernel.
However, when I use your `kconfig_origin` to build a new kernel from
the 0801 Linux-next source, the system fails to boot.

Warning: unable to open an initial console.
check access for rdinit=/init failed: -2, ignoring
input: ImExPS/2 Generic Explorer Mouse as
/devices/platform/i8042/serio1/input/input2
List of all partitions:
No filesystem could mount root, tried:

Kernel panic - not syncing: VFS: Unable to mount root fs on "/dev/sda"
or unknown-block(0,0)
CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted
6.16.0-next-20250801-gb9ddaa95fd28 #1 PREEMPT(none)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
Call Trace:
 <TASK>
 vpanic+0x2a1/0x2b0
 panic+0x55/0x60
 mount_root_generic+0x2c6/0x2e0
 ? __pfx_kernel_init+0x10/0x10
 prepare_namespace+0x49/0x260
 ? __pfx_kernel_init+0x10/0x10
 kernel_init+0x15/0x1a0
 ret_from_fork+0x68/0xd0
 ? __pfx_kernel_init+0x10/0x10
 ret_from_fork_asm+0x19/0x30
 </TASK>
Kernel Offset: 0x21600000 from 0xffffffff81000000 (relocation range:
0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: VFS: Unable to mount root fs on
"/dev/sda" or unknown-block(0,0) ]---

Is there anything missing from the reproducer guide?
initrd/ramdisk?

Thanks
Barry


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4] mm: use per_vma lock for MADV_DONTNEED
  2025-08-04  7:57   ` David Hildenbrand
@ 2025-08-04  8:26     ` Qi Zheng
  2025-08-04  8:30       ` David Hildenbrand
  2025-08-04 21:48     ` Barry Song
  1 sibling, 1 reply; 26+ messages in thread
From: Qi Zheng @ 2025-08-04  8:26 UTC (permalink / raw)
  To: David Hildenbrand, Lai, Yi, Barry Song
  Cc: akpm, linux-mm, linux-kernel, Barry Song, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Jann Horn, Suren Baghdasaryan,
	Lokesh Gidra, Tangquan Zheng, yi1.lai



On 8/4/25 3:57 PM, David Hildenbrand wrote:
> On 04.08.25 02:58, Lai, Yi wrote:
>> Hi Barry Song,
>>
>> Greetings!
>>
>> I used Syzkaller and found that there is general protection fault in 
>> __pte_offset_map_lock in linux-next next-20250801.
>>
>> After bisection and the first bad commit is:
>> "
>> a6fde7add78d mm: use per_vma lock for MADV_DONTNEED
>> "
>>
>> All detailed into can be found at:
>> https://github.com/laifryiee/syzkaller_logs/tree/ 
>> main/250803_193026___pte_offset_map_lock
>> Syzkaller repro code:
>> https://github.com/laifryiee/syzkaller_logs/tree/ 
>> main/250803_193026___pte_offset_map_lock/repro.c
>> Syzkaller repro syscall steps:
>> https://github.com/laifryiee/syzkaller_logs/tree/ 
>> main/250803_193026___pte_offset_map_lock/repro.prog
>> Syzkaller report:
>> https://github.com/laifryiee/syzkaller_logs/tree/ 
>> main/250803_193026___pte_offset_map_lock/repro.report
>> Kconfig(make olddefconfig):
>> https://github.com/laifryiee/syzkaller_logs/tree/ 
>> main/250803_193026___pte_offset_map_lock/kconfig_origin
>> Bisect info:
>> https://github.com/laifryiee/syzkaller_logs/tree/ 
>> main/250803_193026___pte_offset_map_lock/bisect_info.log
>> bzImage:
>> https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/ 
>> main/250803_193026___pte_offset_map_lock/bzImage_next-20250801
>> Issue dmesg:
>> https://github.com/laifryiee/syzkaller_logs/blob/ 
>> main/250803_193026___pte_offset_map_lock/next-20250801_dmesg.log
> 
> Skimming over the reproducer, we seem to have racing MADV_DONTNEED and 
> MADV_COLLAPSE on the same anon area, but the problem only shows up once 
> we tear down that MM.
> 
> If I would have to guess, I'd assume it's related to PT_RECLAIM 
> reclaiming empty page tables during MADV_DONTNEED -- but the kconfig 
> does not indicate that CONFIG_PT_RECLAIM was set.

On the x86_64, if PT_RECLAIM is not manually disabled, PT_RECLAIM should
be enabled; but since __pte_offset_map_lock() holds the RCU lock, there
should be no problem even if PT_RECLAIM frees the PTE page (via RCU).
Anyway, I will also find time to reproduce the problem locally and then
look into this issue.

Thanks.

> 



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4] mm: use per_vma lock for MADV_DONTNEED
  2025-08-04  8:26     ` Qi Zheng
@ 2025-08-04  8:30       ` David Hildenbrand
  2025-08-04  8:49         ` Lai, Yi
  0 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2025-08-04  8:30 UTC (permalink / raw)
  To: Qi Zheng, Lai, Yi, Barry Song
  Cc: akpm, linux-mm, linux-kernel, Barry Song, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Jann Horn, Suren Baghdasaryan,
	Lokesh Gidra, Tangquan Zheng, yi1.lai

On 04.08.25 10:26, Qi Zheng wrote:
> 
> 
> On 8/4/25 3:57 PM, David Hildenbrand wrote:
>> On 04.08.25 02:58, Lai, Yi wrote:
>>> Hi Barry Song,
>>>
>>> Greetings!
>>>
>>> I used Syzkaller and found that there is general protection fault in
>>> __pte_offset_map_lock in linux-next next-20250801.
>>>
>>> After bisection and the first bad commit is:
>>> "
>>> a6fde7add78d mm: use per_vma lock for MADV_DONTNEED
>>> "
>>>
>>> All detailed into can be found at:
>>> https://github.com/laifryiee/syzkaller_logs/tree/
>>> main/250803_193026___pte_offset_map_lock
>>> Syzkaller repro code:
>>> https://github.com/laifryiee/syzkaller_logs/tree/
>>> main/250803_193026___pte_offset_map_lock/repro.c
>>> Syzkaller repro syscall steps:
>>> https://github.com/laifryiee/syzkaller_logs/tree/
>>> main/250803_193026___pte_offset_map_lock/repro.prog
>>> Syzkaller report:
>>> https://github.com/laifryiee/syzkaller_logs/tree/
>>> main/250803_193026___pte_offset_map_lock/repro.report
>>> Kconfig(make olddefconfig):
>>> https://github.com/laifryiee/syzkaller_logs/tree/
>>> main/250803_193026___pte_offset_map_lock/kconfig_origin
>>> Bisect info:
>>> https://github.com/laifryiee/syzkaller_logs/tree/
>>> main/250803_193026___pte_offset_map_lock/bisect_info.log
>>> bzImage:
>>> https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/
>>> main/250803_193026___pte_offset_map_lock/bzImage_next-20250801
>>> Issue dmesg:
>>> https://github.com/laifryiee/syzkaller_logs/blob/
>>> main/250803_193026___pte_offset_map_lock/next-20250801_dmesg.log
>>
>> Skimming over the reproducer, we seem to have racing MADV_DONTNEED and
>> MADV_COLLAPSE on the same anon area, but the problem only shows up once
>> we tear down that MM.
>>
>> If I would have to guess, I'd assume it's related to PT_RECLAIM
>> reclaiming empty page tables during MADV_DONTNEED -- but the kconfig
>> does not indicate that CONFIG_PT_RECLAIM was set.
> 
> On the x86_64, if PT_RECLAIM is not manually disabled, PT_RECLAIM should
> be enabled

That's what I thought: but I was not able to spot it in the provided 
config [1].

Or is that config *before* "make olfconfig"? confusing. I would want to 
see the actually used config.



[1] 
https://github.com/laifryiee/syzkaller_logs/tree/main/250803_193026___pte_offset_map_lock/kconfig_origin

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4] mm: use per_vma lock for MADV_DONTNEED
  2025-08-04  8:30       ` David Hildenbrand
@ 2025-08-04  8:49         ` Lai, Yi
  2025-08-04  9:15           ` Barry Song
  0 siblings, 1 reply; 26+ messages in thread
From: Lai, Yi @ 2025-08-04  8:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Qi Zheng, Barry Song, akpm, linux-mm, linux-kernel, Barry Song,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Jann Horn,
	Suren Baghdasaryan, Lokesh Gidra, Tangquan Zheng, yi1.lai

On Mon, Aug 04, 2025 at 10:30:45AM +0200, David Hildenbrand wrote:
> On 04.08.25 10:26, Qi Zheng wrote:
> > 
> > 
> > On 8/4/25 3:57 PM, David Hildenbrand wrote:
> > > On 04.08.25 02:58, Lai, Yi wrote:
> > > > Hi Barry Song,
> > > > 
> > > > Greetings!
> > > > 
> > > > I used Syzkaller and found that there is general protection fault in
> > > > __pte_offset_map_lock in linux-next next-20250801.
> > > > 
> > > > After bisection and the first bad commit is:
> > > > "
> > > > a6fde7add78d mm: use per_vma lock for MADV_DONTNEED
> > > > "
> > > > 
> > > > All detailed into can be found at:
> > > > https://github.com/laifryiee/syzkaller_logs/tree/
> > > > main/250803_193026___pte_offset_map_lock
> > > > Syzkaller repro code:
> > > > https://github.com/laifryiee/syzkaller_logs/tree/
> > > > main/250803_193026___pte_offset_map_lock/repro.c
> > > > Syzkaller repro syscall steps:
> > > > https://github.com/laifryiee/syzkaller_logs/tree/
> > > > main/250803_193026___pte_offset_map_lock/repro.prog
> > > > Syzkaller report:
> > > > https://github.com/laifryiee/syzkaller_logs/tree/
> > > > main/250803_193026___pte_offset_map_lock/repro.report
> > > > Kconfig(make olddefconfig):
> > > > https://github.com/laifryiee/syzkaller_logs/tree/
> > > > main/250803_193026___pte_offset_map_lock/kconfig_origin
> > > > Bisect info:
> > > > https://github.com/laifryiee/syzkaller_logs/tree/
> > > > main/250803_193026___pte_offset_map_lock/bisect_info.log
> > > > bzImage:
> > > > https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/
> > > > main/250803_193026___pte_offset_map_lock/bzImage_next-20250801
> > > > Issue dmesg:
> > > > https://github.com/laifryiee/syzkaller_logs/blob/
> > > > main/250803_193026___pte_offset_map_lock/next-20250801_dmesg.log
> > > 
> > > Skimming over the reproducer, we seem to have racing MADV_DONTNEED and
> > > MADV_COLLAPSE on the same anon area, but the problem only shows up once
> > > we tear down that MM.
> > > 
> > > If I would have to guess, I'd assume it's related to PT_RECLAIM
> > > reclaiming empty page tables during MADV_DONTNEED -- but the kconfig
> > > does not indicate that CONFIG_PT_RECLAIM was set.
> > 
> > On the x86_64, if PT_RECLAIM is not manually disabled, PT_RECLAIM should
> > be enabled
> 
> That's what I thought: but I was not able to spot it in the provided config
> [1].
> 
> Or is that config *before* "make olfconfig"? confusing. I would want to see
> the actually used config.
> 
> 
>
My kernel compiling steps:
1. copy kconfig_origin to kernel_source_folder/.config
2. make olddefconfig
3. make bzImage -jx

I have also uploaded the actual .config during compiling. 
[2] https://github.com/laifryiee/syzkaller_logs/blob/main/250803_193026___pte_offset_map_lock/.config
CONFIG_ARCH_SUPPORTS_PT_RECLAIM=y
CONFIG_PT_RECLAIM=y
 
> [1] https://github.com/laifryiee/syzkaller_logs/tree/main/250803_193026___pte_offset_map_lock/kconfig_origin
> 
> -- 
> Cheers,
> 
> David / dhildenb
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4] mm: use per_vma lock for MADV_DONTNEED
  2025-08-04  8:49         ` Lai, Yi
@ 2025-08-04  9:15           ` Barry Song
  2025-08-04  9:35             ` Qi Zheng
  0 siblings, 1 reply; 26+ messages in thread
From: Barry Song @ 2025-08-04  9:15 UTC (permalink / raw)
  To: Lai, Yi
  Cc: David Hildenbrand, Qi Zheng, akpm, linux-mm, linux-kernel,
	Barry Song, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
	Jann Horn, Suren Baghdasaryan, Lokesh Gidra, Tangquan Zheng,
	yi1.lai

On Mon, Aug 4, 2025 at 8:49 PM Lai, Yi <yi1.lai@linux.intel.com> wrote:
>
> On Mon, Aug 04, 2025 at 10:30:45AM +0200, David Hildenbrand wrote:
> > On 04.08.25 10:26, Qi Zheng wrote:
> > >
> > >
> > > On 8/4/25 3:57 PM, David Hildenbrand wrote:
> > > > On 04.08.25 02:58, Lai, Yi wrote:
> > > > > Hi Barry Song,
> > > > >
> > > > > Greetings!
> > > > >
> > > > > I used Syzkaller and found that there is general protection fault in
> > > > > __pte_offset_map_lock in linux-next next-20250801.
> > > > >
> > > > > After bisection and the first bad commit is:
> > > > > "
> > > > > a6fde7add78d mm: use per_vma lock for MADV_DONTNEED
> > > > > "
> > > > >
> > > > > All detailed into can be found at:
> > > > > https://github.com/laifryiee/syzkaller_logs/tree/
> > > > > main/250803_193026___pte_offset_map_lock
> > > > > Syzkaller repro code:
> > > > > https://github.com/laifryiee/syzkaller_logs/tree/
> > > > > main/250803_193026___pte_offset_map_lock/repro.c
> > > > > Syzkaller repro syscall steps:
> > > > > https://github.com/laifryiee/syzkaller_logs/tree/
> > > > > main/250803_193026___pte_offset_map_lock/repro.prog
> > > > > Syzkaller report:
> > > > > https://github.com/laifryiee/syzkaller_logs/tree/
> > > > > main/250803_193026___pte_offset_map_lock/repro.report
> > > > > Kconfig(make olddefconfig):
> > > > > https://github.com/laifryiee/syzkaller_logs/tree/
> > > > > main/250803_193026___pte_offset_map_lock/kconfig_origin
> > > > > Bisect info:
> > > > > https://github.com/laifryiee/syzkaller_logs/tree/
> > > > > main/250803_193026___pte_offset_map_lock/bisect_info.log
> > > > > bzImage:
> > > > > https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/
> > > > > main/250803_193026___pte_offset_map_lock/bzImage_next-20250801
> > > > > Issue dmesg:
> > > > > https://github.com/laifryiee/syzkaller_logs/blob/
> > > > > main/250803_193026___pte_offset_map_lock/next-20250801_dmesg.log
> > > >
> > > > Skimming over the reproducer, we seem to have racing MADV_DONTNEED and
> > > > MADV_COLLAPSE on the same anon area, but the problem only shows up once
> > > > we tear down that MM.
> > > >
> > > > If I would have to guess, I'd assume it's related to PT_RECLAIM
> > > > reclaiming empty page tables during MADV_DONTNEED -- but the kconfig
> > > > does not indicate that CONFIG_PT_RECLAIM was set.
> > >
> > > On the x86_64, if PT_RECLAIM is not manually disabled, PT_RECLAIM should
> > > be enabled
> >
> > That's what I thought: but I was not able to spot it in the provided config
> > [1].
> >
> > Or is that config *before* "make olfconfig"? confusing. I would want to see
> > the actually used config.
> >
> >
> >
> My kernel compiling steps:
> 1. copy kconfig_origin to kernel_source_folder/.config
> 2. make olddefconfig
> 3. make bzImage -jx
>
> I have also uploaded the actual .config during compiling.
> [2] https://github.com/laifryiee/syzkaller_logs/blob/main/250803_193026___pte_offset_map_lock/.config
> CONFIG_ARCH_SUPPORTS_PT_RECLAIM=y
> CONFIG_PT_RECLAIM=y

Thanks! I can reproduce the issue within one second.
After disabling PT_RECLAIM in .config, the issue disappears.
The reason it doesn't occur on arm64 is that x86 is the only platform
that supports ARCH_SUPPORTS_PT_RECLAIM.

>
> > [1] https://github.com/laifryiee/syzkaller_logs/tree/main/250803_193026___pte_offset_map_lock/kconfig_origin
> >
> > --
> > Cheers,
> >
> > David / dhildenb
> >

Thanks
Barry


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4] mm: use per_vma lock for MADV_DONTNEED
  2025-08-04  9:15           ` Barry Song
@ 2025-08-04  9:35             ` Qi Zheng
  2025-08-04  9:52               ` Qi Zheng
  0 siblings, 1 reply; 26+ messages in thread
From: Qi Zheng @ 2025-08-04  9:35 UTC (permalink / raw)
  To: Barry Song, Lai, Yi
  Cc: David Hildenbrand, akpm, linux-mm, linux-kernel, Barry Song,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Jann Horn,
	Suren Baghdasaryan, Lokesh Gidra, Tangquan Zheng, yi1.lai



On 8/4/25 5:15 PM, Barry Song wrote:
> On Mon, Aug 4, 2025 at 8:49 PM Lai, Yi <yi1.lai@linux.intel.com> wrote:
>>
>> On Mon, Aug 04, 2025 at 10:30:45AM +0200, David Hildenbrand wrote:
>>> On 04.08.25 10:26, Qi Zheng wrote:
>>>>
>>>>
>>>> On 8/4/25 3:57 PM, David Hildenbrand wrote:
>>>>> On 04.08.25 02:58, Lai, Yi wrote:
>>>>>> Hi Barry Song,
>>>>>>
>>>>>> Greetings!
>>>>>>
>>>>>> I used Syzkaller and found that there is general protection fault in
>>>>>> __pte_offset_map_lock in linux-next next-20250801.
>>>>>>
>>>>>> After bisection and the first bad commit is:
>>>>>> "
>>>>>> a6fde7add78d mm: use per_vma lock for MADV_DONTNEED
>>>>>> "
>>>>>>
>>>>>> All detailed into can be found at:
>>>>>> https://github.com/laifryiee/syzkaller_logs/tree/
>>>>>> main/250803_193026___pte_offset_map_lock
>>>>>> Syzkaller repro code:
>>>>>> https://github.com/laifryiee/syzkaller_logs/tree/
>>>>>> main/250803_193026___pte_offset_map_lock/repro.c
>>>>>> Syzkaller repro syscall steps:
>>>>>> https://github.com/laifryiee/syzkaller_logs/tree/
>>>>>> main/250803_193026___pte_offset_map_lock/repro.prog
>>>>>> Syzkaller report:
>>>>>> https://github.com/laifryiee/syzkaller_logs/tree/
>>>>>> main/250803_193026___pte_offset_map_lock/repro.report
>>>>>> Kconfig(make olddefconfig):
>>>>>> https://github.com/laifryiee/syzkaller_logs/tree/
>>>>>> main/250803_193026___pte_offset_map_lock/kconfig_origin
>>>>>> Bisect info:
>>>>>> https://github.com/laifryiee/syzkaller_logs/tree/
>>>>>> main/250803_193026___pte_offset_map_lock/bisect_info.log
>>>>>> bzImage:
>>>>>> https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/
>>>>>> main/250803_193026___pte_offset_map_lock/bzImage_next-20250801
>>>>>> Issue dmesg:
>>>>>> https://github.com/laifryiee/syzkaller_logs/blob/
>>>>>> main/250803_193026___pte_offset_map_lock/next-20250801_dmesg.log
>>>>>
>>>>> Skimming over the reproducer, we seem to have racing MADV_DONTNEED and
>>>>> MADV_COLLAPSE on the same anon area, but the problem only shows up once
>>>>> we tear down that MM.
>>>>>
>>>>> If I would have to guess, I'd assume it's related to PT_RECLAIM
>>>>> reclaiming empty page tables during MADV_DONTNEED -- but the kconfig
>>>>> does not indicate that CONFIG_PT_RECLAIM was set.
>>>>
>>>> On the x86_64, if PT_RECLAIM is not manually disabled, PT_RECLAIM should
>>>> be enabled
>>>
>>> That's what I thought: but I was not able to spot it in the provided config
>>> [1].
>>>
>>> Or is that config *before* "make olfconfig"? confusing. I would want to see
>>> the actually used config.
>>>
>>>
>>>
>> My kernel compiling steps:
>> 1. copy kconfig_origin to kernel_source_folder/.config
>> 2. make olddefconfig
>> 3. make bzImage -jx
>>
>> I have also uploaded the actual .config during compiling.
>> [2] https://github.com/laifryiee/syzkaller_logs/blob/main/250803_193026___pte_offset_map_lock/.config
>> CONFIG_ARCH_SUPPORTS_PT_RECLAIM=y
>> CONFIG_PT_RECLAIM=y
> 
> Thanks! I can reproduce the issue within one second.

I also reproduced it locally.

BUG: Bad page map in process repro  pte:f000e987f000fea5 pmd:00000067
[22099.667758][T22301] addr:0000000020004000 vm_flags:00100077 
anon_vma:ffff8882c5b5fc98 mapping:0000000000000000 index:20004
[22099.671248][T22301] file:(null) fault:0x0 mmap:0x0 mmap_prepare: 0x0 
read_folio:0x0
[22099.673833][T22301] CPU: 15 UID: 0 PID: 22301 Comm: repro Tainted: G 
   B   W           6.16.0-rc4-next-20250704+ #200 PREEMPT(voluntary)
[22099.673838][T22301] Tainted: [B]=BAD_PAGE, [W]=WARN
[22099.673838][T22301] Hardware name: QEMU Standard PC (i440FX + PIIX, 
1996), BIOS 1.12.0-1 04/01/2014
[22099.673840][T22301] Call Trace:
[22099.673841][T22301]  <TASK>
[22099.673842][T22301]  dump_stack_lvl+0x53/0x70
[22099.673845][T22301]  print_bad_pte+0x178/0x220
[22099.673849][T22301]  vm_normal_page+0x8a/0xa0
[22099.673852][T22301]  unmap_page_range+0x5cb/0x1d40
[22099.673855][T22301]  ? flush_tlb_mm_range+0x21a/0x780
[22099.673859][T22301]  ? tlb_flush_mmu+0x30/0x1c0
[22099.673861][T22301]  unmap_vmas+0xab/0x160
[22099.673863][T22301]  exit_mmap+0xda/0x3c0
[22099.673868][T22301]  mmput+0x6e/0x130
[22099.673869][T22301]  do_exit+0x242/0xb40
[22099.673871][T22301]  do_group_exit+0x30/0x80
[22099.673873][T22301]  get_signal+0x951/0x980
[22099.673876][T22301]  ? futex_wake+0x84/0x170
[22099.673880][T22301]  arch_do_signal_or_restart+0x2d/0x1f0
[22099.673883][T22301]  ? do_futex+0x11a/0x1d0
[22099.673885][T22301]  ? __x64_sys_futex+0x67/0x1c0
[22099.673888][T22301]  exit_to_user_mode_loop+0x86/0x110
[22099.673890][T22301]  do_syscall_64+0x184/0x2b0
[22099.673892][T22301]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[22099.673895][T22301] RIP: 0033:0x7fafb0048af9
[22099.673896][T22301] Code: Unable to access opcode bytes at 
0x7fafb0048acf.
[22099.673898][T22301] RSP: 002b:00007fafaff50ea8 EFLAGS: 00000246 
ORIG_RAX: 00000000000000ca
[22099.673900][T22301] RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 
00007fafb0048af9
[22099.673901][T22301] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 
0000559d33cab1a8
[22099.673903][T22301] RBP: 00007fafaff50ec0 R08: 0000000000000000 R09: 
0000000000000000
[22099.673904][T22301] R10: 0000000000000000 R11: 0000000000000246 R12: 
00007ffe78dbcd2e
[22099.673905][T22301] R13: 00007ffe78dbcd2f R14: 00007fafaff51700 R15: 
0000000000000000
[22099.673907][T22301]  </TASK>
[22099.673913][T22301] BUG: unable to handle page fault for address: 
ffffea7be1ffe548
[22099.674789][T22301] #PF: supervisor read access in kernel mode
[22099.674789][T22301] #PF: error_code(0x0000) - not-present page
[22099.674789][T22301] PGD 2bfff7067 P4D 2bfff7067 PUD 0
[22099.674789][T22301] Oops: Oops: 0000 [#1] SMP PTI
[22099.674789][T22301] CPU: 15 UID: 0 PID: 22301 Comm: repro Tainted: G 
   B   W           6.16.0-rc4-next-20250704+ #200 PREEMPT(voluntary)
[22099.674789][T22301] Tainted: [B]=BAD_PAGE, [W]=WARN
[22099.674789][T22301] Hardware name: QEMU Standard PC (i440FX + PIIX, 
1996), BIOS 1.12.0-1 04/01/2014
[22099.674789][T22301] RIP: 0010:unmap_page_range+0x1101/0x1d40
[22099.674789][T22301] Code: eb 03 cc cc cc f3 0f 1e fa f3 0f 1e fa e9 
ea 01 00 00 48 b8 ff ff ff ff ff 00 00 00 49 21 c2 49 c1 e2 06 4c 03 15 
ef a6 fd 00 <49> 8b 52 08 4c 89 d0 f6 c2 01 0f 8
[22099.674789][T22301] RSP: 0018:ffffc9000557baa0 EFLAGS: 00010282
[22099.674789][T22301] RAX: 00000003ffffffff RBX: 0000000020005000 RCX: 
f000000000000420
[22099.674789][T22301] RDX: 000000000000001e RSI: 0000000000000000 RDI: 
7803ff95ef87ff95
[22099.674789][T22301] RBP: f000d420f000d420 R08: ffff888000000028 R09: 
c000000100000864
[22099.674789][T22301] R10: ffffea7be1ffe540 R11: ffffc9000557b6b0 R12: 
0000000000000000
[22099.674789][T22301] R13: 00000000000001fb R14: ffffc9000557bcc0 R15: 
ffff888000000028
[22099.674789][T22301] FS:  00007fafaff51700(0000) 
GS:ffff8885b2b29000(0000) knlGS:0000000000000000
[22099.674789][T22301] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[22099.674789][T22301] CR2: ffffea7be1ffe548 CR3: 0000000103d8c000 CR4: 
00000000000006f0
[22099.674789][T22301] Call Trace:
[22099.674789][T22301]  <TASK>
[22099.674789][T22301]  ? flush_tlb_mm_range+0x21a/0x780
[22099.674789][T22301]  ? tlb_flush_mmu+0x30/0x1c0
[22099.674789][T22301]  unmap_vmas+0xab/0x160
[22099.674789][T22301]  exit_mmap+0xda/0x3c0
[22099.674789][T22301]  mmput+0x6e/0x130
[22099.674789][T22301]  do_exit+0x242/0xb40
[22099.674789][T22301]  do_group_exit+0x30/0x80
[22099.674789][T22301]  get_signal+0x951/0x980
[22099.674789][T22301]  ? futex_wake+0x84/0x170
[22099.674789][T22301]  arch_do_signal_or_restart+0x2d/0x1f0
[22099.674789][T22301]  ? do_futex+0x11a/0x1d0
[22099.674789][T22301]  ? __x64_sys_futex+0x67/0x1c0
[22099.674789][T22301]  exit_to_user_mode_loop+0x86/0x110
[22099.674789][T22301]  do_syscall_64+0x184/0x2b0
[22099.674789][T22301]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[22099.674789][T22301] RIP: 0033:0x7fafb0048af9
[22099.674789][T22301] Code: Unable to access opcode bytes at 
0x7fafb0048acf.
[22099.674789][T22301] RSP: 002b:00007fafaff50ea8 EFLAGS: 00000246 
ORIG_RAX: 00000000000000ca
[22099.674789][T22301] RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 
00007fafb0048af9
[22099.674789][T22301] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 
0000559d33cab1a8
[22099.674789][T22301] RBP: 00007fafaff50ec0 R08: 0000000000000000 R09: 
0000000000000000
[22099.674789][T22301] R10: 0000000000000000 R11: 0000000000000246 R12: 
00007ffe78dbcd2e
[22099.674789][T22301] R13: 00007ffe78dbcd2f R14: 00007fafaff51700 R15: 
0000000000000000
[22099.674789][T22301]  </TASK>

> After disabling PT_RECLAIM in .config, the issue disappears.

Thanks for the test, I'll take a closer look.

> The reason it doesn't occur on arm64 is that x86 is the only platform
> that supports ARCH_SUPPORTS_PT_RECLAIM.
> 
>>
>>> [1] https://github.com/laifryiee/syzkaller_logs/tree/main/250803_193026___pte_offset_map_lock/kconfig_origin
>>>
>>> --
>>> Cheers,
>>>
>>> David / dhildenb
>>>
> 
> Thanks
> Barry



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4] mm: use per_vma lock for MADV_DONTNEED
  2025-08-04  9:35             ` Qi Zheng
@ 2025-08-04  9:52               ` Qi Zheng
  2025-08-04 10:04                 ` Barry Song
  0 siblings, 1 reply; 26+ messages in thread
From: Qi Zheng @ 2025-08-04  9:52 UTC (permalink / raw)
  To: Barry Song
  Cc: David Hildenbrand, akpm, linux-mm, linux-kernel, Barry Song,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Jann Horn,
	Suren Baghdasaryan, Lokesh Gidra, Tangquan Zheng, yi1.lai,
	Lai, Yi



On 8/4/25 5:35 PM, Qi Zheng wrote:
> 
> 
> On 8/4/25 5:15 PM, Barry Song wrote:
>> On Mon, Aug 4, 2025 at 8:49 PM Lai, Yi <yi1.lai@linux.intel.com> wrote:
>>>
>>> On Mon, Aug 04, 2025 at 10:30:45AM +0200, David Hildenbrand wrote:
>>>> On 04.08.25 10:26, Qi Zheng wrote:
>>>>>
>>>>>
>>>>> On 8/4/25 3:57 PM, David Hildenbrand wrote:
>>>>>> On 04.08.25 02:58, Lai, Yi wrote:
>>>>>>> Hi Barry Song,
>>>>>>>
>>>>>>> Greetings!
>>>>>>>
>>>>>>> I used Syzkaller and found that there is general protection fault in
>>>>>>> __pte_offset_map_lock in linux-next next-20250801.
>>>>>>>
>>>>>>> After bisection and the first bad commit is:
>>>>>>> "
>>>>>>> a6fde7add78d mm: use per_vma lock for MADV_DONTNEED
>>>>>>> "
>>>>>>>
>>>>>>> All detailed into can be found at:
>>>>>>> https://github.com/laifryiee/syzkaller_logs/tree/
>>>>>>> main/250803_193026___pte_offset_map_lock
>>>>>>> Syzkaller repro code:
>>>>>>> https://github.com/laifryiee/syzkaller_logs/tree/
>>>>>>> main/250803_193026___pte_offset_map_lock/repro.c
>>>>>>> Syzkaller repro syscall steps:
>>>>>>> https://github.com/laifryiee/syzkaller_logs/tree/
>>>>>>> main/250803_193026___pte_offset_map_lock/repro.prog
>>>>>>> Syzkaller report:
>>>>>>> https://github.com/laifryiee/syzkaller_logs/tree/
>>>>>>> main/250803_193026___pte_offset_map_lock/repro.report
>>>>>>> Kconfig(make olddefconfig):
>>>>>>> https://github.com/laifryiee/syzkaller_logs/tree/
>>>>>>> main/250803_193026___pte_offset_map_lock/kconfig_origin
>>>>>>> Bisect info:
>>>>>>> https://github.com/laifryiee/syzkaller_logs/tree/
>>>>>>> main/250803_193026___pte_offset_map_lock/bisect_info.log
>>>>>>> bzImage:
>>>>>>> https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/
>>>>>>> main/250803_193026___pte_offset_map_lock/bzImage_next-20250801
>>>>>>> Issue dmesg:
>>>>>>> https://github.com/laifryiee/syzkaller_logs/blob/
>>>>>>> main/250803_193026___pte_offset_map_lock/next-20250801_dmesg.log
>>>>>>
>>>>>> Skimming over the reproducer, we seem to have racing MADV_DONTNEED 
>>>>>> and
>>>>>> MADV_COLLAPSE on the same anon area, but the problem only shows up 
>>>>>> once
>>>>>> we tear down that MM.
>>>>>>
>>>>>> If I would have to guess, I'd assume it's related to PT_RECLAIM
>>>>>> reclaiming empty page tables during MADV_DONTNEED -- but the kconfig
>>>>>> does not indicate that CONFIG_PT_RECLAIM was set.
>>>>>
>>>>> On the x86_64, if PT_RECLAIM is not manually disabled, PT_RECLAIM 
>>>>> should
>>>>> be enabled
>>>>
>>>> That's what I thought: but I was not able to spot it in the provided 
>>>> config
>>>> [1].
>>>>
>>>> Or is that config *before* "make olfconfig"? confusing. I would want 
>>>> to see
>>>> the actually used config.
>>>>
>>>>
>>>>
>>> My kernel compiling steps:
>>> 1. copy kconfig_origin to kernel_source_folder/.config
>>> 2. make olddefconfig
>>> 3. make bzImage -jx
>>>
>>> I have also uploaded the actual .config during compiling.
>>> [2] https://github.com/laifryiee/syzkaller_logs/blob/ 
>>> main/250803_193026___pte_offset_map_lock/.config
>>> CONFIG_ARCH_SUPPORTS_PT_RECLAIM=y
>>> CONFIG_PT_RECLAIM=y
>>
>> Thanks! I can reproduce the issue within one second.
> 
> I also reproduced it locally.

Hi Barry, can you reproduce this problem stably? I can't reproduce it
again after reproducing it once. :(

> 



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4] mm: use per_vma lock for MADV_DONTNEED
  2025-08-04  9:52               ` Qi Zheng
@ 2025-08-04 10:04                 ` Barry Song
  0 siblings, 0 replies; 26+ messages in thread
From: Barry Song @ 2025-08-04 10:04 UTC (permalink / raw)
  To: Qi Zheng
  Cc: David Hildenbrand, akpm, linux-mm, linux-kernel, Barry Song,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Jann Horn,
	Suren Baghdasaryan, Lokesh Gidra, Tangquan Zheng, yi1.lai,
	Lai, Yi

On Mon, Aug 4, 2025 at 9:54 PM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
[...]
> >>>>
> >>> My kernel compiling steps:
> >>> 1. copy kconfig_origin to kernel_source_folder/.config
> >>> 2. make olddefconfig
> >>> 3. make bzImage -jx
> >>>
> >>> I have also uploaded the actual .config during compiling.
> >>> [2] https://github.com/laifryiee/syzkaller_logs/blob/
> >>> main/250803_193026___pte_offset_map_lock/.config
> >>> CONFIG_ARCH_SUPPORTS_PT_RECLAIM=y
> >>> CONFIG_PT_RECLAIM=y
> >>
> >> Thanks! I can reproduce the issue within one second.
> >
> > I also reproduced it locally.
>
> Hi Barry, can you reproduce this problem stably? I can't reproduce it
> again after reproducing it once. :(
>

Yes, the issue reliably reproduces within a short time after every reboot.

[root@test ~]# ./repro
[   18.189594] loop0: detected capacity change from 0 to 65536
[   18.339652] loop0: detected capacity change from 0 to 65536
[   18.488875] loop0: detected capacity change from 0 to 65536
[   18.658206] loop0: detected capacity change from 0 to 65536
[   18.793696] loop0: detected capacity change from 0 to 65536
[   18.796660] Oops: general protection fault, probably for
non-canonical address 0xdffffc0000000003: 0000 [#1] SMP KASAN NOPTI
[   18.797417] KASAN: null-ptr-deref in range
[0x0000000000000018-0x000000000000001f]
[   18.797926] CPU: 0 UID: 0 PID: 635 Comm: repro Not tainted
6.16.0-next-20250801-kvm+ #3 PREEMPT(voluntary)
[   18.798568] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[   18.799324] RIP: 0010:kasan_byte_accessible+0x15/0x30
[   18.799685] Code: cc 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90
90 90 90 90 90 66 0f 1f 00 48 b8 00 00 00 00 00 fc ff df 48 c1 ef 03
48 01 c7 <0f> b6 07 3c 07 0f 96 c0 c3 cc cc cc cc 66 66 2e 0f 1f 0
[   18.800906] RSP: 0018:ffff8880136df678 EFLAGS: 00010286
[   18.801265] RAX: dffffc0000000000 RBX: 0000000000000001 RCX: 1ffffffff0dde60c
[   18.801742] RDX: 0000000000000000 RSI: ffffffff85da2e18 RDI: dffffc0000000003
[   18.802224] RBP: ffff8880136df698 R08: 0000000000000001 R09: 0000000000000000
[   18.802702] R10: 0000000000000000 R11: ffff8880123558d8 R12: 0000000000000018
[   18.803181] R13: 0000000000000018 R14: ffffffff85da2e18 R15: 0000000000000000
[   18.803659] FS:  0000000000000000(0000) GS:ffff8880e3b40000(0000)
knlGS:0000000000000000
[   18.804199] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   18.804588] CR2: 00007f52bf71ffc0 CR3: 00000000119eb000 CR4: 0000000000750ef0
[   18.805066] PKRU: 55555554
[   18.805260] Call Trace:

Thanks
Barry


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4] mm: use per_vma lock for MADV_DONTNEED
  2025-08-04  7:57   ` David Hildenbrand
  2025-08-04  8:26     ` Qi Zheng
@ 2025-08-04 21:48     ` Barry Song
  2025-08-05  2:52       ` Lai, Yi
  1 sibling, 1 reply; 26+ messages in thread
From: Barry Song @ 2025-08-04 21:48 UTC (permalink / raw)
  To: David Hildenbrand, Lai, Yi
  Cc: akpm, linux-mm, linux-kernel, Barry Song, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Jann Horn, Suren Baghdasaryan,
	Lokesh Gidra, Tangquan Zheng, Qi Zheng, yi1.lai

On Mon, Aug 4, 2025 at 7:57 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 04.08.25 02:58, Lai, Yi wrote:
> > Hi Barry Song,
> >
> > Greetings!
> >
> > I used Syzkaller and found that there is general protection fault in __pte_offset_map_lock in linux-next next-20250801.
> >
> > After bisection and the first bad commit is:
> > "
> > a6fde7add78d mm: use per_vma lock for MADV_DONTNEED
> > "
> >
> > All detailed into can be found at:
> > https://github.com/laifryiee/syzkaller_logs/tree/main/250803_193026___pte_offset_map_lock
> > Syzkaller repro code:
> > https://github.com/laifryiee/syzkaller_logs/tree/main/250803_193026___pte_offset_map_lock/repro.c
> > Syzkaller repro syscall steps:
[...]
>
> Skimming over the reproducer, we seem to have racing MADV_DONTNEED and
> MADV_COLLAPSE on the same anon area, but the problem only shows up once
> we tear down that MM.
>

This seems to be where the race happens.
Hi Lai, can you also double check if the below diff fixes the problem?

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 374a6a5193a7..6b40bdfd224c 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1172,11 +1172,11 @@ static int collapse_huge_page(struct mm_struct
*mm, unsigned long address,
        if (result != SCAN_SUCCEED)
                goto out_up_write;
        /* check if the pmd is still valid */
+       vma_start_write(vma);
        result = check_pmd_still_valid(mm, address, pmd);
        if (result != SCAN_SUCCEED)
                goto out_up_write;

-       vma_start_write(vma);
        anon_vma_lock_write(vma->anon_vma);

        mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,

> If I would have to guess, I'd assume it's related to PT_RECLAIM
> reclaiming empty page tables during MADV_DONTNEED -- but the kconfig
> does not indicate that CONFIG_PT_RECLAIM was set.
>
> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v4] mm: use per_vma lock for MADV_DONTNEED
  2025-08-04 21:48     ` Barry Song
@ 2025-08-05  2:52       ` Lai, Yi
  0 siblings, 0 replies; 26+ messages in thread
From: Lai, Yi @ 2025-08-05  2:52 UTC (permalink / raw)
  To: Barry Song
  Cc: David Hildenbrand, akpm, linux-mm, linux-kernel, Barry Song,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Jann Horn,
	Suren Baghdasaryan, Lokesh Gidra, Tangquan Zheng, Qi Zheng,
	yi1.lai

On Tue, Aug 05, 2025 at 09:48:32AM +1200, Barry Song wrote:
> On Mon, Aug 4, 2025 at 7:57 PM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 04.08.25 02:58, Lai, Yi wrote:
> > > Hi Barry Song,
> > >
> > > Greetings!
> > >
> > > I used Syzkaller and found that there is general protection fault in __pte_offset_map_lock in linux-next next-20250801.
> > >
> > > After bisection and the first bad commit is:
> > > "
> > > a6fde7add78d mm: use per_vma lock for MADV_DONTNEED
> > > "
> > >
> > > All detailed into can be found at:
> > > https://github.com/laifryiee/syzkaller_logs/tree/main/250803_193026___pte_offset_map_lock
> > > Syzkaller repro code:
> > > https://github.com/laifryiee/syzkaller_logs/tree/main/250803_193026___pte_offset_map_lock/repro.c
> > > Syzkaller repro syscall steps:
> [...]
> >
> > Skimming over the reproducer, we seem to have racing MADV_DONTNEED and
> > MADV_COLLAPSE on the same anon area, but the problem only shows up once
> > we tear down that MM.
> >
> 
> This seems to be where the race happens.
> Hi Lai, can you also double check if the below diff fixes the problem?
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 374a6a5193a7..6b40bdfd224c 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1172,11 +1172,11 @@ static int collapse_huge_page(struct mm_struct
> *mm, unsigned long address,
>         if (result != SCAN_SUCCEED)
>                 goto out_up_write;
>         /* check if the pmd is still valid */
> +       vma_start_write(vma);
>         result = check_pmd_still_valid(mm, address, pmd);
>         if (result != SCAN_SUCCEED)
>                 goto out_up_write;
> 
> -       vma_start_write(vma);
>         anon_vma_lock_write(vma->anon_vma);
> 
>         mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
>

Applied this change on top of both linux-next next-20250801 and
next-20250804 separately. Issue cannot be reproduced using the
reproducer.
 
> > If I would have to guess, I'd assume it's related to PT_RECLAIM
> > reclaiming empty page tables during MADV_DONTNEED -- but the kconfig
> > does not indicate that CONFIG_PT_RECLAIM was set.
> >
> > --
> > Cheers,
> >
> > David / dhildenb
> >
> 
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2025-08-05  2:52 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-07 22:01 [PATCH v4] mm: use per_vma lock for MADV_DONTNEED Barry Song
2025-06-09  7:21 ` Qi Zheng
2025-06-17 13:38 ` Lorenzo Stoakes
2025-06-18  2:25   ` Lance Yang
2025-06-18  9:52     ` Barry Song
2025-06-18 10:18       ` David Hildenbrand
2025-06-18 10:30         ` Barry Song
2025-06-18 10:32           ` Barry Song
2025-06-18 13:05         ` Lance Yang
2025-06-18 13:13           ` David Hildenbrand
2025-06-18 10:11   ` Barry Song
2025-06-18 10:33     ` Lorenzo Stoakes
2025-06-18 10:36       ` Barry Song
2025-08-04  0:58 ` Lai, Yi
2025-08-04  7:19   ` Barry Song
2025-08-04  7:57   ` David Hildenbrand
2025-08-04  8:26     ` Qi Zheng
2025-08-04  8:30       ` David Hildenbrand
2025-08-04  8:49         ` Lai, Yi
2025-08-04  9:15           ` Barry Song
2025-08-04  9:35             ` Qi Zheng
2025-08-04  9:52               ` Qi Zheng
2025-08-04 10:04                 ` Barry Song
2025-08-04 21:48     ` Barry Song
2025-08-05  2:52       ` Lai, Yi
2025-08-04  8:19   ` Barry Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).