[PATCH v2 0/4] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/4] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE
@ 2025-04-04 21:06 SeongJae Park
  2025-04-04 21:06 ` [PATCH v2 1/4] mm/madvise: define and use madvise_behavior struct for madvise_do_behavior() SeongJae Park
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: SeongJae Park @ 2025-04-04 21:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: SeongJae Park, Liam R.Howlett, David Hildenbrand, Lorenzo Stoakes,
	Rik van Riel, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

When process_madvise() is called to do MADV_DONTNEED[_LOCKED] or
MADV_FREE with multiple address ranges, tlb flushes happen for each of
the given address ranges.  Because such tlb flushes are for same
process, doing those in a batch is more efficient while still being
safe.  Modify process_madvise() entry level code path to do such batched
tlb flushes, while the internal unmap logics do only gathering of the
tlb entries to flush.

In more detail, modify the entry functions to initialize an mmu_gather
ojbect and pass it to the internal logics.  And make the internal logics
do only gathering of the tlb entries to flush into the received
mmu_gather object.  After all internal function calls are done, the
entry functions flush the gathered tlb entries at once.

Because process_madvise() and madvise() share the internal unmap logic,
make same change to madvise() entry code together, to make code
consistent and cleaner.  It is only for keeping the code clean, and
shouldn't degrade madvise().  It could rather provide a potential tlb
flushes reduction benefit for a case that there are multiple vmas for
the given address range.  It is only a side effect from an effort to
keep code clean, so we don't measure it separately.

Similar optimizations might be applicable to other madvise behaviros
such as MADV_COLD and MADV_PAGEOUT.  Those are simply out of the scope
of this patch series, though.

Patches Seuquence
=================

First patch defines new data structure for managing information that
required for batched tlb flushes (mmu_gather and behavior), and update
code paths for MADV_DONTNEED[_LOCKED] and MADV_FREE handling internal
logics to receive it.

Second patch batches tlb flushes for MADV_FREE handling for both
madvise() and process_madvise().

Remaining two patches are for MADV_DONTNEED[_LOCKED] tlb flushes
batching.  Third patch splits zap_page_range_single() for batching of
MADV_DONTNEED[_LOCKED] handling.  The final and fourth patch batches tlb
flushes for the hint using the sub-logic that the third patch split out,
and the helpers for batched tlb flushes that intorduced for MADV_FREE
case, by the second patch.

Test Results
============

I measured the latency to apply MADV_DONTNEED advice to 256 MiB memory
using multiple process_madvise() calls.  I apply the advice in 4 KiB
sized regions granularity, but with varying batch size per
process_madvise() call (vlen) from 1 to 1024.  The source code for the
measurement is available at GitHub[1].  To reduce measurement errors, I
did the measurement five times.

The measurement results are as below.  'sz_batch' column shows the batch
size of process_madvise() calls.  'Before' and 'After' columns show the
average of latencies in nanoseconds that measured five times on kernels
that built without and with the tlb flushes batching of this series
(patches 3 and 4), respectively.  For the baseline, mm-new tree of
2025-04-04[2] has been used.  'B-stdev' and 'A-stdev' columns show
ratios of latency measurements standard deviation to average in percent
for 'Before' and 'After', respectively.  'Latency_reduction' shows the
reduction of the latency that the 'After' has achieved compared to
'Before', in percent.  Higher 'Latency_reduction' values mean more
efficiency improvements.

    sz_batch   Before        B-stdev   After         A-stdev   Latency_reduction
    1          110948138.2   5.55      109476402.8   4.28      1.33
    2          75678535.6    1.67      70470722.2    3.55      6.88
    4          59530647.6    4.77      51735606.6    3.44      13.09
    8          50013051.6    4.39      44377029.8    5.20      11.27
    16         48657878.2    9.32      37291600.4    3.39      23.36
    32         43614180.2    6.06      34127428      3.75      21.75
    64         42466694.2    5.70      26737935.2    2.54      37.04
    128        42977970      6.99      25050444.2    4.06      41.71
    256        41549546      1.88      24644375.8    3.77      40.69
    512        42162698.6    6.17      24068224.8    2.87      42.92
    1024       40978574      5.44      23872024.2    3.65      41.75

As expected, tlb flushes batching provides latency reduction that
proportional to the batch size.  The efficiency gain ranges from about
6.88 percent with batch size 2, to about 40 percent with batch size 128.

Please note that this is a very simple microbenchmark, so real
efficiency gain on real workload could be very different.

Chagelong
=========

Changes from v1
(https://lore.kernel.org/20250310172318.653630-1-sj@kernel.org)
- Split code cleanup part out
- Keep the order between tlb flushes and hugetlb_zap_end()
- Put mm/memory change just before its real usage
- Add VM_WARN_ON_ONCE() for invlaid tlb argument to unmap_vma_single()
- Cleanups following nice reviewers suggestions

Changes from RFC
(https://lore.kernel.org/20250305181611.54484-1-sj@kernel.org)
- Clarify motivation of the change on the cover letter
- Add average and stdev of evaluation results
- Show latency reduction on evaluation results
- Fix !CONFIG_MEMORY_FAILURE build error
- Rename is_memory_populate() to is_madvise_populate()
- Squash patches 5-8
- Add kerneldoc for unmap_vm_area_struct()
- Squash patches 10 and 11
- Squash patches 12-14
- Squash patches 15 and 16

References
==========

[1] https://github.com/sjp38/eval_proc_madvise
[2] commit edd67244fe67 ("mm/show_mem: optimize si_meminfo_node by reducing redundant code") # mm-new

SeongJae Park (4):
  mm/madvise: define and use madvise_behavior struct for
    madvise_do_behavior()
  mm/madvise: batch tlb flushes for MADV_FREE
  mm/memory: split non-tlb flushing part from zap_page_range_single()
  mm/madvise: batch tlb flushes for MADV_DONTNEED[_LOCKED]

 mm/internal.h |   3 ++
 mm/madvise.c  | 110 ++++++++++++++++++++++++++++++++++++--------------
 mm/memory.c   |  47 ++++++++++++++++-----
 3 files changed, 121 insertions(+), 39 deletions(-)

base-commit: 85b87628fae973dedae95f2ea2782b7df4c11322
-- 
2.39.5

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 1/4] mm/madvise: define and use madvise_behavior struct for madvise_do_behavior()
  2025-04-04 21:06 [PATCH v2 0/4] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
@ 2025-04-04 21:06 ` SeongJae Park
  2025-04-04 21:06 ` [PATCH v2 2/4] mm/madvise: batch tlb flushes for MADV_FREE SeongJae Park
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: SeongJae Park @ 2025-04-04 21:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: SeongJae Park, Liam R.Howlett, David Hildenbrand, Lorenzo Stoakes,
	Rik van Riel, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

To implement batched tlb flushes for MADV_DONTNEED[_LOCKED] and
MADV_FREE, an mmu_gather object in addition to the behavior integer need
to be passed to the internal logics.  Using a struct can make it easy
without increasing the number of parameters of all code paths towards
the internal logic.  Define a struct for the purpose and use it on the
code path that starts from madvise_do_behavior() and ends on
madvise_dontneed_free().  Note that this changes madvise_walk_vmas()
visitor type signature, too.  Specifically, it changes its 'arg' type
from 'unsigned long' to the new struct pointer.

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/madvise.c | 46 +++++++++++++++++++++++++++++-----------------
 1 file changed, 29 insertions(+), 17 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index b17f684322ad..8bcfdd995d18 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -48,6 +48,11 @@ struct madvise_walk_private {
 	bool pageout;
 };
 
+struct madvise_behavior {
+	int behavior;
+	struct mmu_gather *tlb;
+};
+
 /*
  * Any behaviour which results in changes to the vma->vm_flags needs to
  * take mmap_lock for writing. Others, which simply traverse vmas, need
@@ -893,12 +898,13 @@ static bool madvise_dontneed_free_valid_vma(struct vm_area_struct *vma,
 static long madvise_dontneed_free(struct vm_area_struct *vma,
 				  struct vm_area_struct **prev,
 				  unsigned long start, unsigned long end,
-				  int behavior)
+				  struct madvise_behavior *behavior)
 {
+	int action = behavior->behavior;
 	struct mm_struct *mm = vma->vm_mm;
 
 	*prev = vma;
-	if (!madvise_dontneed_free_valid_vma(vma, start, &end, behavior))
+	if (!madvise_dontneed_free_valid_vma(vma, start, &end, action))
 		return -EINVAL;
 
 	if (start == end)
@@ -915,8 +921,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
 		 * Potential end adjustment for hugetlb vma is OK as
 		 * the check below keeps end within vma.
 		 */
-		if (!madvise_dontneed_free_valid_vma(vma, start, &end,
-						     behavior))
+		if (!madvise_dontneed_free_valid_vma(vma, start, &end, action))
 			return -EINVAL;
 		if (end > vma->vm_end) {
 			/*
@@ -945,9 +950,9 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
 		VM_WARN_ON(start > end);
 	}
 
-	if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED)
+	if (action == MADV_DONTNEED || action == MADV_DONTNEED_LOCKED)
 		return madvise_dontneed_single_vma(vma, start, end);
-	else if (behavior == MADV_FREE)
+	else if (action == MADV_FREE)
 		return madvise_free_single_vma(vma, start, end);
 	else
 		return -EINVAL;
@@ -1249,8 +1254,10 @@ static long madvise_guard_remove(struct vm_area_struct *vma,
 static int madvise_vma_behavior(struct vm_area_struct *vma,
 				struct vm_area_struct **prev,
 				unsigned long start, unsigned long end,
-				unsigned long behavior)
+				void *behavior_arg)
 {
+	struct madvise_behavior *arg = behavior_arg;
+	int behavior = arg->behavior;
 	int error;
 	struct anon_vma_name *anon_name;
 	unsigned long new_flags = vma->vm_flags;
@@ -1270,7 +1277,7 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
 	case MADV_FREE:
 	case MADV_DONTNEED:
 	case MADV_DONTNEED_LOCKED:
-		return madvise_dontneed_free(vma, prev, start, end, behavior);
+		return madvise_dontneed_free(vma, prev, start, end, arg);
 	case MADV_NORMAL:
 		new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ;
 		break;
@@ -1487,10 +1494,10 @@ static bool process_madvise_remote_valid(int behavior)
  */
 static
 int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
-		      unsigned long end, unsigned long arg,
+		      unsigned long end, void *arg,
 		      int (*visit)(struct vm_area_struct *vma,
 				   struct vm_area_struct **prev, unsigned long start,
-				   unsigned long end, unsigned long arg))
+				   unsigned long end, void *arg))
 {
 	struct vm_area_struct *vma;
 	struct vm_area_struct *prev;
@@ -1548,7 +1555,7 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
 static int madvise_vma_anon_name(struct vm_area_struct *vma,
 				 struct vm_area_struct **prev,
 				 unsigned long start, unsigned long end,
-				 unsigned long anon_name)
+				 void *anon_name)
 {
 	int error;
 
@@ -1557,7 +1564,7 @@ static int madvise_vma_anon_name(struct vm_area_struct *vma,
 		return -EBADF;
 
 	error = madvise_update_vma(vma, prev, start, end, vma->vm_flags,
-				   (struct anon_vma_name *)anon_name);
+				   anon_name);
 
 	/*
 	 * madvise() returns EAGAIN if kernel resources, such as
@@ -1589,7 +1596,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
 	if (end == start)
 		return 0;
 
-	return madvise_walk_vmas(mm, start, end, (unsigned long)anon_name,
+	return madvise_walk_vmas(mm, start, end, anon_name,
 				 madvise_vma_anon_name);
 }
 #endif /* CONFIG_ANON_VMA_NAME */
@@ -1677,8 +1684,10 @@ static bool is_madvise_populate(int behavior)
 }
 
 static int madvise_do_behavior(struct mm_struct *mm,
-		unsigned long start, size_t len_in, int behavior)
+		unsigned long start, size_t len_in,
+		struct madvise_behavior *madv_behavior)
 {
+	int behavior = madv_behavior->behavior;
 	struct blk_plug plug;
 	unsigned long end;
 	int error;
@@ -1692,7 +1701,7 @@ static int madvise_do_behavior(struct mm_struct *mm,
 	if (is_madvise_populate(behavior))
 		error = madvise_populate(mm, start, end, behavior);
 	else
-		error = madvise_walk_vmas(mm, start, end, behavior,
+		error = madvise_walk_vmas(mm, start, end, madv_behavior,
 					  madvise_vma_behavior);
 	blk_finish_plug(&plug);
 	return error;
@@ -1773,13 +1782,14 @@ static int madvise_do_behavior(struct mm_struct *mm,
 int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
 {
 	int error;
+	struct madvise_behavior madv_behavior = {.behavior = behavior};
 
 	if (madvise_should_skip(start, len_in, behavior, &error))
 		return error;
 	error = madvise_lock(mm, behavior);
 	if (error)
 		return error;
-	error = madvise_do_behavior(mm, start, len_in, behavior);
+	error = madvise_do_behavior(mm, start, len_in, &madv_behavior);
 	madvise_unlock(mm, behavior);
 
 	return error;
@@ -1796,6 +1806,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 {
 	ssize_t ret = 0;
 	size_t total_len;
+	struct madvise_behavior madv_behavior = {.behavior = behavior};
 
 	total_len = iov_iter_count(iter);
 
@@ -1811,7 +1822,8 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 		if (madvise_should_skip(start, len_in, behavior, &error))
 			ret = error;
 		else
-			ret = madvise_do_behavior(mm, start, len_in, behavior);
+			ret = madvise_do_behavior(mm, start, len_in,
+					&madv_behavior);
 		/*
 		 * An madvise operation is attempting to restart the syscall,
 		 * but we cannot proceed as it would not be correct to repeat
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 2/4] mm/madvise: batch tlb flushes for MADV_FREE
  2025-04-04 21:06 [PATCH v2 0/4] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
  2025-04-04 21:06 ` [PATCH v2 1/4] mm/madvise: define and use madvise_behavior struct for madvise_do_behavior() SeongJae Park
@ 2025-04-04 21:06 ` SeongJae Park
  2025-04-08 12:58   ` Lorenzo Stoakes
  2025-04-04 21:06 ` [PATCH v2 3/4] mm/memory: split non-tlb flushing part from zap_page_range_single() SeongJae Park
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: SeongJae Park @ 2025-04-04 21:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: SeongJae Park, Liam R.Howlett, David Hildenbrand, Lorenzo Stoakes,
	Rik van Riel, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

MADV_FREE handling for [process_]madvise() flushes tlb for each vma of
each address range.  Update the logic to do tlb flushes in a batched
way.  Initialize an mmu_gather object from do_madvise() and
vector_madvise(), which are the entry level functions for
[process_]madvise(), respectively.  And pass those objects to the
function for per-vma work, via madvise_behavior struct.  Make the
per-vma logic not flushes tlb on their own but just saves the tlb
entries to the received mmu_gather object.  Finally, the entry level
functions flush the tlb entries that gathered for the entire user
request, at once.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/madvise.c | 59 +++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 47 insertions(+), 12 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 8bcfdd995d18..564095e381b2 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -799,12 +799,13 @@ static const struct mm_walk_ops madvise_free_walk_ops = {
 	.walk_lock		= PGWALK_RDLOCK,
 };
 
-static int madvise_free_single_vma(struct vm_area_struct *vma,
-			unsigned long start_addr, unsigned long end_addr)
+static int madvise_free_single_vma(
+		struct madvise_behavior *behavior, struct vm_area_struct *vma,
+		unsigned long start_addr, unsigned long end_addr)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct mmu_notifier_range range;
-	struct mmu_gather tlb;
+	struct mmu_gather *tlb = behavior->tlb;
 
 	/* MADV_FREE works for only anon vma at the moment */
 	if (!vma_is_anonymous(vma))
@@ -820,17 +821,14 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
 				range.start, range.end);
 
 	lru_add_drain();
-	tlb_gather_mmu(&tlb, mm);
 	update_hiwater_rss(mm);
 
 	mmu_notifier_invalidate_range_start(&range);
-	tlb_start_vma(&tlb, vma);
+	tlb_start_vma(tlb, vma);
 	walk_page_range(vma->vm_mm, range.start, range.end,
-			&madvise_free_walk_ops, &tlb);
-	tlb_end_vma(&tlb, vma);
+			&madvise_free_walk_ops, tlb);
+	tlb_end_vma(tlb, vma);
 	mmu_notifier_invalidate_range_end(&range);
-	tlb_finish_mmu(&tlb);
-
 	return 0;
 }
 
@@ -953,7 +951,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
 	if (action == MADV_DONTNEED || action == MADV_DONTNEED_LOCKED)
 		return madvise_dontneed_single_vma(vma, start, end);
 	else if (action == MADV_FREE)
-		return madvise_free_single_vma(vma, start, end);
+		return madvise_free_single_vma(behavior, vma, start, end);
 	else
 		return -EINVAL;
 }
@@ -1626,6 +1624,29 @@ static void madvise_unlock(struct mm_struct *mm, int behavior)
 		mmap_read_unlock(mm);
 }
 
+static bool madvise_batch_tlb_flush(int behavior)
+{
+	switch (behavior) {
+	case MADV_FREE:
+		return true;
+	default:
+		return false;
+	}
+}
+
+static void madvise_init_tlb(struct madvise_behavior *madv_behavior,
+		struct mm_struct *mm)
+{
+	if (madvise_batch_tlb_flush(madv_behavior->behavior))
+		tlb_gather_mmu(madv_behavior->tlb, mm);
+}
+
+static void madvise_finish_tlb(struct madvise_behavior *madv_behavior)
+{
+	if (madvise_batch_tlb_flush(madv_behavior->behavior))
+		tlb_finish_mmu(madv_behavior->tlb);
+}
+
 static bool is_valid_madvise(unsigned long start, size_t len_in, int behavior)
 {
 	size_t len;
@@ -1782,14 +1803,20 @@ static int madvise_do_behavior(struct mm_struct *mm,
 int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
 {
 	int error;
-	struct madvise_behavior madv_behavior = {.behavior = behavior};
+	struct mmu_gather tlb;
+	struct madvise_behavior madv_behavior = {
+		.behavior = behavior,
+		.tlb = &tlb,
+	};
 
 	if (madvise_should_skip(start, len_in, behavior, &error))
 		return error;
 	error = madvise_lock(mm, behavior);
 	if (error)
 		return error;
+	madvise_init_tlb(&madv_behavior, mm);
 	error = madvise_do_behavior(mm, start, len_in, &madv_behavior);
+	madvise_finish_tlb(&madv_behavior);
 	madvise_unlock(mm, behavior);
 
 	return error;
@@ -1806,13 +1833,18 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 {
 	ssize_t ret = 0;
 	size_t total_len;
-	struct madvise_behavior madv_behavior = {.behavior = behavior};
+	struct mmu_gather tlb;
+	struct madvise_behavior madv_behavior = {
+		.behavior = behavior,
+		.tlb = &tlb,
+	};
 
 	total_len = iov_iter_count(iter);
 
 	ret = madvise_lock(mm, behavior);
 	if (ret)
 		return ret;
+	madvise_init_tlb(&madv_behavior, mm);
 
 	while (iov_iter_count(iter)) {
 		unsigned long start = (unsigned long)iter_iov_addr(iter);
@@ -1841,14 +1873,17 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 			}
 
 			/* Drop and reacquire lock to unwind race. */
+			madvise_finish_tlb(&madv_behavior);
 			madvise_unlock(mm, behavior);
 			madvise_lock(mm, behavior);
+			madvise_init_tlb(&madv_behavior, mm);
 			continue;
 		}
 		if (ret < 0)
 			break;
 		iov_iter_advance(iter, iter_iov_len(iter));
 	}
+	madvise_finish_tlb(&madv_behavior);
 	madvise_unlock(mm, behavior);
 
 	ret = (total_len - iov_iter_count(iter)) ? : ret;
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 3/4] mm/memory: split non-tlb flushing part from zap_page_range_single()
  2025-04-04 21:06 [PATCH v2 0/4] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
  2025-04-04 21:06 ` [PATCH v2 1/4] mm/madvise: define and use madvise_behavior struct for madvise_do_behavior() SeongJae Park
  2025-04-04 21:06 ` [PATCH v2 2/4] mm/madvise: batch tlb flushes for MADV_FREE SeongJae Park
@ 2025-04-04 21:06 ` SeongJae Park
  2025-04-08 13:29   ` Lorenzo Stoakes
  2025-04-04 21:07 ` [PATCH v2 4/4] mm/madvise: batch tlb flushes for MADV_DONTNEED[_LOCKED] SeongJae Park
  2025-04-08 13:44 ` [PATCH v2 0/4] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE Lorenzo Stoakes
  4 siblings, 1 reply; 14+ messages in thread
From: SeongJae Park @ 2025-04-04 21:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: SeongJae Park, Liam R.Howlett, David Hildenbrand, Lorenzo Stoakes,
	Rik van Riel, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

Some of zap_page_range_single() callers such as [process_]madvise() with
MADV_DONTNEED[_LOCKED] cannot batch tlb flushes because
zap_page_range_single() flushes tlb for each invocation.  Split out the
body of zap_page_range_single() except mmu_gather object initialization
and gathered tlb entries flushing for such batched tlb flushing usage.

To avoid hugetlb pages allocation failures from concurrent page faults,
the tlb flush should be done before hugetlb faults unlocking, though.
Do the flush and the unlock inside the split out function in the order
for hugetlb vma case.  Refer to commit 2820b0f09be9 ("hugetlbfs: close
race between MADV_DONTNEED and page fault") for more details about the
concurrent faults' page allocation failure problem.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/memory.c | 49 +++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 39 insertions(+), 10 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 8669b2c981a5..8c9bbb1a008c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1989,36 +1989,65 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
 	mmu_notifier_invalidate_range_end(&range);
 }
 
-/**
- * zap_page_range_single - remove user pages in a given range
+/*
+ * notify_unmap_single_vma - remove user pages in a given range
+ * @tlb: pointer to the caller's struct mmu_gather
  * @vma: vm_area_struct holding the applicable pages
- * @address: starting address of pages to zap
- * @size: number of bytes to zap
+ * @address: starting address of pages to remove
+ * @size: number of bytes to remove
  * @details: details of shared cache invalidation
  *
- * The range must fit into one VMA.
+ * @tlb shouldn't be NULL.  The range must fit into one VMA.  If @vma is for
+ * hugetlb, @tlb is flushed and re-initialized by this function.
  */
-void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
+static void notify_unmap_single_vma(struct mmu_gather *tlb,
+		struct vm_area_struct *vma, unsigned long address,
 		unsigned long size, struct zap_details *details)
 {
 	const unsigned long end = address + size;
 	struct mmu_notifier_range range;
-	struct mmu_gather tlb;
+
+	VM_WARN_ON_ONCE(!tlb);
 
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
 				address, end);
 	hugetlb_zap_begin(vma, &range.start, &range.end);
-	tlb_gather_mmu(&tlb, vma->vm_mm);
 	update_hiwater_rss(vma->vm_mm);
 	mmu_notifier_invalidate_range_start(&range);
 	/*
 	 * unmap 'address-end' not 'range.start-range.end' as range
 	 * could have been expanded for hugetlb pmd sharing.
 	 */
-	unmap_single_vma(&tlb, vma, address, end, details, false);
+	unmap_single_vma(tlb, vma, address, end, details, false);
 	mmu_notifier_invalidate_range_end(&range);
+	if (is_vm_hugetlb_page(vma)) {
+		/*
+		 * flush tlb and free resources before hugetlb_zap_end(), to
+		 * avoid concurrent page faults' allocation failure
+		 */
+		tlb_finish_mmu(tlb);
+		hugetlb_zap_end(vma, details);
+		tlb_gather_mmu(tlb, vma->vm_mm);
+	}
+}
+
+/**
+ * zap_page_range_single - remove user pages in a given range
+ * @vma: vm_area_struct holding the applicable pages
+ * @address: starting address of pages to zap
+ * @size: number of bytes to zap
+ * @details: details of shared cache invalidation
+ *
+ * The range must fit into one VMA.
+ */
+void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
+		unsigned long size, struct zap_details *details)
+{
+	struct mmu_gather tlb;
+
+	tlb_gather_mmu(&tlb, vma->vm_mm);
+	notify_unmap_single_vma(&tlb, vma, address, size, details);
 	tlb_finish_mmu(&tlb);
-	hugetlb_zap_end(vma, details);
 }
 
 /**
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 4/4] mm/madvise: batch tlb flushes for MADV_DONTNEED[_LOCKED]
  2025-04-04 21:06 [PATCH v2 0/4] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
                   ` (2 preceding siblings ...)
  2025-04-04 21:06 ` [PATCH v2 3/4] mm/memory: split non-tlb flushing part from zap_page_range_single() SeongJae Park
@ 2025-04-04 21:07 ` SeongJae Park
  2025-04-08 13:36   ` Lorenzo Stoakes
  2025-04-08 13:44 ` [PATCH v2 0/4] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE Lorenzo Stoakes
  4 siblings, 1 reply; 14+ messages in thread
From: SeongJae Park @ 2025-04-04 21:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: SeongJae Park, Liam R.Howlett, David Hildenbrand, Lorenzo Stoakes,
	Rik van Riel, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

Batch tlb flushes for MADV_DONTNEED[_LOCKED] for better efficiency, in a
way that very similar to the tlb flushes batching for MADV_FREE.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/internal.h | 3 +++
 mm/madvise.c  | 9 ++++++---
 mm/memory.c   | 4 ++--
 3 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index e9695baa5922..be0c46837e22 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -435,6 +435,9 @@ void unmap_page_range(struct mmu_gather *tlb,
 			     struct vm_area_struct *vma,
 			     unsigned long addr, unsigned long end,
 			     struct zap_details *details);
+void notify_unmap_single_vma(struct mmu_gather *tlb,
+		struct vm_area_struct *vma, unsigned long addr,
+		unsigned long size, struct zap_details *details);
 int folio_unmap_invalidate(struct address_space *mapping, struct folio *folio,
 			   gfp_t gfp);
 
diff --git a/mm/madvise.c b/mm/madvise.c
index 564095e381b2..c7ac32b4a371 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -851,7 +851,8 @@ static int madvise_free_single_vma(
  * An interface that causes the system to free clean pages and flush
  * dirty pages is already available as msync(MS_INVALIDATE).
  */
-static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
+static long madvise_dontneed_single_vma(struct madvise_behavior *behavior,
+					struct vm_area_struct *vma,
 					unsigned long start, unsigned long end)
 {
 	struct zap_details details = {
@@ -859,7 +860,7 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
 		.even_cows = true,
 	};
 
-	zap_page_range_single(vma, start, end - start, &details);
+	notify_unmap_single_vma(behavior->tlb, vma, start, end - start, &details);
 	return 0;
 }
 
@@ -949,7 +950,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
 	}
 
 	if (action == MADV_DONTNEED || action == MADV_DONTNEED_LOCKED)
-		return madvise_dontneed_single_vma(vma, start, end);
+		return madvise_dontneed_single_vma(behavior, vma, start, end);
 	else if (action == MADV_FREE)
 		return madvise_free_single_vma(behavior, vma, start, end);
 	else
@@ -1627,6 +1628,8 @@ static void madvise_unlock(struct mm_struct *mm, int behavior)
 static bool madvise_batch_tlb_flush(int behavior)
 {
 	switch (behavior) {
+	case MADV_DONTNEED:
+	case MADV_DONTNEED_LOCKED:
 	case MADV_FREE:
 		return true;
 	default:
diff --git a/mm/memory.c b/mm/memory.c
index 8c9bbb1a008c..6a01b73aa111 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1989,7 +1989,7 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
 	mmu_notifier_invalidate_range_end(&range);
 }
 
-/*
+/**
  * notify_unmap_single_vma - remove user pages in a given range
  * @tlb: pointer to the caller's struct mmu_gather
  * @vma: vm_area_struct holding the applicable pages
@@ -2000,7 +2000,7 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
  * @tlb shouldn't be NULL.  The range must fit into one VMA.  If @vma is for
  * hugetlb, @tlb is flushed and re-initialized by this function.
  */
-static void notify_unmap_single_vma(struct mmu_gather *tlb,
+void notify_unmap_single_vma(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, unsigned long address,
 		unsigned long size, struct zap_details *details)
 {
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/4] mm/madvise: batch tlb flushes for MADV_FREE
  2025-04-04 21:06 ` [PATCH v2 2/4] mm/madvise: batch tlb flushes for MADV_FREE SeongJae Park
@ 2025-04-08 12:58   ` Lorenzo Stoakes
  2025-04-08 18:49     ` SeongJae Park
  0 siblings, 1 reply; 14+ messages in thread
From: Lorenzo Stoakes @ 2025-04-08 12:58 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Andrew Morton, Liam R.Howlett, David Hildenbrand, Rik van Riel,
	Shakeel Butt, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On Fri, Apr 04, 2025 at 02:06:58PM -0700, SeongJae Park wrote:
> MADV_FREE handling for [process_]madvise() flushes tlb for each vma of
> each address range.  Update the logic to do tlb flushes in a batched
> way.  Initialize an mmu_gather object from do_madvise() and
> vector_madvise(), which are the entry level functions for
> [process_]madvise(), respectively.  And pass those objects to the
> function for per-vma work, via madvise_behavior struct.  Make the
> per-vma logic not flushes tlb on their own but just saves the tlb
> entries to the received mmu_gather object.  Finally, the entry level
> functions flush the tlb entries that gathered for the entire user
> request, at once.
>
> Signed-off-by: SeongJae Park <sj@kernel.org>

Other than some nitty stuff, and a desire for some careful testing of the
horrid edge case that err... I introduced :P this looks fine, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/madvise.c | 59 +++++++++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 47 insertions(+), 12 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 8bcfdd995d18..564095e381b2 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -799,12 +799,13 @@ static const struct mm_walk_ops madvise_free_walk_ops = {
>  	.walk_lock		= PGWALK_RDLOCK,
>  };
>
> -static int madvise_free_single_vma(struct vm_area_struct *vma,
> -			unsigned long start_addr, unsigned long end_addr)
> +static int madvise_free_single_vma(
> +		struct madvise_behavior *behavior, struct vm_area_struct *vma,

This is pedantic, but elsewhere you differentiate between int behavior and
struct madvise_behavior by referringt to the later as madv_behavior.

The naming kind of sucks in general though.

But for consistency, let's maybe rename this to madv_behavior, and we can
maybe do a commit later to do a rename across the board?

> +		unsigned long start_addr, unsigned long end_addr)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	struct mmu_notifier_range range;
> -	struct mmu_gather tlb;
> +	struct mmu_gather *tlb = behavior->tlb;
>
>  	/* MADV_FREE works for only anon vma at the moment */
>  	if (!vma_is_anonymous(vma))
> @@ -820,17 +821,14 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
>  				range.start, range.end);
>
>  	lru_add_drain();
> -	tlb_gather_mmu(&tlb, mm);
>  	update_hiwater_rss(mm);
>
>  	mmu_notifier_invalidate_range_start(&range);
> -	tlb_start_vma(&tlb, vma);
> +	tlb_start_vma(tlb, vma);
>  	walk_page_range(vma->vm_mm, range.start, range.end,
> -			&madvise_free_walk_ops, &tlb);
> -	tlb_end_vma(&tlb, vma);
> +			&madvise_free_walk_ops, tlb);
> +	tlb_end_vma(tlb, vma);
>  	mmu_notifier_invalidate_range_end(&range);
> -	tlb_finish_mmu(&tlb);
> -
>  	return 0;
>  }
>
> @@ -953,7 +951,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
>  	if (action == MADV_DONTNEED || action == MADV_DONTNEED_LOCKED)
>  		return madvise_dontneed_single_vma(vma, start, end);
>  	else if (action == MADV_FREE)
> -		return madvise_free_single_vma(vma, start, end);
> +		return madvise_free_single_vma(behavior, vma, start, end);
>  	else
>  		return -EINVAL;

On error paths, do we correctly finish the batched (botched? :P) TLB
operation?

>  }
> @@ -1626,6 +1624,29 @@ static void madvise_unlock(struct mm_struct *mm, int behavior)
>  		mmap_read_unlock(mm);
>  }
>
> +static bool madvise_batch_tlb_flush(int behavior)
> +{
> +	switch (behavior) {
> +	case MADV_FREE:
> +		return true;
> +	default:
> +		return false;
> +	}
> +}
> +
> +static void madvise_init_tlb(struct madvise_behavior *madv_behavior,
> +		struct mm_struct *mm)
> +{
> +	if (madvise_batch_tlb_flush(madv_behavior->behavior))
> +		tlb_gather_mmu(madv_behavior->tlb, mm);
> +}
> +
> +static void madvise_finish_tlb(struct madvise_behavior *madv_behavior)
> +{
> +	if (madvise_batch_tlb_flush(madv_behavior->behavior))
> +		tlb_finish_mmu(madv_behavior->tlb);
> +}

These are nice.

> +
>  static bool is_valid_madvise(unsigned long start, size_t len_in, int behavior)
>  {
>  	size_t len;
> @@ -1782,14 +1803,20 @@ static int madvise_do_behavior(struct mm_struct *mm,
>  int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
>  {
>  	int error;
> -	struct madvise_behavior madv_behavior = {.behavior = behavior};
> +	struct mmu_gather tlb;
> +	struct madvise_behavior madv_behavior = {
> +		.behavior = behavior,
> +		.tlb = &tlb,
> +	};
>
>  	if (madvise_should_skip(start, len_in, behavior, &error))
>  		return error;
>  	error = madvise_lock(mm, behavior);
>  	if (error)
>  		return error;
> +	madvise_init_tlb(&madv_behavior, mm);
>  	error = madvise_do_behavior(mm, start, len_in, &madv_behavior);
> +	madvise_finish_tlb(&madv_behavior);
>  	madvise_unlock(mm, behavior);
>
>  	return error;
> @@ -1806,13 +1833,18 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
>  {
>  	ssize_t ret = 0;
>  	size_t total_len;
> -	struct madvise_behavior madv_behavior = {.behavior = behavior};
> +	struct mmu_gather tlb;
> +	struct madvise_behavior madv_behavior = {
> +		.behavior = behavior,
> +		.tlb = &tlb,
> +	};

Again the naming is kinda yucky, but let's just yeah for now stick with
'madv_behavior' for values of this helper struct and 'behavior' for the
actual int value, and we can revist that later.

>
>  	total_len = iov_iter_count(iter);
>
>  	ret = madvise_lock(mm, behavior);
>  	if (ret)
>  		return ret;
> +	madvise_init_tlb(&madv_behavior, mm);
>
>  	while (iov_iter_count(iter)) {
>  		unsigned long start = (unsigned long)iter_iov_addr(iter);
> @@ -1841,14 +1873,17 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
>  			}
>
>  			/* Drop and reacquire lock to unwind race. */
> +			madvise_finish_tlb(&madv_behavior);
>  			madvise_unlock(mm, behavior);
>  			madvise_lock(mm, behavior);
> +			madvise_init_tlb(&madv_behavior, mm);
>  			continue;

Have you found a way in which to test this? Perhaps force this case and
find a means of asserting the TLB flushing behaves as expected? I think
we're ok from the logic, but it's such a tricky one it'd be good to find a
means of doing so, albeit in a manual way.

>  		}
>  		if (ret < 0)
>  			break;
>  		iov_iter_advance(iter, iter_iov_len(iter));
>  	}
> +	madvise_finish_tlb(&madv_behavior);
>  	madvise_unlock(mm, behavior);
>
>  	ret = (total_len - iov_iter_count(iter)) ? : ret;
> --
> 2.39.5


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 3/4] mm/memory: split non-tlb flushing part from zap_page_range_single()
  2025-04-04 21:06 ` [PATCH v2 3/4] mm/memory: split non-tlb flushing part from zap_page_range_single() SeongJae Park
@ 2025-04-08 13:29   ` Lorenzo Stoakes
  2025-04-08 20:12     ` SeongJae Park
  0 siblings, 1 reply; 14+ messages in thread
From: Lorenzo Stoakes @ 2025-04-08 13:29 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Andrew Morton, Liam R.Howlett, David Hildenbrand, Rik van Riel,
	Shakeel Butt, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On Fri, Apr 04, 2025 at 02:06:59PM -0700, SeongJae Park wrote:
> Some of zap_page_range_single() callers such as [process_]madvise() with
> MADV_DONTNEED[_LOCKED] cannot batch tlb flushes because
> zap_page_range_single() flushes tlb for each invocation.  Split out the
> body of zap_page_range_single() except mmu_gather object initialization
> and gathered tlb entries flushing for such batched tlb flushing usage.
>
> To avoid hugetlb pages allocation failures from concurrent page faults,
> the tlb flush should be done before hugetlb faults unlocking, though.
> Do the flush and the unlock inside the split out function in the order
> for hugetlb vma case.  Refer to commit 2820b0f09be9 ("hugetlbfs: close
> race between MADV_DONTNEED and page fault") for more details about the
> concurrent faults' page allocation failure problem.

Good lord, I really hate how we do hugetlb.

>
> Signed-off-by: SeongJae Park <sj@kernel.org>
> ---
>  mm/memory.c | 49 +++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 39 insertions(+), 10 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 8669b2c981a5..8c9bbb1a008c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1989,36 +1989,65 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
>  	mmu_notifier_invalidate_range_end(&range);
>  }
>
> -/**
> - * zap_page_range_single - remove user pages in a given range
> +/*
> + * notify_unmap_single_vma - remove user pages in a given range
> + * @tlb: pointer to the caller's struct mmu_gather
>   * @vma: vm_area_struct holding the applicable pages
> - * @address: starting address of pages to zap
> - * @size: number of bytes to zap
> + * @address: starting address of pages to remove
> + * @size: number of bytes to remove
>   * @details: details of shared cache invalidation
>   *
> - * The range must fit into one VMA.
> + * @tlb shouldn't be NULL.  The range must fit into one VMA.  If @vma is for
> + * hugetlb, @tlb is flushed and re-initialized by this function.
>   */
> -void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
> +static void notify_unmap_single_vma(struct mmu_gather *tlb,
> +		struct vm_area_struct *vma, unsigned long address,
>  		unsigned long size, struct zap_details *details)

Don't love this name. It seems to imply that the primary thing here is the MMU
notification bit.

This is like unmap_vmas() but for a single VMA, that is - contains the
logic unmap_vmas() does for mmu notifier stuff and hugetlb stuff (vom in
mouth), deferring heavy lifting to unmap_single_vma().

I think it might be better to just go with the brainless
'__zap_page_range_single()' here honestly. Then we at least reduce the mess
of confusion in this function naming here.

Of course you intend to un-static this shortly... so maybe that's not so
great.

zap_page_range_single_batched()?

>  {
>  	const unsigned long end = address + size;
>  	struct mmu_notifier_range range;
> -	struct mmu_gather tlb;
> +
> +	VM_WARN_ON_ONCE(!tlb);

Maybe pedantic, but we probably want to ensure not only that tlb is set but
also has had tlb_gatehr_mmu() performed upon it. Then again probably a bit
much given this is a static function called only from
zap_page_range_single().

Hm actually you intend to un-static this shortly right? So I guess in that
case we do want some kind of check like this, perhaps absracting this bit
of tlb_flush_mmu_tlbonly():

	if (!(tlb->freed_tables || tlb->cleared_ptes || tlb->cleared_pmds ||
	      tlb->cleared_puds || tlb->cleared_p4ds))
		return;

Into a separate is_tlb_flushable() helper function or something. Then this
warning can become:

/* tlb should be initialised for a new gather operation. */
VM_WARN_ON_ONCE(!tlb || is_tlb_flushable(tlb));

>
>  	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
>  				address, end);
>  	hugetlb_zap_begin(vma, &range.start, &range.end);

Is it a problem that we invoke this function now _after_ tlb_gather_mmu()
has begun?

> -	tlb_gather_mmu(&tlb, vma->vm_mm);
>  	update_hiwater_rss(vma->vm_mm);
>  	mmu_notifier_invalidate_range_start(&range);
>  	/*
>  	 * unmap 'address-end' not 'range.start-range.end' as range
>  	 * could have been expanded for hugetlb pmd sharing.
>  	 */

Oh GOD I HATE THAT WE HANDLE HUGETLB THIS WAY!!!

Anything where you have to open-code concerns about how a specific use case
uses something like this is just asking to be broken.

Obviously this is nothing to do with your series and is just a grumble to
the sky, but still! *shakes fist at cloud*

> -	unmap_single_vma(&tlb, vma, address, end, details, false);
> +	unmap_single_vma(tlb, vma, address, end, details, false);
>  	mmu_notifier_invalidate_range_end(&range);
> +	if (is_vm_hugetlb_page(vma)) {
> +		/*
> +		 * flush tlb and free resources before hugetlb_zap_end(), to
> +		 * avoid concurrent page faults' allocation failure

Nit: add a full stop (or for those residing in North America 'period' :>)
at the end of this sentence. This is very bikesheddy I know, and can only
apologise.

> +		 */
> +		tlb_finish_mmu(tlb);
> +		hugetlb_zap_end(vma, details);
> +		tlb_gather_mmu(tlb, vma->vm_mm);
> +	}

OK, so as far as I can tell, after this change, we are still ensuring that
the tlb is flushed _prior_ to the invocation of hugetlb_zap_end(), only we,
in order to later export this function, need do it here instead... I mean
this is horrid, but it's sort of unavoidable really.

So I guess this just undoes the optimisation in the case of hugetlb, which
probably doesn't really matter all that much at huge page size
anyway... and plus it's hugetlb so.

Yeah fine, just kind of horrid.

> +}
> +
> +/**
> + * zap_page_range_single - remove user pages in a given range
> + * @vma: vm_area_struct holding the applicable pages
> + * @address: starting address of pages to zap
> + * @size: number of bytes to zap
> + * @details: details of shared cache invalidation
> + *
> + * The range must fit into one VMA.
> + */
> +void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
> +		unsigned long size, struct zap_details *details)
> +{
> +	struct mmu_gather tlb;
> +
> +	tlb_gather_mmu(&tlb, vma->vm_mm);
> +	notify_unmap_single_vma(&tlb, vma, address, size, details);
>  	tlb_finish_mmu(&tlb);
> -	hugetlb_zap_end(vma, details);
>  }
>
>  /**
> --
> 2.39.5

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 4/4] mm/madvise: batch tlb flushes for MADV_DONTNEED[_LOCKED]
  2025-04-04 21:07 ` [PATCH v2 4/4] mm/madvise: batch tlb flushes for MADV_DONTNEED[_LOCKED] SeongJae Park
@ 2025-04-08 13:36   ` Lorenzo Stoakes
  2025-04-08 20:16     ` SeongJae Park
  0 siblings, 1 reply; 14+ messages in thread
From: Lorenzo Stoakes @ 2025-04-08 13:36 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Andrew Morton, Liam R.Howlett, David Hildenbrand, Rik van Riel,
	Shakeel Butt, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On Fri, Apr 04, 2025 at 02:07:00PM -0700, SeongJae Park wrote:
> Batch tlb flushes for MADV_DONTNEED[_LOCKED] for better efficiency, in a
> way that very similar to the tlb flushes batching for MADV_FREE.

This seems like a rather succinct commit message under the circumstances :) can
we put some meat on the bone?

Perhaps explain why one might want to do so, propagating some of your excellent
cover letter contents here, etc.

Also you're doing more than this, you're also exporting the (soon to be renamed,
ideally :) notify_unmap_single_vma() function, let's mention this here please,
and also mention why.

>
> Signed-off-by: SeongJae Park <sj@kernel.org>
> ---
>  mm/internal.h | 3 +++
>  mm/madvise.c  | 9 ++++++---
>  mm/memory.c   | 4 ++--
>  3 files changed, 11 insertions(+), 5 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index e9695baa5922..be0c46837e22 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -435,6 +435,9 @@ void unmap_page_range(struct mmu_gather *tlb,
>  			     struct vm_area_struct *vma,
>  			     unsigned long addr, unsigned long end,
>  			     struct zap_details *details);
> +void notify_unmap_single_vma(struct mmu_gather *tlb,
> +		struct vm_area_struct *vma, unsigned long addr,
> +		unsigned long size, struct zap_details *details);

Yeah I know I said in 3/4 but I really hate this name. We need to change it... :)

>  int folio_unmap_invalidate(struct address_space *mapping, struct folio *folio,
>  			   gfp_t gfp);
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 564095e381b2..c7ac32b4a371 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -851,7 +851,8 @@ static int madvise_free_single_vma(
>   * An interface that causes the system to free clean pages and flush
>   * dirty pages is already available as msync(MS_INVALIDATE).
>   */
> -static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
> +static long madvise_dontneed_single_vma(struct madvise_behavior *behavior,

Again, let's go with madv_behavior for now please. Otherwise we have a weird
inconsistency that sometimes behavior = the int 'behavior' value and sometimes
it's a pointer to the helper struct.

> +					struct vm_area_struct *vma,
>  					unsigned long start, unsigned long end)
>  {
>  	struct zap_details details = {
> @@ -859,7 +860,7 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
>  		.even_cows = true,
>  	};
>
> -	zap_page_range_single(vma, start, end - start, &details);
> +	notify_unmap_single_vma(behavior->tlb, vma, start, end - start, &details);
>  	return 0;
>  }
>
> @@ -949,7 +950,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
>  	}
>
>  	if (action == MADV_DONTNEED || action == MADV_DONTNEED_LOCKED)
> -		return madvise_dontneed_single_vma(vma, start, end);
> +		return madvise_dontneed_single_vma(behavior, vma, start, end);
>  	else if (action == MADV_FREE)
>  		return madvise_free_single_vma(behavior, vma, start, end);
>  	else
> @@ -1627,6 +1628,8 @@ static void madvise_unlock(struct mm_struct *mm, int behavior)
>  static bool madvise_batch_tlb_flush(int behavior)
>  {
>  	switch (behavior) {
> +	case MADV_DONTNEED:
> +	case MADV_DONTNEED_LOCKED:
>  	case MADV_FREE:
>  		return true;
>  	default:
> diff --git a/mm/memory.c b/mm/memory.c
> index 8c9bbb1a008c..6a01b73aa111 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1989,7 +1989,7 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
>  	mmu_notifier_invalidate_range_end(&range);
>  }
>
> -/*
> +/**
>   * notify_unmap_single_vma - remove user pages in a given range
>   * @tlb: pointer to the caller's struct mmu_gather
>   * @vma: vm_area_struct holding the applicable pages
> @@ -2000,7 +2000,7 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
>   * @tlb shouldn't be NULL.  The range must fit into one VMA.  If @vma is for
>   * hugetlb, @tlb is flushed and re-initialized by this function.
>   */
> -static void notify_unmap_single_vma(struct mmu_gather *tlb,
> +void notify_unmap_single_vma(struct mmu_gather *tlb,
>  		struct vm_area_struct *vma, unsigned long address,
>  		unsigned long size, struct zap_details *details)
>  {
> --
> 2.39.5


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/4] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE
  2025-04-04 21:06 [PATCH v2 0/4] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
                   ` (3 preceding siblings ...)
  2025-04-04 21:07 ` [PATCH v2 4/4] mm/madvise: batch tlb flushes for MADV_DONTNEED[_LOCKED] SeongJae Park
@ 2025-04-08 13:44 ` Lorenzo Stoakes
  2025-04-08 20:23   ` SeongJae Park
  4 siblings, 1 reply; 14+ messages in thread
From: Lorenzo Stoakes @ 2025-04-08 13:44 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Andrew Morton, Liam R.Howlett, David Hildenbrand, Rik van Riel,
	Shakeel Butt, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On Fri, Apr 04, 2025 at 02:06:56PM -0700, SeongJae Park wrote:
> When process_madvise() is called to do MADV_DONTNEED[_LOCKED] or
> MADV_FREE with multiple address ranges, tlb flushes happen for each of
> the given address ranges.  Because such tlb flushes are for same

Nit: for _the_ same.

> process, doing those in a batch is more efficient while still being
> safe.  Modify process_madvise() entry level code path to do such batched
> tlb flushes, while the internal unmap logics do only gathering of the
> tlb entries to flush.
>
> In more detail, modify the entry functions to initialize an mmu_gather
> ojbect and pass it to the internal logics.  And make the internal logics

Typo: ojbect -> object, logics -> logic.

> do only gathering of the tlb entries to flush into the received
> mmu_gather object.  After all internal function calls are done, the
> entry functions flush the gathered tlb entries at once.
>
> Because process_madvise() and madvise() share the internal unmap logic,
> make same change to madvise() entry code together, to make code
> consistent and cleaner.  It is only for keeping the code clean, and
> shouldn't degrade madvise().  It could rather provide a potential tlb
> flushes reduction benefit for a case that there are multiple vmas for
> the given address range.  It is only a side effect from an effort to
> keep code clean, so we don't measure it separately.
>
> Similar optimizations might be applicable to other madvise behaviros

Typo: behaviros -> behavior (or 'behaviors', but since behavior is already plural
probably unnecessary).

> such as MADV_COLD and MADV_PAGEOUT.  Those are simply out of the scope
> of this patch series, though.

Well well, for now :)

>
> Patches Seuquence

Typo: Seuquence -> Sequence.

> =================
>
> First patch defines new data structure for managing information that

Nit: _The_ first patch. _a_ new data structure. that -> 'that is'.

> required for batched tlb flushes (mmu_gather and behavior), and update
> code paths for MADV_DONTNEED[_LOCKED] and MADV_FREE handling internal
> logics to receive it.

Typo: logics -> logic.

>
> Second patch batches tlb flushes for MADV_FREE handling for both

Nit: _The_ second patch.

> madvise() and process_madvise().
>
> Remaining two patches are for MADV_DONTNEED[_LOCKED] tlb flushes
> batching.  Third patch splits zap_page_range_single() for batching of

Nit: Third patch -> _The_ third patch.

> MADV_DONTNEED[_LOCKED] handling.  The final and fourth patch batches tlb

Nit: 'patch batches' -> 'patches batch'.

> flushes for the hint using the sub-logic that the third patch split out,
> and the helpers for batched tlb flushes that intorduced for MADV_FREE

Typo: intorduced -> introduced.

Nit: 'for MADV_FREE' -> 'for the MADV_FREE'.

> case, by the second patch.
>
> Test Results
> ============
>
> I measured the latency to apply MADV_DONTNEED advice to 256 MiB memory
> using multiple process_madvise() calls.  I apply the advice in 4 KiB
> sized regions granularity, but with varying batch size per
> process_madvise() call (vlen) from 1 to 1024.  The source code for the
> measurement is available at GitHub[1].  To reduce measurement errors, I
> did the measurement five times.

Be interesting to see how this behaves with mTHP sizing too! But probably a bit
out of scope perhaps.

>
> The measurement results are as below.  'sz_batch' column shows the batch
> size of process_madvise() calls.  'Before' and 'After' columns show the
> average of latencies in nanoseconds that measured five times on kernels
> that built without and with the tlb flushes batching of this series
> (patches 3 and 4), respectively.  For the baseline, mm-new tree of
> 2025-04-04[2] has been used.  'B-stdev' and 'A-stdev' columns show
> ratios of latency measurements standard deviation to average in percent
> for 'Before' and 'After', respectively.  'Latency_reduction' shows the
> reduction of the latency that the 'After' has achieved compared to
> 'Before', in percent.  Higher 'Latency_reduction' values mean more
> efficiency improvements.
>
>     sz_batch   Before        B-stdev   After         A-stdev   Latency_reduction
>     1          110948138.2   5.55      109476402.8   4.28      1.33
>     2          75678535.6    1.67      70470722.2    3.55      6.88
>     4          59530647.6    4.77      51735606.6    3.44      13.09
>     8          50013051.6    4.39      44377029.8    5.20      11.27
>     16         48657878.2    9.32      37291600.4    3.39      23.36
>     32         43614180.2    6.06      34127428      3.75      21.75
>     64         42466694.2    5.70      26737935.2    2.54      37.04
>     128        42977970      6.99      25050444.2    4.06      41.71
>     256        41549546      1.88      24644375.8    3.77      40.69
>     512        42162698.6    6.17      24068224.8    2.87      42.92
>     1024       40978574      5.44      23872024.2    3.65      41.75

Very nice! Great work.

>
> As expected, tlb flushes batching provides latency reduction that
> proportional to the batch size.  The efficiency gain ranges from about
> 6.88 percent with batch size 2, to about 40 percent with batch size 128.
>
> Please note that this is a very simple microbenchmark, so real
> efficiency gain on real workload could be very different.

Indeed, accepted, but it makes a great deal of sense to batch these operations,
especially when we get to the point of actually increasing the process_madvise()
iov size.

>
> Chagelong
> =========
>
> Changes from v1
> (https://lore.kernel.org/20250310172318.653630-1-sj@kernel.org)
> - Split code cleanup part out
> - Keep the order between tlb flushes and hugetlb_zap_end()
> - Put mm/memory change just before its real usage
> - Add VM_WARN_ON_ONCE() for invlaid tlb argument to unmap_vma_single()
> - Cleanups following nice reviewers suggestions
>
> Changes from RFC
> (https://lore.kernel.org/20250305181611.54484-1-sj@kernel.org)
> - Clarify motivation of the change on the cover letter
> - Add average and stdev of evaluation results
> - Show latency reduction on evaluation results
> - Fix !CONFIG_MEMORY_FAILURE build error
> - Rename is_memory_populate() to is_madvise_populate()
> - Squash patches 5-8
> - Add kerneldoc for unmap_vm_area_struct()
> - Squash patches 10 and 11
> - Squash patches 12-14
> - Squash patches 15 and 16
>
> References
> ==========
>
> [1] https://github.com/sjp38/eval_proc_madvise
> [2] commit edd67244fe67 ("mm/show_mem: optimize si_meminfo_node by reducing redundant code") # mm-new
>
> SeongJae Park (4):
>   mm/madvise: define and use madvise_behavior struct for
>     madvise_do_behavior()
>   mm/madvise: batch tlb flushes for MADV_FREE
>   mm/memory: split non-tlb flushing part from zap_page_range_single()
>   mm/madvise: batch tlb flushes for MADV_DONTNEED[_LOCKED]
>
>  mm/internal.h |   3 ++
>  mm/madvise.c  | 110 ++++++++++++++++++++++++++++++++++++--------------
>  mm/memory.c   |  47 ++++++++++++++++-----
>  3 files changed, 121 insertions(+), 39 deletions(-)
>
>
> base-commit: 85b87628fae973dedae95f2ea2782b7df4c11322
> --
> 2.39.5


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/4] mm/madvise: batch tlb flushes for MADV_FREE
  2025-04-08 12:58   ` Lorenzo Stoakes
@ 2025-04-08 18:49     ` SeongJae Park
  0 siblings, 0 replies; 14+ messages in thread
From: SeongJae Park @ 2025-04-08 18:49 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: SeongJae Park, Andrew Morton, Liam R.Howlett, David Hildenbrand,
	Rik van Riel, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

On Tue, 8 Apr 2025 13:58:18 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> On Fri, Apr 04, 2025 at 02:06:58PM -0700, SeongJae Park wrote:
> > MADV_FREE handling for [process_]madvise() flushes tlb for each vma of
> > each address range.  Update the logic to do tlb flushes in a batched
> > way.  Initialize an mmu_gather object from do_madvise() and
> > vector_madvise(), which are the entry level functions for
> > [process_]madvise(), respectively.  And pass those objects to the
> > function for per-vma work, via madvise_behavior struct.  Make the
> > per-vma logic not flushes tlb on their own but just saves the tlb
> > entries to the received mmu_gather object.  Finally, the entry level
> > functions flush the tlb entries that gathered for the entire user
> > request, at once.
> >
> > Signed-off-by: SeongJae Park <sj@kernel.org>
> 
> Other than some nitty stuff, and a desire for some careful testing of the
> horrid edge case that err... I introduced :P this looks fine, so:
> 
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Thank you for your kind review!  I will make the next revision following your
suggestions as I answered below.

> 
> > ---
> >  mm/madvise.c | 59 +++++++++++++++++++++++++++++++++++++++++-----------
> >  1 file changed, 47 insertions(+), 12 deletions(-)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 8bcfdd995d18..564095e381b2 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -799,12 +799,13 @@ static const struct mm_walk_ops madvise_free_walk_ops = {
> >  	.walk_lock		= PGWALK_RDLOCK,
> >  };
> >
> > -static int madvise_free_single_vma(struct vm_area_struct *vma,
> > -			unsigned long start_addr, unsigned long end_addr)
> > +static int madvise_free_single_vma(
> > +		struct madvise_behavior *behavior, struct vm_area_struct *vma,
> 
> This is pedantic, but elsewhere you differentiate between int behavior and
> struct madvise_behavior by referringt to the later as madv_behavior.
> 
> The naming kind of sucks in general though.
> 
> But for consistency, let's maybe rename this to madv_behavior, and we can
> maybe do a commit later to do a rename across the board?

I completely agree.  I will rename so in the next spin.

> 
> > +		unsigned long start_addr, unsigned long end_addr)
> >  {
> >  	struct mm_struct *mm = vma->vm_mm;
> >  	struct mmu_notifier_range range;
> > -	struct mmu_gather tlb;
> > +	struct mmu_gather *tlb = behavior->tlb;
> >
> >  	/* MADV_FREE works for only anon vma at the moment */
> >  	if (!vma_is_anonymous(vma))
[...]
> > @@ -953,7 +951,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> >  	if (action == MADV_DONTNEED || action == MADV_DONTNEED_LOCKED)
> >  		return madvise_dontneed_single_vma(vma, start, end);
> >  	else if (action == MADV_FREE)
> > -		return madvise_free_single_vma(vma, start, end);
> > +		return madvise_free_single_vma(behavior, vma, start, end);
> >  	else
> >  		return -EINVAL;
> 
> On error paths, do we correctly finish the batched (botched? :P) TLB
> operation?

Yes, the change calls tlb_finish_mmu() and tlb_gather_mmu() as needed in the
error paths.  Of course I might forgot calling those in some edge cases.
Please let me know if you find such mistakes.

> 
> >  }
[...]
> > @@ -1841,14 +1873,17 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
> >  			}
> >
> >  			/* Drop and reacquire lock to unwind race. */
> > +			madvise_finish_tlb(&madv_behavior);
> >  			madvise_unlock(mm, behavior);
> >  			madvise_lock(mm, behavior);
> > +			madvise_init_tlb(&madv_behavior, mm);
> >  			continue;
> 
> Have you found a way in which to test this? Perhaps force this case and
> find a means of asserting the TLB flushing behaves as expected? I think
> we're ok from the logic, but it's such a tricky one it'd be good to find a
> means of doing so, albeit in a manual way.

No, unfortunately I haven't found a good way to test this case.

> 
> >  		}
> >  		if (ret < 0)
> >  			break;
> >  		iov_iter_advance(iter, iter_iov_len(iter));
> >  	}
> > +	madvise_finish_tlb(&madv_behavior);
> >  	madvise_unlock(mm, behavior);
> >
> >  	ret = (total_len - iov_iter_count(iter)) ? : ret;
> > --
> > 2.39.5


Thanks,
SJ


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 3/4] mm/memory: split non-tlb flushing part from zap_page_range_single()
  2025-04-08 13:29   ` Lorenzo Stoakes
@ 2025-04-08 20:12     ` SeongJae Park
  2025-04-09 10:37       ` Lorenzo Stoakes
  0 siblings, 1 reply; 14+ messages in thread
From: SeongJae Park @ 2025-04-08 20:12 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: SeongJae Park, Andrew Morton, Liam R.Howlett, David Hildenbrand,
	Rik van Riel, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

On Tue, 8 Apr 2025 14:29:41 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> On Fri, Apr 04, 2025 at 02:06:59PM -0700, SeongJae Park wrote:
> > Some of zap_page_range_single() callers such as [process_]madvise() with
> > MADV_DONTNEED[_LOCKED] cannot batch tlb flushes because
> > zap_page_range_single() flushes tlb for each invocation.  Split out the
> > body of zap_page_range_single() except mmu_gather object initialization
> > and gathered tlb entries flushing for such batched tlb flushing usage.
> >
> > To avoid hugetlb pages allocation failures from concurrent page faults,
> > the tlb flush should be done before hugetlb faults unlocking, though.
> > Do the flush and the unlock inside the split out function in the order
> > for hugetlb vma case.  Refer to commit 2820b0f09be9 ("hugetlbfs: close
> > race between MADV_DONTNEED and page fault") for more details about the
> > concurrent faults' page allocation failure problem.
> 
> Good lord, I really hate how we do hugetlb.
> 
> >
> > Signed-off-by: SeongJae Park <sj@kernel.org>
> > ---
> >  mm/memory.c | 49 +++++++++++++++++++++++++++++++++++++++----------
> >  1 file changed, 39 insertions(+), 10 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 8669b2c981a5..8c9bbb1a008c 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1989,36 +1989,65 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
> >  	mmu_notifier_invalidate_range_end(&range);
> >  }
> >
> > -/**
> > - * zap_page_range_single - remove user pages in a given range
> > +/*
> > + * notify_unmap_single_vma - remove user pages in a given range
> > + * @tlb: pointer to the caller's struct mmu_gather
> >   * @vma: vm_area_struct holding the applicable pages
> > - * @address: starting address of pages to zap
> > - * @size: number of bytes to zap
> > + * @address: starting address of pages to remove
> > + * @size: number of bytes to remove
> >   * @details: details of shared cache invalidation
> >   *
> > - * The range must fit into one VMA.
> > + * @tlb shouldn't be NULL.  The range must fit into one VMA.  If @vma is for
> > + * hugetlb, @tlb is flushed and re-initialized by this function.
> >   */
> > -void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
> > +static void notify_unmap_single_vma(struct mmu_gather *tlb,
> > +		struct vm_area_struct *vma, unsigned long address,
> >  		unsigned long size, struct zap_details *details)
> 
> Don't love this name. It seems to imply that the primary thing here is the MMU
> notification bit.
> 
> This is like unmap_vmas() but for a single VMA, that is - contains the
> logic unmap_vmas() does for mmu notifier stuff and hugetlb stuff (vom in
> mouth), deferring heavy lifting to unmap_single_vma().
> 
> I think it might be better to just go with the brainless
> '__zap_page_range_single()' here honestly. Then we at least reduce the mess
> of confusion in this function naming here.
> 
> Of course you intend to un-static this shortly... so maybe that's not so
> great.
> 
> zap_page_range_single_batched()?

Sounds good to me.  I will rename the function so, unless we get other
opinions.

> 
> >  {
> >  	const unsigned long end = address + size;
> >  	struct mmu_notifier_range range;
> > -	struct mmu_gather tlb;
> > +
> > +	VM_WARN_ON_ONCE(!tlb);
> 
> Maybe pedantic, but we probably want to ensure not only that tlb is set but
> also has had tlb_gatehr_mmu() performed upon it. Then again probably a bit
> much given this is a static function called only from
> zap_page_range_single().
> 
> Hm actually you intend to un-static this shortly right? So I guess in that
> case we do want some kind of check like this, perhaps absracting this bit
> of tlb_flush_mmu_tlbonly():
> 
> 	if (!(tlb->freed_tables || tlb->cleared_ptes || tlb->cleared_pmds ||
> 	      tlb->cleared_puds || tlb->cleared_p4ds))
> 		return;

This code is to see if there are tlb entries to flush.

In this function, valid 'tlb' can either have such entries or not.  So this
code could be good reference but couldn't be used as is.  I think most fields
of mmu_gather can be arbitrarily updated depending on use cases, so making a
perfect but simple check wouldn't be easy.

I think tlb->mm shouldn't be changed after the initialization, though.  So
(tlb->mm == vma->vm_mm) could be a good enough additional check.

> 
> Into a separate is_tlb_flushable() helper function or something. Then this
> warning can become:
> 
> /* tlb should be initialised for a new gather operation. */
> VM_WARN_ON_ONCE(!tlb || is_tlb_flushable(tlb));

If you agree (tlb->mm == vma->vm_mm) is sufficient check, this could be

VM_WARN_ON_ONCE(!tlb || tlb->mm != vma->vm_mm)

> 
> >
> >  	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
> >  				address, end);
> >  	hugetlb_zap_begin(vma, &range.start, &range.end);
> 
> Is it a problem that we invoke this function now _after_ tlb_gather_mmu()
> has begun?

I think it is not a problem since I find no requirements of the ordering.
Please let me know if I'm missing seomthing.

> 
> > -	tlb_gather_mmu(&tlb, vma->vm_mm);
> >  	update_hiwater_rss(vma->vm_mm);
> >  	mmu_notifier_invalidate_range_start(&range);
> >  	/*
> >  	 * unmap 'address-end' not 'range.start-range.end' as range
> >  	 * could have been expanded for hugetlb pmd sharing.
> >  	 */
> 
> Oh GOD I HATE THAT WE HANDLE HUGETLB THIS WAY!!!
> 
> Anything where you have to open-code concerns about how a specific use case
> uses something like this is just asking to be broken.
> 
> Obviously this is nothing to do with your series and is just a grumble to
> the sky, but still! *shakes fist at cloud*
> 
> > -	unmap_single_vma(&tlb, vma, address, end, details, false);
> > +	unmap_single_vma(tlb, vma, address, end, details, false);
> >  	mmu_notifier_invalidate_range_end(&range);
> > +	if (is_vm_hugetlb_page(vma)) {
> > +		/*
> > +		 * flush tlb and free resources before hugetlb_zap_end(), to
> > +		 * avoid concurrent page faults' allocation failure
> 
> Nit: add a full stop (or for those residing in North America 'period' :>)
> at the end of this sentence. This is very bikesheddy I know, and can only
> apologise.

Thank you for kindly suggesting this, I will update as you recommended in the
next spin.

> 
> > +		 */
> > +		tlb_finish_mmu(tlb);
> > +		hugetlb_zap_end(vma, details);
> > +		tlb_gather_mmu(tlb, vma->vm_mm);
> > +	}
> 
> OK, so as far as I can tell, after this change, we are still ensuring that
> the tlb is flushed _prior_ to the invocation of hugetlb_zap_end(), only we,
> in order to later export this function, need do it here instead... I mean
> this is horrid, but it's sort of unavoidable really.
> 
> So I guess this just undoes the optimisation in the case of hugetlb, which
> probably doesn't really matter all that much at huge page size
> anyway... and plus it's hugetlb so.

Yes, that's the intended behavior.

> 
> Yeah fine, just kind of horrid.

I agree this is not clean, and thank you for understanding the situation.
Hopefully we will find a chance to revisit for cleanup later.

> 
> > +}
> > +
> > +/**
> > + * zap_page_range_single - remove user pages in a given range
> > + * @vma: vm_area_struct holding the applicable pages
> > + * @address: starting address of pages to zap
> > + * @size: number of bytes to zap
> > + * @details: details of shared cache invalidation
> > + *
> > + * The range must fit into one VMA.
> > + */
> > +void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
> > +		unsigned long size, struct zap_details *details)
> > +{
> > +	struct mmu_gather tlb;
> > +
> > +	tlb_gather_mmu(&tlb, vma->vm_mm);
> > +	notify_unmap_single_vma(&tlb, vma, address, size, details);
> >  	tlb_finish_mmu(&tlb);
> > -	hugetlb_zap_end(vma, details);
> >  }
> >
> >  /**
> > --
> > 2.39.5

Thank you for taking your time for reviewing this.  If you have no other
opinions, I will make the next spin as mentioned above.


Thanks,
SJ


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 4/4] mm/madvise: batch tlb flushes for MADV_DONTNEED[_LOCKED]
  2025-04-08 13:36   ` Lorenzo Stoakes
@ 2025-04-08 20:16     ` SeongJae Park
  0 siblings, 0 replies; 14+ messages in thread
From: SeongJae Park @ 2025-04-08 20:16 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: SeongJae Park, Andrew Morton, Liam R.Howlett, David Hildenbrand,
	Rik van Riel, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

On Tue, 8 Apr 2025 14:36:18 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> On Fri, Apr 04, 2025 at 02:07:00PM -0700, SeongJae Park wrote:
> > Batch tlb flushes for MADV_DONTNEED[_LOCKED] for better efficiency, in a
> > way that very similar to the tlb flushes batching for MADV_FREE.
> 
> This seems like a rather succinct commit message under the circumstances :) can
> we put some meat on the bone?
> 
> Perhaps explain why one might want to do so, propagating some of your excellent
> cover letter contents here, etc.
> 
> Also you're doing more than this, you're also exporting the (soon to be renamed,
> ideally :) notify_unmap_single_vma() function, let's mention this here please,
> and also mention why.

Good points, thank you.  I will update the commit message in the next spin
following your suggestions.

> 
> >
> > Signed-off-by: SeongJae Park <sj@kernel.org>
> > ---
> >  mm/internal.h | 3 +++
> >  mm/madvise.c  | 9 ++++++---
> >  mm/memory.c   | 4 ++--
> >  3 files changed, 11 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/internal.h b/mm/internal.h
> > index e9695baa5922..be0c46837e22 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -435,6 +435,9 @@ void unmap_page_range(struct mmu_gather *tlb,
> >  			     struct vm_area_struct *vma,
> >  			     unsigned long addr, unsigned long end,
> >  			     struct zap_details *details);
> > +void notify_unmap_single_vma(struct mmu_gather *tlb,
> > +		struct vm_area_struct *vma, unsigned long addr,
> > +		unsigned long size, struct zap_details *details);
> 
> Yeah I know I said in 3/4 but I really hate this name. We need to change it... :)

Yes, I will change the name in the next revision.

> 
> >  int folio_unmap_invalidate(struct address_space *mapping, struct folio *folio,
> >  			   gfp_t gfp);
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 564095e381b2..c7ac32b4a371 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -851,7 +851,8 @@ static int madvise_free_single_vma(
> >   * An interface that causes the system to free clean pages and flush
> >   * dirty pages is already available as msync(MS_INVALIDATE).
> >   */
> > -static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
> > +static long madvise_dontneed_single_vma(struct madvise_behavior *behavior,
> 
> Again, let's go with madv_behavior for now please. Otherwise we have a weird
> inconsistency that sometimes behavior = the int 'behavior' value and sometimes
> it's a pointer to the helper struct.

Ye, I will do so.

> 
> > +					struct vm_area_struct *vma,
> >  					unsigned long start, unsigned long end)
> >  {
> >  	struct zap_details details = {


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/4] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE
  2025-04-08 13:44 ` [PATCH v2 0/4] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE Lorenzo Stoakes
@ 2025-04-08 20:23   ` SeongJae Park
  0 siblings, 0 replies; 14+ messages in thread
From: SeongJae Park @ 2025-04-08 20:23 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: SeongJae Park, Andrew Morton, Liam R.Howlett, David Hildenbrand,
	Rik van Riel, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

On Tue, 8 Apr 2025 14:44:40 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> On Fri, Apr 04, 2025 at 02:06:56PM -0700, SeongJae Park wrote:
> > When process_madvise() is called to do MADV_DONTNEED[_LOCKED] or
> > MADV_FREE with multiple address ranges, tlb flushes happen for each of
> > the given address ranges.  Because such tlb flushes are for same
> 
> Nit: for _the_ same.

Thank you for kindly finding and suggesting fixes for these mistakes.  I will
update following your suggestions here and below.

[...]
> > Similar optimizations might be applicable to other madvise behaviros
> 
> Typo: behaviros -> behavior (or 'behaviors', but since behavior is already plural
> probably unnecessary).
> 
> > such as MADV_COLD and MADV_PAGEOUT.  Those are simply out of the scope
> > of this patch series, though.
> 
> Well well, for now :)

Yes.  Hopefully we will have another chance to further improve the cases.

[...]
> > Test Results
> > ============
> >
> > I measured the latency to apply MADV_DONTNEED advice to 256 MiB memory
> > using multiple process_madvise() calls.  I apply the advice in 4 KiB
> > sized regions granularity, but with varying batch size per
> > process_madvise() call (vlen) from 1 to 1024.  The source code for the
> > measurement is available at GitHub[1].  To reduce measurement errors, I
> > did the measurement five times.
> 
> Be interesting to see how this behaves with mTHP sizing too! But probably a bit
> out of scope perhaps.

Obviously we have many more rooms to explore and get fun :)

> 
> >
> > The measurement results are as below.  'sz_batch' column shows the batch
> > size of process_madvise() calls.  'Before' and 'After' columns show the
> > average of latencies in nanoseconds that measured five times on kernels
> > that built without and with the tlb flushes batching of this series
> > (patches 3 and 4), respectively.  For the baseline, mm-new tree of
> > 2025-04-04[2] has been used.  'B-stdev' and 'A-stdev' columns show
> > ratios of latency measurements standard deviation to average in percent
> > for 'Before' and 'After', respectively.  'Latency_reduction' shows the
> > reduction of the latency that the 'After' has achieved compared to
> > 'Before', in percent.  Higher 'Latency_reduction' values mean more
> > efficiency improvements.
> >
> >     sz_batch   Before        B-stdev   After         A-stdev   Latency_reduction
> >     1          110948138.2   5.55      109476402.8   4.28      1.33
> >     2          75678535.6    1.67      70470722.2    3.55      6.88
> >     4          59530647.6    4.77      51735606.6    3.44      13.09
> >     8          50013051.6    4.39      44377029.8    5.20      11.27
> >     16         48657878.2    9.32      37291600.4    3.39      23.36
> >     32         43614180.2    6.06      34127428      3.75      21.75
> >     64         42466694.2    5.70      26737935.2    2.54      37.04
> >     128        42977970      6.99      25050444.2    4.06      41.71
> >     256        41549546      1.88      24644375.8    3.77      40.69
> >     512        42162698.6    6.17      24068224.8    2.87      42.92
> >     1024       40978574      5.44      23872024.2    3.65      41.75
> 
> Very nice! Great work.
> 
> >
> > As expected, tlb flushes batching provides latency reduction that
> > proportional to the batch size.  The efficiency gain ranges from about
> > 6.88 percent with batch size 2, to about 40 percent with batch size 128.
> >
> > Please note that this is a very simple microbenchmark, so real
> > efficiency gain on real workload could be very different.
> 
> Indeed, accepted, but it makes a great deal of sense to batch these operations,
> especially when we get to the point of actually increasing the process_madvise()
> iov size.

Cannot agree more.

Thank you for your kind review with great suggestions for this patchset.  I
will post the next spin with the suggested changes, soon.


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 3/4] mm/memory: split non-tlb flushing part from zap_page_range_single()
  2025-04-08 20:12     ` SeongJae Park
@ 2025-04-09 10:37       ` Lorenzo Stoakes
  0 siblings, 0 replies; 14+ messages in thread
From: Lorenzo Stoakes @ 2025-04-09 10:37 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Andrew Morton, Liam R.Howlett, David Hildenbrand, Rik van Riel,
	Shakeel Butt, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On Tue, Apr 08, 2025 at 01:12:48PM -0700, SeongJae Park wrote:
> On Tue, 8 Apr 2025 14:29:41 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>
> > On Fri, Apr 04, 2025 at 02:06:59PM -0700, SeongJae Park wrote:
> > > Some of zap_page_range_single() callers such as [process_]madvise() with
> > > MADV_DONTNEED[_LOCKED] cannot batch tlb flushes because
> > > zap_page_range_single() flushes tlb for each invocation.  Split out the
> > > body of zap_page_range_single() except mmu_gather object initialization
> > > and gathered tlb entries flushing for such batched tlb flushing usage.
> > >
> > > To avoid hugetlb pages allocation failures from concurrent page faults,
> > > the tlb flush should be done before hugetlb faults unlocking, though.
> > > Do the flush and the unlock inside the split out function in the order
> > > for hugetlb vma case.  Refer to commit 2820b0f09be9 ("hugetlbfs: close
> > > race between MADV_DONTNEED and page fault") for more details about the
> > > concurrent faults' page allocation failure problem.
> >
> > Good lord, I really hate how we do hugetlb.
> >
> > >
> > > Signed-off-by: SeongJae Park <sj@kernel.org>
> > > ---
> > >  mm/memory.c | 49 +++++++++++++++++++++++++++++++++++++++----------
> > >  1 file changed, 39 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 8669b2c981a5..8c9bbb1a008c 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -1989,36 +1989,65 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
> > >  	mmu_notifier_invalidate_range_end(&range);
> > >  }
> > >
> > > -/**
> > > - * zap_page_range_single - remove user pages in a given range
> > > +/*
> > > + * notify_unmap_single_vma - remove user pages in a given range
> > > + * @tlb: pointer to the caller's struct mmu_gather
> > >   * @vma: vm_area_struct holding the applicable pages
> > > - * @address: starting address of pages to zap
> > > - * @size: number of bytes to zap
> > > + * @address: starting address of pages to remove
> > > + * @size: number of bytes to remove
> > >   * @details: details of shared cache invalidation
> > >   *
> > > - * The range must fit into one VMA.
> > > + * @tlb shouldn't be NULL.  The range must fit into one VMA.  If @vma is for
> > > + * hugetlb, @tlb is flushed and re-initialized by this function.
> > >   */
> > > -void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
> > > +static void notify_unmap_single_vma(struct mmu_gather *tlb,
> > > +		struct vm_area_struct *vma, unsigned long address,
> > >  		unsigned long size, struct zap_details *details)
> >
> > Don't love this name. It seems to imply that the primary thing here is the MMU
> > notification bit.
> >
> > This is like unmap_vmas() but for a single VMA, that is - contains the
> > logic unmap_vmas() does for mmu notifier stuff and hugetlb stuff (vom in
> > mouth), deferring heavy lifting to unmap_single_vma().
> >
> > I think it might be better to just go with the brainless
> > '__zap_page_range_single()' here honestly. Then we at least reduce the mess
> > of confusion in this function naming here.
> >
> > Of course you intend to un-static this shortly... so maybe that's not so
> > great.
> >
> > zap_page_range_single_batched()?
>
> Sounds good to me.  I will rename the function so, unless we get other
> opinions.
>
> >
> > >  {
> > >  	const unsigned long end = address + size;
> > >  	struct mmu_notifier_range range;
> > > -	struct mmu_gather tlb;
> > > +
> > > +	VM_WARN_ON_ONCE(!tlb);
> >
> > Maybe pedantic, but we probably want to ensure not only that tlb is set but
> > also has had tlb_gatehr_mmu() performed upon it. Then again probably a bit
> > much given this is a static function called only from
> > zap_page_range_single().
> >
> > Hm actually you intend to un-static this shortly right? So I guess in that
> > case we do want some kind of check like this, perhaps absracting this bit
> > of tlb_flush_mmu_tlbonly():
> >
> > 	if (!(tlb->freed_tables || tlb->cleared_ptes || tlb->cleared_pmds ||
> > 	      tlb->cleared_puds || tlb->cleared_p4ds))
> > 		return;
>
> This code is to see if there are tlb entries to flush.
>
> In this function, valid 'tlb' can either have such entries or not.  So this
> code could be good reference but couldn't be used as is.  I think most fields
> of mmu_gather can be arbitrarily updated depending on use cases, so making a
> perfect but simple check wouldn't be easy.
>
> I think tlb->mm shouldn't be changed after the initialization, though.  So
> (tlb->mm == vma->vm_mm) could be a good enough additional check.

Yeah this is not _that_ much of a big deal, just maybe a nice-to-have, but
appreciate that it's sort of blurry.

Don't feel obliged on the separated out check, mm == vma->vm_mm suffices
(though again that is also not a big deal! :)

>
> >
> > Into a separate is_tlb_flushable() helper function or something. Then this
> > warning can become:
> >
> > /* tlb should be initialised for a new gather operation. */
> > VM_WARN_ON_ONCE(!tlb || is_tlb_flushable(tlb));
>
> If you agree (tlb->mm == vma->vm_mm) is sufficient check, this could be
>
> VM_WARN_ON_ONCE(!tlb || tlb->mm != vma->vm_mm)

Sure, that's fine.

>
> >
> > >
> > >  	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
> > >  				address, end);
> > >  	hugetlb_zap_begin(vma, &range.start, &range.end);
> >
> > Is it a problem that we invoke this function now _after_ tlb_gather_mmu()
> > has begun?
>
> I think it is not a problem since I find no requirements of the ordering.
> Please let me know if I'm missing seomthing.
>
> >
> > > -	tlb_gather_mmu(&tlb, vma->vm_mm);
> > >  	update_hiwater_rss(vma->vm_mm);
> > >  	mmu_notifier_invalidate_range_start(&range);
> > >  	/*
> > >  	 * unmap 'address-end' not 'range.start-range.end' as range
> > >  	 * could have been expanded for hugetlb pmd sharing.
> > >  	 */
> >
> > Oh GOD I HATE THAT WE HANDLE HUGETLB THIS WAY!!!
> >
> > Anything where you have to open-code concerns about how a specific use case
> > uses something like this is just asking to be broken.
> >
> > Obviously this is nothing to do with your series and is just a grumble to
> > the sky, but still! *shakes fist at cloud*
> >
> > > -	unmap_single_vma(&tlb, vma, address, end, details, false);
> > > +	unmap_single_vma(tlb, vma, address, end, details, false);
> > >  	mmu_notifier_invalidate_range_end(&range);
> > > +	if (is_vm_hugetlb_page(vma)) {
> > > +		/*
> > > +		 * flush tlb and free resources before hugetlb_zap_end(), to
> > > +		 * avoid concurrent page faults' allocation failure
> >
> > Nit: add a full stop (or for those residing in North America 'period' :>)
> > at the end of this sentence. This is very bikesheddy I know, and can only
> > apologise.
>
> Thank you for kindly suggesting this, I will update as you recommended in the
> next spin.
>
> >
> > > +		 */
> > > +		tlb_finish_mmu(tlb);
> > > +		hugetlb_zap_end(vma, details);
> > > +		tlb_gather_mmu(tlb, vma->vm_mm);
> > > +	}
> >
> > OK, so as far as I can tell, after this change, we are still ensuring that
> > the tlb is flushed _prior_ to the invocation of hugetlb_zap_end(), only we,
> > in order to later export this function, need do it here instead... I mean
> > this is horrid, but it's sort of unavoidable really.
> >
> > So I guess this just undoes the optimisation in the case of hugetlb, which
> > probably doesn't really matter all that much at huge page size
> > anyway... and plus it's hugetlb so.
>
> Yes, that's the intended behavior.
>
> >
> > Yeah fine, just kind of horrid.
>
> I agree this is not clean, and thank you for understanding the situation.
> Hopefully we will find a chance to revisit for cleanup later.

Sure!

>
> >
> > > +}
> > > +
> > > +/**
> > > + * zap_page_range_single - remove user pages in a given range
> > > + * @vma: vm_area_struct holding the applicable pages
> > > + * @address: starting address of pages to zap
> > > + * @size: number of bytes to zap
> > > + * @details: details of shared cache invalidation
> > > + *
> > > + * The range must fit into one VMA.
> > > + */
> > > +void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
> > > +		unsigned long size, struct zap_details *details)
> > > +{
> > > +	struct mmu_gather tlb;
> > > +
> > > +	tlb_gather_mmu(&tlb, vma->vm_mm);
> > > +	notify_unmap_single_vma(&tlb, vma, address, size, details);
> > >  	tlb_finish_mmu(&tlb);
> > > -	hugetlb_zap_end(vma, details);
> > >  }
> > >
> > >  /**
> > > --
> > > 2.39.5
>
> Thank you for taking your time for reviewing this.  If you have no other
> opinions, I will make the next spin as mentioned above.

No problems, yes all fine for respin!

>
>
> Thanks,
> SJ


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-04-09 10:38 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-04 21:06 [PATCH v2 0/4] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
2025-04-04 21:06 ` [PATCH v2 1/4] mm/madvise: define and use madvise_behavior struct for madvise_do_behavior() SeongJae Park
2025-04-04 21:06 ` [PATCH v2 2/4] mm/madvise: batch tlb flushes for MADV_FREE SeongJae Park
2025-04-08 12:58   ` Lorenzo Stoakes
2025-04-08 18:49     ` SeongJae Park
2025-04-04 21:06 ` [PATCH v2 3/4] mm/memory: split non-tlb flushing part from zap_page_range_single() SeongJae Park
2025-04-08 13:29   ` Lorenzo Stoakes
2025-04-08 20:12     ` SeongJae Park
2025-04-09 10:37       ` Lorenzo Stoakes
2025-04-04 21:07 ` [PATCH v2 4/4] mm/madvise: batch tlb flushes for MADV_DONTNEED[_LOCKED] SeongJae Park
2025-04-08 13:36   ` Lorenzo Stoakes
2025-04-08 20:16     ` SeongJae Park
2025-04-08 13:44 ` [PATCH v2 0/4] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE Lorenzo Stoakes
2025-04-08 20:23   ` SeongJae Park

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).