Re: [PATCH v2 2/4] mm/madvise: batch tlb flushes for MADV

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH v2 2/4] mm/madvise: batch tlb flushes for MADV_FREE
       [not found] <20250405013136.3863-1-hdanton@sina.com>
@ 2025-04-05 16:30 ` SeongJae Park
  2025-04-05 23:46   ` SeongJae Park
  0 siblings, 1 reply; 5+ messages in thread
From: SeongJae Park @ 2025-04-05 16:30 UTC (permalink / raw)
  Cc: SeongJae Park, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Rik van Riel, Vlastimil Babka, linux-kernel, linux-mm, conduct

Deleting Hillf and adding conduct@ from/to recipients.

On Sat, 5 Apr 2025 09:31:35 +0800 Hillf Danton <hdanton@sina.com> wrote:

To my understanding Hillf is banned from the community for now, due to a CoC
violation.  I'm gonna ignore his reply until his mail can seen on mailing list
again.  I believe that aligns with meaning of the ban, and help people less
confused by mails that replying to mails that not visible on the mailing list
archives.  Please let me know if you have any concern about this.

And I have a humble suggestion for conduct@.  This time I was luckily aware of
the fact that Hillf is banned for now.  I think having an official and formal
way for knowing who are banned for what time period, and a guideline about what
reaction to this kind of mails from banned people is supposed could be helpful.

Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2 2/4] mm/madvise: batch tlb flushes for MADV_FREE
  2025-04-05 16:30 ` [PATCH v2 2/4] mm/madvise: batch tlb flushes for MADV_FREE SeongJae Park
@ 2025-04-05 23:46   ` SeongJae Park
  0 siblings, 0 replies; 5+ messages in thread
From: SeongJae Park @ 2025-04-05 23:46 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Rik van Riel,
	Vlastimil Babka, linux-kernel, linux-mm, conduct

On Sat,  5 Apr 2025 09:30:35 -0700 SeongJae Park <sj@kernel.org> wrote:

> Deleting Hillf and adding conduct@ from/to recipients.

I also wanted to delete public mailing lists (linux-mm@ and linux-kernel@) from
the Cc list, to reduce unnecessary confusions.  But I forgot doing that...
Sorry if this confused you by any chance.  Please ignore this and the previous
mail if you're not directly added to the recipients list.

Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH v2 0/4] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE
@ 2025-04-04 21:06 SeongJae Park
  2025-04-04 21:06 ` [PATCH v2 2/4] mm/madvise: batch tlb flushes for MADV_FREE SeongJae Park
  0 siblings, 1 reply; 5+ messages in thread
From: SeongJae Park @ 2025-04-04 21:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: SeongJae Park, Liam R.Howlett, David Hildenbrand, Lorenzo Stoakes,
	Rik van Riel, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

When process_madvise() is called to do MADV_DONTNEED[_LOCKED] or
MADV_FREE with multiple address ranges, tlb flushes happen for each of
the given address ranges.  Because such tlb flushes are for same
process, doing those in a batch is more efficient while still being
safe.  Modify process_madvise() entry level code path to do such batched
tlb flushes, while the internal unmap logics do only gathering of the
tlb entries to flush.

In more detail, modify the entry functions to initialize an mmu_gather
ojbect and pass it to the internal logics.  And make the internal logics
do only gathering of the tlb entries to flush into the received
mmu_gather object.  After all internal function calls are done, the
entry functions flush the gathered tlb entries at once.

Because process_madvise() and madvise() share the internal unmap logic,
make same change to madvise() entry code together, to make code
consistent and cleaner.  It is only for keeping the code clean, and
shouldn't degrade madvise().  It could rather provide a potential tlb
flushes reduction benefit for a case that there are multiple vmas for
the given address range.  It is only a side effect from an effort to
keep code clean, so we don't measure it separately.

Similar optimizations might be applicable to other madvise behaviros
such as MADV_COLD and MADV_PAGEOUT.  Those are simply out of the scope
of this patch series, though.

Patches Seuquence
=================

First patch defines new data structure for managing information that
required for batched tlb flushes (mmu_gather and behavior), and update
code paths for MADV_DONTNEED[_LOCKED] and MADV_FREE handling internal
logics to receive it.

Second patch batches tlb flushes for MADV_FREE handling for both
madvise() and process_madvise().

Remaining two patches are for MADV_DONTNEED[_LOCKED] tlb flushes
batching.  Third patch splits zap_page_range_single() for batching of
MADV_DONTNEED[_LOCKED] handling.  The final and fourth patch batches tlb
flushes for the hint using the sub-logic that the third patch split out,
and the helpers for batched tlb flushes that intorduced for MADV_FREE
case, by the second patch.

Test Results
============

I measured the latency to apply MADV_DONTNEED advice to 256 MiB memory
using multiple process_madvise() calls.  I apply the advice in 4 KiB
sized regions granularity, but with varying batch size per
process_madvise() call (vlen) from 1 to 1024.  The source code for the
measurement is available at GitHub[1].  To reduce measurement errors, I
did the measurement five times.

The measurement results are as below.  'sz_batch' column shows the batch
size of process_madvise() calls.  'Before' and 'After' columns show the
average of latencies in nanoseconds that measured five times on kernels
that built without and with the tlb flushes batching of this series
(patches 3 and 4), respectively.  For the baseline, mm-new tree of
2025-04-04[2] has been used.  'B-stdev' and 'A-stdev' columns show
ratios of latency measurements standard deviation to average in percent
for 'Before' and 'After', respectively.  'Latency_reduction' shows the
reduction of the latency that the 'After' has achieved compared to
'Before', in percent.  Higher 'Latency_reduction' values mean more
efficiency improvements.

    sz_batch   Before        B-stdev   After         A-stdev   Latency_reduction
    1          110948138.2   5.55      109476402.8   4.28      1.33
    2          75678535.6    1.67      70470722.2    3.55      6.88
    4          59530647.6    4.77      51735606.6    3.44      13.09
    8          50013051.6    4.39      44377029.8    5.20      11.27
    16         48657878.2    9.32      37291600.4    3.39      23.36
    32         43614180.2    6.06      34127428      3.75      21.75
    64         42466694.2    5.70      26737935.2    2.54      37.04
    128        42977970      6.99      25050444.2    4.06      41.71
    256        41549546      1.88      24644375.8    3.77      40.69
    512        42162698.6    6.17      24068224.8    2.87      42.92
    1024       40978574      5.44      23872024.2    3.65      41.75

As expected, tlb flushes batching provides latency reduction that
proportional to the batch size.  The efficiency gain ranges from about
6.88 percent with batch size 2, to about 40 percent with batch size 128.

Please note that this is a very simple microbenchmark, so real
efficiency gain on real workload could be very different.

Chagelong
=========

Changes from v1
(https://lore.kernel.org/20250310172318.653630-1-sj@kernel.org)
- Split code cleanup part out
- Keep the order between tlb flushes and hugetlb_zap_end()
- Put mm/memory change just before its real usage
- Add VM_WARN_ON_ONCE() for invlaid tlb argument to unmap_vma_single()
- Cleanups following nice reviewers suggestions

Changes from RFC
(https://lore.kernel.org/20250305181611.54484-1-sj@kernel.org)
- Clarify motivation of the change on the cover letter
- Add average and stdev of evaluation results
- Show latency reduction on evaluation results
- Fix !CONFIG_MEMORY_FAILURE build error
- Rename is_memory_populate() to is_madvise_populate()
- Squash patches 5-8
- Add kerneldoc for unmap_vm_area_struct()
- Squash patches 10 and 11
- Squash patches 12-14
- Squash patches 15 and 16

References
==========

[1] https://github.com/sjp38/eval_proc_madvise
[2] commit edd67244fe67 ("mm/show_mem: optimize si_meminfo_node by reducing redundant code") # mm-new

SeongJae Park (4):
  mm/madvise: define and use madvise_behavior struct for
    madvise_do_behavior()
  mm/madvise: batch tlb flushes for MADV_FREE
  mm/memory: split non-tlb flushing part from zap_page_range_single()
  mm/madvise: batch tlb flushes for MADV_DONTNEED[_LOCKED]

 mm/internal.h |   3 ++
 mm/madvise.c  | 110 ++++++++++++++++++++++++++++++++++++--------------
 mm/memory.c   |  47 ++++++++++++++++-----
 3 files changed, 121 insertions(+), 39 deletions(-)

base-commit: 85b87628fae973dedae95f2ea2782b7df4c11322
-- 
2.39.5

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH v2 2/4] mm/madvise: batch tlb flushes for MADV_FREE
  2025-04-04 21:06 [PATCH v2 0/4] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
@ 2025-04-04 21:06 ` SeongJae Park
  2025-04-08 12:58   ` Lorenzo Stoakes
  0 siblings, 1 reply; 5+ messages in thread
From: SeongJae Park @ 2025-04-04 21:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: SeongJae Park, Liam R.Howlett, David Hildenbrand, Lorenzo Stoakes,
	Rik van Riel, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

MADV_FREE handling for [process_]madvise() flushes tlb for each vma of
each address range.  Update the logic to do tlb flushes in a batched
way.  Initialize an mmu_gather object from do_madvise() and
vector_madvise(), which are the entry level functions for
[process_]madvise(), respectively.  And pass those objects to the
function for per-vma work, via madvise_behavior struct.  Make the
per-vma logic not flushes tlb on their own but just saves the tlb
entries to the received mmu_gather object.  Finally, the entry level
functions flush the tlb entries that gathered for the entire user
request, at once.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/madvise.c | 59 +++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 47 insertions(+), 12 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 8bcfdd995d18..564095e381b2 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -799,12 +799,13 @@ static const struct mm_walk_ops madvise_free_walk_ops = {
 	.walk_lock		= PGWALK_RDLOCK,
 };
 
-static int madvise_free_single_vma(struct vm_area_struct *vma,
-			unsigned long start_addr, unsigned long end_addr)
+static int madvise_free_single_vma(
+		struct madvise_behavior *behavior, struct vm_area_struct *vma,
+		unsigned long start_addr, unsigned long end_addr)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct mmu_notifier_range range;
-	struct mmu_gather tlb;
+	struct mmu_gather *tlb = behavior->tlb;
 
 	/* MADV_FREE works for only anon vma at the moment */
 	if (!vma_is_anonymous(vma))
@@ -820,17 +821,14 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
 				range.start, range.end);
 
 	lru_add_drain();
-	tlb_gather_mmu(&tlb, mm);
 	update_hiwater_rss(mm);
 
 	mmu_notifier_invalidate_range_start(&range);
-	tlb_start_vma(&tlb, vma);
+	tlb_start_vma(tlb, vma);
 	walk_page_range(vma->vm_mm, range.start, range.end,
-			&madvise_free_walk_ops, &tlb);
-	tlb_end_vma(&tlb, vma);
+			&madvise_free_walk_ops, tlb);
+	tlb_end_vma(tlb, vma);
 	mmu_notifier_invalidate_range_end(&range);
-	tlb_finish_mmu(&tlb);
-
 	return 0;
 }
 
@@ -953,7 +951,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
 	if (action == MADV_DONTNEED || action == MADV_DONTNEED_LOCKED)
 		return madvise_dontneed_single_vma(vma, start, end);
 	else if (action == MADV_FREE)
-		return madvise_free_single_vma(vma, start, end);
+		return madvise_free_single_vma(behavior, vma, start, end);
 	else
 		return -EINVAL;
 }
@@ -1626,6 +1624,29 @@ static void madvise_unlock(struct mm_struct *mm, int behavior)
 		mmap_read_unlock(mm);
 }
 
+static bool madvise_batch_tlb_flush(int behavior)
+{
+	switch (behavior) {
+	case MADV_FREE:
+		return true;
+	default:
+		return false;
+	}
+}
+
+static void madvise_init_tlb(struct madvise_behavior *madv_behavior,
+		struct mm_struct *mm)
+{
+	if (madvise_batch_tlb_flush(madv_behavior->behavior))
+		tlb_gather_mmu(madv_behavior->tlb, mm);
+}
+
+static void madvise_finish_tlb(struct madvise_behavior *madv_behavior)
+{
+	if (madvise_batch_tlb_flush(madv_behavior->behavior))
+		tlb_finish_mmu(madv_behavior->tlb);
+}
+
 static bool is_valid_madvise(unsigned long start, size_t len_in, int behavior)
 {
 	size_t len;
@@ -1782,14 +1803,20 @@ static int madvise_do_behavior(struct mm_struct *mm,
 int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
 {
 	int error;
-	struct madvise_behavior madv_behavior = {.behavior = behavior};
+	struct mmu_gather tlb;
+	struct madvise_behavior madv_behavior = {
+		.behavior = behavior,
+		.tlb = &tlb,
+	};
 
 	if (madvise_should_skip(start, len_in, behavior, &error))
 		return error;
 	error = madvise_lock(mm, behavior);
 	if (error)
 		return error;
+	madvise_init_tlb(&madv_behavior, mm);
 	error = madvise_do_behavior(mm, start, len_in, &madv_behavior);
+	madvise_finish_tlb(&madv_behavior);
 	madvise_unlock(mm, behavior);
 
 	return error;
@@ -1806,13 +1833,18 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 {
 	ssize_t ret = 0;
 	size_t total_len;
-	struct madvise_behavior madv_behavior = {.behavior = behavior};
+	struct mmu_gather tlb;
+	struct madvise_behavior madv_behavior = {
+		.behavior = behavior,
+		.tlb = &tlb,
+	};
 
 	total_len = iov_iter_count(iter);
 
 	ret = madvise_lock(mm, behavior);
 	if (ret)
 		return ret;
+	madvise_init_tlb(&madv_behavior, mm);
 
 	while (iov_iter_count(iter)) {
 		unsigned long start = (unsigned long)iter_iov_addr(iter);
@@ -1841,14 +1873,17 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 			}
 
 			/* Drop and reacquire lock to unwind race. */
+			madvise_finish_tlb(&madv_behavior);
 			madvise_unlock(mm, behavior);
 			madvise_lock(mm, behavior);
+			madvise_init_tlb(&madv_behavior, mm);
 			continue;
 		}
 		if (ret < 0)
 			break;
 		iov_iter_advance(iter, iter_iov_len(iter));
 	}
+	madvise_finish_tlb(&madv_behavior);
 	madvise_unlock(mm, behavior);
 
 	ret = (total_len - iov_iter_count(iter)) ? : ret;
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v2 2/4] mm/madvise: batch tlb flushes for MADV_FREE
  2025-04-04 21:06 ` [PATCH v2 2/4] mm/madvise: batch tlb flushes for MADV_FREE SeongJae Park
@ 2025-04-08 12:58   ` Lorenzo Stoakes
  2025-04-08 18:49     ` SeongJae Park
  0 siblings, 1 reply; 5+ messages in thread
From: Lorenzo Stoakes @ 2025-04-08 12:58 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Andrew Morton, Liam R.Howlett, David Hildenbrand, Rik van Riel,
	Shakeel Butt, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On Fri, Apr 04, 2025 at 02:06:58PM -0700, SeongJae Park wrote:
> MADV_FREE handling for [process_]madvise() flushes tlb for each vma of
> each address range.  Update the logic to do tlb flushes in a batched
> way.  Initialize an mmu_gather object from do_madvise() and
> vector_madvise(), which are the entry level functions for
> [process_]madvise(), respectively.  And pass those objects to the
> function for per-vma work, via madvise_behavior struct.  Make the
> per-vma logic not flushes tlb on their own but just saves the tlb
> entries to the received mmu_gather object.  Finally, the entry level
> functions flush the tlb entries that gathered for the entire user
> request, at once.
>
> Signed-off-by: SeongJae Park <sj@kernel.org>

Other than some nitty stuff, and a desire for some careful testing of the
horrid edge case that err... I introduced :P this looks fine, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/madvise.c | 59 +++++++++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 47 insertions(+), 12 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 8bcfdd995d18..564095e381b2 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -799,12 +799,13 @@ static const struct mm_walk_ops madvise_free_walk_ops = {
>  	.walk_lock		= PGWALK_RDLOCK,
>  };
>
> -static int madvise_free_single_vma(struct vm_area_struct *vma,
> -			unsigned long start_addr, unsigned long end_addr)
> +static int madvise_free_single_vma(
> +		struct madvise_behavior *behavior, struct vm_area_struct *vma,

This is pedantic, but elsewhere you differentiate between int behavior and
struct madvise_behavior by referringt to the later as madv_behavior.

The naming kind of sucks in general though.

But for consistency, let's maybe rename this to madv_behavior, and we can
maybe do a commit later to do a rename across the board?

> +		unsigned long start_addr, unsigned long end_addr)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	struct mmu_notifier_range range;
> -	struct mmu_gather tlb;
> +	struct mmu_gather *tlb = behavior->tlb;
>
>  	/* MADV_FREE works for only anon vma at the moment */
>  	if (!vma_is_anonymous(vma))
> @@ -820,17 +821,14 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
>  				range.start, range.end);
>
>  	lru_add_drain();
> -	tlb_gather_mmu(&tlb, mm);
>  	update_hiwater_rss(mm);
>
>  	mmu_notifier_invalidate_range_start(&range);
> -	tlb_start_vma(&tlb, vma);
> +	tlb_start_vma(tlb, vma);
>  	walk_page_range(vma->vm_mm, range.start, range.end,
> -			&madvise_free_walk_ops, &tlb);
> -	tlb_end_vma(&tlb, vma);
> +			&madvise_free_walk_ops, tlb);
> +	tlb_end_vma(tlb, vma);
>  	mmu_notifier_invalidate_range_end(&range);
> -	tlb_finish_mmu(&tlb);
> -
>  	return 0;
>  }
>
> @@ -953,7 +951,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
>  	if (action == MADV_DONTNEED || action == MADV_DONTNEED_LOCKED)
>  		return madvise_dontneed_single_vma(vma, start, end);
>  	else if (action == MADV_FREE)
> -		return madvise_free_single_vma(vma, start, end);
> +		return madvise_free_single_vma(behavior, vma, start, end);
>  	else
>  		return -EINVAL;

On error paths, do we correctly finish the batched (botched? :P) TLB
operation?

>  }
> @@ -1626,6 +1624,29 @@ static void madvise_unlock(struct mm_struct *mm, int behavior)
>  		mmap_read_unlock(mm);
>  }
>
> +static bool madvise_batch_tlb_flush(int behavior)
> +{
> +	switch (behavior) {
> +	case MADV_FREE:
> +		return true;
> +	default:
> +		return false;
> +	}
> +}
> +
> +static void madvise_init_tlb(struct madvise_behavior *madv_behavior,
> +		struct mm_struct *mm)
> +{
> +	if (madvise_batch_tlb_flush(madv_behavior->behavior))
> +		tlb_gather_mmu(madv_behavior->tlb, mm);
> +}
> +
> +static void madvise_finish_tlb(struct madvise_behavior *madv_behavior)
> +{
> +	if (madvise_batch_tlb_flush(madv_behavior->behavior))
> +		tlb_finish_mmu(madv_behavior->tlb);
> +}

These are nice.

> +
>  static bool is_valid_madvise(unsigned long start, size_t len_in, int behavior)
>  {
>  	size_t len;
> @@ -1782,14 +1803,20 @@ static int madvise_do_behavior(struct mm_struct *mm,
>  int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
>  {
>  	int error;
> -	struct madvise_behavior madv_behavior = {.behavior = behavior};
> +	struct mmu_gather tlb;
> +	struct madvise_behavior madv_behavior = {
> +		.behavior = behavior,
> +		.tlb = &tlb,
> +	};
>
>  	if (madvise_should_skip(start, len_in, behavior, &error))
>  		return error;
>  	error = madvise_lock(mm, behavior);
>  	if (error)
>  		return error;
> +	madvise_init_tlb(&madv_behavior, mm);
>  	error = madvise_do_behavior(mm, start, len_in, &madv_behavior);
> +	madvise_finish_tlb(&madv_behavior);
>  	madvise_unlock(mm, behavior);
>
>  	return error;
> @@ -1806,13 +1833,18 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
>  {
>  	ssize_t ret = 0;
>  	size_t total_len;
> -	struct madvise_behavior madv_behavior = {.behavior = behavior};
> +	struct mmu_gather tlb;
> +	struct madvise_behavior madv_behavior = {
> +		.behavior = behavior,
> +		.tlb = &tlb,
> +	};

Again the naming is kinda yucky, but let's just yeah for now stick with
'madv_behavior' for values of this helper struct and 'behavior' for the
actual int value, and we can revist that later.

>
>  	total_len = iov_iter_count(iter);
>
>  	ret = madvise_lock(mm, behavior);
>  	if (ret)
>  		return ret;
> +	madvise_init_tlb(&madv_behavior, mm);
>
>  	while (iov_iter_count(iter)) {
>  		unsigned long start = (unsigned long)iter_iov_addr(iter);
> @@ -1841,14 +1873,17 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
>  			}
>
>  			/* Drop and reacquire lock to unwind race. */
> +			madvise_finish_tlb(&madv_behavior);
>  			madvise_unlock(mm, behavior);
>  			madvise_lock(mm, behavior);
> +			madvise_init_tlb(&madv_behavior, mm);
>  			continue;

Have you found a way in which to test this? Perhaps force this case and
find a means of asserting the TLB flushing behaves as expected? I think
we're ok from the logic, but it's such a tricky one it'd be good to find a
means of doing so, albeit in a manual way.

>  		}
>  		if (ret < 0)
>  			break;
>  		iov_iter_advance(iter, iter_iov_len(iter));
>  	}
> +	madvise_finish_tlb(&madv_behavior);
>  	madvise_unlock(mm, behavior);
>
>  	ret = (total_len - iov_iter_count(iter)) ? : ret;
> --
> 2.39.5


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2 2/4] mm/madvise: batch tlb flushes for MADV_FREE
  2025-04-08 12:58   ` Lorenzo Stoakes
@ 2025-04-08 18:49     ` SeongJae Park
  0 siblings, 0 replies; 5+ messages in thread
From: SeongJae Park @ 2025-04-08 18:49 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: SeongJae Park, Andrew Morton, Liam R.Howlett, David Hildenbrand,
	Rik van Riel, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

On Tue, 8 Apr 2025 13:58:18 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> On Fri, Apr 04, 2025 at 02:06:58PM -0700, SeongJae Park wrote:
> > MADV_FREE handling for [process_]madvise() flushes tlb for each vma of
> > each address range.  Update the logic to do tlb flushes in a batched
> > way.  Initialize an mmu_gather object from do_madvise() and
> > vector_madvise(), which are the entry level functions for
> > [process_]madvise(), respectively.  And pass those objects to the
> > function for per-vma work, via madvise_behavior struct.  Make the
> > per-vma logic not flushes tlb on their own but just saves the tlb
> > entries to the received mmu_gather object.  Finally, the entry level
> > functions flush the tlb entries that gathered for the entire user
> > request, at once.
> >
> > Signed-off-by: SeongJae Park <sj@kernel.org>
> 
> Other than some nitty stuff, and a desire for some careful testing of the
> horrid edge case that err... I introduced :P this looks fine, so:
> 
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Thank you for your kind review!  I will make the next revision following your
suggestions as I answered below.

> 
> > ---
> >  mm/madvise.c | 59 +++++++++++++++++++++++++++++++++++++++++-----------
> >  1 file changed, 47 insertions(+), 12 deletions(-)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 8bcfdd995d18..564095e381b2 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -799,12 +799,13 @@ static const struct mm_walk_ops madvise_free_walk_ops = {
> >  	.walk_lock		= PGWALK_RDLOCK,
> >  };
> >
> > -static int madvise_free_single_vma(struct vm_area_struct *vma,
> > -			unsigned long start_addr, unsigned long end_addr)
> > +static int madvise_free_single_vma(
> > +		struct madvise_behavior *behavior, struct vm_area_struct *vma,
> 
> This is pedantic, but elsewhere you differentiate between int behavior and
> struct madvise_behavior by referringt to the later as madv_behavior.
> 
> The naming kind of sucks in general though.
> 
> But for consistency, let's maybe rename this to madv_behavior, and we can
> maybe do a commit later to do a rename across the board?

I completely agree.  I will rename so in the next spin.

> 
> > +		unsigned long start_addr, unsigned long end_addr)
> >  {
> >  	struct mm_struct *mm = vma->vm_mm;
> >  	struct mmu_notifier_range range;
> > -	struct mmu_gather tlb;
> > +	struct mmu_gather *tlb = behavior->tlb;
> >
> >  	/* MADV_FREE works for only anon vma at the moment */
> >  	if (!vma_is_anonymous(vma))
[...]
> > @@ -953,7 +951,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> >  	if (action == MADV_DONTNEED || action == MADV_DONTNEED_LOCKED)
> >  		return madvise_dontneed_single_vma(vma, start, end);
> >  	else if (action == MADV_FREE)
> > -		return madvise_free_single_vma(vma, start, end);
> > +		return madvise_free_single_vma(behavior, vma, start, end);
> >  	else
> >  		return -EINVAL;
> 
> On error paths, do we correctly finish the batched (botched? :P) TLB
> operation?

Yes, the change calls tlb_finish_mmu() and tlb_gather_mmu() as needed in the
error paths.  Of course I might forgot calling those in some edge cases.
Please let me know if you find such mistakes.

> 
> >  }
[...]
> > @@ -1841,14 +1873,17 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
> >  			}
> >
> >  			/* Drop and reacquire lock to unwind race. */
> > +			madvise_finish_tlb(&madv_behavior);
> >  			madvise_unlock(mm, behavior);
> >  			madvise_lock(mm, behavior);
> > +			madvise_init_tlb(&madv_behavior, mm);
> >  			continue;
> 
> Have you found a way in which to test this? Perhaps force this case and
> find a means of asserting the TLB flushing behaves as expected? I think
> we're ok from the logic, but it's such a tricky one it'd be good to find a
> means of doing so, albeit in a manual way.

No, unfortunately I haven't found a good way to test this case.

> 
> >  		}
> >  		if (ret < 0)
> >  			break;
> >  		iov_iter_advance(iter, iter_iov_len(iter));
> >  	}
> > +	madvise_finish_tlb(&madv_behavior);
> >  	madvise_unlock(mm, behavior);
> >
> >  	ret = (total_len - iov_iter_count(iter)) ? : ret;
> > --
> > 2.39.5


Thanks,
SJ


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-04-08 18:49 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20250405013136.3863-1-hdanton@sina.com>
2025-04-05 16:30 ` [PATCH v2 2/4] mm/madvise: batch tlb flushes for MADV_FREE SeongJae Park
2025-04-05 23:46   ` SeongJae Park
2025-04-04 21:06 [PATCH v2 0/4] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
2025-04-04 21:06 ` [PATCH v2 2/4] mm/madvise: batch tlb flushes for MADV_FREE SeongJae Park
2025-04-08 12:58   ` Lorenzo Stoakes
2025-04-08 18:49     ` SeongJae Park

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).