Re: [PATCH v4 4/5] ksm: Optimize rmap_walk_ksm by passing a suitable address range

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v4 4/5] ksm: Optimize rmap_walk_ksm by passing a suitable address range
@ 2026-05-15  7:13 xu.xin16
  2026-05-15 12:28 ` Lorenzo Stoakes
  0 siblings, 1 reply; 5+ messages in thread
From: xu.xin16 @ 2026-05-15  7:13 UTC (permalink / raw)
  To: david; +Cc: akpm, ljs, hughd, linux-mm, linux-kernel, michel

> > diff --git a/mm/ksm.c b/mm/ksm.c
> > index 0299a53ba7c9..a13184d00759 100644
> > --- a/mm/ksm.c
> > +++ b/mm/ksm.c
> > @@ -3200,6 +3200,7 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
> >  	hlist_for_each_entry(rmap_item, &stable_node->hlist, hlist) {
> >  		/* Ignore the stable/unstable/sqnr flags */
> >  		const unsigned long addr = rmap_item->address & PAGE_MASK;
> > +		const unsigned long vm_pgoff = rmap_item->vm_pgoff;
> >  		struct anon_vma *anon_vma = rmap_item->anon_vma;
> >  		struct anon_vma_chain *vmac;
> >  		struct vm_area_struct *vma;
> > @@ -3213,8 +3214,12 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
> >  			anon_vma_lock_read(anon_vma);
> >  		}
> >
> > +		/*
> > +		 * Currently KSM folios are order-0 normal pages, so pgoff_end
> > +		 * should be the same as pgoff_start.
> > +		 */
> >  		anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root,
> > -					       0, ULONG_MAX) {
> > +					       vm_pgoff, vm_pgoff) {
>
> But vm_pgoff would just correspond to the start of the VMA, not where the page
> is actually mapped?
>
> I&apos;d assume you really want the linear page index of the original page?

Right. I&apos;ve reconsidered and realized that using vm_pgoff is indeed unstable.

My initial idea was: as long as we can find the VMA that maps this page,
it&apos;s sufficient for anon_vma_interval_tree_foreach() to check whether
"vm_pgoff <= pgoff of the original page <= (vm_pgoff + vma_pages(v) - 1)".

However, the flaw here is that the VMA may be split(e.g., due to madvise or mprotect),
causing vma_pages(v) to change, thereby making this condition no longer satisfied.

Indeed, it&apos;s better to use the linear page index of the original page.

I&apos;ll send v5 to correct this.

>
> --
> Cheers,
>
> David
>


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v4 4/5] ksm: Optimize rmap_walk_ksm by passing a suitable address range
  2026-05-15  7:13 [PATCH v4 4/5] ksm: Optimize rmap_walk_ksm by passing a suitable address range xu.xin16
@ 2026-05-15 12:28 ` Lorenzo Stoakes
  0 siblings, 0 replies; 5+ messages in thread
From: Lorenzo Stoakes @ 2026-05-15 12:28 UTC (permalink / raw)
  To: xu.xin16; +Cc: david, akpm, hughd, linux-mm, linux-kernel, michel

On Fri, May 15, 2026 at 03:13:44PM +0800, xu.xin16@zte.com.cn wrote:
> > > diff --git a/mm/ksm.c b/mm/ksm.c
> > > index 0299a53ba7c9..a13184d00759 100644
> > > --- a/mm/ksm.c
> > > +++ b/mm/ksm.c
> > > @@ -3200,6 +3200,7 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
> > >  	hlist_for_each_entry(rmap_item, &stable_node->hlist, hlist) {
> > >  		/* Ignore the stable/unstable/sqnr flags */
> > >  		const unsigned long addr = rmap_item->address & PAGE_MASK;
> > > +		const unsigned long vm_pgoff = rmap_item->vm_pgoff;
> > >  		struct anon_vma *anon_vma = rmap_item->anon_vma;
> > >  		struct anon_vma_chain *vmac;
> > >  		struct vm_area_struct *vma;
> > > @@ -3213,8 +3214,12 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
> > >  			anon_vma_lock_read(anon_vma);
> > >  		}
> > >
> > > +		/*
> > > +		 * Currently KSM folios are order-0 normal pages, so pgoff_end
> > > +		 * should be the same as pgoff_start.
> > > +		 */
> > >  		anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root,
> > > -					       0, ULONG_MAX) {
> > > +					       vm_pgoff, vm_pgoff) {
> >
> > But vm_pgoff would just correspond to the start of the VMA, not where the page
> > is actually mapped?
> >
> > I&apos;d assume you really want the linear page index of the original page?
>
> Right. I&apos;ve reconsidered and realized that using vm_pgoff is indeed unstable.

Your email client is inserting (kinda) HTML :) & apos ; -> ' please tell it to
behave :P

>
> My initial idea was: as long as we can find the VMA that maps this page,
> it&apos;s sufficient for anon_vma_interval_tree_foreach() to check whether
> "vm_pgoff <= pgoff of the original page <= (vm_pgoff + vma_pages(v) - 1)".
>
> However, the flaw here is that the VMA may be split(e.g., due to madvise or mprotect),
> causing vma_pages(v) to change, thereby making this condition no longer satisfied.
>
> Indeed, it&apos;s better to use the linear page index of the original page.

Yup :)

Partially mapped large folios would cause weirdness also but KSM uses order-0
right?  So probably not a thing.

>
> I&apos;ll send v5 to correct this.
>
> >
> > --
> > Cheers,
> >
> > David
> >

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH v4 0/5] KSM: Optimizations for rmap_walk_ksm
@ 2026-05-03 12:35 xu.xin16
  2026-05-03 12:50 ` [PATCH v4 4/5] ksm: Optimize rmap_walk_ksm by passing a suitable address range xu.xin16
  0 siblings, 1 reply; 5+ messages in thread
From: xu.xin16 @ 2026-05-03 12:35 UTC (permalink / raw)
  To: akpm, david, ljs, hughd; +Cc: linux-mm, linux-kernel, michel, xu.xin16

From: xu xin <xu.xin16@zte.com.cn>

Deep investigation revealed that rmap_walk_ksm's 99.9% of iterations inside
the anon_vma_interval_tree_foreach loop are skipped due to the first check
"if (addr < vma->vm_start || addr >= vma->vm_end)), indicating that a large
number of loop iterations are ineffective. This inefficiency arises because
the pgoff_start and pgoff_end parameters passed to
anon_vma_interval_tree_foreach span the entire address space from 0 to
ULONG_MAX, resulting in very poor loop efficiency.

An initial immature thought was using the "rmap_item->address >> PAGE_SHIFT"
to be the pgoff passed into anon_vma_interval_tree_foreach(). But this is
flawed because when a range has been mremap-moved, when its anon folio
indexes and anon_vma pgoff correspond to the original user address,
not to the current user address, which was pointed out at:

  https://lore.kernel.org/all/02e1b8df-d568-8cbb-b8f6-46d5476d9d75@google.com/

According to the implementation of anon_vma_interval_tree_foreach —
it essentially iterates to find a suitable VMA such that the provided pgoff falls
within the VMA's range [vm_pgoff, vm_pgoff + vma_pages(v) - 1].

So the solution is to add vm_pgoff field in ksm_rmap_item and use vm_pgoff instead of
address >> PAGE_SHIFT.

Changes in v4:
 - Add a tracepoint for rmap_walk
 - Provide a testbench for rmap_walk
 - Add vm_pgoff field in ksm_rmap_item
 - use vm_pgoff instead of address >> PAGE_SHIFT (Suggested by David and Lorenzo)

Changes in v3:
- Fix some typos in commit description
- Replace "pgoff_start" and 'pgoff_end' by 'pgoff'.

Changes in v2:
- Use const variable to initialize 'addr'  "pgoff_start" and 'pgoff_end'
- Let pgoff_end = pgoff_start, since KSM folios are always order-0 (Suggested by David)

xu xin (5):
  mm/rmap: add tracepoint for rmap_walk
  tools/testing: add rmap walk latency benchmark for KSM, anonymous and
    file pages
  ksm: add vm_pgoff into ksm_rmap_item
  ksm: Optimize rmap_walk_ksm by passing a suitable address range
  ksm: add mremap selftests for ksm_rmap_walk

 MAINTAINERS                          |   3 +
 include/trace/events/rmap.h          |  49 +++
 mm/ksm.c                             |  48 ++-
 mm/rmap.c                            |  14 +
 tools/testing/rmap/Makefile          |  11 +
 tools/testing/rmap/rmap_benchmark.c  | 488 +++++++++++++++++++++++++++
 tools/testing/selftests/mm/rmap.c    |  79 +++++
 tools/testing/selftests/mm/vm_util.c |  38 +++
 tools/testing/selftests/mm/vm_util.h |   2 +
 9 files changed, 724 insertions(+), 8 deletions(-)
 create mode 100644 include/trace/events/rmap.h
 create mode 100644 tools/testing/rmap/Makefile
 create mode 100644 tools/testing/rmap/rmap_benchmark.c

-- 
2.25.1

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH v4 4/5] ksm: Optimize rmap_walk_ksm by passing a suitable address range
  2026-05-03 12:35 [PATCH v4 0/5] KSM: Optimizations for rmap_walk_ksm xu.xin16
@ 2026-05-03 12:50 ` xu.xin16
  2026-05-13 12:10   ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 5+ messages in thread
From: xu.xin16 @ 2026-05-03 12:50 UTC (permalink / raw)
  To: akpm, david, ljs, xu.xin16; +Cc: hughd, linux-mm, linux-kernel, michel

From: xu xin <xu.xin16@zte.com.cn>

Problem
=======
When available memory is extremely tight, causing KSM pages to be swapped
out, or when there is significant memory fragmentation and THP triggers
memory compaction, the system will invoke the rmap_walk_ksm function to
perform reverse mapping. However, we observed that this function becomes
particularly time-consuming when a large number of VMAs (e.g., 20,000)
share the same anon_vma. Through debug trace analysis, we found that most
of the latency occurs within anon_vma_interval_tree_foreach, leading to an
excessively long hold time on the anon_vma lock (even reaching 500ms or
more), which in turn causes upper-layer applications (waiting for the
anon_vma lock) to be blocked for extended periods.

Root Cause
==========
Further investigation revealed that 99.9% of iterations inside the
anon_vma_interval_tree_foreach loop are skipped due to the first check
"if (addr < vma->vm_start || addr >= vma->vm_end)), indicating that a large
number of loop iterations are ineffective. This inefficiency arises because
the pgoff_start and pgoff_end parameters passed to
anon_vma_interval_tree_foreach span the entire address space from 0 to
ULONG_MAX, resulting in very poor loop efficiency.

Solution
========
We cannot rely solely on anon_vma to locate all PTEs mapping this page but
also need to have the original page's pgoff. In fact, I believe only the
original vma->vm_pgoff is just enough. The implementation of
anon_vma_interval_tree_foreach — it essentially iterates to find a suitable
VMA such that the provided pgoff falls within the candidate's vm_pgoff range.

	vm_pgoff <= pgoff_parameter <= (vm_pgoff + vma_pages(v) - 1)

Fortunately, we have already vm_pgoff in ksm_rmap_item in the previos patch
of series, so that we use it to get the pgoff to accelerate the searching.

Performance
===========
In our real embedded Linux environment, the measured metrcis were as
follows:

1) Time_ms: Max time for holding anon_vma lock in a single rmap_walk_ksm.
2) Nr_iteration_total: The max times of iterations in a loop of anon_vma_interval_tree_foreach
3) Skip_addr_out_of_range: The max times of skipping due to the first check (vma->vm_start
            and vma->vm_end) in a loop of anon_vma_interval_tree_foreach.
4) Skip_mm_mismatch: The max times of skipping due to the second check (rmap_item->mm == vma->vm_mm)
            in a loop of anon_vma_interval_tree_foreach.

The result is shown as follows:

         Time_ms      Nr_iteration_total    Skip_addr_out_of_range   Skip_mm_mismatch
Before:  228.65       22169                 22168                    0
After :   0.396        3                     0                       2

We also provide a rmap testbench: tools/testing/rmap/rmap_benchmark.c
The testing result in QEMU is shown as follows:

	Maximum duration		Average duration
Before:	705.12 ms (705119858 ns)	532.04 ms (532041586 ns)
After:	1.67 ms (1665917 ns)		1.44 ms (1443784 ns)

Co-developed-by: Wang Yaxin <wang.yaxin@zte.com.cn>
Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn>
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
---
 mm/ksm.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 0299a53ba7c9..a13184d00759 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -3200,6 +3200,7 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
 	hlist_for_each_entry(rmap_item, &stable_node->hlist, hlist) {
 		/* Ignore the stable/unstable/sqnr flags */
 		const unsigned long addr = rmap_item->address & PAGE_MASK;
+		const unsigned long vm_pgoff = rmap_item->vm_pgoff;
 		struct anon_vma *anon_vma = rmap_item->anon_vma;
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
@@ -3213,8 +3214,12 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
 			anon_vma_lock_read(anon_vma);
 		}

+		/*
+		 * Currently KSM folios are order-0 normal pages, so pgoff_end
+		 * should be the same as pgoff_start.
+		 */
 		anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root,
-					       0, ULONG_MAX) {
+					       vm_pgoff, vm_pgoff) {

 			cond_resched();
 			vma = vmac->vma;
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v4 4/5] ksm: Optimize rmap_walk_ksm by passing a suitable address range
  2026-05-03 12:50 ` [PATCH v4 4/5] ksm: Optimize rmap_walk_ksm by passing a suitable address range xu.xin16
@ 2026-05-13 12:10   ` David Hildenbrand (Arm)
  2026-05-15  7:15     ` xu.xin16
  0 siblings, 1 reply; 5+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-13 12:10 UTC (permalink / raw)
  To: xu.xin16, akpm, ljs; +Cc: hughd, linux-mm, linux-kernel, michel

On 5/3/26 14:50, xu.xin16@zte.com.cn wrote:
> From: xu xin <xu.xin16@zte.com.cn>
> 
> Problem
> =======
> When available memory is extremely tight, causing KSM pages to be swapped
> out, or when there is significant memory fragmentation and THP triggers
> memory compaction, the system will invoke the rmap_walk_ksm function to
> perform reverse mapping. However, we observed that this function becomes
> particularly time-consuming when a large number of VMAs (e.g., 20,000)
> share the same anon_vma. Through debug trace analysis, we found that most
> of the latency occurs within anon_vma_interval_tree_foreach, leading to an
> excessively long hold time on the anon_vma lock (even reaching 500ms or
> more), which in turn causes upper-layer applications (waiting for the
> anon_vma lock) to be blocked for extended periods.
> 
> Root Cause
> ==========
> Further investigation revealed that 99.9% of iterations inside the
> anon_vma_interval_tree_foreach loop are skipped due to the first check
> "if (addr < vma->vm_start || addr >= vma->vm_end)), indicating that a large
> number of loop iterations are ineffective. This inefficiency arises because
> the pgoff_start and pgoff_end parameters passed to
> anon_vma_interval_tree_foreach span the entire address space from 0 to
> ULONG_MAX, resulting in very poor loop efficiency.
> 
> Solution
> ========
> We cannot rely solely on anon_vma to locate all PTEs mapping this page but
> also need to have the original page's pgoff. In fact, I believe only the
> original vma->vm_pgoff is just enough. The implementation of
> anon_vma_interval_tree_foreach — it essentially iterates to find a suitable
> VMA such that the provided pgoff falls within the candidate's vm_pgoff range.
> 
> 	vm_pgoff <= pgoff_parameter <= (vm_pgoff + vma_pages(v) - 1)
> 
> Fortunately, we have already vm_pgoff in ksm_rmap_item in the previos patch
> of series, so that we use it to get the pgoff to accelerate the searching.
> 
> Performance
> ===========
> In our real embedded Linux environment, the measured metrcis were as
> follows:
> 
> 1) Time_ms: Max time for holding anon_vma lock in a single rmap_walk_ksm.
> 2) Nr_iteration_total: The max times of iterations in a loop of anon_vma_interval_tree_foreach
> 3) Skip_addr_out_of_range: The max times of skipping due to the first check (vma->vm_start
>             and vma->vm_end) in a loop of anon_vma_interval_tree_foreach.
> 4) Skip_mm_mismatch: The max times of skipping due to the second check (rmap_item->mm == vma->vm_mm)
>             in a loop of anon_vma_interval_tree_foreach.
> 
> The result is shown as follows:
> 
>          Time_ms      Nr_iteration_total    Skip_addr_out_of_range   Skip_mm_mismatch
> Before:  228.65       22169                 22168                    0
> After :   0.396        3                     0                       2
> 
> We also provide a rmap testbench: tools/testing/rmap/rmap_benchmark.c
> The testing result in QEMU is shown as follows:
> 
> 	Maximum duration		Average duration
> Before:	705.12 ms (705119858 ns)	532.04 ms (532041586 ns)
> After:	1.67 ms (1665917 ns)		1.44 ms (1443784 ns)
> 
> Co-developed-by: Wang Yaxin <wang.yaxin@zte.com.cn>
> Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn>
> Signed-off-by: xu xin <xu.xin16@zte.com.cn>
> ---
>  mm/ksm.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 0299a53ba7c9..a13184d00759 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -3200,6 +3200,7 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
>  	hlist_for_each_entry(rmap_item, &stable_node->hlist, hlist) {
>  		/* Ignore the stable/unstable/sqnr flags */
>  		const unsigned long addr = rmap_item->address & PAGE_MASK;
> +		const unsigned long vm_pgoff = rmap_item->vm_pgoff;
>  		struct anon_vma *anon_vma = rmap_item->anon_vma;
>  		struct anon_vma_chain *vmac;
>  		struct vm_area_struct *vma;
> @@ -3213,8 +3214,12 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
>  			anon_vma_lock_read(anon_vma);
>  		}
> 
> +		/*
> +		 * Currently KSM folios are order-0 normal pages, so pgoff_end
> +		 * should be the same as pgoff_start.
> +		 */
>  		anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root,
> -					       0, ULONG_MAX) {
> +					       vm_pgoff, vm_pgoff) {

But vm_pgoff would just correspond to the start of the VMA, not where the page
is actually mapped?

I'd assume you really want the linear page index of the original page?

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v4 4/5] ksm: Optimize rmap_walk_ksm by passing a suitable address range
  2026-05-13 12:10   ` David Hildenbrand (Arm)
@ 2026-05-15  7:15     ` xu.xin16
  0 siblings, 0 replies; 5+ messages in thread
From: xu.xin16 @ 2026-05-15  7:15 UTC (permalink / raw)
  To: david; +Cc: akpm, ljs, hughd, linux-mm, linux-kernel, michel

> > diff --git a/mm/ksm.c b/mm/ksm.c
> > index 0299a53ba7c9..a13184d00759 100644
> > --- a/mm/ksm.c
> > +++ b/mm/ksm.c
> > @@ -3200,6 +3200,7 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
> >  	hlist_for_each_entry(rmap_item, &stable_node->hlist, hlist) {
> >  		/* Ignore the stable/unstable/sqnr flags */
> >  		const unsigned long addr = rmap_item->address & PAGE_MASK;
> > +		const unsigned long vm_pgoff = rmap_item->vm_pgoff;
> >  		struct anon_vma *anon_vma = rmap_item->anon_vma;
> >  		struct anon_vma_chain *vmac;
> >  		struct vm_area_struct *vma;
> > @@ -3213,8 +3214,12 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
> >  			anon_vma_lock_read(anon_vma);
> >  		}
> > 
> > +		/*
> > +		 * Currently KSM folios are order-0 normal pages, so pgoff_end
> > +		 * should be the same as pgoff_start.
> > +		 */
> >  		anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root,
> > -					       0, ULONG_MAX) {
> > +					       vm_pgoff, vm_pgoff) {
> 
> But vm_pgoff would just correspond to the start of the VMA, not where the page
> is actually mapped?
> 
> I'd assume you really want the linear page index of the original page?

Right. I've reconsidered and realized that using vm_pgoff is indeed unstable.

My initial idea was: as long as we can find the VMA that maps this page,
it's sufficient for anon_vma_interval_tree_foreach() to check whether 
"vm_pgoff <= pgoff of the original page <= (vm_pgoff + vma_pages(v) - 1)".

However, the flaw here is that the VMA may be split(e.g., due to madvise or mprotect),
causing vma_pages(v) to change, thereby making this condition no longer satisfied.

Indeed, it's better to use the linear page index of the original page.

I'll send v5 to correct this.

> 
> -- 
> Cheers,
> 
> David
>


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-05-15 12:28 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-15  7:13 [PATCH v4 4/5] ksm: Optimize rmap_walk_ksm by passing a suitable address range xu.xin16
2026-05-15 12:28 ` Lorenzo Stoakes
  -- strict thread matches above, loose matches on Subject: below --
2026-05-03 12:35 [PATCH v4 0/5] KSM: Optimizations for rmap_walk_ksm xu.xin16
2026-05-03 12:50 ` [PATCH v4 4/5] ksm: Optimize rmap_walk_ksm by passing a suitable address range xu.xin16
2026-05-13 12:10   ` David Hildenbrand (Arm)
2026-05-15  7:15     ` xu.xin16

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox