[RFC PATCH 0/9] introduce PGTY_mgt_entry page

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/9] introduce PGTY_mgt_entry page_type
@ 2025-07-24  8:44 Huan Yang
  2025-07-24  8:44 ` [RFC PATCH 1/9] mm: introduce PAGE_TYPE_SHIFT Huan Yang
                   ` (10 more replies)
  0 siblings, 11 replies; 25+ messages in thread
From: Huan Yang @ 2025-07-24  8:44 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Rik van Riel,
	Liam R. Howlett, Vlastimil Babka, Harry Yoo, Xu Xin,
	Chengming Zhou, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple,
	Matthew Wilcox (Oracle), Huan Yang, Christian Brauner, Usama Arif,
	Yu Zhao, Baolin Wang, linux-mm, linux-kernel

Summary
==
This patchset reuses page_type to store migrate entry count during the
period from migrate entry setup to removal, enabling accelerated VMA
traversal when removing migrate entries, following a similar principle to
early termination when folio is unmapped in try_to_migrate.

In my self-constructed test scenario, the migration time can be reduced
from over 150+ms to around 30+ms, achieving nearly a 70% performance
improvement. Additionally, the flame graph shows that the proportion of
remove_migration_ptes can be reduced from 80%+ to 60%+.

Notice: migrate entry specifically refers to migrate PTE entry, as large
folio are not supported page type and 0 mapcount reuse.

Principle
==
When a page removes all PTEs in try_to_migrate and sets up a migrate PTE
entry, we can determine whether the traversal of remaining VMAs can be
terminated early by checking if mapcount is zero. This optimization
helps improve performance during migration.

However, when removing migrate PTE entries and setting up PTEs for the
destination folio in remove_migration_ptes, there is no such information
available to assist in deciding whether the traversal of remaining VMAs
can be ended early. Therefore, it is necessary to traversal all VMAs
associated with this folio.

In reality, when a folio is fully unmapped and before all migrate PTE
entries are removed, the mapcount will always be zero. Since page_type
and mapcount share a union, and referring to folio_mapcount, we can
reuse page_type to record the number of migrate PTE entries of the
current folio in the system as long as it's not a large folio. This
reuse does not affect calls to folio_mapcount, which will always return
zero.

Therefore, we can set the folio's page_type to PGTY_mgt_entry when
try_to_migrate completes, the folio is already unmapped, and it's not a
large folio. The remaining 24 bits can then be used to record the number
of migrate PTE entries generated by try_to_migrate.

Then, in remove_migration_ptes, when the nr_mgt_entry count drops to
zero, we can terminate the VMA traversal early.

It's important to note that we need to initialize the folio's page_type
to PGTY_mgt_entry and set the migrate entry count only while holding the
rmap walk lock.This is because during the lock period, we can prevent
new VMA fork (which would increase migrate entries) and VMA unmap
(which would decrease migrate entries).

However, I doubt there is actually an additional critical section here, for
example anon:

Process Parent                          fork
try_to_migrate
                                        anon_vma_clone
                                            write_lock
                                                avc_inster_tree tail
                                        ....
    folio_lock_anon_vma_read             copy_pte_range
        vma_iter                            pte_lock
                ....                           pte_present copy
                                            ...
                pte_lock
                    new forked pte clean
....
remove_migration_ptes
    rmap_walk_anon_lock

If my understanding is correct and such a critical section exists, it
shouldn't cause any issues—newly added PTEs can still be properly
removed and converted into migrate entries.

But in this:

Process Parent                          fork
try_to_migrate
                                        anon_vma_clone
                                            write_lock
                                                avc_inster_tree
                                        ....
    folio_lock_anon_vma_read             copy_pte_range
        vma_iter
                pte_lock
                    migrate entry set
                ....                        pte_lock
                                                pte_nonpresent copy
                                            ....
....
remove_migration_ptes
    rmap_walk_anon_lock

If the parent process first acquires the pte_lock to set a migrate
entry, the child process will then directly copy the non-present migrate
entry, resulting in an increase in migrate entries. However, since the
newly added VMA is positioned later in the rb tree of the folio's
anon_vma, when we traverse to this child-process-added migrate entry,
the count of migrate entries will still be correctly recorded, and this
will not cause any issues.

If I misunderstand, please correct me. :)

After a folio exits try_to_migrate and before remove_migration_ptes
acquires the rmap lock, the system can perform normal fork and unmap
operations. Therefore, we need to increment or decrement the migrate
entry count recorded in the folio (if it's of type PGTY_mgt_entry) when
handling copy/zap_nonpresent_pte.

When performing remove_migration_ptes during migration to start removing
migrate entries, we need to dynamically decrement the recorded migrate
entry count. Once this count reaches zero, it indicates there are no
remaining migrate entries in the associated VMAs that need to be cleared
and replaced with the destination PFN. This allows us to safely
terminate the VMA traversal early.

However, it's important to note that if issues occur during migration
requiring an undo operation, PGTY_mgt_entry can no longer be used. This
is because the dst needs to be set back to the src, and the presence of
PGTY_mgt_entry would interfere with the normal usage of mapcount when
setup rmap info.

Test
==
I set up a 2-node test environment using QEMU, and used mbind to trigger
page migration between nodes for the specified VMA.

The core idea of the test scenario is to create a situation where the
number of VMAs that need to be itered in the anon_vma is significantly
larger than the folio's mapcount.

To achieve this, I constructed an exaggerated scenario: the parent
process allocates 5MB of memory and binds it to node0, then immediately
forks 1000 child processes. Each child process runs and immediately
memset all this memory to complete COW-ed. Afterwards, the parent process
calls mbind to migrate the memory from node0 to node1, while recording
the time consumed during this period.
Additionally, perf is used to capture a flame graph during the mbind
execution.

The time cost results are as follows:
    Patch1-9               Normal(f817b6d)
      18ms                    197ms
      58ms                    152ms
      40ms                    120ms

The hot path show in fireflame:
    Patch1-9
      move_to_new_folio        38.89%
      remove_migration_ptes    61.11%
      ---------------------
      move_to_new_folio        32.76%
      remove_migration_ptes    67.24%
      ---------------------
      move_to_new_folio        37.50%
      remove_migration_ptes    62.50%

    Normal(f817b6d)
      move_to_new_folio        11.43%
      remove_migration_ptes    87.43%
      ---------------------
      move_to_new_folio        13.91%
      remove_migration_ptes    86.09%
      ---------------------
      move_to_new_folio        12.50%
      remove_migration_ptes    85.83%

Can easy see that cost time optimized by approximately 75.3%.
And the proportion of the remove_migration_ptes function path
has decreased by approximately 20%.

Simplify Test Code:

```c
#define size (5 << 20)
#define CHILD_COUNT 1000

int *buffer = (int *)mmap(NULL, size, PROT_READ | PROT_WRITE,
                        MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);

unsigned long mask = 1UL << 0;
mbind(buffer, size, MPOL_BIND, &mask, 2, 0);
// let all page-faulted in node 0
memset(buffer, 0, size);

// fork child.
pid_t children[CHILD_COUNT];
for (int i = 0; i < CHILD_COUNT; i++) {
    pid_t pid = fork();
    if (pid == 0) {
        // let all child process COW-ed
        memset(buffer, 0, size);
        sleep(100000);
    } else {
        children[i] = pid;
    }
}

// maybe you need sleep to wait child process COW-ed
sleep(10);

// You can use perf watch here
mask = 1UL << 1;
// migrate this buffer from node 0 -> node 1
mbind(buffer, size, MPOL_BIND, &mask, 4, MPOL_MF_MOVE);

```
Notice: this code removed many error assert and resource clean
action, time record ...

Why RFC
==
Memory migration is one of the most general-purpose modules.
My own tests cannot cover all system scenarios, and there
may be omissions or misunderstandings in the code modifications.

If good enough, I will send the formal patch.

Patch 1-7 do some code clean work.
Patch 8 prepare for PGTY_mgt_entry.
Patch 9 apply it.

Huan Yang (9):
  mm: introduce PAGE_TYPE_SHIFT
  mm: add page_type value helper
  mm/rmap: simplify rmap_walk invoke
  mm/rmap: add args in rmap_walk_control done hook
  mm/rmap: introduce exit hook
  mm/rmap: introduce migrate_walk_arg
  mm/migrate: rename rmap_walk_arg folio
  mm/migrate: infrastructure for migrate entry page_type.
  mm/migrate: apply migrate entry page_type

 include/linux/page-flags.h | 106 +++++++++++++++++++++++++++++++++++--
 include/linux/rmap.h       |   7 ++-
 mm/ksm.c                   |   2 +-
 mm/memory.c                |   2 +
 mm/migrate.c               |  38 ++++++++++---
 mm/rmap.c                  |  85 ++++++++++++++++++-----------
 6 files changed, 193 insertions(+), 47 deletions(-)

--
2.34.1

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [RFC PATCH 1/9] mm: introduce PAGE_TYPE_SHIFT
  2025-07-24  8:44 [RFC PATCH 0/9] introduce PGTY_mgt_entry page_type Huan Yang
@ 2025-07-24  8:44 ` Huan Yang
  2025-07-24  8:44 ` [RFC PATCH 2/9] mm: add page_type value helper Huan Yang
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Huan Yang @ 2025-07-24  8:44 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Rik van Riel,
	Liam R. Howlett, Vlastimil Babka, Harry Yoo, Xu Xin,
	Chengming Zhou, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple,
	Matthew Wilcox (Oracle), Huan Yang, Christian Brauner, Usama Arif,
	Yu Zhao, Baolin Wang, linux-mm, linux-kernel, Kirill A. Shutemov

The current shift value for page_type is 24. To avoid hardcode, define
a macro called PAGE_TYPE_SHIFT and PAGE_TYPE_MASK.

No functional change.

Signed-off-by: Huan Yang <link@vivo.com>
---
 include/linux/page-flags.h | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 4fe5ee67535b..3c7103c2eee4 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -961,9 +961,12 @@ enum pagetype {
 	PGTY_mapcount_underflow = 0xff
 };
 
+#define PAGE_TYPE_SHIFT		24
+#define PAGE_TYPE_MASK		((1 << PAGE_TYPE_SHIFT) - 1)
+
 static inline bool page_type_has_type(int page_type)
 {
-	return page_type < (PGTY_mapcount_underflow << 24);
+	return page_type < (PGTY_mapcount_underflow << PAGE_TYPE_SHIFT);
 }
 
 /* This takes a mapcount which is one more than page->_mapcount */
@@ -980,7 +983,8 @@ static inline bool page_has_type(const struct page *page)
 #define FOLIO_TYPE_OPS(lname, fname)					\
 static __always_inline bool folio_test_##fname(const struct folio *folio) \
 {									\
-	return data_race(folio->page.page_type >> 24) == PGTY_##lname;	\
+	return data_race(folio->page.page_type >> PAGE_TYPE_SHIFT)      \
+	       == PGTY_##lname;						\
 }									\
 static __always_inline void __folio_set_##fname(struct folio *folio)	\
 {									\
@@ -988,7 +992,8 @@ static __always_inline void __folio_set_##fname(struct folio *folio)	\
 		return;							\
 	VM_BUG_ON_FOLIO(data_race(folio->page.page_type) != UINT_MAX,	\
 			folio);						\
-	folio->page.page_type = (unsigned int)PGTY_##lname << 24;	\
+	folio->page.page_type = (unsigned int)PGTY_##lname		\
+					      << PAGE_TYPE_SHIFT;	\
 }									\
 static __always_inline void __folio_clear_##fname(struct folio *folio)	\
 {									\
@@ -1002,14 +1007,16 @@ static __always_inline void __folio_clear_##fname(struct folio *folio)	\
 FOLIO_TYPE_OPS(lname, fname)						\
 static __always_inline int Page##uname(const struct page *page)		\
 {									\
-	return data_race(page->page_type >> 24) == PGTY_##lname;	\
+	return data_race(page->page_type >> PAGE_TYPE_SHIFT)		\
+	       == PGTY_##lname;						\
 }									\
 static __always_inline void __SetPage##uname(struct page *page)		\
 {									\
 	if (Page##uname(page))						\
 		return;							\
 	VM_BUG_ON_PAGE(data_race(page->page_type) != UINT_MAX, page);	\
-	page->page_type = (unsigned int)PGTY_##lname << 24;		\
+	page->page_type = (unsigned int)PGTY_##lname			\
+					<< PAGE_TYPE_SHIFT;		\
 }									\
 static __always_inline void __ClearPage##uname(struct page *page)	\
 {									\
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH 2/9] mm: add page_type value helper
  2025-07-24  8:44 [RFC PATCH 0/9] introduce PGTY_mgt_entry page_type Huan Yang
  2025-07-24  8:44 ` [RFC PATCH 1/9] mm: introduce PAGE_TYPE_SHIFT Huan Yang
@ 2025-07-24  8:44 ` Huan Yang
  2025-07-24  8:44 ` [RFC PATCH 3/9] mm/rmap: simplify rmap_walk invoke Huan Yang
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Huan Yang @ 2025-07-24  8:44 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Rik van Riel,
	Liam R. Howlett, Vlastimil Babka, Harry Yoo, Xu Xin,
	Chengming Zhou, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple,
	Matthew Wilcox (Oracle), Huan Yang, Christian Brauner, Usama Arif,
	Yu Zhao, Baolin Wang, linux-mm, linux-kernel, Kirill A. Shutemov

Add two helper functions __SetPageXXXValue and __GetPageXXXValue to
assist in assigning values to the specified page_type. Since the current
page_type value is not supported on large folio, only the page version
of the helper functions is provided.

Signed-off-by: Huan Yang <link@vivo.com>
---
 include/linux/page-flags.h | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 3c7103c2eee4..52c9435079d5 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -1024,6 +1024,23 @@ static __always_inline void __ClearPage##uname(struct page *page)	\
 		return;							\
 	VM_BUG_ON_PAGE(!Page##uname(page), page);			\
 	page->page_type = UINT_MAX;					\
+}									\
+static __always_inline void __PageSet##uname##Value(struct page *page,	\
+						    unsigned int value) \
+{									\
+	if (!Page##uname(page))						\
+		return;							\
+	if (unlikely(value > (PAGE_TYPE_MASK)))				\
+		return;							\
+	WRITE_ONCE(page->page_type, (unsigned int)PGTY_##lname		\
+				    << PAGE_TYPE_SHIFT | value);	\
+}									\
+static __always_inline unsigned int __PageGet##uname##Value(		\
+				    struct page *page)			\
+{									\
+	if (!Page##uname(page))						\
+		return 0;						\
+	return READ_ONCE(page->page_type) & PAGE_TYPE_MASK;		\
 }
 
 /*
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH 3/9] mm/rmap: simplify rmap_walk invoke
  2025-07-24  8:44 [RFC PATCH 0/9] introduce PGTY_mgt_entry page_type Huan Yang
  2025-07-24  8:44 ` [RFC PATCH 1/9] mm: introduce PAGE_TYPE_SHIFT Huan Yang
  2025-07-24  8:44 ` [RFC PATCH 2/9] mm: add page_type value helper Huan Yang
@ 2025-07-24  8:44 ` Huan Yang
  2025-07-24  8:44 ` [RFC PATCH 4/9] mm/rmap: add args in rmap_walk_control done hook Huan Yang
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Huan Yang @ 2025-07-24  8:44 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Rik van Riel,
	Liam R. Howlett, Vlastimil Babka, Harry Yoo, Xu Xin,
	Chengming Zhou, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple,
	Matthew Wilcox (Oracle), Huan Yang, Christian Brauner, Usama Arif,
	Yu Zhao, Baolin Wang, linux-mm, linux-kernel

Currently, the rmap walk is split into two functions: rmap_walk_locked
and rmap_walk, but their implementation functionality is very similar.

This patch simplifies the rmap walk function and moves the locked
parameter to rmap walk control.

No functional change.

Signed-off-by: Huan Yang <link@vivo.com>
---
 include/linux/rmap.h |  3 ++-
 mm/migrate.c         |  6 ++----
 mm/rmap.c            | 43 ++++++++++++++++---------------------------
 3 files changed, 20 insertions(+), 32 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 45904ff413ab..f0d17c971a20 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -996,6 +996,7 @@ void remove_migration_ptes(struct folio *src, struct folio *dst, int flags);
  * arg: passed to rmap_one() and invalid_vma()
  * try_lock: bail out if the rmap lock is contended
  * contended: indicate the rmap traversal bailed out due to lock contention
+ * locked: already locked before invoke rmap_walk
  * rmap_one: executed on each vma where page is mapped
  * done: for checking traversing termination condition
  * anon_lock: for getting anon_lock by optimized way rather than default
@@ -1005,6 +1006,7 @@ struct rmap_walk_control {
 	void *arg;
 	bool try_lock;
 	bool contended;
+	bool locked;
 	/*
 	 * Return false if page table scanning in rmap_walk should be stopped.
 	 * Otherwise, return true.
@@ -1018,7 +1020,6 @@ struct rmap_walk_control {
 };
 
 void rmap_walk(struct folio *folio, struct rmap_walk_control *rwc);
-void rmap_walk_locked(struct folio *folio, struct rmap_walk_control *rwc);
 struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio,
 					  struct rmap_walk_control *rwc);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 8cf0f9c9599d..a5a49af7857a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -355,15 +355,13 @@ void remove_migration_ptes(struct folio *src, struct folio *dst, int flags)
 
 	struct rmap_walk_control rwc = {
 		.rmap_one = remove_migration_pte,
+		.locked = flags & RMP_LOCKED,
 		.arg = &rmap_walk_arg,
 	};
 
 	VM_BUG_ON_FOLIO((flags & RMP_USE_SHARED_ZEROPAGE) && (src != dst), src);
 
-	if (flags & RMP_LOCKED)
-		rmap_walk_locked(dst, &rwc);
-	else
-		rmap_walk(dst, &rwc);
+	rmap_walk(dst, &rwc);
 }
 
 /*
diff --git a/mm/rmap.c b/mm/rmap.c
index a312cae16bb5..bae9f79c7dc9 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2253,14 +2253,12 @@ void try_to_unmap(struct folio *folio, enum ttu_flags flags)
 	struct rmap_walk_control rwc = {
 		.rmap_one = try_to_unmap_one,
 		.arg = (void *)flags,
+		.locked = flags & TTU_RMAP_LOCKED,
 		.done = folio_not_mapped,
 		.anon_lock = folio_lock_anon_vma_read,
 	};
 
-	if (flags & TTU_RMAP_LOCKED)
-		rmap_walk_locked(folio, &rwc);
-	else
-		rmap_walk(folio, &rwc);
+	rmap_walk(folio, &rwc);
 }
 
 /*
@@ -2581,6 +2579,7 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
 		.rmap_one = try_to_migrate_one,
 		.arg = (void *)flags,
 		.done = folio_not_mapped,
+		.locked = flags & TTU_RMAP_LOCKED,
 		.anon_lock = folio_lock_anon_vma_read,
 	};
 
@@ -2607,10 +2606,7 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
 	if (!folio_test_ksm(folio) && folio_test_anon(folio))
 		rwc.invalid_vma = invalid_migration_vma;
 
-	if (flags & TTU_RMAP_LOCKED)
-		rmap_walk_locked(folio, &rwc);
-	else
-		rmap_walk(folio, &rwc);
+	rmap_walk(folio, &rwc);
 }
 
 #ifdef CONFIG_DEVICE_PRIVATE
@@ -2795,17 +2791,16 @@ static struct anon_vma *rmap_walk_anon_lock(const struct folio *folio,
  * rmap method
  * @folio: the folio to be handled
  * @rwc: control variable according to each walk type
- * @locked: caller holds relevant rmap lock
  *
  * Find all the mappings of a folio using the mapping pointer and the vma
  * chains contained in the anon_vma struct it points to.
  */
-static void rmap_walk_anon(struct folio *folio,
-		struct rmap_walk_control *rwc, bool locked)
+static void rmap_walk_anon(struct folio *folio, struct rmap_walk_control *rwc)
 {
 	struct anon_vma *anon_vma;
 	pgoff_t pgoff_start, pgoff_end;
 	struct anon_vma_chain *avc;
+	bool locked = rwc->locked;
 
 	if (locked) {
 		anon_vma = folio_anon_vma(folio);
@@ -2908,14 +2903,14 @@ static void __rmap_walk_file(struct folio *folio, struct address_space *mapping,
  * rmap_walk_file - do something to file page using the object-based rmap method
  * @folio: the folio to be handled
  * @rwc: control variable according to each walk type
- * @locked: caller holds relevant rmap lock
  *
  * Find all the mappings of a folio using the mapping pointer and the vma chains
  * contained in the address_space struct it points to.
  */
-static void rmap_walk_file(struct folio *folio,
-		struct rmap_walk_control *rwc, bool locked)
+static void rmap_walk_file(struct folio *folio, struct rmap_walk_control *rwc)
 {
+	bool locked = rwc->locked;
+
 	/*
 	 * The folio lock not only makes sure that folio->mapping cannot
 	 * suddenly be NULLified by truncation, it makes sure that the structure
@@ -2933,23 +2928,17 @@ static void rmap_walk_file(struct folio *folio,
 
 void rmap_walk(struct folio *folio, struct rmap_walk_control *rwc)
 {
+	/* no ksm support for now if locked */
+	VM_BUG_ON_FOLIO(rwc->locked && folio_test_ksm(folio), folio);
+	/* if already locked, why try lock again? */
+	VM_BUG_ON(rwc->locked && rwc->try_lock);
+
 	if (unlikely(folio_test_ksm(folio)))
 		rmap_walk_ksm(folio, rwc);
 	else if (folio_test_anon(folio))
-		rmap_walk_anon(folio, rwc, false);
-	else
-		rmap_walk_file(folio, rwc, false);
-}
-
-/* Like rmap_walk, but caller holds relevant rmap lock */
-void rmap_walk_locked(struct folio *folio, struct rmap_walk_control *rwc)
-{
-	/* no ksm support for now */
-	VM_BUG_ON_FOLIO(folio_test_ksm(folio), folio);
-	if (folio_test_anon(folio))
-		rmap_walk_anon(folio, rwc, true);
+		rmap_walk_anon(folio, rwc);
 	else
-		rmap_walk_file(folio, rwc, true);
+		rmap_walk_file(folio, rwc);
 }
 
 #ifdef CONFIG_HUGETLB_PAGE
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH 4/9] mm/rmap: add args in rmap_walk_control done hook
  2025-07-24  8:44 [RFC PATCH 0/9] introduce PGTY_mgt_entry page_type Huan Yang
                   ` (2 preceding siblings ...)
  2025-07-24  8:44 ` [RFC PATCH 3/9] mm/rmap: simplify rmap_walk invoke Huan Yang
@ 2025-07-24  8:44 ` Huan Yang
  2025-07-24  8:44 ` [RFC PATCH 5/9] mm/rmap: introduce exit hook Huan Yang
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Huan Yang @ 2025-07-24  8:44 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Rik van Riel,
	Liam R. Howlett, Vlastimil Babka, Harry Yoo, Xu Xin,
	Chengming Zhou, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple,
	Matthew Wilcox (Oracle), Huan Yang, Christian Brauner, Usama Arif,
	Yu Zhao, Baolin Wang, linux-mm, linux-kernel

In the done hook that determines whether rmap_walk can terminate the
traversal early, we may need to read parameters from rmap_walk_control
to assist in this decision.

This patch adds the rmap_walk_control as a parameter to the done hook.

Signed-off-by: Huan Yang <link@vivo.com>
---
 include/linux/rmap.h | 2 +-
 mm/ksm.c             | 2 +-
 mm/rmap.c            | 6 +++---
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index f0d17c971a20..a305811d6310 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -1013,7 +1013,7 @@ struct rmap_walk_control {
 	 */
 	bool (*rmap_one)(struct folio *folio, struct vm_area_struct *vma,
 					unsigned long addr, void *arg);
-	int (*done)(struct folio *folio);
+	int (*done)(struct folio *folio, struct rmap_walk_control *rwc);
 	struct anon_vma *(*anon_lock)(const struct folio *folio,
 				      struct rmap_walk_control *rwc);
 	bool (*invalid_vma)(struct vm_area_struct *vma, void *arg);
diff --git a/mm/ksm.c b/mm/ksm.c
index ef73b25fd65a..635fe402af91 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -3072,7 +3072,7 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
 				anon_vma_unlock_read(anon_vma);
 				return;
 			}
-			if (rwc->done && rwc->done(folio)) {
+			if (rwc->done && rwc->done(folio, rwc)) {
 				anon_vma_unlock_read(anon_vma);
 				return;
 			}
diff --git a/mm/rmap.c b/mm/rmap.c
index bae9f79c7dc9..e590f4071eca 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2232,7 +2232,7 @@ static bool invalid_migration_vma(struct vm_area_struct *vma, void *arg)
 	return vma_is_temporary_stack(vma);
 }
 
-static int folio_not_mapped(struct folio *folio)
+static int folio_not_mapped(struct folio *folio, struct rmap_walk_control *rwc)
 {
 	return !folio_mapped(folio);
 }
@@ -2828,7 +2828,7 @@ static void rmap_walk_anon(struct folio *folio, struct rmap_walk_control *rwc)
 
 		if (!rwc->rmap_one(folio, vma, address, rwc->arg))
 			break;
-		if (rwc->done && rwc->done(folio))
+		if (rwc->done && rwc->done(folio, rwc))
 			break;
 	}
 
@@ -2891,7 +2891,7 @@ static void __rmap_walk_file(struct folio *folio, struct address_space *mapping,
 
 		if (!rwc->rmap_one(folio, vma, address, rwc->arg))
 			goto done;
-		if (rwc->done && rwc->done(folio))
+		if (rwc->done && rwc->done(folio, rwc))
 			goto done;
 	}
 done:
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH 5/9] mm/rmap: introduce exit hook
  2025-07-24  8:44 [RFC PATCH 0/9] introduce PGTY_mgt_entry page_type Huan Yang
                   ` (3 preceding siblings ...)
  2025-07-24  8:44 ` [RFC PATCH 4/9] mm/rmap: add args in rmap_walk_control done hook Huan Yang
@ 2025-07-24  8:44 ` Huan Yang
  2025-07-24  8:44 ` [RFC PATCH 6/9] mm/rmap: introduce migrate_walk_arg Huan Yang
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Huan Yang @ 2025-07-24  8:44 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Rik van Riel,
	Liam R. Howlett, Vlastimil Babka, Harry Yoo, Xu Xin,
	Chengming Zhou, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple,
	Matthew Wilcox (Oracle), Huan Yang, Christian Brauner, Usama Arif,
	Yu Zhao, Baolin Wang, linux-mm, linux-kernel

When approaching the end of an rmap_walk traversal, we may need to
perform some resource statistics or cleanup work, and these operations
need to be done under lock.

This patch adds a new exit hook to rmap_walk_control.

Note that currently, the exit hook only be used in the anon and file
rmap_walk implementations, as ksm no need it.

Signed-off-by: Huan Yang <link@vivo.com>
---
 include/linux/rmap.h | 2 ++
 mm/rmap.c            | 6 ++++++
 2 files changed, 8 insertions(+)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index a305811d6310..594dc4df3bfb 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -999,6 +999,7 @@ void remove_migration_ptes(struct folio *src, struct folio *dst, int flags);
  * locked: already locked before invoke rmap_walk
  * rmap_one: executed on each vma where page is mapped
  * done: for checking traversing termination condition
+ * exit: do some clean work below lock before leave rmap_walk
  * anon_lock: for getting anon_lock by optimized way rather than default
  * invalid_vma: for skipping uninterested vma
  */
@@ -1014,6 +1015,7 @@ struct rmap_walk_control {
 	bool (*rmap_one)(struct folio *folio, struct vm_area_struct *vma,
 					unsigned long addr, void *arg);
 	int (*done)(struct folio *folio, struct rmap_walk_control *rwc);
+	void (*exit)(struct folio *folio, struct rmap_walk_control *rwc);
 	struct anon_vma *(*anon_lock)(const struct folio *folio,
 				      struct rmap_walk_control *rwc);
 	bool (*invalid_vma)(struct vm_area_struct *vma, void *arg);
diff --git a/mm/rmap.c b/mm/rmap.c
index e590f4071eca..66b48ab192f5 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2832,6 +2832,9 @@ static void rmap_walk_anon(struct folio *folio, struct rmap_walk_control *rwc)
 			break;
 	}
 
+	if (rwc->exit)
+		rwc->exit(folio, rwc);
+
 	if (!locked)
 		anon_vma_unlock_read(anon_vma);
 }
@@ -2895,6 +2898,9 @@ static void __rmap_walk_file(struct folio *folio, struct address_space *mapping,
 			goto done;
 	}
 done:
+	if (rwc->exit)
+		rwc->exit(folio, rwc);
+
 	if (!locked)
 		i_mmap_unlock_read(mapping);
 }
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH 6/9] mm/rmap: introduce migrate_walk_arg
  2025-07-24  8:44 [RFC PATCH 0/9] introduce PGTY_mgt_entry page_type Huan Yang
                   ` (4 preceding siblings ...)
  2025-07-24  8:44 ` [RFC PATCH 5/9] mm/rmap: introduce exit hook Huan Yang
@ 2025-07-24  8:44 ` Huan Yang
  2025-07-24  8:44 ` [RFC PATCH 7/9] mm/migrate: rename rmap_walk_arg folio Huan Yang
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Huan Yang @ 2025-07-24  8:44 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Rik van Riel,
	Liam R. Howlett, Vlastimil Babka, Harry Yoo, Xu Xin,
	Chengming Zhou, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple,
	Matthew Wilcox (Oracle), Huan Yang, Christian Brauner, Usama Arif,
	Yu Zhao, Baolin Wang, linux-mm, linux-kernel, Kirill A. Shutemov

In try_to_migrate, rmap_one as well as the done and exit hooks may
require more information from try_to_migrate to assist in performing
migration-related operations.

This patch introduces a new migrate_walk_arg structure to serve as
the arg parameter for rmap_walk_control in try_to_migrate.

Signed-off-by: Huan Yang <link@vivo.com>
---
 mm/rmap.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 66b48ab192f5..2433e12c131d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2261,6 +2261,10 @@ void try_to_unmap(struct folio *folio, enum ttu_flags flags)
 	rmap_walk(folio, &rwc);
 }
 
+struct migrate_walk_arg {
+	enum ttu_flags flags;
+};
+
 /*
  * @arg: enum ttu_flags will be passed to this argument.
  *
@@ -2276,7 +2280,8 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 	pte_t pteval;
 	struct page *subpage;
 	struct mmu_notifier_range range;
-	enum ttu_flags flags = (enum ttu_flags)(long)arg;
+	struct migrate_walk_arg *mwa = (struct migrate_walk_arg *)arg;
+	enum ttu_flags flags = mwa->flags;
 	unsigned long pfn;
 	unsigned long hsz = 0;
 
@@ -2575,9 +2580,13 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
  */
 void try_to_migrate(struct folio *folio, enum ttu_flags flags)
 {
+	struct migrate_walk_arg arg = {
+		.flags = flags,
+	};
+
 	struct rmap_walk_control rwc = {
 		.rmap_one = try_to_migrate_one,
-		.arg = (void *)flags,
+		.arg = (void *)&arg,
 		.done = folio_not_mapped,
 		.locked = flags & TTU_RMAP_LOCKED,
 		.anon_lock = folio_lock_anon_vma_read,
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH 7/9] mm/migrate: rename rmap_walk_arg folio
  2025-07-24  8:44 [RFC PATCH 0/9] introduce PGTY_mgt_entry page_type Huan Yang
                   ` (5 preceding siblings ...)
  2025-07-24  8:44 ` [RFC PATCH 6/9] mm/rmap: introduce migrate_walk_arg Huan Yang
@ 2025-07-24  8:44 ` Huan Yang
  2025-07-24  8:44 ` [RFC PATCH 8/9] mm/migrate: infrastructure for migrate entry page_type Huan Yang
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Huan Yang @ 2025-07-24  8:44 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Rik van Riel,
	Liam R. Howlett, Vlastimil Babka, Harry Yoo, Xu Xin,
	Chengming Zhou, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple,
	Matthew Wilcox (Oracle), Huan Yang, Christian Brauner, Usama Arif,
	Yu Zhao, Baolin Wang, linux-mm, linux-kernel

The current naming of "folio" in rmap_walk_arg does not clearly reflect
the actual role of this parameter. This patch renames it to "src" to
distinguish it from the "folio" parameter in remove_migration_pte.

No functional change.

Signed-off-by: Huan Yang <link@vivo.com>
---
 mm/migrate.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index a5a49af7857a..a5ea8fba2997 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -230,7 +230,7 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
 }
 
 struct rmap_walk_arg {
-	struct folio *folio;
+	struct folio *src;
 	bool map_unused_to_zeropage;
 };
 
@@ -241,7 +241,7 @@ static bool remove_migration_pte(struct folio *folio,
 		struct vm_area_struct *vma, unsigned long addr, void *arg)
 {
 	struct rmap_walk_arg *rmap_walk_arg = arg;
-	DEFINE_FOLIO_VMA_WALK(pvmw, rmap_walk_arg->folio, vma, addr, PVMW_SYNC | PVMW_MIGRATION);
+	DEFINE_FOLIO_VMA_WALK(pvmw, rmap_walk_arg->src, vma, addr, PVMW_SYNC | PVMW_MIGRATION);
 
 	while (page_vma_mapped_walk(&pvmw)) {
 		rmap_t rmap_flags = RMAP_NONE;
@@ -349,7 +349,7 @@ static bool remove_migration_pte(struct folio *folio,
 void remove_migration_ptes(struct folio *src, struct folio *dst, int flags)
 {
 	struct rmap_walk_arg rmap_walk_arg = {
-		.folio = src,
+		.src = src,
 		.map_unused_to_zeropage = flags & RMP_USE_SHARED_ZEROPAGE,
 	};
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH 8/9] mm/migrate: infrastructure for migrate entry page_type.
  2025-07-24  8:44 [RFC PATCH 0/9] introduce PGTY_mgt_entry page_type Huan Yang
                   ` (6 preceding siblings ...)
  2025-07-24  8:44 ` [RFC PATCH 7/9] mm/migrate: rename rmap_walk_arg folio Huan Yang
@ 2025-07-24  8:44 ` Huan Yang
  2025-07-24  8:44 ` [RFC PATCH 9/9] mm/migrate: apply " Huan Yang
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Huan Yang @ 2025-07-24  8:44 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Rik van Riel,
	Liam R. Howlett, Vlastimil Babka, Harry Yoo, Xu Xin,
	Chengming Zhou, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple,
	Matthew Wilcox (Oracle), Huan Yang, Christian Brauner, Usama Arif,
	Yu Zhao, Baolin Wang, linux-mm, linux-kernel

When a page removes all PTEs in try_to_migrate and sets up a migrate PTE
entry, we can determine whether the traversal of remaining VMAs can be
terminated early by checking if mapcount is zero. This optimization
helps improve performance during migration.

However, when removing migrate PTE entries and setting up PTEs for the
destination folio in remove_migration_ptes, there is no such information
available to assist in deciding whether the traversal of remaining VMAs
can be ended early. Therefore, it is necessary to traversal all VMAs
associated with this folio.

In reality, when a folio is fully unmapped and before all migrate PTE
entries are removed, the mapcount will always be zero. Since page_type
and mapcount share a union, and referring to folio_mapcount, we can
reuse page_type to record the number of migrate PTE entries of the
current folio in the system as long as it's not a large folio. This
reuse does not affect calls to folio_mapcount, which will always return
zero.

Therefore, we can set the folio's page_type to PGTY_mgt_entry when
try_to_migrate completes, the folio is already unmapped, and it's not a
large folio. The remaining 24 bits can then be used to record the number
of migrate PTE entries generated by try_to_migrate.

Then, in remove_migration_ptes, when the nr_mgt_entry count drops to
zero, we can terminate the VMA traversal early.

This patch solely completes the infrastructure setup for PGTY_mgt_entry,
with no actual usage implemented, the next patch will do it.

Signed-off-by: Huan Yang <link@vivo.com>
---
 include/linux/page-flags.h | 72 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 72 insertions(+)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 52c9435079d5..d8e80d5ae1f8 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -957,6 +957,7 @@ enum pagetype {
 	PGTY_zsmalloc		= 0xf6,
 	PGTY_unaccepted		= 0xf7,
 	PGTY_large_kmalloc	= 0xf8,
+	PGTY_mgt_entry		= 0xf9,

 	PGTY_mapcount_underflow = 0xff
 };
@@ -1125,6 +1126,77 @@ PAGE_TYPE_OPS(Zsmalloc, zsmalloc, zsmalloc)
 PAGE_TYPE_OPS(Unaccepted, unaccepted, unaccepted)
 FOLIO_TYPE_OPS(large_kmalloc, large_kmalloc)

+
+PAGE_TYPE_OPS(Mentry, mgt_entry, mgt_entry)
+
+#define PAGE_MGT_ENTRY_TYPE_MAX		PAGE_TYPE_MASK
+
+static inline void folio_remove_mgte(struct folio *folio)
+{
+	if (!folio_test_mgt_entry(folio))
+		return;
+
+	__folio_clear_mgt_entry(folio);
+}
+
+static inline void folio_init_mgte(struct folio *folio, int nr_mgt_entry)
+{
+	// PAGE_TYPE value currently do not support large folio.
+	VM_BUG_ON(folio_test_large(folio));
+	VM_BUG_ON(folio_test_mgt_entry(folio));
+
+	if (unlikely(nr_mgt_entry > PAGE_MGT_ENTRY_TYPE_MAX))
+		return;
+
+	__folio_set_mgt_entry(folio);
+	__PageSetMentryValue(&folio->page, nr_mgt_entry);
+}
+
+static inline int folio_get_mgte_count(struct folio *folio)
+{
+	if (!folio_test_mgt_entry(folio))
+		return 0;
+
+	VM_BUG_ON(folio_test_large(folio));
+
+	return __PageGetMentryValue(&folio->page);
+}
+
+static inline void folio_dec_mgte_count(struct folio *folio)
+{
+	unsigned int nr_mgte, old;
+
+	do {
+		old = READ_ONCE(folio->page.page_type);
+
+		if ((old >> PAGE_TYPE_SHIFT) != PGTY_mgt_entry)
+			return;
+
+		nr_mgte = old & PAGE_TYPE_MASK;
+		BUG_ON(nr_mgte == 0);
+	} while (cmpxchg(&folio->page.page_type, old, old - 1) != old);
+}
+
+static inline void folio_inc_mgte_count(struct folio *folio)
+{
+	unsigned int nr_mgte, old;
+
+	do {
+		old = READ_ONCE(folio->page.page_type);
+
+		if ((old >> PAGE_TYPE_SHIFT) != PGTY_mgt_entry)
+			return;
+
+		nr_mgte = old & PAGE_TYPE_MASK;
+
+		if (unlikely(nr_mgte == PAGE_MGT_ENTRY_TYPE_MAX)) {
+			// overflow, can't reuse PAGE_TYPE here.
+			folio_remove_mgte(folio);
+			break;
+		}
+	} while (cmpxchg(&folio->page.page_type, old, old + 1) != old);
+}
+
 /**
  * PageHuge - Determine if the page belongs to hugetlbfs
  * @page: The page to test.
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH 9/9] mm/migrate: apply migrate entry page_type
  2025-07-24  8:44 [RFC PATCH 0/9] introduce PGTY_mgt_entry page_type Huan Yang
                   ` (7 preceding siblings ...)
  2025-07-24  8:44 ` [RFC PATCH 8/9] mm/migrate: infrastructure for migrate entry page_type Huan Yang
@ 2025-07-24  8:44 ` Huan Yang
  2025-07-24  8:59 ` [RFC PATCH 0/9] introduce PGTY_mgt_entry page_type David Hildenbrand
  2025-07-24  9:15 ` Lorenzo Stoakes
  10 siblings, 0 replies; 25+ messages in thread
From: Huan Yang @ 2025-07-24  8:44 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Rik van Riel,
	Liam R. Howlett, Vlastimil Babka, Harry Yoo, Xu Xin,
	Chengming Zhou, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple,
	Matthew Wilcox (Oracle), Huan Yang, Christian Brauner, Usama Arif,
	Yu Zhao, Baolin Wang, linux-mm, linux-kernel, Kirill A. Shutemov
  Cc: Qianfeng Rong

When a single-page folio is already unmapped, we can reuse the page_type
field by setting it to PGTY_mgt_entry.This indicates that the folio is
in a critical state where no PFNs for this folio exist in the system,
with migrate entries used instead.

The lower 24 bits of this field can be utilized to store the count of
migrate entries during try_to_migrate.

It's important to note that we need to initialize the folio's page_type
to PGTY_mgt_entry and set the migrate entry count only while holding the
rmap walk lock.This is because during the lock period, we can prevent
new VMA fork (which would increase migrate entries) and VMA unmap
(which would decrease migrate entries).

After a folio exits try_to_migrate and before remove_migration_ptes
acquires the rmap lock, the system can perform normal fork and unmap
operations. Therefore, we need to increment or decrement the migrate
entry count recorded in the folio (if it's of type PGTY_mgt_entry) when
handling copy/zap_nonpresent_pte.

When performing remove_migration_ptes during migration to start removing
migrate entries, we need to dynamically decrement the recorded migrate
entry count. Once this count reaches zero, it indicates there are no
remaining migrate entries in the associated VMAs that need to be cleared
and replaced with the destination PFN. This allows us to safely
terminate the VMA traversal early.

However, it's important to note that if issues occur during migration
requiring an undo operation, PGTY_mgt_entry can no longer be used. This
is because the dst needs to be set back to the src, and the presence of
PGTY_mgt_entry would interfere with the normal usage of mapcount when
setup rmap info.

Signed-off-by: Huan Yang <link@vivo.com>
Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
---
 mm/memory.c  |  2 ++
 mm/migrate.c | 28 +++++++++++++++++++++++++++-
 mm/rmap.c    | 17 +++++++++++++++++
 3 files changed, 46 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index b4a7695b1e31..f9d71b118c11 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -861,6 +861,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 				pte = pte_swp_mkuffd_wp(pte);
 			set_pte_at(src_mm, addr, src_pte, pte);
 		}
+		folio_inc_mgte_count(folio);
 	} else if (is_device_private_entry(entry)) {
 		page = pfn_swap_entry_to_page(entry);
 		folio = page_folio(page);
@@ -1651,6 +1652,7 @@ static inline int zap_nonpresent_ptes(struct mmu_gather *tlb,
 		if (!should_zap_folio(details, folio))
 			return 1;
 		rss[mm_counter(folio)]--;
+		folio_dec_mgte_count(folio);
 	} else if (pte_marker_entry_uffd_wp(entry)) {
 		/*
 		 * For anon: always drop the marker; for file: only
diff --git a/mm/migrate.c b/mm/migrate.c
index a5ea8fba2997..fc2fac1559bd 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -241,7 +241,8 @@ static bool remove_migration_pte(struct folio *folio,
 		struct vm_area_struct *vma, unsigned long addr, void *arg)
 {
 	struct rmap_walk_arg *rmap_walk_arg = arg;
-	DEFINE_FOLIO_VMA_WALK(pvmw, rmap_walk_arg->src, vma, addr, PVMW_SYNC | PVMW_MIGRATION);
+	struct folio *src = rmap_walk_arg->src;
+	DEFINE_FOLIO_VMA_WALK(pvmw, src, vma, addr, PVMW_SYNC | PVMW_MIGRATION);
 
 	while (page_vma_mapped_walk(&pvmw)) {
 		rmap_t rmap_flags = RMAP_NONE;
@@ -334,6 +335,7 @@ static bool remove_migration_pte(struct folio *folio,
 
 		trace_remove_migration_pte(pvmw.address, pte_val(pte),
 					   compound_order(new));
+		folio_dec_mgte_count(src);
 
 		/* No need to invalidate - it was non-present before */
 		update_mmu_cache(vma, pvmw.address, pvmw.pte);
@@ -342,12 +344,27 @@ static bool remove_migration_pte(struct folio *folio,
 	return true;
 }
 
+static int folio_removed_all_migrate_entry(struct folio *folio,
+					   struct rmap_walk_control *rwc)
+{
+	struct rmap_walk_arg *arg = (struct rmap_walk_arg *)rwc->arg;
+	struct folio *src = arg->src;
+
+	VM_BUG_ON(!folio_test_mgt_entry(src));
+
+	if (!folio_get_mgte_count(src))
+		return true;
+	return false;
+}
+
 /*
  * Get rid of all migration entries and replace them by
  * references to the indicated page.
  */
 void remove_migration_ptes(struct folio *src, struct folio *dst, int flags)
 {
+	bool undo = src == dst;
+
 	struct rmap_walk_arg rmap_walk_arg = {
 		.src = src,
 		.map_unused_to_zeropage = flags & RMP_USE_SHARED_ZEROPAGE,
@@ -356,12 +373,21 @@ void remove_migration_ptes(struct folio *src, struct folio *dst, int flags)
 	struct rmap_walk_control rwc = {
 		.rmap_one = remove_migration_pte,
 		.locked = flags & RMP_LOCKED,
+		.done = !undo && folio_test_mgt_entry(src) ?
+				folio_removed_all_migrate_entry :
+				NULL,
 		.arg = &rmap_walk_arg,
 	};
 
 	VM_BUG_ON_FOLIO((flags & RMP_USE_SHARED_ZEROPAGE) && (src != dst), src);
 
+	if (undo)
+		folio_remove_mgte(src);
+
 	rmap_walk(dst, &rwc);
+
+	if (!undo)
+		folio_remove_mgte(src);
 }
 
 /*
diff --git a/mm/rmap.c b/mm/rmap.c
index 2433e12c131d..f3911491b2d9 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2263,6 +2263,7 @@ void try_to_unmap(struct folio *folio, enum ttu_flags flags)
 
 struct migrate_walk_arg {
 	enum ttu_flags flags;
+	unsigned int nr_migrate_entry;
 };
 
 /*
@@ -2282,6 +2283,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 	struct mmu_notifier_range range;
 	struct migrate_walk_arg *mwa = (struct migrate_walk_arg *)arg;
 	enum ttu_flags flags = mwa->flags;
+	unsigned int nr_migrate_entry = 0;
 	unsigned long pfn;
 	unsigned long hsz = 0;
 
@@ -2548,6 +2550,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 						hsz);
 			else
 				set_pte_at(mm, address, pvmw.pte, swp_pte);
+			nr_migrate_entry++;
 			trace_set_migration_pte(address, pte_val(swp_pte),
 						folio_order(folio));
 			/*
@@ -2565,11 +2568,24 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 		folio_put(folio);
 	}
 
+	mwa->nr_migrate_entry += nr_migrate_entry;
+
 	mmu_notifier_invalidate_range_end(&range);
 
 	return ret;
 }
 
+static void folio_set_migrate_entry_type(struct folio *folio,
+					 struct rmap_walk_control *rwc)
+{
+	struct migrate_walk_arg *mwa = (struct migrate_walk_arg *)rwc->arg;
+	unsigned int nr_migrate_entry = mwa->nr_migrate_entry;
+
+	if (nr_migrate_entry && !folio_test_large(folio) &&
+	    !folio_mapped(folio))
+		folio_init_mgte(folio, nr_migrate_entry);
+}
+
 /**
  * try_to_migrate - try to replace all page table mappings with swap entries
  * @folio: the folio to replace page table entries for
@@ -2588,6 +2604,7 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
 		.rmap_one = try_to_migrate_one,
 		.arg = (void *)&arg,
 		.done = folio_not_mapped,
+		.exit = folio_set_migrate_entry_type,
 		.locked = flags & TTU_RMAP_LOCKED,
 		.anon_lock = folio_lock_anon_vma_read,
 	};
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 0/9] introduce PGTY_mgt_entry page_type
  2025-07-24  8:44 [RFC PATCH 0/9] introduce PGTY_mgt_entry page_type Huan Yang
                   ` (8 preceding siblings ...)
  2025-07-24  8:44 ` [RFC PATCH 9/9] mm/migrate: apply " Huan Yang
@ 2025-07-24  8:59 ` David Hildenbrand
  2025-07-24  9:09   ` Huan Yang
  2025-07-24  9:15 ` Lorenzo Stoakes
  10 siblings, 1 reply; 25+ messages in thread
From: David Hildenbrand @ 2025-07-24  8:59 UTC (permalink / raw)
  To: Huan Yang, Andrew Morton, Lorenzo Stoakes, Rik van Riel,
	Liam R. Howlett, Vlastimil Babka, Harry Yoo, Xu Xin,
	Chengming Zhou, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple,
	Matthew Wilcox (Oracle), Christian Brauner, Usama Arif, Yu Zhao,
	Baolin Wang, linux-mm, linux-kernel

On 24.07.25 10:44, Huan Yang wrote:
> Summary
> ==
> This patchset reuses page_type to store migrate entry count during the
> period from migrate entry setup to removal, enabling accelerated VMA
> traversal when removing migrate entries, following a similar principle to
> early termination when folio is unmapped in try_to_migrate.

I absolutely detest (ab)using page types for that, so no from my side 
unless I am missing something important.

> 
> In my self-constructed test scenario, the migration time can be reduced

How relevant is that in practice?

> from over 150+ms to around 30+ms, achieving nearly a 70% performance
> improvement. Additionally, the flame graph shows that the proportion of
> remove_migration_ptes can be reduced from 80%+ to 60%+.
> 
> Notice: migrate entry specifically refers to migrate PTE entry, as large
> folio are not supported page type and 0 mapcount reuse.
> 
> Principle
> ==
> When a page removes all PTEs in try_to_migrate and sets up a migrate PTE
> entry, we can determine whether the traversal of remaining VMAs can be
> terminated early by checking if mapcount is zero. This optimization
> helps improve performance during migration.
> 
> However, when removing migrate PTE entries and setting up PTEs for the
> destination folio in remove_migration_ptes, there is no such information
> available to assist in deciding whether the traversal of remaining VMAs
> can be ended early. Therefore, it is necessary to traversal all VMAs
> associated with this folio.

Yes, we don't know how many migration entries are still pointing at the 
page.

> 
> In reality, when a folio is fully unmapped and before all migrate PTE
> entries are removed, the mapcount will always be zero. Since page_type
> and mapcount share a union, and referring to folio_mapcount, we can
> reuse page_type to record the number of migrate PTE entries of the
> current folio in the system as long as it's not a large folio. This
> reuse does not affect calls to folio_mapcount, which will always return
> zero.
 > > Therefore, we can set the folio's page_type to PGTY_mgt_entry when
> try_to_migrate completes, the folio is already unmapped, and it's not a
> large folio. The remaining 24 bits can then be used to record the number
> of migrate PTE entries generated by try_to_migrate.

In the future the page type will no longer overlay the mapcount and, 
consequently, be sticky.

> 
> Then, in remove_migration_ptes, when the nr_mgt_entry count drops to
> zero, we can terminate the VMA traversal early.
> 
> It's important to note that we need to initialize the folio's page_type
> to PGTY_mgt_entry and set the migrate entry count only while holding the
> rmap walk lock.This is because during the lock period, we can prevent
> new VMA fork (which would increase migrate entries) and VMA unmap
> (which would decrease migrate entries).

The more I read about PGTY_mgt_entry, the more I hate it.

> 
> However, I doubt there is actually an additional critical section here, for
> example anon:
> 
> Process Parent                          fork
> try_to_migrate
>                                          anon_vma_clone
>                                              write_lock
>                                                  avc_inster_tree tail
>                                          ....
>      folio_lock_anon_vma_read             copy_pte_range
>          vma_iter                            pte_lock
>                  ....                           pte_present copy
>                                              ...
>                  pte_lock
>                      new forked pte clean
> ....
> remove_migration_ptes
>      rmap_walk_anon_lock
> 
> If my understanding is correct and such a critical section exists, it
> shouldn't cause any issues—newly added PTEs can still be properly
> removed and converted into migrate entries.
> 
> But in this:
> 
> Process Parent                          fork
> try_to_migrate
>                                          anon_vma_clone
>                                              write_lock
>                                                  avc_inster_tree
>                                          ....
>      folio_lock_anon_vma_read             copy_pte_range
>          vma_iter
>                  pte_lock
>                      migrate entry set
>                  ....                        pte_lock
>                                                  pte_nonpresent copy
>                                              ....
> ....
> remove_migration_ptes
>      rmap_walk_anon_lock

Just a note: migration entries also apply to non-anon folios.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 0/9] introduce PGTY_mgt_entry page_type
  2025-07-24  8:59 ` [RFC PATCH 0/9] introduce PGTY_mgt_entry page_type David Hildenbrand
@ 2025-07-24  9:09   ` Huan Yang
  2025-07-24  9:12     ` David Hildenbrand
  0 siblings, 1 reply; 25+ messages in thread
From: Huan Yang @ 2025-07-24  9:09 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Lorenzo Stoakes, Rik van Riel,
	Liam R. Howlett, Vlastimil Babka, Harry Yoo, Xu Xin,
	Chengming Zhou, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple,
	Matthew Wilcox (Oracle), Christian Brauner, Usama Arif, Yu Zhao,
	Baolin Wang, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 5754 bytes --]


在 2025/7/24 16:59, David Hildenbrand 写道:
> On 24.07.25 10:44, Huan Yang wrote:
>> Summary
>> ==
>> This patchset reuses page_type to store migrate entry count during the
>> period from migrate entry setup to removal, enabling accelerated VMA
>> traversal when removing migrate entries, following a similar 
>> principle to
>> early termination when folio is unmapped in try_to_migrate.
>
> I absolutely detest (ab)using page types for that, so no from my side 
> unless I am missing something important.
>
>>
>> In my self-constructed test scenario, the migration time can be reduced
>
> How relevant is that in practice?

IMO, any folio mapped < nr vma in mapping(anon_vma, addresss_space), 
will benefit from this.

So, all pages that have been COW-ed by child processes can be skipped.

>
>> from over 150+ms to around 30+ms, achieving nearly a 70% performance
>> improvement. Additionally, the flame graph shows that the proportion of
>> remove_migration_ptes can be reduced from 80%+ to 60%+.
>>
>> Notice: migrate entry specifically refers to migrate PTE entry, as large
>> folio are not supported page type and 0 mapcount reuse.
>>
>> Principle
>> ==
>> When a page removes all PTEs in try_to_migrate and sets up a migrate PTE
>> entry, we can determine whether the traversal of remaining VMAs can be
>> terminated early by checking if mapcount is zero. This optimization
>> helps improve performance during migration.
>>
>> However, when removing migrate PTE entries and setting up PTEs for the
>> destination folio in remove_migration_ptes, there is no such information
>> available to assist in deciding whether the traversal of remaining VMAs
>> can be ended early. Therefore, it is necessary to traversal all VMAs
>> associated with this folio.
>
> Yes, we don't know how many migration entries are still pointing at 
> the page.
>
>>
>> In reality, when a folio is fully unmapped and before all migrate PTE
>> entries are removed, the mapcount will always be zero. Since page_type
>> and mapcount share a union, and referring to folio_mapcount, we can
>> reuse page_type to record the number of migrate PTE entries of the
>> current folio in the system as long as it's not a large folio. This
>> reuse does not affect calls to folio_mapcount, which will always return
>> zero.
> > > Therefore, we can set the folio's page_type to PGTY_mgt_entry when
>> try_to_migrate completes, the folio is already unmapped, and it's not a
>> large folio. The remaining 24 bits can then be used to record the number
>> of migrate PTE entries generated by try_to_migrate.
>
> In the future the page type will no longer overlay the mapcount and, 
> consequently, be sticky.
>
>>
>> Then, in remove_migration_ptes, when the nr_mgt_entry count drops to
>> zero, we can terminate the VMA traversal early.
>>
>> It's important to note that we need to initialize the folio's page_type
>> to PGTY_mgt_entry and set the migrate entry count only while holding the
>> rmap walk lock.This is because during the lock period, we can prevent
>> new VMA fork (which would increase migrate entries) and VMA unmap
>> (which would decrease migrate entries).
>
> The more I read about PGTY_mgt_entry, the more I hate it.
>
>>
>> However, I doubt there is actually an additional critical section 
>> here, for
>> example anon:
>>
>> Process Parent                          fork
>> try_to_migrate
>>                                          anon_vma_clone
>>                                              write_lock
>>                                                  avc_inster_tree tail
>>                                          ....
>>      folio_lock_anon_vma_read             copy_pte_range
>>          vma_iter                            pte_lock
>>                  ....                           pte_present copy
>>                                              ...
>>                  pte_lock
>>                      new forked pte clean
>> ....
>> remove_migration_ptes
>>      rmap_walk_anon_lock
>>
>> If my understanding is correct and such a critical section exists, it
>> shouldn't cause any issues—newly added PTEs can still be properly
>> removed and converted into migrate entries.
>>
>> But in this:
>>
>> Process Parent                          fork
>> try_to_migrate
>>                                          anon_vma_clone
>>                                              write_lock
>>                                                  avc_inster_tree
>>                                          ....
>>      folio_lock_anon_vma_read             copy_pte_range
>>          vma_iter
>>                  pte_lock
>>                      migrate entry set
>>                  ....                        pte_lock
>>                                                  pte_nonpresent copy
>>                                              ....
>> ....
>> remove_migration_ptes
>>      rmap_walk_anon_lock
>
> Just a note: migration entries also apply to non-anon folios.
Yes, just example.

[-- Attachment #2: Type: text/html, Size: 14070 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 0/9] introduce PGTY_mgt_entry page_type
  2025-07-24  9:09   ` Huan Yang
@ 2025-07-24  9:12     ` David Hildenbrand
  2025-07-24  9:20       ` David Hildenbrand
  0 siblings, 1 reply; 25+ messages in thread
From: David Hildenbrand @ 2025-07-24  9:12 UTC (permalink / raw)
  To: Huan Yang, Andrew Morton, Lorenzo Stoakes, Rik van Riel,
	Liam R. Howlett, Vlastimil Babka, Harry Yoo, Xu Xin,
	Chengming Zhou, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple,
	Matthew Wilcox (Oracle), Christian Brauner, Usama Arif, Yu Zhao,
	Baolin Wang, linux-mm, linux-kernel

On 24.07.25 11:09, Huan Yang wrote:
> 
> 在 2025/7/24 16:59, David Hildenbrand 写道:
>> On 24.07.25 10:44, Huan Yang wrote:
>>> Summary
>>> ==
>>> This patchset reuses page_type to store migrate entry count during the
>>> period from migrate entry setup to removal, enabling accelerated VMA
>>> traversal when removing migrate entries, following a similar 
>>> principle to
>>> early termination when folio is unmapped in try_to_migrate.
>>
>> I absolutely detest (ab)using page types for that, so no from my side 
>> unless I am missing something important.
>>
>>>
>>> In my self-constructed test scenario, the migration time can be reduced
>>
>> How relevant is that in practice?
> 
> IMO, any folio mapped < nr vma in mapping(anon_vma, addresss_space), 
> will benefit from this.
> 
> So, all pages that have been COW-ed by child processes can be skipped.

For small anon folios, you could use the anon-exclusive marker to derive 
"there can only be a single mapping".

It's stored alongside the migration entry.

So once you restored that single migration entry, you can just stop the 
walk.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 0/9] introduce PGTY_mgt_entry page_type
  2025-07-24  8:44 [RFC PATCH 0/9] introduce PGTY_mgt_entry page_type Huan Yang
                   ` (9 preceding siblings ...)
  2025-07-24  8:59 ` [RFC PATCH 0/9] introduce PGTY_mgt_entry page_type David Hildenbrand
@ 2025-07-24  9:15 ` Lorenzo Stoakes
  2025-07-24  9:29   ` Huan Yang
  10 siblings, 1 reply; 25+ messages in thread
From: Lorenzo Stoakes @ 2025-07-24  9:15 UTC (permalink / raw)
  To: Huan Yang
  Cc: Andrew Morton, David Hildenbrand, Rik van Riel, Liam R. Howlett,
	Vlastimil Babka, Harry Yoo, Xu Xin, Chengming Zhou, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Matthew Wilcox (Oracle), Christian Brauner,
	Usama Arif, Yu Zhao, Baolin Wang, linux-mm, linux-kernel

NAK. This series is completely un-upstreamable in any form.

David has responded to you already, but to underline.

The lesson here is that you really ought to discuss things with people in
the subsystem you are changing in advance of spending a lot of time doing
work like this which you intend to upstream.

On Thu, Jul 24, 2025 at 04:44:28PM +0800, Huan Yang wrote:
> Summary
> ==
> This patchset reuses page_type to store migrate entry count during the
> period from migrate entry setup to removal, enabling accelerated VMA
> traversal when removing migrate entries, following a similar principle to
> early termination when folio is unmapped in try_to_migrate.
>
> In my self-constructed test scenario, the migration time can be reduced
> from over 150+ms to around 30+ms, achieving nearly a 70% performance
> improvement. Additionally, the flame graph shows that the proportion of
> remove_migration_ptes can be reduced from 80%+ to 60%+.

This sounds completely contrived. I don't even know if you have a use case
here.

>
> Notice: migrate entry specifically refers to migrate PTE entry, as large
> folio are not supported page type and 0 mapcount reuse.
>
> Principle
> ==
> When a page removes all PTEs in try_to_migrate and sets up a migrate PTE
> entry, we can determine whether the traversal of remaining VMAs can be
> terminated early by checking if mapcount is zero. This optimization
> helps improve performance during migration.
>
> However, when removing migrate PTE entries and setting up PTEs for the
> destination folio in remove_migration_ptes, there is no such information
> available to assist in deciding whether the traversal of remaining VMAs
> can be ended early. Therefore, it is necessary to traversal all VMAs
> associated with this folio.
>
> In reality, when a folio is fully unmapped and before all migrate PTE
> entries are removed, the mapcount will always be zero. Since page_type
> and mapcount share a union, and referring to folio_mapcount, we can
> reuse page_type to record the number of migrate PTE entries of the
> current folio in the system as long as it's not a large folio. This
> reuse does not affect calls to folio_mapcount, which will always return
> zero.

OK so - if you ever find yourself thinking this way, please stop. We are in
the midst of fundamentally changing how folios and pages work.

There is absolutely ZERO room for reusing arbitrary fields in this way. Any
series that attempts to do this will be rejected.

Again, I must say - if you had raised this ahead of time we could have
saved you some effort.

>
> Therefore, we can set the folio's page_type to PGTY_mgt_entry when
> try_to_migrate completes, the folio is already unmapped, and it's not a
> large folio. The remaining 24 bits can then be used to record the number
> of migrate PTE entries generated by try_to_migrate.

I mean there's so much wrong here. The future is large folios. Making some
fundamental change that relies on not-large folio is a mistake. 24
bits... I mean no.

>
> Then, in remove_migration_ptes, when the nr_mgt_entry count drops to
> zero, we can terminate the VMA traversal early.
>
> It's important to note that we need to initialize the folio's page_type
> to PGTY_mgt_entry and set the migrate entry count only while holding the
> rmap walk lock.This is because during the lock period, we can prevent
> new VMA fork (which would increase migrate entries) and VMA unmap
> (which would decrease migrate entries).

No, no no. NO.

You are not introducing new locking complexity for this.

I could go on, but there's no point.

This series is not upstreamable, NAK.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 0/9] introduce PGTY_mgt_entry page_type
  2025-07-24  9:12     ` David Hildenbrand
@ 2025-07-24  9:20       ` David Hildenbrand
  2025-07-24  9:32         ` David Hildenbrand
  0 siblings, 1 reply; 25+ messages in thread
From: David Hildenbrand @ 2025-07-24  9:20 UTC (permalink / raw)
  To: Huan Yang, Andrew Morton, Lorenzo Stoakes, Rik van Riel,
	Liam R. Howlett, Vlastimil Babka, Harry Yoo, Xu Xin,
	Chengming Zhou, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple,
	Matthew Wilcox (Oracle), Christian Brauner, Usama Arif, Yu Zhao,
	Baolin Wang, linux-mm, linux-kernel

On 24.07.25 11:12, David Hildenbrand wrote:
> On 24.07.25 11:09, Huan Yang wrote:
>>
>> 在 2025/7/24 16:59, David Hildenbrand 写道:
>>> On 24.07.25 10:44, Huan Yang wrote:
>>>> Summary
>>>> ==
>>>> This patchset reuses page_type to store migrate entry count during the
>>>> period from migrate entry setup to removal, enabling accelerated VMA
>>>> traversal when removing migrate entries, following a similar
>>>> principle to
>>>> early termination when folio is unmapped in try_to_migrate.
>>>
>>> I absolutely detest (ab)using page types for that, so no from my side
>>> unless I am missing something important.
>>>
>>>>
>>>> In my self-constructed test scenario, the migration time can be reduced
>>>
>>> How relevant is that in practice?
>>
>> IMO, any folio mapped < nr vma in mapping(anon_vma, addresss_space),
>> will benefit from this.
>>
>> So, all pages that have been COW-ed by child processes can be skipped.
> 
> For small anon folios, you could use the anon-exclusive marker to derive
> "there can only be a single mapping".
> 
> It's stored alongside the migration entry.
> 
> So once you restored that single migration entry, you can just stop the
> walk.

Essentially, something (untested) like this:

diff --git a/mm/migrate.c b/mm/migrate.c
index 425401b2d4e14..aa5bf96b1daee 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -421,6 +421,15 @@ static bool remove_migration_pte(struct folio *folio,
  
                 /* No need to invalidate - it was non-present before */
                 update_mmu_cache(vma, pvmw.address, pvmw.pte);
+
+               /*
+                * If the small anon folio is exclusive, here can be exactly one
+                * page mapping -- the one we just restored.
+                */
+               if (!folio_test_large(folio) && (rmap_flags & RMAP_EXCLUSIVE)) {
+                       page_vma_mapped_walk_done(&pvmw);
+                       break;
+               }
         }
  
         return true;


-- 
Cheers,

David / dhildenb



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 0/9] introduce PGTY_mgt_entry page_type
  2025-07-24  9:15 ` Lorenzo Stoakes
@ 2025-07-24  9:29   ` Huan Yang
  2025-07-25  1:37     ` Huang, Ying
  0 siblings, 1 reply; 25+ messages in thread
From: Huan Yang @ 2025-07-24  9:29 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, David Hildenbrand, Rik van Riel, Liam R. Howlett,
	Vlastimil Babka, Harry Yoo, Xu Xin, Chengming Zhou, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Matthew Wilcox (Oracle), Christian Brauner,
	Usama Arif, Yu Zhao, Baolin Wang, linux-mm, linux-kernel


在 2025/7/24 17:15, Lorenzo Stoakes 写道:
> NAK. This series is completely un-upstreamable in any form.
>
> David has responded to you already, but to underline.
>
> The lesson here is that you really ought to discuss things with people in
> the subsystem you are changing in advance of spending a lot of time doing
> work like this which you intend to upstream.

Yes, this is a very useful lesson.:)

In the future, when I have ideas in this area, I will bring them up for 
discussion first, especially when

they involve folios or pages.

>
> On Thu, Jul 24, 2025 at 04:44:28PM +0800, Huan Yang wrote:
>> Summary
>> ==
>> This patchset reuses page_type to store migrate entry count during the
>> period from migrate entry setup to removal, enabling accelerated VMA
>> traversal when removing migrate entries, following a similar principle to
>> early termination when folio is unmapped in try_to_migrate.
>>
>> In my self-constructed test scenario, the migration time can be reduced
>> from over 150+ms to around 30+ms, achieving nearly a 70% performance
>> improvement. Additionally, the flame graph shows that the proportion of
>> remove_migration_ptes can be reduced from 80%+ to 60%+.
> This sounds completely contrived. I don't even know if you have a use case
> here.

The test case I provided does have an amplified effect, but the 
optimization it demonstrates is real. It's just that when scaled up to 
the system level, the effect becomes difficult to observe.

>
>> Notice: migrate entry specifically refers to migrate PTE entry, as large
>> folio are not supported page type and 0 mapcount reuse.
>>
>> Principle
>> ==
>> When a page removes all PTEs in try_to_migrate and sets up a migrate PTE
>> entry, we can determine whether the traversal of remaining VMAs can be
>> terminated early by checking if mapcount is zero. This optimization
>> helps improve performance during migration.
>>
>> However, when removing migrate PTE entries and setting up PTEs for the
>> destination folio in remove_migration_ptes, there is no such information
>> available to assist in deciding whether the traversal of remaining VMAs
>> can be ended early. Therefore, it is necessary to traversal all VMAs
>> associated with this folio.
>>
>> In reality, when a folio is fully unmapped and before all migrate PTE
>> entries are removed, the mapcount will always be zero. Since page_type
>> and mapcount share a union, and referring to folio_mapcount, we can
>> reuse page_type to record the number of migrate PTE entries of the
>> current folio in the system as long as it's not a large folio. This
>> reuse does not affect calls to folio_mapcount, which will always return
>> zero.
> OK so - if you ever find yourself thinking this way, please stop. We are in
> the midst of fundamentally changing how folios and pages work.
>
> There is absolutely ZERO room for reusing arbitrary fields in this way. Any
> series that attempts to do this will be rejected.
>
> Again, I must say - if you had raised this ahead of time we could have
> saved you some effort.
>
>> Therefore, we can set the folio's page_type to PGTY_mgt_entry when
>> try_to_migrate completes, the folio is already unmapped, and it's not a
>> large folio. The remaining 24 bits can then be used to record the number
>> of migrate PTE entries generated by try_to_migrate.
> I mean there's so much wrong here. The future is large folios. Making some
> fundamental change that relies on not-large folio is a mistake. 24
> bits... I mean no.
Thanks, I understand it.
>
>> Then, in remove_migration_ptes, when the nr_mgt_entry count drops to
>> zero, we can terminate the VMA traversal early.
>>
>> It's important to note that we need to initialize the folio's page_type
>> to PGTY_mgt_entry and set the migrate entry count only while holding the
>> rmap walk lock.This is because during the lock period, we can prevent
>> new VMA fork (which would increase migrate entries) and VMA unmap
>> (which would decrease migrate entries).
> No, no no. NO.
>
> You are not introducing new locking complexity for this.
>
> I could go on, but there's no point.
>
> This series is not upstreamable, NAK.
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 0/9] introduce PGTY_mgt_entry page_type
  2025-07-24  9:20       ` David Hildenbrand
@ 2025-07-24  9:32         ` David Hildenbrand
  2025-07-24  9:36           ` Huan Yang
  0 siblings, 1 reply; 25+ messages in thread
From: David Hildenbrand @ 2025-07-24  9:32 UTC (permalink / raw)
  To: Huan Yang, Andrew Morton, Lorenzo Stoakes, Rik van Riel,
	Liam R. Howlett, Vlastimil Babka, Harry Yoo, Xu Xin,
	Chengming Zhou, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple,
	Matthew Wilcox (Oracle), Christian Brauner, Usama Arif, Yu Zhao,
	Baolin Wang, linux-mm, linux-kernel

On 24.07.25 11:20, David Hildenbrand wrote:
> On 24.07.25 11:12, David Hildenbrand wrote:
>> On 24.07.25 11:09, Huan Yang wrote:
>>>
>>> 在 2025/7/24 16:59, David Hildenbrand 写道:
>>>> On 24.07.25 10:44, Huan Yang wrote:
>>>>> Summary
>>>>> ==
>>>>> This patchset reuses page_type to store migrate entry count during the
>>>>> period from migrate entry setup to removal, enabling accelerated VMA
>>>>> traversal when removing migrate entries, following a similar
>>>>> principle to
>>>>> early termination when folio is unmapped in try_to_migrate.
>>>>
>>>> I absolutely detest (ab)using page types for that, so no from my side
>>>> unless I am missing something important.
>>>>
>>>>>
>>>>> In my self-constructed test scenario, the migration time can be reduced
>>>>
>>>> How relevant is that in practice?
>>>
>>> IMO, any folio mapped < nr vma in mapping(anon_vma, addresss_space),
>>> will benefit from this.
>>>
>>> So, all pages that have been COW-ed by child processes can be skipped.
>>
>> For small anon folios, you could use the anon-exclusive marker to derive
>> "there can only be a single mapping".
>>
>> It's stored alongside the migration entry.
>>
>> So once you restored that single migration entry, you can just stop the
>> walk.
> 
> Essentially, something (untested) like this:
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 425401b2d4e14..aa5bf96b1daee 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -421,6 +421,15 @@ static bool remove_migration_pte(struct folio *folio,
>    
>                   /* No need to invalidate - it was non-present before */
>                   update_mmu_cache(vma, pvmw.address, pvmw.pte);
> +
> +               /*
> +                * If the small anon folio is exclusive, here can be exactly one
> +                * page mapping -- the one we just restored.
> +                */
> +               if (!folio_test_large(folio) && (rmap_flags & RMAP_EXCLUSIVE)) {
> +                       page_vma_mapped_walk_done(&pvmw);
> +                       break;
> +               }
>           }
>    
>           return true;

Probably that won't really help I assume, because __folio_set_anon() 
will move the new anon folio under vma->anon_vma, not vma->anon_vma->root.

So I assume you mean that we had a COW-shared folio now mapped only into 
some VMAs (some mappings in other processes removed due to CoW or similar).

In that case aborting early can help.

Not in all cases though, just imagine that the very last VMA we're 
iterating maps the page. You have to iterate through all of them either 
way ... no way around that, really.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 0/9] introduce PGTY_mgt_entry page_type
  2025-07-24  9:32         ` David Hildenbrand
@ 2025-07-24  9:36           ` Huan Yang
  2025-07-24  9:45             ` Lorenzo Stoakes
  0 siblings, 1 reply; 25+ messages in thread
From: Huan Yang @ 2025-07-24  9:36 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Lorenzo Stoakes, Rik van Riel,
	Liam R. Howlett, Vlastimil Babka, Harry Yoo, Xu Xin,
	Chengming Zhou, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple,
	Matthew Wilcox (Oracle), Christian Brauner, Usama Arif, Yu Zhao,
	Baolin Wang, linux-mm, linux-kernel


在 2025/7/24 17:32, David Hildenbrand 写道:
> On 24.07.25 11:20, David Hildenbrand wrote:
>> On 24.07.25 11:12, David Hildenbrand wrote:
>>> On 24.07.25 11:09, Huan Yang wrote:
>>>>
>>>> 在 2025/7/24 16:59, David Hildenbrand 写道:
>>>>> On 24.07.25 10:44, Huan Yang wrote:
>>>>>> Summary
>>>>>> ==
>>>>>> This patchset reuses page_type to store migrate entry count 
>>>>>> during the
>>>>>> period from migrate entry setup to removal, enabling accelerated VMA
>>>>>> traversal when removing migrate entries, following a similar
>>>>>> principle to
>>>>>> early termination when folio is unmapped in try_to_migrate.
>>>>>
>>>>> I absolutely detest (ab)using page types for that, so no from my side
>>>>> unless I am missing something important.
>>>>>
>>>>>>
>>>>>> In my self-constructed test scenario, the migration time can be 
>>>>>> reduced
>>>>>
>>>>> How relevant is that in practice?
>>>>
>>>> IMO, any folio mapped < nr vma in mapping(anon_vma, addresss_space),
>>>> will benefit from this.
>>>>
>>>> So, all pages that have been COW-ed by child processes can be skipped.
>>>
>>> For small anon folios, you could use the anon-exclusive marker to 
>>> derive
>>> "there can only be a single mapping".
>>>
>>> It's stored alongside the migration entry.
>>>
>>> So once you restored that single migration entry, you can just stop the
>>> walk.
>>
>> Essentially, something (untested) like this:
>>
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 425401b2d4e14..aa5bf96b1daee 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -421,6 +421,15 @@ static bool remove_migration_pte(struct folio 
>> *folio,
>>                      /* No need to invalidate - it was non-present 
>> before */
>>                   update_mmu_cache(vma, pvmw.address, pvmw.pte);
>> +
>> +               /*
>> +                * If the small anon folio is exclusive, here can be 
>> exactly one
>> +                * page mapping -- the one we just restored.
>> +                */
>> +               if (!folio_test_large(folio) && (rmap_flags & 
>> RMAP_EXCLUSIVE)) {
>> +                       page_vma_mapped_walk_done(&pvmw);
>> +                       break;
>> +               }
>>           }
>>              return true;
>
> Probably that won't really help I assume, because __folio_set_anon() 
> will move the new anon folio under vma->anon_vma, not 
> vma->anon_vma->root.
>
> So I assume you mean that we had a COW-shared folio now mapped only 
> into some VMAs (some mappings in other processes removed due to CoW or 
> similar).
>
> In that case aborting early can help.
>
> Not in all cases though, just imagine that the very last VMA we're 
> iterating maps the page. You have to iterate through all of them 
> either way ... no way around that, really.

Indeed, whether we can exit the loop early depends on the position of 
the terminating VMA in the tree.

I think a better approach would be to remove the fully COW-ed VMAs and 
their associated AVCs from the anon_vma's tree.

I've been researching this aspect, but haven't made any progress yet.(I 
have some ideas, but the specific implementation is still challenging.)



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 0/9] introduce PGTY_mgt_entry page_type
  2025-07-24  9:36           ` Huan Yang
@ 2025-07-24  9:45             ` Lorenzo Stoakes
  2025-07-24  9:56               ` Huan Yang
  0 siblings, 1 reply; 25+ messages in thread
From: Lorenzo Stoakes @ 2025-07-24  9:45 UTC (permalink / raw)
  To: Huan Yang
  Cc: David Hildenbrand, Andrew Morton, Rik van Riel, Liam R. Howlett,
	Vlastimil Babka, Harry Yoo, Xu Xin, Chengming Zhou, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Matthew Wilcox (Oracle), Christian Brauner,
	Usama Arif, Yu Zhao, Baolin Wang, linux-mm, linux-kernel

On Thu, Jul 24, 2025 at 05:36:27PM +0800, Huan Yang wrote:
>
> 在 2025/7/24 17:32, David Hildenbrand 写道:
> > On 24.07.25 11:20, David Hildenbrand wrote:
> > > On 24.07.25 11:12, David Hildenbrand wrote:
> > > > On 24.07.25 11:09, Huan Yang wrote:
> > > > >
> > > > > 在 2025/7/24 16:59, David Hildenbrand 写道:
> > > > > > On 24.07.25 10:44, Huan Yang wrote:
> > > > > > > Summary
> > > > > > > ==
> > > > > > > This patchset reuses page_type to store migrate
> > > > > > > entry count during the
> > > > > > > period from migrate entry setup to removal, enabling accelerated VMA
> > > > > > > traversal when removing migrate entries, following a similar
> > > > > > > principle to
> > > > > > > early termination when folio is unmapped in try_to_migrate.
> > > > > >
> > > > > > I absolutely detest (ab)using page types for that, so no from my side
> > > > > > unless I am missing something important.
> > > > > >
> > > > > > >
> > > > > > > In my self-constructed test scenario, the migration
> > > > > > > time can be reduced
> > > > > >
> > > > > > How relevant is that in practice?
> > > > >
> > > > > IMO, any folio mapped < nr vma in mapping(anon_vma, addresss_space),
> > > > > will benefit from this.
> > > > >
> > > > > So, all pages that have been COW-ed by child processes can be skipped.
> > > >
> > > > For small anon folios, you could use the anon-exclusive marker
> > > > to derive
> > > > "there can only be a single mapping".
> > > >
> > > > It's stored alongside the migration entry.
> > > >
> > > > So once you restored that single migration entry, you can just stop the
> > > > walk.
> > >
> > > Essentially, something (untested) like this:
> > >
> > > diff --git a/mm/migrate.c b/mm/migrate.c
> > > index 425401b2d4e14..aa5bf96b1daee 100644
> > > --- a/mm/migrate.c
> > > +++ b/mm/migrate.c
> > > @@ -421,6 +421,15 @@ static bool remove_migration_pte(struct folio
> > > *folio,
> > >                      /* No need to invalidate - it was non-present
> > > before */
> > >                   update_mmu_cache(vma, pvmw.address, pvmw.pte);
> > > +
> > > +               /*
> > > +                * If the small anon folio is exclusive, here can be
> > > exactly one
> > > +                * page mapping -- the one we just restored.
> > > +                */
> > > +               if (!folio_test_large(folio) && (rmap_flags &
> > > RMAP_EXCLUSIVE)) {
> > > +                       page_vma_mapped_walk_done(&pvmw);
> > > +                       break;
> > > +               }
> > >           }
> > >              return true;
> >
> > Probably that won't really help I assume, because __folio_set_anon()
> > will move the new anon folio under vma->anon_vma, not
> > vma->anon_vma->root.
> >
> > So I assume you mean that we had a COW-shared folio now mapped only into
> > some VMAs (some mappings in other processes removed due to CoW or
> > similar).
> >
> > In that case aborting early can help.
> >
> > Not in all cases though, just imagine that the very last VMA we're
> > iterating maps the page. You have to iterate through all of them either
> > way ... no way around that, really.
>
> Indeed, whether we can exit the loop early depends on the position of the
> terminating VMA in the tree.
>
> I think a better approach would be to remove the fully COW-ed VMAs and their
> associated AVCs from the anon_vma's tree.
>
> I've been researching this aspect, but haven't made any progress yet.(I have
> some ideas, but the specific implementation is still challenging.)
>

Please leave this alone, I'm in the midst of trying to make fundamental changes
to the anon rmap logic and it's really very subtle and indeed challenging (as
you've seen).

Since I intend to change the whole mechanism around this, efforts to adjust the
existing behaviour are going to strictly conflict with that.

We are 'lazy' in actually properly accounting for fully CoW'd VMAs and so can
only know 'maybe' if it has, I mean as from above you've noticed.

The CoW hierarchy also makes life challenging, see vma_had_uncowed_parents() for
an example of the subtlty.

Having looked at anon rmap in detail, I have come to think the only sensible way
forward is something fairly bold.

Thanks!


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 0/9] introduce PGTY_mgt_entry page_type
  2025-07-24  9:45             ` Lorenzo Stoakes
@ 2025-07-24  9:56               ` Huan Yang
  2025-07-24  9:58                 ` Lorenzo Stoakes
  0 siblings, 1 reply; 25+ messages in thread
From: Huan Yang @ 2025-07-24  9:56 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand, Andrew Morton, Rik van Riel, Liam R. Howlett,
	Vlastimil Babka, Harry Yoo, Xu Xin, Chengming Zhou, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Matthew Wilcox (Oracle), Christian Brauner,
	Usama Arif, Yu Zhao, Baolin Wang, linux-mm, linux-kernel


在 2025/7/24 17:45, Lorenzo Stoakes 写道:
> On Thu, Jul 24, 2025 at 05:36:27PM +0800, Huan Yang wrote:
>> 在 2025/7/24 17:32, David Hildenbrand 写道:
>>> On 24.07.25 11:20, David Hildenbrand wrote:
>>>> On 24.07.25 11:12, David Hildenbrand wrote:
>>>>> On 24.07.25 11:09, Huan Yang wrote:
>>>>>> 在 2025/7/24 16:59, David Hildenbrand 写道:
>>>>>>> On 24.07.25 10:44, Huan Yang wrote:
>>>>>>>> Summary
>>>>>>>> ==
>>>>>>>> This patchset reuses page_type to store migrate
>>>>>>>> entry count during the
>>>>>>>> period from migrate entry setup to removal, enabling accelerated VMA
>>>>>>>> traversal when removing migrate entries, following a similar
>>>>>>>> principle to
>>>>>>>> early termination when folio is unmapped in try_to_migrate.
>>>>>>> I absolutely detest (ab)using page types for that, so no from my side
>>>>>>> unless I am missing something important.
>>>>>>>
>>>>>>>> In my self-constructed test scenario, the migration
>>>>>>>> time can be reduced
>>>>>>> How relevant is that in practice?
>>>>>> IMO, any folio mapped < nr vma in mapping(anon_vma, addresss_space),
>>>>>> will benefit from this.
>>>>>>
>>>>>> So, all pages that have been COW-ed by child processes can be skipped.
>>>>> For small anon folios, you could use the anon-exclusive marker
>>>>> to derive
>>>>> "there can only be a single mapping".
>>>>>
>>>>> It's stored alongside the migration entry.
>>>>>
>>>>> So once you restored that single migration entry, you can just stop the
>>>>> walk.
>>>> Essentially, something (untested) like this:
>>>>
>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>> index 425401b2d4e14..aa5bf96b1daee 100644
>>>> --- a/mm/migrate.c
>>>> +++ b/mm/migrate.c
>>>> @@ -421,6 +421,15 @@ static bool remove_migration_pte(struct folio
>>>> *folio,
>>>>                       /* No need to invalidate - it was non-present
>>>> before */
>>>>                    update_mmu_cache(vma, pvmw.address, pvmw.pte);
>>>> +
>>>> +               /*
>>>> +                * If the small anon folio is exclusive, here can be
>>>> exactly one
>>>> +                * page mapping -- the one we just restored.
>>>> +                */
>>>> +               if (!folio_test_large(folio) && (rmap_flags &
>>>> RMAP_EXCLUSIVE)) {
>>>> +                       page_vma_mapped_walk_done(&pvmw);
>>>> +                       break;
>>>> +               }
>>>>            }
>>>>               return true;
>>> Probably that won't really help I assume, because __folio_set_anon()
>>> will move the new anon folio under vma->anon_vma, not
>>> vma->anon_vma->root.
>>>
>>> So I assume you mean that we had a COW-shared folio now mapped only into
>>> some VMAs (some mappings in other processes removed due to CoW or
>>> similar).
>>>
>>> In that case aborting early can help.
>>>
>>> Not in all cases though, just imagine that the very last VMA we're
>>> iterating maps the page. You have to iterate through all of them either
>>> way ... no way around that, really.
>> Indeed, whether we can exit the loop early depends on the position of the
>> terminating VMA in the tree.
>>
>> I think a better approach would be to remove the fully COW-ed VMAs and their
>> associated AVCs from the anon_vma's tree.
>>
>> I've been researching this aspect, but haven't made any progress yet.(I have
>> some ideas, but the specific implementation is still challenging.)
>>
> Please leave this alone, I'm in the midst of trying to make fundamental changes
> to the anon rmap logic and it's really very subtle and indeed challenging (as
> you've seen).
OK, but I still try to research it, not only for upstream or something.
>
> Since I intend to change the whole mechanism around this, efforts to adjust the
> existing behaviour are going to strictly conflict with that.
>
> We are 'lazy' in actually properly accounting for fully CoW'd VMAs and so can
> only know 'maybe' if it has, I mean as from above you've noticed.
>
> The CoW hierarchy also makes life challenging, see vma_had_uncowed_parents() for
> an example of the subtlty.
>
> Having looked at anon rmap in detail, I have come to think the only sensible way
> forward is something fairly bold.

Woo, that's really interesting. I'm looking forward to this.

Thanks.


>
> Thanks!
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 0/9] introduce PGTY_mgt_entry page_type
  2025-07-24  9:56               ` Huan Yang
@ 2025-07-24  9:58                 ` Lorenzo Stoakes
  2025-07-24 10:01                   ` Huan Yang
  0 siblings, 1 reply; 25+ messages in thread
From: Lorenzo Stoakes @ 2025-07-24  9:58 UTC (permalink / raw)
  To: Huan Yang
  Cc: David Hildenbrand, Andrew Morton, Rik van Riel, Liam R. Howlett,
	Vlastimil Babka, Harry Yoo, Xu Xin, Chengming Zhou, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Matthew Wilcox (Oracle), Christian Brauner,
	Usama Arif, Yu Zhao, Baolin Wang, linux-mm, linux-kernel

On Thu, Jul 24, 2025 at 05:56:09PM +0800, Huan Yang wrote:
>
> 在 2025/7/24 17:45, Lorenzo Stoakes 写道:
> > Having looked at anon rmap in detail, I have come to think the only sensible way
> > forward is something fairly bold.
>
> Woo, that's really interesting. I'm looking forward to this.

Haha thanks, I found the one person in the world who's excited by anon rmap :P

Or rather, the only OTHER person ;)

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 0/9] introduce PGTY_mgt_entry page_type
  2025-07-24  9:58                 ` Lorenzo Stoakes
@ 2025-07-24 10:01                   ` Huan Yang
  0 siblings, 0 replies; 25+ messages in thread
From: Huan Yang @ 2025-07-24 10:01 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand, Andrew Morton, Rik van Riel, Liam R. Howlett,
	Vlastimil Babka, Harry Yoo, Xu Xin, Chengming Zhou, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Matthew Wilcox (Oracle), Christian Brauner,
	Usama Arif, Yu Zhao, Baolin Wang, linux-mm, linux-kernel


在 2025/7/24 17:58, Lorenzo Stoakes 写道:
> On Thu, Jul 24, 2025 at 05:56:09PM +0800, Huan Yang wrote:
>> 在 2025/7/24 17:45, Lorenzo Stoakes 写道:
>>> Having looked at anon rmap in detail, I have come to think the only sensible way
>>> forward is something fairly bold.
>> Woo, that's really interesting. I'm looking forward to this.
> Haha thanks, I found the one person in the world who's excited by anon rmap :P
>
> Or rather, the only OTHER person ;)

Not only is this interesting, but the efficiency of anonymous page 
reclamation also frequently affects the performance of our products.

In our flame graphs, we can see that many anonymous page reverse mapping 
paths are among the hotspots.

Therefore, please continue your work—I’m really looking forward to the 
results.

Additionally, if there’s anything I can do to help, I’d be happy to 
participate.

Thanks.

>
> Cheers, Lorenzo
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 0/9] introduce PGTY_mgt_entry page_type
  2025-07-24  9:29   ` Huan Yang
@ 2025-07-25  1:37     ` Huang, Ying
  2025-07-25  1:47       ` Huan Yang
  0 siblings, 1 reply; 25+ messages in thread
From: Huang, Ying @ 2025-07-25  1:37 UTC (permalink / raw)
  To: Huan Yang
  Cc: Lorenzo Stoakes, Andrew Morton, David Hildenbrand, Rik van Riel,
	Liam R. Howlett, Vlastimil Babka, Harry Yoo, Xu Xin,
	Chengming Zhou, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Alistair Popple, Matthew Wilcox (Oracle),
	Christian Brauner, Usama Arif, Yu Zhao, Baolin Wang, linux-mm,
	linux-kernel

Huan Yang <link@vivo.com> writes:

> 在 2025/7/24 17:15, Lorenzo Stoakes 写道:

[snip]

>> On Thu, Jul 24, 2025 at 04:44:28PM +0800, Huan Yang wrote:
>>> Summary
>>> ==
>>> This patchset reuses page_type to store migrate entry count during the
>>> period from migrate entry setup to removal, enabling accelerated VMA
>>> traversal when removing migrate entries, following a similar principle to
>>> early termination when folio is unmapped in try_to_migrate.
>>>
>>> In my self-constructed test scenario, the migration time can be reduced
>>> from over 150+ms to around 30+ms, achieving nearly a 70% performance
>>> improvement. Additionally, the flame graph shows that the proportion of
>>> remove_migration_ptes can be reduced from 80%+ to 60%+.
>> This sounds completely contrived. I don't even know if you have a use case
>> here.
>
> The test case I provided does have an amplified effect, but the
> optimization it demonstrates is real. It's just that when scaled up to
> the system level, the effect becomes difficult to observe.
>

It's more important to sell your problems than selling your code :-)

If you cannot prove that the optimization has some practical effect,
it's hard to persuade others for increased complexity.

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 0/9] introduce PGTY_mgt_entry page_type
  2025-07-25  1:37     ` Huang, Ying
@ 2025-07-25  1:47       ` Huan Yang
  2025-07-25  9:26         ` David Hildenbrand
  0 siblings, 1 reply; 25+ messages in thread
From: Huan Yang @ 2025-07-25  1:47 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Lorenzo Stoakes, Andrew Morton, David Hildenbrand, Rik van Riel,
	Liam R. Howlett, Vlastimil Babka, Harry Yoo, Xu Xin,
	Chengming Zhou, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Alistair Popple, Matthew Wilcox (Oracle),
	Christian Brauner, Usama Arif, Yu Zhao, Baolin Wang, linux-mm,
	linux-kernel


在 2025/7/25 09:37, Huang, Ying 写道:
> Huan Yang <link@vivo.com> writes:
>
>> 在 2025/7/24 17:15, Lorenzo Stoakes 写道:
> [snip]
>
>>> On Thu, Jul 24, 2025 at 04:44:28PM +0800, Huan Yang wrote:
>>>> Summary
>>>> ==
>>>> This patchset reuses page_type to store migrate entry count during the
>>>> period from migrate entry setup to removal, enabling accelerated VMA
>>>> traversal when removing migrate entries, following a similar principle to
>>>> early termination when folio is unmapped in try_to_migrate.
>>>>
>>>> In my self-constructed test scenario, the migration time can be reduced
>>>> from over 150+ms to around 30+ms, achieving nearly a 70% performance
>>>> improvement. Additionally, the flame graph shows that the proportion of
>>>> remove_migration_ptes can be reduced from 80%+ to 60%+.
>>> This sounds completely contrived. I don't even know if you have a use case
>>> here.
>> The test case I provided does have an amplified effect, but the
>> optimization it demonstrates is real. It's just that when scaled up to
>> the system level, the effect becomes difficult to observe.
>>
> It's more important to sell your problems than selling your code :-)
I'll remember it. Thanks. :)
>
> If you cannot prove that the optimization has some practical effect,
> it's hard to persuade others for increased complexity.

To be honest, this patch stems from an issue I noticed during code review.

When this patchset was completed, I did put in some effort to find its 
benefits, and it was only

under such an exaggeratedly constructed test scenario that the effect 
could be demonstrated. :(

The actual problem I'm facing has been described in other replies.

It's actually about some anonymous pages and fully COW-ed pages, but 
their avcs haven't been

removed from the anon_vma's RB tree, resulting in inefficient traversal.

Lorenzo has mentioned that he has some bold ideas regarding this, let's 
look forward it. :)

Thanks.

>
> ---
> Best Regards,
> Huang, Ying


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 0/9] introduce PGTY_mgt_entry page_type
  2025-07-25  1:47       ` Huan Yang
@ 2025-07-25  9:26         ` David Hildenbrand
  0 siblings, 0 replies; 25+ messages in thread
From: David Hildenbrand @ 2025-07-25  9:26 UTC (permalink / raw)
  To: Huan Yang, Huang, Ying
  Cc: Lorenzo Stoakes, Andrew Morton, Rik van Riel, Liam R. Howlett,
	Vlastimil Babka, Harry Yoo, Xu Xin, Chengming Zhou, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price,
	Alistair Popple, Matthew Wilcox (Oracle), Christian Brauner,
	Usama Arif, Yu Zhao, Baolin Wang, linux-mm, linux-kernel

>> If you cannot prove that the optimization has some practical effect,
>> it's hard to persuade others for increased complexity.
> 
> To be honest, this patch stems from an issue I noticed during code review.
> 
> When this patchset was completed, I did put in some effort to find its
> benefits, and it was only
> 
> under such an exaggeratedly constructed test scenario that the effect
> could be demonstrated. :(

I mean, thanks for looking into that and trying to find a way to improve 
it. :)

That VMA walk is the real problem, stopping earlier is just an 
optimization that works in some cases. I guess on average it will 
improve things, although probably really hard to quantify in reality.

I think tracking the #migration entries might be a very good debugging tool.

A cleaner and more reliably solution regarding what you tried to 
implement would be able to

(a) track it in a separate counter, at the time we establish/remove a 
migration entry, not once the mapcount is already 0. With "struct folio" 
getting allocated separately in the future this could maybe be feasible 
(and putting it under a config knob).

(b) doing it also for large folios as well

(b) might be tricky with migration entries being used for THP splits, 
but probable it could be special-cased somehow, I am sure.

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2025-07-25  9:26 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-24  8:44 [RFC PATCH 0/9] introduce PGTY_mgt_entry page_type Huan Yang
2025-07-24  8:44 ` [RFC PATCH 1/9] mm: introduce PAGE_TYPE_SHIFT Huan Yang
2025-07-24  8:44 ` [RFC PATCH 2/9] mm: add page_type value helper Huan Yang
2025-07-24  8:44 ` [RFC PATCH 3/9] mm/rmap: simplify rmap_walk invoke Huan Yang
2025-07-24  8:44 ` [RFC PATCH 4/9] mm/rmap: add args in rmap_walk_control done hook Huan Yang
2025-07-24  8:44 ` [RFC PATCH 5/9] mm/rmap: introduce exit hook Huan Yang
2025-07-24  8:44 ` [RFC PATCH 6/9] mm/rmap: introduce migrate_walk_arg Huan Yang
2025-07-24  8:44 ` [RFC PATCH 7/9] mm/migrate: rename rmap_walk_arg folio Huan Yang
2025-07-24  8:44 ` [RFC PATCH 8/9] mm/migrate: infrastructure for migrate entry page_type Huan Yang
2025-07-24  8:44 ` [RFC PATCH 9/9] mm/migrate: apply " Huan Yang
2025-07-24  8:59 ` [RFC PATCH 0/9] introduce PGTY_mgt_entry page_type David Hildenbrand
2025-07-24  9:09   ` Huan Yang
2025-07-24  9:12     ` David Hildenbrand
2025-07-24  9:20       ` David Hildenbrand
2025-07-24  9:32         ` David Hildenbrand
2025-07-24  9:36           ` Huan Yang
2025-07-24  9:45             ` Lorenzo Stoakes
2025-07-24  9:56               ` Huan Yang
2025-07-24  9:58                 ` Lorenzo Stoakes
2025-07-24 10:01                   ` Huan Yang
2025-07-24  9:15 ` Lorenzo Stoakes
2025-07-24  9:29   ` Huan Yang
2025-07-25  1:37     ` Huang, Ying
2025-07-25  1:47       ` Huan Yang
2025-07-25  9:26         ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).