[PATCH v5 0/6] mm: split underused THPs

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 0/6] mm: split underused THPs
@ 2024-08-30 10:03 Usama Arif
  2024-08-30 10:03 ` [PATCH v5 1/6] mm: free zapped tail pages when splitting isolated thp Usama Arif
                   ` (5 more replies)
  0 siblings, 6 replies; 33+ messages in thread
From: Usama Arif @ 2024-08-30 10:03 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, david, npache,
	baohua, ryan.roberts, rppt, willy, cerasuolodomenico, ryncsn,
	corbet, linux-kernel, linux-doc, kernel-team, Usama Arif

The current upstream default policy for THP is always. However, Meta
uses madvise in production as the current THP=always policy vastly
overprovisions THPs in sparsely accessed memory areas, resulting in
excessive memory pressure and premature OOM killing.
Using madvise + relying on khugepaged has certain drawbacks over
THP=always. Using madvise hints mean THPs aren't "transparent" and
require userspace changes. Waiting for khugepaged to scan memory and
collapse pages into THP can be slow and unpredictable in terms of performance
(i.e. you dont know when the collapse will happen), while production
environments require predictable performance. If there is enough memory
available, its better for both performance and predictability to have
a THP from fault time, i.e. THP=always rather than wait for khugepaged
to collapse it, and deal with sparsely populated THPs when the system is
running out of memory.

This patch-series is an attempt to mitigate the issue of running out of
memory when THP is always enabled. During runtime whenever a THP is being
faulted in or collapsed by khugepaged, the THP is added to a list.
Whenever memory reclaim happens, the kernel runs the deferred_split
shrinker which goes through the list and checks if the THP was underused,
i.e. how many of the base 4K pages of the entire THP were zero-filled.
If this number goes above a certain threshold, the shrinker will attempt
to split that THP. Then at remap time, the pages that were zero-filled are
mapped to the shared zeropage, hence saving memory. This method avoids the
downside of wasting memory in areas where THP is sparsely filled when THP
is always enabled, while still providing the upside THPs like reduced TLB
misses without having to use madvise.

Meta production workloads that were CPU bound (>99% CPU utilzation) were
tested with THP shrinker. The results after 2 hours are as follows:

                            | THP=madvise |  THP=always   | THP=always
                            |             |               | + shrinker series
                            |             |               | + max_ptes_none=409
-----------------------------------------------------------------------------
Performance improvement     |      -      |    +1.8%      |     +1.7%
(over THP=madvise)          |             |               |
-----------------------------------------------------------------------------
Memory usage                |    54.6G    | 58.8G (+7.7%) |   55.9G (+2.4%)
-----------------------------------------------------------------------------
max_ptes_none=409 means that any THP that has more than 409 out of 512
(80%) zero filled filled pages will be split.

To test out the patches, the below commands without the shrinker will
invoke OOM killer immediately and kill stress, but will not fail with
the shrinker:

echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
mkdir /sys/fs/cgroup/test
echo $$ > /sys/fs/cgroup/test/cgroup.procs
echo 20M > /sys/fs/cgroup/test/memory.max
echo 0 > /sys/fs/cgroup/test/memory.swap.max
# allocate twice memory.max for each stress worker and touch 40/512 of
# each THP, i.e. vm-stride 50K.
# With the shrinker, max_ptes_none of 470 and below won't invoke OOM
# killer.
# Without the shrinker, OOM killer is invoked immediately irrespective
# of max_ptes_none value and kills stress.
stress --vm 1 --vm-bytes 40M --vm-stride 50K

---
v4 -> v5:
- rebase on top of latest mm-unstable. This includes Barrys'
  anon mTHP accounting series. All merge conflicts should be resolved
  with the patches on mailing list. This also means all places
  where partially_mapped flag is set is first tested for it (Barry).
- uint64_t to unsigned long for rss_anon function in selftest.
- convert PageMlocked to folio_test_mlocked in
  try_to_map_unused_to_zeropage.

v3 -> v4:
- do not clear partially_mapped flag on hugeTLB folios (Yu Zhao).
- fix condition for calling deferred_folio_split in partially mapped case
  and count for partially mapped vm events (Barry Song).
- use non-atomic versions of set/clear partially_mapped flags
  (David Hildenbrand)
- use PG_partially_mapped = PG_reclaim (Matthew Wilcox)
- delete folio from lru list and folio_batch_add "new_folio" instead
  of folio in __split_huge_page. (Kairui Song)
- fix deadlock in deferred_split_scan by not doing folio_put while
  holding split_queue_lock (Hugh Dickins)
- underutilized to underused and thp_low_util_shrinker to shrink_underused
  (Hugh Dickins)

v2 -> v3:
- Use my_zero_pfn instead of page_to_pfn(ZERO_PAGE(..)) (Johannes)
- Use flags argument instead of bools in remove_migration_ptes (Johannes)
- Use a new flag in folio->_flags_1 instead of folio->_partially_mapped
  (David Hildenbrand).
- Split out the last patch of v2 into 3, one for introducing the flag,
  one for splitting underutilized THPs on _deferred_list and one for adding
  sysfs entry to disable splitting (David Hildenbrand).

v1 -> v2:
- Turn page checks and operations to folio versions in __split_huge_page.
  This means patches 1 and 2 from v1 are no longer needed.
  (David Hildenbrand)
- Map to shared zeropage in all cases if the base page is zero-filled.
  The uffd selftest was removed.
  (David Hildenbrand).
- rename 'dirty' to 'contains_data' in try_to_map_unused_to_zeropage
  (Rik van Riel).
- Use unsigned long instead of uint64_t (kernel test robot).

Alexander Zhu (1):
  mm: selftest to verify zero-filled pages are mapped to zeropage

Usama Arif (3):
  mm: Introduce a pageflag for partially mapped folios
  mm: split underused THPs
  mm: add sysfs entry to disable splitting underused THPs

Yu Zhao (2):
  mm: free zapped tail pages when splitting isolated thp
  mm: remap unused subpages to shared zeropage when splitting isolated
    thp

 Documentation/admin-guide/mm/transhuge.rst    |  16 ++
 include/linux/huge_mm.h                       |   4 +-
 include/linux/khugepaged.h                    |   1 +
 include/linux/page-flags.h                    |  13 +-
 include/linux/rmap.h                          |   7 +-
 include/linux/vm_event_item.h                 |   1 +
 mm/huge_memory.c                              | 163 ++++++++++++++++--
 mm/khugepaged.c                               |   3 +-
 mm/memcontrol.c                               |   3 +-
 mm/migrate.c                                  |  75 ++++++--
 mm/migrate_device.c                           |   4 +-
 mm/page_alloc.c                               |   5 +-
 mm/rmap.c                                     |   5 +-
 mm/vmscan.c                                   |   3 +-
 mm/vmstat.c                                   |   1 +
 .../selftests/mm/split_huge_page_test.c       |  71 ++++++++
 tools/testing/selftests/mm/vm_util.c          |  22 +++
 tools/testing/selftests/mm/vm_util.h          |   1 +
 18 files changed, 358 insertions(+), 40 deletions(-)

-- 
2.43.5

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v5 1/6] mm: free zapped tail pages when splitting isolated thp
  2024-08-30 10:03 [PATCH v5 0/6] mm: split underused THPs Usama Arif
@ 2024-08-30 10:03 ` Usama Arif
  2024-09-05  8:46   ` Hugh Dickins
  2024-08-30 10:03 ` [PATCH v5 2/6] mm: remap unused subpages to shared zeropage " Usama Arif
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 33+ messages in thread
From: Usama Arif @ 2024-08-30 10:03 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, david, npache,
	baohua, ryan.roberts, rppt, willy, cerasuolodomenico, ryncsn,
	corbet, linux-kernel, linux-doc, kernel-team, Shuang Zhai,
	Usama Arif

From: Yu Zhao <yuzhao@google.com>

If a tail page has only two references left, one inherited from the
isolation of its head and the other from lru_add_page_tail() which we
are about to drop, it means this tail page was concurrently zapped.
Then we can safely free it and save page reclaim or migration the
trouble of trying it.

Signed-off-by: Yu Zhao <yuzhao@google.com>
Tested-by: Shuang Zhai <zhais@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 mm/huge_memory.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 15418ffdd377..0c48806ccb9a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3170,7 +3170,9 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	unsigned int new_nr = 1 << new_order;
 	int order = folio_order(folio);
 	unsigned int nr = 1 << order;
+	struct folio_batch free_folios;
 
+	folio_batch_init(&free_folios);
 	/* complete memcg works before add pages to LRU */
 	split_page_memcg(head, order, new_order);
 
@@ -3254,6 +3256,27 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		if (subpage == page)
 			continue;
 		folio_unlock(new_folio);
+		/*
+		 * If a folio has only two references left, one inherited
+		 * from the isolation of its head and the other from
+		 * lru_add_page_tail() which we are about to drop, it means this
+		 * folio was concurrently zapped. Then we can safely free it
+		 * and save page reclaim or migration the trouble of trying it.
+		 */
+		if (list && folio_ref_freeze(new_folio, 2)) {
+			VM_WARN_ON_ONCE_FOLIO(folio_test_lru(new_folio), new_folio);
+			VM_WARN_ON_ONCE_FOLIO(folio_test_large(new_folio), new_folio);
+			VM_WARN_ON_ONCE_FOLIO(folio_mapped(new_folio), new_folio);
+
+			folio_clear_active(new_folio);
+			folio_clear_unevictable(new_folio);
+			list_del(&new_folio->lru);
+			if (!folio_batch_add(&free_folios, new_folio)) {
+				mem_cgroup_uncharge_folios(&free_folios);
+				free_unref_folios(&free_folios);
+			}
+			continue;
+		}
 
 		/*
 		 * Subpages may be freed if there wasn't any mapping
@@ -3264,6 +3287,11 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		 */
 		free_page_and_swap_cache(subpage);
 	}
+
+	if (free_folios.nr) {
+		mem_cgroup_uncharge_folios(&free_folios);
+		free_unref_folios(&free_folios);
+	}
 }
 
 /* Racy check whether the huge page can be split */
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v5 2/6] mm: remap unused subpages to shared zeropage when splitting isolated thp
  2024-08-30 10:03 [PATCH v5 0/6] mm: split underused THPs Usama Arif
  2024-08-30 10:03 ` [PATCH v5 1/6] mm: free zapped tail pages when splitting isolated thp Usama Arif
@ 2024-08-30 10:03 ` Usama Arif
  2024-10-23 16:21   ` Zi Yan
  2025-09-18  8:53   ` Qun-wei Lin (林群崴)
  2024-08-30 10:03 ` [PATCH v5 3/6] mm: selftest to verify zero-filled pages are mapped to zeropage Usama Arif
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 33+ messages in thread
From: Usama Arif @ 2024-08-30 10:03 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, david, npache,
	baohua, ryan.roberts, rppt, willy, cerasuolodomenico, ryncsn,
	corbet, linux-kernel, linux-doc, kernel-team, Shuang Zhai,
	Usama Arif

From: Yu Zhao <yuzhao@google.com>

Here being unused means containing only zeros and inaccessible to
userspace. When splitting an isolated thp under reclaim or migration,
the unused subpages can be mapped to the shared zeropage, hence saving
memory. This is particularly helpful when the internal
fragmentation of a thp is high, i.e. it has many untouched subpages.

This is also a prerequisite for THP low utilization shrinker which will
be introduced in later patches, where underutilized THPs are split, and
the zero-filled pages are freed saving memory.

Signed-off-by: Yu Zhao <yuzhao@google.com>
Tested-by: Shuang Zhai <zhais@google.com>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 include/linux/rmap.h |  7 ++++-
 mm/huge_memory.c     |  8 ++---
 mm/migrate.c         | 72 ++++++++++++++++++++++++++++++++++++++------
 mm/migrate_device.c  |  4 +--
 4 files changed, 75 insertions(+), 16 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 91b5935e8485..d5e93e44322e 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -745,7 +745,12 @@ int folio_mkclean(struct folio *);
 int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff,
 		      struct vm_area_struct *vma);
 
-void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked);
+enum rmp_flags {
+	RMP_LOCKED		= 1 << 0,
+	RMP_USE_SHARED_ZEROPAGE	= 1 << 1,
+};
+
+void remove_migration_ptes(struct folio *src, struct folio *dst, int flags);
 
 /*
  * rmap_walk_control: To control rmap traversing for specific needs
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0c48806ccb9a..af60684e7c70 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3020,7 +3020,7 @@ bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
 	return false;
 }
 
-static void remap_page(struct folio *folio, unsigned long nr)
+static void remap_page(struct folio *folio, unsigned long nr, int flags)
 {
 	int i = 0;
 
@@ -3028,7 +3028,7 @@ static void remap_page(struct folio *folio, unsigned long nr)
 	if (!folio_test_anon(folio))
 		return;
 	for (;;) {
-		remove_migration_ptes(folio, folio, true);
+		remove_migration_ptes(folio, folio, RMP_LOCKED | flags);
 		i += folio_nr_pages(folio);
 		if (i >= nr)
 			break;
@@ -3240,7 +3240,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 
 	if (nr_dropped)
 		shmem_uncharge(folio->mapping->host, nr_dropped);
-	remap_page(folio, nr);
+	remap_page(folio, nr, PageAnon(head) ? RMP_USE_SHARED_ZEROPAGE : 0);
 
 	/*
 	 * set page to its compound_head when split to non order-0 pages, so
@@ -3542,7 +3542,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		if (mapping)
 			xas_unlock(&xas);
 		local_irq_enable();
-		remap_page(folio, folio_nr_pages(folio));
+		remap_page(folio, folio_nr_pages(folio), 0);
 		ret = -EAGAIN;
 	}
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 6f9c62c746be..d039863e014b 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -204,13 +204,57 @@ bool isolate_folio_to_list(struct folio *folio, struct list_head *list)
 	return true;
 }
 
+static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
+					  struct folio *folio,
+					  unsigned long idx)
+{
+	struct page *page = folio_page(folio, idx);
+	bool contains_data;
+	pte_t newpte;
+	void *addr;
+
+	VM_BUG_ON_PAGE(PageCompound(page), page);
+	VM_BUG_ON_PAGE(!PageAnon(page), page);
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	VM_BUG_ON_PAGE(pte_present(*pvmw->pte), page);
+
+	if (folio_test_mlocked(folio) || (pvmw->vma->vm_flags & VM_LOCKED) ||
+	    mm_forbids_zeropage(pvmw->vma->vm_mm))
+		return false;
+
+	/*
+	 * The pmd entry mapping the old thp was flushed and the pte mapping
+	 * this subpage has been non present. If the subpage is only zero-filled
+	 * then map it to the shared zeropage.
+	 */
+	addr = kmap_local_page(page);
+	contains_data = memchr_inv(addr, 0, PAGE_SIZE);
+	kunmap_local(addr);
+
+	if (contains_data)
+		return false;
+
+	newpte = pte_mkspecial(pfn_pte(my_zero_pfn(pvmw->address),
+					pvmw->vma->vm_page_prot));
+	set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte);
+
+	dec_mm_counter(pvmw->vma->vm_mm, mm_counter(folio));
+	return true;
+}
+
+struct rmap_walk_arg {
+	struct folio *folio;
+	bool map_unused_to_zeropage;
+};
+
 /*
  * Restore a potential migration pte to a working pte entry
  */
 static bool remove_migration_pte(struct folio *folio,
-		struct vm_area_struct *vma, unsigned long addr, void *old)
+		struct vm_area_struct *vma, unsigned long addr, void *arg)
 {
-	DEFINE_FOLIO_VMA_WALK(pvmw, old, vma, addr, PVMW_SYNC | PVMW_MIGRATION);
+	struct rmap_walk_arg *rmap_walk_arg = arg;
+	DEFINE_FOLIO_VMA_WALK(pvmw, rmap_walk_arg->folio, vma, addr, PVMW_SYNC | PVMW_MIGRATION);
 
 	while (page_vma_mapped_walk(&pvmw)) {
 		rmap_t rmap_flags = RMAP_NONE;
@@ -234,6 +278,9 @@ static bool remove_migration_pte(struct folio *folio,
 			continue;
 		}
 #endif
+		if (rmap_walk_arg->map_unused_to_zeropage &&
+		    try_to_map_unused_to_zeropage(&pvmw, folio, idx))
+			continue;
 
 		folio_get(folio);
 		pte = mk_pte(new, READ_ONCE(vma->vm_page_prot));
@@ -312,14 +359,21 @@ static bool remove_migration_pte(struct folio *folio,
  * Get rid of all migration entries and replace them by
  * references to the indicated page.
  */
-void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked)
+void remove_migration_ptes(struct folio *src, struct folio *dst, int flags)
 {
+	struct rmap_walk_arg rmap_walk_arg = {
+		.folio = src,
+		.map_unused_to_zeropage = flags & RMP_USE_SHARED_ZEROPAGE,
+	};
+
 	struct rmap_walk_control rwc = {
 		.rmap_one = remove_migration_pte,
-		.arg = src,
+		.arg = &rmap_walk_arg,
 	};
 
-	if (locked)
+	VM_BUG_ON_FOLIO((flags & RMP_USE_SHARED_ZEROPAGE) && (src != dst), src);
+
+	if (flags & RMP_LOCKED)
 		rmap_walk_locked(dst, &rwc);
 	else
 		rmap_walk(dst, &rwc);
@@ -934,7 +988,7 @@ static int writeout(struct address_space *mapping, struct folio *folio)
 	 * At this point we know that the migration attempt cannot
 	 * be successful.
 	 */
-	remove_migration_ptes(folio, folio, false);
+	remove_migration_ptes(folio, folio, 0);
 
 	rc = mapping->a_ops->writepage(&folio->page, &wbc);
 
@@ -1098,7 +1152,7 @@ static void migrate_folio_undo_src(struct folio *src,
 				   struct list_head *ret)
 {
 	if (page_was_mapped)
-		remove_migration_ptes(src, src, false);
+		remove_migration_ptes(src, src, 0);
 	/* Drop an anon_vma reference if we took one */
 	if (anon_vma)
 		put_anon_vma(anon_vma);
@@ -1336,7 +1390,7 @@ static int migrate_folio_move(free_folio_t put_new_folio, unsigned long private,
 		lru_add_drain();
 
 	if (old_page_state & PAGE_WAS_MAPPED)
-		remove_migration_ptes(src, dst, false);
+		remove_migration_ptes(src, dst, 0);
 
 out_unlock_both:
 	folio_unlock(dst);
@@ -1474,7 +1528,7 @@ static int unmap_and_move_huge_page(new_folio_t get_new_folio,
 
 	if (page_was_mapped)
 		remove_migration_ptes(src,
-			rc == MIGRATEPAGE_SUCCESS ? dst : src, false);
+			rc == MIGRATEPAGE_SUCCESS ? dst : src, 0);
 
 unlock_put_anon:
 	folio_unlock(dst);
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 8d687de88a03..9cf26592ac93 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -424,7 +424,7 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
 			continue;
 
 		folio = page_folio(page);
-		remove_migration_ptes(folio, folio, false);
+		remove_migration_ptes(folio, folio, 0);
 
 		src_pfns[i] = 0;
 		folio_unlock(folio);
@@ -840,7 +840,7 @@ void migrate_device_finalize(unsigned long *src_pfns,
 			dst = src;
 		}
 
-		remove_migration_ptes(src, dst, false);
+		remove_migration_ptes(src, dst, 0);
 		folio_unlock(src);
 
 		if (folio_is_zone_device(src))
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v5 3/6] mm: selftest to verify zero-filled pages are mapped to zeropage
  2024-08-30 10:03 [PATCH v5 0/6] mm: split underused THPs Usama Arif
  2024-08-30 10:03 ` [PATCH v5 1/6] mm: free zapped tail pages when splitting isolated thp Usama Arif
  2024-08-30 10:03 ` [PATCH v5 2/6] mm: remap unused subpages to shared zeropage " Usama Arif
@ 2024-08-30 10:03 ` Usama Arif
  2024-08-30 10:03 ` [PATCH v5 4/6] mm: Introduce a pageflag for partially mapped folios Usama Arif
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 33+ messages in thread
From: Usama Arif @ 2024-08-30 10:03 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, david, npache,
	baohua, ryan.roberts, rppt, willy, cerasuolodomenico, ryncsn,
	corbet, linux-kernel, linux-doc, kernel-team, Alexander Zhu,
	Usama Arif

From: Alexander Zhu <alexlzhu@fb.com>

When a THP is split, any subpage that is zero-filled will be mapped
to the shared zeropage, hence saving memory. Add selftest to verify
this by allocating zero-filled THP and comparing RssAnon before and
after split.

Signed-off-by: Alexander Zhu <alexlzhu@fb.com>
Acked-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 .../selftests/mm/split_huge_page_test.c       | 71 +++++++++++++++++++
 tools/testing/selftests/mm/vm_util.c          | 22 ++++++
 tools/testing/selftests/mm/vm_util.h          |  1 +
 3 files changed, 94 insertions(+)

diff --git a/tools/testing/selftests/mm/split_huge_page_test.c b/tools/testing/selftests/mm/split_huge_page_test.c
index e5e8dafc9d94..eb6d1b9fc362 100644
--- a/tools/testing/selftests/mm/split_huge_page_test.c
+++ b/tools/testing/selftests/mm/split_huge_page_test.c
@@ -84,6 +84,76 @@ static void write_debugfs(const char *fmt, ...)
 	write_file(SPLIT_DEBUGFS, input, ret + 1);
 }
 
+static char *allocate_zero_filled_hugepage(size_t len)
+{
+	char *result;
+	size_t i;
+
+	result = memalign(pmd_pagesize, len);
+	if (!result) {
+		printf("Fail to allocate memory\n");
+		exit(EXIT_FAILURE);
+	}
+
+	madvise(result, len, MADV_HUGEPAGE);
+
+	for (i = 0; i < len; i++)
+		result[i] = (char)0;
+
+	return result;
+}
+
+static void verify_rss_anon_split_huge_page_all_zeroes(char *one_page, int nr_hpages, size_t len)
+{
+	unsigned long rss_anon_before, rss_anon_after;
+	size_t i;
+
+	if (!check_huge_anon(one_page, 4, pmd_pagesize)) {
+		printf("No THP is allocated\n");
+		exit(EXIT_FAILURE);
+	}
+
+	rss_anon_before = rss_anon();
+	if (!rss_anon_before) {
+		printf("No RssAnon is allocated before split\n");
+		exit(EXIT_FAILURE);
+	}
+
+	/* split all THPs */
+	write_debugfs(PID_FMT, getpid(), (uint64_t)one_page,
+		      (uint64_t)one_page + len, 0);
+
+	for (i = 0; i < len; i++)
+		if (one_page[i] != (char)0) {
+			printf("%ld byte corrupted\n", i);
+			exit(EXIT_FAILURE);
+		}
+
+	if (!check_huge_anon(one_page, 0, pmd_pagesize)) {
+		printf("Still AnonHugePages not split\n");
+		exit(EXIT_FAILURE);
+	}
+
+	rss_anon_after = rss_anon();
+	if (rss_anon_after >= rss_anon_before) {
+		printf("Incorrect RssAnon value. Before: %ld After: %ld\n",
+		       rss_anon_before, rss_anon_after);
+		exit(EXIT_FAILURE);
+	}
+}
+
+void split_pmd_zero_pages(void)
+{
+	char *one_page;
+	int nr_hpages = 4;
+	size_t len = nr_hpages * pmd_pagesize;
+
+	one_page = allocate_zero_filled_hugepage(len);
+	verify_rss_anon_split_huge_page_all_zeroes(one_page, nr_hpages, len);
+	printf("Split zero filled huge pages successful\n");
+	free(one_page);
+}
+
 void split_pmd_thp(void)
 {
 	char *one_page;
@@ -431,6 +501,7 @@ int main(int argc, char **argv)
 
 	fd_size = 2 * pmd_pagesize;
 
+	split_pmd_zero_pages();
 	split_pmd_thp();
 	split_pte_mapped_thp();
 	split_file_backed_thp();
diff --git a/tools/testing/selftests/mm/vm_util.c b/tools/testing/selftests/mm/vm_util.c
index 5a62530da3b5..d8d0cf04bb57 100644
--- a/tools/testing/selftests/mm/vm_util.c
+++ b/tools/testing/selftests/mm/vm_util.c
@@ -12,6 +12,7 @@
 
 #define PMD_SIZE_FILE_PATH "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size"
 #define SMAP_FILE_PATH "/proc/self/smaps"
+#define STATUS_FILE_PATH "/proc/self/status"
 #define MAX_LINE_LENGTH 500
 
 unsigned int __page_size;
@@ -171,6 +172,27 @@ uint64_t read_pmd_pagesize(void)
 	return strtoul(buf, NULL, 10);
 }
 
+unsigned long rss_anon(void)
+{
+	unsigned long rss_anon = 0;
+	FILE *fp;
+	char buffer[MAX_LINE_LENGTH];
+
+	fp = fopen(STATUS_FILE_PATH, "r");
+	if (!fp)
+		ksft_exit_fail_msg("%s: Failed to open file %s\n", __func__, STATUS_FILE_PATH);
+
+	if (!check_for_pattern(fp, "RssAnon:", buffer, sizeof(buffer)))
+		goto err_out;
+
+	if (sscanf(buffer, "RssAnon:%10lu kB", &rss_anon) != 1)
+		ksft_exit_fail_msg("Reading status error\n");
+
+err_out:
+	fclose(fp);
+	return rss_anon;
+}
+
 bool __check_huge(void *addr, char *pattern, int nr_hpages,
 		  uint64_t hpage_size)
 {
diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h
index 9007c420d52c..2eaed8209925 100644
--- a/tools/testing/selftests/mm/vm_util.h
+++ b/tools/testing/selftests/mm/vm_util.h
@@ -39,6 +39,7 @@ unsigned long pagemap_get_pfn(int fd, char *start);
 void clear_softdirty(void);
 bool check_for_pattern(FILE *fp, const char *pattern, char *buf, size_t len);
 uint64_t read_pmd_pagesize(void);
+unsigned long rss_anon(void);
 bool check_huge_anon(void *addr, int nr_hpages, uint64_t hpage_size);
 bool check_huge_file(void *addr, int nr_hpages, uint64_t hpage_size);
 bool check_huge_shmem(void *addr, int nr_hpages, uint64_t hpage_size);
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v5 4/6] mm: Introduce a pageflag for partially mapped folios
  2024-08-30 10:03 [PATCH v5 0/6] mm: split underused THPs Usama Arif
                   ` (2 preceding siblings ...)
  2024-08-30 10:03 ` [PATCH v5 3/6] mm: selftest to verify zero-filled pages are mapped to zeropage Usama Arif
@ 2024-08-30 10:03 ` Usama Arif
  2024-12-11 15:03   ` David Hildenbrand
  2024-08-30 10:03 ` [PATCH v5 5/6] mm: split underused THPs Usama Arif
  2024-08-30 10:03 ` [PATCH v5 6/6] mm: add sysfs entry to disable splitting " Usama Arif
  5 siblings, 1 reply; 33+ messages in thread
From: Usama Arif @ 2024-08-30 10:03 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, david, npache,
	baohua, ryan.roberts, rppt, willy, cerasuolodomenico, ryncsn,
	corbet, linux-kernel, linux-doc, kernel-team, Usama Arif

Currently folio->_deferred_list is used to keep track of
partially_mapped folios that are going to be split under memory
pressure. In the next patch, all THPs that are faulted in and collapsed
by khugepaged are also going to be tracked using _deferred_list.

This patch introduces a pageflag to be able to distinguish between
partially mapped folios and others in the deferred_list at split time in
deferred_split_scan. Its needed as __folio_remove_rmap decrements
_mapcount, _large_mapcount and _entire_mapcount, hence it won't be
possible to distinguish between partially mapped folios and others in
deferred_split_scan.

Eventhough it introduces an extra flag to track if the folio is
partially mapped, there is no functional change intended with this
patch and the flag is not useful in this patch itself, it will
become useful in the next patch when _deferred_list has non partially
mapped folios.

Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 include/linux/huge_mm.h    |  4 ++--
 include/linux/page-flags.h | 13 +++++++++++-
 mm/huge_memory.c           | 41 ++++++++++++++++++++++++++++----------
 mm/memcontrol.c            |  3 ++-
 mm/migrate.c               |  3 ++-
 mm/page_alloc.c            |  5 +++--
 mm/rmap.c                  |  5 +++--
 mm/vmscan.c                |  3 ++-
 8 files changed, 56 insertions(+), 21 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 4da102b74a8c..0b0539f4ee1a 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -333,7 +333,7 @@ static inline int split_huge_page(struct page *page)
 {
 	return split_huge_page_to_list_to_order(page, NULL, 0);
 }
-void deferred_split_folio(struct folio *folio);
+void deferred_split_folio(struct folio *folio, bool partially_mapped);
 
 void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long address, bool freeze, struct folio *folio);
@@ -502,7 +502,7 @@ static inline int split_huge_page(struct page *page)
 {
 	return 0;
 }
-static inline void deferred_split_folio(struct folio *folio) {}
+static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
 #define split_huge_pmd(__vma, __pmd, __address)	\
 	do { } while (0)
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 2175ebceb41c..1b3a76710487 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -186,6 +186,7 @@ enum pageflags {
 	/* At least one page in this folio has the hwpoison flag set */
 	PG_has_hwpoisoned = PG_active,
 	PG_large_rmappable = PG_workingset, /* anon or file-backed */
+	PG_partially_mapped = PG_reclaim, /* was identified to be partially mapped */
 };
 
 #define PAGEFLAGS_MASK		((1UL << NR_PAGEFLAGS) - 1)
@@ -859,8 +860,18 @@ static inline void ClearPageCompound(struct page *page)
 	ClearPageHead(page);
 }
 FOLIO_FLAG(large_rmappable, FOLIO_SECOND_PAGE)
+FOLIO_TEST_FLAG(partially_mapped, FOLIO_SECOND_PAGE)
+/*
+ * PG_partially_mapped is protected by deferred_split split_queue_lock,
+ * so its safe to use non-atomic set/clear.
+ */
+__FOLIO_SET_FLAG(partially_mapped, FOLIO_SECOND_PAGE)
+__FOLIO_CLEAR_FLAG(partially_mapped, FOLIO_SECOND_PAGE)
 #else
 FOLIO_FLAG_FALSE(large_rmappable)
+FOLIO_TEST_FLAG_FALSE(partially_mapped)
+__FOLIO_SET_FLAG_NOOP(partially_mapped)
+__FOLIO_CLEAR_FLAG_NOOP(partially_mapped)
 #endif
 
 #define PG_head_mask ((1UL << PG_head))
@@ -1171,7 +1182,7 @@ static __always_inline void __ClearPageAnonExclusive(struct page *page)
  */
 #define PAGE_FLAGS_SECOND						\
 	(0xffUL /* order */		| 1UL << PG_has_hwpoisoned |	\
-	 1UL << PG_large_rmappable)
+	 1UL << PG_large_rmappable	| 1UL << PG_partially_mapped)
 
 #define PAGE_FLAGS_PRIVATE				\
 	(1UL << PG_private | 1UL << PG_private_2)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index af60684e7c70..166f8810f3c6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3503,7 +3503,11 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		if (folio_order(folio) > 1 &&
 		    !list_empty(&folio->_deferred_list)) {
 			ds_queue->split_queue_len--;
-			mod_mthp_stat(folio_order(folio), MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
+			if (folio_test_partially_mapped(folio)) {
+				__folio_clear_partially_mapped(folio);
+				mod_mthp_stat(folio_order(folio),
+					      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
+			}
 			/*
 			 * Reinitialize page_deferred_list after removing the
 			 * page from the split_queue, otherwise a subsequent
@@ -3570,13 +3574,18 @@ void __folio_undo_large_rmappable(struct folio *folio)
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
 	if (!list_empty(&folio->_deferred_list)) {
 		ds_queue->split_queue_len--;
-		mod_mthp_stat(folio_order(folio), MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
+		if (folio_test_partially_mapped(folio)) {
+			__folio_clear_partially_mapped(folio);
+			mod_mthp_stat(folio_order(folio),
+				      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
+		}
 		list_del_init(&folio->_deferred_list);
 	}
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 }
 
-void deferred_split_folio(struct folio *folio)
+/* partially_mapped=false won't clear PG_partially_mapped folio flag */
+void deferred_split_folio(struct folio *folio, bool partially_mapped)
 {
 	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
 #ifdef CONFIG_MEMCG
@@ -3604,15 +3613,21 @@ void deferred_split_folio(struct folio *folio)
 	if (folio_test_swapcache(folio))
 		return;
 
-	if (!list_empty(&folio->_deferred_list))
-		return;
-
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
+	if (partially_mapped) {
+		if (!folio_test_partially_mapped(folio)) {
+			__folio_set_partially_mapped(folio);
+			if (folio_test_pmd_mappable(folio))
+				count_vm_event(THP_DEFERRED_SPLIT_PAGE);
+			count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
+			mod_mthp_stat(folio_order(folio), MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, 1);
+
+		}
+	} else {
+		/* partially mapped folios cannot become non-partially mapped */
+		VM_WARN_ON_FOLIO(folio_test_partially_mapped(folio), folio);
+	}
 	if (list_empty(&folio->_deferred_list)) {
-		if (folio_test_pmd_mappable(folio))
-			count_vm_event(THP_DEFERRED_SPLIT_PAGE);
-		count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
-		mod_mthp_stat(folio_order(folio), MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, 1);
 		list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
 		ds_queue->split_queue_len++;
 #ifdef CONFIG_MEMCG
@@ -3660,7 +3675,11 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 			list_move(&folio->_deferred_list, &list);
 		} else {
 			/* We lost race with folio_put() */
-			mod_mthp_stat(folio_order(folio), MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
+			if (folio_test_partially_mapped(folio)) {
+				__folio_clear_partially_mapped(folio);
+				mod_mthp_stat(folio_order(folio),
+					      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
+			}
 			list_del_init(&folio->_deferred_list);
 			ds_queue->split_queue_len--;
 		}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 087a8cb1a6d8..e66da58a365d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4629,7 +4629,8 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
 	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
 	VM_BUG_ON_FOLIO(folio_order(folio) > 1 &&
 			!folio_test_hugetlb(folio) &&
-			!list_empty(&folio->_deferred_list), folio);
+			!list_empty(&folio->_deferred_list) &&
+			folio_test_partially_mapped(folio), folio);
 
 	/*
 	 * Nobody should be changing or seriously looking at
diff --git a/mm/migrate.c b/mm/migrate.c
index d039863e014b..35cc9d35064b 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1766,7 +1766,8 @@ static int migrate_pages_batch(struct list_head *from,
 			 * use _deferred_list.
 			 */
 			if (nr_pages > 2 &&
-			   !list_empty(&folio->_deferred_list)) {
+			   !list_empty(&folio->_deferred_list) &&
+			   folio_test_partially_mapped(folio)) {
 				if (!try_split_folio(folio, split_folios, mode)) {
 					nr_failed++;
 					stats->nr_thp_failed += is_thp;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c2ffccf9d213..a82c221b7c2e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -962,8 +962,9 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page)
 		break;
 	case 2:
 		/* the second tail page: deferred_list overlaps ->mapping */
-		if (unlikely(!list_empty(&folio->_deferred_list))) {
-			bad_page(page, "on deferred list");
+		if (unlikely(!list_empty(&folio->_deferred_list) &&
+		    folio_test_partially_mapped(folio))) {
+			bad_page(page, "partially mapped folio on deferred list");
 			goto out;
 		}
 		break;
diff --git a/mm/rmap.c b/mm/rmap.c
index 78529cf0fd66..a8797d1b3d49 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1579,8 +1579,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
 	 * Check partially_mapped first to ensure it is a large folio.
 	 */
 	if (partially_mapped && folio_test_anon(folio) &&
-	    list_empty(&folio->_deferred_list))
-		deferred_split_folio(folio);
+	    !folio_test_partially_mapped(folio))
+		deferred_split_folio(folio, true);
+
 	__folio_mod_stat(folio, -nr, -nr_pmdmapped);
 
 	/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f27792e77a0f..4ca612f7e473 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1238,7 +1238,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 					 * Split partially mapped folios right away.
 					 * We can free the unmapped pages without IO.
 					 */
-					if (data_race(!list_empty(&folio->_deferred_list)) &&
+					if (data_race(!list_empty(&folio->_deferred_list) &&
+					    folio_test_partially_mapped(folio)) &&
 					    split_folio_to_list(folio, folio_list))
 						goto activate_locked;
 				}
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v5 5/6] mm: split underused THPs
  2024-08-30 10:03 [PATCH v5 0/6] mm: split underused THPs Usama Arif
                   ` (3 preceding siblings ...)
  2024-08-30 10:03 ` [PATCH v5 4/6] mm: Introduce a pageflag for partially mapped folios Usama Arif
@ 2024-08-30 10:03 ` Usama Arif
  2024-08-30 10:03 ` [PATCH v5 6/6] mm: add sysfs entry to disable splitting " Usama Arif
  5 siblings, 0 replies; 33+ messages in thread
From: Usama Arif @ 2024-08-30 10:03 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, david, npache,
	baohua, ryan.roberts, rppt, willy, cerasuolodomenico, ryncsn,
	corbet, linux-kernel, linux-doc, kernel-team, Usama Arif

This is an attempt to mitigate the issue of running out of memory when
THP is always enabled. During runtime whenever a THP is being faulted in
(__do_huge_pmd_anonymous_page) or collapsed by khugepaged
(collapse_huge_page), the THP is added to  _deferred_list. Whenever
memory reclaim happens in linux, the kernel runs the deferred_split
shrinker which goes through the _deferred_list.

If the folio was partially mapped, the shrinker attempts to split it.
If the folio is not partially mapped, the shrinker checks if the THP
was underused, i.e. how many of the base 4K pages of the entire THP
were zero-filled. If this number goes above a certain threshold (decided by
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none), the
shrinker will attempt to split that THP. Then at remap time, the pages
that were zero-filled are mapped to the shared zeropage, hence saving
memory.

Suggested-by: Rik van Riel <riel@surriel.com>
Co-authored-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 Documentation/admin-guide/mm/transhuge.rst |  6 +++
 include/linux/khugepaged.h                 |  1 +
 include/linux/vm_event_item.h              |  1 +
 mm/huge_memory.c                           | 60 +++++++++++++++++++++-
 mm/khugepaged.c                            |  3 +-
 mm/vmstat.c                                |  1 +
 6 files changed, 69 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 56a086900651..aca0cff852b8 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -471,6 +471,12 @@ thp_deferred_split_page
 	splitting it would free up some memory. Pages on split queue are
 	going to be split under memory pressure.
 
+thp_underused_split_page
+	is incremented when a huge page on the split queue was split
+	because it was underused. A THP is underused if the number of
+	zero pages in the THP is above a certain threshold
+	(/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none).
+
 thp_split_pmd
 	is incremented every time a PMD split into table of PTEs.
 	This can happen, for instance, when application calls mprotect() or
diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index f68865e19b0b..30baae91b225 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -4,6 +4,7 @@
 
 #include <linux/sched/coredump.h> /* MMF_VM_HUGEPAGE */
 
+extern unsigned int khugepaged_max_ptes_none __read_mostly;
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 extern struct attribute_group khugepaged_attr_group;
 
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index aae5c7c5cfb4..aed952d04132 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -105,6 +105,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_SPLIT_PAGE,
 		THP_SPLIT_PAGE_FAILED,
 		THP_DEFERRED_SPLIT_PAGE,
+		THP_UNDERUSED_SPLIT_PAGE,
 		THP_SPLIT_PMD,
 		THP_SCAN_EXCEED_NONE_PTE,
 		THP_SCAN_EXCEED_SWAP_PTE,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 166f8810f3c6..a97aeffc55d6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1187,6 +1187,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
 		update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
 		add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 		mm_inc_nr_ptes(vma->vm_mm);
+		deferred_split_folio(folio, false);
 		spin_unlock(vmf->ptl);
 		count_vm_event(THP_FAULT_ALLOC);
 		count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_ALLOC);
@@ -3652,6 +3653,39 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
 	return READ_ONCE(ds_queue->split_queue_len);
 }
 
+static bool thp_underused(struct folio *folio)
+{
+	int num_zero_pages = 0, num_filled_pages = 0;
+	void *kaddr;
+	int i;
+
+	if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
+		return false;
+
+	for (i = 0; i < folio_nr_pages(folio); i++) {
+		kaddr = kmap_local_folio(folio, i * PAGE_SIZE);
+		if (!memchr_inv(kaddr, 0, PAGE_SIZE)) {
+			num_zero_pages++;
+			if (num_zero_pages > khugepaged_max_ptes_none) {
+				kunmap_local(kaddr);
+				return true;
+			}
+		} else {
+			/*
+			 * Another path for early exit once the number
+			 * of non-zero filled pages exceeds threshold.
+			 */
+			num_filled_pages++;
+			if (num_filled_pages >= HPAGE_PMD_NR - khugepaged_max_ptes_none) {
+				kunmap_local(kaddr);
+				return false;
+			}
+		}
+		kunmap_local(kaddr);
+	}
+	return false;
+}
+
 static unsigned long deferred_split_scan(struct shrinker *shrink,
 		struct shrink_control *sc)
 {
@@ -3689,13 +3723,35 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 
 	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
+		bool did_split = false;
+		bool underused = false;
+
+		if (!folio_test_partially_mapped(folio)) {
+			underused = thp_underused(folio);
+			if (!underused)
+				goto next;
+		}
 		if (!folio_trylock(folio))
 			goto next;
-		/* split_huge_page() removes page from list on success */
-		if (!split_folio(folio))
+		if (!split_folio(folio)) {
+			did_split = true;
+			if (underused)
+				count_vm_event(THP_UNDERUSED_SPLIT_PAGE);
 			split++;
+		}
 		folio_unlock(folio);
 next:
+		/*
+		 * split_folio() removes folio from list on success.
+		 * Only add back to the queue if folio is partially mapped.
+		 * If thp_underused returns false, or if split_folio fails
+		 * in the case it was underused, then consider it used and
+		 * don't add it back to split_queue.
+		 */
+		if (!did_split && !folio_test_partially_mapped(folio)) {
+			list_del_init(&folio->_deferred_list);
+			ds_queue->split_queue_len--;
+		}
 		folio_put(folio);
 	}
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 5bfb5594c604..bf1734e8e665 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -85,7 +85,7 @@ static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
  *
  * Note that these are only respected if collapse was initiated by khugepaged.
  */
-static unsigned int khugepaged_max_ptes_none __read_mostly;
+unsigned int khugepaged_max_ptes_none __read_mostly;
 static unsigned int khugepaged_max_ptes_swap __read_mostly;
 static unsigned int khugepaged_max_ptes_shared __read_mostly;
 
@@ -1237,6 +1237,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	set_pmd_at(mm, address, pmd, _pmd);
 	update_mmu_cache_pmd(vma, address, pmd);
+	deferred_split_folio(folio, false);
 	spin_unlock(pmd_ptl);
 
 	folio = NULL;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f41984dc856f..bb081ae4d0ae 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1385,6 +1385,7 @@ const char * const vmstat_text[] = {
 	"thp_split_page",
 	"thp_split_page_failed",
 	"thp_deferred_split_page",
+	"thp_underused_split_page",
 	"thp_split_pmd",
 	"thp_scan_exceed_none_pte",
 	"thp_scan_exceed_swap_pte",
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v5 6/6] mm: add sysfs entry to disable splitting underused THPs
  2024-08-30 10:03 [PATCH v5 0/6] mm: split underused THPs Usama Arif
                   ` (4 preceding siblings ...)
  2024-08-30 10:03 ` [PATCH v5 5/6] mm: split underused THPs Usama Arif
@ 2024-08-30 10:03 ` Usama Arif
  5 siblings, 0 replies; 33+ messages in thread
From: Usama Arif @ 2024-08-30 10:03 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, david, npache,
	baohua, ryan.roberts, rppt, willy, cerasuolodomenico, ryncsn,
	corbet, linux-kernel, linux-doc, kernel-team, Usama Arif

If disabled, THPs faulted in or collapsed will not be added to
_deferred_list, and therefore won't be considered for splitting under
memory pressure if underused.

Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 Documentation/admin-guide/mm/transhuge.rst | 10 +++++++++
 mm/huge_memory.c                           | 26 ++++++++++++++++++++++
 2 files changed, 36 insertions(+)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index aca0cff852b8..cfdd16a52e39 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -202,6 +202,16 @@ PMD-mappable transparent hugepage::
 
 	cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
 
+All THPs at fault and collapse time will be added to _deferred_list,
+and will therefore be split under memory presure if they are considered
+"underused". A THP is underused if the number of zero-filled pages in
+the THP is above max_ptes_none (see below). It is possible to disable
+this behaviour by writing 0 to shrink_underused, and enable it by writing
+1 to it::
+
+	echo 0 > /sys/kernel/mm/transparent_hugepage/shrink_underused
+	echo 1 > /sys/kernel/mm/transparent_hugepage/shrink_underused
+
 khugepaged will be automatically started when PMD-sized THP is enabled
 (either of the per-size anon control or the top-level control are set
 to "always" or "madvise"), and it'll be automatically shutdown when
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a97aeffc55d6..0993dfe9ae94 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -74,6 +74,7 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
 					  struct shrink_control *sc);
 static unsigned long deferred_split_scan(struct shrinker *shrink,
 					 struct shrink_control *sc);
+static bool split_underused_thp = true;
 
 static atomic_t huge_zero_refcount;
 struct folio *huge_zero_folio __read_mostly;
@@ -440,6 +441,27 @@ static ssize_t hpage_pmd_size_show(struct kobject *kobj,
 static struct kobj_attribute hpage_pmd_size_attr =
 	__ATTR_RO(hpage_pmd_size);
 
+static ssize_t split_underused_thp_show(struct kobject *kobj,
+			    struct kobj_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%d\n", split_underused_thp);
+}
+
+static ssize_t split_underused_thp_store(struct kobject *kobj,
+			     struct kobj_attribute *attr,
+			     const char *buf, size_t count)
+{
+	int err = kstrtobool(buf, &split_underused_thp);
+
+	if (err < 0)
+		return err;
+
+	return count;
+}
+
+static struct kobj_attribute split_underused_thp_attr = __ATTR(
+	shrink_underused, 0644, split_underused_thp_show, split_underused_thp_store);
+
 static struct attribute *hugepage_attr[] = {
 	&enabled_attr.attr,
 	&defrag_attr.attr,
@@ -448,6 +470,7 @@ static struct attribute *hugepage_attr[] = {
 #ifdef CONFIG_SHMEM
 	&shmem_enabled_attr.attr,
 #endif
+	&split_underused_thp_attr.attr,
 	NULL,
 };
 
@@ -3601,6 +3624,9 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
 	if (folio_order(folio) <= 1)
 		return;
 
+	if (!partially_mapped && !split_underused_thp)
+		return;
+
 	/*
 	 * The try_to_unmap() in page reclaim path might reach here too,
 	 * this may cause a race condition to corrupt deferred split queue.
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 1/6] mm: free zapped tail pages when splitting isolated thp
  2024-08-30 10:03 ` [PATCH v5 1/6] mm: free zapped tail pages when splitting isolated thp Usama Arif
@ 2024-09-05  8:46   ` Hugh Dickins
  2024-09-05 10:21     ` Usama Arif
  0 siblings, 1 reply; 33+ messages in thread
From: Hugh Dickins @ 2024-09-05  8:46 UTC (permalink / raw)
  To: Andrew Morton, Usama Arif, Yu Zhao
  Cc: linux-mm, hannes, riel, shakeel.butt, roman.gushchin, david,
	npache, baohua, ryan.roberts, rppt, willy, cerasuolodomenico,
	ryncsn, corbet, linux-kernel, linux-doc, kernel-team, Shuang Zhai

On Fri, 30 Aug 2024, Usama Arif wrote:

> From: Yu Zhao <yuzhao@google.com>
> 
> If a tail page has only two references left, one inherited from the
> isolation of its head and the other from lru_add_page_tail() which we
> are about to drop, it means this tail page was concurrently zapped.
> Then we can safely free it and save page reclaim or migration the
> trouble of trying it.
> 
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> Tested-by: Shuang Zhai <zhais@google.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>

I'm sorry, but I think this patch (just this 1/6) needs to be dropped:
it is only an optimization, and unless a persuasive performance case
can be made to extend it, it ought to go (perhaps revisited later).

The problem I kept hitting was that all my work, requiring compaction and
reclaim, got (killably) stuck in or repeatedly calling reclaim_throttle():
because nr_isolated_anon had grown high - and remained high even when the
load had all been killed.

Bisection led to the 2/6 (remap to shared zeropage), but I'd say this 1/6
is the one to blame. I was intending to send this patch to "fix" it:

--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3295,6 +3295,8 @@ static void __split_huge_page(struct pag
 			folio_clear_active(new_folio);
 			folio_clear_unevictable(new_folio);
 			list_del(&new_folio->lru);
+			node_stat_sub_folio(folio, NR_ISOLATED_ANON +
+						folio_is_file_lru(folio));
 			if (!folio_batch_add(&free_folios, new_folio)) {
 				mem_cgroup_uncharge_folios(&free_folios);
 				free_unref_folios(&free_folios);

And that ran nicely, until I terminated the run and did
grep nr_isolated /proc/sys/vm/stat_refresh /proc/vmstat
at the end: stat_refresh kindly left a pr_warn in dmesg to say
nr_isolated_anon -334013737

My patch is not good enough. IIUC, some split_huge_pagers (reclaim?)
know how many pages they isolated and decremented the stats by, and
increment by that same number at the end; whereas other split_huge_pagers
(migration?) decrement one by one as they go through the list afterwards.

I've run out of time (I'm about to take a break): I gave up researching
who needs what, and was already feeling this optimization does too much
second guessing of what's needed (and its array of VM_WARN_ON_ONCE_FOLIOs
rather admits to that).

And I don't think it's as simple as moving the node_stat_sub_folio()
into 2/6 where the zero pte is substituted: that would probably handle
the vast majority of cases, but aren't there others which pass the
folio_ref_freeze(new_folio, 2) test - the title's zapped tail pages,
or racily truncated now that the folio has been unlocked, for example?

Hugh

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 1/6] mm: free zapped tail pages when splitting isolated thp
  2024-09-05  8:46   ` Hugh Dickins
@ 2024-09-05 10:21     ` Usama Arif
  2024-09-05 18:05       ` Hugh Dickins
  0 siblings, 1 reply; 33+ messages in thread
From: Usama Arif @ 2024-09-05 10:21 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton, Yu Zhao
  Cc: linux-mm, hannes, riel, shakeel.butt, roman.gushchin, david,
	npache, baohua, ryan.roberts, rppt, willy, cerasuolodomenico,
	ryncsn, corbet, linux-kernel, linux-doc, kernel-team, Shuang Zhai



On 05/09/2024 09:46, Hugh Dickins wrote:
> On Fri, 30 Aug 2024, Usama Arif wrote:
> 
>> From: Yu Zhao <yuzhao@google.com>
>>
>> If a tail page has only two references left, one inherited from the
>> isolation of its head and the other from lru_add_page_tail() which we
>> are about to drop, it means this tail page was concurrently zapped.
>> Then we can safely free it and save page reclaim or migration the
>> trouble of trying it.
>>
>> Signed-off-by: Yu Zhao <yuzhao@google.com>
>> Tested-by: Shuang Zhai <zhais@google.com>
>> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> 
> I'm sorry, but I think this patch (just this 1/6) needs to be dropped:
> it is only an optimization, and unless a persuasive performance case
> can be made to extend it, it ought to go (perhaps revisited later).
> 

I am ok for patch 1 only to be dropped. Patches 2-6 are not dependent on it.

Its an optimization and underused shrinker doesn't depend on it.
Its possible that folio->new_folio below might fix it? But if it doesn't,
I can retry later on to make this work and resend it only if it alone shows
a significant performance improvement.

Thanks a lot for debugging this! and sorry it caused an issue.


> The problem I kept hitting was that all my work, requiring compaction and
> reclaim, got (killably) stuck in or repeatedly calling reclaim_throttle():
> because nr_isolated_anon had grown high - and remained high even when the
> load had all been killed.
> 
> Bisection led to the 2/6 (remap to shared zeropage), but I'd say this 1/6
> is the one to blame. I was intending to send this patch to "fix" it:
> 
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3295,6 +3295,8 @@ static void __split_huge_page(struct pag
>  			folio_clear_active(new_folio);
>  			folio_clear_unevictable(new_folio);
>  			list_del(&new_folio->lru);
> +			node_stat_sub_folio(folio, NR_ISOLATED_ANON +
> +						folio_is_file_lru(folio));

Maybe this should have been below? (Notice the folio->new_folio)

+			node_stat_sub_folio(new_folio, NR_ISOLATED_ANON +
+						folio_is_file_lru(new_folio));

>  			if (!folio_batch_add(&free_folios, new_folio)) {
>  				mem_cgroup_uncharge_folios(&free_folios);
>  				free_unref_folios(&free_folios);
> 
> And that ran nicely, until I terminated the run and did
> grep nr_isolated /proc/sys/vm/stat_refresh /proc/vmstat
> at the end: stat_refresh kindly left a pr_warn in dmesg to say
> nr_isolated_anon -334013737
> 
> My patch is not good enough. IIUC, some split_huge_pagers (reclaim?)
> know how many pages they isolated and decremented the stats by, and
> increment by that same number at the end; whereas other split_huge_pagers
> (migration?) decrement one by one as they go through the list afterwards.
> 
> I've run out of time (I'm about to take a break): I gave up researching
> who needs what, and was already feeling this optimization does too much
> second guessing of what's needed (and its array of VM_WARN_ON_ONCE_FOLIOs
> rather admits to that).
> 
> And I don't think it's as simple as moving the node_stat_sub_folio()
> into 2/6 where the zero pte is substituted: that would probably handle
> the vast majority of cases, but aren't there others which pass the
> folio_ref_freeze(new_folio, 2) test - the title's zapped tail pages,
> or racily truncated now that the folio has been unlocked, for example?
> 
> Hugh



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 1/6] mm: free zapped tail pages when splitting isolated thp
  2024-09-05 10:21     ` Usama Arif
@ 2024-09-05 18:05       ` Hugh Dickins
  2024-09-05 19:24         ` Usama Arif
  0 siblings, 1 reply; 33+ messages in thread
From: Hugh Dickins @ 2024-09-05 18:05 UTC (permalink / raw)
  To: Usama Arif
  Cc: Hugh Dickins, Andrew Morton, Yu Zhao, linux-mm, hannes, riel,
	shakeel.butt, roman.gushchin, david, npache, baohua, ryan.roberts,
	rppt, willy, cerasuolodomenico, ryncsn, corbet, linux-kernel,
	linux-doc, kernel-team, Shuang Zhai

On Thu, 5 Sep 2024, Usama Arif wrote:
> On 05/09/2024 09:46, Hugh Dickins wrote:
> > On Fri, 30 Aug 2024, Usama Arif wrote:
> > 
> >> From: Yu Zhao <yuzhao@google.com>
> >>
> >> If a tail page has only two references left, one inherited from the
> >> isolation of its head and the other from lru_add_page_tail() which we
> >> are about to drop, it means this tail page was concurrently zapped.
> >> Then we can safely free it and save page reclaim or migration the
> >> trouble of trying it.
> >>
> >> Signed-off-by: Yu Zhao <yuzhao@google.com>
> >> Tested-by: Shuang Zhai <zhais@google.com>
> >> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> >> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> > 
> > I'm sorry, but I think this patch (just this 1/6) needs to be dropped:
> > it is only an optimization, and unless a persuasive performance case
> > can be made to extend it, it ought to go (perhaps revisited later).
> > 
> 
> I am ok for patch 1 only to be dropped. Patches 2-6 are not dependent on it.
> 
> Its an optimization and underused shrinker doesn't depend on it.
> Its possible that folio->new_folio below might fix it? But if it doesn't,
> I can retry later on to make this work and resend it only if it alone shows
> a significant performance improvement.
> 
> Thanks a lot for debugging this! and sorry it caused an issue.
> 
> 
> > The problem I kept hitting was that all my work, requiring compaction and
> > reclaim, got (killably) stuck in or repeatedly calling reclaim_throttle():
> > because nr_isolated_anon had grown high - and remained high even when the
> > load had all been killed.
> > 
> > Bisection led to the 2/6 (remap to shared zeropage), but I'd say this 1/6
> > is the one to blame. I was intending to send this patch to "fix" it:
> > 
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -3295,6 +3295,8 @@ static void __split_huge_page(struct pag
> >  			folio_clear_active(new_folio);
> >  			folio_clear_unevictable(new_folio);
> >  			list_del(&new_folio->lru);
> > +			node_stat_sub_folio(folio, NR_ISOLATED_ANON +
> > +						folio_is_file_lru(folio));
> 
> Maybe this should have been below? (Notice the folio->new_folio)
> 
> +			node_stat_sub_folio(new_folio, NR_ISOLATED_ANON +
> +						folio_is_file_lru(new_folio));

Yes, how very stupid of me (I'm well aware of the earlier bugfixes here,
and ought not to have done a blind copy and paste of those lines): thanks.

However... it makes no difference. I gave yours a run, expecting a
much smaller negative number, but actually it works out much the same:
because, of course, by this point in the code "folio" is left pointing
to a small folio, and is almost equivalent to new_folio for the
node_stat calculations.

(I say "almost" because I guess there's a chance that the page at
folio got reused as part of a larger folio meanwhile, which would
throw the stats off (if they weren't off already).)

So, even with your fix to my fix, this code doesn't work.
Whereas a revert of this 1/6 does work: nr_isolated_anon and
nr_isolated_file come to 0 when the system is at rest, as expected
(and as silence from vmstat_refresh confirms - /proc/vmstat itself
presents negative stats as 0, in order to hide transient oddities).

Hugh

> 
> >  			if (!folio_batch_add(&free_folios, new_folio)) {
> >  				mem_cgroup_uncharge_folios(&free_folios);
> >  				free_unref_folios(&free_folios);
> > 
> > And that ran nicely, until I terminated the run and did
> > grep nr_isolated /proc/sys/vm/stat_refresh /proc/vmstat
> > at the end: stat_refresh kindly left a pr_warn in dmesg to say
> > nr_isolated_anon -334013737
> > 
> > My patch is not good enough. IIUC, some split_huge_pagers (reclaim?)
> > know how many pages they isolated and decremented the stats by, and
> > increment by that same number at the end; whereas other split_huge_pagers
> > (migration?) decrement one by one as they go through the list afterwards.
> > 
> > I've run out of time (I'm about to take a break): I gave up researching
> > who needs what, and was already feeling this optimization does too much
> > second guessing of what's needed (and its array of VM_WARN_ON_ONCE_FOLIOs
> > rather admits to that).
> > 
> > And I don't think it's as simple as moving the node_stat_sub_folio()
> > into 2/6 where the zero pte is substituted: that would probably handle
> > the vast majority of cases, but aren't there others which pass the
> > folio_ref_freeze(new_folio, 2) test - the title's zapped tail pages,
> > or racily truncated now that the folio has been unlocked, for example?
> > 
> > Hugh


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 1/6] mm: free zapped tail pages when splitting isolated thp
  2024-09-05 18:05       ` Hugh Dickins
@ 2024-09-05 19:24         ` Usama Arif
  0 siblings, 0 replies; 33+ messages in thread
From: Usama Arif @ 2024-09-05 19:24 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Yu Zhao, linux-mm, hannes, riel, shakeel.butt,
	roman.gushchin, david, npache, baohua, ryan.roberts, rppt, willy,
	cerasuolodomenico, ryncsn, corbet, linux-kernel, linux-doc,
	kernel-team, Shuang Zhai



On 05/09/2024 19:05, Hugh Dickins wrote:
> On Thu, 5 Sep 2024, Usama Arif wrote:
>> On 05/09/2024 09:46, Hugh Dickins wrote:
>>> On Fri, 30 Aug 2024, Usama Arif wrote:
>>>
>>>> From: Yu Zhao <yuzhao@google.com>
>>>>
>>>> If a tail page has only two references left, one inherited from the
>>>> isolation of its head and the other from lru_add_page_tail() which we
>>>> are about to drop, it means this tail page was concurrently zapped.
>>>> Then we can safely free it and save page reclaim or migration the
>>>> trouble of trying it.
>>>>
>>>> Signed-off-by: Yu Zhao <yuzhao@google.com>
>>>> Tested-by: Shuang Zhai <zhais@google.com>
>>>> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>>>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>>>
>>> I'm sorry, but I think this patch (just this 1/6) needs to be dropped:
>>> it is only an optimization, and unless a persuasive performance case
>>> can be made to extend it, it ought to go (perhaps revisited later).
>>>
>>
>> I am ok for patch 1 only to be dropped. Patches 2-6 are not dependent on it.
>>
>> Its an optimization and underused shrinker doesn't depend on it.
>> Its possible that folio->new_folio below might fix it? But if it doesn't,
>> I can retry later on to make this work and resend it only if it alone shows
>> a significant performance improvement.
>>
>> Thanks a lot for debugging this! and sorry it caused an issue.
>>
>>
>>> The problem I kept hitting was that all my work, requiring compaction and
>>> reclaim, got (killably) stuck in or repeatedly calling reclaim_throttle():
>>> because nr_isolated_anon had grown high - and remained high even when the
>>> load had all been killed.
>>>
>>> Bisection led to the 2/6 (remap to shared zeropage), but I'd say this 1/6
>>> is the one to blame. I was intending to send this patch to "fix" it:
>>>
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -3295,6 +3295,8 @@ static void __split_huge_page(struct pag
>>>  			folio_clear_active(new_folio);
>>>  			folio_clear_unevictable(new_folio);
>>>  			list_del(&new_folio->lru);
>>> +			node_stat_sub_folio(folio, NR_ISOLATED_ANON +
>>> +						folio_is_file_lru(folio));
>>
>> Maybe this should have been below? (Notice the folio->new_folio)
>>
>> +			node_stat_sub_folio(new_folio, NR_ISOLATED_ANON +
>> +						folio_is_file_lru(new_folio));
> 
> Yes, how very stupid of me (I'm well aware of the earlier bugfixes here,
> and ought not to have done a blind copy and paste of those lines): thanks.
> 
> However... it makes no difference. I gave yours a run, expecting a
> much smaller negative number, but actually it works out much the same:
> because, of course, by this point in the code "folio" is left pointing
> to a small folio, and is almost equivalent to new_folio for the
> node_stat calculations.
> 
> (I say "almost" because I guess there's a chance that the page at
> folio got reused as part of a larger folio meanwhile, which would
> throw the stats off (if they weren't off already).)
> 
> So, even with your fix to my fix, this code doesn't work.
> Whereas a revert of this 1/6 does work: nr_isolated_anon and
> nr_isolated_file come to 0 when the system is at rest, as expected
> (and as silence from vmstat_refresh confirms - /proc/vmstat itself
> presents negative stats as 0, in order to hide transient oddities).
> 
> Hugh

Thanks for trying. I was hoping you had mTHPs enabled and then
the folio -> new_folio change would have fixed it.

Happy for patch 1 only to be dropped. I can try to figure it out
later and send if I can actually show any performance numbers
for the fixed version on real world cases.

Thanks,
Usama
> 
>>
>>>  			if (!folio_batch_add(&free_folios, new_folio)) {
>>>  				mem_cgroup_uncharge_folios(&free_folios);
>>>  				free_unref_folios(&free_folios);
>>>
>>> And that ran nicely, until I terminated the run and did
>>> grep nr_isolated /proc/sys/vm/stat_refresh /proc/vmstat
>>> at the end: stat_refresh kindly left a pr_warn in dmesg to say
>>> nr_isolated_anon -334013737
>>>
>>> My patch is not good enough. IIUC, some split_huge_pagers (reclaim?)
>>> know how many pages they isolated and decremented the stats by, and
>>> increment by that same number at the end; whereas other split_huge_pagers
>>> (migration?) decrement one by one as they go through the list afterwards.
>>>
>>> I've run out of time (I'm about to take a break): I gave up researching
>>> who needs what, and was already feeling this optimization does too much
>>> second guessing of what's needed (and its array of VM_WARN_ON_ONCE_FOLIOs
>>> rather admits to that).
>>>
>>> And I don't think it's as simple as moving the node_stat_sub_folio()
>>> into 2/6 where the zero pte is substituted: that would probably handle
>>> the vast majority of cases, but aren't there others which pass the
>>> folio_ref_freeze(new_folio, 2) test - the title's zapped tail pages,
>>> or racily truncated now that the folio has been unlocked, for example?
>>>
>>> Hugh



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 2/6] mm: remap unused subpages to shared zeropage when splitting isolated thp
  2024-08-30 10:03 ` [PATCH v5 2/6] mm: remap unused subpages to shared zeropage " Usama Arif
@ 2024-10-23 16:21   ` Zi Yan
  2024-10-23 16:50     ` Usama Arif
  2025-09-18  8:53   ` Qun-wei Lin (林群崴)
  1 sibling, 1 reply; 33+ messages in thread
From: Zi Yan @ 2024-10-23 16:21 UTC (permalink / raw)
  To: Usama Arif
  Cc: akpm, linux-mm, hannes, riel, shakeel.butt, roman.gushchin,
	yuzhao, david, npache, baohua, ryan.roberts, rppt, willy,
	cerasuolodomenico, ryncsn, corbet, linux-kernel, linux-doc,
	kernel-team, Shuang Zhai

[-- Attachment #1: Type: text/plain, Size: 9206 bytes --]

On 30 Aug 2024, at 6:03, Usama Arif wrote:

> From: Yu Zhao <yuzhao@google.com>
>
> Here being unused means containing only zeros and inaccessible to
> userspace. When splitting an isolated thp under reclaim or migration,
> the unused subpages can be mapped to the shared zeropage, hence saving
> memory. This is particularly helpful when the internal
> fragmentation of a thp is high, i.e. it has many untouched subpages.
>
> This is also a prerequisite for THP low utilization shrinker which will
> be introduced in later patches, where underutilized THPs are split, and
> the zero-filled pages are freed saving memory.
>
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> Tested-by: Shuang Zhai <zhais@google.com>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> ---
>  include/linux/rmap.h |  7 ++++-
>  mm/huge_memory.c     |  8 ++---
>  mm/migrate.c         | 72 ++++++++++++++++++++++++++++++++++++++------
>  mm/migrate_device.c  |  4 +--
>  4 files changed, 75 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 91b5935e8485..d5e93e44322e 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -745,7 +745,12 @@ int folio_mkclean(struct folio *);
>  int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff,
>  		      struct vm_area_struct *vma);
>
> -void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked);
> +enum rmp_flags {
> +	RMP_LOCKED		= 1 << 0,
> +	RMP_USE_SHARED_ZEROPAGE	= 1 << 1,
> +};
> +
> +void remove_migration_ptes(struct folio *src, struct folio *dst, int flags);
>
>  /*
>   * rmap_walk_control: To control rmap traversing for specific needs
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 0c48806ccb9a..af60684e7c70 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3020,7 +3020,7 @@ bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
>  	return false;
>  }
>
> -static void remap_page(struct folio *folio, unsigned long nr)
> +static void remap_page(struct folio *folio, unsigned long nr, int flags)
>  {
>  	int i = 0;
>
> @@ -3028,7 +3028,7 @@ static void remap_page(struct folio *folio, unsigned long nr)
>  	if (!folio_test_anon(folio))
>  		return;
>  	for (;;) {
> -		remove_migration_ptes(folio, folio, true);
> +		remove_migration_ptes(folio, folio, RMP_LOCKED | flags);
>  		i += folio_nr_pages(folio);
>  		if (i >= nr)
>  			break;
> @@ -3240,7 +3240,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>
>  	if (nr_dropped)
>  		shmem_uncharge(folio->mapping->host, nr_dropped);
> -	remap_page(folio, nr);
> +	remap_page(folio, nr, PageAnon(head) ? RMP_USE_SHARED_ZEROPAGE : 0);
>
>  	/*
>  	 * set page to its compound_head when split to non order-0 pages, so
> @@ -3542,7 +3542,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  		if (mapping)
>  			xas_unlock(&xas);
>  		local_irq_enable();
> -		remap_page(folio, folio_nr_pages(folio));
> +		remap_page(folio, folio_nr_pages(folio), 0);
>  		ret = -EAGAIN;
>  	}
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 6f9c62c746be..d039863e014b 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -204,13 +204,57 @@ bool isolate_folio_to_list(struct folio *folio, struct list_head *list)
>  	return true;
>  }
>
> +static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
> +					  struct folio *folio,
> +					  unsigned long idx)
> +{
> +	struct page *page = folio_page(folio, idx);
> +	bool contains_data;
> +	pte_t newpte;
> +	void *addr;
> +
> +	VM_BUG_ON_PAGE(PageCompound(page), page);

This should be:

diff --git a/mm/migrate.c b/mm/migrate.c
index e950fd62607f..7ffdbe078aa7 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -206,7 +206,8 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
        pte_t newpte;
        void *addr;

-       VM_BUG_ON_PAGE(PageCompound(page), page);
+       if (PageCompound(page))
+               return false;
        VM_BUG_ON_PAGE(!PageAnon(page), page);
        VM_BUG_ON_PAGE(!PageLocked(page), page);
        VM_BUG_ON_PAGE(pte_present(*pvmw->pte), page);

Otherwise, splitting anonymous large folios to non order-0 ones just
triggers this BUG_ON.

> +	VM_BUG_ON_PAGE(!PageAnon(page), page);
> +	VM_BUG_ON_PAGE(!PageLocked(page), page);
> +	VM_BUG_ON_PAGE(pte_present(*pvmw->pte), page);
> +
> +	if (folio_test_mlocked(folio) || (pvmw->vma->vm_flags & VM_LOCKED) ||
> +	    mm_forbids_zeropage(pvmw->vma->vm_mm))
> +		return false;
> +
> +	/*
> +	 * The pmd entry mapping the old thp was flushed and the pte mapping
> +	 * this subpage has been non present. If the subpage is only zero-filled
> +	 * then map it to the shared zeropage.
> +	 */
> +	addr = kmap_local_page(page);
> +	contains_data = memchr_inv(addr, 0, PAGE_SIZE);
> +	kunmap_local(addr);
> +
> +	if (contains_data)
> +		return false;
> +
> +	newpte = pte_mkspecial(pfn_pte(my_zero_pfn(pvmw->address),
> +					pvmw->vma->vm_page_prot));
> +	set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte);
> +
> +	dec_mm_counter(pvmw->vma->vm_mm, mm_counter(folio));
> +	return true;
> +}
> +
> +struct rmap_walk_arg {
> +	struct folio *folio;
> +	bool map_unused_to_zeropage;
> +};
> +
>  /*
>   * Restore a potential migration pte to a working pte entry
>   */
>  static bool remove_migration_pte(struct folio *folio,
> -		struct vm_area_struct *vma, unsigned long addr, void *old)
> +		struct vm_area_struct *vma, unsigned long addr, void *arg)
>  {
> -	DEFINE_FOLIO_VMA_WALK(pvmw, old, vma, addr, PVMW_SYNC | PVMW_MIGRATION);
> +	struct rmap_walk_arg *rmap_walk_arg = arg;
> +	DEFINE_FOLIO_VMA_WALK(pvmw, rmap_walk_arg->folio, vma, addr, PVMW_SYNC | PVMW_MIGRATION);
>
>  	while (page_vma_mapped_walk(&pvmw)) {
>  		rmap_t rmap_flags = RMAP_NONE;
> @@ -234,6 +278,9 @@ static bool remove_migration_pte(struct folio *folio,
>  			continue;
>  		}
>  #endif
> +		if (rmap_walk_arg->map_unused_to_zeropage &&
> +		    try_to_map_unused_to_zeropage(&pvmw, folio, idx))
> +			continue;
>
>  		folio_get(folio);
>  		pte = mk_pte(new, READ_ONCE(vma->vm_page_prot));
> @@ -312,14 +359,21 @@ static bool remove_migration_pte(struct folio *folio,
>   * Get rid of all migration entries and replace them by
>   * references to the indicated page.
>   */
> -void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked)
> +void remove_migration_ptes(struct folio *src, struct folio *dst, int flags)
>  {
> +	struct rmap_walk_arg rmap_walk_arg = {
> +		.folio = src,
> +		.map_unused_to_zeropage = flags & RMP_USE_SHARED_ZEROPAGE,
> +	};
> +
>  	struct rmap_walk_control rwc = {
>  		.rmap_one = remove_migration_pte,
> -		.arg = src,
> +		.arg = &rmap_walk_arg,
>  	};
>
> -	if (locked)
> +	VM_BUG_ON_FOLIO((flags & RMP_USE_SHARED_ZEROPAGE) && (src != dst), src);
> +
> +	if (flags & RMP_LOCKED)
>  		rmap_walk_locked(dst, &rwc);
>  	else
>  		rmap_walk(dst, &rwc);
> @@ -934,7 +988,7 @@ static int writeout(struct address_space *mapping, struct folio *folio)
>  	 * At this point we know that the migration attempt cannot
>  	 * be successful.
>  	 */
> -	remove_migration_ptes(folio, folio, false);
> +	remove_migration_ptes(folio, folio, 0);
>
>  	rc = mapping->a_ops->writepage(&folio->page, &wbc);
>
> @@ -1098,7 +1152,7 @@ static void migrate_folio_undo_src(struct folio *src,
>  				   struct list_head *ret)
>  {
>  	if (page_was_mapped)
> -		remove_migration_ptes(src, src, false);
> +		remove_migration_ptes(src, src, 0);
>  	/* Drop an anon_vma reference if we took one */
>  	if (anon_vma)
>  		put_anon_vma(anon_vma);
> @@ -1336,7 +1390,7 @@ static int migrate_folio_move(free_folio_t put_new_folio, unsigned long private,
>  		lru_add_drain();
>
>  	if (old_page_state & PAGE_WAS_MAPPED)
> -		remove_migration_ptes(src, dst, false);
> +		remove_migration_ptes(src, dst, 0);
>
>  out_unlock_both:
>  	folio_unlock(dst);
> @@ -1474,7 +1528,7 @@ static int unmap_and_move_huge_page(new_folio_t get_new_folio,
>
>  	if (page_was_mapped)
>  		remove_migration_ptes(src,
> -			rc == MIGRATEPAGE_SUCCESS ? dst : src, false);
> +			rc == MIGRATEPAGE_SUCCESS ? dst : src, 0);
>
>  unlock_put_anon:
>  	folio_unlock(dst);
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 8d687de88a03..9cf26592ac93 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -424,7 +424,7 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
>  			continue;
>
>  		folio = page_folio(page);
> -		remove_migration_ptes(folio, folio, false);
> +		remove_migration_ptes(folio, folio, 0);
>
>  		src_pfns[i] = 0;
>  		folio_unlock(folio);
> @@ -840,7 +840,7 @@ void migrate_device_finalize(unsigned long *src_pfns,
>  			dst = src;
>  		}
>
> -		remove_migration_ptes(src, dst, false);
> +		remove_migration_ptes(src, dst, 0);
>  		folio_unlock(src);
>
>  		if (folio_is_zone_device(src))
> -- 
> 2.43.5

Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 2/6] mm: remap unused subpages to shared zeropage when splitting isolated thp
  2024-10-23 16:21   ` Zi Yan
@ 2024-10-23 16:50     ` Usama Arif
  2024-10-23 16:55       ` Zi Yan
  2024-10-23 16:56       ` Yu Zhao
  0 siblings, 2 replies; 33+ messages in thread
From: Usama Arif @ 2024-10-23 16:50 UTC (permalink / raw)
  To: Zi Yan, yuzhao
  Cc: akpm, linux-mm, hannes, riel, shakeel.butt, roman.gushchin, david,
	npache, baohua, ryan.roberts, rppt, willy, cerasuolodomenico,
	ryncsn, corbet, linux-kernel, linux-doc, kernel-team, Shuang Zhai



On 23/10/2024 17:21, Zi Yan wrote:
> On 30 Aug 2024, at 6:03, Usama Arif wrote:
> 
>> From: Yu Zhao <yuzhao@google.com>
>>
>> Here being unused means containing only zeros and inaccessible to
>> userspace. When splitting an isolated thp under reclaim or migration,
>> the unused subpages can be mapped to the shared zeropage, hence saving
>> memory. This is particularly helpful when the internal
>> fragmentation of a thp is high, i.e. it has many untouched subpages.
>>
>> This is also a prerequisite for THP low utilization shrinker which will
>> be introduced in later patches, where underutilized THPs are split, and
>> the zero-filled pages are freed saving memory.
>>
>> Signed-off-by: Yu Zhao <yuzhao@google.com>
>> Tested-by: Shuang Zhai <zhais@google.com>
>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>> ---
>>  include/linux/rmap.h |  7 ++++-
>>  mm/huge_memory.c     |  8 ++---
>>  mm/migrate.c         | 72 ++++++++++++++++++++++++++++++++++++++------
>>  mm/migrate_device.c  |  4 +--
>>  4 files changed, 75 insertions(+), 16 deletions(-)
>>
>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>> index 91b5935e8485..d5e93e44322e 100644
>> --- a/include/linux/rmap.h
>> +++ b/include/linux/rmap.h
>> @@ -745,7 +745,12 @@ int folio_mkclean(struct folio *);
>>  int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff,
>>  		      struct vm_area_struct *vma);
>>
>> -void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked);
>> +enum rmp_flags {
>> +	RMP_LOCKED		= 1 << 0,
>> +	RMP_USE_SHARED_ZEROPAGE	= 1 << 1,
>> +};
>> +
>> +void remove_migration_ptes(struct folio *src, struct folio *dst, int flags);
>>
>>  /*
>>   * rmap_walk_control: To control rmap traversing for specific needs
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 0c48806ccb9a..af60684e7c70 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3020,7 +3020,7 @@ bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
>>  	return false;
>>  }
>>
>> -static void remap_page(struct folio *folio, unsigned long nr)
>> +static void remap_page(struct folio *folio, unsigned long nr, int flags)
>>  {
>>  	int i = 0;
>>
>> @@ -3028,7 +3028,7 @@ static void remap_page(struct folio *folio, unsigned long nr)
>>  	if (!folio_test_anon(folio))
>>  		return;
>>  	for (;;) {
>> -		remove_migration_ptes(folio, folio, true);
>> +		remove_migration_ptes(folio, folio, RMP_LOCKED | flags);
>>  		i += folio_nr_pages(folio);
>>  		if (i >= nr)
>>  			break;
>> @@ -3240,7 +3240,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>>
>>  	if (nr_dropped)
>>  		shmem_uncharge(folio->mapping->host, nr_dropped);
>> -	remap_page(folio, nr);
>> +	remap_page(folio, nr, PageAnon(head) ? RMP_USE_SHARED_ZEROPAGE : 0);
>>
>>  	/*
>>  	 * set page to its compound_head when split to non order-0 pages, so
>> @@ -3542,7 +3542,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>  		if (mapping)
>>  			xas_unlock(&xas);
>>  		local_irq_enable();
>> -		remap_page(folio, folio_nr_pages(folio));
>> +		remap_page(folio, folio_nr_pages(folio), 0);
>>  		ret = -EAGAIN;
>>  	}
>>
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 6f9c62c746be..d039863e014b 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -204,13 +204,57 @@ bool isolate_folio_to_list(struct folio *folio, struct list_head *list)
>>  	return true;
>>  }
>>
>> +static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
>> +					  struct folio *folio,
>> +					  unsigned long idx)
>> +{
>> +	struct page *page = folio_page(folio, idx);
>> +	bool contains_data;
>> +	pte_t newpte;
>> +	void *addr;
>> +
>> +	VM_BUG_ON_PAGE(PageCompound(page), page);
> 
> This should be:
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index e950fd62607f..7ffdbe078aa7 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -206,7 +206,8 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
>         pte_t newpte;
>         void *addr;
> 
> -       VM_BUG_ON_PAGE(PageCompound(page), page);
> +       if (PageCompound(page))
> +               return false;
>         VM_BUG_ON_PAGE(!PageAnon(page), page);
>         VM_BUG_ON_PAGE(!PageLocked(page), page);
>         VM_BUG_ON_PAGE(pte_present(*pvmw->pte), page);
> 
> Otherwise, splitting anonymous large folios to non order-0 ones just
> triggers this BUG_ON.
> 

That makes sense, would you like to send the fix?

Adding Yu Zhao to "To" incase he has any objections.

Thanks,
Usama

>> +	VM_BUG_ON_PAGE(!PageAnon(page), page);
>> +	VM_BUG_ON_PAGE(!PageLocked(page), page);
>> +	VM_BUG_ON_PAGE(pte_present(*pvmw->pte), page);
>> +
>> +	if (folio_test_mlocked(folio) || (pvmw->vma->vm_flags & VM_LOCKED) ||
>> +	    mm_forbids_zeropage(pvmw->vma->vm_mm))
>> +		return false;
>> +
>> +	/*
>> +	 * The pmd entry mapping the old thp was flushed and the pte mapping
>> +	 * this subpage has been non present. If the subpage is only zero-filled
>> +	 * then map it to the shared zeropage.
>> +	 */
>> +	addr = kmap_local_page(page);
>> +	contains_data = memchr_inv(addr, 0, PAGE_SIZE);
>> +	kunmap_local(addr);
>> +
>> +	if (contains_data)
>> +		return false;
>> +
>> +	newpte = pte_mkspecial(pfn_pte(my_zero_pfn(pvmw->address),
>> +					pvmw->vma->vm_page_prot));
>> +	set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte);
>> +
>> +	dec_mm_counter(pvmw->vma->vm_mm, mm_counter(folio));
>> +	return true;
>> +}
>> +
>> +struct rmap_walk_arg {
>> +	struct folio *folio;
>> +	bool map_unused_to_zeropage;
>> +};
>> +
>>  /*
>>   * Restore a potential migration pte to a working pte entry
>>   */
>>  static bool remove_migration_pte(struct folio *folio,
>> -		struct vm_area_struct *vma, unsigned long addr, void *old)
>> +		struct vm_area_struct *vma, unsigned long addr, void *arg)
>>  {
>> -	DEFINE_FOLIO_VMA_WALK(pvmw, old, vma, addr, PVMW_SYNC | PVMW_MIGRATION);
>> +	struct rmap_walk_arg *rmap_walk_arg = arg;
>> +	DEFINE_FOLIO_VMA_WALK(pvmw, rmap_walk_arg->folio, vma, addr, PVMW_SYNC | PVMW_MIGRATION);
>>
>>  	while (page_vma_mapped_walk(&pvmw)) {
>>  		rmap_t rmap_flags = RMAP_NONE;
>> @@ -234,6 +278,9 @@ static bool remove_migration_pte(struct folio *folio,
>>  			continue;
>>  		}
>>  #endif
>> +		if (rmap_walk_arg->map_unused_to_zeropage &&
>> +		    try_to_map_unused_to_zeropage(&pvmw, folio, idx))
>> +			continue;
>>
>>  		folio_get(folio);
>>  		pte = mk_pte(new, READ_ONCE(vma->vm_page_prot));
>> @@ -312,14 +359,21 @@ static bool remove_migration_pte(struct folio *folio,
>>   * Get rid of all migration entries and replace them by
>>   * references to the indicated page.
>>   */
>> -void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked)
>> +void remove_migration_ptes(struct folio *src, struct folio *dst, int flags)
>>  {
>> +	struct rmap_walk_arg rmap_walk_arg = {
>> +		.folio = src,
>> +		.map_unused_to_zeropage = flags & RMP_USE_SHARED_ZEROPAGE,
>> +	};
>> +
>>  	struct rmap_walk_control rwc = {
>>  		.rmap_one = remove_migration_pte,
>> -		.arg = src,
>> +		.arg = &rmap_walk_arg,
>>  	};
>>
>> -	if (locked)
>> +	VM_BUG_ON_FOLIO((flags & RMP_USE_SHARED_ZEROPAGE) && (src != dst), src);
>> +
>> +	if (flags & RMP_LOCKED)
>>  		rmap_walk_locked(dst, &rwc);
>>  	else
>>  		rmap_walk(dst, &rwc);
>> @@ -934,7 +988,7 @@ static int writeout(struct address_space *mapping, struct folio *folio)
>>  	 * At this point we know that the migration attempt cannot
>>  	 * be successful.
>>  	 */
>> -	remove_migration_ptes(folio, folio, false);
>> +	remove_migration_ptes(folio, folio, 0);
>>
>>  	rc = mapping->a_ops->writepage(&folio->page, &wbc);
>>
>> @@ -1098,7 +1152,7 @@ static void migrate_folio_undo_src(struct folio *src,
>>  				   struct list_head *ret)
>>  {
>>  	if (page_was_mapped)
>> -		remove_migration_ptes(src, src, false);
>> +		remove_migration_ptes(src, src, 0);
>>  	/* Drop an anon_vma reference if we took one */
>>  	if (anon_vma)
>>  		put_anon_vma(anon_vma);
>> @@ -1336,7 +1390,7 @@ static int migrate_folio_move(free_folio_t put_new_folio, unsigned long private,
>>  		lru_add_drain();
>>
>>  	if (old_page_state & PAGE_WAS_MAPPED)
>> -		remove_migration_ptes(src, dst, false);
>> +		remove_migration_ptes(src, dst, 0);
>>
>>  out_unlock_both:
>>  	folio_unlock(dst);
>> @@ -1474,7 +1528,7 @@ static int unmap_and_move_huge_page(new_folio_t get_new_folio,
>>
>>  	if (page_was_mapped)
>>  		remove_migration_ptes(src,
>> -			rc == MIGRATEPAGE_SUCCESS ? dst : src, false);
>> +			rc == MIGRATEPAGE_SUCCESS ? dst : src, 0);
>>
>>  unlock_put_anon:
>>  	folio_unlock(dst);
>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> index 8d687de88a03..9cf26592ac93 100644
>> --- a/mm/migrate_device.c
>> +++ b/mm/migrate_device.c
>> @@ -424,7 +424,7 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
>>  			continue;
>>
>>  		folio = page_folio(page);
>> -		remove_migration_ptes(folio, folio, false);
>> +		remove_migration_ptes(folio, folio, 0);
>>
>>  		src_pfns[i] = 0;
>>  		folio_unlock(folio);
>> @@ -840,7 +840,7 @@ void migrate_device_finalize(unsigned long *src_pfns,
>>  			dst = src;
>>  		}
>>
>> -		remove_migration_ptes(src, dst, false);
>> +		remove_migration_ptes(src, dst, 0);
>>  		folio_unlock(src);
>>
>>  		if (folio_is_zone_device(src))
>> -- 
>> 2.43.5
> 
> Best Regards,
> Yan, Zi



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 2/6] mm: remap unused subpages to shared zeropage when splitting isolated thp
  2024-10-23 16:50     ` Usama Arif
@ 2024-10-23 16:55       ` Zi Yan
  2024-10-23 16:56       ` Yu Zhao
  1 sibling, 0 replies; 33+ messages in thread
From: Zi Yan @ 2024-10-23 16:55 UTC (permalink / raw)
  To: Usama Arif
  Cc: yuzhao, akpm, linux-mm, hannes, riel, shakeel.butt,
	roman.gushchin, david, npache, baohua, ryan.roberts, rppt, willy,
	cerasuolodomenico, ryncsn, corbet, linux-kernel, linux-doc,
	kernel-team, Shuang Zhai

[-- Attachment #1: Type: text/plain, Size: 4891 bytes --]

On 23 Oct 2024, at 12:50, Usama Arif wrote:

> On 23/10/2024 17:21, Zi Yan wrote:
>> On 30 Aug 2024, at 6:03, Usama Arif wrote:
>>
>>> From: Yu Zhao <yuzhao@google.com>
>>>
>>> Here being unused means containing only zeros and inaccessible to
>>> userspace. When splitting an isolated thp under reclaim or migration,
>>> the unused subpages can be mapped to the shared zeropage, hence saving
>>> memory. This is particularly helpful when the internal
>>> fragmentation of a thp is high, i.e. it has many untouched subpages.
>>>
>>> This is also a prerequisite for THP low utilization shrinker which will
>>> be introduced in later patches, where underutilized THPs are split, and
>>> the zero-filled pages are freed saving memory.
>>>
>>> Signed-off-by: Yu Zhao <yuzhao@google.com>
>>> Tested-by: Shuang Zhai <zhais@google.com>
>>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>>> ---
>>>  include/linux/rmap.h |  7 ++++-
>>>  mm/huge_memory.c     |  8 ++---
>>>  mm/migrate.c         | 72 ++++++++++++++++++++++++++++++++++++++------
>>>  mm/migrate_device.c  |  4 +--
>>>  4 files changed, 75 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>>> index 91b5935e8485..d5e93e44322e 100644
>>> --- a/include/linux/rmap.h
>>> +++ b/include/linux/rmap.h
>>> @@ -745,7 +745,12 @@ int folio_mkclean(struct folio *);
>>>  int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff,
>>>  		      struct vm_area_struct *vma);
>>>
>>> -void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked);
>>> +enum rmp_flags {
>>> +	RMP_LOCKED		= 1 << 0,
>>> +	RMP_USE_SHARED_ZEROPAGE	= 1 << 1,
>>> +};
>>> +
>>> +void remove_migration_ptes(struct folio *src, struct folio *dst, int flags);
>>>
>>>  /*
>>>   * rmap_walk_control: To control rmap traversing for specific needs
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 0c48806ccb9a..af60684e7c70 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -3020,7 +3020,7 @@ bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
>>>  	return false;
>>>  }
>>>
>>> -static void remap_page(struct folio *folio, unsigned long nr)
>>> +static void remap_page(struct folio *folio, unsigned long nr, int flags)
>>>  {
>>>  	int i = 0;
>>>
>>> @@ -3028,7 +3028,7 @@ static void remap_page(struct folio *folio, unsigned long nr)
>>>  	if (!folio_test_anon(folio))
>>>  		return;
>>>  	for (;;) {
>>> -		remove_migration_ptes(folio, folio, true);
>>> +		remove_migration_ptes(folio, folio, RMP_LOCKED | flags);
>>>  		i += folio_nr_pages(folio);
>>>  		if (i >= nr)
>>>  			break;
>>> @@ -3240,7 +3240,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>>>
>>>  	if (nr_dropped)
>>>  		shmem_uncharge(folio->mapping->host, nr_dropped);
>>> -	remap_page(folio, nr);
>>> +	remap_page(folio, nr, PageAnon(head) ? RMP_USE_SHARED_ZEROPAGE : 0);
>>>
>>>  	/*
>>>  	 * set page to its compound_head when split to non order-0 pages, so
>>> @@ -3542,7 +3542,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>  		if (mapping)
>>>  			xas_unlock(&xas);
>>>  		local_irq_enable();
>>> -		remap_page(folio, folio_nr_pages(folio));
>>> +		remap_page(folio, folio_nr_pages(folio), 0);
>>>  		ret = -EAGAIN;
>>>  	}
>>>
>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>> index 6f9c62c746be..d039863e014b 100644
>>> --- a/mm/migrate.c
>>> +++ b/mm/migrate.c
>>> @@ -204,13 +204,57 @@ bool isolate_folio_to_list(struct folio *folio, struct list_head *list)
>>>  	return true;
>>>  }
>>>
>>> +static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
>>> +					  struct folio *folio,
>>> +					  unsigned long idx)
>>> +{
>>> +	struct page *page = folio_page(folio, idx);
>>> +	bool contains_data;
>>> +	pte_t newpte;
>>> +	void *addr;
>>> +
>>> +	VM_BUG_ON_PAGE(PageCompound(page), page);
>>
>> This should be:
>>
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index e950fd62607f..7ffdbe078aa7 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -206,7 +206,8 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
>>         pte_t newpte;
>>         void *addr;
>>
>> -       VM_BUG_ON_PAGE(PageCompound(page), page);
>> +       if (PageCompound(page))
>> +               return false;
>>         VM_BUG_ON_PAGE(!PageAnon(page), page);
>>         VM_BUG_ON_PAGE(!PageLocked(page), page);
>>         VM_BUG_ON_PAGE(pte_present(*pvmw->pte), page);
>>
>> Otherwise, splitting anonymous large folios to non order-0 ones just
>> triggers this BUG_ON.
>>
>
> That makes sense, would you like to send the fix?
>
> Adding Yu Zhao to "To" incase he has any objections.
>
Sure, will do.

Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 2/6] mm: remap unused subpages to shared zeropage when splitting isolated thp
  2024-10-23 16:50     ` Usama Arif
  2024-10-23 16:55       ` Zi Yan
@ 2024-10-23 16:56       ` Yu Zhao
  1 sibling, 0 replies; 33+ messages in thread
From: Yu Zhao @ 2024-10-23 16:56 UTC (permalink / raw)
  To: Usama Arif
  Cc: Zi Yan, akpm, linux-mm, hannes, riel, shakeel.butt,
	roman.gushchin, david, npache, baohua, ryan.roberts, rppt, willy,
	cerasuolodomenico, ryncsn, corbet, linux-kernel, linux-doc,
	kernel-team, Shuang Zhai

On Wed, Oct 23, 2024 at 10:51 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
> On 23/10/2024 17:21, Zi Yan wrote:
> > On 30 Aug 2024, at 6:03, Usama Arif wrote:
> >
> >> From: Yu Zhao <yuzhao@google.com>
> >>
> >> Here being unused means containing only zeros and inaccessible to
> >> userspace. When splitting an isolated thp under reclaim or migration,
> >> the unused subpages can be mapped to the shared zeropage, hence saving
> >> memory. This is particularly helpful when the internal
> >> fragmentation of a thp is high, i.e. it has many untouched subpages.
> >>
> >> This is also a prerequisite for THP low utilization shrinker which will
> >> be introduced in later patches, where underutilized THPs are split, and
> >> the zero-filled pages are freed saving memory.
> >>
> >> Signed-off-by: Yu Zhao <yuzhao@google.com>
> >> Tested-by: Shuang Zhai <zhais@google.com>
> >> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> >> ---
> >>  include/linux/rmap.h |  7 ++++-
> >>  mm/huge_memory.c     |  8 ++---
> >>  mm/migrate.c         | 72 ++++++++++++++++++++++++++++++++++++++------
> >>  mm/migrate_device.c  |  4 +--
> >>  4 files changed, 75 insertions(+), 16 deletions(-)
> >>
> >> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> >> index 91b5935e8485..d5e93e44322e 100644
> >> --- a/include/linux/rmap.h
> >> +++ b/include/linux/rmap.h
> >> @@ -745,7 +745,12 @@ int folio_mkclean(struct folio *);
> >>  int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff,
> >>                    struct vm_area_struct *vma);
> >>
> >> -void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked);
> >> +enum rmp_flags {
> >> +    RMP_LOCKED              = 1 << 0,
> >> +    RMP_USE_SHARED_ZEROPAGE = 1 << 1,
> >> +};
> >> +
> >> +void remove_migration_ptes(struct folio *src, struct folio *dst, int flags);
> >>
> >>  /*
> >>   * rmap_walk_control: To control rmap traversing for specific needs
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> index 0c48806ccb9a..af60684e7c70 100644
> >> --- a/mm/huge_memory.c
> >> +++ b/mm/huge_memory.c
> >> @@ -3020,7 +3020,7 @@ bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
> >>      return false;
> >>  }
> >>
> >> -static void remap_page(struct folio *folio, unsigned long nr)
> >> +static void remap_page(struct folio *folio, unsigned long nr, int flags)
> >>  {
> >>      int i = 0;
> >>
> >> @@ -3028,7 +3028,7 @@ static void remap_page(struct folio *folio, unsigned long nr)
> >>      if (!folio_test_anon(folio))
> >>              return;
> >>      for (;;) {
> >> -            remove_migration_ptes(folio, folio, true);
> >> +            remove_migration_ptes(folio, folio, RMP_LOCKED | flags);
> >>              i += folio_nr_pages(folio);
> >>              if (i >= nr)
> >>                      break;
> >> @@ -3240,7 +3240,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
> >>
> >>      if (nr_dropped)
> >>              shmem_uncharge(folio->mapping->host, nr_dropped);
> >> -    remap_page(folio, nr);
> >> +    remap_page(folio, nr, PageAnon(head) ? RMP_USE_SHARED_ZEROPAGE : 0);
> >>
> >>      /*
> >>       * set page to its compound_head when split to non order-0 pages, so
> >> @@ -3542,7 +3542,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> >>              if (mapping)
> >>                      xas_unlock(&xas);
> >>              local_irq_enable();
> >> -            remap_page(folio, folio_nr_pages(folio));
> >> +            remap_page(folio, folio_nr_pages(folio), 0);
> >>              ret = -EAGAIN;
> >>      }
> >>
> >> diff --git a/mm/migrate.c b/mm/migrate.c
> >> index 6f9c62c746be..d039863e014b 100644
> >> --- a/mm/migrate.c
> >> +++ b/mm/migrate.c
> >> @@ -204,13 +204,57 @@ bool isolate_folio_to_list(struct folio *folio, struct list_head *list)
> >>      return true;
> >>  }
> >>
> >> +static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
> >> +                                      struct folio *folio,
> >> +                                      unsigned long idx)
> >> +{
> >> +    struct page *page = folio_page(folio, idx);
> >> +    bool contains_data;
> >> +    pte_t newpte;
> >> +    void *addr;
> >> +
> >> +    VM_BUG_ON_PAGE(PageCompound(page), page);
> >
> > This should be:
> >
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index e950fd62607f..7ffdbe078aa7 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -206,7 +206,8 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
> >         pte_t newpte;
> >         void *addr;
> >
> > -       VM_BUG_ON_PAGE(PageCompound(page), page);
> > +       if (PageCompound(page))
> > +               return false;
> >         VM_BUG_ON_PAGE(!PageAnon(page), page);
> >         VM_BUG_ON_PAGE(!PageLocked(page), page);
> >         VM_BUG_ON_PAGE(pte_present(*pvmw->pte), page);
> >
> > Otherwise, splitting anonymous large folios to non order-0 ones just
> > triggers this BUG_ON.
> >
>
> That makes sense, would you like to send the fix?
>
> Adding Yu Zhao to "To" incase he has any objections.

LGTM. Thanks!


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 4/6] mm: Introduce a pageflag for partially mapped folios
  2024-08-30 10:03 ` [PATCH v5 4/6] mm: Introduce a pageflag for partially mapped folios Usama Arif
@ 2024-12-11 15:03   ` David Hildenbrand
  2024-12-12 10:30     ` Usama Arif
  0 siblings, 1 reply; 33+ messages in thread
From: David Hildenbrand @ 2024-12-11 15:03 UTC (permalink / raw)
  To: Usama Arif, akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, npache,
	baohua, ryan.roberts, rppt, willy, cerasuolodomenico, ryncsn,
	corbet, linux-kernel, linux-doc, kernel-team

On 30.08.24 12:03, Usama Arif wrote:
> Currently folio->_deferred_list is used to keep track of
> partially_mapped folios that are going to be split under memory
> pressure. In the next patch, all THPs that are faulted in and collapsed
> by khugepaged are also going to be tracked using _deferred_list.
> 
> This patch introduces a pageflag to be able to distinguish between
> partially mapped folios and others in the deferred_list at split time in
> deferred_split_scan. Its needed as __folio_remove_rmap decrements
> _mapcount, _large_mapcount and _entire_mapcount, hence it won't be
> possible to distinguish between partially mapped folios and others in
> deferred_split_scan.
> 
> Eventhough it introduces an extra flag to track if the folio is
> partially mapped, there is no functional change intended with this
> patch and the flag is not useful in this patch itself, it will
> become useful in the next patch when _deferred_list has non partially
> mapped folios.
> 
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> ---
>   include/linux/huge_mm.h    |  4 ++--
>   include/linux/page-flags.h | 13 +++++++++++-
>   mm/huge_memory.c           | 41 ++++++++++++++++++++++++++++----------
>   mm/memcontrol.c            |  3 ++-
>   mm/migrate.c               |  3 ++-
>   mm/page_alloc.c            |  5 +++--
>   mm/rmap.c                  |  5 +++--
>   mm/vmscan.c                |  3 ++-
>   8 files changed, 56 insertions(+), 21 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 4da102b74a8c..0b0539f4ee1a 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -333,7 +333,7 @@ static inline int split_huge_page(struct page *page)
>   {
>   	return split_huge_page_to_list_to_order(page, NULL, 0);
>   }
> -void deferred_split_folio(struct folio *folio);
> +void deferred_split_folio(struct folio *folio, bool partially_mapped);
>   
>   void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>   		unsigned long address, bool freeze, struct folio *folio);
> @@ -502,7 +502,7 @@ static inline int split_huge_page(struct page *page)
>   {
>   	return 0;
>   }
> -static inline void deferred_split_folio(struct folio *folio) {}
> +static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
>   #define split_huge_pmd(__vma, __pmd, __address)	\
>   	do { } while (0)
>   
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 2175ebceb41c..1b3a76710487 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -186,6 +186,7 @@ enum pageflags {
>   	/* At least one page in this folio has the hwpoison flag set */
>   	PG_has_hwpoisoned = PG_active,
>   	PG_large_rmappable = PG_workingset, /* anon or file-backed */
> +	PG_partially_mapped = PG_reclaim, /* was identified to be partially mapped */
>   };
>   
>   #define PAGEFLAGS_MASK		((1UL << NR_PAGEFLAGS) - 1)
> @@ -859,8 +860,18 @@ static inline void ClearPageCompound(struct page *page)
>   	ClearPageHead(page);
>   }
>   FOLIO_FLAG(large_rmappable, FOLIO_SECOND_PAGE)
> +FOLIO_TEST_FLAG(partially_mapped, FOLIO_SECOND_PAGE)
> +/*
> + * PG_partially_mapped is protected by deferred_split split_queue_lock,
> + * so its safe to use non-atomic set/clear.

Just stumbled over that. In my understanding, this assumption is wrong.

I don't think anything prevents other PF_ANY (PG_anon_exclusive, 
PG_PG_hwpoison) / PF_SECOND (PF_has_hwpoisoned) flags from getting 
modified concurrently I'm afraid.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 4/6] mm: Introduce a pageflag for partially mapped folios
  2024-12-11 15:03   ` David Hildenbrand
@ 2024-12-12 10:30     ` Usama Arif
  2024-12-12 10:49       ` David Hildenbrand
  0 siblings, 1 reply; 33+ messages in thread
From: Usama Arif @ 2024-12-12 10:30 UTC (permalink / raw)
  To: David Hildenbrand, akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, npache,
	baohua, ryan.roberts, rppt, willy, cerasuolodomenico, ryncsn,
	corbet, linux-kernel, linux-doc, kernel-team



On 11/12/2024 18:03, David Hildenbrand wrote:
> On 30.08.24 12:03, Usama Arif wrote:
>> Currently folio->_deferred_list is used to keep track of
>> partially_mapped folios that are going to be split under memory
>> pressure. In the next patch, all THPs that are faulted in and collapsed
>> by khugepaged are also going to be tracked using _deferred_list.
>>
>> This patch introduces a pageflag to be able to distinguish between
>> partially mapped folios and others in the deferred_list at split time in
>> deferred_split_scan. Its needed as __folio_remove_rmap decrements
>> _mapcount, _large_mapcount and _entire_mapcount, hence it won't be
>> possible to distinguish between partially mapped folios and others in
>> deferred_split_scan.
>>
>> Eventhough it introduces an extra flag to track if the folio is
>> partially mapped, there is no functional change intended with this
>> patch and the flag is not useful in this patch itself, it will
>> become useful in the next patch when _deferred_list has non partially
>> mapped folios.
>>
>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>> ---
>>   include/linux/huge_mm.h    |  4 ++--
>>   include/linux/page-flags.h | 13 +++++++++++-
>>   mm/huge_memory.c           | 41 ++++++++++++++++++++++++++++----------
>>   mm/memcontrol.c            |  3 ++-
>>   mm/migrate.c               |  3 ++-
>>   mm/page_alloc.c            |  5 +++--
>>   mm/rmap.c                  |  5 +++--
>>   mm/vmscan.c                |  3 ++-
>>   8 files changed, 56 insertions(+), 21 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 4da102b74a8c..0b0539f4ee1a 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -333,7 +333,7 @@ static inline int split_huge_page(struct page *page)
>>   {
>>       return split_huge_page_to_list_to_order(page, NULL, 0);
>>   }
>> -void deferred_split_folio(struct folio *folio);
>> +void deferred_split_folio(struct folio *folio, bool partially_mapped);
>>     void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>>           unsigned long address, bool freeze, struct folio *folio);
>> @@ -502,7 +502,7 @@ static inline int split_huge_page(struct page *page)
>>   {
>>       return 0;
>>   }
>> -static inline void deferred_split_folio(struct folio *folio) {}
>> +static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
>>   #define split_huge_pmd(__vma, __pmd, __address)    \
>>       do { } while (0)
>>   diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>> index 2175ebceb41c..1b3a76710487 100644
>> --- a/include/linux/page-flags.h
>> +++ b/include/linux/page-flags.h
>> @@ -186,6 +186,7 @@ enum pageflags {
>>       /* At least one page in this folio has the hwpoison flag set */
>>       PG_has_hwpoisoned = PG_active,
>>       PG_large_rmappable = PG_workingset, /* anon or file-backed */
>> +    PG_partially_mapped = PG_reclaim, /* was identified to be partially mapped */
>>   };
>>     #define PAGEFLAGS_MASK        ((1UL << NR_PAGEFLAGS) - 1)
>> @@ -859,8 +860,18 @@ static inline void ClearPageCompound(struct page *page)
>>       ClearPageHead(page);
>>   }
>>   FOLIO_FLAG(large_rmappable, FOLIO_SECOND_PAGE)
>> +FOLIO_TEST_FLAG(partially_mapped, FOLIO_SECOND_PAGE)
>> +/*
>> + * PG_partially_mapped is protected by deferred_split split_queue_lock,
>> + * so its safe to use non-atomic set/clear.
> 
> Just stumbled over that. In my understanding, this assumption is wrong.
> 
> I don't think anything prevents other PF_ANY (PG_anon_exclusive, PG_PG_hwpoison) / PF_SECOND (PF_has_hwpoisoned) flags from getting modified concurrently I'm afraid.
> 
Hi David,

Just to clear my understanding, what you are suggesting could happen in __folio_set/clear_partially_mapped is:
1) __folio_set/clear_partially_mapped reads the 2nd page flags (x) where one of the other 2nd page flags is lets say not set.
2) One of the other 2nd page flags are set atomically.
3) __folio_set/clear_partially_mapped writes x + changes to partially_mapped. However, the change in step 2 to one of the other 2nd page flag is lost.

Is that correct? But that would mean we shouldn't have any page flags (first or second page) as non atomic? although it would depend if they are being
changed at the same time point. If you encountered a particular instance of PG_anon_exclusive or PF_has_hwpoisoned being changed at the same point as
__folio_set/clear_partially_mapped, could you point to it?

I am happy to send a fix to change all set/clear_partially_mapped to atomic, but just want to understand this better.

Thanks!
Usama



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 4/6] mm: Introduce a pageflag for partially mapped folios
  2024-12-12 10:30     ` Usama Arif
@ 2024-12-12 10:49       ` David Hildenbrand
  0 siblings, 0 replies; 33+ messages in thread
From: David Hildenbrand @ 2024-12-12 10:49 UTC (permalink / raw)
  To: Usama Arif, akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, npache,
	baohua, ryan.roberts, rppt, willy, cerasuolodomenico, ryncsn,
	corbet, linux-kernel, linux-doc, kernel-team

On 12.12.24 11:30, Usama Arif wrote:
> 
> 
> On 11/12/2024 18:03, David Hildenbrand wrote:
>> On 30.08.24 12:03, Usama Arif wrote:
>>> Currently folio->_deferred_list is used to keep track of
>>> partially_mapped folios that are going to be split under memory
>>> pressure. In the next patch, all THPs that are faulted in and collapsed
>>> by khugepaged are also going to be tracked using _deferred_list.
>>>
>>> This patch introduces a pageflag to be able to distinguish between
>>> partially mapped folios and others in the deferred_list at split time in
>>> deferred_split_scan. Its needed as __folio_remove_rmap decrements
>>> _mapcount, _large_mapcount and _entire_mapcount, hence it won't be
>>> possible to distinguish between partially mapped folios and others in
>>> deferred_split_scan.
>>>
>>> Eventhough it introduces an extra flag to track if the folio is
>>> partially mapped, there is no functional change intended with this
>>> patch and the flag is not useful in this patch itself, it will
>>> become useful in the next patch when _deferred_list has non partially
>>> mapped folios.
>>>
>>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>>> ---
>>>    include/linux/huge_mm.h    |  4 ++--
>>>    include/linux/page-flags.h | 13 +++++++++++-
>>>    mm/huge_memory.c           | 41 ++++++++++++++++++++++++++++----------
>>>    mm/memcontrol.c            |  3 ++-
>>>    mm/migrate.c               |  3 ++-
>>>    mm/page_alloc.c            |  5 +++--
>>>    mm/rmap.c                  |  5 +++--
>>>    mm/vmscan.c                |  3 ++-
>>>    8 files changed, 56 insertions(+), 21 deletions(-)
>>>
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index 4da102b74a8c..0b0539f4ee1a 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -333,7 +333,7 @@ static inline int split_huge_page(struct page *page)
>>>    {
>>>        return split_huge_page_to_list_to_order(page, NULL, 0);
>>>    }
>>> -void deferred_split_folio(struct folio *folio);
>>> +void deferred_split_folio(struct folio *folio, bool partially_mapped);
>>>      void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>>>            unsigned long address, bool freeze, struct folio *folio);
>>> @@ -502,7 +502,7 @@ static inline int split_huge_page(struct page *page)
>>>    {
>>>        return 0;
>>>    }
>>> -static inline void deferred_split_folio(struct folio *folio) {}
>>> +static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
>>>    #define split_huge_pmd(__vma, __pmd, __address)    \
>>>        do { } while (0)
>>>    diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>>> index 2175ebceb41c..1b3a76710487 100644
>>> --- a/include/linux/page-flags.h
>>> +++ b/include/linux/page-flags.h
>>> @@ -186,6 +186,7 @@ enum pageflags {
>>>        /* At least one page in this folio has the hwpoison flag set */
>>>        PG_has_hwpoisoned = PG_active,
>>>        PG_large_rmappable = PG_workingset, /* anon or file-backed */
>>> +    PG_partially_mapped = PG_reclaim, /* was identified to be partially mapped */
>>>    };
>>>      #define PAGEFLAGS_MASK        ((1UL << NR_PAGEFLAGS) - 1)
>>> @@ -859,8 +860,18 @@ static inline void ClearPageCompound(struct page *page)
>>>        ClearPageHead(page);
>>>    }
>>>    FOLIO_FLAG(large_rmappable, FOLIO_SECOND_PAGE)
>>> +FOLIO_TEST_FLAG(partially_mapped, FOLIO_SECOND_PAGE)
>>> +/*
>>> + * PG_partially_mapped is protected by deferred_split split_queue_lock,
>>> + * so its safe to use non-atomic set/clear.
>>
>> Just stumbled over that. In my understanding, this assumption is wrong.
>>
>> I don't think anything prevents other PF_ANY (PG_anon_exclusive, PG_PG_hwpoison) / PF_SECOND (PF_has_hwpoisoned) flags from getting modified concurrently I'm afraid.
>>
> Hi David,
> 
> Just to clear my understanding, what you are suggesting could happen in __folio_set/clear_partially_mapped is:
> 1) __folio_set/clear_partially_mapped reads the 2nd page flags (x) where one of the other 2nd page flags is lets say not set.
> 2) One of the other 2nd page flags are set atomically.
> 3) __folio_set/clear_partially_mapped writes x + changes to partially_mapped. However, the change in step 2 to one of the other 2nd page flag is lost.
> 
> Is that correct?

That matches my understanding.

But that would mean we shouldn't have any page flags (first or second 
page) as non atomic?

Yes. We may only use non-atomic variants as long as we can guarantee 
that nobody can concurrently operate on the flags, for example on the 
early folio allocation path or on the folio freeing path.

Observe how the other SECOND users are atomic, PG_anon_exclusive is 
atomic (except on two page freeing paths) and PF_hwpoison is atomic.


> although it would depend if they are being
> changed at the same time point. If you encountered a particular instance of PG_anon_exclusive or PF_has_hwpoisoned being changed at the same point as
> __folio_set/clear_partially_mapped, could you point to it?

Regarding PG_hwpoison, observe how memory_failure() performs the 
TestSetPageHWPoison() + folio_set_has_hwpoisoned() before unmapping the 
pages, without any locking. This can race with pretty much any operation 
that triggers unmapping.

With PG_anon_exclusive it's a bit more complicated, but it's probably 
sufficient if MADV_DONTNEED (setting deferred) races with concurrent 
swapout/mgration (clearing PG_anon_exclusive), whereby both operations 
are not performed under the same PT lock. This can happen after partial 
mremap(), or after fork() when only parts of the large folio were shared 
with the child.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 2/6] mm: remap unused subpages to shared zeropage when splitting isolated thp
  2024-08-30 10:03 ` [PATCH v5 2/6] mm: remap unused subpages to shared zeropage " Usama Arif
  2024-10-23 16:21   ` Zi Yan
@ 2025-09-18  8:53   ` Qun-wei Lin (林群崴)
  2025-09-18  8:56     ` David Hildenbrand
  1 sibling, 1 reply; 33+ messages in thread
From: Qun-wei Lin (林群崴) @ 2025-09-18  8:53 UTC (permalink / raw)
  To: catalin.marinas@arm.com, usamaarif642@gmail.com,
	linux-mm@kvack.org, yuzhao@google.com, akpm@linux-foundation.org
  Cc: corbet@lwn.net, Andrew Yang (楊智強),
	npache@redhat.com, rppt@kernel.org, willy@infradead.org,
	kernel-team@meta.com, david@redhat.com, roman.gushchin@linux.dev,
	hannes@cmpxchg.org, cerasuolodomenico@gmail.com,
	linux-kernel@vger.kernel.org, ryncsn@gmail.com, surenb@google.com,
	riel@surriel.com, shakeel.butt@linux.dev,
	Chinwen Chang (張錦文),
	linux-doc@vger.kernel.org, Casper Li (李中榮),
	ryan.roberts@arm.com, linux-mediatek@lists.infradead.org,
	baohua@kernel.org, kaleshsingh@google.com, zhais@google.com,
	linux-arm-kernel@lists.infradead.org

On Fri, 2024-08-30 at 11:03 +0100, Usama Arif wrote:
> From: Yu Zhao <yuzhao@google.com>
> 
> Here being unused means containing only zeros and inaccessible to
> userspace. When splitting an isolated thp under reclaim or migration,
> the unused subpages can be mapped to the shared zeropage, hence
> saving
> memory. This is particularly helpful when the internal
> fragmentation of a thp is high, i.e. it has many untouched subpages.
> 
> This is also a prerequisite for THP low utilization shrinker which
> will
> be introduced in later patches, where underutilized THPs are split,
> and
> the zero-filled pages are freed saving memory.
> 
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> Tested-by: Shuang Zhai <zhais@google.com>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> ---
>  include/linux/rmap.h |  7 ++++-
>  mm/huge_memory.c     |  8 ++---
>  mm/migrate.c         | 72 ++++++++++++++++++++++++++++++++++++++----
> --
>  mm/migrate_device.c  |  4 +--
>  4 files changed, 75 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 91b5935e8485..d5e93e44322e 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -745,7 +745,12 @@ int folio_mkclean(struct folio *);
>  int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages,
> pgoff_t pgoff,
>  		      struct vm_area_struct *vma);
>  
> -void remove_migration_ptes(struct folio *src, struct folio *dst,
> bool locked);
> +enum rmp_flags {
> +	RMP_LOCKED		= 1 << 0,
> +	RMP_USE_SHARED_ZEROPAGE	= 1 << 1,
> +};
> +
> +void remove_migration_ptes(struct folio *src, struct folio *dst, int
> flags);
>  
>  /*
>   * rmap_walk_control: To control rmap traversing for specific needs
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 0c48806ccb9a..af60684e7c70 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3020,7 +3020,7 @@ bool unmap_huge_pmd_locked(struct
> vm_area_struct *vma, unsigned long addr,
>  	return false;
>  }
>  
> -static void remap_page(struct folio *folio, unsigned long nr)
> +static void remap_page(struct folio *folio, unsigned long nr, int
> flags)
>  {
>  	int i = 0;
>  
> @@ -3028,7 +3028,7 @@ static void remap_page(struct folio *folio,
> unsigned long nr)
>  	if (!folio_test_anon(folio))
>  		return;
>  	for (;;) {
> -		remove_migration_ptes(folio, folio, true);
> +		remove_migration_ptes(folio, folio, RMP_LOCKED |
> flags);
>  		i += folio_nr_pages(folio);
>  		if (i >= nr)
>  			break;
> @@ -3240,7 +3240,7 @@ static void __split_huge_page(struct page
> *page, struct list_head *list,
>  
>  	if (nr_dropped)
>  		shmem_uncharge(folio->mapping->host, nr_dropped);
> -	remap_page(folio, nr);
> +	remap_page(folio, nr, PageAnon(head) ?
> RMP_USE_SHARED_ZEROPAGE : 0);
>  
>  	/*
>  	 * set page to its compound_head when split to non order-0
> pages, so
> @@ -3542,7 +3542,7 @@ int split_huge_page_to_list_to_order(struct
> page *page, struct list_head *list,
>  		if (mapping)
>  			xas_unlock(&xas);
>  		local_irq_enable();
> -		remap_page(folio, folio_nr_pages(folio));
> +		remap_page(folio, folio_nr_pages(folio), 0);
>  		ret = -EAGAIN;
>  	}
>  
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 6f9c62c746be..d039863e014b 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -204,13 +204,57 @@ bool isolate_folio_to_list(struct folio *folio,
> struct list_head *list)
>  	return true;
>  }
>  
> +static bool try_to_map_unused_to_zeropage(struct
> page_vma_mapped_walk *pvmw,
> +					  struct folio *folio,
> +					  unsigned long idx)
> +{
> +	struct page *page = folio_page(folio, idx);
> +	bool contains_data;
> +	pte_t newpte;
> +	void *addr;
> +
> +	VM_BUG_ON_PAGE(PageCompound(page), page);
> +	VM_BUG_ON_PAGE(!PageAnon(page), page);
> +	VM_BUG_ON_PAGE(!PageLocked(page), page);
> +	VM_BUG_ON_PAGE(pte_present(*pvmw->pte), page);
> +
> +	if (folio_test_mlocked(folio) || (pvmw->vma->vm_flags &
> VM_LOCKED) ||
> +	    mm_forbids_zeropage(pvmw->vma->vm_mm))
> +		return false;
> +
> +	/*
> +	 * The pmd entry mapping the old thp was flushed and the pte
> mapping
> +	 * this subpage has been non present. If the subpage is only
> zero-filled
> +	 * then map it to the shared zeropage.
> +	 */
> +	addr = kmap_local_page(page);
> +	contains_data = memchr_inv(addr, 0, PAGE_SIZE);
> +	kunmap_local(addr);
> +
> +	if (contains_data)
> +		return false;
> +
> +	newpte = pte_mkspecial(pfn_pte(my_zero_pfn(pvmw->address),
> +					pvmw->vma->vm_page_prot));
> +	set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte,
> newpte);
> +
> +	dec_mm_counter(pvmw->vma->vm_mm, mm_counter(folio));
> +	return true;
> +}
> +
> +struct rmap_walk_arg {
> +	struct folio *folio;
> +	bool map_unused_to_zeropage;
> +};
> +
>  /*
>   * Restore a potential migration pte to a working pte entry
>   */
>  static bool remove_migration_pte(struct folio *folio,
> -		struct vm_area_struct *vma, unsigned long addr, void
> *old)
> +		struct vm_area_struct *vma, unsigned long addr, void
> *arg)
>  {
> -	DEFINE_FOLIO_VMA_WALK(pvmw, old, vma, addr, PVMW_SYNC |
> PVMW_MIGRATION);
> +	struct rmap_walk_arg *rmap_walk_arg = arg;
> +	DEFINE_FOLIO_VMA_WALK(pvmw, rmap_walk_arg->folio, vma, addr,
> PVMW_SYNC | PVMW_MIGRATION);
>  
>  	while (page_vma_mapped_walk(&pvmw)) {
>  		rmap_t rmap_flags = RMAP_NONE;
> @@ -234,6 +278,9 @@ static bool remove_migration_pte(struct folio
> *folio,
>  			continue;
>  		}
>  #endif
> +		if (rmap_walk_arg->map_unused_to_zeropage &&
> +		    try_to_map_unused_to_zeropage(&pvmw, folio,
> idx))
> +			continue;
>  
>  		folio_get(folio);
>  		pte = mk_pte(new, READ_ONCE(vma->vm_page_prot));
> @@ -312,14 +359,21 @@ static bool remove_migration_pte(struct folio
> *folio,
>   * Get rid of all migration entries and replace them by
>   * references to the indicated page.
>   */
> -void remove_migration_ptes(struct folio *src, struct folio *dst,
> bool locked)
> +void remove_migration_ptes(struct folio *src, struct folio *dst, int
> flags)
>  {
> +	struct rmap_walk_arg rmap_walk_arg = {
> +		.folio = src,
> +		.map_unused_to_zeropage = flags &
> RMP_USE_SHARED_ZEROPAGE,
> +	};
> +
>  	struct rmap_walk_control rwc = {
>  		.rmap_one = remove_migration_pte,
> -		.arg = src,
> +		.arg = &rmap_walk_arg,
>  	};
>  
> -	if (locked)
> +	VM_BUG_ON_FOLIO((flags & RMP_USE_SHARED_ZEROPAGE) && (src !=
> dst), src);
> +
> +	if (flags & RMP_LOCKED)
>  		rmap_walk_locked(dst, &rwc);
>  	else
>  		rmap_walk(dst, &rwc);
> @@ -934,7 +988,7 @@ static int writeout(struct address_space
> *mapping, struct folio *folio)
>  	 * At this point we know that the migration attempt cannot
>  	 * be successful.
>  	 */
> -	remove_migration_ptes(folio, folio, false);
> +	remove_migration_ptes(folio, folio, 0);
>  
>  	rc = mapping->a_ops->writepage(&folio->page, &wbc);
>  
> @@ -1098,7 +1152,7 @@ static void migrate_folio_undo_src(struct folio
> *src,
>  				   struct list_head *ret)
>  {
>  	if (page_was_mapped)
> -		remove_migration_ptes(src, src, false);
> +		remove_migration_ptes(src, src, 0);
>  	/* Drop an anon_vma reference if we took one */
>  	if (anon_vma)
>  		put_anon_vma(anon_vma);
> @@ -1336,7 +1390,7 @@ static int migrate_folio_move(free_folio_t
> put_new_folio, unsigned long private,
>  		lru_add_drain();
>  
>  	if (old_page_state & PAGE_WAS_MAPPED)
> -		remove_migration_ptes(src, dst, false);
> +		remove_migration_ptes(src, dst, 0);
>  
>  out_unlock_both:
>  	folio_unlock(dst);
> @@ -1474,7 +1528,7 @@ static int unmap_and_move_huge_page(new_folio_t
> get_new_folio,
>  
>  	if (page_was_mapped)
>  		remove_migration_ptes(src,
> -			rc == MIGRATEPAGE_SUCCESS ? dst : src,
> false);
> +			rc == MIGRATEPAGE_SUCCESS ? dst : src, 0);
>  
>  unlock_put_anon:
>  	folio_unlock(dst);
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 8d687de88a03..9cf26592ac93 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -424,7 +424,7 @@ static unsigned long
> migrate_device_unmap(unsigned long *src_pfns,
>  			continue;
>  
>  		folio = page_folio(page);
> -		remove_migration_ptes(folio, folio, false);
> +		remove_migration_ptes(folio, folio, 0);
>  
>  		src_pfns[i] = 0;
>  		folio_unlock(folio);
> @@ -840,7 +840,7 @@ void migrate_device_finalize(unsigned long
> *src_pfns,
>  			dst = src;
>  		}
>  
> -		remove_migration_ptes(src, dst, false);
> +		remove_migration_ptes(src, dst, 0);
>  		folio_unlock(src);
>  
>  		if (folio_is_zone_device(src))

Hi,

This patch has been in the mainline for some time, but we recently
discovered an issue when both mTHP and MTE (Memory Tagging Extension)
are enabled.

It seems that remapping to the same zeropage might causes MTE tag
mismatches, since MTE tags are associated with physical addresses.

In Android, the tombstone is as follows:

---
Build fingerprint:
'alps/vext_k6993v1_64/k6993v1_64:16/BP2A.250605.031.A3/mp1cs1ofp41:user
debug/dev-keys'
Revision: '0'
ABI: 'arm64'
Timestamp: 2025-08-12 04:58:28.507086720+0800
Process uptime: 0s
Cmdline: /system/bin/audioserver
pid: 8217, tid: 8882, name: binder:8217_4  >>> /system/bin/audioserver
<<<
uid: 1041
tagged_addr_ctrl: 000000000007fff3 (PR_TAGGED_ADDR_ENABLE,
PR_MTE_TCF_SYNC, mask 0xfffe)
signal 11 (SIGSEGV), code 9 (SEGV_MTESERR), fault addr
0x0a00007055220000
Cause: [MTE]: Buffer Overflow, 14016 bytes into a 23070-byte allocation
at 0x705521c940
    x0  0a0000705521c940  x1  0300006f75210ab0  x2  00000000000022a5 
x3  0a0000705521ffc0
    x4  0300006f75212de5  x5  0a000070552222f5  x6  0000000000005a1e 
x7  0000000000000000
    x8  339c000005a1e11e  x9  00000000f041339c  x10 000000009dd48904 
x11 000000000000ffff
    x12 0000000022b70889  x13 000000004cc0b2ff  x14 0000000000000000 
x15 0000000000000010
    x16 00000071cc5d8fc0  x17 00000071cc54e040  x18 0000006ef7bd4000 
x19 0300006f7520d430
    x20 00000071cc5e0340  x21 0000000000005a1e  x22 0a0000705521c940 
x23 00000000000059b5
    x24 00000000000000b1  x25 0300006f75212e4e  x26 caa20000059b511d 
x27 000000000000001d
    x28 0300006f75212e30  x29 0000006f1fe385f0
    lr  00000071cc54200c  sp  0000006f1fe385c0  pc  00000071cc54e158 
pst 0000000020001000

26 total frames
backtrace:
      #00 pc 000000000006c158 
/apex/com.android.runtime/lib64/bionic/libc.so
(__memcpy_aarch64_simd+280) (BuildId: 1e819f3e369d59be98bee38a8fbd0322)
      #01 pc 0000000000060008 
/apex/com.android.runtime/lib64/bionic/libc.so
(scudo::Allocator<scudo::AndroidNormalConfig,
&scudo_malloc_postinit>::reallocate(void*, unsigned long, unsigned
long)+696) (BuildId: 1e819f3e369d59be98bee38a8fbd0322)
      #02 pc 000000000005fccc 
/apex/com.android.runtime/lib64/bionic/libc.so (scudo_realloc+44)
(BuildId: 1e819f3e369d59be98bee38a8fbd0322)
      #03 pc 000000000005c2cc 
/apex/com.android.runtime/lib64/bionic/libc.so (LimitRealloc(void*,
unsigned long)+124) (BuildId: 1e819f3e369d59be98bee38a8fbd0322)
      #04 pc 0000000000059a90 
/apex/com.android.runtime/lib64/bionic/libc.so (realloc+160) (BuildId:
1e819f3e369d59be98bee38a8fbd0322)
      #05 pc 0000000000011a74  /system/lib64/libutils.so
(android::SharedBuffer::editResize(unsigned long) const+68) (BuildId:
7aa2d71e030a290c8dd28236ba0a838f)
      #06 pc 0000000000011ba8  /system/lib64/libutils.so
(android::String8::real_append(char const*, unsigned long)+88)
(BuildId: 7aa2d71e030a290c8dd28236ba0a838f)
      #07 pc 000000000007b880 
/system/lib64/libaudiopolicycomponents.so
(android::DeviceDescriptor::dump(android::String8*, int, bool)
const+208) (BuildId: 553fefffdca2f3a5dde634e123bd2c81)
      #08 pc 000000000008094c 
/system/lib64/libaudiopolicycomponents.so
(android::DeviceVector::dump(android::String8*, android::String8
const&, int, bool) const+636) (BuildId:
553fefffdca2f3a5dde634e123bd2c81)
      #09 pc 0000000000092ed4 
/system/lib64/libaudiopolicycomponents.so
(android::IOProfile::dump(android::String8*, int) const+980) (BuildId:
553fefffdca2f3a5dde634e123bd2c81)
      #10 pc 000000000008bd7c 
/system/lib64/libaudiopolicycomponents.so
(android::HwModule::dump(android::String8*, int) const+1148) (BuildId:
553fefffdca2f3a5dde634e123bd2c81)
      #11 pc 000000000009044c 
/system/lib64/libaudiopolicycomponents.so
(android::HwModuleCollection::dump(android::String8*) const+508)
(BuildId: 553fefffdca2f3a5dde634e123bd2c81)
      #12 pc 0000000000090134 
/system/lib64/libaudiopolicymanagerdefault.so
(android::AudioPolicyManager::dump(android::String8*) const+3908)
(BuildId: fdba879fc1a0c470759bfeb3d594ab81)
      #13 pc 0000000000092e40 
/system/lib64/libaudiopolicymanagerdefault.so
(android::AudioPolicyManager::dump(int)+80) (BuildId:
fdba879fc1a0c470759bfeb3d594ab81)
      #14 pc 000000000022b218  /system/bin/audioserver
(android::AudioPolicyService::dump(int,
android::Vector<android::String16> const&)+392) (BuildId:
1988c27ce74b125f598a07a93367cfdd)
      #15 pc 000000000022c8cc  /system/bin/audioserver (non-virtual
thunk to android::AudioPolicyService::dump(int,
android::Vector<android::String16> const&)+12) (BuildId:
1988c27ce74b125f598a07a93367cfdd)
      #16 pc 00000000000883f4  /system/lib64/libbinder.so
(android::BBinder::onTransact(unsigned int, android::Parcel const&,
android::Parcel*, unsigned int)+340) (BuildId:
4ace0dcb0135b71ba70b7aaee457d26f)
      #17 pc 000000000003fadc  /system/lib64/audiopolicy-aidl-cpp.so
(android::media::BnAudioPolicyService::onTransact(unsigned int,
android::Parcel const&, android::Parcel*, unsigned int)+19884)
(BuildId: ae185d80e4e54668275f262317dc2d7d)
      #18 pc 000000000022adc4  /system/bin/audioserver
(android::AudioPolicyService::onTransact(unsigned int, android::Parcel
const&, android::Parcel*, unsigned int)+1076) (BuildId:
1988c27ce74b125f598a07a93367cfdd)
      #19 pc 0000000000048adc  /system/lib64/libbinder.so
(android::IPCThreadState::executeCommand(int)+748) (BuildId:
4ace0dcb0135b71ba70b7aaee457d26f)
      #20 pc 0000000000051788  /system/lib64/libbinder.so
(android::IPCThreadState::joinThreadPool(bool)+296) (BuildId:
4ace0dcb0135b71ba70b7aaee457d26f)
      #21 pc 000000000007e528  /system/lib64/libbinder.so
(android::PoolThread::threadLoop()+24) (BuildId:
4ace0dcb0135b71ba70b7aaee457d26f)
      #22 pc 0000000000019268  /system/lib64/libutils.so
(android::Thread::_threadLoop(void*)+248) (BuildId:
7aa2d71e030a290c8dd28236ba0a838f)
      #23 pc 000000000001b994  /system/lib64/libutils.so
(libutil_thread_trampoline(void*)
(.__uniq.226528677032898775202282855395389835431)+20) (BuildId:
7aa2d71e030a290c8dd28236ba0a838f)
      #24 pc 0000000000083c8c 
/apex/com.android.runtime/lib64/bionic/libc.so
(__pthread_start(void*)+236) (BuildId:
1e819f3e369d59be98bee38a8fbd0322)
      #25 pc 00000000000761a0 
/apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64)
(BuildId: 1e819f3e369d59be98bee38a8fbd0322)

Memory tags around the fault address (0xa00007055220000), one tag per
16 bytes:
      0x705521f800: a  a  a  a  a  a  a  a  a  a  a  a  a  a  a  a
      0x705521f900: a  a  a  a  a  a  a  a  a  a  a  a  a  a  a  a
      0x705521fa00: a  a  a  a  a  a  a  a  a  a  a  a  a  a  a  a
      0x705521fb00: a  a  a  a  a  a  a  a  a  a  a  a  a  a  a  a
      0x705521fc00: a  a  a  a  a  a  a  a  a  a  a  a  a  a  a  a
      0x705521fd00: a  a  a  a  a  a  a  a  a  a  a  a  a  a  a  a
      0x705521fe00: a  a  a  a  a  a  a  a  a  a  a  a  a  a  a  a
      0x705521ff00: a  a  a  a  a  a  a  a  a  a  a  a  a  a  a  a
    =>0x7055220000:[0] 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
      0x7055220100: 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
      0x7055220200: 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
      0x7055220300: 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
      0x7055220400: 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
      0x7055220500: 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
      0x7055220600: 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
      0x7055220700: 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
---

Whenever the memory pressure is high, it will happen to any process
with MTE enabled.

Any suggestion is appreciated.

Thanks,
Qun-wei

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 2/6] mm: remap unused subpages to shared zeropage when splitting isolated thp
  2025-09-18  8:53   ` Qun-wei Lin (林群崴)
@ 2025-09-18  8:56     ` David Hildenbrand
  2025-09-18 11:42       ` Usama Arif
  2025-09-18 12:22       ` Lance Yang
  0 siblings, 2 replies; 33+ messages in thread
From: David Hildenbrand @ 2025-09-18  8:56 UTC (permalink / raw)
  To: Qun-wei Lin (林群崴), catalin.marinas@arm.com,
	usamaarif642@gmail.com, linux-mm@kvack.org, yuzhao@google.com,
	akpm@linux-foundation.org
  Cc: corbet@lwn.net, Andrew Yang (楊智強),
	npache@redhat.com, rppt@kernel.org, willy@infradead.org,
	kernel-team@meta.com, roman.gushchin@linux.dev,
	hannes@cmpxchg.org, cerasuolodomenico@gmail.com,
	linux-kernel@vger.kernel.org, ryncsn@gmail.com, surenb@google.com,
	riel@surriel.com, shakeel.butt@linux.dev,
	Chinwen Chang (張錦文),
	linux-doc@vger.kernel.org, Casper Li (李中榮),
	ryan.roberts@arm.com, linux-mediatek@lists.infradead.org,
	baohua@kernel.org, kaleshsingh@google.com, zhais@google.com,
	linux-arm-kernel@lists.infradead.org

On 18.09.25 10:53, Qun-wei Lin (林群崴) wrote:
> On Fri, 2024-08-30 at 11:03 +0100, Usama Arif wrote:
>> From: Yu Zhao <yuzhao@google.com>
>>
>> Here being unused means containing only zeros and inaccessible to
>> userspace. When splitting an isolated thp under reclaim or migration,
>> the unused subpages can be mapped to the shared zeropage, hence
>> saving
>> memory. This is particularly helpful when the internal
>> fragmentation of a thp is high, i.e. it has many untouched subpages.
>>
>> This is also a prerequisite for THP low utilization shrinker which
>> will
>> be introduced in later patches, where underutilized THPs are split,
>> and
>> the zero-filled pages are freed saving memory.
>>
>> Signed-off-by: Yu Zhao <yuzhao@google.com>
>> Tested-by: Shuang Zhai <zhais@google.com>
>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>> ---
>>   include/linux/rmap.h |  7 ++++-
>>   mm/huge_memory.c     |  8 ++---
>>   mm/migrate.c         | 72 ++++++++++++++++++++++++++++++++++++++----
>> --
>>   mm/migrate_device.c  |  4 +--
>>   4 files changed, 75 insertions(+), 16 deletions(-)
>>
>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>> index 91b5935e8485..d5e93e44322e 100644
>> --- a/include/linux/rmap.h
>> +++ b/include/linux/rmap.h
>> @@ -745,7 +745,12 @@ int folio_mkclean(struct folio *);
>>   int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages,
>> pgoff_t pgoff,
>>   		      struct vm_area_struct *vma);
>>   
>> -void remove_migration_ptes(struct folio *src, struct folio *dst,
>> bool locked);
>> +enum rmp_flags {
>> +	RMP_LOCKED		= 1 << 0,
>> +	RMP_USE_SHARED_ZEROPAGE	= 1 << 1,
>> +};
>> +
>> +void remove_migration_ptes(struct folio *src, struct folio *dst, int
>> flags);
>>   
>>   /*
>>    * rmap_walk_control: To control rmap traversing for specific needs
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 0c48806ccb9a..af60684e7c70 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3020,7 +3020,7 @@ bool unmap_huge_pmd_locked(struct
>> vm_area_struct *vma, unsigned long addr,
>>   	return false;
>>   }
>>   
>> -static void remap_page(struct folio *folio, unsigned long nr)
>> +static void remap_page(struct folio *folio, unsigned long nr, int
>> flags)
>>   {
>>   	int i = 0;
>>   
>> @@ -3028,7 +3028,7 @@ static void remap_page(struct folio *folio,
>> unsigned long nr)
>>   	if (!folio_test_anon(folio))
>>   		return;
>>   	for (;;) {
>> -		remove_migration_ptes(folio, folio, true);
>> +		remove_migration_ptes(folio, folio, RMP_LOCKED |
>> flags);
>>   		i += folio_nr_pages(folio);
>>   		if (i >= nr)
>>   			break;
>> @@ -3240,7 +3240,7 @@ static void __split_huge_page(struct page
>> *page, struct list_head *list,
>>   
>>   	if (nr_dropped)
>>   		shmem_uncharge(folio->mapping->host, nr_dropped);
>> -	remap_page(folio, nr);
>> +	remap_page(folio, nr, PageAnon(head) ?
>> RMP_USE_SHARED_ZEROPAGE : 0);
>>   
>>   	/*
>>   	 * set page to its compound_head when split to non order-0
>> pages, so
>> @@ -3542,7 +3542,7 @@ int split_huge_page_to_list_to_order(struct
>> page *page, struct list_head *list,
>>   		if (mapping)
>>   			xas_unlock(&xas);
>>   		local_irq_enable();
>> -		remap_page(folio, folio_nr_pages(folio));
>> +		remap_page(folio, folio_nr_pages(folio), 0);
>>   		ret = -EAGAIN;
>>   	}
>>   
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 6f9c62c746be..d039863e014b 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -204,13 +204,57 @@ bool isolate_folio_to_list(struct folio *folio,
>> struct list_head *list)
>>   	return true;
>>   }
>>   
>> +static bool try_to_map_unused_to_zeropage(struct
>> page_vma_mapped_walk *pvmw,
>> +					  struct folio *folio,
>> +					  unsigned long idx)
>> +{
>> +	struct page *page = folio_page(folio, idx);
>> +	bool contains_data;
>> +	pte_t newpte;
>> +	void *addr;
>> +
>> +	VM_BUG_ON_PAGE(PageCompound(page), page);
>> +	VM_BUG_ON_PAGE(!PageAnon(page), page);
>> +	VM_BUG_ON_PAGE(!PageLocked(page), page);
>> +	VM_BUG_ON_PAGE(pte_present(*pvmw->pte), page);
>> +
>> +	if (folio_test_mlocked(folio) || (pvmw->vma->vm_flags &
>> VM_LOCKED) ||
>> +	    mm_forbids_zeropage(pvmw->vma->vm_mm))
>> +		return false;
>> +
>> +	/*
>> +	 * The pmd entry mapping the old thp was flushed and the pte
>> mapping
>> +	 * this subpage has been non present. If the subpage is only
>> zero-filled
>> +	 * then map it to the shared zeropage.
>> +	 */
>> +	addr = kmap_local_page(page);
>> +	contains_data = memchr_inv(addr, 0, PAGE_SIZE);
>> +	kunmap_local(addr);
>> +
>> +	if (contains_data)
>> +		return false;
>> +
>> +	newpte = pte_mkspecial(pfn_pte(my_zero_pfn(pvmw->address),
>> +					pvmw->vma->vm_page_prot));
>> +	set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte,
>> newpte);
>> +
>> +	dec_mm_counter(pvmw->vma->vm_mm, mm_counter(folio));
>> +	return true;
>> +}
>> +
>> +struct rmap_walk_arg {
>> +	struct folio *folio;
>> +	bool map_unused_to_zeropage;
>> +};
>> +
>>   /*
>>    * Restore a potential migration pte to a working pte entry
>>    */
>>   static bool remove_migration_pte(struct folio *folio,
>> -		struct vm_area_struct *vma, unsigned long addr, void
>> *old)
>> +		struct vm_area_struct *vma, unsigned long addr, void
>> *arg)
>>   {
>> -	DEFINE_FOLIO_VMA_WALK(pvmw, old, vma, addr, PVMW_SYNC |
>> PVMW_MIGRATION);
>> +	struct rmap_walk_arg *rmap_walk_arg = arg;
>> +	DEFINE_FOLIO_VMA_WALK(pvmw, rmap_walk_arg->folio, vma, addr,
>> PVMW_SYNC | PVMW_MIGRATION);
>>   
>>   	while (page_vma_mapped_walk(&pvmw)) {
>>   		rmap_t rmap_flags = RMAP_NONE;
>> @@ -234,6 +278,9 @@ static bool remove_migration_pte(struct folio
>> *folio,
>>   			continue;
>>   		}
>>   #endif
>> +		if (rmap_walk_arg->map_unused_to_zeropage &&
>> +		    try_to_map_unused_to_zeropage(&pvmw, folio,
>> idx))
>> +			continue;
>>   
>>   		folio_get(folio);
>>   		pte = mk_pte(new, READ_ONCE(vma->vm_page_prot));
>> @@ -312,14 +359,21 @@ static bool remove_migration_pte(struct folio
>> *folio,
>>    * Get rid of all migration entries and replace them by
>>    * references to the indicated page.
>>    */
>> -void remove_migration_ptes(struct folio *src, struct folio *dst,
>> bool locked)
>> +void remove_migration_ptes(struct folio *src, struct folio *dst, int
>> flags)
>>   {
>> +	struct rmap_walk_arg rmap_walk_arg = {
>> +		.folio = src,
>> +		.map_unused_to_zeropage = flags &
>> RMP_USE_SHARED_ZEROPAGE,
>> +	};
>> +
>>   	struct rmap_walk_control rwc = {
>>   		.rmap_one = remove_migration_pte,
>> -		.arg = src,
>> +		.arg = &rmap_walk_arg,
>>   	};
>>   
>> -	if (locked)
>> +	VM_BUG_ON_FOLIO((flags & RMP_USE_SHARED_ZEROPAGE) && (src !=
>> dst), src);
>> +
>> +	if (flags & RMP_LOCKED)
>>   		rmap_walk_locked(dst, &rwc);
>>   	else
>>   		rmap_walk(dst, &rwc);
>> @@ -934,7 +988,7 @@ static int writeout(struct address_space
>> *mapping, struct folio *folio)
>>   	 * At this point we know that the migration attempt cannot
>>   	 * be successful.
>>   	 */
>> -	remove_migration_ptes(folio, folio, false);
>> +	remove_migration_ptes(folio, folio, 0);
>>   
>>   	rc = mapping->a_ops->writepage(&folio->page, &wbc);
>>   
>> @@ -1098,7 +1152,7 @@ static void migrate_folio_undo_src(struct folio
>> *src,
>>   				   struct list_head *ret)
>>   {
>>   	if (page_was_mapped)
>> -		remove_migration_ptes(src, src, false);
>> +		remove_migration_ptes(src, src, 0);
>>   	/* Drop an anon_vma reference if we took one */
>>   	if (anon_vma)
>>   		put_anon_vma(anon_vma);
>> @@ -1336,7 +1390,7 @@ static int migrate_folio_move(free_folio_t
>> put_new_folio, unsigned long private,
>>   		lru_add_drain();
>>   
>>   	if (old_page_state & PAGE_WAS_MAPPED)
>> -		remove_migration_ptes(src, dst, false);
>> +		remove_migration_ptes(src, dst, 0);
>>   
>>   out_unlock_both:
>>   	folio_unlock(dst);
>> @@ -1474,7 +1528,7 @@ static int unmap_and_move_huge_page(new_folio_t
>> get_new_folio,
>>   
>>   	if (page_was_mapped)
>>   		remove_migration_ptes(src,
>> -			rc == MIGRATEPAGE_SUCCESS ? dst : src,
>> false);
>> +			rc == MIGRATEPAGE_SUCCESS ? dst : src, 0);
>>   
>>   unlock_put_anon:
>>   	folio_unlock(dst);
>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> index 8d687de88a03..9cf26592ac93 100644
>> --- a/mm/migrate_device.c
>> +++ b/mm/migrate_device.c
>> @@ -424,7 +424,7 @@ static unsigned long
>> migrate_device_unmap(unsigned long *src_pfns,
>>   			continue;
>>   
>>   		folio = page_folio(page);
>> -		remove_migration_ptes(folio, folio, false);
>> +		remove_migration_ptes(folio, folio, 0);
>>   
>>   		src_pfns[i] = 0;
>>   		folio_unlock(folio);
>> @@ -840,7 +840,7 @@ void migrate_device_finalize(unsigned long
>> *src_pfns,
>>   			dst = src;
>>   		}
>>   
>> -		remove_migration_ptes(src, dst, false);
>> +		remove_migration_ptes(src, dst, 0);
>>   		folio_unlock(src);
>>   
>>   		if (folio_is_zone_device(src))
> 
> Hi,
> 
> This patch has been in the mainline for some time, but we recently
> discovered an issue when both mTHP and MTE (Memory Tagging Extension)
> are enabled.
> 
> It seems that remapping to the same zeropage might causes MTE tag
> mismatches, since MTE tags are associated with physical addresses.

Does this only trigger when the VMA has mte enabled? Maybe we'll have to 
bail out if we detect that mte is enabled.

Also, I wonder how KSM and the shared zeropage works in general with 
that, because I would expect similar issues when we de-duplicate memory?

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 2/6] mm: remap unused subpages to shared zeropage when splitting isolated thp
  2025-09-18  8:56     ` David Hildenbrand
@ 2025-09-18 11:42       ` Usama Arif
  2025-09-18 11:57         ` David Hildenbrand
  2025-09-18 12:22       ` Lance Yang
  1 sibling, 1 reply; 33+ messages in thread
From: Usama Arif @ 2025-09-18 11:42 UTC (permalink / raw)
  To: David Hildenbrand, Qun-wei Lin (林群崴),
	catalin.marinas@arm.com, linux-mm@kvack.org, yuzhao@google.com,
	akpm@linux-foundation.org
  Cc: corbet@lwn.net, Andrew Yang (楊智強),
	npache@redhat.com, rppt@kernel.org, willy@infradead.org,
	kernel-team@meta.com, roman.gushchin@linux.dev,
	hannes@cmpxchg.org, cerasuolodomenico@gmail.com,
	linux-kernel@vger.kernel.org, ryncsn@gmail.com, surenb@google.com,
	riel@surriel.com, shakeel.butt@linux.dev,
	Chinwen Chang (張錦文),
	linux-doc@vger.kernel.org, Casper Li (李中榮),
	ryan.roberts@arm.com, linux-mediatek@lists.infradead.org,
	baohua@kernel.org, kaleshsingh@google.com, zhais@google.com,
	linux-arm-kernel@lists.infradead.org



On 18/09/2025 09:56, David Hildenbrand wrote:
> On 18.09.25 10:53, Qun-wei Lin (林群崴) wrote:
>> On Fri, 2024-08-30 at 11:03 +0100, Usama Arif wrote:
>>> From: Yu Zhao <yuzhao@google.com>
>>>
>>> Here being unused means containing only zeros and inaccessible to
>>> userspace. When splitting an isolated thp under reclaim or migration,
>>> the unused subpages can be mapped to the shared zeropage, hence
>>> saving
>>> memory. This is particularly helpful when the internal
>>> fragmentation of a thp is high, i.e. it has many untouched subpages.
>>>
>>> This is also a prerequisite for THP low utilization shrinker which
>>> will
>>> be introduced in later patches, where underutilized THPs are split,
>>> and
>>> the zero-filled pages are freed saving memory.
>>>
>>> Signed-off-by: Yu Zhao <yuzhao@google.com>
>>> Tested-by: Shuang Zhai <zhais@google.com>
>>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>>> ---
>>>   include/linux/rmap.h |  7 ++++-
>>>   mm/huge_memory.c     |  8 ++---
>>>   mm/migrate.c         | 72 ++++++++++++++++++++++++++++++++++++++----
>>> -- 
>>>   mm/migrate_device.c  |  4 +--
>>>   4 files changed, 75 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>>> index 91b5935e8485..d5e93e44322e 100644
>>> --- a/include/linux/rmap.h
>>> +++ b/include/linux/rmap.h
>>> @@ -745,7 +745,12 @@ int folio_mkclean(struct folio *);
>>>   int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages,
>>> pgoff_t pgoff,
>>>                 struct vm_area_struct *vma);
>>>   -void remove_migration_ptes(struct folio *src, struct folio *dst,
>>> bool locked);
>>> +enum rmp_flags {
>>> +    RMP_LOCKED        = 1 << 0,
>>> +    RMP_USE_SHARED_ZEROPAGE    = 1 << 1,
>>> +};
>>> +
>>> +void remove_migration_ptes(struct folio *src, struct folio *dst, int
>>> flags);
>>>     /*
>>>    * rmap_walk_control: To control rmap traversing for specific needs
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 0c48806ccb9a..af60684e7c70 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -3020,7 +3020,7 @@ bool unmap_huge_pmd_locked(struct
>>> vm_area_struct *vma, unsigned long addr,
>>>       return false;
>>>   }
>>>   -static void remap_page(struct folio *folio, unsigned long nr)
>>> +static void remap_page(struct folio *folio, unsigned long nr, int
>>> flags)
>>>   {
>>>       int i = 0;
>>>   @@ -3028,7 +3028,7 @@ static void remap_page(struct folio *folio,
>>> unsigned long nr)
>>>       if (!folio_test_anon(folio))
>>>           return;
>>>       for (;;) {
>>> -        remove_migration_ptes(folio, folio, true);
>>> +        remove_migration_ptes(folio, folio, RMP_LOCKED |
>>> flags);
>>>           i += folio_nr_pages(folio);
>>>           if (i >= nr)
>>>               break;
>>> @@ -3240,7 +3240,7 @@ static void __split_huge_page(struct page
>>> *page, struct list_head *list,
>>>         if (nr_dropped)
>>>           shmem_uncharge(folio->mapping->host, nr_dropped);
>>> -    remap_page(folio, nr);
>>> +    remap_page(folio, nr, PageAnon(head) ?
>>> RMP_USE_SHARED_ZEROPAGE : 0);
>>>         /*
>>>        * set page to its compound_head when split to non order-0
>>> pages, so
>>> @@ -3542,7 +3542,7 @@ int split_huge_page_to_list_to_order(struct
>>> page *page, struct list_head *list,
>>>           if (mapping)
>>>               xas_unlock(&xas);
>>>           local_irq_enable();
>>> -        remap_page(folio, folio_nr_pages(folio));
>>> +        remap_page(folio, folio_nr_pages(folio), 0);
>>>           ret = -EAGAIN;
>>>       }
>>>   diff --git a/mm/migrate.c b/mm/migrate.c
>>> index 6f9c62c746be..d039863e014b 100644
>>> --- a/mm/migrate.c
>>> +++ b/mm/migrate.c
>>> @@ -204,13 +204,57 @@ bool isolate_folio_to_list(struct folio *folio,
>>> struct list_head *list)
>>>       return true;
>>>   }
>>>   +static bool try_to_map_unused_to_zeropage(struct
>>> page_vma_mapped_walk *pvmw,
>>> +                      struct folio *folio,
>>> +                      unsigned long idx)
>>> +{
>>> +    struct page *page = folio_page(folio, idx);
>>> +    bool contains_data;
>>> +    pte_t newpte;
>>> +    void *addr;
>>> +
>>> +    VM_BUG_ON_PAGE(PageCompound(page), page);
>>> +    VM_BUG_ON_PAGE(!PageAnon(page), page);
>>> +    VM_BUG_ON_PAGE(!PageLocked(page), page);
>>> +    VM_BUG_ON_PAGE(pte_present(*pvmw->pte), page);
>>> +
>>> +    if (folio_test_mlocked(folio) || (pvmw->vma->vm_flags &
>>> VM_LOCKED) ||
>>> +        mm_forbids_zeropage(pvmw->vma->vm_mm))
>>> +        return false;
>>> +
>>> +    /*
>>> +     * The pmd entry mapping the old thp was flushed and the pte
>>> mapping
>>> +     * this subpage has been non present. If the subpage is only
>>> zero-filled
>>> +     * then map it to the shared zeropage.
>>> +     */
>>> +    addr = kmap_local_page(page);
>>> +    contains_data = memchr_inv(addr, 0, PAGE_SIZE);
>>> +    kunmap_local(addr);
>>> +
>>> +    if (contains_data)
>>> +        return false;
>>> +
>>> +    newpte = pte_mkspecial(pfn_pte(my_zero_pfn(pvmw->address),
>>> +                    pvmw->vma->vm_page_prot));
>>> +    set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte,
>>> newpte);
>>> +
>>> +    dec_mm_counter(pvmw->vma->vm_mm, mm_counter(folio));
>>> +    return true;
>>> +}
>>> +
>>> +struct rmap_walk_arg {
>>> +    struct folio *folio;
>>> +    bool map_unused_to_zeropage;
>>> +};
>>> +
>>>   /*
>>>    * Restore a potential migration pte to a working pte entry
>>>    */
>>>   static bool remove_migration_pte(struct folio *folio,
>>> -        struct vm_area_struct *vma, unsigned long addr, void
>>> *old)
>>> +        struct vm_area_struct *vma, unsigned long addr, void
>>> *arg)
>>>   {
>>> -    DEFINE_FOLIO_VMA_WALK(pvmw, old, vma, addr, PVMW_SYNC |
>>> PVMW_MIGRATION);
>>> +    struct rmap_walk_arg *rmap_walk_arg = arg;
>>> +    DEFINE_FOLIO_VMA_WALK(pvmw, rmap_walk_arg->folio, vma, addr,
>>> PVMW_SYNC | PVMW_MIGRATION);
>>>         while (page_vma_mapped_walk(&pvmw)) {
>>>           rmap_t rmap_flags = RMAP_NONE;
>>> @@ -234,6 +278,9 @@ static bool remove_migration_pte(struct folio
>>> *folio,
>>>               continue;
>>>           }
>>>   #endif
>>> +        if (rmap_walk_arg->map_unused_to_zeropage &&
>>> +            try_to_map_unused_to_zeropage(&pvmw, folio,
>>> idx))
>>> +            continue;
>>>             folio_get(folio);
>>>           pte = mk_pte(new, READ_ONCE(vma->vm_page_prot));
>>> @@ -312,14 +359,21 @@ static bool remove_migration_pte(struct folio
>>> *folio,
>>>    * Get rid of all migration entries and replace them by
>>>    * references to the indicated page.
>>>    */
>>> -void remove_migration_ptes(struct folio *src, struct folio *dst,
>>> bool locked)
>>> +void remove_migration_ptes(struct folio *src, struct folio *dst, int
>>> flags)
>>>   {
>>> +    struct rmap_walk_arg rmap_walk_arg = {
>>> +        .folio = src,
>>> +        .map_unused_to_zeropage = flags &
>>> RMP_USE_SHARED_ZEROPAGE,
>>> +    };
>>> +
>>>       struct rmap_walk_control rwc = {
>>>           .rmap_one = remove_migration_pte,
>>> -        .arg = src,
>>> +        .arg = &rmap_walk_arg,
>>>       };
>>>   -    if (locked)
>>> +    VM_BUG_ON_FOLIO((flags & RMP_USE_SHARED_ZEROPAGE) && (src !=
>>> dst), src);
>>> +
>>> +    if (flags & RMP_LOCKED)
>>>           rmap_walk_locked(dst, &rwc);
>>>       else
>>>           rmap_walk(dst, &rwc);
>>> @@ -934,7 +988,7 @@ static int writeout(struct address_space
>>> *mapping, struct folio *folio)
>>>        * At this point we know that the migration attempt cannot
>>>        * be successful.
>>>        */
>>> -    remove_migration_ptes(folio, folio, false);
>>> +    remove_migration_ptes(folio, folio, 0);
>>>         rc = mapping->a_ops->writepage(&folio->page, &wbc);
>>>   @@ -1098,7 +1152,7 @@ static void migrate_folio_undo_src(struct folio
>>> *src,
>>>                      struct list_head *ret)
>>>   {
>>>       if (page_was_mapped)
>>> -        remove_migration_ptes(src, src, false);
>>> +        remove_migration_ptes(src, src, 0);
>>>       /* Drop an anon_vma reference if we took one */
>>>       if (anon_vma)
>>>           put_anon_vma(anon_vma);
>>> @@ -1336,7 +1390,7 @@ static int migrate_folio_move(free_folio_t
>>> put_new_folio, unsigned long private,
>>>           lru_add_drain();
>>>         if (old_page_state & PAGE_WAS_MAPPED)
>>> -        remove_migration_ptes(src, dst, false);
>>> +        remove_migration_ptes(src, dst, 0);
>>>     out_unlock_both:
>>>       folio_unlock(dst);
>>> @@ -1474,7 +1528,7 @@ static int unmap_and_move_huge_page(new_folio_t
>>> get_new_folio,
>>>         if (page_was_mapped)
>>>           remove_migration_ptes(src,
>>> -            rc == MIGRATEPAGE_SUCCESS ? dst : src,
>>> false);
>>> +            rc == MIGRATEPAGE_SUCCESS ? dst : src, 0);
>>>     unlock_put_anon:
>>>       folio_unlock(dst);
>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>> index 8d687de88a03..9cf26592ac93 100644
>>> --- a/mm/migrate_device.c
>>> +++ b/mm/migrate_device.c
>>> @@ -424,7 +424,7 @@ static unsigned long
>>> migrate_device_unmap(unsigned long *src_pfns,
>>>               continue;
>>>             folio = page_folio(page);
>>> -        remove_migration_ptes(folio, folio, false);
>>> +        remove_migration_ptes(folio, folio, 0);
>>>             src_pfns[i] = 0;
>>>           folio_unlock(folio);
>>> @@ -840,7 +840,7 @@ void migrate_device_finalize(unsigned long
>>> *src_pfns,
>>>               dst = src;
>>>           }
>>>   -        remove_migration_ptes(src, dst, false);
>>> +        remove_migration_ptes(src, dst, 0);
>>>           folio_unlock(src);
>>>             if (folio_is_zone_device(src))
>>
>> Hi,
>>
>> This patch has been in the mainline for some time, but we recently
>> discovered an issue when both mTHP and MTE (Memory Tagging Extension)
>> are enabled.
>>
>> It seems that remapping to the same zeropage might causes MTE tag
>> mismatches, since MTE tags are associated with physical addresses.
> 
> Does this only trigger when the VMA has mte enabled? Maybe we'll have to bail out if we detect that mte is enabled.
> 

I believe MTE is all or nothing? i.e. all the memory is tagged when enabled, but will let the arm folks confirm.

Yeah unfortunately I think that might be the only way. We cant change the pointers and I dont think there is a way
to mark the memory as "untagged". If we cant remap to zeropage, then there is no point of shrinker. I am guessing
instead of checking at runtime whether mte is enabled when remapping to shared zeropage, we need to ifndef the
shrinker if CONFIG_ARM64_MTE is enabled?

> Also, I wonder how KSM and the shared zeropage works in general with that, because I would expect similar issues when we de-duplicate memory?
> 

Yeah thats a very good point!

Also the initial report mentioned mTHP instead of THP, but I dont think that matters.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 2/6] mm: remap unused subpages to shared zeropage when splitting isolated thp
  2025-09-18 11:42       ` Usama Arif
@ 2025-09-18 11:57         ` David Hildenbrand
  0 siblings, 0 replies; 33+ messages in thread
From: David Hildenbrand @ 2025-09-18 11:57 UTC (permalink / raw)
  To: Usama Arif, Qun-wei Lin (林群崴),
	catalin.marinas@arm.com, linux-mm@kvack.org, yuzhao@google.com,
	akpm@linux-foundation.org
  Cc: corbet@lwn.net, Andrew Yang (楊智強),
	npache@redhat.com, rppt@kernel.org, willy@infradead.org,
	kernel-team@meta.com, roman.gushchin@linux.dev,
	hannes@cmpxchg.org, cerasuolodomenico@gmail.com,
	linux-kernel@vger.kernel.org, ryncsn@gmail.com, surenb@google.com,
	riel@surriel.com, shakeel.butt@linux.dev,
	Chinwen Chang (張錦文),
	linux-doc@vger.kernel.org, Casper Li (李中榮),
	ryan.roberts@arm.com, linux-mediatek@lists.infradead.org,
	baohua@kernel.org, kaleshsingh@google.com, zhais@google.com,
	linux-arm-kernel@lists.infradead.org

On 18.09.25 13:42, Usama Arif wrote:
> 
> 
> On 18/09/2025 09:56, David Hildenbrand wrote:
>> On 18.09.25 10:53, Qun-wei Lin (林群崴) wrote:
>>> On Fri, 2024-08-30 at 11:03 +0100, Usama Arif wrote:
>>>> From: Yu Zhao <yuzhao@google.com>
>>>>
>>>> Here being unused means containing only zeros and inaccessible to
>>>> userspace. When splitting an isolated thp under reclaim or migration,
>>>> the unused subpages can be mapped to the shared zeropage, hence
>>>> saving
>>>> memory. This is particularly helpful when the internal
>>>> fragmentation of a thp is high, i.e. it has many untouched subpages.
>>>>
>>>> This is also a prerequisite for THP low utilization shrinker which
>>>> will
>>>> be introduced in later patches, where underutilized THPs are split,
>>>> and
>>>> the zero-filled pages are freed saving memory.
>>>>
>>>> Signed-off-by: Yu Zhao <yuzhao@google.com>
>>>> Tested-by: Shuang Zhai <zhais@google.com>
>>>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>>>> ---
>>>>    include/linux/rmap.h |  7 ++++-
>>>>    mm/huge_memory.c     |  8 ++---
>>>>    mm/migrate.c         | 72 ++++++++++++++++++++++++++++++++++++++----
>>>> -- 
>>>>    mm/migrate_device.c  |  4 +--
>>>>    4 files changed, 75 insertions(+), 16 deletions(-)
>>>>
>>>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>>>> index 91b5935e8485..d5e93e44322e 100644
>>>> --- a/include/linux/rmap.h
>>>> +++ b/include/linux/rmap.h
>>>> @@ -745,7 +745,12 @@ int folio_mkclean(struct folio *);
>>>>    int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages,
>>>> pgoff_t pgoff,
>>>>                  struct vm_area_struct *vma);
>>>>    -void remove_migration_ptes(struct folio *src, struct folio *dst,
>>>> bool locked);
>>>> +enum rmp_flags {
>>>> +    RMP_LOCKED        = 1 << 0,
>>>> +    RMP_USE_SHARED_ZEROPAGE    = 1 << 1,
>>>> +};
>>>> +
>>>> +void remove_migration_ptes(struct folio *src, struct folio *dst, int
>>>> flags);
>>>>      /*
>>>>     * rmap_walk_control: To control rmap traversing for specific needs
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index 0c48806ccb9a..af60684e7c70 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -3020,7 +3020,7 @@ bool unmap_huge_pmd_locked(struct
>>>> vm_area_struct *vma, unsigned long addr,
>>>>        return false;
>>>>    }
>>>>    -static void remap_page(struct folio *folio, unsigned long nr)
>>>> +static void remap_page(struct folio *folio, unsigned long nr, int
>>>> flags)
>>>>    {
>>>>        int i = 0;
>>>>    @@ -3028,7 +3028,7 @@ static void remap_page(struct folio *folio,
>>>> unsigned long nr)
>>>>        if (!folio_test_anon(folio))
>>>>            return;
>>>>        for (;;) {
>>>> -        remove_migration_ptes(folio, folio, true);
>>>> +        remove_migration_ptes(folio, folio, RMP_LOCKED |
>>>> flags);
>>>>            i += folio_nr_pages(folio);
>>>>            if (i >= nr)
>>>>                break;
>>>> @@ -3240,7 +3240,7 @@ static void __split_huge_page(struct page
>>>> *page, struct list_head *list,
>>>>          if (nr_dropped)
>>>>            shmem_uncharge(folio->mapping->host, nr_dropped);
>>>> -    remap_page(folio, nr);
>>>> +    remap_page(folio, nr, PageAnon(head) ?
>>>> RMP_USE_SHARED_ZEROPAGE : 0);
>>>>          /*
>>>>         * set page to its compound_head when split to non order-0
>>>> pages, so
>>>> @@ -3542,7 +3542,7 @@ int split_huge_page_to_list_to_order(struct
>>>> page *page, struct list_head *list,
>>>>            if (mapping)
>>>>                xas_unlock(&xas);
>>>>            local_irq_enable();
>>>> -        remap_page(folio, folio_nr_pages(folio));
>>>> +        remap_page(folio, folio_nr_pages(folio), 0);
>>>>            ret = -EAGAIN;
>>>>        }
>>>>    diff --git a/mm/migrate.c b/mm/migrate.c
>>>> index 6f9c62c746be..d039863e014b 100644
>>>> --- a/mm/migrate.c
>>>> +++ b/mm/migrate.c
>>>> @@ -204,13 +204,57 @@ bool isolate_folio_to_list(struct folio *folio,
>>>> struct list_head *list)
>>>>        return true;
>>>>    }
>>>>    +static bool try_to_map_unused_to_zeropage(struct
>>>> page_vma_mapped_walk *pvmw,
>>>> +                      struct folio *folio,
>>>> +                      unsigned long idx)
>>>> +{
>>>> +    struct page *page = folio_page(folio, idx);
>>>> +    bool contains_data;
>>>> +    pte_t newpte;
>>>> +    void *addr;
>>>> +
>>>> +    VM_BUG_ON_PAGE(PageCompound(page), page);
>>>> +    VM_BUG_ON_PAGE(!PageAnon(page), page);
>>>> +    VM_BUG_ON_PAGE(!PageLocked(page), page);
>>>> +    VM_BUG_ON_PAGE(pte_present(*pvmw->pte), page);
>>>> +
>>>> +    if (folio_test_mlocked(folio) || (pvmw->vma->vm_flags &
>>>> VM_LOCKED) ||
>>>> +        mm_forbids_zeropage(pvmw->vma->vm_mm))
>>>> +        return false;
>>>> +
>>>> +    /*
>>>> +     * The pmd entry mapping the old thp was flushed and the pte
>>>> mapping
>>>> +     * this subpage has been non present. If the subpage is only
>>>> zero-filled
>>>> +     * then map it to the shared zeropage.
>>>> +     */
>>>> +    addr = kmap_local_page(page);
>>>> +    contains_data = memchr_inv(addr, 0, PAGE_SIZE);
>>>> +    kunmap_local(addr);
>>>> +
>>>> +    if (contains_data)
>>>> +        return false;
>>>> +
>>>> +    newpte = pte_mkspecial(pfn_pte(my_zero_pfn(pvmw->address),
>>>> +                    pvmw->vma->vm_page_prot));
>>>> +    set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte,
>>>> newpte);
>>>> +
>>>> +    dec_mm_counter(pvmw->vma->vm_mm, mm_counter(folio));
>>>> +    return true;
>>>> +}
>>>> +
>>>> +struct rmap_walk_arg {
>>>> +    struct folio *folio;
>>>> +    bool map_unused_to_zeropage;
>>>> +};
>>>> +
>>>>    /*
>>>>     * Restore a potential migration pte to a working pte entry
>>>>     */
>>>>    static bool remove_migration_pte(struct folio *folio,
>>>> -        struct vm_area_struct *vma, unsigned long addr, void
>>>> *old)
>>>> +        struct vm_area_struct *vma, unsigned long addr, void
>>>> *arg)
>>>>    {
>>>> -    DEFINE_FOLIO_VMA_WALK(pvmw, old, vma, addr, PVMW_SYNC |
>>>> PVMW_MIGRATION);
>>>> +    struct rmap_walk_arg *rmap_walk_arg = arg;
>>>> +    DEFINE_FOLIO_VMA_WALK(pvmw, rmap_walk_arg->folio, vma, addr,
>>>> PVMW_SYNC | PVMW_MIGRATION);
>>>>          while (page_vma_mapped_walk(&pvmw)) {
>>>>            rmap_t rmap_flags = RMAP_NONE;
>>>> @@ -234,6 +278,9 @@ static bool remove_migration_pte(struct folio
>>>> *folio,
>>>>                continue;
>>>>            }
>>>>    #endif
>>>> +        if (rmap_walk_arg->map_unused_to_zeropage &&
>>>> +            try_to_map_unused_to_zeropage(&pvmw, folio,
>>>> idx))
>>>> +            continue;
>>>>              folio_get(folio);
>>>>            pte = mk_pte(new, READ_ONCE(vma->vm_page_prot));
>>>> @@ -312,14 +359,21 @@ static bool remove_migration_pte(struct folio
>>>> *folio,
>>>>     * Get rid of all migration entries and replace them by
>>>>     * references to the indicated page.
>>>>     */
>>>> -void remove_migration_ptes(struct folio *src, struct folio *dst,
>>>> bool locked)
>>>> +void remove_migration_ptes(struct folio *src, struct folio *dst, int
>>>> flags)
>>>>    {
>>>> +    struct rmap_walk_arg rmap_walk_arg = {
>>>> +        .folio = src,
>>>> +        .map_unused_to_zeropage = flags &
>>>> RMP_USE_SHARED_ZEROPAGE,
>>>> +    };
>>>> +
>>>>        struct rmap_walk_control rwc = {
>>>>            .rmap_one = remove_migration_pte,
>>>> -        .arg = src,
>>>> +        .arg = &rmap_walk_arg,
>>>>        };
>>>>    -    if (locked)
>>>> +    VM_BUG_ON_FOLIO((flags & RMP_USE_SHARED_ZEROPAGE) && (src !=
>>>> dst), src);
>>>> +
>>>> +    if (flags & RMP_LOCKED)
>>>>            rmap_walk_locked(dst, &rwc);
>>>>        else
>>>>            rmap_walk(dst, &rwc);
>>>> @@ -934,7 +988,7 @@ static int writeout(struct address_space
>>>> *mapping, struct folio *folio)
>>>>         * At this point we know that the migration attempt cannot
>>>>         * be successful.
>>>>         */
>>>> -    remove_migration_ptes(folio, folio, false);
>>>> +    remove_migration_ptes(folio, folio, 0);
>>>>          rc = mapping->a_ops->writepage(&folio->page, &wbc);
>>>>    @@ -1098,7 +1152,7 @@ static void migrate_folio_undo_src(struct folio
>>>> *src,
>>>>                       struct list_head *ret)
>>>>    {
>>>>        if (page_was_mapped)
>>>> -        remove_migration_ptes(src, src, false);
>>>> +        remove_migration_ptes(src, src, 0);
>>>>        /* Drop an anon_vma reference if we took one */
>>>>        if (anon_vma)
>>>>            put_anon_vma(anon_vma);
>>>> @@ -1336,7 +1390,7 @@ static int migrate_folio_move(free_folio_t
>>>> put_new_folio, unsigned long private,
>>>>            lru_add_drain();
>>>>          if (old_page_state & PAGE_WAS_MAPPED)
>>>> -        remove_migration_ptes(src, dst, false);
>>>> +        remove_migration_ptes(src, dst, 0);
>>>>      out_unlock_both:
>>>>        folio_unlock(dst);
>>>> @@ -1474,7 +1528,7 @@ static int unmap_and_move_huge_page(new_folio_t
>>>> get_new_folio,
>>>>          if (page_was_mapped)
>>>>            remove_migration_ptes(src,
>>>> -            rc == MIGRATEPAGE_SUCCESS ? dst : src,
>>>> false);
>>>> +            rc == MIGRATEPAGE_SUCCESS ? dst : src, 0);
>>>>      unlock_put_anon:
>>>>        folio_unlock(dst);
>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>> index 8d687de88a03..9cf26592ac93 100644
>>>> --- a/mm/migrate_device.c
>>>> +++ b/mm/migrate_device.c
>>>> @@ -424,7 +424,7 @@ static unsigned long
>>>> migrate_device_unmap(unsigned long *src_pfns,
>>>>                continue;
>>>>              folio = page_folio(page);
>>>> -        remove_migration_ptes(folio, folio, false);
>>>> +        remove_migration_ptes(folio, folio, 0);
>>>>              src_pfns[i] = 0;
>>>>            folio_unlock(folio);
>>>> @@ -840,7 +840,7 @@ void migrate_device_finalize(unsigned long
>>>> *src_pfns,
>>>>                dst = src;
>>>>            }
>>>>    -        remove_migration_ptes(src, dst, false);
>>>> +        remove_migration_ptes(src, dst, 0);
>>>>            folio_unlock(src);
>>>>              if (folio_is_zone_device(src))
>>>
>>> Hi,
>>>
>>> This patch has been in the mainline for some time, but we recently
>>> discovered an issue when both mTHP and MTE (Memory Tagging Extension)
>>> are enabled.
>>>
>>> It seems that remapping to the same zeropage might causes MTE tag
>>> mismatches, since MTE tags are associated with physical addresses.
>>
>> Does this only trigger when the VMA has mte enabled? Maybe we'll have to bail out if we detect that mte is enabled.
>>
> 
> I believe MTE is all or nothing? i.e. all the memory is tagged when enabled, but will let the arm folks confirm.
> 

I think for a process it's only active when PROT_MTE was called.

In khugepaged, I recall that we use 
copy_mc_user_highpage()->copy_user_highpage() that translates on arm64 
to its custom copy_highpage() where tags are copied as well.

So that's why khugepaged works.

KSM? Hmmm. Not sure how that is fenced off. We'd likely need some hidden 
page_mte_tagged() check somewhere.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 2/6] mm: remap unused subpages to shared zeropage when splitting isolated thp
  2025-09-18  8:56     ` David Hildenbrand
  2025-09-18 11:42       ` Usama Arif
@ 2025-09-18 12:22       ` Lance Yang
  2025-09-18 12:25         ` Lance Yang
  2025-09-18 12:35         ` David Hildenbrand
  1 sibling, 2 replies; 33+ messages in thread
From: Lance Yang @ 2025-09-18 12:22 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Qun-wei Lin (林群崴), catalin.marinas@arm.com,
	usamaarif642@gmail.com, linux-mm@kvack.org, yuzhao@google.com,
	akpm@linux-foundation.org, corbet@lwn.net,
	Andrew Yang (楊智強), npache@redhat.com,
	rppt@kernel.org, willy@infradead.org, kernel-team@meta.com,
	roman.gushchin@linux.dev, hannes@cmpxchg.org,
	cerasuolodomenico@gmail.com, linux-kernel@vger.kernel.org,
	ryncsn@gmail.com, surenb@google.com, riel@surriel.com,
	shakeel.butt@linux.dev, Chinwen Chang (張錦文),
	linux-doc@vger.kernel.org, Casper Li (李中榮),
	ryan.roberts@arm.com, linux-mediatek@lists.infradead.org,
	baohua@kernel.org, kaleshsingh@google.com, zhais@google.com,
	linux-arm-kernel@lists.infradead.org

On Thu, Sep 18, 2025 at 5:21 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 18.09.25 10:53, Qun-wei Lin (林群崴) wrote:
> > On Fri, 2024-08-30 at 11:03 +0100, Usama Arif wrote:
> >> From: Yu Zhao <yuzhao@google.com>
> >>
> >> Here being unused means containing only zeros and inaccessible to
> >> userspace. When splitting an isolated thp under reclaim or migration,
> >> the unused subpages can be mapped to the shared zeropage, hence
> >> saving
> >> memory. This is particularly helpful when the internal
> >> fragmentation of a thp is high, i.e. it has many untouched subpages.
> >>
> >> This is also a prerequisite for THP low utilization shrinker which
> >> will
> >> be introduced in later patches, where underutilized THPs are split,
> >> and
> >> the zero-filled pages are freed saving memory.
> >>
> >> Signed-off-by: Yu Zhao <yuzhao@google.com>
> >> Tested-by: Shuang Zhai <zhais@google.com>
> >> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> >> ---
> >>   include/linux/rmap.h |  7 ++++-
> >>   mm/huge_memory.c     |  8 ++---
> >>   mm/migrate.c         | 72 ++++++++++++++++++++++++++++++++++++++----
> >> --
> >>   mm/migrate_device.c  |  4 +--
> >>   4 files changed, 75 insertions(+), 16 deletions(-)
> >>
> >> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> >> index 91b5935e8485..d5e93e44322e 100644
> >> --- a/include/linux/rmap.h
> >> +++ b/include/linux/rmap.h
> >> @@ -745,7 +745,12 @@ int folio_mkclean(struct folio *);
> >>   int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages,
> >> pgoff_t pgoff,
> >>                    struct vm_area_struct *vma);
> >>
> >> -void remove_migration_ptes(struct folio *src, struct folio *dst,
> >> bool locked);
> >> +enum rmp_flags {
> >> +    RMP_LOCKED              = 1 << 0,
> >> +    RMP_USE_SHARED_ZEROPAGE = 1 << 1,
> >> +};
> >> +
> >> +void remove_migration_ptes(struct folio *src, struct folio *dst, int
> >> flags);
> >>
> >>   /*
> >>    * rmap_walk_control: To control rmap traversing for specific needs
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> index 0c48806ccb9a..af60684e7c70 100644
> >> --- a/mm/huge_memory.c
> >> +++ b/mm/huge_memory.c
> >> @@ -3020,7 +3020,7 @@ bool unmap_huge_pmd_locked(struct
> >> vm_area_struct *vma, unsigned long addr,
> >>      return false;
> >>   }
> >>
> >> -static void remap_page(struct folio *folio, unsigned long nr)
> >> +static void remap_page(struct folio *folio, unsigned long nr, int
> >> flags)
> >>   {
> >>      int i = 0;
> >>
> >> @@ -3028,7 +3028,7 @@ static void remap_page(struct folio *folio,
> >> unsigned long nr)
> >>      if (!folio_test_anon(folio))
> >>              return;
> >>      for (;;) {
> >> -            remove_migration_ptes(folio, folio, true);
> >> +            remove_migration_ptes(folio, folio, RMP_LOCKED |
> >> flags);
> >>              i += folio_nr_pages(folio);
> >>              if (i >= nr)
> >>                      break;
> >> @@ -3240,7 +3240,7 @@ static void __split_huge_page(struct page
> >> *page, struct list_head *list,
> >>
> >>      if (nr_dropped)
> >>              shmem_uncharge(folio->mapping->host, nr_dropped);
> >> -    remap_page(folio, nr);
> >> +    remap_page(folio, nr, PageAnon(head) ?
> >> RMP_USE_SHARED_ZEROPAGE : 0);
> >>
> >>      /*
> >>       * set page to its compound_head when split to non order-0
> >> pages, so
> >> @@ -3542,7 +3542,7 @@ int split_huge_page_to_list_to_order(struct
> >> page *page, struct list_head *list,
> >>              if (mapping)
> >>                      xas_unlock(&xas);
> >>              local_irq_enable();
> >> -            remap_page(folio, folio_nr_pages(folio));
> >> +            remap_page(folio, folio_nr_pages(folio), 0);
> >>              ret = -EAGAIN;
> >>      }
> >>
> >> diff --git a/mm/migrate.c b/mm/migrate.c
> >> index 6f9c62c746be..d039863e014b 100644
> >> --- a/mm/migrate.c
> >> +++ b/mm/migrate.c
> >> @@ -204,13 +204,57 @@ bool isolate_folio_to_list(struct folio *folio,
> >> struct list_head *list)
> >>      return true;
> >>   }
> >>
> >> +static bool try_to_map_unused_to_zeropage(struct
> >> page_vma_mapped_walk *pvmw,
> >> +                                      struct folio *folio,
> >> +                                      unsigned long idx)
> >> +{
> >> +    struct page *page = folio_page(folio, idx);
> >> +    bool contains_data;
> >> +    pte_t newpte;
> >> +    void *addr;
> >> +
> >> +    VM_BUG_ON_PAGE(PageCompound(page), page);
> >> +    VM_BUG_ON_PAGE(!PageAnon(page), page);
> >> +    VM_BUG_ON_PAGE(!PageLocked(page), page);
> >> +    VM_BUG_ON_PAGE(pte_present(*pvmw->pte), page);
> >> +
> >> +    if (folio_test_mlocked(folio) || (pvmw->vma->vm_flags &
> >> VM_LOCKED) ||
> >> +        mm_forbids_zeropage(pvmw->vma->vm_mm))
> >> +            return false;
> >> +
> >> +    /*
> >> +     * The pmd entry mapping the old thp was flushed and the pte
> >> mapping
> >> +     * this subpage has been non present. If the subpage is only
> >> zero-filled
> >> +     * then map it to the shared zeropage.
> >> +     */
> >> +    addr = kmap_local_page(page);
> >> +    contains_data = memchr_inv(addr, 0, PAGE_SIZE);
> >> +    kunmap_local(addr);
> >> +
> >> +    if (contains_data)
> >> +            return false;
> >> +
> >> +    newpte = pte_mkspecial(pfn_pte(my_zero_pfn(pvmw->address),
> >> +                                    pvmw->vma->vm_page_prot));
> >> +    set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte,
> >> newpte);
> >> +
> >> +    dec_mm_counter(pvmw->vma->vm_mm, mm_counter(folio));
> >> +    return true;
> >> +}
> >> +
> >> +struct rmap_walk_arg {
> >> +    struct folio *folio;
> >> +    bool map_unused_to_zeropage;
> >> +};
> >> +
> >>   /*
> >>    * Restore a potential migration pte to a working pte entry
> >>    */
> >>   static bool remove_migration_pte(struct folio *folio,
> >> -            struct vm_area_struct *vma, unsigned long addr, void
> >> *old)
> >> +            struct vm_area_struct *vma, unsigned long addr, void
> >> *arg)
> >>   {
> >> -    DEFINE_FOLIO_VMA_WALK(pvmw, old, vma, addr, PVMW_SYNC |
> >> PVMW_MIGRATION);
> >> +    struct rmap_walk_arg *rmap_walk_arg = arg;
> >> +    DEFINE_FOLIO_VMA_WALK(pvmw, rmap_walk_arg->folio, vma, addr,
> >> PVMW_SYNC | PVMW_MIGRATION);
> >>
> >>      while (page_vma_mapped_walk(&pvmw)) {
> >>              rmap_t rmap_flags = RMAP_NONE;
> >> @@ -234,6 +278,9 @@ static bool remove_migration_pte(struct folio
> >> *folio,
> >>                      continue;
> >>              }
> >>   #endif
> >> +            if (rmap_walk_arg->map_unused_to_zeropage &&
> >> +                try_to_map_unused_to_zeropage(&pvmw, folio,
> >> idx))
> >> +                    continue;
> >>
> >>              folio_get(folio);
> >>              pte = mk_pte(new, READ_ONCE(vma->vm_page_prot));
> >> @@ -312,14 +359,21 @@ static bool remove_migration_pte(struct folio
> >> *folio,
> >>    * Get rid of all migration entries and replace them by
> >>    * references to the indicated page.
> >>    */
> >> -void remove_migration_ptes(struct folio *src, struct folio *dst,
> >> bool locked)
> >> +void remove_migration_ptes(struct folio *src, struct folio *dst, int
> >> flags)
> >>   {
> >> +    struct rmap_walk_arg rmap_walk_arg = {
> >> +            .folio = src,
> >> +            .map_unused_to_zeropage = flags &
> >> RMP_USE_SHARED_ZEROPAGE,
> >> +    };
> >> +
> >>      struct rmap_walk_control rwc = {
> >>              .rmap_one = remove_migration_pte,
> >> -            .arg = src,
> >> +            .arg = &rmap_walk_arg,
> >>      };
> >>
> >> -    if (locked)
> >> +    VM_BUG_ON_FOLIO((flags & RMP_USE_SHARED_ZEROPAGE) && (src !=
> >> dst), src);
> >> +
> >> +    if (flags & RMP_LOCKED)
> >>              rmap_walk_locked(dst, &rwc);
> >>      else
> >>              rmap_walk(dst, &rwc);
> >> @@ -934,7 +988,7 @@ static int writeout(struct address_space
> >> *mapping, struct folio *folio)
> >>       * At this point we know that the migration attempt cannot
> >>       * be successful.
> >>       */
> >> -    remove_migration_ptes(folio, folio, false);
> >> +    remove_migration_ptes(folio, folio, 0);
> >>
> >>      rc = mapping->a_ops->writepage(&folio->page, &wbc);
> >>
> >> @@ -1098,7 +1152,7 @@ static void migrate_folio_undo_src(struct folio
> >> *src,
> >>                                 struct list_head *ret)
> >>   {
> >>      if (page_was_mapped)
> >> -            remove_migration_ptes(src, src, false);
> >> +            remove_migration_ptes(src, src, 0);
> >>      /* Drop an anon_vma reference if we took one */
> >>      if (anon_vma)
> >>              put_anon_vma(anon_vma);
> >> @@ -1336,7 +1390,7 @@ static int migrate_folio_move(free_folio_t
> >> put_new_folio, unsigned long private,
> >>              lru_add_drain();
> >>
> >>      if (old_page_state & PAGE_WAS_MAPPED)
> >> -            remove_migration_ptes(src, dst, false);
> >> +            remove_migration_ptes(src, dst, 0);
> >>
> >>   out_unlock_both:
> >>      folio_unlock(dst);
> >> @@ -1474,7 +1528,7 @@ static int unmap_and_move_huge_page(new_folio_t
> >> get_new_folio,
> >>
> >>      if (page_was_mapped)
> >>              remove_migration_ptes(src,
> >> -                    rc == MIGRATEPAGE_SUCCESS ? dst : src,
> >> false);
> >> +                    rc == MIGRATEPAGE_SUCCESS ? dst : src, 0);
> >>
> >>   unlock_put_anon:
> >>      folio_unlock(dst);
> >> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> >> index 8d687de88a03..9cf26592ac93 100644
> >> --- a/mm/migrate_device.c
> >> +++ b/mm/migrate_device.c
> >> @@ -424,7 +424,7 @@ static unsigned long
> >> migrate_device_unmap(unsigned long *src_pfns,
> >>                      continue;
> >>
> >>              folio = page_folio(page);
> >> -            remove_migration_ptes(folio, folio, false);
> >> +            remove_migration_ptes(folio, folio, 0);
> >>
> >>              src_pfns[i] = 0;
> >>              folio_unlock(folio);
> >> @@ -840,7 +840,7 @@ void migrate_device_finalize(unsigned long
> >> *src_pfns,
> >>                      dst = src;
> >>              }
> >>
> >> -            remove_migration_ptes(src, dst, false);
> >> +            remove_migration_ptes(src, dst, 0);
> >>              folio_unlock(src);
> >>
> >>              if (folio_is_zone_device(src))
> >
> > Hi,
> >
> > This patch has been in the mainline for some time, but we recently
> > discovered an issue when both mTHP and MTE (Memory Tagging Extension)
> > are enabled.
> >
> > It seems that remapping to the same zeropage might causes MTE tag
> > mismatches, since MTE tags are associated with physical addresses.
>
> Does this only trigger when the VMA has mte enabled? Maybe we'll have to
> bail out if we detect that mte is enabled.

It seems RISC-V also has a similar feature (RISCV_ISA_SUPM) that uses
the same prctl(PR_{GET,SET}_TAGGED_ADDR_CTRL) API.

config RISCV_ISA_SUPM
        bool "Supm extension for userspace pointer masking"
        depends on 64BIT
        default y
        help
          Add support for pointer masking in userspace (Supm) when the
          underlying hardware extension (Smnpm or Ssnpm) is detected at boot.

          If this option is disabled, userspace will be unable to use
          the prctl(PR_{SET,GET}_TAGGED_ADDR_CTRL) API.

I wonder if we should disable the THP shrinker for such architectures that
define PR_SET_TAGGED_ADDR_CTRL (or PR_GET_TAGGED_ADDR_CTRL).

Cheers,
Lance

>
> Also, I wonder how KSM and the shared zeropage works in general with
> that, because I would expect similar issues when we de-duplicate memory?
>
> --
> Cheers
>
> David / dhildenb
>
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 2/6] mm: remap unused subpages to shared zeropage when splitting isolated thp
  2025-09-18 12:22       ` Lance Yang
@ 2025-09-18 12:25         ` Lance Yang
  2025-09-18 12:35         ` David Hildenbrand
  1 sibling, 0 replies; 33+ messages in thread
From: Lance Yang @ 2025-09-18 12:25 UTC (permalink / raw)
  To: Lance Yang
  Cc: David Hildenbrand, Qun-wei Lin (林群崴),
	catalin.marinas@arm.com, usamaarif642@gmail.com,
	linux-mm@kvack.org, yuzhao@google.com, akpm@linux-foundation.org,
	corbet@lwn.net, Andrew Yang (楊智強),
	npache@redhat.com, rppt@kernel.org, willy@infradead.org,
	kernel-team@meta.com, roman.gushchin@linux.dev,
	hannes@cmpxchg.org, cerasuolodomenico@gmail.com,
	linux-kernel@vger.kernel.org, ryncsn@gmail.com, surenb@google.com,
	riel@surriel.com, shakeel.butt@linux.dev,
	Chinwen Chang (張錦文),
	linux-doc@vger.kernel.org, Casper Li (李中榮),
	ryan.roberts@arm.com, linux-mediatek@lists.infradead.org,
	baohua@kernel.org, kaleshsingh@google.com, zhais@google.com,
	linux-arm-kernel@lists.infradead.org

On Thu, Sep 18, 2025 at 8:22 PM Lance Yang <lance.yang@linux.dev> wrote:
>
> On Thu, Sep 18, 2025 at 5:21 PM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 18.09.25 10:53, Qun-wei Lin (林群崴) wrote:
> > > On Fri, 2024-08-30 at 11:03 +0100, Usama Arif wrote:
> > >> From: Yu Zhao <yuzhao@google.com>
> > >>
> > >> Here being unused means containing only zeros and inaccessible to
> > >> userspace. When splitting an isolated thp under reclaim or migration,
> > >> the unused subpages can be mapped to the shared zeropage, hence
> > >> saving
> > >> memory. This is particularly helpful when the internal
> > >> fragmentation of a thp is high, i.e. it has many untouched subpages.
> > >>
> > >> This is also a prerequisite for THP low utilization shrinker which
> > >> will
> > >> be introduced in later patches, where underutilized THPs are split,
> > >> and
> > >> the zero-filled pages are freed saving memory.
> > >>
> > >> Signed-off-by: Yu Zhao <yuzhao@google.com>
> > >> Tested-by: Shuang Zhai <zhais@google.com>
> > >> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> > >> ---
> > >>   include/linux/rmap.h |  7 ++++-
> > >>   mm/huge_memory.c     |  8 ++---
> > >>   mm/migrate.c         | 72 ++++++++++++++++++++++++++++++++++++++----
> > >> --
> > >>   mm/migrate_device.c  |  4 +--
> > >>   4 files changed, 75 insertions(+), 16 deletions(-)
> > >>
> > >> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > >> index 91b5935e8485..d5e93e44322e 100644
> > >> --- a/include/linux/rmap.h
> > >> +++ b/include/linux/rmap.h
> > >> @@ -745,7 +745,12 @@ int folio_mkclean(struct folio *);
> > >>   int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages,
> > >> pgoff_t pgoff,
> > >>                    struct vm_area_struct *vma);
> > >>
> > >> -void remove_migration_ptes(struct folio *src, struct folio *dst,
> > >> bool locked);
> > >> +enum rmp_flags {
> > >> +    RMP_LOCKED              = 1 << 0,
> > >> +    RMP_USE_SHARED_ZEROPAGE = 1 << 1,
> > >> +};
> > >> +
> > >> +void remove_migration_ptes(struct folio *src, struct folio *dst, int
> > >> flags);
> > >>
> > >>   /*
> > >>    * rmap_walk_control: To control rmap traversing for specific needs
> > >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > >> index 0c48806ccb9a..af60684e7c70 100644
> > >> --- a/mm/huge_memory.c
> > >> +++ b/mm/huge_memory.c
> > >> @@ -3020,7 +3020,7 @@ bool unmap_huge_pmd_locked(struct
> > >> vm_area_struct *vma, unsigned long addr,
> > >>      return false;
> > >>   }
> > >>
> > >> -static void remap_page(struct folio *folio, unsigned long nr)
> > >> +static void remap_page(struct folio *folio, unsigned long nr, int
> > >> flags)
> > >>   {
> > >>      int i = 0;
> > >>
> > >> @@ -3028,7 +3028,7 @@ static void remap_page(struct folio *folio,
> > >> unsigned long nr)
> > >>      if (!folio_test_anon(folio))
> > >>              return;
> > >>      for (;;) {
> > >> -            remove_migration_ptes(folio, folio, true);
> > >> +            remove_migration_ptes(folio, folio, RMP_LOCKED |
> > >> flags);
> > >>              i += folio_nr_pages(folio);
> > >>              if (i >= nr)
> > >>                      break;
> > >> @@ -3240,7 +3240,7 @@ static void __split_huge_page(struct page
> > >> *page, struct list_head *list,
> > >>
> > >>      if (nr_dropped)
> > >>              shmem_uncharge(folio->mapping->host, nr_dropped);
> > >> -    remap_page(folio, nr);
> > >> +    remap_page(folio, nr, PageAnon(head) ?
> > >> RMP_USE_SHARED_ZEROPAGE : 0);
> > >>
> > >>      /*
> > >>       * set page to its compound_head when split to non order-0
> > >> pages, so
> > >> @@ -3542,7 +3542,7 @@ int split_huge_page_to_list_to_order(struct
> > >> page *page, struct list_head *list,
> > >>              if (mapping)
> > >>                      xas_unlock(&xas);
> > >>              local_irq_enable();
> > >> -            remap_page(folio, folio_nr_pages(folio));
> > >> +            remap_page(folio, folio_nr_pages(folio), 0);
> > >>              ret = -EAGAIN;
> > >>      }
> > >>
> > >> diff --git a/mm/migrate.c b/mm/migrate.c
> > >> index 6f9c62c746be..d039863e014b 100644
> > >> --- a/mm/migrate.c
> > >> +++ b/mm/migrate.c
> > >> @@ -204,13 +204,57 @@ bool isolate_folio_to_list(struct folio *folio,
> > >> struct list_head *list)
> > >>      return true;
> > >>   }
> > >>
> > >> +static bool try_to_map_unused_to_zeropage(struct
> > >> page_vma_mapped_walk *pvmw,
> > >> +                                      struct folio *folio,
> > >> +                                      unsigned long idx)
> > >> +{
> > >> +    struct page *page = folio_page(folio, idx);
> > >> +    bool contains_data;
> > >> +    pte_t newpte;
> > >> +    void *addr;
> > >> +
> > >> +    VM_BUG_ON_PAGE(PageCompound(page), page);
> > >> +    VM_BUG_ON_PAGE(!PageAnon(page), page);
> > >> +    VM_BUG_ON_PAGE(!PageLocked(page), page);
> > >> +    VM_BUG_ON_PAGE(pte_present(*pvmw->pte), page);
> > >> +
> > >> +    if (folio_test_mlocked(folio) || (pvmw->vma->vm_flags &
> > >> VM_LOCKED) ||
> > >> +        mm_forbids_zeropage(pvmw->vma->vm_mm))
> > >> +            return false;
> > >> +
> > >> +    /*
> > >> +     * The pmd entry mapping the old thp was flushed and the pte
> > >> mapping
> > >> +     * this subpage has been non present. If the subpage is only
> > >> zero-filled
> > >> +     * then map it to the shared zeropage.
> > >> +     */
> > >> +    addr = kmap_local_page(page);
> > >> +    contains_data = memchr_inv(addr, 0, PAGE_SIZE);
> > >> +    kunmap_local(addr);
> > >> +
> > >> +    if (contains_data)
> > >> +            return false;
> > >> +
> > >> +    newpte = pte_mkspecial(pfn_pte(my_zero_pfn(pvmw->address),
> > >> +                                    pvmw->vma->vm_page_prot));
> > >> +    set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte,
> > >> newpte);
> > >> +
> > >> +    dec_mm_counter(pvmw->vma->vm_mm, mm_counter(folio));
> > >> +    return true;
> > >> +}
> > >> +
> > >> +struct rmap_walk_arg {
> > >> +    struct folio *folio;
> > >> +    bool map_unused_to_zeropage;
> > >> +};
> > >> +
> > >>   /*
> > >>    * Restore a potential migration pte to a working pte entry
> > >>    */
> > >>   static bool remove_migration_pte(struct folio *folio,
> > >> -            struct vm_area_struct *vma, unsigned long addr, void
> > >> *old)
> > >> +            struct vm_area_struct *vma, unsigned long addr, void
> > >> *arg)
> > >>   {
> > >> -    DEFINE_FOLIO_VMA_WALK(pvmw, old, vma, addr, PVMW_SYNC |
> > >> PVMW_MIGRATION);
> > >> +    struct rmap_walk_arg *rmap_walk_arg = arg;
> > >> +    DEFINE_FOLIO_VMA_WALK(pvmw, rmap_walk_arg->folio, vma, addr,
> > >> PVMW_SYNC | PVMW_MIGRATION);
> > >>
> > >>      while (page_vma_mapped_walk(&pvmw)) {
> > >>              rmap_t rmap_flags = RMAP_NONE;
> > >> @@ -234,6 +278,9 @@ static bool remove_migration_pte(struct folio
> > >> *folio,
> > >>                      continue;
> > >>              }
> > >>   #endif
> > >> +            if (rmap_walk_arg->map_unused_to_zeropage &&
> > >> +                try_to_map_unused_to_zeropage(&pvmw, folio,
> > >> idx))
> > >> +                    continue;
> > >>
> > >>              folio_get(folio);
> > >>              pte = mk_pte(new, READ_ONCE(vma->vm_page_prot));
> > >> @@ -312,14 +359,21 @@ static bool remove_migration_pte(struct folio
> > >> *folio,
> > >>    * Get rid of all migration entries and replace them by
> > >>    * references to the indicated page.
> > >>    */
> > >> -void remove_migration_ptes(struct folio *src, struct folio *dst,
> > >> bool locked)
> > >> +void remove_migration_ptes(struct folio *src, struct folio *dst, int
> > >> flags)
> > >>   {
> > >> +    struct rmap_walk_arg rmap_walk_arg = {
> > >> +            .folio = src,
> > >> +            .map_unused_to_zeropage = flags &
> > >> RMP_USE_SHARED_ZEROPAGE,
> > >> +    };
> > >> +
> > >>      struct rmap_walk_control rwc = {
> > >>              .rmap_one = remove_migration_pte,
> > >> -            .arg = src,
> > >> +            .arg = &rmap_walk_arg,
> > >>      };
> > >>
> > >> -    if (locked)
> > >> +    VM_BUG_ON_FOLIO((flags & RMP_USE_SHARED_ZEROPAGE) && (src !=
> > >> dst), src);
> > >> +
> > >> +    if (flags & RMP_LOCKED)
> > >>              rmap_walk_locked(dst, &rwc);
> > >>      else
> > >>              rmap_walk(dst, &rwc);
> > >> @@ -934,7 +988,7 @@ static int writeout(struct address_space
> > >> *mapping, struct folio *folio)
> > >>       * At this point we know that the migration attempt cannot
> > >>       * be successful.
> > >>       */
> > >> -    remove_migration_ptes(folio, folio, false);
> > >> +    remove_migration_ptes(folio, folio, 0);
> > >>
> > >>      rc = mapping->a_ops->writepage(&folio->page, &wbc);
> > >>
> > >> @@ -1098,7 +1152,7 @@ static void migrate_folio_undo_src(struct folio
> > >> *src,
> > >>                                 struct list_head *ret)
> > >>   {
> > >>      if (page_was_mapped)
> > >> -            remove_migration_ptes(src, src, false);
> > >> +            remove_migration_ptes(src, src, 0);
> > >>      /* Drop an anon_vma reference if we took one */
> > >>      if (anon_vma)
> > >>              put_anon_vma(anon_vma);
> > >> @@ -1336,7 +1390,7 @@ static int migrate_folio_move(free_folio_t
> > >> put_new_folio, unsigned long private,
> > >>              lru_add_drain();
> > >>
> > >>      if (old_page_state & PAGE_WAS_MAPPED)
> > >> -            remove_migration_ptes(src, dst, false);
> > >> +            remove_migration_ptes(src, dst, 0);
> > >>
> > >>   out_unlock_both:
> > >>      folio_unlock(dst);
> > >> @@ -1474,7 +1528,7 @@ static int unmap_and_move_huge_page(new_folio_t
> > >> get_new_folio,
> > >>
> > >>      if (page_was_mapped)
> > >>              remove_migration_ptes(src,
> > >> -                    rc == MIGRATEPAGE_SUCCESS ? dst : src,
> > >> false);
> > >> +                    rc == MIGRATEPAGE_SUCCESS ? dst : src, 0);
> > >>
> > >>   unlock_put_anon:
> > >>      folio_unlock(dst);
> > >> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> > >> index 8d687de88a03..9cf26592ac93 100644
> > >> --- a/mm/migrate_device.c
> > >> +++ b/mm/migrate_device.c
> > >> @@ -424,7 +424,7 @@ static unsigned long
> > >> migrate_device_unmap(unsigned long *src_pfns,
> > >>                      continue;
> > >>
> > >>              folio = page_folio(page);
> > >> -            remove_migration_ptes(folio, folio, false);
> > >> +            remove_migration_ptes(folio, folio, 0);
> > >>
> > >>              src_pfns[i] = 0;
> > >>              folio_unlock(folio);
> > >> @@ -840,7 +840,7 @@ void migrate_device_finalize(unsigned long
> > >> *src_pfns,
> > >>                      dst = src;
> > >>              }
> > >>
> > >> -            remove_migration_ptes(src, dst, false);
> > >> +            remove_migration_ptes(src, dst, 0);
> > >>              folio_unlock(src);
> > >>
> > >>              if (folio_is_zone_device(src))
> > >
> > > Hi,
> > >
> > > This patch has been in the mainline for some time, but we recently
> > > discovered an issue when both mTHP and MTE (Memory Tagging Extension)
> > > are enabled.
> > >
> > > It seems that remapping to the same zeropage might causes MTE tag
> > > mismatches, since MTE tags are associated with physical addresses.
> >
> > Does this only trigger when the VMA has mte enabled? Maybe we'll have to
> > bail out if we detect that mte is enabled.
>
> It seems RISC-V also has a similar feature (RISCV_ISA_SUPM) that uses
> the same prctl(PR_{GET,SET}_TAGGED_ADDR_CTRL) API.
>
> config RISCV_ISA_SUPM
>         bool "Supm extension for userspace pointer masking"
>         depends on 64BIT
>         default y
>         help
>           Add support for pointer masking in userspace (Supm) when the
>           underlying hardware extension (Smnpm or Ssnpm) is detected at boot.
>
>           If this option is disabled, userspace will be unable to use
>           the prctl(PR_{SET,GET}_TAGGED_ADDR_CTRL) API.
>
> I wonder if we should disable the THP shrinker for such architectures that
> define PR_SET_TAGGED_ADDR_CTRL (or PR_GET_TAGGED_ADDR_CTRL).

SET_TAGGED_ADDR_CTRL or GET_TAGGED_ADDR_CTRL

File: kernel/sys.c 114
#ifndef SET_TAGGED_ADDR_CTRL
# define SET_TAGGED_ADDR_CTRL(a) (-EINVAL)
#endif
#ifndef GET_TAGGED_ADDR_CTRL
# define GET_TAGGED_ADDR_CTRL() (-EINVAL)
#endif

Cheers,
Lance

>
> Cheers,
> Lance
>
> >
> > Also, I wonder how KSM and the shared zeropage works in general with
> > that, because I would expect similar issues when we de-duplicate memory?
> >
> > --
> > Cheers
> >
> > David / dhildenb
> >
> >


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 2/6] mm: remap unused subpages to shared zeropage when splitting isolated thp
  2025-09-18 12:22       ` Lance Yang
  2025-09-18 12:25         ` Lance Yang
@ 2025-09-18 12:35         ` David Hildenbrand
  2025-09-19  5:16           ` Lance Yang
  1 sibling, 1 reply; 33+ messages in thread
From: David Hildenbrand @ 2025-09-18 12:35 UTC (permalink / raw)
  To: Lance Yang
  Cc: Qun-wei Lin (林群崴), catalin.marinas@arm.com,
	usamaarif642@gmail.com, linux-mm@kvack.org, yuzhao@google.com,
	akpm@linux-foundation.org, corbet@lwn.net,
	Andrew Yang (楊智強), npache@redhat.com,
	rppt@kernel.org, willy@infradead.org, kernel-team@meta.com,
	roman.gushchin@linux.dev, hannes@cmpxchg.org,
	cerasuolodomenico@gmail.com, linux-kernel@vger.kernel.org,
	ryncsn@gmail.com, surenb@google.com, riel@surriel.com,
	shakeel.butt@linux.dev, Chinwen Chang (張錦文),
	linux-doc@vger.kernel.org, Casper Li (李中榮),
	ryan.roberts@arm.com, linux-mediatek@lists.infradead.org,
	baohua@kernel.org, kaleshsingh@google.com, zhais@google.com,
	linux-arm-kernel@lists.infradead.org

On 18.09.25 14:22, Lance Yang wrote:
> On Thu, Sep 18, 2025 at 5:21 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 18.09.25 10:53, Qun-wei Lin (林群崴) wrote:
>>> On Fri, 2024-08-30 at 11:03 +0100, Usama Arif wrote:
>>>> From: Yu Zhao <yuzhao@google.com>
>>>>
>>>> Here being unused means containing only zeros and inaccessible to
>>>> userspace. When splitting an isolated thp under reclaim or migration,
>>>> the unused subpages can be mapped to the shared zeropage, hence
>>>> saving
>>>> memory. This is particularly helpful when the internal
>>>> fragmentation of a thp is high, i.e. it has many untouched subpages.
>>>>
>>>> This is also a prerequisite for THP low utilization shrinker which
>>>> will
>>>> be introduced in later patches, where underutilized THPs are split,
>>>> and
>>>> the zero-filled pages are freed saving memory.
>>>>
>>>> Signed-off-by: Yu Zhao <yuzhao@google.com>
>>>> Tested-by: Shuang Zhai <zhais@google.com>
>>>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>>>> ---
>>>>    include/linux/rmap.h |  7 ++++-
>>>>    mm/huge_memory.c     |  8 ++---
>>>>    mm/migrate.c         | 72 ++++++++++++++++++++++++++++++++++++++----
>>>> --
>>>>    mm/migrate_device.c  |  4 +--
>>>>    4 files changed, 75 insertions(+), 16 deletions(-)
>>>>
>>>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>>>> index 91b5935e8485..d5e93e44322e 100644
>>>> --- a/include/linux/rmap.h
>>>> +++ b/include/linux/rmap.h
>>>> @@ -745,7 +745,12 @@ int folio_mkclean(struct folio *);
>>>>    int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages,
>>>> pgoff_t pgoff,
>>>>                     struct vm_area_struct *vma);
>>>>
>>>> -void remove_migration_ptes(struct folio *src, struct folio *dst,
>>>> bool locked);
>>>> +enum rmp_flags {
>>>> +    RMP_LOCKED              = 1 << 0,
>>>> +    RMP_USE_SHARED_ZEROPAGE = 1 << 1,
>>>> +};
>>>> +
>>>> +void remove_migration_ptes(struct folio *src, struct folio *dst, int
>>>> flags);
>>>>
>>>>    /*
>>>>     * rmap_walk_control: To control rmap traversing for specific needs
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index 0c48806ccb9a..af60684e7c70 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -3020,7 +3020,7 @@ bool unmap_huge_pmd_locked(struct
>>>> vm_area_struct *vma, unsigned long addr,
>>>>       return false;
>>>>    }
>>>>
>>>> -static void remap_page(struct folio *folio, unsigned long nr)
>>>> +static void remap_page(struct folio *folio, unsigned long nr, int
>>>> flags)
>>>>    {
>>>>       int i = 0;
>>>>
>>>> @@ -3028,7 +3028,7 @@ static void remap_page(struct folio *folio,
>>>> unsigned long nr)
>>>>       if (!folio_test_anon(folio))
>>>>               return;
>>>>       for (;;) {
>>>> -            remove_migration_ptes(folio, folio, true);
>>>> +            remove_migration_ptes(folio, folio, RMP_LOCKED |
>>>> flags);
>>>>               i += folio_nr_pages(folio);
>>>>               if (i >= nr)
>>>>                       break;
>>>> @@ -3240,7 +3240,7 @@ static void __split_huge_page(struct page
>>>> *page, struct list_head *list,
>>>>
>>>>       if (nr_dropped)
>>>>               shmem_uncharge(folio->mapping->host, nr_dropped);
>>>> -    remap_page(folio, nr);
>>>> +    remap_page(folio, nr, PageAnon(head) ?
>>>> RMP_USE_SHARED_ZEROPAGE : 0);
>>>>
>>>>       /*
>>>>        * set page to its compound_head when split to non order-0
>>>> pages, so
>>>> @@ -3542,7 +3542,7 @@ int split_huge_page_to_list_to_order(struct
>>>> page *page, struct list_head *list,
>>>>               if (mapping)
>>>>                       xas_unlock(&xas);
>>>>               local_irq_enable();
>>>> -            remap_page(folio, folio_nr_pages(folio));
>>>> +            remap_page(folio, folio_nr_pages(folio), 0);
>>>>               ret = -EAGAIN;
>>>>       }
>>>>
>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>> index 6f9c62c746be..d039863e014b 100644
>>>> --- a/mm/migrate.c
>>>> +++ b/mm/migrate.c
>>>> @@ -204,13 +204,57 @@ bool isolate_folio_to_list(struct folio *folio,
>>>> struct list_head *list)
>>>>       return true;
>>>>    }
>>>>
>>>> +static bool try_to_map_unused_to_zeropage(struct
>>>> page_vma_mapped_walk *pvmw,
>>>> +                                      struct folio *folio,
>>>> +                                      unsigned long idx)
>>>> +{
>>>> +    struct page *page = folio_page(folio, idx);
>>>> +    bool contains_data;
>>>> +    pte_t newpte;
>>>> +    void *addr;
>>>> +
>>>> +    VM_BUG_ON_PAGE(PageCompound(page), page);
>>>> +    VM_BUG_ON_PAGE(!PageAnon(page), page);
>>>> +    VM_BUG_ON_PAGE(!PageLocked(page), page);
>>>> +    VM_BUG_ON_PAGE(pte_present(*pvmw->pte), page);
>>>> +
>>>> +    if (folio_test_mlocked(folio) || (pvmw->vma->vm_flags &
>>>> VM_LOCKED) ||
>>>> +        mm_forbids_zeropage(pvmw->vma->vm_mm))
>>>> +            return false;
>>>> +
>>>> +    /*
>>>> +     * The pmd entry mapping the old thp was flushed and the pte
>>>> mapping
>>>> +     * this subpage has been non present. If the subpage is only
>>>> zero-filled
>>>> +     * then map it to the shared zeropage.
>>>> +     */
>>>> +    addr = kmap_local_page(page);
>>>> +    contains_data = memchr_inv(addr, 0, PAGE_SIZE);
>>>> +    kunmap_local(addr);
>>>> +
>>>> +    if (contains_data)
>>>> +            return false;
>>>> +
>>>> +    newpte = pte_mkspecial(pfn_pte(my_zero_pfn(pvmw->address),
>>>> +                                    pvmw->vma->vm_page_prot));
>>>> +    set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte,
>>>> newpte);
>>>> +
>>>> +    dec_mm_counter(pvmw->vma->vm_mm, mm_counter(folio));
>>>> +    return true;
>>>> +}
>>>> +
>>>> +struct rmap_walk_arg {
>>>> +    struct folio *folio;
>>>> +    bool map_unused_to_zeropage;
>>>> +};
>>>> +
>>>>    /*
>>>>     * Restore a potential migration pte to a working pte entry
>>>>     */
>>>>    static bool remove_migration_pte(struct folio *folio,
>>>> -            struct vm_area_struct *vma, unsigned long addr, void
>>>> *old)
>>>> +            struct vm_area_struct *vma, unsigned long addr, void
>>>> *arg)
>>>>    {
>>>> -    DEFINE_FOLIO_VMA_WALK(pvmw, old, vma, addr, PVMW_SYNC |
>>>> PVMW_MIGRATION);
>>>> +    struct rmap_walk_arg *rmap_walk_arg = arg;
>>>> +    DEFINE_FOLIO_VMA_WALK(pvmw, rmap_walk_arg->folio, vma, addr,
>>>> PVMW_SYNC | PVMW_MIGRATION);
>>>>
>>>>       while (page_vma_mapped_walk(&pvmw)) {
>>>>               rmap_t rmap_flags = RMAP_NONE;
>>>> @@ -234,6 +278,9 @@ static bool remove_migration_pte(struct folio
>>>> *folio,
>>>>                       continue;
>>>>               }
>>>>    #endif
>>>> +            if (rmap_walk_arg->map_unused_to_zeropage &&
>>>> +                try_to_map_unused_to_zeropage(&pvmw, folio,
>>>> idx))
>>>> +                    continue;
>>>>
>>>>               folio_get(folio);
>>>>               pte = mk_pte(new, READ_ONCE(vma->vm_page_prot));
>>>> @@ -312,14 +359,21 @@ static bool remove_migration_pte(struct folio
>>>> *folio,
>>>>     * Get rid of all migration entries and replace them by
>>>>     * references to the indicated page.
>>>>     */
>>>> -void remove_migration_ptes(struct folio *src, struct folio *dst,
>>>> bool locked)
>>>> +void remove_migration_ptes(struct folio *src, struct folio *dst, int
>>>> flags)
>>>>    {
>>>> +    struct rmap_walk_arg rmap_walk_arg = {
>>>> +            .folio = src,
>>>> +            .map_unused_to_zeropage = flags &
>>>> RMP_USE_SHARED_ZEROPAGE,
>>>> +    };
>>>> +
>>>>       struct rmap_walk_control rwc = {
>>>>               .rmap_one = remove_migration_pte,
>>>> -            .arg = src,
>>>> +            .arg = &rmap_walk_arg,
>>>>       };
>>>>
>>>> -    if (locked)
>>>> +    VM_BUG_ON_FOLIO((flags & RMP_USE_SHARED_ZEROPAGE) && (src !=
>>>> dst), src);
>>>> +
>>>> +    if (flags & RMP_LOCKED)
>>>>               rmap_walk_locked(dst, &rwc);
>>>>       else
>>>>               rmap_walk(dst, &rwc);
>>>> @@ -934,7 +988,7 @@ static int writeout(struct address_space
>>>> *mapping, struct folio *folio)
>>>>        * At this point we know that the migration attempt cannot
>>>>        * be successful.
>>>>        */
>>>> -    remove_migration_ptes(folio, folio, false);
>>>> +    remove_migration_ptes(folio, folio, 0);
>>>>
>>>>       rc = mapping->a_ops->writepage(&folio->page, &wbc);
>>>>
>>>> @@ -1098,7 +1152,7 @@ static void migrate_folio_undo_src(struct folio
>>>> *src,
>>>>                                  struct list_head *ret)
>>>>    {
>>>>       if (page_was_mapped)
>>>> -            remove_migration_ptes(src, src, false);
>>>> +            remove_migration_ptes(src, src, 0);
>>>>       /* Drop an anon_vma reference if we took one */
>>>>       if (anon_vma)
>>>>               put_anon_vma(anon_vma);
>>>> @@ -1336,7 +1390,7 @@ static int migrate_folio_move(free_folio_t
>>>> put_new_folio, unsigned long private,
>>>>               lru_add_drain();
>>>>
>>>>       if (old_page_state & PAGE_WAS_MAPPED)
>>>> -            remove_migration_ptes(src, dst, false);
>>>> +            remove_migration_ptes(src, dst, 0);
>>>>
>>>>    out_unlock_both:
>>>>       folio_unlock(dst);
>>>> @@ -1474,7 +1528,7 @@ static int unmap_and_move_huge_page(new_folio_t
>>>> get_new_folio,
>>>>
>>>>       if (page_was_mapped)
>>>>               remove_migration_ptes(src,
>>>> -                    rc == MIGRATEPAGE_SUCCESS ? dst : src,
>>>> false);
>>>> +                    rc == MIGRATEPAGE_SUCCESS ? dst : src, 0);
>>>>
>>>>    unlock_put_anon:
>>>>       folio_unlock(dst);
>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>> index 8d687de88a03..9cf26592ac93 100644
>>>> --- a/mm/migrate_device.c
>>>> +++ b/mm/migrate_device.c
>>>> @@ -424,7 +424,7 @@ static unsigned long
>>>> migrate_device_unmap(unsigned long *src_pfns,
>>>>                       continue;
>>>>
>>>>               folio = page_folio(page);
>>>> -            remove_migration_ptes(folio, folio, false);
>>>> +            remove_migration_ptes(folio, folio, 0);
>>>>
>>>>               src_pfns[i] = 0;
>>>>               folio_unlock(folio);
>>>> @@ -840,7 +840,7 @@ void migrate_device_finalize(unsigned long
>>>> *src_pfns,
>>>>                       dst = src;
>>>>               }
>>>>
>>>> -            remove_migration_ptes(src, dst, false);
>>>> +            remove_migration_ptes(src, dst, 0);
>>>>               folio_unlock(src);
>>>>
>>>>               if (folio_is_zone_device(src))
>>>
>>> Hi,
>>>
>>> This patch has been in the mainline for some time, but we recently
>>> discovered an issue when both mTHP and MTE (Memory Tagging Extension)
>>> are enabled.
>>>
>>> It seems that remapping to the same zeropage might causes MTE tag
>>> mismatches, since MTE tags are associated with physical addresses.
>>
>> Does this only trigger when the VMA has mte enabled? Maybe we'll have to
>> bail out if we detect that mte is enabled.
> 
> It seems RISC-V also has a similar feature (RISCV_ISA_SUPM) that uses
> the same prctl(PR_{GET,SET}_TAGGED_ADDR_CTRL) API.
> 
> config RISCV_ISA_SUPM
>          bool "Supm extension for userspace pointer masking"
>          depends on 64BIT
>          default y
>          help
>            Add support for pointer masking in userspace (Supm) when the
>            underlying hardware extension (Smnpm or Ssnpm) is detected at boot.
> 
>            If this option is disabled, userspace will be unable to use
>            the prctl(PR_{SET,GET}_TAGGED_ADDR_CTRL) API.
> 
> I wonder if we should disable the THP shrinker for such architectures that

I think where possible we really only want to identify problematic 
(tagged) pages and skip them. And we should either look into fixing KSM 
as well or finding out why KSM is not affected.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 2/6] mm: remap unused subpages to shared zeropage when splitting isolated thp
  2025-09-18 12:35         ` David Hildenbrand
@ 2025-09-19  5:16           ` Lance Yang
  2025-09-19  7:55             ` David Hildenbrand
  0 siblings, 1 reply; 33+ messages in thread
From: Lance Yang @ 2025-09-19  5:16 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Qun-wei Lin (林群崴), catalin.marinas@arm.com,
	usamaarif642@gmail.com, linux-mm@kvack.org, yuzhao@google.com,
	akpm@linux-foundation.org, corbet@lwn.net,
	Andrew Yang (楊智強), npache@redhat.com,
	rppt@kernel.org, willy@infradead.org, kernel-team@meta.com,
	roman.gushchin@linux.dev, hannes@cmpxchg.org,
	cerasuolodomenico@gmail.com, linux-kernel@vger.kernel.org,
	ryncsn@gmail.com, surenb@google.com, riel@surriel.com,
	shakeel.butt@linux.dev, Chinwen Chang (張錦文),
	linux-doc@vger.kernel.org, Casper Li (李中榮),
	ryan.roberts@arm.com, linux-mediatek@lists.infradead.org,
	baohua@kernel.org, kaleshsingh@google.com, zhais@google.com,
	linux-arm-kernel@lists.infradead.org



On 2025/9/18 20:35, David Hildenbrand wrote:
> On 18.09.25 14:22, Lance Yang wrote:
>> On Thu, Sep 18, 2025 at 5:21 PM David Hildenbrand <david@redhat.com> 
>> wrote:
>>>
>>> On 18.09.25 10:53, Qun-wei Lin (林群崴) wrote:
>>>> On Fri, 2024-08-30 at 11:03 +0100, Usama Arif wrote:
>>>>> From: Yu Zhao <yuzhao@google.com>
>>>>>
>>>>> Here being unused means containing only zeros and inaccessible to
>>>>> userspace. When splitting an isolated thp under reclaim or migration,
>>>>> the unused subpages can be mapped to the shared zeropage, hence
>>>>> saving
>>>>> memory. This is particularly helpful when the internal
>>>>> fragmentation of a thp is high, i.e. it has many untouched subpages.
>>>>>
>>>>> This is also a prerequisite for THP low utilization shrinker which
>>>>> will
>>>>> be introduced in later patches, where underutilized THPs are split,
>>>>> and
>>>>> the zero-filled pages are freed saving memory.
>>>>>
>>>>> Signed-off-by: Yu Zhao <yuzhao@google.com>
>>>>> Tested-by: Shuang Zhai <zhais@google.com>
>>>>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>>>>> ---
>>>>>    include/linux/rmap.h |  7 ++++-
>>>>>    mm/huge_memory.c     |  8 ++---
>>>>>    mm/migrate.c         | 72 +++++++++++++++++++++++++++++++++++++ 
>>>>> +----
>>>>> -- 
>>>>>    mm/migrate_device.c  |  4 +--
>>>>>    4 files changed, 75 insertions(+), 16 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>>>>> index 91b5935e8485..d5e93e44322e 100644
>>>>> --- a/include/linux/rmap.h
>>>>> +++ b/include/linux/rmap.h
>>>>> @@ -745,7 +745,12 @@ int folio_mkclean(struct folio *);
>>>>>    int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages,
>>>>> pgoff_t pgoff,
>>>>>                     struct vm_area_struct *vma);
>>>>>
>>>>> -void remove_migration_ptes(struct folio *src, struct folio *dst,
>>>>> bool locked);
>>>>> +enum rmp_flags {
>>>>> +    RMP_LOCKED              = 1 << 0,
>>>>> +    RMP_USE_SHARED_ZEROPAGE = 1 << 1,
>>>>> +};
>>>>> +
>>>>> +void remove_migration_ptes(struct folio *src, struct folio *dst, int
>>>>> flags);
>>>>>
>>>>>    /*
>>>>>     * rmap_walk_control: To control rmap traversing for specific needs
>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>> index 0c48806ccb9a..af60684e7c70 100644
>>>>> --- a/mm/huge_memory.c
>>>>> +++ b/mm/huge_memory.c
>>>>> @@ -3020,7 +3020,7 @@ bool unmap_huge_pmd_locked(struct
>>>>> vm_area_struct *vma, unsigned long addr,
>>>>>       return false;
>>>>>    }
>>>>>
>>>>> -static void remap_page(struct folio *folio, unsigned long nr)
>>>>> +static void remap_page(struct folio *folio, unsigned long nr, int
>>>>> flags)
>>>>>    {
>>>>>       int i = 0;
>>>>>
>>>>> @@ -3028,7 +3028,7 @@ static void remap_page(struct folio *folio,
>>>>> unsigned long nr)
>>>>>       if (!folio_test_anon(folio))
>>>>>               return;
>>>>>       for (;;) {
>>>>> -            remove_migration_ptes(folio, folio, true);
>>>>> +            remove_migration_ptes(folio, folio, RMP_LOCKED |
>>>>> flags);
>>>>>               i += folio_nr_pages(folio);
>>>>>               if (i >= nr)
>>>>>                       break;
>>>>> @@ -3240,7 +3240,7 @@ static void __split_huge_page(struct page
>>>>> *page, struct list_head *list,
>>>>>
>>>>>       if (nr_dropped)
>>>>>               shmem_uncharge(folio->mapping->host, nr_dropped);
>>>>> -    remap_page(folio, nr);
>>>>> +    remap_page(folio, nr, PageAnon(head) ?
>>>>> RMP_USE_SHARED_ZEROPAGE : 0);
>>>>>
>>>>>       /*
>>>>>        * set page to its compound_head when split to non order-0
>>>>> pages, so
>>>>> @@ -3542,7 +3542,7 @@ int split_huge_page_to_list_to_order(struct
>>>>> page *page, struct list_head *list,
>>>>>               if (mapping)
>>>>>                       xas_unlock(&xas);
>>>>>               local_irq_enable();
>>>>> -            remap_page(folio, folio_nr_pages(folio));
>>>>> +            remap_page(folio, folio_nr_pages(folio), 0);
>>>>>               ret = -EAGAIN;
>>>>>       }
>>>>>
>>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>>> index 6f9c62c746be..d039863e014b 100644
>>>>> --- a/mm/migrate.c
>>>>> +++ b/mm/migrate.c
>>>>> @@ -204,13 +204,57 @@ bool isolate_folio_to_list(struct folio *folio,
>>>>> struct list_head *list)
>>>>>       return true;
>>>>>    }
>>>>>
>>>>> +static bool try_to_map_unused_to_zeropage(struct
>>>>> page_vma_mapped_walk *pvmw,
>>>>> +                                      struct folio *folio,
>>>>> +                                      unsigned long idx)
>>>>> +{
>>>>> +    struct page *page = folio_page(folio, idx);
>>>>> +    bool contains_data;
>>>>> +    pte_t newpte;
>>>>> +    void *addr;
>>>>> +
>>>>> +    VM_BUG_ON_PAGE(PageCompound(page), page);
>>>>> +    VM_BUG_ON_PAGE(!PageAnon(page), page);
>>>>> +    VM_BUG_ON_PAGE(!PageLocked(page), page);
>>>>> +    VM_BUG_ON_PAGE(pte_present(*pvmw->pte), page);
>>>>> +
>>>>> +    if (folio_test_mlocked(folio) || (pvmw->vma->vm_flags &
>>>>> VM_LOCKED) ||
>>>>> +        mm_forbids_zeropage(pvmw->vma->vm_mm))
>>>>> +            return false;
>>>>> +
>>>>> +    /*
>>>>> +     * The pmd entry mapping the old thp was flushed and the pte
>>>>> mapping
>>>>> +     * this subpage has been non present. If the subpage is only
>>>>> zero-filled
>>>>> +     * then map it to the shared zeropage.
>>>>> +     */
>>>>> +    addr = kmap_local_page(page);
>>>>> +    contains_data = memchr_inv(addr, 0, PAGE_SIZE);
>>>>> +    kunmap_local(addr);
>>>>> +
>>>>> +    if (contains_data)
>>>>> +            return false;
>>>>> +
>>>>> +    newpte = pte_mkspecial(pfn_pte(my_zero_pfn(pvmw->address),
>>>>> +                                    pvmw->vma->vm_page_prot));
>>>>> +    set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte,
>>>>> newpte);
>>>>> +
>>>>> +    dec_mm_counter(pvmw->vma->vm_mm, mm_counter(folio));
>>>>> +    return true;
>>>>> +}
>>>>> +
>>>>> +struct rmap_walk_arg {
>>>>> +    struct folio *folio;
>>>>> +    bool map_unused_to_zeropage;
>>>>> +};
>>>>> +
>>>>>    /*
>>>>>     * Restore a potential migration pte to a working pte entry
>>>>>     */
>>>>>    static bool remove_migration_pte(struct folio *folio,
>>>>> -            struct vm_area_struct *vma, unsigned long addr, void
>>>>> *old)
>>>>> +            struct vm_area_struct *vma, unsigned long addr, void
>>>>> *arg)
>>>>>    {
>>>>> -    DEFINE_FOLIO_VMA_WALK(pvmw, old, vma, addr, PVMW_SYNC |
>>>>> PVMW_MIGRATION);
>>>>> +    struct rmap_walk_arg *rmap_walk_arg = arg;
>>>>> +    DEFINE_FOLIO_VMA_WALK(pvmw, rmap_walk_arg->folio, vma, addr,
>>>>> PVMW_SYNC | PVMW_MIGRATION);
>>>>>
>>>>>       while (page_vma_mapped_walk(&pvmw)) {
>>>>>               rmap_t rmap_flags = RMAP_NONE;
>>>>> @@ -234,6 +278,9 @@ static bool remove_migration_pte(struct folio
>>>>> *folio,
>>>>>                       continue;
>>>>>               }
>>>>>    #endif
>>>>> +            if (rmap_walk_arg->map_unused_to_zeropage &&
>>>>> +                try_to_map_unused_to_zeropage(&pvmw, folio,
>>>>> idx))
>>>>> +                    continue;
>>>>>
>>>>>               folio_get(folio);
>>>>>               pte = mk_pte(new, READ_ONCE(vma->vm_page_prot));
>>>>> @@ -312,14 +359,21 @@ static bool remove_migration_pte(struct folio
>>>>> *folio,
>>>>>     * Get rid of all migration entries and replace them by
>>>>>     * references to the indicated page.
>>>>>     */
>>>>> -void remove_migration_ptes(struct folio *src, struct folio *dst,
>>>>> bool locked)
>>>>> +void remove_migration_ptes(struct folio *src, struct folio *dst, int
>>>>> flags)
>>>>>    {
>>>>> +    struct rmap_walk_arg rmap_walk_arg = {
>>>>> +            .folio = src,
>>>>> +            .map_unused_to_zeropage = flags &
>>>>> RMP_USE_SHARED_ZEROPAGE,
>>>>> +    };
>>>>> +
>>>>>       struct rmap_walk_control rwc = {
>>>>>               .rmap_one = remove_migration_pte,
>>>>> -            .arg = src,
>>>>> +            .arg = &rmap_walk_arg,
>>>>>       };
>>>>>
>>>>> -    if (locked)
>>>>> +    VM_BUG_ON_FOLIO((flags & RMP_USE_SHARED_ZEROPAGE) && (src !=
>>>>> dst), src);
>>>>> +
>>>>> +    if (flags & RMP_LOCKED)
>>>>>               rmap_walk_locked(dst, &rwc);
>>>>>       else
>>>>>               rmap_walk(dst, &rwc);
>>>>> @@ -934,7 +988,7 @@ static int writeout(struct address_space
>>>>> *mapping, struct folio *folio)
>>>>>        * At this point we know that the migration attempt cannot
>>>>>        * be successful.
>>>>>        */
>>>>> -    remove_migration_ptes(folio, folio, false);
>>>>> +    remove_migration_ptes(folio, folio, 0);
>>>>>
>>>>>       rc = mapping->a_ops->writepage(&folio->page, &wbc);
>>>>>
>>>>> @@ -1098,7 +1152,7 @@ static void migrate_folio_undo_src(struct folio
>>>>> *src,
>>>>>                                  struct list_head *ret)
>>>>>    {
>>>>>       if (page_was_mapped)
>>>>> -            remove_migration_ptes(src, src, false);
>>>>> +            remove_migration_ptes(src, src, 0);
>>>>>       /* Drop an anon_vma reference if we took one */
>>>>>       if (anon_vma)
>>>>>               put_anon_vma(anon_vma);
>>>>> @@ -1336,7 +1390,7 @@ static int migrate_folio_move(free_folio_t
>>>>> put_new_folio, unsigned long private,
>>>>>               lru_add_drain();
>>>>>
>>>>>       if (old_page_state & PAGE_WAS_MAPPED)
>>>>> -            remove_migration_ptes(src, dst, false);
>>>>> +            remove_migration_ptes(src, dst, 0);
>>>>>
>>>>>    out_unlock_both:
>>>>>       folio_unlock(dst);
>>>>> @@ -1474,7 +1528,7 @@ static int unmap_and_move_huge_page(new_folio_t
>>>>> get_new_folio,
>>>>>
>>>>>       if (page_was_mapped)
>>>>>               remove_migration_ptes(src,
>>>>> -                    rc == MIGRATEPAGE_SUCCESS ? dst : src,
>>>>> false);
>>>>> +                    rc == MIGRATEPAGE_SUCCESS ? dst : src, 0);
>>>>>
>>>>>    unlock_put_anon:
>>>>>       folio_unlock(dst);
>>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>>> index 8d687de88a03..9cf26592ac93 100644
>>>>> --- a/mm/migrate_device.c
>>>>> +++ b/mm/migrate_device.c
>>>>> @@ -424,7 +424,7 @@ static unsigned long
>>>>> migrate_device_unmap(unsigned long *src_pfns,
>>>>>                       continue;
>>>>>
>>>>>               folio = page_folio(page);
>>>>> -            remove_migration_ptes(folio, folio, false);
>>>>> +            remove_migration_ptes(folio, folio, 0);
>>>>>
>>>>>               src_pfns[i] = 0;
>>>>>               folio_unlock(folio);
>>>>> @@ -840,7 +840,7 @@ void migrate_device_finalize(unsigned long
>>>>> *src_pfns,
>>>>>                       dst = src;
>>>>>               }
>>>>>
>>>>> -            remove_migration_ptes(src, dst, false);
>>>>> +            remove_migration_ptes(src, dst, 0);
>>>>>               folio_unlock(src);
>>>>>
>>>>>               if (folio_is_zone_device(src))
>>>>
>>>> Hi,
>>>>
>>>> This patch has been in the mainline for some time, but we recently
>>>> discovered an issue when both mTHP and MTE (Memory Tagging Extension)
>>>> are enabled.
>>>>
>>>> It seems that remapping to the same zeropage might causes MTE tag
>>>> mismatches, since MTE tags are associated with physical addresses.
>>>
>>> Does this only trigger when the VMA has mte enabled? Maybe we'll have to
>>> bail out if we detect that mte is enabled.
>>
>> It seems RISC-V also has a similar feature (RISCV_ISA_SUPM) that uses
>> the same prctl(PR_{GET,SET}_TAGGED_ADDR_CTRL) API.
>>
>> config RISCV_ISA_SUPM
>>          bool "Supm extension for userspace pointer masking"
>>          depends on 64BIT
>>          default y
>>          help
>>            Add support for pointer masking in userspace (Supm) when the
>>            underlying hardware extension (Smnpm or Ssnpm) is detected 
>> at boot.
>>
>>            If this option is disabled, userspace will be unable to use
>>            the prctl(PR_{SET,GET}_TAGGED_ADDR_CTRL) API.
>>
>> I wonder if we should disable the THP shrinker for such architectures 
>> that
> 
> I think where possible we really only want to identify problematic 
> (tagged) pages and skip them. And we should either look into fixing KSM 
> as well or finding out why KSM is not affected.

Yeah. Seems like we could introduce a new helper, 
folio_test_mte_tagged(struct
folio *folio). By default, it would return false, and architectures like 
arm64
can override it.

Looking at the code, the PG_mte_tagged flag is not set for regular THP. 
The MTE
status actually comes from the VM_MTE flag in the VMA that maps it.

static inline bool folio_test_hugetlb_mte_tagged(struct folio *folio)
{
	bool ret = test_bit(PG_mte_tagged, &folio->flags.f);

	VM_WARN_ON_ONCE(!folio_test_hugetlb(folio));

	/*
	 * If the folio is tagged, ensure ordering with a likely subsequent
	 * read of the tags.
	 */
	if (ret)
		smp_rmb();
	return ret;
}

static inline bool page_mte_tagged(struct page *page)
{
	bool ret = test_bit(PG_mte_tagged, &page->flags.f);

	VM_WARN_ON_ONCE(folio_test_hugetlb(page_folio(page)));

	/*
	 * If the page is tagged, ensure ordering with a likely subsequent
	 * read of the tags.
	 */
	if (ret)
		smp_rmb();
	return ret;
}

contpte_set_ptes()
	__set_ptes()
		__set_ptes_anysz()
			__sync_cache_and_tags()
				mte_sync_tags()
					set_page_mte_tagged()

Then, having the THP shrinker skip any folios that are identified as
MTE-tagged.
Cheers,
Lance




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 2/6] mm: remap unused subpages to shared zeropage when splitting isolated thp
  2025-09-19  5:16           ` Lance Yang
@ 2025-09-19  7:55             ` David Hildenbrand
  2025-09-19  8:14               ` Lance Yang
  0 siblings, 1 reply; 33+ messages in thread
From: David Hildenbrand @ 2025-09-19  7:55 UTC (permalink / raw)
  To: Lance Yang
  Cc: Qun-wei Lin (林群崴), catalin.marinas@arm.com,
	usamaarif642@gmail.com, linux-mm@kvack.org, yuzhao@google.com,
	akpm@linux-foundation.org, corbet@lwn.net,
	Andrew Yang (楊智強), npache@redhat.com,
	rppt@kernel.org, willy@infradead.org, kernel-team@meta.com,
	roman.gushchin@linux.dev, hannes@cmpxchg.org,
	cerasuolodomenico@gmail.com, linux-kernel@vger.kernel.org,
	ryncsn@gmail.com, surenb@google.com, riel@surriel.com,
	shakeel.butt@linux.dev, Chinwen Chang (張錦文),
	linux-doc@vger.kernel.org, Casper Li (李中榮),
	ryan.roberts@arm.com, linux-mediatek@lists.infradead.org,
	baohua@kernel.org, kaleshsingh@google.com, zhais@google.com,
	linux-arm-kernel@lists.infradead.org

>> I think where possible we really only want to identify problematic
>> (tagged) pages and skip them. And we should either look into fixing KSM
>> as well or finding out why KSM is not affected.
> 
> Yeah. Seems like we could introduce a new helper,
> folio_test_mte_tagged(struct
> folio *folio). By default, it would return false, and architectures like
> arm64
> can override it.

If we add a new helper it should instead express the semantics that we cannot deduplicate.

For THP, I recall that only some pages might be tagged. So likely we want to check per page.

> 
> Looking at the code, the PG_mte_tagged flag is not set for regular THP.

I think it's supported for THP per page. Only for hugetlb we tag the whole thing through the head page instead of individual pages.

> The MTE
> status actually comes from the VM_MTE flag in the VMA that maps it.
> 

During the rmap walk we could check the VMA flag, but there would be no way to just stop the THP shrinker scanning this page early.

> static inline bool folio_test_hugetlb_mte_tagged(struct folio *folio)
> {
> 	bool ret = test_bit(PG_mte_tagged, &folio->flags.f);
> 
> 	VM_WARN_ON_ONCE(!folio_test_hugetlb(folio));
> 
> 	/*
> 	 * If the folio is tagged, ensure ordering with a likely subsequent
> 	 * read of the tags.
> 	 */
> 	if (ret)
> 		smp_rmb();
> 	return ret;
> }
> 
> static inline bool page_mte_tagged(struct page *page)
> {
> 	bool ret = test_bit(PG_mte_tagged, &page->flags.f);
> 
> 	VM_WARN_ON_ONCE(folio_test_hugetlb(page_folio(page)));
> 
> 	/*
> 	 * If the page is tagged, ensure ordering with a likely subsequent
> 	 * read of the tags.
> 	 */
> 	if (ret)
> 		smp_rmb();
> 	return ret;
> }
> 
> contpte_set_ptes()
> 	__set_ptes()
> 		__set_ptes_anysz()
> 			__sync_cache_and_tags()
> 				mte_sync_tags()
> 					set_page_mte_tagged()
> 
> Then, having the THP shrinker skip any folios that are identified as
> MTE-tagged.

Likely we should just do something like (maybe we want better naming)

#ifndef page_is_mergable
#define page_is_mergable(page) (true)
#endif

And for arm64 have it be

#define page_is_mergable(page) (!page_mte_tagged(page))


And then do

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1f0813b956436..1cac9093918d6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -4251,7 +4251,8 @@ static bool thp_underused(struct folio *folio)
  
         for (i = 0; i < folio_nr_pages(folio); i++) {
                 kaddr = kmap_local_folio(folio, i * PAGE_SIZE);
-               if (!memchr_inv(kaddr, 0, PAGE_SIZE)) {
+               if (page_is_mergable(folio_page(folio, i)) &&
+                   !memchr_inv(kaddr, 0, PAGE_SIZE)) {
                         num_zero_pages++;
                         if (num_zero_pages > khugepaged_max_ptes_none) {
                                 kunmap_local(kaddr);
diff --git a/mm/migrate.c b/mm/migrate.c
index 946253c398072..476a9a9091bd3 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -306,6 +306,8 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
  
         if (PageCompound(page))
                 return false;
+       if (!page_is_mergable(page))
+               return false;
         VM_BUG_ON_PAGE(!PageAnon(page), page);
         VM_BUG_ON_PAGE(!PageLocked(page), page);
         VM_BUG_ON_PAGE(pte_present(ptep_get(pvmw->pte)), page);


For KSM, similarly just bail out early. But still wondering if this is already checked
somehow for KSM.

-- 
Cheers

David / dhildenb



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 2/6] mm: remap unused subpages to shared zeropage when splitting isolated thp
  2025-09-19  7:55             ` David Hildenbrand
@ 2025-09-19  8:14               ` Lance Yang
  2025-09-19 10:53                 ` Lance Yang
  0 siblings, 1 reply; 33+ messages in thread
From: Lance Yang @ 2025-09-19  8:14 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Qun-wei Lin (林群崴), catalin.marinas@arm.com,
	usamaarif642@gmail.com, linux-mm@kvack.org, yuzhao@google.com,
	akpm@linux-foundation.org, corbet@lwn.net,
	Andrew Yang (楊智強), npache@redhat.com,
	rppt@kernel.org, willy@infradead.org, kernel-team@meta.com,
	roman.gushchin@linux.dev, hannes@cmpxchg.org,
	cerasuolodomenico@gmail.com, linux-kernel@vger.kernel.org,
	ryncsn@gmail.com, surenb@google.com, riel@surriel.com,
	shakeel.butt@linux.dev, Chinwen Chang (張錦文),
	linux-doc@vger.kernel.org, Casper Li (李中榮),
	ryan.roberts@arm.com, linux-mediatek@lists.infradead.org,
	baohua@kernel.org, kaleshsingh@google.com, zhais@google.com,
	linux-arm-kernel@lists.infradead.org



On 2025/9/19 15:55, David Hildenbrand wrote:
>>> I think where possible we really only want to identify problematic
>>> (tagged) pages and skip them. And we should either look into fixing KSM
>>> as well or finding out why KSM is not affected.
>>
>> Yeah. Seems like we could introduce a new helper,
>> folio_test_mte_tagged(struct
>> folio *folio). By default, it would return false, and architectures like
>> arm64
>> can override it.
> 
> If we add a new helper it should instead express the semantics that we 
> cannot deduplicate.

Agreed.

> 
> For THP, I recall that only some pages might be tagged. So likely we 
> want to check per page.

Yes, a per-page check would be simpler.

> 
>>
>> Looking at the code, the PG_mte_tagged flag is not set for regular THP.
> 
> I think it's supported for THP per page. Only for hugetlb we tag the 
> whole thing through the head page instead of individual pages.

Right. That's exactly what I meant.

> 
>> The MTE
>> status actually comes from the VM_MTE flag in the VMA that maps it.
>>
> 
> During the rmap walk we could check the VMA flag, but there would be no 
> way to just stop the THP shrinker scanning this page early.
> 
>> static inline bool folio_test_hugetlb_mte_tagged(struct folio *folio)
>> {
>>     bool ret = test_bit(PG_mte_tagged, &folio->flags.f);
>>
>>     VM_WARN_ON_ONCE(!folio_test_hugetlb(folio));
>>
>>     /*
>>      * If the folio is tagged, ensure ordering with a likely subsequent
>>      * read of the tags.
>>      */
>>     if (ret)
>>         smp_rmb();
>>     return ret;
>> }
>>
>> static inline bool page_mte_tagged(struct page *page)
>> {
>>     bool ret = test_bit(PG_mte_tagged, &page->flags.f);
>>
>>     VM_WARN_ON_ONCE(folio_test_hugetlb(page_folio(page)));
>>
>>     /*
>>      * If the page is tagged, ensure ordering with a likely subsequent
>>      * read of the tags.
>>      */
>>     if (ret)
>>         smp_rmb();
>>     return ret;
>> }
>>
>> contpte_set_ptes()
>>     __set_ptes()
>>         __set_ptes_anysz()
>>             __sync_cache_and_tags()
>>                 mte_sync_tags()
>>                     set_page_mte_tagged()
>>
>> Then, having the THP shrinker skip any folios that are identified as
>> MTE-tagged.
> 
> Likely we should just do something like (maybe we want better naming)
> 
> #ifndef page_is_mergable
> #define page_is_mergable(page) (true)
> #endif


Maybe something like page_is_optimizable()? Just a thought ;p

> 
> And for arm64 have it be
> 
> #define page_is_mergable(page) (!page_mte_tagged(page))
> 
> 
> And then do
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 1f0813b956436..1cac9093918d6 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -4251,7 +4251,8 @@ static bool thp_underused(struct folio *folio)
> 
>          for (i = 0; i < folio_nr_pages(folio); i++) {
>                  kaddr = kmap_local_folio(folio, i * PAGE_SIZE);
> -               if (!memchr_inv(kaddr, 0, PAGE_SIZE)) {
> +               if (page_is_mergable(folio_page(folio, i)) &&
> +                   !memchr_inv(kaddr, 0, PAGE_SIZE)) {
>                          num_zero_pages++;
>                          if (num_zero_pages > khugepaged_max_ptes_none) {
>                                  kunmap_local(kaddr);
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 946253c398072..476a9a9091bd3 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -306,6 +306,8 @@ static bool try_to_map_unused_to_zeropage(struct 
> page_vma_mapped_walk *pvmw,
> 
>          if (PageCompound(page))
>                  return false;
> +       if (!page_is_mergable(page))
> +               return false;
>          VM_BUG_ON_PAGE(!PageAnon(page), page);
>          VM_BUG_ON_PAGE(!PageLocked(page), page);
>          VM_BUG_ON_PAGE(pte_present(ptep_get(pvmw->pte)), page);

Looks good to me!

> 
> 
> For KSM, similarly just bail out early. But still wondering if this is 
> already checked
> somehow for KSM.

+1 I'm looking for a machine to test it on.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 2/6] mm: remap unused subpages to shared zeropage when splitting isolated thp
  2025-09-19  8:14               ` Lance Yang
@ 2025-09-19 10:53                 ` Lance Yang
  2025-09-19 12:19                   ` Lance Yang
  0 siblings, 1 reply; 33+ messages in thread
From: Lance Yang @ 2025-09-19 10:53 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Qun-wei Lin (林群崴), catalin.marinas@arm.com,
	usamaarif642@gmail.com, linux-mm@kvack.org, yuzhao@google.com,
	akpm@linux-foundation.org, corbet@lwn.net,
	Andrew Yang (楊智強), npache@redhat.com,
	rppt@kernel.org, willy@infradead.org, kernel-team@meta.com,
	roman.gushchin@linux.dev, hannes@cmpxchg.org,
	cerasuolodomenico@gmail.com, linux-kernel@vger.kernel.org,
	ryncsn@gmail.com, surenb@google.com, riel@surriel.com,
	shakeel.butt@linux.dev, Chinwen Chang (張錦文),
	linux-doc@vger.kernel.org, Casper Li (李中榮),
	ryan.roberts@arm.com, linux-mediatek@lists.infradead.org,
	baohua@kernel.org, kaleshsingh@google.com, zhais@google.com,
	linux-arm-kernel@lists.infradead.org



On 2025/9/19 16:14, Lance Yang wrote:
> 
> 
> On 2025/9/19 15:55, David Hildenbrand wrote:
>>>> I think where possible we really only want to identify problematic
>>>> (tagged) pages and skip them. And we should either look into fixing KSM
>>>> as well or finding out why KSM is not affected.
>>>
>>> Yeah. Seems like we could introduce a new helper,
>>> folio_test_mte_tagged(struct
>>> folio *folio). By default, it would return false, and architectures like
>>> arm64
>>> can override it.
>>
>> If we add a new helper it should instead express the semantics that we 
>> cannot deduplicate.
> 
> Agreed.
> 
>>
>> For THP, I recall that only some pages might be tagged. So likely we 
>> want to check per page.
> 
> Yes, a per-page check would be simpler.
> 
>>
>>>
>>> Looking at the code, the PG_mte_tagged flag is not set for regular THP.
>>
>> I think it's supported for THP per page. Only for hugetlb we tag the 
>> whole thing through the head page instead of individual pages.
> 
> Right. That's exactly what I meant.
> 
>>
>>> The MTE
>>> status actually comes from the VM_MTE flag in the VMA that maps it.
>>>
>>
>> During the rmap walk we could check the VMA flag, but there would be 
>> no way to just stop the THP shrinker scanning this page early.
>>
>>> static inline bool folio_test_hugetlb_mte_tagged(struct folio *folio)
>>> {
>>>     bool ret = test_bit(PG_mte_tagged, &folio->flags.f);
>>>
>>>     VM_WARN_ON_ONCE(!folio_test_hugetlb(folio));
>>>
>>>     /*
>>>      * If the folio is tagged, ensure ordering with a likely subsequent
>>>      * read of the tags.
>>>      */
>>>     if (ret)
>>>         smp_rmb();
>>>     return ret;
>>> }
>>>
>>> static inline bool page_mte_tagged(struct page *page)
>>> {
>>>     bool ret = test_bit(PG_mte_tagged, &page->flags.f);
>>>
>>>     VM_WARN_ON_ONCE(folio_test_hugetlb(page_folio(page)));
>>>
>>>     /*
>>>      * If the page is tagged, ensure ordering with a likely subsequent
>>>      * read of the tags.
>>>      */
>>>     if (ret)
>>>         smp_rmb();
>>>     return ret;
>>> }
>>>
>>> contpte_set_ptes()
>>>     __set_ptes()
>>>         __set_ptes_anysz()
>>>             __sync_cache_and_tags()
>>>                 mte_sync_tags()
>>>                     set_page_mte_tagged()
>>>
>>> Then, having the THP shrinker skip any folios that are identified as
>>> MTE-tagged.
>>
>> Likely we should just do something like (maybe we want better naming)
>>
>> #ifndef page_is_mergable
>> #define page_is_mergable(page) (true)
>> #endif
> 
> 
> Maybe something like page_is_optimizable()? Just a thought ;p
> 
>>
>> And for arm64 have it be
>>
>> #define page_is_mergable(page) (!page_mte_tagged(page))
>>
>>
>> And then do
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 1f0813b956436..1cac9093918d6 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -4251,7 +4251,8 @@ static bool thp_underused(struct folio *folio)
>>
>>          for (i = 0; i < folio_nr_pages(folio); i++) {
>>                  kaddr = kmap_local_folio(folio, i * PAGE_SIZE);
>> -               if (!memchr_inv(kaddr, 0, PAGE_SIZE)) {
>> +               if (page_is_mergable(folio_page(folio, i)) &&
>> +                   !memchr_inv(kaddr, 0, PAGE_SIZE)) {
>>                          num_zero_pages++;
>>                          if (num_zero_pages > khugepaged_max_ptes_none) {
>>                                  kunmap_local(kaddr);
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 946253c398072..476a9a9091bd3 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -306,6 +306,8 @@ static bool try_to_map_unused_to_zeropage(struct 
>> page_vma_mapped_walk *pvmw,
>>
>>          if (PageCompound(page))
>>                  return false;
>> +       if (!page_is_mergable(page))
>> +               return false;
>>          VM_BUG_ON_PAGE(!PageAnon(page), page);
>>          VM_BUG_ON_PAGE(!PageLocked(page), page);
>>          VM_BUG_ON_PAGE(pte_present(ptep_get(pvmw->pte)), page);
> 
> Looks good to me!
> 
>>
>>
>> For KSM, similarly just bail out early. But still wondering if this is 
>> already checked
>> somehow for KSM.
> 
> +1 I'm looking for a machine to test it on.

Interestingly, it seems KSM is already skipping MTE-tagged pages. My test,
running on a v6.8.0 kernel inside QEMU (with MTE enabled), shows no merging
activity for those pages ...



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 2/6] mm: remap unused subpages to shared zeropage when splitting isolated thp
  2025-09-19 10:53                 ` Lance Yang
@ 2025-09-19 12:19                   ` Lance Yang
  2025-09-19 12:44                     ` Lance Yang
  2025-09-19 13:09                     ` David Hildenbrand
  0 siblings, 2 replies; 33+ messages in thread
From: Lance Yang @ 2025-09-19 12:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Qun-wei Lin (林群崴), catalin.marinas@arm.com,
	usamaarif642@gmail.com, linux-mm@kvack.org, yuzhao@google.com,
	akpm@linux-foundation.org, corbet@lwn.net,
	Andrew Yang (楊智強), npache@redhat.com,
	rppt@kernel.org, willy@infradead.org, kernel-team@meta.com,
	roman.gushchin@linux.dev, hannes@cmpxchg.org,
	cerasuolodomenico@gmail.com, linux-kernel@vger.kernel.org,
	ryncsn@gmail.com, surenb@google.com, riel@surriel.com,
	shakeel.butt@linux.dev, Chinwen Chang (張錦文),
	linux-doc@vger.kernel.org, Casper Li (李中榮),
	ryan.roberts@arm.com, linux-mediatek@lists.infradead.org,
	baohua@kernel.org, kaleshsingh@google.com, zhais@google.com,
	linux-arm-kernel@lists.infradead.org

Hey David,

I believe I've found the exact reason why KSM skips MTE-tagged pages ;p

> 
> 
> On 2025/9/19 16:14, Lance Yang wrote:
>>
>>
>> On 2025/9/19 15:55, David Hildenbrand wrote:
>>>>> I think where possible we really only want to identify problematic
>>>>> (tagged) pages and skip them. And we should either look into fixing 
>>>>> KSM
>>>>> as well or finding out why KSM is not affected.
>>>>
>>>> Yeah. Seems like we could introduce a new helper,
>>>> folio_test_mte_tagged(struct
>>>> folio *folio). By default, it would return false, and architectures 
>>>> like
>>>> arm64
>>>> can override it.
>>>
>>> If we add a new helper it should instead express the semantics that 
>>> we cannot deduplicate.
>>
>> Agreed.
>>
>>>
>>> For THP, I recall that only some pages might be tagged. So likely we 
>>> want to check per page.
>>
>> Yes, a per-page check would be simpler.
>>
>>>
>>>>
>>>> Looking at the code, the PG_mte_tagged flag is not set for regular THP.
>>>
>>> I think it's supported for THP per page. Only for hugetlb we tag the 
>>> whole thing through the head page instead of individual pages.
>>
>> Right. That's exactly what I meant.
>>
>>>
>>>> The MTE
>>>> status actually comes from the VM_MTE flag in the VMA that maps it.
>>>>
>>>
>>> During the rmap walk we could check the VMA flag, but there would be 
>>> no way to just stop the THP shrinker scanning this page early.
>>>
>>>> static inline bool folio_test_hugetlb_mte_tagged(struct folio *folio)
>>>> {
>>>>     bool ret = test_bit(PG_mte_tagged, &folio->flags.f);
>>>>
>>>>     VM_WARN_ON_ONCE(!folio_test_hugetlb(folio));
>>>>
>>>>     /*
>>>>      * If the folio is tagged, ensure ordering with a likely subsequent
>>>>      * read of the tags.
>>>>      */
>>>>     if (ret)
>>>>         smp_rmb();
>>>>     return ret;
>>>> }
>>>>
>>>> static inline bool page_mte_tagged(struct page *page)
>>>> {
>>>>     bool ret = test_bit(PG_mte_tagged, &page->flags.f);
>>>>
>>>>     VM_WARN_ON_ONCE(folio_test_hugetlb(page_folio(page)));
>>>>
>>>>     /*
>>>>      * If the page is tagged, ensure ordering with a likely subsequent
>>>>      * read of the tags.
>>>>      */
>>>>     if (ret)
>>>>         smp_rmb();
>>>>     return ret;
>>>> }
>>>>
>>>> contpte_set_ptes()
>>>>     __set_ptes()
>>>>         __set_ptes_anysz()
>>>>             __sync_cache_and_tags()
>>>>                 mte_sync_tags()
>>>>                     set_page_mte_tagged()
>>>>
>>>> Then, having the THP shrinker skip any folios that are identified as
>>>> MTE-tagged.
>>>
>>> Likely we should just do something like (maybe we want better naming)
>>>
>>> #ifndef page_is_mergable
>>> #define page_is_mergable(page) (true)
>>> #endif
>>
>>
>> Maybe something like page_is_optimizable()? Just a thought ;p
>>
>>>
>>> And for arm64 have it be
>>>
>>> #define page_is_mergable(page) (!page_mte_tagged(page))
>>>
>>>
>>> And then do
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 1f0813b956436..1cac9093918d6 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -4251,7 +4251,8 @@ static bool thp_underused(struct folio *folio)
>>>
>>>          for (i = 0; i < folio_nr_pages(folio); i++) {
>>>                  kaddr = kmap_local_folio(folio, i * PAGE_SIZE);
>>> -               if (!memchr_inv(kaddr, 0, PAGE_SIZE)) {
>>> +               if (page_is_mergable(folio_page(folio, i)) &&
>>> +                   !memchr_inv(kaddr, 0, PAGE_SIZE)) {
>>>                          num_zero_pages++;
>>>                          if (num_zero_pages > 
>>> khugepaged_max_ptes_none) {
>>>                                  kunmap_local(kaddr);
>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>> index 946253c398072..476a9a9091bd3 100644
>>> --- a/mm/migrate.c
>>> +++ b/mm/migrate.c
>>> @@ -306,6 +306,8 @@ static bool try_to_map_unused_to_zeropage(struct 
>>> page_vma_mapped_walk *pvmw,
>>>
>>>          if (PageCompound(page))
>>>                  return false;
>>> +       if (!page_is_mergable(page))
>>> +               return false;
>>>          VM_BUG_ON_PAGE(!PageAnon(page), page);
>>>          VM_BUG_ON_PAGE(!PageLocked(page), page);
>>>          VM_BUG_ON_PAGE(pte_present(ptep_get(pvmw->pte)), page);
>>
>> Looks good to me!
>>
>>>
>>>
>>> For KSM, similarly just bail out early. But still wondering if this 
>>> is already checked
>>> somehow for KSM.
>>
>> +1 I'm looking for a machine to test it on.
> 
> Interestingly, it seems KSM is already skipping MTE-tagged pages. My test,
> running on a v6.8.0 kernel inside QEMU (with MTE enabled), shows no merging
> activity for those pages ...

KSM's call to pages_identical() ultimately leads to memcmp_pages(). The
arm64 implementation of memcmp_pages() in arch/arm64/kernel/mte.c contains
a specific check that prevents merging in this case.

try_to_merge_one_page()
	-> pages_identical()
		-> !memcmp_pages() Fails!
		-> replace_page()


int memcmp_pages(struct page *page1, struct page *page2)
{
	char *addr1, *addr2;
	int ret;

	addr1 = page_address(page1);
	addr2 = page_address(page2);
	ret = memcmp(addr1, addr2, PAGE_SIZE);

	if (!system_supports_mte() || ret)
		return ret;

	/*
	 * If the page content is identical but at least one of the pages is
	 * tagged, return non-zero to avoid KSM merging. If only one of the
	 * pages is tagged, __set_ptes() may zero or change the tags of the
	 * other page via mte_sync_tags().
	 */
	if (page_mte_tagged(page1) || page_mte_tagged(page2))
		return addr1 != addr2;

	return ret;
}

IIUC, if either page is MTE-tagged, memcmp_pages() intentionally returns
a non-zero value, which in turn causes pages_identical() to return false.

Cheers,
Lance


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 2/6] mm: remap unused subpages to shared zeropage when splitting isolated thp
  2025-09-19 12:19                   ` Lance Yang
@ 2025-09-19 12:44                     ` Lance Yang
  2025-09-19 13:09                     ` David Hildenbrand
  1 sibling, 0 replies; 33+ messages in thread
From: Lance Yang @ 2025-09-19 12:44 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Qun-wei Lin (林群崴), catalin.marinas@arm.com,
	usamaarif642@gmail.com, linux-mm@kvack.org, yuzhao@google.com,
	akpm@linux-foundation.org, corbet@lwn.net,
	Andrew Yang (楊智強), npache@redhat.com,
	rppt@kernel.org, willy@infradead.org, kernel-team@meta.com,
	roman.gushchin@linux.dev, hannes@cmpxchg.org,
	cerasuolodomenico@gmail.com, linux-kernel@vger.kernel.org,
	ryncsn@gmail.com, surenb@google.com, riel@surriel.com,
	shakeel.butt@linux.dev, Chinwen Chang (張錦文),
	linux-doc@vger.kernel.org, Casper Li (李中榮),
	ryan.roberts@arm.com, linux-mediatek@lists.infradead.org,
	baohua@kernel.org, kaleshsingh@google.com, zhais@google.com,
	linux-arm-kernel@lists.infradead.org



On 2025/9/19 20:19, Lance Yang wrote:
> Hey David,
> 
> I believe I've found the exact reason why KSM skips MTE-tagged pages ;p
> 
>>
>>
>> On 2025/9/19 16:14, Lance Yang wrote:
>>>
>>>
>>> On 2025/9/19 15:55, David Hildenbrand wrote:
>>>>>> I think where possible we really only want to identify problematic
>>>>>> (tagged) pages and skip them. And we should either look into 
>>>>>> fixing KSM
>>>>>> as well or finding out why KSM is not affected.
>>>>>
>>>>> Yeah. Seems like we could introduce a new helper,
>>>>> folio_test_mte_tagged(struct
>>>>> folio *folio). By default, it would return false, and architectures 
>>>>> like
>>>>> arm64
>>>>> can override it.
>>>>
>>>> If we add a new helper it should instead express the semantics that 
>>>> we cannot deduplicate.
>>>
>>> Agreed.
>>>
>>>>
>>>> For THP, I recall that only some pages might be tagged. So likely we 
>>>> want to check per page.
>>>
>>> Yes, a per-page check would be simpler.
>>>
>>>>
>>>>>
>>>>> Looking at the code, the PG_mte_tagged flag is not set for regular 
>>>>> THP.
>>>>
>>>> I think it's supported for THP per page. Only for hugetlb we tag the 
>>>> whole thing through the head page instead of individual pages.
>>>
>>> Right. That's exactly what I meant.
>>>
>>>>
>>>>> The MTE
>>>>> status actually comes from the VM_MTE flag in the VMA that maps it.
>>>>>
>>>>
>>>> During the rmap walk we could check the VMA flag, but there would be 
>>>> no way to just stop the THP shrinker scanning this page early.
>>>>
>>>>> static inline bool folio_test_hugetlb_mte_tagged(struct folio *folio)
>>>>> {
>>>>>     bool ret = test_bit(PG_mte_tagged, &folio->flags.f);
>>>>>
>>>>>     VM_WARN_ON_ONCE(!folio_test_hugetlb(folio));
>>>>>
>>>>>     /*
>>>>>      * If the folio is tagged, ensure ordering with a likely 
>>>>> subsequent
>>>>>      * read of the tags.
>>>>>      */
>>>>>     if (ret)
>>>>>         smp_rmb();
>>>>>     return ret;
>>>>> }
>>>>>
>>>>> static inline bool page_mte_tagged(struct page *page)
>>>>> {
>>>>>     bool ret = test_bit(PG_mte_tagged, &page->flags.f);
>>>>>
>>>>>     VM_WARN_ON_ONCE(folio_test_hugetlb(page_folio(page)));
>>>>>
>>>>>     /*
>>>>>      * If the page is tagged, ensure ordering with a likely subsequent
>>>>>      * read of the tags.
>>>>>      */
>>>>>     if (ret)
>>>>>         smp_rmb();
>>>>>     return ret;
>>>>> }
>>>>>
>>>>> contpte_set_ptes()
>>>>>     __set_ptes()
>>>>>         __set_ptes_anysz()
>>>>>             __sync_cache_and_tags()
>>>>>                 mte_sync_tags()
>>>>>                     set_page_mte_tagged()
>>>>>
>>>>> Then, having the THP shrinker skip any folios that are identified as
>>>>> MTE-tagged.
>>>>
>>>> Likely we should just do something like (maybe we want better naming)
>>>>
>>>> #ifndef page_is_mergable
>>>> #define page_is_mergable(page) (true)
>>>> #endif
>>>
>>>
>>> Maybe something like page_is_optimizable()? Just a thought ;p
>>>
>>>>
>>>> And for arm64 have it be
>>>>
>>>> #define page_is_mergable(page) (!page_mte_tagged(page))
>>>>
>>>>
>>>> And then do
>>>>
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index 1f0813b956436..1cac9093918d6 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -4251,7 +4251,8 @@ static bool thp_underused(struct folio *folio)
>>>>
>>>>          for (i = 0; i < folio_nr_pages(folio); i++) {
>>>>                  kaddr = kmap_local_folio(folio, i * PAGE_SIZE);
>>>> -               if (!memchr_inv(kaddr, 0, PAGE_SIZE)) {
>>>> +               if (page_is_mergable(folio_page(folio, i)) &&
>>>> +                   !memchr_inv(kaddr, 0, PAGE_SIZE)) {
>>>>                          num_zero_pages++;
>>>>                          if (num_zero_pages > 
>>>> khugepaged_max_ptes_none) {
>>>>                                  kunmap_local(kaddr);
>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>> index 946253c398072..476a9a9091bd3 100644
>>>> --- a/mm/migrate.c
>>>> +++ b/mm/migrate.c
>>>> @@ -306,6 +306,8 @@ static bool try_to_map_unused_to_zeropage(struct 
>>>> page_vma_mapped_walk *pvmw,
>>>>
>>>>          if (PageCompound(page))
>>>>                  return false;
>>>> +       if (!page_is_mergable(page))
>>>> +               return false;
>>>>          VM_BUG_ON_PAGE(!PageAnon(page), page);
>>>>          VM_BUG_ON_PAGE(!PageLocked(page), page);
>>>>          VM_BUG_ON_PAGE(pte_present(ptep_get(pvmw->pte)), page);
>>>
>>> Looks good to me!
>>>
>>>>
>>>>
>>>> For KSM, similarly just bail out early. But still wondering if this 
>>>> is already checked
>>>> somehow for KSM.
>>>
>>> +1 I'm looking for a machine to test it on.
>>
>> Interestingly, it seems KSM is already skipping MTE-tagged pages. My 
>> test,
>> running on a v6.8.0 kernel inside QEMU (with MTE enabled), shows no 
>> merging
>> activity for those pages ...
> 
> KSM's call to pages_identical() ultimately leads to memcmp_pages(). The
> arm64 implementation of memcmp_pages() in arch/arm64/kernel/mte.c contains
> a specific check that prevents merging in this case.
> 
> try_to_merge_one_page()
>      -> pages_identical()
>          -> !memcmp_pages() Fails!
>          -> replace_page()

Forgot to add:

memcmp_pages() is also called in other KSM paths, such as
stable_tree_search(), stable_tree_insert(), and
unstable_tree_search_insert(), effectively blocking MTE-tagged
pages from entering either of KSM's trees.

> 
> 
> int memcmp_pages(struct page *page1, struct page *page2)
> {
>      char *addr1, *addr2;
>      int ret;
> 
>      addr1 = page_address(page1);
>      addr2 = page_address(page2);
>      ret = memcmp(addr1, addr2, PAGE_SIZE);
> 
>      if (!system_supports_mte() || ret)
>          return ret;
> 
>      /*
>       * If the page content is identical but at least one of the pages is
>       * tagged, return non-zero to avoid KSM merging. If only one of the
>       * pages is tagged, __set_ptes() may zero or change the tags of the
>       * other page via mte_sync_tags().
>       */
>      if (page_mte_tagged(page1) || page_mte_tagged(page2))
>          return addr1 != addr2;
> 
>      return ret;
> }
> 
> IIUC, if either page is MTE-tagged, memcmp_pages() intentionally returns
> a non-zero value, which in turn causes pages_identical() to return false.
> 
> Cheers,
> Lance



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 2/6] mm: remap unused subpages to shared zeropage when splitting isolated thp
  2025-09-19 12:19                   ` Lance Yang
  2025-09-19 12:44                     ` Lance Yang
@ 2025-09-19 13:09                     ` David Hildenbrand
  2025-09-19 13:24                       ` Lance Yang
  1 sibling, 1 reply; 33+ messages in thread
From: David Hildenbrand @ 2025-09-19 13:09 UTC (permalink / raw)
  To: Lance Yang
  Cc: Qun-wei Lin (林群崴), catalin.marinas@arm.com,
	usamaarif642@gmail.com, linux-mm@kvack.org, yuzhao@google.com,
	akpm@linux-foundation.org, corbet@lwn.net,
	Andrew Yang (楊智強), npache@redhat.com,
	rppt@kernel.org, willy@infradead.org, kernel-team@meta.com,
	roman.gushchin@linux.dev, hannes@cmpxchg.org,
	cerasuolodomenico@gmail.com, linux-kernel@vger.kernel.org,
	ryncsn@gmail.com, surenb@google.com, riel@surriel.com,
	shakeel.butt@linux.dev, Chinwen Chang (張錦文),
	linux-doc@vger.kernel.org, Casper Li (李中榮),
	ryan.roberts@arm.com, linux-mediatek@lists.infradead.org,
	baohua@kernel.org, kaleshsingh@google.com, zhais@google.com,
	linux-arm-kernel@lists.infradead.org

On 19.09.25 14:19, Lance Yang wrote:
> Hey David,
> 
> I believe I've found the exact reason why KSM skips MTE-tagged pages ;p
> 
>>
>>
>> On 2025/9/19 16:14, Lance Yang wrote:
>>>
>>>
>>> On 2025/9/19 15:55, David Hildenbrand wrote:
>>>>>> I think where possible we really only want to identify problematic
>>>>>> (tagged) pages and skip them. And we should either look into fixing
>>>>>> KSM
>>>>>> as well or finding out why KSM is not affected.
>>>>>
>>>>> Yeah. Seems like we could introduce a new helper,
>>>>> folio_test_mte_tagged(struct
>>>>> folio *folio). By default, it would return false, and architectures
>>>>> like
>>>>> arm64
>>>>> can override it.
>>>>
>>>> If we add a new helper it should instead express the semantics that
>>>> we cannot deduplicate.
>>>
>>> Agreed.
>>>
>>>>
>>>> For THP, I recall that only some pages might be tagged. So likely we
>>>> want to check per page.
>>>
>>> Yes, a per-page check would be simpler.
>>>
>>>>
>>>>>
>>>>> Looking at the code, the PG_mte_tagged flag is not set for regular THP.
>>>>
>>>> I think it's supported for THP per page. Only for hugetlb we tag the
>>>> whole thing through the head page instead of individual pages.
>>>
>>> Right. That's exactly what I meant.
>>>
>>>>
>>>>> The MTE
>>>>> status actually comes from the VM_MTE flag in the VMA that maps it.
>>>>>
>>>>
>>>> During the rmap walk we could check the VMA flag, but there would be
>>>> no way to just stop the THP shrinker scanning this page early.
>>>>
>>>>> static inline bool folio_test_hugetlb_mte_tagged(struct folio *folio)
>>>>> {
>>>>>      bool ret = test_bit(PG_mte_tagged, &folio->flags.f);
>>>>>
>>>>>      VM_WARN_ON_ONCE(!folio_test_hugetlb(folio));
>>>>>
>>>>>      /*
>>>>>       * If the folio is tagged, ensure ordering with a likely subsequent
>>>>>       * read of the tags.
>>>>>       */
>>>>>      if (ret)
>>>>>          smp_rmb();
>>>>>      return ret;
>>>>> }
>>>>>
>>>>> static inline bool page_mte_tagged(struct page *page)
>>>>> {
>>>>>      bool ret = test_bit(PG_mte_tagged, &page->flags.f);
>>>>>
>>>>>      VM_WARN_ON_ONCE(folio_test_hugetlb(page_folio(page)));
>>>>>
>>>>>      /*
>>>>>       * If the page is tagged, ensure ordering with a likely subsequent
>>>>>       * read of the tags.
>>>>>       */
>>>>>      if (ret)
>>>>>          smp_rmb();
>>>>>      return ret;
>>>>> }
>>>>>
>>>>> contpte_set_ptes()
>>>>>      __set_ptes()
>>>>>          __set_ptes_anysz()
>>>>>              __sync_cache_and_tags()
>>>>>                  mte_sync_tags()
>>>>>                      set_page_mte_tagged()
>>>>>
>>>>> Then, having the THP shrinker skip any folios that are identified as
>>>>> MTE-tagged.
>>>>
>>>> Likely we should just do something like (maybe we want better naming)
>>>>
>>>> #ifndef page_is_mergable
>>>> #define page_is_mergable(page) (true)
>>>> #endif
>>>
>>>
>>> Maybe something like page_is_optimizable()? Just a thought ;p
>>>
>>>>
>>>> And for arm64 have it be
>>>>
>>>> #define page_is_mergable(page) (!page_mte_tagged(page))
>>>>
>>>>
>>>> And then do
>>>>
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index 1f0813b956436..1cac9093918d6 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -4251,7 +4251,8 @@ static bool thp_underused(struct folio *folio)
>>>>
>>>>           for (i = 0; i < folio_nr_pages(folio); i++) {
>>>>                   kaddr = kmap_local_folio(folio, i * PAGE_SIZE);
>>>> -               if (!memchr_inv(kaddr, 0, PAGE_SIZE)) {
>>>> +               if (page_is_mergable(folio_page(folio, i)) &&
>>>> +                   !memchr_inv(kaddr, 0, PAGE_SIZE)) {
>>>>                           num_zero_pages++;
>>>>                           if (num_zero_pages >
>>>> khugepaged_max_ptes_none) {
>>>>                                   kunmap_local(kaddr);
>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>> index 946253c398072..476a9a9091bd3 100644
>>>> --- a/mm/migrate.c
>>>> +++ b/mm/migrate.c
>>>> @@ -306,6 +306,8 @@ static bool try_to_map_unused_to_zeropage(struct
>>>> page_vma_mapped_walk *pvmw,
>>>>
>>>>           if (PageCompound(page))
>>>>                   return false;
>>>> +       if (!page_is_mergable(page))
>>>> +               return false;
>>>>           VM_BUG_ON_PAGE(!PageAnon(page), page);
>>>>           VM_BUG_ON_PAGE(!PageLocked(page), page);
>>>>           VM_BUG_ON_PAGE(pte_present(ptep_get(pvmw->pte)), page);
>>>
>>> Looks good to me!
>>>
>>>>
>>>>
>>>> For KSM, similarly just bail out early. But still wondering if this
>>>> is already checked
>>>> somehow for KSM.
>>>
>>> +1 I'm looking for a machine to test it on.
>>
>> Interestingly, it seems KSM is already skipping MTE-tagged pages. My test,
>> running on a v6.8.0 kernel inside QEMU (with MTE enabled), shows no merging
>> activity for those pages ...
> 
> KSM's call to pages_identical() ultimately leads to memcmp_pages(). The
> arm64 implementation of memcmp_pages() in arch/arm64/kernel/mte.c contains
> a specific check that prevents merging in this case.
> 
> try_to_merge_one_page()
> 	-> pages_identical()
> 		-> !memcmp_pages() Fails!
> 		-> replace_page()
> 
> 
> int memcmp_pages(struct page *page1, struct page *page2)
> {
> 	char *addr1, *addr2;
> 	int ret;
> 
> 	addr1 = page_address(page1);
> 	addr2 = page_address(page2);
> 	ret = memcmp(addr1, addr2, PAGE_SIZE);
> 
> 	if (!system_supports_mte() || ret)
> 		return ret;
> 
> 	/*
> 	 * If the page content is identical but at least one of the pages is
> 	 * tagged, return non-zero to avoid KSM merging. If only one of the
> 	 * pages is tagged, __set_ptes() may zero or change the tags of the
> 	 * other page via mte_sync_tags().
> 	 */
> 	if (page_mte_tagged(page1) || page_mte_tagged(page2))
> 		return addr1 != addr2;
> 
> 	return ret;
> }
> 
> IIUC, if either page is MTE-tagged, memcmp_pages() intentionally returns
> a non-zero value, which in turn causes pages_identical() to return false.

Cool, so we should likely just use that then in the shrinker code. Can 
you send a fix?

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 2/6] mm: remap unused subpages to shared zeropage when splitting isolated thp
  2025-09-19 13:09                     ` David Hildenbrand
@ 2025-09-19 13:24                       ` Lance Yang
  0 siblings, 0 replies; 33+ messages in thread
From: Lance Yang @ 2025-09-19 13:24 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Qun-wei Lin (林群崴), catalin.marinas@arm.com,
	usamaarif642@gmail.com, linux-mm@kvack.org, yuzhao@google.com,
	akpm@linux-foundation.org, corbet@lwn.net,
	Andrew Yang (楊智強), npache@redhat.com,
	rppt@kernel.org, willy@infradead.org, kernel-team@meta.com,
	roman.gushchin@linux.dev, hannes@cmpxchg.org,
	cerasuolodomenico@gmail.com, linux-kernel@vger.kernel.org,
	ryncsn@gmail.com, surenb@google.com, riel@surriel.com,
	shakeel.butt@linux.dev, Chinwen Chang (張錦文),
	linux-doc@vger.kernel.org, Casper Li (李中榮),
	ryan.roberts@arm.com, linux-mediatek@lists.infradead.org,
	baohua@kernel.org, kaleshsingh@google.com, zhais@google.com,
	linux-arm-kernel@lists.infradead.org



On 2025/9/19 21:09, David Hildenbrand wrote:
> On 19.09.25 14:19, Lance Yang wrote:
>> Hey David,
>>
>> I believe I've found the exact reason why KSM skips MTE-tagged pages ;p
>>
>>>
>>>
>>> On 2025/9/19 16:14, Lance Yang wrote:
>>>>
>>>>
>>>> On 2025/9/19 15:55, David Hildenbrand wrote:
>>>>>>> I think where possible we really only want to identify problematic
>>>>>>> (tagged) pages and skip them. And we should either look into fixing
>>>>>>> KSM
>>>>>>> as well or finding out why KSM is not affected.
>>>>>>
>>>>>> Yeah. Seems like we could introduce a new helper,
>>>>>> folio_test_mte_tagged(struct
>>>>>> folio *folio). By default, it would return false, and architectures
>>>>>> like
>>>>>> arm64
>>>>>> can override it.
>>>>>
>>>>> If we add a new helper it should instead express the semantics that
>>>>> we cannot deduplicate.
>>>>
>>>> Agreed.
>>>>
>>>>>
>>>>> For THP, I recall that only some pages might be tagged. So likely we
>>>>> want to check per page.
>>>>
>>>> Yes, a per-page check would be simpler.
>>>>
>>>>>
>>>>>>
>>>>>> Looking at the code, the PG_mte_tagged flag is not set for regular 
>>>>>> THP.
>>>>>
>>>>> I think it's supported for THP per page. Only for hugetlb we tag the
>>>>> whole thing through the head page instead of individual pages.
>>>>
>>>> Right. That's exactly what I meant.
>>>>
>>>>>
>>>>>> The MTE
>>>>>> status actually comes from the VM_MTE flag in the VMA that maps it.
>>>>>>
>>>>>
>>>>> During the rmap walk we could check the VMA flag, but there would be
>>>>> no way to just stop the THP shrinker scanning this page early.
>>>>>
>>>>>> static inline bool folio_test_hugetlb_mte_tagged(struct folio *folio)
>>>>>> {
>>>>>>      bool ret = test_bit(PG_mte_tagged, &folio->flags.f);
>>>>>>
>>>>>>      VM_WARN_ON_ONCE(!folio_test_hugetlb(folio));
>>>>>>
>>>>>>      /*
>>>>>>       * If the folio is tagged, ensure ordering with a likely 
>>>>>> subsequent
>>>>>>       * read of the tags.
>>>>>>       */
>>>>>>      if (ret)
>>>>>>          smp_rmb();
>>>>>>      return ret;
>>>>>> }
>>>>>>
>>>>>> static inline bool page_mte_tagged(struct page *page)
>>>>>> {
>>>>>>      bool ret = test_bit(PG_mte_tagged, &page->flags.f);
>>>>>>
>>>>>>      VM_WARN_ON_ONCE(folio_test_hugetlb(page_folio(page)));
>>>>>>
>>>>>>      /*
>>>>>>       * If the page is tagged, ensure ordering with a likely 
>>>>>> subsequent
>>>>>>       * read of the tags.
>>>>>>       */
>>>>>>      if (ret)
>>>>>>          smp_rmb();
>>>>>>      return ret;
>>>>>> }
>>>>>>
>>>>>> contpte_set_ptes()
>>>>>>      __set_ptes()
>>>>>>          __set_ptes_anysz()
>>>>>>              __sync_cache_and_tags()
>>>>>>                  mte_sync_tags()
>>>>>>                      set_page_mte_tagged()
>>>>>>
>>>>>> Then, having the THP shrinker skip any folios that are identified as
>>>>>> MTE-tagged.
>>>>>
>>>>> Likely we should just do something like (maybe we want better naming)
>>>>>
>>>>> #ifndef page_is_mergable
>>>>> #define page_is_mergable(page) (true)
>>>>> #endif
>>>>
>>>>
>>>> Maybe something like page_is_optimizable()? Just a thought ;p
>>>>
>>>>>
>>>>> And for arm64 have it be
>>>>>
>>>>> #define page_is_mergable(page) (!page_mte_tagged(page))
>>>>>
>>>>>
>>>>> And then do
>>>>>
>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>> index 1f0813b956436..1cac9093918d6 100644
>>>>> --- a/mm/huge_memory.c
>>>>> +++ b/mm/huge_memory.c
>>>>> @@ -4251,7 +4251,8 @@ static bool thp_underused(struct folio *folio)
>>>>>
>>>>>           for (i = 0; i < folio_nr_pages(folio); i++) {
>>>>>                   kaddr = kmap_local_folio(folio, i * PAGE_SIZE);
>>>>> -               if (!memchr_inv(kaddr, 0, PAGE_SIZE)) {
>>>>> +               if (page_is_mergable(folio_page(folio, i)) &&
>>>>> +                   !memchr_inv(kaddr, 0, PAGE_SIZE)) {
>>>>>                           num_zero_pages++;
>>>>>                           if (num_zero_pages >
>>>>> khugepaged_max_ptes_none) {
>>>>>                                   kunmap_local(kaddr);
>>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>>> index 946253c398072..476a9a9091bd3 100644
>>>>> --- a/mm/migrate.c
>>>>> +++ b/mm/migrate.c
>>>>> @@ -306,6 +306,8 @@ static bool try_to_map_unused_to_zeropage(struct
>>>>> page_vma_mapped_walk *pvmw,
>>>>>
>>>>>           if (PageCompound(page))
>>>>>                   return false;
>>>>> +       if (!page_is_mergable(page))
>>>>> +               return false;
>>>>>           VM_BUG_ON_PAGE(!PageAnon(page), page);
>>>>>           VM_BUG_ON_PAGE(!PageLocked(page), page);
>>>>>           VM_BUG_ON_PAGE(pte_present(ptep_get(pvmw->pte)), page);
>>>>
>>>> Looks good to me!
>>>>
>>>>>
>>>>>
>>>>> For KSM, similarly just bail out early. But still wondering if this
>>>>> is already checked
>>>>> somehow for KSM.
>>>>
>>>> +1 I'm looking for a machine to test it on.
>>>
>>> Interestingly, it seems KSM is already skipping MTE-tagged pages. My 
>>> test,
>>> running on a v6.8.0 kernel inside QEMU (with MTE enabled), shows no 
>>> merging
>>> activity for those pages ...
>>
>> KSM's call to pages_identical() ultimately leads to memcmp_pages(). The
>> arm64 implementation of memcmp_pages() in arch/arm64/kernel/mte.c 
>> contains
>> a specific check that prevents merging in this case.
>>
>> try_to_merge_one_page()
>>     -> pages_identical()
>>         -> !memcmp_pages() Fails!
>>         -> replace_page()
>>
>>
>> int memcmp_pages(struct page *page1, struct page *page2)
>> {
>>     char *addr1, *addr2;
>>     int ret;
>>
>>     addr1 = page_address(page1);
>>     addr2 = page_address(page2);
>>     ret = memcmp(addr1, addr2, PAGE_SIZE);
>>
>>     if (!system_supports_mte() || ret)
>>         return ret;
>>
>>     /*
>>      * If the page content is identical but at least one of the pages is
>>      * tagged, return non-zero to avoid KSM merging. If only one of the
>>      * pages is tagged, __set_ptes() may zero or change the tags of the
>>      * other page via mte_sync_tags().
>>      */
>>     if (page_mte_tagged(page1) || page_mte_tagged(page2))
>>         return addr1 != addr2;
>>
>>     return ret;
>> }
>>
>> IIUC, if either page is MTE-tagged, memcmp_pages() intentionally returns
>> a non-zero value, which in turn causes pages_identical() to return false.
> 
> Cool, so we should likely just use that then in the shrinker code. Can 
> you send a fix?

Certainly! I'll get on that ;p

Cheers,
Lance



^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2025-09-19 13:24 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-30 10:03 [PATCH v5 0/6] mm: split underused THPs Usama Arif
2024-08-30 10:03 ` [PATCH v5 1/6] mm: free zapped tail pages when splitting isolated thp Usama Arif
2024-09-05  8:46   ` Hugh Dickins
2024-09-05 10:21     ` Usama Arif
2024-09-05 18:05       ` Hugh Dickins
2024-09-05 19:24         ` Usama Arif
2024-08-30 10:03 ` [PATCH v5 2/6] mm: remap unused subpages to shared zeropage " Usama Arif
2024-10-23 16:21   ` Zi Yan
2024-10-23 16:50     ` Usama Arif
2024-10-23 16:55       ` Zi Yan
2024-10-23 16:56       ` Yu Zhao
2025-09-18  8:53   ` Qun-wei Lin (林群崴)
2025-09-18  8:56     ` David Hildenbrand
2025-09-18 11:42       ` Usama Arif
2025-09-18 11:57         ` David Hildenbrand
2025-09-18 12:22       ` Lance Yang
2025-09-18 12:25         ` Lance Yang
2025-09-18 12:35         ` David Hildenbrand
2025-09-19  5:16           ` Lance Yang
2025-09-19  7:55             ` David Hildenbrand
2025-09-19  8:14               ` Lance Yang
2025-09-19 10:53                 ` Lance Yang
2025-09-19 12:19                   ` Lance Yang
2025-09-19 12:44                     ` Lance Yang
2025-09-19 13:09                     ` David Hildenbrand
2025-09-19 13:24                       ` Lance Yang
2024-08-30 10:03 ` [PATCH v5 3/6] mm: selftest to verify zero-filled pages are mapped to zeropage Usama Arif
2024-08-30 10:03 ` [PATCH v5 4/6] mm: Introduce a pageflag for partially mapped folios Usama Arif
2024-12-11 15:03   ` David Hildenbrand
2024-12-12 10:30     ` Usama Arif
2024-12-12 10:49       ` David Hildenbrand
2024-08-30 10:03 ` [PATCH v5 5/6] mm: split underused THPs Usama Arif
2024-08-30 10:03 ` [PATCH v5 6/6] mm: add sysfs entry to disable splitting " Usama Arif

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).