[PATCH v2 0/4] mm: split underutilized THPs

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/4] mm: split underutilized THPs
@ 2024-08-07 13:46 Usama Arif
  2024-08-07 13:46 ` [PATCH v2 1/4] mm: free zapped tail pages when splitting isolated thp Usama Arif
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Usama Arif @ 2024-08-07 13:46 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, david, baohua,
	ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team, Usama Arif

The current upstream default policy for THP is always. However, Meta
uses madvise in production as the current THP=always policy vastly
overprovisions THPs in sparsely accessed memory areas, resulting in
excessive memory pressure and premature OOM killing.
Using madvise + relying on khugepaged has certain drawbacks over
THP=always. Using madvise hints mean THPs aren't "transparent" and
require userspace changes. Waiting for khugepaged to scan memory and
collapse pages into THP can be slow and unpredictable in terms of performance
(i.e. you dont know when the collapse will happen), while production
environments require predictable performance. If there is enough memory
available, its better for both performance and predictability to have
a THP from fault time, i.e. THP=always rather than wait for khugepaged
to collapse it, and deal with sparsely populated THPs when the system is
running out of memory.

This patch-series is an attempt to mitigate the issue of running out of
memory when THP is always enabled. During runtime whenever a THP is being
faulted in or collapsed by khugepaged, the THP is added to a list.
Whenever memory reclaim happens, the kernel runs the deferred_split
shrinker which goes through the list and checks if the THP was underutilized,
i.e. how many of the base 4K pages of the entire THP were zero-filled.
If this number goes above a certain threshold, the shrinker will attempt
to split that THP. Then at remap time, the pages that were zero-filled are
mapped to the shared zeropage, hence saving memory. This method avoids the
downside of wasting memory in areas where THP is sparsely filled when THP
is always enabled, while still providing the upside THPs like reduced TLB
misses without having to use madvise.

Meta production workloads that were CPU bound (>99% CPU utilzation) were
tested with THP shrinker. The results after 2 hours are as follows:

                            | THP=madvise |  THP=always   | THP=always
                            |             |               | + shrinker series
                            |             |               | + max_ptes_none=409
-----------------------------------------------------------------------------
Performance improvement     |      -      |    +1.8%      |     +1.7%
(over THP=madvise)          |             |               |
-----------------------------------------------------------------------------
Memory usage                |    54.6G    | 58.8G (+7.7%) |   55.9G (+2.4%)
-----------------------------------------------------------------------------
max_ptes_none=409 means that any THP that has more than 409 out of 512
(80%) zero filled filled pages will be split.

To test out the patches, the below commands without the shrinker will
invoke OOM killer immediately and kill stress, but will not fail with
the shrinker:

echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
mkdir /sys/fs/cgroup/test
echo $$ > /sys/fs/cgroup/test/cgroup.procs
echo 20M > /sys/fs/cgroup/test/memory.max
echo 0 > /sys/fs/cgroup/test/memory.swap.max
# allocate twice memory.max for each stress worker and touch 40/512 of
# each THP, i.e. vm-stride 50K.
# With the shrinker, max_ptes_none of 470 and below won't invoke OOM
# killer.
# Without the shrinker, OOM killer is invoked immediately irrespective
# of max_ptes_none value and kills stress.
stress --vm 1 --vm-bytes 40M --vm-stride 50K

v1 -> v2:
- Turn page checks and operations to folio versions in __split_huge_page.
  This means patches 1 and 2 from v1 are no longer needed.
  (David Hildenbrand)
- Map to shared zeropage in all cases if the base page is zero-filled.
  The uffd selftest was removed.
  (David Hildenbrand).
- rename 'dirty' to 'contains_data' in try_to_map_unused_to_zeropage
  (Rik van Riel).
- Use unsigned long instead of uint64_t (kernel test robot).

Alexander Zhu (1):
  mm: selftest to verify zero-filled pages are mapped to zeropage

Usama Arif (1):
  mm: split underutilized THPs

Yu Zhao (2):
  mm: free zapped tail pages when splitting isolated thp
  mm: remap unused subpages to shared zeropage when splitting isolated
    thp

 Documentation/admin-guide/mm/transhuge.rst    |   6 +
 include/linux/huge_mm.h                       |   4 +-
 include/linux/khugepaged.h                    |   1 +
 include/linux/mm_types.h                      |   2 +
 include/linux/rmap.h                          |   3 +-
 include/linux/vm_event_item.h                 |   1 +
 mm/huge_memory.c                              | 153 +++++++++++++++---
 mm/hugetlb.c                                  |   1 +
 mm/internal.h                                 |   4 +-
 mm/khugepaged.c                               |   3 +-
 mm/memcontrol.c                               |   3 +-
 mm/migrate.c                                  |  73 +++++++--
 mm/migrate_device.c                           |   4 +-
 mm/rmap.c                                     |   2 +-
 mm/vmscan.c                                   |   3 +-
 mm/vmstat.c                                   |   1 +
 .../selftests/mm/split_huge_page_test.c       |  71 ++++++++
 tools/testing/selftests/mm/vm_util.c          |  22 +++
 tools/testing/selftests/mm/vm_util.h          |   1 +
 19 files changed, 321 insertions(+), 37 deletions(-)

-- 
2.43.5

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 1/4] mm: free zapped tail pages when splitting isolated thp
  2024-08-07 13:46 [PATCH v2 0/4] mm: split underutilized THPs Usama Arif
@ 2024-08-07 13:46 ` Usama Arif
  2024-08-07 19:45   ` Johannes Weiner
  2024-08-08 15:56   ` David Hildenbrand
  2024-08-07 13:46 ` [PATCH v2 2/4] mm: remap unused subpages to shared zeropage " Usama Arif
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 13+ messages in thread
From: Usama Arif @ 2024-08-07 13:46 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, david, baohua,
	ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team, Shuang Zhai, Usama Arif

From: Yu Zhao <yuzhao@google.com>

If a tail page has only two references left, one inherited from the
isolation of its head and the other from lru_add_page_tail() which we
are about to drop, it means this tail page was concurrently zapped.
Then we can safely free it and save page reclaim or migration the
trouble of trying it.

Signed-off-by: Yu Zhao <yuzhao@google.com>
Tested-by: Shuang Zhai <zhais@google.com>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 mm/huge_memory.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0167dc27e365..35c1089d8d61 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2923,7 +2923,9 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	unsigned int new_nr = 1 << new_order;
 	int order = folio_order(folio);
 	unsigned int nr = 1 << order;
+	struct folio_batch free_folios;
 
+	folio_batch_init(&free_folios);
 	/* complete memcg works before add pages to LRU */
 	split_page_memcg(head, order, new_order);
 
@@ -3007,6 +3009,26 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		if (subpage == page)
 			continue;
 		folio_unlock(new_folio);
+		/*
+		 * If a folio has only two references left, one inherited
+		 * from the isolation of its head and the other from
+		 * lru_add_page_tail() which we are about to drop, it means this
+		 * folio was concurrently zapped. Then we can safely free it
+		 * and save page reclaim or migration the trouble of trying it.
+		 */
+		if (list && page_ref_freeze(subpage, 2)) {
+			VM_WARN_ON_ONCE_FOLIO(folio_test_lru(new_folio), new_folio);
+			VM_WARN_ON_ONCE_FOLIO(folio_test_large(new_folio), new_folio);
+			VM_WARN_ON_ONCE_FOLIO(folio_mapped(new_folio), new_folio);
+
+			folio_clear_active(new_folio);
+			folio_clear_unevictable(new_folio);
+			if (folio_batch_add(&free_folios, folio) == 0) {
+				mem_cgroup_uncharge_folios(&free_folios);
+				free_unref_folios(&free_folios);
+			}
+			continue;
+		}
 
 		/*
 		 * Subpages may be freed if there wasn't any mapping
@@ -3017,6 +3039,11 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		 */
 		free_page_and_swap_cache(subpage);
 	}
+
+	if (free_folios.nr) {
+		mem_cgroup_uncharge_folios(&free_folios);
+		free_unref_folios(&free_folios);
+	}
 }
 
 /* Racy check whether the huge page can be split */
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 2/4] mm: remap unused subpages to shared zeropage when splitting isolated thp
  2024-08-07 13:46 [PATCH v2 0/4] mm: split underutilized THPs Usama Arif
  2024-08-07 13:46 ` [PATCH v2 1/4] mm: free zapped tail pages when splitting isolated thp Usama Arif
@ 2024-08-07 13:46 ` Usama Arif
  2024-08-07 20:02   ` Johannes Weiner
  2024-08-07 13:46 ` [PATCH v2 3/4] mm: selftest to verify zero-filled pages are mapped to zeropage Usama Arif
  2024-08-07 13:46 ` [PATCH v2 4/4] mm: split underutilized THPs Usama Arif
  3 siblings, 1 reply; 13+ messages in thread
From: Usama Arif @ 2024-08-07 13:46 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, david, baohua,
	ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team, Shuang Zhai, Usama Arif

From: Yu Zhao <yuzhao@google.com>

Here being unused means containing only zeros and inaccessible to
userspace. When splitting an isolated thp under reclaim or migration,
the unused subpages can be mapped to the shared zeropage, hence saving
memory. This is particularly helpful when the internal
fragmentation of a thp is high, i.e. it has many untouched subpages.

This is also a prerequisite for THP low utilization shrinker which will
be introduced in later patches, where underutilized THPs are split, and
the zero-filled pages are freed saving memory.

Signed-off-by: Yu Zhao <yuzhao@google.com>
Tested-by: Shuang Zhai <zhais@google.com>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 include/linux/rmap.h |  3 +-
 mm/huge_memory.c     |  8 ++---
 mm/migrate.c         | 70 +++++++++++++++++++++++++++++++++++++++-----
 mm/migrate_device.c  |  4 +--
 4 files changed, 70 insertions(+), 15 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 0978c64f49d8..1d338466a495 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -745,7 +745,8 @@ int folio_mkclean(struct folio *);
 int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff,
 		      struct vm_area_struct *vma);
 
-void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked);
+void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked,
+			   bool map_unused_to_zeropage);
 
 /*
  * rmap_walk_control: To control rmap traversing for specific needs
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 35c1089d8d61..891562665e19 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2775,7 +2775,7 @@ bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
 	return false;
 }
 
-static void remap_page(struct folio *folio, unsigned long nr)
+static void remap_page(struct folio *folio, unsigned long nr, bool map_unused_to_zeropage)
 {
 	int i = 0;
 
@@ -2783,7 +2783,7 @@ static void remap_page(struct folio *folio, unsigned long nr)
 	if (!folio_test_anon(folio))
 		return;
 	for (;;) {
-		remove_migration_ptes(folio, folio, true);
+		remove_migration_ptes(folio, folio, true, map_unused_to_zeropage);
 		i += folio_nr_pages(folio);
 		if (i >= nr)
 			break;
@@ -2993,7 +2993,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 
 	if (nr_dropped)
 		shmem_uncharge(folio->mapping->host, nr_dropped);
-	remap_page(folio, nr);
+	remap_page(folio, nr, PageAnon(head));
 
 	/*
 	 * set page to its compound_head when split to non order-0 pages, so
@@ -3287,7 +3287,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		if (mapping)
 			xas_unlock(&xas);
 		local_irq_enable();
-		remap_page(folio, folio_nr_pages(folio));
+		remap_page(folio, folio_nr_pages(folio), false);
 		ret = -EAGAIN;
 	}
 
diff --git a/mm/migrate.c b/mm/migrate.c
index b273bac0d5ae..151bf1b6204d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -177,13 +177,56 @@ void putback_movable_pages(struct list_head *l)
 	}
 }
 
+static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
+					  struct folio *folio,
+					  unsigned long idx)
+{
+	struct page *page = folio_page(folio, idx);
+	bool contains_data;
+	pte_t newpte;
+	void *addr;
+
+	VM_BUG_ON_PAGE(PageCompound(page), page);
+	VM_BUG_ON_PAGE(!PageAnon(page), page);
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	VM_BUG_ON_PAGE(pte_present(*pvmw->pte), page);
+
+	if (PageMlocked(page) || (pvmw->vma->vm_flags & VM_LOCKED))
+		return false;
+
+	/*
+	 * The pmd entry mapping the old thp was flushed and the pte mapping
+	 * this subpage has been non present. If the subpage is only zero-filled
+	 * then map it to the shared zeropage.
+	 */
+	addr = kmap_local_page(page);
+	contains_data = memchr_inv(addr, 0, PAGE_SIZE);
+	kunmap_local(addr);
+
+	if (contains_data || mm_forbids_zeropage(pvmw->vma->vm_mm))
+		return false;
+
+	newpte = pte_mkspecial(pfn_pte(page_to_pfn(ZERO_PAGE(pvmw->address)),
+					pvmw->vma->vm_page_prot));
+	set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte);
+
+	dec_mm_counter(pvmw->vma->vm_mm, mm_counter(folio));
+	return true;
+}
+
+struct rmap_walk_arg {
+	struct folio *folio;
+	bool map_unused_to_zeropage;
+};
+
 /*
  * Restore a potential migration pte to a working pte entry
  */
 static bool remove_migration_pte(struct folio *folio,
-		struct vm_area_struct *vma, unsigned long addr, void *old)
+		struct vm_area_struct *vma, unsigned long addr, void *arg)
 {
-	DEFINE_FOLIO_VMA_WALK(pvmw, old, vma, addr, PVMW_SYNC | PVMW_MIGRATION);
+	struct rmap_walk_arg *rmap_walk_arg = arg;
+	DEFINE_FOLIO_VMA_WALK(pvmw, rmap_walk_arg->folio, vma, addr, PVMW_SYNC | PVMW_MIGRATION);
 
 	while (page_vma_mapped_walk(&pvmw)) {
 		rmap_t rmap_flags = RMAP_NONE;
@@ -207,6 +250,9 @@ static bool remove_migration_pte(struct folio *folio,
 			continue;
 		}
 #endif
+		if (rmap_walk_arg->map_unused_to_zeropage &&
+		    try_to_map_unused_to_zeropage(&pvmw, folio, idx))
+			continue;
 
 		folio_get(folio);
 		pte = mk_pte(new, READ_ONCE(vma->vm_page_prot));
@@ -285,13 +331,21 @@ static bool remove_migration_pte(struct folio *folio,
  * Get rid of all migration entries and replace them by
  * references to the indicated page.
  */
-void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked)
+void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked,
+			   bool map_unused_to_zeropage)
 {
+	struct rmap_walk_arg rmap_walk_arg = {
+		.folio = src,
+		.map_unused_to_zeropage = map_unused_to_zeropage,
+	};
+
 	struct rmap_walk_control rwc = {
 		.rmap_one = remove_migration_pte,
-		.arg = src,
+		.arg = &rmap_walk_arg,
 	};
 
+	VM_BUG_ON_FOLIO(map_unused_to_zeropage && src != dst, src);
+
 	if (locked)
 		rmap_walk_locked(dst, &rwc);
 	else
@@ -904,7 +958,7 @@ static int writeout(struct address_space *mapping, struct folio *folio)
 	 * At this point we know that the migration attempt cannot
 	 * be successful.
 	 */
-	remove_migration_ptes(folio, folio, false);
+	remove_migration_ptes(folio, folio, false, false);
 
 	rc = mapping->a_ops->writepage(&folio->page, &wbc);
 
@@ -1068,7 +1122,7 @@ static void migrate_folio_undo_src(struct folio *src,
 				   struct list_head *ret)
 {
 	if (page_was_mapped)
-		remove_migration_ptes(src, src, false);
+		remove_migration_ptes(src, src, false, false);
 	/* Drop an anon_vma reference if we took one */
 	if (anon_vma)
 		put_anon_vma(anon_vma);
@@ -1306,7 +1360,7 @@ static int migrate_folio_move(free_folio_t put_new_folio, unsigned long private,
 		lru_add_drain();
 
 	if (old_page_state & PAGE_WAS_MAPPED)
-		remove_migration_ptes(src, dst, false);
+		remove_migration_ptes(src, dst, false, false);
 
 out_unlock_both:
 	folio_unlock(dst);
@@ -1444,7 +1498,7 @@ static int unmap_and_move_huge_page(new_folio_t get_new_folio,
 
 	if (page_was_mapped)
 		remove_migration_ptes(src,
-			rc == MIGRATEPAGE_SUCCESS ? dst : src, false);
+			rc == MIGRATEPAGE_SUCCESS ? dst : src, false, false);
 
 unlock_put_anon:
 	folio_unlock(dst);
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 6d66dc1c6ffa..a1630d8e0d95 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -424,7 +424,7 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
 			continue;
 
 		folio = page_folio(page);
-		remove_migration_ptes(folio, folio, false);
+		remove_migration_ptes(folio, folio, false, false);
 
 		src_pfns[i] = 0;
 		folio_unlock(folio);
@@ -837,7 +837,7 @@ void migrate_device_finalize(unsigned long *src_pfns,
 
 		src = page_folio(page);
 		dst = page_folio(newpage);
-		remove_migration_ptes(src, dst, false);
+		remove_migration_ptes(src, dst, false, false);
 		folio_unlock(src);
 
 		if (is_zone_device_page(page))
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 3/4] mm: selftest to verify zero-filled pages are mapped to zeropage
  2024-08-07 13:46 [PATCH v2 0/4] mm: split underutilized THPs Usama Arif
  2024-08-07 13:46 ` [PATCH v2 1/4] mm: free zapped tail pages when splitting isolated thp Usama Arif
  2024-08-07 13:46 ` [PATCH v2 2/4] mm: remap unused subpages to shared zeropage " Usama Arif
@ 2024-08-07 13:46 ` Usama Arif
  2024-08-07 13:46 ` [PATCH v2 4/4] mm: split underutilized THPs Usama Arif
  3 siblings, 0 replies; 13+ messages in thread
From: Usama Arif @ 2024-08-07 13:46 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, david, baohua,
	ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team, Alexander Zhu, Usama Arif

From: Alexander Zhu <alexlzhu@fb.com>

When a THP is split, any subpage that is zero-filled will be mapped
to the shared zeropage, hence saving memory. Add selftest to verify
this by allocating zero-filled THP and comparing RssAnon before and
after split.

Signed-off-by: Alexander Zhu <alexlzhu@fb.com>
Acked-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 .../selftests/mm/split_huge_page_test.c       | 71 +++++++++++++++++++
 tools/testing/selftests/mm/vm_util.c          | 22 ++++++
 tools/testing/selftests/mm/vm_util.h          |  1 +
 3 files changed, 94 insertions(+)

diff --git a/tools/testing/selftests/mm/split_huge_page_test.c b/tools/testing/selftests/mm/split_huge_page_test.c
index e5e8dafc9d94..eb6d1b9fc362 100644
--- a/tools/testing/selftests/mm/split_huge_page_test.c
+++ b/tools/testing/selftests/mm/split_huge_page_test.c
@@ -84,6 +84,76 @@ static void write_debugfs(const char *fmt, ...)
 	write_file(SPLIT_DEBUGFS, input, ret + 1);
 }
 
+static char *allocate_zero_filled_hugepage(size_t len)
+{
+	char *result;
+	size_t i;
+
+	result = memalign(pmd_pagesize, len);
+	if (!result) {
+		printf("Fail to allocate memory\n");
+		exit(EXIT_FAILURE);
+	}
+
+	madvise(result, len, MADV_HUGEPAGE);
+
+	for (i = 0; i < len; i++)
+		result[i] = (char)0;
+
+	return result;
+}
+
+static void verify_rss_anon_split_huge_page_all_zeroes(char *one_page, int nr_hpages, size_t len)
+{
+	unsigned long rss_anon_before, rss_anon_after;
+	size_t i;
+
+	if (!check_huge_anon(one_page, 4, pmd_pagesize)) {
+		printf("No THP is allocated\n");
+		exit(EXIT_FAILURE);
+	}
+
+	rss_anon_before = rss_anon();
+	if (!rss_anon_before) {
+		printf("No RssAnon is allocated before split\n");
+		exit(EXIT_FAILURE);
+	}
+
+	/* split all THPs */
+	write_debugfs(PID_FMT, getpid(), (uint64_t)one_page,
+		      (uint64_t)one_page + len, 0);
+
+	for (i = 0; i < len; i++)
+		if (one_page[i] != (char)0) {
+			printf("%ld byte corrupted\n", i);
+			exit(EXIT_FAILURE);
+		}
+
+	if (!check_huge_anon(one_page, 0, pmd_pagesize)) {
+		printf("Still AnonHugePages not split\n");
+		exit(EXIT_FAILURE);
+	}
+
+	rss_anon_after = rss_anon();
+	if (rss_anon_after >= rss_anon_before) {
+		printf("Incorrect RssAnon value. Before: %ld After: %ld\n",
+		       rss_anon_before, rss_anon_after);
+		exit(EXIT_FAILURE);
+	}
+}
+
+void split_pmd_zero_pages(void)
+{
+	char *one_page;
+	int nr_hpages = 4;
+	size_t len = nr_hpages * pmd_pagesize;
+
+	one_page = allocate_zero_filled_hugepage(len);
+	verify_rss_anon_split_huge_page_all_zeroes(one_page, nr_hpages, len);
+	printf("Split zero filled huge pages successful\n");
+	free(one_page);
+}
+
 void split_pmd_thp(void)
 {
 	char *one_page;
@@ -431,6 +501,7 @@ int main(int argc, char **argv)
 
 	fd_size = 2 * pmd_pagesize;
 
+	split_pmd_zero_pages();
 	split_pmd_thp();
 	split_pte_mapped_thp();
 	split_file_backed_thp();
diff --git a/tools/testing/selftests/mm/vm_util.c b/tools/testing/selftests/mm/vm_util.c
index 5a62530da3b5..d8d0cf04bb57 100644
--- a/tools/testing/selftests/mm/vm_util.c
+++ b/tools/testing/selftests/mm/vm_util.c
@@ -12,6 +12,7 @@
 
 #define PMD_SIZE_FILE_PATH "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size"
 #define SMAP_FILE_PATH "/proc/self/smaps"
+#define STATUS_FILE_PATH "/proc/self/status"
 #define MAX_LINE_LENGTH 500
 
 unsigned int __page_size;
@@ -171,6 +172,27 @@ uint64_t read_pmd_pagesize(void)
 	return strtoul(buf, NULL, 10);
 }
 
+unsigned long rss_anon(void)
+{
+	unsigned long rss_anon = 0;
+	FILE *fp;
+	char buffer[MAX_LINE_LENGTH];
+
+	fp = fopen(STATUS_FILE_PATH, "r");
+	if (!fp)
+		ksft_exit_fail_msg("%s: Failed to open file %s\n", __func__, STATUS_FILE_PATH);
+
+	if (!check_for_pattern(fp, "RssAnon:", buffer, sizeof(buffer)))
+		goto err_out;
+
+	if (sscanf(buffer, "RssAnon:%10lu kB", &rss_anon) != 1)
+		ksft_exit_fail_msg("Reading status error\n");
+
+err_out:
+	fclose(fp);
+	return rss_anon;
+}
+
 bool __check_huge(void *addr, char *pattern, int nr_hpages,
 		  uint64_t hpage_size)
 {
diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h
index 9007c420d52c..71b75429f4a5 100644
--- a/tools/testing/selftests/mm/vm_util.h
+++ b/tools/testing/selftests/mm/vm_util.h
@@ -39,6 +39,7 @@ unsigned long pagemap_get_pfn(int fd, char *start);
 void clear_softdirty(void);
 bool check_for_pattern(FILE *fp, const char *pattern, char *buf, size_t len);
 uint64_t read_pmd_pagesize(void);
+uint64_t rss_anon(void);
 bool check_huge_anon(void *addr, int nr_hpages, uint64_t hpage_size);
 bool check_huge_file(void *addr, int nr_hpages, uint64_t hpage_size);
 bool check_huge_shmem(void *addr, int nr_hpages, uint64_t hpage_size);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 4/4] mm: split underutilized THPs
  2024-08-07 13:46 [PATCH v2 0/4] mm: split underutilized THPs Usama Arif
                   ` (2 preceding siblings ...)
  2024-08-07 13:46 ` [PATCH v2 3/4] mm: selftest to verify zero-filled pages are mapped to zeropage Usama Arif
@ 2024-08-07 13:46 ` Usama Arif
  2024-08-08 15:55   ` David Hildenbrand
  3 siblings, 1 reply; 13+ messages in thread
From: Usama Arif @ 2024-08-07 13:46 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, david, baohua,
	ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team, Usama Arif

This is an attempt to mitigate the issue of running out of memory when THP
is always enabled. During runtime whenever a THP is being faulted in
(__do_huge_pmd_anonymous_page) or collapsed by khugepaged
(collapse_huge_page), the THP is added to  _deferred_list. Whenever memory
reclaim happens in linux, the kernel runs the deferred_split
shrinker which goes through the _deferred_list.

If the folio was partially mapped, the shrinker attempts to split it.
A new boolean is added to be able to distinguish between partially
mapped folios and others in the deferred_list at split time in
deferred_split_scan. Its needed as __folio_remove_rmap decrements
the folio mapcount elements, hence it won't be possible to distinguish
between partially mapped folios and others in deferred_split_scan
without the boolean.

If folio->_partially_mapped is not set, the shrinker checks if the THP
was underutilized, i.e. how many of the base 4K pages of the entire THP
were zero-filled. If this number goes above a certain threshold (decided by
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none), the
shrinker will attempt to split that THP. Then at remap time, the pages that
were zero-filled are mapped to the shared zeropage, hence saving memory.

Suggested-by: Rik van Riel <riel@surriel.com>
Co-authored-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 Documentation/admin-guide/mm/transhuge.rst |   6 ++
 include/linux/huge_mm.h                    |   4 +-
 include/linux/khugepaged.h                 |   1 +
 include/linux/mm_types.h                   |   2 +
 include/linux/vm_event_item.h              |   1 +
 mm/huge_memory.c                           | 118 ++++++++++++++++++---
 mm/hugetlb.c                               |   1 +
 mm/internal.h                              |   4 +-
 mm/khugepaged.c                            |   3 +-
 mm/memcontrol.c                            |   3 +-
 mm/migrate.c                               |   3 +-
 mm/rmap.c                                  |   2 +-
 mm/vmscan.c                                |   3 +-
 mm/vmstat.c                                |   1 +
 14 files changed, 130 insertions(+), 22 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 058485daf186..24eec1c03ad8 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -447,6 +447,12 @@ thp_deferred_split_page
 	splitting it would free up some memory. Pages on split queue are
 	going to be split under memory pressure.
 
+thp_underutilized_split_page
+	is incremented when a huge page on the split queue was split
+	because it was underutilized. A THP is underutilized if the
+	number of zero pages in the THP are above a certain threshold
+	(/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none).
+
 thp_split_pmd
 	is incremented every time a PMD split into table of PTEs.
 	This can happen, for instance, when application calls mprotect() or
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index e25d9ebfdf89..00af84aa88ea 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -321,7 +321,7 @@ static inline int split_huge_page(struct page *page)
 {
 	return split_huge_page_to_list_to_order(page, NULL, 0);
 }
-void deferred_split_folio(struct folio *folio);
+void deferred_split_folio(struct folio *folio, bool partially_mapped);
 
 void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long address, bool freeze, struct folio *folio);
@@ -484,7 +484,7 @@ static inline int split_huge_page(struct page *page)
 {
 	return 0;
 }
-static inline void deferred_split_folio(struct folio *folio) {}
+static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
 #define split_huge_pmd(__vma, __pmd, __address)	\
 	do { } while (0)
 
diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index f68865e19b0b..30baae91b225 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -4,6 +4,7 @@
 
 #include <linux/sched/coredump.h> /* MMF_VM_HUGEPAGE */
 
+extern unsigned int khugepaged_max_ptes_none __read_mostly;
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 extern struct attribute_group khugepaged_attr_group;
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 485424979254..443026cf763e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -311,6 +311,7 @@ typedef struct {
  * @_hugetlb_cgroup_rsvd: Do not use directly, use accessor in hugetlb_cgroup.h.
  * @_hugetlb_hwpoison: Do not use directly, call raw_hwp_list_head().
  * @_deferred_list: Folios to be split under memory pressure.
+ * @_partially_mapped: Folio was partially mapped.
  * @_unused_slab_obj_exts: Placeholder to match obj_exts in struct slab.
  *
  * A folio is a physically, virtually and logically contiguous set
@@ -393,6 +394,7 @@ struct folio {
 			unsigned long _head_2a;
 	/* public: */
 			struct list_head _deferred_list;
+			bool _partially_mapped;
 	/* private: the union with struct page is transitional */
 		};
 		struct page __page_2;
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index aae5c7c5cfb4..bf1470a7a737 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -105,6 +105,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_SPLIT_PAGE,
 		THP_SPLIT_PAGE_FAILED,
 		THP_DEFERRED_SPLIT_PAGE,
+		THP_UNDERUTILIZED_SPLIT_PAGE,
 		THP_SPLIT_PMD,
 		THP_SCAN_EXCEED_NONE_PTE,
 		THP_SCAN_EXCEED_SWAP_PTE,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 891562665e19..d2008f748e92 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -73,6 +73,7 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
 					  struct shrink_control *sc);
 static unsigned long deferred_split_scan(struct shrinker *shrink,
 					 struct shrink_control *sc);
+static bool split_underutilized_thp = true;
 
 static atomic_t huge_zero_refcount;
 struct folio *huge_zero_folio __read_mostly;
@@ -438,6 +439,27 @@ static ssize_t hpage_pmd_size_show(struct kobject *kobj,
 static struct kobj_attribute hpage_pmd_size_attr =
 	__ATTR_RO(hpage_pmd_size);
 
+static ssize_t split_underutilized_thp_show(struct kobject *kobj,
+			    struct kobj_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%d\n", split_underutilized_thp);
+}
+
+static ssize_t split_underutilized_thp_store(struct kobject *kobj,
+			     struct kobj_attribute *attr,
+			     const char *buf, size_t count)
+{
+	int err = kstrtobool(buf, &split_underutilized_thp);
+
+	if (err < 0)
+		return err;
+
+	return count;
+}
+
+static struct kobj_attribute split_underutilized_thp_attr = __ATTR(
+	thp_low_util_shrinker, 0644, split_underutilized_thp_show, split_underutilized_thp_store);
+
 static struct attribute *hugepage_attr[] = {
 	&enabled_attr.attr,
 	&defrag_attr.attr,
@@ -446,6 +468,7 @@ static struct attribute *hugepage_attr[] = {
 #ifdef CONFIG_SHMEM
 	&shmem_enabled_attr.attr,
 #endif
+	&split_underutilized_thp_attr.attr,
 	NULL,
 };
 
@@ -1002,6 +1025,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
 		update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
 		add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 		mm_inc_nr_ptes(vma->vm_mm);
+		deferred_split_folio(folio, false);
 		spin_unlock(vmf->ptl);
 		count_vm_event(THP_FAULT_ALLOC);
 		count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_ALLOC);
@@ -3260,6 +3284,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 			 * page_deferred_list.
 			 */
 			list_del_init(&folio->_deferred_list);
+			folio->_partially_mapped = false;
 		}
 		spin_unlock(&ds_queue->split_queue_lock);
 		if (mapping) {
@@ -3316,11 +3341,12 @@ void __folio_undo_large_rmappable(struct folio *folio)
 	if (!list_empty(&folio->_deferred_list)) {
 		ds_queue->split_queue_len--;
 		list_del_init(&folio->_deferred_list);
+		folio->_partially_mapped = false;
 	}
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 }
 
-void deferred_split_folio(struct folio *folio)
+void deferred_split_folio(struct folio *folio, bool partially_mapped)
 {
 	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
 #ifdef CONFIG_MEMCG
@@ -3335,6 +3361,9 @@ void deferred_split_folio(struct folio *folio)
 	if (folio_order(folio) <= 1)
 		return;
 
+	if (!partially_mapped && !split_underutilized_thp)
+		return;
+
 	/*
 	 * The try_to_unmap() in page reclaim path might reach here too,
 	 * this may cause a race condition to corrupt deferred split queue.
@@ -3348,14 +3377,14 @@ void deferred_split_folio(struct folio *folio)
 	if (folio_test_swapcache(folio))
 		return;
 
-	if (!list_empty(&folio->_deferred_list))
-		return;
-
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
+	folio->_partially_mapped = partially_mapped;
 	if (list_empty(&folio->_deferred_list)) {
-		if (folio_test_pmd_mappable(folio))
-			count_vm_event(THP_DEFERRED_SPLIT_PAGE);
-		count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
+		if (partially_mapped) {
+			if (folio_test_pmd_mappable(folio))
+				count_vm_event(THP_DEFERRED_SPLIT_PAGE);
+			count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
+		}
 		list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
 		ds_queue->split_queue_len++;
 #ifdef CONFIG_MEMCG
@@ -3380,6 +3409,39 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
 	return READ_ONCE(ds_queue->split_queue_len);
 }
 
+static bool thp_underutilized(struct folio *folio)
+{
+	int num_zero_pages = 0, num_filled_pages = 0;
+	void *kaddr;
+	int i;
+
+	if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
+		return false;
+
+	for (i = 0; i < folio_nr_pages(folio); i++) {
+		kaddr = kmap_local_folio(folio, i * PAGE_SIZE);
+		if (memchr_inv(kaddr, 0, PAGE_SIZE) == NULL) {
+			num_zero_pages++;
+			if (num_zero_pages > khugepaged_max_ptes_none) {
+				kunmap_local(kaddr);
+				return true;
+			}
+		} else {
+			/*
+			 * Another path for early exit once the number
+			 * of non-zero filled pages exceeds threshold.
+			 */
+			num_filled_pages++;
+			if (num_filled_pages >= HPAGE_PMD_NR - khugepaged_max_ptes_none) {
+				kunmap_local(kaddr);
+				return false;
+			}
+		}
+		kunmap_local(kaddr);
+	}
+	return false;
+}
+
 static unsigned long deferred_split_scan(struct shrinker *shrink,
 		struct shrink_control *sc)
 {
@@ -3404,6 +3466,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 		} else {
 			/* We lost race with folio_put() */
 			list_del_init(&folio->_deferred_list);
+			folio->_partially_mapped = false;
 			ds_queue->split_queue_len--;
 		}
 		if (!--sc->nr_to_scan)
@@ -3412,18 +3475,45 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 
 	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
+		bool did_split = false;
+		bool underutilized = false;
+
+		if (folio->_partially_mapped)
+			goto split;
+		underutilized = thp_underutilized(folio);
+		if (underutilized)
+			goto split;
+		continue;
+split:
 		if (!folio_trylock(folio))
-			goto next;
-		/* split_huge_page() removes page from list on success */
-		if (!split_folio(folio))
-			split++;
+			continue;
+		did_split = !split_folio(folio);
 		folio_unlock(folio);
-next:
-		folio_put(folio);
+		if (did_split) {
+			/* Splitting removed folio from the list, drop reference here */
+			folio_put(folio);
+			if (underutilized)
+				count_vm_event(THP_UNDERUTILIZED_SPLIT_PAGE);
+			split++;
+		}
 	}
 
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
-	list_splice_tail(&list, &ds_queue->split_queue);
+	/*
+	 * Only add back to the queue if folio->_partially_mapped is set.
+	 * If thp_underutilized returns false, or if split_folio fails in
+	 * the case it was underutilized, then consider it used and don't
+	 * add it back to split_queue.
+	 */
+	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
+		if (folio->_partially_mapped)
+			list_move(&folio->_deferred_list, &ds_queue->split_queue);
+		else {
+			list_del_init(&folio->_deferred_list);
+			ds_queue->split_queue_len--;
+		}
+		folio_put(folio);
+	}
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 
 	/*
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5a32157ca309..df2da47d0637 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1758,6 +1758,7 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
 		free_gigantic_folio(folio, huge_page_order(h));
 	} else {
 		INIT_LIST_HEAD(&folio->_deferred_list);
+		folio->_partially_mapped = false;
 		folio_put(folio);
 	}
 }
diff --git a/mm/internal.h b/mm/internal.h
index 7a3bcc6d95e7..d646464ba0d7 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -657,8 +657,10 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
 	atomic_set(&folio->_entire_mapcount, -1);
 	atomic_set(&folio->_nr_pages_mapped, 0);
 	atomic_set(&folio->_pincount, 0);
-	if (order > 1)
+	if (order > 1) {
 		INIT_LIST_HEAD(&folio->_deferred_list);
+		folio->_partially_mapped = false;
+	}
 }
 
 static inline void prep_compound_tail(struct page *head, int tail_idx)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index f3b3db104615..5a434fdbc1ef 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -85,7 +85,7 @@ static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
  *
  * Note that these are only respected if collapse was initiated by khugepaged.
  */
-static unsigned int khugepaged_max_ptes_none __read_mostly;
+unsigned int khugepaged_max_ptes_none __read_mostly;
 static unsigned int khugepaged_max_ptes_swap __read_mostly;
 static unsigned int khugepaged_max_ptes_shared __read_mostly;
 
@@ -1235,6 +1235,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	set_pmd_at(mm, address, pmd, _pmd);
 	update_mmu_cache_pmd(vma, address, pmd);
+	deferred_split_folio(folio, false);
 	spin_unlock(pmd_ptl);
 
 	folio = NULL;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9b3ef3a70833..bd8facce2a2b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4651,7 +4651,8 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
 	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
 	VM_BUG_ON_FOLIO(folio_order(folio) > 1 &&
 			!folio_test_hugetlb(folio) &&
-			!list_empty(&folio->_deferred_list), folio);
+			!list_empty(&folio->_deferred_list) &&
+			folio->_partially_mapped, folio);
 
 	/*
 	 * Nobody should be changing or seriously looking at
diff --git a/mm/migrate.c b/mm/migrate.c
index 151bf1b6204d..395f1b0deb45 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1731,7 +1731,8 @@ static int migrate_pages_batch(struct list_head *from,
 			 * use _deferred_list.
 			 */
 			if (nr_pages > 2 &&
-			   !list_empty(&folio->_deferred_list)) {
+			   !list_empty(&folio->_deferred_list) &&
+			   folio->_partially_mapped) {
 				if (try_split_folio(folio, split_folios) == 0) {
 					nr_failed++;
 					stats->nr_thp_failed += is_thp;
diff --git a/mm/rmap.c b/mm/rmap.c
index 2630bde38640..1b5418121965 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1582,7 +1582,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
 		 */
 		if (folio_test_anon(folio) && partially_mapped &&
 		    list_empty(&folio->_deferred_list))
-			deferred_split_folio(folio);
+			deferred_split_folio(folio, true);
 	}
 	__folio_mod_stat(folio, -nr, -nr_pmdmapped);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c89d0551655e..1bee9b1262f6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1233,7 +1233,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 					 * Split partially mapped folios right away.
 					 * We can free the unmapped pages without IO.
 					 */
-					if (data_race(!list_empty(&folio->_deferred_list)) &&
+					if (data_race(!list_empty(&folio->_deferred_list) &&
+					    folio->_partially_mapped) &&
 					    split_folio_to_list(folio, folio_list))
 						goto activate_locked;
 				}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 5082431dad28..525fad4a1d6d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1367,6 +1367,7 @@ const char * const vmstat_text[] = {
 	"thp_split_page",
 	"thp_split_page_failed",
 	"thp_deferred_split_page",
+	"thp_underutilized_split_page",
 	"thp_split_pmd",
 	"thp_scan_exceed_none_pte",
 	"thp_scan_exceed_swap_pte",
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/4] mm: free zapped tail pages when splitting isolated thp
  2024-08-07 13:46 ` [PATCH v2 1/4] mm: free zapped tail pages when splitting isolated thp Usama Arif
@ 2024-08-07 19:45   ` Johannes Weiner
  2024-08-08 15:56   ` David Hildenbrand
  1 sibling, 0 replies; 13+ messages in thread
From: Johannes Weiner @ 2024-08-07 19:45 UTC (permalink / raw)
  To: Usama Arif
  Cc: akpm, linux-mm, riel, shakeel.butt, roman.gushchin, yuzhao, david,
	baohua, ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team, Shuang Zhai

On Wed, Aug 07, 2024 at 02:46:46PM +0100, Usama Arif wrote:
> From: Yu Zhao <yuzhao@google.com>
> 
> If a tail page has only two references left, one inherited from the
> isolation of its head and the other from lru_add_page_tail() which we
> are about to drop, it means this tail page was concurrently zapped.
> Then we can safely free it and save page reclaim or migration the
> trouble of trying it.
> 
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> Tested-by: Shuang Zhai <zhais@google.com>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> ---
>  mm/huge_memory.c | 27 +++++++++++++++++++++++++++
>  1 file changed, 27 insertions(+)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 0167dc27e365..35c1089d8d61 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2923,7 +2923,9 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>  	unsigned int new_nr = 1 << new_order;
>  	int order = folio_order(folio);
>  	unsigned int nr = 1 << order;
> +	struct folio_batch free_folios;
>  
> +	folio_batch_init(&free_folios);
>  	/* complete memcg works before add pages to LRU */
>  	split_page_memcg(head, order, new_order);
>  
> @@ -3007,6 +3009,26 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>  		if (subpage == page)
>  			continue;
>  		folio_unlock(new_folio);
> +		/*
> +		 * If a folio has only two references left, one inherited
> +		 * from the isolation of its head and the other from
> +		 * lru_add_page_tail() which we are about to drop, it means this
> +		 * folio was concurrently zapped. Then we can safely free it
> +		 * and save page reclaim or migration the trouble of trying it.
> +		 */
> +		if (list && page_ref_freeze(subpage, 2)) {

folio_ref_freeze(new_folio, 2)?

Otherwise looks good to me

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/4] mm: remap unused subpages to shared zeropage when splitting isolated thp
  2024-08-07 13:46 ` [PATCH v2 2/4] mm: remap unused subpages to shared zeropage " Usama Arif
@ 2024-08-07 20:02   ` Johannes Weiner
  2024-08-08 15:53     ` Usama Arif
  0 siblings, 1 reply; 13+ messages in thread
From: Johannes Weiner @ 2024-08-07 20:02 UTC (permalink / raw)
  To: Usama Arif
  Cc: akpm, linux-mm, riel, shakeel.butt, roman.gushchin, yuzhao, david,
	baohua, ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team, Shuang Zhai

On Wed, Aug 07, 2024 at 02:46:47PM +0100, Usama Arif wrote:
> @@ -177,13 +177,56 @@ void putback_movable_pages(struct list_head *l)
>  	}
>  }
>  
> +static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
> +					  struct folio *folio,
> +					  unsigned long idx)
> +{
> +	struct page *page = folio_page(folio, idx);
> +	bool contains_data;
> +	pte_t newpte;
> +	void *addr;
> +
> +	VM_BUG_ON_PAGE(PageCompound(page), page);
> +	VM_BUG_ON_PAGE(!PageAnon(page), page);
> +	VM_BUG_ON_PAGE(!PageLocked(page), page);
> +	VM_BUG_ON_PAGE(pte_present(*pvmw->pte), page);
> +
> +	if (PageMlocked(page) || (pvmw->vma->vm_flags & VM_LOCKED))
> +		return false;
> +
> +	/*
> +	 * The pmd entry mapping the old thp was flushed and the pte mapping
> +	 * this subpage has been non present. If the subpage is only zero-filled
> +	 * then map it to the shared zeropage.
> +	 */
> +	addr = kmap_local_page(page);
> +	contains_data = memchr_inv(addr, 0, PAGE_SIZE);
> +	kunmap_local(addr);
> +
> +	if (contains_data || mm_forbids_zeropage(pvmw->vma->vm_mm))
> +		return false;
> +
> +	newpte = pte_mkspecial(pfn_pte(page_to_pfn(ZERO_PAGE(pvmw->address)),
> +					pvmw->vma->vm_page_prot));

Why not use my_zero_pfn() here? On many configurations this just
returns zero_pfn and avoids the indirection through mem_map.

> @@ -904,7 +958,7 @@ static int writeout(struct address_space *mapping, struct folio *folio)
>  	 * At this point we know that the migration attempt cannot
>  	 * be successful.
>  	 */
> -	remove_migration_ptes(folio, folio, false);
> +	remove_migration_ptes(folio, folio, false, false);

bool params are not great for callsite readability.

How about a flags parameter and using names?

enum rmp_flags {
	RMP_LOCKED	= 1 << 0,
	RMP_ZEROPAGES	= 1 << 1,
}

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/4] mm: remap unused subpages to shared zeropage when splitting isolated thp
  2024-08-07 20:02   ` Johannes Weiner
@ 2024-08-08 15:53     ` Usama Arif
  0 siblings, 0 replies; 13+ messages in thread
From: Usama Arif @ 2024-08-08 15:53 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: akpm, linux-mm, riel, shakeel.butt, roman.gushchin, yuzhao, david,
	baohua, ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team, Shuang Zhai



On 07/08/2024 21:02, Johannes Weiner wrote:
> On Wed, Aug 07, 2024 at 02:46:47PM +0100, Usama Arif wrote:
>> @@ -177,13 +177,56 @@ void putback_movable_pages(struct list_head *l)
>>  	}
>>  }
>>  
>> +static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
>> +					  struct folio *folio,
>> +					  unsigned long idx)
>> +{
>> +	struct page *page = folio_page(folio, idx);
>> +	bool contains_data;
>> +	pte_t newpte;
>> +	void *addr;
>> +
>> +	VM_BUG_ON_PAGE(PageCompound(page), page);
>> +	VM_BUG_ON_PAGE(!PageAnon(page), page);
>> +	VM_BUG_ON_PAGE(!PageLocked(page), page);
>> +	VM_BUG_ON_PAGE(pte_present(*pvmw->pte), page);
>> +
>> +	if (PageMlocked(page) || (pvmw->vma->vm_flags & VM_LOCKED))
>> +		return false;
>> +
>> +	/*
>> +	 * The pmd entry mapping the old thp was flushed and the pte mapping
>> +	 * this subpage has been non present. If the subpage is only zero-filled
>> +	 * then map it to the shared zeropage.
>> +	 */
>> +	addr = kmap_local_page(page);
>> +	contains_data = memchr_inv(addr, 0, PAGE_SIZE);
>> +	kunmap_local(addr);
>> +
>> +	if (contains_data || mm_forbids_zeropage(pvmw->vma->vm_mm))
>> +		return false;
>> +
>> +	newpte = pte_mkspecial(pfn_pte(page_to_pfn(ZERO_PAGE(pvmw->address)),
>> +					pvmw->vma->vm_page_prot));
> 
> Why not use my_zero_pfn() here? On many configurations this just
> returns zero_pfn and avoids the indirection through mem_map.
> 
>> @@ -904,7 +958,7 @@ static int writeout(struct address_space *mapping, struct folio *folio)
>>  	 * At this point we know that the migration attempt cannot
>>  	 * be successful.
>>  	 */
>> -	remove_migration_ptes(folio, folio, false);
>> +	remove_migration_ptes(folio, folio, false, false);
> 
> bool params are not great for callsite readability.
> 
> How about a flags parameter and using names?
> 
> enum rmp_flags {
> 	RMP_LOCKED	= 1 << 0,
> 	RMP_ZEROPAGES	= 1 << 1,
> }

Thanks! Will include both of the above changes in the next revision.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 4/4] mm: split underutilized THPs
  2024-08-07 13:46 ` [PATCH v2 4/4] mm: split underutilized THPs Usama Arif
@ 2024-08-08 15:55   ` David Hildenbrand
  2024-08-09 10:31     ` Usama Arif
  0 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand @ 2024-08-08 15:55 UTC (permalink / raw)
  To: Usama Arif, akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, baohua,
	ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team

On 07.08.24 15:46, Usama Arif wrote:
> This is an attempt to mitigate the issue of running out of memory when THP
> is always enabled. During runtime whenever a THP is being faulted in
> (__do_huge_pmd_anonymous_page) or collapsed by khugepaged
> (collapse_huge_page), the THP is added to  _deferred_list. Whenever memory
> reclaim happens in linux, the kernel runs the deferred_split
> shrinker which goes through the _deferred_list.
> 
> If the folio was partially mapped, the shrinker attempts to split it.
> A new boolean is added to be able to distinguish between partially
> mapped folios and others in the deferred_list at split time in
> deferred_split_scan. Its needed as __folio_remove_rmap decrements
> the folio mapcount elements, hence it won't be possible to distinguish
> between partially mapped folios and others in deferred_split_scan
> without the boolean.

Just so I get this right: Are you saying that we might now add fully 
mapped folios to the deferred split queue and that's what you want to 
distinguish?

If that's the case, then could we use a bit in folio->_flags_1 instead?

Further, I think you forgot to update at least one instance if a 
list_empty(&folio->_deferred_list) check where we want to detect 
"partially mapped". Please go over all and see what needs adjustments.

I would actually suggest to split decoupling of "_deferred_list" and 
"partially mapped" into a separate preparation patch.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/4] mm: free zapped tail pages when splitting isolated thp
  2024-08-07 13:46 ` [PATCH v2 1/4] mm: free zapped tail pages when splitting isolated thp Usama Arif
  2024-08-07 19:45   ` Johannes Weiner
@ 2024-08-08 15:56   ` David Hildenbrand
  1 sibling, 0 replies; 13+ messages in thread
From: David Hildenbrand @ 2024-08-08 15:56 UTC (permalink / raw)
  To: Usama Arif, akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, baohua,
	ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team, Shuang Zhai

On 07.08.24 15:46, Usama Arif wrote:
> From: Yu Zhao <yuzhao@google.com>
> 
> If a tail page has only two references left, one inherited from the
> isolation of its head and the other from lru_add_page_tail() which we
> are about to drop, it means this tail page was concurrently zapped.
> Then we can safely free it and save page reclaim or migration the
> trouble of trying it.
> 
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> Tested-by: Shuang Zhai <zhais@google.com>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> ---
>   mm/huge_memory.c | 27 +++++++++++++++++++++++++++
>   1 file changed, 27 insertions(+)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 0167dc27e365..35c1089d8d61 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2923,7 +2923,9 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>   	unsigned int new_nr = 1 << new_order;
>   	int order = folio_order(folio);
>   	unsigned int nr = 1 << order;
> +	struct folio_batch free_folios;
>   
> +	folio_batch_init(&free_folios);
>   	/* complete memcg works before add pages to LRU */
>   	split_page_memcg(head, order, new_order);
>   
> @@ -3007,6 +3009,26 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>   		if (subpage == page)
>   			continue;
>   		folio_unlock(new_folio);
> +		/*
> +		 * If a folio has only two references left, one inherited
> +		 * from the isolation of its head and the other from
> +		 * lru_add_page_tail() which we are about to drop, it means this
> +		 * folio was concurrently zapped. Then we can safely free it
> +		 * and save page reclaim or migration the trouble of trying it.
> +		 */
> +		if (list && page_ref_freeze(subpage, 2)) {
> +			VM_WARN_ON_ONCE_FOLIO(folio_test_lru(new_folio), new_folio);
> +			VM_WARN_ON_ONCE_FOLIO(folio_test_large(new_folio), new_folio);
> +			VM_WARN_ON_ONCE_FOLIO(folio_mapped(new_folio), new_folio);
> +
> +			folio_clear_active(new_folio);
> +			folio_clear_unevictable(new_folio);
> +			if (folio_batch_add(&free_folios, folio) == 0) {

nit:  "!folio_batch_add(&free_folios, folio)"

Nothing else jumped at me.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 4/4] mm: split underutilized THPs
  2024-08-08 15:55   ` David Hildenbrand
@ 2024-08-09 10:31     ` Usama Arif
  2024-08-09 13:21       ` David Hildenbrand
  0 siblings, 1 reply; 13+ messages in thread
From: Usama Arif @ 2024-08-09 10:31 UTC (permalink / raw)
  To: David Hildenbrand, akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, baohua,
	ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team



On 08/08/2024 16:55, David Hildenbrand wrote:
> On 07.08.24 15:46, Usama Arif wrote:
>> This is an attempt to mitigate the issue of running out of memory when THP
>> is always enabled. During runtime whenever a THP is being faulted in
>> (__do_huge_pmd_anonymous_page) or collapsed by khugepaged
>> (collapse_huge_page), the THP is added to  _deferred_list. Whenever memory
>> reclaim happens in linux, the kernel runs the deferred_split
>> shrinker which goes through the _deferred_list.
>>
>> If the folio was partially mapped, the shrinker attempts to split it.
>> A new boolean is added to be able to distinguish between partially
>> mapped folios and others in the deferred_list at split time in
>> deferred_split_scan. Its needed as __folio_remove_rmap decrements
>> the folio mapcount elements, hence it won't be possible to distinguish
>> between partially mapped folios and others in deferred_split_scan
>> without the boolean.
> 
> Just so I get this right: Are you saying that we might now add fully mapped folios to the deferred split queue and that's what you want to distinguish?

Yes

> 
> If that's the case, then could we use a bit in folio->_flags_1 instead?
Yes, thats a good idea. Will create the below flag for the next revision

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 5769fe6e4950..5825bd1cf6db 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -189,6 +189,11 @@ enum pageflags {
 
 #define PAGEFLAGS_MASK         ((1UL << NR_PAGEFLAGS) - 1)
 
+enum folioflags_1 {
+       /* The first 8 bits of folio->_flags_1 are used to keep track of folio order */
+       FOLIO_PARTIALLY_MAPPED = 8,     /* folio is partially mapped */
+}
+
 #ifndef __GENERATING_BOUNDS_H


and use set/clear/test_bit(FOLIO_PARTIALLY_MAPPED, &folio->_flags_1) in the respective places.

> 
> Further, I think you forgot to update at least one instance if a list_empty(&folio->_deferred_list) check where we want to detect "partially mapped". Please go over all and see what needs adjustments.
> 

Ah I think its the one in free_tail_page_prepare? The way I wrote this patch is by going through all instances of "folio->_deferred_list" and deciding if partially_mapped needs to be set/cleared/tested. I think I missed it when rebasing to mm-unstable. Double checked now and the only one missing is free_tail_page_prepare ([1] was removed recently by Barry)

[1] https://lore.kernel.org/lkml/20240629234155.53524-1-21cnbao@gmail.com/

Will include the below diff in the next revision.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index aae00ba3b3bd..b4e1393cbd4f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -957,8 +957,9 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page)
                break;
        case 2:
                /* the second tail page: deferred_list overlaps ->mapping */
-               if (unlikely(!list_empty(&folio->_deferred_list))) {
-                       bad_page(page, "on deferred list");
+               if (unlikely(!list_empty(&folio->_deferred_list) &&
+                   test_bit(FOLIO_PARTIALLY_MAPPED, &folio->_flags_1))) {
+                       bad_page(page, "partially mapped folio on deferred list");
                        goto out;
                }
                break;

> I would actually suggest to split decoupling of "_deferred_list" and "partially mapped" into a separate preparation patch.
> 
Yes, will do. I will split it into 3 patches, 1st one that introduces FOLIO_PARTIALLY_MAPPED and sets/clear it in the right place without introducing any functional change, 2nd to split underutilized THPs and 3rd to add sysfs entry to enable/disable the shrinker. Should make the patches quite small and easy to review.


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 4/4] mm: split underutilized THPs
  2024-08-09 10:31     ` Usama Arif
@ 2024-08-09 13:21       ` David Hildenbrand
  2024-08-09 14:25         ` Usama Arif
  0 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand @ 2024-08-09 13:21 UTC (permalink / raw)
  To: Usama Arif, akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, baohua,
	ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team

On 09.08.24 12:31, Usama Arif wrote:
> 
> 
> On 08/08/2024 16:55, David Hildenbrand wrote:
>> On 07.08.24 15:46, Usama Arif wrote:
>>> This is an attempt to mitigate the issue of running out of memory when THP
>>> is always enabled. During runtime whenever a THP is being faulted in
>>> (__do_huge_pmd_anonymous_page) or collapsed by khugepaged
>>> (collapse_huge_page), the THP is added to  _deferred_list. Whenever memory
>>> reclaim happens in linux, the kernel runs the deferred_split
>>> shrinker which goes through the _deferred_list.
>>>
>>> If the folio was partially mapped, the shrinker attempts to split it.
>>> A new boolean is added to be able to distinguish between partially
>>> mapped folios and others in the deferred_list at split time in
>>> deferred_split_scan. Its needed as __folio_remove_rmap decrements
>>> the folio mapcount elements, hence it won't be possible to distinguish
>>> between partially mapped folios and others in deferred_split_scan
>>> without the boolean.
>>
>> Just so I get this right: Are you saying that we might now add fully mapped folios to the deferred split queue and that's what you want to distinguish?
> 
> Yes
> 
>>
>> If that's the case, then could we use a bit in folio->_flags_1 instead?
> Yes, thats a good idea. Will create the below flag for the next revision
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 5769fe6e4950..5825bd1cf6db 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -189,6 +189,11 @@ enum pageflags {
>   
>   #define PAGEFLAGS_MASK         ((1UL << NR_PAGEFLAGS) - 1)
>   
> +enum folioflags_1 {
> +       /* The first 8 bits of folio->_flags_1 are used to keep track of folio order */
> +       FOLIO_PARTIALLY_MAPPED = 8,     /* folio is partially mapped */
> +}

This might be what you want to achieve:

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index a0a29bd092f8..d4722ed60ef8 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -182,6 +182,7 @@ enum pageflags {
         /* At least one page in this folio has the hwpoison flag set */
         PG_has_hwpoisoned = PG_active,
         PG_large_rmappable = PG_workingset, /* anon or file-backed */
+       PG_partially_mapped, /* was identified to be partially mapped */
  };
  
  #define PAGEFLAGS_MASK         ((1UL << NR_PAGEFLAGS) - 1)
@@ -861,8 +862,9 @@ static inline void ClearPageCompound(struct page *page)
         ClearPageHead(page);
  }
  FOLIO_FLAG(large_rmappable, FOLIO_SECOND_PAGE)
+FOLIO_FLAG(partially_mapped, FOLIO_SECOND_PAGE)
  #else
-FOLIO_FLAG_FALSE(large_rmappable)
+FOLIO_FLAG_FALSE(partially_mapped)
  #endif
  
  #define PG_head_mask ((1UL << PG_head))

The downside is an atomic op to set/clear, but it should likely not really matter
(initially, the flag will be clear, and we should only ever set it once when
partially unmapping). If it hurts, we can reconsider.

[...]

>> I would actually suggest to split decoupling of "_deferred_list" and "partially mapped" into a separate preparation patch.
>>
> Yes, will do. I will split it into 3 patches, 1st one that introduces FOLIO_PARTIALLY_MAPPED and sets/clear it in the right place without introducing any functional change, 2nd to split underutilized THPs and 3rd to add sysfs entry to enable/disable the shrinker. Should make the patches quite small and easy to review.


Great! As always, please shout if you disagree with something I propose :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 4/4] mm: split underutilized THPs
  2024-08-09 13:21       ` David Hildenbrand
@ 2024-08-09 14:25         ` Usama Arif
  0 siblings, 0 replies; 13+ messages in thread
From: Usama Arif @ 2024-08-09 14:25 UTC (permalink / raw)
  To: David Hildenbrand, akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, baohua,
	ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team



On 09/08/2024 14:21, David Hildenbrand wrote:
> On 09.08.24 12:31, Usama Arif wrote:
>>
>>
>> On 08/08/2024 16:55, David Hildenbrand wrote:
>>> On 07.08.24 15:46, Usama Arif wrote:
>>>> This is an attempt to mitigate the issue of running out of memory when THP
>>>> is always enabled. During runtime whenever a THP is being faulted in
>>>> (__do_huge_pmd_anonymous_page) or collapsed by khugepaged
>>>> (collapse_huge_page), the THP is added to  _deferred_list. Whenever memory
>>>> reclaim happens in linux, the kernel runs the deferred_split
>>>> shrinker which goes through the _deferred_list.
>>>>
>>>> If the folio was partially mapped, the shrinker attempts to split it.
>>>> A new boolean is added to be able to distinguish between partially
>>>> mapped folios and others in the deferred_list at split time in
>>>> deferred_split_scan. Its needed as __folio_remove_rmap decrements
>>>> the folio mapcount elements, hence it won't be possible to distinguish
>>>> between partially mapped folios and others in deferred_split_scan
>>>> without the boolean.
>>>
>>> Just so I get this right: Are you saying that we might now add fully mapped folios to the deferred split queue and that's what you want to distinguish?
>>
>> Yes
>>
>>>
>>> If that's the case, then could we use a bit in folio->_flags_1 instead?
>> Yes, thats a good idea. Will create the below flag for the next revision
>>
>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>> index 5769fe6e4950..5825bd1cf6db 100644
>> --- a/include/linux/page-flags.h
>> +++ b/include/linux/page-flags.h
>> @@ -189,6 +189,11 @@ enum pageflags {
>>     #define PAGEFLAGS_MASK         ((1UL << NR_PAGEFLAGS) - 1)
>>   +enum folioflags_1 {
>> +       /* The first 8 bits of folio->_flags_1 are used to keep track of folio order */
>> +       FOLIO_PARTIALLY_MAPPED = 8,     /* folio is partially mapped */
>> +}
> 
> This might be what you want to achieve:
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index a0a29bd092f8..d4722ed60ef8 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -182,6 +182,7 @@ enum pageflags {
>         /* At least one page in this folio has the hwpoison flag set */
>         PG_has_hwpoisoned = PG_active,
>         PG_large_rmappable = PG_workingset, /* anon or file-backed */
> +       PG_partially_mapped, /* was identified to be partially mapped */
>  };
>  
>  #define PAGEFLAGS_MASK         ((1UL << NR_PAGEFLAGS) - 1)
> @@ -861,8 +862,9 @@ static inline void ClearPageCompound(struct page *page)
>         ClearPageHead(page);
>  }
>  FOLIO_FLAG(large_rmappable, FOLIO_SECOND_PAGE)
> +FOLIO_FLAG(partially_mapped, FOLIO_SECOND_PAGE)
>  #else
> -FOLIO_FLAG_FALSE(large_rmappable)
> +FOLIO_FLAG_FALSE(partially_mapped)
>  #endif
>  
>  #define PG_head_mask ((1UL << PG_head))
> 
> The downside is an atomic op to set/clear, but it should likely not really matter
> (initially, the flag will be clear, and we should only ever set it once when
> partially unmapping). If it hurts, we can reconsider.
> 
> [...]

I was looking for where the bits for flags_1 were specified! I just saw the start of enum pageflags, saw that compound order isn't specified anywhere over there and ignored the end :)

Yes, this is what I wanted to do. Thanks.


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-08-09 14:25 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-07 13:46 [PATCH v2 0/4] mm: split underutilized THPs Usama Arif
2024-08-07 13:46 ` [PATCH v2 1/4] mm: free zapped tail pages when splitting isolated thp Usama Arif
2024-08-07 19:45   ` Johannes Weiner
2024-08-08 15:56   ` David Hildenbrand
2024-08-07 13:46 ` [PATCH v2 2/4] mm: remap unused subpages to shared zeropage " Usama Arif
2024-08-07 20:02   ` Johannes Weiner
2024-08-08 15:53     ` Usama Arif
2024-08-07 13:46 ` [PATCH v2 3/4] mm: selftest to verify zero-filled pages are mapped to zeropage Usama Arif
2024-08-07 13:46 ` [PATCH v2 4/4] mm: split underutilized THPs Usama Arif
2024-08-08 15:55   ` David Hildenbrand
2024-08-09 10:31     ` Usama Arif
2024-08-09 13:21       ` David Hildenbrand
2024-08-09 14:25         ` Usama Arif

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox