[RFC PATCH 00/12] khugepaged: Asynchronous mTHP collapse

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 00/12] khugepaged: Asynchronous mTHP collapse
@ 2024-12-16 16:50 Dev Jain
  2024-12-16 16:50 ` [RFC PATCH 01/12] khugepaged: Rename hpage_collapse_scan_pmd() -> ptes() Dev Jain
                   ` (12 more replies)
  0 siblings, 13 replies; 74+ messages in thread
From: Dev Jain @ 2024-12-16 16:50 UTC (permalink / raw)
  To: akpm, david, willy, kirill.shutemov
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel, Dev Jain

This patchset extends khugepaged from collapsing only PMD-sized THPs to
collapsing anonymous mTHPs.

mTHPs were introduced in the kernel to improve memory management by allocating
chunks of larger memory, so as to reduce number of page faults, TLB misses (due
to TLB coalescing), reduce length of LRU lists, etc. However, the mTHP property
is often lost due to CoW, swap-in/out, and when the kernel just cannot find
enough physically contiguous memory to allocate on fault. Henceforth, there is a
need to regain mTHPs in the system asynchronously. This work is an attempt in
this direction, starting with anonymous folios.

In the fault handler, we select the THP order in a greedy manner; the same has
been used here, along with the same sysfs interface to control the order of
collapse. In contrast to PMD-collapse, we (hopefully) get rid of the mmap_write_lock().

---------------------------------------------------------
Testing
---------------------------------------------------------

The set has been build tested on x86_64.
For Aarch64,
1. mm-selftests: No regressions.
2. Analyzing with tools/mm/thpmaps on different userspace programs mapping
   aligned VMAs of a large size, faulting in basepages/mTHPs (according to sysfs),
   and then madvise()'ing the VMA, khugepaged is able to 100% collapse the VMAs.

This patchset is rebased on mm-unstable (e7e89af21ffcfd1077ca6d2188de6497db1ad84c).

Some points to be noted:
1. Some stats like pages_collapsed for khugepaged have not been extended for mTHP.
   I'd welcome suggestions on any updation, or addition to the sysfs interface.
2. Please see patch 9 for lock handling.

Dev Jain (12):
  khugepaged: Rename hpage_collapse_scan_pmd() -> ptes()
  khugepaged: Generalize alloc_charge_folio()
  khugepaged: Generalize hugepage_vma_revalidate()
  khugepaged: Generalize __collapse_huge_page_swapin()
  khugepaged: Generalize __collapse_huge_page_isolate()
  khugepaged: Generalize __collapse_huge_page_copy_failed()
  khugepaged: Scan PTEs order-wise
  khugepaged: Abstract PMD-THP collapse
  khugepaged: Introduce vma_collapse_anon_folio()
  khugepaged: Skip PTE range if a larger mTHP is already mapped
  khugepaged: Enable sysfs to control order of collapse
  selftests/mm: khugepaged: Enlighten for mTHP collapse

 include/linux/huge_mm.h                 |   2 +
 mm/huge_memory.c                        |   4 +
 mm/khugepaged.c                         | 445 +++++++++++++++++-------
 tools/testing/selftests/mm/khugepaged.c |   5 +-
 4 files changed, 319 insertions(+), 137 deletions(-)

-- 
2.30.2

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [RFC PATCH 01/12] khugepaged: Rename hpage_collapse_scan_pmd() -> ptes()
  2024-12-16 16:50 [RFC PATCH 00/12] khugepaged: Asynchronous mTHP collapse Dev Jain
@ 2024-12-16 16:50 ` Dev Jain
  2024-12-17  4:18   ` Matthew Wilcox
  2024-12-16 16:50 ` [RFC PATCH 02/12] khugepaged: Generalize alloc_charge_folio() Dev Jain
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 74+ messages in thread
From: Dev Jain @ 2024-12-16 16:50 UTC (permalink / raw)
  To: akpm, david, willy, kirill.shutemov
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel, Dev Jain

Rename prior to generalizing the collapse function.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/khugepaged.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 99dc995aac11..95643e6e5f31 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -605,7 +605,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		folio = page_folio(page);
 		VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
 
-		/* See hpage_collapse_scan_pmd(). */
+		/* See hpage_collapse_scan_ptes(). */
 		if (folio_likely_mapped_shared(folio)) {
 			++shared;
 			if (cc->is_khugepaged &&
@@ -991,7 +991,7 @@ static int check_pmd_still_valid(struct mm_struct *mm,
 
 /*
  * Bring missing pages in from swap, to complete THP collapse.
- * Only done if hpage_collapse_scan_pmd believes it is worthwhile.
+ * Only done if hpage_collapse_scan_ptes believes it is worthwhile.
  *
  * Called and returns without pte mapped or spinlocks held.
  * Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
@@ -1263,7 +1263,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	return result;
 }
 
-static int hpage_collapse_scan_pmd(struct mm_struct *mm,
+static int hpage_collapse_scan_ptes(struct mm_struct *mm,
 				   struct vm_area_struct *vma,
 				   unsigned long address, bool *mmap_locked,
 				   struct collapse_control *cc)
@@ -2457,7 +2457,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 					mmap_read_unlock(mm);
 				}
 			} else {
-				*result = hpage_collapse_scan_pmd(mm, vma,
+				*result = hpage_collapse_scan_ptes(mm, vma,
 					khugepaged_scan.address, &mmap_locked, cc);
 			}
 
@@ -2792,7 +2792,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 							  cc);
 			fput(file);
 		} else {
-			result = hpage_collapse_scan_pmd(mm, vma, addr,
+			result = hpage_collapse_scan_ptes(mm, vma, addr,
 							 &mmap_locked, cc);
 		}
 		if (!mmap_locked)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [RFC PATCH 02/12] khugepaged: Generalize alloc_charge_folio()
  2024-12-16 16:50 [RFC PATCH 00/12] khugepaged: Asynchronous mTHP collapse Dev Jain
  2024-12-16 16:50 ` [RFC PATCH 01/12] khugepaged: Rename hpage_collapse_scan_pmd() -> ptes() Dev Jain
@ 2024-12-16 16:50 ` Dev Jain
  2024-12-17  2:51   ` Baolin Wang
                     ` (2 more replies)
  2024-12-16 16:50 ` [RFC PATCH 03/12] khugepaged: Generalize hugepage_vma_revalidate() Dev Jain
                   ` (10 subsequent siblings)
  12 siblings, 3 replies; 74+ messages in thread
From: Dev Jain @ 2024-12-16 16:50 UTC (permalink / raw)
  To: akpm, david, willy, kirill.shutemov
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel, Dev Jain

Pass order to alloc_charge_folio() and update mTHP statistics.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 include/linux/huge_mm.h |  2 ++
 mm/huge_memory.c        |  4 ++++
 mm/khugepaged.c         | 13 +++++++++----
 3 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 93e509b6c00e..8b6d0fed99b3 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -119,6 +119,8 @@ enum mthp_stat_item {
 	MTHP_STAT_ANON_FAULT_ALLOC,
 	MTHP_STAT_ANON_FAULT_FALLBACK,
 	MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE,
+	MTHP_STAT_ANON_COLLAPSE_ALLOC,
+	MTHP_STAT_ANON_COLLAPSE_ALLOC_FAILED,
 	MTHP_STAT_ZSWPOUT,
 	MTHP_STAT_SWPIN,
 	MTHP_STAT_SWPIN_FALLBACK,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2da5520bfe24..2e582fad4c77 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -615,6 +615,8 @@ static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
 DEFINE_MTHP_STAT_ATTR(anon_fault_alloc, MTHP_STAT_ANON_FAULT_ALLOC);
 DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK);
 DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
+DEFINE_MTHP_STAT_ATTR(anon_collapse_alloc, MTHP_STAT_ANON_COLLAPSE_ALLOC);
+DEFINE_MTHP_STAT_ATTR(anon_collapse_alloc_failed, MTHP_STAT_ANON_COLLAPSE_ALLOC_FAILED);
 DEFINE_MTHP_STAT_ATTR(zswpout, MTHP_STAT_ZSWPOUT);
 DEFINE_MTHP_STAT_ATTR(swpin, MTHP_STAT_SWPIN);
 DEFINE_MTHP_STAT_ATTR(swpin_fallback, MTHP_STAT_SWPIN_FALLBACK);
@@ -636,6 +638,8 @@ static struct attribute *anon_stats_attrs[] = {
 	&anon_fault_alloc_attr.attr,
 	&anon_fault_fallback_attr.attr,
 	&anon_fault_fallback_charge_attr.attr,
+	&anon_collapse_alloc_attr.attr,
+	&anon_collapse_alloc_failed_attr.attr,
 #ifndef CONFIG_SHMEM
 	&zswpout_attr.attr,
 	&swpin_attr.attr,
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 95643e6e5f31..02cd424b8e48 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1073,21 +1073,26 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
 }
 
 static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
-			      struct collapse_control *cc)
+			      int order, struct collapse_control *cc)
 {
 	gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
 		     GFP_TRANSHUGE);
 	int node = hpage_collapse_find_target_node(cc);
 	struct folio *folio;
 
-	folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
+	folio = __folio_alloc(gfp, order, node, &cc->alloc_nmask);
 	if (!folio) {
 		*foliop = NULL;
 		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
+		if (order != HPAGE_PMD_ORDER)
+			count_mthp_stat(order, MTHP_STAT_ANON_COLLAPSE_ALLOC_FAILED);
 		return SCAN_ALLOC_HUGE_PAGE_FAIL;
 	}
 
 	count_vm_event(THP_COLLAPSE_ALLOC);
+	if (order != HPAGE_PMD_ORDER)
+		count_mthp_stat(order, MTHP_STAT_ANON_COLLAPSE_ALLOC);
+
 	if (unlikely(mem_cgroup_charge(folio, mm, gfp))) {
 		folio_put(folio);
 		*foliop = NULL;
@@ -1124,7 +1129,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 */
 	mmap_read_unlock(mm);
 
-	result = alloc_charge_folio(&folio, mm, cc);
+	result = alloc_charge_folio(&folio, mm, order, cc);
 	if (result != SCAN_SUCCEED)
 		goto out_nolock;
 
@@ -1850,7 +1855,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
 	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
 
-	result = alloc_charge_folio(&new_folio, mm, cc);
+	result = alloc_charge_folio(&new_folio, mm, HPAGE_PMD_ORDER, cc);
 	if (result != SCAN_SUCCEED)
 		goto out;
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [RFC PATCH 03/12] khugepaged: Generalize hugepage_vma_revalidate()
  2024-12-16 16:50 [RFC PATCH 00/12] khugepaged: Asynchronous mTHP collapse Dev Jain
  2024-12-16 16:50 ` [RFC PATCH 01/12] khugepaged: Rename hpage_collapse_scan_pmd() -> ptes() Dev Jain
  2024-12-16 16:50 ` [RFC PATCH 02/12] khugepaged: Generalize alloc_charge_folio() Dev Jain
@ 2024-12-16 16:50 ` Dev Jain
  2024-12-17  4:21   ` Matthew Wilcox
  2024-12-17 16:58   ` Ryan Roberts
  2024-12-16 16:50 ` [RFC PATCH 04/12] khugepaged: Generalize __collapse_huge_page_swapin() Dev Jain
                   ` (9 subsequent siblings)
  12 siblings, 2 replies; 74+ messages in thread
From: Dev Jain @ 2024-12-16 16:50 UTC (permalink / raw)
  To: akpm, david, willy, kirill.shutemov
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel, Dev Jain

Post retaking the lock, it must be checked that the VMA is suitable for our
scan order. Hence, generalize hugepage_vma_revalidate().

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/khugepaged.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 02cd424b8e48..2f0601795471 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -918,7 +918,7 @@ static int hpage_collapse_find_target_node(struct collapse_control *cc)
 
 static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 				   bool expect_anon,
-				   struct vm_area_struct **vmap,
+				   struct vm_area_struct **vmap, int order,
 				   struct collapse_control *cc)
 {
 	struct vm_area_struct *vma;
@@ -931,9 +931,9 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 	if (!vma)
 		return SCAN_VMA_NULL;
 
-	if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
+	if (!thp_vma_suitable_order(vma, address, order))
 		return SCAN_ADDRESS_RANGE;
-	if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_flags, PMD_ORDER))
+	if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_flags, order))
 		return SCAN_VMA_CHECK;
 	/*
 	 * Anon VMA expected, the address may be unmapped then
@@ -1134,7 +1134,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		goto out_nolock;
 
 	mmap_read_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
+	result = hugepage_vma_revalidate(mm, address, true, &vma, order, cc);
 	if (result != SCAN_SUCCEED) {
 		mmap_read_unlock(mm);
 		goto out_nolock;
@@ -1168,7 +1168,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * mmap_lock.
 	 */
 	mmap_write_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
+	result = hugepage_vma_revalidate(mm, address, true, &vma, order, cc);
 	if (result != SCAN_SUCCEED)
 		goto out_up_write;
 	/* check if the pmd is still valid */
@@ -2776,7 +2776,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 			mmap_read_lock(mm);
 			mmap_locked = true;
 			result = hugepage_vma_revalidate(mm, addr, false, &vma,
-							 cc);
+							 HPAGE_PMD_ORDER, cc);
 			if (result  != SCAN_SUCCEED) {
 				last_fail = result;
 				goto out_nolock;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [RFC PATCH 04/12] khugepaged: Generalize __collapse_huge_page_swapin()
  2024-12-16 16:50 [RFC PATCH 00/12] khugepaged: Asynchronous mTHP collapse Dev Jain
                   ` (2 preceding siblings ...)
  2024-12-16 16:50 ` [RFC PATCH 03/12] khugepaged: Generalize hugepage_vma_revalidate() Dev Jain
@ 2024-12-16 16:50 ` Dev Jain
  2024-12-17  4:24   ` Matthew Wilcox
  2024-12-16 16:50 ` [RFC PATCH 05/12] khugepaged: Generalize __collapse_huge_page_isolate() Dev Jain
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 74+ messages in thread
From: Dev Jain @ 2024-12-16 16:50 UTC (permalink / raw)
  To: akpm, david, willy, kirill.shutemov
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel, Dev Jain

If any PTE in our scan range is a swap entry, then use do_swap_page() to swap-in
the corresponding folio.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/khugepaged.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 2f0601795471..f52dae7d5179 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -998,17 +998,17 @@ static int check_pmd_still_valid(struct mm_struct *mm,
  */
 static int __collapse_huge_page_swapin(struct mm_struct *mm,
 				       struct vm_area_struct *vma,
-				       unsigned long haddr, pmd_t *pmd,
-				       int referenced)
+				       unsigned long addr, pmd_t *pmd,
+				       int referenced, int order)
 {
 	int swapped_in = 0;
 	vm_fault_t ret = 0;
-	unsigned long address, end = haddr + (HPAGE_PMD_NR * PAGE_SIZE);
+	unsigned long address, end = addr + ((1UL << order) * PAGE_SIZE);
 	int result;
 	pte_t *pte = NULL;
 	spinlock_t *ptl;
 
-	for (address = haddr; address < end; address += PAGE_SIZE) {
+	for (address = addr; address < end; address += PAGE_SIZE) {
 		struct vm_fault vmf = {
 			.vma = vma,
 			.address = address,
@@ -1153,7 +1153,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		 * that case.  Continuing to collapse causes inconsistency.
 		 */
 		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
-						     referenced);
+						     referenced, order);
 		if (result != SCAN_SUCCEED)
 			goto out_nolock;
 	}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [RFC PATCH 05/12] khugepaged: Generalize __collapse_huge_page_isolate()
  2024-12-16 16:50 [RFC PATCH 00/12] khugepaged: Asynchronous mTHP collapse Dev Jain
                   ` (3 preceding siblings ...)
  2024-12-16 16:50 ` [RFC PATCH 04/12] khugepaged: Generalize __collapse_huge_page_swapin() Dev Jain
@ 2024-12-16 16:50 ` Dev Jain
  2024-12-17  4:32   ` Matthew Wilcox
  2024-12-17 17:09   ` Ryan Roberts
  2024-12-16 16:50 ` [RFC PATCH 06/12] khugepaged: Generalize __collapse_huge_page_copy_failed() Dev Jain
                   ` (7 subsequent siblings)
  12 siblings, 2 replies; 74+ messages in thread
From: Dev Jain @ 2024-12-16 16:50 UTC (permalink / raw)
  To: akpm, david, willy, kirill.shutemov
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel, Dev Jain

Scale down the scan range and the sysfs tunables according to the scan order,
and isolate the folios.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/khugepaged.c | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index f52dae7d5179..de044b1f83d4 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -564,15 +564,18 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 					unsigned long address,
 					pte_t *pte,
 					struct collapse_control *cc,
-					struct list_head *compound_pagelist)
+					struct list_head *compound_pagelist, int order)
 {
-	struct page *page = NULL;
-	struct folio *folio = NULL;
-	pte_t *_pte;
+	unsigned int max_ptes_shared = khugepaged_max_ptes_shared >> (HPAGE_PMD_ORDER - order);
+	unsigned int max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
 	int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
+	struct folio *folio = NULL;
+	struct page *page = NULL;
 	bool writable = false;
+	pte_t *_pte;
 
-	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
+
+	for (_pte = pte; _pte < pte + (1UL << order);
 	     _pte++, address += PAGE_SIZE) {
 		pte_t pteval = ptep_get(_pte);
 		if (pte_none(pteval) || (pte_present(pteval) &&
@@ -580,7 +583,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 			++none_or_zero;
 			if (!userfaultfd_armed(vma) &&
 			    (!cc->is_khugepaged ||
-			     none_or_zero <= khugepaged_max_ptes_none)) {
+			     none_or_zero <= max_ptes_none)) {
 				continue;
 			} else {
 				result = SCAN_EXCEED_NONE_PTE;
@@ -609,7 +612,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		if (folio_likely_mapped_shared(folio)) {
 			++shared;
 			if (cc->is_khugepaged &&
-			    shared > khugepaged_max_ptes_shared) {
+			    shared > max_ptes_shared) {
 				result = SCAN_EXCEED_SHARED_PTE;
 				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
 				goto out;
@@ -1200,7 +1203,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
 	if (pte) {
 		result = __collapse_huge_page_isolate(vma, address, pte, cc,
-						      &compound_pagelist);
+						      &compound_pagelist, order);
 		spin_unlock(pte_ptl);
 	} else {
 		result = SCAN_PMD_NULL;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [RFC PATCH 06/12] khugepaged: Generalize __collapse_huge_page_copy_failed()
  2024-12-16 16:50 [RFC PATCH 00/12] khugepaged: Asynchronous mTHP collapse Dev Jain
                   ` (4 preceding siblings ...)
  2024-12-16 16:50 ` [RFC PATCH 05/12] khugepaged: Generalize __collapse_huge_page_isolate() Dev Jain
@ 2024-12-16 16:50 ` Dev Jain
  2024-12-17 17:22   ` Ryan Roberts
  2024-12-16 16:51 ` [RFC PATCH 07/12] khugepaged: Scan PTEs order-wise Dev Jain
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 74+ messages in thread
From: Dev Jain @ 2024-12-16 16:50 UTC (permalink / raw)
  To: akpm, david, willy, kirill.shutemov
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel, Dev Jain

Upon failure, we repopulate the PMD in case of PMD-THP collapse. Hence, make
this logic specific for PMD case.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/khugepaged.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index de044b1f83d4..886c76816963 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -766,7 +766,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
 					     pmd_t *pmd,
 					     pmd_t orig_pmd,
 					     struct vm_area_struct *vma,
-					     struct list_head *compound_pagelist)
+					     struct list_head *compound_pagelist, int order)
 {
 	spinlock_t *pmd_ptl;
 
@@ -776,14 +776,16 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
 	 * pages. Since pages are still isolated and locked here,
 	 * acquiring anon_vma_lock_write is unnecessary.
 	 */
-	pmd_ptl = pmd_lock(vma->vm_mm, pmd);
-	pmd_populate(vma->vm_mm, pmd, pmd_pgtable(orig_pmd));
-	spin_unlock(pmd_ptl);
+	if (order == HPAGE_PMD_ORDER) {
+		pmd_ptl = pmd_lock(vma->vm_mm, pmd);
+		pmd_populate(vma->vm_mm, pmd, pmd_pgtable(orig_pmd));
+		spin_unlock(pmd_ptl);
+	}
 	/*
 	 * Release both raw and compound pages isolated
 	 * in __collapse_huge_page_isolate.
 	 */
-	release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist);
+	release_pte_pages(pte, pte + (1UL << order), compound_pagelist);
 }
 
 /*
@@ -834,7 +836,7 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
 						    compound_pagelist);
 	else
 		__collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
-						 compound_pagelist);
+						 compound_pagelist, order);
 
 	return result;
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [RFC PATCH 07/12] khugepaged: Scan PTEs order-wise
  2024-12-16 16:50 [RFC PATCH 00/12] khugepaged: Asynchronous mTHP collapse Dev Jain
                   ` (5 preceding siblings ...)
  2024-12-16 16:50 ` [RFC PATCH 06/12] khugepaged: Generalize __collapse_huge_page_copy_failed() Dev Jain
@ 2024-12-16 16:51 ` Dev Jain
  2024-12-17 18:15   ` Ryan Roberts
  2025-01-06 10:04   ` Usama Arif
  2024-12-16 16:51 ` [RFC PATCH 08/12] khugepaged: Abstract PMD-THP collapse Dev Jain
                   ` (5 subsequent siblings)
  12 siblings, 2 replies; 74+ messages in thread
From: Dev Jain @ 2024-12-16 16:51 UTC (permalink / raw)
  To: akpm, david, willy, kirill.shutemov
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel, Dev Jain

Scan the PTEs order-wise, using the mask of suitable orders for this VMA
derived in conjunction with sysfs THP settings. Scale down the tunables; in
case of collapse failure, we drop down to the next order. Otherwise, we try to
jump to the highest possible order and then start a fresh scan. Note that
madvise(MADV_COLLAPSE) has not been generalized.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/khugepaged.c | 84 ++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 69 insertions(+), 15 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 886c76816963..078794aa3335 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -20,6 +20,7 @@
 #include <linux/swapops.h>
 #include <linux/shmem_fs.h>
 #include <linux/ksm.h>
+#include <linux/count_zeros.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
@@ -1111,7 +1112,7 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
 }
 
 static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
-			      int referenced, int unmapped,
+			      int referenced, int unmapped, int order,
 			      struct collapse_control *cc)
 {
 	LIST_HEAD(compound_pagelist);
@@ -1278,38 +1279,59 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
 				   unsigned long address, bool *mmap_locked,
 				   struct collapse_control *cc)
 {
-	pmd_t *pmd;
-	pte_t *pte, *_pte;
-	int result = SCAN_FAIL, referenced = 0;
-	int none_or_zero = 0, shared = 0;
-	struct page *page = NULL;
+	unsigned int max_ptes_shared, max_ptes_none, max_ptes_swap;
+	int referenced, shared, none_or_zero, unmapped;
+	unsigned long _address, org_address = address;
 	struct folio *folio = NULL;
-	unsigned long _address;
-	spinlock_t *ptl;
-	int node = NUMA_NO_NODE, unmapped = 0;
+	struct page *page = NULL;
+	int node = NUMA_NO_NODE;
+	int result = SCAN_FAIL;
 	bool writable = false;
+	unsigned long orders;
+	pte_t *pte, *_pte;
+	spinlock_t *ptl;
+	pmd_t *pmd;
+	int order;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
+	orders = thp_vma_allowable_orders(vma, vma->vm_flags,
+			TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER + 1) - 1);
+	orders = thp_vma_suitable_orders(vma, address, orders);
+	order = highest_order(orders);
+
+	/* MADV_COLLAPSE needs to work irrespective of sysfs setting */
+	if (!cc->is_khugepaged)
+		order = HPAGE_PMD_ORDER;
+
+scan_pte_range:
+
+	max_ptes_shared = khugepaged_max_ptes_shared >> (HPAGE_PMD_ORDER - order);
+	max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
+	max_ptes_swap = khugepaged_max_ptes_swap >> (HPAGE_PMD_ORDER - order);
+	referenced = 0, shared = 0, none_or_zero = 0, unmapped = 0;
+
+	/* Check pmd after taking mmap lock */
 	result = find_pmd_or_thp_or_none(mm, address, &pmd);
 	if (result != SCAN_SUCCEED)
 		goto out;
 
 	memset(cc->node_load, 0, sizeof(cc->node_load));
 	nodes_clear(cc->alloc_nmask);
+
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
 	if (!pte) {
 		result = SCAN_PMD_NULL;
 		goto out;
 	}
 
-	for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
+	for (_address = address, _pte = pte; _pte < pte + (1UL << order);
 	     _pte++, _address += PAGE_SIZE) {
 		pte_t pteval = ptep_get(_pte);
 		if (is_swap_pte(pteval)) {
 			++unmapped;
 			if (!cc->is_khugepaged ||
-			    unmapped <= khugepaged_max_ptes_swap) {
+			    unmapped <= max_ptes_swap) {
 				/*
 				 * Always be strict with uffd-wp
 				 * enabled swap entries.  Please see
@@ -1330,7 +1352,7 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
 			++none_or_zero;
 			if (!userfaultfd_armed(vma) &&
 			    (!cc->is_khugepaged ||
-			     none_or_zero <= khugepaged_max_ptes_none)) {
+			     none_or_zero <= max_ptes_none)) {
 				continue;
 			} else {
 				result = SCAN_EXCEED_NONE_PTE;
@@ -1375,7 +1397,7 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
 		if (folio_likely_mapped_shared(folio)) {
 			++shared;
 			if (cc->is_khugepaged &&
-			    shared > khugepaged_max_ptes_shared) {
+			    shared > max_ptes_shared) {
 				result = SCAN_EXCEED_SHARED_PTE;
 				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
 				goto out_unmap;
@@ -1432,7 +1454,7 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
 		result = SCAN_PAGE_RO;
 	} else if (cc->is_khugepaged &&
 		   (!referenced ||
-		    (unmapped && referenced < HPAGE_PMD_NR / 2))) {
+		    (unmapped && referenced < (1UL << order) / 2))) {
 		result = SCAN_LACK_REFERENCED_PAGE;
 	} else {
 		result = SCAN_SUCCEED;
@@ -1441,9 +1463,41 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
 	pte_unmap_unlock(pte, ptl);
 	if (result == SCAN_SUCCEED) {
 		result = collapse_huge_page(mm, address, referenced,
-					    unmapped, cc);
+					    unmapped, order, cc);
 		/* collapse_huge_page will return with the mmap_lock released */
 		*mmap_locked = false;
+
+		/* Immediately exit on exhaustion of range */
+		if (_address == org_address + (PAGE_SIZE << HPAGE_PMD_ORDER))
+			goto out;
+	}
+	if (result != SCAN_SUCCEED) {
+
+		/* Go to the next order. */
+		order = next_order(&orders, order);
+		if (order < 2)
+			goto out;
+		goto maybe_mmap_lock;
+	} else {
+		address = _address;
+		pte = _pte;
+
+
+		/* Get highest order possible starting from address */
+		order = count_trailing_zeros(address >> PAGE_SHIFT);
+
+		/* This needs to be present in the mask too */
+		if (!(orders & (1UL << order)))
+			order = next_order(&orders, order);
+		if (order < 2)
+			goto out;
+
+maybe_mmap_lock:
+		if (!(*mmap_locked)) {
+			mmap_read_lock(mm);
+			*mmap_locked = true;
+		}
+		goto scan_pte_range;
 	}
 out:
 	trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [RFC PATCH 08/12] khugepaged: Abstract PMD-THP collapse
  2024-12-16 16:50 [RFC PATCH 00/12] khugepaged: Asynchronous mTHP collapse Dev Jain
                   ` (6 preceding siblings ...)
  2024-12-16 16:51 ` [RFC PATCH 07/12] khugepaged: Scan PTEs order-wise Dev Jain
@ 2024-12-16 16:51 ` Dev Jain
  2024-12-17 19:24   ` Ryan Roberts
  2024-12-16 16:51 ` [RFC PATCH 09/12] khugepaged: Introduce vma_collapse_anon_folio() Dev Jain
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 74+ messages in thread
From: Dev Jain @ 2024-12-16 16:51 UTC (permalink / raw)
  To: akpm, david, willy, kirill.shutemov
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel, Dev Jain

Abstract away taking the mmap_lock exclusively, copying page contents, and
setting the PMD, into vma_collapse_anon_folio_pmd().

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/khugepaged.c | 119 +++++++++++++++++++++++++++---------------------
 1 file changed, 66 insertions(+), 53 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 078794aa3335..88beebef773e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1111,58 +1111,17 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
 	return SCAN_SUCCEED;
 }
 
-static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
-			      int referenced, int unmapped, int order,
-			      struct collapse_control *cc)
+static int vma_collapse_anon_folio_pmd(struct mm_struct *mm, unsigned long address,
+		struct vm_area_struct *vma, struct collapse_control *cc, pmd_t *pmd,
+		struct folio *folio)
 {
+	struct mmu_notifier_range range;
+	spinlock_t *pmd_ptl, *pte_ptl;
 	LIST_HEAD(compound_pagelist);
-	pmd_t *pmd, _pmd;
-	pte_t *pte;
 	pgtable_t pgtable;
-	struct folio *folio;
-	spinlock_t *pmd_ptl, *pte_ptl;
-	int result = SCAN_FAIL;
-	struct vm_area_struct *vma;
-	struct mmu_notifier_range range;
-
-	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
-
-	/*
-	 * Before allocating the hugepage, release the mmap_lock read lock.
-	 * The allocation can take potentially a long time if it involves
-	 * sync compaction, and we do not need to hold the mmap_lock during
-	 * that. We will recheck the vma after taking it again in write mode.
-	 */
-	mmap_read_unlock(mm);
-
-	result = alloc_charge_folio(&folio, mm, order, cc);
-	if (result != SCAN_SUCCEED)
-		goto out_nolock;
-
-	mmap_read_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, order, cc);
-	if (result != SCAN_SUCCEED) {
-		mmap_read_unlock(mm);
-		goto out_nolock;
-	}
-
-	result = find_pmd_or_thp_or_none(mm, address, &pmd);
-	if (result != SCAN_SUCCEED) {
-		mmap_read_unlock(mm);
-		goto out_nolock;
-	}
-
-	if (unmapped) {
-		/*
-		 * __collapse_huge_page_swapin will return with mmap_lock
-		 * released when it fails. So we jump out_nolock directly in
-		 * that case.  Continuing to collapse causes inconsistency.
-		 */
-		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
-						     referenced, order);
-		if (result != SCAN_SUCCEED)
-			goto out_nolock;
-	}
+	int result;
+	pmd_t _pmd;
+	pte_t *pte;
 
 	mmap_read_unlock(mm);
 	/*
@@ -1174,7 +1133,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * mmap_lock.
 	 */
 	mmap_write_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, order, cc);
+
+	result = hugepage_vma_revalidate(mm, address, true, &vma, HPAGE_PMD_ORDER, cc);
 	if (result != SCAN_SUCCEED)
 		goto out_up_write;
 	/* check if the pmd is still valid */
@@ -1206,7 +1166,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
 	if (pte) {
 		result = __collapse_huge_page_isolate(vma, address, pte, cc,
-						      &compound_pagelist, order);
+						      &compound_pagelist, HPAGE_PMD_ORDER);
 		spin_unlock(pte_ptl);
 	} else {
 		result = SCAN_PMD_NULL;
@@ -1262,11 +1222,64 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	deferred_split_folio(folio, false);
 	spin_unlock(pmd_ptl);
 
-	folio = NULL;
-
 	result = SCAN_SUCCEED;
 out_up_write:
 	mmap_write_unlock(mm);
+	return result;
+}
+
+static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
+			      int referenced, int unmapped, int order,
+			      struct collapse_control *cc)
+{
+	struct vm_area_struct *vma;
+	int result = SCAN_FAIL;
+	struct folio *folio;
+	pmd_t *pmd;
+
+	/*
+	 * Before allocating the hugepage, release the mmap_lock read lock.
+	 * The allocation can take potentially a long time if it involves
+	 * sync compaction, and we do not need to hold the mmap_lock during
+	 * that. We will recheck the vma after taking it again in write mode.
+	 */
+	mmap_read_unlock(mm);
+
+	result = alloc_charge_folio(&folio, mm, order, cc);
+	if (result != SCAN_SUCCEED)
+		goto out_nolock;
+
+	mmap_read_lock(mm);
+	result = hugepage_vma_revalidate(mm, address, true, &vma, order, cc);
+	if (result != SCAN_SUCCEED) {
+		mmap_read_unlock(mm);
+		goto out_nolock;
+	}
+
+	result = find_pmd_or_thp_or_none(mm, address, &pmd);
+	if (result != SCAN_SUCCEED) {
+		mmap_read_unlock(mm);
+		goto out_nolock;
+	}
+
+	if (unmapped) {
+		/*
+		 * __collapse_huge_page_swapin will return with mmap_lock
+		 * released when it fails. So we jump out_nolock directly in
+		 * that case.  Continuing to collapse causes inconsistency.
+		 */
+		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
+						     referenced, order);
+		if (result != SCAN_SUCCEED)
+			goto out_nolock;
+	}
+
+	if (order == HPAGE_PMD_ORDER)
+		result = vma_collapse_anon_folio_pmd(mm, address, vma, cc, pmd, folio);
+
+	if (result == SCAN_SUCCEED)
+		folio = NULL;
+
 out_nolock:
 	if (folio)
 		folio_put(folio);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [RFC PATCH 09/12] khugepaged: Introduce vma_collapse_anon_folio()
  2024-12-16 16:50 [RFC PATCH 00/12] khugepaged: Asynchronous mTHP collapse Dev Jain
                   ` (7 preceding siblings ...)
  2024-12-16 16:51 ` [RFC PATCH 08/12] khugepaged: Abstract PMD-THP collapse Dev Jain
@ 2024-12-16 16:51 ` Dev Jain
  2024-12-16 17:06   ` David Hildenbrand
  2025-01-06 10:17   ` Usama Arif
  2024-12-16 16:51 ` [RFC PATCH 10/12] khugepaged: Skip PTE range if a larger mTHP is already mapped Dev Jain
                   ` (3 subsequent siblings)
  12 siblings, 2 replies; 74+ messages in thread
From: Dev Jain @ 2024-12-16 16:51 UTC (permalink / raw)
  To: akpm, david, willy, kirill.shutemov
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel, Dev Jain

In contrast to PMD-collapse, we do not need to operate on two levels of pagetable
simultaneously. Therefore, downgrade the mmap lock from write to read mode. Still
take the anon_vma lock in exclusive mode so as to not waste time in the rmap path,
which is anyways going to fail since the PTEs are going to be changed. Under the PTL,
copy page contents, clear the PTEs, remove folio pins, and (try to) unmap the
old folios. Set the PTEs to the new folio using the set_ptes() API.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
Note: I have been trying hard to get rid of the locks in here: we still are
taking the PTL around the page copying; dropping the PTL and taking it after
the copying should lead to a deadlock, for example:
khugepaged						madvise(MADV_COLD)
folio_lock()						lock(ptl)
lock(ptl)						folio_lock()

We can create a locked folio list, altogether drop both the locks, take the PTL,
do everything which __collapse_huge_page_isolate() does *except* the isolation and
again try locking folios, but then it will reduce efficiency of khugepaged
and almost looks like a forced solution :)
Please note the following discussion if anyone is interested:
https://lore.kernel.org/all/66bb7496-a445-4ad7-8e56-4f2863465c54@arm.com/
(Apologies for not CCing the mailing list from the start)

 mm/khugepaged.c | 108 ++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 87 insertions(+), 21 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 88beebef773e..8040b130e677 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -714,24 +714,28 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
 						struct vm_area_struct *vma,
 						unsigned long address,
 						spinlock_t *ptl,
-						struct list_head *compound_pagelist)
+						struct list_head *compound_pagelist, int order)
 {
 	struct folio *src, *tmp;
 	pte_t *_pte;
 	pte_t pteval;
 
-	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
+	for (_pte = pte; _pte < pte + (1UL << order);
 	     _pte++, address += PAGE_SIZE) {
 		pteval = ptep_get(_pte);
 		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
 			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
 			if (is_zero_pfn(pte_pfn(pteval))) {
-				/*
-				 * ptl mostly unnecessary.
-				 */
-				spin_lock(ptl);
-				ptep_clear(vma->vm_mm, address, _pte);
-				spin_unlock(ptl);
+				if (order == HPAGE_PMD_ORDER) {
+					/*
+					* ptl mostly unnecessary.
+					*/
+					spin_lock(ptl);
+					ptep_clear(vma->vm_mm, address, _pte);
+					spin_unlock(ptl);
+				} else {
+					ptep_clear(vma->vm_mm, address, _pte);
+				}
 				ksm_might_unmap_zero_page(vma->vm_mm, pteval);
 			}
 		} else {
@@ -740,15 +744,20 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
 			src = page_folio(src_page);
 			if (!folio_test_large(src))
 				release_pte_folio(src);
-			/*
-			 * ptl mostly unnecessary, but preempt has to
-			 * be disabled to update the per-cpu stats
-			 * inside folio_remove_rmap_pte().
-			 */
-			spin_lock(ptl);
-			ptep_clear(vma->vm_mm, address, _pte);
-			folio_remove_rmap_pte(src, src_page, vma);
-			spin_unlock(ptl);
+			if (order == HPAGE_PMD_ORDER) {
+				/*
+				* ptl mostly unnecessary, but preempt has to
+				* be disabled to update the per-cpu stats
+				* inside folio_remove_rmap_pte().
+				*/
+				spin_lock(ptl);
+				ptep_clear(vma->vm_mm, address, _pte);
+				folio_remove_rmap_pte(src, src_page, vma);
+				spin_unlock(ptl);
+			} else {
+				ptep_clear(vma->vm_mm, address, _pte);
+				folio_remove_rmap_pte(src, src_page, vma);
+			}
 			free_page_and_swap_cache(src_page);
 		}
 	}
@@ -807,7 +816,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
 static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
 		pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
 		unsigned long address, spinlock_t *ptl,
-		struct list_head *compound_pagelist)
+		struct list_head *compound_pagelist, int order)
 {
 	unsigned int i;
 	int result = SCAN_SUCCEED;
@@ -815,7 +824,7 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
 	/*
 	 * Copying pages' contents is subject to memory poison at any iteration.
 	 */
-	for (i = 0; i < HPAGE_PMD_NR; i++) {
+	for (i = 0; i < (1 << order); i++) {
 		pte_t pteval = ptep_get(pte + i);
 		struct page *page = folio_page(folio, i);
 		unsigned long src_addr = address + i * PAGE_SIZE;
@@ -834,7 +843,7 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
 
 	if (likely(result == SCAN_SUCCEED))
 		__collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
-						    compound_pagelist);
+						    compound_pagelist, order);
 	else
 		__collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
 						 compound_pagelist, order);
@@ -1196,7 +1205,7 @@ static int vma_collapse_anon_folio_pmd(struct mm_struct *mm, unsigned long addre
 
 	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
 					   vma, address, pte_ptl,
-					   &compound_pagelist);
+					   &compound_pagelist, HPAGE_PMD_ORDER);
 	pte_unmap(pte);
 	if (unlikely(result != SCAN_SUCCEED))
 		goto out_up_write;
@@ -1228,6 +1237,61 @@ static int vma_collapse_anon_folio_pmd(struct mm_struct *mm, unsigned long addre
 	return result;
 }
 
+/* Enter with mmap read lock */
+static int vma_collapse_anon_folio(struct mm_struct *mm, unsigned long address,
+		struct vm_area_struct *vma, struct collapse_control *cc, pmd_t *pmd,
+		struct folio *folio, int order)
+{
+	int result;
+	struct mmu_notifier_range range;
+	spinlock_t *pte_ptl;
+	LIST_HEAD(compound_pagelist);
+	pte_t *pte;
+	pte_t entry;
+	int nr_pages = folio_nr_pages(folio);
+
+	anon_vma_lock_write(vma->anon_vma);
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
+				address + (PAGE_SIZE << order));
+	mmu_notifier_invalidate_range_start(&range);
+
+	pte = pte_offset_map_lock(mm, pmd, address, &pte_ptl);
+	if (pte)
+		result = __collapse_huge_page_isolate(vma, address, pte, cc,
+						      &compound_pagelist, order);
+	else
+		result = SCAN_PMD_NULL;
+
+	if (unlikely(result != SCAN_SUCCEED))
+		goto out_up_read;
+
+	anon_vma_unlock_write(vma->anon_vma);
+
+	__folio_mark_uptodate(folio);
+	entry = mk_pte(&folio->page, vma->vm_page_prot);
+	entry = maybe_mkwrite(entry, vma);
+
+	result = __collapse_huge_page_copy(pte, folio, pmd, *pmd,
+					   vma, address, pte_ptl,
+					   &compound_pagelist, order);
+	if (unlikely(result != SCAN_SUCCEED))
+		goto out_up_read;
+
+	folio_ref_add(folio, nr_pages - 1);
+	folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
+	folio_add_lru_vma(folio, vma);
+	deferred_split_folio(folio, false);
+	set_ptes(mm, address, pte, entry, nr_pages);
+	update_mmu_cache_range(NULL, vma, address, pte, nr_pages);
+	pte_unmap_unlock(pte, pte_ptl);
+	mmu_notifier_invalidate_range_end(&range);
+	result = SCAN_SUCCEED;
+
+out_up_read:
+	mmap_read_unlock(mm);
+	return result;
+}
+
 static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 			      int referenced, int unmapped, int order,
 			      struct collapse_control *cc)
@@ -1276,6 +1340,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 
 	if (order == HPAGE_PMD_ORDER)
 		result = vma_collapse_anon_folio_pmd(mm, address, vma, cc, pmd, folio);
+	else
+		result = vma_collapse_anon_folio(mm, address, vma, cc, pmd, folio, order);
 
 	if (result == SCAN_SUCCEED)
 		folio = NULL;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [RFC PATCH 10/12] khugepaged: Skip PTE range if a larger mTHP is already mapped
  2024-12-16 16:50 [RFC PATCH 00/12] khugepaged: Asynchronous mTHP collapse Dev Jain
                   ` (8 preceding siblings ...)
  2024-12-16 16:51 ` [RFC PATCH 09/12] khugepaged: Introduce vma_collapse_anon_folio() Dev Jain
@ 2024-12-16 16:51 ` Dev Jain
  2024-12-18  7:36   ` Ryan Roberts
  2024-12-16 16:51 ` [RFC PATCH 11/12] khugepaged: Enable sysfs to control order of collapse Dev Jain
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 74+ messages in thread
From: Dev Jain @ 2024-12-16 16:51 UTC (permalink / raw)
  To: akpm, david, willy, kirill.shutemov
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel, Dev Jain

We may hit a situation wherein we have a larger folio mapped. It is incorrect
to go ahead with the collapse since some pages will be unmapped, leading to
the entire folio getting unmapped. Therefore, skip the corresponding range.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
In the future, if at all it is required that at some point we want all the folios
in the system to be of a specific order, we may split these larger folios.

 mm/khugepaged.c | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 8040b130e677..47e7c476b893 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -33,6 +33,7 @@ enum scan_result {
 	SCAN_PMD_NULL,
 	SCAN_PMD_NONE,
 	SCAN_PMD_MAPPED,
+	SCAN_PTE_MAPPED,
 	SCAN_EXCEED_NONE_PTE,
 	SCAN_EXCEED_SWAP_PTE,
 	SCAN_EXCEED_SHARED_PTE,
@@ -609,6 +610,11 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		folio = page_folio(page);
 		VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
 
+		if (order !=HPAGE_PMD_ORDER && folio_order(folio) >= order) {
+			result = SCAN_PTE_MAPPED;
+			goto out;
+		}
+
 		/* See hpage_collapse_scan_ptes(). */
 		if (folio_likely_mapped_shared(folio)) {
 			++shared;
@@ -1369,6 +1375,7 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
 	unsigned long orders;
 	pte_t *pte, *_pte;
 	spinlock_t *ptl;
+	int found_order;
 	pmd_t *pmd;
 	int order;
 
@@ -1467,6 +1474,24 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
 			goto out_unmap;
 		}
 
+		found_order = folio_order(folio);
+
+		/*
+		 * No point of scanning. Two options: if this folio was hit
+		 * somewhere in the middle of the scan, then drop down the
+		 * order. Or, completely skip till the end of this folio. The
+		 * latter gives us a higher order to start with, with atmost
+		 * 1 << order PTEs not collapsed; the former may force us
+		 * to end up going below order 2 and exiting.
+		 */
+		if (order != HPAGE_PMD_ORDER && found_order >= order) {
+			result = SCAN_PTE_MAPPED;
+			_address += (PAGE_SIZE << found_order);
+			_pte += (1UL << found_order);
+			pte_unmap_unlock(pte, ptl);
+			goto decide_order;
+		}
+
 		/*
 		 * We treat a single page as shared if any part of the THP
 		 * is shared. "False negatives" from
@@ -1550,6 +1575,10 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
 		if (_address == org_address + (PAGE_SIZE << HPAGE_PMD_ORDER))
 			goto out;
 	}
+	/* A larger folio was mapped; it will be skipped in next iteration */
+	if (result == SCAN_PTE_MAPPED)
+		goto decide_order;
+
 	if (result != SCAN_SUCCEED) {
 
 		/* Go to the next order. */
@@ -1558,6 +1587,8 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
 			goto out;
 		goto maybe_mmap_lock;
 	} else {
+
+decide_order:
 		address = _address;
 		pte = _pte;
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [RFC PATCH 11/12] khugepaged: Enable sysfs to control order of collapse
  2024-12-16 16:50 [RFC PATCH 00/12] khugepaged: Asynchronous mTHP collapse Dev Jain
                   ` (9 preceding siblings ...)
  2024-12-16 16:51 ` [RFC PATCH 10/12] khugepaged: Skip PTE range if a larger mTHP is already mapped Dev Jain
@ 2024-12-16 16:51 ` Dev Jain
  2024-12-16 16:51 ` [RFC PATCH 12/12] selftests/mm: khugepaged: Enlighten for mTHP collapse Dev Jain
  2024-12-16 17:31 ` [RFC PATCH 00/12] khugepaged: Asynchronous " Dev Jain
  12 siblings, 0 replies; 74+ messages in thread
From: Dev Jain @ 2024-12-16 16:51 UTC (permalink / raw)
  To: akpm, david, willy, kirill.shutemov
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel, Dev Jain

Activate khugepaged for anonymous collapse even if a single order is activated.
Note that, we are still scanning the VMAs only when they are PMD-aligned/sized,
for ease of implementation.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/khugepaged.c | 37 +++++++++++++++++++------------------
 1 file changed, 19 insertions(+), 18 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 47e7c476b893..ffc4d5aef991 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -414,24 +414,20 @@ static inline int hpage_collapse_test_exit_or_disable(struct mm_struct *mm)
 	       test_bit(MMF_DISABLE_THP, &mm->flags);
 }
 
-static bool hugepage_pmd_enabled(void)
+static bool thp_enabled(void)
 {
 	/*
 	 * We cover the anon, shmem and the file-backed case here; file-backed
 	 * hugepages, when configured in, are determined by the global control.
-	 * Anon pmd-sized hugepages are determined by the pmd-size control.
+	 * Anon mTHPs are determined by the per-size control.
 	 * Shmem pmd-sized hugepages are also determined by its pmd-size control,
 	 * except when the global shmem_huge is set to SHMEM_HUGE_DENY.
 	 */
 	if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
 	    hugepage_global_enabled())
 		return true;
-	if (test_bit(PMD_ORDER, &huge_anon_orders_always))
-		return true;
-	if (test_bit(PMD_ORDER, &huge_anon_orders_madvise))
-		return true;
-	if (test_bit(PMD_ORDER, &huge_anon_orders_inherit) &&
-	    hugepage_global_enabled())
+	if (huge_anon_orders_always || huge_anon_orders_madvise ||
+	    (huge_anon_orders_inherit && hugepage_global_enabled()))
 		return true;
 	if (IS_ENABLED(CONFIG_SHMEM) && shmem_hpage_pmd_enabled())
 		return true;
@@ -474,9 +470,9 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
 			  unsigned long vm_flags)
 {
 	if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) &&
-	    hugepage_pmd_enabled()) {
-		if (thp_vma_allowable_order(vma, vm_flags, TVA_ENFORCE_SYSFS,
-					    PMD_ORDER))
+	    thp_enabled()) {
+		if (thp_vma_allowable_orders(vma, vm_flags, TVA_ENFORCE_SYSFS,
+					    BIT(PMD_ORDER + 1) - 1))
 			__khugepaged_enter(vma->vm_mm);
 	}
 }
@@ -2586,8 +2582,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 			progress++;
 			break;
 		}
-		if (!thp_vma_allowable_order(vma, vma->vm_flags,
-					TVA_ENFORCE_SYSFS, PMD_ORDER)) {
+		if (!thp_vma_allowable_orders(vma, vma->vm_flags,
+					TVA_ENFORCE_SYSFS, BIT(PMD_ORDER + 1) - 1)) {
 skip:
 			progress++;
 			continue;
@@ -2611,6 +2607,11 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 				  khugepaged_scan.address + HPAGE_PMD_SIZE >
 				  hend);
 			if (IS_ENABLED(CONFIG_SHMEM) && vma->vm_file) {
+				if (!thp_vma_allowable_order(vma, vma->vm_flags,
+				    TVA_ENFORCE_SYSFS, PMD_ORDER)) {
+					khugepaged_scan.address += HPAGE_PMD_SIZE;
+					continue;
+				}
 				struct file *file = get_file(vma->vm_file);
 				pgoff_t pgoff = linear_page_index(vma,
 						khugepaged_scan.address);
@@ -2689,7 +2690,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 
 static int khugepaged_has_work(void)
 {
-	return !list_empty(&khugepaged_scan.mm_head) && hugepage_pmd_enabled();
+	return !list_empty(&khugepaged_scan.mm_head) && thp_enabled();
 }
 
 static int khugepaged_wait_event(void)
@@ -2762,7 +2763,7 @@ static void khugepaged_wait_work(void)
 		return;
 	}
 
-	if (hugepage_pmd_enabled())
+	if (thp_enabled())
 		wait_event_freezable(khugepaged_wait, khugepaged_wait_event());
 }
 
@@ -2793,7 +2794,7 @@ static void set_recommended_min_free_kbytes(void)
 	int nr_zones = 0;
 	unsigned long recommended_min;
 
-	if (!hugepage_pmd_enabled()) {
+	if (!thp_enabled()) {
 		calculate_min_free_kbytes();
 		goto update_wmarks;
 	}
@@ -2843,7 +2844,7 @@ int start_stop_khugepaged(void)
 	int err = 0;
 
 	mutex_lock(&khugepaged_mutex);
-	if (hugepage_pmd_enabled()) {
+	if (thp_enabled()) {
 		if (!khugepaged_thread)
 			khugepaged_thread = kthread_run(khugepaged, NULL,
 							"khugepaged");
@@ -2869,7 +2870,7 @@ int start_stop_khugepaged(void)
 void khugepaged_min_free_kbytes_update(void)
 {
 	mutex_lock(&khugepaged_mutex);
-	if (hugepage_pmd_enabled() && khugepaged_thread)
+	if (thp_enabled() && khugepaged_thread)
 		set_recommended_min_free_kbytes();
 	mutex_unlock(&khugepaged_mutex);
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [RFC PATCH 12/12] selftests/mm: khugepaged: Enlighten for mTHP collapse
  2024-12-16 16:50 [RFC PATCH 00/12] khugepaged: Asynchronous mTHP collapse Dev Jain
                   ` (10 preceding siblings ...)
  2024-12-16 16:51 ` [RFC PATCH 11/12] khugepaged: Enable sysfs to control order of collapse Dev Jain
@ 2024-12-16 16:51 ` Dev Jain
  2024-12-18  9:03   ` Ryan Roberts
  2024-12-16 17:31 ` [RFC PATCH 00/12] khugepaged: Asynchronous " Dev Jain
  12 siblings, 1 reply; 74+ messages in thread
From: Dev Jain @ 2024-12-16 16:51 UTC (permalink / raw)
  To: akpm, david, willy, kirill.shutemov
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel, Dev Jain

One of the testcases triggers a CoW on the 255th page (0-indexing) with
max_ptes_shared = 256. This leads to 0-254 pages (255 in number) being unshared,
and 257 pages shared, exceeding the constraint. Suppose we run the test as
./khugepaged -s 2. Therefore, khugepaged starts collapsing the range to order-2
folios, since PMD-collapse will fail due to the constraint.
When the scan reaches 254-257 PTE range, because at least one PTE in this range
is writable, with other 3 being read-only, khugepaged collapses this into an
order-2 mTHP, resulting in 3 extra PTEs getting unshared. After this, we encounter
a 4-sized chunk of read-only PTEs, and mTHP collapse stops according to the scaled
constraint, but the number of shared PTEs have now come under the constraint for
PMD-sized THPs. Therefore, the next scan of khugepaged will be able to collapse
this range into a PMD-mapped hugepage, leading to failure of this subtest. Fix
this by reducing the CoW range.

Note: The only objective of this patch is to make the test work for the PMD-case;
no extension has been made for testing for mTHPs.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 tools/testing/selftests/mm/khugepaged.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/selftests/mm/khugepaged.c
index 8a4d34cce36b..143c4ad9f6a1 100644
--- a/tools/testing/selftests/mm/khugepaged.c
+++ b/tools/testing/selftests/mm/khugepaged.c
@@ -981,6 +981,7 @@ static void collapse_fork_compound(struct collapse_context *c, struct mem_ops *o
 static void collapse_max_ptes_shared(struct collapse_context *c, struct mem_ops *ops)
 {
 	int max_ptes_shared = thp_read_num("khugepaged/max_ptes_shared");
+	int fault_nr_pages = is_anon(ops) ? 1 << anon_order : 1;
 	int wstatus;
 	void *p;

@@ -997,8 +998,8 @@ static void collapse_max_ptes_shared(struct collapse_context *c, struct mem_ops
 			fail("Fail");

 		printf("Trigger CoW on page %d of %d...",
-				hpage_pmd_nr - max_ptes_shared - 1, hpage_pmd_nr);
-		ops->fault(p, 0, (hpage_pmd_nr - max_ptes_shared - 1) * page_size);
+				hpage_pmd_nr - max_ptes_shared - fault_nr_pages, hpage_pmd_nr);
+		ops->fault(p, 0, (hpage_pmd_nr - max_ptes_shared - fault_nr_pages) * page_size);
 		if (ops->check_huge(p, 0))
 			success("OK");
 		else
-- 
2.30.2

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 09/12] khugepaged: Introduce vma_collapse_anon_folio()
  2024-12-16 16:51 ` [RFC PATCH 09/12] khugepaged: Introduce vma_collapse_anon_folio() Dev Jain
@ 2024-12-16 17:06   ` David Hildenbrand
  2024-12-16 19:08     ` Yang Shi
                       ` (2 more replies)
  2025-01-06 10:17   ` Usama Arif
  1 sibling, 3 replies; 74+ messages in thread
From: David Hildenbrand @ 2024-12-16 17:06 UTC (permalink / raw)
  To: Dev Jain, akpm, willy, kirill.shutemov
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel

On 16.12.24 17:51, Dev Jain wrote:
> In contrast to PMD-collapse, we do not need to operate on two levels of pagetable
> simultaneously. Therefore, downgrade the mmap lock from write to read mode. Still
> take the anon_vma lock in exclusive mode so as to not waste time in the rmap path,
> which is anyways going to fail since the PTEs are going to be changed. Under the PTL,
> copy page contents, clear the PTEs, remove folio pins, and (try to) unmap the
> old folios. Set the PTEs to the new folio using the set_ptes() API.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
> Note: I have been trying hard to get rid of the locks in here: we still are
> taking the PTL around the page copying; dropping the PTL and taking it after
> the copying should lead to a deadlock, for example:
> khugepaged						madvise(MADV_COLD)
> folio_lock()						lock(ptl)
> lock(ptl)						folio_lock()
> 
> We can create a locked folio list, altogether drop both the locks, take the PTL,
> do everything which __collapse_huge_page_isolate() does *except* the isolation and
> again try locking folios, but then it will reduce efficiency of khugepaged
> and almost looks like a forced solution :)
> Please note the following discussion if anyone is interested:
> https://lore.kernel.org/all/66bb7496-a445-4ad7-8e56-4f2863465c54@arm.com/
> (Apologies for not CCing the mailing list from the start)
> 
>   mm/khugepaged.c | 108 ++++++++++++++++++++++++++++++++++++++----------
>   1 file changed, 87 insertions(+), 21 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 88beebef773e..8040b130e677 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -714,24 +714,28 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>   						struct vm_area_struct *vma,
>   						unsigned long address,
>   						spinlock_t *ptl,
> -						struct list_head *compound_pagelist)
> +						struct list_head *compound_pagelist, int order)
>   {
>   	struct folio *src, *tmp;
>   	pte_t *_pte;
>   	pte_t pteval;
>   
> -	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> +	for (_pte = pte; _pte < pte + (1UL << order);
>   	     _pte++, address += PAGE_SIZE) {
>   		pteval = ptep_get(_pte);
>   		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>   			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
>   			if (is_zero_pfn(pte_pfn(pteval))) {
> -				/*
> -				 * ptl mostly unnecessary.
> -				 */
> -				spin_lock(ptl);
> -				ptep_clear(vma->vm_mm, address, _pte);
> -				spin_unlock(ptl);
> +				if (order == HPAGE_PMD_ORDER) {
> +					/*
> +					* ptl mostly unnecessary.
> +					*/
> +					spin_lock(ptl);
> +					ptep_clear(vma->vm_mm, address, _pte);
> +					spin_unlock(ptl);
> +				} else {
> +					ptep_clear(vma->vm_mm, address, _pte);
> +				}
>   				ksm_might_unmap_zero_page(vma->vm_mm, pteval);
>   			}
>   		} else {
> @@ -740,15 +744,20 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>   			src = page_folio(src_page);
>   			if (!folio_test_large(src))
>   				release_pte_folio(src);
> -			/*
> -			 * ptl mostly unnecessary, but preempt has to
> -			 * be disabled to update the per-cpu stats
> -			 * inside folio_remove_rmap_pte().
> -			 */
> -			spin_lock(ptl);
> -			ptep_clear(vma->vm_mm, address, _pte);
> -			folio_remove_rmap_pte(src, src_page, vma);
> -			spin_unlock(ptl);
> +			if (order == HPAGE_PMD_ORDER) {
> +				/*
> +				* ptl mostly unnecessary, but preempt has to
> +				* be disabled to update the per-cpu stats
> +				* inside folio_remove_rmap_pte().
> +				*/
> +				spin_lock(ptl);
> +				ptep_clear(vma->vm_mm, address, _pte);




> +				folio_remove_rmap_pte(src, src_page, vma);
> +				spin_unlock(ptl);
> +			} else {
> +				ptep_clear(vma->vm_mm, address, _pte);
> +				folio_remove_rmap_pte(src, src_page, vma);
> +			}

As I've talked to Nico about this code recently ... :)

Are you clearing the PTE after the copy succeeded? If so, where is the 
TLB flush?

How do you sync against concurrent write acess + GUP-fast?


The sequence really must be: (1) clear PTE/PMD + flush TLB (2) check if 
there are unexpected page references (e.g., GUP) if so back off (3) 
copy page content (4) set updated PTE/PMD.

To Nico, I suggested doing it simple initially, and still clear the 
high-level PMD entry + flush under mmap write lock, then re-map the PTE 
table after modifying the page table. It's not as efficient, but "harder 
to get wrong".

Maybe that's already happening, but I stumbled over this clearing logic 
in __collapse_huge_page_copy_succeeded(), so I'm curious.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 00/12] khugepaged: Asynchronous mTHP collapse
  2024-12-16 16:50 [RFC PATCH 00/12] khugepaged: Asynchronous mTHP collapse Dev Jain
                   ` (11 preceding siblings ...)
  2024-12-16 16:51 ` [RFC PATCH 12/12] selftests/mm: khugepaged: Enlighten for mTHP collapse Dev Jain
@ 2024-12-16 17:31 ` Dev Jain
  2025-01-02 21:58   ` Nico Pache
  12 siblings, 1 reply; 74+ messages in thread
From: Dev Jain @ 2024-12-16 17:31 UTC (permalink / raw)
  To: akpm, david, willy, kirill.shutemov
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel,
	Nico Pache

+Nico, apologies, forgot to CC you.

On 16/12/24 10:20 pm, Dev Jain wrote:
> This patchset extends khugepaged from collapsing only PMD-sized THPs to
> collapsing anonymous mTHPs.
>
> mTHPs were introduced in the kernel to improve memory management by allocating
> chunks of larger memory, so as to reduce number of page faults, TLB misses (due
> to TLB coalescing), reduce length of LRU lists, etc. However, the mTHP property
> is often lost due to CoW, swap-in/out, and when the kernel just cannot find
> enough physically contiguous memory to allocate on fault. Henceforth, there is a
> need to regain mTHPs in the system asynchronously. This work is an attempt in
> this direction, starting with anonymous folios.
>
> In the fault handler, we select the THP order in a greedy manner; the same has
> been used here, along with the same sysfs interface to control the order of
> collapse. In contrast to PMD-collapse, we (hopefully) get rid of the mmap_write_lock().
>
> ---------------------------------------------------------
> Testing
> ---------------------------------------------------------
>
> The set has been build tested on x86_64.
> For Aarch64,
> 1. mm-selftests: No regressions.
> 2. Analyzing with tools/mm/thpmaps on different userspace programs mapping
>     aligned VMAs of a large size, faulting in basepages/mTHPs (according to sysfs),
>     and then madvise()'ing the VMA, khugepaged is able to 100% collapse the VMAs.
>
> This patchset is rebased on mm-unstable (e7e89af21ffcfd1077ca6d2188de6497db1ad84c).
>
> Some points to be noted:
> 1. Some stats like pages_collapsed for khugepaged have not been extended for mTHP.
>     I'd welcome suggestions on any updation, or addition to the sysfs interface.
> 2. Please see patch 9 for lock handling.
>
> Dev Jain (12):
>    khugepaged: Rename hpage_collapse_scan_pmd() -> ptes()
>    khugepaged: Generalize alloc_charge_folio()
>    khugepaged: Generalize hugepage_vma_revalidate()
>    khugepaged: Generalize __collapse_huge_page_swapin()
>    khugepaged: Generalize __collapse_huge_page_isolate()
>    khugepaged: Generalize __collapse_huge_page_copy_failed()
>    khugepaged: Scan PTEs order-wise
>    khugepaged: Abstract PMD-THP collapse
>    khugepaged: Introduce vma_collapse_anon_folio()
>    khugepaged: Skip PTE range if a larger mTHP is already mapped
>    khugepaged: Enable sysfs to control order of collapse
>    selftests/mm: khugepaged: Enlighten for mTHP collapse
>
>   include/linux/huge_mm.h                 |   2 +
>   mm/huge_memory.c                        |   4 +
>   mm/khugepaged.c                         | 445 +++++++++++++++++-------
>   tools/testing/selftests/mm/khugepaged.c |   5 +-
>   4 files changed, 319 insertions(+), 137 deletions(-)
>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 09/12] khugepaged: Introduce vma_collapse_anon_folio()
  2024-12-16 17:06   ` David Hildenbrand
@ 2024-12-16 19:08     ` Yang Shi
  2024-12-17 10:07     ` Dev Jain
  2024-12-18 15:59     ` Dev Jain
  2 siblings, 0 replies; 74+ messages in thread
From: Yang Shi @ 2024-12-16 19:08 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Dev Jain, akpm, willy, kirill.shutemov, ryan.roberts,
	anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, linux-mm, linux-kernel

On Mon, Dec 16, 2024 at 9:09 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 16.12.24 17:51, Dev Jain wrote:
> > In contrast to PMD-collapse, we do not need to operate on two levels of pagetable
> > simultaneously. Therefore, downgrade the mmap lock from write to read mode. Still
> > take the anon_vma lock in exclusive mode so as to not waste time in the rmap path,
> > which is anyways going to fail since the PTEs are going to be changed. Under the PTL,
> > copy page contents, clear the PTEs, remove folio pins, and (try to) unmap the
> > old folios. Set the PTEs to the new folio using the set_ptes() API.
> >
> > Signed-off-by: Dev Jain <dev.jain@arm.com>
> > ---
> > Note: I have been trying hard to get rid of the locks in here: we still are
> > taking the PTL around the page copying; dropping the PTL and taking it after
> > the copying should lead to a deadlock, for example:
> > khugepaged                                            madvise(MADV_COLD)
> > folio_lock()                                          lock(ptl)
> > lock(ptl)                                             folio_lock()
> >
> > We can create a locked folio list, altogether drop both the locks, take the PTL,
> > do everything which __collapse_huge_page_isolate() does *except* the isolation and
> > again try locking folios, but then it will reduce efficiency of khugepaged
> > and almost looks like a forced solution :)
> > Please note the following discussion if anyone is interested:
> > https://lore.kernel.org/all/66bb7496-a445-4ad7-8e56-4f2863465c54@arm.com/
> > (Apologies for not CCing the mailing list from the start)
> >
> >   mm/khugepaged.c | 108 ++++++++++++++++++++++++++++++++++++++----------
> >   1 file changed, 87 insertions(+), 21 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 88beebef773e..8040b130e677 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -714,24 +714,28 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
> >                                               struct vm_area_struct *vma,
> >                                               unsigned long address,
> >                                               spinlock_t *ptl,
> > -                                             struct list_head *compound_pagelist)
> > +                                             struct list_head *compound_pagelist, int order)
> >   {
> >       struct folio *src, *tmp;
> >       pte_t *_pte;
> >       pte_t pteval;
> >
> > -     for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> > +     for (_pte = pte; _pte < pte + (1UL << order);
> >            _pte++, address += PAGE_SIZE) {
> >               pteval = ptep_get(_pte);
> >               if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> >                       add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
> >                       if (is_zero_pfn(pte_pfn(pteval))) {
> > -                             /*
> > -                              * ptl mostly unnecessary.
> > -                              */
> > -                             spin_lock(ptl);
> > -                             ptep_clear(vma->vm_mm, address, _pte);
> > -                             spin_unlock(ptl);
> > +                             if (order == HPAGE_PMD_ORDER) {
> > +                                     /*
> > +                                     * ptl mostly unnecessary.
> > +                                     */
> > +                                     spin_lock(ptl);
> > +                                     ptep_clear(vma->vm_mm, address, _pte);
> > +                                     spin_unlock(ptl);
> > +                             } else {
> > +                                     ptep_clear(vma->vm_mm, address, _pte);
> > +                             }
> >                               ksm_might_unmap_zero_page(vma->vm_mm, pteval);
> >                       }
> >               } else {
> > @@ -740,15 +744,20 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
> >                       src = page_folio(src_page);
> >                       if (!folio_test_large(src))
> >                               release_pte_folio(src);
> > -                     /*
> > -                      * ptl mostly unnecessary, but preempt has to
> > -                      * be disabled to update the per-cpu stats
> > -                      * inside folio_remove_rmap_pte().
> > -                      */
> > -                     spin_lock(ptl);
> > -                     ptep_clear(vma->vm_mm, address, _pte);
> > -                     folio_remove_rmap_pte(src, src_page, vma);
> > -                     spin_unlock(ptl);
> > +                     if (order == HPAGE_PMD_ORDER) {
> > +                             /*
> > +                             * ptl mostly unnecessary, but preempt has to
> > +                             * be disabled to update the per-cpu stats
> > +                             * inside folio_remove_rmap_pte().
> > +                             */
> > +                             spin_lock(ptl);
> > +                             ptep_clear(vma->vm_mm, address, _pte);
>
>
>
>
> > +                             folio_remove_rmap_pte(src, src_page, vma);
> > +                             spin_unlock(ptl);

I think it is ok not to take the ptl since the preempt is disabled at
this point by pte_map(). pte_unmap() is called after copy.

> > +                     } else {
> > +                             ptep_clear(vma->vm_mm, address, _pte);
> > +                             folio_remove_rmap_pte(src, src_page, vma);
> > +                     }
>
> As I've talked to Nico about this code recently ... :)
>
> Are you clearing the PTE after the copy succeeded? If so, where is the
> TLB flush?
>
> How do you sync against concurrent write acess + GUP-fast?
>
>
> The sequence really must be: (1) clear PTE/PMD + flush TLB (2) check if
> there are unexpected page references (e.g., GUP) if so back off (3)
> copy page content (4) set updated PTE/PMD.

Yeah, either PMD is not cleared or tlb_remove_table_sync_one() is not
called IIRC, the concurrent GUP may change the refcount after the
refcount check.

>
> To Nico, I suggested doing it simple initially, and still clear the
> high-level PMD entry + flush under mmap write lock, then re-map the PTE
> table after modifying the page table. It's not as efficient, but "harder
> to get wrong".
>
> Maybe that's already happening, but I stumbled over this clearing logic
> in __collapse_huge_page_copy_succeeded(), so I'm curious.
>
> --
> Cheers,
>
> David / dhildenb
>
>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 02/12] khugepaged: Generalize alloc_charge_folio()
  2024-12-16 16:50 ` [RFC PATCH 02/12] khugepaged: Generalize alloc_charge_folio() Dev Jain
@ 2024-12-17  2:51   ` Baolin Wang
  2024-12-17  6:08     ` Dev Jain
  2024-12-17  4:17   ` Matthew Wilcox
  2024-12-17  6:53   ` Ryan Roberts
  2 siblings, 1 reply; 74+ messages in thread
From: Baolin Wang @ 2024-12-17  2:51 UTC (permalink / raw)
  To: Dev Jain, akpm, david, willy, kirill.shutemov
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel



On 2024/12/17 00:50, Dev Jain wrote:
> Pass order to alloc_charge_folio() and update mTHP statistics.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>   include/linux/huge_mm.h |  2 ++
>   mm/huge_memory.c        |  4 ++++
>   mm/khugepaged.c         | 13 +++++++++----
>   3 files changed, 15 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 93e509b6c00e..8b6d0fed99b3 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -119,6 +119,8 @@ enum mthp_stat_item {
>   	MTHP_STAT_ANON_FAULT_ALLOC,
>   	MTHP_STAT_ANON_FAULT_FALLBACK,
>   	MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE,
> +	MTHP_STAT_ANON_COLLAPSE_ALLOC,
> +	MTHP_STAT_ANON_COLLAPSE_ALLOC_FAILED,
>   	MTHP_STAT_ZSWPOUT,
>   	MTHP_STAT_SWPIN,
>   	MTHP_STAT_SWPIN_FALLBACK,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2da5520bfe24..2e582fad4c77 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -615,6 +615,8 @@ static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
>   DEFINE_MTHP_STAT_ATTR(anon_fault_alloc, MTHP_STAT_ANON_FAULT_ALLOC);
>   DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK);
>   DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
> +DEFINE_MTHP_STAT_ATTR(anon_collapse_alloc, MTHP_STAT_ANON_COLLAPSE_ALLOC);
> +DEFINE_MTHP_STAT_ATTR(anon_collapse_alloc_failed, MTHP_STAT_ANON_COLLAPSE_ALLOC_FAILED);
>   DEFINE_MTHP_STAT_ATTR(zswpout, MTHP_STAT_ZSWPOUT);
>   DEFINE_MTHP_STAT_ATTR(swpin, MTHP_STAT_SWPIN);
>   DEFINE_MTHP_STAT_ATTR(swpin_fallback, MTHP_STAT_SWPIN_FALLBACK);
> @@ -636,6 +638,8 @@ static struct attribute *anon_stats_attrs[] = {
>   	&anon_fault_alloc_attr.attr,
>   	&anon_fault_fallback_attr.attr,
>   	&anon_fault_fallback_charge_attr.attr,
> +	&anon_collapse_alloc_attr.attr,
> +	&anon_collapse_alloc_failed_attr.attr,
>   #ifndef CONFIG_SHMEM
>   	&zswpout_attr.attr,
>   	&swpin_attr.attr,
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 95643e6e5f31..02cd424b8e48 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1073,21 +1073,26 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
>   }
>   
>   static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
> -			      struct collapse_control *cc)
> +			      int order, struct collapse_control *cc)
>   {
>   	gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
>   		     GFP_TRANSHUGE);
>   	int node = hpage_collapse_find_target_node(cc);
>   	struct folio *folio;
>   
> -	folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
> +	folio = __folio_alloc(gfp, order, node, &cc->alloc_nmask);
>   	if (!folio) {
>   		*foliop = NULL;
>   		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
> +		if (order != HPAGE_PMD_ORDER)
> +			count_mthp_stat(order, MTHP_STAT_ANON_COLLAPSE_ALLOC_FAILED);
>   		return SCAN_ALLOC_HUGE_PAGE_FAIL;
>   	}
>   
>   	count_vm_event(THP_COLLAPSE_ALLOC);
> +	if (order != HPAGE_PMD_ORDER)
> +		count_mthp_stat(order, MTHP_STAT_ANON_COLLAPSE_ALLOC);

File collapse will also call alloc_charge_folio() to allocate THP, so 
using term '_ANON_' is not suitable for both anon and file collapse.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 02/12] khugepaged: Generalize alloc_charge_folio()
  2024-12-16 16:50 ` [RFC PATCH 02/12] khugepaged: Generalize alloc_charge_folio() Dev Jain
  2024-12-17  2:51   ` Baolin Wang
@ 2024-12-17  4:17   ` Matthew Wilcox
  2024-12-17  7:09     ` Ryan Roberts
  2024-12-17  6:53   ` Ryan Roberts
  2 siblings, 1 reply; 74+ messages in thread
From: Matthew Wilcox @ 2024-12-17  4:17 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, david, kirill.shutemov, ryan.roberts, anshuman.khandual,
	catalin.marinas, cl, vbabka, mhocko, apopple, dave.hansen, will,
	baohua, jack, srivatsa, haowenchao22, hughd, aneesh.kumar, yang,
	peterx, ioworker0, wangkefeng.wang, ziy, jglisse, surenb,
	vishal.moola, zokeefe, zhengqi.arch, jhubbard, 21cnbao, linux-mm,
	linux-kernel

On Mon, Dec 16, 2024 at 10:20:55PM +0530, Dev Jain wrote:
>  static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
> -			      struct collapse_control *cc)
> +			      int order, struct collapse_control *cc)

unsigned, surely?

>  	if (!folio) {
>  		*foliop = NULL;
>  		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
> +		if (order != HPAGE_PMD_ORDER)
> +			count_mthp_stat(order, MTHP_STAT_ANON_COLLAPSE_ALLOC_FAILED);

i don't understand why we need new statistics here.  we already have a
signal that memory allocation failures are preventing collapse from
being successful, why do we care if it's mthp or actual thp?

>  	count_vm_event(THP_COLLAPSE_ALLOC);
> +	if (order != HPAGE_PMD_ORDER)
> +		count_mthp_stat(order, MTHP_STAT_ANON_COLLAPSE_ALLOC);

similar question


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 01/12] khugepaged: Rename hpage_collapse_scan_pmd() -> ptes()
  2024-12-16 16:50 ` [RFC PATCH 01/12] khugepaged: Rename hpage_collapse_scan_pmd() -> ptes() Dev Jain
@ 2024-12-17  4:18   ` Matthew Wilcox
  2024-12-17  5:52     ` Dev Jain
  2024-12-17  6:43     ` Ryan Roberts
  0 siblings, 2 replies; 74+ messages in thread
From: Matthew Wilcox @ 2024-12-17  4:18 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, david, kirill.shutemov, ryan.roberts, anshuman.khandual,
	catalin.marinas, cl, vbabka, mhocko, apopple, dave.hansen, will,
	baohua, jack, srivatsa, haowenchao22, hughd, aneesh.kumar, yang,
	peterx, ioworker0, wangkefeng.wang, ziy, jglisse, surenb,
	vishal.moola, zokeefe, zhengqi.arch, jhubbard, 21cnbao, linux-mm,
	linux-kernel

On Mon, Dec 16, 2024 at 10:20:54PM +0530, Dev Jain wrote:
> -static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> +static int hpage_collapse_scan_ptes(struct mm_struct *mm,

i don't think this is necessary at all.  you're scanning a pmd.
you might not be scanning in order to collapse to a pmd, but pmd
is the level you're scanning at.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 03/12] khugepaged: Generalize hugepage_vma_revalidate()
  2024-12-16 16:50 ` [RFC PATCH 03/12] khugepaged: Generalize hugepage_vma_revalidate() Dev Jain
@ 2024-12-17  4:21   ` Matthew Wilcox
  2024-12-17 16:58   ` Ryan Roberts
  1 sibling, 0 replies; 74+ messages in thread
From: Matthew Wilcox @ 2024-12-17  4:21 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, david, kirill.shutemov, ryan.roberts, anshuman.khandual,
	catalin.marinas, cl, vbabka, mhocko, apopple, dave.hansen, will,
	baohua, jack, srivatsa, haowenchao22, hughd, aneesh.kumar, yang,
	peterx, ioworker0, wangkefeng.wang, ziy, jglisse, surenb,
	vishal.moola, zokeefe, zhengqi.arch, jhubbard, 21cnbao, linux-mm,
	linux-kernel

On Mon, Dec 16, 2024 at 10:20:56PM +0530, Dev Jain wrote:
>  static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
>  				   bool expect_anon,
> -				   struct vm_area_struct **vmap,
> +				   struct vm_area_struct **vmap, int order,

orders are unsigned.  i'm going to stop leaving this feedback here, but
please fix the entire series for this braino.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 04/12] khugepaged: Generalize __collapse_huge_page_swapin()
  2024-12-16 16:50 ` [RFC PATCH 04/12] khugepaged: Generalize __collapse_huge_page_swapin() Dev Jain
@ 2024-12-17  4:24   ` Matthew Wilcox
  0 siblings, 0 replies; 74+ messages in thread
From: Matthew Wilcox @ 2024-12-17  4:24 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, david, kirill.shutemov, ryan.roberts, anshuman.khandual,
	catalin.marinas, cl, vbabka, mhocko, apopple, dave.hansen, will,
	baohua, jack, srivatsa, haowenchao22, hughd, aneesh.kumar, yang,
	peterx, ioworker0, wangkefeng.wang, ziy, jglisse, surenb,
	vishal.moola, zokeefe, zhengqi.arch, jhubbard, 21cnbao, linux-mm,
	linux-kernel

On Mon, Dec 16, 2024 at 10:20:57PM +0530, Dev Jain wrote:
> -	unsigned long address, end = haddr + (HPAGE_PMD_NR * PAGE_SIZE);
> +	unsigned long address, end = addr + ((1UL << order) * PAGE_SIZE);

addr + (PAGE_SIZE << order);  


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 05/12] khugepaged: Generalize __collapse_huge_page_isolate()
  2024-12-16 16:50 ` [RFC PATCH 05/12] khugepaged: Generalize __collapse_huge_page_isolate() Dev Jain
@ 2024-12-17  4:32   ` Matthew Wilcox
  2024-12-17  6:41     ` Dev Jain
  2024-12-17 17:09   ` Ryan Roberts
  1 sibling, 1 reply; 74+ messages in thread
From: Matthew Wilcox @ 2024-12-17  4:32 UTC (permalink / raw)
  To: Dev Jain, g
  Cc: akpm, david, kirill.shutemov, ryan.roberts, anshuman.khandual,
	catalin.marinas, cl, vbabka, mhocko, apopple, dave.hansen, will,
	baohua, jack, srivatsa, haowenchao22, hughd, aneesh.kumar, yang,
	peterx, ioworker0, wangkefeng.wang, ziy, jglisse, surenb,
	vishal.moola, zokeefe, zhengqi.arch, jhubbard, 21cnbao, linux-mm,
	linux-kernel

On Mon, Dec 16, 2024 at 10:20:58PM +0530, Dev Jain wrote:
>  {
> -	struct page *page = NULL;
> -	struct folio *folio = NULL;
> -	pte_t *_pte;
> +	unsigned int max_ptes_shared = khugepaged_max_ptes_shared >> (HPAGE_PMD_ORDER - order);
> +	unsigned int max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
>  	int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
> +	struct folio *folio = NULL;
> +	struct page *page = NULL;

why are you moving variables around unnecessarily?

>  	bool writable = false;
> +	pte_t *_pte;
>  
> -	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> +
> +	for (_pte = pte; _pte < pte + (1UL << order);

spurious blank line


also you might first want to finish off the page->folio conversion in
this function first; we have a vm_normal_folio() now.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 01/12] khugepaged: Rename hpage_collapse_scan_pmd() -> ptes()
  2024-12-17  4:18   ` Matthew Wilcox
@ 2024-12-17  5:52     ` Dev Jain
  2024-12-17  6:43     ` Ryan Roberts
  1 sibling, 0 replies; 74+ messages in thread
From: Dev Jain @ 2024-12-17  5:52 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, david, kirill.shutemov, ryan.roberts, anshuman.khandual,
	catalin.marinas, cl, vbabka, mhocko, apopple, dave.hansen, will,
	baohua, jack, srivatsa, haowenchao22, hughd, aneesh.kumar, yang,
	peterx, ioworker0, wangkefeng.wang, ziy, jglisse, surenb,
	vishal.moola, zokeefe, zhengqi.arch, jhubbard, 21cnbao, linux-mm,
	linux-kernel


On 17/12/24 9:48 am, Matthew Wilcox wrote:
> On Mon, Dec 16, 2024 at 10:20:54PM +0530, Dev Jain wrote:
>> -static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>> +static int hpage_collapse_scan_ptes(struct mm_struct *mm,
> i don't think this is necessary at all.  you're scanning a pmd.
> you might not be scanning in order to collapse to a pmd, but pmd
> is the level you're scanning at.

I did that if at all in the review process we note that we need to even drop
down the starting scan order, so that we do not skip smaller VMAs. I kinda
agree with you so till that time I have no problem reverting.

>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 02/12] khugepaged: Generalize alloc_charge_folio()
  2024-12-17  2:51   ` Baolin Wang
@ 2024-12-17  6:08     ` Dev Jain
  0 siblings, 0 replies; 74+ messages in thread
From: Dev Jain @ 2024-12-17  6:08 UTC (permalink / raw)
  To: Baolin Wang, akpm, david, willy, kirill.shutemov
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel


On 17/12/24 8:21 am, Baolin Wang wrote:
>
>
> On 2024/12/17 00:50, Dev Jain wrote:
>> Pass order to alloc_charge_folio() and update mTHP statistics.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   include/linux/huge_mm.h |  2 ++
>>   mm/huge_memory.c        |  4 ++++
>>   mm/khugepaged.c         | 13 +++++++++----
>>   3 files changed, 15 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 93e509b6c00e..8b6d0fed99b3 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -119,6 +119,8 @@ enum mthp_stat_item {
>>       MTHP_STAT_ANON_FAULT_ALLOC,
>>       MTHP_STAT_ANON_FAULT_FALLBACK,
>>       MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE,
>> +    MTHP_STAT_ANON_COLLAPSE_ALLOC,
>> +    MTHP_STAT_ANON_COLLAPSE_ALLOC_FAILED,
>>       MTHP_STAT_ZSWPOUT,
>>       MTHP_STAT_SWPIN,
>>       MTHP_STAT_SWPIN_FALLBACK,
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 2da5520bfe24..2e582fad4c77 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -615,6 +615,8 @@ static struct kobj_attribute _name##_attr = 
>> __ATTR_RO(_name)
>>   DEFINE_MTHP_STAT_ATTR(anon_fault_alloc, MTHP_STAT_ANON_FAULT_ALLOC);
>>   DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, 
>> MTHP_STAT_ANON_FAULT_FALLBACK);
>>   DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, 
>> MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
>> +DEFINE_MTHP_STAT_ATTR(anon_collapse_alloc, 
>> MTHP_STAT_ANON_COLLAPSE_ALLOC);
>> +DEFINE_MTHP_STAT_ATTR(anon_collapse_alloc_failed, 
>> MTHP_STAT_ANON_COLLAPSE_ALLOC_FAILED);
>>   DEFINE_MTHP_STAT_ATTR(zswpout, MTHP_STAT_ZSWPOUT);
>>   DEFINE_MTHP_STAT_ATTR(swpin, MTHP_STAT_SWPIN);
>>   DEFINE_MTHP_STAT_ATTR(swpin_fallback, MTHP_STAT_SWPIN_FALLBACK);
>> @@ -636,6 +638,8 @@ static struct attribute *anon_stats_attrs[] = {
>>       &anon_fault_alloc_attr.attr,
>>       &anon_fault_fallback_attr.attr,
>>       &anon_fault_fallback_charge_attr.attr,
>> +    &anon_collapse_alloc_attr.attr,
>> +    &anon_collapse_alloc_failed_attr.attr,
>>   #ifndef CONFIG_SHMEM
>>       &zswpout_attr.attr,
>>       &swpin_attr.attr,
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 95643e6e5f31..02cd424b8e48 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1073,21 +1073,26 @@ static int __collapse_huge_page_swapin(struct 
>> mm_struct *mm,
>>   }
>>     static int alloc_charge_folio(struct folio **foliop, struct 
>> mm_struct *mm,
>> -                  struct collapse_control *cc)
>> +                  int order, struct collapse_control *cc)
>>   {
>>       gfp_t gfp = (cc->is_khugepaged ? 
>> alloc_hugepage_khugepaged_gfpmask() :
>>                GFP_TRANSHUGE);
>>       int node = hpage_collapse_find_target_node(cc);
>>       struct folio *folio;
>>   -    folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, 
>> &cc->alloc_nmask);
>> +    folio = __folio_alloc(gfp, order, node, &cc->alloc_nmask);
>>       if (!folio) {
>>           *foliop = NULL;
>>           count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
>> +        if (order != HPAGE_PMD_ORDER)
>> +            count_mthp_stat(order, 
>> MTHP_STAT_ANON_COLLAPSE_ALLOC_FAILED);
>>           return SCAN_ALLOC_HUGE_PAGE_FAIL;
>>       }
>>         count_vm_event(THP_COLLAPSE_ALLOC);
>> +    if (order != HPAGE_PMD_ORDER)
>> +        count_mthp_stat(order, MTHP_STAT_ANON_COLLAPSE_ALLOC);
>
> File collapse will also call alloc_charge_folio() to allocate THP, so 
> using term '_ANON_' is not suitable for both anon and file collapse.

But currently file collapse will only allocate a PMD-folio, and I have 
extended to mTHP only for anon, so it makes sense? Although
I get your point in general...

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 05/12] khugepaged: Generalize __collapse_huge_page_isolate()
  2024-12-17  4:32   ` Matthew Wilcox
@ 2024-12-17  6:41     ` Dev Jain
  2024-12-17 17:14       ` Ryan Roberts
  0 siblings, 1 reply; 74+ messages in thread
From: Dev Jain @ 2024-12-17  6:41 UTC (permalink / raw)
  To: Matthew Wilcox, g
  Cc: akpm, david, kirill.shutemov, ryan.roberts, anshuman.khandual,
	catalin.marinas, cl, vbabka, mhocko, apopple, dave.hansen, will,
	baohua, jack, srivatsa, haowenchao22, hughd, aneesh.kumar, yang,
	peterx, ioworker0, wangkefeng.wang, ziy, jglisse, surenb,
	vishal.moola, zokeefe, zhengqi.arch, jhubbard, 21cnbao, linux-mm,
	linux-kernel


On 17/12/24 10:02 am, Matthew Wilcox wrote:
> On Mon, Dec 16, 2024 at 10:20:58PM +0530, Dev Jain wrote:
>>   {
>> -	struct page *page = NULL;
>> -	struct folio *folio = NULL;
>> -	pte_t *_pte;
>> +	unsigned int max_ptes_shared = khugepaged_max_ptes_shared >> (HPAGE_PMD_ORDER - order);
>> +	unsigned int max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
>>   	int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
>> +	struct folio *folio = NULL;
>> +	struct page *page = NULL;
> why are you moving variables around unnecessarily?

In a previous work, I moved code around and David noted to arrange the declarations
in reverse Xmas tree order. I guess (?) that was not spoiling git history, so if
this feels like that, I will revert.

>
>>   	bool writable = false;
>> +	pte_t *_pte;
>>   
>> -	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>> +
>> +	for (_pte = pte; _pte < pte + (1UL << order);
> spurious blank line

My bad

>
>
> also you might first want to finish off the page->folio conversion in
> this function first; we have a vm_normal_folio() now.

I did not add any code before we derive the folio...I'm sorry, I don't get what you mean...


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 01/12] khugepaged: Rename hpage_collapse_scan_pmd() -> ptes()
  2024-12-17  4:18   ` Matthew Wilcox
  2024-12-17  5:52     ` Dev Jain
@ 2024-12-17  6:43     ` Ryan Roberts
  2024-12-17 18:11       ` Zi Yan
  1 sibling, 1 reply; 74+ messages in thread
From: Ryan Roberts @ 2024-12-17  6:43 UTC (permalink / raw)
  To: Matthew Wilcox, Dev Jain
  Cc: akpm, david, kirill.shutemov, anshuman.khandual, catalin.marinas,
	cl, vbabka, mhocko, apopple, dave.hansen, will, baohua, jack,
	srivatsa, haowenchao22, hughd, aneesh.kumar, yang, peterx,
	ioworker0, wangkefeng.wang, ziy, jglisse, surenb, vishal.moola,
	zokeefe, zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel

On 17/12/2024 04:18, Matthew Wilcox wrote:
> On Mon, Dec 16, 2024 at 10:20:54PM +0530, Dev Jain wrote:
>> -static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>> +static int hpage_collapse_scan_ptes(struct mm_struct *mm,
> 
> i don't think this is necessary at all.  you're scanning a pmd.
> you might not be scanning in order to collapse to a pmd, but pmd
> is the level you're scanning at.
> 

Sorry Matthew, I don't really understand this statement. Prior to the change we
were scanning all PTE entries in a PTE table with the aim of collapsing to a PMD
entry. After the change we are scanning some PTE entries in a PTE table with the
aim of collapsing to either to a multi-PTE-mapped folio or a single-PMD-mapped
folio.

So personally I think "scan_pmd" was a misnomer even before the change - we are
scanning the ptes.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 02/12] khugepaged: Generalize alloc_charge_folio()
  2024-12-16 16:50 ` [RFC PATCH 02/12] khugepaged: Generalize alloc_charge_folio() Dev Jain
  2024-12-17  2:51   ` Baolin Wang
  2024-12-17  4:17   ` Matthew Wilcox
@ 2024-12-17  6:53   ` Ryan Roberts
  2024-12-17  9:06     ` Dev Jain
  2 siblings, 1 reply; 74+ messages in thread
From: Ryan Roberts @ 2024-12-17  6:53 UTC (permalink / raw)
  To: Dev Jain, akpm, david, willy, kirill.shutemov
  Cc: anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, linux-mm, linux-kernel

On 16/12/2024 16:50, Dev Jain wrote:
> Pass order to alloc_charge_folio() and update mTHP statistics.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  include/linux/huge_mm.h |  2 ++
>  mm/huge_memory.c        |  4 ++++
>  mm/khugepaged.c         | 13 +++++++++----
>  3 files changed, 15 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 93e509b6c00e..8b6d0fed99b3 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -119,6 +119,8 @@ enum mthp_stat_item {
>  	MTHP_STAT_ANON_FAULT_ALLOC,
>  	MTHP_STAT_ANON_FAULT_FALLBACK,
>  	MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE,
> +	MTHP_STAT_ANON_COLLAPSE_ALLOC,
> +	MTHP_STAT_ANON_COLLAPSE_ALLOC_FAILED,
>  	MTHP_STAT_ZSWPOUT,
>  	MTHP_STAT_SWPIN,
>  	MTHP_STAT_SWPIN_FALLBACK,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2da5520bfe24..2e582fad4c77 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -615,6 +615,8 @@ static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
>  DEFINE_MTHP_STAT_ATTR(anon_fault_alloc, MTHP_STAT_ANON_FAULT_ALLOC);
>  DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK);
>  DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
> +DEFINE_MTHP_STAT_ATTR(anon_collapse_alloc, MTHP_STAT_ANON_COLLAPSE_ALLOC);
> +DEFINE_MTHP_STAT_ATTR(anon_collapse_alloc_failed, MTHP_STAT_ANON_COLLAPSE_ALLOC_FAILED);
>  DEFINE_MTHP_STAT_ATTR(zswpout, MTHP_STAT_ZSWPOUT);
>  DEFINE_MTHP_STAT_ATTR(swpin, MTHP_STAT_SWPIN);
>  DEFINE_MTHP_STAT_ATTR(swpin_fallback, MTHP_STAT_SWPIN_FALLBACK);
> @@ -636,6 +638,8 @@ static struct attribute *anon_stats_attrs[] = {
>  	&anon_fault_alloc_attr.attr,
>  	&anon_fault_fallback_attr.attr,
>  	&anon_fault_fallback_charge_attr.attr,
> +	&anon_collapse_alloc_attr.attr,
> +	&anon_collapse_alloc_failed_attr.attr,
>  #ifndef CONFIG_SHMEM
>  	&zswpout_attr.attr,
>  	&swpin_attr.attr,
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 95643e6e5f31..02cd424b8e48 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1073,21 +1073,26 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
>  }
>  
>  static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
> -			      struct collapse_control *cc)
> +			      int order, struct collapse_control *cc)
>  {
>  	gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
>  		     GFP_TRANSHUGE);
>  	int node = hpage_collapse_find_target_node(cc);
>  	struct folio *folio;
>  
> -	folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
> +	folio = __folio_alloc(gfp, order, node, &cc->alloc_nmask);
>  	if (!folio) {
>  		*foliop = NULL;
>  		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
> +		if (order != HPAGE_PMD_ORDER)
> +			count_mthp_stat(order, MTHP_STAT_ANON_COLLAPSE_ALLOC_FAILED);

Bug? We should be calling count_mthp_stat() for all orders, but only calling
count_vm_event(THP_*) for PMD_ORDER, as per pattern laid out by other mTHP stats.

The aim is for existing THP stats (which are implicitly only counting PMD-sized
THP) to continue only to count PMD-sized THP. It's a userspace ABI and we were
scared of the potential to break things if we changed the existing counters'
semantics.

>  		return SCAN_ALLOC_HUGE_PAGE_FAIL;
>  	}
>  
>  	count_vm_event(THP_COLLAPSE_ALLOC);
> +	if (order != HPAGE_PMD_ORDER)
> +		count_mthp_stat(order, MTHP_STAT_ANON_COLLAPSE_ALLOC);

Same problem.

Also, I agree with Baolin that we don't want "anon" in the title. This is a
generic path used for file-backed memory. So once you fix the bug, the new stats
will also be counting the file-backed memory too (although for now, only for
PMD_ORDER).

> +
>  	if (unlikely(mem_cgroup_charge(folio, mm, gfp))) {
>  		folio_put(folio);
>  		*foliop = NULL;
> @@ -1124,7 +1129,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	 */
>  	mmap_read_unlock(mm);
>  
> -	result = alloc_charge_folio(&folio, mm, cc);
> +	result = alloc_charge_folio(&folio, mm, order, cc);

Where is order coming from? I'm guessing that's added later, so this patch won't
compile on it's own? Perhaps HPAGE_PMD_ORDER for now?

>  	if (result != SCAN_SUCCEED)
>  		goto out_nolock;
>  
> @@ -1850,7 +1855,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
>  	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
>  
> -	result = alloc_charge_folio(&new_folio, mm, cc);
> +	result = alloc_charge_folio(&new_folio, mm, HPAGE_PMD_ORDER, cc);
>  	if (result != SCAN_SUCCEED)
>  		goto out;
>  


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 02/12] khugepaged: Generalize alloc_charge_folio()
  2024-12-17  4:17   ` Matthew Wilcox
@ 2024-12-17  7:09     ` Ryan Roberts
  2024-12-17 13:00       ` Zi Yan
  2024-12-20 17:41       ` Christoph Lameter (Ampere)
  0 siblings, 2 replies; 74+ messages in thread
From: Ryan Roberts @ 2024-12-17  7:09 UTC (permalink / raw)
  To: Matthew Wilcox, Dev Jain
  Cc: akpm, david, kirill.shutemov, anshuman.khandual, catalin.marinas,
	cl, vbabka, mhocko, apopple, dave.hansen, will, baohua, jack,
	srivatsa, haowenchao22, hughd, aneesh.kumar, yang, peterx,
	ioworker0, wangkefeng.wang, ziy, jglisse, surenb, vishal.moola,
	zokeefe, zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel

On 17/12/2024 04:17, Matthew Wilcox wrote:
> On Mon, Dec 16, 2024 at 10:20:55PM +0530, Dev Jain wrote:
>>  static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
>> -			      struct collapse_control *cc)
>> +			      int order, struct collapse_control *cc)
> 
> unsigned, surely?

I'm obviously feeling argumentative this morning...

There are plenty of examples of order being signed and unsigned in the code
base... it's a mess. Certainly the mTHP changes up to now have opted for
(signed) int. And get_order(), which I would assume to the authority, returns
(signed) int.

Personally I prefer int for small positive integers; it's more compact. But if
we're trying to establish a pattern to use unsigned int for all new uses of
order, that fine too, let's just document it somewhere?

> 
>>  	if (!folio) {
>>  		*foliop = NULL;
>>  		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
>> +		if (order != HPAGE_PMD_ORDER)
>> +			count_mthp_stat(order, MTHP_STAT_ANON_COLLAPSE_ALLOC_FAILED);
> 
> i don't understand why we need new statistics here.  we already have a
> signal that memory allocation failures are preventing collapse from
> being successful, why do we care if it's mthp or actual thp?

We previously decided that all existing THP stats would continue to only count
PMD-sized THP for fear of breaking userspace in subtle ways, and instead would
introduce new mTHP stats that can count for each order. We already have
MTHP_STAT_ANON_FAULT_ALLOC and MTHP_STAT_ANON_FAULT_FALLBACK (amongst others) so
these new stats fit the pattern well, IMHO.

(except for the bug that I called out in the other mail; we should call
count_mthp_stat() for all orders and call count_vm_event() only for PMD_ORDER).

> 
>>  	count_vm_event(THP_COLLAPSE_ALLOC);
>> +	if (order != HPAGE_PMD_ORDER)
>> +		count_mthp_stat(order, MTHP_STAT_ANON_COLLAPSE_ALLOC);
> 
> similar question
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 02/12] khugepaged: Generalize alloc_charge_folio()
  2024-12-17  6:53   ` Ryan Roberts
@ 2024-12-17  9:06     ` Dev Jain
  0 siblings, 0 replies; 74+ messages in thread
From: Dev Jain @ 2024-12-17  9:06 UTC (permalink / raw)
  To: Ryan Roberts, akpm, david, willy, kirill.shutemov
  Cc: anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, linux-mm, linux-kernel


On 17/12/24 12:23 pm, Ryan Roberts wrote:
> On 16/12/2024 16:50, Dev Jain wrote:
>> Pass order to alloc_charge_folio() and update mTHP statistics.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   include/linux/huge_mm.h |  2 ++
>>   mm/huge_memory.c        |  4 ++++
>>   mm/khugepaged.c         | 13 +++++++++----
>>   3 files changed, 15 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 93e509b6c00e..8b6d0fed99b3 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -119,6 +119,8 @@ enum mthp_stat_item {
>>   	MTHP_STAT_ANON_FAULT_ALLOC,
>>   	MTHP_STAT_ANON_FAULT_FALLBACK,
>>   	MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE,
>> +	MTHP_STAT_ANON_COLLAPSE_ALLOC,
>> +	MTHP_STAT_ANON_COLLAPSE_ALLOC_FAILED,
>>   	MTHP_STAT_ZSWPOUT,
>>   	MTHP_STAT_SWPIN,
>>   	MTHP_STAT_SWPIN_FALLBACK,
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 2da5520bfe24..2e582fad4c77 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -615,6 +615,8 @@ static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
>>   DEFINE_MTHP_STAT_ATTR(anon_fault_alloc, MTHP_STAT_ANON_FAULT_ALLOC);
>>   DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK);
>>   DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
>> +DEFINE_MTHP_STAT_ATTR(anon_collapse_alloc, MTHP_STAT_ANON_COLLAPSE_ALLOC);
>> +DEFINE_MTHP_STAT_ATTR(anon_collapse_alloc_failed, MTHP_STAT_ANON_COLLAPSE_ALLOC_FAILED);
>>   DEFINE_MTHP_STAT_ATTR(zswpout, MTHP_STAT_ZSWPOUT);
>>   DEFINE_MTHP_STAT_ATTR(swpin, MTHP_STAT_SWPIN);
>>   DEFINE_MTHP_STAT_ATTR(swpin_fallback, MTHP_STAT_SWPIN_FALLBACK);
>> @@ -636,6 +638,8 @@ static struct attribute *anon_stats_attrs[] = {
>>   	&anon_fault_alloc_attr.attr,
>>   	&anon_fault_fallback_attr.attr,
>>   	&anon_fault_fallback_charge_attr.attr,
>> +	&anon_collapse_alloc_attr.attr,
>> +	&anon_collapse_alloc_failed_attr.attr,
>>   #ifndef CONFIG_SHMEM
>>   	&zswpout_attr.attr,
>>   	&swpin_attr.attr,
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 95643e6e5f31..02cd424b8e48 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1073,21 +1073,26 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
>>   }
>>   
>>   static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
>> -			      struct collapse_control *cc)
>> +			      int order, struct collapse_control *cc)
>>   {
>>   	gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
>>   		     GFP_TRANSHUGE);
>>   	int node = hpage_collapse_find_target_node(cc);
>>   	struct folio *folio;
>>   
>> -	folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
>> +	folio = __folio_alloc(gfp, order, node, &cc->alloc_nmask);
>>   	if (!folio) {
>>   		*foliop = NULL;
>>   		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
>> +		if (order != HPAGE_PMD_ORDER)
>> +			count_mthp_stat(order, MTHP_STAT_ANON_COLLAPSE_ALLOC_FAILED);
> Bug? We should be calling count_mthp_stat() for all orders, but only calling
> count_vm_event(THP_*) for PMD_ORDER, as per pattern laid out by other mTHP stats.

Ah okay.

>
> The aim is for existing THP stats (which are implicitly only counting PMD-sized
> THP) to continue only to count PMD-sized THP. It's a userspace ABI and we were
> scared of the potential to break things if we changed the existing counters'
> semantics.
>
>>   		return SCAN_ALLOC_HUGE_PAGE_FAIL;
>>   	}
>>   
>>   	count_vm_event(THP_COLLAPSE_ALLOC);
>> +	if (order != HPAGE_PMD_ORDER)
>> +		count_mthp_stat(order, MTHP_STAT_ANON_COLLAPSE_ALLOC);
> Same problem.
>
> Also, I agree with Baolin that we don't want "anon" in the title. This is a
> generic path used for file-backed memory. So once you fix the bug, the new stats
> will also be counting the file-backed memory too (although for now, only for
> PMD_ORDER).

Sure.

>> +
>>   	if (unlikely(mem_cgroup_charge(folio, mm, gfp))) {
>>   		folio_put(folio);
>>   		*foliop = NULL;
>> @@ -1124,7 +1129,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>   	 */
>>   	mmap_read_unlock(mm);
>>   
>> -	result = alloc_charge_folio(&folio, mm, cc);
>> +	result = alloc_charge_folio(&folio, mm, order, cc);
> Where is order coming from? I'm guessing that's added later, so this patch won't
> compile on it's own? Perhaps HPAGE_PMD_ORDER for now?

Okay yes, this won't compile on its own. I'll ensure sequential buildability next time.

>
>>   	if (result != SCAN_SUCCEED)
>>   		goto out_nolock;
>>   
>> @@ -1850,7 +1855,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>>   	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
>>   	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
>>   
>> -	result = alloc_charge_folio(&new_folio, mm, cc);
>> +	result = alloc_charge_folio(&new_folio, mm, HPAGE_PMD_ORDER, cc);
>>   	if (result != SCAN_SUCCEED)
>>   		goto out;
>>   
>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 09/12] khugepaged: Introduce vma_collapse_anon_folio()
  2024-12-16 17:06   ` David Hildenbrand
  2024-12-16 19:08     ` Yang Shi
@ 2024-12-17 10:07     ` Dev Jain
  2024-12-17 10:32       ` David Hildenbrand
  2024-12-18 15:59     ` Dev Jain
  2 siblings, 1 reply; 74+ messages in thread
From: Dev Jain @ 2024-12-17 10:07 UTC (permalink / raw)
  To: David Hildenbrand, akpm, willy, kirill.shutemov
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel


On 16/12/24 10:36 pm, David Hildenbrand wrote:
> On 16.12.24 17:51, Dev Jain wrote:
>> In contrast to PMD-collapse, we do not need to operate on two levels 
>> of pagetable
>> simultaneously. Therefore, downgrade the mmap lock from write to read 
>> mode. Still
>> take the anon_vma lock in exclusive mode so as to not waste time in 
>> the rmap path,
>> which is anyways going to fail since the PTEs are going to be 
>> changed. Under the PTL,
>> copy page contents, clear the PTEs, remove folio pins, and (try to) 
>> unmap the
>> old folios. Set the PTEs to the new folio using the set_ptes() API.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>> Note: I have been trying hard to get rid of the locks in here: we 
>> still are
>> taking the PTL around the page copying; dropping the PTL and taking 
>> it after
>> the copying should lead to a deadlock, for example:
>> khugepaged                        madvise(MADV_COLD)
>> folio_lock()                        lock(ptl)
>> lock(ptl)                        folio_lock()
>>
>> We can create a locked folio list, altogether drop both the locks, 
>> take the PTL,
>> do everything which __collapse_huge_page_isolate() does *except* the 
>> isolation and
>> again try locking folios, but then it will reduce efficiency of 
>> khugepaged
>> and almost looks like a forced solution :)
>> Please note the following discussion if anyone is interested:
>> https://lore.kernel.org/all/66bb7496-a445-4ad7-8e56-4f2863465c54@arm.com/ 
>>
>> (Apologies for not CCing the mailing list from the start)
>>
>>   mm/khugepaged.c | 108 ++++++++++++++++++++++++++++++++++++++----------
>>   1 file changed, 87 insertions(+), 21 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 88beebef773e..8040b130e677 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -714,24 +714,28 @@ static void 
>> __collapse_huge_page_copy_succeeded(pte_t *pte,
>>                           struct vm_area_struct *vma,
>>                           unsigned long address,
>>                           spinlock_t *ptl,
>> -                        struct list_head *compound_pagelist)
>> +                        struct list_head *compound_pagelist, int order)
>>   {
>>       struct folio *src, *tmp;
>>       pte_t *_pte;
>>       pte_t pteval;
>>   -    for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>> +    for (_pte = pte; _pte < pte + (1UL << order);
>>            _pte++, address += PAGE_SIZE) {
>>           pteval = ptep_get(_pte);
>>           if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>>               add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
>>               if (is_zero_pfn(pte_pfn(pteval))) {
>> -                /*
>> -                 * ptl mostly unnecessary.
>> -                 */
>> -                spin_lock(ptl);
>> -                ptep_clear(vma->vm_mm, address, _pte);
>> -                spin_unlock(ptl);
>> +                if (order == HPAGE_PMD_ORDER) {
>> +                    /*
>> +                    * ptl mostly unnecessary.
>> +                    */
>> +                    spin_lock(ptl);
>> +                    ptep_clear(vma->vm_mm, address, _pte);
>> +                    spin_unlock(ptl);
>> +                } else {
>> +                    ptep_clear(vma->vm_mm, address, _pte);
>> +                }
>>                   ksm_might_unmap_zero_page(vma->vm_mm, pteval);
>>               }
>>           } else {
>> @@ -740,15 +744,20 @@ static void 
>> __collapse_huge_page_copy_succeeded(pte_t *pte,
>>               src = page_folio(src_page);
>>               if (!folio_test_large(src))
>>                   release_pte_folio(src);
>> -            /*
>> -             * ptl mostly unnecessary, but preempt has to
>> -             * be disabled to update the per-cpu stats
>> -             * inside folio_remove_rmap_pte().
>> -             */
>> -            spin_lock(ptl);
>> -            ptep_clear(vma->vm_mm, address, _pte);
>> -            folio_remove_rmap_pte(src, src_page, vma);
>> -            spin_unlock(ptl);
>> +            if (order == HPAGE_PMD_ORDER) {
>> +                /*
>> +                * ptl mostly unnecessary, but preempt has to
>> +                * be disabled to update the per-cpu stats
>> +                * inside folio_remove_rmap_pte().
>> +                */
>> +                spin_lock(ptl);
>> +                ptep_clear(vma->vm_mm, address, _pte);
>
>
>
>
>> + folio_remove_rmap_pte(src, src_page, vma);
>> +                spin_unlock(ptl);
>> +            } else {
>> +                ptep_clear(vma->vm_mm, address, _pte);
>> +                folio_remove_rmap_pte(src, src_page, vma);
>> +            }
>
> As I've talked to Nico about this code recently ... :)
>
> Are you clearing the PTE after the copy succeeded? If so, where is the 
> TLB flush?
>
> How do you sync against concurrent write acess + GUP-fast?
>
>
> The sequence really must be: (1) clear PTE/PMD + flush TLB (2) check 
> if there are unexpected page references (e.g., GUP) if so back off (3) 
> copy page content (4) set updated PTE/PMD.

Thanks...we need to ensure GUP-fast does not write when we are copying 
contents, so (2) will ensure that GUP-fast will see the cleared PTE and 
back-off.
>
> To Nico, I suggested doing it simple initially, and still clear the 
> high-level PMD entry + flush under mmap write lock, then re-map the 
> PTE table after modifying the page table. It's not as efficient, but 
> "harder to get wrong".
>
> Maybe that's already happening, but I stumbled over this clearing 
> logic in __collapse_huge_page_copy_succeeded(), so I'm curious.

No, I am not even touching the PMD. I guess the sequence you described 
should work? I just need to reverse the copying and PTE clearing order 
to implement this sequence.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 09/12] khugepaged: Introduce vma_collapse_anon_folio()
  2024-12-17 10:07     ` Dev Jain
@ 2024-12-17 10:32       ` David Hildenbrand
  2024-12-18  8:35         ` Dev Jain
  0 siblings, 1 reply; 74+ messages in thread
From: David Hildenbrand @ 2024-12-17 10:32 UTC (permalink / raw)
  To: Dev Jain, akpm, willy, kirill.shutemov
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel

On 17.12.24 11:07, Dev Jain wrote:
> 
> On 16/12/24 10:36 pm, David Hildenbrand wrote:
>> On 16.12.24 17:51, Dev Jain wrote:
>>> In contrast to PMD-collapse, we do not need to operate on two levels
>>> of pagetable
>>> simultaneously. Therefore, downgrade the mmap lock from write to read
>>> mode. Still
>>> take the anon_vma lock in exclusive mode so as to not waste time in
>>> the rmap path,
>>> which is anyways going to fail since the PTEs are going to be
>>> changed. Under the PTL,
>>> copy page contents, clear the PTEs, remove folio pins, and (try to)
>>> unmap the
>>> old folios. Set the PTEs to the new folio using the set_ptes() API.
>>>
>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>> ---
>>> Note: I have been trying hard to get rid of the locks in here: we
>>> still are
>>> taking the PTL around the page copying; dropping the PTL and taking
>>> it after
>>> the copying should lead to a deadlock, for example:
>>> khugepaged                        madvise(MADV_COLD)
>>> folio_lock()                        lock(ptl)
>>> lock(ptl)                        folio_lock()
>>>
>>> We can create a locked folio list, altogether drop both the locks,
>>> take the PTL,
>>> do everything which __collapse_huge_page_isolate() does *except* the
>>> isolation and
>>> again try locking folios, but then it will reduce efficiency of
>>> khugepaged
>>> and almost looks like a forced solution :)
>>> Please note the following discussion if anyone is interested:
>>> https://lore.kernel.org/all/66bb7496-a445-4ad7-8e56-4f2863465c54@arm.com/
>>>
>>> (Apologies for not CCing the mailing list from the start)
>>>
>>>    mm/khugepaged.c | 108 ++++++++++++++++++++++++++++++++++++++----------
>>>    1 file changed, 87 insertions(+), 21 deletions(-)
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index 88beebef773e..8040b130e677 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -714,24 +714,28 @@ static void
>>> __collapse_huge_page_copy_succeeded(pte_t *pte,
>>>                            struct vm_area_struct *vma,
>>>                            unsigned long address,
>>>                            spinlock_t *ptl,
>>> -                        struct list_head *compound_pagelist)
>>> +                        struct list_head *compound_pagelist, int order)
>>>    {
>>>        struct folio *src, *tmp;
>>>        pte_t *_pte;
>>>        pte_t pteval;
>>>    -    for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>>> +    for (_pte = pte; _pte < pte + (1UL << order);
>>>             _pte++, address += PAGE_SIZE) {
>>>            pteval = ptep_get(_pte);
>>>            if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>>>                add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
>>>                if (is_zero_pfn(pte_pfn(pteval))) {
>>> -                /*
>>> -                 * ptl mostly unnecessary.
>>> -                 */
>>> -                spin_lock(ptl);
>>> -                ptep_clear(vma->vm_mm, address, _pte);
>>> -                spin_unlock(ptl);
>>> +                if (order == HPAGE_PMD_ORDER) {
>>> +                    /*
>>> +                    * ptl mostly unnecessary.
>>> +                    */
>>> +                    spin_lock(ptl);
>>> +                    ptep_clear(vma->vm_mm, address, _pte);
>>> +                    spin_unlock(ptl);
>>> +                } else {
>>> +                    ptep_clear(vma->vm_mm, address, _pte);
>>> +                }
>>>                    ksm_might_unmap_zero_page(vma->vm_mm, pteval);
>>>                }
>>>            } else {
>>> @@ -740,15 +744,20 @@ static void
>>> __collapse_huge_page_copy_succeeded(pte_t *pte,
>>>                src = page_folio(src_page);
>>>                if (!folio_test_large(src))
>>>                    release_pte_folio(src);
>>> -            /*
>>> -             * ptl mostly unnecessary, but preempt has to
>>> -             * be disabled to update the per-cpu stats
>>> -             * inside folio_remove_rmap_pte().
>>> -             */
>>> -            spin_lock(ptl);
>>> -            ptep_clear(vma->vm_mm, address, _pte);
>>> -            folio_remove_rmap_pte(src, src_page, vma);
>>> -            spin_unlock(ptl);
>>> +            if (order == HPAGE_PMD_ORDER) {
>>> +                /*
>>> +                * ptl mostly unnecessary, but preempt has to
>>> +                * be disabled to update the per-cpu stats
>>> +                * inside folio_remove_rmap_pte().
>>> +                */
>>> +                spin_lock(ptl);
>>> +                ptep_clear(vma->vm_mm, address, _pte);
>>
>>
>>
>>
>>> + folio_remove_rmap_pte(src, src_page, vma);
>>> +                spin_unlock(ptl);
>>> +            } else {
>>> +                ptep_clear(vma->vm_mm, address, _pte);
>>> +                folio_remove_rmap_pte(src, src_page, vma);
>>> +            }
>>
>> As I've talked to Nico about this code recently ... :)
>>
>> Are you clearing the PTE after the copy succeeded? If so, where is the
>> TLB flush?
>>
>> How do you sync against concurrent write acess + GUP-fast?
>>
>>
>> The sequence really must be: (1) clear PTE/PMD + flush TLB (2) check
>> if there are unexpected page references (e.g., GUP) if so back off (3)
>> copy page content (4) set updated PTE/PMD.
> 
> Thanks...we need to ensure GUP-fast does not write when we are copying
> contents, so (2) will ensure that GUP-fast will see the cleared PTE and
> back-off.

Yes, and of course, that also the CPU cannot concurrently still modify 
the page content while/after you copy the page content, but before you 
unmap+flush.

>>
>> To Nico, I suggested doing it simple initially, and still clear the
>> high-level PMD entry + flush under mmap write lock, then re-map the
>> PTE table after modifying the page table. It's not as efficient, but
>> "harder to get wrong".
>>
>> Maybe that's already happening, but I stumbled over this clearing
>> logic in __collapse_huge_page_copy_succeeded(), so I'm curious.
> 
> No, I am not even touching the PMD. I guess the sequence you described
> should work? I just need to reverse the copying and PTE clearing order
> to implement this sequence.

That would work, but you really have to hold the PTL for the whole 
period: from when you temporarily clear PTEs +_ flush the TLB, when you 
copy, until you re-insert the updated ones.

When having to back-off (restore original PTEs), or for copying, you'll 
likely need access to the original PTEs, which were already cleared. So 
likely you need a temporary copy of the original PTEs somehow.

That's why temporarily clearing the PMD und mmap write lock is easier to 
implement, at the cost of requiring the mmap lock in write mode like PMD 
collapse.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 02/12] khugepaged: Generalize alloc_charge_folio()
  2024-12-17  7:09     ` Ryan Roberts
@ 2024-12-17 13:00       ` Zi Yan
  2024-12-20 17:41       ` Christoph Lameter (Ampere)
  1 sibling, 0 replies; 74+ messages in thread
From: Zi Yan @ 2024-12-17 13:00 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Matthew Wilcox, Dev Jain, akpm, david, kirill.shutemov,
	anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, jglisse,
	surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard, 21cnbao,
	linux-mm, linux-kernel

On 17 Dec 2024, at 2:09, Ryan Roberts wrote:

> On 17/12/2024 04:17, Matthew Wilcox wrote:
>> On Mon, Dec 16, 2024 at 10:20:55PM +0530, Dev Jain wrote:
>>>  static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
>>> -			      struct collapse_control *cc)
>>> +			      int order, struct collapse_control *cc)
>>
>> unsigned, surely?
>
> I'm obviously feeling argumentative this morning...
>
> There are plenty of examples of order being signed and unsigned in the code
> base... it's a mess. Certainly the mTHP changes up to now have opted for
> (signed) int. And get_order(), which I would assume to the authority, returns
> (signed) int.
>
> Personally I prefer int for small positive integers; it's more compact. But if
> we're trying to establish a pattern to use unsigned int for all new uses of
> order, that fine too, let's just document it somewhere?

If unsigned is used, I wonder how to handle
for (unsigned order = 9; order >= 0; order--) case. We will need a signed order to
make this work, right?

--
Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 03/12] khugepaged: Generalize hugepage_vma_revalidate()
  2024-12-16 16:50 ` [RFC PATCH 03/12] khugepaged: Generalize hugepage_vma_revalidate() Dev Jain
  2024-12-17  4:21   ` Matthew Wilcox
@ 2024-12-17 16:58   ` Ryan Roberts
  1 sibling, 0 replies; 74+ messages in thread
From: Ryan Roberts @ 2024-12-17 16:58 UTC (permalink / raw)
  To: Dev Jain, akpm, david, willy, kirill.shutemov
  Cc: anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, linux-mm, linux-kernel

On 16/12/2024 16:50, Dev Jain wrote:
> Post retaking the lock, it must be checked that the VMA is suitable for our
> scan order. Hence, generalize hugepage_vma_revalidate().
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/khugepaged.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 02cd424b8e48..2f0601795471 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -918,7 +918,7 @@ static int hpage_collapse_find_target_node(struct collapse_control *cc)
>  
>  static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
>  				   bool expect_anon,
> -				   struct vm_area_struct **vmap,
> +				   struct vm_area_struct **vmap, int order,
>  				   struct collapse_control *cc)
>  {
>  	struct vm_area_struct *vma;
> @@ -931,9 +931,9 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
>  	if (!vma)
>  		return SCAN_VMA_NULL;
>  
> -	if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
> +	if (!thp_vma_suitable_order(vma, address, order))
>  		return SCAN_ADDRESS_RANGE;
> -	if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_flags, PMD_ORDER))
> +	if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_flags, order))
>  		return SCAN_VMA_CHECK;
>  	/*
>  	 * Anon VMA expected, the address may be unmapped then
> @@ -1134,7 +1134,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  		goto out_nolock;
>  
>  	mmap_read_lock(mm);
> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
> +	result = hugepage_vma_revalidate(mm, address, true, &vma, order, cc);

some more compilation issues: replace order with HPAGE_PMD_ORDER.

>  	if (result != SCAN_SUCCEED) {
>  		mmap_read_unlock(mm);
>  		goto out_nolock;
> @@ -1168,7 +1168,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	 * mmap_lock.
>  	 */
>  	mmap_write_lock(mm);
> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
> +	result = hugepage_vma_revalidate(mm, address, true, &vma, order, cc);

and here.

>  	if (result != SCAN_SUCCEED)
>  		goto out_up_write;
>  	/* check if the pmd is still valid */
> @@ -2776,7 +2776,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
>  			mmap_read_lock(mm);
>  			mmap_locked = true;
>  			result = hugepage_vma_revalidate(mm, addr, false, &vma,
> -							 cc);
> +							 HPAGE_PMD_ORDER, cc);
>  			if (result  != SCAN_SUCCEED) {
>  				last_fail = result;
>  				goto out_nolock;


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 05/12] khugepaged: Generalize __collapse_huge_page_isolate()
  2024-12-16 16:50 ` [RFC PATCH 05/12] khugepaged: Generalize __collapse_huge_page_isolate() Dev Jain
  2024-12-17  4:32   ` Matthew Wilcox
@ 2024-12-17 17:09   ` Ryan Roberts
  1 sibling, 0 replies; 74+ messages in thread
From: Ryan Roberts @ 2024-12-17 17:09 UTC (permalink / raw)
  To: Dev Jain, akpm, david, willy, kirill.shutemov
  Cc: anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, linux-mm, linux-kernel

On 16/12/2024 16:50, Dev Jain wrote:
> Scale down the scan range and the sysfs tunables according to the scan order,
> and isolate the folios.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/khugepaged.c | 19 +++++++++++--------
>  1 file changed, 11 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index f52dae7d5179..de044b1f83d4 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -564,15 +564,18 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  					unsigned long address,
>  					pte_t *pte,
>  					struct collapse_control *cc,
> -					struct list_head *compound_pagelist)
> +					struct list_head *compound_pagelist, int order)
>  {
> -	struct page *page = NULL;
> -	struct folio *folio = NULL;
> -	pte_t *_pte;
> +	unsigned int max_ptes_shared = khugepaged_max_ptes_shared >> (HPAGE_PMD_ORDER - order);
> +	unsigned int max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);

This is implicitly rounding down. I think that's the right thing to do; it's
better to be conservative.

>  	int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
> +	struct folio *folio = NULL;
> +	struct page *page = NULL;
>  	bool writable = false;
> +	pte_t *_pte;
>  
> -	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> +
> +	for (_pte = pte; _pte < pte + (1UL << order);
>  	     _pte++, address += PAGE_SIZE) {
>  		pte_t pteval = ptep_get(_pte);
>  		if (pte_none(pteval) || (pte_present(pteval) &&
> @@ -580,7 +583,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  			++none_or_zero;
>  			if (!userfaultfd_armed(vma) &&
>  			    (!cc->is_khugepaged ||
> -			     none_or_zero <= khugepaged_max_ptes_none)) {
> +			     none_or_zero <= max_ptes_none)) {
>  				continue;
>  			} else {
>  				result = SCAN_EXCEED_NONE_PTE;
> @@ -609,7 +612,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  		if (folio_likely_mapped_shared(folio)) {
>  			++shared;
>  			if (cc->is_khugepaged &&
> -			    shared > khugepaged_max_ptes_shared) {
> +			    shared > max_ptes_shared) {
>  				result = SCAN_EXCEED_SHARED_PTE;
>  				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>  				goto out;
> @@ -1200,7 +1203,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
>  	if (pte) {
>  		result = __collapse_huge_page_isolate(vma, address, pte, cc,
> -						      &compound_pagelist);
> +						      &compound_pagelist, order);
>  		spin_unlock(pte_ptl);
>  	} else {
>  		result = SCAN_PMD_NULL;


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 05/12] khugepaged: Generalize __collapse_huge_page_isolate()
  2024-12-17  6:41     ` Dev Jain
@ 2024-12-17 17:14       ` Ryan Roberts
  0 siblings, 0 replies; 74+ messages in thread
From: Ryan Roberts @ 2024-12-17 17:14 UTC (permalink / raw)
  To: Dev Jain, Matthew Wilcox, g
  Cc: akpm, david, kirill.shutemov, anshuman.khandual, catalin.marinas,
	cl, vbabka, mhocko, apopple, dave.hansen, will, baohua, jack,
	srivatsa, haowenchao22, hughd, aneesh.kumar, yang, peterx,
	ioworker0, wangkefeng.wang, ziy, jglisse, surenb, vishal.moola,
	zokeefe, zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel

On 17/12/2024 06:41, Dev Jain wrote:
> 
> On 17/12/24 10:02 am, Matthew Wilcox wrote:
>> On Mon, Dec 16, 2024 at 10:20:58PM +0530, Dev Jain wrote:
>>>   {
>>> -    struct page *page = NULL;
>>> -    struct folio *folio = NULL;
>>> -    pte_t *_pte;
>>> +    unsigned int max_ptes_shared = khugepaged_max_ptes_shared >>
>>> (HPAGE_PMD_ORDER - order);
>>> +    unsigned int max_ptes_none = khugepaged_max_ptes_none >>
>>> (HPAGE_PMD_ORDER - order);
>>>       int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
>>> +    struct folio *folio = NULL;
>>> +    struct page *page = NULL;
>> why are you moving variables around unnecessarily?
> 
> In a previous work, I moved code around and David noted to arrange the declarations
> in reverse Xmas tree order. I guess (?) that was not spoiling git history, so if
> this feels like that, I will revert.
> 
>>
>>>       bool writable = false;
>>> +    pte_t *_pte;
>>>   -    for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>>> +
>>> +    for (_pte = pte; _pte < pte + (1UL << order);
>> spurious blank line
> 
> My bad
> 
>>
>>
>> also you might first want to finish off the page->folio conversion in
>> this function first; we have a vm_normal_folio() now.
> 
> I did not add any code before we derive the folio...I'm sorry, I don't get what
> you mean...
> 

I think Matthew is suggesting helping out with the folio to page conversion work
while you are working on this function. I think it would amount to a patch that
does something like this:

----8<-----
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index ffc4d5aef991..d94e05754140 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -568,7 +568,6 @@ static int __collapse_huge_page_isolate(struct
vm_area_struct *vma,
        unsigned int max_ptes_none = khugepaged_max_ptes_none >>
(HPAGE_PMD_ORDER - order);
        int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
        struct folio *folio = NULL;
-       struct page *page = NULL;
        bool writable = false;
        pte_t *_pte;

@@ -597,13 +596,12 @@ static int __collapse_huge_page_isolate(struct
vm_area_struct *vma,
                        result = SCAN_PTE_UFFD_WP;
                        goto out;
                }
-               page = vm_normal_page(vma, address, pteval);
-               if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
+               folio = vm_normal_folio(vma, address, pteval);
+               if (unlikely(!folio) || unlikely(folio_is_zone_device(folio))) {
                        result = SCAN_PAGE_NULL;
                        goto out;
                }

-               folio = page_folio(page);
                VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);

                if (order !=HPAGE_PMD_ORDER && folio_order(folio) >= order) {
----8<-----


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 06/12] khugepaged: Generalize __collapse_huge_page_copy_failed()
  2024-12-16 16:50 ` [RFC PATCH 06/12] khugepaged: Generalize __collapse_huge_page_copy_failed() Dev Jain
@ 2024-12-17 17:22   ` Ryan Roberts
  2024-12-18  8:49     ` Dev Jain
  0 siblings, 1 reply; 74+ messages in thread
From: Ryan Roberts @ 2024-12-17 17:22 UTC (permalink / raw)
  To: Dev Jain, akpm, david, willy, kirill.shutemov
  Cc: anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, linux-mm, linux-kernel

On 16/12/2024 16:50, Dev Jain wrote:
> Upon failure, we repopulate the PMD in case of PMD-THP collapse. Hence, make
> this logic specific for PMD case.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/khugepaged.c | 14 ++++++++------
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index de044b1f83d4..886c76816963 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -766,7 +766,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
>  					     pmd_t *pmd,
>  					     pmd_t orig_pmd,
>  					     struct vm_area_struct *vma,
> -					     struct list_head *compound_pagelist)
> +					     struct list_head *compound_pagelist, int order)

nit: suggest putting order on its own line.

>  {
>  	spinlock_t *pmd_ptl;
>  
> @@ -776,14 +776,16 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
>  	 * pages. Since pages are still isolated and locked here,
>  	 * acquiring anon_vma_lock_write is unnecessary.
>  	 */
> -	pmd_ptl = pmd_lock(vma->vm_mm, pmd);
> -	pmd_populate(vma->vm_mm, pmd, pmd_pgtable(orig_pmd));
> -	spin_unlock(pmd_ptl);
> +	if (order == HPAGE_PMD_ORDER) {
> +		pmd_ptl = pmd_lock(vma->vm_mm, pmd);
> +		pmd_populate(vma->vm_mm, pmd, pmd_pgtable(orig_pmd));
> +		spin_unlock(pmd_ptl);
> +	}
>  	/*
>  	 * Release both raw and compound pages isolated
>  	 * in __collapse_huge_page_isolate.
>  	 */
> -	release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist);
> +	release_pte_pages(pte, pte + (1UL << order), compound_pagelist);
>  }

Given this function is clearly so geared towards re-establishing the pmd, given
that it takes the *pmd and orig_pmd as params, and given that in the
non-pmd-order case, we only call through to release_pte_pages(), I wonder if
it's better to make the decision at a higher level and either call this function
or release_pte_pages() directly? No strong opinion, just looks a bit weird at
the moment.

>  
>  /*
> @@ -834,7 +836,7 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
>  						    compound_pagelist);
>  	else
>  		__collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
> -						 compound_pagelist);
> +						 compound_pagelist, order);
>  
>  	return result;
>  }


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 01/12] khugepaged: Rename hpage_collapse_scan_pmd() -> ptes()
  2024-12-17  6:43     ` Ryan Roberts
@ 2024-12-17 18:11       ` Zi Yan
  2024-12-17 19:12         ` Ryan Roberts
  0 siblings, 1 reply; 74+ messages in thread
From: Zi Yan @ 2024-12-17 18:11 UTC (permalink / raw)
  To: Ryan Roberts, Dev Jain
  Cc: Matthew Wilcox, akpm, david, kirill.shutemov, anshuman.khandual,
	catalin.marinas, cl, vbabka, mhocko, apopple, dave.hansen, will,
	baohua, jack, srivatsa, haowenchao22, hughd, aneesh.kumar, yang,
	peterx, ioworker0, wangkefeng.wang, jglisse, surenb, vishal.moola,
	zokeefe, zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel

On 17 Dec 2024, at 1:43, Ryan Roberts wrote:

> On 17/12/2024 04:18, Matthew Wilcox wrote:
>> On Mon, Dec 16, 2024 at 10:20:54PM +0530, Dev Jain wrote:
>>> -static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>>> +static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>>
>> i don't think this is necessary at all.  you're scanning a pmd.
>> you might not be scanning in order to collapse to a pmd, but pmd
>> is the level you're scanning at.
>>
>
> Sorry Matthew, I don't really understand this statement. Prior to the change we
> were scanning all PTE entries in a PTE table with the aim of collapsing to a PMD
> entry. After the change we are scanning some PTE entries in a PTE table with the
> aim of collapsing to either to a multi-PTE-mapped folio or a single-PMD-mapped
> folio.
>
> So personally I think "scan_pmd" was a misnomer even before the change - we are
> scanning the ptes.

But there are still a lot of scan_pmd code in the function, for example,
VM_BUG_ON(address & ~HPAGE_PMD_MASK), _pte < pte + HPAGE_PMD_NR in the function.
These need to be changed along with function renaming. If after the change only
a subset of PTEs are scanned within a PMD, maybe a scan_range parameter can be
added.

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 07/12] khugepaged: Scan PTEs order-wise
  2024-12-16 16:51 ` [RFC PATCH 07/12] khugepaged: Scan PTEs order-wise Dev Jain
@ 2024-12-17 18:15   ` Ryan Roberts
  2024-12-18  9:24     ` Dev Jain
  2025-01-06 10:04   ` Usama Arif
  1 sibling, 1 reply; 74+ messages in thread
From: Ryan Roberts @ 2024-12-17 18:15 UTC (permalink / raw)
  To: Dev Jain, akpm, david, willy, kirill.shutemov
  Cc: anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, linux-mm, linux-kernel

On 16/12/2024 16:51, Dev Jain wrote:
> Scan the PTEs order-wise, using the mask of suitable orders for this VMA
> derived in conjunction with sysfs THP settings. Scale down the tunables; in
> case of collapse failure, we drop down to the next order. Otherwise, we try to
> jump to the highest possible order and then start a fresh scan. Note that
> madvise(MADV_COLLAPSE) has not been generalized.

Is there are reason you are not modifying MADV_COLLAPSE? It's really just a
synchonous way to do what khugepaged does asynchonously (isn't it?), so it would
behave the same way in an ideal world.

> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/khugepaged.c | 84 ++++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 69 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 886c76816963..078794aa3335 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -20,6 +20,7 @@
>  #include <linux/swapops.h>
>  #include <linux/shmem_fs.h>
>  #include <linux/ksm.h>
> +#include <linux/count_zeros.h>
>  
>  #include <asm/tlb.h>
>  #include <asm/pgalloc.h>
> @@ -1111,7 +1112,7 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
>  }
>  
>  static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> -			      int referenced, int unmapped,
> +			      int referenced, int unmapped, int order,
>  			      struct collapse_control *cc)
>  {
>  	LIST_HEAD(compound_pagelist);
> @@ -1278,38 +1279,59 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>  				   unsigned long address, bool *mmap_locked,
>  				   struct collapse_control *cc)
>  {
> -	pmd_t *pmd;
> -	pte_t *pte, *_pte;
> -	int result = SCAN_FAIL, referenced = 0;
> -	int none_or_zero = 0, shared = 0;
> -	struct page *page = NULL;
> +	unsigned int max_ptes_shared, max_ptes_none, max_ptes_swap;
> +	int referenced, shared, none_or_zero, unmapped;
> +	unsigned long _address, org_address = address;

nit: Perhaps it's clearer to keep the original address in address and use a
variable, start, for the starting point of each scan?

>  	struct folio *folio = NULL;
> -	unsigned long _address;
> -	spinlock_t *ptl;
> -	int node = NUMA_NO_NODE, unmapped = 0;
> +	struct page *page = NULL;
> +	int node = NUMA_NO_NODE;
> +	int result = SCAN_FAIL;
>  	bool writable = false;
> +	unsigned long orders;
> +	pte_t *pte, *_pte;
> +	spinlock_t *ptl;
> +	pmd_t *pmd;
> +	int order;
>  
>  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>  
> +	orders = thp_vma_allowable_orders(vma, vma->vm_flags,
> +			TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER + 1) - 1);

Perhaps THP_ORDERS_ALL instead of "BIT(PMD_ORDER + 1) - 1"?

> +	orders = thp_vma_suitable_orders(vma, address, orders);
> +	order = highest_order(orders);
> +
> +	/* MADV_COLLAPSE needs to work irrespective of sysfs setting */
> +	if (!cc->is_khugepaged)
> +		order = HPAGE_PMD_ORDER;
> +
> +scan_pte_range:
> +
> +	max_ptes_shared = khugepaged_max_ptes_shared >> (HPAGE_PMD_ORDER - order);
> +	max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
> +	max_ptes_swap = khugepaged_max_ptes_swap >> (HPAGE_PMD_ORDER - order);
> +	referenced = 0, shared = 0, none_or_zero = 0, unmapped = 0;
> +
> +	/* Check pmd after taking mmap lock */
>  	result = find_pmd_or_thp_or_none(mm, address, &pmd);
>  	if (result != SCAN_SUCCEED)
>  		goto out;
>  
>  	memset(cc->node_load, 0, sizeof(cc->node_load));
>  	nodes_clear(cc->alloc_nmask);
> +
>  	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
>  	if (!pte) {
>  		result = SCAN_PMD_NULL;
>  		goto out;
>  	}
>  
> -	for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> +	for (_address = address, _pte = pte; _pte < pte + (1UL << order);
>  	     _pte++, _address += PAGE_SIZE) {
>  		pte_t pteval = ptep_get(_pte);
>  		if (is_swap_pte(pteval)) {
>  			++unmapped;
>  			if (!cc->is_khugepaged ||
> -			    unmapped <= khugepaged_max_ptes_swap) {
> +			    unmapped <= max_ptes_swap) {
>  				/*
>  				 * Always be strict with uffd-wp
>  				 * enabled swap entries.  Please see
> @@ -1330,7 +1352,7 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>  			++none_or_zero;
>  			if (!userfaultfd_armed(vma) &&
>  			    (!cc->is_khugepaged ||
> -			     none_or_zero <= khugepaged_max_ptes_none)) {
> +			     none_or_zero <= max_ptes_none)) {
>  				continue;
>  			} else {
>  				result = SCAN_EXCEED_NONE_PTE;
> @@ -1375,7 +1397,7 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>  		if (folio_likely_mapped_shared(folio)) {
>  			++shared;
>  			if (cc->is_khugepaged &&
> -			    shared > khugepaged_max_ptes_shared) {
> +			    shared > max_ptes_shared) {
>  				result = SCAN_EXCEED_SHARED_PTE;
>  				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>  				goto out_unmap;
> @@ -1432,7 +1454,7 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>  		result = SCAN_PAGE_RO;
>  	} else if (cc->is_khugepaged &&
>  		   (!referenced ||
> -		    (unmapped && referenced < HPAGE_PMD_NR / 2))) {
> +		    (unmapped && referenced < (1UL << order) / 2))) {
>  		result = SCAN_LACK_REFERENCED_PAGE;
>  	} else {
>  		result = SCAN_SUCCEED;
> @@ -1441,9 +1463,41 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>  	pte_unmap_unlock(pte, ptl);
>  	if (result == SCAN_SUCCEED) {
>  		result = collapse_huge_page(mm, address, referenced,
> -					    unmapped, cc);
> +					    unmapped, order, cc);
>  		/* collapse_huge_page will return with the mmap_lock released */
>  		*mmap_locked = false;
> +
> +		/* Immediately exit on exhaustion of range */
> +		if (_address == org_address + (PAGE_SIZE << HPAGE_PMD_ORDER))
> +			goto out;

Looks like this assumes this function is always asked to scan a full PTE table?
Does that mean that you can't handle collapse for VMAs that don't span a whole
PMD entry? I think we will want to support that.

> +	}
> +	if (result != SCAN_SUCCEED) {
> +
> +		/* Go to the next order. */
> +		order = next_order(&orders, order);
> +		if (order < 2)

This should be:
		if (!orders)

I think the return order is undefined when order is the last order in orders.

> +			goto out;
> +		goto maybe_mmap_lock;
> +	} else {
> +		address = _address;
> +		pte = _pte;
> +
> +
> +		/* Get highest order possible starting from address */
> +		order = count_trailing_zeros(address >> PAGE_SHIFT);
> +
> +		/* This needs to be present in the mask too */
> +		if (!(orders & (1UL << order)))
> +			order = next_order(&orders, order);

Not quite; if the exact order isn't in the bitmap, this will pick out the
highest order in the bitmap, which may be higher than count_trailing_zeros()
returned. You could do:

		order = count_trailing_zeros(address >> PAGE_SHIFT);
		orders &= (1UL << order + 1) - 1;
		order = next_order(&orders, order);
		if (!orders)
			goto out;

That will mask out any orders that are bigger than the one returned by
count_trailing_zeros() then next_order() will return the highest order in the
remaining set.

But even that doesn't quite work because next_order() is destructive. Once you
arrive on a higher order address boundary, you want to be able to select a
higher order from the original orders bitmap. But you have lost them on a
previous go around the loop.

Perhaps stash orig_orders at the top of the function when you first calculate
it. Then I think this works (totally untested):

		order = count_trailing_zeros(address >> PAGE_SHIFT);
		orders = orig_orders & (1UL << order + 1) - 1;
		order = next_order(&orders, order);
		if (!orders)
			goto out;

You might want to do something like this for the first go around the loop, but I
think address is currently always at the start of the PMD on entry, so not
needed until that restriction is removed.

> +		if (order < 2)
> +			goto out;
> +
> +maybe_mmap_lock:
> +		if (!(*mmap_locked)) {
> +			mmap_read_lock(mm);

Given the lock was already held in read mode on entering this function, then
released by collapse_huge_page(), is it definitely safe to retake this lock and
rerun this function? Is it possible that state that was checked before entering
this function has changed since the lock was released that would now need
re-checking?

> +			*mmap_locked = true;
> +		}
> +		goto scan_pte_range;
>  	}
>  out:
>  	trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced,


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 01/12] khugepaged: Rename hpage_collapse_scan_pmd() -> ptes()
  2024-12-17 18:11       ` Zi Yan
@ 2024-12-17 19:12         ` Ryan Roberts
  0 siblings, 0 replies; 74+ messages in thread
From: Ryan Roberts @ 2024-12-17 19:12 UTC (permalink / raw)
  To: Zi Yan, Dev Jain
  Cc: Matthew Wilcox, akpm, david, kirill.shutemov, anshuman.khandual,
	catalin.marinas, cl, vbabka, mhocko, apopple, dave.hansen, will,
	baohua, jack, srivatsa, haowenchao22, hughd, aneesh.kumar, yang,
	peterx, ioworker0, wangkefeng.wang, jglisse, surenb, vishal.moola,
	zokeefe, zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel

On 17/12/2024 18:11, Zi Yan wrote:
> On 17 Dec 2024, at 1:43, Ryan Roberts wrote:
> 
>> On 17/12/2024 04:18, Matthew Wilcox wrote:
>>> On Mon, Dec 16, 2024 at 10:20:54PM +0530, Dev Jain wrote:
>>>> -static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>>>> +static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>>>
>>> i don't think this is necessary at all.  you're scanning a pmd.
>>> you might not be scanning in order to collapse to a pmd, but pmd
>>> is the level you're scanning at.
>>>
>>
>> Sorry Matthew, I don't really understand this statement. Prior to the change we
>> were scanning all PTE entries in a PTE table with the aim of collapsing to a PMD
>> entry. After the change we are scanning some PTE entries in a PTE table with the
>> aim of collapsing to either to a multi-PTE-mapped folio or a single-PMD-mapped
>> folio.
>>
>> So personally I think "scan_pmd" was a misnomer even before the change - we are
>> scanning the ptes.
> 
> But there are still a lot of scan_pmd code in the function, for example,
> VM_BUG_ON(address & ~HPAGE_PMD_MASK), _pte < pte + HPAGE_PMD_NR in the function.
> These need to be changed along with function renaming. If after the change only
> a subset of PTEs are scanned within a PMD, maybe a scan_range parameter can be
> added.

Oh I see; I think your and Matthew's point is that we are scanning a
"PMD-entry's worth of PTEs". Looking at it like that, I guess I understand why
"scan_pmd" makes sense. Fair enough.

> 
> Best Regards,
> Yan, Zi


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 08/12] khugepaged: Abstract PMD-THP collapse
  2024-12-16 16:51 ` [RFC PATCH 08/12] khugepaged: Abstract PMD-THP collapse Dev Jain
@ 2024-12-17 19:24   ` Ryan Roberts
  2024-12-18  9:26     ` Dev Jain
  0 siblings, 1 reply; 74+ messages in thread
From: Ryan Roberts @ 2024-12-17 19:24 UTC (permalink / raw)
  To: Dev Jain, akpm, david, willy, kirill.shutemov
  Cc: anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, linux-mm, linux-kernel

On 16/12/2024 16:51, Dev Jain wrote:
> Abstract away taking the mmap_lock exclusively, copying page contents, and
> setting the PMD, into vma_collapse_anon_folio_pmd().
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/khugepaged.c | 119 +++++++++++++++++++++++++++---------------------
>  1 file changed, 66 insertions(+), 53 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 078794aa3335..88beebef773e 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1111,58 +1111,17 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
>  	return SCAN_SUCCEED;
>  }
>  
> -static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> -			      int referenced, int unmapped, int order,
> -			      struct collapse_control *cc)
> +static int vma_collapse_anon_folio_pmd(struct mm_struct *mm, unsigned long address,
> +		struct vm_area_struct *vma, struct collapse_control *cc, pmd_t *pmd,
> +		struct folio *folio)
>  {
> +	struct mmu_notifier_range range;
> +	spinlock_t *pmd_ptl, *pte_ptl;
>  	LIST_HEAD(compound_pagelist);
> -	pmd_t *pmd, _pmd;
> -	pte_t *pte;
>  	pgtable_t pgtable;
> -	struct folio *folio;
> -	spinlock_t *pmd_ptl, *pte_ptl;
> -	int result = SCAN_FAIL;
> -	struct vm_area_struct *vma;
> -	struct mmu_notifier_range range;
> -
> -	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> -
> -	/*
> -	 * Before allocating the hugepage, release the mmap_lock read lock.
> -	 * The allocation can take potentially a long time if it involves
> -	 * sync compaction, and we do not need to hold the mmap_lock during
> -	 * that. We will recheck the vma after taking it again in write mode.
> -	 */
> -	mmap_read_unlock(mm);
> -
> -	result = alloc_charge_folio(&folio, mm, order, cc);
> -	if (result != SCAN_SUCCEED)
> -		goto out_nolock;
> -
> -	mmap_read_lock(mm);
> -	result = hugepage_vma_revalidate(mm, address, true, &vma, order, cc);
> -	if (result != SCAN_SUCCEED) {
> -		mmap_read_unlock(mm);
> -		goto out_nolock;
> -	}
> -
> -	result = find_pmd_or_thp_or_none(mm, address, &pmd);
> -	if (result != SCAN_SUCCEED) {
> -		mmap_read_unlock(mm);
> -		goto out_nolock;
> -	}
> -
> -	if (unmapped) {
> -		/*
> -		 * __collapse_huge_page_swapin will return with mmap_lock
> -		 * released when it fails. So we jump out_nolock directly in
> -		 * that case.  Continuing to collapse causes inconsistency.
> -		 */
> -		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> -						     referenced, order);
> -		if (result != SCAN_SUCCEED)
> -			goto out_nolock;
> -	}
> +	int result;
> +	pmd_t _pmd;
> +	pte_t *pte;
>  
>  	mmap_read_unlock(mm);
>  	/*
> @@ -1174,7 +1133,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	 * mmap_lock.
>  	 */
>  	mmap_write_lock(mm);
> -	result = hugepage_vma_revalidate(mm, address, true, &vma, order, cc);
> +
> +	result = hugepage_vma_revalidate(mm, address, true, &vma, HPAGE_PMD_ORDER, cc);
>  	if (result != SCAN_SUCCEED)
>  		goto out_up_write;
>  	/* check if the pmd is still valid */
> @@ -1206,7 +1166,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
>  	if (pte) {
>  		result = __collapse_huge_page_isolate(vma, address, pte, cc,
> -						      &compound_pagelist, order);
> +						      &compound_pagelist, HPAGE_PMD_ORDER);
>  		spin_unlock(pte_ptl);
>  	} else {
>  		result = SCAN_PMD_NULL;
> @@ -1262,11 +1222,64 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	deferred_split_folio(folio, false);
>  	spin_unlock(pmd_ptl);
>  
> -	folio = NULL;
> -
>  	result = SCAN_SUCCEED;
>  out_up_write:
>  	mmap_write_unlock(mm);
> +	return result;
> +}
> +
> +static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> +			      int referenced, int unmapped, int order,
> +			      struct collapse_control *cc)
> +{
> +	struct vm_area_struct *vma;
> +	int result = SCAN_FAIL;
> +	struct folio *folio;
> +	pmd_t *pmd;
> +
> +	/*
> +	 * Before allocating the hugepage, release the mmap_lock read lock.
> +	 * The allocation can take potentially a long time if it involves
> +	 * sync compaction, and we do not need to hold the mmap_lock during
> +	 * that. We will recheck the vma after taking it again in write mode.
> +	 */
> +	mmap_read_unlock(mm);
> +
> +	result = alloc_charge_folio(&folio, mm, order, cc);
> +	if (result != SCAN_SUCCEED)
> +		goto out_nolock;
> +
> +	mmap_read_lock(mm);
> +	result = hugepage_vma_revalidate(mm, address, true, &vma, order, cc);
> +	if (result != SCAN_SUCCEED) {
> +		mmap_read_unlock(mm);
> +		goto out_nolock;
> +	}
> +
> +	result = find_pmd_or_thp_or_none(mm, address, &pmd);
> +	if (result != SCAN_SUCCEED) {
> +		mmap_read_unlock(mm);
> +		goto out_nolock;
> +	}
> +
> +	if (unmapped) {
> +		/*
> +		 * __collapse_huge_page_swapin will return with mmap_lock
> +		 * released when it fails. So we jump out_nolock directly in
> +		 * that case.  Continuing to collapse causes inconsistency.
> +		 */
> +		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> +						     referenced, order);
> +		if (result != SCAN_SUCCEED)
> +			goto out_nolock;
> +	}
> +
> +	if (order == HPAGE_PMD_ORDER)
> +		result = vma_collapse_anon_folio_pmd(mm, address, vma, cc, pmd, folio);

I think the locking is broken here? collapse_huge_page() used to enter with the
mmamp read lock and exit without the lock held at all. After the change, this is
only true for order == HPAGE_PMD_ORDER. For other orders, you exit with the mmap
read lock still held. Perhaps:

	else
		mmap_read_unlock(mm);

> +
> +	if (result == SCAN_SUCCEED)
> +		folio = NULL;
> +
>  out_nolock:
>  	if (folio)
>  		folio_put(folio);


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 10/12] khugepaged: Skip PTE range if a larger mTHP is already mapped
  2024-12-16 16:51 ` [RFC PATCH 10/12] khugepaged: Skip PTE range if a larger mTHP is already mapped Dev Jain
@ 2024-12-18  7:36   ` Ryan Roberts
  2024-12-18  9:34     ` Dev Jain
  0 siblings, 1 reply; 74+ messages in thread
From: Ryan Roberts @ 2024-12-18  7:36 UTC (permalink / raw)
  To: Dev Jain, akpm, david, willy, kirill.shutemov
  Cc: anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, linux-mm, linux-kernel

On 16/12/2024 16:51, Dev Jain wrote:
> We may hit a situation wherein we have a larger folio mapped. It is incorrect
> to go ahead with the collapse since some pages will be unmapped, leading to
> the entire folio getting unmapped. Therefore, skip the corresponding range.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
> In the future, if at all it is required that at some point we want all the folios
> in the system to be of a specific order, we may split these larger folios.
> 
>  mm/khugepaged.c | 31 +++++++++++++++++++++++++++++++
>  1 file changed, 31 insertions(+)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 8040b130e677..47e7c476b893 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -33,6 +33,7 @@ enum scan_result {
>  	SCAN_PMD_NULL,
>  	SCAN_PMD_NONE,
>  	SCAN_PMD_MAPPED,
> +	SCAN_PTE_MAPPED,
>  	SCAN_EXCEED_NONE_PTE,
>  	SCAN_EXCEED_SWAP_PTE,
>  	SCAN_EXCEED_SHARED_PTE,
> @@ -609,6 +610,11 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  		folio = page_folio(page);
>  		VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
>  
> +		if (order !=HPAGE_PMD_ORDER && folio_order(folio) >= order) {
> +			result = SCAN_PTE_MAPPED;
> +			goto out;
> +		}
> +
>  		/* See hpage_collapse_scan_ptes(). */
>  		if (folio_likely_mapped_shared(folio)) {
>  			++shared;
> @@ -1369,6 +1375,7 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>  	unsigned long orders;
>  	pte_t *pte, *_pte;
>  	spinlock_t *ptl;
> +	int found_order;
>  	pmd_t *pmd;
>  	int order;
>  
> @@ -1467,6 +1474,24 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>  			goto out_unmap;
>  		}
>  
> +		found_order = folio_order(folio);
> +
> +		/*
> +		 * No point of scanning. Two options: if this folio was hit
> +		 * somewhere in the middle of the scan, then drop down the
> +		 * order. Or, completely skip till the end of this folio. The
> +		 * latter gives us a higher order to start with, with atmost
> +		 * 1 << order PTEs not collapsed; the former may force us
> +		 * to end up going below order 2 and exiting.
> +		 */
> +		if (order != HPAGE_PMD_ORDER && found_order >= order) {
> +			result = SCAN_PTE_MAPPED;
> +			_address += (PAGE_SIZE << found_order);
> +			_pte += (1UL << found_order);
> +			pte_unmap_unlock(pte, ptl);
> +			goto decide_order;
> +		}

It would be good if you can spell out the desired policy when khugepaged hits
partially unmapped large folios and unaligned large folios. I think the simple
approach is to always collapse them to fully mapped, aligned folios even if the
resulting order is smaller than the original. But I'm not sure that's definitely
going to always be the best thing.

Regardless, I'm struggling to understand the logic in this patch. Taking the
order of a folio based on having hit one of it's pages says anything about
whether the whole of that folio is mapped or not or it's alignment. And it's not
clear to me how we would get to a situation where we are scanning for a lower
order and find a (fully mapped, aligned) folio of higher order in the first place.

Let's assume the desired policy is that khugepaged should always collapse to
naturally aligned large folios. If there happens to be an existing aligned
order-4 folio that is fully mapped, we will identify that for collapse as part
of the scan for order-4. At that point, we should just notice that it is already
an aligned order-4 folio and bypass collapse. Of course we may have already
chosen to collapse it into a higher order, but we should definitely not get to a
lower order before we notice it.

Hmm... I guess if the sysfs thp settings have been changed then things could get
spicy... if order-8 was previously enabled and we have an order-8 folio, then it
get's disabled and khugepaged is scanning for order-4 (which is still enabled)
then hits the order-8; what's the expected policy? rework into 2 order-4 folios
or leave it as as single order-8?


> +
>  		/*
>  		 * We treat a single page as shared if any part of the THP
>  		 * is shared. "False negatives" from
> @@ -1550,6 +1575,10 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>  		if (_address == org_address + (PAGE_SIZE << HPAGE_PMD_ORDER))
>  			goto out;
>  	}
> +	/* A larger folio was mapped; it will be skipped in next iteration */
> +	if (result == SCAN_PTE_MAPPED)
> +		goto decide_order;
> +
>  	if (result != SCAN_SUCCEED) {
>  
>  		/* Go to the next order. */
> @@ -1558,6 +1587,8 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>  			goto out;
>  		goto maybe_mmap_lock;
>  	} else {
> +
> +decide_order:
>  		address = _address;
>  		pte = _pte;
>  


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 09/12] khugepaged: Introduce vma_collapse_anon_folio()
  2024-12-17 10:32       ` David Hildenbrand
@ 2024-12-18  8:35         ` Dev Jain
  2025-01-02 10:08           ` Dev Jain
  2025-01-02 11:22           ` David Hildenbrand
  0 siblings, 2 replies; 74+ messages in thread
From: Dev Jain @ 2024-12-18  8:35 UTC (permalink / raw)
  To: David Hildenbrand, akpm, willy, kirill.shutemov
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel


On 17/12/24 4:02 pm, David Hildenbrand wrote:
> On 17.12.24 11:07, Dev Jain wrote:
>>
>> On 16/12/24 10:36 pm, David Hildenbrand wrote:
>>> On 16.12.24 17:51, Dev Jain wrote:
>>>> In contrast to PMD-collapse, we do not need to operate on two levels
>>>> of pagetable
>>>> simultaneously. Therefore, downgrade the mmap lock from write to read
>>>> mode. Still
>>>> take the anon_vma lock in exclusive mode so as to not waste time in
>>>> the rmap path,
>>>> which is anyways going to fail since the PTEs are going to be
>>>> changed. Under the PTL,
>>>> copy page contents, clear the PTEs, remove folio pins, and (try to)
>>>> unmap the
>>>> old folios. Set the PTEs to the new folio using the set_ptes() API.
>>>>
>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>> ---
>>>> Note: I have been trying hard to get rid of the locks in here: we
>>>> still are
>>>> taking the PTL around the page copying; dropping the PTL and taking
>>>> it after
>>>> the copying should lead to a deadlock, for example:
>>>> khugepaged                        madvise(MADV_COLD)
>>>> folio_lock()                        lock(ptl)
>>>> lock(ptl)                        folio_lock()
>>>>
>>>> We can create a locked folio list, altogether drop both the locks,
>>>> take the PTL,
>>>> do everything which __collapse_huge_page_isolate() does *except* the
>>>> isolation and
>>>> again try locking folios, but then it will reduce efficiency of
>>>> khugepaged
>>>> and almost looks like a forced solution :)
>>>> Please note the following discussion if anyone is interested:
>>>> https://lore.kernel.org/all/66bb7496-a445-4ad7-8e56-4f2863465c54@arm.com/ 
>>>>
>>>>
>>>> (Apologies for not CCing the mailing list from the start)
>>>>
>>>>    mm/khugepaged.c | 108 
>>>> ++++++++++++++++++++++++++++++++++++++----------
>>>>    1 file changed, 87 insertions(+), 21 deletions(-)
>>>>
>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>> index 88beebef773e..8040b130e677 100644
>>>> --- a/mm/khugepaged.c
>>>> +++ b/mm/khugepaged.c
>>>> @@ -714,24 +714,28 @@ static void
>>>> __collapse_huge_page_copy_succeeded(pte_t *pte,
>>>>                            struct vm_area_struct *vma,
>>>>                            unsigned long address,
>>>>                            spinlock_t *ptl,
>>>> -                        struct list_head *compound_pagelist)
>>>> +                        struct list_head *compound_pagelist, int 
>>>> order)
>>>>    {
>>>>        struct folio *src, *tmp;
>>>>        pte_t *_pte;
>>>>        pte_t pteval;
>>>>    -    for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>>>> +    for (_pte = pte; _pte < pte + (1UL << order);
>>>>             _pte++, address += PAGE_SIZE) {
>>>>            pteval = ptep_get(_pte);
>>>>            if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>>>>                add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
>>>>                if (is_zero_pfn(pte_pfn(pteval))) {
>>>> -                /*
>>>> -                 * ptl mostly unnecessary.
>>>> -                 */
>>>> -                spin_lock(ptl);
>>>> -                ptep_clear(vma->vm_mm, address, _pte);
>>>> -                spin_unlock(ptl);
>>>> +                if (order == HPAGE_PMD_ORDER) {
>>>> +                    /*
>>>> +                    * ptl mostly unnecessary.
>>>> +                    */
>>>> +                    spin_lock(ptl);
>>>> +                    ptep_clear(vma->vm_mm, address, _pte);
>>>> +                    spin_unlock(ptl);
>>>> +                } else {
>>>> +                    ptep_clear(vma->vm_mm, address, _pte);
>>>> +                }
>>>>                    ksm_might_unmap_zero_page(vma->vm_mm, pteval);
>>>>                }
>>>>            } else {
>>>> @@ -740,15 +744,20 @@ static void
>>>> __collapse_huge_page_copy_succeeded(pte_t *pte,
>>>>                src = page_folio(src_page);
>>>>                if (!folio_test_large(src))
>>>>                    release_pte_folio(src);
>>>> -            /*
>>>> -             * ptl mostly unnecessary, but preempt has to
>>>> -             * be disabled to update the per-cpu stats
>>>> -             * inside folio_remove_rmap_pte().
>>>> -             */
>>>> -            spin_lock(ptl);
>>>> -            ptep_clear(vma->vm_mm, address, _pte);
>>>> -            folio_remove_rmap_pte(src, src_page, vma);
>>>> -            spin_unlock(ptl);
>>>> +            if (order == HPAGE_PMD_ORDER) {
>>>> +                /*
>>>> +                * ptl mostly unnecessary, but preempt has to
>>>> +                * be disabled to update the per-cpu stats
>>>> +                * inside folio_remove_rmap_pte().
>>>> +                */
>>>> +                spin_lock(ptl);
>>>> +                ptep_clear(vma->vm_mm, address, _pte);
>>>
>>>
>>>
>>>
>>>> + folio_remove_rmap_pte(src, src_page, vma);
>>>> +                spin_unlock(ptl);
>>>> +            } else {
>>>> +                ptep_clear(vma->vm_mm, address, _pte);
>>>> +                folio_remove_rmap_pte(src, src_page, vma);
>>>> +            }
>>>
>>> As I've talked to Nico about this code recently ... :)
>>>
>>> Are you clearing the PTE after the copy succeeded? If so, where is the
>>> TLB flush?
>>>
>>> How do you sync against concurrent write acess + GUP-fast?
>>>
>>>
>>> The sequence really must be: (1) clear PTE/PMD + flush TLB (2) check
>>> if there are unexpected page references (e.g., GUP) if so back off (3)
>>> copy page content (4) set updated PTE/PMD.
>>
>> Thanks...we need to ensure GUP-fast does not write when we are copying
>> contents, so (2) will ensure that GUP-fast will see the cleared PTE and
>> back-off.
>
> Yes, and of course, that also the CPU cannot concurrently still modify 
> the page content while/after you copy the page content, but before you 
> unmap+flush.
>
>>>
>>> To Nico, I suggested doing it simple initially, and still clear the
>>> high-level PMD entry + flush under mmap write lock, then re-map the
>>> PTE table after modifying the page table. It's not as efficient, but
>>> "harder to get wrong".
>>>
>>> Maybe that's already happening, but I stumbled over this clearing
>>> logic in __collapse_huge_page_copy_succeeded(), so I'm curious.
>>
>> No, I am not even touching the PMD. I guess the sequence you described
>> should work? I just need to reverse the copying and PTE clearing order
>> to implement this sequence.
>
> That would work, but you really have to hold the PTL for the whole 
> period: from when you temporarily clear PTEs +_ flush the TLB, when 
> you copy, until you re-insert the updated ones.

Ignoring the implementation and code churn part :) Is the following 
algorithm theoretically correct: (1) Take PTL, scan PTEs,
isolate and lock the folios, set the PTEs to migration entries, check 
folio references. This will solve concurrent write
access races. Now, we can drop the PTL...no one can write to the old 
folios because (1) rmap cannot run (2) folio from PTE
cannot be derived. Note that migration_entry_wait_on_locked() path can 
be scheduled out, so this is not the same as the
fault handlers spinning on the PTL. We can now safely copy old folios to 
new folio, then take the PTL: The PTL is
available because every pagetable walker will see a migration entry and 
back off. We batch set the PTEs now, and release
the folio locks, making the fault handlers getting out of 
migration_entry_wait_on_locked(). As compared to the old code,
the point of failure we need to handle is when copying fails, or at some 
point folio isolation fails...therefore, we need to
maintain a list of old PTEs corresponding to the PTEs set to migration 
entries.

Note that, I had suggested this "setting the PTEs to a global invalid 
state" thingy in our previous discussion too, but I guess
simultaneously working on the PMD and PTE was the main problem there, 
since the walkers do not take a lock on the PMD
to check if someone is changing it, when what they really are interested 
in is to make change at the PTE level. In fact, leaving
all specifics like racing with a specific pagetable walker etc aside, I 
do not see why the following claim isn't true:

Claim: The (anon-private) mTHP khugepaged collapse problem is 
mathematically equivalent to the (anon-private) page migration problem.

The difference being, in khugepaged we need the VMA to be stable, hence 
have to take the mmap_read_lock(), and have to "migrate"
to a large folio instead of individual pages.

If at all my theory is correct, I'll leave it to the community to decide 
if it's worth it to go through my brain-rot :)


>
> When having to back-off (restore original PTEs), or for copying, 
> you'll likely need access to the original PTEs, which were already 
> cleared. So likely you need a temporary copy of the original PTEs 
> somehow.
>
> That's why temporarily clearing the PMD und mmap write lock is easier 
> to implement, at the cost of requiring the mmap lock in write mode 
> like PMD collapse.
>
>
So, I understand the following: Some CPU spinning on the PTL for a long 
time is worse than taking the mmap_write_lock(). The latter blocks this 
process
from doing mmap()s, which, in my limited knowledge, is bad for 
memory-intensive processes (aligned with the fact that the maple tree was
introduced to optimize VMA operations), and the former literally nukes 
one unit of computation from the system for a long time.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 06/12] khugepaged: Generalize __collapse_huge_page_copy_failed()
  2024-12-17 17:22   ` Ryan Roberts
@ 2024-12-18  8:49     ` Dev Jain
  0 siblings, 0 replies; 74+ messages in thread
From: Dev Jain @ 2024-12-18  8:49 UTC (permalink / raw)
  To: Ryan Roberts, akpm, david, willy, kirill.shutemov
  Cc: anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, linux-mm, linux-kernel


On 17/12/24 10:52 pm, Ryan Roberts wrote:
> On 16/12/2024 16:50, Dev Jain wrote:
>> Upon failure, we repopulate the PMD in case of PMD-THP collapse. Hence, make
>> this logic specific for PMD case.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   mm/khugepaged.c | 14 ++++++++------
>>   1 file changed, 8 insertions(+), 6 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index de044b1f83d4..886c76816963 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -766,7 +766,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
>>   					     pmd_t *pmd,
>>   					     pmd_t orig_pmd,
>>   					     struct vm_area_struct *vma,
>> -					     struct list_head *compound_pagelist)
>> +					     struct list_head *compound_pagelist, int order)
> nit: suggest putting order on its own line.
>
>>   {
>>   	spinlock_t *pmd_ptl;
>>   
>> @@ -776,14 +776,16 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
>>   	 * pages. Since pages are still isolated and locked here,
>>   	 * acquiring anon_vma_lock_write is unnecessary.
>>   	 */
>> -	pmd_ptl = pmd_lock(vma->vm_mm, pmd);
>> -	pmd_populate(vma->vm_mm, pmd, pmd_pgtable(orig_pmd));
>> -	spin_unlock(pmd_ptl);
>> +	if (order == HPAGE_PMD_ORDER) {
>> +		pmd_ptl = pmd_lock(vma->vm_mm, pmd);
>> +		pmd_populate(vma->vm_mm, pmd, pmd_pgtable(orig_pmd));
>> +		spin_unlock(pmd_ptl);
>> +	}
>>   	/*
>>   	 * Release both raw and compound pages isolated
>>   	 * in __collapse_huge_page_isolate.
>>   	 */
>> -	release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist);
>> +	release_pte_pages(pte, pte + (1UL << order), compound_pagelist);
>>   }
> Given this function is clearly so geared towards re-establishing the pmd, given
> that it takes the *pmd and orig_pmd as params, and given that in the
> non-pmd-order case, we only call through to release_pte_pages(), I wonder if
> it's better to make the decision at a higher level and either call this function
> or release_pte_pages() directly? No strong opinion, just looks a bit weird at
> the moment.

Makes sense, we can probably get rid of this function and let the caller call
reestablish_pmd() or something for the PMD case.

>
>>   
>>   /*
>> @@ -834,7 +836,7 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
>>   						    compound_pagelist);
>>   	else
>>   		__collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
>> -						 compound_pagelist);
>> +						 compound_pagelist, order);
>>   
>>   	return result;
>>   }

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 12/12] selftests/mm: khugepaged: Enlighten for mTHP collapse
  2024-12-16 16:51 ` [RFC PATCH 12/12] selftests/mm: khugepaged: Enlighten for mTHP collapse Dev Jain
@ 2024-12-18  9:03   ` Ryan Roberts
  2024-12-18  9:50     ` Dev Jain
  0 siblings, 1 reply; 74+ messages in thread
From: Ryan Roberts @ 2024-12-18  9:03 UTC (permalink / raw)
  To: Dev Jain, akpm, david, willy, kirill.shutemov
  Cc: anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, linux-mm, linux-kernel

On 16/12/2024 16:51, Dev Jain wrote:
> One of the testcases triggers a CoW on the 255th page (0-indexing) with
> max_ptes_shared = 256. This leads to 0-254 pages (255 in number) being unshared,
> and 257 pages shared, exceeding the constraint. Suppose we run the test as
> ./khugepaged -s 2. Therefore, khugepaged starts collapsing the range to order-2
> folios, since PMD-collapse will fail due to the constraint.
> When the scan reaches 254-257 PTE range, because at least one PTE in this range
> is writable, with other 3 being read-only, khugepaged collapses this into an
> order-2 mTHP, resulting in 3 extra PTEs getting unshared. After this, we encounter
> a 4-sized chunk of read-only PTEs, and mTHP collapse stops according to the scaled
> constraint, but the number of shared PTEs have now come under the constraint for
> PMD-sized THPs. Therefore, the next scan of khugepaged will be able to collapse
> this range into a PMD-mapped hugepage, leading to failure of this subtest. Fix
> this by reducing the CoW range.

Is this description essentially saying that it's now possible to creep towards
collapsing to a full PMD-size block over successive scans due to rounding errors
in the scaling? Or is this just trying an edge case and the problem doesn't
generalize?

> 
> Note: The only objective of this patch is to make the test work for the PMD-case;
> no extension has been made for testing for mTHPs.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  tools/testing/selftests/mm/khugepaged.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/selftests/mm/khugepaged.c
> index 8a4d34cce36b..143c4ad9f6a1 100644
> --- a/tools/testing/selftests/mm/khugepaged.c
> +++ b/tools/testing/selftests/mm/khugepaged.c
> @@ -981,6 +981,7 @@ static void collapse_fork_compound(struct collapse_context *c, struct mem_ops *o
>  static void collapse_max_ptes_shared(struct collapse_context *c, struct mem_ops *ops)
>  {
>  	int max_ptes_shared = thp_read_num("khugepaged/max_ptes_shared");
> +	int fault_nr_pages = is_anon(ops) ? 1 << anon_order : 1;
>  	int wstatus;
>  	void *p;
>  
> @@ -997,8 +998,8 @@ static void collapse_max_ptes_shared(struct collapse_context *c, struct mem_ops
>  			fail("Fail");
>  
>  		printf("Trigger CoW on page %d of %d...",
> -				hpage_pmd_nr - max_ptes_shared - 1, hpage_pmd_nr);
> -		ops->fault(p, 0, (hpage_pmd_nr - max_ptes_shared - 1) * page_size);
> +				hpage_pmd_nr - max_ptes_shared - fault_nr_pages, hpage_pmd_nr);
> +		ops->fault(p, 0, (hpage_pmd_nr - max_ptes_shared - fault_nr_pages) * page_size);
>  		if (ops->check_huge(p, 0))
>  			success("OK");
>  		else


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 07/12] khugepaged: Scan PTEs order-wise
  2024-12-17 18:15   ` Ryan Roberts
@ 2024-12-18  9:24     ` Dev Jain
  0 siblings, 0 replies; 74+ messages in thread
From: Dev Jain @ 2024-12-18  9:24 UTC (permalink / raw)
  To: Ryan Roberts, akpm, david, willy, kirill.shutemov
  Cc: anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, linux-mm, linux-kernel


On 17/12/24 11:45 pm, Ryan Roberts wrote:
> On 16/12/2024 16:51, Dev Jain wrote:
>> Scan the PTEs order-wise, using the mask of suitable orders for this VMA
>> derived in conjunction with sysfs THP settings. Scale down the tunables; in
>> case of collapse failure, we drop down to the next order. Otherwise, we try to
>> jump to the highest possible order and then start a fresh scan. Note that
>> madvise(MADV_COLLAPSE) has not been generalized.
> Is there are reason you are not modifying MADV_COLLAPSE? It's really just a
> synchonous way to do what khugepaged does asynchonously (isn't it?), so it would
> behave the same way in an ideal world.

Correct, but I started running into return value problems for madvise(). For example,
the return value of hpage_collapse_scan_ptes() will be the return value of the last
mTHP scan. In this case, what do we want madvise() to return? If I collapse the range
to multiple 64K mTHPs, then I should still return failure, because otherwise the caller would
logically assume that MADV_COLLAPSE succeeded so will assume a PMD-hugepage mapped there. But
then the caller ended up collapsing memory...if you return success, then the khugepaged selftest
starts failing. Basically, this will be (kind of?) an ABI change and I really didn't want to sway
the discussion away from khugepaged, so I just kept it simple :)

>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   mm/khugepaged.c | 84 ++++++++++++++++++++++++++++++++++++++++---------
>>   1 file changed, 69 insertions(+), 15 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 886c76816963..078794aa3335 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -20,6 +20,7 @@
>>   #include <linux/swapops.h>
>>   #include <linux/shmem_fs.h>
>>   #include <linux/ksm.h>
>> +#include <linux/count_zeros.h>
>>   
>>   #include <asm/tlb.h>
>>   #include <asm/pgalloc.h>
>> @@ -1111,7 +1112,7 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
>>   }
>>   
>>   static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>> -			      int referenced, int unmapped,
>> +			      int referenced, int unmapped, int order,
>>   			      struct collapse_control *cc)
>>   {
>>   	LIST_HEAD(compound_pagelist);
>> @@ -1278,38 +1279,59 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>>   				   unsigned long address, bool *mmap_locked,
>>   				   struct collapse_control *cc)
>>   {
>> -	pmd_t *pmd;
>> -	pte_t *pte, *_pte;
>> -	int result = SCAN_FAIL, referenced = 0;
>> -	int none_or_zero = 0, shared = 0;
>> -	struct page *page = NULL;
>> +	unsigned int max_ptes_shared, max_ptes_none, max_ptes_swap;
>> +	int referenced, shared, none_or_zero, unmapped;
>> +	unsigned long _address, org_address = address;
> nit: Perhaps it's clearer to keep the original address in address and use a
> variable, start, for the starting point of each scan?

Probably...will keep it in mind.

>
>>   	struct folio *folio = NULL;
>> -	unsigned long _address;
>> -	spinlock_t *ptl;
>> -	int node = NUMA_NO_NODE, unmapped = 0;
>> +	struct page *page = NULL;
>> +	int node = NUMA_NO_NODE;
>> +	int result = SCAN_FAIL;
>>   	bool writable = false;
>> +	unsigned long orders;
>> +	pte_t *pte, *_pte;
>> +	spinlock_t *ptl;
>> +	pmd_t *pmd;
>> +	int order;
>>   
>>   	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>>   
>> +	orders = thp_vma_allowable_orders(vma, vma->vm_flags,
>> +			TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER + 1) - 1);
> Perhaps THP_ORDERS_ALL instead of "BIT(PMD_ORDER + 1) - 1"?

Ah yes, THP_ORDERS_ALL_ANON.

>
>> +	orders = thp_vma_suitable_orders(vma, address, orders);
>> +	order = highest_order(orders);
>> +
>> +	/* MADV_COLLAPSE needs to work irrespective of sysfs setting */
>> +	if (!cc->is_khugepaged)
>> +		order = HPAGE_PMD_ORDER;
>> +
>> +scan_pte_range:
>> +
>> +	max_ptes_shared = khugepaged_max_ptes_shared >> (HPAGE_PMD_ORDER - order);
>> +	max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
>> +	max_ptes_swap = khugepaged_max_ptes_swap >> (HPAGE_PMD_ORDER - order);
>> +	referenced = 0, shared = 0, none_or_zero = 0, unmapped = 0;
>> +
>> +	/* Check pmd after taking mmap lock */
>>   	result = find_pmd_or_thp_or_none(mm, address, &pmd);
>>   	if (result != SCAN_SUCCEED)
>>   		goto out;
>>   
>>   	memset(cc->node_load, 0, sizeof(cc->node_load));
>>   	nodes_clear(cc->alloc_nmask);
>> +
>>   	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
>>   	if (!pte) {
>>   		result = SCAN_PMD_NULL;
>>   		goto out;
>>   	}
>>   
>> -	for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>> +	for (_address = address, _pte = pte; _pte < pte + (1UL << order);
>>   	     _pte++, _address += PAGE_SIZE) {
>>   		pte_t pteval = ptep_get(_pte);
>>   		if (is_swap_pte(pteval)) {
>>   			++unmapped;
>>   			if (!cc->is_khugepaged ||
>> -			    unmapped <= khugepaged_max_ptes_swap) {
>> +			    unmapped <= max_ptes_swap) {
>>   				/*
>>   				 * Always be strict with uffd-wp
>>   				 * enabled swap entries.  Please see
>> @@ -1330,7 +1352,7 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>>   			++none_or_zero;
>>   			if (!userfaultfd_armed(vma) &&
>>   			    (!cc->is_khugepaged ||
>> -			     none_or_zero <= khugepaged_max_ptes_none)) {
>> +			     none_or_zero <= max_ptes_none)) {
>>   				continue;
>>   			} else {
>>   				result = SCAN_EXCEED_NONE_PTE;
>> @@ -1375,7 +1397,7 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>>   		if (folio_likely_mapped_shared(folio)) {
>>   			++shared;
>>   			if (cc->is_khugepaged &&
>> -			    shared > khugepaged_max_ptes_shared) {
>> +			    shared > max_ptes_shared) {
>>   				result = SCAN_EXCEED_SHARED_PTE;
>>   				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>>   				goto out_unmap;
>> @@ -1432,7 +1454,7 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>>   		result = SCAN_PAGE_RO;
>>   	} else if (cc->is_khugepaged &&
>>   		   (!referenced ||
>> -		    (unmapped && referenced < HPAGE_PMD_NR / 2))) {
>> +		    (unmapped && referenced < (1UL << order) / 2))) {
>>   		result = SCAN_LACK_REFERENCED_PAGE;
>>   	} else {
>>   		result = SCAN_SUCCEED;
>> @@ -1441,9 +1463,41 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>>   	pte_unmap_unlock(pte, ptl);
>>   	if (result == SCAN_SUCCEED) {
>>   		result = collapse_huge_page(mm, address, referenced,
>> -					    unmapped, cc);
>> +					    unmapped, order, cc);
>>   		/* collapse_huge_page will return with the mmap_lock released */
>>   		*mmap_locked = false;
>> +
>> +		/* Immediately exit on exhaustion of range */
>> +		if (_address == org_address + (PAGE_SIZE << HPAGE_PMD_ORDER))
>> +			goto out;
> Looks like this assumes this function is always asked to scan a full PTE table?
> Does that mean that you can't handle collapse for VMAs that don't span a whole
> PMD entry? I think we will want to support that.

Correct. Yes, we should support that, otherwise khugepaged will scan only large VMAs.
Will have to make a change in khugepaged_scan_mm_slot() for anon case.

>
>> +	}
>> +	if (result != SCAN_SUCCEED) {
>> +
>> +		/* Go to the next order. */
>> +		order = next_order(&orders, order);
>> +		if (order < 2)
> This should be:
> 		if (!orders)
>
> I think the return order is undefined when order is the last order in orders.

The return order is -1, from what I could gather from reading the code.

>
>> +			goto out;
>> +		goto maybe_mmap_lock;
>> +	} else {
>> +		address = _address;
>> +		pte = _pte;
>> +
>> +
>> +		/* Get highest order possible starting from address */
>> +		order = count_trailing_zeros(address >> PAGE_SHIFT);
>> +
>> +		/* This needs to be present in the mask too */
>> +		if (!(orders & (1UL << order)))
>> +			order = next_order(&orders, order);
> Not quite; if the exact order isn't in the bitmap, this will pick out the
> highest order in the bitmap, which may be higher than count_trailing_zeros()
> returned.

Oh okay, nice catch.

>   You could do:
>
> 		order = count_trailing_zeros(address >> PAGE_SHIFT);
> 		orders &= (1UL << order + 1) - 1;
> 		order = next_order(&orders, order);
> 		if (!orders)
> 			goto out;
>
> That will mask out any orders that are bigger than the one returned by
> count_trailing_zeros() then next_order() will return the highest order in the
> remaining set.
>
> But even that doesn't quite work because next_order() is destructive. Once you
> arrive on a higher order address boundary, you want to be able to select a
> higher order from the original orders bitmap. But you have lost them on a
> previous go around the loop.
>
> Perhaps stash orig_orders at the top of the function when you first calculate
> it. Then I think this works (totally untested):
>
> 		order = count_trailing_zeros(address >> PAGE_SHIFT);
> 		orders = orig_orders & (1UL << order + 1) - 1;
> 		order = next_order(&orders, order);
> 		if (!orders)
> 			goto out;
>
> You might want to do something like this for the first go around the loop, but I
> think address is currently always at the start of the PMD on entry, so not
> needed until that restriction is removed.

Will take a look, we just need order <= order derived from trailing zeroes,
and then we need the first enabled order below this in the bitmask, shouldn't
be too complicated.

>
>> +		if (order < 2)
>> +			goto out;
>> +
>> +maybe_mmap_lock:
>> +		if (!(*mmap_locked)) {
>> +			mmap_read_lock(mm);
> Given the lock was already held in read mode on entering this function, then
> released by collapse_huge_page(), is it definitely safe to retake this lock and
> rerun this function? Is it possible that state that was checked before entering
> this function has changed since the lock was released that would now need
> re-checking?

Thanks, what I am missing is a hugepage_vma_revalidate().

>> +			*mmap_locked = true;
>> +		}
>> +		goto scan_pte_range;
>>   	}
>>   out:
>>   	trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced,

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 08/12] khugepaged: Abstract PMD-THP collapse
  2024-12-17 19:24   ` Ryan Roberts
@ 2024-12-18  9:26     ` Dev Jain
  0 siblings, 0 replies; 74+ messages in thread
From: Dev Jain @ 2024-12-18  9:26 UTC (permalink / raw)
  To: Ryan Roberts, akpm, david, willy, kirill.shutemov
  Cc: anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, linux-mm, linux-kernel


On 18/12/24 12:54 am, Ryan Roberts wrote:
> On 16/12/2024 16:51, Dev Jain wrote:
>> Abstract away taking the mmap_lock exclusively, copying page contents, and
>> setting the PMD, into vma_collapse_anon_folio_pmd().
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   mm/khugepaged.c | 119 +++++++++++++++++++++++++++---------------------
>>   1 file changed, 66 insertions(+), 53 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 078794aa3335..88beebef773e 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1111,58 +1111,17 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
>>   	return SCAN_SUCCEED;
>>   }
>>   
>> -static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>> -			      int referenced, int unmapped, int order,
>> -			      struct collapse_control *cc)
>> +static int vma_collapse_anon_folio_pmd(struct mm_struct *mm, unsigned long address,
>> +		struct vm_area_struct *vma, struct collapse_control *cc, pmd_t *pmd,
>> +		struct folio *folio)
>>   {
>> +	struct mmu_notifier_range range;
>> +	spinlock_t *pmd_ptl, *pte_ptl;
>>   	LIST_HEAD(compound_pagelist);
>> -	pmd_t *pmd, _pmd;
>> -	pte_t *pte;
>>   	pgtable_t pgtable;
>> -	struct folio *folio;
>> -	spinlock_t *pmd_ptl, *pte_ptl;
>> -	int result = SCAN_FAIL;
>> -	struct vm_area_struct *vma;
>> -	struct mmu_notifier_range range;
>> -
>> -	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>> -
>> -	/*
>> -	 * Before allocating the hugepage, release the mmap_lock read lock.
>> -	 * The allocation can take potentially a long time if it involves
>> -	 * sync compaction, and we do not need to hold the mmap_lock during
>> -	 * that. We will recheck the vma after taking it again in write mode.
>> -	 */
>> -	mmap_read_unlock(mm);
>> -
>> -	result = alloc_charge_folio(&folio, mm, order, cc);
>> -	if (result != SCAN_SUCCEED)
>> -		goto out_nolock;
>> -
>> -	mmap_read_lock(mm);
>> -	result = hugepage_vma_revalidate(mm, address, true, &vma, order, cc);
>> -	if (result != SCAN_SUCCEED) {
>> -		mmap_read_unlock(mm);
>> -		goto out_nolock;
>> -	}
>> -
>> -	result = find_pmd_or_thp_or_none(mm, address, &pmd);
>> -	if (result != SCAN_SUCCEED) {
>> -		mmap_read_unlock(mm);
>> -		goto out_nolock;
>> -	}
>> -
>> -	if (unmapped) {
>> -		/*
>> -		 * __collapse_huge_page_swapin will return with mmap_lock
>> -		 * released when it fails. So we jump out_nolock directly in
>> -		 * that case.  Continuing to collapse causes inconsistency.
>> -		 */
>> -		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
>> -						     referenced, order);
>> -		if (result != SCAN_SUCCEED)
>> -			goto out_nolock;
>> -	}
>> +	int result;
>> +	pmd_t _pmd;
>> +	pte_t *pte;
>>   
>>   	mmap_read_unlock(mm);
>>   	/*
>> @@ -1174,7 +1133,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>   	 * mmap_lock.
>>   	 */
>>   	mmap_write_lock(mm);
>> -	result = hugepage_vma_revalidate(mm, address, true, &vma, order, cc);
>> +
>> +	result = hugepage_vma_revalidate(mm, address, true, &vma, HPAGE_PMD_ORDER, cc);
>>   	if (result != SCAN_SUCCEED)
>>   		goto out_up_write;
>>   	/* check if the pmd is still valid */
>> @@ -1206,7 +1166,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>   	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
>>   	if (pte) {
>>   		result = __collapse_huge_page_isolate(vma, address, pte, cc,
>> -						      &compound_pagelist, order);
>> +						      &compound_pagelist, HPAGE_PMD_ORDER);
>>   		spin_unlock(pte_ptl);
>>   	} else {
>>   		result = SCAN_PMD_NULL;
>> @@ -1262,11 +1222,64 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>   	deferred_split_folio(folio, false);
>>   	spin_unlock(pmd_ptl);
>>   
>> -	folio = NULL;
>> -
>>   	result = SCAN_SUCCEED;
>>   out_up_write:
>>   	mmap_write_unlock(mm);
>> +	return result;
>> +}
>> +
>> +static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>> +			      int referenced, int unmapped, int order,
>> +			      struct collapse_control *cc)
>> +{
>> +	struct vm_area_struct *vma;
>> +	int result = SCAN_FAIL;
>> +	struct folio *folio;
>> +	pmd_t *pmd;
>> +
>> +	/*
>> +	 * Before allocating the hugepage, release the mmap_lock read lock.
>> +	 * The allocation can take potentially a long time if it involves
>> +	 * sync compaction, and we do not need to hold the mmap_lock during
>> +	 * that. We will recheck the vma after taking it again in write mode.
>> +	 */
>> +	mmap_read_unlock(mm);
>> +
>> +	result = alloc_charge_folio(&folio, mm, order, cc);
>> +	if (result != SCAN_SUCCEED)
>> +		goto out_nolock;
>> +
>> +	mmap_read_lock(mm);
>> +	result = hugepage_vma_revalidate(mm, address, true, &vma, order, cc);
>> +	if (result != SCAN_SUCCEED) {
>> +		mmap_read_unlock(mm);
>> +		goto out_nolock;
>> +	}
>> +
>> +	result = find_pmd_or_thp_or_none(mm, address, &pmd);
>> +	if (result != SCAN_SUCCEED) {
>> +		mmap_read_unlock(mm);
>> +		goto out_nolock;
>> +	}
>> +
>> +	if (unmapped) {
>> +		/*
>> +		 * __collapse_huge_page_swapin will return with mmap_lock
>> +		 * released when it fails. So we jump out_nolock directly in
>> +		 * that case.  Continuing to collapse causes inconsistency.
>> +		 */
>> +		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
>> +						     referenced, order);
>> +		if (result != SCAN_SUCCEED)
>> +			goto out_nolock;
>> +	}
>> +
>> +	if (order == HPAGE_PMD_ORDER)
>> +		result = vma_collapse_anon_folio_pmd(mm, address, vma, cc, pmd, folio);
> I think the locking is broken here? collapse_huge_page() used to enter with the
> mmamp read lock and exit without the lock held at all. After the change, this is
> only true for order == HPAGE_PMD_ORDER. For other orders, you exit with the mmap
> read lock still held. Perhaps:
>
> 	else
> 		mmap_read_unlock(mm);

I have completely messed up sequential building of patches :) I will take care next time.

>
>> +
>> +	if (result == SCAN_SUCCEED)
>> +		folio = NULL;
>> +
>>   out_nolock:
>>   	if (folio)
>>   		folio_put(folio);

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 10/12] khugepaged: Skip PTE range if a larger mTHP is already mapped
  2024-12-18  7:36   ` Ryan Roberts
@ 2024-12-18  9:34     ` Dev Jain
  2024-12-19  3:40       ` John Hubbard
  0 siblings, 1 reply; 74+ messages in thread
From: Dev Jain @ 2024-12-18  9:34 UTC (permalink / raw)
  To: Ryan Roberts, akpm, david, willy, kirill.shutemov
  Cc: anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, linux-mm, linux-kernel


On 18/12/24 1:06 pm, Ryan Roberts wrote:
> On 16/12/2024 16:51, Dev Jain wrote:
>> We may hit a situation wherein we have a larger folio mapped. It is incorrect
>> to go ahead with the collapse since some pages will be unmapped, leading to
>> the entire folio getting unmapped. Therefore, skip the corresponding range.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>> In the future, if at all it is required that at some point we want all the folios
>> in the system to be of a specific order, we may split these larger folios.
>>
>>   mm/khugepaged.c | 31 +++++++++++++++++++++++++++++++
>>   1 file changed, 31 insertions(+)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 8040b130e677..47e7c476b893 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -33,6 +33,7 @@ enum scan_result {
>>   	SCAN_PMD_NULL,
>>   	SCAN_PMD_NONE,
>>   	SCAN_PMD_MAPPED,
>> +	SCAN_PTE_MAPPED,
>>   	SCAN_EXCEED_NONE_PTE,
>>   	SCAN_EXCEED_SWAP_PTE,
>>   	SCAN_EXCEED_SHARED_PTE,
>> @@ -609,6 +610,11 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>   		folio = page_folio(page);
>>   		VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
>>   
>> +		if (order !=HPAGE_PMD_ORDER && folio_order(folio) >= order) {
>> +			result = SCAN_PTE_MAPPED;
>> +			goto out;
>> +		}
>> +
>>   		/* See hpage_collapse_scan_ptes(). */
>>   		if (folio_likely_mapped_shared(folio)) {
>>   			++shared;
>> @@ -1369,6 +1375,7 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>>   	unsigned long orders;
>>   	pte_t *pte, *_pte;
>>   	spinlock_t *ptl;
>> +	int found_order;
>>   	pmd_t *pmd;
>>   	int order;
>>   
>> @@ -1467,6 +1474,24 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>>   			goto out_unmap;
>>   		}
>>   
>> +		found_order = folio_order(folio);
>> +
>> +		/*
>> +		 * No point of scanning. Two options: if this folio was hit
>> +		 * somewhere in the middle of the scan, then drop down the
>> +		 * order. Or, completely skip till the end of this folio. The
>> +		 * latter gives us a higher order to start with, with atmost
>> +		 * 1 << order PTEs not collapsed; the former may force us
>> +		 * to end up going below order 2 and exiting.
>> +		 */
>> +		if (order != HPAGE_PMD_ORDER && found_order >= order) {
>> +			result = SCAN_PTE_MAPPED;
>> +			_address += (PAGE_SIZE << found_order);
>> +			_pte += (1UL << found_order);
>> +			pte_unmap_unlock(pte, ptl);
>> +			goto decide_order;
>> +		}
> It would be good if you can spell out the desired policy when khugepaged hits
> partially unmapped large folios and unaligned large folios. I think the simple
> approach is to always collapse them to fully mapped, aligned folios even if the
> resulting order is smaller than the original. But I'm not sure that's definitely
> going to always be the best thing.
>
> Regardless, I'm struggling to understand the logic in this patch. Taking the
> order of a folio based on having hit one of it's pages says anything about
> whether the whole of that folio is mapped or not or it's alignment. And it's not
> clear to me how we would get to a situation where we are scanning for a lower
> order and find a (fully mapped, aligned) folio of higher order in the first place.
>
> Let's assume the desired policy is that khugepaged should always collapse to
> naturally aligned large folios. If there happens to be an existing aligned
> order-4 folio that is fully mapped, we will identify that for collapse as part
> of the scan for order-4. At that point, we should just notice that it is already
> an aligned order-4 folio and bypass collapse. Of course we may have already
> chosen to collapse it into a higher order, but we should definitely not get to a
> lower order before we notice it.
>
> Hmm... I guess if the sysfs thp settings have been changed then things could get
> spicy... if order-8 was previously enabled and we have an order-8 folio, then it
> get's disabled and khugepaged is scanning for order-4 (which is still enabled)
> then hits the order-8; what's the expected policy? rework into 2 order-4 folios
> or leave it as as single order-8?

Exactly, sorry, I should have made it clear in the patch description that I am
handling the following scenario: there is a long running system on which we are
using order-8 folios, and now we decide to downgrade to order-4. Will it be a
good idea to take the pain of splitting order-8 to 16 order-4 folios? This should
be a rare situation in the first place, so I have currently decided to ignore the
folios set up by the previous sysfs setting and only focus on collapsing fresh memory.

Thinking again, a sys-admin deciding to downgrade order of folios, should do that in
the hopes of reducing internal fragmentation or increasing swap speed etc, so it makes
sense to shatter large folios....maybe we can have a sysfs tunable for this?

>
>> +
>>   		/*
>>   		 * We treat a single page as shared if any part of the THP
>>   		 * is shared. "False negatives" from
>> @@ -1550,6 +1575,10 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>>   		if (_address == org_address + (PAGE_SIZE << HPAGE_PMD_ORDER))
>>   			goto out;
>>   	}
>> +	/* A larger folio was mapped; it will be skipped in next iteration */
>> +	if (result == SCAN_PTE_MAPPED)
>> +		goto decide_order;
>> +
>>   	if (result != SCAN_SUCCEED) {
>>   
>>   		/* Go to the next order. */
>> @@ -1558,6 +1587,8 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>>   			goto out;
>>   		goto maybe_mmap_lock;
>>   	} else {
>> +
>> +decide_order:
>>   		address = _address;
>>   		pte = _pte;
>>   

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 12/12] selftests/mm: khugepaged: Enlighten for mTHP collapse
  2024-12-18  9:03   ` Ryan Roberts
@ 2024-12-18  9:50     ` Dev Jain
  2024-12-20 11:05       ` Ryan Roberts
  0 siblings, 1 reply; 74+ messages in thread
From: Dev Jain @ 2024-12-18  9:50 UTC (permalink / raw)
  To: Ryan Roberts, akpm, david, willy, kirill.shutemov
  Cc: anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, linux-mm, linux-kernel


On 18/12/24 2:33 pm, Ryan Roberts wrote:
> On 16/12/2024 16:51, Dev Jain wrote:
>> One of the testcases triggers a CoW on the 255th page (0-indexing) with
>> max_ptes_shared = 256. This leads to 0-254 pages (255 in number) being unshared,
>> and 257 pages shared, exceeding the constraint. Suppose we run the test as
>> ./khugepaged -s 2. Therefore, khugepaged starts collapsing the range to order-2
>> folios, since PMD-collapse will fail due to the constraint.
>> When the scan reaches 254-257 PTE range, because at least one PTE in this range
>> is writable, with other 3 being read-only, khugepaged collapses this into an
>> order-2 mTHP, resulting in 3 extra PTEs getting unshared. After this, we encounter
>> a 4-sized chunk of read-only PTEs, and mTHP collapse stops according to the scaled
>> constraint, but the number of shared PTEs have now come under the constraint for
>> PMD-sized THPs. Therefore, the next scan of khugepaged will be able to collapse
>> this range into a PMD-mapped hugepage, leading to failure of this subtest. Fix
>> this by reducing the CoW range.
> Is this description essentially saying that it's now possible to creep towards
> collapsing to a full PMD-size block over successive scans due to rounding errors
> in the scaling? Or is this just trying an edge case and the problem doesn't
> generalize?

For this case, max_ptes_shared for order-2 is 256 >> (9 - 2) = 2, without rounding errors. We cannot
really get a rounding problem because we are rounding down, essentially either keep the restriction
same, or making it stricter, as we go down the orders.

But thinking again, this behaviour may generalize: essentially, let us say that the distribution
of none ptes vs filled ptes is very skewed for the PMD case. In a local region, this distribution
may not be skewed, and then an mTHP collapse will occur, making this entire region uniform. Over time this
may keep happening and then the region will become globally uniform to come under the PMD-constraint
on max_ptes_none, and eventually PMD-collapse will occur. Which may beg the question whether we want
to detach khugepaged orders from mTHP sysfs settings.

>
>> Note: The only objective of this patch is to make the test work for the PMD-case;
>> no extension has been made for testing for mTHPs.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   tools/testing/selftests/mm/khugepaged.c | 5 +++--
>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/selftests/mm/khugepaged.c
>> index 8a4d34cce36b..143c4ad9f6a1 100644
>> --- a/tools/testing/selftests/mm/khugepaged.c
>> +++ b/tools/testing/selftests/mm/khugepaged.c
>> @@ -981,6 +981,7 @@ static void collapse_fork_compound(struct collapse_context *c, struct mem_ops *o
>>   static void collapse_max_ptes_shared(struct collapse_context *c, struct mem_ops *ops)
>>   {
>>   	int max_ptes_shared = thp_read_num("khugepaged/max_ptes_shared");
>> +	int fault_nr_pages = is_anon(ops) ? 1 << anon_order : 1;
>>   	int wstatus;
>>   	void *p;
>>   
>> @@ -997,8 +998,8 @@ static void collapse_max_ptes_shared(struct collapse_context *c, struct mem_ops
>>   			fail("Fail");
>>   
>>   		printf("Trigger CoW on page %d of %d...",
>> -				hpage_pmd_nr - max_ptes_shared - 1, hpage_pmd_nr);
>> -		ops->fault(p, 0, (hpage_pmd_nr - max_ptes_shared - 1) * page_size);
>> +				hpage_pmd_nr - max_ptes_shared - fault_nr_pages, hpage_pmd_nr);
>> +		ops->fault(p, 0, (hpage_pmd_nr - max_ptes_shared - fault_nr_pages) * page_size);
>>   		if (ops->check_huge(p, 0))
>>   			success("OK");
>>   		else

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 09/12] khugepaged: Introduce vma_collapse_anon_folio()
  2024-12-16 17:06   ` David Hildenbrand
  2024-12-16 19:08     ` Yang Shi
  2024-12-17 10:07     ` Dev Jain
@ 2024-12-18 15:59     ` Dev Jain
  2 siblings, 0 replies; 74+ messages in thread
From: Dev Jain @ 2024-12-18 15:59 UTC (permalink / raw)
  To: David Hildenbrand, akpm, willy, kirill.shutemov
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel


On 16/12/24 10:36 pm, David Hildenbrand wrote:
> On 16.12.24 17:51, Dev Jain wrote:
>> In contrast to PMD-collapse, we do not need to operate on two levels 
>> of pagetable
>> simultaneously. Therefore, downgrade the mmap lock from write to read 
>> mode. Still
>> take the anon_vma lock in exclusive mode so as to not waste time in 
>> the rmap path,
>> which is anyways going to fail since the PTEs are going to be 
>> changed. Under the PTL,
>> copy page contents, clear the PTEs, remove folio pins, and (try to) 
>> unmap the
>> old folios. Set the PTEs to the new folio using the set_ptes() API.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>> Note: I have been trying hard to get rid of the locks in here: we 
>> still are
>> taking the PTL around the page copying; dropping the PTL and taking 
>> it after
>> the copying should lead to a deadlock, for example:
>> khugepaged                        madvise(MADV_COLD)
>> folio_lock()                        lock(ptl)
>> lock(ptl)                        folio_lock()
>>
>> We can create a locked folio list, altogether drop both the locks, 
>> take the PTL,
>> do everything which __collapse_huge_page_isolate() does *except* the 
>> isolation and
>> again try locking folios, but then it will reduce efficiency of 
>> khugepaged
>> and almost looks like a forced solution :)
>> Please note the following discussion if anyone is interested:
>> https://lore.kernel.org/all/66bb7496-a445-4ad7-8e56-4f2863465c54@arm.com/ 
>>
>> (Apologies for not CCing the mailing list from the start)
>>
>>   mm/khugepaged.c | 108 ++++++++++++++++++++++++++++++++++++++----------
>>   1 file changed, 87 insertions(+), 21 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 88beebef773e..8040b130e677 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -714,24 +714,28 @@ static void 
>> __collapse_huge_page_copy_succeeded(pte_t *pte,
>>                           struct vm_area_struct *vma,
>>                           unsigned long address,
>>                           spinlock_t *ptl,
>> -                        struct list_head *compound_pagelist)
>> +                        struct list_head *compound_pagelist, int order)
>>   {
>>       struct folio *src, *tmp;
>>       pte_t *_pte;
>>       pte_t pteval;
>>   -    for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>> +    for (_pte = pte; _pte < pte + (1UL << order);
>>            _pte++, address += PAGE_SIZE) {
>>           pteval = ptep_get(_pte);
>>           if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>>               add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
>>               if (is_zero_pfn(pte_pfn(pteval))) {
>> -                /*
>> -                 * ptl mostly unnecessary.
>> -                 */
>> -                spin_lock(ptl);
>> -                ptep_clear(vma->vm_mm, address, _pte);
>> -                spin_unlock(ptl);
>> +                if (order == HPAGE_PMD_ORDER) {
>> +                    /*
>> +                    * ptl mostly unnecessary.
>> +                    */
>> +                    spin_lock(ptl);
>> +                    ptep_clear(vma->vm_mm, address, _pte);
>> +                    spin_unlock(ptl);
>> +                } else {
>> +                    ptep_clear(vma->vm_mm, address, _pte);
>> +                }
>>                   ksm_might_unmap_zero_page(vma->vm_mm, pteval);
>>               }
>>           } else {
>> @@ -740,15 +744,20 @@ static void 
>> __collapse_huge_page_copy_succeeded(pte_t *pte,
>>               src = page_folio(src_page);
>>               if (!folio_test_large(src))
>>                   release_pte_folio(src);
>> -            /*
>> -             * ptl mostly unnecessary, but preempt has to
>> -             * be disabled to update the per-cpu stats
>> -             * inside folio_remove_rmap_pte().
>> -             */
>> -            spin_lock(ptl);
>> -            ptep_clear(vma->vm_mm, address, _pte);
>> -            folio_remove_rmap_pte(src, src_page, vma);
>> -            spin_unlock(ptl);
>> +            if (order == HPAGE_PMD_ORDER) {
>> +                /*
>> +                * ptl mostly unnecessary, but preempt has to
>> +                * be disabled to update the per-cpu stats
>> +                * inside folio_remove_rmap_pte().
>> +                */
>> +                spin_lock(ptl);
>> +                ptep_clear(vma->vm_mm, address, _pte);
>
>
>
>
>> + folio_remove_rmap_pte(src, src_page, vma);
>> +                spin_unlock(ptl);
>> +            } else {
>> +                ptep_clear(vma->vm_mm, address, _pte);
>> +                folio_remove_rmap_pte(src, src_page, vma);
>> +            }
>
> As I've talked to Nico about this code recently ... :)
>
> Are you clearing the PTE after the copy succeeded? If so, where is the 
> TLB flush?
>
> How do you sync against concurrent write acess + GUP-fast?
>
>
> The sequence really must be: (1) clear PTE/PMD + flush TLB (2) check 
> if there are unexpected page references (e.g., GUP) if so back off (3) 
> copy page content (4) set updated PTE/PMD.
>
> To Nico, I suggested doing it simple initially, and still clear the 
> high-level PMD entry + flush under mmap write lock, then re-map the 
> PTE table after modifying the page table. It's not as efficient, but 
> "harder to get wrong".

Btw if we are taking mmap write lock, then we do not require flushing 
PMD entry for mTHP collapse right?
>
> Maybe that's already happening, but I stumbled over this clearing 
> logic in __collapse_huge_page_copy_succeeded(), so I'm curious.
>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 10/12] khugepaged: Skip PTE range if a larger mTHP is already mapped
  2024-12-18  9:34     ` Dev Jain
@ 2024-12-19  3:40       ` John Hubbard
  2024-12-19  3:51         ` Zi Yan
  2024-12-19  7:59         ` Dev Jain
  0 siblings, 2 replies; 74+ messages in thread
From: John Hubbard @ 2024-12-19  3:40 UTC (permalink / raw)
  To: Dev Jain, Ryan Roberts, akpm, david, willy, kirill.shutemov
  Cc: anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, 21cnbao,
	linux-mm, linux-kernel

On 12/18/24 1:34 AM, Dev Jain wrote:
> On 18/12/24 1:06 pm, Ryan Roberts wrote:
>> On 16/12/2024 16:51, Dev Jain wrote:
>>> We may hit a situation wherein we have a larger folio mapped. It is incorrect
>>> to go ahead with the collapse since some pages will be unmapped, leading to
>>> the entire folio getting unmapped. Therefore, skip the corresponding range.
...
>> It would be good if you can spell out the desired policy when khugepaged hits
>> partially unmapped large folios and unaligned large folios. I think the simple
>> approach is to always collapse them to fully mapped, aligned folios even if the
>> resulting order is smaller than the original. But I'm not sure that's definitely
>> going to always be the best thing.
>>
>> Regardless, I'm struggling to understand the logic in this patch. Taking the
>> order of a folio based on having hit one of it's pages says anything about
>> whether the whole of that folio is mapped or not or it's alignment. And it's not
>> clear to me how we would get to a situation where we are scanning for a lower
>> order and find a (fully mapped, aligned) folio of higher order in the first place.
>>
>> Let's assume the desired policy is that khugepaged should always collapse to
>> naturally aligned large folios. If there happens to be an existing aligned
>> order-4 folio that is fully mapped, we will identify that for collapse as part
>> of the scan for order-4. At that point, we should just notice that it is already
>> an aligned order-4 folio and bypass collapse. Of course we may have already
>> chosen to collapse it into a higher order, but we should definitely not get to a
>> lower order before we notice it.
>>
>> Hmm... I guess if the sysfs thp settings have been changed then things could get
>> spicy... if order-8 was previously enabled and we have an order-8 folio, then it
>> get's disabled and khugepaged is scanning for order-4 (which is still enabled)
>> then hits the order-8; what's the expected policy? rework into 2 order-4 folios
>> or leave it as as single order-8?
> 
> Exactly, sorry, I should have made it clear in the patch description that I am
> handling the following scenario: there is a long running system on which we are
> using order-8 folios, and now we decide to downgrade to order-4. Will it be a
> good idea to take the pain of splitting order-8 to 16 order-4 folios? This should
> be a rare situation in the first place, so I have currently decided to ignore the
> folios set up by the previous sysfs setting and only focus on collapsing fresh memory.
> 
> Thinking again, a sys-admin deciding to downgrade order of folios, should do that in
> the hopes of reducing internal fragmentation or increasing swap speed etc, so it makes
> sense to shatter large folios....maybe we can have a sysfs tunable for this?

Maybe we should not support it (at runtime) at all. We are trying to build
systems that don't require incredibly detailed sysadmin involvement, and
this level of tweaking qualifies, thoroughly, as "incredibly detailed
sysadmin micromanagement", imho.

Apologies for not having gone through the series in detail yet, but this
point jumped out at me.

thanks,
-- 
John Hubbard


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 10/12] khugepaged: Skip PTE range if a larger mTHP is already mapped
  2024-12-19  3:40       ` John Hubbard
@ 2024-12-19  3:51         ` Zi Yan
  2024-12-19  7:59         ` Dev Jain
  1 sibling, 0 replies; 74+ messages in thread
From: Zi Yan @ 2024-12-19  3:51 UTC (permalink / raw)
  To: John Hubbard, Dev Jain, Ryan Roberts
  Cc: akpm, david, willy, kirill.shutemov, anshuman.khandual,
	catalin.marinas, cl, vbabka, mhocko, apopple, dave.hansen, will,
	baohua, jack, srivatsa, haowenchao22, hughd, aneesh.kumar, yang,
	peterx, ioworker0, wangkefeng.wang, jglisse, surenb, vishal.moola,
	zokeefe, zhengqi.arch, 21cnbao, linux-mm, linux-kernel

On 18 Dec 2024, at 22:40, John Hubbard wrote:

> On 12/18/24 1:34 AM, Dev Jain wrote:
>> On 18/12/24 1:06 pm, Ryan Roberts wrote:
>>> On 16/12/2024 16:51, Dev Jain wrote:
>>>> We may hit a situation wherein we have a larger folio mapped. It is incorrect
>>>> to go ahead with the collapse since some pages will be unmapped, leading to
>>>> the entire folio getting unmapped. Therefore, skip the corresponding range.
> ...
>>> It would be good if you can spell out the desired policy when khugepaged hits
>>> partially unmapped large folios and unaligned large folios. I think the simple
>>> approach is to always collapse them to fully mapped, aligned folios even if the
>>> resulting order is smaller than the original. But I'm not sure that's definitely
>>> going to always be the best thing.
>>>
>>> Regardless, I'm struggling to understand the logic in this patch. Taking the
>>> order of a folio based on having hit one of it's pages says anything about
>>> whether the whole of that folio is mapped or not or it's alignment. And it's not
>>> clear to me how we would get to a situation where we are scanning for a lower
>>> order and find a (fully mapped, aligned) folio of higher order in the first place.
>>>
>>> Let's assume the desired policy is that khugepaged should always collapse to
>>> naturally aligned large folios. If there happens to be an existing aligned
>>> order-4 folio that is fully mapped, we will identify that for collapse as part
>>> of the scan for order-4. At that point, we should just notice that it is already
>>> an aligned order-4 folio and bypass collapse. Of course we may have already
>>> chosen to collapse it into a higher order, but we should definitely not get to a
>>> lower order before we notice it.
>>>
>>> Hmm... I guess if the sysfs thp settings have been changed then things could get
>>> spicy... if order-8 was previously enabled and we have an order-8 folio, then it
>>> get's disabled and khugepaged is scanning for order-4 (which is still enabled)
>>> then hits the order-8; what's the expected policy? rework into 2 order-4 folios
>>> or leave it as as single order-8?
>>
>> Exactly, sorry, I should have made it clear in the patch description that I am
>> handling the following scenario: there is a long running system on which we are
>> using order-8 folios, and now we decide to downgrade to order-4. Will it be a
>> good idea to take the pain of splitting order-8 to 16 order-4 folios? This should
>> be a rare situation in the first place, so I have currently decided to ignore the
>> folios set up by the previous sysfs setting and only focus on collapsing fresh memory.
>>
>> Thinking again, a sys-admin deciding to downgrade order of folios, should do that in
>> the hopes of reducing internal fragmentation or increasing swap speed etc, so it makes
>> sense to shatter large folios....maybe we can have a sysfs tunable for this?
>
> Maybe we should not support it (at runtime) at all. We are trying to build
> systems that don't require incredibly detailed sysadmin involvement, and
> this level of tweaking qualifies, thoroughly, as "incredibly detailed
> sysadmin micromanagement", imho.

I agree.

Regarding sysctl thp settings, we probably want to change its meaning from
kernel only allows folios with enabled orders to appear in the system to
kernel actively creates folios with enabled orders. Otherwise, like Ryan
said above, if a sysadmin changes sysctl thp settings from order-8 only
to order-4 only, kernel needs to scan all memory to split all order-8 folios?
That sounds like madness.


--
Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 10/12] khugepaged: Skip PTE range if a larger mTHP is already mapped
  2024-12-19  3:40       ` John Hubbard
  2024-12-19  3:51         ` Zi Yan
@ 2024-12-19  7:59         ` Dev Jain
  2024-12-19  8:07           ` Dev Jain
  1 sibling, 1 reply; 74+ messages in thread
From: Dev Jain @ 2024-12-19  7:59 UTC (permalink / raw)
  To: John Hubbard, Ryan Roberts, akpm, david, willy, kirill.shutemov
  Cc: anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, 21cnbao,
	linux-mm, linux-kernel


On 19/12/24 9:10 am, John Hubbard wrote:
> On 12/18/24 1:34 AM, Dev Jain wrote:
>> On 18/12/24 1:06 pm, Ryan Roberts wrote:
>>> On 16/12/2024 16:51, Dev Jain wrote:
>>>> We may hit a situation wherein we have a larger folio mapped. It is 
>>>> incorrect
>>>> to go ahead with the collapse since some pages will be unmapped, 
>>>> leading to
>>>> the entire folio getting unmapped. Therefore, skip the 
>>>> corresponding range.
> ...
>>> It would be good if you can spell out the desired policy when 
>>> khugepaged hits
>>> partially unmapped large folios and unaligned large folios. I think 
>>> the simple
>>> approach is to always collapse them to fully mapped, aligned folios 
>>> even if the
>>> resulting order is smaller than the original. But I'm not sure 
>>> that's definitely
>>> going to always be the best thing.
>>>
>>> Regardless, I'm struggling to understand the logic in this patch. 
>>> Taking the
>>> order of a folio based on having hit one of it's pages says anything 
>>> about
>>> whether the whole of that folio is mapped or not or it's alignment. 
>>> And it's not
>>> clear to me how we would get to a situation where we are scanning 
>>> for a lower
>>> order and find a (fully mapped, aligned) folio of higher order in 
>>> the first place.
>>>
>>> Let's assume the desired policy is that khugepaged should always 
>>> collapse to
>>> naturally aligned large folios. If there happens to be an existing 
>>> aligned
>>> order-4 folio that is fully mapped, we will identify that for 
>>> collapse as part
>>> of the scan for order-4. At that point, we should just notice that 
>>> it is already
>>> an aligned order-4 folio and bypass collapse. Of course we may have 
>>> already
>>> chosen to collapse it into a higher order, but we should definitely 
>>> not get to a
>>> lower order before we notice it.
>>>
>>> Hmm... I guess if the sysfs thp settings have been changed then 
>>> things could get
>>> spicy... if order-8 was previously enabled and we have an order-8 
>>> folio, then it
>>> get's disabled and khugepaged is scanning for order-4 (which is 
>>> still enabled)
>>> then hits the order-8; what's the expected policy? rework into 2 
>>> order-4 folios
>>> or leave it as as single order-8?
>>
>> Exactly, sorry, I should have made it clear in the patch description 
>> that I am
>> handling the following scenario: there is a long running system on 
>> which we are
>> using order-8 folios, and now we decide to downgrade to order-4. Will 
>> it be a
>> good idea to take the pain of splitting order-8 to 16 order-4 folios? 
>> This should
>> be a rare situation in the first place, so I have currently decided 
>> to ignore the
>> folios set up by the previous sysfs setting and only focus on 
>> collapsing fresh memory.
>>
>> Thinking again, a sys-admin deciding to downgrade order of folios, 
>> should do that in
>> the hopes of reducing internal fragmentation or increasing swap speed 
>> etc, so it makes
>> sense to shatter large folios....maybe we can have a sysfs tunable 
>> for this?
>
> Maybe we should not support it (at runtime) at all. We are trying to 
> build
> systems that don't require incredibly detailed sysadmin involvement, and
> this level of tweaking qualifies, thoroughly, as "incredibly detailed
> sysadmin micromanagement", imho.

Ryan pointed out one thing: what about unaligned, or partially mapped large
folios? For the previous sysfs settings, it may happen that we have an 
unaligned
order-8 folio, let us say it got unaligned due to mremap(). Then it is a 
good
idea to start from the order-4 aligned page and start collapsing memory so
that we can take advantage of the contig bit. Otherwise if it is a 
fully-mapped
aligned order-8 folio, then we anyways are abusing the contig bit advantage
so collapsing is pointless.
>
> Apologies for not having gone through the series in detail yet, but this
> point jumped out at me.
>
> thanks,

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 10/12] khugepaged: Skip PTE range if a larger mTHP is already mapped
  2024-12-19  7:59         ` Dev Jain
@ 2024-12-19  8:07           ` Dev Jain
  2024-12-20 11:57             ` Ryan Roberts
  0 siblings, 1 reply; 74+ messages in thread
From: Dev Jain @ 2024-12-19  8:07 UTC (permalink / raw)
  To: John Hubbard, Ryan Roberts, akpm, david, willy, kirill.shutemov
  Cc: anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, 21cnbao,
	linux-mm, linux-kernel


On 19/12/24 1:29 pm, Dev Jain wrote:
>
> On 19/12/24 9:10 am, John Hubbard wrote:
>> On 12/18/24 1:34 AM, Dev Jain wrote:
>>> On 18/12/24 1:06 pm, Ryan Roberts wrote:
>>>> On 16/12/2024 16:51, Dev Jain wrote:
>>>>> We may hit a situation wherein we have a larger folio mapped. It 
>>>>> is incorrect
>>>>> to go ahead with the collapse since some pages will be unmapped, 
>>>>> leading to
>>>>> the entire folio getting unmapped. Therefore, skip the 
>>>>> corresponding range.
>> ...
>>>> It would be good if you can spell out the desired policy when 
>>>> khugepaged hits
>>>> partially unmapped large folios and unaligned large folios. I think 
>>>> the simple
>>>> approach is to always collapse them to fully mapped, aligned folios 
>>>> even if the
>>>> resulting order is smaller than the original. But I'm not sure 
>>>> that's definitely
>>>> going to always be the best thing.
>>>>
>>>> Regardless, I'm struggling to understand the logic in this patch. 
>>>> Taking the
>>>> order of a folio based on having hit one of it's pages says 
>>>> anything about
>>>> whether the whole of that folio is mapped or not or it's alignment. 
>>>> And it's not
>>>> clear to me how we would get to a situation where we are scanning 
>>>> for a lower
>>>> order and find a (fully mapped, aligned) folio of higher order in 
>>>> the first place.
>>>>
>>>> Let's assume the desired policy is that khugepaged should always 
>>>> collapse to
>>>> naturally aligned large folios. If there happens to be an existing 
>>>> aligned
>>>> order-4 folio that is fully mapped, we will identify that for 
>>>> collapse as part
>>>> of the scan for order-4. At that point, we should just notice that 
>>>> it is already
>>>> an aligned order-4 folio and bypass collapse. Of course we may have 
>>>> already
>>>> chosen to collapse it into a higher order, but we should definitely 
>>>> not get to a
>>>> lower order before we notice it.
>>>>
>>>> Hmm... I guess if the sysfs thp settings have been changed then 
>>>> things could get
>>>> spicy... if order-8 was previously enabled and we have an order-8 
>>>> folio, then it
>>>> get's disabled and khugepaged is scanning for order-4 (which is 
>>>> still enabled)
>>>> then hits the order-8; what's the expected policy? rework into 2 
>>>> order-4 folios
>>>> or leave it as as single order-8?
>>>
>>> Exactly, sorry, I should have made it clear in the patch description 
>>> that I am
>>> handling the following scenario: there is a long running system on 
>>> which we are
>>> using order-8 folios, and now we decide to downgrade to order-4. 
>>> Will it be a
>>> good idea to take the pain of splitting order-8 to 16 order-4 
>>> folios? This should
>>> be a rare situation in the first place, so I have currently decided 
>>> to ignore the
>>> folios set up by the previous sysfs setting and only focus on 
>>> collapsing fresh memory.
>>>
>>> Thinking again, a sys-admin deciding to downgrade order of folios, 
>>> should do that in
>>> the hopes of reducing internal fragmentation or increasing swap 
>>> speed etc, so it makes
>>> sense to shatter large folios....maybe we can have a sysfs tunable 
>>> for this?
>>
>> Maybe we should not support it (at runtime) at all. We are trying to 
>> build
>> systems that don't require incredibly detailed sysadmin involvement, and
>> this level of tweaking qualifies, thoroughly, as "incredibly detailed
>> sysadmin micromanagement", imho.
>
> Ryan pointed out one thing: what about unaligned, or partially mapped 
> large
> folios? For the previous sysfs settings, it may happen that we have an 
> unaligned
> order-8 folio, let us say it got unaligned due to mremap(). Then it is 
> a good
> idea to start from the order-4 aligned page and start collapsing 
> memory so
> that we can take advantage of the contig bit. Otherwise if it is a 
> fully-mapped
> aligned order-8 folio, then we anyways are abusing the contig bit 
> advantage
> so collapsing is pointless.


In fact, in the current code, we are collapsing an unaligned PMD-size 
folio to an
aligned PMD-mapped folio; we will not see a block mapping in the PMD, and go
ahead with the scan...so the logic should be, skip the scan if the VAs 
and PAs are
aligned.

>>
>> Apologies for not having gone through the series in detail yet, but this
>> point jumped out at me.
>>
>> thanks,
>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 12/12] selftests/mm: khugepaged: Enlighten for mTHP collapse
  2024-12-18  9:50     ` Dev Jain
@ 2024-12-20 11:05       ` Ryan Roberts
  2024-12-30  7:09         ` Dev Jain
  0 siblings, 1 reply; 74+ messages in thread
From: Ryan Roberts @ 2024-12-20 11:05 UTC (permalink / raw)
  To: Dev Jain, akpm, david, willy, kirill.shutemov
  Cc: anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, linux-mm, linux-kernel

On 18/12/2024 09:50, Dev Jain wrote:
> 
> On 18/12/24 2:33 pm, Ryan Roberts wrote:
>> On 16/12/2024 16:51, Dev Jain wrote:
>>> One of the testcases triggers a CoW on the 255th page (0-indexing) with
>>> max_ptes_shared = 256. This leads to 0-254 pages (255 in number) being unshared,
>>> and 257 pages shared, exceeding the constraint. Suppose we run the test as
>>> ./khugepaged -s 2. Therefore, khugepaged starts collapsing the range to order-2
>>> folios, since PMD-collapse will fail due to the constraint.
>>> When the scan reaches 254-257 PTE range, because at least one PTE in this range
>>> is writable, with other 3 being read-only, khugepaged collapses this into an
>>> order-2 mTHP, resulting in 3 extra PTEs getting unshared. After this, we
>>> encounter
>>> a 4-sized chunk of read-only PTEs, and mTHP collapse stops according to the
>>> scaled
>>> constraint, but the number of shared PTEs have now come under the constraint for
>>> PMD-sized THPs. Therefore, the next scan of khugepaged will be able to collapse
>>> this range into a PMD-mapped hugepage, leading to failure of this subtest. Fix
>>> this by reducing the CoW range.
>> Is this description essentially saying that it's now possible to creep towards
>> collapsing to a full PMD-size block over successive scans due to rounding errors
>> in the scaling? Or is this just trying an edge case and the problem doesn't
>> generalize?
> 
> For this case, max_ptes_shared for order-2 is 256 >> (9 - 2) = 2, without
> rounding errors. We cannot
> really get a rounding problem because we are rounding down, essentially either
> keep the restriction
> same, or making it stricter, as we go down the orders.
> 
> But thinking again, this behaviour may generalize: essentially, let us say that
> the distribution
> of none ptes vs filled ptes is very skewed for the PMD case. In a local region,
> this distribution
> may not be skewed, and then an mTHP collapse will occur, making this entire
> region uniform. Over time this
> may keep happening and then the region will become globally uniform to come
> under the PMD-constraint
> on max_ptes_none, and eventually PMD-collapse will occur. Which may beg the
> question whether we want
> to detach khugepaged orders from mTHP sysfs settings.

We want to avoid new user controls at all costs, I think.

I think an example of the problem you are describing is: Let's say we start off
with all mTHP orders enabled and max_ptes_none is 50% (so 256). We have a 2M VMA
aligned over a single PMD. The first 2 4K pages of the VMA are allocated.

khugepaged will scan this VMA and decide to collapse the first 4 PTEs to a
single order-2 (16K) folio; that's allowed because 50% of the PTEs were none.
But now on the next scan, 50% of the first 8 PTEs are none so it will collapse
to 32K. Then on the next scan it will collapse to 64K, and so on all the way to
2M. So by faulting in 2 pages originally we have now collapsed to 2M dispite the
control trying to prevent it, and we have done it in a very inefficient way.

If max_ptes_none was 75% you would only need every other order enabled (I think?).

In practice perhaps it's not a problem because you are only likely to have 1 or
2 mTHP sizes enabled. But I wonder if we need to think about how to protect from
this "creep"?

Perhaps only consider a large folio for collapse into a larger folio if it
wasn't originally collapsed by khugepaged in the first place? That would need a
folio flag... and I suspect that will cause other edge case issues if we think
about it for 5 mins...

Another way of thinking about it is; if all the same mTHP orders were enabled at
fault time (and the allocation suceeded) we would have allocated the largest
order anyway, so the end states are the same. But the large number of
incremental collapses that khugepaged will perform feel like a problem.

I'm not quite sure what the answer is.

Thanks,
Ryan


> 
>>
>>> Note: The only objective of this patch is to make the test work for the PMD-
>>> case;
>>> no extension has been made for testing for mTHPs.
>>>
>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>> ---
>>>   tools/testing/selftests/mm/khugepaged.c | 5 +++--
>>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/
>>> selftests/mm/khugepaged.c
>>> index 8a4d34cce36b..143c4ad9f6a1 100644
>>> --- a/tools/testing/selftests/mm/khugepaged.c
>>> +++ b/tools/testing/selftests/mm/khugepaged.c
>>> @@ -981,6 +981,7 @@ static void collapse_fork_compound(struct
>>> collapse_context *c, struct mem_ops *o
>>>   static void collapse_max_ptes_shared(struct collapse_context *c, struct
>>> mem_ops *ops)
>>>   {
>>>       int max_ptes_shared = thp_read_num("khugepaged/max_ptes_shared");
>>> +    int fault_nr_pages = is_anon(ops) ? 1 << anon_order : 1;
>>>       int wstatus;
>>>       void *p;
>>>   @@ -997,8 +998,8 @@ static void collapse_max_ptes_shared(struct
>>> collapse_context *c, struct mem_ops
>>>               fail("Fail");
>>>             printf("Trigger CoW on page %d of %d...",
>>> -                hpage_pmd_nr - max_ptes_shared - 1, hpage_pmd_nr);
>>> -        ops->fault(p, 0, (hpage_pmd_nr - max_ptes_shared - 1) * page_size);
>>> +                hpage_pmd_nr - max_ptes_shared - fault_nr_pages, hpage_pmd_nr);
>>> +        ops->fault(p, 0, (hpage_pmd_nr - max_ptes_shared - fault_nr_pages) *
>>> page_size);
>>>           if (ops->check_huge(p, 0))
>>>               success("OK");
>>>           else


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 10/12] khugepaged: Skip PTE range if a larger mTHP is already mapped
  2024-12-19  8:07           ` Dev Jain
@ 2024-12-20 11:57             ` Ryan Roberts
  0 siblings, 0 replies; 74+ messages in thread
From: Ryan Roberts @ 2024-12-20 11:57 UTC (permalink / raw)
  To: Dev Jain, John Hubbard, akpm, david, willy, kirill.shutemov
  Cc: anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, 21cnbao,
	linux-mm, linux-kernel

On 19/12/2024 08:07, Dev Jain wrote:
> 
> On 19/12/24 1:29 pm, Dev Jain wrote:
>>
>> On 19/12/24 9:10 am, John Hubbard wrote:
>>> On 12/18/24 1:34 AM, Dev Jain wrote:
>>>> On 18/12/24 1:06 pm, Ryan Roberts wrote:
>>>>> On 16/12/2024 16:51, Dev Jain wrote:
>>>>>> We may hit a situation wherein we have a larger folio mapped. It is incorrect
>>>>>> to go ahead with the collapse since some pages will be unmapped, leading to
>>>>>> the entire folio getting unmapped. Therefore, skip the corresponding range.
>>> ...
>>>>> It would be good if you can spell out the desired policy when khugepaged hits
>>>>> partially unmapped large folios and unaligned large folios. I think the simple
>>>>> approach is to always collapse them to fully mapped, aligned folios even if
>>>>> the
>>>>> resulting order is smaller than the original. But I'm not sure that's
>>>>> definitely
>>>>> going to always be the best thing.
>>>>>
>>>>> Regardless, I'm struggling to understand the logic in this patch. Taking the
>>>>> order of a folio based on having hit one of it's pages says anything about
>>>>> whether the whole of that folio is mapped or not or it's alignment. And
>>>>> it's not
>>>>> clear to me how we would get to a situation where we are scanning for a lower
>>>>> order and find a (fully mapped, aligned) folio of higher order in the first
>>>>> place.
>>>>>
>>>>> Let's assume the desired policy is that khugepaged should always collapse to
>>>>> naturally aligned large folios. If there happens to be an existing aligned
>>>>> order-4 folio that is fully mapped, we will identify that for collapse as part
>>>>> of the scan for order-4. At that point, we should just notice that it is
>>>>> already
>>>>> an aligned order-4 folio and bypass collapse. Of course we may have already
>>>>> chosen to collapse it into a higher order, but we should definitely not get
>>>>> to a
>>>>> lower order before we notice it.
>>>>>
>>>>> Hmm... I guess if the sysfs thp settings have been changed then things
>>>>> could get
>>>>> spicy... if order-8 was previously enabled and we have an order-8 folio,
>>>>> then it
>>>>> get's disabled and khugepaged is scanning for order-4 (which is still enabled)
>>>>> then hits the order-8; what's the expected policy? rework into 2 order-4
>>>>> folios
>>>>> or leave it as as single order-8?
>>>>
>>>> Exactly, sorry, I should have made it clear in the patch description that I am
>>>> handling the following scenario: there is a long running system on which we are
>>>> using order-8 folios, and now we decide to downgrade to order-4. Will it be a
>>>> good idea to take the pain of splitting order-8 to 16 order-4 folios? This
>>>> should
>>>> be a rare situation in the first place, so I have currently decided to
>>>> ignore the
>>>> folios set up by the previous sysfs setting and only focus on collapsing
>>>> fresh memory.
>>>>
>>>> Thinking again, a sys-admin deciding to downgrade order of folios, should do
>>>> that in
>>>> the hopes of reducing internal fragmentation or increasing swap speed etc,
>>>> so it makes
>>>> sense to shatter large folios....maybe we can have a sysfs tunable for this?
>>>
>>> Maybe we should not support it (at runtime) at all. We are trying to build
>>> systems that don't require incredibly detailed sysadmin involvement, and
>>> this level of tweaking qualifies, thoroughly, as "incredibly detailed
>>> sysadmin micromanagement", imho.

Agreed that we definitely don't want any new controls for this.

>>
>> Ryan pointed out one thing: what about unaligned, or partially mapped large
>> folios? For the previous sysfs settings, it may happen that we have an unaligned
>> order-8 folio, let us say it got unaligned due to mremap(). Then it is a good
>> idea to start from the order-4 aligned page and start collapsing memory so
>> that we can take advantage of the contig bit. Otherwise if it is a fully-mapped
>> aligned order-8 folio, then we anyways are abusing the contig bit advantage
>> so collapsing is pointless.
> 
> 
> In fact, in the current code, we are collapsing an unaligned PMD-size folio to an
> aligned PMD-mapped folio; we will not see a block mapping in the PMD, and go
> ahead with the scan...so the logic should be, skip the scan if the VAs and PAs are
> aligned.

Here is my stab at what the policy should be. It sounds quite complex, but
actually I think the implementation is quite simple.

memory may initially be mapped partially or fully from large or small folios.
And the mapping may be naturally aligned for the folio order or not.

khugepaged's goal should always to end up with the largest, aligned orders that
are possible. It will only collapse to an enabled order, but it will never
collapse to a smaller or equal order than the source memory's contiguous mapping
_and_ aligment.

So if for example, the first (or last) 64K of a 128K folio is mapped on a 64K VA
boundary, if scanning for order-4 (64K) we would leave that partial mapping
alone. But if either less than 64K was mapped from the 128K folio, or it was
mapped in a way that it was not 64K aligned in VA, then it would be a candidate
for collapse to a new 64K folio.

I think this can all be implemented by remembering if all of the PFNs for the
VA-aligned order that is being scanned are present and congituous and if they
all belong to the same folio. If that is true, and if the first PFN is aligned
on a PA boundary for that order, then skip the collapse and move to the next
block (addr + (PAGE_SIZE << order)).

I think this will give us the properties we want; if we encounter a larger folio
that is fully mapped and aligned, we will leave it alone. If we encounter a mix
of small folios and partially mapped large folios we will collapse. If we
encounter a partial mapping of a larger folio but that partial mapping fills the
block we are scanning in an appropriately aligned manner, we will leave it alone.

Thanks,
Ryan

> 
>>>
>>> Apologies for not having gone through the series in detail yet, but this
>>> point jumped out at me.
>>>
>>> thanks,
>>


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 02/12] khugepaged: Generalize alloc_charge_folio()
  2024-12-17  7:09     ` Ryan Roberts
  2024-12-17 13:00       ` Zi Yan
@ 2024-12-20 17:41       ` Christoph Lameter (Ampere)
  2024-12-20 17:45         ` Ryan Roberts
  1 sibling, 1 reply; 74+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-12-20 17:41 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Matthew Wilcox, Dev Jain, akpm, david, kirill.shutemov,
	anshuman.khandual, catalin.marinas, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, linux-mm, linux-kernel

On Tue, 17 Dec 2024, Ryan Roberts wrote:

> We previously decided that all existing THP stats would continue to only count
> PMD-sized THP for fear of breaking userspace in subtle ways, and instead would
> introduce new mTHP stats that can count for each order. We already have
> MTHP_STAT_ANON_FAULT_ALLOC and MTHP_STAT_ANON_FAULT_FALLBACK (amongst others) so
> these new stats fit the pattern well, IMHO.

Could we move all the stats somewhere into sysfs where we can get them by
page order? /proc/vmstat keeps getting new counters.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 02/12] khugepaged: Generalize alloc_charge_folio()
  2024-12-20 17:41       ` Christoph Lameter (Ampere)
@ 2024-12-20 17:45         ` Ryan Roberts
  2024-12-20 18:47           ` Christoph Lameter (Ampere)
  0 siblings, 1 reply; 74+ messages in thread
From: Ryan Roberts @ 2024-12-20 17:45 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: Matthew Wilcox, Dev Jain, akpm, david, kirill.shutemov,
	anshuman.khandual, catalin.marinas, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, linux-mm, linux-kernel

On 20/12/2024 17:41, Christoph Lameter (Ampere) wrote:
> On Tue, 17 Dec 2024, Ryan Roberts wrote:
> 
>> We previously decided that all existing THP stats would continue to only count
>> PMD-sized THP for fear of breaking userspace in subtle ways, and instead would
>> introduce new mTHP stats that can count for each order. We already have
>> MTHP_STAT_ANON_FAULT_ALLOC and MTHP_STAT_ANON_FAULT_FALLBACK (amongst others) so
>> these new stats fit the pattern well, IMHO.
> 
> Could we move all the stats somewhere into sysfs where we can get them by
> page order? /proc/vmstat keeps getting new counters.

This is exactly what has been done already for mthp stats. They live at:

/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/stats/.

So there is a directory per size, and there is a file per stat.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 02/12] khugepaged: Generalize alloc_charge_folio()
  2024-12-20 17:45         ` Ryan Roberts
@ 2024-12-20 18:47           ` Christoph Lameter (Ampere)
  2025-01-02 11:21             ` Ryan Roberts
  0 siblings, 1 reply; 74+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-12-20 18:47 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Matthew Wilcox, Dev Jain, akpm, david, kirill.shutemov,
	anshuman.khandual, catalin.marinas, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, linux-mm, linux-kernel

On Fri, 20 Dec 2024, Ryan Roberts wrote:

> > Could we move all the stats somewhere into sysfs where we can get them by
> > page order? /proc/vmstat keeps getting new counters.
>
> This is exactly what has been done already for mthp stats. They live at:
>
> /sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/stats/.
>
> So there is a directory per size, and there is a file per stat.

Then lets drop all THP and huge page counters from vmstat.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 12/12] selftests/mm: khugepaged: Enlighten for mTHP collapse
  2024-12-20 11:05       ` Ryan Roberts
@ 2024-12-30  7:09         ` Dev Jain
  2024-12-30 16:36           ` Zi Yan
  0 siblings, 1 reply; 74+ messages in thread
From: Dev Jain @ 2024-12-30  7:09 UTC (permalink / raw)
  To: Ryan Roberts, akpm, david, willy, kirill.shutemov
  Cc: anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, linux-mm, linux-kernel


On 20/12/24 4:35 pm, Ryan Roberts wrote:
> On 18/12/2024 09:50, Dev Jain wrote:
>> On 18/12/24 2:33 pm, Ryan Roberts wrote:
>>> On 16/12/2024 16:51, Dev Jain wrote:
>>>> One of the testcases triggers a CoW on the 255th page (0-indexing) with
>>>> max_ptes_shared = 256. This leads to 0-254 pages (255 in number) being unshared,
>>>> and 257 pages shared, exceeding the constraint. Suppose we run the test as
>>>> ./khugepaged -s 2. Therefore, khugepaged starts collapsing the range to order-2
>>>> folios, since PMD-collapse will fail due to the constraint.
>>>> When the scan reaches 254-257 PTE range, because at least one PTE in this range
>>>> is writable, with other 3 being read-only, khugepaged collapses this into an
>>>> order-2 mTHP, resulting in 3 extra PTEs getting unshared. After this, we
>>>> encounter
>>>> a 4-sized chunk of read-only PTEs, and mTHP collapse stops according to the
>>>> scaled
>>>> constraint, but the number of shared PTEs have now come under the constraint for
>>>> PMD-sized THPs. Therefore, the next scan of khugepaged will be able to collapse
>>>> this range into a PMD-mapped hugepage, leading to failure of this subtest. Fix
>>>> this by reducing the CoW range.
>>> Is this description essentially saying that it's now possible to creep towards
>>> collapsing to a full PMD-size block over successive scans due to rounding errors
>>> in the scaling? Or is this just trying an edge case and the problem doesn't
>>> generalize?
>> For this case, max_ptes_shared for order-2 is 256 >> (9 - 2) = 2, without
>> rounding errors. We cannot
>> really get a rounding problem because we are rounding down, essentially either
>> keep the restriction
>> same, or making it stricter, as we go down the orders.
>>
>> But thinking again, this behaviour may generalize: essentially, let us say that
>> the distribution
>> of none ptes vs filled ptes is very skewed for the PMD case. In a local region,
>> this distribution
>> may not be skewed, and then an mTHP collapse will occur, making this entire
>> region uniform. Over time this
>> may keep happening and then the region will become globally uniform to come
>> under the PMD-constraint
>> on max_ptes_none, and eventually PMD-collapse will occur. Which may beg the
>> question whether we want
>> to detach khugepaged orders from mTHP sysfs settings.
> We want to avoid new user controls at all costs, I think.
>
> I think an example of the problem you are describing is: Let's say we start off
> with all mTHP orders enabled and max_ptes_none is 50% (so 256). We have a 2M VMA
> aligned over a single PMD. The first 2 4K pages of the VMA are allocated.
>
> khugepaged will scan this VMA and decide to collapse the first 4 PTEs to a
> single order-2 (16K) folio; that's allowed because 50% of the PTEs were none.
> But now on the next scan, 50% of the first 8 PTEs are none so it will collapse
> to 32K. Then on the next scan it will collapse to 64K, and so on all the way to
> 2M. So by faulting in 2 pages originally we have now collapsed to 2M dispite the
> control trying to prevent it, and we have done it in a very inefficient way.
>
> If max_ptes_none was 75% you would only need every other order enabled (I think?).
>
> In practice perhaps it's not a problem because you are only likely to have 1 or
> 2 mTHP sizes enabled. But I wonder if we need to think about how to protect from
> this "creep"?
>
> Perhaps only consider a large folio for collapse into a larger folio if it
> wasn't originally collapsed by khugepaged in the first place? That would need a
> folio flag... and I suspect that will cause other edge case issues if we think
> about it for 5 mins...
>
> Another way of thinking about it is; if all the same mTHP orders were enabled at
> fault time (and the allocation suceeded) we would have allocated the largest
> order anyway, so the end states are the same. But the large number of
> incremental collapses that khugepaged will perform feel like a problem.
>
> I'm not quite sure what the answer is.

Can't really think of anything else apart from decoupling khugepaged sysfs from mTHP sysfs...

>
> Thanks,
> Ryan
>
>
>>>> Note: The only objective of this patch is to make the test work for the PMD-
>>>> case;
>>>> no extension has been made for testing for mTHPs.
>>>>
>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>> ---
>>>>    tools/testing/selftests/mm/khugepaged.c | 5 +++--
>>>>    1 file changed, 3 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/
>>>> selftests/mm/khugepaged.c
>>>> index 8a4d34cce36b..143c4ad9f6a1 100644
>>>> --- a/tools/testing/selftests/mm/khugepaged.c
>>>> +++ b/tools/testing/selftests/mm/khugepaged.c
>>>> @@ -981,6 +981,7 @@ static void collapse_fork_compound(struct
>>>> collapse_context *c, struct mem_ops *o
>>>>    static void collapse_max_ptes_shared(struct collapse_context *c, struct
>>>> mem_ops *ops)
>>>>    {
>>>>        int max_ptes_shared = thp_read_num("khugepaged/max_ptes_shared");
>>>> +    int fault_nr_pages = is_anon(ops) ? 1 << anon_order : 1;
>>>>        int wstatus;
>>>>        void *p;
>>>>    @@ -997,8 +998,8 @@ static void collapse_max_ptes_shared(struct
>>>> collapse_context *c, struct mem_ops
>>>>                fail("Fail");
>>>>              printf("Trigger CoW on page %d of %d...",
>>>> -                hpage_pmd_nr - max_ptes_shared - 1, hpage_pmd_nr);
>>>> -        ops->fault(p, 0, (hpage_pmd_nr - max_ptes_shared - 1) * page_size);
>>>> +                hpage_pmd_nr - max_ptes_shared - fault_nr_pages, hpage_pmd_nr);
>>>> +        ops->fault(p, 0, (hpage_pmd_nr - max_ptes_shared - fault_nr_pages) *
>>>> page_size);
>>>>            if (ops->check_huge(p, 0))
>>>>                success("OK");
>>>>            else

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 12/12] selftests/mm: khugepaged: Enlighten for mTHP collapse
  2024-12-30  7:09         ` Dev Jain
@ 2024-12-30 16:36           ` Zi Yan
  2025-01-02 11:43             ` Ryan Roberts
  2025-01-03 10:11             ` Dev Jain
  0 siblings, 2 replies; 74+ messages in thread
From: Zi Yan @ 2024-12-30 16:36 UTC (permalink / raw)
  To: Dev Jain, Ryan Roberts, akpm, david, willy, kirill.shutemov
  Cc: anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, jglisse,
	surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard, 21cnbao,
	linux-mm, linux-kernel

On Mon Dec 30, 2024 at 2:09 AM EST, Dev Jain wrote:
>
> On 20/12/24 4:35 pm, Ryan Roberts wrote:
> > On 18/12/2024 09:50, Dev Jain wrote:
> >> On 18/12/24 2:33 pm, Ryan Roberts wrote:
> >>> On 16/12/2024 16:51, Dev Jain wrote:
> >>>> One of the testcases triggers a CoW on the 255th page (0-indexing) with
> >>>> max_ptes_shared = 256. This leads to 0-254 pages (255 in number) being unshared,
> >>>> and 257 pages shared, exceeding the constraint. Suppose we run the test as
> >>>> ./khugepaged -s 2. Therefore, khugepaged starts collapsing the range to order-2
> >>>> folios, since PMD-collapse will fail due to the constraint.
> >>>> When the scan reaches 254-257 PTE range, because at least one PTE in this range
> >>>> is writable, with other 3 being read-only, khugepaged collapses this into an
> >>>> order-2 mTHP, resulting in 3 extra PTEs getting unshared. After this, we
> >>>> encounter
> >>>> a 4-sized chunk of read-only PTEs, and mTHP collapse stops according to the
> >>>> scaled
> >>>> constraint, but the number of shared PTEs have now come under the constraint for
> >>>> PMD-sized THPs. Therefore, the next scan of khugepaged will be able to collapse
> >>>> this range into a PMD-mapped hugepage, leading to failure of this subtest. Fix
> >>>> this by reducing the CoW range.
> >>> Is this description essentially saying that it's now possible to creep towards
> >>> collapsing to a full PMD-size block over successive scans due to rounding errors
> >>> in the scaling? Or is this just trying an edge case and the problem doesn't
> >>> generalize?
> >> For this case, max_ptes_shared for order-2 is 256 >> (9 - 2) = 2, without
> >> rounding errors. We cannot
> >> really get a rounding problem because we are rounding down, essentially either
> >> keep the restriction
> >> same, or making it stricter, as we go down the orders.
> >>
> >> But thinking again, this behaviour may generalize: essentially, let us say that
> >> the distribution
> >> of none ptes vs filled ptes is very skewed for the PMD case. In a local region,
> >> this distribution
> >> may not be skewed, and then an mTHP collapse will occur, making this entire
> >> region uniform. Over time this
> >> may keep happening and then the region will become globally uniform to come
> >> under the PMD-constraint
> >> on max_ptes_none, and eventually PMD-collapse will occur. Which may beg the
> >> question whether we want
> >> to detach khugepaged orders from mTHP sysfs settings.
> > We want to avoid new user controls at all costs, I think.
> >
> > I think an example of the problem you are describing is: Let's say we start off
> > with all mTHP orders enabled and max_ptes_none is 50% (so 256). We have a 2M VMA
> > aligned over a single PMD. The first 2 4K pages of the VMA are allocated.
> >
> > khugepaged will scan this VMA and decide to collapse the first 4 PTEs to a
> > single order-2 (16K) folio; that's allowed because 50% of the PTEs were none.
> > But now on the next scan, 50% of the first 8 PTEs are none so it will collapse
> > to 32K. Then on the next scan it will collapse to 64K, and so on all the way to
> > 2M. So by faulting in 2 pages originally we have now collapsed to 2M dispite the
> > control trying to prevent it, and we have done it in a very inefficient way.
> >
> > If max_ptes_none was 75% you would only need every other order enabled (I think?).
> >
> > In practice perhaps it's not a problem because you are only likely to have 1 or
> > 2 mTHP sizes enabled. But I wonder if we need to think about how to protect from
> > this "creep"?
> >
> > Perhaps only consider a large folio for collapse into a larger folio if it
> > wasn't originally collapsed by khugepaged in the first place? That would need a
> > folio flag... and I suspect that will cause other edge case issues if we think
> > about it for 5 mins...
> >
> > Another way of thinking about it is; if all the same mTHP orders were enabled at
> > fault time (and the allocation suceeded) we would have allocated the largest
> > order anyway, so the end states are the same. But the large number of
> > incremental collapses that khugepaged will perform feel like a problem.
> >
> > I'm not quite sure what the answer is.
>
> Can't really think of anything else apart from decoupling khugepaged sysfs from mTHP sysfs...

One (not so effective) workaround is to add a VMA flag to make
khugepaged to skip scanning a VMA that khugepaged has collapsed before
and reset the flag in a future page fault. This would prevent khugepaged
from doing this "creep" collapse behavior until the page tables covered
by the VMA is changed. This is not perfect, since the page faults might
not change the aforementioned region and later khugepaged can still
perform the "creep".

-- 
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 09/12] khugepaged: Introduce vma_collapse_anon_folio()
  2024-12-18  8:35         ` Dev Jain
@ 2025-01-02 10:08           ` Dev Jain
  2025-01-02 11:33             ` David Hildenbrand
  2025-01-02 11:22           ` David Hildenbrand
  1 sibling, 1 reply; 74+ messages in thread
From: Dev Jain @ 2025-01-02 10:08 UTC (permalink / raw)
  To: David Hildenbrand, akpm, willy, kirill.shutemov
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel


On 18/12/24 2:05 pm, Dev Jain wrote:
>
> On 17/12/24 4:02 pm, David Hildenbrand wrote:
>> On 17.12.24 11:07, Dev Jain wrote:
>>>
>>> On 16/12/24 10:36 pm, David Hildenbrand wrote:
>>>> On 16.12.24 17:51, Dev Jain wrote:
>>>>> In contrast to PMD-collapse, we do not need to operate on two levels
>>>>> of pagetable
>>>>> simultaneously. Therefore, downgrade the mmap lock from write to read
>>>>> mode. Still
>>>>> take the anon_vma lock in exclusive mode so as to not waste time in
>>>>> the rmap path,
>>>>> which is anyways going to fail since the PTEs are going to be
>>>>> changed. Under the PTL,
>>>>> copy page contents, clear the PTEs, remove folio pins, and (try to)
>>>>> unmap the
>>>>> old folios. Set the PTEs to the new folio using the set_ptes() API.
>>>>>
>>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>>> ---
>>>>> Note: I have been trying hard to get rid of the locks in here: we
>>>>> still are
>>>>> taking the PTL around the page copying; dropping the PTL and taking
>>>>> it after
>>>>> the copying should lead to a deadlock, for example:
>>>>> khugepaged                        madvise(MADV_COLD)
>>>>> folio_lock()                        lock(ptl)
>>>>> lock(ptl)                        folio_lock()
>>>>>
>>>>> We can create a locked folio list, altogether drop both the locks,
>>>>> take the PTL,
>>>>> do everything which __collapse_huge_page_isolate() does *except* the
>>>>> isolation and
>>>>> again try locking folios, but then it will reduce efficiency of
>>>>> khugepaged
>>>>> and almost looks like a forced solution :)
>>>>> Please note the following discussion if anyone is interested:
>>>>> https://lore.kernel.org/all/66bb7496-a445-4ad7-8e56-4f2863465c54@arm.com/ 
>>>>>
>>>>>
>>>>> (Apologies for not CCing the mailing list from the start)
>>>>>
>>>>>    mm/khugepaged.c | 108 
>>>>> ++++++++++++++++++++++++++++++++++++++----------
>>>>>    1 file changed, 87 insertions(+), 21 deletions(-)
>>>>>
>>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>>> index 88beebef773e..8040b130e677 100644
>>>>> --- a/mm/khugepaged.c
>>>>> +++ b/mm/khugepaged.c
>>>>> @@ -714,24 +714,28 @@ static void
>>>>> __collapse_huge_page_copy_succeeded(pte_t *pte,
>>>>>                            struct vm_area_struct *vma,
>>>>>                            unsigned long address,
>>>>>                            spinlock_t *ptl,
>>>>> -                        struct list_head *compound_pagelist)
>>>>> +                        struct list_head *compound_pagelist, int 
>>>>> order)
>>>>>    {
>>>>>        struct folio *src, *tmp;
>>>>>        pte_t *_pte;
>>>>>        pte_t pteval;
>>>>>    -    for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>>>>> +    for (_pte = pte; _pte < pte + (1UL << order);
>>>>>             _pte++, address += PAGE_SIZE) {
>>>>>            pteval = ptep_get(_pte);
>>>>>            if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>>>>>                add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
>>>>>                if (is_zero_pfn(pte_pfn(pteval))) {
>>>>> -                /*
>>>>> -                 * ptl mostly unnecessary.
>>>>> -                 */
>>>>> -                spin_lock(ptl);
>>>>> -                ptep_clear(vma->vm_mm, address, _pte);
>>>>> -                spin_unlock(ptl);
>>>>> +                if (order == HPAGE_PMD_ORDER) {
>>>>> +                    /*
>>>>> +                    * ptl mostly unnecessary.
>>>>> +                    */
>>>>> +                    spin_lock(ptl);
>>>>> +                    ptep_clear(vma->vm_mm, address, _pte);
>>>>> +                    spin_unlock(ptl);
>>>>> +                } else {
>>>>> +                    ptep_clear(vma->vm_mm, address, _pte);
>>>>> +                }
>>>>> ksm_might_unmap_zero_page(vma->vm_mm, pteval);
>>>>>                }
>>>>>            } else {
>>>>> @@ -740,15 +744,20 @@ static void
>>>>> __collapse_huge_page_copy_succeeded(pte_t *pte,
>>>>>                src = page_folio(src_page);
>>>>>                if (!folio_test_large(src))
>>>>>                    release_pte_folio(src);
>>>>> -            /*
>>>>> -             * ptl mostly unnecessary, but preempt has to
>>>>> -             * be disabled to update the per-cpu stats
>>>>> -             * inside folio_remove_rmap_pte().
>>>>> -             */
>>>>> -            spin_lock(ptl);
>>>>> -            ptep_clear(vma->vm_mm, address, _pte);
>>>>> -            folio_remove_rmap_pte(src, src_page, vma);
>>>>> -            spin_unlock(ptl);
>>>>> +            if (order == HPAGE_PMD_ORDER) {
>>>>> +                /*
>>>>> +                * ptl mostly unnecessary, but preempt has to
>>>>> +                * be disabled to update the per-cpu stats
>>>>> +                * inside folio_remove_rmap_pte().
>>>>> +                */
>>>>> +                spin_lock(ptl);
>>>>> +                ptep_clear(vma->vm_mm, address, _pte);
>>>>
>>>>
>>>>
>>>>
>>>>> + folio_remove_rmap_pte(src, src_page, vma);
>>>>> +                spin_unlock(ptl);
>>>>> +            } else {
>>>>> +                ptep_clear(vma->vm_mm, address, _pte);
>>>>> +                folio_remove_rmap_pte(src, src_page, vma);
>>>>> +            }
>>>>
>>>> As I've talked to Nico about this code recently ... :)
>>>>
>>>> Are you clearing the PTE after the copy succeeded? If so, where is the
>>>> TLB flush?
>>>>
>>>> How do you sync against concurrent write acess + GUP-fast?
>>>>
>>>>
>>>> The sequence really must be: (1) clear PTE/PMD + flush TLB (2) check
>>>> if there are unexpected page references (e.g., GUP) if so back off (3)
>>>> copy page content (4) set updated PTE/PMD.
>>>
>>> Thanks...we need to ensure GUP-fast does not write when we are copying
>>> contents, so (2) will ensure that GUP-fast will see the cleared PTE and
>>> back-off.
>>
>> Yes, and of course, that also the CPU cannot concurrently still 
>> modify the page content while/after you copy the page content, but 
>> before you unmap+flush.
>>
>>>>
>>>> To Nico, I suggested doing it simple initially, and still clear the
>>>> high-level PMD entry + flush under mmap write lock, then re-map the
>>>> PTE table after modifying the page table. It's not as efficient, but
>>>> "harder to get wrong".
>>>>
>>>> Maybe that's already happening, but I stumbled over this clearing
>>>> logic in __collapse_huge_page_copy_succeeded(), so I'm curious.
>>>
>>> No, I am not even touching the PMD. I guess the sequence you described
>>> should work? I just need to reverse the copying and PTE clearing order
>>> to implement this sequence.
>>
>> That would work, but you really have to hold the PTL for the whole 
>> period: from when you temporarily clear PTEs +_ flush the TLB, when 
>> you copy, until you re-insert the updated ones.
>
> Ignoring the implementation and code churn part :) Is the following 
> algorithm theoretically correct: (1) Take PTL, scan PTEs,
> isolate and lock the folios, set the PTEs to migration entries, check 
> folio references. This will solve concurrent write
> access races. Now, we can drop the PTL...no one can write to the old 
> folios because (1) rmap cannot run (2) folio from PTE
> cannot be derived. Note that migration_entry_wait_on_locked() path can 
> be scheduled out, so this is not the same as the
> fault handlers spinning on the PTL. We can now safely copy old folios 
> to new folio, then take the PTL: The PTL is
> available because every pagetable walker will see a migration entry 
> and back off. We batch set the PTEs now, and release
> the folio locks, making the fault handlers getting out of 
> migration_entry_wait_on_locked(). As compared to the old code,
> the point of failure we need to handle is when copying fails, or at 
> some point folio isolation fails...therefore, we need to
> maintain a list of old PTEs corresponding to the PTEs set to migration 
> entries.
>
> Note that, I had suggested this "setting the PTEs to a global invalid 
> state" thingy in our previous discussion too, but I guess
> simultaneously working on the PMD and PTE was the main problem there, 
> since the walkers do not take a lock on the PMD
> to check if someone is changing it, when what they really are 
> interested in is to make change at the PTE level. In fact, leaving
> all specifics like racing with a specific pagetable walker etc aside, 
> I do not see why the following claim isn't true:
>
> Claim: The (anon-private) mTHP khugepaged collapse problem is 
> mathematically equivalent to the (anon-private) page migration problem.
>
> The difference being, in khugepaged we need the VMA to be stable, 
> hence have to take the mmap_read_lock(), and have to "migrate"
> to a large folio instead of individual pages.
>
> If at all my theory is correct, I'll leave it to the community to 
> decide if it's worth it to go through my brain-rot :)

If more thinking is required on this sequence, I can postpone this to a 
future optimization patch.

>
>
>>
>> When having to back-off (restore original PTEs), or for copying, 
>> you'll likely need access to the original PTEs, which were already 
>> cleared. So likely you need a temporary copy of the original PTEs 
>> somehow.
>>
>> That's why temporarily clearing the PMD und mmap write lock is easier 
>> to implement, at the cost of requiring the mmap lock in write mode 
>> like PMD collapse.

Why do I need to clear the PMD if I am taking the mmap_write_lock() and 
operating only on the PTE?
>>
>>
> So, I understand the following: Some CPU spinning on the PTL for a 
> long time is worse than taking the mmap_write_lock(). The latter 
> blocks this process
> from doing mmap()s, which, in my limited knowledge, is bad for 
> memory-intensive processes (aligned with the fact that the maple tree was
> introduced to optimize VMA operations), and the former literally nukes 
> one unit of computation from the system for a long time.
>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 02/12] khugepaged: Generalize alloc_charge_folio()
  2024-12-20 18:47           ` Christoph Lameter (Ampere)
@ 2025-01-02 11:21             ` Ryan Roberts
  0 siblings, 0 replies; 74+ messages in thread
From: Ryan Roberts @ 2025-01-02 11:21 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: Matthew Wilcox, Dev Jain, akpm, david, kirill.shutemov,
	anshuman.khandual, catalin.marinas, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, linux-mm, linux-kernel

Sorry I missed this before heading off for Christmas...


On 20/12/2024 18:47, Christoph Lameter (Ampere) wrote:
> On Fri, 20 Dec 2024, Ryan Roberts wrote:
> 
>>> Could we move all the stats somewhere into sysfs where we can get them by
>>> page order? /proc/vmstat keeps getting new counters.
>>
>> This is exactly what has been done already for mthp stats. They live at:
>>
>> /sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/stats/.
>>
>> So there is a directory per size, and there is a file per stat.
> 
> Then lets drop all THP and huge page counters from vmstat.

Previous discussion concluded that we can't remove counters from vmstat for fear
of breaking user space. So the policy is that vmstat THP entries should remain,
but they continue to only count PMD-sized THP.

The PMD-sized THP counters are effectively duplicated at:

/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/stats/

(or whatever your PMD size happens to be).

Thanks,
Ryan


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 09/12] khugepaged: Introduce vma_collapse_anon_folio()
  2024-12-18  8:35         ` Dev Jain
  2025-01-02 10:08           ` Dev Jain
@ 2025-01-02 11:22           ` David Hildenbrand
  1 sibling, 0 replies; 74+ messages in thread
From: David Hildenbrand @ 2025-01-02 11:22 UTC (permalink / raw)
  To: Dev Jain, akpm, willy, kirill.shutemov
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel


Still on PTO, but replying to this mail :)

>>>>
>>>> To Nico, I suggested doing it simple initially, and still clear the
>>>> high-level PMD entry + flush under mmap write lock, then re-map the
>>>> PTE table after modifying the page table. It's not as efficient, but
>>>> "harder to get wrong".
>>>>
>>>> Maybe that's already happening, but I stumbled over this clearing
>>>> logic in __collapse_huge_page_copy_succeeded(), so I'm curious.
>>>
>>> No, I am not even touching the PMD. I guess the sequence you described
>>> should work? I just need to reverse the copying and PTE clearing order
>>> to implement this sequence.
>>
>> That would work, but you really have to hold the PTL for the whole
>> period: from when you temporarily clear PTEs +_ flush the TLB, when
>> you copy, until you re-insert the updated ones.
> 
> Ignoring the implementation and code churn part :) Is the following
> algorithm theoretically correct: (1) Take PTL, scan PTEs,
> isolate and lock the folios, set the PTEs to migration entries, check
> folio references. This will solve concurrent write
> access races.
 > Now, we can drop the PTL...no one can write to the old> folios 
because (1) rmap cannot run (2) folio from PTE
> cannot be derived. Note that migration_entry_wait_on_locked() path can
> be scheduled out, so this is not the same as the
> fault handlers spinning on the PTL. We can now safely copy old folios to
> new folio, then take the PTL: The PTL is
> available because every pagetable walker will see a migration entry and
> back off. We batch set the PTEs now, and release
> the folio locks, making the fault handlers getting out of
> migration_entry_wait_on_locked(). As compared to the old code,
> the point of failure we need to handle is when copying fails, or at some
> point folio isolation fails...therefore, we need to
> maintain a list of old PTEs corresponding to the PTEs set to migration
> entries.
> 
> Note that, I had suggested this "setting the PTEs to a global invalid
> state" thingy in our previous discussion too, but I guess
> simultaneously working on the PMD and PTE was the main problem there,
> since the walkers do not take a lock on the PMD
> to check if someone is changing it, when what they really are interested
> in is to make change at the PTE level. In fact, leaving
> all specifics like racing with a specific pagetable walker etc aside, I
> do not see why the following claim isn't true:
> 
> Claim: The (anon-private) mTHP khugepaged collapse problem is
> mathematically equivalent to the (anon-private) page migration problem.
> 
> The difference being, in khugepaged we need the VMA to be stable, hence
> have to take the mmap_read_lock(), and have to "migrate"
> to a large folio instead of individual pages.
> 
> If at all my theory is correct, I'll leave it to the community to decide
> if it's worth it to go through my brain-rot :)

What we have to achieve is

a) Make sure GUP-fast cannot grab the folio
b) The CPU cannot read/write the folio
c) No "ordinary" page table walkers can grab the folio.

Handling a) and b) involves either invalidating (incl migration entry) 
or temporarily clearing (what we do right now for the PMD) the affected 
entry and flushing the TLB.

We can use migration entries while the folio is locked; there might be 
some devil in the detail, so I would suggest to going with something 
simpler first, and then try making use of migration entries.


> 
> 
>>
>> When having to back-off (restore original PTEs), or for copying,
>> you'll likely need access to the original PTEs, which were already
>> cleared. So likely you need a temporary copy of the original PTEs
>> somehow.
>>
>> That's why temporarily clearing the PMD und mmap write lock is easier
>> to implement, at the cost of requiring the mmap lock in write mode
>> like PMD collapse.
>>
>>
> So, I understand the following: Some CPU spinning on the PTL for a long
> time is worse than taking the mmap_write_lock(). The latter blocks this
> process
> from doing mmap()s, which, in my limited knowledge, is bad for
> memory-intensive processes (aligned with the fact that the maple tree was
> introduced to optimize VMA operations), and the former literally nukes
> one unit of computation from the system for a long time.

With per-VMA locks, khugepaged grabbing the mmap long in write mode got 
"less" bad, because most page fault can still make progress. But it's 
certainly still suboptimal.

Yes, I think having a lot of thread spinning for a long time for a PTL 
can be worse than using a sleepable lock in some scenarios I think; 
especially if the PTL spans more than a single page table.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 09/12] khugepaged: Introduce vma_collapse_anon_folio()
  2025-01-02 10:08           ` Dev Jain
@ 2025-01-02 11:33             ` David Hildenbrand
  2025-01-03  8:17               ` Dev Jain
  0 siblings, 1 reply; 74+ messages in thread
From: David Hildenbrand @ 2025-01-02 11:33 UTC (permalink / raw)
  To: Dev Jain, akpm, willy, kirill.shutemov
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel

>>>
>>> When having to back-off (restore original PTEs), or for copying,
>>> you'll likely need access to the original PTEs, which were already
>>> cleared. So likely you need a temporary copy of the original PTEs
>>> somehow.
>>>
>>> That's why temporarily clearing the PMD und mmap write lock is easier
>>> to implement, at the cost of requiring the mmap lock in write mode
>>> like PMD collapse.
> 
> Why do I need to clear the PMD if I am taking the mmap_write_lock() and
> operating only on the PTE?

One approach I proposed to Nico (and I think he has a prototype) is:

a) Take all locks like we do today (mmap in write, vma in write, rmap in 
write)

After this step, no "ordinary" page table walkers can run anymore

b) Clear the PMD entry and flush the TLB like we do today

After this step, neither the CPU can read/write folios nor GUP-fast can 
run. The PTE table is completely isolated.

c) Now we can work on the (temporarily cleared) PTE table as we please: 
isolate folios, lock them, ... without clearing the PTE entries, just 
like we do today.

d) Allocate the new folios (we don't have to hold any spinlocks), copy + 
replace the affected PTE entries in the isolated PTE table. Similar to 
what we do today, except that we don't clear PTEs but instead clear+reset.

e) Unlock+un-isolate + unref the collapsed folios like we do today.

f) Re-map the PTE-table, like we do today when collapse would have failed.

Of course, after taking all locks we have to re-verify that there is 
something to collapse (e.g., in d) we also have to check for unexpected 
folio references). The backup path is easy: remap the PTE table as no 
PTE entries were touched just yet.

Observe that many things are "like we do today".

As soon as we go to read locks + PTE locks, it all gets more complicated 
to get it right. Not that it cannot be done, but the above is IMHO a lot 
simpler to get right.

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 12/12] selftests/mm: khugepaged: Enlighten for mTHP collapse
  2024-12-30 16:36           ` Zi Yan
@ 2025-01-02 11:43             ` Ryan Roberts
  2025-01-03 10:10               ` Dev Jain
  2025-01-03 10:11             ` Dev Jain
  1 sibling, 1 reply; 74+ messages in thread
From: Ryan Roberts @ 2025-01-02 11:43 UTC (permalink / raw)
  To: Zi Yan, Dev Jain, akpm, david, willy, kirill.shutemov
  Cc: anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, jglisse,
	surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard, 21cnbao,
	linux-mm, linux-kernel

On 30/12/2024 16:36, Zi Yan wrote:
> On Mon Dec 30, 2024 at 2:09 AM EST, Dev Jain wrote:
>>
>> On 20/12/24 4:35 pm, Ryan Roberts wrote:
>>> On 18/12/2024 09:50, Dev Jain wrote:
>>>> On 18/12/24 2:33 pm, Ryan Roberts wrote:
>>>>> On 16/12/2024 16:51, Dev Jain wrote:
>>>>>> One of the testcases triggers a CoW on the 255th page (0-indexing) with
>>>>>> max_ptes_shared = 256. This leads to 0-254 pages (255 in number) being unshared,
>>>>>> and 257 pages shared, exceeding the constraint. Suppose we run the test as
>>>>>> ./khugepaged -s 2. Therefore, khugepaged starts collapsing the range to order-2
>>>>>> folios, since PMD-collapse will fail due to the constraint.
>>>>>> When the scan reaches 254-257 PTE range, because at least one PTE in this range
>>>>>> is writable, with other 3 being read-only, khugepaged collapses this into an
>>>>>> order-2 mTHP, resulting in 3 extra PTEs getting unshared. After this, we
>>>>>> encounter
>>>>>> a 4-sized chunk of read-only PTEs, and mTHP collapse stops according to the
>>>>>> scaled
>>>>>> constraint, but the number of shared PTEs have now come under the constraint for
>>>>>> PMD-sized THPs. Therefore, the next scan of khugepaged will be able to collapse
>>>>>> this range into a PMD-mapped hugepage, leading to failure of this subtest. Fix
>>>>>> this by reducing the CoW range.
>>>>> Is this description essentially saying that it's now possible to creep towards
>>>>> collapsing to a full PMD-size block over successive scans due to rounding errors
>>>>> in the scaling? Or is this just trying an edge case and the problem doesn't
>>>>> generalize?
>>>> For this case, max_ptes_shared for order-2 is 256 >> (9 - 2) = 2, without
>>>> rounding errors. We cannot
>>>> really get a rounding problem because we are rounding down, essentially either
>>>> keep the restriction
>>>> same, or making it stricter, as we go down the orders.
>>>>
>>>> But thinking again, this behaviour may generalize: essentially, let us say that
>>>> the distribution
>>>> of none ptes vs filled ptes is very skewed for the PMD case. In a local region,
>>>> this distribution
>>>> may not be skewed, and then an mTHP collapse will occur, making this entire
>>>> region uniform. Over time this
>>>> may keep happening and then the region will become globally uniform to come
>>>> under the PMD-constraint
>>>> on max_ptes_none, and eventually PMD-collapse will occur. Which may beg the
>>>> question whether we want
>>>> to detach khugepaged orders from mTHP sysfs settings.
>>> We want to avoid new user controls at all costs, I think.
>>>
>>> I think an example of the problem you are describing is: Let's say we start off
>>> with all mTHP orders enabled and max_ptes_none is 50% (so 256). We have a 2M VMA
>>> aligned over a single PMD. The first 2 4K pages of the VMA are allocated.
>>>
>>> khugepaged will scan this VMA and decide to collapse the first 4 PTEs to a
>>> single order-2 (16K) folio; that's allowed because 50% of the PTEs were none.
>>> But now on the next scan, 50% of the first 8 PTEs are none so it will collapse
>>> to 32K. Then on the next scan it will collapse to 64K, and so on all the way to
>>> 2M. So by faulting in 2 pages originally we have now collapsed to 2M dispite the
>>> control trying to prevent it, and we have done it in a very inefficient way.
>>>
>>> If max_ptes_none was 75% you would only need every other order enabled (I think?).
>>>
>>> In practice perhaps it's not a problem because you are only likely to have 1 or
>>> 2 mTHP sizes enabled. But I wonder if we need to think about how to protect from
>>> this "creep"?
>>>
>>> Perhaps only consider a large folio for collapse into a larger folio if it
>>> wasn't originally collapsed by khugepaged in the first place? That would need a
>>> folio flag... and I suspect that will cause other edge case issues if we think
>>> about it for 5 mins...
>>>
>>> Another way of thinking about it is; if all the same mTHP orders were enabled at
>>> fault time (and the allocation suceeded) we would have allocated the largest
>>> order anyway, so the end states are the same. But the large number of
>>> incremental collapses that khugepaged will perform feel like a problem.
>>>
>>> I'm not quite sure what the answer is.
>>
>> Can't really think of anything else apart from decoupling khugepaged sysfs from mTHP sysfs...
> 
> One (not so effective) workaround is to add a VMA flag to make
> khugepaged to skip scanning a VMA that khugepaged has collapsed before
> and reset the flag in a future page fault. This would prevent khugepaged
> from doing this "creep" collapse behavior until the page tables covered
> by the VMA is changed. This is not perfect, since the page faults might
> not change the aforementioned region and later khugepaged can still
> perform the "creep".
> 

I guess what you really want is a bitmap with a bit per page in the VMA to tell
you whether it exists due to fault or collapse. But... yuk...

It looks like Ubuntu at least is not modifying the default max_ptes_none, which
is 511. So for general purpose distros I guess we won't see this issue in
practice because we will always collapse to the largest enabled size.

So with that in mind, perhaps Zi's suggested single VM flag idea will be good
enough?

Thanks,
Ryan



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 00/12] khugepaged: Asynchronous mTHP collapse
  2024-12-16 17:31 ` [RFC PATCH 00/12] khugepaged: Asynchronous " Dev Jain
@ 2025-01-02 21:58   ` Nico Pache
  2025-01-03  7:04     ` Dev Jain
  0 siblings, 1 reply; 74+ messages in thread
From: Nico Pache @ 2025-01-02 21:58 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, david, willy, kirill.shutemov, ryan.roberts,
	anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, linux-mm, linux-kernel

On Mon, Dec 16, 2024 at 10:31 AM Dev Jain <dev.jain@arm.com> wrote:
>
> +Nico, apologies, forgot to CC you.
Hey Dev,

Happy New Year!

Thanks! I'm trying to apply/test your patches, but am failing to apply
them due to mm-unstable which has "unstable" sha values, making
applying them difficult.
Could you share a public git repo to your patches?

Also, have you seen any issues with your patches? My version of
khugepaged mTHP support was mostly done before the holidays but I
haven't posted due to some issues with (BAD PAGE) refcount issues when
trying to reclaim pages that I haven't found the cause of yet.

-- Nico

>
> On 16/12/24 10:20 pm, Dev Jain wrote:
> > This patchset extends khugepaged from collapsing only PMD-sized THPs to
> > collapsing anonymous mTHPs.
> >
> > mTHPs were introduced in the kernel to improve memory management by allocating
> > chunks of larger memory, so as to reduce number of page faults, TLB misses (due
> > to TLB coalescing), reduce length of LRU lists, etc. However, the mTHP property
> > is often lost due to CoW, swap-in/out, and when the kernel just cannot find
> > enough physically contiguous memory to allocate on fault. Henceforth, there is a
> > need to regain mTHPs in the system asynchronously. This work is an attempt in
> > this direction, starting with anonymous folios.
> >
> > In the fault handler, we select the THP order in a greedy manner; the same has
> > been used here, along with the same sysfs interface to control the order of
> > collapse. In contrast to PMD-collapse, we (hopefully) get rid of the mmap_write_lock().
> >
> > ---------------------------------------------------------
> > Testing
> > ---------------------------------------------------------
> >
> > The set has been build tested on x86_64.
> > For Aarch64,
> > 1. mm-selftests: No regressions.
> > 2. Analyzing with tools/mm/thpmaps on different userspace programs mapping
> >     aligned VMAs of a large size, faulting in basepages/mTHPs (according to sysfs),
> >     and then madvise()'ing the VMA, khugepaged is able to 100% collapse the VMAs.
> >
> > This patchset is rebased on mm-unstable (e7e89af21ffcfd1077ca6d2188de6497db1ad84c).
> >
> > Some points to be noted:
> > 1. Some stats like pages_collapsed for khugepaged have not been extended for mTHP.
> >     I'd welcome suggestions on any updation, or addition to the sysfs interface.
> > 2. Please see patch 9 for lock handling.
> >
> > Dev Jain (12):
> >    khugepaged: Rename hpage_collapse_scan_pmd() -> ptes()
> >    khugepaged: Generalize alloc_charge_folio()
> >    khugepaged: Generalize hugepage_vma_revalidate()
> >    khugepaged: Generalize __collapse_huge_page_swapin()
> >    khugepaged: Generalize __collapse_huge_page_isolate()
> >    khugepaged: Generalize __collapse_huge_page_copy_failed()
> >    khugepaged: Scan PTEs order-wise
> >    khugepaged: Abstract PMD-THP collapse
> >    khugepaged: Introduce vma_collapse_anon_folio()
> >    khugepaged: Skip PTE range if a larger mTHP is already mapped
> >    khugepaged: Enable sysfs to control order of collapse
> >    selftests/mm: khugepaged: Enlighten for mTHP collapse
> >
> >   include/linux/huge_mm.h                 |   2 +
> >   mm/huge_memory.c                        |   4 +
> >   mm/khugepaged.c                         | 445 +++++++++++++++++-------
> >   tools/testing/selftests/mm/khugepaged.c |   5 +-
> >   4 files changed, 319 insertions(+), 137 deletions(-)
> >
>


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 00/12] khugepaged: Asynchronous mTHP collapse
  2025-01-02 21:58   ` Nico Pache
@ 2025-01-03  7:04     ` Dev Jain
  0 siblings, 0 replies; 74+ messages in thread
From: Dev Jain @ 2025-01-03  7:04 UTC (permalink / raw)
  To: Nico Pache
  Cc: akpm, david, willy, kirill.shutemov, ryan.roberts,
	anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, linux-mm, linux-kernel


On 03/01/25 3:28 am, Nico Pache wrote:
> On Mon, Dec 16, 2024 at 10:31 AM Dev Jain <dev.jain@arm.com> wrote:
>> +Nico, apologies, forgot to CC you.
> Hey Dev,
>
> Happy New Year!

Happy New Year to you too!

>
> Thanks! I'm trying to apply/test your patches, but am failing to apply
> them due to mm-unstable which has "unstable" sha values, making
> applying them difficult.

That is strange. This works for me: Clone mm from akpm, checkout to mm-unstable,
hard reset to  e7e89af21ffcfd1077ca6d2188de6497db1ad84c , then apply the patches.

> Could you share a public git repo to your patches?
>
> Also, have you seen any issues with your patches? My version of
> khugepaged mTHP support was mostly done before the holidays but I
> haven't posted due to some issues with (BAD PAGE) refcount issues when
> trying to reclaim pages that I haven't found the cause of yet.
>
> -- Nico

Did not find any obvious issues till now with debug configs on :)

>
>> On 16/12/24 10:20 pm, Dev Jain wrote:
>>> This patchset extends khugepaged from collapsing only PMD-sized THPs to
>>> collapsing anonymous mTHPs.
>>>
>>> mTHPs were introduced in the kernel to improve memory management by allocating
>>> chunks of larger memory, so as to reduce number of page faults, TLB misses (due
>>> to TLB coalescing), reduce length of LRU lists, etc. However, the mTHP property
>>> is often lost due to CoW, swap-in/out, and when the kernel just cannot find
>>> enough physically contiguous memory to allocate on fault. Henceforth, there is a
>>> need to regain mTHPs in the system asynchronously. This work is an attempt in
>>> this direction, starting with anonymous folios.
>>>
>>> In the fault handler, we select the THP order in a greedy manner; the same has
>>> been used here, along with the same sysfs interface to control the order of
>>> collapse. In contrast to PMD-collapse, we (hopefully) get rid of the mmap_write_lock().
>>>
>>> ---------------------------------------------------------
>>> Testing
>>> ---------------------------------------------------------
>>>
>>> The set has been build tested on x86_64.
>>> For Aarch64,
>>> 1. mm-selftests: No regressions.
>>> 2. Analyzing with tools/mm/thpmaps on different userspace programs mapping
>>>      aligned VMAs of a large size, faulting in basepages/mTHPs (according to sysfs),
>>>      and then madvise()'ing the VMA, khugepaged is able to 100% collapse the VMAs.
>>>
>>> This patchset is rebased on mm-unstable (e7e89af21ffcfd1077ca6d2188de6497db1ad84c).
>>>
>>> Some points to be noted:
>>> 1. Some stats like pages_collapsed for khugepaged have not been extended for mTHP.
>>>      I'd welcome suggestions on any updation, or addition to the sysfs interface.
>>> 2. Please see patch 9 for lock handling.
>>>
>>> Dev Jain (12):
>>>     khugepaged: Rename hpage_collapse_scan_pmd() -> ptes()
>>>     khugepaged: Generalize alloc_charge_folio()
>>>     khugepaged: Generalize hugepage_vma_revalidate()
>>>     khugepaged: Generalize __collapse_huge_page_swapin()
>>>     khugepaged: Generalize __collapse_huge_page_isolate()
>>>     khugepaged: Generalize __collapse_huge_page_copy_failed()
>>>     khugepaged: Scan PTEs order-wise
>>>     khugepaged: Abstract PMD-THP collapse
>>>     khugepaged: Introduce vma_collapse_anon_folio()
>>>     khugepaged: Skip PTE range if a larger mTHP is already mapped
>>>     khugepaged: Enable sysfs to control order of collapse
>>>     selftests/mm: khugepaged: Enlighten for mTHP collapse
>>>
>>>    include/linux/huge_mm.h                 |   2 +
>>>    mm/huge_memory.c                        |   4 +
>>>    mm/khugepaged.c                         | 445 +++++++++++++++++-------
>>>    tools/testing/selftests/mm/khugepaged.c |   5 +-
>>>    4 files changed, 319 insertions(+), 137 deletions(-)
>>>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 09/12] khugepaged: Introduce vma_collapse_anon_folio()
  2025-01-02 11:33             ` David Hildenbrand
@ 2025-01-03  8:17               ` Dev Jain
  0 siblings, 0 replies; 74+ messages in thread
From: Dev Jain @ 2025-01-03  8:17 UTC (permalink / raw)
  To: David Hildenbrand, akpm, willy, kirill.shutemov
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel


On 02/01/25 5:03 pm, David Hildenbrand wrote:
>>>>
>>>> When having to back-off (restore original PTEs), or for copying,
>>>> you'll likely need access to the original PTEs, which were already
>>>> cleared. So likely you need a temporary copy of the original PTEs
>>>> somehow.
>>>>
>>>> That's why temporarily clearing the PMD und mmap write lock is easier
>>>> to implement, at the cost of requiring the mmap lock in write mode
>>>> like PMD collapse.
>>
>> Why do I need to clear the PMD if I am taking the mmap_write_lock() and
>> operating only on the PTE?
>
> One approach I proposed to Nico (and I think he has a prototype) is:
>
> a) Take all locks like we do today (mmap in write, vma in write, rmap 
> in write)
>
> After this step, no "ordinary" page table walkers can run anymore
>
> b) Clear the PMD entry and flush the TLB like we do today
>
> After this step, neither the CPU can read/write folios nor GUP-fast 
> can run. The PTE table is completely isolated.
>
> c) Now we can work on the (temporarily cleared) PTE table as we 
> please: isolate folios, lock them, ... without clearing the PTE 
> entries, just like we do today.
>
> d) Allocate the new folios (we don't have to hold any spinlocks), copy 
> + replace the affected PTE entries in the isolated PTE table. Similar 
> to what we do today, except that we don't clear PTEs but instead 
> clear+reset.
>
> e) Unlock+un-isolate + unref the collapsed folios like we do today.
>
> f) Re-map the PTE-table, like we do today when collapse would have 
> failed.
>
>
> Of course, after taking all locks we have to re-verify that there is 
> something to collapse (e.g., in d) we also have to check for 
> unexpected folio references). The backup path is easy: remap the PTE 
> table as no PTE entries were touched just yet.
>
> Observe that many things are "like we do today".
>
>
> As soon as we go to read locks + PTE locks, it all gets more 
> complicated to get it right. Not that it cannot be done, but the above 
> is IMHO a lot simpler to get right.

Thanks for the reply. I'll go ahead with the write lock algorithm then.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 12/12] selftests/mm: khugepaged: Enlighten for mTHP collapse
  2025-01-02 11:43             ` Ryan Roberts
@ 2025-01-03 10:10               ` Dev Jain
  0 siblings, 0 replies; 74+ messages in thread
From: Dev Jain @ 2025-01-03 10:10 UTC (permalink / raw)
  To: Ryan Roberts, Zi Yan, akpm, david, willy, kirill.shutemov
  Cc: anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, jglisse,
	surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard, 21cnbao,
	linux-mm, linux-kernel


On 02/01/25 5:13 pm, Ryan Roberts wrote:
> On 30/12/2024 16:36, Zi Yan wrote:
>> On Mon Dec 30, 2024 at 2:09 AM EST, Dev Jain wrote:
>>> On 20/12/24 4:35 pm, Ryan Roberts wrote:
>>>> On 18/12/2024 09:50, Dev Jain wrote:
>>>>> On 18/12/24 2:33 pm, Ryan Roberts wrote:
>>>>>> On 16/12/2024 16:51, Dev Jain wrote:
>>>>>>> One of the testcases triggers a CoW on the 255th page (0-indexing) with
>>>>>>> max_ptes_shared = 256. This leads to 0-254 pages (255 in number) being unshared,
>>>>>>> and 257 pages shared, exceeding the constraint. Suppose we run the test as
>>>>>>> ./khugepaged -s 2. Therefore, khugepaged starts collapsing the range to order-2
>>>>>>> folios, since PMD-collapse will fail due to the constraint.
>>>>>>> When the scan reaches 254-257 PTE range, because at least one PTE in this range
>>>>>>> is writable, with other 3 being read-only, khugepaged collapses this into an
>>>>>>> order-2 mTHP, resulting in 3 extra PTEs getting unshared. After this, we
>>>>>>> encounter
>>>>>>> a 4-sized chunk of read-only PTEs, and mTHP collapse stops according to the
>>>>>>> scaled
>>>>>>> constraint, but the number of shared PTEs have now come under the constraint for
>>>>>>> PMD-sized THPs. Therefore, the next scan of khugepaged will be able to collapse
>>>>>>> this range into a PMD-mapped hugepage, leading to failure of this subtest. Fix
>>>>>>> this by reducing the CoW range.
>>>>>> Is this description essentially saying that it's now possible to creep towards
>>>>>> collapsing to a full PMD-size block over successive scans due to rounding errors
>>>>>> in the scaling? Or is this just trying an edge case and the problem doesn't
>>>>>> generalize?
>>>>> For this case, max_ptes_shared for order-2 is 256 >> (9 - 2) = 2, without
>>>>> rounding errors. We cannot
>>>>> really get a rounding problem because we are rounding down, essentially either
>>>>> keep the restriction
>>>>> same, or making it stricter, as we go down the orders.
>>>>>
>>>>> But thinking again, this behaviour may generalize: essentially, let us say that
>>>>> the distribution
>>>>> of none ptes vs filled ptes is very skewed for the PMD case. In a local region,
>>>>> this distribution
>>>>> may not be skewed, and then an mTHP collapse will occur, making this entire
>>>>> region uniform. Over time this
>>>>> may keep happening and then the region will become globally uniform to come
>>>>> under the PMD-constraint
>>>>> on max_ptes_none, and eventually PMD-collapse will occur. Which may beg the
>>>>> question whether we want
>>>>> to detach khugepaged orders from mTHP sysfs settings.
>>>> We want to avoid new user controls at all costs, I think.
>>>>
>>>> I think an example of the problem you are describing is: Let's say we start off
>>>> with all mTHP orders enabled and max_ptes_none is 50% (so 256). We have a 2M VMA
>>>> aligned over a single PMD. The first 2 4K pages of the VMA are allocated.
>>>>
>>>> khugepaged will scan this VMA and decide to collapse the first 4 PTEs to a
>>>> single order-2 (16K) folio; that's allowed because 50% of the PTEs were none.
>>>> But now on the next scan, 50% of the first 8 PTEs are none so it will collapse
>>>> to 32K. Then on the next scan it will collapse to 64K, and so on all the way to
>>>> 2M. So by faulting in 2 pages originally we have now collapsed to 2M dispite the
>>>> control trying to prevent it, and we have done it in a very inefficient way.
>>>>
>>>> If max_ptes_none was 75% you would only need every other order enabled (I think?).
>>>>
>>>> In practice perhaps it's not a problem because you are only likely to have 1 or
>>>> 2 mTHP sizes enabled. But I wonder if we need to think about how to protect from
>>>> this "creep"?
>>>>
>>>> Perhaps only consider a large folio for collapse into a larger folio if it
>>>> wasn't originally collapsed by khugepaged in the first place? That would need a
>>>> folio flag... and I suspect that will cause other edge case issues if we think
>>>> about it for 5 mins...
>>>>
>>>> Another way of thinking about it is; if all the same mTHP orders were enabled at
>>>> fault time (and the allocation suceeded) we would have allocated the largest
>>>> order anyway, so the end states are the same. But the large number of
>>>> incremental collapses that khugepaged will perform feel like a problem.
>>>>
>>>> I'm not quite sure what the answer is.
>>> Can't really think of anything else apart from decoupling khugepaged sysfs from mTHP sysfs...
>> One (not so effective) workaround is to add a VMA flag to make
>> khugepaged to skip scanning a VMA that khugepaged has collapsed before
>> and reset the flag in a future page fault. This would prevent khugepaged
>> from doing this "creep" collapse behavior until the page tables covered
>> by the VMA is changed. This is not perfect, since the page faults might
>> not change the aforementioned region and later khugepaged can still
>> perform the "creep".
>>
> I guess what you really want is a bitmap with a bit per page in the VMA to tell
> you whether it exists due to fault or collapse. But... yuk...
>
> It looks like Ubuntu at least is not modifying the default max_ptes_none, which
> is 511. So for general purpose distros I guess we won't see this issue in
> practice because we will always collapse to the largest enabled size.

Will still see this problem for max_ptes_shared = 256...
We can set this to 255, so that the fraction progress as follows:
255/512 -> 127/256 -> 63/128 -> 31/64 -> 15/32 -> 7/16 -> 3/8 -> 1/4 -> 0/2
This is the best possible fractional decrease we can get since we always end
up on an odd number and lose a 1 due to rounding.
The only real solution seems to be to track whether the PTE/page we have is original
or collapsed.

>
> So with that in mind, perhaps Zi's suggested single VM flag idea will be good
> enough?
>
> Thanks,
> Ryan
>
>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 12/12] selftests/mm: khugepaged: Enlighten for mTHP collapse
  2024-12-30 16:36           ` Zi Yan
  2025-01-02 11:43             ` Ryan Roberts
@ 2025-01-03 10:11             ` Dev Jain
  1 sibling, 0 replies; 74+ messages in thread
From: Dev Jain @ 2025-01-03 10:11 UTC (permalink / raw)
  To: Zi Yan, Ryan Roberts, akpm, david, willy, kirill.shutemov
  Cc: anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, jglisse,
	surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard, 21cnbao,
	linux-mm, linux-kernel


On 30/12/24 10:06 pm, Zi Yan wrote:
> On Mon Dec 30, 2024 at 2:09 AM EST, Dev Jain wrote:
>> On 20/12/24 4:35 pm, Ryan Roberts wrote:
>>> On 18/12/2024 09:50, Dev Jain wrote:
>>>> On 18/12/24 2:33 pm, Ryan Roberts wrote:
>>>>> On 16/12/2024 16:51, Dev Jain wrote:
>>>>>> One of the testcases triggers a CoW on the 255th page (0-indexing) with
>>>>>> max_ptes_shared = 256. This leads to 0-254 pages (255 in number) being unshared,
>>>>>> and 257 pages shared, exceeding the constraint. Suppose we run the test as
>>>>>> ./khugepaged -s 2. Therefore, khugepaged starts collapsing the range to order-2
>>>>>> folios, since PMD-collapse will fail due to the constraint.
>>>>>> When the scan reaches 254-257 PTE range, because at least one PTE in this range
>>>>>> is writable, with other 3 being read-only, khugepaged collapses this into an
>>>>>> order-2 mTHP, resulting in 3 extra PTEs getting unshared. After this, we
>>>>>> encounter
>>>>>> a 4-sized chunk of read-only PTEs, and mTHP collapse stops according to the
>>>>>> scaled
>>>>>> constraint, but the number of shared PTEs have now come under the constraint for
>>>>>> PMD-sized THPs. Therefore, the next scan of khugepaged will be able to collapse
>>>>>> this range into a PMD-mapped hugepage, leading to failure of this subtest. Fix
>>>>>> this by reducing the CoW range.
>>>>> Is this description essentially saying that it's now possible to creep towards
>>>>> collapsing to a full PMD-size block over successive scans due to rounding errors
>>>>> in the scaling? Or is this just trying an edge case and the problem doesn't
>>>>> generalize?
>>>> For this case, max_ptes_shared for order-2 is 256 >> (9 - 2) = 2, without
>>>> rounding errors. We cannot
>>>> really get a rounding problem because we are rounding down, essentially either
>>>> keep the restriction
>>>> same, or making it stricter, as we go down the orders.
>>>>
>>>> But thinking again, this behaviour may generalize: essentially, let us say that
>>>> the distribution
>>>> of none ptes vs filled ptes is very skewed for the PMD case. In a local region,
>>>> this distribution
>>>> may not be skewed, and then an mTHP collapse will occur, making this entire
>>>> region uniform. Over time this
>>>> may keep happening and then the region will become globally uniform to come
>>>> under the PMD-constraint
>>>> on max_ptes_none, and eventually PMD-collapse will occur. Which may beg the
>>>> question whether we want
>>>> to detach khugepaged orders from mTHP sysfs settings.
>>> We want to avoid new user controls at all costs, I think.
>>>
>>> I think an example of the problem you are describing is: Let's say we start off
>>> with all mTHP orders enabled and max_ptes_none is 50% (so 256). We have a 2M VMA
>>> aligned over a single PMD. The first 2 4K pages of the VMA are allocated.
>>>
>>> khugepaged will scan this VMA and decide to collapse the first 4 PTEs to a
>>> single order-2 (16K) folio; that's allowed because 50% of the PTEs were none.
>>> But now on the next scan, 50% of the first 8 PTEs are none so it will collapse
>>> to 32K. Then on the next scan it will collapse to 64K, and so on all the way to
>>> 2M. So by faulting in 2 pages originally we have now collapsed to 2M dispite the
>>> control trying to prevent it, and we have done it in a very inefficient way.
>>>
>>> If max_ptes_none was 75% you would only need every other order enabled (I think?).
>>>
>>> In practice perhaps it's not a problem because you are only likely to have 1 or
>>> 2 mTHP sizes enabled. But I wonder if we need to think about how to protect from
>>> this "creep"?
>>>
>>> Perhaps only consider a large folio for collapse into a larger folio if it
>>> wasn't originally collapsed by khugepaged in the first place? That would need a
>>> folio flag... and I suspect that will cause other edge case issues if we think
>>> about it for 5 mins...
>>>
>>> Another way of thinking about it is; if all the same mTHP orders were enabled at
>>> fault time (and the allocation suceeded) we would have allocated the largest
>>> order anyway, so the end states are the same. But the large number of
>>> incremental collapses that khugepaged will perform feel like a problem.
>>>
>>> I'm not quite sure what the answer is.
>> Can't really think of anything else apart from decoupling khugepaged sysfs from mTHP sysfs...
> One (not so effective) workaround is to add a VMA flag to make
> khugepaged to skip scanning a VMA that khugepaged has collapsed before
> and reset the flag in a future page fault.

Currently the code scans the VMA, and if it is able to collapse the first PMD-sized
chunk, it will go to the next VMA: if (!mmap_locked) -> goto breakouterloop_mmap_lock.
So in your method, suppose we have a VMA of 200MB, then only the first 2MB chunk will
be scanned for this VMA, and the rest 198MB will never be scanned.

So it boils down to remembering, which ranges (not abstractable by the complete VMA)
got collapsed and skip them, which basically leads us to maintaining a folio flag...

I wonder why we are going to the next VMA immediately after getting the first collapse
success. I guess it is for khugepaged not to trouble the same process for too long and
give chance to other processes? If we can change this behaviour and scan the complete
VMA and then mark it as collapsed, then your method works, but again as you mention below,
the page fault may operate somewhere else in the VMA...


>   This would prevent khugepaged
> from doing this "creep" collapse behavior until the page tables covered
> by the VMA is changed. This is not perfect, since the page faults might
> not change the aforementioned region and later khugepaged can still
> perform the "creep".
>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 07/12] khugepaged: Scan PTEs order-wise
  2024-12-16 16:51 ` [RFC PATCH 07/12] khugepaged: Scan PTEs order-wise Dev Jain
  2024-12-17 18:15   ` Ryan Roberts
@ 2025-01-06 10:04   ` Usama Arif
  2025-01-07  7:17     ` Dev Jain
  1 sibling, 1 reply; 74+ messages in thread
From: Usama Arif @ 2025-01-06 10:04 UTC (permalink / raw)
  To: Dev Jain, akpm, david, willy, kirill.shutemov
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel,
	Johannes Weiner



On 16/12/2024 16:51, Dev Jain wrote:
> Scan the PTEs order-wise, using the mask of suitable orders for this VMA
> derived in conjunction with sysfs THP settings. Scale down the tunables; in
> case of collapse failure, we drop down to the next order. Otherwise, we try to
> jump to the highest possible order and then start a fresh scan. Note that
> madvise(MADV_COLLAPSE) has not been generalized.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/khugepaged.c | 84 ++++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 69 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 886c76816963..078794aa3335 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -20,6 +20,7 @@
>  #include <linux/swapops.h>
>  #include <linux/shmem_fs.h>
>  #include <linux/ksm.h>
> +#include <linux/count_zeros.h>
>  
>  #include <asm/tlb.h>
>  #include <asm/pgalloc.h>
> @@ -1111,7 +1112,7 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
>  }
>  
>  static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> -			      int referenced, int unmapped,
> +			      int referenced, int unmapped, int order,
>  			      struct collapse_control *cc)
>  {
>  	LIST_HEAD(compound_pagelist);
> @@ -1278,38 +1279,59 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>  				   unsigned long address, bool *mmap_locked,
>  				   struct collapse_control *cc)
>  {
> -	pmd_t *pmd;
> -	pte_t *pte, *_pte;
> -	int result = SCAN_FAIL, referenced = 0;
> -	int none_or_zero = 0, shared = 0;
> -	struct page *page = NULL;
> +	unsigned int max_ptes_shared, max_ptes_none, max_ptes_swap;
> +	int referenced, shared, none_or_zero, unmapped;
> +	unsigned long _address, org_address = address;
>  	struct folio *folio = NULL;
> -	unsigned long _address;
> -	spinlock_t *ptl;
> -	int node = NUMA_NO_NODE, unmapped = 0;
> +	struct page *page = NULL;
> +	int node = NUMA_NO_NODE;
> +	int result = SCAN_FAIL;
>  	bool writable = false;
> +	unsigned long orders;
> +	pte_t *pte, *_pte;
> +	spinlock_t *ptl;
> +	pmd_t *pmd;
> +	int order;
>  
>  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>  
> +	orders = thp_vma_allowable_orders(vma, vma->vm_flags,
> +			TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER + 1) - 1);
> +	orders = thp_vma_suitable_orders(vma, address, orders);
> +	order = highest_order(orders);
> +
> +	/* MADV_COLLAPSE needs to work irrespective of sysfs setting */
> +	if (!cc->is_khugepaged)
> +		order = HPAGE_PMD_ORDER;
> +
> +scan_pte_range:
> +
> +	max_ptes_shared = khugepaged_max_ptes_shared >> (HPAGE_PMD_ORDER - order);
> +	max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
> +	max_ptes_swap = khugepaged_max_ptes_swap >> (HPAGE_PMD_ORDER - order);
> +	referenced = 0, shared = 0, none_or_zero = 0, unmapped = 0;
> +

Hi Dev,

Thanks for the patches.

Looking at the above code, I imagine you are planning to use the max_ptes_none, max_ptes_shared and
max_ptes_swap that is used for PMD THPs for all mTHP sizes?

I think this can be a bit confusing for users who aren't familiar with kernel code, as the default
values are for PMD THPs, for e.g. max_ptes_none is 511, and the user might not know that it is going
to be scaled down for lower order THPs.

Another thing is, what if these parameters have different optimal values then the scaled down versions
of mTHP?

The other option is to introduce these parameters as new sysfs entries per mTHP size. These parameters
can be very difficult to tune (and are usually left at their default values), so I don't think its a
good idea to introduce new sysfs parameters, but just something to think about.

Regards,
Usama

> +	/* Check pmd after taking mmap lock */
>  	result = find_pmd_or_thp_or_none(mm, address, &pmd);
>  	if (result != SCAN_SUCCEED)
>  		goto out;
>  
>  	memset(cc->node_load, 0, sizeof(cc->node_load));
>  	nodes_clear(cc->alloc_nmask);
> +
>  	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
>  	if (!pte) {
>  		result = SCAN_PMD_NULL;
>  		goto out;
>  	}
>  
> -	for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> +	for (_address = address, _pte = pte; _pte < pte + (1UL << order);
>  	     _pte++, _address += PAGE_SIZE) {
>  		pte_t pteval = ptep_get(_pte);
>  		if (is_swap_pte(pteval)) {
>  			++unmapped;
>  			if (!cc->is_khugepaged ||
> -			    unmapped <= khugepaged_max_ptes_swap) {
> +			    unmapped <= max_ptes_swap) {
>  				/*
>  				 * Always be strict with uffd-wp
>  				 * enabled swap entries.  Please see
> @@ -1330,7 +1352,7 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>  			++none_or_zero;
>  			if (!userfaultfd_armed(vma) &&
>  			    (!cc->is_khugepaged ||
> -			     none_or_zero <= khugepaged_max_ptes_none)) {
> +			     none_or_zero <= max_ptes_none)) {
>  				continue;
>  			} else {
>  				result = SCAN_EXCEED_NONE_PTE;
> @@ -1375,7 +1397,7 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>  		if (folio_likely_mapped_shared(folio)) {
>  			++shared;
>  			if (cc->is_khugepaged &&
> -			    shared > khugepaged_max_ptes_shared) {
> +			    shared > max_ptes_shared) {
>  				result = SCAN_EXCEED_SHARED_PTE;
>  				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>  				goto out_unmap;
> @@ -1432,7 +1454,7 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>  		result = SCAN_PAGE_RO;
>  	} else if (cc->is_khugepaged &&
>  		   (!referenced ||
> -		    (unmapped && referenced < HPAGE_PMD_NR / 2))) {
> +		    (unmapped && referenced < (1UL << order) / 2))) {
>  		result = SCAN_LACK_REFERENCED_PAGE;
>  	} else {
>  		result = SCAN_SUCCEED;
> @@ -1441,9 +1463,41 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>  	pte_unmap_unlock(pte, ptl);
>  	if (result == SCAN_SUCCEED) {
>  		result = collapse_huge_page(mm, address, referenced,
> -					    unmapped, cc);
> +					    unmapped, order, cc);
>  		/* collapse_huge_page will return with the mmap_lock released */
>  		*mmap_locked = false;
> +
> +		/* Immediately exit on exhaustion of range */
> +		if (_address == org_address + (PAGE_SIZE << HPAGE_PMD_ORDER))
> +			goto out;
> +	}
> +	if (result != SCAN_SUCCEED) {
> +
> +		/* Go to the next order. */
> +		order = next_order(&orders, order);
> +		if (order < 2)
> +			goto out;
> +		goto maybe_mmap_lock;
> +	} else {
> +		address = _address;
> +		pte = _pte;
> +
> +
> +		/* Get highest order possible starting from address */
> +		order = count_trailing_zeros(address >> PAGE_SHIFT);
> +
> +		/* This needs to be present in the mask too */
> +		if (!(orders & (1UL << order)))
> +			order = next_order(&orders, order);
> +		if (order < 2)
> +			goto out;
> +
> +maybe_mmap_lock:
> +		if (!(*mmap_locked)) {
> +			mmap_read_lock(mm);
> +			*mmap_locked = true;
> +		}
> +		goto scan_pte_range;
>  	}
>  out:
>  	trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced,


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 09/12] khugepaged: Introduce vma_collapse_anon_folio()
  2024-12-16 16:51 ` [RFC PATCH 09/12] khugepaged: Introduce vma_collapse_anon_folio() Dev Jain
  2024-12-16 17:06   ` David Hildenbrand
@ 2025-01-06 10:17   ` Usama Arif
  2025-01-07  8:12     ` Dev Jain
  1 sibling, 1 reply; 74+ messages in thread
From: Usama Arif @ 2025-01-06 10:17 UTC (permalink / raw)
  To: Dev Jain, akpm, david, willy, kirill.shutemov, Johannes Weiner
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel,
	Hugh Dickins



On 16/12/2024 16:51, Dev Jain wrote:
> In contrast to PMD-collapse, we do not need to operate on two levels of pagetable
> simultaneously. Therefore, downgrade the mmap lock from write to read mode. Still
> take the anon_vma lock in exclusive mode so as to not waste time in the rmap path,
> which is anyways going to fail since the PTEs are going to be changed. Under the PTL,
> copy page contents, clear the PTEs, remove folio pins, and (try to) unmap the
> old folios. Set the PTEs to the new folio using the set_ptes() API.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
> Note: I have been trying hard to get rid of the locks in here: we still are
> taking the PTL around the page copying; dropping the PTL and taking it after
> the copying should lead to a deadlock, for example:
> khugepaged						madvise(MADV_COLD)
> folio_lock()						lock(ptl)
> lock(ptl)						folio_lock()
> 
> We can create a locked folio list, altogether drop both the locks, take the PTL,
> do everything which __collapse_huge_page_isolate() does *except* the isolation and
> again try locking folios, but then it will reduce efficiency of khugepaged
> and almost looks like a forced solution :)
> Please note the following discussion if anyone is interested:
> https://lore.kernel.org/all/66bb7496-a445-4ad7-8e56-4f2863465c54@arm.com/
> (Apologies for not CCing the mailing list from the start)
> 
>  mm/khugepaged.c | 108 ++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 87 insertions(+), 21 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 88beebef773e..8040b130e677 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -714,24 +714,28 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>  						struct vm_area_struct *vma,
>  						unsigned long address,
>  						spinlock_t *ptl,
> -						struct list_head *compound_pagelist)
> +						struct list_head *compound_pagelist, int order)
>  {
>  	struct folio *src, *tmp;
>  	pte_t *_pte;
>  	pte_t pteval;
>  
> -	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> +	for (_pte = pte; _pte < pte + (1UL << order);
>  	     _pte++, address += PAGE_SIZE) {
>  		pteval = ptep_get(_pte);
>  		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>  			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
>  			if (is_zero_pfn(pte_pfn(pteval))) {
> -				/*
> -				 * ptl mostly unnecessary.
> -				 */
> -				spin_lock(ptl);
> -				ptep_clear(vma->vm_mm, address, _pte);
> -				spin_unlock(ptl);
> +				if (order == HPAGE_PMD_ORDER) {
> +					/*
> +					* ptl mostly unnecessary.
> +					*/
> +					spin_lock(ptl);
> +					ptep_clear(vma->vm_mm, address, _pte);
> +					spin_unlock(ptl);
> +				} else {
> +					ptep_clear(vma->vm_mm, address, _pte);
> +				}
>  				ksm_might_unmap_zero_page(vma->vm_mm, pteval);
>  			}
>  		} else {
> @@ -740,15 +744,20 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>  			src = page_folio(src_page);
>  			if (!folio_test_large(src))
>  				release_pte_folio(src);
> -			/*
> -			 * ptl mostly unnecessary, but preempt has to
> -			 * be disabled to update the per-cpu stats
> -			 * inside folio_remove_rmap_pte().
> -			 */
> -			spin_lock(ptl);
> -			ptep_clear(vma->vm_mm, address, _pte);
> -			folio_remove_rmap_pte(src, src_page, vma);
> -			spin_unlock(ptl);
> +			if (order == HPAGE_PMD_ORDER) {
> +				/*
> +				* ptl mostly unnecessary, but preempt has to
> +				* be disabled to update the per-cpu stats
> +				* inside folio_remove_rmap_pte().
> +				*/
> +				spin_lock(ptl);
> +				ptep_clear(vma->vm_mm, address, _pte);
> +				folio_remove_rmap_pte(src, src_page, vma);
> +				spin_unlock(ptl);
> +			} else {
> +				ptep_clear(vma->vm_mm, address, _pte);
> +				folio_remove_rmap_pte(src, src_page, vma);
> +			}
>  			free_page_and_swap_cache(src_page);
>  		}
>  	}
> @@ -807,7 +816,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
>  static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
>  		pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
>  		unsigned long address, spinlock_t *ptl,
> -		struct list_head *compound_pagelist)
> +		struct list_head *compound_pagelist, int order)
>  {
>  	unsigned int i;
>  	int result = SCAN_SUCCEED;
> @@ -815,7 +824,7 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
>  	/*
>  	 * Copying pages' contents is subject to memory poison at any iteration.
>  	 */
> -	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +	for (i = 0; i < (1 << order); i++) {
>  		pte_t pteval = ptep_get(pte + i);
>  		struct page *page = folio_page(folio, i);
>  		unsigned long src_addr = address + i * PAGE_SIZE;
> @@ -834,7 +843,7 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
>  
>  	if (likely(result == SCAN_SUCCEED))
>  		__collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
> -						    compound_pagelist);
> +						    compound_pagelist, order);
>  	else
>  		__collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
>  						 compound_pagelist, order);
> @@ -1196,7 +1205,7 @@ static int vma_collapse_anon_folio_pmd(struct mm_struct *mm, unsigned long addre
>  
>  	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
>  					   vma, address, pte_ptl,
> -					   &compound_pagelist);
> +					   &compound_pagelist, HPAGE_PMD_ORDER);
>  	pte_unmap(pte);
>  	if (unlikely(result != SCAN_SUCCEED))
>  		goto out_up_write;
> @@ -1228,6 +1237,61 @@ static int vma_collapse_anon_folio_pmd(struct mm_struct *mm, unsigned long addre
>  	return result;
>  }
>  
> +/* Enter with mmap read lock */
> +static int vma_collapse_anon_folio(struct mm_struct *mm, unsigned long address,
> +		struct vm_area_struct *vma, struct collapse_control *cc, pmd_t *pmd,
> +		struct folio *folio, int order)
> +{
> +	int result;
> +	struct mmu_notifier_range range;
> +	spinlock_t *pte_ptl;
> +	LIST_HEAD(compound_pagelist);
> +	pte_t *pte;
> +	pte_t entry;
> +	int nr_pages = folio_nr_pages(folio);
> +
> +	anon_vma_lock_write(vma->anon_vma);
> +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> +				address + (PAGE_SIZE << order));
> +	mmu_notifier_invalidate_range_start(&range);
> +
> +	pte = pte_offset_map_lock(mm, pmd, address, &pte_ptl);
> +	if (pte)
> +		result = __collapse_huge_page_isolate(vma, address, pte, cc,
> +						      &compound_pagelist, order);
> +	else
> +		result = SCAN_PMD_NULL;
> +
> +	if (unlikely(result != SCAN_SUCCEED))
> +		goto out_up_read;
> +
> +	anon_vma_unlock_write(vma->anon_vma);
> +
> +	__folio_mark_uptodate(folio);
> +	entry = mk_pte(&folio->page, vma->vm_page_prot);
> +	entry = maybe_mkwrite(entry, vma);
> +
> +	result = __collapse_huge_page_copy(pte, folio, pmd, *pmd,
> +					   vma, address, pte_ptl,
> +					   &compound_pagelist, order);
> +	if (unlikely(result != SCAN_SUCCEED))
> +		goto out_up_read;
> +
> +	folio_ref_add(folio, nr_pages - 1);
> +	folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
> +	folio_add_lru_vma(folio, vma);
> +	deferred_split_folio(folio, false);

Hi Dev,

You are adding the lower order folios to the deferred split queue,
but you havent changed the THP shrinker to take this into account.

At memory pressure you will be doing a lot of work checking the contents of
all mTHP pages which will be wasted unless you change the shrinker, something
like below (unbuilt, untested) might work:

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c89aed1510f1..f9586df40f67 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3788,7 +3788,7 @@ static bool thp_underused(struct folio *folio)
                kaddr = kmap_local_folio(folio, i * PAGE_SIZE);
                if (!memchr_inv(kaddr, 0, PAGE_SIZE)) {
                        num_zero_pages++;
-                       if (num_zero_pages > khugepaged_max_ptes_none) {
+                       if (num_zero_pages > khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - folio_order(folio))) {
                                kunmap_local(kaddr);
                                return true;
                        }


The question is, do we want the shrinker to be run for lower order mTHPs? It can consume
a lot of CPU cycles and not be as useful as PMD order THPs. So instead of above, we could
disable THP shrinker for lower orders? 

> +	set_ptes(mm, address, pte, entry, nr_pages);
> +	update_mmu_cache_range(NULL, vma, address, pte, nr_pages);
> +	pte_unmap_unlock(pte, pte_ptl);
> +	mmu_notifier_invalidate_range_end(&range);
> +	result = SCAN_SUCCEED;
> +
> +out_up_read:
> +	mmap_read_unlock(mm);
> +	return result;
> +}
> +
>  static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  			      int referenced, int unmapped, int order,
>  			      struct collapse_control *cc)
> @@ -1276,6 +1340,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  
>  	if (order == HPAGE_PMD_ORDER)
>  		result = vma_collapse_anon_folio_pmd(mm, address, vma, cc, pmd, folio);
> +	else
> +		result = vma_collapse_anon_folio(mm, address, vma, cc, pmd, folio, order);
>  
>  	if (result == SCAN_SUCCEED)
>  		folio = NULL;


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 07/12] khugepaged: Scan PTEs order-wise
  2025-01-06 10:04   ` Usama Arif
@ 2025-01-07  7:17     ` Dev Jain
  0 siblings, 0 replies; 74+ messages in thread
From: Dev Jain @ 2025-01-07  7:17 UTC (permalink / raw)
  To: Usama Arif, akpm, david, willy, kirill.shutemov
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel,
	Johannes Weiner


On 06/01/25 3:34 pm, Usama Arif wrote:
>
> On 16/12/2024 16:51, Dev Jain wrote:
>> Scan the PTEs order-wise, using the mask of suitable orders for this VMA
>> derived in conjunction with sysfs THP settings. Scale down the tunables; in
>> case of collapse failure, we drop down to the next order. Otherwise, we try to
>> jump to the highest possible order and then start a fresh scan. Note that
>> madvise(MADV_COLLAPSE) has not been generalized.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   mm/khugepaged.c | 84 ++++++++++++++++++++++++++++++++++++++++---------
>>   1 file changed, 69 insertions(+), 15 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 886c76816963..078794aa3335 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -20,6 +20,7 @@
>>   #include <linux/swapops.h>
>>   #include <linux/shmem_fs.h>
>>   #include <linux/ksm.h>
>> +#include <linux/count_zeros.h>
>>   
>>   #include <asm/tlb.h>
>>   #include <asm/pgalloc.h>
>> @@ -1111,7 +1112,7 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
>>   }
>>   
>>   static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>> -			      int referenced, int unmapped,
>> +			      int referenced, int unmapped, int order,
>>   			      struct collapse_control *cc)
>>   {
>>   	LIST_HEAD(compound_pagelist);
>> @@ -1278,38 +1279,59 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>>   				   unsigned long address, bool *mmap_locked,
>>   				   struct collapse_control *cc)
>>   {
>> -	pmd_t *pmd;
>> -	pte_t *pte, *_pte;
>> -	int result = SCAN_FAIL, referenced = 0;
>> -	int none_or_zero = 0, shared = 0;
>> -	struct page *page = NULL;
>> +	unsigned int max_ptes_shared, max_ptes_none, max_ptes_swap;
>> +	int referenced, shared, none_or_zero, unmapped;
>> +	unsigned long _address, org_address = address;
>>   	struct folio *folio = NULL;
>> -	unsigned long _address;
>> -	spinlock_t *ptl;
>> -	int node = NUMA_NO_NODE, unmapped = 0;
>> +	struct page *page = NULL;
>> +	int node = NUMA_NO_NODE;
>> +	int result = SCAN_FAIL;
>>   	bool writable = false;
>> +	unsigned long orders;
>> +	pte_t *pte, *_pte;
>> +	spinlock_t *ptl;
>> +	pmd_t *pmd;
>> +	int order;
>>   
>>   	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>>   
>> +	orders = thp_vma_allowable_orders(vma, vma->vm_flags,
>> +			TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER + 1) - 1);
>> +	orders = thp_vma_suitable_orders(vma, address, orders);
>> +	order = highest_order(orders);
>> +
>> +	/* MADV_COLLAPSE needs to work irrespective of sysfs setting */
>> +	if (!cc->is_khugepaged)
>> +		order = HPAGE_PMD_ORDER;
>> +
>> +scan_pte_range:
>> +
>> +	max_ptes_shared = khugepaged_max_ptes_shared >> (HPAGE_PMD_ORDER - order);
>> +	max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
>> +	max_ptes_swap = khugepaged_max_ptes_swap >> (HPAGE_PMD_ORDER - order);
>> +	referenced = 0, shared = 0, none_or_zero = 0, unmapped = 0;
>> +
> Hi Dev,
>
> Thanks for the patches.
>
> Looking at the above code, I imagine you are planning to use the max_ptes_none, max_ptes_shared and
> max_ptes_swap that is used for PMD THPs for all mTHP sizes?
>
> I think this can be a bit confusing for users who aren't familiar with kernel code, as the default
> values are for PMD THPs, for e.g. max_ptes_none is 511, and the user might not know that it is going
> to be scaled down for lower order THPs.

You make sense.

>
> Another thing is, what if these parameters have different optimal values then the scaled down versions
> of mTHP?

By optimal, here we mean, how much the sysadmin wants khugepaged to succeed. If I want its success so bad
that I am ready to collapse for a single filled entry, then this correspondence holds true for the scaled
down version. There may be off-by-one errors but, well, they are off-by-one errors :)

>
> The other option is to introduce these parameters as new sysfs entries per mTHP size. These parameters
> can be very difficult to tune (and are usually left at their default values), so I don't think its a
> good idea to introduce new sysfs parameters, but just something to think about.

Nonetheless you have a valid question, and I am not really sure how to go about this. If we are against
new sysfs entries, then the only derivation from that is to scale down, and the only way the user will
know that this is happening is kernel documentation.

>
> Regards,
> Usama
>
>> +	/* Check pmd after taking mmap lock */
>>   	result = find_pmd_or_thp_or_none(mm, address, &pmd);
>>   	if (result != SCAN_SUCCEED)
>>   		goto out;
>>   
>>   	memset(cc->node_load, 0, sizeof(cc->node_load));
>>   	nodes_clear(cc->alloc_nmask);
>> +
>>   	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
>>   	if (!pte) {
>>   		result = SCAN_PMD_NULL;
>>   		goto out;
>>   	}
>>   
>> -	for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>> +	for (_address = address, _pte = pte; _pte < pte + (1UL << order);
>>   	     _pte++, _address += PAGE_SIZE) {
>>   		pte_t pteval = ptep_get(_pte);
>>   		if (is_swap_pte(pteval)) {
>>   			++unmapped;
>>   			if (!cc->is_khugepaged ||
>> -			    unmapped <= khugepaged_max_ptes_swap) {
>> +			    unmapped <= max_ptes_swap) {
>>   				/*
>>   				 * Always be strict with uffd-wp
>>   				 * enabled swap entries.  Please see
>> @@ -1330,7 +1352,7 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>>   			++none_or_zero;
>>   			if (!userfaultfd_armed(vma) &&
>>   			    (!cc->is_khugepaged ||
>> -			     none_or_zero <= khugepaged_max_ptes_none)) {
>> +			     none_or_zero <= max_ptes_none)) {
>>   				continue;
>>   			} else {
>>   				result = SCAN_EXCEED_NONE_PTE;
>> @@ -1375,7 +1397,7 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>>   		if (folio_likely_mapped_shared(folio)) {
>>   			++shared;
>>   			if (cc->is_khugepaged &&
>> -			    shared > khugepaged_max_ptes_shared) {
>> +			    shared > max_ptes_shared) {
>>   				result = SCAN_EXCEED_SHARED_PTE;
>>   				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>>   				goto out_unmap;
>> @@ -1432,7 +1454,7 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>>   		result = SCAN_PAGE_RO;
>>   	} else if (cc->is_khugepaged &&
>>   		   (!referenced ||
>> -		    (unmapped && referenced < HPAGE_PMD_NR / 2))) {
>> +		    (unmapped && referenced < (1UL << order) / 2))) {
>>   		result = SCAN_LACK_REFERENCED_PAGE;
>>   	} else {
>>   		result = SCAN_SUCCEED;
>> @@ -1441,9 +1463,41 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>>   	pte_unmap_unlock(pte, ptl);
>>   	if (result == SCAN_SUCCEED) {
>>   		result = collapse_huge_page(mm, address, referenced,
>> -					    unmapped, cc);
>> +					    unmapped, order, cc);
>>   		/* collapse_huge_page will return with the mmap_lock released */
>>   		*mmap_locked = false;
>> +
>> +		/* Immediately exit on exhaustion of range */
>> +		if (_address == org_address + (PAGE_SIZE << HPAGE_PMD_ORDER))
>> +			goto out;
>> +	}
>> +	if (result != SCAN_SUCCEED) {
>> +
>> +		/* Go to the next order. */
>> +		order = next_order(&orders, order);
>> +		if (order < 2)
>> +			goto out;
>> +		goto maybe_mmap_lock;
>> +	} else {
>> +		address = _address;
>> +		pte = _pte;
>> +
>> +
>> +		/* Get highest order possible starting from address */
>> +		order = count_trailing_zeros(address >> PAGE_SHIFT);
>> +
>> +		/* This needs to be present in the mask too */
>> +		if (!(orders & (1UL << order)))
>> +			order = next_order(&orders, order);
>> +		if (order < 2)
>> +			goto out;
>> +
>> +maybe_mmap_lock:
>> +		if (!(*mmap_locked)) {
>> +			mmap_read_lock(mm);
>> +			*mmap_locked = true;
>> +		}
>> +		goto scan_pte_range;
>>   	}
>>   out:
>>   	trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced,

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 09/12] khugepaged: Introduce vma_collapse_anon_folio()
  2025-01-06 10:17   ` Usama Arif
@ 2025-01-07  8:12     ` Dev Jain
  0 siblings, 0 replies; 74+ messages in thread
From: Dev Jain @ 2025-01-07  8:12 UTC (permalink / raw)
  To: Usama Arif, akpm, david, willy, kirill.shutemov, Johannes Weiner
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel


On 06/01/25 3:47 pm, Usama Arif wrote:
>
> On 16/12/2024 16:51, Dev Jain wrote:
>> In contrast to PMD-collapse, we do not need to operate on two levels of pagetable
>> simultaneously. Therefore, downgrade the mmap lock from write to read mode. Still
>> take the anon_vma lock in exclusive mode so as to not waste time in the rmap path,
>> which is anyways going to fail since the PTEs are going to be changed. Under the PTL,
>> copy page contents, clear the PTEs, remove folio pins, and (try to) unmap the
>> old folios. Set the PTEs to the new folio using the set_ptes() API.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>> Note: I have been trying hard to get rid of the locks in here: we still are
>> taking the PTL around the page copying; dropping the PTL and taking it after
>> the copying should lead to a deadlock, for example:
>> khugepaged						madvise(MADV_COLD)
>> folio_lock()						lock(ptl)
>> lock(ptl)						folio_lock()
>>
>> We can create a locked folio list, altogether drop both the locks, take the PTL,
>> do everything which __collapse_huge_page_isolate() does *except* the isolation and
>> again try locking folios, but then it will reduce efficiency of khugepaged
>> and almost looks like a forced solution :)
>> Please note the following discussion if anyone is interested:
>> https://lore.kernel.org/all/66bb7496-a445-4ad7-8e56-4f2863465c54@arm.com/
>> (Apologies for not CCing the mailing list from the start)
>>
>>   mm/khugepaged.c | 108 ++++++++++++++++++++++++++++++++++++++----------
>>   1 file changed, 87 insertions(+), 21 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 88beebef773e..8040b130e677 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -714,24 +714,28 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>>   						struct vm_area_struct *vma,
>>   						unsigned long address,
>>   						spinlock_t *ptl,
>> -						struct list_head *compound_pagelist)
>> +						struct list_head *compound_pagelist, int order)
>>   {
>>   	struct folio *src, *tmp;
>>   	pte_t *_pte;
>>   	pte_t pteval;
>>   
>> -	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>> +	for (_pte = pte; _pte < pte + (1UL << order);
>>   	     _pte++, address += PAGE_SIZE) {
>>   		pteval = ptep_get(_pte);
>>   		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>>   			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
>>   			if (is_zero_pfn(pte_pfn(pteval))) {
>> -				/*
>> -				 * ptl mostly unnecessary.
>> -				 */
>> -				spin_lock(ptl);
>> -				ptep_clear(vma->vm_mm, address, _pte);
>> -				spin_unlock(ptl);
>> +				if (order == HPAGE_PMD_ORDER) {
>> +					/*
>> +					* ptl mostly unnecessary.
>> +					*/
>> +					spin_lock(ptl);
>> +					ptep_clear(vma->vm_mm, address, _pte);
>> +					spin_unlock(ptl);
>> +				} else {
>> +					ptep_clear(vma->vm_mm, address, _pte);
>> +				}
>>   				ksm_might_unmap_zero_page(vma->vm_mm, pteval);
>>   			}
>>   		} else {
>> @@ -740,15 +744,20 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>>   			src = page_folio(src_page);
>>   			if (!folio_test_large(src))
>>   				release_pte_folio(src);
>> -			/*
>> -			 * ptl mostly unnecessary, but preempt has to
>> -			 * be disabled to update the per-cpu stats
>> -			 * inside folio_remove_rmap_pte().
>> -			 */
>> -			spin_lock(ptl);
>> -			ptep_clear(vma->vm_mm, address, _pte);
>> -			folio_remove_rmap_pte(src, src_page, vma);
>> -			spin_unlock(ptl);
>> +			if (order == HPAGE_PMD_ORDER) {
>> +				/*
>> +				* ptl mostly unnecessary, but preempt has to
>> +				* be disabled to update the per-cpu stats
>> +				* inside folio_remove_rmap_pte().
>> +				*/
>> +				spin_lock(ptl);
>> +				ptep_clear(vma->vm_mm, address, _pte);
>> +				folio_remove_rmap_pte(src, src_page, vma);
>> +				spin_unlock(ptl);
>> +			} else {
>> +				ptep_clear(vma->vm_mm, address, _pte);
>> +				folio_remove_rmap_pte(src, src_page, vma);
>> +			}
>>   			free_page_and_swap_cache(src_page);
>>   		}
>>   	}
>> @@ -807,7 +816,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
>>   static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
>>   		pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
>>   		unsigned long address, spinlock_t *ptl,
>> -		struct list_head *compound_pagelist)
>> +		struct list_head *compound_pagelist, int order)
>>   {
>>   	unsigned int i;
>>   	int result = SCAN_SUCCEED;
>> @@ -815,7 +824,7 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
>>   	/*
>>   	 * Copying pages' contents is subject to memory poison at any iteration.
>>   	 */
>> -	for (i = 0; i < HPAGE_PMD_NR; i++) {
>> +	for (i = 0; i < (1 << order); i++) {
>>   		pte_t pteval = ptep_get(pte + i);
>>   		struct page *page = folio_page(folio, i);
>>   		unsigned long src_addr = address + i * PAGE_SIZE;
>> @@ -834,7 +843,7 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
>>   
>>   	if (likely(result == SCAN_SUCCEED))
>>   		__collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
>> -						    compound_pagelist);
>> +						    compound_pagelist, order);
>>   	else
>>   		__collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
>>   						 compound_pagelist, order);
>> @@ -1196,7 +1205,7 @@ static int vma_collapse_anon_folio_pmd(struct mm_struct *mm, unsigned long addre
>>   
>>   	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
>>   					   vma, address, pte_ptl,
>> -					   &compound_pagelist);
>> +					   &compound_pagelist, HPAGE_PMD_ORDER);
>>   	pte_unmap(pte);
>>   	if (unlikely(result != SCAN_SUCCEED))
>>   		goto out_up_write;
>> @@ -1228,6 +1237,61 @@ static int vma_collapse_anon_folio_pmd(struct mm_struct *mm, unsigned long addre
>>   	return result;
>>   }
>>   
>> +/* Enter with mmap read lock */
>> +static int vma_collapse_anon_folio(struct mm_struct *mm, unsigned long address,
>> +		struct vm_area_struct *vma, struct collapse_control *cc, pmd_t *pmd,
>> +		struct folio *folio, int order)
>> +{
>> +	int result;
>> +	struct mmu_notifier_range range;
>> +	spinlock_t *pte_ptl;
>> +	LIST_HEAD(compound_pagelist);
>> +	pte_t *pte;
>> +	pte_t entry;
>> +	int nr_pages = folio_nr_pages(folio);
>> +
>> +	anon_vma_lock_write(vma->anon_vma);
>> +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
>> +				address + (PAGE_SIZE << order));
>> +	mmu_notifier_invalidate_range_start(&range);
>> +
>> +	pte = pte_offset_map_lock(mm, pmd, address, &pte_ptl);
>> +	if (pte)
>> +		result = __collapse_huge_page_isolate(vma, address, pte, cc,
>> +						      &compound_pagelist, order);
>> +	else
>> +		result = SCAN_PMD_NULL;
>> +
>> +	if (unlikely(result != SCAN_SUCCEED))
>> +		goto out_up_read;
>> +
>> +	anon_vma_unlock_write(vma->anon_vma);
>> +
>> +	__folio_mark_uptodate(folio);
>> +	entry = mk_pte(&folio->page, vma->vm_page_prot);
>> +	entry = maybe_mkwrite(entry, vma);
>> +
>> +	result = __collapse_huge_page_copy(pte, folio, pmd, *pmd,
>> +					   vma, address, pte_ptl,
>> +					   &compound_pagelist, order);
>> +	if (unlikely(result != SCAN_SUCCEED))
>> +		goto out_up_read;
>> +
>> +	folio_ref_add(folio, nr_pages - 1);
>> +	folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
>> +	folio_add_lru_vma(folio, vma);
>> +	deferred_split_folio(folio, false);
> Hi Dev,
>
> You are adding the lower order folios to the deferred split queue,
> but you havent changed the THP shrinker to take this into account.

Thanks for the observation!

>
> At memory pressure you will be doing a lot of work checking the contents of
> all mTHP pages which will be wasted unless you change the shrinker, something
> like below (unbuilt, untested) might work:
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index c89aed1510f1..f9586df40f67 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3788,7 +3788,7 @@ static bool thp_underused(struct folio *folio)
>                  kaddr = kmap_local_folio(folio, i * PAGE_SIZE);
>                  if (!memchr_inv(kaddr, 0, PAGE_SIZE)) {
>                          num_zero_pages++;
> -                       if (num_zero_pages > khugepaged_max_ptes_none) {
> +                       if (num_zero_pages > khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - folio_order(folio))) {
>                                  kunmap_local(kaddr);
>                                  return true;
>                          }
>
>
> The question is, do we want the shrinker to be run for lower order mTHPs? It can consume
> a lot of CPU cycles and not be as useful as PMD order THPs. So instead of above, we could
> disable THP shrinker for lower orders?

Your suggestion makes sense to me. Freeing two (PMD - 1) mTHPs gives us the same amount of memory
as freeing one PMD THP, with the downside being, one extra iteration, one extra freeing, long deferred list
etc, basically more overhead.

>   
>
>> +	set_ptes(mm, address, pte, entry, nr_pages);
>> +	update_mmu_cache_range(NULL, vma, address, pte, nr_pages);
>> +	pte_unmap_unlock(pte, pte_ptl);
>> +	mmu_notifier_invalidate_range_end(&range);
>> +	result = SCAN_SUCCEED;
>> +
>> +out_up_read:
>> +	mmap_read_unlock(mm);
>> +	return result;
>> +}
>> +
>>   static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>   			      int referenced, int unmapped, int order,
>>   			      struct collapse_control *cc)
>> @@ -1276,6 +1340,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>   
>>   	if (order == HPAGE_PMD_ORDER)
>>   		result = vma_collapse_anon_folio_pmd(mm, address, vma, cc, pmd, folio);
>> +	else
>> +		result = vma_collapse_anon_folio(mm, address, vma, cc, pmd, folio, order);
>>   
>>   	if (result == SCAN_SUCCEED)
>>   		folio = NULL;

^ permalink raw reply	[flat|nested] 74+ messages in thread

end of thread, other threads:[~2025-01-07  8:13 UTC | newest]

Thread overview: 74+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-16 16:50 [RFC PATCH 00/12] khugepaged: Asynchronous mTHP collapse Dev Jain
2024-12-16 16:50 ` [RFC PATCH 01/12] khugepaged: Rename hpage_collapse_scan_pmd() -> ptes() Dev Jain
2024-12-17  4:18   ` Matthew Wilcox
2024-12-17  5:52     ` Dev Jain
2024-12-17  6:43     ` Ryan Roberts
2024-12-17 18:11       ` Zi Yan
2024-12-17 19:12         ` Ryan Roberts
2024-12-16 16:50 ` [RFC PATCH 02/12] khugepaged: Generalize alloc_charge_folio() Dev Jain
2024-12-17  2:51   ` Baolin Wang
2024-12-17  6:08     ` Dev Jain
2024-12-17  4:17   ` Matthew Wilcox
2024-12-17  7:09     ` Ryan Roberts
2024-12-17 13:00       ` Zi Yan
2024-12-20 17:41       ` Christoph Lameter (Ampere)
2024-12-20 17:45         ` Ryan Roberts
2024-12-20 18:47           ` Christoph Lameter (Ampere)
2025-01-02 11:21             ` Ryan Roberts
2024-12-17  6:53   ` Ryan Roberts
2024-12-17  9:06     ` Dev Jain
2024-12-16 16:50 ` [RFC PATCH 03/12] khugepaged: Generalize hugepage_vma_revalidate() Dev Jain
2024-12-17  4:21   ` Matthew Wilcox
2024-12-17 16:58   ` Ryan Roberts
2024-12-16 16:50 ` [RFC PATCH 04/12] khugepaged: Generalize __collapse_huge_page_swapin() Dev Jain
2024-12-17  4:24   ` Matthew Wilcox
2024-12-16 16:50 ` [RFC PATCH 05/12] khugepaged: Generalize __collapse_huge_page_isolate() Dev Jain
2024-12-17  4:32   ` Matthew Wilcox
2024-12-17  6:41     ` Dev Jain
2024-12-17 17:14       ` Ryan Roberts
2024-12-17 17:09   ` Ryan Roberts
2024-12-16 16:50 ` [RFC PATCH 06/12] khugepaged: Generalize __collapse_huge_page_copy_failed() Dev Jain
2024-12-17 17:22   ` Ryan Roberts
2024-12-18  8:49     ` Dev Jain
2024-12-16 16:51 ` [RFC PATCH 07/12] khugepaged: Scan PTEs order-wise Dev Jain
2024-12-17 18:15   ` Ryan Roberts
2024-12-18  9:24     ` Dev Jain
2025-01-06 10:04   ` Usama Arif
2025-01-07  7:17     ` Dev Jain
2024-12-16 16:51 ` [RFC PATCH 08/12] khugepaged: Abstract PMD-THP collapse Dev Jain
2024-12-17 19:24   ` Ryan Roberts
2024-12-18  9:26     ` Dev Jain
2024-12-16 16:51 ` [RFC PATCH 09/12] khugepaged: Introduce vma_collapse_anon_folio() Dev Jain
2024-12-16 17:06   ` David Hildenbrand
2024-12-16 19:08     ` Yang Shi
2024-12-17 10:07     ` Dev Jain
2024-12-17 10:32       ` David Hildenbrand
2024-12-18  8:35         ` Dev Jain
2025-01-02 10:08           ` Dev Jain
2025-01-02 11:33             ` David Hildenbrand
2025-01-03  8:17               ` Dev Jain
2025-01-02 11:22           ` David Hildenbrand
2024-12-18 15:59     ` Dev Jain
2025-01-06 10:17   ` Usama Arif
2025-01-07  8:12     ` Dev Jain
2024-12-16 16:51 ` [RFC PATCH 10/12] khugepaged: Skip PTE range if a larger mTHP is already mapped Dev Jain
2024-12-18  7:36   ` Ryan Roberts
2024-12-18  9:34     ` Dev Jain
2024-12-19  3:40       ` John Hubbard
2024-12-19  3:51         ` Zi Yan
2024-12-19  7:59         ` Dev Jain
2024-12-19  8:07           ` Dev Jain
2024-12-20 11:57             ` Ryan Roberts
2024-12-16 16:51 ` [RFC PATCH 11/12] khugepaged: Enable sysfs to control order of collapse Dev Jain
2024-12-16 16:51 ` [RFC PATCH 12/12] selftests/mm: khugepaged: Enlighten for mTHP collapse Dev Jain
2024-12-18  9:03   ` Ryan Roberts
2024-12-18  9:50     ` Dev Jain
2024-12-20 11:05       ` Ryan Roberts
2024-12-30  7:09         ` Dev Jain
2024-12-30 16:36           ` Zi Yan
2025-01-02 11:43             ` Ryan Roberts
2025-01-03 10:10               ` Dev Jain
2025-01-03 10:11             ` Dev Jain
2024-12-16 17:31 ` [RFC PATCH 00/12] khugepaged: Asynchronous " Dev Jain
2025-01-02 21:58   ` Nico Pache
2025-01-03  7:04     ` Dev Jain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).