[RFC PATCH 0/3] Fix and refactor do_{huge_pmd_}numa

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/3] Fix and refactor do_{huge_pmd_}numa_page()
@ 2024-07-12  2:44 Zi Yan
  2024-07-12  2:44 ` [RFC PATCH 1/3] memory tiering: read last_cpupid correctly in do_huge_pmd_numa_page() Zi Yan
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Zi Yan @ 2024-07-12  2:44 UTC (permalink / raw)
  To: David Hildenbrand, Huang, Ying, linux-mm
  Cc: Zi Yan, Andrew Morton, Baolin Wang, linux-kernel

From: Zi Yan <ziy@nvidia.com>

First patch adds the missing sysctl_numa_balancing_mode check. Second patch
introduces folio_has_cpupid() to replace open coded folio last_cpupid check.
Third patch consolidates code in do_numa_page() and do_huge_pmd_numa_page()
by moving more common code into numa_migrate_prep() (renamed to
numa_migrate_check() in the patch).

The RFC is mainly for third patch. It changes the original code behavior:
1. do_huge_pmd_numa_page() did not check shared folios to set TNF_SHARED.
2. do_huge_pmd_numa_page() did not check and skip zone device folios.

Zi Yan (3):
  memory tiering: read last_cpupid correctly in do_huge_pmd_numa_page()
  memory tiering: introduce folio_has_cpupid() check
  mm/migrate: move common code to numa_migrate_check (was
    numa_migrate_prep)

 include/linux/memory-tiers.h |  8 ++++
 kernel/sched/fair.c          |  3 +-
 mm/huge_memory.c             | 31 +++++---------
 mm/internal.h                |  5 ++-
 mm/memory-tiers.c            | 17 ++++++++
 mm/memory.c                  | 82 +++++++++++++++++-------------------
 mm/mprotect.c                |  3 +-
 7 files changed, 80 insertions(+), 69 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC PATCH 1/3] memory tiering: read last_cpupid correctly in do_huge_pmd_numa_page()
  2024-07-12  2:44 [RFC PATCH 0/3] Fix and refactor do_{huge_pmd_}numa_page() Zi Yan
@ 2024-07-12  2:44 ` Zi Yan
  2024-07-12  3:22   ` Huang, Ying
                     ` (2 more replies)
  2024-07-12  2:44 ` [RFC PATCH 2/3] memory tiering: introduce folio_has_cpupid() check Zi Yan
  2024-07-12  2:44 ` [RFC PATCH 3/3] mm/migrate: move common code to numa_migrate_check (was numa_migrate_prep) Zi Yan
  2 siblings, 3 replies; 18+ messages in thread
From: Zi Yan @ 2024-07-12  2:44 UTC (permalink / raw)
  To: David Hildenbrand, Huang, Ying, linux-mm
  Cc: Zi Yan, Andrew Morton, Baolin Wang, linux-kernel

From: Zi Yan <ziy@nvidia.com>

last_cpupid is only available when memory tiering is off or the folio
is in toptier node. Complete the check to read last_cpupid when it is
available.

Before the fix, the default last_cpupid will be used even if memory
tiering mode is turned off at runtime instead of the actual value. This
can prevent task_numa_fault() from getting right numa fault stats, but
should not cause any crash. User might see performance changes after the
fix.

Fixes: 33024536bafd ("memory tiering: hot page selection with hint page fault latency")
Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/huge_memory.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d7c84480f1a4..07d9dde4ca33 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1705,7 +1705,8 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 	 * For memory tiering mode, cpupid of slow memory page is used
 	 * to record page access time.  So use default value.
 	 */
-	if (node_is_toptier(nid))
+	if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) ||
+	    node_is_toptier(nid))
 		last_cpupid = folio_last_cpupid(folio);
 	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
 	if (target_nid == NUMA_NO_NODE)
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 2/3] memory tiering: introduce folio_has_cpupid() check
  2024-07-12  2:44 [RFC PATCH 0/3] Fix and refactor do_{huge_pmd_}numa_page() Zi Yan
  2024-07-12  2:44 ` [RFC PATCH 1/3] memory tiering: read last_cpupid correctly in do_huge_pmd_numa_page() Zi Yan
@ 2024-07-12  2:44 ` Zi Yan
  2024-07-12  6:27   ` Huang, Ying
  2024-07-12  2:44 ` [RFC PATCH 3/3] mm/migrate: move common code to numa_migrate_check (was numa_migrate_prep) Zi Yan
  2 siblings, 1 reply; 18+ messages in thread
From: Zi Yan @ 2024-07-12  2:44 UTC (permalink / raw)
  To: David Hildenbrand, Huang, Ying, linux-mm
  Cc: Zi Yan, Andrew Morton, Baolin Wang, linux-kernel

From: Zi Yan <ziy@nvidia.com>

Instead of open coded check for if memory tiering mode is on and a folio
is in the top tier memory, use a function to encapsulate the check.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/memory-tiers.h |  8 ++++++++
 kernel/sched/fair.c          |  3 +--
 mm/huge_memory.c             |  6 ++----
 mm/memory-tiers.c            | 17 +++++++++++++++++
 mm/memory.c                  |  3 +--
 mm/mprotect.c                |  3 +--
 6 files changed, 30 insertions(+), 10 deletions(-)

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 0dc0cf2863e2..10c127d461c4 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -73,6 +73,10 @@ static inline bool node_is_toptier(int node)
 }
 #endif
 
+
+bool folio_has_cpupid(struct folio *folio);
+
+
 #else
 
 #define numa_demotion_enabled	false
@@ -151,5 +155,9 @@ static inline struct memory_dev_type *mt_find_alloc_memory_type(int adist,
 static inline void mt_put_memory_types(struct list_head *memory_types)
 {
 }
+static inline bool folio_has_cpupid(struct folio *folio)
+{
+	return true;
+}
 #endif	/* CONFIG_NUMA */
 #endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8a5b1ae0aa55..03de808cb3cc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1840,8 +1840,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
 	 * The pages in slow memory node should be migrated according
 	 * to hot/cold instead of private/shared.
 	 */
-	if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING &&
-	    !node_is_toptier(src_nid)) {
+	if (!folio_has_cpupid(folio)) {
 		struct pglist_data *pgdat;
 		unsigned long rate_limit;
 		unsigned int latency, th, def_th;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 07d9dde4ca33..8c11d6da4b36 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1705,8 +1705,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 	 * For memory tiering mode, cpupid of slow memory page is used
 	 * to record page access time.  So use default value.
 	 */
-	if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) ||
-	    node_is_toptier(nid))
+	if (folio_has_cpupid(folio))
 		last_cpupid = folio_last_cpupid(folio);
 	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
 	if (target_nid == NUMA_NO_NODE)
@@ -2059,8 +2058,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		    toptier)
 			goto unlock;
 
-		if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING &&
-		    !toptier)
+		if (!folio_has_cpupid(folio))
 			folio_xchg_access_time(folio,
 					       jiffies_to_msecs(jiffies));
 	}
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 4775b3a3dabe..7f0360d4e3a0 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -6,6 +6,7 @@
 #include <linux/memory.h>
 #include <linux/memory-tiers.h>
 #include <linux/notifier.h>
+#include <linux/sched/sysctl.h>
 
 #include "internal.h"
 
@@ -50,6 +51,22 @@ static const struct bus_type memory_tier_subsys = {
 	.dev_name = "memory_tier",
 };
 
+/**
+ * folio_has_cpupid - check if a folio has cpupid information
+ * @folio: folio to check
+ *
+ * folio's _last_cpupid field is repurposed by memory tiering. In memory
+ * tiering mode, cpupid of slow memory folio (not toptier memory) is used to
+ * record page access time.
+ *
+ * Return: the folio _last_cpupid is used as cpupid
+ */
+bool folio_has_cpupid(struct folio *folio)
+{
+	return !(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) ||
+	       node_is_toptier(folio_nid(folio));
+}
+
 #ifdef CONFIG_MIGRATION
 static int top_tier_adistance;
 /*
diff --git a/mm/memory.c b/mm/memory.c
index dceb62f3fa34..96c2f5b3d796 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5344,8 +5344,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	 * For memory tiering mode, cpupid of slow memory page is used
 	 * to record page access time.  So use default value.
 	 */
-	if ((sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) &&
-	    !node_is_toptier(nid))
+	if (!folio_has_cpupid(folio))
 		last_cpupid = (-1 & LAST_CPUPID_MASK);
 	else
 		last_cpupid = folio_last_cpupid(folio);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 222ab434da54..787c3c2bf1b6 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -161,8 +161,7 @@ static long change_pte_range(struct mmu_gather *tlb,
 				if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
 				    toptier)
 					continue;
-				if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING &&
-				    !toptier)
+				if (!folio_has_cpupid(folio))
 					folio_xchg_access_time(folio,
 						jiffies_to_msecs(jiffies));
 			}
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 3/3] mm/migrate: move common code to numa_migrate_check (was numa_migrate_prep)
  2024-07-12  2:44 [RFC PATCH 0/3] Fix and refactor do_{huge_pmd_}numa_page() Zi Yan
  2024-07-12  2:44 ` [RFC PATCH 1/3] memory tiering: read last_cpupid correctly in do_huge_pmd_numa_page() Zi Yan
  2024-07-12  2:44 ` [RFC PATCH 2/3] memory tiering: introduce folio_has_cpupid() check Zi Yan
@ 2024-07-12  2:44 ` Zi Yan
  2024-07-18  8:36   ` Huang, Ying
  2 siblings, 1 reply; 18+ messages in thread
From: Zi Yan @ 2024-07-12  2:44 UTC (permalink / raw)
  To: David Hildenbrand, Huang, Ying, linux-mm
  Cc: Zi Yan, Andrew Morton, Baolin Wang, linux-kernel

From: Zi Yan <ziy@nvidia.com>

do_numa_page() and do_huge_pmd_numa_page() share a lot of common code. To
reduce redundancy, move common code to numa_migrate_prep() and rename
the function to numa_migrate_check() to reflect its functionality.

There is some code difference between do_numa_page() and
do_huge_pmd_numa_page() before the code move:

1. do_huge_pmd_numa_page() did not check shared folios to set TNF_SHARED.
2. do_huge_pmd_numa_page() did not check and skip zone device folios.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/huge_memory.c | 28 ++++++-----------
 mm/internal.h    |  5 +--
 mm/memory.c      | 81 +++++++++++++++++++++++-------------------------
 3 files changed, 52 insertions(+), 62 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8c11d6da4b36..66d67d13e0dc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1670,10 +1670,10 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 	pmd_t pmd;
 	struct folio *folio;
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
-	int nid = NUMA_NO_NODE;
-	int target_nid, last_cpupid = (-1 & LAST_CPUPID_MASK);
+	int target_nid = NUMA_NO_NODE;
+	int last_cpupid = (-1 & LAST_CPUPID_MASK);
 	bool writable = false;
-	int flags = 0;
+	int flags = 0, nr_pages;
 
 	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
 	if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) {
@@ -1693,21 +1693,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 		writable = true;
 
 	folio = vm_normal_folio_pmd(vma, haddr, pmd);
-	if (!folio)
+	if (!folio || folio_is_zone_device(folio))
 		goto out_map;
 
-	/* See similar comment in do_numa_page for explanation */
-	if (!writable)
-		flags |= TNF_NO_GROUP;
+	nr_pages = folio_nr_pages(folio);
 
-	nid = folio_nid(folio);
-	/*
-	 * For memory tiering mode, cpupid of slow memory page is used
-	 * to record page access time.  So use default value.
-	 */
-	if (folio_has_cpupid(folio))
-		last_cpupid = folio_last_cpupid(folio);
-	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
+	target_nid = numa_migrate_check(folio, vmf, haddr, writable,
+			&flags, &last_cpupid);
 	if (target_nid == NUMA_NO_NODE)
 		goto out_map;
 	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
@@ -1720,8 +1712,8 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 
 	if (!migrate_misplaced_folio(folio, vma, target_nid)) {
 		flags |= TNF_MIGRATED;
-		nid = target_nid;
 	} else {
+		target_nid = NUMA_NO_NODE;
 		flags |= TNF_MIGRATE_FAIL;
 		vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
 		if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) {
@@ -1732,8 +1724,8 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 	}
 
 out:
-	if (nid != NUMA_NO_NODE)
-		task_numa_fault(last_cpupid, nid, HPAGE_PMD_NR, flags);
+	if (target_nid != NUMA_NO_NODE)
+		task_numa_fault(last_cpupid, target_nid, nr_pages, flags);
 
 	return 0;
 
diff --git a/mm/internal.h b/mm/internal.h
index b4d86436565b..7782b7bb3383 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1217,8 +1217,9 @@ void vunmap_range_noflush(unsigned long start, unsigned long end);
 
 void __vunmap_range_noflush(unsigned long start, unsigned long end);
 
-int numa_migrate_prep(struct folio *folio, struct vm_fault *vmf,
-		      unsigned long addr, int page_nid, int *flags);
+int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
+		      unsigned long addr, bool writable,
+		      int *flags, int *last_cpupid);
 
 void free_zone_device_folio(struct folio *folio);
 int migrate_device_coherent_page(struct page *page);
diff --git a/mm/memory.c b/mm/memory.c
index 96c2f5b3d796..a252c0f13755 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5209,16 +5209,42 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
 	return ret;
 }
 
-int numa_migrate_prep(struct folio *folio, struct vm_fault *vmf,
-		      unsigned long addr, int page_nid, int *flags)
+int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
+		      unsigned long addr, bool writable,
+		      int *flags, int *last_cpupid)
 {
 	struct vm_area_struct *vma = vmf->vma;
 
+	/*
+	 * Avoid grouping on RO pages in general. RO pages shouldn't hurt as
+	 * much anyway since they can be in shared cache state. This misses
+	 * the case where a mapping is writable but the process never writes
+	 * to it but pte_write gets cleared during protection updates and
+	 * pte_dirty has unpredictable behaviour between PTE scan updates,
+	 * background writeback, dirty balancing and application behaviour.
+	 */
+	if (!writable)
+		*flags |= TNF_NO_GROUP;
+
+	/*
+	 * Flag if the folio is shared between multiple address spaces. This
+	 * is later used when determining whether to group tasks together
+	 */
+	if (folio_likely_mapped_shared(folio) && (vma->vm_flags & VM_SHARED))
+		*flags |= TNF_SHARED;
+
+	/*
+	 * For memory tiering mode, cpupid of slow memory page is used
+	 * to record page access time.
+	 */
+	if (folio_has_cpupid(folio))
+		*last_cpupid = folio_last_cpupid(folio);
+
 	/* Record the current PID acceesing VMA */
 	vma_set_access_pid_bit(vma);
 
 	count_vm_numa_event(NUMA_HINT_FAULTS);
-	if (page_nid == numa_node_id()) {
+	if (folio_nid(folio) == numa_node_id()) {
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 		*flags |= TNF_FAULT_LOCAL;
 	}
@@ -5284,12 +5310,11 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct folio *folio = NULL;
-	int nid = NUMA_NO_NODE;
+	int target_nid = NUMA_NO_NODE;
 	bool writable = false, ignore_writable = false;
 	bool pte_write_upgrade = vma_wants_manual_pte_write_upgrade(vma);
-	int last_cpupid;
-	int target_nid;
-	pte_t pte, old_pte;
+	int last_cpupid = (-1 & LAST_CPUPID_MASK);
+	pte_t pte, old_pte = vmf->orig_pte;
 	int flags = 0, nr_pages;
 
 	/*
@@ -5297,10 +5322,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	 * table lock, that its contents have not changed during fault handling.
 	 */
 	spin_lock(vmf->ptl);
-	/* Read the live PTE from the page tables: */
-	old_pte = ptep_get(vmf->pte);
-
-	if (unlikely(!pte_same(old_pte, vmf->orig_pte))) {
+	if (unlikely(!pte_same(old_pte, *vmf->pte))) {
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 		goto out;
 	}
@@ -5320,35 +5342,10 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	if (!folio || folio_is_zone_device(folio))
 		goto out_map;
 
-	/*
-	 * Avoid grouping on RO pages in general. RO pages shouldn't hurt as
-	 * much anyway since they can be in shared cache state. This misses
-	 * the case where a mapping is writable but the process never writes
-	 * to it but pte_write gets cleared during protection updates and
-	 * pte_dirty has unpredictable behaviour between PTE scan updates,
-	 * background writeback, dirty balancing and application behaviour.
-	 */
-	if (!writable)
-		flags |= TNF_NO_GROUP;
-
-	/*
-	 * Flag if the folio is shared between multiple address spaces. This
-	 * is later used when determining whether to group tasks together
-	 */
-	if (folio_likely_mapped_shared(folio) && (vma->vm_flags & VM_SHARED))
-		flags |= TNF_SHARED;
-
-	nid = folio_nid(folio);
 	nr_pages = folio_nr_pages(folio);
-	/*
-	 * For memory tiering mode, cpupid of slow memory page is used
-	 * to record page access time.  So use default value.
-	 */
-	if (!folio_has_cpupid(folio))
-		last_cpupid = (-1 & LAST_CPUPID_MASK);
-	else
-		last_cpupid = folio_last_cpupid(folio);
-	target_nid = numa_migrate_prep(folio, vmf, vmf->address, nid, &flags);
+
+	target_nid = numa_migrate_check(folio, vmf, vmf->address, writable,
+			&flags, &last_cpupid);
 	if (target_nid == NUMA_NO_NODE)
 		goto out_map;
 	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
@@ -5362,9 +5359,9 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 
 	/* Migrate to the requested node */
 	if (!migrate_misplaced_folio(folio, vma, target_nid)) {
-		nid = target_nid;
 		flags |= TNF_MIGRATED;
 	} else {
+		target_nid = NUMA_NO_NODE;
 		flags |= TNF_MIGRATE_FAIL;
 		vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
 					       vmf->address, &vmf->ptl);
@@ -5378,8 +5375,8 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	}
 
 out:
-	if (nid != NUMA_NO_NODE)
-		task_numa_fault(last_cpupid, nid, nr_pages, flags);
+	if (target_nid != NUMA_NO_NODE)
+		task_numa_fault(last_cpupid, target_nid, nr_pages, flags);
 	return 0;
 out_map:
 	/*
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/3] memory tiering: read last_cpupid correctly in do_huge_pmd_numa_page()
  2024-07-12  2:44 ` [RFC PATCH 1/3] memory tiering: read last_cpupid correctly in do_huge_pmd_numa_page() Zi Yan
@ 2024-07-12  3:22   ` Huang, Ying
  2024-07-12  4:01   ` Baolin Wang
  2024-07-13  1:13   ` David Hildenbrand
  2 siblings, 0 replies; 18+ messages in thread
From: Huang, Ying @ 2024-07-12  3:22 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, linux-mm, Zi Yan, Andrew Morton, Baolin Wang,
	linux-kernel

Zi Yan <zi.yan@sent.com> writes:

> From: Zi Yan <ziy@nvidia.com>
>
> last_cpupid is only available when memory tiering is off or the folio
> is in toptier node. Complete the check to read last_cpupid when it is
> available.
>
> Before the fix, the default last_cpupid will be used even if memory
> tiering mode is turned off at runtime instead of the actual value. This
> can prevent task_numa_fault() from getting right numa fault stats, but
> should not cause any crash. User might see performance changes after the
> fix.
>
> Fixes: 33024536bafd ("memory tiering: hot page selection with hint page fault latency")
> Signed-off-by: Zi Yan <ziy@nvidia.com>

Good catch!  Thanks!

Reviewed-by: "Huang, Ying" <ying.huang@intel.com>

> ---
>  mm/huge_memory.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d7c84480f1a4..07d9dde4ca33 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1705,7 +1705,8 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>  	 * For memory tiering mode, cpupid of slow memory page is used
>  	 * to record page access time.  So use default value.
>  	 */
> -	if (node_is_toptier(nid))
> +	if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) ||
> +	    node_is_toptier(nid))
>  		last_cpupid = folio_last_cpupid(folio);
>  	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
>  	if (target_nid == NUMA_NO_NODE)


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/3] memory tiering: read last_cpupid correctly in do_huge_pmd_numa_page()
  2024-07-12  2:44 ` [RFC PATCH 1/3] memory tiering: read last_cpupid correctly in do_huge_pmd_numa_page() Zi Yan
  2024-07-12  3:22   ` Huang, Ying
@ 2024-07-12  4:01   ` Baolin Wang
  2024-07-13  1:13   ` David Hildenbrand
  2 siblings, 0 replies; 18+ messages in thread
From: Baolin Wang @ 2024-07-12  4:01 UTC (permalink / raw)
  To: Zi Yan, David Hildenbrand, Huang, Ying, linux-mm
  Cc: Andrew Morton, linux-kernel



On 2024/7/12 10:44, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> last_cpupid is only available when memory tiering is off or the folio
> is in toptier node. Complete the check to read last_cpupid when it is
> available.
> 
> Before the fix, the default last_cpupid will be used even if memory
> tiering mode is turned off at runtime instead of the actual value. This
> can prevent task_numa_fault() from getting right numa fault stats, but
> should not cause any crash. User might see performance changes after the
> fix.
> 
> Fixes: 33024536bafd ("memory tiering: hot page selection with hint page fault latency")
> Signed-off-by: Zi Yan <ziy@nvidia.com>

LGTM.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>

> ---
>   mm/huge_memory.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d7c84480f1a4..07d9dde4ca33 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1705,7 +1705,8 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>   	 * For memory tiering mode, cpupid of slow memory page is used
>   	 * to record page access time.  So use default value.
>   	 */
> -	if (node_is_toptier(nid))
> +	if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) ||
> +	    node_is_toptier(nid))
>   		last_cpupid = folio_last_cpupid(folio);
>   	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
>   	if (target_nid == NUMA_NO_NODE)


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 2/3] memory tiering: introduce folio_has_cpupid() check
  2024-07-12  2:44 ` [RFC PATCH 2/3] memory tiering: introduce folio_has_cpupid() check Zi Yan
@ 2024-07-12  6:27   ` Huang, Ying
  0 siblings, 0 replies; 18+ messages in thread
From: Huang, Ying @ 2024-07-12  6:27 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, linux-mm, Zi Yan, Andrew Morton, Baolin Wang,
	linux-kernel

Zi Yan <zi.yan@sent.com> writes:

> From: Zi Yan <ziy@nvidia.com>
>
> Instead of open coded check for if memory tiering mode is on and a folio
> is in the top tier memory, use a function to encapsulate the check.
>
> Signed-off-by: Zi Yan <ziy@nvidia.com>

LGTM, Thanks!  Feel free to add

Reviewed-by: "Huang, Ying" <ying.huang@intel.com>

in your future versions.

> ---
>  include/linux/memory-tiers.h |  8 ++++++++
>  kernel/sched/fair.c          |  3 +--
>  mm/huge_memory.c             |  6 ++----
>  mm/memory-tiers.c            | 17 +++++++++++++++++
>  mm/memory.c                  |  3 +--
>  mm/mprotect.c                |  3 +--
>  6 files changed, 30 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> index 0dc0cf2863e2..10c127d461c4 100644
> --- a/include/linux/memory-tiers.h
> +++ b/include/linux/memory-tiers.h
> @@ -73,6 +73,10 @@ static inline bool node_is_toptier(int node)
>  }
>  #endif
>  
> +
> +bool folio_has_cpupid(struct folio *folio);
> +
> +
>  #else
>  
>  #define numa_demotion_enabled	false
> @@ -151,5 +155,9 @@ static inline struct memory_dev_type *mt_find_alloc_memory_type(int adist,
>  static inline void mt_put_memory_types(struct list_head *memory_types)
>  {
>  }
> +static inline bool folio_has_cpupid(struct folio *folio)
> +{
> +	return true;
> +}
>  #endif	/* CONFIG_NUMA */
>  #endif  /* _LINUX_MEMORY_TIERS_H */
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8a5b1ae0aa55..03de808cb3cc 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1840,8 +1840,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
>  	 * The pages in slow memory node should be migrated according
>  	 * to hot/cold instead of private/shared.
>  	 */
> -	if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING &&
> -	    !node_is_toptier(src_nid)) {
> +	if (!folio_has_cpupid(folio)) {
>  		struct pglist_data *pgdat;
>  		unsigned long rate_limit;
>  		unsigned int latency, th, def_th;
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 07d9dde4ca33..8c11d6da4b36 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1705,8 +1705,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>  	 * For memory tiering mode, cpupid of slow memory page is used
>  	 * to record page access time.  So use default value.
>  	 */
> -	if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) ||
> -	    node_is_toptier(nid))
> +	if (folio_has_cpupid(folio))
>  		last_cpupid = folio_last_cpupid(folio);
>  	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
>  	if (target_nid == NUMA_NO_NODE)
> @@ -2059,8 +2058,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		    toptier)
>  			goto unlock;
>  
> -		if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING &&
> -		    !toptier)
> +		if (!folio_has_cpupid(folio))
>  			folio_xchg_access_time(folio,
>  					       jiffies_to_msecs(jiffies));
>  	}
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index 4775b3a3dabe..7f0360d4e3a0 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
> @@ -6,6 +6,7 @@
>  #include <linux/memory.h>
>  #include <linux/memory-tiers.h>
>  #include <linux/notifier.h>
> +#include <linux/sched/sysctl.h>
>  
>  #include "internal.h"
>  
> @@ -50,6 +51,22 @@ static const struct bus_type memory_tier_subsys = {
>  	.dev_name = "memory_tier",
>  };
>  
> +/**
> + * folio_has_cpupid - check if a folio has cpupid information
> + * @folio: folio to check
> + *
> + * folio's _last_cpupid field is repurposed by memory tiering. In memory
> + * tiering mode, cpupid of slow memory folio (not toptier memory) is used to
> + * record page access time.
> + *
> + * Return: the folio _last_cpupid is used as cpupid
> + */
> +bool folio_has_cpupid(struct folio *folio)
> +{
> +	return !(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) ||
> +	       node_is_toptier(folio_nid(folio));
> +}
> +
>  #ifdef CONFIG_MIGRATION
>  static int top_tier_adistance;
>  /*
> diff --git a/mm/memory.c b/mm/memory.c
> index dceb62f3fa34..96c2f5b3d796 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5344,8 +5344,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>  	 * For memory tiering mode, cpupid of slow memory page is used
>  	 * to record page access time.  So use default value.
>  	 */
> -	if ((sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) &&
> -	    !node_is_toptier(nid))
> +	if (!folio_has_cpupid(folio))
>  		last_cpupid = (-1 & LAST_CPUPID_MASK);
>  	else
>  		last_cpupid = folio_last_cpupid(folio);
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 222ab434da54..787c3c2bf1b6 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -161,8 +161,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>  				if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
>  				    toptier)
>  					continue;
> -				if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING &&
> -				    !toptier)
> +				if (!folio_has_cpupid(folio))
>  					folio_xchg_access_time(folio,
>  						jiffies_to_msecs(jiffies));
>  			}


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/3] memory tiering: read last_cpupid correctly in do_huge_pmd_numa_page()
  2024-07-12  2:44 ` [RFC PATCH 1/3] memory tiering: read last_cpupid correctly in do_huge_pmd_numa_page() Zi Yan
  2024-07-12  3:22   ` Huang, Ying
  2024-07-12  4:01   ` Baolin Wang
@ 2024-07-13  1:13   ` David Hildenbrand
  2024-07-13  1:18     ` Zi Yan
  2 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2024-07-13  1:13 UTC (permalink / raw)
  To: Zi Yan, Huang, Ying, linux-mm; +Cc: Andrew Morton, Baolin Wang, linux-kernel

On 12.07.24 04:44, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> last_cpupid is only available when memory tiering is off or the folio
> is in toptier node. Complete the check to read last_cpupid when it is
> available.
> 
> Before the fix, the default last_cpupid will be used even if memory
> tiering mode is turned off at runtime instead of the actual value. This
> can prevent task_numa_fault() from getting right numa fault stats, but
> should not cause any crash. User might see performance changes after the
> fix.
> 
> Fixes: 33024536bafd ("memory tiering: hot page selection with hint page fault latency")
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>   mm/huge_memory.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d7c84480f1a4..07d9dde4ca33 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1705,7 +1705,8 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>   	 * For memory tiering mode, cpupid of slow memory page is used
>   	 * to record page access time.  So use default value.
>   	 */
> -	if (node_is_toptier(nid))
> +	if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) ||
> +	    node_is_toptier(nid))
>   		last_cpupid = folio_last_cpupid(folio);
>   	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
>   	if (target_nid == NUMA_NO_NODE)

Reported-by: ...
Closes: ...

If it applies ;)

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/3] memory tiering: read last_cpupid correctly in do_huge_pmd_numa_page()
  2024-07-13  1:13   ` David Hildenbrand
@ 2024-07-13  1:18     ` Zi Yan
  2024-07-13  1:23       ` David Hildenbrand
  0 siblings, 1 reply; 18+ messages in thread
From: Zi Yan @ 2024-07-13  1:18 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Huang, Ying, linux-mm, Andrew Morton, Baolin Wang, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2015 bytes --]

On 12 Jul 2024, at 21:13, David Hildenbrand wrote:

> On 12.07.24 04:44, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> last_cpupid is only available when memory tiering is off or the folio
>> is in toptier node. Complete the check to read last_cpupid when it is
>> available.
>>
>> Before the fix, the default last_cpupid will be used even if memory
>> tiering mode is turned off at runtime instead of the actual value. This
>> can prevent task_numa_fault() from getting right numa fault stats, but
>> should not cause any crash. User might see performance changes after the
>> fix.
>>
>> Fixes: 33024536bafd ("memory tiering: hot page selection with hint page fault latency")
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>>   mm/huge_memory.c | 3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index d7c84480f1a4..07d9dde4ca33 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1705,7 +1705,8 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>   	 * For memory tiering mode, cpupid of slow memory page is used
>>   	 * to record page access time.  So use default value.
>>   	 */
>> -	if (node_is_toptier(nid))
>> +	if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) ||
>> +	    node_is_toptier(nid))
>>   		last_cpupid = folio_last_cpupid(folio);
>>   	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
>>   	if (target_nid == NUMA_NO_NODE)
>
> Reported-by: ...

Reported-by: David Hildenbrand <david@redhat.com>

I suppose your email[1] reports the issue based on code inspection.

> Closes: ...

Closes: https://lore.kernel.org/linux-mm/9af34a6b-ca56-4a64-8aa6-ade65f109288@redhat.com/

Will add them in the next version.

>
> If it applies ;)
>
> Acked-by: David Hildenbrand <david@redhat.com>

Thanks.


[1] https://lore.kernel.org/linux-mm/9af34a6b-ca56-4a64-8aa6-ade65f109288@redhat.com/

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/3] memory tiering: read last_cpupid correctly in do_huge_pmd_numa_page()
  2024-07-13  1:18     ` Zi Yan
@ 2024-07-13  1:23       ` David Hildenbrand
  0 siblings, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2024-07-13  1:23 UTC (permalink / raw)
  To: Zi Yan; +Cc: Huang, Ying, linux-mm, Andrew Morton, Baolin Wang, linux-kernel

On 13.07.24 03:18, Zi Yan wrote:
> On 12 Jul 2024, at 21:13, David Hildenbrand wrote:
> 
>> On 12.07.24 04:44, Zi Yan wrote:
>>> From: Zi Yan <ziy@nvidia.com>
>>>
>>> last_cpupid is only available when memory tiering is off or the folio
>>> is in toptier node. Complete the check to read last_cpupid when it is
>>> available.
>>>
>>> Before the fix, the default last_cpupid will be used even if memory
>>> tiering mode is turned off at runtime instead of the actual value. This
>>> can prevent task_numa_fault() from getting right numa fault stats, but
>>> should not cause any crash. User might see performance changes after the
>>> fix.
>>>
>>> Fixes: 33024536bafd ("memory tiering: hot page selection with hint page fault latency")
>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>> ---
>>>    mm/huge_memory.c | 3 ++-
>>>    1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index d7c84480f1a4..07d9dde4ca33 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -1705,7 +1705,8 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>    	 * For memory tiering mode, cpupid of slow memory page is used
>>>    	 * to record page access time.  So use default value.
>>>    	 */
>>> -	if (node_is_toptier(nid))
>>> +	if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) ||
>>> +	    node_is_toptier(nid))
>>>    		last_cpupid = folio_last_cpupid(folio);
>>>    	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
>>>    	if (target_nid == NUMA_NO_NODE)
>>
>> Reported-by: ...
> 
> Reported-by: David Hildenbrand <david@redhat.com>
> 
> I suppose your email[1] reports the issue based on code inspection.


Yes, thanks for taking care of the fix and cleanups!

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 3/3] mm/migrate: move common code to numa_migrate_check (was numa_migrate_prep)
  2024-07-12  2:44 ` [RFC PATCH 3/3] mm/migrate: move common code to numa_migrate_check (was numa_migrate_prep) Zi Yan
@ 2024-07-18  8:36   ` Huang, Ying
  2024-07-18 14:40     ` Zi Yan
  2024-07-19 20:19     ` Zi Yan
  0 siblings, 2 replies; 18+ messages in thread
From: Huang, Ying @ 2024-07-18  8:36 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, linux-mm, Zi Yan, Andrew Morton, Baolin Wang,
	linux-kernel

Zi Yan <zi.yan@sent.com> writes:

> From: Zi Yan <ziy@nvidia.com>
>
> do_numa_page() and do_huge_pmd_numa_page() share a lot of common code. To
> reduce redundancy, move common code to numa_migrate_prep() and rename
> the function to numa_migrate_check() to reflect its functionality.
>
> There is some code difference between do_numa_page() and
> do_huge_pmd_numa_page() before the code move:
>
> 1. do_huge_pmd_numa_page() did not check shared folios to set TNF_SHARED.
> 2. do_huge_pmd_numa_page() did not check and skip zone device folios.
>
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>  mm/huge_memory.c | 28 ++++++-----------
>  mm/internal.h    |  5 +--
>  mm/memory.c      | 81 +++++++++++++++++++++++-------------------------
>  3 files changed, 52 insertions(+), 62 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 8c11d6da4b36..66d67d13e0dc 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1670,10 +1670,10 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>  	pmd_t pmd;
>  	struct folio *folio;
>  	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
> -	int nid = NUMA_NO_NODE;
> -	int target_nid, last_cpupid = (-1 & LAST_CPUPID_MASK);
> +	int target_nid = NUMA_NO_NODE;
> +	int last_cpupid = (-1 & LAST_CPUPID_MASK);
>  	bool writable = false;
> -	int flags = 0;
> +	int flags = 0, nr_pages;
>  
>  	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
>  	if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) {
> @@ -1693,21 +1693,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>  		writable = true;
>  
>  	folio = vm_normal_folio_pmd(vma, haddr, pmd);
> -	if (!folio)
> +	if (!folio || folio_is_zone_device(folio))

This change appears unrelated.  Can we put it in a separate patch?

IIUC, this isn't necessary even in do_numa_page()?  Because in
change_pte_range(), folio_is_zone_device() has been checked already.
But It doesn't hurt too.

>  		goto out_map;
>  
> -	/* See similar comment in do_numa_page for explanation */
> -	if (!writable)
> -		flags |= TNF_NO_GROUP;
> +	nr_pages = folio_nr_pages(folio);
>  
> -	nid = folio_nid(folio);
> -	/*
> -	 * For memory tiering mode, cpupid of slow memory page is used
> -	 * to record page access time.  So use default value.
> -	 */
> -	if (folio_has_cpupid(folio))
> -		last_cpupid = folio_last_cpupid(folio);
> -	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
> +	target_nid = numa_migrate_check(folio, vmf, haddr, writable,
> +			&flags, &last_cpupid);
>  	if (target_nid == NUMA_NO_NODE)
>  		goto out_map;
>  	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
> @@ -1720,8 +1712,8 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>  
>  	if (!migrate_misplaced_folio(folio, vma, target_nid)) {
>  		flags |= TNF_MIGRATED;
> -		nid = target_nid;
>  	} else {
> +		target_nid = NUMA_NO_NODE;
>  		flags |= TNF_MIGRATE_FAIL;
>  		vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
>  		if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) {
> @@ -1732,8 +1724,8 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>  	}
>  
>  out:
> -	if (nid != NUMA_NO_NODE)
> -		task_numa_fault(last_cpupid, nid, HPAGE_PMD_NR, flags);
> +	if (target_nid != NUMA_NO_NODE)
> +		task_numa_fault(last_cpupid, target_nid, nr_pages, flags);

This appears a behavior change.  IIUC, there are 2 possible issues.

1) if migrate_misplaced_folio() fails, folio_nid() should be used as
nid.  "target_nid" as variable name here is confusing, because
folio_nid() is needed in fact.

2) if !pmd_same(), task_numa_fault() should be skipped.  The original
code is buggy.

Similar issues for do_numa_page().

If my understanding were correct, we should implement a separate patch
to fix 2) above.  And that may need to be backported.

>  
>  	return 0;
>  
> diff --git a/mm/internal.h b/mm/internal.h
> index b4d86436565b..7782b7bb3383 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -1217,8 +1217,9 @@ void vunmap_range_noflush(unsigned long start, unsigned long end);
>  
>  void __vunmap_range_noflush(unsigned long start, unsigned long end);
>  
> -int numa_migrate_prep(struct folio *folio, struct vm_fault *vmf,
> -		      unsigned long addr, int page_nid, int *flags);
> +int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
> +		      unsigned long addr, bool writable,
> +		      int *flags, int *last_cpupid);
>  
>  void free_zone_device_folio(struct folio *folio);
>  int migrate_device_coherent_page(struct page *page);
> diff --git a/mm/memory.c b/mm/memory.c
> index 96c2f5b3d796..a252c0f13755 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5209,16 +5209,42 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
>  	return ret;
>  }
>  
> -int numa_migrate_prep(struct folio *folio, struct vm_fault *vmf,
> -		      unsigned long addr, int page_nid, int *flags)
> +int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
> +		      unsigned long addr, bool writable,
> +		      int *flags, int *last_cpupid)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
>  
> +	/*
> +	 * Avoid grouping on RO pages in general. RO pages shouldn't hurt as
> +	 * much anyway since they can be in shared cache state. This misses
> +	 * the case where a mapping is writable but the process never writes
> +	 * to it but pte_write gets cleared during protection updates and
> +	 * pte_dirty has unpredictable behaviour between PTE scan updates,
> +	 * background writeback, dirty balancing and application behaviour.
> +	 */
> +	if (!writable)
> +		*flags |= TNF_NO_GROUP;
> +
> +	/*
> +	 * Flag if the folio is shared between multiple address spaces. This
> +	 * is later used when determining whether to group tasks together
> +	 */
> +	if (folio_likely_mapped_shared(folio) && (vma->vm_flags & VM_SHARED))
> +		*flags |= TNF_SHARED;
> +
> +	/*
> +	 * For memory tiering mode, cpupid of slow memory page is used
> +	 * to record page access time.
> +	 */
> +	if (folio_has_cpupid(folio))
> +		*last_cpupid = folio_last_cpupid(folio);
> +
>  	/* Record the current PID acceesing VMA */
>  	vma_set_access_pid_bit(vma);
>  
>  	count_vm_numa_event(NUMA_HINT_FAULTS);
> -	if (page_nid == numa_node_id()) {
> +	if (folio_nid(folio) == numa_node_id()) {
>  		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
>  		*flags |= TNF_FAULT_LOCAL;
>  	}
> @@ -5284,12 +5310,11 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
>  	struct folio *folio = NULL;
> -	int nid = NUMA_NO_NODE;
> +	int target_nid = NUMA_NO_NODE;
>  	bool writable = false, ignore_writable = false;
>  	bool pte_write_upgrade = vma_wants_manual_pte_write_upgrade(vma);
> -	int last_cpupid;
> -	int target_nid;
> -	pte_t pte, old_pte;
> +	int last_cpupid = (-1 & LAST_CPUPID_MASK);
> +	pte_t pte, old_pte = vmf->orig_pte;
>  	int flags = 0, nr_pages;
>  
>  	/*
> @@ -5297,10 +5322,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>  	 * table lock, that its contents have not changed during fault handling.
>  	 */
>  	spin_lock(vmf->ptl);
> -	/* Read the live PTE from the page tables: */
> -	old_pte = ptep_get(vmf->pte);
> -
> -	if (unlikely(!pte_same(old_pte, vmf->orig_pte))) {
> +	if (unlikely(!pte_same(old_pte, *vmf->pte))) {
>  		pte_unmap_unlock(vmf->pte, vmf->ptl);
>  		goto out;
>  	}
> @@ -5320,35 +5342,10 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>  	if (!folio || folio_is_zone_device(folio))
>  		goto out_map;
>  
> -	/*
> -	 * Avoid grouping on RO pages in general. RO pages shouldn't hurt as
> -	 * much anyway since they can be in shared cache state. This misses
> -	 * the case where a mapping is writable but the process never writes
> -	 * to it but pte_write gets cleared during protection updates and
> -	 * pte_dirty has unpredictable behaviour between PTE scan updates,
> -	 * background writeback, dirty balancing and application behaviour.
> -	 */
> -	if (!writable)
> -		flags |= TNF_NO_GROUP;
> -
> -	/*
> -	 * Flag if the folio is shared between multiple address spaces. This
> -	 * is later used when determining whether to group tasks together
> -	 */
> -	if (folio_likely_mapped_shared(folio) && (vma->vm_flags & VM_SHARED))
> -		flags |= TNF_SHARED;
> -
> -	nid = folio_nid(folio);
>  	nr_pages = folio_nr_pages(folio);
> -	/*
> -	 * For memory tiering mode, cpupid of slow memory page is used
> -	 * to record page access time.  So use default value.
> -	 */
> -	if (!folio_has_cpupid(folio))
> -		last_cpupid = (-1 & LAST_CPUPID_MASK);
> -	else
> -		last_cpupid = folio_last_cpupid(folio);
> -	target_nid = numa_migrate_prep(folio, vmf, vmf->address, nid, &flags);
> +
> +	target_nid = numa_migrate_check(folio, vmf, vmf->address, writable,
> +			&flags, &last_cpupid);
>  	if (target_nid == NUMA_NO_NODE)
>  		goto out_map;
>  	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
> @@ -5362,9 +5359,9 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>  
>  	/* Migrate to the requested node */
>  	if (!migrate_misplaced_folio(folio, vma, target_nid)) {
> -		nid = target_nid;
>  		flags |= TNF_MIGRATED;
>  	} else {
> +		target_nid = NUMA_NO_NODE;
>  		flags |= TNF_MIGRATE_FAIL;
>  		vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
>  					       vmf->address, &vmf->ptl);
> @@ -5378,8 +5375,8 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>  	}
>  
>  out:
> -	if (nid != NUMA_NO_NODE)
> -		task_numa_fault(last_cpupid, nid, nr_pages, flags);
> +	if (target_nid != NUMA_NO_NODE)
> +		task_numa_fault(last_cpupid, target_nid, nr_pages, flags);
>  	return 0;
>  out_map:
>  	/*

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 3/3] mm/migrate: move common code to numa_migrate_check (was numa_migrate_prep)
  2024-07-18  8:36   ` Huang, Ying
@ 2024-07-18 14:40     ` Zi Yan
  2024-07-19 20:19     ` Zi Yan
  1 sibling, 0 replies; 18+ messages in thread
From: Zi Yan @ 2024-07-18 14:40 UTC (permalink / raw)
  To: Huang, Ying
  Cc: David Hildenbrand, linux-mm, Andrew Morton, Baolin Wang,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 10541 bytes --]

On 18 Jul 2024, at 4:36, Huang, Ying wrote:

> Zi Yan <zi.yan@sent.com> writes:
>
>> From: Zi Yan <ziy@nvidia.com>
>>
>> do_numa_page() and do_huge_pmd_numa_page() share a lot of common code. To
>> reduce redundancy, move common code to numa_migrate_prep() and rename
>> the function to numa_migrate_check() to reflect its functionality.
>>
>> There is some code difference between do_numa_page() and
>> do_huge_pmd_numa_page() before the code move:
>>
>> 1. do_huge_pmd_numa_page() did not check shared folios to set TNF_SHARED.
>> 2. do_huge_pmd_numa_page() did not check and skip zone device folios.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>>  mm/huge_memory.c | 28 ++++++-----------
>>  mm/internal.h    |  5 +--
>>  mm/memory.c      | 81 +++++++++++++++++++++++-------------------------
>>  3 files changed, 52 insertions(+), 62 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 8c11d6da4b36..66d67d13e0dc 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1670,10 +1670,10 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>  	pmd_t pmd;
>>  	struct folio *folio;
>>  	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
>> -	int nid = NUMA_NO_NODE;
>> -	int target_nid, last_cpupid = (-1 & LAST_CPUPID_MASK);
>> +	int target_nid = NUMA_NO_NODE;
>> +	int last_cpupid = (-1 & LAST_CPUPID_MASK);
>>  	bool writable = false;
>> -	int flags = 0;
>> +	int flags = 0, nr_pages;
>>
>>  	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
>>  	if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) {
>> @@ -1693,21 +1693,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>  		writable = true;
>>
>>  	folio = vm_normal_folio_pmd(vma, haddr, pmd);
>> -	if (!folio)
>> +	if (!folio || folio_is_zone_device(folio))
>
> This change appears unrelated.  Can we put it in a separate patch?
>
> IIUC, this isn't necessary even in do_numa_page()?  Because in
> change_pte_range(), folio_is_zone_device() has been checked already.
> But It doesn't hurt too.

OK, will make this change in a separate patch.

>
>>  		goto out_map;
>>
>> -	/* See similar comment in do_numa_page for explanation */
>> -	if (!writable)
>> -		flags |= TNF_NO_GROUP;
>> +	nr_pages = folio_nr_pages(folio);
>>
>> -	nid = folio_nid(folio);
>> -	/*
>> -	 * For memory tiering mode, cpupid of slow memory page is used
>> -	 * to record page access time.  So use default value.
>> -	 */
>> -	if (folio_has_cpupid(folio))
>> -		last_cpupid = folio_last_cpupid(folio);
>> -	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
>> +	target_nid = numa_migrate_check(folio, vmf, haddr, writable,
>> +			&flags, &last_cpupid);
>>  	if (target_nid == NUMA_NO_NODE)
>>  		goto out_map;
>>  	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>> @@ -1720,8 +1712,8 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>
>>  	if (!migrate_misplaced_folio(folio, vma, target_nid)) {
>>  		flags |= TNF_MIGRATED;
>> -		nid = target_nid;
>>  	} else {
>> +		target_nid = NUMA_NO_NODE;
>>  		flags |= TNF_MIGRATE_FAIL;
>>  		vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
>>  		if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) {
>> @@ -1732,8 +1724,8 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>  	}
>>
>>  out:
>> -	if (nid != NUMA_NO_NODE)
>> -		task_numa_fault(last_cpupid, nid, HPAGE_PMD_NR, flags);
>> +	if (target_nid != NUMA_NO_NODE)
>> +		task_numa_fault(last_cpupid, target_nid, nr_pages, flags);
>
> This appears a behavior change.  IIUC, there are 2 possible issues.
>
> 1) if migrate_misplaced_folio() fails, folio_nid() should be used as
> nid.  "target_nid" as variable name here is confusing, because
> folio_nid() is needed in fact.

Right. Will fix it in my code.

>
> 2) if !pmd_same(), task_numa_fault() should be skipped.  The original
> code is buggy.
>
> Similar issues for do_numa_page().
>
> If my understanding were correct, we should implement a separate patch
> to fix 2) above.  And that may need to be backported.

Got it. Here is my plan:

1. Send the first two patches in this series separately, since they
are separate fixes. They can be picked up right now.
2. Send a separate patch to fix 2) above with cc: stable.
3. Clean up this patch and send it separately.

Thank you for the review. :)

>
>>
>>  	return 0;
>>
>> diff --git a/mm/internal.h b/mm/internal.h
>> index b4d86436565b..7782b7bb3383 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -1217,8 +1217,9 @@ void vunmap_range_noflush(unsigned long start, unsigned long end);
>>
>>  void __vunmap_range_noflush(unsigned long start, unsigned long end);
>>
>> -int numa_migrate_prep(struct folio *folio, struct vm_fault *vmf,
>> -		      unsigned long addr, int page_nid, int *flags);
>> +int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
>> +		      unsigned long addr, bool writable,
>> +		      int *flags, int *last_cpupid);
>>
>>  void free_zone_device_folio(struct folio *folio);
>>  int migrate_device_coherent_page(struct page *page);
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 96c2f5b3d796..a252c0f13755 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -5209,16 +5209,42 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
>>  	return ret;
>>  }
>>
>> -int numa_migrate_prep(struct folio *folio, struct vm_fault *vmf,
>> -		      unsigned long addr, int page_nid, int *flags)
>> +int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
>> +		      unsigned long addr, bool writable,
>> +		      int *flags, int *last_cpupid)
>>  {
>>  	struct vm_area_struct *vma = vmf->vma;
>>
>> +	/*
>> +	 * Avoid grouping on RO pages in general. RO pages shouldn't hurt as
>> +	 * much anyway since they can be in shared cache state. This misses
>> +	 * the case where a mapping is writable but the process never writes
>> +	 * to it but pte_write gets cleared during protection updates and
>> +	 * pte_dirty has unpredictable behaviour between PTE scan updates,
>> +	 * background writeback, dirty balancing and application behaviour.
>> +	 */
>> +	if (!writable)
>> +		*flags |= TNF_NO_GROUP;
>> +
>> +	/*
>> +	 * Flag if the folio is shared between multiple address spaces. This
>> +	 * is later used when determining whether to group tasks together
>> +	 */
>> +	if (folio_likely_mapped_shared(folio) && (vma->vm_flags & VM_SHARED))
>> +		*flags |= TNF_SHARED;
>> +
>> +	/*
>> +	 * For memory tiering mode, cpupid of slow memory page is used
>> +	 * to record page access time.
>> +	 */
>> +	if (folio_has_cpupid(folio))
>> +		*last_cpupid = folio_last_cpupid(folio);
>> +
>>  	/* Record the current PID acceesing VMA */
>>  	vma_set_access_pid_bit(vma);
>>
>>  	count_vm_numa_event(NUMA_HINT_FAULTS);
>> -	if (page_nid == numa_node_id()) {
>> +	if (folio_nid(folio) == numa_node_id()) {
>>  		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
>>  		*flags |= TNF_FAULT_LOCAL;
>>  	}
>> @@ -5284,12 +5310,11 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>>  {
>>  	struct vm_area_struct *vma = vmf->vma;
>>  	struct folio *folio = NULL;
>> -	int nid = NUMA_NO_NODE;
>> +	int target_nid = NUMA_NO_NODE;
>>  	bool writable = false, ignore_writable = false;
>>  	bool pte_write_upgrade = vma_wants_manual_pte_write_upgrade(vma);
>> -	int last_cpupid;
>> -	int target_nid;
>> -	pte_t pte, old_pte;
>> +	int last_cpupid = (-1 & LAST_CPUPID_MASK);
>> +	pte_t pte, old_pte = vmf->orig_pte;
>>  	int flags = 0, nr_pages;
>>
>>  	/*
>> @@ -5297,10 +5322,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>>  	 * table lock, that its contents have not changed during fault handling.
>>  	 */
>>  	spin_lock(vmf->ptl);
>> -	/* Read the live PTE from the page tables: */
>> -	old_pte = ptep_get(vmf->pte);
>> -
>> -	if (unlikely(!pte_same(old_pte, vmf->orig_pte))) {
>> +	if (unlikely(!pte_same(old_pte, *vmf->pte))) {
>>  		pte_unmap_unlock(vmf->pte, vmf->ptl);
>>  		goto out;
>>  	}
>> @@ -5320,35 +5342,10 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>>  	if (!folio || folio_is_zone_device(folio))
>>  		goto out_map;
>>
>> -	/*
>> -	 * Avoid grouping on RO pages in general. RO pages shouldn't hurt as
>> -	 * much anyway since they can be in shared cache state. This misses
>> -	 * the case where a mapping is writable but the process never writes
>> -	 * to it but pte_write gets cleared during protection updates and
>> -	 * pte_dirty has unpredictable behaviour between PTE scan updates,
>> -	 * background writeback, dirty balancing and application behaviour.
>> -	 */
>> -	if (!writable)
>> -		flags |= TNF_NO_GROUP;
>> -
>> -	/*
>> -	 * Flag if the folio is shared between multiple address spaces. This
>> -	 * is later used when determining whether to group tasks together
>> -	 */
>> -	if (folio_likely_mapped_shared(folio) && (vma->vm_flags & VM_SHARED))
>> -		flags |= TNF_SHARED;
>> -
>> -	nid = folio_nid(folio);
>>  	nr_pages = folio_nr_pages(folio);
>> -	/*
>> -	 * For memory tiering mode, cpupid of slow memory page is used
>> -	 * to record page access time.  So use default value.
>> -	 */
>> -	if (!folio_has_cpupid(folio))
>> -		last_cpupid = (-1 & LAST_CPUPID_MASK);
>> -	else
>> -		last_cpupid = folio_last_cpupid(folio);
>> -	target_nid = numa_migrate_prep(folio, vmf, vmf->address, nid, &flags);
>> +
>> +	target_nid = numa_migrate_check(folio, vmf, vmf->address, writable,
>> +			&flags, &last_cpupid);
>>  	if (target_nid == NUMA_NO_NODE)
>>  		goto out_map;
>>  	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>> @@ -5362,9 +5359,9 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>>
>>  	/* Migrate to the requested node */
>>  	if (!migrate_misplaced_folio(folio, vma, target_nid)) {
>> -		nid = target_nid;
>>  		flags |= TNF_MIGRATED;
>>  	} else {
>> +		target_nid = NUMA_NO_NODE;
>>  		flags |= TNF_MIGRATE_FAIL;
>>  		vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
>>  					       vmf->address, &vmf->ptl);
>> @@ -5378,8 +5375,8 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>>  	}
>>
>>  out:
>> -	if (nid != NUMA_NO_NODE)
>> -		task_numa_fault(last_cpupid, nid, nr_pages, flags);
>> +	if (target_nid != NUMA_NO_NODE)
>> +		task_numa_fault(last_cpupid, target_nid, nr_pages, flags);
>>  	return 0;
>>  out_map:
>>  	/*
>
> --
> Best Regards,
> Huang, Ying


Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 3/3] mm/migrate: move common code to numa_migrate_check (was numa_migrate_prep)
  2024-07-18  8:36   ` Huang, Ying
  2024-07-18 14:40     ` Zi Yan
@ 2024-07-19 20:19     ` Zi Yan
  2024-07-22  1:47       ` Huang, Ying
  1 sibling, 1 reply; 18+ messages in thread
From: Zi Yan @ 2024-07-19 20:19 UTC (permalink / raw)
  To: Huang, Ying
  Cc: David Hildenbrand, linux-mm, Andrew Morton, Baolin Wang,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4447 bytes --]

On 18 Jul 2024, at 4:36, Huang, Ying wrote:

> Zi Yan <zi.yan@sent.com> writes:
>
>> From: Zi Yan <ziy@nvidia.com>
>>
>> do_numa_page() and do_huge_pmd_numa_page() share a lot of common code. To
>> reduce redundancy, move common code to numa_migrate_prep() and rename
>> the function to numa_migrate_check() to reflect its functionality.
>>
>> There is some code difference between do_numa_page() and
>> do_huge_pmd_numa_page() before the code move:
>>
>> 1. do_huge_pmd_numa_page() did not check shared folios to set TNF_SHARED.
>> 2. do_huge_pmd_numa_page() did not check and skip zone device folios.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>>  mm/huge_memory.c | 28 ++++++-----------
>>  mm/internal.h    |  5 +--
>>  mm/memory.c      | 81 +++++++++++++++++++++++-------------------------
>>  3 files changed, 52 insertions(+), 62 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 8c11d6da4b36..66d67d13e0dc 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1670,10 +1670,10 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>  	pmd_t pmd;
>>  	struct folio *folio;
>>  	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
>> -	int nid = NUMA_NO_NODE;
>> -	int target_nid, last_cpupid = (-1 & LAST_CPUPID_MASK);
>> +	int target_nid = NUMA_NO_NODE;
>> +	int last_cpupid = (-1 & LAST_CPUPID_MASK);
>>  	bool writable = false;
>> -	int flags = 0;
>> +	int flags = 0, nr_pages;
>>
>>  	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
>>  	if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) {
>> @@ -1693,21 +1693,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>  		writable = true;
>>
>>  	folio = vm_normal_folio_pmd(vma, haddr, pmd);
>> -	if (!folio)
>> +	if (!folio || folio_is_zone_device(folio))
>
> This change appears unrelated.  Can we put it in a separate patch?
>
> IIUC, this isn't necessary even in do_numa_page()?  Because in
> change_pte_range(), folio_is_zone_device() has been checked already.
> But It doesn't hurt too.
>
>>  		goto out_map;
>>
>> -	/* See similar comment in do_numa_page for explanation */
>> -	if (!writable)
>> -		flags |= TNF_NO_GROUP;
>> +	nr_pages = folio_nr_pages(folio);
>>
>> -	nid = folio_nid(folio);
>> -	/*
>> -	 * For memory tiering mode, cpupid of slow memory page is used
>> -	 * to record page access time.  So use default value.
>> -	 */
>> -	if (folio_has_cpupid(folio))
>> -		last_cpupid = folio_last_cpupid(folio);
>> -	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
>> +	target_nid = numa_migrate_check(folio, vmf, haddr, writable,
>> +			&flags, &last_cpupid);
>>  	if (target_nid == NUMA_NO_NODE)
>>  		goto out_map;
>>  	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>> @@ -1720,8 +1712,8 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>
>>  	if (!migrate_misplaced_folio(folio, vma, target_nid)) {
>>  		flags |= TNF_MIGRATED;
>> -		nid = target_nid;
>>  	} else {
>> +		target_nid = NUMA_NO_NODE;
>>  		flags |= TNF_MIGRATE_FAIL;
>>  		vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
>>  		if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) {
>> @@ -1732,8 +1724,8 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>  	}
>>
>>  out:
>> -	if (nid != NUMA_NO_NODE)
>> -		task_numa_fault(last_cpupid, nid, HPAGE_PMD_NR, flags);
>> +	if (target_nid != NUMA_NO_NODE)
>> +		task_numa_fault(last_cpupid, target_nid, nr_pages, flags);
>
> This appears a behavior change.  IIUC, there are 2 possible issues.
>
> 1) if migrate_misplaced_folio() fails, folio_nid() should be used as
> nid.  "target_nid" as variable name here is confusing, because
> folio_nid() is needed in fact.
>
> 2) if !pmd_same(), task_numa_fault() should be skipped.  The original
> code is buggy.
>
> Similar issues for do_numa_page().
>
> If my understanding were correct, we should implement a separate patch
> to fix 2) above.  And that may need to be backported.

Hmm, the original code seems OK after I checked the implementation.
There are two possible !pte_same()/!pmd_same() locations:
1) at the beginning of do_numa_page() and do_huge_pmd_numa_page() and the faulted
PTE/PMD changed before the folio can be checked, task_numa_fault() should not be
called.

2) when migrate_misplaced_folio() failed and the PTE/PMD changed, but the folio
has been determined and checked. task_numa_fault() should be called even if
!pte_same()/!pmd_same(),

Let me know if I get this wrong. Thanks.


Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 3/3] mm/migrate: move common code to numa_migrate_check (was numa_migrate_prep)
  2024-07-19 20:19     ` Zi Yan
@ 2024-07-22  1:47       ` Huang, Ying
  2024-07-22 14:01         ` Zi Yan
  0 siblings, 1 reply; 18+ messages in thread
From: Huang, Ying @ 2024-07-22  1:47 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, linux-mm, Andrew Morton, Baolin Wang,
	linux-kernel

Zi Yan <ziy@nvidia.com> writes:

> On 18 Jul 2024, at 4:36, Huang, Ying wrote:
>
>> Zi Yan <zi.yan@sent.com> writes:
>>
>>> From: Zi Yan <ziy@nvidia.com>
>>>
>>> do_numa_page() and do_huge_pmd_numa_page() share a lot of common code. To
>>> reduce redundancy, move common code to numa_migrate_prep() and rename
>>> the function to numa_migrate_check() to reflect its functionality.
>>>
>>> There is some code difference between do_numa_page() and
>>> do_huge_pmd_numa_page() before the code move:
>>>
>>> 1. do_huge_pmd_numa_page() did not check shared folios to set TNF_SHARED.
>>> 2. do_huge_pmd_numa_page() did not check and skip zone device folios.
>>>
>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>> ---
>>>  mm/huge_memory.c | 28 ++++++-----------
>>>  mm/internal.h    |  5 +--
>>>  mm/memory.c      | 81 +++++++++++++++++++++++-------------------------
>>>  3 files changed, 52 insertions(+), 62 deletions(-)
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 8c11d6da4b36..66d67d13e0dc 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -1670,10 +1670,10 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>  	pmd_t pmd;
>>>  	struct folio *folio;
>>>  	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
>>> -	int nid = NUMA_NO_NODE;
>>> -	int target_nid, last_cpupid = (-1 & LAST_CPUPID_MASK);
>>> +	int target_nid = NUMA_NO_NODE;
>>> +	int last_cpupid = (-1 & LAST_CPUPID_MASK);
>>>  	bool writable = false;
>>> -	int flags = 0;
>>> +	int flags = 0, nr_pages;
>>>
>>>  	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
>>>  	if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) {
>>> @@ -1693,21 +1693,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>  		writable = true;
>>>
>>>  	folio = vm_normal_folio_pmd(vma, haddr, pmd);
>>> -	if (!folio)
>>> +	if (!folio || folio_is_zone_device(folio))
>>
>> This change appears unrelated.  Can we put it in a separate patch?
>>
>> IIUC, this isn't necessary even in do_numa_page()?  Because in
>> change_pte_range(), folio_is_zone_device() has been checked already.
>> But It doesn't hurt too.
>>
>>>  		goto out_map;
>>>
>>> -	/* See similar comment in do_numa_page for explanation */
>>> -	if (!writable)
>>> -		flags |= TNF_NO_GROUP;
>>> +	nr_pages = folio_nr_pages(folio);
>>>
>>> -	nid = folio_nid(folio);
>>> -	/*
>>> -	 * For memory tiering mode, cpupid of slow memory page is used
>>> -	 * to record page access time.  So use default value.
>>> -	 */
>>> -	if (folio_has_cpupid(folio))
>>> -		last_cpupid = folio_last_cpupid(folio);
>>> -	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
>>> +	target_nid = numa_migrate_check(folio, vmf, haddr, writable,
>>> +			&flags, &last_cpupid);
>>>  	if (target_nid == NUMA_NO_NODE)
>>>  		goto out_map;
>>>  	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>>> @@ -1720,8 +1712,8 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>
>>>  	if (!migrate_misplaced_folio(folio, vma, target_nid)) {
>>>  		flags |= TNF_MIGRATED;
>>> -		nid = target_nid;
>>>  	} else {
>>> +		target_nid = NUMA_NO_NODE;
>>>  		flags |= TNF_MIGRATE_FAIL;
>>>  		vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
>>>  		if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) {
>>> @@ -1732,8 +1724,8 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>  	}
>>>
>>>  out:
>>> -	if (nid != NUMA_NO_NODE)
>>> -		task_numa_fault(last_cpupid, nid, HPAGE_PMD_NR, flags);
>>> +	if (target_nid != NUMA_NO_NODE)
>>> +		task_numa_fault(last_cpupid, target_nid, nr_pages, flags);
>>
>> This appears a behavior change.  IIUC, there are 2 possible issues.
>>
>> 1) if migrate_misplaced_folio() fails, folio_nid() should be used as
>> nid.  "target_nid" as variable name here is confusing, because
>> folio_nid() is needed in fact.
>>
>> 2) if !pmd_same(), task_numa_fault() should be skipped.  The original
>> code is buggy.
>>
>> Similar issues for do_numa_page().
>>
>> If my understanding were correct, we should implement a separate patch
>> to fix 2) above.  And that may need to be backported.
>
> Hmm, the original code seems OK after I checked the implementation.
> There are two possible !pte_same()/!pmd_same() locations:
> 1) at the beginning of do_numa_page() and do_huge_pmd_numa_page() and the faulted
> PTE/PMD changed before the folio can be checked, task_numa_fault() should not be
> called.

Yes.

> 2) when migrate_misplaced_folio() failed and the PTE/PMD changed, but the folio
> has been determined and checked. task_numa_fault() should be called even if
> !pte_same()/!pmd_same(),

IIUC, if !pte_same()/!pmd_same(), the fault has been processed on
another CPU.  For example, do_numa_page()/do_huge_pmd_numa_page() has
been called on another CPU and task_numa_fault() has been called for the
PTE/PMD already.

> Let me know if I get this wrong. Thanks.
>

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 3/3] mm/migrate: move common code to numa_migrate_check (was numa_migrate_prep)
  2024-07-22  1:47       ` Huang, Ying
@ 2024-07-22 14:01         ` Zi Yan
  2024-07-22 15:21           ` Zi Yan
  0 siblings, 1 reply; 18+ messages in thread
From: Zi Yan @ 2024-07-22 14:01 UTC (permalink / raw)
  To: Huang, Ying
  Cc: David Hildenbrand, linux-mm, Andrew Morton, Baolin Wang,
	linux-kernel, Mel Gorman

[-- Attachment #1: Type: text/plain, Size: 5480 bytes --]

On 21 Jul 2024, at 21:47, Huang, Ying wrote:

> Zi Yan <ziy@nvidia.com> writes:
>
>> On 18 Jul 2024, at 4:36, Huang, Ying wrote:
>>
>>> Zi Yan <zi.yan@sent.com> writes:
>>>
>>>> From: Zi Yan <ziy@nvidia.com>
>>>>
>>>> do_numa_page() and do_huge_pmd_numa_page() share a lot of common code. To
>>>> reduce redundancy, move common code to numa_migrate_prep() and rename
>>>> the function to numa_migrate_check() to reflect its functionality.
>>>>
>>>> There is some code difference between do_numa_page() and
>>>> do_huge_pmd_numa_page() before the code move:
>>>>
>>>> 1. do_huge_pmd_numa_page() did not check shared folios to set TNF_SHARED.
>>>> 2. do_huge_pmd_numa_page() did not check and skip zone device folios.
>>>>
>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>>> ---
>>>>  mm/huge_memory.c | 28 ++++++-----------
>>>>  mm/internal.h    |  5 +--
>>>>  mm/memory.c      | 81 +++++++++++++++++++++++-------------------------
>>>>  3 files changed, 52 insertions(+), 62 deletions(-)
>>>>
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index 8c11d6da4b36..66d67d13e0dc 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -1670,10 +1670,10 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>>  	pmd_t pmd;
>>>>  	struct folio *folio;
>>>>  	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
>>>> -	int nid = NUMA_NO_NODE;
>>>> -	int target_nid, last_cpupid = (-1 & LAST_CPUPID_MASK);
>>>> +	int target_nid = NUMA_NO_NODE;
>>>> +	int last_cpupid = (-1 & LAST_CPUPID_MASK);
>>>>  	bool writable = false;
>>>> -	int flags = 0;
>>>> +	int flags = 0, nr_pages;
>>>>
>>>>  	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
>>>>  	if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) {
>>>> @@ -1693,21 +1693,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>>  		writable = true;
>>>>
>>>>  	folio = vm_normal_folio_pmd(vma, haddr, pmd);
>>>> -	if (!folio)
>>>> +	if (!folio || folio_is_zone_device(folio))
>>>
>>> This change appears unrelated.  Can we put it in a separate patch?
>>>
>>> IIUC, this isn't necessary even in do_numa_page()?  Because in
>>> change_pte_range(), folio_is_zone_device() has been checked already.
>>> But It doesn't hurt too.
>>>
>>>>  		goto out_map;
>>>>
>>>> -	/* See similar comment in do_numa_page for explanation */
>>>> -	if (!writable)
>>>> -		flags |= TNF_NO_GROUP;
>>>> +	nr_pages = folio_nr_pages(folio);
>>>>
>>>> -	nid = folio_nid(folio);
>>>> -	/*
>>>> -	 * For memory tiering mode, cpupid of slow memory page is used
>>>> -	 * to record page access time.  So use default value.
>>>> -	 */
>>>> -	if (folio_has_cpupid(folio))
>>>> -		last_cpupid = folio_last_cpupid(folio);
>>>> -	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
>>>> +	target_nid = numa_migrate_check(folio, vmf, haddr, writable,
>>>> +			&flags, &last_cpupid);
>>>>  	if (target_nid == NUMA_NO_NODE)
>>>>  		goto out_map;
>>>>  	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>>>> @@ -1720,8 +1712,8 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>>
>>>>  	if (!migrate_misplaced_folio(folio, vma, target_nid)) {
>>>>  		flags |= TNF_MIGRATED;
>>>> -		nid = target_nid;
>>>>  	} else {
>>>> +		target_nid = NUMA_NO_NODE;
>>>>  		flags |= TNF_MIGRATE_FAIL;
>>>>  		vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
>>>>  		if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) {
>>>> @@ -1732,8 +1724,8 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>>  	}
>>>>
>>>>  out:
>>>> -	if (nid != NUMA_NO_NODE)
>>>> -		task_numa_fault(last_cpupid, nid, HPAGE_PMD_NR, flags);
>>>> +	if (target_nid != NUMA_NO_NODE)
>>>> +		task_numa_fault(last_cpupid, target_nid, nr_pages, flags);
>>>
>>> This appears a behavior change.  IIUC, there are 2 possible issues.
>>>
>>> 1) if migrate_misplaced_folio() fails, folio_nid() should be used as
>>> nid.  "target_nid" as variable name here is confusing, because
>>> folio_nid() is needed in fact.
>>>
>>> 2) if !pmd_same(), task_numa_fault() should be skipped.  The original
>>> code is buggy.
>>>
>>> Similar issues for do_numa_page().
>>>
>>> If my understanding were correct, we should implement a separate patch
>>> to fix 2) above.  And that may need to be backported.
>>
>> Hmm, the original code seems OK after I checked the implementation.
>> There are two possible !pte_same()/!pmd_same() locations:
>> 1) at the beginning of do_numa_page() and do_huge_pmd_numa_page() and the faulted
>> PTE/PMD changed before the folio can be checked, task_numa_fault() should not be
>> called.
>
> Yes.
>
>> 2) when migrate_misplaced_folio() failed and the PTE/PMD changed, but the folio
>> has been determined and checked. task_numa_fault() should be called even if
>> !pte_same()/!pmd_same(),
>
> IIUC, if !pte_same()/!pmd_same(), the fault has been processed on
> another CPU.  For example, do_numa_page()/do_huge_pmd_numa_page() has
> been called on another CPU and task_numa_fault() has been called for the
> PTE/PMD already.

Hmm, this behavior at least dates back to 2015 at
commit 074c238177a7 ("mm: numa: slow PTE scan rate if migration failures occur”).
So cc Mel.

The code is https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/memory.c?id=074c238177a75f5e79af3b2cb6a84e54823ef950#n3102. I have not checked older
commits.

I wonder how far we should trace back.


Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 3/3] mm/migrate: move common code to numa_migrate_check (was numa_migrate_prep)
  2024-07-22 14:01         ` Zi Yan
@ 2024-07-22 15:21           ` Zi Yan
  2024-07-23  1:16             ` Huang, Ying
  0 siblings, 1 reply; 18+ messages in thread
From: Zi Yan @ 2024-07-22 15:21 UTC (permalink / raw)
  To: Huang, Ying
  Cc: David Hildenbrand, linux-mm, Andrew Morton, Baolin Wang,
	linux-kernel, Mel Gorman

[-- Attachment #1: Type: text/plain, Size: 6139 bytes --]

On 22 Jul 2024, at 10:01, Zi Yan wrote:

> On 21 Jul 2024, at 21:47, Huang, Ying wrote:
>
>> Zi Yan <ziy@nvidia.com> writes:
>>
>>> On 18 Jul 2024, at 4:36, Huang, Ying wrote:
>>>
>>>> Zi Yan <zi.yan@sent.com> writes:
>>>>
>>>>> From: Zi Yan <ziy@nvidia.com>
>>>>>
>>>>> do_numa_page() and do_huge_pmd_numa_page() share a lot of common code. To
>>>>> reduce redundancy, move common code to numa_migrate_prep() and rename
>>>>> the function to numa_migrate_check() to reflect its functionality.
>>>>>
>>>>> There is some code difference between do_numa_page() and
>>>>> do_huge_pmd_numa_page() before the code move:
>>>>>
>>>>> 1. do_huge_pmd_numa_page() did not check shared folios to set TNF_SHARED.
>>>>> 2. do_huge_pmd_numa_page() did not check and skip zone device folios.
>>>>>
>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>>>> ---
>>>>>  mm/huge_memory.c | 28 ++++++-----------
>>>>>  mm/internal.h    |  5 +--
>>>>>  mm/memory.c      | 81 +++++++++++++++++++++++-------------------------
>>>>>  3 files changed, 52 insertions(+), 62 deletions(-)
>>>>>
>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>> index 8c11d6da4b36..66d67d13e0dc 100644
>>>>> --- a/mm/huge_memory.c
>>>>> +++ b/mm/huge_memory.c
>>>>> @@ -1670,10 +1670,10 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>>>  	pmd_t pmd;
>>>>>  	struct folio *folio;
>>>>>  	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
>>>>> -	int nid = NUMA_NO_NODE;
>>>>> -	int target_nid, last_cpupid = (-1 & LAST_CPUPID_MASK);
>>>>> +	int target_nid = NUMA_NO_NODE;
>>>>> +	int last_cpupid = (-1 & LAST_CPUPID_MASK);
>>>>>  	bool writable = false;
>>>>> -	int flags = 0;
>>>>> +	int flags = 0, nr_pages;
>>>>>
>>>>>  	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
>>>>>  	if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) {
>>>>> @@ -1693,21 +1693,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>>>  		writable = true;
>>>>>
>>>>>  	folio = vm_normal_folio_pmd(vma, haddr, pmd);
>>>>> -	if (!folio)
>>>>> +	if (!folio || folio_is_zone_device(folio))
>>>>
>>>> This change appears unrelated.  Can we put it in a separate patch?
>>>>
>>>> IIUC, this isn't necessary even in do_numa_page()?  Because in
>>>> change_pte_range(), folio_is_zone_device() has been checked already.
>>>> But It doesn't hurt too.
>>>>
>>>>>  		goto out_map;
>>>>>
>>>>> -	/* See similar comment in do_numa_page for explanation */
>>>>> -	if (!writable)
>>>>> -		flags |= TNF_NO_GROUP;
>>>>> +	nr_pages = folio_nr_pages(folio);
>>>>>
>>>>> -	nid = folio_nid(folio);
>>>>> -	/*
>>>>> -	 * For memory tiering mode, cpupid of slow memory page is used
>>>>> -	 * to record page access time.  So use default value.
>>>>> -	 */
>>>>> -	if (folio_has_cpupid(folio))
>>>>> -		last_cpupid = folio_last_cpupid(folio);
>>>>> -	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
>>>>> +	target_nid = numa_migrate_check(folio, vmf, haddr, writable,
>>>>> +			&flags, &last_cpupid);
>>>>>  	if (target_nid == NUMA_NO_NODE)
>>>>>  		goto out_map;
>>>>>  	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>>>>> @@ -1720,8 +1712,8 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>>>
>>>>>  	if (!migrate_misplaced_folio(folio, vma, target_nid)) {
>>>>>  		flags |= TNF_MIGRATED;
>>>>> -		nid = target_nid;
>>>>>  	} else {
>>>>> +		target_nid = NUMA_NO_NODE;
>>>>>  		flags |= TNF_MIGRATE_FAIL;
>>>>>  		vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
>>>>>  		if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) {
>>>>> @@ -1732,8 +1724,8 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>>>  	}
>>>>>
>>>>>  out:
>>>>> -	if (nid != NUMA_NO_NODE)
>>>>> -		task_numa_fault(last_cpupid, nid, HPAGE_PMD_NR, flags);
>>>>> +	if (target_nid != NUMA_NO_NODE)
>>>>> +		task_numa_fault(last_cpupid, target_nid, nr_pages, flags);
>>>>
>>>> This appears a behavior change.  IIUC, there are 2 possible issues.
>>>>
>>>> 1) if migrate_misplaced_folio() fails, folio_nid() should be used as
>>>> nid.  "target_nid" as variable name here is confusing, because
>>>> folio_nid() is needed in fact.
>>>>
>>>> 2) if !pmd_same(), task_numa_fault() should be skipped.  The original
>>>> code is buggy.
>>>>
>>>> Similar issues for do_numa_page().
>>>>
>>>> If my understanding were correct, we should implement a separate patch
>>>> to fix 2) above.  And that may need to be backported.
>>>
>>> Hmm, the original code seems OK after I checked the implementation.
>>> There are two possible !pte_same()/!pmd_same() locations:
>>> 1) at the beginning of do_numa_page() and do_huge_pmd_numa_page() and the faulted
>>> PTE/PMD changed before the folio can be checked, task_numa_fault() should not be
>>> called.
>>
>> Yes.
>>
>>> 2) when migrate_misplaced_folio() failed and the PTE/PMD changed, but the folio
>>> has been determined and checked. task_numa_fault() should be called even if
>>> !pte_same()/!pmd_same(),
>>
>> IIUC, if !pte_same()/!pmd_same(), the fault has been processed on
>> another CPU.  For example, do_numa_page()/do_huge_pmd_numa_page() has
>> been called on another CPU and task_numa_fault() has been called for the
>> PTE/PMD already.
>
> Hmm, this behavior at least dates back to 2015 at
> commit 074c238177a7 ("mm: numa: slow PTE scan rate if migration failures occur”).
> So cc Mel.
>
> The code is https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/memory.c?id=074c238177a75f5e79af3b2cb6a84e54823ef950#n3102. I have not checked older
> commits.
>
> I wonder how far we should trace back.

OK, I find the commit where task_numa_fault policy settled:
8191acbd30c7 ("mm: numa: Sanitize task_numa_fault() callsites”).

It says:
“So modify all three sites to always account; we did after all receive
the fault; and always account to where the page is after migration,
regardless of success.“, where the three call sites were:
do_huge_pmd_numa_page(), do_numa_page(), and do_pmd_numa_page().

The current code still follows what the commit log does.


Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 3/3] mm/migrate: move common code to numa_migrate_check (was numa_migrate_prep)
  2024-07-22 15:21           ` Zi Yan
@ 2024-07-23  1:16             ` Huang, Ying
  2024-07-23  1:43               ` Zi Yan
  0 siblings, 1 reply; 18+ messages in thread
From: Huang, Ying @ 2024-07-23  1:16 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, linux-mm, Andrew Morton, Baolin Wang,
	linux-kernel, Mel Gorman

Zi Yan <ziy@nvidia.com> writes:

> On 22 Jul 2024, at 10:01, Zi Yan wrote:
>
>> On 21 Jul 2024, at 21:47, Huang, Ying wrote:
>>
>>> Zi Yan <ziy@nvidia.com> writes:
>>>
>>>> On 18 Jul 2024, at 4:36, Huang, Ying wrote:
>>>>
>>>>> Zi Yan <zi.yan@sent.com> writes:
>>>>>
>>>>>> From: Zi Yan <ziy@nvidia.com>
>>>>>>
>>>>>> do_numa_page() and do_huge_pmd_numa_page() share a lot of common code. To
>>>>>> reduce redundancy, move common code to numa_migrate_prep() and rename
>>>>>> the function to numa_migrate_check() to reflect its functionality.
>>>>>>
>>>>>> There is some code difference between do_numa_page() and
>>>>>> do_huge_pmd_numa_page() before the code move:
>>>>>>
>>>>>> 1. do_huge_pmd_numa_page() did not check shared folios to set TNF_SHARED.
>>>>>> 2. do_huge_pmd_numa_page() did not check and skip zone device folios.
>>>>>>
>>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>>>>> ---
>>>>>>  mm/huge_memory.c | 28 ++++++-----------
>>>>>>  mm/internal.h    |  5 +--
>>>>>>  mm/memory.c      | 81 +++++++++++++++++++++++-------------------------
>>>>>>  3 files changed, 52 insertions(+), 62 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>> index 8c11d6da4b36..66d67d13e0dc 100644
>>>>>> --- a/mm/huge_memory.c
>>>>>> +++ b/mm/huge_memory.c
>>>>>> @@ -1670,10 +1670,10 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>>>>  	pmd_t pmd;
>>>>>>  	struct folio *folio;
>>>>>>  	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
>>>>>> -	int nid = NUMA_NO_NODE;
>>>>>> -	int target_nid, last_cpupid = (-1 & LAST_CPUPID_MASK);
>>>>>> +	int target_nid = NUMA_NO_NODE;
>>>>>> +	int last_cpupid = (-1 & LAST_CPUPID_MASK);
>>>>>>  	bool writable = false;
>>>>>> -	int flags = 0;
>>>>>> +	int flags = 0, nr_pages;
>>>>>>
>>>>>>  	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
>>>>>>  	if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) {
>>>>>> @@ -1693,21 +1693,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>>>>  		writable = true;
>>>>>>
>>>>>>  	folio = vm_normal_folio_pmd(vma, haddr, pmd);
>>>>>> -	if (!folio)
>>>>>> +	if (!folio || folio_is_zone_device(folio))
>>>>>
>>>>> This change appears unrelated.  Can we put it in a separate patch?
>>>>>
>>>>> IIUC, this isn't necessary even in do_numa_page()?  Because in
>>>>> change_pte_range(), folio_is_zone_device() has been checked already.
>>>>> But It doesn't hurt too.
>>>>>
>>>>>>  		goto out_map;
>>>>>>
>>>>>> -	/* See similar comment in do_numa_page for explanation */
>>>>>> -	if (!writable)
>>>>>> -		flags |= TNF_NO_GROUP;
>>>>>> +	nr_pages = folio_nr_pages(folio);
>>>>>>
>>>>>> -	nid = folio_nid(folio);
>>>>>> -	/*
>>>>>> -	 * For memory tiering mode, cpupid of slow memory page is used
>>>>>> -	 * to record page access time.  So use default value.
>>>>>> -	 */
>>>>>> -	if (folio_has_cpupid(folio))
>>>>>> -		last_cpupid = folio_last_cpupid(folio);
>>>>>> -	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
>>>>>> +	target_nid = numa_migrate_check(folio, vmf, haddr, writable,
>>>>>> +			&flags, &last_cpupid);
>>>>>>  	if (target_nid == NUMA_NO_NODE)
>>>>>>  		goto out_map;
>>>>>>  	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>>>>>> @@ -1720,8 +1712,8 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>>>>
>>>>>>  	if (!migrate_misplaced_folio(folio, vma, target_nid)) {
>>>>>>  		flags |= TNF_MIGRATED;
>>>>>> -		nid = target_nid;
>>>>>>  	} else {
>>>>>> +		target_nid = NUMA_NO_NODE;
>>>>>>  		flags |= TNF_MIGRATE_FAIL;
>>>>>>  		vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
>>>>>>  		if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) {
>>>>>> @@ -1732,8 +1724,8 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>>>>  	}
>>>>>>
>>>>>>  out:
>>>>>> -	if (nid != NUMA_NO_NODE)
>>>>>> -		task_numa_fault(last_cpupid, nid, HPAGE_PMD_NR, flags);
>>>>>> +	if (target_nid != NUMA_NO_NODE)
>>>>>> +		task_numa_fault(last_cpupid, target_nid, nr_pages, flags);
>>>>>
>>>>> This appears a behavior change.  IIUC, there are 2 possible issues.
>>>>>
>>>>> 1) if migrate_misplaced_folio() fails, folio_nid() should be used as
>>>>> nid.  "target_nid" as variable name here is confusing, because
>>>>> folio_nid() is needed in fact.
>>>>>
>>>>> 2) if !pmd_same(), task_numa_fault() should be skipped.  The original
>>>>> code is buggy.
>>>>>
>>>>> Similar issues for do_numa_page().
>>>>>
>>>>> If my understanding were correct, we should implement a separate patch
>>>>> to fix 2) above.  And that may need to be backported.
>>>>
>>>> Hmm, the original code seems OK after I checked the implementation.
>>>> There are two possible !pte_same()/!pmd_same() locations:
>>>> 1) at the beginning of do_numa_page() and do_huge_pmd_numa_page() and the faulted
>>>> PTE/PMD changed before the folio can be checked, task_numa_fault() should not be
>>>> called.
>>>
>>> Yes.
>>>
>>>> 2) when migrate_misplaced_folio() failed and the PTE/PMD changed, but the folio
>>>> has been determined and checked. task_numa_fault() should be called even if
>>>> !pte_same()/!pmd_same(),
>>>
>>> IIUC, if !pte_same()/!pmd_same(), the fault has been processed on
>>> another CPU.  For example, do_numa_page()/do_huge_pmd_numa_page() has
>>> been called on another CPU and task_numa_fault() has been called for the
>>> PTE/PMD already.
>>
>> Hmm, this behavior at least dates back to 2015 at
>> commit 074c238177a7 ("mm: numa: slow PTE scan rate if migration failures occur”).
>> So cc Mel.
>>
>> The code is https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/memory.c?id=074c238177a75f5e79af3b2cb6a84e54823ef950#n3102. I have not checked older
>> commits.
>>
>> I wonder how far we should trace back.
>
> OK, I find the commit where task_numa_fault policy settled:
> 8191acbd30c7 ("mm: numa: Sanitize task_numa_fault() callsites”).
>
> It says:
> “So modify all three sites to always account; we did after all receive
> the fault; and always account to where the page is after migration,
> regardless of success.“, where the three call sites were:
> do_huge_pmd_numa_page(), do_numa_page(), and do_pmd_numa_page().
>
> The current code still follows what the commit log does.

Per my understanding, the issue is introduced in commit b99a342d4f11
("NUMA balancing: reduce TLB flush via delaying mapping on hint page
fault").  Before that, the PTE is restored before migration, so
task_numa_fault() should be called for migration failure too.  After
that, the PTE is restored after migration failure, if the PTE has been
changed by someone else, someone else should have called
task_numa_fault() if necessary, we shouldn't call it again.

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 3/3] mm/migrate: move common code to numa_migrate_check (was numa_migrate_prep)
  2024-07-23  1:16             ` Huang, Ying
@ 2024-07-23  1:43               ` Zi Yan
  0 siblings, 0 replies; 18+ messages in thread
From: Zi Yan @ 2024-07-23  1:43 UTC (permalink / raw)
  To: Huang, Ying
  Cc: David Hildenbrand, linux-mm, Andrew Morton, Baolin Wang,
	linux-kernel, Mel Gorman

[-- Attachment #1: Type: text/plain, Size: 7246 bytes --]

On Mon Jul 22, 2024 at 9:16 PM EDT, Huang, Ying wrote:
> Zi Yan <ziy@nvidia.com> writes:
>
> > On 22 Jul 2024, at 10:01, Zi Yan wrote:
> >
> >> On 21 Jul 2024, at 21:47, Huang, Ying wrote:
> >>
> >>> Zi Yan <ziy@nvidia.com> writes:
> >>>
> >>>> On 18 Jul 2024, at 4:36, Huang, Ying wrote:
> >>>>
> >>>>> Zi Yan <zi.yan@sent.com> writes:
> >>>>>
> >>>>>> From: Zi Yan <ziy@nvidia.com>
> >>>>>>
> >>>>>> do_numa_page() and do_huge_pmd_numa_page() share a lot of common code. To
> >>>>>> reduce redundancy, move common code to numa_migrate_prep() and rename
> >>>>>> the function to numa_migrate_check() to reflect its functionality.
> >>>>>>
> >>>>>> There is some code difference between do_numa_page() and
> >>>>>> do_huge_pmd_numa_page() before the code move:
> >>>>>>
> >>>>>> 1. do_huge_pmd_numa_page() did not check shared folios to set TNF_SHARED.
> >>>>>> 2. do_huge_pmd_numa_page() did not check and skip zone device folios.
> >>>>>>
> >>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
> >>>>>> ---
> >>>>>>  mm/huge_memory.c | 28 ++++++-----------
> >>>>>>  mm/internal.h    |  5 +--
> >>>>>>  mm/memory.c      | 81 +++++++++++++++++++++++-------------------------
> >>>>>>  3 files changed, 52 insertions(+), 62 deletions(-)
> >>>>>>
> >>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >>>>>> index 8c11d6da4b36..66d67d13e0dc 100644
> >>>>>> --- a/mm/huge_memory.c
> >>>>>> +++ b/mm/huge_memory.c
> >>>>>> @@ -1670,10 +1670,10 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
> >>>>>>  	pmd_t pmd;
> >>>>>>  	struct folio *folio;
> >>>>>>  	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
> >>>>>> -	int nid = NUMA_NO_NODE;
> >>>>>> -	int target_nid, last_cpupid = (-1 & LAST_CPUPID_MASK);
> >>>>>> +	int target_nid = NUMA_NO_NODE;
> >>>>>> +	int last_cpupid = (-1 & LAST_CPUPID_MASK);
> >>>>>>  	bool writable = false;
> >>>>>> -	int flags = 0;
> >>>>>> +	int flags = 0, nr_pages;
> >>>>>>
> >>>>>>  	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> >>>>>>  	if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) {
> >>>>>> @@ -1693,21 +1693,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
> >>>>>>  		writable = true;
> >>>>>>
> >>>>>>  	folio = vm_normal_folio_pmd(vma, haddr, pmd);
> >>>>>> -	if (!folio)
> >>>>>> +	if (!folio || folio_is_zone_device(folio))
> >>>>>
> >>>>> This change appears unrelated.  Can we put it in a separate patch?
> >>>>>
> >>>>> IIUC, this isn't necessary even in do_numa_page()?  Because in
> >>>>> change_pte_range(), folio_is_zone_device() has been checked already.
> >>>>> But It doesn't hurt too.
> >>>>>
> >>>>>>  		goto out_map;
> >>>>>>
> >>>>>> -	/* See similar comment in do_numa_page for explanation */
> >>>>>> -	if (!writable)
> >>>>>> -		flags |= TNF_NO_GROUP;
> >>>>>> +	nr_pages = folio_nr_pages(folio);
> >>>>>>
> >>>>>> -	nid = folio_nid(folio);
> >>>>>> -	/*
> >>>>>> -	 * For memory tiering mode, cpupid of slow memory page is used
> >>>>>> -	 * to record page access time.  So use default value.
> >>>>>> -	 */
> >>>>>> -	if (folio_has_cpupid(folio))
> >>>>>> -		last_cpupid = folio_last_cpupid(folio);
> >>>>>> -	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
> >>>>>> +	target_nid = numa_migrate_check(folio, vmf, haddr, writable,
> >>>>>> +			&flags, &last_cpupid);
> >>>>>>  	if (target_nid == NUMA_NO_NODE)
> >>>>>>  		goto out_map;
> >>>>>>  	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
> >>>>>> @@ -1720,8 +1712,8 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
> >>>>>>
> >>>>>>  	if (!migrate_misplaced_folio(folio, vma, target_nid)) {
> >>>>>>  		flags |= TNF_MIGRATED;
> >>>>>> -		nid = target_nid;
> >>>>>>  	} else {
> >>>>>> +		target_nid = NUMA_NO_NODE;
> >>>>>>  		flags |= TNF_MIGRATE_FAIL;
> >>>>>>  		vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> >>>>>>  		if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) {
> >>>>>> @@ -1732,8 +1724,8 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
> >>>>>>  	}
> >>>>>>
> >>>>>>  out:
> >>>>>> -	if (nid != NUMA_NO_NODE)
> >>>>>> -		task_numa_fault(last_cpupid, nid, HPAGE_PMD_NR, flags);
> >>>>>> +	if (target_nid != NUMA_NO_NODE)
> >>>>>> +		task_numa_fault(last_cpupid, target_nid, nr_pages, flags);
> >>>>>
> >>>>> This appears a behavior change.  IIUC, there are 2 possible issues.
> >>>>>
> >>>>> 1) if migrate_misplaced_folio() fails, folio_nid() should be used as
> >>>>> nid.  "target_nid" as variable name here is confusing, because
> >>>>> folio_nid() is needed in fact.
> >>>>>
> >>>>> 2) if !pmd_same(), task_numa_fault() should be skipped.  The original
> >>>>> code is buggy.
> >>>>>
> >>>>> Similar issues for do_numa_page().
> >>>>>
> >>>>> If my understanding were correct, we should implement a separate patch
> >>>>> to fix 2) above.  And that may need to be backported.
> >>>>
> >>>> Hmm, the original code seems OK after I checked the implementation.
> >>>> There are two possible !pte_same()/!pmd_same() locations:
> >>>> 1) at the beginning of do_numa_page() and do_huge_pmd_numa_page() and the faulted
> >>>> PTE/PMD changed before the folio can be checked, task_numa_fault() should not be
> >>>> called.
> >>>
> >>> Yes.
> >>>
> >>>> 2) when migrate_misplaced_folio() failed and the PTE/PMD changed, but the folio
> >>>> has been determined and checked. task_numa_fault() should be called even if
> >>>> !pte_same()/!pmd_same(),
> >>>
> >>> IIUC, if !pte_same()/!pmd_same(), the fault has been processed on
> >>> another CPU.  For example, do_numa_page()/do_huge_pmd_numa_page() has
> >>> been called on another CPU and task_numa_fault() has been called for the
> >>> PTE/PMD already.
> >>
> >> Hmm, this behavior at least dates back to 2015 at
> >> commit 074c238177a7 ("mm: numa: slow PTE scan rate if migration failures occur”).
> >> So cc Mel.
> >>
> >> The code is https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/memory.c?id=074c238177a75f5e79af3b2cb6a84e54823ef950#n3102. I have not checked older
> >> commits.
> >>
> >> I wonder how far we should trace back.
> >
> > OK, I find the commit where task_numa_fault policy settled:
> > 8191acbd30c7 ("mm: numa: Sanitize task_numa_fault() callsites”).
> >
> > It says:
> > “So modify all three sites to always account; we did after all receive
> > the fault; and always account to where the page is after migration,
> > regardless of success.“, where the three call sites were:
> > do_huge_pmd_numa_page(), do_numa_page(), and do_pmd_numa_page().
> >
> > The current code still follows what the commit log does.
>
> Per my understanding, the issue is introduced in commit b99a342d4f11
> ("NUMA balancing: reduce TLB flush via delaying mapping on hint page
> fault").  Before that, the PTE is restored before migration, so
> task_numa_fault() should be called for migration failure too.  After
> that, the PTE is restored after migration failure, if the PTE has been
> changed by someone else, someone else should have called
> task_numa_fault() if necessary, we shouldn't call it again.

You are right. Will fix the issue. Thank you for the explanation.

-- 
Best Regards,
Yan, Zi


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2024-07-23  1:43 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-12  2:44 [RFC PATCH 0/3] Fix and refactor do_{huge_pmd_}numa_page() Zi Yan
2024-07-12  2:44 ` [RFC PATCH 1/3] memory tiering: read last_cpupid correctly in do_huge_pmd_numa_page() Zi Yan
2024-07-12  3:22   ` Huang, Ying
2024-07-12  4:01   ` Baolin Wang
2024-07-13  1:13   ` David Hildenbrand
2024-07-13  1:18     ` Zi Yan
2024-07-13  1:23       ` David Hildenbrand
2024-07-12  2:44 ` [RFC PATCH 2/3] memory tiering: introduce folio_has_cpupid() check Zi Yan
2024-07-12  6:27   ` Huang, Ying
2024-07-12  2:44 ` [RFC PATCH 3/3] mm/migrate: move common code to numa_migrate_check (was numa_migrate_prep) Zi Yan
2024-07-18  8:36   ` Huang, Ying
2024-07-18 14:40     ` Zi Yan
2024-07-19 20:19     ` Zi Yan
2024-07-22  1:47       ` Huang, Ying
2024-07-22 14:01         ` Zi Yan
2024-07-22 15:21           ` Zi Yan
2024-07-23  1:16             ` Huang, Ying
2024-07-23  1:43               ` Zi Yan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).