linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 0/2] mm/migrate: move NUMA hinting fault folio isolation + checks under PTL
@ 2024-06-20 21:29 David Hildenbrand
  2024-06-20 21:29 ` [PATCH v1 1/2] mm/migrate: make migrate_misplaced_folio() return 0 on success David Hildenbrand
  2024-06-20 21:29 ` [PATCH v1 2/2] mm/migrate: move NUMA hinting fault folio isolation + checks under PTL David Hildenbrand
  0 siblings, 2 replies; 24+ messages in thread
From: David Hildenbrand @ 2024-06-20 21:29 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, David Hildenbrand, Andrew Morton

Working on moving some mapcount related checks -- especially
folio_likely_mapped_shared() invocations -- under the PTL, where we are
sure the folio will remain mapped and the consequently the values are
actually expressive in current MM context, there are not that many
problematic bits to sort out.

This series tackles the NUMA hinting fault handling: we now perform checks
and folio isolation under PTL with the nice side effect that we have to
take less temporary folio references.

Did a quick test on a 2 socket system, NUMA hinting+migration on PTEs and
PMDs seems to continue working as expected.

Cc: Andrew Morton <akpm@linux-foundation.org>

David Hildenbrand (2):
  mm/migrate: make migrate_misplaced_folio() return 0 on success
  mm/migrate: move NUMA hinting fault folio isolation + checks under PTL

 include/linux/migrate.h |  7 ++++
 mm/huge_memory.c        | 13 ++++---
 mm/memory.c             | 11 +++---
 mm/migrate.c            | 81 +++++++++++++++++++----------------------
 4 files changed, 58 insertions(+), 54 deletions(-)


base-commit: a53138cdbe3ea8875405e96fa9cde64e24f4f9e1
-- 
2.45.2



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v1 1/2] mm/migrate: make migrate_misplaced_folio() return 0 on success
  2024-06-20 21:29 [PATCH v1 0/2] mm/migrate: move NUMA hinting fault folio isolation + checks under PTL David Hildenbrand
@ 2024-06-20 21:29 ` David Hildenbrand
  2024-06-21  1:40   ` Zi Yan
                     ` (2 more replies)
  2024-06-20 21:29 ` [PATCH v1 2/2] mm/migrate: move NUMA hinting fault folio isolation + checks under PTL David Hildenbrand
  1 sibling, 3 replies; 24+ messages in thread
From: David Hildenbrand @ 2024-06-20 21:29 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, David Hildenbrand, Andrew Morton

Let's just return 0 on success, which is less confusing.

... especially because we got it wrong in the migrate.h stub where we
have "return -EAGAIN; /* can't migrate now */" instead of "return 0;".
Likely this wrong return value doesn't currently matter, but it
certainly adds confusion.

We'll add migrate_misplaced_folio_prepare() next, where we want to use
the same "return 0 on success" approach, so let's just clean this up.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/huge_memory.c | 5 ++---
 mm/memory.c      | 2 +-
 mm/migrate.c     | 4 ++--
 3 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0fffaa58a47a..fc27dabcd8e3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1652,7 +1652,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
 	int nid = NUMA_NO_NODE;
 	int target_nid, last_cpupid = (-1 & LAST_CPUPID_MASK);
-	bool migrated = false, writable = false;
+	bool writable = false;
 	int flags = 0;
 
 	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
@@ -1696,8 +1696,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 	spin_unlock(vmf->ptl);
 	writable = false;
 
-	migrated = migrate_misplaced_folio(folio, vma, target_nid);
-	if (migrated) {
+	if (!migrate_misplaced_folio(folio, vma, target_nid)) {
 		flags |= TNF_MIGRATED;
 		nid = target_nid;
 	} else {
diff --git a/mm/memory.c b/mm/memory.c
index 00728ea95583..118660de5bcc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5354,7 +5354,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	ignore_writable = true;
 
 	/* Migrate to the requested node */
-	if (migrate_misplaced_folio(folio, vma, target_nid)) {
+	if (!migrate_misplaced_folio(folio, vma, target_nid)) {
 		nid = target_nid;
 		flags |= TNF_MIGRATED;
 	} else {
diff --git a/mm/migrate.c b/mm/migrate.c
index 781979567f64..0307b54879a0 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2629,11 +2629,11 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
 					    nr_succeeded);
 	}
 	BUG_ON(!list_empty(&migratepages));
-	return isolated;
+	return isolated ? 0 : -EAGAIN;
 
 out:
 	folio_put(folio);
-	return 0;
+	return -EAGAIN;
 }
 #endif /* CONFIG_NUMA_BALANCING */
 #endif /* CONFIG_NUMA */
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v1 2/2] mm/migrate: move NUMA hinting fault folio isolation + checks under PTL
  2024-06-20 21:29 [PATCH v1 0/2] mm/migrate: move NUMA hinting fault folio isolation + checks under PTL David Hildenbrand
  2024-06-20 21:29 ` [PATCH v1 1/2] mm/migrate: make migrate_misplaced_folio() return 0 on success David Hildenbrand
@ 2024-06-20 21:29 ` David Hildenbrand
  2024-06-21  2:05   ` Zi Yan
                     ` (4 more replies)
  1 sibling, 5 replies; 24+ messages in thread
From: David Hildenbrand @ 2024-06-20 21:29 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, David Hildenbrand, Andrew Morton

Currently we always take a folio reference even if migration will not
even be tried or isolation failed, requiring us to grab+drop an additional
reference.

Further, we end up calling folio_likely_mapped_shared() while the folio
might have already been unmapped, because after we dropped the PTL, that
can easily happen. We want to stop touching mapcounts and friends from
such context, and only call folio_likely_mapped_shared() while the folio
is still mapped: mapcount information is pretty much stale and unreliable
otherwise.

So let's move checks into numamigrate_isolate_folio(), rename that
function to migrate_misplaced_folio_prepare(), and call that function
from callsites where we call migrate_misplaced_folio(), but still with
the PTL held.

We can now stop taking temporary folio references, and really only take
a reference if folio isolation succeeded. Doing the
folio_likely_mapped_shared() + golio isolation under PT lock is now similar
to how we handle MADV_PAGEOUT.

While at it, combine the folio_is_file_lru() checks.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/migrate.h |  7 ++++
 mm/huge_memory.c        |  8 ++--
 mm/memory.c             |  9 +++--
 mm/migrate.c            | 81 +++++++++++++++++++----------------------
 4 files changed, 55 insertions(+), 50 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index f9d92482d117..644be30b69c8 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -139,9 +139,16 @@ const struct movable_operations *page_movable_ops(struct page *page)
 }
 
 #ifdef CONFIG_NUMA_BALANCING
+int migrate_misplaced_folio_prepare(struct folio *folio,
+		struct vm_area_struct *vma, int node);
 int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
 			   int node);
 #else
+static inline int migrate_misplaced_folio_prepare(struct folio *folio,
+		struct vm_area_struct *vma, int node)
+{
+	return -EAGAIN; /* can't migrate now */
+}
 static inline int migrate_misplaced_folio(struct folio *folio,
 					 struct vm_area_struct *vma, int node)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fc27dabcd8e3..4b2817bb2c7d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1688,11 +1688,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 	if (node_is_toptier(nid))
 		last_cpupid = folio_last_cpupid(folio);
 	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
-	if (target_nid == NUMA_NO_NODE) {
-		folio_put(folio);
+	if (target_nid == NUMA_NO_NODE)
+		goto out_map;
+	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
+		flags |= TNF_MIGRATE_FAIL;
 		goto out_map;
 	}
-
+	/* The folio is isolated and isolation code holds a folio reference. */
 	spin_unlock(vmf->ptl);
 	writable = false;
 
diff --git a/mm/memory.c b/mm/memory.c
index 118660de5bcc..4fd1ecfced4d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5207,8 +5207,6 @@ int numa_migrate_prep(struct folio *folio, struct vm_fault *vmf,
 {
 	struct vm_area_struct *vma = vmf->vma;
 
-	folio_get(folio);
-
 	/* Record the current PID acceesing VMA */
 	vma_set_access_pid_bit(vma);
 
@@ -5345,10 +5343,13 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	else
 		last_cpupid = folio_last_cpupid(folio);
 	target_nid = numa_migrate_prep(folio, vmf, vmf->address, nid, &flags);
-	if (target_nid == NUMA_NO_NODE) {
-		folio_put(folio);
+	if (target_nid == NUMA_NO_NODE)
+		goto out_map;
+	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
+		flags |= TNF_MIGRATE_FAIL;
 		goto out_map;
 	}
+	/* The folio is isolated and isolation code holds a folio reference. */
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
 	writable = false;
 	ignore_writable = true;
diff --git a/mm/migrate.c b/mm/migrate.c
index 0307b54879a0..27f070f64f27 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2530,9 +2530,37 @@ static struct folio *alloc_misplaced_dst_folio(struct folio *src,
 	return __folio_alloc_node(gfp, order, nid);
 }
 
-static int numamigrate_isolate_folio(pg_data_t *pgdat, struct folio *folio)
+/*
+ * Prepare for calling migrate_misplaced_folio() by isolating the folio if
+ * permitted. Must be called with the PTL still held.
+ */
+int migrate_misplaced_folio_prepare(struct folio *folio,
+		struct vm_area_struct *vma, int node)
 {
 	int nr_pages = folio_nr_pages(folio);
+	pg_data_t *pgdat = NODE_DATA(node);
+
+	if (folio_is_file_lru(folio)) {
+		/*
+		 * Do not migrate file folios that are mapped in multiple
+		 * processes with execute permissions as they are probably
+		 * shared libraries.
+		 *
+		 * See folio_likely_mapped_shared() on possible imprecision
+		 * when we cannot easily detect if a folio is shared.
+		 */
+		if ((vma->vm_flags & VM_EXEC) &&
+		    folio_likely_mapped_shared(folio))
+			return -EACCES;
+
+		/*
+		 * Do not migrate dirty folios as not all filesystems can move
+		 * dirty folios in MIGRATE_ASYNC mode which is a waste of
+		 * cycles.
+		 */
+		if (folio_test_dirty(folio))
+			return -EAGAIN;
+	}
 
 	/* Avoid migrating to a node that is nearly full */
 	if (!migrate_balanced_pgdat(pgdat, nr_pages)) {
@@ -2550,65 +2578,37 @@ static int numamigrate_isolate_folio(pg_data_t *pgdat, struct folio *folio)
 		 * further.
 		 */
 		if (z < 0)
-			return 0;
+			return -EAGAIN;
 
 		wakeup_kswapd(pgdat->node_zones + z, 0,
 			      folio_order(folio), ZONE_MOVABLE);
-		return 0;
+		return -EAGAIN;
 	}
 
 	if (!folio_isolate_lru(folio))
-		return 0;
+		return -EAGAIN;
 
 	node_stat_mod_folio(folio, NR_ISOLATED_ANON + folio_is_file_lru(folio),
 			    nr_pages);
-
-	/*
-	 * Isolating the folio has taken another reference, so the
-	 * caller's reference can be safely dropped without the folio
-	 * disappearing underneath us during migration.
-	 */
-	folio_put(folio);
-	return 1;
+	return 0;
 }
 
 /*
  * Attempt to migrate a misplaced folio to the specified destination
- * node. Caller is expected to have an elevated reference count on
- * the folio that will be dropped by this function before returning.
+ * node. Caller is expected to have isolated the folio by calling
+ * migrate_misplaced_folio_prepare(), which will result in an
+ * elevated reference count on the folio. This function will un-isolate the
+ * folio, dereferencing the folio before returning.
  */
 int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
 			    int node)
 {
 	pg_data_t *pgdat = NODE_DATA(node);
-	int isolated;
 	int nr_remaining;
 	unsigned int nr_succeeded;
 	LIST_HEAD(migratepages);
 	int nr_pages = folio_nr_pages(folio);
 
-	/*
-	 * Don't migrate file folios that are mapped in multiple processes
-	 * with execute permissions as they are probably shared libraries.
-	 *
-	 * See folio_likely_mapped_shared() on possible imprecision when we
-	 * cannot easily detect if a folio is shared.
-	 */
-	if (folio_likely_mapped_shared(folio) && folio_is_file_lru(folio) &&
-	    (vma->vm_flags & VM_EXEC))
-		goto out;
-
-	/*
-	 * Also do not migrate dirty folios as not all filesystems can move
-	 * dirty folios in MIGRATE_ASYNC mode which is a waste of cycles.
-	 */
-	if (folio_is_file_lru(folio) && folio_test_dirty(folio))
-		goto out;
-
-	isolated = numamigrate_isolate_folio(pgdat, folio);
-	if (!isolated)
-		goto out;
-
 	list_add(&folio->lru, &migratepages);
 	nr_remaining = migrate_pages(&migratepages, alloc_misplaced_dst_folio,
 				     NULL, node, MIGRATE_ASYNC,
@@ -2620,7 +2620,6 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
 					folio_is_file_lru(folio), -nr_pages);
 			folio_putback_lru(folio);
 		}
-		isolated = 0;
 	}
 	if (nr_succeeded) {
 		count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
@@ -2629,11 +2628,7 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
 					    nr_succeeded);
 	}
 	BUG_ON(!list_empty(&migratepages));
-	return isolated ? 0 : -EAGAIN;
-
-out:
-	folio_put(folio);
-	return -EAGAIN;
+	return nr_remaining ? -EAGAIN : 0;
 }
 #endif /* CONFIG_NUMA_BALANCING */
 #endif /* CONFIG_NUMA */
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 1/2] mm/migrate: make migrate_misplaced_folio() return 0 on success
  2024-06-20 21:29 ` [PATCH v1 1/2] mm/migrate: make migrate_misplaced_folio() return 0 on success David Hildenbrand
@ 2024-06-21  1:40   ` Zi Yan
  2024-06-21  3:39   ` Baolin Wang
  2024-07-01  7:36   ` Huang, Ying
  2 siblings, 0 replies; 24+ messages in thread
From: Zi Yan @ 2024-06-21  1:40 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: linux-kernel, linux-mm, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 771 bytes --]

On 20 Jun 2024, at 17:29, David Hildenbrand wrote:

> Let's just return 0 on success, which is less confusing.
>
> ... especially because we got it wrong in the migrate.h stub where we
> have "return -EAGAIN; /* can't migrate now */" instead of "return 0;".
> Likely this wrong return value doesn't currently matter, but it
> certainly adds confusion.
>
> We'll add migrate_misplaced_folio_prepare() next, where we want to use
> the same "return 0 on success" approach, so let's just clean this up.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/huge_memory.c | 5 ++---
>  mm/memory.c      | 2 +-
>  mm/migrate.c     | 4 ++--
>  3 files changed, 5 insertions(+), 6 deletions(-)
>

LGTM. Reviewed-by: Zi Yan <ziy@nvidia.com>

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 2/2] mm/migrate: move NUMA hinting fault folio isolation + checks under PTL
  2024-06-20 21:29 ` [PATCH v1 2/2] mm/migrate: move NUMA hinting fault folio isolation + checks under PTL David Hildenbrand
@ 2024-06-21  2:05   ` Zi Yan
  2024-06-21  7:32     ` David Hildenbrand
  2024-06-21  4:07   ` Baolin Wang
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 24+ messages in thread
From: Zi Yan @ 2024-06-21  2:05 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: linux-kernel, linux-mm, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 3589 bytes --]

On 20 Jun 2024, at 17:29, David Hildenbrand wrote:

> Currently we always take a folio reference even if migration will not
> even be tried or isolation failed, requiring us to grab+drop an additional
> reference.
>
> Further, we end up calling folio_likely_mapped_shared() while the folio
> might have already been unmapped, because after we dropped the PTL, that
> can easily happen. We want to stop touching mapcounts and friends from
> such context, and only call folio_likely_mapped_shared() while the folio
> is still mapped: mapcount information is pretty much stale and unreliable
> otherwise.
>
> So let's move checks into numamigrate_isolate_folio(), rename that
> function to migrate_misplaced_folio_prepare(), and call that function
> from callsites where we call migrate_misplaced_folio(), but still with
> the PTL held.
>
> We can now stop taking temporary folio references, and really only take
> a reference if folio isolation succeeded. Doing the
> folio_likely_mapped_shared() + golio isolation under PT lock is now similar
> to how we handle MADV_PAGEOUT.
>
> While at it, combine the folio_is_file_lru() checks.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  include/linux/migrate.h |  7 ++++
>  mm/huge_memory.c        |  8 ++--
>  mm/memory.c             |  9 +++--
>  mm/migrate.c            | 81 +++++++++++++++++++----------------------
>  4 files changed, 55 insertions(+), 50 deletions(-)
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index f9d92482d117..644be30b69c8 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -139,9 +139,16 @@ const struct movable_operations *page_movable_ops(struct page *page)
>  }
>
>  #ifdef CONFIG_NUMA_BALANCING
> +int migrate_misplaced_folio_prepare(struct folio *folio,
> +		struct vm_area_struct *vma, int node);
>  int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
>  			   int node);
>  #else
> +static inline int migrate_misplaced_folio_prepare(struct folio *folio,
> +		struct vm_area_struct *vma, int node)
> +{
> +	return -EAGAIN; /* can't migrate now */
> +}
>  static inline int migrate_misplaced_folio(struct folio *folio,
>  					 struct vm_area_struct *vma, int node)
>  {
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index fc27dabcd8e3..4b2817bb2c7d 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1688,11 +1688,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>  	if (node_is_toptier(nid))
>  		last_cpupid = folio_last_cpupid(folio);
>  	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
> -	if (target_nid == NUMA_NO_NODE) {
> -		folio_put(folio);
> +	if (target_nid == NUMA_NO_NODE)
> +		goto out_map;
> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
> +		flags |= TNF_MIGRATE_FAIL;
>  		goto out_map;
>  	}
> -
> +	/* The folio is isolated and isolation code holds a folio reference. */
>  	spin_unlock(vmf->ptl);
>  	writable = false;
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 118660de5bcc..4fd1ecfced4d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5207,8 +5207,6 @@ int numa_migrate_prep(struct folio *folio, struct vm_fault *vmf,
>  {
>  	struct vm_area_struct *vma = vmf->vma;
>
> -	folio_get(folio);
> -
>  	/* Record the current PID acceesing VMA */
>  	vma_set_access_pid_bit(vma);

Without taking a reference here, is it safe to call mpol_misplaced() on the folio
at the return? Can this folio be freed?

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 1/2] mm/migrate: make migrate_misplaced_folio() return 0 on success
  2024-06-20 21:29 ` [PATCH v1 1/2] mm/migrate: make migrate_misplaced_folio() return 0 on success David Hildenbrand
  2024-06-21  1:40   ` Zi Yan
@ 2024-06-21  3:39   ` Baolin Wang
  2024-07-01  7:36   ` Huang, Ying
  2 siblings, 0 replies; 24+ messages in thread
From: Baolin Wang @ 2024-06-21  3:39 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel; +Cc: linux-mm, Andrew Morton



On 2024/6/21 05:29, David Hildenbrand wrote:
> Let's just return 0 on success, which is less confusing.
> 
> ... especially because we got it wrong in the migrate.h stub where we
> have "return -EAGAIN; /* can't migrate now */" instead of "return 0;".
> Likely this wrong return value doesn't currently matter, but it
> certainly adds confusion.
> 
> We'll add migrate_misplaced_folio_prepare() next, where we want to use
> the same "return 0 on success" approach, so let's just clean this up
LGTM.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>

> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>   mm/huge_memory.c | 5 ++---
>   mm/memory.c      | 2 +-
>   mm/migrate.c     | 4 ++--
>   3 files changed, 5 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 0fffaa58a47a..fc27dabcd8e3 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1652,7 +1652,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>   	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
>   	int nid = NUMA_NO_NODE;
>   	int target_nid, last_cpupid = (-1 & LAST_CPUPID_MASK);
> -	bool migrated = false, writable = false;
> +	bool writable = false;
>   	int flags = 0;
>   
>   	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> @@ -1696,8 +1696,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>   	spin_unlock(vmf->ptl);
>   	writable = false;
>   
> -	migrated = migrate_misplaced_folio(folio, vma, target_nid);
> -	if (migrated) {
> +	if (!migrate_misplaced_folio(folio, vma, target_nid)) {
>   		flags |= TNF_MIGRATED;
>   		nid = target_nid;
>   	} else {
> diff --git a/mm/memory.c b/mm/memory.c
> index 00728ea95583..118660de5bcc 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5354,7 +5354,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>   	ignore_writable = true;
>   
>   	/* Migrate to the requested node */
> -	if (migrate_misplaced_folio(folio, vma, target_nid)) {
> +	if (!migrate_misplaced_folio(folio, vma, target_nid)) {
>   		nid = target_nid;
>   		flags |= TNF_MIGRATED;
>   	} else {
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 781979567f64..0307b54879a0 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2629,11 +2629,11 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
>   					    nr_succeeded);
>   	}
>   	BUG_ON(!list_empty(&migratepages));
> -	return isolated;
> +	return isolated ? 0 : -EAGAIN;
>   
>   out:
>   	folio_put(folio);
> -	return 0;
> +	return -EAGAIN;
>   }
>   #endif /* CONFIG_NUMA_BALANCING */
>   #endif /* CONFIG_NUMA */


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 2/2] mm/migrate: move NUMA hinting fault folio isolation + checks under PTL
  2024-06-20 21:29 ` [PATCH v1 2/2] mm/migrate: move NUMA hinting fault folio isolation + checks under PTL David Hildenbrand
  2024-06-21  2:05   ` Zi Yan
@ 2024-06-21  4:07   ` Baolin Wang
  2024-06-21  7:31     ` David Hildenbrand
  2024-06-21 13:44   ` Zi Yan
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 24+ messages in thread
From: Baolin Wang @ 2024-06-21  4:07 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel; +Cc: linux-mm, Andrew Morton



On 2024/6/21 05:29, David Hildenbrand wrote:
> Currently we always take a folio reference even if migration will not
> even be tried or isolation failed, requiring us to grab+drop an additional
> reference.
> 
> Further, we end up calling folio_likely_mapped_shared() while the folio
> might have already been unmapped, because after we dropped the PTL, that
> can easily happen. We want to stop touching mapcounts and friends from
> such context, and only call folio_likely_mapped_shared() while the folio
> is still mapped: mapcount information is pretty much stale and unreliable
> otherwise.
> 
> So let's move checks into numamigrate_isolate_folio(), rename that
> function to migrate_misplaced_folio_prepare(), and call that function
> from callsites where we call migrate_misplaced_folio(), but still with
> the PTL held.
> 
> We can now stop taking temporary folio references, and really only take
> a reference if folio isolation succeeded. Doing the
> folio_likely_mapped_shared() + golio isolation under PT lock is now similar
> to how we handle MADV_PAGEOUT.

s/golio/folio

Make sense to me. Feel free to add:
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>

> 
> While at it, combine the folio_is_file_lru() checks.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>   include/linux/migrate.h |  7 ++++
>   mm/huge_memory.c        |  8 ++--
>   mm/memory.c             |  9 +++--
>   mm/migrate.c            | 81 +++++++++++++++++++----------------------
>   4 files changed, 55 insertions(+), 50 deletions(-)
> 
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index f9d92482d117..644be30b69c8 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -139,9 +139,16 @@ const struct movable_operations *page_movable_ops(struct page *page)
>   }
>   
>   #ifdef CONFIG_NUMA_BALANCING
> +int migrate_misplaced_folio_prepare(struct folio *folio,
> +		struct vm_area_struct *vma, int node);
>   int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
>   			   int node);
>   #else
> +static inline int migrate_misplaced_folio_prepare(struct folio *folio,
> +		struct vm_area_struct *vma, int node)
> +{
> +	return -EAGAIN; /* can't migrate now */
> +}
>   static inline int migrate_misplaced_folio(struct folio *folio,
>   					 struct vm_area_struct *vma, int node)
>   {
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index fc27dabcd8e3..4b2817bb2c7d 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1688,11 +1688,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>   	if (node_is_toptier(nid))
>   		last_cpupid = folio_last_cpupid(folio);
>   	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
> -	if (target_nid == NUMA_NO_NODE) {
> -		folio_put(folio);
> +	if (target_nid == NUMA_NO_NODE)
> +		goto out_map;
> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
> +		flags |= TNF_MIGRATE_FAIL;
>   		goto out_map;
>   	}
> -
> +	/* The folio is isolated and isolation code holds a folio reference. */
>   	spin_unlock(vmf->ptl);
>   	writable = false;
>   
> diff --git a/mm/memory.c b/mm/memory.c
> index 118660de5bcc..4fd1ecfced4d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5207,8 +5207,6 @@ int numa_migrate_prep(struct folio *folio, struct vm_fault *vmf,
>   {
>   	struct vm_area_struct *vma = vmf->vma;
>   
> -	folio_get(folio);
> -
>   	/* Record the current PID acceesing VMA */
>   	vma_set_access_pid_bit(vma);
>   
> @@ -5345,10 +5343,13 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>   	else
>   		last_cpupid = folio_last_cpupid(folio);
>   	target_nid = numa_migrate_prep(folio, vmf, vmf->address, nid, &flags);
> -	if (target_nid == NUMA_NO_NODE) {
> -		folio_put(folio);
> +	if (target_nid == NUMA_NO_NODE)
> +		goto out_map;
> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
> +		flags |= TNF_MIGRATE_FAIL;
>   		goto out_map;
>   	}
> +	/* The folio is isolated and isolation code holds a folio reference. */
>   	pte_unmap_unlock(vmf->pte, vmf->ptl);
>   	writable = false;
>   	ignore_writable = true;
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 0307b54879a0..27f070f64f27 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2530,9 +2530,37 @@ static struct folio *alloc_misplaced_dst_folio(struct folio *src,
>   	return __folio_alloc_node(gfp, order, nid);
>   }
>   
> -static int numamigrate_isolate_folio(pg_data_t *pgdat, struct folio *folio)
> +/*
> + * Prepare for calling migrate_misplaced_folio() by isolating the folio if
> + * permitted. Must be called with the PTL still held.
> + */
> +int migrate_misplaced_folio_prepare(struct folio *folio,
> +		struct vm_area_struct *vma, int node)
>   {
>   	int nr_pages = folio_nr_pages(folio);
> +	pg_data_t *pgdat = NODE_DATA(node);
> +
> +	if (folio_is_file_lru(folio)) {
> +		/*
> +		 * Do not migrate file folios that are mapped in multiple
> +		 * processes with execute permissions as they are probably
> +		 * shared libraries.
> +		 *
> +		 * See folio_likely_mapped_shared() on possible imprecision
> +		 * when we cannot easily detect if a folio is shared.
> +		 */
> +		if ((vma->vm_flags & VM_EXEC) &&
> +		    folio_likely_mapped_shared(folio))
> +			return -EACCES;
> +
> +		/*
> +		 * Do not migrate dirty folios as not all filesystems can move
> +		 * dirty folios in MIGRATE_ASYNC mode which is a waste of
> +		 * cycles.
> +		 */
> +		if (folio_test_dirty(folio))
> +			return -EAGAIN;
> +	}
>   
>   	/* Avoid migrating to a node that is nearly full */
>   	if (!migrate_balanced_pgdat(pgdat, nr_pages)) {
> @@ -2550,65 +2578,37 @@ static int numamigrate_isolate_folio(pg_data_t *pgdat, struct folio *folio)
>   		 * further.
>   		 */
>   		if (z < 0)
> -			return 0;
> +			return -EAGAIN;
>   
>   		wakeup_kswapd(pgdat->node_zones + z, 0,
>   			      folio_order(folio), ZONE_MOVABLE);
> -		return 0;
> +		return -EAGAIN;
>   	}
>   
>   	if (!folio_isolate_lru(folio))
> -		return 0;
> +		return -EAGAIN;
>   
>   	node_stat_mod_folio(folio, NR_ISOLATED_ANON + folio_is_file_lru(folio),
>   			    nr_pages);
> -
> -	/*
> -	 * Isolating the folio has taken another reference, so the
> -	 * caller's reference can be safely dropped without the folio
> -	 * disappearing underneath us during migration.
> -	 */
> -	folio_put(folio);
> -	return 1;
> +	return 0;
>   }

(just a nit: returning boolean seems more readable)

>   
>   /*
>    * Attempt to migrate a misplaced folio to the specified destination
> - * node. Caller is expected to have an elevated reference count on
> - * the folio that will be dropped by this function before returning.
> + * node. Caller is expected to have isolated the folio by calling
> + * migrate_misplaced_folio_prepare(), which will result in an
> + * elevated reference count on the folio. This function will un-isolate the
> + * folio, dereferencing the folio before returning.
>    */
>   int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
>   			    int node)
>   {
>   	pg_data_t *pgdat = NODE_DATA(node);
> -	int isolated;
>   	int nr_remaining;
>   	unsigned int nr_succeeded;
>   	LIST_HEAD(migratepages);
>   	int nr_pages = folio_nr_pages(folio);
>   
> -	/*
> -	 * Don't migrate file folios that are mapped in multiple processes
> -	 * with execute permissions as they are probably shared libraries.
> -	 *
> -	 * See folio_likely_mapped_shared() on possible imprecision when we
> -	 * cannot easily detect if a folio is shared.
> -	 */
> -	if (folio_likely_mapped_shared(folio) && folio_is_file_lru(folio) &&
> -	    (vma->vm_flags & VM_EXEC))
> -		goto out;
> -
> -	/*
> -	 * Also do not migrate dirty folios as not all filesystems can move
> -	 * dirty folios in MIGRATE_ASYNC mode which is a waste of cycles.
> -	 */
> -	if (folio_is_file_lru(folio) && folio_test_dirty(folio))
> -		goto out;
> -
> -	isolated = numamigrate_isolate_folio(pgdat, folio);
> -	if (!isolated)
> -		goto out;
> -
>   	list_add(&folio->lru, &migratepages);
>   	nr_remaining = migrate_pages(&migratepages, alloc_misplaced_dst_folio,
>   				     NULL, node, MIGRATE_ASYNC,
> @@ -2620,7 +2620,6 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
>   					folio_is_file_lru(folio), -nr_pages);
>   			folio_putback_lru(folio);
>   		}
> -		isolated = 0;
>   	}
>   	if (nr_succeeded) {
>   		count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
> @@ -2629,11 +2628,7 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
>   					    nr_succeeded);
>   	}
>   	BUG_ON(!list_empty(&migratepages));
> -	return isolated ? 0 : -EAGAIN;
> -
> -out:
> -	folio_put(folio);
> -	return -EAGAIN;
> +	return nr_remaining ? -EAGAIN : 0;
>   }
>   #endif /* CONFIG_NUMA_BALANCING */
>   #endif /* CONFIG_NUMA */


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 2/2] mm/migrate: move NUMA hinting fault folio isolation + checks under PTL
  2024-06-21  4:07   ` Baolin Wang
@ 2024-06-21  7:31     ` David Hildenbrand
  0 siblings, 0 replies; 24+ messages in thread
From: David Hildenbrand @ 2024-06-21  7:31 UTC (permalink / raw)
  To: Baolin Wang, linux-kernel; +Cc: linux-mm, Andrew Morton

On 21.06.24 06:07, Baolin Wang wrote:
> 
> 
> On 2024/6/21 05:29, David Hildenbrand wrote:
>> Currently we always take a folio reference even if migration will not
>> even be tried or isolation failed, requiring us to grab+drop an additional
>> reference.
>>
>> Further, we end up calling folio_likely_mapped_shared() while the folio
>> might have already been unmapped, because after we dropped the PTL, that
>> can easily happen. We want to stop touching mapcounts and friends from
>> such context, and only call folio_likely_mapped_shared() while the folio
>> is still mapped: mapcount information is pretty much stale and unreliable
>> otherwise.
>>
>> So let's move checks into numamigrate_isolate_folio(), rename that
>> function to migrate_misplaced_folio_prepare(), and call that function
>> from callsites where we call migrate_misplaced_folio(), but still with
>> the PTL held.
>>
>> We can now stop taking temporary folio references, and really only take
>> a reference if folio isolation succeeded. Doing the
>> folio_likely_mapped_shared() + golio isolation under PT lock is now similar
>> to how we handle MADV_PAGEOUT.
> 
> s/golio/folio
> 
> Make sense to me. Feel free to add:
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>


Thanks!

[...]

>>    	node_stat_mod_folio(folio, NR_ISOLATED_ANON + folio_is_file_lru(folio),
>>    			    nr_pages);
>> -
>> -	/*
>> -	 * Isolating the folio has taken another reference, so the
>> -	 * caller's reference can be safely dropped without the folio
>> -	 * disappearing underneath us during migration.
>> -	 */
>> -	folio_put(folio);
>> -	return 1;
>> +	return 0;
>>    }
> 
> (just a nit: returning boolean seems more readable)

"return true" on success really is nasty in a code base where most 
functions return "0" on success. Like most functions in mm/migrate.c -- 
like migrate_pages() -- do.

Thanks!

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 2/2] mm/migrate: move NUMA hinting fault folio isolation + checks under PTL
  2024-06-21  2:05   ` Zi Yan
@ 2024-06-21  7:32     ` David Hildenbrand
  0 siblings, 0 replies; 24+ messages in thread
From: David Hildenbrand @ 2024-06-21  7:32 UTC (permalink / raw)
  To: Zi Yan; +Cc: linux-kernel, linux-mm, Andrew Morton

On 21.06.24 04:05, Zi Yan wrote:
> On 20 Jun 2024, at 17:29, David Hildenbrand wrote:
> 
>> Currently we always take a folio reference even if migration will not
>> even be tried or isolation failed, requiring us to grab+drop an additional
>> reference.
>>
>> Further, we end up calling folio_likely_mapped_shared() while the folio
>> might have already been unmapped, because after we dropped the PTL, that
>> can easily happen. We want to stop touching mapcounts and friends from
>> such context, and only call folio_likely_mapped_shared() while the folio
>> is still mapped: mapcount information is pretty much stale and unreliable
>> otherwise.
>>
>> So let's move checks into numamigrate_isolate_folio(), rename that
>> function to migrate_misplaced_folio_prepare(), and call that function
>> from callsites where we call migrate_misplaced_folio(), but still with
>> the PTL held.
>>
>> We can now stop taking temporary folio references, and really only take
>> a reference if folio isolation succeeded. Doing the
>> folio_likely_mapped_shared() + golio isolation under PT lock is now similar
>> to how we handle MADV_PAGEOUT.
>>
>> While at it, combine the folio_is_file_lru() checks.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   include/linux/migrate.h |  7 ++++
>>   mm/huge_memory.c        |  8 ++--
>>   mm/memory.c             |  9 +++--
>>   mm/migrate.c            | 81 +++++++++++++++++++----------------------
>>   4 files changed, 55 insertions(+), 50 deletions(-)
>>
>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
>> index f9d92482d117..644be30b69c8 100644
>> --- a/include/linux/migrate.h
>> +++ b/include/linux/migrate.h
>> @@ -139,9 +139,16 @@ const struct movable_operations *page_movable_ops(struct page *page)
>>   }
>>
>>   #ifdef CONFIG_NUMA_BALANCING
>> +int migrate_misplaced_folio_prepare(struct folio *folio,
>> +		struct vm_area_struct *vma, int node);
>>   int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
>>   			   int node);
>>   #else
>> +static inline int migrate_misplaced_folio_prepare(struct folio *folio,
>> +		struct vm_area_struct *vma, int node)
>> +{
>> +	return -EAGAIN; /* can't migrate now */
>> +}
>>   static inline int migrate_misplaced_folio(struct folio *folio,
>>   					 struct vm_area_struct *vma, int node)
>>   {
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index fc27dabcd8e3..4b2817bb2c7d 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1688,11 +1688,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>   	if (node_is_toptier(nid))
>>   		last_cpupid = folio_last_cpupid(folio);
>>   	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
>> -	if (target_nid == NUMA_NO_NODE) {
>> -		folio_put(folio);
>> +	if (target_nid == NUMA_NO_NODE)
>> +		goto out_map;
>> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>> +		flags |= TNF_MIGRATE_FAIL;
>>   		goto out_map;
>>   	}
>> -
>> +	/* The folio is isolated and isolation code holds a folio reference. */
>>   	spin_unlock(vmf->ptl);
>>   	writable = false;
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 118660de5bcc..4fd1ecfced4d 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -5207,8 +5207,6 @@ int numa_migrate_prep(struct folio *folio, struct vm_fault *vmf,
>>   {
>>   	struct vm_area_struct *vma = vmf->vma;
>>
>> -	folio_get(folio);
>> -
>>   	/* Record the current PID acceesing VMA */
>>   	vma_set_access_pid_bit(vma);
> 
> Without taking a reference here, is it safe to call mpol_misplaced() on the folio
> at the return? Can this folio be freed?

As long as we are holding the PTL, the folio cannot get unmapped and 
consequently not freed.

Note that if that would be the case (being able to get freed 
concurrently), even the folio_get() would have been wrong: if it could 
get freed concurrently, we would have to use folio_try_get*().

Thanks!

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 2/2] mm/migrate: move NUMA hinting fault folio isolation + checks under PTL
  2024-06-20 21:29 ` [PATCH v1 2/2] mm/migrate: move NUMA hinting fault folio isolation + checks under PTL David Hildenbrand
  2024-06-21  2:05   ` Zi Yan
  2024-06-21  4:07   ` Baolin Wang
@ 2024-06-21 13:44   ` Zi Yan
  2024-06-21 20:18     ` David Hildenbrand
  2024-06-21 17:47   ` Donet Tom
  2024-06-26 16:22   ` David Hildenbrand
  4 siblings, 1 reply; 24+ messages in thread
From: Zi Yan @ 2024-06-21 13:44 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: linux-kernel, linux-mm, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 3007 bytes --]

On 20 Jun 2024, at 17:29, David Hildenbrand wrote:

> Currently we always take a folio reference even if migration will not
> even be tried or isolation failed, requiring us to grab+drop an additional
> reference.
>
> Further, we end up calling folio_likely_mapped_shared() while the folio
> might have already been unmapped, because after we dropped the PTL, that
> can easily happen. We want to stop touching mapcounts and friends from
> such context, and only call folio_likely_mapped_shared() while the folio
> is still mapped: mapcount information is pretty much stale and unreliable
> otherwise.
>
> So let's move checks into numamigrate_isolate_folio(), rename that
> function to migrate_misplaced_folio_prepare(), and call that function
> from callsites where we call migrate_misplaced_folio(), but still with
> the PTL held.
>
> We can now stop taking temporary folio references, and really only take
> a reference if folio isolation succeeded. Doing the
> folio_likely_mapped_shared() + golio isolation under PT lock is now similar
> to how we handle MADV_PAGEOUT.
>
> While at it, combine the folio_is_file_lru() checks.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  include/linux/migrate.h |  7 ++++
>  mm/huge_memory.c        |  8 ++--
>  mm/memory.c             |  9 +++--
>  mm/migrate.c            | 81 +++++++++++++++++++----------------------
>  4 files changed, 55 insertions(+), 50 deletions(-)

LGTM. Reviewed-by: Zi Yan <ziy@nvidia.com>

One nit below:

<snip>

> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index fc27dabcd8e3..4b2817bb2c7d 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1688,11 +1688,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>  	if (node_is_toptier(nid))
>  		last_cpupid = folio_last_cpupid(folio);
>  	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
> -	if (target_nid == NUMA_NO_NODE) {
> -		folio_put(folio);
> +	if (target_nid == NUMA_NO_NODE)
> +		goto out_map;
> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
> +		flags |= TNF_MIGRATE_FAIL;
>  		goto out_map;
>  	}
> -
> +	/* The folio is isolated and isolation code holds a folio reference. */
>  	spin_unlock(vmf->ptl);
>  	writable = false;
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 118660de5bcc..4fd1ecfced4d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c

<snip>

> @@ -5345,10 +5343,13 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>  	else
>  		last_cpupid = folio_last_cpupid(folio);
>  	target_nid = numa_migrate_prep(folio, vmf, vmf->address, nid, &flags);
> -	if (target_nid == NUMA_NO_NODE) {
> -		folio_put(folio);
> +	if (target_nid == NUMA_NO_NODE)
> +		goto out_map;
> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
> +		flags |= TNF_MIGRATE_FAIL;
>  		goto out_map;
>  	}

These two locations are repeated code, maybe just merge the ifs into
numa_migrate_prep(). Feel free to ignore if you are not going to send
another version. :)

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 2/2] mm/migrate: move NUMA hinting fault folio isolation + checks under PTL
  2024-06-20 21:29 ` [PATCH v1 2/2] mm/migrate: move NUMA hinting fault folio isolation + checks under PTL David Hildenbrand
                     ` (2 preceding siblings ...)
  2024-06-21 13:44   ` Zi Yan
@ 2024-06-21 17:47   ` Donet Tom
  2024-06-21 20:14     ` David Hildenbrand
  2024-06-26 16:22   ` David Hildenbrand
  4 siblings, 1 reply; 24+ messages in thread
From: Donet Tom @ 2024-06-21 17:47 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel; +Cc: linux-mm, Andrew Morton


On 6/21/24 02:59, David Hildenbrand wrote:
> Currently we always take a folio reference even if migration will not
> even be tried or isolation failed, requiring us to grab+drop an additional
> reference.
>
> Further, we end up calling folio_likely_mapped_shared() while the folio
> might have already been unmapped, because after we dropped the PTL, that
> can easily happen. We want to stop touching mapcounts and friends from
> such context, and only call folio_likely_mapped_shared() while the folio
> is still mapped: mapcount information is pretty much stale and unreliable
> otherwise.
>
> So let's move checks into numamigrate_isolate_folio(), rename that
> function to migrate_misplaced_folio_prepare(), and call that function
> from callsites where we call migrate_misplaced_folio(), but still with
> the PTL held.
>
> We can now stop taking temporary folio references, and really only take
> a reference if folio isolation succeeded. Doing the
> folio_likely_mapped_shared() + golio isolation under PT lock is now similar
> to how we handle MADV_PAGEOUT.
>
> While at it, combine the folio_is_file_lru() checks.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>   include/linux/migrate.h |  7 ++++
>   mm/huge_memory.c        |  8 ++--
>   mm/memory.c             |  9 +++--
>   mm/migrate.c            | 81 +++++++++++++++++++----------------------
>   4 files changed, 55 insertions(+), 50 deletions(-)
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index f9d92482d117..644be30b69c8 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -139,9 +139,16 @@ const struct movable_operations *page_movable_ops(struct page *page)
>   }
>   
>   #ifdef CONFIG_NUMA_BALANCING
> +int migrate_misplaced_folio_prepare(struct folio *folio,
> +		struct vm_area_struct *vma, int node);
>   int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
>   			   int node);
>   #else
> +static inline int migrate_misplaced_folio_prepare(struct folio *folio,
> +		struct vm_area_struct *vma, int node)
> +{
> +	return -EAGAIN; /* can't migrate now */
> +}
>   static inline int migrate_misplaced_folio(struct folio *folio,
>   					 struct vm_area_struct *vma, int node)
>   {
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index fc27dabcd8e3..4b2817bb2c7d 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1688,11 +1688,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>   	if (node_is_toptier(nid))
>   		last_cpupid = folio_last_cpupid(folio);
>   	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
> -	if (target_nid == NUMA_NO_NODE) {
> -		folio_put(folio);
> +	if (target_nid == NUMA_NO_NODE)
> +		goto out_map;
> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
> +		flags |= TNF_MIGRATE_FAIL;
>   		goto out_map;
>   	}
> -
> +	/* The folio is isolated and isolation code holds a folio reference. */
>   	spin_unlock(vmf->ptl);
>   	writable = false;
>   
> diff --git a/mm/memory.c b/mm/memory.c
> index 118660de5bcc..4fd1ecfced4d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5207,8 +5207,6 @@ int numa_migrate_prep(struct folio *folio, struct vm_fault *vmf,
>   {
>   	struct vm_area_struct *vma = vmf->vma;
>   
> -	folio_get(folio);
> -
>   	/* Record the current PID acceesing VMA */
>   	vma_set_access_pid_bit(vma);
>   
> @@ -5345,10 +5343,13 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>   	else
>   		last_cpupid = folio_last_cpupid(folio);
>   	target_nid = numa_migrate_prep(folio, vmf, vmf->address, nid, &flags);
> -	if (target_nid == NUMA_NO_NODE) {
> -		folio_put(folio);
> +	if (target_nid == NUMA_NO_NODE)
> +		goto out_map;
> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
> +		flags |= TNF_MIGRATE_FAIL;
>   		goto out_map;
>   	}
> +	/* The folio is isolated and isolation code holds a folio reference. */
>   	pte_unmap_unlock(vmf->pte, vmf->ptl);
>   	writable = false;
>   	ignore_writable = true;
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 0307b54879a0..27f070f64f27 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2530,9 +2530,37 @@ static struct folio *alloc_misplaced_dst_folio(struct folio *src,
>   	return __folio_alloc_node(gfp, order, nid);
>   }
>   
> -static int numamigrate_isolate_folio(pg_data_t *pgdat, struct folio *folio)
> +/*
> + * Prepare for calling migrate_misplaced_folio() by isolating the folio if
> + * permitted. Must be called with the PTL still held.
> + */
> +int migrate_misplaced_folio_prepare(struct folio *folio,
> +		struct vm_area_struct *vma, int node)
>   {
>   	int nr_pages = folio_nr_pages(folio);
> +	pg_data_t *pgdat = NODE_DATA(node);
> +
> +	if (folio_is_file_lru(folio)) {
> +		/*
> +		 * Do not migrate file folios that are mapped in multiple
> +		 * processes with execute permissions as they are probably
> +		 * shared libraries.
> +		 *
> +		 * See folio_likely_mapped_shared() on possible imprecision
> +		 * when we cannot easily detect if a folio is shared.
> +		 */
> +		if ((vma->vm_flags & VM_EXEC) &&
> +		    folio_likely_mapped_shared(folio))
> +			return -EACCES;
> +
> +		/*
> +		 * Do not migrate dirty folios as not all filesystems can move
> +		 * dirty folios in MIGRATE_ASYNC mode which is a waste of
> +		 * cycles.
> +		 */
> +		if (folio_test_dirty(folio))
> +			return -EAGAIN;
> +	}
>   
>   	/* Avoid migrating to a node that is nearly full */
>   	if (!migrate_balanced_pgdat(pgdat, nr_pages)) {
> @@ -2550,65 +2578,37 @@ static int numamigrate_isolate_folio(pg_data_t *pgdat, struct folio *folio)
>   		 * further.
>   		 */
>   		if (z < 0)
> -			return 0;
> +			return -EAGAIN;
>   
>   		wakeup_kswapd(pgdat->node_zones + z, 0,
>   			      folio_order(folio), ZONE_MOVABLE);
> -		return 0;
> +		return -EAGAIN;
>   	}
>   
>   	if (!folio_isolate_lru(folio))
> -		return 0;
> +		return -EAGAIN;
>   
>   	node_stat_mod_folio(folio, NR_ISOLATED_ANON + folio_is_file_lru(folio),
>   			    nr_pages);
> -
> -	/*
> -	 * Isolating the folio has taken another reference, so the
> -	 * caller's reference can be safely dropped without the folio
> -	 * disappearing underneath us during migration.
> -	 */
> -	folio_put(folio);
> -	return 1;
> +	return 0;
>   }
>   
>   /*
>    * Attempt to migrate a misplaced folio to the specified destination
> - * node. Caller is expected to have an elevated reference count on
> - * the folio that will be dropped by this function before returning.
> + * node. Caller is expected to have isolated the folio by calling
> + * migrate_misplaced_folio_prepare(), which will result in an
> + * elevated reference count on the folio. This function will un-isolate the
> + * folio, dereferencing the folio before returning.
>    */
>   int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
>   			    int node)
>   {
>   	pg_data_t *pgdat = NODE_DATA(node);
> -	int isolated;
>   	int nr_remaining;
>   	unsigned int nr_succeeded;
>   	LIST_HEAD(migratepages);
>   	int nr_pages = folio_nr_pages(folio);
>   
> -	/*
> -	 * Don't migrate file folios that are mapped in multiple processes
> -	 * with execute permissions as they are probably shared libraries.
> -	 *
> -	 * See folio_likely_mapped_shared() on possible imprecision when we
> -	 * cannot easily detect if a folio is shared.
> -	 */
> -	if (folio_likely_mapped_shared(folio) && folio_is_file_lru(folio) &&
> -	    (vma->vm_flags & VM_EXEC))
> -		goto out;
> -
> -	/*
> -	 * Also do not migrate dirty folios as not all filesystems can move
> -	 * dirty folios in MIGRATE_ASYNC mode which is a waste of cycles.
> -	 */
> -	if (folio_is_file_lru(folio) && folio_test_dirty(folio))
> -		goto out;
> -
> -	isolated = numamigrate_isolate_folio(pgdat, folio);
> -	if (!isolated)
> -		goto out;
> -
>   	list_add(&folio->lru, &migratepages);
>   	nr_remaining = migrate_pages(&migratepages, alloc_misplaced_dst_folio,
>   				     NULL, node, MIGRATE_ASYNC,
> @@ -2620,7 +2620,6 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
>   					folio_is_file_lru(folio), -nr_pages);
>   			folio_putback_lru(folio);
>   		}
> -		isolated = 0;
>   	}
>   	if (nr_succeeded) {
>   		count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
> @@ -2629,11 +2628,7 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
>   					    nr_succeeded);
>   	}
>   	BUG_ON(!list_empty(&migratepages));
> -	return isolated ? 0 : -EAGAIN;
> -
> -out:
> -	folio_put(folio);
> -	return -EAGAIN;
> +	return nr_remaining ? -EAGAIN : 0;
>   }
>   #endif /* CONFIG_NUMA_BALANCING */
>   #endif /* CONFIG_NUMA */

Hi David,

I have tested these patches on my system. My system has 2 DRAM nodes and 
1 PMEM node.
I tested page migration between DRAM nodes and page promotion from PMEM 
to DRAM.
Both are working fine.

below are the results.

Migration results
=============
numa_pte_updates 18977
numa_huge_pte_updates 0
numa_hint_faults 18504
numa_hint_faults_local 2116
numa_pages_migrated 16388



Promotion Results
===============
nr_sec_page_table_pages 0
nr_iommu_pages 0
nr_swapcached 0
pgpromote_success 16386
pgpromote_candidate 0


Tested-by: Donet Tom <donettom@linux.ibm.com>


Thanks
Donet


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 2/2] mm/migrate: move NUMA hinting fault folio isolation + checks under PTL
  2024-06-21 17:47   ` Donet Tom
@ 2024-06-21 20:14     ` David Hildenbrand
  0 siblings, 0 replies; 24+ messages in thread
From: David Hildenbrand @ 2024-06-21 20:14 UTC (permalink / raw)
  To: Donet Tom, linux-kernel; +Cc: linux-mm, Andrew Morton

On 21.06.24 19:47, Donet Tom wrote:
> 
> On 6/21/24 02:59, David Hildenbrand wrote:
>> Currently we always take a folio reference even if migration will not
>> even be tried or isolation failed, requiring us to grab+drop an additional
>> reference.
>>
>> Further, we end up calling folio_likely_mapped_shared() while the folio
>> might have already been unmapped, because after we dropped the PTL, that
>> can easily happen. We want to stop touching mapcounts and friends from
>> such context, and only call folio_likely_mapped_shared() while the folio
>> is still mapped: mapcount information is pretty much stale and unreliable
>> otherwise.
>>
>> So let's move checks into numamigrate_isolate_folio(), rename that
>> function to migrate_misplaced_folio_prepare(), and call that function
>> from callsites where we call migrate_misplaced_folio(), but still with
>> the PTL held.
>>
>> We can now stop taking temporary folio references, and really only take
>> a reference if folio isolation succeeded. Doing the
>> folio_likely_mapped_shared() + golio isolation under PT lock is now similar
>> to how we handle MADV_PAGEOUT.
>>
>> While at it, combine the folio_is_file_lru() checks.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>    include/linux/migrate.h |  7 ++++
>>    mm/huge_memory.c        |  8 ++--
>>    mm/memory.c             |  9 +++--
>>    mm/migrate.c            | 81 +++++++++++++++++++----------------------
>>    4 files changed, 55 insertions(+), 50 deletions(-)
>>
>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
>> index f9d92482d117..644be30b69c8 100644
>> --- a/include/linux/migrate.h
>> +++ b/include/linux/migrate.h
>> @@ -139,9 +139,16 @@ const struct movable_operations *page_movable_ops(struct page *page)
>>    }
>>    
>>    #ifdef CONFIG_NUMA_BALANCING
>> +int migrate_misplaced_folio_prepare(struct folio *folio,
>> +		struct vm_area_struct *vma, int node);
>>    int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
>>    			   int node);
>>    #else
>> +static inline int migrate_misplaced_folio_prepare(struct folio *folio,
>> +		struct vm_area_struct *vma, int node)
>> +{
>> +	return -EAGAIN; /* can't migrate now */
>> +}
>>    static inline int migrate_misplaced_folio(struct folio *folio,
>>    					 struct vm_area_struct *vma, int node)
>>    {
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index fc27dabcd8e3..4b2817bb2c7d 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1688,11 +1688,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>    	if (node_is_toptier(nid))
>>    		last_cpupid = folio_last_cpupid(folio);
>>    	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
>> -	if (target_nid == NUMA_NO_NODE) {
>> -		folio_put(folio);
>> +	if (target_nid == NUMA_NO_NODE)
>> +		goto out_map;
>> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>> +		flags |= TNF_MIGRATE_FAIL;
>>    		goto out_map;
>>    	}
>> -
>> +	/* The folio is isolated and isolation code holds a folio reference. */
>>    	spin_unlock(vmf->ptl);
>>    	writable = false;
>>    
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 118660de5bcc..4fd1ecfced4d 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -5207,8 +5207,6 @@ int numa_migrate_prep(struct folio *folio, struct vm_fault *vmf,
>>    {
>>    	struct vm_area_struct *vma = vmf->vma;
>>    
>> -	folio_get(folio);
>> -
>>    	/* Record the current PID acceesing VMA */
>>    	vma_set_access_pid_bit(vma);
>>    
>> @@ -5345,10 +5343,13 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>>    	else
>>    		last_cpupid = folio_last_cpupid(folio);
>>    	target_nid = numa_migrate_prep(folio, vmf, vmf->address, nid, &flags);
>> -	if (target_nid == NUMA_NO_NODE) {
>> -		folio_put(folio);
>> +	if (target_nid == NUMA_NO_NODE)
>> +		goto out_map;
>> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>> +		flags |= TNF_MIGRATE_FAIL;
>>    		goto out_map;
>>    	}
>> +	/* The folio is isolated and isolation code holds a folio reference. */
>>    	pte_unmap_unlock(vmf->pte, vmf->ptl);
>>    	writable = false;
>>    	ignore_writable = true;
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 0307b54879a0..27f070f64f27 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -2530,9 +2530,37 @@ static struct folio *alloc_misplaced_dst_folio(struct folio *src,
>>    	return __folio_alloc_node(gfp, order, nid);
>>    }
>>    
>> -static int numamigrate_isolate_folio(pg_data_t *pgdat, struct folio *folio)
>> +/*
>> + * Prepare for calling migrate_misplaced_folio() by isolating the folio if
>> + * permitted. Must be called with the PTL still held.
>> + */
>> +int migrate_misplaced_folio_prepare(struct folio *folio,
>> +		struct vm_area_struct *vma, int node)
>>    {
>>    	int nr_pages = folio_nr_pages(folio);
>> +	pg_data_t *pgdat = NODE_DATA(node);
>> +
>> +	if (folio_is_file_lru(folio)) {
>> +		/*
>> +		 * Do not migrate file folios that are mapped in multiple
>> +		 * processes with execute permissions as they are probably
>> +		 * shared libraries.
>> +		 *
>> +		 * See folio_likely_mapped_shared() on possible imprecision
>> +		 * when we cannot easily detect if a folio is shared.
>> +		 */
>> +		if ((vma->vm_flags & VM_EXEC) &&
>> +		    folio_likely_mapped_shared(folio))
>> +			return -EACCES;
>> +
>> +		/*
>> +		 * Do not migrate dirty folios as not all filesystems can move
>> +		 * dirty folios in MIGRATE_ASYNC mode which is a waste of
>> +		 * cycles.
>> +		 */
>> +		if (folio_test_dirty(folio))
>> +			return -EAGAIN;
>> +	}
>>    
>>    	/* Avoid migrating to a node that is nearly full */
>>    	if (!migrate_balanced_pgdat(pgdat, nr_pages)) {
>> @@ -2550,65 +2578,37 @@ static int numamigrate_isolate_folio(pg_data_t *pgdat, struct folio *folio)
>>    		 * further.
>>    		 */
>>    		if (z < 0)
>> -			return 0;
>> +			return -EAGAIN;
>>    
>>    		wakeup_kswapd(pgdat->node_zones + z, 0,
>>    			      folio_order(folio), ZONE_MOVABLE);
>> -		return 0;
>> +		return -EAGAIN;
>>    	}
>>    
>>    	if (!folio_isolate_lru(folio))
>> -		return 0;
>> +		return -EAGAIN;
>>    
>>    	node_stat_mod_folio(folio, NR_ISOLATED_ANON + folio_is_file_lru(folio),
>>    			    nr_pages);
>> -
>> -	/*
>> -	 * Isolating the folio has taken another reference, so the
>> -	 * caller's reference can be safely dropped without the folio
>> -	 * disappearing underneath us during migration.
>> -	 */
>> -	folio_put(folio);
>> -	return 1;
>> +	return 0;
>>    }
>>    
>>    /*
>>     * Attempt to migrate a misplaced folio to the specified destination
>> - * node. Caller is expected to have an elevated reference count on
>> - * the folio that will be dropped by this function before returning.
>> + * node. Caller is expected to have isolated the folio by calling
>> + * migrate_misplaced_folio_prepare(), which will result in an
>> + * elevated reference count on the folio. This function will un-isolate the
>> + * folio, dereferencing the folio before returning.
>>     */
>>    int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
>>    			    int node)
>>    {
>>    	pg_data_t *pgdat = NODE_DATA(node);
>> -	int isolated;
>>    	int nr_remaining;
>>    	unsigned int nr_succeeded;
>>    	LIST_HEAD(migratepages);
>>    	int nr_pages = folio_nr_pages(folio);
>>    
>> -	/*
>> -	 * Don't migrate file folios that are mapped in multiple processes
>> -	 * with execute permissions as they are probably shared libraries.
>> -	 *
>> -	 * See folio_likely_mapped_shared() on possible imprecision when we
>> -	 * cannot easily detect if a folio is shared.
>> -	 */
>> -	if (folio_likely_mapped_shared(folio) && folio_is_file_lru(folio) &&
>> -	    (vma->vm_flags & VM_EXEC))
>> -		goto out;
>> -
>> -	/*
>> -	 * Also do not migrate dirty folios as not all filesystems can move
>> -	 * dirty folios in MIGRATE_ASYNC mode which is a waste of cycles.
>> -	 */
>> -	if (folio_is_file_lru(folio) && folio_test_dirty(folio))
>> -		goto out;
>> -
>> -	isolated = numamigrate_isolate_folio(pgdat, folio);
>> -	if (!isolated)
>> -		goto out;
>> -
>>    	list_add(&folio->lru, &migratepages);
>>    	nr_remaining = migrate_pages(&migratepages, alloc_misplaced_dst_folio,
>>    				     NULL, node, MIGRATE_ASYNC,
>> @@ -2620,7 +2620,6 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
>>    					folio_is_file_lru(folio), -nr_pages);
>>    			folio_putback_lru(folio);
>>    		}
>> -		isolated = 0;
>>    	}
>>    	if (nr_succeeded) {
>>    		count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
>> @@ -2629,11 +2628,7 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
>>    					    nr_succeeded);
>>    	}
>>    	BUG_ON(!list_empty(&migratepages));
>> -	return isolated ? 0 : -EAGAIN;
>> -
>> -out:
>> -	folio_put(folio);
>> -	return -EAGAIN;
>> +	return nr_remaining ? -EAGAIN : 0;
>>    }
>>    #endif /* CONFIG_NUMA_BALANCING */
>>    #endif /* CONFIG_NUMA */
> 
> Hi David,
> 
> I have tested these patches on my system. My system has 2 DRAM nodes and
> 1 PMEM node.
> I tested page migration between DRAM nodes and page promotion from PMEM
> to DRAM.
> Both are working fine.
> 
> below are the results.
> 
> Migration results
> =============
> numa_pte_updates 18977
> numa_huge_pte_updates 0
> numa_hint_faults 18504
> numa_hint_faults_local 2116
> numa_pages_migrated 16388
> 
> 
> 
> Promotion Results
> ===============
> nr_sec_page_table_pages 0
> nr_iommu_pages 0
> nr_swapcached 0
> pgpromote_success 16386
> pgpromote_candidate 0
> 
> 
> Tested-by: Donet Tom <donettom@linux.ibm.com>

Oh, what a pleasant surprise, thanks a bunch of the fast testing!

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 2/2] mm/migrate: move NUMA hinting fault folio isolation + checks under PTL
  2024-06-21 13:44   ` Zi Yan
@ 2024-06-21 20:18     ` David Hildenbrand
  2024-06-21 20:48       ` Zi Yan
  0 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2024-06-21 20:18 UTC (permalink / raw)
  To: Zi Yan; +Cc: linux-kernel, linux-mm, Andrew Morton

On 21.06.24 15:44, Zi Yan wrote:
> On 20 Jun 2024, at 17:29, David Hildenbrand wrote:
> 
>> Currently we always take a folio reference even if migration will not
>> even be tried or isolation failed, requiring us to grab+drop an additional
>> reference.
>>
>> Further, we end up calling folio_likely_mapped_shared() while the folio
>> might have already been unmapped, because after we dropped the PTL, that
>> can easily happen. We want to stop touching mapcounts and friends from
>> such context, and only call folio_likely_mapped_shared() while the folio
>> is still mapped: mapcount information is pretty much stale and unreliable
>> otherwise.
>>
>> So let's move checks into numamigrate_isolate_folio(), rename that
>> function to migrate_misplaced_folio_prepare(), and call that function
>> from callsites where we call migrate_misplaced_folio(), but still with
>> the PTL held.
>>
>> We can now stop taking temporary folio references, and really only take
>> a reference if folio isolation succeeded. Doing the
>> folio_likely_mapped_shared() + golio isolation under PT lock is now similar
>> to how we handle MADV_PAGEOUT.
>>
>> While at it, combine the folio_is_file_lru() checks.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   include/linux/migrate.h |  7 ++++
>>   mm/huge_memory.c        |  8 ++--
>>   mm/memory.c             |  9 +++--
>>   mm/migrate.c            | 81 +++++++++++++++++++----------------------
>>   4 files changed, 55 insertions(+), 50 deletions(-)
> 
> LGTM. Reviewed-by: Zi Yan <ziy@nvidia.com>
> 
> One nit below:
> 
> <snip>
> 
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index fc27dabcd8e3..4b2817bb2c7d 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1688,11 +1688,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>   	if (node_is_toptier(nid))
>>   		last_cpupid = folio_last_cpupid(folio);
>>   	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
>> -	if (target_nid == NUMA_NO_NODE) {
>> -		folio_put(folio);
>> +	if (target_nid == NUMA_NO_NODE)
>> +		goto out_map;
>> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>> +		flags |= TNF_MIGRATE_FAIL;
>>   		goto out_map;
>>   	}
>> -
>> +	/* The folio is isolated and isolation code holds a folio reference. */
>>   	spin_unlock(vmf->ptl);
>>   	writable = false;
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 118660de5bcc..4fd1ecfced4d 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
> 
> <snip>
> 
>> @@ -5345,10 +5343,13 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>>   	else
>>   		last_cpupid = folio_last_cpupid(folio);
>>   	target_nid = numa_migrate_prep(folio, vmf, vmf->address, nid, &flags);
>> -	if (target_nid == NUMA_NO_NODE) {
>> -		folio_put(folio);
>> +	if (target_nid == NUMA_NO_NODE)
>> +		goto out_map;
>> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>> +		flags |= TNF_MIGRATE_FAIL;
>>   		goto out_map;
>>   	}
> 
> These two locations are repeated code, maybe just merge the ifs into
> numa_migrate_prep(). Feel free to ignore if you are not going to send
> another version. :)

I went back and forth a couple of times and

a) Didn't want to move numa_migrate_prep() into
    migrate_misplaced_folio_prepare(), because having that code in
    mm/migrate.c felt a bit odd.

b) Didn't want to move migrate_misplaced_folio_prepare() because I enjoy
    seeing the migrate_misplaced_folio_prepare() and
    migrate_misplaced_folio() calls in the same callercontext.

I also considered renaming numa_migrate_prep(), but wasn't really able 
to come up with a good name.

But maybe a) is not too bad?

We'd have migrate_misplaced_folio_prepare() consume &flags and 
&target_nid, and perform the "flags |= TNF_MIGRATE_FAIL;" internally.

What would be your take?

> 
> --
> Best Regards,
> Yan, Zi

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 2/2] mm/migrate: move NUMA hinting fault folio isolation + checks under PTL
  2024-06-21 20:18     ` David Hildenbrand
@ 2024-06-21 20:48       ` Zi Yan
  2024-06-26 16:49         ` David Hildenbrand
  0 siblings, 1 reply; 24+ messages in thread
From: Zi Yan @ 2024-06-21 20:48 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: linux-kernel, linux-mm, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 4553 bytes --]

On 21 Jun 2024, at 16:18, David Hildenbrand wrote:

> On 21.06.24 15:44, Zi Yan wrote:
>> On 20 Jun 2024, at 17:29, David Hildenbrand wrote:
>>
>>> Currently we always take a folio reference even if migration will not
>>> even be tried or isolation failed, requiring us to grab+drop an additional
>>> reference.
>>>
>>> Further, we end up calling folio_likely_mapped_shared() while the folio
>>> might have already been unmapped, because after we dropped the PTL, that
>>> can easily happen. We want to stop touching mapcounts and friends from
>>> such context, and only call folio_likely_mapped_shared() while the folio
>>> is still mapped: mapcount information is pretty much stale and unreliable
>>> otherwise.
>>>
>>> So let's move checks into numamigrate_isolate_folio(), rename that
>>> function to migrate_misplaced_folio_prepare(), and call that function
>>> from callsites where we call migrate_misplaced_folio(), but still with
>>> the PTL held.
>>>
>>> We can now stop taking temporary folio references, and really only take
>>> a reference if folio isolation succeeded. Doing the
>>> folio_likely_mapped_shared() + golio isolation under PT lock is now similar
>>> to how we handle MADV_PAGEOUT.
>>>
>>> While at it, combine the folio_is_file_lru() checks.
>>>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>>   include/linux/migrate.h |  7 ++++
>>>   mm/huge_memory.c        |  8 ++--
>>>   mm/memory.c             |  9 +++--
>>>   mm/migrate.c            | 81 +++++++++++++++++++----------------------
>>>   4 files changed, 55 insertions(+), 50 deletions(-)
>>
>> LGTM. Reviewed-by: Zi Yan <ziy@nvidia.com>
>>
>> One nit below:
>>
>> <snip>
>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index fc27dabcd8e3..4b2817bb2c7d 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -1688,11 +1688,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>   	if (node_is_toptier(nid))
>>>   		last_cpupid = folio_last_cpupid(folio);
>>>   	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
>>> -	if (target_nid == NUMA_NO_NODE) {
>>> -		folio_put(folio);
>>> +	if (target_nid == NUMA_NO_NODE)
>>> +		goto out_map;
>>> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>>> +		flags |= TNF_MIGRATE_FAIL;
>>>   		goto out_map;
>>>   	}
>>> -
>>> +	/* The folio is isolated and isolation code holds a folio reference. */
>>>   	spin_unlock(vmf->ptl);
>>>   	writable = false;
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 118660de5bcc..4fd1ecfced4d 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>
>> <snip>
>>
>>> @@ -5345,10 +5343,13 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>>>   	else
>>>   		last_cpupid = folio_last_cpupid(folio);
>>>   	target_nid = numa_migrate_prep(folio, vmf, vmf->address, nid, &flags);
>>> -	if (target_nid == NUMA_NO_NODE) {
>>> -		folio_put(folio);
>>> +	if (target_nid == NUMA_NO_NODE)
>>> +		goto out_map;
>>> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>>> +		flags |= TNF_MIGRATE_FAIL;
>>>   		goto out_map;
>>>   	}
>>
>> These two locations are repeated code, maybe just merge the ifs into
>> numa_migrate_prep(). Feel free to ignore if you are not going to send
>> another version. :)
>
> I went back and forth a couple of times and
>
> a) Didn't want to move numa_migrate_prep() into
>    migrate_misplaced_folio_prepare(), because having that code in
>    mm/migrate.c felt a bit odd.

I agree after checking the actual code, since the code is just
updating NUMA fault stats and checking where the folio should be.

>
> b) Didn't want to move migrate_misplaced_folio_prepare() because I enjoy
>    seeing the migrate_misplaced_folio_prepare() and
>    migrate_misplaced_folio() calls in the same callercontext.
>
> I also considered renaming numa_migrate_prep(), but wasn't really able to come up with a good name.

How about numa_migrate_check()? Since it tells whether a folio should be
migrated or not.

>
> But maybe a) is not too bad?
>
> We'd have migrate_misplaced_folio_prepare() consume &flags and &target_nid, and perform the "flags |= TNF_MIGRATE_FAIL;" internally.
>
> What would be your take?

I would either rename numa_migrate_prep() or just do nothing. I have to admit
that the "prep" and "prepare" in both function names motivated me to propose
the merge, but now the actual code tells me they should be separate.

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 2/2] mm/migrate: move NUMA hinting fault folio isolation + checks under PTL
  2024-06-20 21:29 ` [PATCH v1 2/2] mm/migrate: move NUMA hinting fault folio isolation + checks under PTL David Hildenbrand
                     ` (3 preceding siblings ...)
  2024-06-21 17:47   ` Donet Tom
@ 2024-06-26 16:22   ` David Hildenbrand
  2024-06-27  6:00     ` Donet Tom
  4 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2024-06-26 16:22 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, Andrew Morton, Donet Tom

On 20.06.24 23:29, David Hildenbrand wrote:
> Currently we always take a folio reference even if migration will not
> even be tried or isolation failed, requiring us to grab+drop an additional
> reference.
> 
> Further, we end up calling folio_likely_mapped_shared() while the folio
> might have already been unmapped, because after we dropped the PTL, that
> can easily happen. We want to stop touching mapcounts and friends from
> such context, and only call folio_likely_mapped_shared() while the folio
> is still mapped: mapcount information is pretty much stale and unreliable
> otherwise.
> 
> So let's move checks into numamigrate_isolate_folio(), rename that
> function to migrate_misplaced_folio_prepare(), and call that function
> from callsites where we call migrate_misplaced_folio(), but still with
> the PTL held.
> 
> We can now stop taking temporary folio references, and really only take
> a reference if folio isolation succeeded. Doing the
> folio_likely_mapped_shared() + golio isolation under PT lock is now similar
> to how we handle MADV_PAGEOUT.
> 
> While at it, combine the folio_is_file_lru() checks.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---

Donet just reported an issue. I suspect this fixes it -- in any case, this is
the right thing to do.

 From 0833b9896e98c8d88c521609c811a220d14a2e7e Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Wed, 26 Jun 2024 18:14:44 +0200
Subject: [PATCH] Fixup: mm/migrate: move NUMA hinting fault folio isolation +
  checks under PTL

Donet reports an issue during NUMA migration we haven't seen previously:

	[   71.422804] list_del corruption, c00c00000061b3c8->next is
	LIST_POISON1 (5deadbeef0000100)
	[   71.422839] ------------[ cut here ]------------
	[   71.422843] kernel BUG at lib/list_debug.c:56!
	[   71.422850] Oops: Exception in kernel mode, sig: 5 [#1]

We forgot to convert one "return 0;" to return an error instead from
migrate_misplaced_folio_prepare() in case the target node is nearly
full.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
  mm/migrate.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 8beedbb42a93..9ed43c1eea5e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2564,7 +2564,7 @@ int migrate_misplaced_folio_prepare(struct folio *folio,
  		int z;
  
  		if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING))
-			return 0;
+			return -EAGAIN;
  		for (z = pgdat->nr_zones - 1; z >= 0; z--) {
  			if (managed_zone(pgdat->node_zones + z))
  				break;

base-commit: 4b17ce353e02b47b00e2fe87b862f09e8b9a47e6
-- 
2.45.2


-- 
Cheers,

David / dhildenb



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 2/2] mm/migrate: move NUMA hinting fault folio isolation + checks under PTL
  2024-06-21 20:48       ` Zi Yan
@ 2024-06-26 16:49         ` David Hildenbrand
  2024-06-26 17:37           ` Zi Yan
  0 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2024-06-26 16:49 UTC (permalink / raw)
  To: Zi Yan; +Cc: linux-kernel, linux-mm, Andrew Morton

On 21.06.24 22:48, Zi Yan wrote:
> On 21 Jun 2024, at 16:18, David Hildenbrand wrote:
> 
>> On 21.06.24 15:44, Zi Yan wrote:
>>> On 20 Jun 2024, at 17:29, David Hildenbrand wrote:
>>>
>>>> Currently we always take a folio reference even if migration will not
>>>> even be tried or isolation failed, requiring us to grab+drop an additional
>>>> reference.
>>>>
>>>> Further, we end up calling folio_likely_mapped_shared() while the folio
>>>> might have already been unmapped, because after we dropped the PTL, that
>>>> can easily happen. We want to stop touching mapcounts and friends from
>>>> such context, and only call folio_likely_mapped_shared() while the folio
>>>> is still mapped: mapcount information is pretty much stale and unreliable
>>>> otherwise.
>>>>
>>>> So let's move checks into numamigrate_isolate_folio(), rename that
>>>> function to migrate_misplaced_folio_prepare(), and call that function
>>>> from callsites where we call migrate_misplaced_folio(), but still with
>>>> the PTL held.
>>>>
>>>> We can now stop taking temporary folio references, and really only take
>>>> a reference if folio isolation succeeded. Doing the
>>>> folio_likely_mapped_shared() + golio isolation under PT lock is now similar
>>>> to how we handle MADV_PAGEOUT.
>>>>
>>>> While at it, combine the folio_is_file_lru() checks.
>>>>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>> ---
>>>>    include/linux/migrate.h |  7 ++++
>>>>    mm/huge_memory.c        |  8 ++--
>>>>    mm/memory.c             |  9 +++--
>>>>    mm/migrate.c            | 81 +++++++++++++++++++----------------------
>>>>    4 files changed, 55 insertions(+), 50 deletions(-)
>>>
>>> LGTM. Reviewed-by: Zi Yan <ziy@nvidia.com>
>>>
>>> One nit below:
>>>
>>> <snip>
>>>
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index fc27dabcd8e3..4b2817bb2c7d 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -1688,11 +1688,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>>    	if (node_is_toptier(nid))
>>>>    		last_cpupid = folio_last_cpupid(folio);
>>>>    	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
>>>> -	if (target_nid == NUMA_NO_NODE) {
>>>> -		folio_put(folio);
>>>> +	if (target_nid == NUMA_NO_NODE)
>>>> +		goto out_map;
>>>> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>>>> +		flags |= TNF_MIGRATE_FAIL;
>>>>    		goto out_map;
>>>>    	}
>>>> -
>>>> +	/* The folio is isolated and isolation code holds a folio reference. */
>>>>    	spin_unlock(vmf->ptl);
>>>>    	writable = false;
>>>>
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 118660de5bcc..4fd1ecfced4d 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>
>>> <snip>
>>>
>>>> @@ -5345,10 +5343,13 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>>>>    	else
>>>>    		last_cpupid = folio_last_cpupid(folio);
>>>>    	target_nid = numa_migrate_prep(folio, vmf, vmf->address, nid, &flags);
>>>> -	if (target_nid == NUMA_NO_NODE) {
>>>> -		folio_put(folio);
>>>> +	if (target_nid == NUMA_NO_NODE)
>>>> +		goto out_map;
>>>> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>>>> +		flags |= TNF_MIGRATE_FAIL;
>>>>    		goto out_map;
>>>>    	}
>>>
>>> These two locations are repeated code, maybe just merge the ifs into
>>> numa_migrate_prep(). Feel free to ignore if you are not going to send
>>> another version. :)
>>
>> I went back and forth a couple of times and
>>
>> a) Didn't want to move numa_migrate_prep() into
>>     migrate_misplaced_folio_prepare(), because having that code in
>>     mm/migrate.c felt a bit odd.
> 
> I agree after checking the actual code, since the code is just
> updating NUMA fault stats and checking where the folio should be.
> 
>>
>> b) Didn't want to move migrate_misplaced_folio_prepare() because I enjoy
>>     seeing the migrate_misplaced_folio_prepare() and
>>     migrate_misplaced_folio() calls in the same callercontext.
>>
>> I also considered renaming numa_migrate_prep(), but wasn't really able to come up with a good name.
> 
> How about numa_migrate_check()? Since it tells whether a folio should be
> migrated or not.
> 
>>
>> But maybe a) is not too bad?
>>
>> We'd have migrate_misplaced_folio_prepare() consume &flags and &target_nid, and perform the "flags |= TNF_MIGRATE_FAIL;" internally.
>>
>> What would be your take?
> 
> I would either rename numa_migrate_prep() or just do nothing. I have to admit
> that the "prep" and "prepare" in both function names motivated me to propose
> the merge, but now the actual code tells me they should be separate.

Let's leave it like that for now. Renaming to numa_migrate_check() makes 
sense, and likely moving more numa handling stuff in there.

Bit I yet have to figure out why some of the memory.c vs. huge_memory.c 
code differences exist, so we can unify them.

For example, why did 33024536bafd9 introduce slightly different 
last_cpupid handling in do_huge_pmd_numa_page(), whereby it seems like 
some subtle difference in handling NUMA_BALANCING_MEMORY_TIERING? Maybe 
I am missing something obvious. :)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 2/2] mm/migrate: move NUMA hinting fault folio isolation + checks under PTL
  2024-06-26 16:49         ` David Hildenbrand
@ 2024-06-26 17:37           ` Zi Yan
  2024-07-01  8:32             ` Huang, Ying
  0 siblings, 1 reply; 24+ messages in thread
From: Zi Yan @ 2024-06-26 17:37 UTC (permalink / raw)
  To: David Hildenbrand, Huang Ying; +Cc: linux-kernel, linux-mm, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 6135 bytes --]

On Wed Jun 26, 2024 at 12:49 PM EDT, David Hildenbrand wrote:
> On 21.06.24 22:48, Zi Yan wrote:
> > On 21 Jun 2024, at 16:18, David Hildenbrand wrote:
> > 
> >> On 21.06.24 15:44, Zi Yan wrote:
> >>> On 20 Jun 2024, at 17:29, David Hildenbrand wrote:
> >>>
> >>>> Currently we always take a folio reference even if migration will not
> >>>> even be tried or isolation failed, requiring us to grab+drop an additional
> >>>> reference.
> >>>>
> >>>> Further, we end up calling folio_likely_mapped_shared() while the folio
> >>>> might have already been unmapped, because after we dropped the PTL, that
> >>>> can easily happen. We want to stop touching mapcounts and friends from
> >>>> such context, and only call folio_likely_mapped_shared() while the folio
> >>>> is still mapped: mapcount information is pretty much stale and unreliable
> >>>> otherwise.
> >>>>
> >>>> So let's move checks into numamigrate_isolate_folio(), rename that
> >>>> function to migrate_misplaced_folio_prepare(), and call that function
> >>>> from callsites where we call migrate_misplaced_folio(), but still with
> >>>> the PTL held.
> >>>>
> >>>> We can now stop taking temporary folio references, and really only take
> >>>> a reference if folio isolation succeeded. Doing the
> >>>> folio_likely_mapped_shared() + golio isolation under PT lock is now similar
> >>>> to how we handle MADV_PAGEOUT.
> >>>>
> >>>> While at it, combine the folio_is_file_lru() checks.
> >>>>
> >>>> Signed-off-by: David Hildenbrand <david@redhat.com>
> >>>> ---
> >>>>    include/linux/migrate.h |  7 ++++
> >>>>    mm/huge_memory.c        |  8 ++--
> >>>>    mm/memory.c             |  9 +++--
> >>>>    mm/migrate.c            | 81 +++++++++++++++++++----------------------
> >>>>    4 files changed, 55 insertions(+), 50 deletions(-)
> >>>
> >>> LGTM. Reviewed-by: Zi Yan <ziy@nvidia.com>
> >>>
> >>> One nit below:
> >>>
> >>> <snip>
> >>>
> >>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >>>> index fc27dabcd8e3..4b2817bb2c7d 100644
> >>>> --- a/mm/huge_memory.c
> >>>> +++ b/mm/huge_memory.c
> >>>> @@ -1688,11 +1688,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
> >>>>    	if (node_is_toptier(nid))
> >>>>    		last_cpupid = folio_last_cpupid(folio);
> >>>>    	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
> >>>> -	if (target_nid == NUMA_NO_NODE) {
> >>>> -		folio_put(folio);
> >>>> +	if (target_nid == NUMA_NO_NODE)
> >>>> +		goto out_map;
> >>>> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
> >>>> +		flags |= TNF_MIGRATE_FAIL;
> >>>>    		goto out_map;
> >>>>    	}
> >>>> -
> >>>> +	/* The folio is isolated and isolation code holds a folio reference. */
> >>>>    	spin_unlock(vmf->ptl);
> >>>>    	writable = false;
> >>>>
> >>>> diff --git a/mm/memory.c b/mm/memory.c
> >>>> index 118660de5bcc..4fd1ecfced4d 100644
> >>>> --- a/mm/memory.c
> >>>> +++ b/mm/memory.c
> >>>
> >>> <snip>
> >>>
> >>>> @@ -5345,10 +5343,13 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
> >>>>    	else
> >>>>    		last_cpupid = folio_last_cpupid(folio);
> >>>>    	target_nid = numa_migrate_prep(folio, vmf, vmf->address, nid, &flags);
> >>>> -	if (target_nid == NUMA_NO_NODE) {
> >>>> -		folio_put(folio);
> >>>> +	if (target_nid == NUMA_NO_NODE)
> >>>> +		goto out_map;
> >>>> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
> >>>> +		flags |= TNF_MIGRATE_FAIL;
> >>>>    		goto out_map;
> >>>>    	}
> >>>
> >>> These two locations are repeated code, maybe just merge the ifs into
> >>> numa_migrate_prep(). Feel free to ignore if you are not going to send
> >>> another version. :)
> >>
> >> I went back and forth a couple of times and
> >>
> >> a) Didn't want to move numa_migrate_prep() into
> >>     migrate_misplaced_folio_prepare(), because having that code in
> >>     mm/migrate.c felt a bit odd.
> > 
> > I agree after checking the actual code, since the code is just
> > updating NUMA fault stats and checking where the folio should be.
> > 
> >>
> >> b) Didn't want to move migrate_misplaced_folio_prepare() because I enjoy
> >>     seeing the migrate_misplaced_folio_prepare() and
> >>     migrate_misplaced_folio() calls in the same callercontext.
> >>
> >> I also considered renaming numa_migrate_prep(), but wasn't really able to come up with a good name.
> > 
> > How about numa_migrate_check()? Since it tells whether a folio should be
> > migrated or not.
> > 
> >>
> >> But maybe a) is not too bad?
> >>
> >> We'd have migrate_misplaced_folio_prepare() consume &flags and &target_nid, and perform the "flags |= TNF_MIGRATE_FAIL;" internally.
> >>
> >> What would be your take?
> > 
> > I would either rename numa_migrate_prep() or just do nothing. I have to admit
> > that the "prep" and "prepare" in both function names motivated me to propose
> > the merge, but now the actual code tells me they should be separate.
>
> Let's leave it like that for now. Renaming to numa_migrate_check() makes 
> sense, and likely moving more numa handling stuff in there.
>
> Bit I yet have to figure out why some of the memory.c vs. huge_memory.c 
> code differences exist, so we can unify them.
>
> For example, why did 33024536bafd9 introduce slightly different 
> last_cpupid handling in do_huge_pmd_numa_page(), whereby it seems like 
> some subtle difference in handling NUMA_BALANCING_MEMORY_TIERING? Maybe 
> I am missing something obvious. :)

It seems to me that a sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING
check is missing in do_huge_pmd_numa_page(). So the

if (node_is_toptier(nid))

should be

if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) ||
node_is_toptier(nid))

to be consistent with other checks. Add Ying to confirm.

I also think a function like

bool folio_has_cpupid(folio)
{
    return !(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
    || node_is_toptier(folio_nid(folio));
}

would be better than the existing checks.

-- 
Best Regards,
Yan, Zi


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 2/2] mm/migrate: move NUMA hinting fault folio isolation + checks under PTL
  2024-06-26 16:22   ` David Hildenbrand
@ 2024-06-27  6:00     ` Donet Tom
  0 siblings, 0 replies; 24+ messages in thread
From: Donet Tom @ 2024-06-27  6:00 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel; +Cc: linux-mm, Andrew Morton


On 6/26/24 21:52, David Hildenbrand wrote:
> On 20.06.24 23:29, David Hildenbrand wrote:
>> Currently we always take a folio reference even if migration will not
>> even be tried or isolation failed, requiring us to grab+drop an 
>> additional
>> reference.
>>
>> Further, we end up calling folio_likely_mapped_shared() while the folio
>> might have already been unmapped, because after we dropped the PTL, that
>> can easily happen. We want to stop touching mapcounts and friends from
>> such context, and only call folio_likely_mapped_shared() while the folio
>> is still mapped: mapcount information is pretty much stale and 
>> unreliable
>> otherwise.
>>
>> So let's move checks into numamigrate_isolate_folio(), rename that
>> function to migrate_misplaced_folio_prepare(), and call that function
>> from callsites where we call migrate_misplaced_folio(), but still with
>> the PTL held.
>>
>> We can now stop taking temporary folio references, and really only take
>> a reference if folio isolation succeeded. Doing the
>> folio_likely_mapped_shared() + golio isolation under PT lock is now 
>> similar
>> to how we handle MADV_PAGEOUT.
>>
>> While at it, combine the folio_is_file_lru() checks.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>
> Donet just reported an issue. I suspect this fixes it -- in any case, 
> this is
> the right thing to do.
>
> From 0833b9896e98c8d88c521609c811a220d14a2e7e Mon Sep 17 00:00:00 2001
> From: David Hildenbrand <david@redhat.com>
> Date: Wed, 26 Jun 2024 18:14:44 +0200
> Subject: [PATCH] Fixup: mm/migrate: move NUMA hinting fault folio 
> isolation +
>  checks under PTL
>
> Donet reports an issue during NUMA migration we haven't seen previously:
>
>     [   71.422804] list_del corruption, c00c00000061b3c8->next is
>     LIST_POISON1 (5deadbeef0000100)
>     [   71.422839] ------------[ cut here ]------------
>     [   71.422843] kernel BUG at lib/list_debug.c:56!
>     [   71.422850] Oops: Exception in kernel mode, sig: 5 [#1]
>
> We forgot to convert one "return 0;" to return an error instead from
> migrate_misplaced_folio_prepare() in case the target node is nearly
> full.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/migrate.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 8beedbb42a93..9ed43c1eea5e 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2564,7 +2564,7 @@ int migrate_misplaced_folio_prepare(struct folio 
> *folio,
>          int z;
>
>          if (!(sysctl_numa_balancing_mode & 
> NUMA_BALANCING_MEMORY_TIERING))
> -            return 0;
> +            return -EAGAIN;
>          for (z = pgdat->nr_zones - 1; z >= 0; z--) {
>              if (managed_zone(pgdat->node_zones + z))
>                  break;
Hi David

I tested with this patch . The issue is resolved. I am not seeing the 
kernel panic.

I also tested the page migration. It working fine.

numa_pte_updates 1262330
numa_huge_pte_updates 0
numa_hint_faults 925797
numa_hint_faults_local 3780
numa_pages_migrated 327930
pgmigrate_success 822530


Thanks
Donet


>
> base-commit: 4b17ce353e02b47b00e2fe87b862f09e8b9a47e6


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 1/2] mm/migrate: make migrate_misplaced_folio() return 0 on success
  2024-06-20 21:29 ` [PATCH v1 1/2] mm/migrate: make migrate_misplaced_folio() return 0 on success David Hildenbrand
  2024-06-21  1:40   ` Zi Yan
  2024-06-21  3:39   ` Baolin Wang
@ 2024-07-01  7:36   ` Huang, Ying
  2024-07-01  7:44     ` David Hildenbrand
  2 siblings, 1 reply; 24+ messages in thread
From: Huang, Ying @ 2024-07-01  7:36 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: linux-kernel, linux-mm, Andrew Morton

David Hildenbrand <david@redhat.com> writes:

> Let's just return 0 on success, which is less confusing.
>
> ... especially because we got it wrong in the migrate.h stub where we
> have "return -EAGAIN; /* can't migrate now */" instead of "return 0;".
> Likely this wrong return value doesn't currently matter, but it
> certainly adds confusion.
>
> We'll add migrate_misplaced_folio_prepare() next, where we want to use
> the same "return 0 on success" approach, so let's just clean this up.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/huge_memory.c | 5 ++---
>  mm/memory.c      | 2 +-
>  mm/migrate.c     | 4 ++--
>  3 files changed, 5 insertions(+), 6 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 0fffaa58a47a..fc27dabcd8e3 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1652,7 +1652,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>  	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
>  	int nid = NUMA_NO_NODE;
>  	int target_nid, last_cpupid = (-1 & LAST_CPUPID_MASK);
> -	bool migrated = false, writable = false;
> +	bool writable = false;
>  	int flags = 0;
>  
>  	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> @@ -1696,8 +1696,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>  	spin_unlock(vmf->ptl);
>  	writable = false;
>  
> -	migrated = migrate_misplaced_folio(folio, vma, target_nid);
> -	if (migrated) {
> +	if (!migrate_misplaced_folio(folio, vma, target_nid)) {
>  		flags |= TNF_MIGRATED;
>  		nid = target_nid;
>  	} else {
> diff --git a/mm/memory.c b/mm/memory.c
> index 00728ea95583..118660de5bcc 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5354,7 +5354,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>  	ignore_writable = true;
>  
>  	/* Migrate to the requested node */
> -	if (migrate_misplaced_folio(folio, vma, target_nid)) {
> +	if (!migrate_misplaced_folio(folio, vma, target_nid)) {
>  		nid = target_nid;
>  		flags |= TNF_MIGRATED;
>  	} else {
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 781979567f64..0307b54879a0 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2629,11 +2629,11 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
>  					    nr_succeeded);
>  	}
>  	BUG_ON(!list_empty(&migratepages));
> -	return isolated;
> +	return isolated ? 0 : -EAGAIN;

Is it good to use -EAGAIN as error code always?  At least if
nr_remaining < 0, we can use that?

Otherwise, this LGTM.  Thanks!

>  
>  out:
>  	folio_put(folio);
> -	return 0;
> +	return -EAGAIN;
>  }
>  #endif /* CONFIG_NUMA_BALANCING */
>  #endif /* CONFIG_NUMA */

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 1/2] mm/migrate: make migrate_misplaced_folio() return 0 on success
  2024-07-01  7:36   ` Huang, Ying
@ 2024-07-01  7:44     ` David Hildenbrand
  0 siblings, 0 replies; 24+ messages in thread
From: David Hildenbrand @ 2024-07-01  7:44 UTC (permalink / raw)
  To: Huang, Ying; +Cc: linux-kernel, linux-mm, Andrew Morton

On 01.07.24 09:36, Huang, Ying wrote:
> David Hildenbrand <david@redhat.com> writes:
> 
>> Let's just return 0 on success, which is less confusing.
>>
>> ... especially because we got it wrong in the migrate.h stub where we
>> have "return -EAGAIN; /* can't migrate now */" instead of "return 0;".
>> Likely this wrong return value doesn't currently matter, but it
>> certainly adds confusion.
>>
>> We'll add migrate_misplaced_folio_prepare() next, where we want to use
>> the same "return 0 on success" approach, so let's just clean this up.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   mm/huge_memory.c | 5 ++---
>>   mm/memory.c      | 2 +-
>>   mm/migrate.c     | 4 ++--
>>   3 files changed, 5 insertions(+), 6 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 0fffaa58a47a..fc27dabcd8e3 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1652,7 +1652,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>   	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
>>   	int nid = NUMA_NO_NODE;
>>   	int target_nid, last_cpupid = (-1 & LAST_CPUPID_MASK);
>> -	bool migrated = false, writable = false;
>> +	bool writable = false;
>>   	int flags = 0;
>>   
>>   	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
>> @@ -1696,8 +1696,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>   	spin_unlock(vmf->ptl);
>>   	writable = false;
>>   
>> -	migrated = migrate_misplaced_folio(folio, vma, target_nid);
>> -	if (migrated) {
>> +	if (!migrate_misplaced_folio(folio, vma, target_nid)) {
>>   		flags |= TNF_MIGRATED;
>>   		nid = target_nid;
>>   	} else {
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 00728ea95583..118660de5bcc 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -5354,7 +5354,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>>   	ignore_writable = true;
>>   
>>   	/* Migrate to the requested node */
>> -	if (migrate_misplaced_folio(folio, vma, target_nid)) {
>> +	if (!migrate_misplaced_folio(folio, vma, target_nid)) {
>>   		nid = target_nid;
>>   		flags |= TNF_MIGRATED;
>>   	} else {
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 781979567f64..0307b54879a0 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -2629,11 +2629,11 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
>>   					    nr_succeeded);
>>   	}
>>   	BUG_ON(!list_empty(&migratepages));
>> -	return isolated;
>> +	return isolated ? 0 : -EAGAIN;
> 
> Is it good to use -EAGAIN as error code always?  At least if
> nr_remaining < 0, we can use that?

Thanks for the review. I kept it simple for now, because the 
migrate_misplaced_folio_prepare() stub has:

	return -EAGAIN; /* can't migrate now */

and it doesn't really make a difference right now which exact return 
value we use.

(maybe some cases, like the stub, should return -EINVAL or -EOPNOTSUPP 
once we start caring about the exact value)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 2/2] mm/migrate: move NUMA hinting fault folio isolation + checks under PTL
  2024-06-26 17:37           ` Zi Yan
@ 2024-07-01  8:32             ` Huang, Ying
  2024-07-01 13:50               ` Zi Yan
  0 siblings, 1 reply; 24+ messages in thread
From: Huang, Ying @ 2024-07-01  8:32 UTC (permalink / raw)
  To: Zi Yan; +Cc: David Hildenbrand, linux-kernel, linux-mm, Andrew Morton

"Zi Yan" <ziy@nvidia.com> writes:

> On Wed Jun 26, 2024 at 12:49 PM EDT, David Hildenbrand wrote:
>> On 21.06.24 22:48, Zi Yan wrote:
>> > On 21 Jun 2024, at 16:18, David Hildenbrand wrote:
>> > 
>> >> On 21.06.24 15:44, Zi Yan wrote:
>> >>> On 20 Jun 2024, at 17:29, David Hildenbrand wrote:
>> >>>
>> >>>> Currently we always take a folio reference even if migration will not
>> >>>> even be tried or isolation failed, requiring us to grab+drop an additional
>> >>>> reference.
>> >>>>
>> >>>> Further, we end up calling folio_likely_mapped_shared() while the folio
>> >>>> might have already been unmapped, because after we dropped the PTL, that
>> >>>> can easily happen. We want to stop touching mapcounts and friends from
>> >>>> such context, and only call folio_likely_mapped_shared() while the folio
>> >>>> is still mapped: mapcount information is pretty much stale and unreliable
>> >>>> otherwise.
>> >>>>
>> >>>> So let's move checks into numamigrate_isolate_folio(), rename that
>> >>>> function to migrate_misplaced_folio_prepare(), and call that function
>> >>>> from callsites where we call migrate_misplaced_folio(), but still with
>> >>>> the PTL held.
>> >>>>
>> >>>> We can now stop taking temporary folio references, and really only take
>> >>>> a reference if folio isolation succeeded. Doing the
>> >>>> folio_likely_mapped_shared() + golio isolation under PT lock is now similar
>> >>>> to how we handle MADV_PAGEOUT.
>> >>>>
>> >>>> While at it, combine the folio_is_file_lru() checks.
>> >>>>
>> >>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> >>>> ---
>> >>>>    include/linux/migrate.h |  7 ++++
>> >>>>    mm/huge_memory.c        |  8 ++--
>> >>>>    mm/memory.c             |  9 +++--
>> >>>>    mm/migrate.c            | 81 +++++++++++++++++++----------------------
>> >>>>    4 files changed, 55 insertions(+), 50 deletions(-)
>> >>>
>> >>> LGTM. Reviewed-by: Zi Yan <ziy@nvidia.com>
>> >>>
>> >>> One nit below:
>> >>>
>> >>> <snip>
>> >>>
>> >>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> >>>> index fc27dabcd8e3..4b2817bb2c7d 100644
>> >>>> --- a/mm/huge_memory.c
>> >>>> +++ b/mm/huge_memory.c
>> >>>> @@ -1688,11 +1688,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>> >>>>    	if (node_is_toptier(nid))
>> >>>>    		last_cpupid = folio_last_cpupid(folio);
>> >>>>    	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
>> >>>> -	if (target_nid == NUMA_NO_NODE) {
>> >>>> -		folio_put(folio);
>> >>>> +	if (target_nid == NUMA_NO_NODE)
>> >>>> +		goto out_map;
>> >>>> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>> >>>> +		flags |= TNF_MIGRATE_FAIL;
>> >>>>    		goto out_map;
>> >>>>    	}
>> >>>> -
>> >>>> +	/* The folio is isolated and isolation code holds a folio reference. */
>> >>>>    	spin_unlock(vmf->ptl);
>> >>>>    	writable = false;
>> >>>>
>> >>>> diff --git a/mm/memory.c b/mm/memory.c
>> >>>> index 118660de5bcc..4fd1ecfced4d 100644
>> >>>> --- a/mm/memory.c
>> >>>> +++ b/mm/memory.c
>> >>>
>> >>> <snip>
>> >>>
>> >>>> @@ -5345,10 +5343,13 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>> >>>>    	else
>> >>>>    		last_cpupid = folio_last_cpupid(folio);
>> >>>>    	target_nid = numa_migrate_prep(folio, vmf, vmf->address, nid, &flags);
>> >>>> -	if (target_nid == NUMA_NO_NODE) {
>> >>>> -		folio_put(folio);
>> >>>> +	if (target_nid == NUMA_NO_NODE)
>> >>>> +		goto out_map;
>> >>>> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>> >>>> +		flags |= TNF_MIGRATE_FAIL;
>> >>>>    		goto out_map;
>> >>>>    	}
>> >>>
>> >>> These two locations are repeated code, maybe just merge the ifs into
>> >>> numa_migrate_prep(). Feel free to ignore if you are not going to send
>> >>> another version. :)
>> >>
>> >> I went back and forth a couple of times and
>> >>
>> >> a) Didn't want to move numa_migrate_prep() into
>> >>     migrate_misplaced_folio_prepare(), because having that code in
>> >>     mm/migrate.c felt a bit odd.
>> > 
>> > I agree after checking the actual code, since the code is just
>> > updating NUMA fault stats and checking where the folio should be.
>> > 
>> >>
>> >> b) Didn't want to move migrate_misplaced_folio_prepare() because I enjoy
>> >>     seeing the migrate_misplaced_folio_prepare() and
>> >>     migrate_misplaced_folio() calls in the same callercontext.
>> >>
>> >> I also considered renaming numa_migrate_prep(), but wasn't really able to come up with a good name.
>> > 
>> > How about numa_migrate_check()? Since it tells whether a folio should be
>> > migrated or not.
>> > 
>> >>
>> >> But maybe a) is not too bad?
>> >>
>> >> We'd have migrate_misplaced_folio_prepare() consume &flags and &target_nid, and perform the "flags |= TNF_MIGRATE_FAIL;" internally.
>> >>
>> >> What would be your take?
>> > 
>> > I would either rename numa_migrate_prep() or just do nothing. I have to admit
>> > that the "prep" and "prepare" in both function names motivated me to propose
>> > the merge, but now the actual code tells me they should be separate.
>>
>> Let's leave it like that for now. Renaming to numa_migrate_check() makes 
>> sense, and likely moving more numa handling stuff in there.
>>
>> Bit I yet have to figure out why some of the memory.c vs. huge_memory.c 
>> code differences exist, so we can unify them.
>>
>> For example, why did 33024536bafd9 introduce slightly different 
>> last_cpupid handling in do_huge_pmd_numa_page(), whereby it seems like 
>> some subtle difference in handling NUMA_BALANCING_MEMORY_TIERING? Maybe 
>> I am missing something obvious. :)
>
> It seems to me that a sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING
> check is missing in do_huge_pmd_numa_page(). So the
>
> if (node_is_toptier(nid))
>
> should be
>
> if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) ||
> node_is_toptier(nid))
>
> to be consistent with other checks. Add Ying to confirm.

Yes.  It should be so.  Sorry for my mistake and confusing.

> I also think a function like
>
> bool folio_has_cpupid(folio)
> {
>     return !(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
>     || node_is_toptier(folio_nid(folio));
> }
>
> would be better than the existing checks.

Yes.  This looks better.  Even better, we can add some comments to the
function too.

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 2/2] mm/migrate: move NUMA hinting fault folio isolation + checks under PTL
  2024-07-01  8:32             ` Huang, Ying
@ 2024-07-01 13:50               ` Zi Yan
  2024-07-01 14:03                 ` David Hildenbrand
  0 siblings, 1 reply; 24+ messages in thread
From: Zi Yan @ 2024-07-01 13:50 UTC (permalink / raw)
  To: Huang, Ying; +Cc: David Hildenbrand, linux-kernel, linux-mm, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 6636 bytes --]

On 1 Jul 2024, at 4:32, Huang, Ying wrote:

> "Zi Yan" <ziy@nvidia.com> writes:
>
>> On Wed Jun 26, 2024 at 12:49 PM EDT, David Hildenbrand wrote:
>>> On 21.06.24 22:48, Zi Yan wrote:
>>>> On 21 Jun 2024, at 16:18, David Hildenbrand wrote:
>>>>
>>>>> On 21.06.24 15:44, Zi Yan wrote:
>>>>>> On 20 Jun 2024, at 17:29, David Hildenbrand wrote:
>>>>>>
>>>>>>> Currently we always take a folio reference even if migration will not
>>>>>>> even be tried or isolation failed, requiring us to grab+drop an additional
>>>>>>> reference.
>>>>>>>
>>>>>>> Further, we end up calling folio_likely_mapped_shared() while the folio
>>>>>>> might have already been unmapped, because after we dropped the PTL, that
>>>>>>> can easily happen. We want to stop touching mapcounts and friends from
>>>>>>> such context, and only call folio_likely_mapped_shared() while the folio
>>>>>>> is still mapped: mapcount information is pretty much stale and unreliable
>>>>>>> otherwise.
>>>>>>>
>>>>>>> So let's move checks into numamigrate_isolate_folio(), rename that
>>>>>>> function to migrate_misplaced_folio_prepare(), and call that function
>>>>>>> from callsites where we call migrate_misplaced_folio(), but still with
>>>>>>> the PTL held.
>>>>>>>
>>>>>>> We can now stop taking temporary folio references, and really only take
>>>>>>> a reference if folio isolation succeeded. Doing the
>>>>>>> folio_likely_mapped_shared() + golio isolation under PT lock is now similar
>>>>>>> to how we handle MADV_PAGEOUT.
>>>>>>>
>>>>>>> While at it, combine the folio_is_file_lru() checks.
>>>>>>>
>>>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>>>> ---
>>>>>>>    include/linux/migrate.h |  7 ++++
>>>>>>>    mm/huge_memory.c        |  8 ++--
>>>>>>>    mm/memory.c             |  9 +++--
>>>>>>>    mm/migrate.c            | 81 +++++++++++++++++++----------------------
>>>>>>>    4 files changed, 55 insertions(+), 50 deletions(-)
>>>>>>
>>>>>> LGTM. Reviewed-by: Zi Yan <ziy@nvidia.com>
>>>>>>
>>>>>> One nit below:
>>>>>>
>>>>>> <snip>
>>>>>>
>>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>>> index fc27dabcd8e3..4b2817bb2c7d 100644
>>>>>>> --- a/mm/huge_memory.c
>>>>>>> +++ b/mm/huge_memory.c
>>>>>>> @@ -1688,11 +1688,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>>>>>    	if (node_is_toptier(nid))
>>>>>>>    		last_cpupid = folio_last_cpupid(folio);
>>>>>>>    	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
>>>>>>> -	if (target_nid == NUMA_NO_NODE) {
>>>>>>> -		folio_put(folio);
>>>>>>> +	if (target_nid == NUMA_NO_NODE)
>>>>>>> +		goto out_map;
>>>>>>> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>>>>>>> +		flags |= TNF_MIGRATE_FAIL;
>>>>>>>    		goto out_map;
>>>>>>>    	}
>>>>>>> -
>>>>>>> +	/* The folio is isolated and isolation code holds a folio reference. */
>>>>>>>    	spin_unlock(vmf->ptl);
>>>>>>>    	writable = false;
>>>>>>>
>>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>>> index 118660de5bcc..4fd1ecfced4d 100644
>>>>>>> --- a/mm/memory.c
>>>>>>> +++ b/mm/memory.c
>>>>>>
>>>>>> <snip>
>>>>>>
>>>>>>> @@ -5345,10 +5343,13 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>>>>>>>    	else
>>>>>>>    		last_cpupid = folio_last_cpupid(folio);
>>>>>>>    	target_nid = numa_migrate_prep(folio, vmf, vmf->address, nid, &flags);
>>>>>>> -	if (target_nid == NUMA_NO_NODE) {
>>>>>>> -		folio_put(folio);
>>>>>>> +	if (target_nid == NUMA_NO_NODE)
>>>>>>> +		goto out_map;
>>>>>>> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>>>>>>> +		flags |= TNF_MIGRATE_FAIL;
>>>>>>>    		goto out_map;
>>>>>>>    	}
>>>>>>
>>>>>> These two locations are repeated code, maybe just merge the ifs into
>>>>>> numa_migrate_prep(). Feel free to ignore if you are not going to send
>>>>>> another version. :)
>>>>>
>>>>> I went back and forth a couple of times and
>>>>>
>>>>> a) Didn't want to move numa_migrate_prep() into
>>>>>     migrate_misplaced_folio_prepare(), because having that code in
>>>>>     mm/migrate.c felt a bit odd.
>>>>
>>>> I agree after checking the actual code, since the code is just
>>>> updating NUMA fault stats and checking where the folio should be.
>>>>
>>>>>
>>>>> b) Didn't want to move migrate_misplaced_folio_prepare() because I enjoy
>>>>>     seeing the migrate_misplaced_folio_prepare() and
>>>>>     migrate_misplaced_folio() calls in the same callercontext.
>>>>>
>>>>> I also considered renaming numa_migrate_prep(), but wasn't really able to come up with a good name.
>>>>
>>>> How about numa_migrate_check()? Since it tells whether a folio should be
>>>> migrated or not.
>>>>
>>>>>
>>>>> But maybe a) is not too bad?
>>>>>
>>>>> We'd have migrate_misplaced_folio_prepare() consume &flags and &target_nid, and perform the "flags |= TNF_MIGRATE_FAIL;" internally.
>>>>>
>>>>> What would be your take?
>>>>
>>>> I would either rename numa_migrate_prep() or just do nothing. I have to admit
>>>> that the "prep" and "prepare" in both function names motivated me to propose
>>>> the merge, but now the actual code tells me they should be separate.
>>>
>>> Let's leave it like that for now. Renaming to numa_migrate_check() makes
>>> sense, and likely moving more numa handling stuff in there.
>>>
>>> Bit I yet have to figure out why some of the memory.c vs. huge_memory.c
>>> code differences exist, so we can unify them.
>>>
>>> For example, why did 33024536bafd9 introduce slightly different
>>> last_cpupid handling in do_huge_pmd_numa_page(), whereby it seems like
>>> some subtle difference in handling NUMA_BALANCING_MEMORY_TIERING? Maybe
>>> I am missing something obvious. :)
>>
>> It seems to me that a sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING
>> check is missing in do_huge_pmd_numa_page(). So the
>>
>> if (node_is_toptier(nid))
>>
>> should be
>>
>> if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) ||
>> node_is_toptier(nid))
>>
>> to be consistent with other checks. Add Ying to confirm.
>
> Yes.  It should be so.  Sorry for my mistake and confusing.

Thank you for the confirmation.

>
>> I also think a function like
>>
>> bool folio_has_cpupid(folio)
>> {
>>     return !(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
>>     || node_is_toptier(folio_nid(folio));
>> }
>>
>> would be better than the existing checks.
>
> Yes.  This looks better.  Even better, we can add some comments to the
> function too.

I will prepare a patch about it.

Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 2/2] mm/migrate: move NUMA hinting fault folio isolation + checks under PTL
  2024-07-01 13:50               ` Zi Yan
@ 2024-07-01 14:03                 ` David Hildenbrand
  2024-07-01 14:04                   ` Zi Yan
  0 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2024-07-01 14:03 UTC (permalink / raw)
  To: Zi Yan, Huang, Ying; +Cc: linux-kernel, linux-mm, Andrew Morton

On 01.07.24 15:50, Zi Yan wrote:
> On 1 Jul 2024, at 4:32, Huang, Ying wrote:
> 
>> "Zi Yan" <ziy@nvidia.com> writes:
>>
>>> On Wed Jun 26, 2024 at 12:49 PM EDT, David Hildenbrand wrote:
>>>> On 21.06.24 22:48, Zi Yan wrote:
>>>>> On 21 Jun 2024, at 16:18, David Hildenbrand wrote:
>>>>>
>>>>>> On 21.06.24 15:44, Zi Yan wrote:
>>>>>>> On 20 Jun 2024, at 17:29, David Hildenbrand wrote:
>>>>>>>
>>>>>>>> Currently we always take a folio reference even if migration will not
>>>>>>>> even be tried or isolation failed, requiring us to grab+drop an additional
>>>>>>>> reference.
>>>>>>>>
>>>>>>>> Further, we end up calling folio_likely_mapped_shared() while the folio
>>>>>>>> might have already been unmapped, because after we dropped the PTL, that
>>>>>>>> can easily happen. We want to stop touching mapcounts and friends from
>>>>>>>> such context, and only call folio_likely_mapped_shared() while the folio
>>>>>>>> is still mapped: mapcount information is pretty much stale and unreliable
>>>>>>>> otherwise.
>>>>>>>>
>>>>>>>> So let's move checks into numamigrate_isolate_folio(), rename that
>>>>>>>> function to migrate_misplaced_folio_prepare(), and call that function
>>>>>>>> from callsites where we call migrate_misplaced_folio(), but still with
>>>>>>>> the PTL held.
>>>>>>>>
>>>>>>>> We can now stop taking temporary folio references, and really only take
>>>>>>>> a reference if folio isolation succeeded. Doing the
>>>>>>>> folio_likely_mapped_shared() + golio isolation under PT lock is now similar
>>>>>>>> to how we handle MADV_PAGEOUT.
>>>>>>>>
>>>>>>>> While at it, combine the folio_is_file_lru() checks.
>>>>>>>>
>>>>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>>>>> ---
>>>>>>>>     include/linux/migrate.h |  7 ++++
>>>>>>>>     mm/huge_memory.c        |  8 ++--
>>>>>>>>     mm/memory.c             |  9 +++--
>>>>>>>>     mm/migrate.c            | 81 +++++++++++++++++++----------------------
>>>>>>>>     4 files changed, 55 insertions(+), 50 deletions(-)
>>>>>>>
>>>>>>> LGTM. Reviewed-by: Zi Yan <ziy@nvidia.com>
>>>>>>>
>>>>>>> One nit below:
>>>>>>>
>>>>>>> <snip>
>>>>>>>
>>>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>>>> index fc27dabcd8e3..4b2817bb2c7d 100644
>>>>>>>> --- a/mm/huge_memory.c
>>>>>>>> +++ b/mm/huge_memory.c
>>>>>>>> @@ -1688,11 +1688,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>>>>>>     	if (node_is_toptier(nid))
>>>>>>>>     		last_cpupid = folio_last_cpupid(folio);
>>>>>>>>     	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
>>>>>>>> -	if (target_nid == NUMA_NO_NODE) {
>>>>>>>> -		folio_put(folio);
>>>>>>>> +	if (target_nid == NUMA_NO_NODE)
>>>>>>>> +		goto out_map;
>>>>>>>> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>>>>>>>> +		flags |= TNF_MIGRATE_FAIL;
>>>>>>>>     		goto out_map;
>>>>>>>>     	}
>>>>>>>> -
>>>>>>>> +	/* The folio is isolated and isolation code holds a folio reference. */
>>>>>>>>     	spin_unlock(vmf->ptl);
>>>>>>>>     	writable = false;
>>>>>>>>
>>>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>>>> index 118660de5bcc..4fd1ecfced4d 100644
>>>>>>>> --- a/mm/memory.c
>>>>>>>> +++ b/mm/memory.c
>>>>>>>
>>>>>>> <snip>
>>>>>>>
>>>>>>>> @@ -5345,10 +5343,13 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>>>>>>>>     	else
>>>>>>>>     		last_cpupid = folio_last_cpupid(folio);
>>>>>>>>     	target_nid = numa_migrate_prep(folio, vmf, vmf->address, nid, &flags);
>>>>>>>> -	if (target_nid == NUMA_NO_NODE) {
>>>>>>>> -		folio_put(folio);
>>>>>>>> +	if (target_nid == NUMA_NO_NODE)
>>>>>>>> +		goto out_map;
>>>>>>>> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>>>>>>>> +		flags |= TNF_MIGRATE_FAIL;
>>>>>>>>     		goto out_map;
>>>>>>>>     	}
>>>>>>>
>>>>>>> These two locations are repeated code, maybe just merge the ifs into
>>>>>>> numa_migrate_prep(). Feel free to ignore if you are not going to send
>>>>>>> another version. :)
>>>>>>
>>>>>> I went back and forth a couple of times and
>>>>>>
>>>>>> a) Didn't want to move numa_migrate_prep() into
>>>>>>      migrate_misplaced_folio_prepare(), because having that code in
>>>>>>      mm/migrate.c felt a bit odd.
>>>>>
>>>>> I agree after checking the actual code, since the code is just
>>>>> updating NUMA fault stats and checking where the folio should be.
>>>>>
>>>>>>
>>>>>> b) Didn't want to move migrate_misplaced_folio_prepare() because I enjoy
>>>>>>      seeing the migrate_misplaced_folio_prepare() and
>>>>>>      migrate_misplaced_folio() calls in the same callercontext.
>>>>>>
>>>>>> I also considered renaming numa_migrate_prep(), but wasn't really able to come up with a good name.
>>>>>
>>>>> How about numa_migrate_check()? Since it tells whether a folio should be
>>>>> migrated or not.
>>>>>
>>>>>>
>>>>>> But maybe a) is not too bad?
>>>>>>
>>>>>> We'd have migrate_misplaced_folio_prepare() consume &flags and &target_nid, and perform the "flags |= TNF_MIGRATE_FAIL;" internally.
>>>>>>
>>>>>> What would be your take?
>>>>>
>>>>> I would either rename numa_migrate_prep() or just do nothing. I have to admit
>>>>> that the "prep" and "prepare" in both function names motivated me to propose
>>>>> the merge, but now the actual code tells me they should be separate.
>>>>
>>>> Let's leave it like that for now. Renaming to numa_migrate_check() makes
>>>> sense, and likely moving more numa handling stuff in there.
>>>>
>>>> Bit I yet have to figure out why some of the memory.c vs. huge_memory.c
>>>> code differences exist, so we can unify them.
>>>>
>>>> For example, why did 33024536bafd9 introduce slightly different
>>>> last_cpupid handling in do_huge_pmd_numa_page(), whereby it seems like
>>>> some subtle difference in handling NUMA_BALANCING_MEMORY_TIERING? Maybe
>>>> I am missing something obvious. :)
>>>
>>> It seems to me that a sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING
>>> check is missing in do_huge_pmd_numa_page(). So the
>>>
>>> if (node_is_toptier(nid))
>>>
>>> should be
>>>
>>> if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) ||
>>> node_is_toptier(nid))
>>>
>>> to be consistent with other checks. Add Ying to confirm.
>>
>> Yes.  It should be so.  Sorry for my mistake and confusing.
> 
> Thank you for the confirmation.
> 
>>
>>> I also think a function like
>>>
>>> bool folio_has_cpupid(folio)
>>> {
>>>      return !(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
>>>      || node_is_toptier(folio_nid(folio));
>>> }
>>>
>>> would be better than the existing checks.
>>
>> Yes.  This looks better.  Even better, we can add some comments to the
>> function too.
> 
> I will prepare a patch about it.

Do you have capacity to further consolidate the logic, maybe moving more 
stuff into the numa_migrate_prep (and renaming it? :)).

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 2/2] mm/migrate: move NUMA hinting fault folio isolation + checks under PTL
  2024-07-01 14:03                 ` David Hildenbrand
@ 2024-07-01 14:04                   ` Zi Yan
  0 siblings, 0 replies; 24+ messages in thread
From: Zi Yan @ 2024-07-01 14:04 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: Huang, Ying, linux-kernel, linux-mm, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 7251 bytes --]

On 1 Jul 2024, at 10:03, David Hildenbrand wrote:

> On 01.07.24 15:50, Zi Yan wrote:
>> On 1 Jul 2024, at 4:32, Huang, Ying wrote:
>>
>>> "Zi Yan" <ziy@nvidia.com> writes:
>>>
>>>> On Wed Jun 26, 2024 at 12:49 PM EDT, David Hildenbrand wrote:
>>>>> On 21.06.24 22:48, Zi Yan wrote:
>>>>>> On 21 Jun 2024, at 16:18, David Hildenbrand wrote:
>>>>>>
>>>>>>> On 21.06.24 15:44, Zi Yan wrote:
>>>>>>>> On 20 Jun 2024, at 17:29, David Hildenbrand wrote:
>>>>>>>>
>>>>>>>>> Currently we always take a folio reference even if migration will not
>>>>>>>>> even be tried or isolation failed, requiring us to grab+drop an additional
>>>>>>>>> reference.
>>>>>>>>>
>>>>>>>>> Further, we end up calling folio_likely_mapped_shared() while the folio
>>>>>>>>> might have already been unmapped, because after we dropped the PTL, that
>>>>>>>>> can easily happen. We want to stop touching mapcounts and friends from
>>>>>>>>> such context, and only call folio_likely_mapped_shared() while the folio
>>>>>>>>> is still mapped: mapcount information is pretty much stale and unreliable
>>>>>>>>> otherwise.
>>>>>>>>>
>>>>>>>>> So let's move checks into numamigrate_isolate_folio(), rename that
>>>>>>>>> function to migrate_misplaced_folio_prepare(), and call that function
>>>>>>>>> from callsites where we call migrate_misplaced_folio(), but still with
>>>>>>>>> the PTL held.
>>>>>>>>>
>>>>>>>>> We can now stop taking temporary folio references, and really only take
>>>>>>>>> a reference if folio isolation succeeded. Doing the
>>>>>>>>> folio_likely_mapped_shared() + golio isolation under PT lock is now similar
>>>>>>>>> to how we handle MADV_PAGEOUT.
>>>>>>>>>
>>>>>>>>> While at it, combine the folio_is_file_lru() checks.
>>>>>>>>>
>>>>>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>>>>>> ---
>>>>>>>>>     include/linux/migrate.h |  7 ++++
>>>>>>>>>     mm/huge_memory.c        |  8 ++--
>>>>>>>>>     mm/memory.c             |  9 +++--
>>>>>>>>>     mm/migrate.c            | 81 +++++++++++++++++++----------------------
>>>>>>>>>     4 files changed, 55 insertions(+), 50 deletions(-)
>>>>>>>>
>>>>>>>> LGTM. Reviewed-by: Zi Yan <ziy@nvidia.com>
>>>>>>>>
>>>>>>>> One nit below:
>>>>>>>>
>>>>>>>> <snip>
>>>>>>>>
>>>>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>>>>> index fc27dabcd8e3..4b2817bb2c7d 100644
>>>>>>>>> --- a/mm/huge_memory.c
>>>>>>>>> +++ b/mm/huge_memory.c
>>>>>>>>> @@ -1688,11 +1688,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>>>>>>>     	if (node_is_toptier(nid))
>>>>>>>>>     		last_cpupid = folio_last_cpupid(folio);
>>>>>>>>>     	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
>>>>>>>>> -	if (target_nid == NUMA_NO_NODE) {
>>>>>>>>> -		folio_put(folio);
>>>>>>>>> +	if (target_nid == NUMA_NO_NODE)
>>>>>>>>> +		goto out_map;
>>>>>>>>> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>>>>>>>>> +		flags |= TNF_MIGRATE_FAIL;
>>>>>>>>>     		goto out_map;
>>>>>>>>>     	}
>>>>>>>>> -
>>>>>>>>> +	/* The folio is isolated and isolation code holds a folio reference. */
>>>>>>>>>     	spin_unlock(vmf->ptl);
>>>>>>>>>     	writable = false;
>>>>>>>>>
>>>>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>>>>> index 118660de5bcc..4fd1ecfced4d 100644
>>>>>>>>> --- a/mm/memory.c
>>>>>>>>> +++ b/mm/memory.c
>>>>>>>>
>>>>>>>> <snip>
>>>>>>>>
>>>>>>>>> @@ -5345,10 +5343,13 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>>>>>>>>>     	else
>>>>>>>>>     		last_cpupid = folio_last_cpupid(folio);
>>>>>>>>>     	target_nid = numa_migrate_prep(folio, vmf, vmf->address, nid, &flags);
>>>>>>>>> -	if (target_nid == NUMA_NO_NODE) {
>>>>>>>>> -		folio_put(folio);
>>>>>>>>> +	if (target_nid == NUMA_NO_NODE)
>>>>>>>>> +		goto out_map;
>>>>>>>>> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>>>>>>>>> +		flags |= TNF_MIGRATE_FAIL;
>>>>>>>>>     		goto out_map;
>>>>>>>>>     	}
>>>>>>>>
>>>>>>>> These two locations are repeated code, maybe just merge the ifs into
>>>>>>>> numa_migrate_prep(). Feel free to ignore if you are not going to send
>>>>>>>> another version. :)
>>>>>>>
>>>>>>> I went back and forth a couple of times and
>>>>>>>
>>>>>>> a) Didn't want to move numa_migrate_prep() into
>>>>>>>      migrate_misplaced_folio_prepare(), because having that code in
>>>>>>>      mm/migrate.c felt a bit odd.
>>>>>>
>>>>>> I agree after checking the actual code, since the code is just
>>>>>> updating NUMA fault stats and checking where the folio should be.
>>>>>>
>>>>>>>
>>>>>>> b) Didn't want to move migrate_misplaced_folio_prepare() because I enjoy
>>>>>>>      seeing the migrate_misplaced_folio_prepare() and
>>>>>>>      migrate_misplaced_folio() calls in the same callercontext.
>>>>>>>
>>>>>>> I also considered renaming numa_migrate_prep(), but wasn't really able to come up with a good name.
>>>>>>
>>>>>> How about numa_migrate_check()? Since it tells whether a folio should be
>>>>>> migrated or not.
>>>>>>
>>>>>>>
>>>>>>> But maybe a) is not too bad?
>>>>>>>
>>>>>>> We'd have migrate_misplaced_folio_prepare() consume &flags and &target_nid, and perform the "flags |= TNF_MIGRATE_FAIL;" internally.
>>>>>>>
>>>>>>> What would be your take?
>>>>>>
>>>>>> I would either rename numa_migrate_prep() or just do nothing. I have to admit
>>>>>> that the "prep" and "prepare" in both function names motivated me to propose
>>>>>> the merge, but now the actual code tells me they should be separate.
>>>>>
>>>>> Let's leave it like that for now. Renaming to numa_migrate_check() makes
>>>>> sense, and likely moving more numa handling stuff in there.
>>>>>
>>>>> Bit I yet have to figure out why some of the memory.c vs. huge_memory.c
>>>>> code differences exist, so we can unify them.
>>>>>
>>>>> For example, why did 33024536bafd9 introduce slightly different
>>>>> last_cpupid handling in do_huge_pmd_numa_page(), whereby it seems like
>>>>> some subtle difference in handling NUMA_BALANCING_MEMORY_TIERING? Maybe
>>>>> I am missing something obvious. :)
>>>>
>>>> It seems to me that a sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING
>>>> check is missing in do_huge_pmd_numa_page(). So the
>>>>
>>>> if (node_is_toptier(nid))
>>>>
>>>> should be
>>>>
>>>> if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) ||
>>>> node_is_toptier(nid))
>>>>
>>>> to be consistent with other checks. Add Ying to confirm.
>>>
>>> Yes.  It should be so.  Sorry for my mistake and confusing.
>>
>> Thank you for the confirmation.
>>
>>>
>>>> I also think a function like
>>>>
>>>> bool folio_has_cpupid(folio)
>>>> {
>>>>      return !(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
>>>>      || node_is_toptier(folio_nid(folio));
>>>> }
>>>>
>>>> would be better than the existing checks.
>>>
>>> Yes.  This looks better.  Even better, we can add some comments to the
>>> function too.
>>
>> I will prepare a patch about it.
>
> Do you have capacity to further consolidate the logic, maybe moving more stuff into the numa_migrate_prep (and renaming it? :)).

Sure, let me give it a shot. :)

Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2024-07-01 14:04 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-20 21:29 [PATCH v1 0/2] mm/migrate: move NUMA hinting fault folio isolation + checks under PTL David Hildenbrand
2024-06-20 21:29 ` [PATCH v1 1/2] mm/migrate: make migrate_misplaced_folio() return 0 on success David Hildenbrand
2024-06-21  1:40   ` Zi Yan
2024-06-21  3:39   ` Baolin Wang
2024-07-01  7:36   ` Huang, Ying
2024-07-01  7:44     ` David Hildenbrand
2024-06-20 21:29 ` [PATCH v1 2/2] mm/migrate: move NUMA hinting fault folio isolation + checks under PTL David Hildenbrand
2024-06-21  2:05   ` Zi Yan
2024-06-21  7:32     ` David Hildenbrand
2024-06-21  4:07   ` Baolin Wang
2024-06-21  7:31     ` David Hildenbrand
2024-06-21 13:44   ` Zi Yan
2024-06-21 20:18     ` David Hildenbrand
2024-06-21 20:48       ` Zi Yan
2024-06-26 16:49         ` David Hildenbrand
2024-06-26 17:37           ` Zi Yan
2024-07-01  8:32             ` Huang, Ying
2024-07-01 13:50               ` Zi Yan
2024-07-01 14:03                 ` David Hildenbrand
2024-07-01 14:04                   ` Zi Yan
2024-06-21 17:47   ` Donet Tom
2024-06-21 20:14     ` David Hildenbrand
2024-06-26 16:22   ` David Hildenbrand
2024-06-27  6:00     ` Donet Tom

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).