[PATCH v2 0/2] Allow migrate on protnone reference with MPOL_PREFERRED

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/2] Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy
@ 2024-03-08 15:15 Donet Tom
  2024-03-08 15:15 ` [PATCH v2 1/2] mm/mempolicy: Use numa_node_id() instead of cpu_to_node() Donet Tom
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Donet Tom @ 2024-03-08 15:15 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, linux-kernel
  Cc: Aneesh Kumar, Huang Ying, Michal Hocko, Dave Hansen, Mel Gorman,
	Feng Tang, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar,
	Rik van Riel, Johannes Weiner, Matthew Wilcox, Vlastimil Babka,
	Dan Williams, Hugh Dickins, Kefeng Wang, Suren Baghdasaryan,
	Donet Tom

This patchset is to optimize the cross-socket memory access with
MPOL_PREFERRED_MANY policy.

To test this patch we ran the following test on a 3 node system.
 Node 0 - 2GB   - Tier 1
 Node 1 - 11GB  - Tier 1
 Node 6 - 10GB  - Tier 2

Below changes are made to memcached to set the memory policy,
It select Node0 and Node1 as preferred nodes.

   #include <numaif.h>
   #include <numa.h>

    unsigned long nodemask;
    int ret;

    nodemask = 0x03;
    ret = set_mempolicy(MPOL_PREFERRED_MANY | MPOL_F_NUMA_BALANCING,
                                               &nodemask, 10);
    /* If MPOL_F_NUMA_BALANCING isn't supported,
     * fall back to MPOL_PREFERRED_MANY */
    if (ret < 0 && errno == EINVAL){
       printf("set mem policy normal\n");
        ret = set_mempolicy(MPOL_PREFERRED_MANY, &nodemask, 10);
    }
    if (ret < 0) {
       perror("Failed to call set_mempolicy");
       exit(-1);
    }

Test Procedure:
===============
1. Make sure memory tiring and demotion are enabled.
2. Start memcached.

   # ./memcached -b 100000 -m 204800 -u root -c 1000000 -t 7
       -d -s "/tmp/memcached.sock"

3. Run memtier_benchmark to store 3200000 keys.

  #./memtier_benchmark -S "/tmp/memcached.sock" --protocol=memcache_binary
    --threads=1 --pipeline=1 --ratio=1:0 --key-pattern=S:S --key-minimum=1
    --key-maximum=3200000 -n allkeys -c 1 -R -x 1 -d 1024

4. Start a memory eater on node 0 and 1. This will demote all memcached
   pages to node 6.
5. Make sure all the memcached pages got demoted to lower tier by reading
   /proc/<memcaced PID>/numa_maps.

    # cat /proc/2771/numa_maps
     ---
    default anon=1009 dirty=1009 active=0 N6=1009 kernelpagesize_kB=64
    default anon=1009 dirty=1009 active=0 N6=1009 kernelpagesize_kB=64
     ---

6. Kill memory eater.
7. Read the pgpromote_success counter.
8. Start reading the keys by running memtier_benchmark.

  #./memtier_benchmark -S "/tmp/memcached.sock" --protocol=memcache_binary
   --pipeline=1 --distinct-client-seed --ratio=0:3 --key-pattern=R:R
   --key-minimum=1 --key-maximum=3200000 -n allkeys
   --threads=64 -c 1 -R -x 6

9. Read the pgpromote_success counter.

Test Results:
=============
Without Patch
------------------
1. pgpromote_success  before test
Node 0:  pgpromote_success 11
Node 1:  pgpromote_success 140974

pgpromote_success  after test
Node 0:  pgpromote_success 11
Node 1:  pgpromote_success 140974

2. Memtier-benchmark result.
AGGREGATED AVERAGE RESULTS (6 runs)
==================================================================
Type    Ops/sec   Hits/sec   Misses/sec  Avg. Latency  p50 Latency
------------------------------------------------------------------
Sets     0.00       ---         ---        ---          ---
Gets    305792.03  305791.93   0.10       0.18949       0.16700
Waits    0.00       ---         ---        ---          ---
Totals  305792.03  305791.93   0.10       0.18949       0.16700

======================================
p99 Latency  p99.9 Latency  KB/sec
-------------------------------------
---          ---            0.00
0.44700     1.71100        11542.69
---           ---            ---
0.44700     1.71100        11542.69

With Patch
---------------
1. pgpromote_success  before test
Node 0:  pgpromote_success 5
Node 1:  pgpromote_success 89386

pgpromote_success  after test
Node 0:  pgpromote_success 57895
Node 1:  pgpromote_success 141463

2. Memtier-benchmark result.
AGGREGATED AVERAGE RESULTS (6 runs)
====================================================================
Type    Ops/sec    Hits/sec  Misses/sec  Avg. Latency  p50 Latency
--------------------------------------------------------------------
Sets     0.00        ---       ---        ---           ---
Gets    521942.24  521942.07  0.17       0.11459        0.10300
Waits    0.00        ---       ---         ---          ---
Totals  521942.24  521942.07  0.17       0.11459        0.10300

=======================================
p99 Latency  p99.9 Latency  KB/sec
---------------------------------------
 ---          ---            0.00
0.23100      0.31900        19701.68
---          ---             ---
0.23100      0.31900        19701.68


Test Result Analysis:
=====================
1. With patch we could observe pages are getting promoted.
2. Memtier-benchmark results shows that, with the patch,
   performance has increased more than 50%.

 Ops/sec without fix -  305792.03
 Ops/sec with fix    -  521942.24

Changes:
v2:
- Rebased on latest upstream (v6.8-rc7)
- Used 'numa_node_id()' to get the current execution node ID, Added
  'lockdep_assert_held' to make sure that the 'mpol_misplaced()' is
  called with ptl held.
- The migration condition has been updated; now, migration will only
  occur if the execution node is present in the policy nodemask.

-v1: https://lore.kernel.org/linux-mm/9c3f7b743477560d1c5b12b8c111a584a2cc92ee.1708097962.git.donettom@linux.ibm.com/#t


Donet Tom (2):
  mm/mempolicy: Use numa_node_id() instead of cpu_to_node()
  mm/numa_balancing:Allow migrate on protnone reference with
    MPOL_PREFERRED_MANY policy

 include/linux/mempolicy.h |  5 +++--
 mm/huge_memory.c          |  2 +-
 mm/internal.h             |  2 +-
 mm/memory.c               |  8 +++++---
 mm/mempolicy.c            | 34 ++++++++++++++++++++++++++--------
 5 files changed, 36 insertions(+), 15 deletions(-)

-- 
2.39.3


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v2 1/2] mm/mempolicy: Use numa_node_id() instead of cpu_to_node()
  2024-03-08 15:15 [PATCH v2 0/2] Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy Donet Tom
@ 2024-03-08 15:15 ` Donet Tom
  2024-03-08 15:15 ` [PATCH v2 2/2] mm/numa_balancing:Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy Donet Tom
  2024-03-11  1:45 ` [PATCH v2 0/2] Allow " Huang, Ying
  2 siblings, 0 replies; 6+ messages in thread
From: Donet Tom @ 2024-03-08 15:15 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, linux-kernel
  Cc: Aneesh Kumar, Huang Ying, Michal Hocko, Dave Hansen, Mel Gorman,
	Feng Tang, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar,
	Rik van Riel, Johannes Weiner, Matthew Wilcox, Vlastimil Babka,
	Dan Williams, Hugh Dickins, Kefeng Wang, Suren Baghdasaryan,
	Donet Tom

Instead of using 'cpu_to_node()', we use 'numa_node_id()', which
is quicker. smp_processor_id is guaranteed to be stable in the
'mpol_misplaced()' function because it is called with ptl held.
lockdep_assert_held was added to ensure that.

No functional change in this patch.

Signed-off-by: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
 include/linux/mempolicy.h |  5 +++--
 mm/huge_memory.c          |  2 +-
 mm/internal.h             |  2 +-
 mm/memory.c               |  8 +++++---
 mm/mempolicy.c            | 12 +++++++++---
 5 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 931b118336f4..1add16f21612 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -167,7 +167,8 @@ extern void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol);
 /* Check if a vma is migratable */
 extern bool vma_migratable(struct vm_area_struct *vma);
 
-int mpol_misplaced(struct folio *, struct vm_area_struct *, unsigned long);
+int mpol_misplaced(struct folio *folio, struct vm_fault *vmf,
+					unsigned long addr);
 extern void mpol_put_task_policy(struct task_struct *);
 
 static inline bool mpol_is_preferred_many(struct mempolicy *pol)
@@ -282,7 +283,7 @@ static inline int mpol_parse_str(char *str, struct mempolicy **mpol)
 #endif
 
 static inline int mpol_misplaced(struct folio *folio,
-				 struct vm_area_struct *vma,
+				 struct vm_fault *vmf,
 				 unsigned long address)
 {
 	return -1; /* no node preference */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 94c958f7ebb5..7f944e0c4571 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1752,7 +1752,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 	 */
 	if (node_is_toptier(nid))
 		last_cpupid = folio_last_cpupid(folio);
-	target_nid = numa_migrate_prep(folio, vma, haddr, nid, &flags);
+	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
 	if (target_nid == NUMA_NO_NODE) {
 		folio_put(folio);
 		goto out_map;
diff --git a/mm/internal.h b/mm/internal.h
index f309a010d50f..ae175be9165e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -992,7 +992,7 @@ void vunmap_range_noflush(unsigned long start, unsigned long end);
 
 void __vunmap_range_noflush(unsigned long start, unsigned long end);
 
-int numa_migrate_prep(struct folio *folio, struct vm_area_struct *vma,
+int numa_migrate_prep(struct folio *folio, struct vm_fault *vmf,
 		      unsigned long addr, int page_nid, int *flags);
 
 void free_zone_device_page(struct page *page);
diff --git a/mm/memory.c b/mm/memory.c
index 0bfc8b007c01..4e258a8564ca 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4899,9 +4899,11 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
 	return ret;
 }
 
-int numa_migrate_prep(struct folio *folio, struct vm_area_struct *vma,
+int numa_migrate_prep(struct folio *folio, struct vm_fault *vmf,
 		      unsigned long addr, int page_nid, int *flags)
 {
+	struct vm_area_struct *vma = vmf->vma;
+
 	folio_get(folio);
 
 	/* Record the current PID acceesing VMA */
@@ -4913,7 +4915,7 @@ int numa_migrate_prep(struct folio *folio, struct vm_area_struct *vma,
 		*flags |= TNF_FAULT_LOCAL;
 	}
 
-	return mpol_misplaced(folio, vma, addr);
+	return mpol_misplaced(folio, vmf, addr);
 }
 
 static vm_fault_t do_numa_page(struct vm_fault *vmf)
@@ -4987,7 +4989,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 		last_cpupid = (-1 & LAST_CPUPID_MASK);
 	else
 		last_cpupid = folio_last_cpupid(folio);
-	target_nid = numa_migrate_prep(folio, vma, vmf->address, nid, &flags);
+	target_nid = numa_migrate_prep(folio, vmf, vmf->address, nid, &flags);
 	if (target_nid == NUMA_NO_NODE) {
 		folio_put(folio);
 		goto out_map;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 10a590ee1c89..e635d7ed501b 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2477,18 +2477,24 @@ static void sp_free(struct sp_node *n)
  * Return: NUMA_NO_NODE if the page is in a node that is valid for this
  * policy, or a suitable node ID to allocate a replacement folio from.
  */
-int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma,
+int mpol_misplaced(struct folio *folio, struct vm_fault *vmf,
 		   unsigned long addr)
 {
 	struct mempolicy *pol;
 	pgoff_t ilx;
 	struct zoneref *z;
 	int curnid = folio_nid(folio);
+	struct vm_area_struct *vma = vmf->vma;
 	int thiscpu = raw_smp_processor_id();
-	int thisnid = cpu_to_node(thiscpu);
+	int thisnid = numa_node_id();
 	int polnid = NUMA_NO_NODE;
 	int ret = NUMA_NO_NODE;
 
+	/*
+	 * Make sure ptl is held so that we don't preempt and we
+	 * have a stable smp processor id
+	 */
+	lockdep_assert_held(vmf->ptl);
 	pol = get_vma_policy(vma, addr, folio_order(folio), &ilx);
 	if (!(pol->flags & MPOL_F_MOF))
 		goto out;
@@ -2526,7 +2532,7 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma,
 		if (node_isset(curnid, pol->nodes))
 			goto out;
 		z = first_zones_zonelist(
-				node_zonelist(numa_node_id(), GFP_HIGHUSER),
+				node_zonelist(thisnid, GFP_HIGHUSER),
 				gfp_zone(GFP_HIGHUSER),
 				&pol->nodes);
 		polnid = zone_to_nid(z->zone);
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v2 2/2] mm/numa_balancing:Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy
  2024-03-08 15:15 [PATCH v2 0/2] Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy Donet Tom
  2024-03-08 15:15 ` [PATCH v2 1/2] mm/mempolicy: Use numa_node_id() instead of cpu_to_node() Donet Tom
@ 2024-03-08 15:15 ` Donet Tom
  2024-03-11  1:37   ` Huang, Ying
  2024-03-11  1:45 ` [PATCH v2 0/2] Allow " Huang, Ying
  2 siblings, 1 reply; 6+ messages in thread
From: Donet Tom @ 2024-03-08 15:15 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, linux-kernel
  Cc: Aneesh Kumar, Huang Ying, Michal Hocko, Dave Hansen, Mel Gorman,
	Feng Tang, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar,
	Rik van Riel, Johannes Weiner, Matthew Wilcox, Vlastimil Babka,
	Dan Williams, Hugh Dickins, Kefeng Wang, Suren Baghdasaryan,
	Donet Tom

commit bda420b98505 ("numa balancing: migrate on fault among multiple bound
nodes") added support for migrate on protnone reference with MPOL_BIND
memory policy. This allowed numa fault migration when the executing node
is part of the policy mask for MPOL_BIND. This patch extends migration
support to MPOL_PREFERRED_MANY policy.

Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag
MPOL_F_NUMA_BALANCING. This causes issues when we want to use
NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory tier,
the kernel should not allocate pages from the slower memory tier via
allocation control zonelist fallback. Instead, we should move cold pages
from the faster memory node via memory demotion. For a page allocation,
kswapd is only woken up after we try to allocate pages from all nodes in
the allocation zone list. This implies that, without using memory
policies, we will end up allocating hot pages in the slower memory tier.

MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add
MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better
allocation control when we have memory tiers in the system. With
MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only
of faster memory nodes. When we fail to allocate pages from the faster
memory node, kswapd would be woken up, allowing demotion of cold pages
to slower memory nodes.

With the current kernel, such usage of memory policies implies we can't
do page promotion from a slower memory tier to a faster memory tier
using numa fault. This patch fixes this issue.

For MPOL_PREFERRED_MANY, if the executing node is in the policy node
mask, we allow numa migration to the executing nodes. If the executing
node is not in the policy node mask, we do not allow numa migration.

Signed-off-by: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
 mm/mempolicy.c | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index e635d7ed501b..ccd9c6c5fcf5 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1458,9 +1458,10 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
 	if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES))
 		return -EINVAL;
 	if (*flags & MPOL_F_NUMA_BALANCING) {
-		if (*mode != MPOL_BIND)
+		if (*mode == MPOL_BIND || *mode == MPOL_PREFERRED_MANY)
+			*flags |= (MPOL_F_MOF | MPOL_F_MORON);
+		else
 			return -EINVAL;
-		*flags |= (MPOL_F_MOF | MPOL_F_MORON);
 	}
 	return 0;
 }
@@ -2515,15 +2516,26 @@ int mpol_misplaced(struct folio *folio, struct vm_fault *vmf,
 		break;

 	case MPOL_BIND:
-		/* Optimize placement among multiple nodes via NUMA balancing */
+	case MPOL_PREFERRED_MANY:
+		/*
+		 * Even though MPOL_PREFERRED_MANY can allocate pages outside
+		 * policy nodemask we don't allow numa migration to nodes
+		 * outside policy nodemask for now. This is done so that if we
+		 * want demotion to slow memory to happen, before allocating
+		 * from some DRAM node say 'x', we will end up using a
+		 * MPOL_PREFERRED_MANY mask excluding node 'x'. In such scenario
+		 * we should not promote to node 'x' from slow memory node.
+		 */
 		if (pol->flags & MPOL_F_MORON) {
+			/*
+			 * Optimize placement among multiple nodes
+			 * via NUMA balancing
+			 */
 			if (node_isset(thisnid, pol->nodes))
 				break;
 			goto out;
 		}
-		fallthrough;

-	case MPOL_PREFERRED_MANY:
 		/*
 		 * use current page if in policy nodemask,
 		 * else select nearest allowed node, if any.
-- 
2.39.3

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 2/2] mm/numa_balancing:Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy
  2024-03-08 15:15 ` [PATCH v2 2/2] mm/numa_balancing:Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy Donet Tom
@ 2024-03-11  1:37   ` Huang, Ying
  2024-03-13 15:51     ` Andrew Morton
  0 siblings, 1 reply; 6+ messages in thread
From: Huang, Ying @ 2024-03-11  1:37 UTC (permalink / raw)
  To: Donet Tom
  Cc: Andrew Morton, linux-mm, linux-kernel, Aneesh Kumar, Michal Hocko,
	Dave Hansen, Mel Gorman, Feng Tang, Andrea Arcangeli,
	Peter Zijlstra, Ingo Molnar, Rik van Riel, Johannes Weiner,
	Matthew Wilcox, Vlastimil Babka, Dan Williams, Hugh Dickins,
	Kefeng Wang, Suren Baghdasaryan

Donet Tom <donettom@linux.ibm.com> writes:

> commit bda420b98505 ("numa balancing: migrate on fault among multiple bound
> nodes") added support for migrate on protnone reference with MPOL_BIND
> memory policy. This allowed numa fault migration when the executing node
> is part of the policy mask for MPOL_BIND. This patch extends migration
> support to MPOL_PREFERRED_MANY policy.
>
> Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag
> MPOL_F_NUMA_BALANCING. This causes issues when we want to use
> NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory tier,
> the kernel should not allocate pages from the slower memory tier via
> allocation control zonelist fallback. Instead, we should move cold pages
> from the faster memory node via memory demotion. For a page allocation,
> kswapd is only woken up after we try to allocate pages from all nodes in
> the allocation zone list. This implies that, without using memory
> policies, we will end up allocating hot pages in the slower memory tier.
>
> MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add
> MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better
> allocation control when we have memory tiers in the system. With
> MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only
> of faster memory nodes. When we fail to allocate pages from the faster
> memory node, kswapd would be woken up, allowing demotion of cold pages
> to slower memory nodes.
>
> With the current kernel, such usage of memory policies implies we can't
> do page promotion from a slower memory tier to a faster memory tier
> using numa fault. This patch fixes this issue.
>
> For MPOL_PREFERRED_MANY, if the executing node is in the policy node
> mask, we allow numa migration to the executing nodes. If the executing
> node is not in the policy node mask, we do not allow numa migration.
>
> Signed-off-by: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
> ---
>  mm/mempolicy.c | 22 +++++++++++++++++-----
>  1 file changed, 17 insertions(+), 5 deletions(-)
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index e635d7ed501b..ccd9c6c5fcf5 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1458,9 +1458,10 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
>  	if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES))
>  		return -EINVAL;
>  	if (*flags & MPOL_F_NUMA_BALANCING) {
> -		if (*mode != MPOL_BIND)
> +		if (*mode == MPOL_BIND || *mode == MPOL_PREFERRED_MANY)
> +			*flags |= (MPOL_F_MOF | MPOL_F_MORON);
> +		else
>  			return -EINVAL;
> -		*flags |= (MPOL_F_MOF | MPOL_F_MORON);
>  	}
>  	return 0;
>  }
> @@ -2515,15 +2516,26 @@ int mpol_misplaced(struct folio *folio, struct vm_fault *vmf,
>  		break;
>  
>  	case MPOL_BIND:
> -		/* Optimize placement among multiple nodes via NUMA balancing */
> +	case MPOL_PREFERRED_MANY:
> +		/*
> +		 * Even though MPOL_PREFERRED_MANY can allocate pages outside
> +		 * policy nodemask we don't allow numa migration to nodes
> +		 * outside policy nodemask for now. This is done so that if we
> +		 * want demotion to slow memory to happen, before allocating
> +		 * from some DRAM node say 'x', we will end up using a
> +		 * MPOL_PREFERRED_MANY mask excluding node 'x'. In such scenario
> +		 * we should not promote to node 'x' from slow memory node.
> +		 */

This is a little hard to digest for me.  And, I don't think that we need
to put this policy choice in code comments.  It's better to put it in
patch description.  Where we can give more background, for example, to
avoid cross-socket traffic, etc.

Otherwise, the patchset looks good to me.  Thanks!

>  		if (pol->flags & MPOL_F_MORON) {
> +			/*
> +			 * Optimize placement among multiple nodes
> +			 * via NUMA balancing
> +			 */
>  			if (node_isset(thisnid, pol->nodes))
>  				break;
>  			goto out;
>  		}
> -		fallthrough;
>  
> -	case MPOL_PREFERRED_MANY:
>  		/*
>  		 * use current page if in policy nodemask,
>  		 * else select nearest allowed node, if any.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 2/2] mm/numa_balancing:Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy
  2024-03-11  1:37   ` Huang, Ying
@ 2024-03-13 15:51     ` Andrew Morton
  0 siblings, 0 replies; 6+ messages in thread
From: Andrew Morton @ 2024-03-13 15:51 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Donet Tom, linux-mm, linux-kernel, Aneesh Kumar, Michal Hocko,
	Dave Hansen, Mel Gorman, Feng Tang, Andrea Arcangeli,
	Peter Zijlstra, Ingo Molnar, Rik van Riel, Johannes Weiner,
	Matthew Wilcox, Vlastimil Babka, Dan Williams, Hugh Dickins,
	Kefeng Wang, Suren Baghdasaryan

On Mon, 11 Mar 2024 09:37:36 +0800 "Huang, Ying" <ying.huang@intel.com> wrote:

> > @@ -2515,15 +2516,26 @@ int mpol_misplaced(struct folio *folio, struct vm_fault *vmf,
> >  		break;
> >  
> >  	case MPOL_BIND:
> > -		/* Optimize placement among multiple nodes via NUMA balancing */
> > +	case MPOL_PREFERRED_MANY:
> > +		/*
> > +		 * Even though MPOL_PREFERRED_MANY can allocate pages outside
> > +		 * policy nodemask we don't allow numa migration to nodes
> > +		 * outside policy nodemask for now. This is done so that if we
> > +		 * want demotion to slow memory to happen, before allocating
> > +		 * from some DRAM node say 'x', we will end up using a
> > +		 * MPOL_PREFERRED_MANY mask excluding node 'x'. In such scenario
> > +		 * we should not promote to node 'x' from slow memory node.
> > +		 */
> 
> This is a little hard to digest for me.  And, I don't think that we need
> to put this policy choice in code comments.  It's better to put it in
> patch description.  Where we can give more background, for example, to
> avoid cross-socket traffic, etc.

Oh.  I like the comment.  We could perhaps put additional detail in the
changelog, but using changelogs to understand the code is so darned
inconvenient.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 0/2] Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy
  2024-03-08 15:15 [PATCH v2 0/2] Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy Donet Tom
  2024-03-08 15:15 ` [PATCH v2 1/2] mm/mempolicy: Use numa_node_id() instead of cpu_to_node() Donet Tom
  2024-03-08 15:15 ` [PATCH v2 2/2] mm/numa_balancing:Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy Donet Tom
@ 2024-03-11  1:45 ` Huang, Ying
  2 siblings, 0 replies; 6+ messages in thread
From: Huang, Ying @ 2024-03-11  1:45 UTC (permalink / raw)
  To: Donet Tom
  Cc: Andrew Morton, linux-mm, linux-kernel, Aneesh Kumar, Michal Hocko,
	Dave Hansen, Mel Gorman, Feng Tang, Andrea Arcangeli,
	Peter Zijlstra, Ingo Molnar, Rik van Riel, Johannes Weiner,
	Matthew Wilcox, Vlastimil Babka, Dan Williams, Hugh Dickins,
	Kefeng Wang, Suren Baghdasaryan

Donet Tom <donettom@linux.ibm.com> writes:

> This patchset is to optimize the cross-socket memory access with
> MPOL_PREFERRED_MANY policy.
>
> To test this patch we ran the following test on a 3 node system.
>  Node 0 - 2GB   - Tier 1
>  Node 1 - 11GB  - Tier 1
>  Node 6 - 10GB  - Tier 2
>
> Below changes are made to memcached to set the memory policy,
> It select Node0 and Node1 as preferred nodes.
>
>    #include <numaif.h>
>    #include <numa.h>
>
>     unsigned long nodemask;
>     int ret;
>
>     nodemask = 0x03;
>     ret = set_mempolicy(MPOL_PREFERRED_MANY | MPOL_F_NUMA_BALANCING,
>                                                &nodemask, 10);
>     /* If MPOL_F_NUMA_BALANCING isn't supported,
>      * fall back to MPOL_PREFERRED_MANY */
>     if (ret < 0 && errno == EINVAL){
>        printf("set mem policy normal\n");
>         ret = set_mempolicy(MPOL_PREFERRED_MANY, &nodemask, 10);
>     }
>     if (ret < 0) {
>        perror("Failed to call set_mempolicy");
>        exit(-1);
>     }
>
> Test Procedure:
> ===============
> 1. Make sure memory tiring and demotion are enabled.

Nit picking.

s/tiring/tiering/

--
Best Regards,
Huang, Ying

> 2. Start memcached.
>
>    # ./memcached -b 100000 -m 204800 -u root -c 1000000 -t 7
>        -d -s "/tmp/memcached.sock"
>
> 3. Run memtier_benchmark to store 3200000 keys.
>
>   #./memtier_benchmark -S "/tmp/memcached.sock" --protocol=memcache_binary
>     --threads=1 --pipeline=1 --ratio=1:0 --key-pattern=S:S --key-minimum=1
>     --key-maximum=3200000 -n allkeys -c 1 -R -x 1 -d 1024
>
> 4. Start a memory eater on node 0 and 1. This will demote all memcached
>    pages to node 6.
> 5. Make sure all the memcached pages got demoted to lower tier by reading
>    /proc/<memcaced PID>/numa_maps.
>
>     # cat /proc/2771/numa_maps
>      ---
>     default anon=1009 dirty=1009 active=0 N6=1009 kernelpagesize_kB=64
>     default anon=1009 dirty=1009 active=0 N6=1009 kernelpagesize_kB=64
>      ---
>
> 6. Kill memory eater.
> 7. Read the pgpromote_success counter.
> 8. Start reading the keys by running memtier_benchmark.
>
>   #./memtier_benchmark -S "/tmp/memcached.sock" --protocol=memcache_binary
>    --pipeline=1 --distinct-client-seed --ratio=0:3 --key-pattern=R:R
>    --key-minimum=1 --key-maximum=3200000 -n allkeys
>    --threads=64 -c 1 -R -x 6
>
> 9. Read the pgpromote_success counter.
>
> Test Results:
> =============
> Without Patch
> ------------------
> 1. pgpromote_success  before test
> Node 0:  pgpromote_success 11
> Node 1:  pgpromote_success 140974
>
> pgpromote_success  after test
> Node 0:  pgpromote_success 11
> Node 1:  pgpromote_success 140974
>
> 2. Memtier-benchmark result.
> AGGREGATED AVERAGE RESULTS (6 runs)
> ==================================================================
> Type    Ops/sec   Hits/sec   Misses/sec  Avg. Latency  p50 Latency
> ------------------------------------------------------------------
> Sets     0.00       ---         ---        ---          ---
> Gets    305792.03  305791.93   0.10       0.18949       0.16700
> Waits    0.00       ---         ---        ---          ---
> Totals  305792.03  305791.93   0.10       0.18949       0.16700
>
> ======================================
> p99 Latency  p99.9 Latency  KB/sec
> -------------------------------------
> ---          ---            0.00
> 0.44700     1.71100        11542.69
> ---           ---            ---
> 0.44700     1.71100        11542.69
>
> With Patch
> ---------------
> 1. pgpromote_success  before test
> Node 0:  pgpromote_success 5
> Node 1:  pgpromote_success 89386
>
> pgpromote_success  after test
> Node 0:  pgpromote_success 57895
> Node 1:  pgpromote_success 141463
>
> 2. Memtier-benchmark result.
> AGGREGATED AVERAGE RESULTS (6 runs)
> ====================================================================
> Type    Ops/sec    Hits/sec  Misses/sec  Avg. Latency  p50 Latency
> --------------------------------------------------------------------
> Sets     0.00        ---       ---        ---           ---
> Gets    521942.24  521942.07  0.17       0.11459        0.10300
> Waits    0.00        ---       ---         ---          ---
> Totals  521942.24  521942.07  0.17       0.11459        0.10300
>
> =======================================
> p99 Latency  p99.9 Latency  KB/sec
> ---------------------------------------
>  ---          ---            0.00
> 0.23100      0.31900        19701.68
> ---          ---             ---
> 0.23100      0.31900        19701.68
>
>
> Test Result Analysis:
> =====================
> 1. With patch we could observe pages are getting promoted.
> 2. Memtier-benchmark results shows that, with the patch,
>    performance has increased more than 50%.
>
>  Ops/sec without fix -  305792.03
>  Ops/sec with fix    -  521942.24
>
> Changes:
> v2:
> - Rebased on latest upstream (v6.8-rc7)
> - Used 'numa_node_id()' to get the current execution node ID, Added
>   'lockdep_assert_held' to make sure that the 'mpol_misplaced()' is
>   called with ptl held.
> - The migration condition has been updated; now, migration will only
>   occur if the execution node is present in the policy nodemask.
>
> -v1: https://lore.kernel.org/linux-mm/9c3f7b743477560d1c5b12b8c111a584a2cc92ee.1708097962.git.donettom@linux.ibm.com/#t
>
>
> Donet Tom (2):
>   mm/mempolicy: Use numa_node_id() instead of cpu_to_node()
>   mm/numa_balancing:Allow migrate on protnone reference with
>     MPOL_PREFERRED_MANY policy
>
>  include/linux/mempolicy.h |  5 +++--
>  mm/huge_memory.c          |  2 +-
>  mm/internal.h             |  2 +-
>  mm/memory.c               |  8 +++++---
>  mm/mempolicy.c            | 34 ++++++++++++++++++++++++++--------
>  5 files changed, 36 insertions(+), 15 deletions(-)

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-03-13 15:51 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-03-08 15:15 [PATCH v2 0/2] Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy Donet Tom
2024-03-08 15:15 ` [PATCH v2 1/2] mm/mempolicy: Use numa_node_id() instead of cpu_to_node() Donet Tom
2024-03-08 15:15 ` [PATCH v2 2/2] mm/numa_balancing:Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy Donet Tom
2024-03-11  1:37   ` Huang, Ying
2024-03-13 15:51     ` Andrew Morton
2024-03-11  1:45 ` [PATCH v2 0/2] Allow " Huang, Ying

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox