[RFC PATCH 0/4] mm/damon: Add DAMOS action to interleave data across nodes

linux-doc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/4] mm/damon: Add DAMOS action to interleave data across nodes
@ 2025-06-12 18:13 Bijan Tabatabai
  2025-06-12 18:13 ` [RFC PATCH 1/4] mm/mempolicy: Expose policy_nodemask() in include/linux/mempolicy.h Bijan Tabatabai
                   ` (6 more replies)
  0 siblings, 7 replies; 30+ messages in thread
From: Bijan Tabatabai @ 2025-06-12 18:13 UTC (permalink / raw)
  To: damon, linux-mm, linux-doc, linux-kernel
  Cc: sj, akpm, corbet, david, ziy, matthew.brost, joshua.hahnjy,
	rakie.kim, byungchul, gourry, ying.huang, apopple, bijantabatab,
	venkataravis, emirakhur, ajayjoshi, vtavarespetr

From: Bijan Tabatabai <bijantabatab@micron.com>

A recent patch set automatically set the interleave weight for each node
according to the node's maximum bandwidth [1]. In another thread, the patch
set's author, Joshua Hahn, wondered if/how these weights should be changed
if the bandwidth utilization of the system changes [2].

This patch set adds the mechanism for dynamically changing how application
data is interleaved across nodes while leaving the policy of what the
interleave weights should be to userspace. It does this by adding a new
DAMOS action: DAMOS_INTERLEAVE. We implement DAMOS_INTERLEAVE with both
paddr and vaddr operations sets. Using the paddr version is useful for
managing page placement globally. Using the vaddr version limits tracking
to one process per kdamond instance, but the va based tracking better
captures spacial locality.

DAMOS_INTERLEAVE interleaves pages within a region across nodes using the
interleave weights at /sys/kernel/mm/mempolicy/weighted_interleave/node<N>
and the page placement algorithm in weighted_interleave_nid via
policy_nodemask. We chose to reuse the mempolicy weighted interleave
infrastructure to avoid reimplementing code. However, this has the awkward
side effect that only pages that are mapped to processes using
MPOL_WEIGHTED_INTERLEAVE will be migrated according to new interleave
weights. This might be fine because workloads that want their data to be
dynamically interleaved will want their newly allocated data to be
interleaved at the same ratio.

If exposing policy_nodemask is undesirable, we have two alternative methods
for having DAMON access the interleave weights it should use. We would
appreciate feedback on which method is preferred.
1. Use mpol_misplaced instead
  pros: mpol_misplaced is already exposed publically
  cons: Would require refactoring mpol_misplaced to take a struct vm_area
  instead of a struct vm_fault, and require refactoring mpol_misplaced and
  get_vma_policy to take in a struct task_struct rather than just using
  current. Also requires processes to use MPOL_WEIGHTED_INTERLEAVE.
2. Add a new field to struct damos, similar to target_nid for the
MIGRATE_HOT/COLD schemes.
  pros: Keeps changes contained inside DAMON. Would not require processes
  to use MPOL_WEIGHTED_INTERLEAVE.
  cons: Duplicates page placement code. Requires discussion on the sysfs
  interface to use for users to pass in the interleave weights.

This patchset was tested on an AMD machine with a NUMA node with CPUs
attached to DDR memory and a cpu-less NUMA node attached to CXL memory.
However, this patch set should generalize to other architectures and number
of NUMA nodes.

Patches Sequence
________________
The first patch exposes policy_nodemask() in include/linux/mempolicy.h to
let DAMON determine where a page should be placed for interleaving.
The second patch implements DAMOS_INTERLEAVE as a paddr action.
The third patch moves the DAMON page migration code to ops-common, allowing
vaddr actions to use it.
Finally, the fourth patch implements a vaddr version of DAMOS_INTERLEAVE.

[1] https://lore.kernel.org/linux-mm/20250520141236.2987309-1-joshua.hahnjy@gmail.com/
[2] https://lore.kernel.org/linux-mm/20250313155705.1943522-1-joshua.hahnjy@gmail.com/

Bijan Tabatabai (4):
  mm/mempolicy: Expose policy_nodemask() in include/linux/mempolicy.h
  mm/damon/paddr: Add DAMOS_INTERLEAVE action
  mm/damon: Move damon_pa_migrate_pages to ops-common
  mm/damon/vaddr: Add vaddr version of DAMOS_INTERLEAVE

 Documentation/mm/damon/design.rst |   2 +
 include/linux/damon.h             |   2 +
 include/linux/mempolicy.h         |   2 +
 mm/damon/ops-common.c             | 136 ++++++++++++++++++++
 mm/damon/ops-common.h             |   4 +
 mm/damon/paddr.c                  | 198 +++++++++++++-----------------
 mm/damon/sysfs-schemes.c          |   1 +
 mm/damon/vaddr.c                  | 124 +++++++++++++++++++
 mm/mempolicy.c                    |   4 +-
 9 files changed, 360 insertions(+), 113 deletions(-)

-- 
2.43.5

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 1/4] mm/mempolicy: Expose policy_nodemask() in include/linux/mempolicy.h
  2025-06-12 18:13 [RFC PATCH 0/4] mm/damon: Add DAMOS action to interleave data across nodes Bijan Tabatabai
@ 2025-06-12 18:13 ` Bijan Tabatabai
  2025-06-13 13:45   ` David Hildenbrand
  2025-06-12 18:13 ` [RFC PATCH 2/4] mm/damon/paddr: Add DAMOS_INTERLEAVE action Bijan Tabatabai
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 30+ messages in thread
From: Bijan Tabatabai @ 2025-06-12 18:13 UTC (permalink / raw)
  To: damon, linux-mm, linux-doc, linux-kernel
  Cc: sj, akpm, corbet, david, ziy, matthew.brost, joshua.hahnjy,
	rakie.kim, byungchul, gourry, ying.huang, apopple, bijantabatab,
	venkataravis, emirakhur, ajayjoshi, vtavarespetr

From: Bijan Tabatabai <bijantabatab@micron.com>

This patch is to allow DAMON to call policy_nodemask() so it can
determine where to place a page for interleaving.

Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com>
---
 include/linux/mempolicy.h | 9 +++++++++
 mm/mempolicy.c            | 4 +---
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 0fe96f3ab3ef..e96bf493ff7a 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -133,6 +133,8 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
 struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
 		unsigned long addr, int order, pgoff_t *ilx);
 bool vma_policy_mof(struct vm_area_struct *vma);
+nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
+		pgoff_t ilx, int *nid);
 
 extern void numa_default_policy(void);
 extern void numa_policy_init(void);
@@ -232,6 +234,13 @@ static inline struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
 	return NULL;
 }
 
+static inline nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
+				pgoff_t ilx, int *nid)
+{
+	*nid = NUMA_NO_NODE;
+	return NULL;
+}
+
 static inline int
 vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst)
 {
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 3b1dfd08338b..54f539497e20 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -596,8 +596,6 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
 
 static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
 				unsigned long flags);
-static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
-				pgoff_t ilx, int *nid);
 
 static bool strictly_unmovable(unsigned long flags)
 {
@@ -2195,7 +2193,7 @@ static unsigned int interleave_nid(struct mempolicy *pol, pgoff_t ilx)
  * Return a nodemask representing a mempolicy for filtering nodes for
  * page allocation, together with preferred node id (or the input node id).
  */
-static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
+nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
 				   pgoff_t ilx, int *nid)
 {
 	nodemask_t *nodemask = NULL;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 2/4] mm/damon/paddr: Add DAMOS_INTERLEAVE action
  2025-06-12 18:13 [RFC PATCH 0/4] mm/damon: Add DAMOS action to interleave data across nodes Bijan Tabatabai
  2025-06-12 18:13 ` [RFC PATCH 1/4] mm/mempolicy: Expose policy_nodemask() in include/linux/mempolicy.h Bijan Tabatabai
@ 2025-06-12 18:13 ` Bijan Tabatabai
  2025-06-13 13:43   ` David Hildenbrand
  2025-06-12 18:13 ` [RFC PATCH 3/4] mm/damon: Move damon_pa_migrate_pages to ops-common Bijan Tabatabai
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 30+ messages in thread
From: Bijan Tabatabai @ 2025-06-12 18:13 UTC (permalink / raw)
  To: damon, linux-mm, linux-doc, linux-kernel
  Cc: sj, akpm, corbet, david, ziy, matthew.brost, joshua.hahnjy,
	rakie.kim, byungchul, gourry, ying.huang, apopple, bijantabatab,
	venkataravis, emirakhur, ajayjoshi, vtavarespetr

From: Bijan Tabatabai <bijantabatab@micron.com>

This patch adds the DAMOS_INTERLEAVE action.
It interleaves pages inside of a given region according to the weights
in the iw_table. To reuse existing interleaving code, the target nid for
a folio is determined by calling policy_nodemask, therefore only folios
belonging to processes using the MPOL_WEIGHTED_INTERLEAVE policy will
have their pages migrated.

Below is an example of its usage where pages are initially interleaved at
a 1:1 ratio and then changed to be interleaved at a 2:1 ratio. The
alloc_data program simply allocates 1GB of data then sleeps.
  $ cd /sys/kernel/mm/damon/admin/kdamonds/0
  $ sudo cat ./contexts/0/schemes/0/action
  interleave
  $ echo 1 | sudo tee /sys/kernel/mm/mempolicy/weighted_interleave/node0
  $ echo 1 | sudo tee /sys/kernel/mm/mempolicy/weighted_interleave/node1
  $ numactl -w 0,1 ~/alloc_data 1G &
  $ numastat -c -p alloc_data

  Per-node process memory usage (in MBs) for PID 18473 (alloc_data)
           Node 0 Node 1 Total
           ------ ------ -----
  Huge          0      0     0
  Heap          0      0     0
  Stack         0      0     0
  Private     514    514  1027
  -------  ------ ------ -----
  Total       514    514  1028
  $ echo 2 | sudo tee /sys/kernel/mm/mempolicy/weighted_interleave/node0
  $ numastat -c -p alloc_data

  Per-node process memory usage (in MBs) for PID 18473 (alloc_data)
           Node 0 Node 1 Total
           ------ ------ -----
  Huge          0      0     0
  Heap          0      0     0
  Stack         0      0     0
  Private     684    343  1027
  -------  ------ ------ -----
  Total       684    343  1027

Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com>
---
 Documentation/mm/damon/design.rst |   2 +
 include/linux/damon.h             |   2 +
 mm/damon/paddr.c                  | 112 ++++++++++++++++++++++++++++++
 mm/damon/sysfs-schemes.c          |   1 +
 4 files changed, 117 insertions(+)

diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst
index ddc50db3afa4..c50d2105cea0 100644
--- a/Documentation/mm/damon/design.rst
+++ b/Documentation/mm/damon/design.rst
@@ -455,6 +455,8 @@ that supports each action are as below.
    Supported by ``paddr`` operations set.
  - ``migrate_cold``: Migrate the regions prioritizing colder regions.
    Supported by ``paddr`` operations set.
+ - ``interleave``: Interleave the regions according to the weighted interleave weights.
+   Supported by ``paddr`` operations set.
  - ``stat``: Do nothing but count the statistics.
    Supported by all operations sets.
 
diff --git a/include/linux/damon.h b/include/linux/damon.h
index a4011726cb3b..81d26a203337 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -117,6 +117,7 @@ struct damon_target {
  * @DAMOS_LRU_DEPRIO:	Deprioritize the region on its LRU lists.
  * @DAMOS_MIGRATE_HOT:  Migrate the regions prioritizing warmer regions.
  * @DAMOS_MIGRATE_COLD:	Migrate the regions prioritizing colder regions.
+ * @DAMOS_INTERLEAVE: Interleave the regions by the weighted interleave ratio
  * @DAMOS_STAT:		Do nothing but count the stat.
  * @NR_DAMOS_ACTIONS:	Total number of DAMOS actions
  *
@@ -136,6 +137,7 @@ enum damos_action {
 	DAMOS_LRU_DEPRIO,
 	DAMOS_MIGRATE_HOT,
 	DAMOS_MIGRATE_COLD,
+	DAMOS_INTERLEAVE,
 	DAMOS_STAT,		/* Do nothing but only record the stat */
 	NR_DAMOS_ACTIONS,
 };
diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
index 4102a8c5f992..e989464635cd 100644
--- a/mm/damon/paddr.c
+++ b/mm/damon/paddr.c
@@ -535,6 +535,114 @@ static unsigned long damon_pa_migrate(struct damon_region *r, struct damos *s,
 	return applied * PAGE_SIZE;
 }
 
+#if defined(CONFIG_MEMCG) && defined(CONFIG_NUMA)
+struct damos_interleave_private {
+	struct list_head *folio_migration_list;
+	bool putback_lru;
+};
+
+static bool damon_pa_interleave_rmap(struct folio *folio, struct vm_area_struct *vma,
+		unsigned long addr, void *arg)
+{
+	struct mempolicy *pol;
+	struct task_struct *task;
+	pgoff_t ilx;
+	int target_nid;
+	struct damos_interleave_private *priv = arg;
+
+	task = rcu_dereference(vma->vm_mm->owner);
+	if (!task)
+		return true;
+
+	pol = get_task_policy(task);
+	if (!pol)
+		return true;
+
+	/* Getting the interleave weights only makes sense with MPOL_WEIGHTED_INTERLEAVE */
+	if (pol->mode != MPOL_WEIGHTED_INTERLEAVE) {
+		mpol_cond_put(pol);
+		return true;
+	}
+
+	ilx = vma->vm_pgoff >> folio_order(folio);
+	ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + folio_order(folio));
+	policy_nodemask(0, pol, ilx, &target_nid);
+
+	if (target_nid != NUMA_NO_NODE && folio_nid(folio) != target_nid) {
+		list_add(&folio->lru, &priv->folio_migration_list[target_nid]);
+		priv->putback_lru = false;
+	}
+
+	mpol_cond_put(pol);
+	return false;
+}
+
+static unsigned long damon_pa_interleave(struct damon_region *r, struct damos *s,
+		unsigned long *sz_filter_passed)
+{
+	struct damos_interleave_private priv;
+	struct rmap_walk_control rwc;
+	unsigned long addr, applied;
+	struct folio *folio;
+
+	priv.folio_migration_list = kmalloc_array(nr_node_ids, sizeof(struct list_head),
+		GFP_KERNEL);
+	if (!priv.folio_migration_list)
+		return 0;
+
+	for (int i = 0; i < nr_node_ids; i++)
+		INIT_LIST_HEAD(&priv.folio_migration_list[i]);
+
+	memset(&rwc, 0, sizeof(struct rmap_walk_control));
+	rwc.rmap_one = damon_pa_interleave_rmap;
+	rwc.arg = &priv;
+
+	addr = r->ar.start;
+	while (addr < r->ar.end) {
+		folio = damon_get_folio(PHYS_PFN(addr));
+
+		if (damon_pa_invalid_damos_folio(folio, s)) {
+			addr += PAGE_SIZE;
+			continue;
+		}
+
+		if (damos_pa_filter_out(s, folio))
+			goto put_folio;
+		else
+			*sz_filter_passed += folio_size(folio);
+
+		if (!folio_isolate_lru(folio))
+			goto put_folio;
+
+		priv.putback_lru = true;
+		rmap_walk(folio, &rwc);
+
+		if (priv.putback_lru)
+			folio_putback_lru(folio);
+
+put_folio:
+		addr += folio_size(folio);
+		folio_put(folio);
+	}
+
+	applied = 0;
+	for (int i = 0; i < nr_node_ids; i++) {
+		applied += damon_pa_migrate_pages(&priv.folio_migration_list[i], i);
+		cond_resched();
+	}
+
+	kfree(priv.folio_migration_list);
+	s->last_applied = folio;
+	return applied * PAGE_SIZE;
+}
+#else
+static unsigned long damon_pa_interleave(struct damon_region *r, struct damos *s,
+		unsigned long *sz_filter_passed)
+{
+	return 0;
+}
+#endif /* defined(CONFIG_MEMCG) && defined(CONFIG_NUMA) */
+
 static bool damon_pa_scheme_has_filter(struct damos *s)
 {
 	struct damos_filter *f;
@@ -584,6 +692,8 @@ static unsigned long damon_pa_apply_scheme(struct damon_ctx *ctx,
 	case DAMOS_MIGRATE_HOT:
 	case DAMOS_MIGRATE_COLD:
 		return damon_pa_migrate(r, scheme, sz_filter_passed);
+	case DAMOS_INTERLEAVE:
+		return damon_pa_interleave(r, scheme, sz_filter_passed);
 	case DAMOS_STAT:
 		return damon_pa_stat(r, scheme, sz_filter_passed);
 	default:
@@ -608,6 +718,8 @@ static int damon_pa_scheme_score(struct damon_ctx *context,
 		return damon_hot_score(context, r, scheme);
 	case DAMOS_MIGRATE_COLD:
 		return damon_cold_score(context, r, scheme);
+	case DAMOS_INTERLEAVE:
+		return damon_hot_score(context, r, scheme);
 	default:
 		break;
 	}
diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c
index 0f6c9e1fec0b..12a32e066d06 100644
--- a/mm/damon/sysfs-schemes.c
+++ b/mm/damon/sysfs-schemes.c
@@ -1597,6 +1597,7 @@ static const char * const damon_sysfs_damos_action_strs[] = {
 	"lru_deprio",
 	"migrate_hot",
 	"migrate_cold",
+	"interleave",
 	"stat",
 };
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 3/4] mm/damon: Move damon_pa_migrate_pages to ops-common
  2025-06-12 18:13 [RFC PATCH 0/4] mm/damon: Add DAMOS action to interleave data across nodes Bijan Tabatabai
  2025-06-12 18:13 ` [RFC PATCH 1/4] mm/mempolicy: Expose policy_nodemask() in include/linux/mempolicy.h Bijan Tabatabai
  2025-06-12 18:13 ` [RFC PATCH 2/4] mm/damon/paddr: Add DAMOS_INTERLEAVE action Bijan Tabatabai
@ 2025-06-12 18:13 ` Bijan Tabatabai
  2025-06-12 18:13 ` [RFC PATCH 4/4] mm/damon/vaddr: Add vaddr version of DAMOS_INTERLEAVE Bijan Tabatabai
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 30+ messages in thread
From: Bijan Tabatabai @ 2025-06-12 18:13 UTC (permalink / raw)
  To: damon, linux-mm, linux-doc, linux-kernel
  Cc: sj, akpm, corbet, david, ziy, matthew.brost, joshua.hahnjy,
	rakie.kim, byungchul, gourry, ying.huang, apopple, bijantabatab,
	venkataravis, emirakhur, ajayjoshi, vtavarespetr

From: Bijan Tabatabai <bijantabatab@micron.com>

This patch moves damon_pa_migrate_pages from mm/damon/paddr.c to
mm/damon/ops-common.c and renames it damon_migrate_pages. This gives
vaddr based actions the ability to migrate pages as well.

Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com>
---
 mm/damon/ops-common.c | 123 +++++++++++++++++++++++++++++++++++++++++
 mm/damon/ops-common.h |   2 +
 mm/damon/paddr.c      | 125 +-----------------------------------------
 3 files changed, 127 insertions(+), 123 deletions(-)

diff --git a/mm/damon/ops-common.c b/mm/damon/ops-common.c
index b43620fee6bb..2c4fb274b7f6 100644
--- a/mm/damon/ops-common.c
+++ b/mm/damon/ops-common.c
@@ -5,6 +5,7 @@
  * Author: SeongJae Park <sj@kernel.org>
  */
 
+#include <linux/migrate.h>
 #include <linux/mmu_notifier.h>
 #include <linux/page_idle.h>
 #include <linux/pagemap.h>
@@ -12,6 +13,7 @@
 #include <linux/swap.h>
 #include <linux/swapops.h>
 
+#include "../internal.h"
 #include "ops-common.h"
 
 /*
@@ -138,3 +140,124 @@ int damon_cold_score(struct damon_ctx *c, struct damon_region *r,
 	/* Return coldness of the region */
 	return DAMOS_MAX_SCORE - hotness;
 }
+
+static unsigned int __damon_migrate_folio_list(
+		struct list_head *migrate_folios, struct pglist_data *pgdat,
+		int target_nid)
+{
+	unsigned int nr_succeeded = 0;
+	nodemask_t allowed_mask = NODE_MASK_NONE;
+	struct migration_target_control mtc = {
+		/*
+		 * Allocate from 'node', or fail quickly and quietly.
+		 * When this happens, 'page' will likely just be discarded
+		 * instead of migrated.
+		 */
+		.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
+			__GFP_NOWARN | __GFP_NOMEMALLOC | GFP_NOWAIT,
+		.nid = target_nid,
+		.nmask = &allowed_mask
+	};
+
+	if (pgdat->node_id == target_nid || target_nid == NUMA_NO_NODE)
+		return 0;
+
+	if (list_empty(migrate_folios))
+		return 0;
+
+	/* Migration ignores all cpuset and mempolicy settings */
+	migrate_pages(migrate_folios, alloc_migrate_folio, NULL,
+		      (unsigned long)&mtc, MIGRATE_ASYNC, MR_DAMON,
+		      &nr_succeeded);
+
+	return nr_succeeded;
+}
+
+static unsigned int damon_migrate_folio_list(struct list_head *folio_list,
+						struct pglist_data *pgdat,
+						int target_nid)
+{
+	unsigned int nr_migrated = 0;
+	struct folio *folio;
+	LIST_HEAD(ret_folios);
+	LIST_HEAD(migrate_folios);
+
+	while (!list_empty(folio_list)) {
+		struct folio *folio;
+
+		cond_resched();
+
+		folio = lru_to_folio(folio_list);
+		list_del(&folio->lru);
+
+		if (!folio_trylock(folio))
+			goto keep;
+
+		/* Relocate its contents to another node. */
+		list_add(&folio->lru, &migrate_folios);
+		folio_unlock(folio);
+		continue;
+keep:
+		list_add(&folio->lru, &ret_folios);
+	}
+	/* 'folio_list' is always empty here */
+
+	/* Migrate folios selected for migration */
+	nr_migrated += __damon_migrate_folio_list(
+			&migrate_folios, pgdat, target_nid);
+	/*
+	 * Folios that could not be migrated are still in @migrate_folios.  Add
+	 * those back on @folio_list
+	 */
+	if (!list_empty(&migrate_folios))
+		list_splice_init(&migrate_folios, folio_list);
+
+	try_to_unmap_flush();
+
+	list_splice(&ret_folios, folio_list);
+
+	while (!list_empty(folio_list)) {
+		folio = lru_to_folio(folio_list);
+		list_del(&folio->lru);
+		folio_putback_lru(folio);
+	}
+
+	return nr_migrated;
+}
+
+unsigned long damon_migrate_pages(struct list_head *folio_list,
+					    int target_nid)
+{
+	int nid;
+	unsigned long nr_migrated = 0;
+	LIST_HEAD(node_folio_list);
+	unsigned int noreclaim_flag;
+
+	if (list_empty(folio_list))
+		return nr_migrated;
+
+	noreclaim_flag = memalloc_noreclaim_save();
+
+	nid = folio_nid(lru_to_folio(folio_list));
+	do {
+		struct folio *folio = lru_to_folio(folio_list);
+
+		if (nid == folio_nid(folio)) {
+			list_move(&folio->lru, &node_folio_list);
+			continue;
+		}
+
+		nr_migrated += damon_migrate_folio_list(&node_folio_list,
+							   NODE_DATA(nid),
+							   target_nid);
+		nid = folio_nid(lru_to_folio(folio_list));
+	} while (!list_empty(folio_list));
+
+	nr_migrated += damon_migrate_folio_list(&node_folio_list,
+						   NODE_DATA(nid),
+						   target_nid);
+
+	memalloc_noreclaim_restore(noreclaim_flag);
+
+	return nr_migrated;
+}
diff --git a/mm/damon/ops-common.h b/mm/damon/ops-common.h
index cc9f5da9c012..54209a7e70e6 100644
--- a/mm/damon/ops-common.h
+++ b/mm/damon/ops-common.h
@@ -16,3 +16,5 @@ int damon_cold_score(struct damon_ctx *c, struct damon_region *r,
 			struct damos *s);
 int damon_hot_score(struct damon_ctx *c, struct damon_region *r,
 			struct damos *s);
+
+unsigned long damon_migrate_pages(struct list_head *folio_list, int target_nid);
diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
index e989464635cd..722d69f26e37 100644
--- a/mm/damon/paddr.c
+++ b/mm/damon/paddr.c
@@ -381,127 +381,6 @@ static unsigned long damon_pa_deactivate_pages(struct damon_region *r,
 			sz_filter_passed);
 }
 
-static unsigned int __damon_pa_migrate_folio_list(
-		struct list_head *migrate_folios, struct pglist_data *pgdat,
-		int target_nid)
-{
-	unsigned int nr_succeeded = 0;
-	nodemask_t allowed_mask = NODE_MASK_NONE;
-	struct migration_target_control mtc = {
-		/*
-		 * Allocate from 'node', or fail quickly and quietly.
-		 * When this happens, 'page' will likely just be discarded
-		 * instead of migrated.
-		 */
-		.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
-			__GFP_NOWARN | __GFP_NOMEMALLOC | GFP_NOWAIT,
-		.nid = target_nid,
-		.nmask = &allowed_mask
-	};
-
-	if (pgdat->node_id == target_nid || target_nid == NUMA_NO_NODE)
-		return 0;
-
-	if (list_empty(migrate_folios))
-		return 0;
-
-	/* Migration ignores all cpuset and mempolicy settings */
-	migrate_pages(migrate_folios, alloc_migrate_folio, NULL,
-		      (unsigned long)&mtc, MIGRATE_ASYNC, MR_DAMON,
-		      &nr_succeeded);
-
-	return nr_succeeded;
-}
-
-static unsigned int damon_pa_migrate_folio_list(struct list_head *folio_list,
-						struct pglist_data *pgdat,
-						int target_nid)
-{
-	unsigned int nr_migrated = 0;
-	struct folio *folio;
-	LIST_HEAD(ret_folios);
-	LIST_HEAD(migrate_folios);
-
-	while (!list_empty(folio_list)) {
-		struct folio *folio;
-
-		cond_resched();
-
-		folio = lru_to_folio(folio_list);
-		list_del(&folio->lru);
-
-		if (!folio_trylock(folio))
-			goto keep;
-
-		/* Relocate its contents to another node. */
-		list_add(&folio->lru, &migrate_folios);
-		folio_unlock(folio);
-		continue;
-keep:
-		list_add(&folio->lru, &ret_folios);
-	}
-	/* 'folio_list' is always empty here */
-
-	/* Migrate folios selected for migration */
-	nr_migrated += __damon_pa_migrate_folio_list(
-			&migrate_folios, pgdat, target_nid);
-	/*
-	 * Folios that could not be migrated are still in @migrate_folios.  Add
-	 * those back on @folio_list
-	 */
-	if (!list_empty(&migrate_folios))
-		list_splice_init(&migrate_folios, folio_list);
-
-	try_to_unmap_flush();
-
-	list_splice(&ret_folios, folio_list);
-
-	while (!list_empty(folio_list)) {
-		folio = lru_to_folio(folio_list);
-		list_del(&folio->lru);
-		folio_putback_lru(folio);
-	}
-
-	return nr_migrated;
-}
-
-static unsigned long damon_pa_migrate_pages(struct list_head *folio_list,
-					    int target_nid)
-{
-	int nid;
-	unsigned long nr_migrated = 0;
-	LIST_HEAD(node_folio_list);
-	unsigned int noreclaim_flag;
-
-	if (list_empty(folio_list))
-		return nr_migrated;
-
-	noreclaim_flag = memalloc_noreclaim_save();
-
-	nid = folio_nid(lru_to_folio(folio_list));
-	do {
-		struct folio *folio = lru_to_folio(folio_list);
-
-		if (nid == folio_nid(folio)) {
-			list_move(&folio->lru, &node_folio_list);
-			continue;
-		}
-
-		nr_migrated += damon_pa_migrate_folio_list(&node_folio_list,
-							   NODE_DATA(nid),
-							   target_nid);
-		nid = folio_nid(lru_to_folio(folio_list));
-	} while (!list_empty(folio_list));
-
-	nr_migrated += damon_pa_migrate_folio_list(&node_folio_list,
-						   NODE_DATA(nid),
-						   target_nid);
-
-	memalloc_noreclaim_restore(noreclaim_flag);
-
-	return nr_migrated;
-}
-
 static unsigned long damon_pa_migrate(struct damon_region *r, struct damos *s,
 		unsigned long *sz_filter_passed)
 {
@@ -529,7 +408,7 @@ static unsigned long damon_pa_migrate(struct damon_region *r, struct damos *s,
 		addr += folio_size(folio);
 		folio_put(folio);
 	}
-	applied = damon_pa_migrate_pages(&folio_list, s->target_nid);
+	applied = damon_migrate_pages(&folio_list, s->target_nid);
 	cond_resched();
 	s->last_applied = folio;
 	return applied * PAGE_SIZE;
@@ -627,7 +506,7 @@ static unsigned long damon_pa_interleave(struct damon_region *r, struct damos *s
 
 	applied = 0;
 	for (int i = 0; i < nr_node_ids; i++) {
-		applied += damon_pa_migrate_pages(&priv.folio_migration_list[i], i);
+		applied += damon_migrate_pages(&priv.folio_migration_list[i], i);
 		cond_resched();
 	}
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 4/4] mm/damon/vaddr: Add vaddr version of DAMOS_INTERLEAVE
  2025-06-12 18:13 [RFC PATCH 0/4] mm/damon: Add DAMOS action to interleave data across nodes Bijan Tabatabai
                   ` (2 preceding siblings ...)
  2025-06-12 18:13 ` [RFC PATCH 3/4] mm/damon: Move damon_pa_migrate_pages to ops-common Bijan Tabatabai
@ 2025-06-12 18:13 ` Bijan Tabatabai
  2025-06-12 23:49 ` [RFC PATCH 0/4] mm/damon: Add DAMOS action to interleave data across nodes SeongJae Park
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 30+ messages in thread
From: Bijan Tabatabai @ 2025-06-12 18:13 UTC (permalink / raw)
  To: damon, linux-mm, linux-doc, linux-kernel
  Cc: sj, akpm, corbet, david, ziy, matthew.brost, joshua.hahnjy,
	rakie.kim, byungchul, gourry, ying.huang, apopple, bijantabatab,
	venkataravis, emirakhur, ajayjoshi, vtavarespetr

From: Bijan Tabatabai <bijantabatab@micron.com>

This patch adds a vaddr implementation of the DAMOS_INTERLEAVE action.

Below is an example of its usage where pages are initially interleaved at
a 1:1 ratio and then changed to be interleaved at a 2:1 ratio. The
alloc_data program simply allocates 1GB of data then sleeps.
  $ echo 1 | sudo tee /sys/kernel/mm/mempolicy/weighted_interleave/node0
  $ echo 1 | sudo tee /sys/kernel/mm/mempolicy/weighted_interleave/node1
  $ numactl -w 0,1 ./alloc_data 1G&
  [1] 11447
  $ cat interleave_vaddr.yaml
  kdamonds:
  - state: null
    pid: null
    contexts:
    - ops: vaddr
      targets:
      - pid: 11447
        regions: []
      intervals:
        sample_us: 200 ms
        aggr_us: 5 s
        ops_update_us: 10 s
      nr_regions:
        min: 200
        max: 500
      schemes:
      - action: interleave
        access_pattern:
          sz_bytes:
            min: 0 B
            max: max
          nr_accesses:
            min: 0 %
            max: 100 %
          age:
            min: 0 ns
            max: max

  $ sudo ./damo/damo start interleave_vaddr.yaml
  $ numastat -c -p 11447

  Per-node process memory usage (in MBs) for PID 11447 (alloc_data)
           Node 0 Node 1 Total
           ------ ------ -----
  Huge          0      0     0
  Heap          0      0     0
  Stack         0      0     0
  Private     514    514  1027
  -------  ------ ------ -----
  Total       514    514  1027
  $ echo 2 | sudo tee /sys/kernel/mm/mempolicy/weighted_interleave/node0
  $ numastat -c -p 11447

  Per-node process memory usage (in MBs) for PID 11447 (alloc_data)
           Node 0 Node 1 Total
           ------ ------ -----
  Huge          0      0     0
  Heap          0      0     0
  Stack         0      0     0
  Private     684    343  1027
  -------  ------ ------ -----
  Total       684    343  1027

Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com>
---
 Documentation/mm/damon/design.rst |   2 +-
 mm/damon/ops-common.c             |  13 +++
 mm/damon/ops-common.h             |   2 +
 mm/damon/paddr.c                  |  11 +--
 mm/damon/vaddr.c                  | 135 ++++++++++++++++++++++++++++++
 5 files changed, 155 insertions(+), 8 deletions(-)

diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst
index c50d2105cea0..a79ba62f820b 100644
--- a/Documentation/mm/damon/design.rst
+++ b/Documentation/mm/damon/design.rst
@@ -456,7 +456,7 @@ that supports each action are as below.
  - ``migrate_cold``: Migrate the regions prioritizing colder regions.
    Supported by ``paddr`` operations set.
  - ``interleave``: Interleave the regions according to the weighted interleave weights.
-   Supported by ``paddr`` operations set.
+   Supported by ``vaddr``, ``fvaddr`` and ``paddr`` operations set.
  - ``stat``: Do nothing but count the statistics.
    Supported by all operations sets.
 
diff --git a/mm/damon/ops-common.c b/mm/damon/ops-common.c
index 2c4fb274b7f6..59d92404fc8f 100644
--- a/mm/damon/ops-common.c
+++ b/mm/damon/ops-common.c
@@ -261,3 +261,16 @@ unsigned long damon_migrate_pages(struct list_head *folio_list,
 
 	return nr_migrated;
 }
+
+int damon_interleave_target_nid(unsigned long addr, struct vm_area_struct *vma,
+		struct mempolicy *pol, struct folio *folio)
+{
+	pgoff_t ilx;
+	int target_nid;
+
+	ilx = vma->vm_pgoff >> folio_order(folio);
+	ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + folio_order(folio));
+	policy_nodemask(0, pol, ilx, &target_nid);
+
+	return target_nid;
+}
diff --git a/mm/damon/ops-common.h b/mm/damon/ops-common.h
index 54209a7e70e6..bacb4de92dc9 100644
--- a/mm/damon/ops-common.h
+++ b/mm/damon/ops-common.h
@@ -18,3 +18,5 @@ int damon_hot_score(struct damon_ctx *c, struct damon_region *r,
 			struct damos *s);
 
 unsigned long damon_migrate_pages(struct list_head *folio_list, int target_nid);
+int damon_interleave_target_nid(unsigned long addr, struct vm_area_struct *vma,
+			struct mempolicy *pol, struct folio *folio);
diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
index 722d69f26e37..93e3c72b54c7 100644
--- a/mm/damon/paddr.c
+++ b/mm/damon/paddr.c
@@ -415,7 +415,7 @@ static unsigned long damon_pa_migrate(struct damon_region *r, struct damos *s,
 }
 
 #if defined(CONFIG_MEMCG) && defined(CONFIG_NUMA)
-struct damos_interleave_private {
+struct damos_pa_interleave_private {
 	struct list_head *folio_migration_list;
 	bool putback_lru;
 };
@@ -425,9 +425,8 @@ static bool damon_pa_interleave_rmap(struct folio *folio, struct vm_area_struct
 {
 	struct mempolicy *pol;
 	struct task_struct *task;
-	pgoff_t ilx;
 	int target_nid;
-	struct damos_interleave_private *priv = arg;
+	struct damos_pa_interleave_private *priv = arg;
 
 	task = rcu_dereference(vma->vm_mm->owner);
 	if (!task)
@@ -443,9 +442,7 @@ static bool damon_pa_interleave_rmap(struct folio *folio, struct vm_area_struct
 		return true;
 	}
 
-	ilx = vma->vm_pgoff >> folio_order(folio);
-	ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + folio_order(folio));
-	policy_nodemask(0, pol, ilx, &target_nid);
+	target_nid = damon_interleave_target_nid(addr, vma, pol, folio);
 
 	if (target_nid != NUMA_NO_NODE && folio_nid(folio) != target_nid) {
 		list_add(&folio->lru, &priv->folio_migration_list[target_nid]);
@@ -459,7 +456,7 @@ static bool damon_pa_interleave_rmap(struct folio *folio, struct vm_area_struct
 static unsigned long damon_pa_interleave(struct damon_region *r, struct damos *s,
 		unsigned long *sz_filter_passed)
 {
-	struct damos_interleave_private priv;
+	struct damos_pa_interleave_private priv;
 	struct rmap_walk_control rwc;
 	unsigned long addr, applied;
 	struct folio *folio;
diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
index 46554e49a478..1d1170f49317 100644
--- a/mm/damon/vaddr.c
+++ b/mm/damon/vaddr.c
@@ -9,12 +9,14 @@
 
 #include <linux/highmem.h>
 #include <linux/hugetlb.h>
+#include <linux/mempolicy.h>
 #include <linux/mman.h>
 #include <linux/mmu_notifier.h>
 #include <linux/page_idle.h>
 #include <linux/pagewalk.h>
 #include <linux/sched/mm.h>
 
+#include "../internal.h"
 #include "ops-common.h"
 
 #ifdef CONFIG_DAMON_VADDR_KUNIT_TEST
@@ -653,6 +655,137 @@ static unsigned long damos_madvise(struct damon_target *target,
 }
 #endif	/* CONFIG_ADVISE_SYSCALLS */
 
+#ifdef CONFIG_NUMA
+struct damos_va_interleave_private {
+	struct list_head *folio_migration_list;
+	struct mempolicy *pol;
+};
+
+static void damos_va_interleave_folio(unsigned long addr, struct folio *folio,
+	struct vm_area_struct *vma, struct damos_va_interleave_private *priv)
+{
+	int target_nid;
+
+	if (!folio_isolate_lru(folio))
+		return;
+
+	target_nid = damon_interleave_target_nid(addr, vma, priv->pol, folio);
+
+	if (target_nid != NUMA_NO_NODE && folio_nid(folio) != target_nid)
+		list_add(&folio->lru, &priv->folio_migration_list[target_nid]);
+	else
+		folio_putback_lru(folio);
+
+}
+
+static int damos_va_interleave_pmd(pmd_t *pmd, unsigned long addr,
+		unsigned long next, struct mm_walk *walk)
+{
+	struct damos_va_interleave_private *priv = walk->private;
+	struct folio *folio;
+	spinlock_t *ptl;
+	pmd_t pmde;
+
+	ptl = pmd_lock(walk->mm, pmd);
+	pmde = pmdp_get(pmd);
+
+	if (!pmd_present(pmde) || !pmd_trans_huge(pmde))
+		goto unlock;
+
+	folio = damon_get_folio(pmd_pfn(pmde));
+	if (!folio)
+		goto unlock;
+
+	damos_va_interleave_folio(addr, folio, walk->vma, priv);
+
+	folio_put(folio);
+unlock:
+	spin_unlock(ptl);
+	return 0;
+}
+
+static int damos_va_interleave_pte(pte_t *pte, unsigned long addr,
+		unsigned long next, struct mm_walk *walk)
+{
+	struct damos_va_interleave_private *priv = walk->private;
+	struct folio *folio;
+
+	if (pte_none(*pte) || !pte_present(*pte))
+		return 0;
+
+	folio = vm_normal_folio(walk->vma, addr, *pte);
+	if (!folio)
+		return 0;
+	folio_get(folio);
+
+	damos_va_interleave_folio(addr, folio, walk->vma, priv);
+
+	folio_put(folio);
+	return 0;
+}
+
+static unsigned long damos_va_interleave(struct damon_target *target,
+		struct damon_region *r, struct damos *s)
+{
+	struct damos_va_interleave_private priv;
+	struct task_struct *task;
+	struct mm_struct *mm;
+	int ret;
+	unsigned long applied = 0;
+	struct mm_walk_ops walk_ops = {
+		.pmd_entry = damos_va_interleave_pmd,
+		.pte_entry = damos_va_interleave_pte,
+	};
+
+	task = damon_get_task_struct(target);
+	if (!task)
+		return 0;
+
+	priv.pol = get_task_policy(task);
+	if (!priv.pol)
+		goto put_task;
+
+	if (priv.pol->mode != MPOL_WEIGHTED_INTERLEAVE)
+		goto put_pol;
+
+	priv.folio_migration_list = kmalloc_array(nr_node_ids, sizeof(struct list_head),
+		GFP_KERNEL);
+	if (!priv.folio_migration_list)
+		goto put_pol;
+
+	for (int i = 0; i < nr_node_ids; i++)
+		INIT_LIST_HEAD(&priv.folio_migration_list[i]);
+
+	mm = damon_get_mm(target);
+	if (!mm)
+		goto free_folio_list;
+
+	mmap_read_lock(mm);
+	ret = walk_page_range(mm, r->ar.start, r->ar.end, &walk_ops, &priv);
+	mmap_read_unlock(mm);
+	mmput(mm);
+
+	for (int i = 0; i < nr_node_ids; i++) {
+		applied += damon_migrate_pages(&priv.folio_migration_list[i], i);
+		cond_resched();
+	}
+
+free_folio_list:
+	kfree(priv.folio_migration_list);
+put_pol:
+	mpol_cond_put(priv.pol);
+put_task:
+	put_task_struct(task);
+	return applied * PAGE_SIZE;
+}
+#else
+static unsigned long damos_va_interleave(struct damon_target *target,
+		struct damon_region *r, struct damos *s)
+{
+	return 0;
+}
+#endif /* CONFIG_NUMA */
+
 static unsigned long damon_va_apply_scheme(struct damon_ctx *ctx,
 		struct damon_target *t, struct damon_region *r,
 		struct damos *scheme, unsigned long *sz_filter_passed)
@@ -675,6 +808,8 @@ static unsigned long damon_va_apply_scheme(struct damon_ctx *ctx,
 	case DAMOS_NOHUGEPAGE:
 		madv_action = MADV_NOHUGEPAGE;
 		break;
+	case DAMOS_INTERLEAVE:
+		return damos_va_interleave(t, r, scheme);
 	case DAMOS_STAT:
 		return 0;
 	default:
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/4] mm/damon: Add DAMOS action to interleave data across nodes
  2025-06-12 18:13 [RFC PATCH 0/4] mm/damon: Add DAMOS action to interleave data across nodes Bijan Tabatabai
                   ` (3 preceding siblings ...)
  2025-06-12 18:13 ` [RFC PATCH 4/4] mm/damon/vaddr: Add vaddr version of DAMOS_INTERLEAVE Bijan Tabatabai
@ 2025-06-12 23:49 ` SeongJae Park
  2025-06-13  2:41   ` Huang, Ying
  2025-06-13 15:44   ` Bijan Tabatabai
  2025-06-13  9:55 ` Rakie Kim
  2025-06-13 15:25 ` Joshua Hahn
  6 siblings, 2 replies; 30+ messages in thread
From: SeongJae Park @ 2025-06-12 23:49 UTC (permalink / raw)
  To: Bijan Tabatabai
  Cc: SeongJae Park, damon, linux-mm, linux-doc, linux-kernel, akpm,
	corbet, david, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, bijantabatab,
	venkataravis, emirakhur, ajayjoshi, vtavarespetr

Hi Bijan,

On Thu, 12 Jun 2025 13:13:26 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:

> From: Bijan Tabatabai <bijantabatab@micron.com>
> 
> A recent patch set automatically set the interleave weight for each node
> according to the node's maximum bandwidth [1]. In another thread, the patch
> set's author, Joshua Hahn, wondered if/how these weights should be changed
> if the bandwidth utilization of the system changes [2].

Thank you for sharing the background.  I do agree it is an important question.

> 
> This patch set adds the mechanism for dynamically changing how application
> data is interleaved across nodes while leaving the policy of what the
> interleave weights should be to userspace. It does this by adding a new
> DAMOS action: DAMOS_INTERLEAVE. We implement DAMOS_INTERLEAVE with both
> paddr and vaddr operations sets. Using the paddr version is useful for
> managing page placement globally. Using the vaddr version limits tracking
> to one process per kdamond instance, but the va based tracking better
> captures spacial locality.
> 
> DAMOS_INTERLEAVE interleaves pages within a region across nodes using the
> interleave weights at /sys/kernel/mm/mempolicy/weighted_interleave/node<N>
> and the page placement algorithm in weighted_interleave_nid via
> policy_nodemask.

So, what DAMOS_INTERLEAVE will do is, migrating pages of a given DAMON region
into multiple nodes, following interleaving weights, right?  We already have
DAMOS actions for migrating pages of a given DAMON region, namely
DAMOS_MIGRATE_{HOT,COLD}.  The actions support only single migration target
node, though.  To my perspective, hence, DAMOS_INTERLEAVE looks like an
extended version of DAMOS_MIGRATE_{HOT,COLD} for flexible target node
selections.  In a way, DAMOS_INTERLEAVE is rather a restricted version of
DAMOS_MIGRATE_{HOT,COLD}, since it prioritizes only hotter regions, if I read
the second patch correctly.

What about extending DAMOS_MIGRATE_{HOT,COLD} to support your use case?  For
example, letting users enter special keyword, say, 'weighted_interleave' to
'target_nid' DAMON sysfs file.  In the case, DAMOS_MIGRATE_{HOT,COLD} would
work in the way you are implementing DAMOS_INTERLEAVE.

> We chose to reuse the mempolicy weighted interleave
> infrastructure to avoid reimplementing code. However, this has the awkward
> side effect that only pages that are mapped to processes using
> MPOL_WEIGHTED_INTERLEAVE will be migrated according to new interleave
> weights. This might be fine because workloads that want their data to be
> dynamically interleaved will want their newly allocated data to be
> interleaved at the same ratio.

Makes sense to me.  I'm not very familiar with interleaving and memory policy,
though.

> 
> If exposing policy_nodemask is undesirable,

I see you are exposing it on include/linux/mempolicy.h on the first patch of
this series, and I agree it is not desirable to unnecessarily expose functions.
But you could reduce the exposure by exporting it on mm/internal.h instead.
mempolicy maitnainers and reviewers who you kindly Cc-ed to this mail could
give us good opinions.

> we have two alternative methods
> for having DAMON access the interleave weights it should use. We would
> appreciate feedback on which method is preferred.
> 1. Use mpol_misplaced instead
>   pros: mpol_misplaced is already exposed publically
>   cons: Would require refactoring mpol_misplaced to take a struct vm_area
>   instead of a struct vm_fault, and require refactoring mpol_misplaced and
>   get_vma_policy to take in a struct task_struct rather than just using
>   current. Also requires processes to use MPOL_WEIGHTED_INTERLEAVE.

I feel cons is larger than pros.  mpolicy people's opinion would matter more,
though.

> 2. Add a new field to struct damos, similar to target_nid for the
> MIGRATE_HOT/COLD schemes.
>   pros: Keeps changes contained inside DAMON. Would not require processes
>   to use MPOL_WEIGHTED_INTERLEAVE.
>   cons: Duplicates page placement code. Requires discussion on the sysfs
>   interface to use for users to pass in the interleave weights.

I agree this is also somewhat doable.  In future, we might want to implement
this anyway, for non-global and flexible memory interleaving.  But if memory
policy people are ok with reusing policy_nodemask(), I don't think we need to
do this now.

> 
> This patchset was tested on an AMD machine with a NUMA node with CPUs
> attached to DDR memory and a cpu-less NUMA node attached to CXL memory.
> However, this patch set should generalize to other architectures and number
> of NUMA nodes.

I show the test results on the commit messages of the second and the fourth
patches.  In the next version, letting readers know that here would be nice.
Also adding a short description of what you confirmed with the tests here
(e.g., with the test we confirmed this patch functions as expected [and
achieves X % Y metric wins]) would be nice.

> 
> Patches Sequence
> ________________
> The first patch exposes policy_nodemask() in include/linux/mempolicy.h to
> let DAMON determine where a page should be placed for interleaving.
> The second patch implements DAMOS_INTERLEAVE as a paddr action.
> The third patch moves the DAMON page migration code to ops-common, allowing
> vaddr actions to use it.
> Finally, the fourth patch implements a vaddr version of DAMOS_INTERLEAVE.

I'll try to take look on code and add comments if something stands out, but
let's focus on the high level discussion first, especially whether to implement
this as a new DAMOS action, or extend DAMOS_MIGRATE_{HOT,COLD} actions.

I think it would also be nice if you could add more explanation about why you
picked DAMON as a way to implement this feature.  I assume that's because you
found opportunities to utilize this feature in some access-aware way or
utilizing DAMOS features.  I was actually able to imagine some such usages.
For example, we could do the re-interleaving for hot or cold pages of specific
NUMA nodes or specific virtual address ranges first to make interleaving
effective faster.

Also we could apply a sort of speed limit for the interleaving-migration to
ensure it doesn't consume memory bandwidth too much.  The limit could be
arbitrarily user-defined or auto-tuned for specific system metrics value (e.g.,
memory bandwidth balance?).

If you have such use case in your mind or your test setups, sharing those here
or on the next versions of this would be very helpful for reviewers.

> 
> [1] https://lore.kernel.org/linux-mm/20250520141236.2987309-1-joshua.hahnjy@gmail.com/
> [2] https://lore.kernel.org/linux-mm/20250313155705.1943522-1-joshua.hahnjy@gmail.com/

Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/4] mm/damon: Add DAMOS action to interleave data across nodes
  2025-06-12 23:49 ` [RFC PATCH 0/4] mm/damon: Add DAMOS action to interleave data across nodes SeongJae Park
@ 2025-06-13  2:41   ` Huang, Ying
  2025-06-13 16:02     ` Bijan Tabatabai
  2025-06-13 15:44   ` Bijan Tabatabai
  1 sibling, 1 reply; 30+ messages in thread
From: Huang, Ying @ 2025-06-13  2:41 UTC (permalink / raw)
  To: Bijan Tabatabai, SeongJae Park
  Cc: damon, linux-mm, linux-doc, linux-kernel, akpm, corbet, david,
	ziy, matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	apopple, bijantabatab, venkataravis, emirakhur, ajayjoshi,
	vtavarespetr

SeongJae Park <sj@kernel.org> writes:

> Hi Bijan,
>
> On Thu, 12 Jun 2025 13:13:26 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:
>
>> From: Bijan Tabatabai <bijantabatab@micron.com>
>> 
>> A recent patch set automatically set the interleave weight for each node
>> according to the node's maximum bandwidth [1]. In another thread, the patch
>> set's author, Joshua Hahn, wondered if/how these weights should be changed
>> if the bandwidth utilization of the system changes [2].
>
> Thank you for sharing the background.  I do agree it is an important question.
>
>> 
>> This patch set adds the mechanism for dynamically changing how application
>> data is interleaved across nodes while leaving the policy of what the
>> interleave weights should be to userspace. It does this by adding a new
>> DAMOS action: DAMOS_INTERLEAVE. We implement DAMOS_INTERLEAVE with both
>> paddr and vaddr operations sets. Using the paddr version is useful for
>> managing page placement globally. Using the vaddr version limits tracking
>> to one process per kdamond instance, but the va based tracking better
>> captures spacial locality.
>> 
>> DAMOS_INTERLEAVE interleaves pages within a region across nodes using the
>> interleave weights at /sys/kernel/mm/mempolicy/weighted_interleave/node<N>
>> and the page placement algorithm in weighted_interleave_nid via
>> policy_nodemask.
>
> So, what DAMOS_INTERLEAVE will do is, migrating pages of a given DAMON region
> into multiple nodes, following interleaving weights, right?

Hi, Bijan,

It's hard for me to understand what you want to do in your original
patch description.  SeongJae's description is helpful.  So, can you add
more description in the future version?

So, you will migrate allocated pages to follow new weight?  How to
interact with the weight specified by users explicitly?  Usually we will
respect explicit user policy.

> We already have
> DAMOS actions for migrating pages of a given DAMON region, namely
> DAMOS_MIGRATE_{HOT,COLD}.  The actions support only single migration target
> node, though.  To my perspective, hence, DAMOS_INTERLEAVE looks like an
> extended version of DAMOS_MIGRATE_{HOT,COLD} for flexible target node
> selections.  In a way, DAMOS_INTERLEAVE is rather a restricted version of
> DAMOS_MIGRATE_{HOT,COLD}, since it prioritizes only hotter regions, if I read
> the second patch correctly.
>
> What about extending DAMOS_MIGRATE_{HOT,COLD} to support your use case?  For
> example, letting users enter special keyword, say, 'weighted_interleave' to
> 'target_nid' DAMON sysfs file.  In the case, DAMOS_MIGRATE_{HOT,COLD} would
> work in the way you are implementing DAMOS_INTERLEAVE.
>
>> We chose to reuse the mempolicy weighted interleave
>> infrastructure to avoid reimplementing code. However, this has the awkward
>> side effect that only pages that are mapped to processes using
>> MPOL_WEIGHTED_INTERLEAVE will be migrated according to new interleave
>> weights. This might be fine because workloads that want their data to be
>> dynamically interleaved will want their newly allocated data to be
>> interleaved at the same ratio.
>
> Makes sense to me.  I'm not very familiar with interleaving and memory policy,
> though.
>
>> 
>> If exposing policy_nodemask is undesirable,
>
> I see you are exposing it on include/linux/mempolicy.h on the first patch of
> this series, and I agree it is not desirable to unnecessarily expose functions.
> But you could reduce the exposure by exporting it on mm/internal.h instead.
> mempolicy maitnainers and reviewers who you kindly Cc-ed to this mail could
> give us good opinions.
>
>> we have two alternative methods
>> for having DAMON access the interleave weights it should use. We would
>> appreciate feedback on which method is preferred.
>> 1. Use mpol_misplaced instead
>>   pros: mpol_misplaced is already exposed publically
>>   cons: Would require refactoring mpol_misplaced to take a struct vm_area
>>   instead of a struct vm_fault, and require refactoring mpol_misplaced and
>>   get_vma_policy to take in a struct task_struct rather than just using
>>   current. Also requires processes to use MPOL_WEIGHTED_INTERLEAVE.
>
> I feel cons is larger than pros.  mpolicy people's opinion would matter more,
> though.
>
>> 2. Add a new field to struct damos, similar to target_nid for the
>> MIGRATE_HOT/COLD schemes.
>>   pros: Keeps changes contained inside DAMON. Would not require processes
>>   to use MPOL_WEIGHTED_INTERLEAVE.
>>   cons: Duplicates page placement code. Requires discussion on the sysfs
>>   interface to use for users to pass in the interleave weights.
>
> I agree this is also somewhat doable.  In future, we might want to implement
> this anyway, for non-global and flexible memory interleaving.  But if memory
> policy people are ok with reusing policy_nodemask(), I don't think we need to
> do this now.
>
>> 
>> This patchset was tested on an AMD machine with a NUMA node with CPUs
>> attached to DDR memory and a cpu-less NUMA node attached to CXL memory.
>> However, this patch set should generalize to other architectures and number
>> of NUMA nodes.
>
> I show the test results on the commit messages of the second and the fourth
> patches.  In the next version, letting readers know that here would be nice.
> Also adding a short description of what you confirmed with the tests here
> (e.g., with the test we confirmed this patch functions as expected [and
> achieves X % Y metric wins]) would be nice.
>
>> 
>> Patches Sequence
>> ________________
>> The first patch exposes policy_nodemask() in include/linux/mempolicy.h to
>> let DAMON determine where a page should be placed for interleaving.
>> The second patch implements DAMOS_INTERLEAVE as a paddr action.
>> The third patch moves the DAMON page migration code to ops-common, allowing
>> vaddr actions to use it.
>> Finally, the fourth patch implements a vaddr version of DAMOS_INTERLEAVE.
>
> I'll try to take look on code and add comments if something stands out, but
> let's focus on the high level discussion first, especially whether to implement
> this as a new DAMOS action, or extend DAMOS_MIGRATE_{HOT,COLD} actions.
>
> I think it would also be nice if you could add more explanation about why you
> picked DAMON as a way to implement this feature.  I assume that's because you
> found opportunities to utilize this feature in some access-aware way or
> utilizing DAMOS features.  I was actually able to imagine some such usages.
> For example, we could do the re-interleaving for hot or cold pages of specific
> NUMA nodes or specific virtual address ranges first to make interleaving
> effective faster.
>
> Also we could apply a sort of speed limit for the interleaving-migration to
> ensure it doesn't consume memory bandwidth too much.  The limit could be
> arbitrarily user-defined or auto-tuned for specific system metrics value (e.g.,
> memory bandwidth balance?).
>
> If you have such use case in your mind or your test setups, sharing those here
> or on the next versions of this would be very helpful for reviewers.
>
>> 
>> [1] https://lore.kernel.org/linux-mm/20250520141236.2987309-1-joshua.hahnjy@gmail.com/
>> [2] https://lore.kernel.org/linux-mm/20250313155705.1943522-1-joshua.hahnjy@gmail.com/

---
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/4] mm/damon: Add DAMOS action to interleave data across nodes
  2025-06-12 18:13 [RFC PATCH 0/4] mm/damon: Add DAMOS action to interleave data across nodes Bijan Tabatabai
                   ` (4 preceding siblings ...)
  2025-06-12 23:49 ` [RFC PATCH 0/4] mm/damon: Add DAMOS action to interleave data across nodes SeongJae Park
@ 2025-06-13  9:55 ` Rakie Kim
  2025-06-13 16:12   ` Bijan Tabatabai
  2025-06-13 15:25 ` Joshua Hahn
  6 siblings, 1 reply; 30+ messages in thread
From: Rakie Kim @ 2025-06-13  9:55 UTC (permalink / raw)
  To: Bijan Tabatabai
  Cc: sj, akpm, corbet, david, ziy, matthew.brost, joshua.hahnjy,
	rakie.kim, byungchul, gourry, ying.huang, apopple, bijantabatab,
	venkataravis, emirakhur, ajayjoshi, vtavarespetr, damon, linux-mm,
	linux-doc, linux-kernel, kernel_team

On Thu, 12 Jun 2025 13:13:26 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:
> From: Bijan Tabatabai <bijantabatab@micron.com>
> 
> A recent patch set automatically set the interleave weight for each node
> according to the node's maximum bandwidth [1]. In another thread, the patch
> set's author, Joshua Hahn, wondered if/how these weights should be changed
> if the bandwidth utilization of the system changes [2].
> 
> This patch set adds the mechanism for dynamically changing how application
> data is interleaved across nodes while leaving the policy of what the
> interleave weights should be to userspace. It does this by adding a new
> DAMOS action: DAMOS_INTERLEAVE. We implement DAMOS_INTERLEAVE with both
> paddr and vaddr operations sets. Using the paddr version is useful for
> managing page placement globally. Using the vaddr version limits tracking
> to one process per kdamond instance, but the va based tracking better
> captures spacial locality.

Hi Bijan,

Thank you for explaining the motivation and need behind this patch.
I believe it's important to consider the case where a new memory node
is added and the interleave weight values are recalculated.

If a new memory node (say, node2) is added, there are two possible
approaches to consider.

1. Migrating pages to the newly added node2.
   In this case, there is a potential issue where pages may be migrated
   to node2, even though it is not part of the nodemask set by the user.

2. Ignoring the newly added node2 and continuing to use only the existing
   nodemask for migrations.
   However, if the weight values have been updated considering node2
   performance, avoiding node2 might reduce the effectiveness of using
   Weighted Interleave.

It would be helpful to consider these two options or explore other
possible solutions to ensure correctness.

Rakie

> 
> DAMOS_INTERLEAVE interleaves pages within a region across nodes using the
> interleave weights at /sys/kernel/mm/mempolicy/weighted_interleave/node<N>
> and the page placement algorithm in weighted_interleave_nid via
> policy_nodemask. We chose to reuse the mempolicy weighted interleave
> infrastructure to avoid reimplementing code. However, this has the awkward
> side effect that only pages that are mapped to processes using
> MPOL_WEIGHTED_INTERLEAVE will be migrated according to new interleave
> weights. This might be fine because workloads that want their data to be
> dynamically interleaved will want their newly allocated data to be
> interleaved at the same ratio.
> 
> If exposing policy_nodemask is undesirable, we have two alternative methods
> for having DAMON access the interleave weights it should use. We would
> appreciate feedback on which method is preferred.
> 1. Use mpol_misplaced instead
>   pros: mpol_misplaced is already exposed publically
>   cons: Would require refactoring mpol_misplaced to take a struct vm_area
>   instead of a struct vm_fault, and require refactoring mpol_misplaced and
>   get_vma_policy to take in a struct task_struct rather than just using
>   current. Also requires processes to use MPOL_WEIGHTED_INTERLEAVE.
> 2. Add a new field to struct damos, similar to target_nid for the
> MIGRATE_HOT/COLD schemes.
>   pros: Keeps changes contained inside DAMON. Would not require processes
>   to use MPOL_WEIGHTED_INTERLEAVE.
>   cons: Duplicates page placement code. Requires discussion on the sysfs
>   interface to use for users to pass in the interleave weights.
> 
> This patchset was tested on an AMD machine with a NUMA node with CPUs
> attached to DDR memory and a cpu-less NUMA node attached to CXL memory.
> However, this patch set should generalize to other architectures and number
> of NUMA nodes.
> 
> Patches Sequence
> ________________
> The first patch exposes policy_nodemask() in include/linux/mempolicy.h to
> let DAMON determine where a page should be placed for interleaving.
> The second patch implements DAMOS_INTERLEAVE as a paddr action.
> The third patch moves the DAMON page migration code to ops-common, allowing
> vaddr actions to use it.
> Finally, the fourth patch implements a vaddr version of DAMOS_INTERLEAVE.
> 
> [1] https://lore.kernel.org/linux-mm/20250520141236.2987309-1-joshua.hahnjy@gmail.com/
> [2] https://lore.kernel.org/linux-mm/20250313155705.1943522-1-joshua.hahnjy@gmail.com/
> 
> Bijan Tabatabai (4):
>   mm/mempolicy: Expose policy_nodemask() in include/linux/mempolicy.h
>   mm/damon/paddr: Add DAMOS_INTERLEAVE action
>   mm/damon: Move damon_pa_migrate_pages to ops-common
>   mm/damon/vaddr: Add vaddr version of DAMOS_INTERLEAVE
> 
>  Documentation/mm/damon/design.rst |   2 +
>  include/linux/damon.h             |   2 +
>  include/linux/mempolicy.h         |   2 +
>  mm/damon/ops-common.c             | 136 ++++++++++++++++++++
>  mm/damon/ops-common.h             |   4 +
>  mm/damon/paddr.c                  | 198 +++++++++++++-----------------
>  mm/damon/sysfs-schemes.c          |   1 +
>  mm/damon/vaddr.c                  | 124 +++++++++++++++++++
>  mm/mempolicy.c                    |   4 +-
>  9 files changed, 360 insertions(+), 113 deletions(-)
> 
> -- 
> 2.43.5
> 
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 2/4] mm/damon/paddr: Add DAMOS_INTERLEAVE action
  2025-06-12 18:13 ` [RFC PATCH 2/4] mm/damon/paddr: Add DAMOS_INTERLEAVE action Bijan Tabatabai
@ 2025-06-13 13:43   ` David Hildenbrand
  0 siblings, 0 replies; 30+ messages in thread
From: David Hildenbrand @ 2025-06-13 13:43 UTC (permalink / raw)
  To: Bijan Tabatabai, damon, linux-mm, linux-doc, linux-kernel
  Cc: sj, akpm, corbet, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, bijantabatab,
	venkataravis, emirakhur, ajayjoshi, vtavarespetr

On 12.06.25 20:13, Bijan Tabatabai wrote:
> From: Bijan Tabatabai <bijantabatab@micron.com>
> 
> This patch adds the DAMOS_INTERLEAVE action.
> It interleaves pages inside of a given region according to the weights
> in the iw_table. To reuse existing interleaving code, the target nid for
> a folio is determined by calling policy_nodemask, therefore only folios
> belonging to processes using the MPOL_WEIGHTED_INTERLEAVE policy will
> have their pages migrated.
> 
> Below is an example of its usage where pages are initially interleaved at
> a 1:1 ratio and then changed to be interleaved at a 2:1 ratio. The
> alloc_data program simply allocates 1GB of data then sleeps.
>    $ cd /sys/kernel/mm/damon/admin/kdamonds/0
>    $ sudo cat ./contexts/0/schemes/0/action
>    interleave
>    $ echo 1 | sudo tee /sys/kernel/mm/mempolicy/weighted_interleave/node0
>    $ echo 1 | sudo tee /sys/kernel/mm/mempolicy/weighted_interleave/node1
>    $ numactl -w 0,1 ~/alloc_data 1G &
>    $ numastat -c -p alloc_data
> 
>    Per-node process memory usage (in MBs) for PID 18473 (alloc_data)
>             Node 0 Node 1 Total
>             ------ ------ -----
>    Huge          0      0     0
>    Heap          0      0     0
>    Stack         0      0     0
>    Private     514    514  1027
>    -------  ------ ------ -----
>    Total       514    514  1028
>    $ echo 2 | sudo tee /sys/kernel/mm/mempolicy/weighted_interleave/node0
>    $ numastat -c -p alloc_data
> 
>    Per-node process memory usage (in MBs) for PID 18473 (alloc_data)
>             Node 0 Node 1 Total
>             ------ ------ -----
>    Huge          0      0     0
>    Heap          0      0     0
>    Stack         0      0     0
>    Private     684    343  1027
>    -------  ------ ------ -----
>    Total       684    343  1027
> 
> Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com>
> ---
>   Documentation/mm/damon/design.rst |   2 +
>   include/linux/damon.h             |   2 +
>   mm/damon/paddr.c                  | 112 ++++++++++++++++++++++++++++++
>   mm/damon/sysfs-schemes.c          |   1 +
>   4 files changed, 117 insertions(+)
> 
> diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst
> index ddc50db3afa4..c50d2105cea0 100644
> --- a/Documentation/mm/damon/design.rst
> +++ b/Documentation/mm/damon/design.rst
> @@ -455,6 +455,8 @@ that supports each action are as below.
>      Supported by ``paddr`` operations set.
>    - ``migrate_cold``: Migrate the regions prioritizing colder regions.
>      Supported by ``paddr`` operations set.
> + - ``interleave``: Interleave the regions according to the weighted interleave weights.
> +   Supported by ``paddr`` operations set.
>    - ``stat``: Do nothing but count the statistics.
>      Supported by all operations sets.
>   
> diff --git a/include/linux/damon.h b/include/linux/damon.h
> index a4011726cb3b..81d26a203337 100644
> --- a/include/linux/damon.h
> +++ b/include/linux/damon.h
> @@ -117,6 +117,7 @@ struct damon_target {
>    * @DAMOS_LRU_DEPRIO:	Deprioritize the region on its LRU lists.
>    * @DAMOS_MIGRATE_HOT:  Migrate the regions prioritizing warmer regions.
>    * @DAMOS_MIGRATE_COLD:	Migrate the regions prioritizing colder regions.
> + * @DAMOS_INTERLEAVE: Interleave the regions by the weighted interleave ratio
>    * @DAMOS_STAT:		Do nothing but count the stat.
>    * @NR_DAMOS_ACTIONS:	Total number of DAMOS actions
>    *
> @@ -136,6 +137,7 @@ enum damos_action {
>   	DAMOS_LRU_DEPRIO,
>   	DAMOS_MIGRATE_HOT,
>   	DAMOS_MIGRATE_COLD,
> +	DAMOS_INTERLEAVE,
>   	DAMOS_STAT,		/* Do nothing but only record the stat */
>   	NR_DAMOS_ACTIONS,
>   };
> diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
> index 4102a8c5f992..e989464635cd 100644
> --- a/mm/damon/paddr.c
> +++ b/mm/damon/paddr.c
> @@ -535,6 +535,114 @@ static unsigned long damon_pa_migrate(struct damon_region *r, struct damos *s,
>   	return applied * PAGE_SIZE;
>   }
>   
> +#if defined(CONFIG_MEMCG) && defined(CONFIG_NUMA)
> +struct damos_interleave_private {
> +	struct list_head *folio_migration_list;
> +	bool putback_lru;
> +};
> +
> +static bool damon_pa_interleave_rmap(struct folio *folio, struct vm_area_struct *vma,
> +		unsigned long addr, void *arg)
> +{
> +	struct mempolicy *pol;
> +	struct task_struct *task;
> +	pgoff_t ilx;
> +	int target_nid;
> +	struct damos_interleave_private *priv = arg;
> +
> +	task = rcu_dereference(vma->vm_mm->owner);
> +	if (!task)
> +		return true;
> +
> +	pol = get_task_policy(task);
> +	if (!pol)
> +		return true;

Why is this not using get_vma_policy(), which will fallback to the task 
policy in case there is no per-vma policy>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 1/4] mm/mempolicy: Expose policy_nodemask() in include/linux/mempolicy.h
  2025-06-12 18:13 ` [RFC PATCH 1/4] mm/mempolicy: Expose policy_nodemask() in include/linux/mempolicy.h Bijan Tabatabai
@ 2025-06-13 13:45   ` David Hildenbrand
  2025-06-13 16:33     ` Bijan Tabatabai
  0 siblings, 1 reply; 30+ messages in thread
From: David Hildenbrand @ 2025-06-13 13:45 UTC (permalink / raw)
  To: Bijan Tabatabai, damon, linux-mm, linux-doc, linux-kernel
  Cc: sj, akpm, corbet, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, bijantabatab,
	venkataravis, emirakhur, ajayjoshi, vtavarespetr

On 12.06.25 20:13, Bijan Tabatabai wrote:
> From: Bijan Tabatabai <bijantabatab@micron.com>
> 
> This patch is to allow DAMON to call policy_nodemask() so it can
> determine where to place a page for interleaving.
> 
> Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com>
> ---
>   include/linux/mempolicy.h | 9 +++++++++
>   mm/mempolicy.c            | 4 +---
>   2 files changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index 0fe96f3ab3ef..e96bf493ff7a 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -133,6 +133,8 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>   struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>   		unsigned long addr, int order, pgoff_t *ilx);
>   bool vma_policy_mof(struct vm_area_struct *vma);
> +nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
> +		pgoff_t ilx, int *nid);
>   
>   extern void numa_default_policy(void);
>   extern void numa_policy_init(void);
> @@ -232,6 +234,13 @@ static inline struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>   	return NULL;
>   }
>   
> +static inline nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
> +				pgoff_t ilx, int *nid)
> +{
> +	*nid = NUMA_NO_NODE;
> +	return NULL;
> +}
> +
>   static inline int
>   vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst)
>   {
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 3b1dfd08338b..54f539497e20 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -596,8 +596,6 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
>   
>   static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
>   				unsigned long flags);
> -static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
> -				pgoff_t ilx, int *nid);
>   
>   static bool strictly_unmovable(unsigned long flags)
>   {
> @@ -2195,7 +2193,7 @@ static unsigned int interleave_nid(struct mempolicy *pol, pgoff_t ilx)
>    * Return a nodemask representing a mempolicy for filtering nodes for
>    * page allocation, together with preferred node id (or the input node id).
>    */
> -static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
> +nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
>   				   pgoff_t ilx, int *nid)
>   {
>   	nodemask_t *nodemask = NULL;

You actually only care about the nid for your use case.

Maybe we should add

get_vma_policy_node() that internally does a get_vma_policy() to then 
give you only the node back.

If get_vma_policy() is not the right thing (see my reply to patch #2), 
of course a get_task_policy_node() could be added.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/4] mm/damon: Add DAMOS action to interleave data across nodes
  2025-06-12 18:13 [RFC PATCH 0/4] mm/damon: Add DAMOS action to interleave data across nodes Bijan Tabatabai
                   ` (5 preceding siblings ...)
  2025-06-13  9:55 ` Rakie Kim
@ 2025-06-13 15:25 ` Joshua Hahn
  2025-06-13 16:46   ` Bijan Tabatabai
  6 siblings, 1 reply; 30+ messages in thread
From: Joshua Hahn @ 2025-06-13 15:25 UTC (permalink / raw)
  To: Bijan Tabatabai
  Cc: damon, linux-mm, linux-doc, linux-kernel, sj, akpm, corbet, david,
	ziy, matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, bijantabatab, venkataravis, emirakhur,
	ajayjoshi, vtavarespetr

On Thu, 12 Jun 2025 13:13:26 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:

> From: Bijan Tabatabai <bijantabatab@micron.com>
> 
> A recent patch set automatically set the interleave weight for each node
> according to the node's maximum bandwidth [1]. In another thread, the patch
> set's author, Joshua Hahn, wondered if/how these weights should be changed
> if the bandwidth utilization of the system changes [2].

Hi Bijan,

Thank you for this patchset, and thank you for finding interest in my
question!

> This patch set adds the mechanism for dynamically changing how application
> data is interleaved across nodes while leaving the policy of what the
> interleave weights should be to userspace. It does this by adding a new
> DAMOS action: DAMOS_INTERLEAVE. We implement DAMOS_INTERLEAVE with both
> paddr and vaddr operations sets. Using the paddr version is useful for
> managing page placement globally. Using the vaddr version limits tracking
> to one process per kdamond instance, but the va based tracking better
> captures spacial locality.
>
> DAMOS_INTERLEAVE interleaves pages within a region across nodes using the
> interleave weights at /sys/kernel/mm/mempolicy/weighted_interleave/node<N>
> and the page placement algorithm in weighted_interleave_nid via
> policy_nodemask. We chose to reuse the mempolicy weighted interleave
> infrastructure to avoid reimplementing code. However, this has the awkward
> side effect that only pages that are mapped to processes using
> MPOL_WEIGHTED_INTERLEAVE will be migrated according to new interleave
> weights. This might be fine because workloads that want their data to be
> dynamically interleaved will want their newly allocated data to be
> interleaved at the same ratio.

I think this is generally true. Maybe until a user says that they have a
usecase where they would like to have a non-weighted-interleave policy
to allocate pages, but would like to place them according to a set weight,
we can leave support for other mempolicies out for now.

> If exposing policy_nodemask is undesirable, we have two alternative methods
> for having DAMON access the interleave weights it should use. We would
> appreciate feedback on which method is preferred.
> 1. Use mpol_misplaced instead
>   pros: mpol_misplaced is already exposed publically
>   cons: Would require refactoring mpol_misplaced to take a struct vm_area
>   instead of a struct vm_fault, and require refactoring mpol_misplaced and
>   get_vma_policy to take in a struct task_struct rather than just using
>   current. Also requires processes to use MPOL_WEIGHTED_INTERLEAVE.
> 2. Add a new field to struct damos, similar to target_nid for the
> MIGRATE_HOT/COLD schemes.
>   pros: Keeps changes contained inside DAMON. Would not require processes
>   to use MPOL_WEIGHTED_INTERLEAVE.
>   cons: Duplicates page placement code. Requires discussion on the sysfs
>   interface to use for users to pass in the interleave weights.

Here I agree with SJ's sentiment -- I think mpol_misplaced runs with the
context of working with current / fault contexts, like you pointed out.
Perhaps it is best to keep the scope of the changes as local as possible : -)
As for duplicating page placement code, I think that is something we can
refine over iterations of this patchset, and maybe SJ will have some great
ideas on how this can best be done as well.

> This patchset was tested on an AMD machine with a NUMA node with CPUs
> attached to DDR memory and a cpu-less NUMA node attached to CXL memory.
> However, this patch set should generalize to other architectures and number
> of NUMA nodes.

I think moving the test results to the cover letter will help reviewers
better understand the intent of the work. Also, I think it will also be
very helpful to include some potential use-cases in here as well. That is,
what workloads would benefit from placing pages according to a set ratio,
rather than using existing migration policies that adjust this based on
hotness / coldness?

One such use case that I can think of is using this patchset + weighted
interleave auto-tuning, which would help alleviate bandwidth limitations
by ensuring that past the allocation stage, pages are being accessed
in a way that maximizes the bandwidth usage of the system (at the cost of
latency, which may or may not even be true based on how bandwidth-bound
the workload is). 

Thank you again for the amazing patchset! Have a great day : -)
Joshua

Sent using hkml (https://github.com/sjp38/hackermail)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/4] mm/damon: Add DAMOS action to interleave data across nodes
  2025-06-12 23:49 ` [RFC PATCH 0/4] mm/damon: Add DAMOS action to interleave data across nodes SeongJae Park
  2025-06-13  2:41   ` Huang, Ying
@ 2025-06-13 15:44   ` Bijan Tabatabai
  2025-06-13 17:12     ` SeongJae Park
  2025-06-16  7:42     ` Byungchul Park
  1 sibling, 2 replies; 30+ messages in thread
From: Bijan Tabatabai @ 2025-06-13 15:44 UTC (permalink / raw)
  To: SeongJae Park
  Cc: linux-mm, linux-doc, linux-kernel, akpm, corbet, david, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, bijantabatab, venkataravis, emirakhur,
	ajayjoshi, vtavarespetr, damon

Hi SeongJae,

Thank you for your comments.

On Thu, Jun 12, 2025 at 6:49 PM SeongJae Park <sj@kernel.org> wrote:
>
> Hi Bijan,
>
> On Thu, 12 Jun 2025 13:13:26 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:
>
> > From: Bijan Tabatabai <bijantabatab@micron.com>
> >
> > A recent patch set automatically set the interleave weight for each node
> > according to the node's maximum bandwidth [1]. In another thread, the patch
> > set's author, Joshua Hahn, wondered if/how these weights should be changed
> > if the bandwidth utilization of the system changes [2].
>
> Thank you for sharing the background.  I do agree it is an important question.
>
> >
> > This patch set adds the mechanism for dynamically changing how application
> > data is interleaved across nodes while leaving the policy of what the
> > interleave weights should be to userspace. It does this by adding a new
> > DAMOS action: DAMOS_INTERLEAVE. We implement DAMOS_INTERLEAVE with both
> > paddr and vaddr operations sets. Using the paddr version is useful for
> > managing page placement globally. Using the vaddr version limits tracking
> > to one process per kdamond instance, but the va based tracking better
> > captures spacial locality.
> >
> > DAMOS_INTERLEAVE interleaves pages within a region across nodes using the
> > interleave weights at /sys/kernel/mm/mempolicy/weighted_interleave/node<N>
> > and the page placement algorithm in weighted_interleave_nid via
> > policy_nodemask.
>
> So, what DAMOS_INTERLEAVE will do is, migrating pages of a given DAMON region
> into multiple nodes, following interleaving weights, right?

That's correct.

> We already have
> DAMOS actions for migrating pages of a given DAMON region, namely
> DAMOS_MIGRATE_{HOT,COLD}.  The actions support only single migration target
> node, though.  To my perspective, hence, DAMOS_INTERLEAVE looks like an
> extended version of DAMOS_MIGRATE_{HOT,COLD} for flexible target node
> selections.  In a way, DAMOS_INTERLEAVE is rather a restricted version of
> DAMOS_MIGRATE_{HOT,COLD}, since it prioritizes only hotter regions, if I read
> the second patch correctly.
>
> What about extending DAMOS_MIGRATE_{HOT,COLD} to support your use case?  For
> example, letting users enter special keyword, say, 'weighted_interleave' to
> 'target_nid' DAMON sysfs file.  In the case, DAMOS_MIGRATE_{HOT,COLD} would
> work in the way you are implementing DAMOS_INTERLEAVE.

I like this idea. I will do this in the next version of the patch. I
have a couple of questions
about how to go about this if you don't mind.

First, should I drop the vaddr implementation or implement
DAMOS_MIGRATE_{HOT,COLD}
in vaddr as well? I am leaning towards the former because I believe
the paddr version is
more important, though the vaddr version is useful if the user only
cares about one
application.

Second, do you have a preference for how we indicate that we are using
the mempolicy
rather than target_nid in struct damos? I was thinking of either
setting target_nid to
NUMA_NO_NODE or adding a boolean to struct damos for this.

Maybe it would also be a good idea to generalize it some more. I
implemented this using
just weighted interleave because I was targeting the use case where
the best interleave
weights for a workload changes as the bandwidth utilization of the
system changes, which
I will go describe in more detail further down. However, we could
apply the same logic for
any mempolicy instead of just filtering for MPOL_WEIGHTED_INTERLEAVE. This might
clean up the code a little bit because the logic dependent on
CONFIG_NUMA would be
contained in the mempolicy code.

> > We chose to reuse the mempolicy weighted interleave
> > infrastructure to avoid reimplementing code. However, this has the awkward
> > side effect that only pages that are mapped to processes using
> > MPOL_WEIGHTED_INTERLEAVE will be migrated according to new interleave
> > weights. This might be fine because workloads that want their data to be
> > dynamically interleaved will want their newly allocated data to be
> > interleaved at the same ratio.
>
> Makes sense to me.  I'm not very familiar with interleaving and memory policy,
> though.
>
> >
> > If exposing policy_nodemask is undesirable,
>
> I see you are exposing it on include/linux/mempolicy.h on the first patch of
> this series, and I agree it is not desirable to unnecessarily expose functions.
> But you could reduce the exposure by exporting it on mm/internal.h instead.
> mempolicy maitnainers and reviewers who you kindly Cc-ed to this mail could
> give us good opinions.
>
> > we have two alternative methods
> > for having DAMON access the interleave weights it should use. We would
> > appreciate feedback on which method is preferred.
> > 1. Use mpol_misplaced instead
> >   pros: mpol_misplaced is already exposed publically
> >   cons: Would require refactoring mpol_misplaced to take a struct vm_area
> >   instead of a struct vm_fault, and require refactoring mpol_misplaced and
> >   get_vma_policy to take in a struct task_struct rather than just using
> >   current. Also requires processes to use MPOL_WEIGHTED_INTERLEAVE.
>
> I feel cons is larger than pros.  mpolicy people's opinion would matter more,
> though.
>
> > 2. Add a new field to struct damos, similar to target_nid for the
> > MIGRATE_HOT/COLD schemes.
> >   pros: Keeps changes contained inside DAMON. Would not require processes
> >   to use MPOL_WEIGHTED_INTERLEAVE.
> >   cons: Duplicates page placement code. Requires discussion on the sysfs
> >   interface to use for users to pass in the interleave weights.
>
> I agree this is also somewhat doable.  In future, we might want to implement
> this anyway, for non-global and flexible memory interleaving.  But if memory
> policy people are ok with reusing policy_nodemask(), I don't think we need to
> do this now.
>
> >
> > This patchset was tested on an AMD machine with a NUMA node with CPUs
> > attached to DDR memory and a cpu-less NUMA node attached to CXL memory.
> > However, this patch set should generalize to other architectures and number
> > of NUMA nodes.
>
> I show the test results on the commit messages of the second and the fourth
> patches.  In the next version, letting readers know that here would be nice.
> Also adding a short description of what you confirmed with the tests here
> (e.g., with the test we confirmed this patch functions as expected [and
> achieves X % Y metric wins]) would be nice.
>

Noted. I'll include this in the cover letter of the next patch set.

> >
> > Patches Sequence
> > ________________
> > The first patch exposes policy_nodemask() in include/linux/mempolicy.h to
> > let DAMON determine where a page should be placed for interleaving.
> > The second patch implements DAMOS_INTERLEAVE as a paddr action.
> > The third patch moves the DAMON page migration code to ops-common, allowing
> > vaddr actions to use it.
> > Finally, the fourth patch implements a vaddr version of DAMOS_INTERLEAVE.
>
> I'll try to take look on code and add comments if something stands out, but
> let's focus on the high level discussion first, especially whether to implement
> this as a new DAMOS action, or extend DAMOS_MIGRATE_{HOT,COLD} actions.

Makes sense. Based on your reply, I will probably change the code significantly.

> I think it would also be nice if you could add more explanation about why you
> picked DAMON as a way to implement this feature.  I assume that's because you
> found opportunities to utilize this feature in some access-aware way or
> utilizing DAMOS features.  I was actually able to imagine some such usages.
> For example, we could do the re-interleaving for hot or cold pages of specific
> NUMA nodes or specific virtual address ranges first to make interleaving
> effective faster.

Yeah, I'll give more detail on the use case I was targeting, which I
will also include
in the cover letter of the next patch set.

Basically, we have seen that the best interleave weights for a workload can
change depending on the bandwidth utilization of the system. This was touched
upon in the discussion in [1]. As a toy example, imagine some
application that uses
75% of the local bandwidth. Assuming sufficient capacity, when running alone, we
probably want to keep all of that application's data in local memory.
However, if a
second instance of that application begins, using the same amount of bandwidth,
it would be best to interleave the data of both processes to alleviate
the bandwidth
pressure from the local node. Likewise, when one of the processes ends, the data
should be moved back to local memory.

We imagine there would be a userspace application that would monitor system
performance characteristics, such as bandwidth utilization or memory
access latency,
and uses that information to tune the interleave weights. Others seemed to have
come to a similar conclusion in previous discussions [2]. We are
currently working
on a userspace program that does this, but it's not quite ready to be
published yet.

After the userspace application adjusts the interleave weights, we need some
mechanism to migrate the application pages that have already been allocated.
We think DAMON is the correct venue for this mechanism because we noticed
that we don't have to migrate all of the application's pages to
improve performance,
we just need to migrate the frequently accessed pages. DAMON's existing hotness
tracking is very useful for this. Additionally, as Ying pointed out
[3], a complete
solution must also handle when a memory node is at capacity. The existing
DAMOS_MIGRATE_COLD action can be used in conjunction with the functionality
in this patch set to provide that complete solution.

[1] https://lore.kernel.org/linux-mm/20250313155705.1943522-1-joshua.hahnjy@gmail.com/
[2] https://lore.kernel.org/linux-mm/20250314151137.892379-1-joshua.hahnjy@gmail.com/
[3] https://lore.kernel.org/linux-mm/87frjfx6u4.fsf@DESKTOP-5N7EMDA/

> Also we could apply a sort of speed limit for the interleaving-migration to
> ensure it doesn't consume memory bandwidth too much.  The limit could be
> arbitrarily user-defined or auto-tuned for specific system metrics value (e.g.,
> memory bandwidth balance?).

I agree this is a concern, but I figured DAMOS's existing quota mechanism would
handle it. If you could elaborate on why quotas aren't enough here,
that would help
me come up with a solution.

> If you have such use case in your mind or your test setups, sharing those here
> or on the next versions of this would be very helpful for reviewers.

Answered above. I will include them in the next version.

Thanks,
Bijan

> >
> > [1] https://lore.kernel.org/linux-mm/20250520141236.2987309-1-joshua.hahnjy@gmail.com/
> > [2] https://lore.kernel.org/linux-mm/20250313155705.1943522-1-joshua.hahnjy@gmail.com/
>
>
> Thanks,
> SJ
>
> [...]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/4] mm/damon: Add DAMOS action to interleave data across nodes
  2025-06-13  2:41   ` Huang, Ying
@ 2025-06-13 16:02     ` Bijan Tabatabai
  0 siblings, 0 replies; 30+ messages in thread
From: Bijan Tabatabai @ 2025-06-13 16:02 UTC (permalink / raw)
  To: Huang, Ying
  Cc: SeongJae Park, linux-mm, linux-doc, linux-kernel, akpm, corbet,
	david, ziy, matthew.brost, joshua.hahnjy, rakie.kim, byungchul,
	gourry, apopple, bijantabatab, venkataravis, emirakhur, ajayjoshi,
	vtavarespetr, damon

On Thu, Jun 12, 2025 at 9:42 PM Huang, Ying
<ying.huang@linux.alibaba.com> wrote:
>
> SeongJae Park <sj@kernel.org> writes:
>
> > Hi Bijan,
> >
> > On Thu, 12 Jun 2025 13:13:26 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:
> >
> >> From: Bijan Tabatabai <bijantabatab@micron.com>
> >>
> >> A recent patch set automatically set the interleave weight for each node
> >> according to the node's maximum bandwidth [1]. In another thread, the patch
> >> set's author, Joshua Hahn, wondered if/how these weights should be changed
> >> if the bandwidth utilization of the system changes [2].
> >
> > Thank you for sharing the background.  I do agree it is an important question.
> >
> >>
> >> This patch set adds the mechanism for dynamically changing how application
> >> data is interleaved across nodes while leaving the policy of what the
> >> interleave weights should be to userspace. It does this by adding a new
> >> DAMOS action: DAMOS_INTERLEAVE. We implement DAMOS_INTERLEAVE with both
> >> paddr and vaddr operations sets. Using the paddr version is useful for
> >> managing page placement globally. Using the vaddr version limits tracking
> >> to one process per kdamond instance, but the va based tracking better
> >> captures spacial locality.
> >>
> >> DAMOS_INTERLEAVE interleaves pages within a region across nodes using the
> >> interleave weights at /sys/kernel/mm/mempolicy/weighted_interleave/node<N>
> >> and the page placement algorithm in weighted_interleave_nid via
> >> policy_nodemask.
> >
> > So, what DAMOS_INTERLEAVE will do is, migrating pages of a given DAMON region
> > into multiple nodes, following interleaving weights, right?
Hi Ying,

> Hi, Bijan,
>
> It's hard for me to understand what you want to do in your original
> patch description.  SeongJae's description is helpful.  So, can you add
> more description in the future version?

Yes, sorry about that. I added more detail in my reply to SeongJae and
will include more detail in the cover letter of the next revision.

> So, you will migrate allocated pages to follow new weight?

Yes

> How to interact with the weight specified by users explicitly?  Usually we will
> respect explicit user policy.

I am not entirely sure I understand the question completely, but I
will try to answer
the best I can.

We interact with the user provided weights through the policy_nodemask function,
which gives us the node id a page should be on. This patch only reads the user
provided weights and migrates pages to be consistent with new weights provided
by the user, so I believe these changes do respect the explicit user
policy. Please let
me know if you disagree.

Thanks for the review,
Bijan

P.S. Sorry for sending this twice - I accidentally replied instead of
replied all.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/4] mm/damon: Add DAMOS action to interleave data across nodes
  2025-06-13  9:55 ` Rakie Kim
@ 2025-06-13 16:12   ` Bijan Tabatabai
  0 siblings, 0 replies; 30+ messages in thread
From: Bijan Tabatabai @ 2025-06-13 16:12 UTC (permalink / raw)
  To: Rakie Kim
  Cc: sj, akpm, corbet, david, ziy, matthew.brost, joshua.hahnjy,
	byungchul, gourry, ying.huang, apopple, bijantabatab,
	venkataravis, emirakhur, ajayjoshi, vtavarespetr, linux-mm,
	linux-doc, linux-kernel, kernel_team, damon

On Fri, Jun 13, 2025 at 4:55 AM Rakie Kim <rakie.kim@sk.com> wrote:
>
> On Thu, 12 Jun 2025 13:13:26 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:
> > From: Bijan Tabatabai <bijantabatab@micron.com>
> >
> > A recent patch set automatically set the interleave weight for each node
> > according to the node's maximum bandwidth [1]. In another thread, the patch
> > set's author, Joshua Hahn, wondered if/how these weights should be changed
> > if the bandwidth utilization of the system changes [2].
> >
> > This patch set adds the mechanism for dynamically changing how application
> > data is interleaved across nodes while leaving the policy of what the
> > interleave weights should be to userspace. It does this by adding a new
> > DAMOS action: DAMOS_INTERLEAVE. We implement DAMOS_INTERLEAVE with both
> > paddr and vaddr operations sets. Using the paddr version is useful for
> > managing page placement globally. Using the vaddr version limits tracking
> > to one process per kdamond instance, but the va based tracking better
> > captures spacial locality.
>
> Hi Bijan,
>
> Thank you for explaining the motivation and need behind this patch.
> I believe it's important to consider the case where a new memory node
> is added and the interleave weight values are recalculated.
>
> If a new memory node (say, node2) is added, there are two possible
> approaches to consider.
>
> 1. Migrating pages to the newly added node2.
>    In this case, there is a potential issue where pages may be migrated
>    to node2, even though it is not part of the nodemask set by the user.
>
> 2. Ignoring the newly added node2 and continuing to use only the existing
>    nodemask for migrations.
>    However, if the weight values have been updated considering node2
>    performance, avoiding node2 might reduce the effectiveness of using
>    Weighted Interleave.
>
> It would be helpful to consider these two options or explore other
> possible solutions to ensure correctness.
>
> Rakie

Hi Rakie,

Thank you for the reply - this is not a problem I considered, but it
is important.

I think option 2 is the correct choice, and is what is already done by the
policy_nodemask function we use to determine what node to place a page. I
do not think it makes sense to ignore the nodemask when it is explicitly set by
the user.
However, if we decide to change the way we get the interleave weights to be from
a DAMON specific interface, then I think it would make sense to only
migrate to the
newly onlined node if the user sets a weight for that node. This is
because in that
case, the user is explicitly telling DAMON to use that node.

Let me know if you have any other concerns,
Bijan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 1/4] mm/mempolicy: Expose policy_nodemask() in include/linux/mempolicy.h
  2025-06-13 13:45   ` David Hildenbrand
@ 2025-06-13 16:33     ` Bijan Tabatabai
  2025-06-16  9:45       ` David Hildenbrand
  2025-06-16 10:55       ` Huang, Ying
  0 siblings, 2 replies; 30+ messages in thread
From: Bijan Tabatabai @ 2025-06-13 16:33 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-doc, linux-kernel, sj, akpm, corbet, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, bijantabatab, venkataravis, emirakhur,
	ajayjoshi, vtavarespetr, damon

On Fri, Jun 13, 2025 at 8:45 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 12.06.25 20:13, Bijan Tabatabai wrote:
> > From: Bijan Tabatabai <bijantabatab@micron.com>
> >
> > This patch is to allow DAMON to call policy_nodemask() so it can
> > determine where to place a page for interleaving.
> >
> > Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com>
> > ---
> >   include/linux/mempolicy.h | 9 +++++++++
> >   mm/mempolicy.c            | 4 +---
> >   2 files changed, 10 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> > index 0fe96f3ab3ef..e96bf493ff7a 100644
> > --- a/include/linux/mempolicy.h
> > +++ b/include/linux/mempolicy.h
> > @@ -133,6 +133,8 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
> >   struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> >               unsigned long addr, int order, pgoff_t *ilx);
> >   bool vma_policy_mof(struct vm_area_struct *vma);
> > +nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
> > +             pgoff_t ilx, int *nid);
> >
> >   extern void numa_default_policy(void);
> >   extern void numa_policy_init(void);
> > @@ -232,6 +234,13 @@ static inline struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> >       return NULL;
> >   }
> >
> > +static inline nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
> > +                             pgoff_t ilx, int *nid)
> > +{
> > +     *nid = NUMA_NO_NODE;
> > +     return NULL;
> > +}
> > +
> >   static inline int
> >   vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst)
> >   {
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 3b1dfd08338b..54f539497e20 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -596,8 +596,6 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
> >
> >   static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
> >                               unsigned long flags);
> > -static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
> > -                             pgoff_t ilx, int *nid);
> >
> >   static bool strictly_unmovable(unsigned long flags)
> >   {
> > @@ -2195,7 +2193,7 @@ static unsigned int interleave_nid(struct mempolicy *pol, pgoff_t ilx)
> >    * Return a nodemask representing a mempolicy for filtering nodes for
> >    * page allocation, together with preferred node id (or the input node id).
> >    */
> > -static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
> > +nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
> >                                  pgoff_t ilx, int *nid)
> >   {
> >       nodemask_t *nodemask = NULL;
>
> You actually only care about the nid for your use case.
>
> Maybe we should add
>
> get_vma_policy_node() that internally does a get_vma_policy() to then
> give you only the node back.
>
> If get_vma_policy() is not the right thing (see my reply to patch #2),
> of course a get_task_policy_node() could be added.
>
> --
> Cheers,
>
> David / dhildenb

Hi David,

I did not use get_vma_policy or mpol_misplaced, which I believe is the
closest function that exists for what I want in this patch, because
those functions
seem to assume they are called inside of the task that the folio/vma
is mapped to.
More specifically, mpol_misplaced assumes it is being called within a
page fault.
This doesn't work for us, because we call it inside of a kdamond process.

I would be open to adding a new function that takes in a folio, vma,
address, and
task_struct and returns the nid the folio should be placed on. It could possibly
be implemented as a function internal to mpol_misplaced because the two would
be very similar.

How would you propose we handle MPOL_BIND and MPOL_PREFFERED_MANY
in this function? mpol_misplaced chooses a nid based on the node and
cpu the fault
occurred on, which we wouldn't have in a kdamond context. The two options I see
are either:
1. return the nid of the first node in the policy's nodemask
2. return NUMA_NO_NODE
I think I would lean towards the first.

Thanks,
Bijan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/4] mm/damon: Add DAMOS action to interleave data across nodes
  2025-06-13 15:25 ` Joshua Hahn
@ 2025-06-13 16:46   ` Bijan Tabatabai
  0 siblings, 0 replies; 30+ messages in thread
From: Bijan Tabatabai @ 2025-06-13 16:46 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: linux-mm, linux-doc, linux-kernel, sj, akpm, corbet, david, ziy,
	matthew.brost, rakie.kim, byungchul, gourry, ying.huang, apopple,
	bijantabatab, venkataravis, emirakhur, ajayjoshi, vtavarespetr,
	damon

Hi Joshua,

On Fri, Jun 13, 2025 at 10:25 AM Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
>
> On Thu, 12 Jun 2025 13:13:26 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:
>
> > From: Bijan Tabatabai <bijantabatab@micron.com>
> >
> > A recent patch set automatically set the interleave weight for each node
> > according to the node's maximum bandwidth [1]. In another thread, the patch
> > set's author, Joshua Hahn, wondered if/how these weights should be changed
> > if the bandwidth utilization of the system changes [2].
>
> Hi Bijan,
>
> Thank you for this patchset, and thank you for finding interest in my
> question!
>
> > This patch set adds the mechanism for dynamically changing how application
> > data is interleaved across nodes while leaving the policy of what the
> > interleave weights should be to userspace. It does this by adding a new
> > DAMOS action: DAMOS_INTERLEAVE. We implement DAMOS_INTERLEAVE with both
> > paddr and vaddr operations sets. Using the paddr version is useful for
> > managing page placement globally. Using the vaddr version limits tracking
> > to one process per kdamond instance, but the va based tracking better
> > captures spacial locality.
> >
> > DAMOS_INTERLEAVE interleaves pages within a region across nodes using the
> > interleave weights at /sys/kernel/mm/mempolicy/weighted_interleave/node<N>
> > and the page placement algorithm in weighted_interleave_nid via
> > policy_nodemask. We chose to reuse the mempolicy weighted interleave
> > infrastructure to avoid reimplementing code. However, this has the awkward
> > side effect that only pages that are mapped to processes using
> > MPOL_WEIGHTED_INTERLEAVE will be migrated according to new interleave
> > weights. This might be fine because workloads that want their data to be
> > dynamically interleaved will want their newly allocated data to be
> > interleaved at the same ratio.
>
> I think this is generally true. Maybe until a user says that they have a
> usecase where they would like to have a non-weighted-interleave policy
> to allocate pages, but would like to place them according to a set weight,
> we can leave support for other mempolicies out for now.
>
> > If exposing policy_nodemask is undesirable, we have two alternative methods
> > for having DAMON access the interleave weights it should use. We would
> > appreciate feedback on which method is preferred.
> > 1. Use mpol_misplaced instead
> >   pros: mpol_misplaced is already exposed publically
> >   cons: Would require refactoring mpol_misplaced to take a struct vm_area
> >   instead of a struct vm_fault, and require refactoring mpol_misplaced and
> >   get_vma_policy to take in a struct task_struct rather than just using
> >   current. Also requires processes to use MPOL_WEIGHTED_INTERLEAVE.
> > 2. Add a new field to struct damos, similar to target_nid for the
> > MIGRATE_HOT/COLD schemes.
> >   pros: Keeps changes contained inside DAMON. Would not require processes
> >   to use MPOL_WEIGHTED_INTERLEAVE.
> >   cons: Duplicates page placement code. Requires discussion on the sysfs
> >   interface to use for users to pass in the interleave weights.
>
> Here I agree with SJ's sentiment -- I think mpol_misplaced runs with the
> context of working with current / fault contexts, like you pointed out.
> Perhaps it is best to keep the scope of the changes as local as possible : -)
> As for duplicating page placement code, I think that is something we can
> refine over iterations of this patchset, and maybe SJ will have some great
> ideas on how this can best be done as well.

David Hildenbrand responded to this and proposed adding a new function that
just returns the nid a folio should go on based on its mempolicy. I think that's
probably the best way to go for now. I think the common case would want
the weights used by this and mempolicy to be the same. However, if there is
a use case where different weights are desired, I don't mind coming back and
adding that functionality.

> > This patchset was tested on an AMD machine with a NUMA node with CPUs
> > attached to DDR memory and a cpu-less NUMA node attached to CXL memory.
> > However, this patch set should generalize to other architectures and number
> > of NUMA nodes.
>
> I think moving the test results to the cover letter will help reviewers
> better understand the intent of the work. Also, I think it will also be
> very helpful to include some potential use-cases in here as well. That is,
> what workloads would benefit from placing pages according to a set ratio,
> rather than using existing migration policies that adjust this based on
> hotness / coldness?

Noted. I will be sure to include that in the next revision.

> One such use case that I can think of is using this patchset + weighted
> interleave auto-tuning, which would help alleviate bandwidth limitations
> by ensuring that past the allocation stage, pages are being accessed
> in a way that maximizes the bandwidth usage of the system (at the cost of
> latency, which may or may not even be true based on how bandwidth-bound
> the workload is).

This was the exact use case I envisioned for this patch. I talk about it in more
detail in my reply to SeongJae.

> Thank you again for the amazing patchset! Have a great day : -)
> Joshua

I appreciate you taking the time to respond,
Bijan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/4] mm/damon: Add DAMOS action to interleave data across nodes
  2025-06-13 15:44   ` Bijan Tabatabai
@ 2025-06-13 17:12     ` SeongJae Park
  2025-06-16  7:42     ` Byungchul Park
  1 sibling, 0 replies; 30+ messages in thread
From: SeongJae Park @ 2025-06-13 17:12 UTC (permalink / raw)
  To: Bijan Tabatabai
  Cc: SeongJae Park, linux-mm, linux-doc, linux-kernel, akpm, corbet,
	david, ziy, matthew.brost, joshua.hahnjy, rakie.kim, byungchul,
	gourry, ying.huang, apopple, bijantabatab, venkataravis,
	emirakhur, ajayjoshi, vtavarespetr, damon

On Fri, 13 Jun 2025 10:44:17 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:

> Hi SeongJae,
> 
> Thank you for your comments.
> 
> On Thu, Jun 12, 2025 at 6:49 PM SeongJae Park <sj@kernel.org> wrote:
> >
> > Hi Bijan,
> >
> > On Thu, 12 Jun 2025 13:13:26 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:
> >
> > > From: Bijan Tabatabai <bijantabatab@micron.com>
> > >
[...]
> > What about extending DAMOS_MIGRATE_{HOT,COLD} to support your use case?  For
> > example, letting users enter special keyword, say, 'weighted_interleave' to
> > 'target_nid' DAMON sysfs file.  In the case, DAMOS_MIGRATE_{HOT,COLD} would
> > work in the way you are implementing DAMOS_INTERLEAVE.
> 
> I like this idea. I will do this in the next version of the patch.

Great, looking forward to that!

> I
> have a couple of questions
> about how to go about this if you don't mind.

Of course I don't :)

> 
> First, should I drop the vaddr implementation or implement
> DAMOS_MIGRATE_{HOT,COLD}
> in vaddr as well? I am leaning towards the former because I believe
> the paddr version is
> more important, though the vaddr version is useful if the user only
> cares about one
> application.

I show no problem at dropping the vaddr implementation.  Please do what you
want and need to do on your pace :)

> 
> Second, do you have a preference for how we indicate that we are using
> the mempolicy
> rather than target_nid in struct damos? I was thinking of either
> setting target_nid to
> NUMA_NO_NODE or adding a boolean to struct damos for this.

I'd prefer adding a boolean to 'struct damos'.

> 
> Maybe it would also be a good idea to generalize it some more. I
> implemented this using
> just weighted interleave because I was targeting the use case where
> the best interleave
> weights for a workload changes as the bandwidth utilization of the
> system changes, which
> I will go describe in more detail further down. However, we could
> apply the same logic for
> any mempolicy instead of just filtering for MPOL_WEIGHTED_INTERLEAVE. This might
> clean up the code a little bit because the logic dependent on
> CONFIG_NUMA would be
> contained in the mempolicy code.

Yes, I agree.  Such flexibility sounds useful :)

In future, I think we could further let users set multiple target nodes for
DAMOS_MIGRATE_{HOT,COLD} with arbitrary weights.

[...]
> > I show the test results on the commit messages of the second and the fourth
> > patches.  In the next version, letting readers know that here would be nice.
> > Also adding a short description of what you confirmed with the tests here
> > (e.g., with the test we confirmed this patch functions as expected [and
> > achieves X % Y metric wins]) would be nice.
> >
> 
> Noted. I'll include this in the cover letter of the next patch set.

Thank you! :)

[...]
> > I think it would also be nice if you could add more explanation about why you
> > picked DAMON as a way to implement this feature.  I assume that's because you
> > found opportunities to utilize this feature in some access-aware way or
> > utilizing DAMOS features.  I was actually able to imagine some such usages.
> > For example, we could do the re-interleaving for hot or cold pages of specific
> > NUMA nodes or specific virtual address ranges first to make interleaving
> > effective faster.
> 
> Yeah, I'll give more detail on the use case I was targeting, which I
> will also include
> in the cover letter of the next patch set.
> 
> Basically, we have seen that the best interleave weights for a workload can
> change depending on the bandwidth utilization of the system. This was touched
> upon in the discussion in [1]. As a toy example, imagine some
> application that uses
> 75% of the local bandwidth. Assuming sufficient capacity, when running alone, we
> probably want to keep all of that application's data in local memory.
> However, if a
> second instance of that application begins, using the same amount of bandwidth,
> it would be best to interleave the data of both processes to alleviate
> the bandwidth
> pressure from the local node. Likewise, when one of the processes ends, the data
> should be moved back to local memory.
> 
> We imagine there would be a userspace application that would monitor system
> performance characteristics, such as bandwidth utilization or memory
> access latency,
> and uses that information to tune the interleave weights. Others seemed to have
> come to a similar conclusion in previous discussions [2]. We are
> currently working
> on a userspace program that does this, but it's not quite ready to be
> published yet.

Sounds interesting, looking forward!

Note that DAMOS has internal feedback loop for auto-tuning aggressiveness of a
given scheme, and the feedback loop accepts system metrics or arbitrary user
inputs.  I think the userspace program _might_ be able to give the arbitrary
feedback.  We could also think about extending the list of DAMOS-accepting
feedback system metrics to memory bandwidth.

> 
> After the userspace application adjusts the interleave weights, we need some
> mechanism to migrate the application pages that have already been allocated.
> We think DAMON is the correct venue for this mechanism because we noticed
> that we don't have to migrate all of the application's pages to
> improve performance,
> we just need to migrate the frequently accessed pages. DAMON's existing hotness
> tracking is very useful for this. Additionally, as Ying pointed out
> [3], a complete
> solution must also handle when a memory node is at capacity. The existing
> DAMOS_MIGRATE_COLD action can be used in conjunction with the functionality
> in this patch set to provide that complete solution.
> 
> [1] https://lore.kernel.org/linux-mm/20250313155705.1943522-1-joshua.hahnjy@gmail.com/
> [2] https://lore.kernel.org/linux-mm/20250314151137.892379-1-joshua.hahnjy@gmail.com/
> [3] https://lore.kernel.org/linux-mm/87frjfx6u4.fsf@DESKTOP-5N7EMDA/

Thank you for this nice and informative description of the use case!

> 
> > Also we could apply a sort of speed limit for the interleaving-migration to
> > ensure it doesn't consume memory bandwidth too much.  The limit could be
> > arbitrarily user-defined or auto-tuned for specific system metrics value (e.g.,
> > memory bandwidth balance?).
> 
> I agree this is a concern, but I figured DAMOS's existing quota mechanism would
> handle it. If you could elaborate on why quotas aren't enough here,
> that would help
> me come up with a solution.

What I wanted to say is, we could use DAMOS's existing quota mechanism to
handle it.  DAMOS quota feature is just another name of [auto-tunable] speed
limit.  Sorry for confusing you.  Anyway, happy to confirm this is yet another
DAMOS feature that could be useful for your and future cases.

> 
> 
> > If you have such use case in your mind or your test setups, sharing those here
> > or on the next versions of this would be very helpful for reviewers.
> 
> Answered above. I will include them in the next version.

That was very helpful.  Keeping that on the next version will be helpful for
new readers such as future SJ :)

[1] https://origin.kernel.org/doc/html/latest/mm/damon/design.html#aim-oriented-feedback-driven-auto-tuning


Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/4] mm/damon: Add DAMOS action to interleave data across nodes
  2025-06-13 15:44   ` Bijan Tabatabai
  2025-06-13 17:12     ` SeongJae Park
@ 2025-06-16  7:42     ` Byungchul Park
  2025-06-16 15:01       ` Bijan Tabatabai
  1 sibling, 1 reply; 30+ messages in thread
From: Byungchul Park @ 2025-06-16  7:42 UTC (permalink / raw)
  To: Bijan Tabatabai
  Cc: SeongJae Park, linux-mm, linux-doc, linux-kernel, akpm, corbet,
	david, ziy, matthew.brost, joshua.hahnjy, rakie.kim, gourry,
	ying.huang, apopple, bijantabatab, venkataravis, emirakhur,
	ajayjoshi, vtavarespetr, damon, kernel_team

On Fri, Jun 13, 2025 at 10:44:17AM -0500, Bijan Tabatabai wrote:
> Hi SeongJae,
> 
> Thank you for your comments.
> 
> On Thu, Jun 12, 2025 at 6:49 PM SeongJae Park <sj@kernel.org> wrote:
> >
> > Hi Bijan,
> >
> > On Thu, 12 Jun 2025 13:13:26 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:
> >
> > > From: Bijan Tabatabai <bijantabatab@micron.com>
> > >
> > > A recent patch set automatically set the interleave weight for each node
> > > according to the node's maximum bandwidth [1]. In another thread, the patch
> > > set's author, Joshua Hahn, wondered if/how these weights should be changed
> > > if the bandwidth utilization of the system changes [2].
> >
> > Thank you for sharing the background.  I do agree it is an important question.
> >
> > >
> > > This patch set adds the mechanism for dynamically changing how application
> > > data is interleaved across nodes while leaving the policy of what the
> > > interleave weights should be to userspace. It does this by adding a new
> > > DAMOS action: DAMOS_INTERLEAVE. We implement DAMOS_INTERLEAVE with both
> > > paddr and vaddr operations sets. Using the paddr version is useful for
> > > managing page placement globally. Using the vaddr version limits tracking
> > > to one process per kdamond instance, but the va based tracking better
> > > captures spacial locality.
> > >
> > > DAMOS_INTERLEAVE interleaves pages within a region across nodes using the
> > > interleave weights at /sys/kernel/mm/mempolicy/weighted_interleave/node<N>
> > > and the page placement algorithm in weighted_interleave_nid via
> > > policy_nodemask.
> >
> > So, what DAMOS_INTERLEAVE will do is, migrating pages of a given DAMON region
> > into multiple nodes, following interleaving weights, right?
> 
> That's correct.

Your approach sounds interesting.

IIUC, the approach can be intergrated with the existing numa hinting
mechanism as well, so as to perform weighted interleaving migration for
promotion, which may result in suppressing the migration anyway tho, in
MPOL_WEIGHTED_INTERLEAVE set.

Do you have plan for the that too?

Plus, it'd be the best if you share the improvement result rather than
the placement data.

	Byungchul

> > We already have
> > DAMOS actions for migrating pages of a given DAMON region, namely
> > DAMOS_MIGRATE_{HOT,COLD}.  The actions support only single migration target
> > node, though.  To my perspective, hence, DAMOS_INTERLEAVE looks like an
> > extended version of DAMOS_MIGRATE_{HOT,COLD} for flexible target node
> > selections.  In a way, DAMOS_INTERLEAVE is rather a restricted version of
> > DAMOS_MIGRATE_{HOT,COLD}, since it prioritizes only hotter regions, if I read
> > the second patch correctly.
> >
> > What about extending DAMOS_MIGRATE_{HOT,COLD} to support your use case?  For
> > example, letting users enter special keyword, say, 'weighted_interleave' to
> > 'target_nid' DAMON sysfs file.  In the case, DAMOS_MIGRATE_{HOT,COLD} would
> > work in the way you are implementing DAMOS_INTERLEAVE.
> 
> I like this idea. I will do this in the next version of the patch. I
> have a couple of questions
> about how to go about this if you don't mind.
> 
> First, should I drop the vaddr implementation or implement
> DAMOS_MIGRATE_{HOT,COLD}
> in vaddr as well? I am leaning towards the former because I believe
> the paddr version is
> more important, though the vaddr version is useful if the user only
> cares about one
> application.
> 
> Second, do you have a preference for how we indicate that we are using
> the mempolicy
> rather than target_nid in struct damos? I was thinking of either
> setting target_nid to
> NUMA_NO_NODE or adding a boolean to struct damos for this.
> 
> Maybe it would also be a good idea to generalize it some more. I
> implemented this using
> just weighted interleave because I was targeting the use case where
> the best interleave
> weights for a workload changes as the bandwidth utilization of the
> system changes, which
> I will go describe in more detail further down. However, we could
> apply the same logic for
> any mempolicy instead of just filtering for MPOL_WEIGHTED_INTERLEAVE. This might
> clean up the code a little bit because the logic dependent on
> CONFIG_NUMA would be
> contained in the mempolicy code.
> 
> > > We chose to reuse the mempolicy weighted interleave
> > > infrastructure to avoid reimplementing code. However, this has the awkward
> > > side effect that only pages that are mapped to processes using
> > > MPOL_WEIGHTED_INTERLEAVE will be migrated according to new interleave
> > > weights. This might be fine because workloads that want their data to be
> > > dynamically interleaved will want their newly allocated data to be
> > > interleaved at the same ratio.
> >
> > Makes sense to me.  I'm not very familiar with interleaving and memory policy,
> > though.
> >
> > >
> > > If exposing policy_nodemask is undesirable,
> >
> > I see you are exposing it on include/linux/mempolicy.h on the first patch of
> > this series, and I agree it is not desirable to unnecessarily expose functions.
> > But you could reduce the exposure by exporting it on mm/internal.h instead.
> > mempolicy maitnainers and reviewers who you kindly Cc-ed to this mail could
> > give us good opinions.
> >
> > > we have two alternative methods
> > > for having DAMON access the interleave weights it should use. We would
> > > appreciate feedback on which method is preferred.
> > > 1. Use mpol_misplaced instead
> > >   pros: mpol_misplaced is already exposed publically
> > >   cons: Would require refactoring mpol_misplaced to take a struct vm_area
> > >   instead of a struct vm_fault, and require refactoring mpol_misplaced and
> > >   get_vma_policy to take in a struct task_struct rather than just using
> > >   current. Also requires processes to use MPOL_WEIGHTED_INTERLEAVE.
> >
> > I feel cons is larger than pros.  mpolicy people's opinion would matter more,
> > though.
> >
> > > 2. Add a new field to struct damos, similar to target_nid for the
> > > MIGRATE_HOT/COLD schemes.
> > >   pros: Keeps changes contained inside DAMON. Would not require processes
> > >   to use MPOL_WEIGHTED_INTERLEAVE.
> > >   cons: Duplicates page placement code. Requires discussion on the sysfs
> > >   interface to use for users to pass in the interleave weights.
> >
> > I agree this is also somewhat doable.  In future, we might want to implement
> > this anyway, for non-global and flexible memory interleaving.  But if memory
> > policy people are ok with reusing policy_nodemask(), I don't think we need to
> > do this now.
> >
> > >
> > > This patchset was tested on an AMD machine with a NUMA node with CPUs
> > > attached to DDR memory and a cpu-less NUMA node attached to CXL memory.
> > > However, this patch set should generalize to other architectures and number
> > > of NUMA nodes.
> >
> > I show the test results on the commit messages of the second and the fourth
> > patches.  In the next version, letting readers know that here would be nice.
> > Also adding a short description of what you confirmed with the tests here
> > (e.g., with the test we confirmed this patch functions as expected [and
> > achieves X % Y metric wins]) would be nice.
> >
> 
> Noted. I'll include this in the cover letter of the next patch set.
> 
> > >
> > > Patches Sequence
> > > ________________
> > > The first patch exposes policy_nodemask() in include/linux/mempolicy.h to
> > > let DAMON determine where a page should be placed for interleaving.
> > > The second patch implements DAMOS_INTERLEAVE as a paddr action.
> > > The third patch moves the DAMON page migration code to ops-common, allowing
> > > vaddr actions to use it.
> > > Finally, the fourth patch implements a vaddr version of DAMOS_INTERLEAVE.
> >
> > I'll try to take look on code and add comments if something stands out, but
> > let's focus on the high level discussion first, especially whether to implement
> > this as a new DAMOS action, or extend DAMOS_MIGRATE_{HOT,COLD} actions.
> 
> Makes sense. Based on your reply, I will probably change the code significantly.
> 
> > I think it would also be nice if you could add more explanation about why you
> > picked DAMON as a way to implement this feature.  I assume that's because you
> > found opportunities to utilize this feature in some access-aware way or
> > utilizing DAMOS features.  I was actually able to imagine some such usages.
> > For example, we could do the re-interleaving for hot or cold pages of specific
> > NUMA nodes or specific virtual address ranges first to make interleaving
> > effective faster.
> 
> Yeah, I'll give more detail on the use case I was targeting, which I
> will also include
> in the cover letter of the next patch set.
> 
> Basically, we have seen that the best interleave weights for a workload can
> change depending on the bandwidth utilization of the system. This was touched
> upon in the discussion in [1]. As a toy example, imagine some
> application that uses
> 75% of the local bandwidth. Assuming sufficient capacity, when running alone, we
> probably want to keep all of that application's data in local memory.
> However, if a
> second instance of that application begins, using the same amount of bandwidth,
> it would be best to interleave the data of both processes to alleviate
> the bandwidth
> pressure from the local node. Likewise, when one of the processes ends, the data
> should be moved back to local memory.
> 
> We imagine there would be a userspace application that would monitor system
> performance characteristics, such as bandwidth utilization or memory
> access latency,
> and uses that information to tune the interleave weights. Others seemed to have
> come to a similar conclusion in previous discussions [2]. We are
> currently working
> on a userspace program that does this, but it's not quite ready to be
> published yet.
> 
> After the userspace application adjusts the interleave weights, we need some
> mechanism to migrate the application pages that have already been allocated.
> We think DAMON is the correct venue for this mechanism because we noticed
> that we don't have to migrate all of the application's pages to
> improve performance,
> we just need to migrate the frequently accessed pages. DAMON's existing hotness
> tracking is very useful for this. Additionally, as Ying pointed out
> [3], a complete
> solution must also handle when a memory node is at capacity. The existing
> DAMOS_MIGRATE_COLD action can be used in conjunction with the functionality
> in this patch set to provide that complete solution.
> 
> [1] https://lore.kernel.org/linux-mm/20250313155705.1943522-1-joshua.hahnjy@gmail.com/
> [2] https://lore.kernel.org/linux-mm/20250314151137.892379-1-joshua.hahnjy@gmail.com/
> [3] https://lore.kernel.org/linux-mm/87frjfx6u4.fsf@DESKTOP-5N7EMDA/
> 
> > Also we could apply a sort of speed limit for the interleaving-migration to
> > ensure it doesn't consume memory bandwidth too much.  The limit could be
> > arbitrarily user-defined or auto-tuned for specific system metrics value (e.g.,
> > memory bandwidth balance?).
> 
> I agree this is a concern, but I figured DAMOS's existing quota mechanism would
> handle it. If you could elaborate on why quotas aren't enough here,
> that would help
> me come up with a solution.
> 
> 
> > If you have such use case in your mind or your test setups, sharing those here
> > or on the next versions of this would be very helpful for reviewers.
> 
> Answered above. I will include them in the next version.
> 
> Thanks,
> Bijan
> 
> > >
> > > [1] https://lore.kernel.org/linux-mm/20250520141236.2987309-1-joshua.hahnjy@gmail.com/
> > > [2] https://lore.kernel.org/linux-mm/20250313155705.1943522-1-joshua.hahnjy@gmail.com/
> >
> >
> > Thanks,
> > SJ
> >
> > [...]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 1/4] mm/mempolicy: Expose policy_nodemask() in include/linux/mempolicy.h
  2025-06-13 16:33     ` Bijan Tabatabai
@ 2025-06-16  9:45       ` David Hildenbrand
  2025-06-16 11:02         ` Huang, Ying
  2025-06-16 14:16         ` Bijan Tabatabai
  2025-06-16 10:55       ` Huang, Ying
  1 sibling, 2 replies; 30+ messages in thread
From: David Hildenbrand @ 2025-06-16  9:45 UTC (permalink / raw)
  To: Bijan Tabatabai
  Cc: linux-mm, linux-doc, linux-kernel, sj, akpm, corbet, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, bijantabatab, venkataravis, emirakhur,
	ajayjoshi, vtavarespetr, damon

On 13.06.25 18:33, Bijan Tabatabai wrote:
> On Fri, Jun 13, 2025 at 8:45 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 12.06.25 20:13, Bijan Tabatabai wrote:
>>> From: Bijan Tabatabai <bijantabatab@micron.com>
>>>
>>> This patch is to allow DAMON to call policy_nodemask() so it can
>>> determine where to place a page for interleaving.
>>>
>>> Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com>
>>> ---
>>>    include/linux/mempolicy.h | 9 +++++++++
>>>    mm/mempolicy.c            | 4 +---
>>>    2 files changed, 10 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
>>> index 0fe96f3ab3ef..e96bf493ff7a 100644
>>> --- a/include/linux/mempolicy.h
>>> +++ b/include/linux/mempolicy.h
>>> @@ -133,6 +133,8 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>>>    struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>>>                unsigned long addr, int order, pgoff_t *ilx);
>>>    bool vma_policy_mof(struct vm_area_struct *vma);
>>> +nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
>>> +             pgoff_t ilx, int *nid);
>>>
>>>    extern void numa_default_policy(void);
>>>    extern void numa_policy_init(void);
>>> @@ -232,6 +234,13 @@ static inline struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>>>        return NULL;
>>>    }
>>>
>>> +static inline nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
>>> +                             pgoff_t ilx, int *nid)
>>> +{
>>> +     *nid = NUMA_NO_NODE;
>>> +     return NULL;
>>> +}
>>> +
>>>    static inline int
>>>    vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst)
>>>    {
>>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>>> index 3b1dfd08338b..54f539497e20 100644
>>> --- a/mm/mempolicy.c
>>> +++ b/mm/mempolicy.c
>>> @@ -596,8 +596,6 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
>>>
>>>    static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
>>>                                unsigned long flags);
>>> -static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
>>> -                             pgoff_t ilx, int *nid);
>>>
>>>    static bool strictly_unmovable(unsigned long flags)
>>>    {
>>> @@ -2195,7 +2193,7 @@ static unsigned int interleave_nid(struct mempolicy *pol, pgoff_t ilx)
>>>     * Return a nodemask representing a mempolicy for filtering nodes for
>>>     * page allocation, together with preferred node id (or the input node id).
>>>     */
>>> -static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
>>> +nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
>>>                                   pgoff_t ilx, int *nid)
>>>    {
>>>        nodemask_t *nodemask = NULL;
>>
>> You actually only care about the nid for your use case.
>>
>> Maybe we should add
>>
>> get_vma_policy_node() that internally does a get_vma_policy() to then
>> give you only the node back.
>>
>> If get_vma_policy() is not the right thing (see my reply to patch #2),
>> of course a get_task_policy_node() could be added.
>>
>> --
>> Cheers,
>>
>> David / dhildenb
> 
> Hi David,

Hi,

> 
> I did not use get_vma_policy or mpol_misplaced, which I believe is the
> closest function that exists for what I want in this patch, because
> those functions

I think what you mean is, that you are performing an rmap walk. But 
there, you do have a VMA + MM available (stable).

> seem to assume they are called inside of the task that the folio/vma
> is mapped to.

But, we do have a VMA at hand, so why would we want to ignore any set 
policy? (I think VMA policies so far only apply to shmem, but still).

I really think you want to use get_vma_policy() instead of the task policy.


> More specifically, mpol_misplaced assumes it is being called within a
> page fault.
> This doesn't work for us, because we call it inside of a kdamond process.

Right.

But it uses the vmf only for ...

1) Obtaining the VMA
2) Sanity-checking that the ptlock is held.

Which, you also have during the rmap walk.


So what about factoring out that handling from mpol_misplaced(), having 
another function where you pass the VMA instead of the vmf?

> 
> I would be open to adding a new function that takes in a folio, vma,
> address, and
> task_struct and returns the nid the folio should be placed on. It could possibly
> be implemented as a function internal to mpol_misplaced because the two would
> be very similar.

Good, you had the same thought :)

> 
> How would you propose we handle MPOL_BIND and MPOL_PREFFERED_MANY
> in this function? mpol_misplaced chooses a nid based on the node and
> cpu the fault
> occurred on, which we wouldn't have in a kdamond context. The two options I see
> are either:
> 1. return the nid of the first node in the policy's nodemask
> 2. return NUMA_NO_NODE
> I think I would lean towards the first.

I guess we'd need a way for your new helper to deal with both cases 
(is_fault vs. !is_fault), and make a decision based on that.


For your use case, you can then decide what would be appropriate. It's a 
good question what the appropriate action would be: 1) sounds better, 
but I do wonder if we would rather want to distribute the folios in a 
different way across applicable nodes, not sure ...

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 1/4] mm/mempolicy: Expose policy_nodemask() in include/linux/mempolicy.h
  2025-06-13 16:33     ` Bijan Tabatabai
  2025-06-16  9:45       ` David Hildenbrand
@ 2025-06-16 10:55       ` Huang, Ying
  1 sibling, 0 replies; 30+ messages in thread
From: Huang, Ying @ 2025-06-16 10:55 UTC (permalink / raw)
  To: Bijan Tabatabai
  Cc: David Hildenbrand, linux-mm, linux-doc, linux-kernel, sj, akpm,
	corbet, ziy, matthew.brost, joshua.hahnjy, rakie.kim, byungchul,
	gourry, apopple, bijantabatab, venkataravis, emirakhur, ajayjoshi,
	vtavarespetr, damon

Bijan Tabatabai <bijan311@gmail.com> writes:

> On Fri, Jun 13, 2025 at 8:45 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 12.06.25 20:13, Bijan Tabatabai wrote:
>> > From: Bijan Tabatabai <bijantabatab@micron.com>
>> >
>> > This patch is to allow DAMON to call policy_nodemask() so it can
>> > determine where to place a page for interleaving.
>> >
>> > Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com>
>> > ---
>> >   include/linux/mempolicy.h | 9 +++++++++
>> >   mm/mempolicy.c            | 4 +---
>> >   2 files changed, 10 insertions(+), 3 deletions(-)
>> >
>> > diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
>> > index 0fe96f3ab3ef..e96bf493ff7a 100644
>> > --- a/include/linux/mempolicy.h
>> > +++ b/include/linux/mempolicy.h
>> > @@ -133,6 +133,8 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>> >   struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>> >               unsigned long addr, int order, pgoff_t *ilx);
>> >   bool vma_policy_mof(struct vm_area_struct *vma);
>> > +nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
>> > +             pgoff_t ilx, int *nid);
>> >
>> >   extern void numa_default_policy(void);
>> >   extern void numa_policy_init(void);
>> > @@ -232,6 +234,13 @@ static inline struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>> >       return NULL;
>> >   }
>> >
>> > +static inline nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
>> > +                             pgoff_t ilx, int *nid)
>> > +{
>> > +     *nid = NUMA_NO_NODE;
>> > +     return NULL;
>> > +}
>> > +
>> >   static inline int
>> >   vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst)
>> >   {
>> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>> > index 3b1dfd08338b..54f539497e20 100644
>> > --- a/mm/mempolicy.c
>> > +++ b/mm/mempolicy.c
>> > @@ -596,8 +596,6 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
>> >
>> >   static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
>> >                               unsigned long flags);
>> > -static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
>> > -                             pgoff_t ilx, int *nid);
>> >
>> >   static bool strictly_unmovable(unsigned long flags)
>> >   {
>> > @@ -2195,7 +2193,7 @@ static unsigned int interleave_nid(struct mempolicy *pol, pgoff_t ilx)
>> >    * Return a nodemask representing a mempolicy for filtering nodes for
>> >    * page allocation, together with preferred node id (or the input node id).
>> >    */
>> > -static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
>> > +nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
>> >                                  pgoff_t ilx, int *nid)
>> >   {
>> >       nodemask_t *nodemask = NULL;
>>
>> You actually only care about the nid for your use case.
>>
>> Maybe we should add
>>
>> get_vma_policy_node() that internally does a get_vma_policy() to then
>> give you only the node back.
>>
>> If get_vma_policy() is not the right thing (see my reply to patch #2),
>> of course a get_task_policy_node() could be added.
>>
>> --
>> Cheers,
>>
>> David / dhildenb
>
> Hi David,
>
> I did not use get_vma_policy or mpol_misplaced, which I believe is the
> closest function that exists for what I want in this patch, because
> those functions
> seem to assume they are called inside of the task that the folio/vma
> is mapped to.
> More specifically, mpol_misplaced assumes it is being called within a
> page fault.
> This doesn't work for us, because we call it inside of a kdamond process.
>
> I would be open to adding a new function that takes in a folio, vma,
> address, and
> task_struct and returns the nid the folio should be placed on. It could possibly
> be implemented as a function internal to mpol_misplaced because the two would
> be very similar.
>
> How would you propose we handle MPOL_BIND and MPOL_PREFFERED_MANY
> in this function? mpol_misplaced chooses a nid based on the node and
> cpu the fault
> occurred on, which we wouldn't have in a kdamond context. The two options I see
> are either:
> 1. return the nid of the first node in the policy's nodemask
> 2. return NUMA_NO_NODE
> I think I would lean towards the first.

You can try numa_node_id() first, then fall back to the first nid in
the nodemask.

---
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 1/4] mm/mempolicy: Expose policy_nodemask() in include/linux/mempolicy.h
  2025-06-16  9:45       ` David Hildenbrand
@ 2025-06-16 11:02         ` Huang, Ying
  2025-06-16 11:11           ` David Hildenbrand
  2025-06-16 14:16         ` Bijan Tabatabai
  1 sibling, 1 reply; 30+ messages in thread
From: Huang, Ying @ 2025-06-16 11:02 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Bijan Tabatabai, linux-mm, linux-doc, linux-kernel, sj, akpm,
	corbet, ziy, matthew.brost, joshua.hahnjy, rakie.kim, byungchul,
	gourry, apopple, bijantabatab, venkataravis, emirakhur, ajayjoshi,
	vtavarespetr, damon

David Hildenbrand <david@redhat.com> writes:

> On 13.06.25 18:33, Bijan Tabatabai wrote:
>> On Fri, Jun 13, 2025 at 8:45 AM David Hildenbrand <david@redhat.com> wrote:
>>>
>>> On 12.06.25 20:13, Bijan Tabatabai wrote:
>>>> From: Bijan Tabatabai <bijantabatab@micron.com>
>>>>

[snip]

>> I did not use get_vma_policy or mpol_misplaced, which I believe is
>> the
>> closest function that exists for what I want in this patch, because
>> those functions
>
> I think what you mean is, that you are performing an rmap walk. But
> there, you do have a VMA + MM available (stable).
>
>> seem to assume they are called inside of the task that the folio/vma
>> is mapped to.
>
> But, we do have a VMA at hand, so why would we want to ignore any set
> policy? (I think VMA policies so far only apply to shmem, but still).
>
> I really think you want to use get_vma_policy() instead of the task policy.
>
>
>> More specifically, mpol_misplaced assumes it is being called within a
>> page fault.
>> This doesn't work for us, because we call it inside of a kdamond process.
>
> Right.
>
> But it uses the vmf only for ...
>
> 1) Obtaining the VMA
> 2) Sanity-checking that the ptlock is held.

3) update NUMA balancing per-folio cpupid state (via should_numa_migrate_memory()).

This needs to be called by the NUMA page fault handler.

> Which, you also have during the rmap walk.
>

[snip]

---
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 1/4] mm/mempolicy: Expose policy_nodemask() in include/linux/mempolicy.h
  2025-06-16 11:02         ` Huang, Ying
@ 2025-06-16 11:11           ` David Hildenbrand
  0 siblings, 0 replies; 30+ messages in thread
From: David Hildenbrand @ 2025-06-16 11:11 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Bijan Tabatabai, linux-mm, linux-doc, linux-kernel, sj, akpm,
	corbet, ziy, matthew.brost, joshua.hahnjy, rakie.kim, byungchul,
	gourry, apopple, bijantabatab, venkataravis, emirakhur, ajayjoshi,
	vtavarespetr, damon

On 16.06.25 13:02, Huang, Ying wrote:
> David Hildenbrand <david@redhat.com> writes:
> 
>> On 13.06.25 18:33, Bijan Tabatabai wrote:
>>> On Fri, Jun 13, 2025 at 8:45 AM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 12.06.25 20:13, Bijan Tabatabai wrote:
>>>>> From: Bijan Tabatabai <bijantabatab@micron.com>
>>>>>
> 
> [snip]
> 
>>> I did not use get_vma_policy or mpol_misplaced, which I believe is
>>> the
>>> closest function that exists for what I want in this patch, because
>>> those functions
>>
>> I think what you mean is, that you are performing an rmap walk. But
>> there, you do have a VMA + MM available (stable).
>>
>>> seem to assume they are called inside of the task that the folio/vma
>>> is mapped to.
>>
>> But, we do have a VMA at hand, so why would we want to ignore any set
>> policy? (I think VMA policies so far only apply to shmem, but still).
>>
>> I really think you want to use get_vma_policy() instead of the task policy.
>>
>>
>>> More specifically, mpol_misplaced assumes it is being called within a
>>> page fault.
>>> This doesn't work for us, because we call it inside of a kdamond process.
>>
>> Right.
>>
>> But it uses the vmf only for ...
>>
>> 1) Obtaining the VMA
>> 2) Sanity-checking that the ptlock is held.
> 
> 3) update NUMA balancing per-folio cpupid state (via should_numa_migrate_memory()).

How is the vmf used for that ("it uses the vmf only for ...")?

But yes, anything that relies on "thiscpu" and "thisnid" might be 
fault-specific as well (what I mentioned further below)


-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 1/4] mm/mempolicy: Expose policy_nodemask() in include/linux/mempolicy.h
  2025-06-16  9:45       ` David Hildenbrand
  2025-06-16 11:02         ` Huang, Ying
@ 2025-06-16 14:16         ` Bijan Tabatabai
  2025-06-16 14:26           ` David Hildenbrand
  2025-06-16 17:43           ` Gregory Price
  1 sibling, 2 replies; 30+ messages in thread
From: Bijan Tabatabai @ 2025-06-16 14:16 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-doc, linux-kernel, sj, akpm, corbet, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, bijantabatab, venkataravis, emirakhur,
	ajayjoshi, vtavarespetr, damon

On Mon, Jun 16, 2025 at 4:46 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 13.06.25 18:33, Bijan Tabatabai wrote:
> > On Fri, Jun 13, 2025 at 8:45 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 12.06.25 20:13, Bijan Tabatabai wrote:
[...]
> Hi,
>
> >
> > I did not use get_vma_policy or mpol_misplaced, which I believe is the
> > closest function that exists for what I want in this patch, because
> > those functions
>
> I think what you mean is, that you are performing an rmap walk. But
> there, you do have a VMA + MM available (stable).
>
> > seem to assume they are called inside of the task that the folio/vma
> > is mapped to.
>
> But, we do have a VMA at hand, so why would we want to ignore any set
> policy? (I think VMA policies so far only apply to shmem, but still).
>
> I really think you want to use get_vma_policy() instead of the task policy.

Sorry, I think I misunderstood you before. You are right, we should
consider the VMA policy before using the task policy. I will do this
in the next revision.

>
> > More specifically, mpol_misplaced assumes it is being called within a
> > page fault.
> > This doesn't work for us, because we call it inside of a kdamond process.
>
> Right.
>
> But it uses the vmf only for ...
>
> 1) Obtaining the VMA
> 2) Sanity-checking that the ptlock is held.
>
> Which, you also have during the rmap walk.

There is another subtle dependency in get_vma_policy.
It first checks if a VMA policy exists, and if it doesn't, it uses the
task policy of the current task, which doesn't make sense when called
by a kdamond thread.

However, I don't think this will change what seems to be our consensus
of adding a new helper function.

>
> So what about factoring out that handling from mpol_misplaced(), having
> another function where you pass the VMA instead of the vmf?
>
> >
> > I would be open to adding a new function that takes in a folio, vma,
> > address, and
> > task_struct and returns the nid the folio should be placed on. It could possibly
> > be implemented as a function internal to mpol_misplaced because the two would
> > be very similar.
>
> Good, you had the same thought :)
>
> >
> > How would you propose we handle MPOL_BIND and MPOL_PREFFERED_MANY
> > in this function? mpol_misplaced chooses a nid based on the node and
> > cpu the fault
> > occurred on, which we wouldn't have in a kdamond context. The two options I see
> > are either:
> > 1. return the nid of the first node in the policy's nodemask
> > 2. return NUMA_NO_NODE
> > I think I would lean towards the first.
>
> I guess we'd need a way for your new helper to deal with both cases
> (is_fault vs. !is_fault), and make a decision based on that.
>
>
> For your use case, you can then decide what would be appropriate. It's a
> good question what the appropriate action would be: 1) sounds better,
> but I do wonder if we would rather want to distribute the folios in a
> different way across applicable nodes, not sure ...

Yes, I was thinking about that too, but I felt that adding state
somewhere or using randomness to distribute the folios was incorrect,
especially since those policies are not the focus of this patchset.

I think I'll move forward with option 1 for now.

Thanks,
Bijan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 1/4] mm/mempolicy: Expose policy_nodemask() in include/linux/mempolicy.h
  2025-06-16 14:16         ` Bijan Tabatabai
@ 2025-06-16 14:26           ` David Hildenbrand
  2025-06-16 17:43           ` Gregory Price
  1 sibling, 0 replies; 30+ messages in thread
From: David Hildenbrand @ 2025-06-16 14:26 UTC (permalink / raw)
  To: Bijan Tabatabai
  Cc: linux-mm, linux-doc, linux-kernel, sj, akpm, corbet, ziy,
	matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
	ying.huang, apopple, bijantabatab, venkataravis, emirakhur,
	ajayjoshi, vtavarespetr, damon

On 16.06.25 16:16, Bijan Tabatabai wrote:
> On Mon, Jun 16, 2025 at 4:46 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 13.06.25 18:33, Bijan Tabatabai wrote:
>>> On Fri, Jun 13, 2025 at 8:45 AM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 12.06.25 20:13, Bijan Tabatabai wrote:
> [...]
>> Hi,
>>
>>>
>>> I did not use get_vma_policy or mpol_misplaced, which I believe is the
>>> closest function that exists for what I want in this patch, because
>>> those functions
>>
>> I think what you mean is, that you are performing an rmap walk. But
>> there, you do have a VMA + MM available (stable).
>>
>>> seem to assume they are called inside of the task that the folio/vma
>>> is mapped to.
>>
>> But, we do have a VMA at hand, so why would we want to ignore any set
>> policy? (I think VMA policies so far only apply to shmem, but still).
>>
>> I really think you want to use get_vma_policy() instead of the task policy.
> 
> Sorry, I think I misunderstood you before. You are right, we should
> consider the VMA policy before using the task policy. I will do this
> in the next revision.
> 
>>
>>> More specifically, mpol_misplaced assumes it is being called within a
>>> page fault.
>>> This doesn't work for us, because we call it inside of a kdamond process.
>>
>> Right.
>>
>> But it uses the vmf only for ...
>>
>> 1) Obtaining the VMA
>> 2) Sanity-checking that the ptlock is held.
>>
>> Which, you also have during the rmap walk.
> 
> There is another subtle dependency in get_vma_policy.
> It first checks if a VMA policy exists, and if it doesn't, it uses the
> task policy of the current task, which doesn't make sense when called
> by a kdamond thread.

Indeed, anything that depends on current task / nid / cpu might require 
care.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/4] mm/damon: Add DAMOS action to interleave data across nodes
  2025-06-16  7:42     ` Byungchul Park
@ 2025-06-16 15:01       ` Bijan Tabatabai
  0 siblings, 0 replies; 30+ messages in thread
From: Bijan Tabatabai @ 2025-06-16 15:01 UTC (permalink / raw)
  To: Byungchul Park
  Cc: SeongJae Park, linux-mm, linux-doc, linux-kernel, akpm, corbet,
	david, ziy, matthew.brost, joshua.hahnjy, rakie.kim, gourry,
	ying.huang, apopple, bijantabatab, venkataravis, emirakhur,
	ajayjoshi, vtavarespetr, damon, kernel_team

On Mon, Jun 16, 2025 at 2:42 AM Byungchul Park <byungchul@sk.com> wrote:
[...]

Hi Byungchul,

> Your approach sounds interesting.
>
> IIUC, the approach can be intergrated with the existing numa hinting
> mechanism as well, so as to perform weighted interleaving migration for
> promotion, which may result in suppressing the migration anyway tho, in
> MPOL_WEIGHTED_INTERLEAVE set.
>
> Do you have plan for the that too?

I do not currently have plans to support that, but this approach could
be used there as well.

> Plus, it'd be the best if you share the improvement result rather than
> the placement data.

Sure, I could add some performance data in the cover letter of the
next revision.

>         Byungchul
>
[...]

Thanks,
Bijan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 1/4] mm/mempolicy: Expose policy_nodemask() in include/linux/mempolicy.h
  2025-06-16 14:16         ` Bijan Tabatabai
  2025-06-16 14:26           ` David Hildenbrand
@ 2025-06-16 17:43           ` Gregory Price
  2025-06-16 22:16             ` Bijan Tabatabai
  1 sibling, 1 reply; 30+ messages in thread
From: Gregory Price @ 2025-06-16 17:43 UTC (permalink / raw)
  To: Bijan Tabatabai
  Cc: David Hildenbrand, linux-mm, linux-doc, linux-kernel, sj, akpm,
	corbet, ziy, matthew.brost, joshua.hahnjy, rakie.kim, byungchul,
	ying.huang, apopple, bijantabatab, venkataravis, emirakhur,
	ajayjoshi, vtavarespetr, damon

On Mon, Jun 16, 2025 at 09:16:55AM -0500, Bijan Tabatabai wrote:
> >
> > Which, you also have during the rmap walk.
> 
> There is another subtle dependency in get_vma_policy.
> It first checks if a VMA policy exists, and if it doesn't, it uses the
> task policy of the current task, which doesn't make sense when called
> by a kdamond thread.
> 
> However, I don't think this will change what seems to be our consensus
> of adding a new helper function.
> 

Hate to interject here, but this gets worse the further you dig in.  The
mempolicy interface as a whole has many, many, many hidden references to
current->mempolicy and current->mems_allowed.  External interface
references to a task or vma mempolicy is a problem i explored somewhat
extensively when I prototyped `set_mempolicy2()`. It did not go well.

Generally, mempolicy is not well structured to allow external actors on
a task's mempolicy.  Accessing a task's mempolicy requires operating in
a task's context or at least taking a reference on that task's mempolicy
(which requires taking the task_struct lock).

I will just say that mempolicy is *extremely* current-task centric - and
very much allocation-time centric (i.e. the internal workings don't
really want to consider migration all that much).  You'll probably find
that this project requires rethinking mempolicy's external interfaces in
general (which is sorely needed anyway).

I think this path to modifying mempolicy to support DAMON is a bit
ambitious for where mempolicy is at the moment. You may be better off
duplicating the interleave-weight logic and making some helper functions
to get the weight data, and then coming back around to generalize it
later.

~Gregory

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 1/4] mm/mempolicy: Expose policy_nodemask() in include/linux/mempolicy.h
  2025-06-16 17:43           ` Gregory Price
@ 2025-06-16 22:16             ` Bijan Tabatabai
  2025-06-17 18:58               ` SeongJae Park
  0 siblings, 1 reply; 30+ messages in thread
From: Bijan Tabatabai @ 2025-06-16 22:16 UTC (permalink / raw)
  To: Gregory Price
  Cc: David Hildenbrand, linux-mm, linux-doc, linux-kernel, sj, akpm,
	corbet, ziy, matthew.brost, joshua.hahnjy, rakie.kim, byungchul,
	ying.huang, apopple, bijantabatab, venkataravis, emirakhur,
	ajayjoshi, vtavarespetr, damon

Hi Gregory,

On Mon, Jun 16, 2025 at 12:43 PM Gregory Price <gourry@gourry.net> wrote:
>
> On Mon, Jun 16, 2025 at 09:16:55AM -0500, Bijan Tabatabai wrote:
> > >
> > > Which, you also have during the rmap walk.
> >
> > There is another subtle dependency in get_vma_policy.
> > It first checks if a VMA policy exists, and if it doesn't, it uses the
> > task policy of the current task, which doesn't make sense when called
> > by a kdamond thread.
> >
> > However, I don't think this will change what seems to be our consensus
> > of adding a new helper function.
> >
>
> Hate to interject here, but this gets worse the further you dig in.  The
> mempolicy interface as a whole has many, many, many hidden references to
> current->mempolicy and current->mems_allowed.  External interface
> references to a task or vma mempolicy is a problem i explored somewhat
> extensively when I prototyped `set_mempolicy2()`. It did not go well.
>
> Generally, mempolicy is not well structured to allow external actors on
> a task's mempolicy.  Accessing a task's mempolicy requires operating in
> a task's context or at least taking a reference on that task's mempolicy
> (which requires taking the task_struct lock).

Good point, I didn't take the lock in the second patch. Also, this
made me realize I need to make sure there isn't a race condition where
a task exits after getting a pointer to its task_struct from
mm->owner.

> I will just say that mempolicy is *extremely* current-task centric - and
> very much allocation-time centric (i.e. the internal workings don't
> really want to consider migration all that much).  You'll probably find
> that this project requires rethinking mempolicy's external interfaces in
> general (which is sorely needed anyway).
>
> I think this path to modifying mempolicy to support DAMON is a bit
> ambitious for where mempolicy is at the moment. You may be better off
> duplicating the interleave-weight logic and making some helper functions
> to get the weight data, and then coming back around to generalize it
> later.

This may be true, but I think I will be able to avoid a lot of this
nastiness with what I need. I am going to try with the mempolicy
approach for the next revision, but if I get too much resistance, I
will probably switch to this approach.

Thanks,
Bijan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 1/4] mm/mempolicy: Expose policy_nodemask() in include/linux/mempolicy.h
  2025-06-16 22:16             ` Bijan Tabatabai
@ 2025-06-17 18:58               ` SeongJae Park
  2025-06-17 19:54                 ` Bijan Tabatabai
  0 siblings, 1 reply; 30+ messages in thread
From: SeongJae Park @ 2025-06-17 18:58 UTC (permalink / raw)
  To: Bijan Tabatabai
  Cc: SeongJae Park, Gregory Price, David Hildenbrand, linux-mm,
	linux-doc, linux-kernel, akpm, corbet, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, ying.huang, apopple,
	bijantabatab, venkataravis, emirakhur, ajayjoshi, vtavarespetr,
	damon

On Mon, 16 Jun 2025 17:16:16 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:

> Hi Gregory,
> 
> On Mon, Jun 16, 2025 at 12:43 PM Gregory Price <gourry@gourry.net> wrote:
> >
> > On Mon, Jun 16, 2025 at 09:16:55AM -0500, Bijan Tabatabai wrote:
[...]
> > Hate to interject here,

Please don't hesitate :)

[...]
> > I will just say that mempolicy is *extremely* current-task centric - and
> > very much allocation-time centric (i.e. the internal workings don't
> > really want to consider migration all that much).  You'll probably find
> > that this project requires rethinking mempolicy's external interfaces in
> > general (which is sorely needed anyway).
> >
> > I think this path to modifying mempolicy to support DAMON is a bit
> > ambitious for where mempolicy is at the moment. You may be better off
> > duplicating the interleave-weight logic and making some helper functions
> > to get the weight data, and then coming back around to generalize it
> > later.

Thank you for the nice clarification and opinion, Gregory.

> 
> This may be true, but I think I will be able to avoid a lot of this
> nastiness with what I need. I am going to try with the mempolicy
> approach for the next revision, but if I get too much resistance, I
> will probably switch to this approach.

I have no strong opinion about use of mempolicy for now, as long as mempolicy
folks are fine.

Nonetheless, I just wanted to mention Gregory's suggestion also sounds fairly
good to me.  It would avoid unnecessary coupling of the concepts of
allocation-time interleaving and after-allocation migration.  Also it feels
even more aligned with a potential future extension of this project that we
discussed[1]: letting users set multiple target nodes for
DAMOS_MIGRATE_{HOT,COLD} with arbitrary weights.

[1] https://lore.kernel.org/20250613171237.44776-1-sj@kernel.org


Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 1/4] mm/mempolicy: Expose policy_nodemask() in include/linux/mempolicy.h
  2025-06-17 18:58               ` SeongJae Park
@ 2025-06-17 19:54                 ` Bijan Tabatabai
  2025-06-17 22:30                   ` SeongJae Park
  0 siblings, 1 reply; 30+ messages in thread
From: Bijan Tabatabai @ 2025-06-17 19:54 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Gregory Price, David Hildenbrand, linux-mm, linux-doc,
	linux-kernel, akpm, corbet, ziy, matthew.brost, joshua.hahnjy,
	rakie.kim, byungchul, ying.huang, apopple, bijantabatab,
	venkataravis, emirakhur, ajayjoshi, vtavarespetr, damon

On Tue, Jun 17, 2025 at 1:58 PM SeongJae Park <sj@kernel.org> wrote:
>
> On Mon, 16 Jun 2025 17:16:16 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:
>
> > Hi Gregory,
> >
> > On Mon, Jun 16, 2025 at 12:43 PM Gregory Price <gourry@gourry.net> wrote:
> > >
> > > On Mon, Jun 16, 2025 at 09:16:55AM -0500, Bijan Tabatabai wrote:
> [...]
> > > I will just say that mempolicy is *extremely* current-task centric - and
> > > very much allocation-time centric (i.e. the internal workings don't
> > > really want to consider migration all that much).  You'll probably find
> > > that this project requires rethinking mempolicy's external interfaces in
> > > general (which is sorely needed anyway).
> > >
> > > I think this path to modifying mempolicy to support DAMON is a bit
> > > ambitious for where mempolicy is at the moment. You may be better off
> > > duplicating the interleave-weight logic and making some helper functions
> > > to get the weight data, and then coming back around to generalize it
> > > later.
>
> Thank you for the nice clarification and opinion, Gregory.
>
> >
> > This may be true, but I think I will be able to avoid a lot of this
> > nastiness with what I need. I am going to try with the mempolicy
> > approach for the next revision, but if I get too much resistance, I
> > will probably switch to this approach.
>
> I have no strong opinion about use of mempolicy for now, as long as mempolicy
> folks are fine.
>
> Nonetheless, I just wanted to mention Gregory's suggestion also sounds fairly
> good to me.  It would avoid unnecessary coupling of the concepts of
> allocation-time interleaving and after-allocation migration.  Also it feels
> even more aligned with a potential future extension of this project that we
> discussed[1]: letting users set multiple target nodes for
> DAMOS_MIGRATE_{HOT,COLD} with arbitrary weights.
>
> [1] https://lore.kernel.org/20250613171237.44776-1-sj@kernel.org

Given this discussion, as well as Joshua's comments earlier [1], it
sounds like while people aren't exactly opposed to using mempolicy for
this, the building consensus is that it would be best not to. I will
move the interleave logic to DAMON for the next revision. However, I
still think it makes sense to use the global weights (probably via
get_il_weight) for now to avoid allocating pages a certain way and
then migrating them soon after.

I'll try to send the next version of the patch set by the end of the week.

Thanks everyone for their feedback,
Bijan

[1] https://lore.kernel.org/linux-mm/20250613152517.225529-1-joshua.hahnjy@gmail.com/
[...]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 1/4] mm/mempolicy: Expose policy_nodemask() in include/linux/mempolicy.h
  2025-06-17 19:54                 ` Bijan Tabatabai
@ 2025-06-17 22:30                   ` SeongJae Park
  0 siblings, 0 replies; 30+ messages in thread
From: SeongJae Park @ 2025-06-17 22:30 UTC (permalink / raw)
  To: Bijan Tabatabai
  Cc: SeongJae Park, Gregory Price, David Hildenbrand, linux-mm,
	linux-doc, linux-kernel, akpm, corbet, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, ying.huang, apopple,
	bijantabatab, venkataravis, emirakhur, ajayjoshi, vtavarespetr,
	damon

On Tue, 17 Jun 2025 14:54:39 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:

[...]
> Given this discussion, as well as Joshua's comments earlier [1], it
> sounds like while people aren't exactly opposed to using mempolicy for
> this, the building consensus is that it would be best not to. I will
> move the interleave logic to DAMON for the next revision. However, I
> still think it makes sense to use the global weights (probably via
> get_il_weight) for now to avoid allocating pages a certain way and
> then migrating them soon after.

Makes sense to me :)

> 
> I'll try to send the next version of the patch set by the end of the week.

Looking forwrd to it!


Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2025-06-17 22:30 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-12 18:13 [RFC PATCH 0/4] mm/damon: Add DAMOS action to interleave data across nodes Bijan Tabatabai
2025-06-12 18:13 ` [RFC PATCH 1/4] mm/mempolicy: Expose policy_nodemask() in include/linux/mempolicy.h Bijan Tabatabai
2025-06-13 13:45   ` David Hildenbrand
2025-06-13 16:33     ` Bijan Tabatabai
2025-06-16  9:45       ` David Hildenbrand
2025-06-16 11:02         ` Huang, Ying
2025-06-16 11:11           ` David Hildenbrand
2025-06-16 14:16         ` Bijan Tabatabai
2025-06-16 14:26           ` David Hildenbrand
2025-06-16 17:43           ` Gregory Price
2025-06-16 22:16             ` Bijan Tabatabai
2025-06-17 18:58               ` SeongJae Park
2025-06-17 19:54                 ` Bijan Tabatabai
2025-06-17 22:30                   ` SeongJae Park
2025-06-16 10:55       ` Huang, Ying
2025-06-12 18:13 ` [RFC PATCH 2/4] mm/damon/paddr: Add DAMOS_INTERLEAVE action Bijan Tabatabai
2025-06-13 13:43   ` David Hildenbrand
2025-06-12 18:13 ` [RFC PATCH 3/4] mm/damon: Move damon_pa_migrate_pages to ops-common Bijan Tabatabai
2025-06-12 18:13 ` [RFC PATCH 4/4] mm/damon/vaddr: Add vaddr version of DAMOS_INTERLEAVE Bijan Tabatabai
2025-06-12 23:49 ` [RFC PATCH 0/4] mm/damon: Add DAMOS action to interleave data across nodes SeongJae Park
2025-06-13  2:41   ` Huang, Ying
2025-06-13 16:02     ` Bijan Tabatabai
2025-06-13 15:44   ` Bijan Tabatabai
2025-06-13 17:12     ` SeongJae Park
2025-06-16  7:42     ` Byungchul Park
2025-06-16 15:01       ` Bijan Tabatabai
2025-06-13  9:55 ` Rakie Kim
2025-06-13 16:12   ` Bijan Tabatabai
2025-06-13 15:25 ` Joshua Hahn
2025-06-13 16:46   ` Bijan Tabatabai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).