linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v8] mm/mempolicy: Weighted Interleave Auto-tuning
@ 2025-05-05 18:23 Joshua Hahn
  2025-05-10  5:25 ` Huang, Ying
  2025-05-19  1:56 ` Joshua Hahn
  0 siblings, 2 replies; 17+ messages in thread
From: Joshua Hahn @ 2025-05-05 18:23 UTC (permalink / raw)
  To: gourry, ying.huang
  Cc: honggyu.kim, yunjeong.mun, gregkh, rafael, lenb, dan.j.williams,
	Jonathan.Cameron, dave.jiang, horen.chuang, hannes, osalvador,
	linux-kernel, linux-acpi, linux-mm, kernel-team

On machines with multiple memory nodes, interleaving page allocations
across nodes allows for better utilization of each node's bandwidth.
Previous work by Gregory Price [1] introduced weighted interleave, which
allowed for pages to be allocated across nodes according to user-set ratios.

Ideally, these weights should be proportional to their bandwidth, so
that under bandwidth pressure, each node uses its maximal efficient
bandwidth and prevents latency from increasing exponentially.

Previously, weighted interleave's default weights were just 1s -- which
would be equivalent to the (unweighted) interleave mempolicy, which goes
through the nodes in a round-robin fashion, ignoring bandwidth information.

This patch has two main goals:
First, it makes weighted interleave easier to use for users who wish to
relieve bandwidth pressure when using nodes with varying bandwidth (CXL).
By providing a set of "real" default weights that just work out of the
box, users who might not have the capability (or wish to) perform
experimentation to find the most optimal weights for their system can
still take advantage of bandwidth-informed weighted interleave.

Second, it allows for weighted interleave to dynamically adjust to
hotplugged memory with new bandwidth information. Instead of manually
updating node weights every time new bandwidth information is reported
or taken off, weighted interleave adjusts and provides a new set of
default weights for weighted interleave to use when there is a change
in bandwidth information.

To meet these goals, this patch introduces an auto-configuration mode
for the interleave weights that provides a reasonable set of default
weights, calculated using bandwidth data reported by the system. In auto
mode, weights are dynamically adjusted based on whatever the current
bandwidth information reports (and responds to hotplug events).

This patch still supports users manually writing weights into the nodeN
sysfs interface by entering into manual mode. When a user enters manual
mode, the system stops dynamically updating any of the node weights,
even during hotplug events that shift the optimal weight distribution.

A new sysfs interface "auto" is introduced, which allows users to switch
between the auto (writing 1 or Y) and manual (writing 0 or N) modes. The
system also automatically enters manual mode when a nodeN interface is
manually written to.

There is one functional change that this patch makes to the existing
weighted_interleave ABI: previously, writing 0 directly to a nodeN
interface was said to reset the weight to the system default. Before
this patch, the default for all weights were 1, which meant that writing
0 and 1 were functionally equivalent. With this patch, writing 0 is invalid.

[1] https://lore.kernel.org/linux-mm/20240202170238.90004-1-gregory.price@memverge.com/

Suggested-by: Yunjeong Mun <yunjeong.mun@sk.com>
Suggested-by: Oscar Salvador <osalvador@suse.de>
Suggested-by: Ying Huang <ying.huang@linux.alibaba.com>
Suggested-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Co-developed-by: Gregory Price <gourry@gourry.net>
Signed-off-by: Gregory Price <gourry@gourry.net>
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
Changelog
v8:
- Rebased on top of mm-new (93129fd5a3a9c87fe0e53ff8b0bfa30ee43a873a)
  - Specifically, to include Rakie Kim's patchset mm/mempolicy:
    Enhance sysfs handling for memory hotplug in weighted interleave
    (1f553871c3925bd1fc474b5fad53a0e8609cd5b0)
  - Rebasing also included cleaning up this patch's sysfs creation.
- Fixes to prevent racing between rcu_access_pointer and rcu_dereference
- Wordsmithing

v7:
- Wordsmithing
- Rename iw_table_lock to wi_state_lock
- Clean up reduce_interleave_weights, as suggested by Yunjeong Mun.
  - Combine iw_table allocation & initialization to be outside the function.
  - Skip scaling to [1,100] before scaling to [1,weightiness].
- Removed the second part of this patch, which prevented creating weight
  sysfs interfaces for memoryless nodes.
- Added Suggested-by tags; I should have done this much, much earlier.

v6:
- iw_weights and mode_auto are combined into one rcu-protected struct.
- Protection against memoryless nodes, as suggested by Oscar Salvador
- Wordsmithing (documentation, commit message and comments), as suggested
  by Andrew Morton.
- Removed unnecessary #include statement in hmat.c, as pointed out by
  Harry (Hyeonggon) Yoo and Ying Huang.
- Bandwidth values changed from u64_t to unsigned int, as pointed out by
  Ying Huang and Dan Carpenter.
- RCU optimizations, as suggested by Ying Huang.
- A second patch is included to fix unintended behavior that creates a
  weight knob for memoryless nodes as well.
- Sysfs show/store functions use str_true_false & kstrtobool.
- Fix a build error in 32-bit systems, which are unable to perform
  64-bit division by casting 64-bit values to 32-bit, if under the range.

v5:
- I accidentally forgot to add the mm/mempolicy: subject tag since v1 of
  this patch. Added to the subject now!
- Wordsmithing, correcting typos, and re-naming variables for clarity.
- No functional changes.

v4:
- Renamed the mode interface to the "auto" interface, which now only
  emits either 'Y' or 'N'. Users can now interact with it by
  writing 'Y', '1', 'N', or '0' to it.
- Added additional documentation to the nodeN sysfs interface.
- Makes sure iw_table locks are properly held.
- Removed unlikely() call in reduce_interleave_weights.
- Wordsmithing

v3:
- Weightiness (max_node_weight) is now fixed to 32.
- Instead, the sysfs interface now exposes a "mode" parameter, which
  can either be "auto" or "manual".
  - Thank you Hyeonggon and Honggyu for the feedback.
- Documentation updated to reflect new sysfs interface, explicitly
  specifies that 0 is invalid.
  - Thank you Gregory and Ying for the discussion on how best to
    handle the 0 case.
- Re-worked nodeN sysfs store to handle auto --> manual shifts
- mempolicy_set_node_perf internally handles the auto / manual
  case differently now. bw is always updated, iw updates depend on
  what mode the user is in.
- Wordsmithing comments for clarity.
- Removed RFC tag.

v2:
- Name of the interface is changed: "max_node_weight" --> "weightiness"
- Default interleave weight table no longer exists. Rather, the
  interleave weight table is initialized with the defaults, if bandwidth
  information is available.
  - In addition, all sections that handle iw_table have been changed
    to reference iw_table if it exists, otherwise defaulting to 1.
- All instances of unsigned long are converted to uint64_t to guarantee
  support for both 32-bit and 64-bit machines
- sysfs initialization cleanup
- Documentation has been rewritten to explicitly outline expected
  behavior and expand on the interpretation of "weightiness".
- kzalloc replaced with kcalloc for readability
- Thank you Gregory and Hyeonggon for your review & feedback!

 ...fs-kernel-mm-mempolicy-weighted-interleave |  35 +-
 drivers/base/node.c                           |   9 +
 include/linux/mempolicy.h                     |  13 +
 mm/mempolicy.c                                | 311 ++++++++++++++----
 4 files changed, 307 insertions(+), 61 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
index 0b7972de04e9..ec13382c606f 100644
--- a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
@@ -20,6 +20,35 @@ Description:	Weight configuration interface for nodeN
 		Minimum weight: 1
 		Maximum weight: 255
 
-		Writing an empty string or `0` will reset the weight to the
-		system default. The system default may be set by the kernel
-		or drivers at boot or during hotplug events.
+		Writing invalid values (i.e. any values not in [1,255],
+		empty string, ...) will return -EINVAL.
+
+		Changing the weight to a valid value will automatically
+		update the system to manual mode as well.
+
+What:		/sys/kernel/mm/mempolicy/weighted_interleave/auto
+Date:		May 2025
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Auto-weighting configuration interface
+
+		Configuration mode for weighted interleave. 'true' indicates
+		that the system is in auto mode, and a 'false' indicates that
+		the system is in manual mode.
+
+		In auto mode, all node weights are re-calculated and overwritten
+		(visible via the nodeN interfaces) whenever new bandwidth data
+		is made available during either boot or hotplug events.
+
+		In manual mode, node weights can only be updated by the user.
+		Note that nodes that are onlined with previously set weights
+		will reuse those weights. If they were not previously set or
+		are onlined with missing bandwidth data, the weights will use
+		a default weight of 1.
+
+		Writing any true value string (e.g. Y or 1) will enable auto
+		mode, while writing any false value string (e.g. N or 0) will
+		enable manual mode. All other strings are ignored and will
+		return -EINVAL.
+
+		Writing a new weight to a node directly via the nodeN interface
+		will also automatically switch the system to manual mode.
diff --git a/drivers/base/node.c b/drivers/base/node.c
index cd13ef287011..25ab9ec14eb8 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -7,6 +7,7 @@
 #include <linux/init.h>
 #include <linux/mm.h>
 #include <linux/memory.h>
+#include <linux/mempolicy.h>
 #include <linux/vmstat.h>
 #include <linux/notifier.h>
 #include <linux/node.h>
@@ -214,6 +215,14 @@ void node_set_perf_attrs(unsigned int nid, struct access_coordinate *coord,
 			break;
 		}
 	}
+
+	/* When setting CPU access coordinates, update mempolicy */
+	if (access == ACCESS_COORDINATE_CPU) {
+		if (mempolicy_set_node_perf(nid, coord)) {
+			pr_info("failed to set mempolicy attrs for node %d\n",
+				nid);
+		}
+	}
 }
 EXPORT_SYMBOL_GPL(node_set_perf_attrs);
 
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index ce9885e0178a..3e8da8ba1146 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -11,6 +11,7 @@
 #include <linux/slab.h>
 #include <linux/rbtree.h>
 #include <linux/spinlock.h>
+#include <linux/node.h>
 #include <linux/nodemask.h>
 #include <linux/pagemap.h>
 #include <uapi/linux/mempolicy.h>
@@ -56,6 +57,15 @@ struct mempolicy {
 	} w;
 };
 
+/*
+ * A null weighted_interleave_state is interpted as having .mode = "auto",
+ * and .iw_table is interpreted as an array of 1s with length nr_node_ids.
+ */
+struct weighted_interleave_state {
+	bool mode_auto;
+	u8 iw_table[];
+};
+
 /*
  * Support for managing mempolicy data objects (clone, copy, destroy)
  * The default fast path of a NULL MPOL_DEFAULT policy is always inlined.
@@ -178,6 +188,9 @@ static inline bool mpol_is_preferred_many(struct mempolicy *pol)
 
 extern bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone);
 
+extern int mempolicy_set_node_perf(unsigned int node,
+				   struct access_coordinate *coords);
+
 #else
 
 struct mempolicy {};
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index f43951668c41..f542691b7123 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -109,6 +109,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/printk.h>
 #include <linux/swapops.h>
+#include <linux/gcd.h>
 
 #include <asm/tlbflush.h>
 #include <asm/tlb.h>
@@ -140,31 +141,130 @@ static struct mempolicy default_policy = {
 static struct mempolicy preferred_node_policy[MAX_NUMNODES];
 
 /*
- * iw_table is the sysfs-set interleave weight table, a value of 0 denotes
- * system-default value should be used. A NULL iw_table also denotes that
- * system-default values should be used. Until the system-default table
- * is implemented, the system-default is always 1.
- *
- * iw_table is RCU protected
+ * weightiness balances the tradeoff between small weights (cycles through nodes
+ * faster, more fair/even distribution) and large weights (smaller errors
+ * between actual bandwidth ratios and weight ratios). 32 is a number that has
+ * been found to perform at a reasonable compromise between the two goals.
+ */
+static const int weightiness = 32;
+
+static struct weighted_interleave_state __rcu *wi_state;
+static unsigned int *node_bw_table;
+
+/*
+ * wi_state_lock protects both wi_state and node_bw_table.
+ * node_bw_table is only used by writers to update wi_state.
  */
-static u8 __rcu *iw_table;
-static DEFINE_MUTEX(iw_table_lock);
+static DEFINE_MUTEX(wi_state_lock);
 
 static u8 get_il_weight(int node)
 {
-	u8 *table;
-	u8 weight;
+	struct weighted_interleave_state *state;
+	u8 weight = 1;
 
 	rcu_read_lock();
-	table = rcu_dereference(iw_table);
-	/* if no iw_table, use system default */
-	weight = table ? table[node] : 1;
-	/* if value in iw_table is 0, use system default */
-	weight = weight ? weight : 1;
+	state = rcu_dereference(wi_state);
+	if (state)
+		weight = state->iw_table[node];
 	rcu_read_unlock();
 	return weight;
 }
 
+/*
+ * Convert bandwidth values into weighted interleave weights.
+ * Call with wi_state_lock.
+ */
+static void reduce_interleave_weights(unsigned int *bw, u8 *new_iw)
+{
+	u64 sum_bw = 0;
+	unsigned int cast_sum_bw, scaling_factor = 1, iw_gcd = 0;
+	int nid;
+
+	for_each_node_state(nid, N_MEMORY)
+		sum_bw += bw[nid];
+
+	/* Scale bandwidths to whole numbers in the range [1, weightiness] */
+	for_each_node_state(nid, N_MEMORY) {
+		/*
+		 * Try not to perform 64-bit division.
+		 * If sum_bw < scaling_factor, then sum_bw < U32_MAX.
+		 * If sum_bw > scaling_factor, then round the weight up to 1.
+		 */
+		scaling_factor = weightiness * bw[nid];
+		if (bw[nid] && sum_bw < scaling_factor) {
+			cast_sum_bw = (unsigned int)sum_bw;
+			new_iw[nid] = scaling_factor / cast_sum_bw;
+		} else {
+			new_iw[nid] = 1;
+		}
+		if (!iw_gcd)
+			iw_gcd = new_iw[nid];
+		iw_gcd = gcd(iw_gcd, new_iw[nid]);
+	}
+
+	/* 1:2 is strictly better than 16:32. Reduce by the weights' GCD. */
+	for_each_node_state(nid, N_MEMORY)
+		new_iw[nid] /= iw_gcd;
+}
+
+int mempolicy_set_node_perf(unsigned int node, struct access_coordinate *coords)
+{
+	struct weighted_interleave_state *new_wi_state, *old_wi_state = NULL;
+	unsigned int *old_bw, *new_bw;
+	unsigned int bw_val;
+	int i;
+
+	bw_val = min(coords->read_bandwidth, coords->write_bandwidth);
+	new_bw = kcalloc(nr_node_ids, sizeof(unsigned int), GFP_KERNEL);
+	if (!new_bw)
+		return -ENOMEM;
+
+	new_wi_state = kmalloc(struct_size(new_wi_state, iw_table, nr_node_ids),
+			       GFP_KERNEL);
+	if (!new_wi_state) {
+		kfree(new_bw);
+		return -ENOMEM;
+	}
+	new_wi_state->mode_auto = true;
+	for (i = 0; i < nr_node_ids; i++)
+		new_wi_state->iw_table[i] = 1;
+
+	/*
+	 * Update bandwidth info, even in manual mode. That way, when switching
+	 * to auto mode in the future, iw_table can be overwritten using
+	 * accurate bw data.
+	 */
+	mutex_lock(&wi_state_lock);
+
+	old_bw = node_bw_table;
+	if (old_bw)
+		memcpy(new_bw, old_bw, nr_node_ids * sizeof(*old_bw));
+	new_bw[node] = bw_val;
+	node_bw_table = new_bw;
+
+	old_wi_state = rcu_dereference_protected(wi_state,
+					lockdep_is_held(&wi_state_lock));
+	if (old_wi_state && !old_wi_state->mode_auto) {
+		/* Manual mode; skip reducing weights and updating wi_state */
+		mutex_unlock(&wi_state_lock);
+		kfree(new_wi_state);
+		goto out;
+	}
+
+	/* NULL wi_state assumes auto=true; reduce weights and update wi_state*/
+	reduce_interleave_weights(new_bw, new_wi_state->iw_table);
+	rcu_assign_pointer(wi_state, new_wi_state);
+
+	mutex_unlock(&wi_state_lock);
+	if (old_wi_state) {
+		synchronize_rcu();
+		kfree(old_wi_state);
+	}
+out:
+	kfree(old_bw);
+	return 0;
+}
+
 /**
  * numa_nearest_node - Find nearest node by state
  * @node: Node id to start the search
@@ -2023,26 +2123,28 @@ static unsigned int read_once_policy_nodemask(struct mempolicy *pol,
 
 static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
 {
+	struct weighted_interleave_state *state;
 	nodemask_t nodemask;
 	unsigned int target, nr_nodes;
-	u8 *table;
+	u8 *table = NULL;
 	unsigned int weight_total = 0;
 	u8 weight;
-	int nid;
+	int nid = 0;
 
 	nr_nodes = read_once_policy_nodemask(pol, &nodemask);
 	if (!nr_nodes)
 		return numa_node_id();
 
 	rcu_read_lock();
-	table = rcu_dereference(iw_table);
+
+	state = rcu_dereference(wi_state);
+	/* Uninitialized wi_state means we should assume all weights are 1 */
+	if (state)
+		table = state->iw_table;
+
 	/* calculate the total weight */
-	for_each_node_mask(nid, nodemask) {
-		/* detect system default usage */
-		weight = table ? table[nid] : 1;
-		weight = weight ? weight : 1;
-		weight_total += weight;
-	}
+	for_each_node_mask(nid, nodemask)
+		weight_total += table ? table[nid] : 1;
 
 	/* Calculate the node offset based on totals */
 	target = ilx % weight_total;
@@ -2050,7 +2152,6 @@ static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
 	while (target) {
 		/* detect system default usage */
 		weight = table ? table[nid] : 1;
-		weight = weight ? weight : 1;
 		if (target < weight)
 			break;
 		target -= weight;
@@ -2451,13 +2552,14 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
 		struct mempolicy *pol, unsigned long nr_pages,
 		struct page **page_array)
 {
+	struct weighted_interleave_state *state;
 	struct task_struct *me = current;
 	unsigned int cpuset_mems_cookie;
 	unsigned long total_allocated = 0;
 	unsigned long nr_allocated = 0;
 	unsigned long rounds;
 	unsigned long node_pages, delta;
-	u8 *table, *weights, weight;
+	u8 *weights, weight;
 	unsigned int weight_total = 0;
 	unsigned long rem_pages = nr_pages;
 	nodemask_t nodes;
@@ -2507,17 +2609,19 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
 		return total_allocated;
 
 	rcu_read_lock();
-	table = rcu_dereference(iw_table);
-	if (table)
-		memcpy(weights, table, nr_node_ids);
-	rcu_read_unlock();
+	state = rcu_dereference(wi_state);
+	if (state) {
+		memcpy(weights, state->iw_table, nr_node_ids * sizeof(u8));
+		rcu_read_unlock();
+	} else {
+		rcu_read_unlock();
+		for (i = 0; i < nr_node_ids; i++)
+			weights[i] = 1;
+	}
 
 	/* calculate total, detect system default usage */
-	for_each_node_mask(node, nodes) {
-		if (!weights[node])
-			weights[node] = 1;
+	for_each_node_mask(node, nodes)
 		weight_total += weights[node];
-	}
 
 	/*
 	 * Calculate rounds/partial rounds to minimize __alloc_pages_bulk calls.
@@ -3431,6 +3535,7 @@ struct iw_node_attr {
 struct sysfs_wi_group {
 	struct kobject wi_kobj;
 	struct mutex kobj_lock;
+	struct kobj_attribute auto_kobj_attr;
 	struct iw_node_attr *nattrs[];
 };
 
@@ -3450,31 +3555,104 @@ static ssize_t node_show(struct kobject *kobj, struct kobj_attribute *attr,
 static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr,
 			  const char *buf, size_t count)
 {
+	struct weighted_interleave_state *new_wi_state, *old_wi_state = NULL;
 	struct iw_node_attr *node_attr;
-	u8 *new;
-	u8 *old;
 	u8 weight = 0;
+	int i;
 
 	node_attr = container_of(attr, struct iw_node_attr, kobj_attr);
 	if (count == 0 || sysfs_streq(buf, ""))
 		weight = 0;
-	else if (kstrtou8(buf, 0, &weight))
+	else if (kstrtou8(buf, 0, &weight) || weight == 0)
 		return -EINVAL;
 
-	new = kzalloc(nr_node_ids, GFP_KERNEL);
-	if (!new)
+	new_wi_state = kzalloc(struct_size(new_wi_state, iw_table, nr_node_ids),
+			       GFP_KERNEL);
+	if (!new_wi_state)
 		return -ENOMEM;
 
-	mutex_lock(&iw_table_lock);
-	old = rcu_dereference_protected(iw_table,
-					lockdep_is_held(&iw_table_lock));
-	if (old)
-		memcpy(new, old, nr_node_ids);
-	new[node_attr->nid] = weight;
-	rcu_assign_pointer(iw_table, new);
-	mutex_unlock(&iw_table_lock);
-	synchronize_rcu();
-	kfree(old);
+	mutex_lock(&wi_state_lock);
+	old_wi_state = rcu_dereference_protected(wi_state,
+					lockdep_is_held(&wi_state_lock));
+	if (old_wi_state) {
+		memcpy(new_wi_state->iw_table, old_wi_state->iw_table,
+					nr_node_ids * sizeof(u8));
+	} else {
+		for (i = 0; i < nr_node_ids; i++)
+			new_wi_state->iw_table[i] = 1;
+	}
+	new_wi_state->iw_table[node_attr->nid] = weight;
+	new_wi_state->mode_auto = false;
+
+	rcu_assign_pointer(wi_state, new_wi_state);
+	mutex_unlock(&wi_state_lock);
+	if (old_wi_state) {
+		synchronize_rcu();
+		kfree(old_wi_state);
+	}
+	return count;
+}
+
+static ssize_t weighted_interleave_auto_show(struct kobject *kobj,
+		struct kobj_attribute *attr, char *buf)
+{
+	struct weighted_interleave_state *state;
+	bool wi_auto = true;
+
+	rcu_read_lock();
+	state = rcu_dereference(wi_state);
+	if (state)
+		wi_auto = state->mode_auto;
+	rcu_read_unlock();
+
+	return sysfs_emit(buf, "%s\n", str_true_false(wi_auto));
+}
+
+static ssize_t weighted_interleave_auto_store(struct kobject *kobj,
+		struct kobj_attribute *attr, const char *buf, size_t count)
+{
+	struct weighted_interleave_state *new_wi_state, *old_wi_state = NULL;
+	unsigned int *bw;
+	bool input;
+	int i;
+
+	if (kstrtobool(buf, &input))
+		return -EINVAL;
+
+	new_wi_state = kzalloc(struct_size(new_wi_state, iw_table, nr_node_ids),
+			       GFP_KERNEL);
+	if (!new_wi_state)
+		return -ENOMEM;
+	for (i = 0; i < nr_node_ids; i++)
+		new_wi_state->iw_table[i] = 1;
+
+	mutex_lock(&wi_state_lock);
+	if (!input) {
+		old_wi_state = rcu_dereference_protected(wi_state,
+					lockdep_is_held(&wi_state_lock));
+		if (old_wi_state)
+			memcpy(new_wi_state->iw_table, old_wi_state->iw_table,
+					nr_node_ids * sizeof(u8));
+		goto update_wi_state;
+	}
+
+	bw = node_bw_table;
+	if (!bw) {
+		mutex_unlock(&wi_state_lock);
+		kfree(new_wi_state);
+		return -ENODEV;
+	}
+
+	new_wi_state->mode_auto = true;
+	reduce_interleave_weights(bw, new_wi_state->iw_table);
+
+update_wi_state:
+	rcu_assign_pointer(wi_state, new_wi_state);
+	mutex_unlock(&wi_state_lock);
+	if (old_wi_state) {
+		synchronize_rcu();
+		kfree(old_wi_state);
+	}
 	return count;
 }
 
@@ -3508,23 +3686,31 @@ static void sysfs_wi_node_delete_all(void)
 		sysfs_wi_node_delete(nid);
 }
 
-static void iw_table_free(void)
+static void wi_state_free(void)
 {
-	u8 *old;
+	struct weighted_interleave_state *old_wi_state;
 
-	mutex_lock(&iw_table_lock);
-	old = rcu_dereference_protected(iw_table,
-					lockdep_is_held(&iw_table_lock));
-	rcu_assign_pointer(iw_table, NULL);
-	mutex_unlock(&iw_table_lock);
+	mutex_lock(&wi_state_lock);
+
+	old_wi_state = rcu_dereference_protected(wi_state,
+			lockdep_is_held(&wi_state_lock));
+	if (!old_wi_state) {
+		mutex_unlock(&wi_state_lock);
+		goto out;
+	}
 
+	rcu_assign_pointer(wi_state, NULL);
+	mutex_unlock(&wi_state_lock);
 	synchronize_rcu();
-	kfree(old);
+	kfree(old_wi_state);
+out:
+	kfree(&wi_group->wi_kobj);
 }
 
 static void wi_cleanup(void) {
+	sysfs_remove_file(&wi_group->wi_kobj, &wi_group->auto_kobj_attr.attr);
 	sysfs_wi_node_delete_all();
-	iw_table_free();
+	wi_state_free();
 }
 
 static void wi_kobj_release(struct kobject *wi_kobj)
@@ -3612,6 +3798,10 @@ static int wi_node_notifier(struct notifier_block *nb,
 	return NOTIFY_OK;
 }
 
+static struct kobj_attribute wi_auto_attr =
+	__ATTR(auto, 0664, weighted_interleave_auto_show,
+			   weighted_interleave_auto_store);
+
 static int __init add_weighted_interleave_group(struct kobject *mempolicy_kobj)
 {
 	int nid, err;
@@ -3627,6 +3817,11 @@ static int __init add_weighted_interleave_group(struct kobject *mempolicy_kobj)
 	if (err)
 		goto err_put_kobj;
 
+	err = sysfs_create_file(&wi_group->wi_kobj, &wi_auto_attr.attr);
+	if (err)
+		goto err_put_kobj;
+	wi_group->auto_kobj_attr = wi_auto_attr;
+
 	for_each_online_node(nid) {
 		if (!node_state(nid, N_MEMORY))
 			continue;
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v8] mm/mempolicy: Weighted Interleave Auto-tuning
  2025-05-05 18:23 [PATCH v8] mm/mempolicy: Weighted Interleave Auto-tuning Joshua Hahn
@ 2025-05-10  5:25 ` Huang, Ying
  2025-05-10 18:51   ` Joshua Hahn
  2025-05-19  1:56 ` Joshua Hahn
  1 sibling, 1 reply; 17+ messages in thread
From: Huang, Ying @ 2025-05-10  5:25 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: gourry, honggyu.kim, yunjeong.mun, gregkh, rafael, lenb,
	dan.j.williams, Jonathan.Cameron, dave.jiang, horen.chuang,
	hannes, osalvador, linux-kernel, linux-acpi, linux-mm,
	kernel-team

Hi, Joshua,

Thank you for updated version!  And sorry for late reply.

Joshua Hahn <joshua.hahnjy@gmail.com> writes:

> On machines with multiple memory nodes, interleaving page allocations
> across nodes allows for better utilization of each node's bandwidth.
> Previous work by Gregory Price [1] introduced weighted interleave, which
> allowed for pages to be allocated across nodes according to user-set ratios.
>
> Ideally, these weights should be proportional to their bandwidth, so
> that under bandwidth pressure, each node uses its maximal efficient
> bandwidth and prevents latency from increasing exponentially.
>
> Previously, weighted interleave's default weights were just 1s -- which
> would be equivalent to the (unweighted) interleave mempolicy, which goes
> through the nodes in a round-robin fashion, ignoring bandwidth information.
>
> This patch has two main goals:
> First, it makes weighted interleave easier to use for users who wish to
> relieve bandwidth pressure when using nodes with varying bandwidth (CXL).
> By providing a set of "real" default weights that just work out of the
> box, users who might not have the capability (or wish to) perform
> experimentation to find the most optimal weights for their system can
> still take advantage of bandwidth-informed weighted interleave.
>
> Second, it allows for weighted interleave to dynamically adjust to
> hotplugged memory with new bandwidth information. Instead of manually
> updating node weights every time new bandwidth information is reported
> or taken off, weighted interleave adjusts and provides a new set of
> default weights for weighted interleave to use when there is a change
> in bandwidth information.
>
> To meet these goals, this patch introduces an auto-configuration mode
> for the interleave weights that provides a reasonable set of default
> weights, calculated using bandwidth data reported by the system. In auto
> mode, weights are dynamically adjusted based on whatever the current
> bandwidth information reports (and responds to hotplug events).
>
> This patch still supports users manually writing weights into the nodeN
> sysfs interface by entering into manual mode. When a user enters manual
> mode, the system stops dynamically updating any of the node weights,
> even during hotplug events that shift the optimal weight distribution.
>
> A new sysfs interface "auto" is introduced, which allows users to switch
> between the auto (writing 1 or Y) and manual (writing 0 or N) modes. The
> system also automatically enters manual mode when a nodeN interface is
> manually written to.
>
> There is one functional change that this patch makes to the existing
> weighted_interleave ABI: previously, writing 0 directly to a nodeN
> interface was said to reset the weight to the system default. Before
> this patch, the default for all weights were 1, which meant that writing
> 0 and 1 were functionally equivalent. With this patch, writing 0 is invalid.
>
> [1] https://lore.kernel.org/linux-mm/20240202170238.90004-1-gregory.price@memverge.com/
>
> Suggested-by: Yunjeong Mun <yunjeong.mun@sk.com>
> Suggested-by: Oscar Salvador <osalvador@suse.de>
> Suggested-by: Ying Huang <ying.huang@linux.alibaba.com>
> Suggested-by: Harry Yoo <harry.yoo@oracle.com>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> Co-developed-by: Gregory Price <gourry@gourry.net>
> Signed-off-by: Gregory Price <gourry@gourry.net>
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
> ---
> Changelog
> v8:
> - Rebased on top of mm-new (93129fd5a3a9c87fe0e53ff8b0bfa30ee43a873a)
>   - Specifically, to include Rakie Kim's patchset mm/mempolicy:
>     Enhance sysfs handling for memory hotplug in weighted interleave
>     (1f553871c3925bd1fc474b5fad53a0e8609cd5b0)
>   - Rebasing also included cleaning up this patch's sysfs creation.
> - Fixes to prevent racing between rcu_access_pointer and rcu_dereference
> - Wordsmithing
>
> v7:
> - Wordsmithing
> - Rename iw_table_lock to wi_state_lock
> - Clean up reduce_interleave_weights, as suggested by Yunjeong Mun.
>   - Combine iw_table allocation & initialization to be outside the function.
>   - Skip scaling to [1,100] before scaling to [1,weightiness].
> - Removed the second part of this patch, which prevented creating weight
>   sysfs interfaces for memoryless nodes.
> - Added Suggested-by tags; I should have done this much, much earlier.
>
> v6:
> - iw_weights and mode_auto are combined into one rcu-protected struct.
> - Protection against memoryless nodes, as suggested by Oscar Salvador
> - Wordsmithing (documentation, commit message and comments), as suggested
>   by Andrew Morton.
> - Removed unnecessary #include statement in hmat.c, as pointed out by
>   Harry (Hyeonggon) Yoo and Ying Huang.
> - Bandwidth values changed from u64_t to unsigned int, as pointed out by
>   Ying Huang and Dan Carpenter.
> - RCU optimizations, as suggested by Ying Huang.
> - A second patch is included to fix unintended behavior that creates a
>   weight knob for memoryless nodes as well.
> - Sysfs show/store functions use str_true_false & kstrtobool.
> - Fix a build error in 32-bit systems, which are unable to perform
>   64-bit division by casting 64-bit values to 32-bit, if under the range.
>
> v5:
> - I accidentally forgot to add the mm/mempolicy: subject tag since v1 of
>   this patch. Added to the subject now!
> - Wordsmithing, correcting typos, and re-naming variables for clarity.
> - No functional changes.
>
> v4:
> - Renamed the mode interface to the "auto" interface, which now only
>   emits either 'Y' or 'N'. Users can now interact with it by
>   writing 'Y', '1', 'N', or '0' to it.
> - Added additional documentation to the nodeN sysfs interface.
> - Makes sure iw_table locks are properly held.
> - Removed unlikely() call in reduce_interleave_weights.
> - Wordsmithing
>
> v3:
> - Weightiness (max_node_weight) is now fixed to 32.
> - Instead, the sysfs interface now exposes a "mode" parameter, which
>   can either be "auto" or "manual".
>   - Thank you Hyeonggon and Honggyu for the feedback.
> - Documentation updated to reflect new sysfs interface, explicitly
>   specifies that 0 is invalid.
>   - Thank you Gregory and Ying for the discussion on how best to
>     handle the 0 case.
> - Re-worked nodeN sysfs store to handle auto --> manual shifts
> - mempolicy_set_node_perf internally handles the auto / manual
>   case differently now. bw is always updated, iw updates depend on
>   what mode the user is in.
> - Wordsmithing comments for clarity.
> - Removed RFC tag.
>
> v2:
> - Name of the interface is changed: "max_node_weight" --> "weightiness"
> - Default interleave weight table no longer exists. Rather, the
>   interleave weight table is initialized with the defaults, if bandwidth
>   information is available.
>   - In addition, all sections that handle iw_table have been changed
>     to reference iw_table if it exists, otherwise defaulting to 1.
> - All instances of unsigned long are converted to uint64_t to guarantee
>   support for both 32-bit and 64-bit machines
> - sysfs initialization cleanup
> - Documentation has been rewritten to explicitly outline expected
>   behavior and expand on the interpretation of "weightiness".
> - kzalloc replaced with kcalloc for readability
> - Thank you Gregory and Hyeonggon for your review & feedback!
>
>  ...fs-kernel-mm-mempolicy-weighted-interleave |  35 +-
>  drivers/base/node.c                           |   9 +
>  include/linux/mempolicy.h                     |  13 +
>  mm/mempolicy.c                                | 311 ++++++++++++++----
>  4 files changed, 307 insertions(+), 61 deletions(-)
>
> diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> index 0b7972de04e9..ec13382c606f 100644
> --- a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> @@ -20,6 +20,35 @@ Description:	Weight configuration interface for nodeN
>  		Minimum weight: 1
>  		Maximum weight: 255
>  
> -		Writing an empty string or `0` will reset the weight to the
> -		system default. The system default may be set by the kernel
> -		or drivers at boot or during hotplug events.
> +		Writing invalid values (i.e. any values not in [1,255],
> +		empty string, ...) will return -EINVAL.
> +
> +		Changing the weight to a valid value will automatically
> +		update the system to manual mode as well.

s/update/switch/ ?

But my English is poor, please keep your version if you think that it's
better.

> +
> +What:		/sys/kernel/mm/mempolicy/weighted_interleave/auto
> +Date:		May 2025
> +Contact:	Linux memory management mailing list <linux-mm@kvack.org>
> +Description:	Auto-weighting configuration interface
> +
> +		Configuration mode for weighted interleave. 'true' indicates
> +		that the system is in auto mode, and a 'false' indicates that
> +		the system is in manual mode.
> +
> +		In auto mode, all node weights are re-calculated and overwritten
> +		(visible via the nodeN interfaces) whenever new bandwidth data
> +		is made available during either boot or hotplug events.
> +
> +		In manual mode, node weights can only be updated by the user.
> +		Note that nodes that are onlined with previously set weights
> +		will reuse those weights. If they were not previously set or
> +		are onlined with missing bandwidth data, the weights will use
> +		a default weight of 1.
> +
> +		Writing any true value string (e.g. Y or 1) will enable auto
> +		mode, while writing any false value string (e.g. N or 0) will
> +		enable manual mode. All other strings are ignored and will
> +		return -EINVAL.
> +
> +		Writing a new weight to a node directly via the nodeN interface
> +		will also automatically switch the system to manual mode.
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index cd13ef287011..25ab9ec14eb8 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -7,6 +7,7 @@
>  #include <linux/init.h>
>  #include <linux/mm.h>
>  #include <linux/memory.h>
> +#include <linux/mempolicy.h>
>  #include <linux/vmstat.h>
>  #include <linux/notifier.h>
>  #include <linux/node.h>
> @@ -214,6 +215,14 @@ void node_set_perf_attrs(unsigned int nid, struct access_coordinate *coord,
>  			break;
>  		}
>  	}
> +
> +	/* When setting CPU access coordinates, update mempolicy */
> +	if (access == ACCESS_COORDINATE_CPU) {
> +		if (mempolicy_set_node_perf(nid, coord)) {
> +			pr_info("failed to set mempolicy attrs for node %d\n",
> +				nid);
> +		}
> +	}
>  }
>  EXPORT_SYMBOL_GPL(node_set_perf_attrs);
>  
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index ce9885e0178a..3e8da8ba1146 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -11,6 +11,7 @@
>  #include <linux/slab.h>
>  #include <linux/rbtree.h>
>  #include <linux/spinlock.h>
> +#include <linux/node.h>
>  #include <linux/nodemask.h>
>  #include <linux/pagemap.h>
>  #include <uapi/linux/mempolicy.h>
> @@ -56,6 +57,15 @@ struct mempolicy {
>  	} w;
>  };
>  
> +/*
> + * A null weighted_interleave_state is interpted as having .mode = "auto",
> + * and .iw_table is interpreted as an array of 1s with length nr_node_ids.
> + */

Better to move the comments to above "wi_state" definition?

> +struct weighted_interleave_state {
> +	bool mode_auto;
> +	u8 iw_table[];
> +};
> +

Why do you put the type definition in mempolicy.h instead of
mempolicy.c?  I don't find other users except mempolicy.c.

>  /*
>   * Support for managing mempolicy data objects (clone, copy, destroy)
>   * The default fast path of a NULL MPOL_DEFAULT policy is always inlined.
> @@ -178,6 +188,9 @@ static inline bool mpol_is_preferred_many(struct mempolicy *pol)
>  
>  extern bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone);
>  
> +extern int mempolicy_set_node_perf(unsigned int node,
> +				   struct access_coordinate *coords);
> +
>  #else
>  
>  struct mempolicy {};
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index f43951668c41..f542691b7123 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -109,6 +109,7 @@
>  #include <linux/mmu_notifier.h>
>  #include <linux/printk.h>
>  #include <linux/swapops.h>
> +#include <linux/gcd.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/tlb.h>
> @@ -140,31 +141,130 @@ static struct mempolicy default_policy = {
>  static struct mempolicy preferred_node_policy[MAX_NUMNODES];
>  
>  /*
> - * iw_table is the sysfs-set interleave weight table, a value of 0 denotes
> - * system-default value should be used. A NULL iw_table also denotes that
> - * system-default values should be used. Until the system-default table
> - * is implemented, the system-default is always 1.
> - *
> - * iw_table is RCU protected
> + * weightiness balances the tradeoff between small weights (cycles through nodes
> + * faster, more fair/even distribution) and large weights (smaller errors
> + * between actual bandwidth ratios and weight ratios). 32 is a number that has
> + * been found to perform at a reasonable compromise between the two goals.
> + */
> +static const int weightiness = 32;
> +
> +static struct weighted_interleave_state __rcu *wi_state;
> +static unsigned int *node_bw_table;
> +
> +/*
> + * wi_state_lock protects both wi_state and node_bw_table.
> + * node_bw_table is only used by writers to update wi_state.
>   */
> -static u8 __rcu *iw_table;
> -static DEFINE_MUTEX(iw_table_lock);
> +static DEFINE_MUTEX(wi_state_lock);
>  
>  static u8 get_il_weight(int node)
>  {
> -	u8 *table;
> -	u8 weight;
> +	struct weighted_interleave_state *state;
> +	u8 weight = 1;
>  
>  	rcu_read_lock();
> -	table = rcu_dereference(iw_table);
> -	/* if no iw_table, use system default */
> -	weight = table ? table[node] : 1;
> -	/* if value in iw_table is 0, use system default */
> -	weight = weight ? weight : 1;
> +	state = rcu_dereference(wi_state);
> +	if (state)
> +		weight = state->iw_table[node];
>  	rcu_read_unlock();
>  	return weight;
>  }
>  
> +/*
> + * Convert bandwidth values into weighted interleave weights.
> + * Call with wi_state_lock.
> + */
> +static void reduce_interleave_weights(unsigned int *bw, u8 *new_iw)
> +{
> +	u64 sum_bw = 0;
> +	unsigned int cast_sum_bw, scaling_factor = 1, iw_gcd = 0;
> +	int nid;
> +
> +	for_each_node_state(nid, N_MEMORY)
> +		sum_bw += bw[nid];
> +
> +	/* Scale bandwidths to whole numbers in the range [1, weightiness] */
> +	for_each_node_state(nid, N_MEMORY) {
> +		/*
> +		 * Try not to perform 64-bit division.
> +		 * If sum_bw < scaling_factor, then sum_bw < U32_MAX.
> +		 * If sum_bw > scaling_factor, then round the weight up to 1.
> +		 */
> +		scaling_factor = weightiness * bw[nid];
> +		if (bw[nid] && sum_bw < scaling_factor) {
> +			cast_sum_bw = (unsigned int)sum_bw;
> +			new_iw[nid] = scaling_factor / cast_sum_bw;
> +		} else {
> +			new_iw[nid] = 1;
> +		}
> +		if (!iw_gcd)
> +			iw_gcd = new_iw[nid];
> +		iw_gcd = gcd(iw_gcd, new_iw[nid]);
> +	}
> +
> +	/* 1:2 is strictly better than 16:32. Reduce by the weights' GCD. */
> +	for_each_node_state(nid, N_MEMORY)
> +		new_iw[nid] /= iw_gcd;
> +}
> +
> +int mempolicy_set_node_perf(unsigned int node, struct access_coordinate *coords)
> +{
> +	struct weighted_interleave_state *new_wi_state, *old_wi_state = NULL;
> +	unsigned int *old_bw, *new_bw;
> +	unsigned int bw_val;
> +	int i;
> +
> +	bw_val = min(coords->read_bandwidth, coords->write_bandwidth);
> +	new_bw = kcalloc(nr_node_ids, sizeof(unsigned int), GFP_KERNEL);
> +	if (!new_bw)
> +		return -ENOMEM;
> +
> +	new_wi_state = kmalloc(struct_size(new_wi_state, iw_table, nr_node_ids),
> +			       GFP_KERNEL);
> +	if (!new_wi_state) {
> +		kfree(new_bw);
> +		return -ENOMEM;
> +	}
> +	new_wi_state->mode_auto = true;
> +	for (i = 0; i < nr_node_ids; i++)
> +		new_wi_state->iw_table[i] = 1;
> +
> +	/*
> +	 * Update bandwidth info, even in manual mode. That way, when switching
> +	 * to auto mode in the future, iw_table can be overwritten using
> +	 * accurate bw data.
> +	 */
> +	mutex_lock(&wi_state_lock);
> +
> +	old_bw = node_bw_table;
> +	if (old_bw)
> +		memcpy(new_bw, old_bw, nr_node_ids * sizeof(*old_bw));
> +	new_bw[node] = bw_val;
> +	node_bw_table = new_bw;
> +
> +	old_wi_state = rcu_dereference_protected(wi_state,
> +					lockdep_is_held(&wi_state_lock));
> +	if (old_wi_state && !old_wi_state->mode_auto) {
> +		/* Manual mode; skip reducing weights and updating wi_state */
> +		mutex_unlock(&wi_state_lock);
> +		kfree(new_wi_state);
> +		goto out;
> +	}
> +
> +	/* NULL wi_state assumes auto=true; reduce weights and update wi_state*/
> +	reduce_interleave_weights(new_bw, new_wi_state->iw_table);
> +	rcu_assign_pointer(wi_state, new_wi_state);
> +
> +	mutex_unlock(&wi_state_lock);
> +	if (old_wi_state) {
> +		synchronize_rcu();
> +		kfree(old_wi_state);
> +	}
> +out:
> +	kfree(old_bw);
> +	return 0;
> +}
> +
>  /**
>   * numa_nearest_node - Find nearest node by state
>   * @node: Node id to start the search
> @@ -2023,26 +2123,28 @@ static unsigned int read_once_policy_nodemask(struct mempolicy *pol,
>  
>  static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
>  {
> +	struct weighted_interleave_state *state;
>  	nodemask_t nodemask;
>  	unsigned int target, nr_nodes;
> -	u8 *table;
> +	u8 *table = NULL;
>  	unsigned int weight_total = 0;
>  	u8 weight;
> -	int nid;
> +	int nid = 0;
>  
>  	nr_nodes = read_once_policy_nodemask(pol, &nodemask);
>  	if (!nr_nodes)
>  		return numa_node_id();
>  
>  	rcu_read_lock();
> -	table = rcu_dereference(iw_table);
> +
> +	state = rcu_dereference(wi_state);
> +	/* Uninitialized wi_state means we should assume all weights are 1 */
> +	if (state)
> +		table = state->iw_table;
> +
>  	/* calculate the total weight */
> -	for_each_node_mask(nid, nodemask) {
> -		/* detect system default usage */
> -		weight = table ? table[nid] : 1;
> -		weight = weight ? weight : 1;
> -		weight_total += weight;
> -	}
> +	for_each_node_mask(nid, nodemask)
> +		weight_total += table ? table[nid] : 1;
>  
>  	/* Calculate the node offset based on totals */
>  	target = ilx % weight_total;
> @@ -2050,7 +2152,6 @@ static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
>  	while (target) {
>  		/* detect system default usage */
>  		weight = table ? table[nid] : 1;
> -		weight = weight ? weight : 1;
>  		if (target < weight)
>  			break;
>  		target -= weight;
> @@ -2451,13 +2552,14 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
>  		struct mempolicy *pol, unsigned long nr_pages,
>  		struct page **page_array)
>  {
> +	struct weighted_interleave_state *state;
>  	struct task_struct *me = current;
>  	unsigned int cpuset_mems_cookie;
>  	unsigned long total_allocated = 0;
>  	unsigned long nr_allocated = 0;
>  	unsigned long rounds;
>  	unsigned long node_pages, delta;
> -	u8 *table, *weights, weight;
> +	u8 *weights, weight;
>  	unsigned int weight_total = 0;
>  	unsigned long rem_pages = nr_pages;
>  	nodemask_t nodes;
> @@ -2507,17 +2609,19 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
>  		return total_allocated;
>  
>  	rcu_read_lock();
> -	table = rcu_dereference(iw_table);
> -	if (table)
> -		memcpy(weights, table, nr_node_ids);
> -	rcu_read_unlock();
> +	state = rcu_dereference(wi_state);
> +	if (state) {
> +		memcpy(weights, state->iw_table, nr_node_ids * sizeof(u8));
> +		rcu_read_unlock();
> +	} else {
> +		rcu_read_unlock();
> +		for (i = 0; i < nr_node_ids; i++)
> +			weights[i] = 1;
> +	}
>  
>  	/* calculate total, detect system default usage */
> -	for_each_node_mask(node, nodes) {
> -		if (!weights[node])
> -			weights[node] = 1;
> +	for_each_node_mask(node, nodes)
>  		weight_total += weights[node];
> -	}
>  
>  	/*
>  	 * Calculate rounds/partial rounds to minimize __alloc_pages_bulk calls.
> @@ -3431,6 +3535,7 @@ struct iw_node_attr {
>  struct sysfs_wi_group {
>  	struct kobject wi_kobj;
>  	struct mutex kobj_lock;
> +	struct kobj_attribute auto_kobj_attr;
>  	struct iw_node_attr *nattrs[];
>  };
>  
> @@ -3450,31 +3555,104 @@ static ssize_t node_show(struct kobject *kobj, struct kobj_attribute *attr,
>  static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr,
>  			  const char *buf, size_t count)
>  {
> +	struct weighted_interleave_state *new_wi_state, *old_wi_state = NULL;
>  	struct iw_node_attr *node_attr;
> -	u8 *new;
> -	u8 *old;
>  	u8 weight = 0;
> +	int i;
>  
>  	node_attr = container_of(attr, struct iw_node_attr, kobj_attr);
>  	if (count == 0 || sysfs_streq(buf, ""))
>  		weight = 0;

According to revised ABI, we should return -EINVAL here?

> -	else if (kstrtou8(buf, 0, &weight))
> +	else if (kstrtou8(buf, 0, &weight) || weight == 0)
>  		return -EINVAL;
>  
> -	new = kzalloc(nr_node_ids, GFP_KERNEL);
> -	if (!new)
> +	new_wi_state = kzalloc(struct_size(new_wi_state, iw_table, nr_node_ids),
> +			       GFP_KERNEL);
> +	if (!new_wi_state)
>  		return -ENOMEM;
>  
> -	mutex_lock(&iw_table_lock);
> -	old = rcu_dereference_protected(iw_table,
> -					lockdep_is_held(&iw_table_lock));
> -	if (old)
> -		memcpy(new, old, nr_node_ids);
> -	new[node_attr->nid] = weight;
> -	rcu_assign_pointer(iw_table, new);
> -	mutex_unlock(&iw_table_lock);
> -	synchronize_rcu();
> -	kfree(old);
> +	mutex_lock(&wi_state_lock);
> +	old_wi_state = rcu_dereference_protected(wi_state,
> +					lockdep_is_held(&wi_state_lock));
> +	if (old_wi_state) {
> +		memcpy(new_wi_state->iw_table, old_wi_state->iw_table,
> +					nr_node_ids * sizeof(u8));
> +	} else {
> +		for (i = 0; i < nr_node_ids; i++)
> +			new_wi_state->iw_table[i] = 1;
> +	}
> +	new_wi_state->iw_table[node_attr->nid] = weight;
> +	new_wi_state->mode_auto = false;
> +
> +	rcu_assign_pointer(wi_state, new_wi_state);
> +	mutex_unlock(&wi_state_lock);
> +	if (old_wi_state) {
> +		synchronize_rcu();
> +		kfree(old_wi_state);
> +	}
> +	return count;
> +}
> +
> +static ssize_t weighted_interleave_auto_show(struct kobject *kobj,
> +		struct kobj_attribute *attr, char *buf)
> +{
> +	struct weighted_interleave_state *state;
> +	bool wi_auto = true;
> +
> +	rcu_read_lock();
> +	state = rcu_dereference(wi_state);
> +	if (state)
> +		wi_auto = state->mode_auto;
> +	rcu_read_unlock();
> +
> +	return sysfs_emit(buf, "%s\n", str_true_false(wi_auto));
> +}
> +
> +static ssize_t weighted_interleave_auto_store(struct kobject *kobj,
> +		struct kobj_attribute *attr, const char *buf, size_t count)
> +{
> +	struct weighted_interleave_state *new_wi_state, *old_wi_state = NULL;
> +	unsigned int *bw;
> +	bool input;
> +	int i;
> +
> +	if (kstrtobool(buf, &input))
> +		return -EINVAL;
> +
> +	new_wi_state = kzalloc(struct_size(new_wi_state, iw_table, nr_node_ids),
> +			       GFP_KERNEL);
> +	if (!new_wi_state)
> +		return -ENOMEM;
> +	for (i = 0; i < nr_node_ids; i++)
> +		new_wi_state->iw_table[i] = 1;
> +
> +	mutex_lock(&wi_state_lock);
> +	if (!input) {

If input == old_wi_state->mode_auto, we can return directly?

> +		old_wi_state = rcu_dereference_protected(wi_state,
> +					lockdep_is_held(&wi_state_lock));
> +		if (old_wi_state)
> +			memcpy(new_wi_state->iw_table, old_wi_state->iw_table,
> +					nr_node_ids * sizeof(u8));
> +		goto update_wi_state;
> +	}
> +
> +	bw = node_bw_table;
> +	if (!bw) {
> +		mutex_unlock(&wi_state_lock);
> +		kfree(new_wi_state);
> +		return -ENODEV;
> +	}
> +
> +	new_wi_state->mode_auto = true;
> +	reduce_interleave_weights(bw, new_wi_state->iw_table);
> +
> +update_wi_state:
> +	rcu_assign_pointer(wi_state, new_wi_state);
> +	mutex_unlock(&wi_state_lock);
> +	if (old_wi_state) {
> +		synchronize_rcu();
> +		kfree(old_wi_state);
> +	}
>  	return count;
>  }
>  
> @@ -3508,23 +3686,31 @@ static void sysfs_wi_node_delete_all(void)
>  		sysfs_wi_node_delete(nid);
>  }
>  
> -static void iw_table_free(void)
> +static void wi_state_free(void)
>  {
> -	u8 *old;
> +	struct weighted_interleave_state *old_wi_state;
>  
> -	mutex_lock(&iw_table_lock);
> -	old = rcu_dereference_protected(iw_table,
> -					lockdep_is_held(&iw_table_lock));
> -	rcu_assign_pointer(iw_table, NULL);
> -	mutex_unlock(&iw_table_lock);
> +	mutex_lock(&wi_state_lock);
> +
> +	old_wi_state = rcu_dereference_protected(wi_state,
> +			lockdep_is_held(&wi_state_lock));
> +	if (!old_wi_state) {
> +		mutex_unlock(&wi_state_lock);
> +		goto out;
> +	}
>  
> +	rcu_assign_pointer(wi_state, NULL);
> +	mutex_unlock(&wi_state_lock);
>  	synchronize_rcu();
> -	kfree(old);
> +	kfree(old_wi_state);
> +out:
> +	kfree(&wi_group->wi_kobj);
>  }
>  
>  static void wi_cleanup(void) {
> +	sysfs_remove_file(&wi_group->wi_kobj, &wi_group->auto_kobj_attr.attr);

Why not just

	sysfs_remove_file(&wi_group->wi_kobj, &wi_auto_attr.attr);

?

>  	sysfs_wi_node_delete_all();
> -	iw_table_free();
> +	wi_state_free();
>  }
>  
>  static void wi_kobj_release(struct kobject *wi_kobj)
> @@ -3612,6 +3798,10 @@ static int wi_node_notifier(struct notifier_block *nb,
>  	return NOTIFY_OK;
>  }
>  
> +static struct kobj_attribute wi_auto_attr =
> +	__ATTR(auto, 0664, weighted_interleave_auto_show,
> +			   weighted_interleave_auto_store);
> +
>  static int __init add_weighted_interleave_group(struct kobject *mempolicy_kobj)
>  {
>  	int nid, err;
> @@ -3627,6 +3817,11 @@ static int __init add_weighted_interleave_group(struct kobject *mempolicy_kobj)
>  	if (err)
>  		goto err_put_kobj;
>  
> +	err = sysfs_create_file(&wi_group->wi_kobj, &wi_auto_attr.attr);
> +	if (err)
> +		goto err_put_kobj;
> +	wi_group->auto_kobj_attr = wi_auto_attr;
> +
>  	for_each_online_node(nid) {
>  		if (!node_state(nid, N_MEMORY))
>  			continue;

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v8] mm/mempolicy: Weighted Interleave Auto-tuning
  2025-05-10  5:25 ` Huang, Ying
@ 2025-05-10 18:51   ` Joshua Hahn
  2025-05-11  2:58     ` Joshua Hahn
  0 siblings, 1 reply; 17+ messages in thread
From: Joshua Hahn @ 2025-05-10 18:51 UTC (permalink / raw)
  To: Huang, Ying
  Cc: gourry, honggyu.kim, yunjeong.mun, gregkh, rafael, lenb,
	dan.j.williams, Jonathan.Cameron, dave.jiang, horen.chuang,
	hannes, osalvador, linux-kernel, linux-acpi, linux-mm,
	kernel-team

On Sat, 10 May 2025 13:25:32 +0800 "Huang, Ying" <ying.huang@linux.alibaba.com> wrote:

Hi Ying,
Thank you for reviewing my patch, as always!

> Hi, Joshua,
> 
> Thank you for updated version!  And sorry for late reply.
> 
> Joshua Hahn <joshua.hahnjy@gmail.com> writes:

[...snip...]

> > diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> > index 0b7972de04e9..ec13382c606f 100644
> > --- a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> > +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> > @@ -20,6 +20,35 @@ Description:	Weight configuration interface for nodeN
> >  		Minimum weight: 1
> >  		Maximum weight: 255
> >  
> > -		Writing an empty string or `0` will reset the weight to the
> > -		system default. The system default may be set by the kernel
> > -		or drivers at boot or during hotplug events.
> > +		Writing invalid values (i.e. any values not in [1,255],
> > +		empty string, ...) will return -EINVAL.
> > +
> > +		Changing the weight to a valid value will automatically
> > +		update the system to manual mode as well.
> 
> s/update/switch/ ?
> 
> But my English is poor, please keep your version if you think that it's
> better.

I have no particular preference here, whatever will make it easiest for the
users to understand what is happening. I'll take your suggestion!

[...snip...]

> > +/*
> > + * A null weighted_interleave_state is interpted as having .mode = "auto",
> > + * and .iw_table is interpreted as an array of 1s with length nr_node_ids.
> > + */
> 
> Better to move the comments to above "wi_state" definition?
> 
> > +struct weighted_interleave_state {
> > +	bool mode_auto;
> > +	u8 iw_table[];
> > +};
> > +
> 
> Why do you put the type definition in mempolicy.h instead of
> mempolicy.c?  I don't find other users except mempolicy.c.

Good point, I'll move the definition to mempolicy.c and move the comment
to the wi_state definition as well.

[...snip...]

> > @@ -3450,31 +3555,104 @@ static ssize_t node_show(struct kobject *kobj, struct kobj_attribute *attr,
> >  static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr,
> >  			  const char *buf, size_t count)
> >  {
> > +	struct weighted_interleave_state *new_wi_state, *old_wi_state = NULL;
> >  	struct iw_node_attr *node_attr;
> > -	u8 *new;
> > -	u8 *old;
> >  	u8 weight = 0;
> > +	int i;
> >  
> >  	node_attr = container_of(attr, struct iw_node_attr, kobj_attr);
> >  	if (count == 0 || sysfs_streq(buf, ""))
> >  		weight = 0;
> 
> According to revised ABI, we should return -EINVAL here?

Great catch, I completely ignored the ABI description that I wrote...
I'll go ahead and just return -EINVAL here!

[...snip...]

> > +static ssize_t weighted_interleave_auto_store(struct kobject *kobj,
> > +		struct kobj_attribute *attr, const char *buf, size_t count)
> > +{
> > +	struct weighted_interleave_state *new_wi_state, *old_wi_state = NULL;
> > +	unsigned int *bw;
> > +	bool input;
> > +	int i;
> > +
> > +	if (kstrtobool(buf, &input))
> > +		return -EINVAL;
> > +
> > +	new_wi_state = kzalloc(struct_size(new_wi_state, iw_table, nr_node_ids),
> > +			       GFP_KERNEL);
> > +	if (!new_wi_state)
> > +		return -ENOMEM;
> > +	for (i = 0; i < nr_node_ids; i++)
> > +		new_wi_state->iw_table[i] = 1;
> > +
> > +	mutex_lock(&wi_state_lock);
> > +	if (!input) {
> 
> If input == old_wi_state->mode_auto, we can return directly?

Yes, that makes sense to me.

> >  static void wi_cleanup(void) {
> > +	sysfs_remove_file(&wi_group->wi_kobj, &wi_group->auto_kobj_attr.attr);
> 
> Why not just
> 
> 	sysfs_remove_file(&wi_group->wi_kobj, &wi_auto_attr.attr);
> 
> ?

Also makes sense!!

> ---
> Best Regards,
> Huang, Ying

Thank you for your great feedback Ying, I'll make changes based on
your suggestions and shortly send up a v9. I hope you have a great day!
Joshua


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v8] mm/mempolicy: Weighted Interleave Auto-tuning
  2025-05-10 18:51   ` Joshua Hahn
@ 2025-05-11  2:58     ` Joshua Hahn
  2025-05-11 12:56       ` Honggyu Kim
  2025-05-12  1:35       ` Huang, Ying
  0 siblings, 2 replies; 17+ messages in thread
From: Joshua Hahn @ 2025-05-11  2:58 UTC (permalink / raw)
  To: Joshua Hahn, Andrew Morton
  Cc: Huang, Ying, gourry, honggyu.kim, yunjeong.mun, gregkh, rafael,
	lenb, dan.j.williams, Jonathan.Cameron, dave.jiang, horen.chuang,
	hannes, osalvador, linux-kernel, linux-acpi, linux-mm,
	kernel-team

Hello Andrew,

I was hoping to fold this fixlet in with the patch this belongs to. It includes
some wordsmithing changes, some code simplification/cleanups, and makes sure
that the code behavior matches that of the ABI I described. I've kept the
original message below as well, where Ying suggested the changes present in
this fixlet.

Please let me know if this fixlet is too big, and you would rather prefer a
new version instead. Thank you as always for your patience and support!
Joshua

diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
index ec13382c606f..649c0e9b895c 100644
--- a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
@@ -24,7 +24,7 @@ Description:	Weight configuration interface for nodeN
 		empty string, ...) will return -EINVAL.

 		Changing the weight to a valid value will automatically
-		update the system to manual mode as well.
+		switch the system to manual mode as well.

 What:		/sys/kernel/mm/mempolicy/weighted_interleave/auto
 Date:		May 2025
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 3e8da8ba1146..0fe96f3ab3ef 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -57,15 +57,6 @@ struct mempolicy {
 	} w;
 };

-/*
- * A null weighted_interleave_state is interpted as having .mode = "auto",
- * and .iw_table is interpreted as an array of 1s with length nr_node_ids.
- */
-struct weighted_interleave_state {
-	bool mode_auto;
-	u8 iw_table[];
-};
-
 /*
  * Support for managing mempolicy data objects (clone, copy, destroy)
  * The default fast path of a NULL MPOL_DEFAULT policy is always inlined.
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index f542691b7123..0624d735a2e7 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -148,6 +148,14 @@ static struct mempolicy preferred_node_policy[MAX_NUMNODES];
  */
 static const int weightiness = 32;

+/*
+ * A null weighted_interleave_state is interpreted as having .mode="auto",
+ * and .iw_table is interpreted as an array of 1s with length nr_node_ids.
+ */
+struct weighted_interleave_state {
+	bool mode_auto;
+	u8 iw_table[];
+};
 static struct weighted_interleave_state __rcu *wi_state;
 static unsigned int *node_bw_table;

@@ -3561,9 +3569,8 @@ static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr,
 	int i;

 	node_attr = container_of(attr, struct iw_node_attr, kobj_attr);
-	if (count == 0 || sysfs_streq(buf, ""))
-		weight = 0;
-	else if (kstrtou8(buf, 0, &weight) || weight == 0)
+	if (count == 0 || sysfs_streq(buf, "") ||
+	    kstrtou8(buf, 0, &weight) || weight == 0)
 		return -EINVAL;

 	new_wi_state = kzalloc(struct_size(new_wi_state, iw_table, nr_node_ids),
@@ -3630,9 +3637,15 @@ static ssize_t weighted_interleave_auto_store(struct kobject *kobj,
 	if (!input) {
 		old_wi_state = rcu_dereference_protected(wi_state,
 					lockdep_is_held(&wi_state_lock));
-		if (old_wi_state)
-			memcpy(new_wi_state->iw_table, old_wi_state->iw_table,
-					nr_node_ids * sizeof(u8));
+		if (!old_wi_state)
+			goto update_wi_state;
+		if (input == old_wi_state->mode_auto) {
+			mutex_unlock(&wi_state_lock);
+			return count;
+		}
+
+		memcpy(new_wi_state->iw_table, old_wi_state->iw_table,
+					       nr_node_ids * sizeof(u8));
 		goto update_wi_state;
 	}

@@ -3707,8 +3720,12 @@ static void wi_state_free(void)
 	kfree(&wi_group->wi_kobj);
 }

+static struct kobj_attribute wi_auto_attr =
+	__ATTR(auto, 0664, weighted_interleave_auto_show,
+			   weighted_interleave_auto_store);
+
 static void wi_cleanup(void) {
-	sysfs_remove_file(&wi_group->wi_kobj, &wi_group->auto_kobj_attr.attr);
+	sysfs_remove_file(&wi_group->wi_kobj, &wi_auto_attr.attr);
 	sysfs_wi_node_delete_all();
 	wi_state_free();
 }
@@ -3798,10 +3815,6 @@ static int wi_node_notifier(struct notifier_block *nb,
 	return NOTIFY_OK;
 }

-static struct kobj_attribute wi_auto_attr =
-	__ATTR(auto, 0664, weighted_interleave_auto_show,
-			   weighted_interleave_auto_store);
-
 static int __init add_weighted_interleave_group(struct kobject *mempolicy_kobj)
 {
 	int nid, err;


On Sat, 10 May 2025 11:51:50 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:

> On Sat, 10 May 2025 13:25:32 +0800 "Huang, Ying" <ying.huang@linux.alibaba.com> wrote:
> 
> Hi Ying,
> Thank you for reviewing my patch, as always!
> 
> > Hi, Joshua,
> > 
> > Thank you for updated version!  And sorry for late reply.
> > 
> > Joshua Hahn <joshua.hahnjy@gmail.com> writes:
> 
> [...snip...]
> 
> > > diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> > > index 0b7972de04e9..ec13382c606f 100644
> > > --- a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> > > +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> > > @@ -20,6 +20,35 @@ Description:	Weight configuration interface for nodeN
> > >  		Minimum weight: 1
> > >  		Maximum weight: 255
> > >  
> > > -		Writing an empty string or `0` will reset the weight to the
> > > -		system default. The system default may be set by the kernel
> > > -		or drivers at boot or during hotplug events.
> > > +		Writing invalid values (i.e. any values not in [1,255],
> > > +		empty string, ...) will return -EINVAL.
> > > +
> > > +		Changing the weight to a valid value will automatically
> > > +		update the system to manual mode as well.
> > 
> > s/update/switch/ ?
> > 
> > But my English is poor, please keep your version if you think that it's
> > better.
> 
> I have no particular preference here, whatever will make it easiest for the
> users to understand what is happening. I'll take your suggestion!
> 
> [...snip...]
> 
> > > +/*
> > > + * A null weighted_interleave_state is interpted as having .mode = "auto",
> > > + * and .iw_table is interpreted as an array of 1s with length nr_node_ids.
> > > + */
> > 
> > Better to move the comments to above "wi_state" definition?
> > 
> > > +struct weighted_interleave_state {
> > > +	bool mode_auto;
> > > +	u8 iw_table[];
> > > +};
> > > +
> > 
> > Why do you put the type definition in mempolicy.h instead of
> > mempolicy.c?  I don't find other users except mempolicy.c.
> 
> Good point, I'll move the definition to mempolicy.c and move the comment
> to the wi_state definition as well.
> 
> [...snip...]
> 
> > > @@ -3450,31 +3555,104 @@ static ssize_t node_show(struct kobject *kobj, struct kobj_attribute *attr,
> > >  static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr,
> > >  			  const char *buf, size_t count)
> > >  {
> > > +	struct weighted_interleave_state *new_wi_state, *old_wi_state = NULL;
> > >  	struct iw_node_attr *node_attr;
> > > -	u8 *new;
> > > -	u8 *old;
> > >  	u8 weight = 0;
> > > +	int i;
> > >  
> > >  	node_attr = container_of(attr, struct iw_node_attr, kobj_attr);
> > >  	if (count == 0 || sysfs_streq(buf, ""))
> > >  		weight = 0;
> > 
> > According to revised ABI, we should return -EINVAL here?
> 
> Great catch, I completely ignored the ABI description that I wrote...
> I'll go ahead and just return -EINVAL here!
> 
> [...snip...]
> 
> > > +static ssize_t weighted_interleave_auto_store(struct kobject *kobj,
> > > +		struct kobj_attribute *attr, const char *buf, size_t count)
> > > +{
> > > +	struct weighted_interleave_state *new_wi_state, *old_wi_state = NULL;
> > > +	unsigned int *bw;
> > > +	bool input;
> > > +	int i;
> > > +
> > > +	if (kstrtobool(buf, &input))
> > > +		return -EINVAL;
> > > +
> > > +	new_wi_state = kzalloc(struct_size(new_wi_state, iw_table, nr_node_ids),
> > > +			       GFP_KERNEL);
> > > +	if (!new_wi_state)
> > > +		return -ENOMEM;
> > > +	for (i = 0; i < nr_node_ids; i++)
> > > +		new_wi_state->iw_table[i] = 1;
> > > +
> > > +	mutex_lock(&wi_state_lock);
> > > +	if (!input) {
> > 
> > If input == old_wi_state->mode_auto, we can return directly?
> 
> Yes, that makes sense to me.
> 
> > >  static void wi_cleanup(void) {
> > > +	sysfs_remove_file(&wi_group->wi_kobj, &wi_group->auto_kobj_attr.attr);
> > 
> > Why not just
> > 
> > 	sysfs_remove_file(&wi_group->wi_kobj, &wi_auto_attr.attr);
> > 
> > ?
> 
> Also makes sense!!
> 
> > ---
> > Best Regards,
> > Huang, Ying
> 
> Thank you for your great feedback Ying, I'll make changes based on
> your suggestions and shortly send up a v9. I hope you have a great day!
> Joshua
> 


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v8] mm/mempolicy: Weighted Interleave Auto-tuning
  2025-05-11  2:58     ` Joshua Hahn
@ 2025-05-11 12:56       ` Honggyu Kim
  2025-05-12 14:14         ` Joshua Hahn
  2025-05-12  1:35       ` Huang, Ying
  1 sibling, 1 reply; 17+ messages in thread
From: Honggyu Kim @ 2025-05-11 12:56 UTC (permalink / raw)
  To: Joshua Hahn, Andrew Morton
  Cc: kernel_team, Huang, Ying, gourry, yunjeong.mun, gregkh, rafael,
	lenb, dan.j.williams, Jonathan.Cameron, dave.jiang, horen.chuang,
	hannes, osalvador, linux-kernel, linux-acpi, linux-mm,
	kernel-team

Hi Joshua,

Thanks for the update this patch and it looks good to me.

I've applied your v8 patch with your fixup here together, then tested it in my
server, which has 8ch of DRAM with 4ch of CXL memory in each socket.

I can confirm that it shows decent ratio with this auto weight configuration as
follows.

   $ ls /sys/kernel/mm/mempolicy/weighted_interleave/
   auto  node0  node1  node2  node3

   $ cat /sys/kernel/mm/mempolicy/weighted_interleave/*
   true
   3
   3
   2
   2

Hi Andrew,

I'm not sure if Joshua is better to post v9, but if you want to fold and update,
then could you please add my tags as follows when you fold this change?

   Reviewed-by: Honggyu Kim <honggyu.kim@sk.com>
   Tested-by: Honggyu Kim <honggyu.kim@sk.com>

I added the same tags in v7 but not included in v8 somehow.
https://lore.kernel.org/linux-mm/5fdd7db9-96fb-49ea-9803-977158cb0132@sk.com

Thanks,
Honggyu

On 5/11/2025 11:58 AM, Joshua Hahn wrote:
> Hello Andrew,
> 
> I was hoping to fold this fixlet in with the patch this belongs to. It includes
> some wordsmithing changes, some code simplification/cleanups, and makes sure
> that the code behavior matches that of the ABI I described. I've kept the
> original message below as well, where Ying suggested the changes present in
> this fixlet.
> 
> Please let me know if this fixlet is too big, and you would rather prefer a
> new version instead. Thank you as always for your patience and support!
> Joshua
> 
> diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> index ec13382c606f..649c0e9b895c 100644
> --- a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> @@ -24,7 +24,7 @@ Description:	Weight configuration interface for nodeN
>   		empty string, ...) will return -EINVAL.
> 
>   		Changing the weight to a valid value will automatically
> -		update the system to manual mode as well.
> +		switch the system to manual mode as well.
> 
>   What:		/sys/kernel/mm/mempolicy/weighted_interleave/auto
>   Date:		May 2025
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index 3e8da8ba1146..0fe96f3ab3ef 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -57,15 +57,6 @@ struct mempolicy {
>   	} w;
>   };
> 
> -/*
> - * A null weighted_interleave_state is interpted as having .mode = "auto",
> - * and .iw_table is interpreted as an array of 1s with length nr_node_ids.
> - */
> -struct weighted_interleave_state {
> -	bool mode_auto;
> -	u8 iw_table[];
> -};
> -
>   /*
>    * Support for managing mempolicy data objects (clone, copy, destroy)
>    * The default fast path of a NULL MPOL_DEFAULT policy is always inlined.
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index f542691b7123..0624d735a2e7 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -148,6 +148,14 @@ static struct mempolicy preferred_node_policy[MAX_NUMNODES];
>    */
>   static const int weightiness = 32;
> 
> +/*
> + * A null weighted_interleave_state is interpreted as having .mode="auto",
> + * and .iw_table is interpreted as an array of 1s with length nr_node_ids.
> + */
> +struct weighted_interleave_state {
> +	bool mode_auto;
> +	u8 iw_table[];
> +};
>   static struct weighted_interleave_state __rcu *wi_state;
>   static unsigned int *node_bw_table;
> 
> @@ -3561,9 +3569,8 @@ static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr,
>   	int i;
> 
>   	node_attr = container_of(attr, struct iw_node_attr, kobj_attr);
> -	if (count == 0 || sysfs_streq(buf, ""))
> -		weight = 0;
> -	else if (kstrtou8(buf, 0, &weight) || weight == 0)
> +	if (count == 0 || sysfs_streq(buf, "") ||
> +	    kstrtou8(buf, 0, &weight) || weight == 0)
>   		return -EINVAL;
> 
>   	new_wi_state = kzalloc(struct_size(new_wi_state, iw_table, nr_node_ids),
> @@ -3630,9 +3637,15 @@ static ssize_t weighted_interleave_auto_store(struct kobject *kobj,
>   	if (!input) {
>   		old_wi_state = rcu_dereference_protected(wi_state,
>   					lockdep_is_held(&wi_state_lock));
> -		if (old_wi_state)
> -			memcpy(new_wi_state->iw_table, old_wi_state->iw_table,
> -					nr_node_ids * sizeof(u8));
> +		if (!old_wi_state)
> +			goto update_wi_state;
> +		if (input == old_wi_state->mode_auto) {
> +			mutex_unlock(&wi_state_lock);
> +			return count;
> +		}
> +
> +		memcpy(new_wi_state->iw_table, old_wi_state->iw_table,
> +					       nr_node_ids * sizeof(u8));
>   		goto update_wi_state;
>   	}
> 
> @@ -3707,8 +3720,12 @@ static void wi_state_free(void)
>   	kfree(&wi_group->wi_kobj);
>   }
> 
> +static struct kobj_attribute wi_auto_attr =
> +	__ATTR(auto, 0664, weighted_interleave_auto_show,
> +			   weighted_interleave_auto_store);
> +
>   static void wi_cleanup(void) {
> -	sysfs_remove_file(&wi_group->wi_kobj, &wi_group->auto_kobj_attr.attr);
> +	sysfs_remove_file(&wi_group->wi_kobj, &wi_auto_attr.attr);
>   	sysfs_wi_node_delete_all();
>   	wi_state_free();
>   }
> @@ -3798,10 +3815,6 @@ static int wi_node_notifier(struct notifier_block *nb,
>   	return NOTIFY_OK;
>   }
> 
> -static struct kobj_attribute wi_auto_attr =
> -	__ATTR(auto, 0664, weighted_interleave_auto_show,
> -			   weighted_interleave_auto_store);
> -
>   static int __init add_weighted_interleave_group(struct kobject *mempolicy_kobj)
>   {
>   	int nid, err;
> 
> 
> On Sat, 10 May 2025 11:51:50 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> 
>> On Sat, 10 May 2025 13:25:32 +0800 "Huang, Ying" <ying.huang@linux.alibaba.com> wrote:
>>
>> Hi Ying,
>> Thank you for reviewing my patch, as always!
>>
>>> Hi, Joshua,
>>>
>>> Thank you for updated version!  And sorry for late reply.
>>>
>>> Joshua Hahn <joshua.hahnjy@gmail.com> writes:
>>
>> [...snip...]
>>
>>>> diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
>>>> index 0b7972de04e9..ec13382c606f 100644
>>>> --- a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
>>>> +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
>>>> @@ -20,6 +20,35 @@ Description:	Weight configuration interface for nodeN
>>>>   		Minimum weight: 1
>>>>   		Maximum weight: 255
>>>>   
>>>> -		Writing an empty string or `0` will reset the weight to the
>>>> -		system default. The system default may be set by the kernel
>>>> -		or drivers at boot or during hotplug events.
>>>> +		Writing invalid values (i.e. any values not in [1,255],
>>>> +		empty string, ...) will return -EINVAL.
>>>> +
>>>> +		Changing the weight to a valid value will automatically
>>>> +		update the system to manual mode as well.
>>>
>>> s/update/switch/ ?
>>>
>>> But my English is poor, please keep your version if you think that it's
>>> better.
>>
>> I have no particular preference here, whatever will make it easiest for the
>> users to understand what is happening. I'll take your suggestion!
>>
>> [...snip...]
>>
>>>> +/*
>>>> + * A null weighted_interleave_state is interpted as having .mode = "auto",
>>>> + * and .iw_table is interpreted as an array of 1s with length nr_node_ids.
>>>> + */
>>>
>>> Better to move the comments to above "wi_state" definition?
>>>
>>>> +struct weighted_interleave_state {
>>>> +	bool mode_auto;
>>>> +	u8 iw_table[];
>>>> +};
>>>> +
>>>
>>> Why do you put the type definition in mempolicy.h instead of
>>> mempolicy.c?  I don't find other users except mempolicy.c.
>>
>> Good point, I'll move the definition to mempolicy.c and move the comment
>> to the wi_state definition as well.
>>
>> [...snip...]
>>
>>>> @@ -3450,31 +3555,104 @@ static ssize_t node_show(struct kobject *kobj, struct kobj_attribute *attr,
>>>>   static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr,
>>>>   			  const char *buf, size_t count)
>>>>   {
>>>> +	struct weighted_interleave_state *new_wi_state, *old_wi_state = NULL;
>>>>   	struct iw_node_attr *node_attr;
>>>> -	u8 *new;
>>>> -	u8 *old;
>>>>   	u8 weight = 0;
>>>> +	int i;
>>>>   
>>>>   	node_attr = container_of(attr, struct iw_node_attr, kobj_attr);
>>>>   	if (count == 0 || sysfs_streq(buf, ""))
>>>>   		weight = 0;
>>>
>>> According to revised ABI, we should return -EINVAL here?
>>
>> Great catch, I completely ignored the ABI description that I wrote...
>> I'll go ahead and just return -EINVAL here!
>>
>> [...snip...]
>>
>>>> +static ssize_t weighted_interleave_auto_store(struct kobject *kobj,
>>>> +		struct kobj_attribute *attr, const char *buf, size_t count)
>>>> +{
>>>> +	struct weighted_interleave_state *new_wi_state, *old_wi_state = NULL;
>>>> +	unsigned int *bw;
>>>> +	bool input;
>>>> +	int i;
>>>> +
>>>> +	if (kstrtobool(buf, &input))
>>>> +		return -EINVAL;
>>>> +
>>>> +	new_wi_state = kzalloc(struct_size(new_wi_state, iw_table, nr_node_ids),
>>>> +			       GFP_KERNEL);
>>>> +	if (!new_wi_state)
>>>> +		return -ENOMEM;
>>>> +	for (i = 0; i < nr_node_ids; i++)
>>>> +		new_wi_state->iw_table[i] = 1;
>>>> +
>>>> +	mutex_lock(&wi_state_lock);
>>>> +	if (!input) {
>>>
>>> If input == old_wi_state->mode_auto, we can return directly?
>>
>> Yes, that makes sense to me.
>>
>>>>   static void wi_cleanup(void) {
>>>> +	sysfs_remove_file(&wi_group->wi_kobj, &wi_group->auto_kobj_attr.attr);
>>>
>>> Why not just
>>>
>>> 	sysfs_remove_file(&wi_group->wi_kobj, &wi_auto_attr.attr);
>>>
>>> ?
>>
>> Also makes sense!!
>>
>>> ---
>>> Best Regards,
>>> Huang, Ying
>>
>> Thank you for your great feedback Ying, I'll make changes based on
>> your suggestions and shortly send up a v9. I hope you have a great day!
>> Joshua
>>



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v8] mm/mempolicy: Weighted Interleave Auto-tuning
  2025-05-11  2:58     ` Joshua Hahn
  2025-05-11 12:56       ` Honggyu Kim
@ 2025-05-12  1:35       ` Huang, Ying
  2025-05-12 14:25         ` Joshua Hahn
  1 sibling, 1 reply; 17+ messages in thread
From: Huang, Ying @ 2025-05-12  1:35 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Andrew Morton, gourry, honggyu.kim, yunjeong.mun, gregkh, rafael,
	lenb, dan.j.williams, Jonathan.Cameron, dave.jiang, horen.chuang,
	hannes, osalvador, linux-kernel, linux-acpi, linux-mm,
	kernel-team

Hi, Joshua,

Joshua Hahn <joshua.hahnjy@gmail.com> writes:

[snip]

> @@ -3707,8 +3720,12 @@ static void wi_state_free(void)
>  	kfree(&wi_group->wi_kobj);
>  }
>
> +static struct kobj_attribute wi_auto_attr =
> +	__ATTR(auto, 0664, weighted_interleave_auto_show,
> +			   weighted_interleave_auto_store);
> +
>  static void wi_cleanup(void) {
> -	sysfs_remove_file(&wi_group->wi_kobj, &wi_group->auto_kobj_attr.attr);
> +	sysfs_remove_file(&wi_group->wi_kobj, &wi_auto_attr.attr);

If we use wi_auto_attr directly here, we can remove auto_kobj_attr field
from struct sysfs_wi_group?

>  	sysfs_wi_node_delete_all();
>  	wi_state_free();
>  }

[snip]

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v8] mm/mempolicy: Weighted Interleave Auto-tuning
  2025-05-11 12:56       ` Honggyu Kim
@ 2025-05-12 14:14         ` Joshua Hahn
  2025-05-16  4:37           ` Honggyu Kim
  0 siblings, 1 reply; 17+ messages in thread
From: Joshua Hahn @ 2025-05-12 14:14 UTC (permalink / raw)
  To: Honggyu Kim
  Cc: Andrew Morton, kernel_team, Huang, Ying, gourry, yunjeong.mun,
	gregkh, rafael, lenb, dan.j.williams, Jonathan.Cameron,
	dave.jiang, horen.chuang, hannes, osalvador, linux-kernel,
	linux-acpi, linux-mm, kernel-team

Hi Honggyu, thank you for reviewing & testing my patch (again)!

On Sun, 11 May 2025 21:56:20 +0900 Honggyu Kim <honggyu.kim@sk.com> wrote:

> Hi Joshua,
> 
> Thanks for the update this patch and it looks good to me.
> 
> I've applied your v8 patch with your fixup here together, then tested it in my
> server, which has 8ch of DRAM with 4ch of CXL memory in each socket.
> 
> I can confirm that it shows decent ratio with this auto weight configuration as
> follows.
> 
>    $ ls /sys/kernel/mm/mempolicy/weighted_interleave/
>    auto  node0  node1  node2  node3
> 
>    $ cat /sys/kernel/mm/mempolicy/weighted_interleave/*
>    true
>    3
>    3
>    2
>    2
> 
> Hi Andrew,
> 
> I'm not sure if Joshua is better to post v9, but if you want to fold and update,
> then could you please add my tags as follows when you fold this change?
> 
>    Reviewed-by: Honggyu Kim <honggyu.kim@sk.com>
>    Tested-by: Honggyu Kim <honggyu.kim@sk.com>
> 
> I added the same tags in v7 but not included in v8 somehow.
> https://lore.kernel.org/linux-mm/5fdd7db9-96fb-49ea-9803-977158cb0132@sk.com

I must have missed including these tags. Sorry about the confusion --
hopefully we can incorporate them into v8!

> Thanks,
> Honggyu
> 
> On 5/11/2025 11:58 AM, Joshua Hahn wrote:
> > Hello Andrew,
> > 
> > I was hoping to fold this fixlet in with the patch this belongs to. It includes
> > some wordsmithing changes, some code simplification/cleanups, and makes sure
> > that the code behavior matches that of the ABI I described. I've kept the
> > original message below as well, where Ying suggested the changes present in
> > this fixlet.
> > 
> > Please let me know if this fixlet is too big, and you would rather prefer a
> > new version instead. Thank you as always for your patience and support!
> > Joshua
> > 
> > diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> > index ec13382c606f..649c0e9b895c 100644
> > --- a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> > +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> > @@ -24,7 +24,7 @@ Description:	Weight configuration interface for nodeN
> >   		empty string, ...) will return -EINVAL.
> > 
> >   		Changing the weight to a valid value will automatically
> > -		update the system to manual mode as well.
> > +		switch the system to manual mode as well.
> > 
> >   What:		/sys/kernel/mm/mempolicy/weighted_interleave/auto
> >   Date:		May 2025
> > diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> > index 3e8da8ba1146..0fe96f3ab3ef 100644
> > --- a/include/linux/mempolicy.h
> > +++ b/include/linux/mempolicy.h
> > @@ -57,15 +57,6 @@ struct mempolicy {
> >   	} w;
> >   };
> > 
> > -/*
> > - * A null weighted_interleave_state is interpted as having .mode = "auto",
> > - * and .iw_table is interpreted as an array of 1s with length nr_node_ids.
> > - */
> > -struct weighted_interleave_state {
> > -	bool mode_auto;
> > -	u8 iw_table[];
> > -};
> > -
> >   /*
> >    * Support for managing mempolicy data objects (clone, copy, destroy)
> >    * The default fast path of a NULL MPOL_DEFAULT policy is always inlined.
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index f542691b7123..0624d735a2e7 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -148,6 +148,14 @@ static struct mempolicy preferred_node_policy[MAX_NUMNODES];
> >    */
> >   static const int weightiness = 32;
> > 
> > +/*
> > + * A null weighted_interleave_state is interpreted as having .mode="auto",
> > + * and .iw_table is interpreted as an array of 1s with length nr_node_ids.
> > + */
> > +struct weighted_interleave_state {
> > +	bool mode_auto;
> > +	u8 iw_table[];
> > +};
> >   static struct weighted_interleave_state __rcu *wi_state;
> >   static unsigned int *node_bw_table;
> > 
> > @@ -3561,9 +3569,8 @@ static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr,
> >   	int i;
> > 
> >   	node_attr = container_of(attr, struct iw_node_attr, kobj_attr);
> > -	if (count == 0 || sysfs_streq(buf, ""))
> > -		weight = 0;
> > -	else if (kstrtou8(buf, 0, &weight) || weight == 0)
> > +	if (count == 0 || sysfs_streq(buf, "") ||
> > +	    kstrtou8(buf, 0, &weight) || weight == 0)
> >   		return -EINVAL;
> > 
> >   	new_wi_state = kzalloc(struct_size(new_wi_state, iw_table, nr_node_ids),
> > @@ -3630,9 +3637,15 @@ static ssize_t weighted_interleave_auto_store(struct kobject *kobj,
> >   	if (!input) {
> >   		old_wi_state = rcu_dereference_protected(wi_state,
> >   					lockdep_is_held(&wi_state_lock));
> > -		if (old_wi_state)
> > -			memcpy(new_wi_state->iw_table, old_wi_state->iw_table,
> > -					nr_node_ids * sizeof(u8));
> > +		if (!old_wi_state)
> > +			goto update_wi_state;
> > +		if (input == old_wi_state->mode_auto) {
> > +			mutex_unlock(&wi_state_lock);
> > +			return count;
> > +		}
> > +
> > +		memcpy(new_wi_state->iw_table, old_wi_state->iw_table,
> > +					       nr_node_ids * sizeof(u8));
> >   		goto update_wi_state;
> >   	}
> > 
> > @@ -3707,8 +3720,12 @@ static void wi_state_free(void)
> >   	kfree(&wi_group->wi_kobj);
> >   }
> > 
> > +static struct kobj_attribute wi_auto_attr =
> > +	__ATTR(auto, 0664, weighted_interleave_auto_show,
> > +			   weighted_interleave_auto_store);
> > +
> >   static void wi_cleanup(void) {
> > -	sysfs_remove_file(&wi_group->wi_kobj, &wi_group->auto_kobj_attr.attr);
> > +	sysfs_remove_file(&wi_group->wi_kobj, &wi_auto_attr.attr);
> >   	sysfs_wi_node_delete_all();
> >   	wi_state_free();
> >   }
> > @@ -3798,10 +3815,6 @@ static int wi_node_notifier(struct notifier_block *nb,
> >   	return NOTIFY_OK;
> >   }
> > 
> > -static struct kobj_attribute wi_auto_attr =
> > -	__ATTR(auto, 0664, weighted_interleave_auto_show,
> > -			   weighted_interleave_auto_store);
> > -
> >   static int __init add_weighted_interleave_group(struct kobject *mempolicy_kobj)
> >   {
> >   	int nid, err;
> > 
> > 
> > On Sat, 10 May 2025 11:51:50 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> > 
> >> On Sat, 10 May 2025 13:25:32 +0800 "Huang, Ying" <ying.huang@linux.alibaba.com> wrote:
> >>
> >> Hi Ying,
> >> Thank you for reviewing my patch, as always!
> >>
> >>> Hi, Joshua,
> >>>
> >>> Thank you for updated version!  And sorry for late reply.
> >>>
> >>> Joshua Hahn <joshua.hahnjy@gmail.com> writes:
> >>
> >> [...snip...]
> >>
> >>>> diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> >>>> index 0b7972de04e9..ec13382c606f 100644
> >>>> --- a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> >>>> +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> >>>> @@ -20,6 +20,35 @@ Description:	Weight configuration interface for nodeN
> >>>>   		Minimum weight: 1
> >>>>   		Maximum weight: 255
> >>>>   
> >>>> -		Writing an empty string or `0` will reset the weight to the
> >>>> -		system default. The system default may be set by the kernel
> >>>> -		or drivers at boot or during hotplug events.
> >>>> +		Writing invalid values (i.e. any values not in [1,255],
> >>>> +		empty string, ...) will return -EINVAL.
> >>>> +
> >>>> +		Changing the weight to a valid value will automatically
> >>>> +		update the system to manual mode as well.
> >>>
> >>> s/update/switch/ ?
> >>>
> >>> But my English is poor, please keep your version if you think that it's
> >>> better.
> >>
> >> I have no particular preference here, whatever will make it easiest for the
> >> users to understand what is happening. I'll take your suggestion!
> >>
> >> [...snip...]
> >>
> >>>> +/*
> >>>> + * A null weighted_interleave_state is interpted as having .mode = "auto",
> >>>> + * and .iw_table is interpreted as an array of 1s with length nr_node_ids.
> >>>> + */
> >>>
> >>> Better to move the comments to above "wi_state" definition?
> >>>
> >>>> +struct weighted_interleave_state {
> >>>> +	bool mode_auto;
> >>>> +	u8 iw_table[];
> >>>> +};
> >>>> +
> >>>
> >>> Why do you put the type definition in mempolicy.h instead of
> >>> mempolicy.c?  I don't find other users except mempolicy.c.
> >>
> >> Good point, I'll move the definition to mempolicy.c and move the comment
> >> to the wi_state definition as well.
> >>
> >> [...snip...]
> >>
> >>>> @@ -3450,31 +3555,104 @@ static ssize_t node_show(struct kobject *kobj, struct kobj_attribute *attr,
> >>>>   static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr,
> >>>>   			  const char *buf, size_t count)
> >>>>   {
> >>>> +	struct weighted_interleave_state *new_wi_state, *old_wi_state = NULL;
> >>>>   	struct iw_node_attr *node_attr;
> >>>> -	u8 *new;
> >>>> -	u8 *old;
> >>>>   	u8 weight = 0;
> >>>> +	int i;
> >>>>   
> >>>>   	node_attr = container_of(attr, struct iw_node_attr, kobj_attr);
> >>>>   	if (count == 0 || sysfs_streq(buf, ""))
> >>>>   		weight = 0;
> >>>
> >>> According to revised ABI, we should return -EINVAL here?
> >>
> >> Great catch, I completely ignored the ABI description that I wrote...
> >> I'll go ahead and just return -EINVAL here!
> >>
> >> [...snip...]
> >>
> >>>> +static ssize_t weighted_interleave_auto_store(struct kobject *kobj,
> >>>> +		struct kobj_attribute *attr, const char *buf, size_t count)
> >>>> +{
> >>>> +	struct weighted_interleave_state *new_wi_state, *old_wi_state = NULL;
> >>>> +	unsigned int *bw;
> >>>> +	bool input;
> >>>> +	int i;
> >>>> +
> >>>> +	if (kstrtobool(buf, &input))
> >>>> +		return -EINVAL;
> >>>> +
> >>>> +	new_wi_state = kzalloc(struct_size(new_wi_state, iw_table, nr_node_ids),
> >>>> +			       GFP_KERNEL);
> >>>> +	if (!new_wi_state)
> >>>> +		return -ENOMEM;
> >>>> +	for (i = 0; i < nr_node_ids; i++)
> >>>> +		new_wi_state->iw_table[i] = 1;
> >>>> +
> >>>> +	mutex_lock(&wi_state_lock);
> >>>> +	if (!input) {
> >>>
> >>> If input == old_wi_state->mode_auto, we can return directly?
> >>
> >> Yes, that makes sense to me.
> >>
> >>>>   static void wi_cleanup(void) {
> >>>> +	sysfs_remove_file(&wi_group->wi_kobj, &wi_group->auto_kobj_attr.attr);
> >>>
> >>> Why not just
> >>>
> >>> 	sysfs_remove_file(&wi_group->wi_kobj, &wi_auto_attr.attr);
> >>>
> >>> ?
> >>
> >> Also makes sense!!
> >>
> >>> ---
> >>> Best Regards,
> >>> Huang, Ying
> >>
> >> Thank you for your great feedback Ying, I'll make changes based on
> >> your suggestions and shortly send up a v9. I hope you have a great day!
> >> Joshua
> >>
> 
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v8] mm/mempolicy: Weighted Interleave Auto-tuning
  2025-05-12  1:35       ` Huang, Ying
@ 2025-05-12 14:25         ` Joshua Hahn
  2025-05-12 22:29           ` Andrew Morton
  2025-05-13  1:41           ` Huang, Ying
  0 siblings, 2 replies; 17+ messages in thread
From: Joshua Hahn @ 2025-05-12 14:25 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, gourry, honggyu.kim, yunjeong.mun, gregkh, rafael,
	lenb, dan.j.williams, Jonathan.Cameron, dave.jiang, horen.chuang,
	hannes, osalvador, linux-kernel, linux-acpi, linux-mm,
	kernel-team

On Mon, 12 May 2025 09:35:16 +0800 "Huang, Ying" <ying.huang@linux.alibaba.com> wrote:

> Hi, Joshua,
> 
> Joshua Hahn <joshua.hahnjy@gmail.com> writes:
> 
> [snip]
> 
> > @@ -3707,8 +3720,12 @@ static void wi_state_free(void)
> >  	kfree(&wi_group->wi_kobj);
> >  }
> >
> > +static struct kobj_attribute wi_auto_attr =
> > +	__ATTR(auto, 0664, weighted_interleave_auto_show,
> > +			   weighted_interleave_auto_store);
> > +
> >  static void wi_cleanup(void) {
> > -	sysfs_remove_file(&wi_group->wi_kobj, &wi_group->auto_kobj_attr.attr);
> > +	sysfs_remove_file(&wi_group->wi_kobj, &wi_auto_attr.attr);
> 
> If we use wi_auto_attr directly here, we can remove auto_kobj_attr field
> from struct sysfs_wi_group?

Hi Ying, thank you for this comment. I should have caught it as well.
Removing the last users / setters doesn't seem complicated.

Andrew, I'm very sorry -- do you think you can fold this fixlet in as well?
This is a minor change that removes the only users of this variable.

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d5ae36d2eda8..8581cc861945 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -3543,7 +3543,6 @@ struct iw_node_attr {
 struct sysfs_wi_group {
        struct kobject wi_kobj;
        struct mutex kobj_lock;
-       struct kobj_attribute auto_kobj_attr;
        struct iw_node_attr *nattrs[];
 };

@@ -3833,7 +3832,6 @@ static int __init add_weighted_interleave_group(struct kobject *mempolicy_kobj)
        err = sysfs_create_file(&wi_group->wi_kobj, &wi_auto_attr.attr);
        if (err)
                goto err_put_kobj;
-       wi_group->auto_kobj_attr = wi_auto_attr

        for_each_online_node(nid) {
                if (!node_state(nid, N_MEMORY))

Thank you both, I hope you have a great day!
Joshua


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v8] mm/mempolicy: Weighted Interleave Auto-tuning
  2025-05-12 14:25         ` Joshua Hahn
@ 2025-05-12 22:29           ` Andrew Morton
  2025-05-13 14:01             ` Joshua Hahn
  2025-05-13  1:41           ` Huang, Ying
  1 sibling, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2025-05-12 22:29 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Huang, Ying, gourry, honggyu.kim, yunjeong.mun, gregkh, rafael,
	lenb, dan.j.williams, Jonathan.Cameron, dave.jiang, horen.chuang,
	hannes, osalvador, linux-kernel, linux-acpi, linux-mm,
	kernel-team

On Mon, 12 May 2025 07:25:10 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:

> Andrew, I'm very sorry -- do you think you can fold this fixlet in as well?

No probs.

In future, something a little more formal would be preferred, so I
don't have to write your changelog and forge your Signed-off-by:!



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v8] mm/mempolicy: Weighted Interleave Auto-tuning
  2025-05-12 14:25         ` Joshua Hahn
  2025-05-12 22:29           ` Andrew Morton
@ 2025-05-13  1:41           ` Huang, Ying
  2025-05-13 13:59             ` Joshua Hahn
  1 sibling, 1 reply; 17+ messages in thread
From: Huang, Ying @ 2025-05-13  1:41 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Andrew Morton, gourry, honggyu.kim, yunjeong.mun, gregkh, rafael,
	lenb, dan.j.williams, Jonathan.Cameron, dave.jiang, horen.chuang,
	hannes, osalvador, linux-kernel, linux-acpi, linux-mm,
	kernel-team

Joshua Hahn <joshua.hahnjy@gmail.com> writes:

> On Mon, 12 May 2025 09:35:16 +0800 "Huang, Ying" <ying.huang@linux.alibaba.com> wrote:
>
>> Hi, Joshua,
>> 
>> Joshua Hahn <joshua.hahnjy@gmail.com> writes:
>> 
>> [snip]
>> 
>> > @@ -3707,8 +3720,12 @@ static void wi_state_free(void)
>> >  	kfree(&wi_group->wi_kobj);
>> >  }
>> >
>> > +static struct kobj_attribute wi_auto_attr =
>> > +	__ATTR(auto, 0664, weighted_interleave_auto_show,
>> > +			   weighted_interleave_auto_store);
>> > +
>> >  static void wi_cleanup(void) {
>> > -	sysfs_remove_file(&wi_group->wi_kobj, &wi_group->auto_kobj_attr.attr);
>> > +	sysfs_remove_file(&wi_group->wi_kobj, &wi_auto_attr.attr);
>> 
>> If we use wi_auto_attr directly here, we can remove auto_kobj_attr field
>> from struct sysfs_wi_group?
>
> Hi Ying, thank you for this comment. I should have caught it as well.
> Removing the last users / setters doesn't seem complicated.
>
> Andrew, I'm very sorry -- do you think you can fold this fixlet in as well?
> This is a minor change that removes the only users of this variable.
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index d5ae36d2eda8..8581cc861945 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -3543,7 +3543,6 @@ struct iw_node_attr {
>  struct sysfs_wi_group {
>         struct kobject wi_kobj;
>         struct mutex kobj_lock;
> -       struct kobj_attribute auto_kobj_attr;
>         struct iw_node_attr *nattrs[];
>  };
>
> @@ -3833,7 +3832,6 @@ static int __init add_weighted_interleave_group(struct kobject *mempolicy_kobj)
>         err = sysfs_create_file(&wi_group->wi_kobj, &wi_auto_attr.attr);
>         if (err)
>                 goto err_put_kobj;
> -       wi_group->auto_kobj_attr = wi_auto_attr
>
>         for_each_online_node(nid) {
>                 if (!node_state(nid, N_MEMORY))
>
> Thank you both, I hope you have a great day!

Thanks!  LGTM, please feel free to add my

Reviewed-by: Huang Ying <ying.huang@linux.alibaba.com>

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v8] mm/mempolicy: Weighted Interleave Auto-tuning
  2025-05-13  1:41           ` Huang, Ying
@ 2025-05-13 13:59             ` Joshua Hahn
  0 siblings, 0 replies; 17+ messages in thread
From: Joshua Hahn @ 2025-05-13 13:59 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, gourry, honggyu.kim, yunjeong.mun, gregkh, rafael,
	lenb, dan.j.williams, Jonathan.Cameron, dave.jiang, horen.chuang,
	hannes, osalvador, linux-kernel, linux-acpi, linux-mm,
	kernel-team

On Tue, 13 May 2025 09:41:10 +0800 "Huang, Ying" <ying.huang@linux.alibaba.com> wrote:

> Joshua Hahn <joshua.hahnjy@gmail.com> writes:
> 
> > On Mon, 12 May 2025 09:35:16 +0800 "Huang, Ying" <ying.huang@linux.alibaba.com> wrote:
> >
> >> Hi, Joshua,
> >> 
> >> Joshua Hahn <joshua.hahnjy@gmail.com> writes:
> >> 
> >> [snip]
> >> 
> >> > @@ -3707,8 +3720,12 @@ static void wi_state_free(void)
> >> >  	kfree(&wi_group->wi_kobj);
> >> >  }
> >> >
> >> > +static struct kobj_attribute wi_auto_attr =
> >> > +	__ATTR(auto, 0664, weighted_interleave_auto_show,
> >> > +			   weighted_interleave_auto_store);
> >> > +
> >> >  static void wi_cleanup(void) {
> >> > -	sysfs_remove_file(&wi_group->wi_kobj, &wi_group->auto_kobj_attr.attr);
> >> > +	sysfs_remove_file(&wi_group->wi_kobj, &wi_auto_attr.attr);
> >> 
> >> If we use wi_auto_attr directly here, we can remove auto_kobj_attr field
> >> from struct sysfs_wi_group?
> >
> > Hi Ying, thank you for this comment. I should have caught it as well.
> > Removing the last users / setters doesn't seem complicated.
> >
> > Andrew, I'm very sorry -- do you think you can fold this fixlet in as well?
> > This is a minor change that removes the only users of this variable.
> >
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index d5ae36d2eda8..8581cc861945 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -3543,7 +3543,6 @@ struct iw_node_attr {
> >  struct sysfs_wi_group {
> >         struct kobject wi_kobj;
> >         struct mutex kobj_lock;
> > -       struct kobj_attribute auto_kobj_attr;
> >         struct iw_node_attr *nattrs[];
> >  };
> >
> > @@ -3833,7 +3832,6 @@ static int __init add_weighted_interleave_group(struct kobject *mempolicy_kobj)
> >         err = sysfs_create_file(&wi_group->wi_kobj, &wi_auto_attr.attr);
> >         if (err)
> >                 goto err_put_kobj;
> > -       wi_group->auto_kobj_attr = wi_auto_attr
> >
> >         for_each_online_node(nid) {
> >                 if (!node_state(nid, N_MEMORY))
> >
> > Thank you both, I hope you have a great day!
> 
> Thanks!  LGTM, please feel free to add my
> 
> Reviewed-by: Huang Ying <ying.huang@linux.alibaba.com>

Thank you Ying, really appreciate all of your feedback throughout this
series. I hope you have a great day!
Joshua

> 
> ---
> Best Regards,
> Huang, Ying


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v8] mm/mempolicy: Weighted Interleave Auto-tuning
  2025-05-12 22:29           ` Andrew Morton
@ 2025-05-13 14:01             ` Joshua Hahn
  0 siblings, 0 replies; 17+ messages in thread
From: Joshua Hahn @ 2025-05-13 14:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Huang, Ying, gourry, honggyu.kim, yunjeong.mun, gregkh, rafael,
	lenb, dan.j.williams, Jonathan.Cameron, dave.jiang, horen.chuang,
	hannes, osalvador, linux-kernel, linux-acpi, linux-mm,
	kernel-team

On Mon, 12 May 2025 15:29:44 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:

> On Mon, 12 May 2025 07:25:10 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> 
> > Andrew, I'm very sorry -- do you think you can fold this fixlet in as well?
> 
> No probs.
> 
> In future, something a little more formal would be preferred, so I
> don't have to write your changelog and forge your Signed-off-by:!

Hi Andrew,

Sorry about that! I'll keep that in mind in the future : -)
Thank you for being patient and helping us with this patch.
I hope you have a great day!

Joshua


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v8] mm/mempolicy: Weighted Interleave Auto-tuning
  2025-05-12 14:14         ` Joshua Hahn
@ 2025-05-16  4:37           ` Honggyu Kim
  2025-05-16 14:43             ` Joshua Hahn
  0 siblings, 1 reply; 17+ messages in thread
From: Honggyu Kim @ 2025-05-16  4:37 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: kernel_team, Andrew Morton, Huang, Ying, gourry, yunjeong.mun,
	gregkh, rafael, lenb, dan.j.williams, Jonathan.Cameron,
	dave.jiang, horen.chuang, hannes, osalvador, linux-kernel,
	linux-acpi, linux-mm, kernel-team

Hi Joshua,

On 5/12/2025 11:14 PM, Joshua Hahn wrote:
> Hi Honggyu, thank you for reviewing & testing my patch (again)!

[...snip...]

>> Hi Andrew,
>>
>> I'm not sure if Joshua is better to post v9, but if you want to fold and update,
>> then could you please add my tags as follows when you fold this change?
>>
>>     Reviewed-by: Honggyu Kim <honggyu.kim@sk.com>
>>     Tested-by: Honggyu Kim <honggyu.kim@sk.com>
>>
>> I added the same tags in v7 but not included in v8 somehow.
>> https://lore.kernel.org/linux-mm/5fdd7db9-96fb-49ea-9803-977158cb0132@sk.com
> 
> I must have missed including these tags. Sorry about the confusion --
> hopefully we can incorporate them into v8!

I found it wasn't again :)
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-new&id=c021f31a3b73d8e5ae2abdd658d837c44929cad7

Thanks,
Honggyu


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v8] mm/mempolicy: Weighted Interleave Auto-tuning
  2025-05-16  4:37           ` Honggyu Kim
@ 2025-05-16 14:43             ` Joshua Hahn
  2025-05-17  8:26               ` Huang, Ying
  0 siblings, 1 reply; 17+ messages in thread
From: Joshua Hahn @ 2025-05-16 14:43 UTC (permalink / raw)
  To: Honggyu Kim
  Cc: kernel_team, Andrew Morton, Huang, Ying, gourry, yunjeong.mun,
	gregkh, rafael, lenb, dan.j.williams, Jonathan.Cameron,
	dave.jiang, horen.chuang, hannes, osalvador, linux-kernel,
	linux-acpi, linux-mm, kernel-team

On Fri, 16 May 2025 13:37:39 +0900 Honggyu Kim <honggyu.kim@sk.com> wrote:

Hi Andrew,

Would it be too late at this point to add Honggyu's Reviewed-by and
Tested-by tags? He has helped test and review this series from the start,
but I must have missed adding his tags after rebasing from v7.

Sorry again for the noise, and thank you for your understanding! I'm not
sure what the procedure looks like for commit messages... but here is
a fixlet if that makes sense with you:

 Suggested-by: Yunjeong Mun <yunjeong.mun@sk.com>
 Suggested-by: Oscar Salvador <osalvador@suse.de>
 Suggested-by: Ying Huang <ying.huang@linux.alibaba.com>
 Suggested-by: Harry Yoo <harry.yoo@oracle.com>
+Tested-by: Honggyu Kim <honggyu.kim@sk.com>
+Reviewed-by: Honggyu Kim <honggyu.kim@sk.com>
 Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
 Co-developed-by: Gregory Price <gourry@gourry.net>
 Signed-off-by: Gregory Price <gourry@gourry.net>
 Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>

And I will sign this off so that you do not have to forge my signature : -)
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>

Thank you again for your help, I hope you have a great day!
Joshua

> Hi Joshua,
> 
> On 5/12/2025 11:14 PM, Joshua Hahn wrote:
> > Hi Honggyu, thank you for reviewing & testing my patch (again)!
> 
> [...snip...]
> 
> >> Hi Andrew,
> >>
> >> I'm not sure if Joshua is better to post v9, but if you want to fold and update,
> >> then could you please add my tags as follows when you fold this change?
> >>
> >>     Reviewed-by: Honggyu Kim <honggyu.kim@sk.com>
> >>     Tested-by: Honggyu Kim <honggyu.kim@sk.com>
> >>
> >> I added the same tags in v7 but not included in v8 somehow.
> >> https://lore.kernel.org/linux-mm/5fdd7db9-96fb-49ea-9803-977158cb0132@sk.com
> > 
> > I must have missed including these tags. Sorry about the confusion --
> > hopefully we can incorporate them into v8!
> 
> I found it wasn't again :)
> https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-new&id=c021f31a3b73d8e5ae2abdd658d837c44929cad7
> 
> Thanks,
> Honggyu


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v8] mm/mempolicy: Weighted Interleave Auto-tuning
  2025-05-16 14:43             ` Joshua Hahn
@ 2025-05-17  8:26               ` Huang, Ying
  0 siblings, 0 replies; 17+ messages in thread
From: Huang, Ying @ 2025-05-17  8:26 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Honggyu Kim, kernel_team, Andrew Morton, gourry, yunjeong.mun,
	gregkh, rafael, lenb, dan.j.williams, Jonathan.Cameron,
	dave.jiang, horen.chuang, hannes, osalvador, linux-kernel,
	linux-acpi, linux-mm, kernel-team

Joshua Hahn <joshua.hahnjy@gmail.com> writes:

> On Fri, 16 May 2025 13:37:39 +0900 Honggyu Kim <honggyu.kim@sk.com> wrote:
>
> Hi Andrew,
>
> Would it be too late at this point to add Honggyu's Reviewed-by and
> Tested-by tags? He has helped test and review this series from the start,
> but I must have missed adding his tags after rebasing from v7.
>
> Sorry again for the noise, and thank you for your understanding! I'm not
> sure what the procedure looks like for commit messages... but here is
> a fixlet if that makes sense with you:
>
>  Suggested-by: Yunjeong Mun <yunjeong.mun@sk.com>
>  Suggested-by: Oscar Salvador <osalvador@suse.de>
>  Suggested-by: Ying Huang <ying.huang@linux.alibaba.com>
>  Suggested-by: Harry Yoo <harry.yoo@oracle.com>
> +Tested-by: Honggyu Kim <honggyu.kim@sk.com>
> +Reviewed-by: Honggyu Kim <honggyu.kim@sk.com>
>  Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
>  Co-developed-by: Gregory Price <gourry@gourry.net>
>  Signed-off-by: Gregory Price <gourry@gourry.net>
>  Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
>
> And I will sign this off so that you do not have to forge my signature : -)
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
>
> Thank you again for your help, I hope you have a great day!

Feel free to add my

Reviewed-by: Huang Ying <ying.huang@linux.alibaba.com>

too :-)

[snip]

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v8] mm/mempolicy: Weighted Interleave Auto-tuning
  2025-05-05 18:23 [PATCH v8] mm/mempolicy: Weighted Interleave Auto-tuning Joshua Hahn
  2025-05-10  5:25 ` Huang, Ying
@ 2025-05-19  1:56 ` Joshua Hahn
  2025-05-19 22:31   ` Andrew Morton
  1 sibling, 1 reply; 17+ messages in thread
From: Joshua Hahn @ 2025-05-19  1:56 UTC (permalink / raw)
  To: Joshua Hahn, Andrew Morton
  Cc: gourry, ying.huang, honggyu.kim, yunjeong.mun, gregkh, rafael,
	lenb, dan.j.williams, Jonathan.Cameron, dave.jiang, horen.chuang,
	hannes, osalvador, linux-kernel, linux-acpi, linux-mm,
	kernel-team

Hello Andrew,

Perhaps at this point it is best to just send a v9? : -)
The v9 will include: 2 code fixlets, as suggested by Ying Huang
And adding tags from both Honggyu and Ying.

I just thought it would be difficult to track 4 fixlets, but please let me
know whatever works best for you! Thank you always for your help. I hope you
have a great day!
Joshua

On Mon,  5 May 2025 11:23:28 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:

> On machines with multiple memory nodes, interleaving page allocations
> across nodes allows for better utilization of each node's bandwidth.
> Previous work by Gregory Price [1] introduced weighted interleave, which
> allowed for pages to be allocated across nodes according to user-set ratios.
> 
> Ideally, these weights should be proportional to their bandwidth, so
> that under bandwidth pressure, each node uses its maximal efficient
> bandwidth and prevents latency from increasing exponentially.
> 
> Previously, weighted interleave's default weights were just 1s -- which
> would be equivalent to the (unweighted) interleave mempolicy, which goes
> through the nodes in a round-robin fashion, ignoring bandwidth information.
> 
> This patch has two main goals:
> First, it makes weighted interleave easier to use for users who wish to
> relieve bandwidth pressure when using nodes with varying bandwidth (CXL).
> By providing a set of "real" default weights that just work out of the
> box, users who might not have the capability (or wish to) perform
> experimentation to find the most optimal weights for their system can
> still take advantage of bandwidth-informed weighted interleave.
> 
> Second, it allows for weighted interleave to dynamically adjust to
> hotplugged memory with new bandwidth information. Instead of manually
> updating node weights every time new bandwidth information is reported
> or taken off, weighted interleave adjusts and provides a new set of
> default weights for weighted interleave to use when there is a change
> in bandwidth information.
> 
> To meet these goals, this patch introduces an auto-configuration mode
> for the interleave weights that provides a reasonable set of default
> weights, calculated using bandwidth data reported by the system. In auto
> mode, weights are dynamically adjusted based on whatever the current
> bandwidth information reports (and responds to hotplug events).
> 
> This patch still supports users manually writing weights into the nodeN
> sysfs interface by entering into manual mode. When a user enters manual
> mode, the system stops dynamically updating any of the node weights,
> even during hotplug events that shift the optimal weight distribution.
> 
> A new sysfs interface "auto" is introduced, which allows users to switch
> between the auto (writing 1 or Y) and manual (writing 0 or N) modes. The
> system also automatically enters manual mode when a nodeN interface is
> manually written to.
> 
> There is one functional change that this patch makes to the existing
> weighted_interleave ABI: previously, writing 0 directly to a nodeN
> interface was said to reset the weight to the system default. Before
> this patch, the default for all weights were 1, which meant that writing
> 0 and 1 were functionally equivalent. With this patch, writing 0 is invalid.
> 
> [1] https://lore.kernel.org/linux-mm/20240202170238.90004-1-gregory.price@memverge.com/
> 
> Suggested-by: Yunjeong Mun <yunjeong.mun@sk.com>
> Suggested-by: Oscar Salvador <osalvador@suse.de>
> Suggested-by: Ying Huang <ying.huang@linux.alibaba.com>
> Suggested-by: Harry Yoo <harry.yoo@oracle.com>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> Co-developed-by: Gregory Price <gourry@gourry.net>
> Signed-off-by: Gregory Price <gourry@gourry.net>
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
> ---


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v8] mm/mempolicy: Weighted Interleave Auto-tuning
  2025-05-19  1:56 ` Joshua Hahn
@ 2025-05-19 22:31   ` Andrew Morton
  0 siblings, 0 replies; 17+ messages in thread
From: Andrew Morton @ 2025-05-19 22:31 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: gourry, ying.huang, honggyu.kim, yunjeong.mun, gregkh, rafael,
	lenb, dan.j.williams, Jonathan.Cameron, dave.jiang, horen.chuang,
	hannes, osalvador, linux-kernel, linux-acpi, linux-mm,
	kernel-team

On Sun, 18 May 2025 18:56:06 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:

> Perhaps at this point it is best to just send a v9? : -)

Sure.


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-05-19 22:31 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-05 18:23 [PATCH v8] mm/mempolicy: Weighted Interleave Auto-tuning Joshua Hahn
2025-05-10  5:25 ` Huang, Ying
2025-05-10 18:51   ` Joshua Hahn
2025-05-11  2:58     ` Joshua Hahn
2025-05-11 12:56       ` Honggyu Kim
2025-05-12 14:14         ` Joshua Hahn
2025-05-16  4:37           ` Honggyu Kim
2025-05-16 14:43             ` Joshua Hahn
2025-05-17  8:26               ` Huang, Ying
2025-05-12  1:35       ` Huang, Ying
2025-05-12 14:25         ` Joshua Hahn
2025-05-12 22:29           ` Andrew Morton
2025-05-13 14:01             ` Joshua Hahn
2025-05-13  1:41           ` Huang, Ying
2025-05-13 13:59             ` Joshua Hahn
2025-05-19  1:56 ` Joshua Hahn
2025-05-19 22:31   ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).