public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed
From: Jonathan Cameron <jonathan.cameron@huawei.com>
To: YeeLi <seven.yi.lee@gmail.com>
Cc: <akpm@linux-foundation.org>, <david@kernel.org>,
	<dan.j.williams@intel.com>, <ying.huang@linux.alibaba.com>,
	<linux-mm@kvack.org>, <joshua.hahnjy@gmail.com>,
	<linux-kernel@vger.kernel.org>, <linux-cxl@vger.kernel.org>,
	<dave.jiang@intel.com>
Subject: Re: [PATCH] mm/mempolicy: add sysfs interface to override NUMA node bandwidth
Date: Thu, 12 Mar 2026 11:58:25 +0000	[thread overview]
Message-ID: <20260312115825.000045af@huawei.com> (raw)
In-Reply-To: <20260312091207.2016518-1-seven.yi.lee@gmail.com>

On Thu, 12 Mar 2026 17:12:07 +0800
YeeLi <seven.yi.lee@gmail.com> wrote:

> From: yeeli <seven.yi.lee@gmail.com>
> 
> Automatic tuning for weighted interleaving [1] provides real benefits on
> systems with CXL support. However, platforms that lack HMAT or CDAT
> information cannot make use of this feature.
> 
> If the bandwidth reported by firmware or the device deviates from the
> actual measured bandwidth, administrators also lack a clear way to adjust
> the per-node weight values.
> 
> This patch introduces an optional Kconfig option,
> CONFIG_NUMA_BW_MANUAL_OVERRIDE (default n), which exposes node bandwidth
> R/W sysfs attributes under:
> 
>   /sys/kernel/mm/mempolicy/weighted_interleave/bw_nodeN
> 
> The sysfs files are created and removed dynamically on node hotplug
> events, in sync with the existing weighted_interleave/nodeN attributes.
> 
> Userspace can write a single bandwidth value (in MB/s) to override both
> read_bandwidth and write_bandwidth for the corresponding NUMA node. The
> value is then propagated to the internal node_bw_table via
> mempolicy_set_node_perf().
> 
> This interface is intended for debugging and experimentation only.
> 
> [1] Link:
> https://lkml.kernel.org/r/20250505182328.4148265-1-joshua.hahnjy@gmail.com
> 
> Signed-off-by: yeeli <seven.yi.lee@gmail.com>

As I note below, I'm not convinced this is the best layer to be injecting
data at if we want a debug interface that allows us to reflect real
hardware.  Might be the sort of patch that is useful outside the tree though
so thanks for sharing it!

> ---
>  mm/Kconfig     |  20 +++++++
>  mm/mempolicy.c | 148 +++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 168 insertions(+)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index bd0ea5454af8..40554df18edc 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1441,6 +1441,26 @@ config NUMA_EMU
>  	  into virtual nodes when booted with "numa=fake=N", where N is the
>  	  number of nodes. This is only useful for debugging.
>  
> +config NUMA_BW_MANUAL_OVERRIDE
> +	bool "Allow manual override of per-NUMA-node bandwidth for weighted interleave"
> +	depends on NUMA && SYSFS
> +	default n

Drop this. Everything is default n (by default!)

> +	help
> +	  This option exposes writable sysfs attributes under
> +	  /sys/kernel/mm/mempolicy/weighted_interleave/bw_nodeN, allowing
> +	  userspace to manually set read/write bandwidth values for each NUMA node.
> +
> +	  These values update the internal node_bw_table and can influence
> +	  weighted interleave auto-tuning (if enabled).
> +
> +	  WARNING: This is intended for debugging, development, or platforms
> +	  with incorrect HMAT/CDAT firmware data. Overriding hardware-reported
> +	  bandwidth can lead to suboptimal performance, instability, or
> +	  incorrect resource allocation decisions.

If it was an SRAT (for GPs) / HMAT issue I'd suggest table injection is
an adequate way to do testing and debug.

CDAT is trickier, but maybe we should be thinking about a debug path to inject
that as well - similar to what we do for ACPI tables.

From my point of view I'd rather see debug / testing interfaces that test the
whole flow rather than just this top layer control.

However, I can see this is a useful patch for developers. But if you
are developing debugging bandwidth-aware memory policies, I'm going to assume
you don't mind building a kernel and adding this patch to make your life
easier!

So to me, useful tool but I'm not 'yet' convinced it should be upstream.

> +
> +	  Say N unless you are actively developing or debugging bandwidth-aware
> +	  memory policies.
> +
>  config ARCH_HAS_USER_SHADOW_STACK
>  	bool
>  	help
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 68a98ba57882..0b7f42491748 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -226,6 +226,7 @@ int mempolicy_set_node_perf(unsigned int node, struct access_coordinate *coords)
>  
>  	bw_val = min(coords->read_bandwidth, coords->write_bandwidth);
>  	new_bw = kcalloc(nr_node_ids, sizeof(unsigned int), GFP_KERNEL);
> +

Stray change.  Always check for these bnefore posting as they slow down
patch sets for no good reason.

>  	if (!new_bw)
>  		return -ENOMEM;
>  
> @@ -3614,6 +3615,9 @@ struct iw_node_attr {
>  struct sysfs_wi_group {
>  	struct kobject wi_kobj;
>  	struct mutex kobj_lock;
> +#ifdef CONFIG_NUMA_BW_MANUAL_OVERRIDE
> +	struct iw_node_attr *bw_attrs[MAX_NUMNODES];

Ok. It's debug so I guess this is ok rather than what is
done for nattrs with nr_node_ids.
Alternative would be a pointer and a separate allocation.


> +#endif
>  	struct iw_node_attr *nattrs[];
>  };
>  
> @@ -3855,6 +3859,128 @@ static int sysfs_wi_node_add(int nid)
>  	return ret;
>  }
>  
> +#ifdef CONFIG_NUMA_BW_MANUAL_OVERRIDE
> +static ssize_t bw_node_show(struct kobject *kobj,
> +			    struct kobj_attribute *attr,
> +			    char *buf)
Can fit on fewer lines without going past 80 chars.

> +{
> +	struct iw_node_attr *node_attr;
> +
> +	node_attr = container_of(attr, struct iw_node_attr, kobj_attr);
> +
> +	/*A Node without CDAT or HMAT*/
	/* A node without either CDAT or HMAT */


> +	if (!node_bw_table)
> +		return sprintf(buf, "N/A\n");
> +
> +	if (!node_bw_table[node_attr->nid])
> +		return sprintf(buf, "0\n");
> +
> +	return sprintf(buf, "%u(MB/s)\n", node_bw_table[node_attr->nid]);
> +}
> +
> +static ssize_t bw_node_store(struct kobject *kobj,
> +			     struct kobj_attribute *attr,
> +			     const char *buf, size_t count)
> +{
> +	struct iw_node_attr *node_attr;
> +	unsigned long val = 0;
> +	int ret;
> +	struct access_coordinate coords = {
> +		.read_bandwidth = 0,
> +		.write_bandwidth = 0,
> +	};
> +
> +	node_attr = container_of(attr, struct iw_node_attr, kobj_attr);

Can do that at declaration of node_Attr above.

> +
> +	ret = kstrtoul(buf, 0, &val);

Check ret before filling in values.

> +
> +	coords.read_bandwidth = val;
> +	coords.write_bandwidth = val;
> +
> +	if (ret)
> +		return ret;
> +
> +	if (val > UINT_MAX)
> +		return -EINVAL;
> +
> +	ret = mempolicy_set_node_perf(node_attr->nid, &coords);
> +	if (ret)
> +		return ret;
> +
> +	return count;
> +}
> +
> +static int sysfs_bw_node_add(int nid)
> +{
> +	int ret;
> +	char *name;
> +	struct iw_node_attr *new_attr;
> +
> +	if (nid < 0 || nid >= nr_node_ids) {
> +		pr_err("invalid node id: %d\n", nid);
> +		return -EINVAL;
> +	}
> +
> +	new_attr = kzalloc(sizeof(*new_attr), GFP_KERNEL);
> +	if (!new_attr)
> +		return -ENOMEM;
> +
> +	name = kasprintf(GFP_KERNEL, "bw_node%d", nid);
> +	if (!name) {
> +		kfree(new_attr);
> +		return -ENOMEM;
> +	}
> +
> +	sysfs_attr_init(&new_attr->kobj_attr.attr);
> +	new_attr->kobj_attr.attr.name = name;
> +	new_attr->kobj_attr.attr.mode = 0644;
> +	new_attr->kobj_attr.show = bw_node_show;
> +	new_attr->kobj_attr.store = bw_node_store;
> +	new_attr->nid = nid;
> +
> +	mutex_lock(&wi_group->kobj_lock);
> +	if (wi_group->bw_attrs[nid]) {

Add another label and put the mutex unlock in the error path.

> +		mutex_unlock(&wi_group->kobj_lock);
> +		ret = -EEXIST;
> +		goto out;
> +	}
> +
> +	ret = sysfs_create_file(&wi_group->wi_kobj, &new_attr->kobj_attr.attr);
> +
No blank line here.

> +	if (ret) {
> +		mutex_unlock(&wi_group->kobj_lock);
> +		goto out;
> +	}
> +	wi_group->bw_attrs[nid] = new_attr;
> +	mutex_unlock(&wi_group->kobj_lock);
> +	return 0;
> +
> +out:
> +	kfree(new_attr->kobj_attr.attr.name);
> +	kfree(new_attr);
> +	return ret;
> +}
> +
> +static void sysfs_bw_node_delete(int nid)
> +{
> +	struct iw_node_attr *attr;
> +
> +	if (nid < 0 || nid >= nr_node_ids)
> +		return;
> +
> +	mutex_lock(&wi_group->kobj_lock);
> +	attr = wi_group->bw_attrs[nid];
> +
> +	if (attr) {
> +		sysfs_remove_file(&wi_group->wi_kobj, &attr->kobj_attr.attr);
> +		kfree(attr->kobj_attr.attr.name);
> +		kfree(attr);
> +		wi_group->nattrs[nid] = NULL;
> +	}
> +	mutex_unlock(&wi_group->kobj_lock);
> +}
> +#endif
> +
>  static int wi_node_notifier(struct notifier_block *nb,
>  			       unsigned long action, void *data)
>  {
> @@ -3868,9 +3994,22 @@ static int wi_node_notifier(struct notifier_block *nb,
>  		if (err)
>  			pr_err("failed to add sysfs for node%d during hotplug: %d\n",
>  			       nid, err);
> +
> +#ifdef CONFIG_NUMA_BW_MANUAL_OVERRIDE
> +		err = sysfs_bw_node_add(nid);
> +		if (err)
> +			pr_err("failed to add sysfs bw_node%d: %d\n",
> +			       nid, err);
> +#endif
>  		break;
> +
>  	case NODE_REMOVED_LAST_MEMORY:
>  		sysfs_wi_node_delete(nid);
> +
> +#ifdef CONFIG_NUMA_BW_MANUAL_OVERRIDE

Probably better to stub the function than to scatter ifdefs throughout.

> +		sysfs_bw_node_delete(nid);
> +#endif
> +
>  		break;
>  	}
>  
> @@ -3906,6 +4045,15 @@ static int __init add_weighted_interleave_group(struct kobject *mempolicy_kobj)
>  			       nid, err);
>  			goto err_cleanup_kobj;
>  		}
> +
> +#ifdef CONFIG_NUMA_BW_MANUAL_OVERRIDE
> +		err = sysfs_bw_node_add(nid);
> +		if (err) {
> +			pr_err("failed to add sysfs bw_node%d during init: %d\n", nid, err);
> +			goto err_cleanup_kobj;
> +		}
> +#endif
> +
>  	}
>  
>  	hotplug_node_notifier(wi_node_notifier, DEFAULT_CALLBACK_PRI);



  parent reply	other threads:[~2026-03-12 11:58 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-12  9:12 [PATCH] mm/mempolicy: add sysfs interface to override NUMA node bandwidth YeeLi
2026-03-12  9:42 ` Huang, Ying
2026-03-12 10:26   ` Yee Li
2026-03-12 11:58 ` Jonathan Cameron [this message]
2026-03-13  3:05   ` Yee Li
2026-03-12 15:00 ` Joshua Hahn
2026-03-12 15:05   ` Joshua Hahn
2026-03-13  3:39   ` Yee Li
2026-03-12 16:12 ` Gregory Price
2026-03-13  3:51   ` Yee Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260312115825.000045af@huawei.com \
    --to=jonathan.cameron@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=david@kernel.org \
    --cc=joshua.hahnjy@gmail.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=seven.yi.lee@gmail.com \
    --cc=ying.huang@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox