From: Jonathan Cameron <jonathan.cameron@huawei.com>
To: YeeLi <seven.yi.lee@gmail.com>
Cc: <akpm@linux-foundation.org>, <david@kernel.org>,
<dan.j.williams@intel.com>, <ying.huang@linux.alibaba.com>,
<linux-mm@kvack.org>, <joshua.hahnjy@gmail.com>,
<linux-kernel@vger.kernel.org>, <linux-cxl@vger.kernel.org>,
<dave.jiang@intel.com>
Subject: Re: [PATCH] mm/mempolicy: add sysfs interface to override NUMA node bandwidth
Date: Thu, 12 Mar 2026 11:58:25 +0000 [thread overview]
Message-ID: <20260312115825.000045af@huawei.com> (raw)
In-Reply-To: <20260312091207.2016518-1-seven.yi.lee@gmail.com>
On Thu, 12 Mar 2026 17:12:07 +0800
YeeLi <seven.yi.lee@gmail.com> wrote:
> From: yeeli <seven.yi.lee@gmail.com>
>
> Automatic tuning for weighted interleaving [1] provides real benefits on
> systems with CXL support. However, platforms that lack HMAT or CDAT
> information cannot make use of this feature.
>
> If the bandwidth reported by firmware or the device deviates from the
> actual measured bandwidth, administrators also lack a clear way to adjust
> the per-node weight values.
>
> This patch introduces an optional Kconfig option,
> CONFIG_NUMA_BW_MANUAL_OVERRIDE (default n), which exposes node bandwidth
> R/W sysfs attributes under:
>
> /sys/kernel/mm/mempolicy/weighted_interleave/bw_nodeN
>
> The sysfs files are created and removed dynamically on node hotplug
> events, in sync with the existing weighted_interleave/nodeN attributes.
>
> Userspace can write a single bandwidth value (in MB/s) to override both
> read_bandwidth and write_bandwidth for the corresponding NUMA node. The
> value is then propagated to the internal node_bw_table via
> mempolicy_set_node_perf().
>
> This interface is intended for debugging and experimentation only.
>
> [1] Link:
> https://lkml.kernel.org/r/20250505182328.4148265-1-joshua.hahnjy@gmail.com
>
> Signed-off-by: yeeli <seven.yi.lee@gmail.com>
As I note below, I'm not convinced this is the best layer to be injecting
data at if we want a debug interface that allows us to reflect real
hardware. Might be the sort of patch that is useful outside the tree though
so thanks for sharing it!
> ---
> mm/Kconfig | 20 +++++++
> mm/mempolicy.c | 148 +++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 168 insertions(+)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index bd0ea5454af8..40554df18edc 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1441,6 +1441,26 @@ config NUMA_EMU
> into virtual nodes when booted with "numa=fake=N", where N is the
> number of nodes. This is only useful for debugging.
>
> +config NUMA_BW_MANUAL_OVERRIDE
> + bool "Allow manual override of per-NUMA-node bandwidth for weighted interleave"
> + depends on NUMA && SYSFS
> + default n
Drop this. Everything is default n (by default!)
> + help
> + This option exposes writable sysfs attributes under
> + /sys/kernel/mm/mempolicy/weighted_interleave/bw_nodeN, allowing
> + userspace to manually set read/write bandwidth values for each NUMA node.
> +
> + These values update the internal node_bw_table and can influence
> + weighted interleave auto-tuning (if enabled).
> +
> + WARNING: This is intended for debugging, development, or platforms
> + with incorrect HMAT/CDAT firmware data. Overriding hardware-reported
> + bandwidth can lead to suboptimal performance, instability, or
> + incorrect resource allocation decisions.
If it was an SRAT (for GPs) / HMAT issue I'd suggest table injection is
an adequate way to do testing and debug.
CDAT is trickier, but maybe we should be thinking about a debug path to inject
that as well - similar to what we do for ACPI tables.
From my point of view I'd rather see debug / testing interfaces that test the
whole flow rather than just this top layer control.
However, I can see this is a useful patch for developers. But if you
are developing debugging bandwidth-aware memory policies, I'm going to assume
you don't mind building a kernel and adding this patch to make your life
easier!
So to me, useful tool but I'm not 'yet' convinced it should be upstream.
> +
> + Say N unless you are actively developing or debugging bandwidth-aware
> + memory policies.
> +
> config ARCH_HAS_USER_SHADOW_STACK
> bool
> help
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 68a98ba57882..0b7f42491748 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -226,6 +226,7 @@ int mempolicy_set_node_perf(unsigned int node, struct access_coordinate *coords)
>
> bw_val = min(coords->read_bandwidth, coords->write_bandwidth);
> new_bw = kcalloc(nr_node_ids, sizeof(unsigned int), GFP_KERNEL);
> +
Stray change. Always check for these bnefore posting as they slow down
patch sets for no good reason.
> if (!new_bw)
> return -ENOMEM;
>
> @@ -3614,6 +3615,9 @@ struct iw_node_attr {
> struct sysfs_wi_group {
> struct kobject wi_kobj;
> struct mutex kobj_lock;
> +#ifdef CONFIG_NUMA_BW_MANUAL_OVERRIDE
> + struct iw_node_attr *bw_attrs[MAX_NUMNODES];
Ok. It's debug so I guess this is ok rather than what is
done for nattrs with nr_node_ids.
Alternative would be a pointer and a separate allocation.
> +#endif
> struct iw_node_attr *nattrs[];
> };
>
> @@ -3855,6 +3859,128 @@ static int sysfs_wi_node_add(int nid)
> return ret;
> }
>
> +#ifdef CONFIG_NUMA_BW_MANUAL_OVERRIDE
> +static ssize_t bw_node_show(struct kobject *kobj,
> + struct kobj_attribute *attr,
> + char *buf)
Can fit on fewer lines without going past 80 chars.
> +{
> + struct iw_node_attr *node_attr;
> +
> + node_attr = container_of(attr, struct iw_node_attr, kobj_attr);
> +
> + /*A Node without CDAT or HMAT*/
/* A node without either CDAT or HMAT */
> + if (!node_bw_table)
> + return sprintf(buf, "N/A\n");
> +
> + if (!node_bw_table[node_attr->nid])
> + return sprintf(buf, "0\n");
> +
> + return sprintf(buf, "%u(MB/s)\n", node_bw_table[node_attr->nid]);
> +}
> +
> +static ssize_t bw_node_store(struct kobject *kobj,
> + struct kobj_attribute *attr,
> + const char *buf, size_t count)
> +{
> + struct iw_node_attr *node_attr;
> + unsigned long val = 0;
> + int ret;
> + struct access_coordinate coords = {
> + .read_bandwidth = 0,
> + .write_bandwidth = 0,
> + };
> +
> + node_attr = container_of(attr, struct iw_node_attr, kobj_attr);
Can do that at declaration of node_Attr above.
> +
> + ret = kstrtoul(buf, 0, &val);
Check ret before filling in values.
> +
> + coords.read_bandwidth = val;
> + coords.write_bandwidth = val;
> +
> + if (ret)
> + return ret;
> +
> + if (val > UINT_MAX)
> + return -EINVAL;
> +
> + ret = mempolicy_set_node_perf(node_attr->nid, &coords);
> + if (ret)
> + return ret;
> +
> + return count;
> +}
> +
> +static int sysfs_bw_node_add(int nid)
> +{
> + int ret;
> + char *name;
> + struct iw_node_attr *new_attr;
> +
> + if (nid < 0 || nid >= nr_node_ids) {
> + pr_err("invalid node id: %d\n", nid);
> + return -EINVAL;
> + }
> +
> + new_attr = kzalloc(sizeof(*new_attr), GFP_KERNEL);
> + if (!new_attr)
> + return -ENOMEM;
> +
> + name = kasprintf(GFP_KERNEL, "bw_node%d", nid);
> + if (!name) {
> + kfree(new_attr);
> + return -ENOMEM;
> + }
> +
> + sysfs_attr_init(&new_attr->kobj_attr.attr);
> + new_attr->kobj_attr.attr.name = name;
> + new_attr->kobj_attr.attr.mode = 0644;
> + new_attr->kobj_attr.show = bw_node_show;
> + new_attr->kobj_attr.store = bw_node_store;
> + new_attr->nid = nid;
> +
> + mutex_lock(&wi_group->kobj_lock);
> + if (wi_group->bw_attrs[nid]) {
Add another label and put the mutex unlock in the error path.
> + mutex_unlock(&wi_group->kobj_lock);
> + ret = -EEXIST;
> + goto out;
> + }
> +
> + ret = sysfs_create_file(&wi_group->wi_kobj, &new_attr->kobj_attr.attr);
> +
No blank line here.
> + if (ret) {
> + mutex_unlock(&wi_group->kobj_lock);
> + goto out;
> + }
> + wi_group->bw_attrs[nid] = new_attr;
> + mutex_unlock(&wi_group->kobj_lock);
> + return 0;
> +
> +out:
> + kfree(new_attr->kobj_attr.attr.name);
> + kfree(new_attr);
> + return ret;
> +}
> +
> +static void sysfs_bw_node_delete(int nid)
> +{
> + struct iw_node_attr *attr;
> +
> + if (nid < 0 || nid >= nr_node_ids)
> + return;
> +
> + mutex_lock(&wi_group->kobj_lock);
> + attr = wi_group->bw_attrs[nid];
> +
> + if (attr) {
> + sysfs_remove_file(&wi_group->wi_kobj, &attr->kobj_attr.attr);
> + kfree(attr->kobj_attr.attr.name);
> + kfree(attr);
> + wi_group->nattrs[nid] = NULL;
> + }
> + mutex_unlock(&wi_group->kobj_lock);
> +}
> +#endif
> +
> static int wi_node_notifier(struct notifier_block *nb,
> unsigned long action, void *data)
> {
> @@ -3868,9 +3994,22 @@ static int wi_node_notifier(struct notifier_block *nb,
> if (err)
> pr_err("failed to add sysfs for node%d during hotplug: %d\n",
> nid, err);
> +
> +#ifdef CONFIG_NUMA_BW_MANUAL_OVERRIDE
> + err = sysfs_bw_node_add(nid);
> + if (err)
> + pr_err("failed to add sysfs bw_node%d: %d\n",
> + nid, err);
> +#endif
> break;
> +
> case NODE_REMOVED_LAST_MEMORY:
> sysfs_wi_node_delete(nid);
> +
> +#ifdef CONFIG_NUMA_BW_MANUAL_OVERRIDE
Probably better to stub the function than to scatter ifdefs throughout.
> + sysfs_bw_node_delete(nid);
> +#endif
> +
> break;
> }
>
> @@ -3906,6 +4045,15 @@ static int __init add_weighted_interleave_group(struct kobject *mempolicy_kobj)
> nid, err);
> goto err_cleanup_kobj;
> }
> +
> +#ifdef CONFIG_NUMA_BW_MANUAL_OVERRIDE
> + err = sysfs_bw_node_add(nid);
> + if (err) {
> + pr_err("failed to add sysfs bw_node%d during init: %d\n", nid, err);
> + goto err_cleanup_kobj;
> + }
> +#endif
> +
> }
>
> hotplug_node_notifier(wi_node_notifier, DEFAULT_CALLBACK_PRI);
next prev parent reply other threads:[~2026-03-12 11:58 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-12 9:12 [PATCH] mm/mempolicy: add sysfs interface to override NUMA node bandwidth YeeLi
2026-03-12 9:42 ` Huang, Ying
2026-03-12 10:26 ` Yee Li
2026-03-12 11:58 ` Jonathan Cameron [this message]
2026-03-13 3:05 ` Yee Li
2026-03-12 15:00 ` Joshua Hahn
2026-03-12 15:05 ` Joshua Hahn
2026-03-13 3:39 ` Yee Li
2026-03-12 16:12 ` Gregory Price
2026-03-13 3:51 ` Yee Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260312115825.000045af@huawei.com \
--to=jonathan.cameron@huawei.com \
--cc=akpm@linux-foundation.org \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=david@kernel.org \
--cc=joshua.hahnjy@gmail.com \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=seven.yi.lee@gmail.com \
--cc=ying.huang@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox