* [PATCH net-next V5 05/12] devlink: Add dump support for device-level resources
From: Tariq Toukan @ 2026-04-07 19:41 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Simon Horman, Donald Hunter, Jiri Pirko, Jonathan Corbet,
Shuah Khan, Saeed Mahameed, Leon Romanovsky, Tariq Toukan,
Mark Bloch, Shuah Khan, Matthieu Baerts (NGI0), Chuck Lever,
Carolina Jubran, Or Har-Toov, Moshe Shemesh, Dragos Tatulea,
Daniel Zahka, Shahar Shitrit, Cosmin Ratiu, Jacob Keller,
Parav Pandit, Adithya Jayachandran, Shay Drori, Kees Cook,
Daniel Jurgens, netdev, linux-kernel, linux-doc, linux-rdma,
linux-kselftest, Gal Pressman, Jiri Pirko
In-Reply-To: <20260407194107.148063-1-tariqt@nvidia.com>
From: Or Har-Toov <ohartoov@nvidia.com>
Add dumpit handler for resource-dump command to iterate over all devlink
devices and show their resources.
$ devlink resource show
pci/0000:08:00.0:
name local_max_SFs size 508 unit entry
name external_max_SFs size 508 unit entry
pci/0000:08:00.1:
name local_max_SFs size 508 unit entry
name external_max_SFs size 508 unit entry
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Shay Drori <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
Documentation/netlink/specs/devlink.yaml | 6 +-
net/devlink/netlink_gen.c | 20 +++++-
net/devlink/netlink_gen.h | 4 +-
net/devlink/resource.c | 77 ++++++++++++++++++++++++
4 files changed, 102 insertions(+), 5 deletions(-)
diff --git a/Documentation/netlink/specs/devlink.yaml b/Documentation/netlink/specs/devlink.yaml
index b495d56b9137..c423e049c7bd 100644
--- a/Documentation/netlink/specs/devlink.yaml
+++ b/Documentation/netlink/specs/devlink.yaml
@@ -1764,13 +1764,17 @@ operations:
- bus-name
- dev-name
- index
- reply:
+ reply: &resource-dump-reply
value: 36
attributes:
- bus-name
- dev-name
- index
- resource-list
+ dump:
+ request:
+ attributes: *dev-id-attrs
+ reply: *resource-dump-reply
-
name: reload
diff --git a/net/devlink/netlink_gen.c b/net/devlink/netlink_gen.c
index eb35e80e01d1..a5a47a4c6de8 100644
--- a/net/devlink/netlink_gen.c
+++ b/net/devlink/netlink_gen.c
@@ -305,7 +305,14 @@ static const struct nla_policy devlink_resource_set_nl_policy[DEVLINK_ATTR_INDEX
};
/* DEVLINK_CMD_RESOURCE_DUMP - do */
-static const struct nla_policy devlink_resource_dump_nl_policy[DEVLINK_ATTR_INDEX + 1] = {
+static const struct nla_policy devlink_resource_dump_do_nl_policy[DEVLINK_ATTR_INDEX + 1] = {
+ [DEVLINK_ATTR_BUS_NAME] = { .type = NLA_NUL_STRING, },
+ [DEVLINK_ATTR_DEV_NAME] = { .type = NLA_NUL_STRING, },
+ [DEVLINK_ATTR_INDEX] = NLA_POLICY_FULL_RANGE(NLA_UINT, &devlink_attr_index_range),
+};
+
+/* DEVLINK_CMD_RESOURCE_DUMP - dump */
+static const struct nla_policy devlink_resource_dump_dump_nl_policy[DEVLINK_ATTR_INDEX + 1] = {
[DEVLINK_ATTR_BUS_NAME] = { .type = NLA_NUL_STRING, },
[DEVLINK_ATTR_DEV_NAME] = { .type = NLA_NUL_STRING, },
[DEVLINK_ATTR_INDEX] = NLA_POLICY_FULL_RANGE(NLA_UINT, &devlink_attr_index_range),
@@ -680,7 +687,7 @@ static const struct nla_policy devlink_notify_filter_set_nl_policy[DEVLINK_ATTR_
};
/* Ops table for devlink */
-const struct genl_split_ops devlink_nl_ops[74] = {
+const struct genl_split_ops devlink_nl_ops[75] = {
{
.cmd = DEVLINK_CMD_GET,
.validate = GENL_DONT_VALIDATE_STRICT,
@@ -958,10 +965,17 @@ const struct genl_split_ops devlink_nl_ops[74] = {
.pre_doit = devlink_nl_pre_doit,
.doit = devlink_nl_resource_dump_doit,
.post_doit = devlink_nl_post_doit,
- .policy = devlink_resource_dump_nl_policy,
+ .policy = devlink_resource_dump_do_nl_policy,
.maxattr = DEVLINK_ATTR_INDEX,
.flags = GENL_CMD_CAP_DO,
},
+ {
+ .cmd = DEVLINK_CMD_RESOURCE_DUMP,
+ .dumpit = devlink_nl_resource_dump_dumpit,
+ .policy = devlink_resource_dump_dump_nl_policy,
+ .maxattr = DEVLINK_ATTR_INDEX,
+ .flags = GENL_CMD_CAP_DUMP,
+ },
{
.cmd = DEVLINK_CMD_RELOAD,
.validate = GENL_DONT_VALIDATE_STRICT,
diff --git a/net/devlink/netlink_gen.h b/net/devlink/netlink_gen.h
index 2817d53a0eba..d79f6a0888f6 100644
--- a/net/devlink/netlink_gen.h
+++ b/net/devlink/netlink_gen.h
@@ -18,7 +18,7 @@ extern const struct nla_policy devlink_dl_rate_tc_bws_nl_policy[DEVLINK_RATE_TC_
extern const struct nla_policy devlink_dl_selftest_id_nl_policy[DEVLINK_ATTR_SELFTEST_ID_FLASH + 1];
/* Ops table for devlink */
-extern const struct genl_split_ops devlink_nl_ops[74];
+extern const struct genl_split_ops devlink_nl_ops[75];
int devlink_nl_pre_doit(const struct genl_split_ops *ops, struct sk_buff *skb,
struct genl_info *info);
@@ -80,6 +80,8 @@ int devlink_nl_dpipe_table_counters_set_doit(struct sk_buff *skb,
struct genl_info *info);
int devlink_nl_resource_set_doit(struct sk_buff *skb, struct genl_info *info);
int devlink_nl_resource_dump_doit(struct sk_buff *skb, struct genl_info *info);
+int devlink_nl_resource_dump_dumpit(struct sk_buff *skb,
+ struct netlink_callback *cb);
int devlink_nl_reload_doit(struct sk_buff *skb, struct genl_info *info);
int devlink_nl_param_get_doit(struct sk_buff *skb, struct genl_info *info);
int devlink_nl_param_get_dumpit(struct sk_buff *skb,
diff --git a/net/devlink/resource.c b/net/devlink/resource.c
index f3014ec425c4..02fb36e25c52 100644
--- a/net/devlink/resource.c
+++ b/net/devlink/resource.c
@@ -223,6 +223,31 @@ static int devlink_resource_put(struct devlink *devlink, struct sk_buff *skb,
return -EMSGSIZE;
}
+static int devlink_resource_list_fill(struct sk_buff *skb,
+ struct devlink *devlink,
+ struct list_head *resource_list_head,
+ int *idx)
+{
+ struct devlink_resource *resource;
+ int i = 0;
+ int err;
+
+ list_for_each_entry(resource, resource_list_head, list) {
+ if (i < *idx) {
+ i++;
+ continue;
+ }
+ err = devlink_resource_put(devlink, skb, resource);
+ if (err) {
+ *idx = i;
+ return err;
+ }
+ i++;
+ }
+ *idx = 0;
+ return 0;
+}
+
static int devlink_resource_fill(struct genl_info *info,
enum devlink_command cmd, int flags)
{
@@ -302,6 +327,58 @@ int devlink_nl_resource_dump_doit(struct sk_buff *skb, struct genl_info *info)
return devlink_resource_fill(info, DEVLINK_CMD_RESOURCE_DUMP, 0);
}
+static int
+devlink_nl_resource_dump_one(struct sk_buff *skb, struct devlink *devlink,
+ struct netlink_callback *cb, int flags)
+{
+ struct devlink_nl_dump_state *state = devlink_dump_state(cb);
+ struct nlattr *resources_attr;
+ int start_idx = state->idx;
+ void *hdr;
+ int err;
+
+ if (list_empty(&devlink->resource_list))
+ return 0;
+
+ err = -EMSGSIZE;
+ hdr = genlmsg_put(skb, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq,
+ &devlink_nl_family, flags, DEVLINK_CMD_RESOURCE_DUMP);
+ if (!hdr)
+ return err;
+
+ if (devlink_nl_put_handle(skb, devlink))
+ goto nla_put_failure;
+
+ resources_attr = nla_nest_start_noflag(skb, DEVLINK_ATTR_RESOURCE_LIST);
+ if (!resources_attr)
+ goto nla_put_failure;
+
+ err = devlink_resource_list_fill(skb, devlink,
+ &devlink->resource_list, &state->idx);
+ if (err) {
+ if (state->idx == start_idx)
+ goto resource_list_cancel;
+ nla_nest_end(skb, resources_attr);
+ genlmsg_end(skb, hdr);
+ return err;
+ }
+ nla_nest_end(skb, resources_attr);
+ genlmsg_end(skb, hdr);
+ return 0;
+
+resource_list_cancel:
+ nla_nest_cancel(skb, resources_attr);
+nla_put_failure:
+ genlmsg_cancel(skb, hdr);
+ return err;
+}
+
+int devlink_nl_resource_dump_dumpit(struct sk_buff *skb,
+ struct netlink_callback *cb)
+{
+ return devlink_nl_dumpit(skb, cb, devlink_nl_resource_dump_one);
+}
+
int devlink_resources_validate(struct devlink *devlink,
struct devlink_resource *resource,
struct genl_info *info)
--
2.44.0
^ permalink raw reply related
* [PATCH net-next V5 06/12] devlink: Include port resources in resource dump dumpit
From: Tariq Toukan @ 2026-04-07 19:41 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Simon Horman, Donald Hunter, Jiri Pirko, Jonathan Corbet,
Shuah Khan, Saeed Mahameed, Leon Romanovsky, Tariq Toukan,
Mark Bloch, Shuah Khan, Matthieu Baerts (NGI0), Chuck Lever,
Carolina Jubran, Or Har-Toov, Moshe Shemesh, Dragos Tatulea,
Daniel Zahka, Shahar Shitrit, Cosmin Ratiu, Jacob Keller,
Parav Pandit, Adithya Jayachandran, Shay Drori, Kees Cook,
Daniel Jurgens, netdev, linux-kernel, linux-doc, linux-rdma,
linux-kselftest, Gal Pressman
In-Reply-To: <20260407194107.148063-1-tariqt@nvidia.com>
From: Or Har-Toov <ohartoov@nvidia.com>
Allow querying devlink resources per-port via the resource-dump dumpit
handler. Both device-level and all ports resources are included in the
reply.
For example:
$ devlink resource show
pci/0000:03:00.0:
name local_max_SFs size 508 unit entry
name external_max_SFs size 508 unit entry
pci/0000:03:00.0/196608:
name max_SFs size 20 unit entry
pci/0000:03:00.1:
name local_max_SFs size 508 unit entry
name external_max_SFs size 508 unit entry
pci/0000:03:00.1/262144:
name max_SFs size 20 unit entry
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
net/devlink/devl_internal.h | 5 ++++
net/devlink/netlink.c | 2 ++
net/devlink/resource.c | 59 ++++++++++++++++++++++++++++++++-----
3 files changed, 58 insertions(+), 8 deletions(-)
diff --git a/net/devlink/devl_internal.h b/net/devlink/devl_internal.h
index 7dfb7cdd2d23..e4e48ee2da5a 100644
--- a/net/devlink/devl_internal.h
+++ b/net/devlink/devl_internal.h
@@ -164,6 +164,11 @@ struct devlink_nl_dump_state {
struct {
u64 dump_ts;
};
+ /* DEVLINK_CMD_RESOURCE_DUMP */
+ struct {
+ u32 index;
+ bool index_valid;
+ } port_ctx;
};
};
diff --git a/net/devlink/netlink.c b/net/devlink/netlink.c
index 32ddbe244cb7..ae4afc739678 100644
--- a/net/devlink/netlink.c
+++ b/net/devlink/netlink.c
@@ -370,6 +370,8 @@ static int devlink_nl_inst_iter_dumpit(struct sk_buff *msg,
/* restart sub-object walk for the next instance */
state->idx = 0;
+ state->port_ctx.index = 0;
+ state->port_ctx.index_valid = false;
}
if (err != -EMSGSIZE)
diff --git a/net/devlink/resource.c b/net/devlink/resource.c
index 02fb36e25c52..7984eda63eb6 100644
--- a/net/devlink/resource.c
+++ b/net/devlink/resource.c
@@ -328,16 +328,20 @@ int devlink_nl_resource_dump_doit(struct sk_buff *skb, struct genl_info *info)
}
static int
-devlink_nl_resource_dump_one(struct sk_buff *skb, struct devlink *devlink,
- struct netlink_callback *cb, int flags)
+devlink_resource_dump_fill_one(struct sk_buff *skb, struct devlink *devlink,
+ struct devlink_port *devlink_port,
+ struct netlink_callback *cb, int flags, int *idx)
{
- struct devlink_nl_dump_state *state = devlink_dump_state(cb);
+ struct list_head *resource_list;
struct nlattr *resources_attr;
- int start_idx = state->idx;
+ int start_idx = *idx;
void *hdr;
int err;
- if (list_empty(&devlink->resource_list))
+ resource_list = devlink_port ?
+ &devlink_port->resource_list : &devlink->resource_list;
+
+ if (list_empty(resource_list))
return 0;
err = -EMSGSIZE;
@@ -348,15 +352,17 @@ devlink_nl_resource_dump_one(struct sk_buff *skb, struct devlink *devlink,
if (devlink_nl_put_handle(skb, devlink))
goto nla_put_failure;
+ if (devlink_port &&
+ nla_put_u32(skb, DEVLINK_ATTR_PORT_INDEX, devlink_port->index))
+ goto nla_put_failure;
resources_attr = nla_nest_start_noflag(skb, DEVLINK_ATTR_RESOURCE_LIST);
if (!resources_attr)
goto nla_put_failure;
- err = devlink_resource_list_fill(skb, devlink,
- &devlink->resource_list, &state->idx);
+ err = devlink_resource_list_fill(skb, devlink, resource_list, idx);
if (err) {
- if (state->idx == start_idx)
+ if (*idx == start_idx)
goto resource_list_cancel;
nla_nest_end(skb, resources_attr);
genlmsg_end(skb, hdr);
@@ -373,6 +379,43 @@ devlink_nl_resource_dump_one(struct sk_buff *skb, struct devlink *devlink,
return err;
}
+static int
+devlink_nl_resource_dump_one(struct sk_buff *skb, struct devlink *devlink,
+ struct netlink_callback *cb, int flags)
+{
+ struct devlink_nl_dump_state *state = devlink_dump_state(cb);
+ struct devlink_port *devlink_port;
+ unsigned long port_idx;
+ int err;
+
+ if (!state->port_ctx.index_valid) {
+ err = devlink_resource_dump_fill_one(skb, devlink, NULL,
+ cb, flags, &state->idx);
+ if (err)
+ return err;
+ state->idx = 0;
+ }
+
+ /* Check in case port was removed between dump callbacks. */
+ if (state->port_ctx.index_valid &&
+ !xa_load(&devlink->ports, state->port_ctx.index))
+ state->idx = 0;
+ state->port_ctx.index_valid = true;
+ xa_for_each_start(&devlink->ports, port_idx, devlink_port,
+ state->port_ctx.index) {
+ err = devlink_resource_dump_fill_one(skb, devlink, devlink_port,
+ cb, flags, &state->idx);
+ if (err) {
+ state->port_ctx.index = port_idx;
+ return err;
+ }
+ state->idx = 0;
+ }
+ state->port_ctx.index_valid = false;
+ state->port_ctx.index = 0;
+ return 0;
+}
+
int devlink_nl_resource_dump_dumpit(struct sk_buff *skb,
struct netlink_callback *cb)
{
--
2.44.0
^ permalink raw reply related
* [PATCH net-next V5 07/12] devlink: Add port-specific option to resource dump doit
From: Tariq Toukan @ 2026-04-07 19:41 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Simon Horman, Donald Hunter, Jiri Pirko, Jonathan Corbet,
Shuah Khan, Saeed Mahameed, Leon Romanovsky, Tariq Toukan,
Mark Bloch, Shuah Khan, Matthieu Baerts (NGI0), Chuck Lever,
Carolina Jubran, Or Har-Toov, Moshe Shemesh, Dragos Tatulea,
Daniel Zahka, Shahar Shitrit, Cosmin Ratiu, Jacob Keller,
Parav Pandit, Adithya Jayachandran, Shay Drori, Kees Cook,
Daniel Jurgens, netdev, linux-kernel, linux-doc, linux-rdma,
linux-kselftest, Gal Pressman
In-Reply-To: <20260407194107.148063-1-tariqt@nvidia.com>
From: Or Har-Toov <ohartoov@nvidia.com>
Allow querying devlink resources per-port via the resource-dump doit
handler. When a port-index attribute is provided, only that port's
resources are returned. When no port-index is given, only device-level
resources are returned, preserving backward compatibility.
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
Documentation/netlink/specs/devlink.yaml | 4 +++-
net/devlink/netlink_gen.c | 3 ++-
net/devlink/netlink_gen.h | 4 ++--
net/devlink/resource.c | 20 +++++++++++++++++---
4 files changed, 24 insertions(+), 7 deletions(-)
diff --git a/Documentation/netlink/specs/devlink.yaml b/Documentation/netlink/specs/devlink.yaml
index c423e049c7bd..34aa81ba689e 100644
--- a/Documentation/netlink/specs/devlink.yaml
+++ b/Documentation/netlink/specs/devlink.yaml
@@ -1757,19 +1757,21 @@ operations:
attribute-set: devlink
dont-validate: [strict]
do:
- pre: devlink-nl-pre-doit
+ pre: devlink-nl-pre-doit-port-optional
post: devlink-nl-post-doit
request:
attributes:
- bus-name
- dev-name
- index
+ - port-index
reply: &resource-dump-reply
value: 36
attributes:
- bus-name
- dev-name
- index
+ - port-index
- resource-list
dump:
request:
diff --git a/net/devlink/netlink_gen.c b/net/devlink/netlink_gen.c
index a5a47a4c6de8..9cc372d9ee41 100644
--- a/net/devlink/netlink_gen.c
+++ b/net/devlink/netlink_gen.c
@@ -309,6 +309,7 @@ static const struct nla_policy devlink_resource_dump_do_nl_policy[DEVLINK_ATTR_I
[DEVLINK_ATTR_BUS_NAME] = { .type = NLA_NUL_STRING, },
[DEVLINK_ATTR_DEV_NAME] = { .type = NLA_NUL_STRING, },
[DEVLINK_ATTR_INDEX] = NLA_POLICY_FULL_RANGE(NLA_UINT, &devlink_attr_index_range),
+ [DEVLINK_ATTR_PORT_INDEX] = { .type = NLA_U32, },
};
/* DEVLINK_CMD_RESOURCE_DUMP - dump */
@@ -962,7 +963,7 @@ const struct genl_split_ops devlink_nl_ops[75] = {
{
.cmd = DEVLINK_CMD_RESOURCE_DUMP,
.validate = GENL_DONT_VALIDATE_STRICT,
- .pre_doit = devlink_nl_pre_doit,
+ .pre_doit = devlink_nl_pre_doit_port_optional,
.doit = devlink_nl_resource_dump_doit,
.post_doit = devlink_nl_post_doit,
.policy = devlink_resource_dump_do_nl_policy,
diff --git a/net/devlink/netlink_gen.h b/net/devlink/netlink_gen.h
index d79f6a0888f6..20034b0929a8 100644
--- a/net/devlink/netlink_gen.h
+++ b/net/devlink/netlink_gen.h
@@ -24,11 +24,11 @@ int devlink_nl_pre_doit(const struct genl_split_ops *ops, struct sk_buff *skb,
struct genl_info *info);
int devlink_nl_pre_doit_port(const struct genl_split_ops *ops,
struct sk_buff *skb, struct genl_info *info);
-int devlink_nl_pre_doit_dev_lock(const struct genl_split_ops *ops,
- struct sk_buff *skb, struct genl_info *info);
int devlink_nl_pre_doit_port_optional(const struct genl_split_ops *ops,
struct sk_buff *skb,
struct genl_info *info);
+int devlink_nl_pre_doit_dev_lock(const struct genl_split_ops *ops,
+ struct sk_buff *skb, struct genl_info *info);
void
devlink_nl_post_doit(const struct genl_split_ops *ops, struct sk_buff *skb,
struct genl_info *info);
diff --git a/net/devlink/resource.c b/net/devlink/resource.c
index 7984eda63eb6..bf5221fb3e64 100644
--- a/net/devlink/resource.c
+++ b/net/devlink/resource.c
@@ -251,8 +251,10 @@ static int devlink_resource_list_fill(struct sk_buff *skb,
static int devlink_resource_fill(struct genl_info *info,
enum devlink_command cmd, int flags)
{
+ struct devlink_port *devlink_port = info->user_ptr[1];
struct devlink *devlink = info->user_ptr[0];
struct devlink_resource *resource;
+ struct list_head *resource_list;
struct nlattr *resources_attr;
struct sk_buff *skb = NULL;
struct nlmsghdr *nlh;
@@ -261,7 +263,9 @@ static int devlink_resource_fill(struct genl_info *info,
int i;
int err;
- resource = list_first_entry(&devlink->resource_list,
+ resource_list = devlink_port ?
+ &devlink_port->resource_list : &devlink->resource_list;
+ resource = list_first_entry(resource_list,
struct devlink_resource, list);
start_again:
err = devlink_nl_msg_reply_and_new(&skb, info);
@@ -277,6 +281,9 @@ static int devlink_resource_fill(struct genl_info *info,
if (devlink_nl_put_handle(skb, devlink))
goto nla_put_failure;
+ if (devlink_port &&
+ nla_put_u32(skb, DEVLINK_ATTR_PORT_INDEX, devlink_port->index))
+ goto nla_put_failure;
resources_attr = nla_nest_start_noflag(skb,
DEVLINK_ATTR_RESOURCE_LIST);
@@ -285,7 +292,7 @@ static int devlink_resource_fill(struct genl_info *info,
incomplete = false;
i = 0;
- list_for_each_entry_from(resource, &devlink->resource_list, list) {
+ list_for_each_entry_from(resource, resource_list, list) {
err = devlink_resource_put(devlink, skb, resource);
if (err) {
if (!i)
@@ -319,9 +326,16 @@ static int devlink_resource_fill(struct genl_info *info,
int devlink_nl_resource_dump_doit(struct sk_buff *skb, struct genl_info *info)
{
+ struct devlink_port *devlink_port = info->user_ptr[1];
struct devlink *devlink = info->user_ptr[0];
+ struct list_head *resource_list;
+
+ if (info->attrs[DEVLINK_ATTR_PORT_INDEX] && !devlink_port)
+ return -ENODEV;
- if (list_empty(&devlink->resource_list))
+ resource_list = devlink_port ?
+ &devlink_port->resource_list : &devlink->resource_list;
+ if (list_empty(resource_list))
return -EOPNOTSUPP;
return devlink_resource_fill(info, DEVLINK_CMD_RESOURCE_DUMP, 0);
--
2.44.0
^ permalink raw reply related
* [PATCH net-next V5 08/12] selftest: netdevsim: Add devlink port resource doit test
From: Tariq Toukan @ 2026-04-07 19:41 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Simon Horman, Donald Hunter, Jiri Pirko, Jonathan Corbet,
Shuah Khan, Saeed Mahameed, Leon Romanovsky, Tariq Toukan,
Mark Bloch, Shuah Khan, Matthieu Baerts (NGI0), Chuck Lever,
Carolina Jubran, Or Har-Toov, Moshe Shemesh, Dragos Tatulea,
Daniel Zahka, Shahar Shitrit, Cosmin Ratiu, Jacob Keller,
Parav Pandit, Adithya Jayachandran, Shay Drori, Kees Cook,
Daniel Jurgens, netdev, linux-kernel, linux-doc, linux-rdma,
linux-kselftest, Gal Pressman
In-Reply-To: <20260407194107.148063-1-tariqt@nvidia.com>
From: Or Har-Toov <ohartoov@nvidia.com>
Tests that querying a specific port handle returns the expected
resource name and size.
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../drivers/net/netdevsim/devlink.sh | 29 ++++++++++++++++++-
1 file changed, 28 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/drivers/net/netdevsim/devlink.sh b/tools/testing/selftests/drivers/net/netdevsim/devlink.sh
index 1b529ccaf050..31d1cef54898 100755
--- a/tools/testing/selftests/drivers/net/netdevsim/devlink.sh
+++ b/tools/testing/selftests/drivers/net/netdevsim/devlink.sh
@@ -5,7 +5,8 @@ lib_dir=$(dirname $0)/../../../net/forwarding
ALL_TESTS="fw_flash_test params_test \
params_default_test regions_test reload_test \
- netns_reload_test resource_test dev_info_test \
+ netns_reload_test resource_test \
+ port_resource_doit_test dev_info_test \
empty_reporter_test dummy_reporter_test rate_test"
NUM_NETIFS=0
source $lib_dir/lib.sh
@@ -768,6 +769,32 @@ rate_node_del()
devlink port function rate del $handle
}
+port_resource_doit_test()
+{
+ RET=0
+
+ local port_handle="${DL_HANDLE}/0"
+ local name
+ local size
+
+ if ! devlink resource help 2>&1 | grep -q "PORT_INDEX"; then
+ echo "SKIP: devlink resource show with port not supported"
+ return
+ fi
+
+ name=$(cmd_jq "devlink resource show $port_handle -j" \
+ '.[][][].name')
+ [ "$name" == "test_resource" ]
+ check_err $? "wrong port resource name (got $name)"
+
+ size=$(cmd_jq "devlink resource show $port_handle -j" \
+ '.[][][].size')
+ [ "$size" == "20" ]
+ check_err $? "wrong port resource size (got $size)"
+
+ log_test "port resource doit test"
+}
+
rate_test()
{
RET=0
--
2.44.0
^ permalink raw reply related
* [PATCH net-next V5 09/12] devlink: Document port-level resources and full dump
From: Tariq Toukan @ 2026-04-07 19:41 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Simon Horman, Donald Hunter, Jiri Pirko, Jonathan Corbet,
Shuah Khan, Saeed Mahameed, Leon Romanovsky, Tariq Toukan,
Mark Bloch, Shuah Khan, Matthieu Baerts (NGI0), Chuck Lever,
Carolina Jubran, Or Har-Toov, Moshe Shemesh, Dragos Tatulea,
Daniel Zahka, Shahar Shitrit, Cosmin Ratiu, Jacob Keller,
Parav Pandit, Adithya Jayachandran, Shay Drori, Kees Cook,
Daniel Jurgens, netdev, linux-kernel, linux-doc, linux-rdma,
linux-kselftest, Gal Pressman, Jiri Pirko
In-Reply-To: <20260407194107.148063-1-tariqt@nvidia.com>
From: Or Har-Toov <ohartoov@nvidia.com>
Document the port-level resource support and the option to dump all
resources, including both device-level and port-level entries.
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Shay Drori <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../networking/devlink/devlink-resource.rst | 35 +++++++++++++++++++
1 file changed, 35 insertions(+)
diff --git a/Documentation/networking/devlink/devlink-resource.rst b/Documentation/networking/devlink/devlink-resource.rst
index 3d5ae51e65a2..9839c1661315 100644
--- a/Documentation/networking/devlink/devlink-resource.rst
+++ b/Documentation/networking/devlink/devlink-resource.rst
@@ -74,3 +74,38 @@ attribute, which represents the pending change in size. For example:
Note that changes in resource size may require a device reload to properly
take effect.
+
+Port-level Resources and Full Dump
+==================================
+
+In addition to device-level resources, ``devlink`` also supports port-level
+resources. These resources are associated with a specific devlink port rather
+than the device as a whole.
+
+To list resources for all devlink devices and ports:
+
+.. code:: shell
+
+ $ devlink resource show
+ pci/0000:03:00.0:
+ name max_local_SFs size 128 unit entry dpipe_tables none
+ name max_external_SFs size 128 unit entry dpipe_tables none
+ pci/0000:03:00.0/196608:
+ name max_SFs size 128 unit entry dpipe_tables none
+ pci/0000:03:00.0/196609:
+ name max_SFs size 128 unit entry dpipe_tables none
+ pci/0000:03:00.1:
+ name max_local_SFs size 128 unit entry dpipe_tables none
+ name max_external_SFs size 128 unit entry dpipe_tables none
+ pci/0000:03:00.1/196708:
+ name max_SFs size 128 unit entry dpipe_tables none
+ pci/0000:03:00.1/196709:
+ name max_SFs size 128 unit entry dpipe_tables none
+
+To show resources for a specific port:
+
+.. code:: shell
+
+ $ devlink resource show pci/0000:03:00.0/196608
+ pci/0000:03:00.0/196608:
+ name max_SFs size 128 unit entry dpipe_tables none
--
2.44.0
^ permalink raw reply related
* [PATCH net-next V5 10/12] devlink: Add resource scope filtering to resource dump
From: Tariq Toukan @ 2026-04-07 19:41 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Simon Horman, Donald Hunter, Jiri Pirko, Jonathan Corbet,
Shuah Khan, Saeed Mahameed, Leon Romanovsky, Tariq Toukan,
Mark Bloch, Shuah Khan, Matthieu Baerts (NGI0), Chuck Lever,
Carolina Jubran, Or Har-Toov, Moshe Shemesh, Dragos Tatulea,
Daniel Zahka, Shahar Shitrit, Cosmin Ratiu, Jacob Keller,
Parav Pandit, Adithya Jayachandran, Shay Drori, Kees Cook,
Daniel Jurgens, netdev, linux-kernel, linux-doc, linux-rdma,
linux-kselftest, Gal Pressman
In-Reply-To: <20260407194107.148063-1-tariqt@nvidia.com>
From: Or Har-Toov <ohartoov@nvidia.com>
Allow filtering the resource dump to device-level or port-level
resources using the 'scope' option.
Example - dump only device-level resources:
$ devlink resource show scope dev
pci/0000:03:00.0:
name max_local_SFs size 128 unit entry dpipe_tables none
name max_external_SFs size 128 unit entry dpipe_tables none
pci/0000:03:00.1:
name max_local_SFs size 128 unit entry dpipe_tables none
name max_external_SFs size 128 unit entry dpipe_tables none
Example - dump only port-level resources:
$ devlink resource show scope port
pci/0000:03:00.0/196608:
name max_SFs size 128 unit entry dpipe_tables none
pci/0000:03:00.0/196609:
name max_SFs size 128 unit entry dpipe_tables none
pci/0000:03:00.1/196708:
name max_SFs size 128 unit entry dpipe_tables none
pci/0000:03:00.1/196709:
name max_SFs size 128 unit entry dpipe_tables none
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
Documentation/netlink/specs/devlink.yaml | 24 +++++++++++++++++++++++-
include/uapi/linux/devlink.h | 11 +++++++++++
net/devlink/netlink_gen.c | 5 +++--
net/devlink/resource.c | 19 ++++++++++++++++++-
4 files changed, 55 insertions(+), 4 deletions(-)
diff --git a/Documentation/netlink/specs/devlink.yaml b/Documentation/netlink/specs/devlink.yaml
index 34aa81ba689e..247b147d689f 100644
--- a/Documentation/netlink/specs/devlink.yaml
+++ b/Documentation/netlink/specs/devlink.yaml
@@ -157,6 +157,14 @@ definitions:
entries:
-
name: entry
+ -
+ type: enum
+ name: resource-scope
+ entries:
+ -
+ name: dev
+ -
+ name: port
-
type: enum
name: reload-action
@@ -873,6 +881,16 @@ attribute-sets:
doc: Unique devlink instance index.
checks:
max: u32-max
+ -
+ name: resource-scope-mask
+ type: u32
+ enum: resource-scope
+ enum-as-flags: true
+ doc: |
+ Bitmask selecting which resource classes to include in a
+ resource-dump response. Bit 0 (dev) selects device-level
+ resources; bit 1 (port) selects port-level resources.
+ When absent all classes are returned.
-
name: dl-dev-stats
subset-of: devlink
@@ -1775,7 +1793,11 @@ operations:
- resource-list
dump:
request:
- attributes: *dev-id-attrs
+ attributes:
+ - bus-name
+ - dev-name
+ - index
+ - resource-scope-mask
reply: *resource-dump-reply
-
diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index 7de2d8cc862f..0b165eac7619 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -645,6 +645,7 @@ enum devlink_attr {
DEVLINK_ATTR_PARAM_RESET_DEFAULT, /* flag */
DEVLINK_ATTR_INDEX, /* uint */
+ DEVLINK_ATTR_RESOURCE_SCOPE_MASK, /* u32 */
/* Add new attributes above here, update the spec in
* Documentation/netlink/specs/devlink.yaml and re-generate
@@ -704,6 +705,16 @@ enum devlink_resource_unit {
DEVLINK_RESOURCE_UNIT_ENTRY,
};
+enum devlink_resource_scope {
+ DEVLINK_RESOURCE_SCOPE_DEV_BIT,
+ DEVLINK_RESOURCE_SCOPE_PORT_BIT,
+};
+
+#define DEVLINK_RESOURCE_SCOPE_DEV \
+ _BITUL(DEVLINK_RESOURCE_SCOPE_DEV_BIT)
+#define DEVLINK_RESOURCE_SCOPE_PORT \
+ _BITUL(DEVLINK_RESOURCE_SCOPE_PORT_BIT)
+
enum devlink_port_fn_attr_cap {
DEVLINK_PORT_FN_ATTR_CAP_ROCE_BIT,
DEVLINK_PORT_FN_ATTR_CAP_MIGRATABLE_BIT,
diff --git a/net/devlink/netlink_gen.c b/net/devlink/netlink_gen.c
index 9cc372d9ee41..81899786fd98 100644
--- a/net/devlink/netlink_gen.c
+++ b/net/devlink/netlink_gen.c
@@ -313,10 +313,11 @@ static const struct nla_policy devlink_resource_dump_do_nl_policy[DEVLINK_ATTR_I
};
/* DEVLINK_CMD_RESOURCE_DUMP - dump */
-static const struct nla_policy devlink_resource_dump_dump_nl_policy[DEVLINK_ATTR_INDEX + 1] = {
+static const struct nla_policy devlink_resource_dump_dump_nl_policy[DEVLINK_ATTR_RESOURCE_SCOPE_MASK + 1] = {
[DEVLINK_ATTR_BUS_NAME] = { .type = NLA_NUL_STRING, },
[DEVLINK_ATTR_DEV_NAME] = { .type = NLA_NUL_STRING, },
[DEVLINK_ATTR_INDEX] = NLA_POLICY_FULL_RANGE(NLA_UINT, &devlink_attr_index_range),
+ [DEVLINK_ATTR_RESOURCE_SCOPE_MASK] = NLA_POLICY_MASK(NLA_U32, 0x3),
};
/* DEVLINK_CMD_RELOAD - do */
@@ -974,7 +975,7 @@ const struct genl_split_ops devlink_nl_ops[75] = {
.cmd = DEVLINK_CMD_RESOURCE_DUMP,
.dumpit = devlink_nl_resource_dump_dumpit,
.policy = devlink_resource_dump_dump_nl_policy,
- .maxattr = DEVLINK_ATTR_INDEX,
+ .maxattr = DEVLINK_ATTR_RESOURCE_SCOPE_MASK,
.flags = GENL_CMD_CAP_DUMP,
},
{
diff --git a/net/devlink/resource.c b/net/devlink/resource.c
index bf5221fb3e64..3d2f42bc2fb5 100644
--- a/net/devlink/resource.c
+++ b/net/devlink/resource.c
@@ -398,11 +398,25 @@ devlink_nl_resource_dump_one(struct sk_buff *skb, struct devlink *devlink,
struct netlink_callback *cb, int flags)
{
struct devlink_nl_dump_state *state = devlink_dump_state(cb);
+ const struct genl_info *info = genl_info_dump(cb);
struct devlink_port *devlink_port;
+ struct nlattr *scope_attr = NULL;
unsigned long port_idx;
+ u32 scope = 0;
int err;
- if (!state->port_ctx.index_valid) {
+ if (info->attrs && info->attrs[DEVLINK_ATTR_RESOURCE_SCOPE_MASK]) {
+ scope_attr = info->attrs[DEVLINK_ATTR_RESOURCE_SCOPE_MASK];
+ scope = nla_get_u32(scope_attr);
+ if (!scope) {
+ NL_SET_ERR_MSG_ATTR(info->extack, scope_attr,
+ "empty resource scope selection");
+ return -EINVAL;
+ }
+ }
+
+ if (!state->port_ctx.index_valid &&
+ (!scope || (scope & DEVLINK_RESOURCE_SCOPE_DEV))) {
err = devlink_resource_dump_fill_one(skb, devlink, NULL,
cb, flags, &state->idx);
if (err)
@@ -410,6 +424,8 @@ devlink_nl_resource_dump_one(struct sk_buff *skb, struct devlink *devlink,
state->idx = 0;
}
+ if (scope && !(scope & DEVLINK_RESOURCE_SCOPE_PORT))
+ goto out;
/* Check in case port was removed between dump callbacks. */
if (state->port_ctx.index_valid &&
!xa_load(&devlink->ports, state->port_ctx.index))
@@ -425,6 +441,7 @@ devlink_nl_resource_dump_one(struct sk_buff *skb, struct devlink *devlink,
}
state->idx = 0;
}
+out:
state->port_ctx.index_valid = false;
state->port_ctx.index = 0;
return 0;
--
2.44.0
^ permalink raw reply related
* [PATCH net-next V5 11/12] selftest: netdevsim: Add resource dump and scope filter test
From: Tariq Toukan @ 2026-04-07 19:41 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Simon Horman, Donald Hunter, Jiri Pirko, Jonathan Corbet,
Shuah Khan, Saeed Mahameed, Leon Romanovsky, Tariq Toukan,
Mark Bloch, Shuah Khan, Matthieu Baerts (NGI0), Chuck Lever,
Carolina Jubran, Or Har-Toov, Moshe Shemesh, Dragos Tatulea,
Daniel Zahka, Shahar Shitrit, Cosmin Ratiu, Jacob Keller,
Parav Pandit, Adithya Jayachandran, Shay Drori, Kees Cook,
Daniel Jurgens, netdev, linux-kernel, linux-doc, linux-rdma,
linux-kselftest, Gal Pressman
In-Reply-To: <20260407194107.148063-1-tariqt@nvidia.com>
From: Or Har-Toov <ohartoov@nvidia.com>
Add resource_dump_test() which verifies dumping resources for all
devices and ports, and tests that scope=dev returns only device-level
resources and scope=port returns only port resources.
Skip if userspace does not support the scope parameter.
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../drivers/net/netdevsim/devlink.sh | 52 ++++++++++++++++++-
1 file changed, 51 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/drivers/net/netdevsim/devlink.sh b/tools/testing/selftests/drivers/net/netdevsim/devlink.sh
index 31d1cef54898..22a626c6cde3 100755
--- a/tools/testing/selftests/drivers/net/netdevsim/devlink.sh
+++ b/tools/testing/selftests/drivers/net/netdevsim/devlink.sh
@@ -5,7 +5,7 @@ lib_dir=$(dirname $0)/../../../net/forwarding
ALL_TESTS="fw_flash_test params_test \
params_default_test regions_test reload_test \
- netns_reload_test resource_test \
+ netns_reload_test resource_test resource_dump_test \
port_resource_doit_test dev_info_test \
empty_reporter_test dummy_reporter_test rate_test"
NUM_NETIFS=0
@@ -483,6 +483,56 @@ resource_test()
log_test "resource test"
}
+resource_dump_test()
+{
+ RET=0
+
+ local port_jq
+ local dev_jq
+ local dl_jq
+ local count
+
+ dl_jq="with_entries(select(.key | startswith(\"$DL_HANDLE\")))"
+ port_jq="[.[] | $dl_jq | keys |"
+ port_jq+=" map(select(test(\"/.+/\"))) | length] | add"
+ dev_jq="[.[] | $dl_jq | keys |"
+ dev_jq+=" map(select(test(\"/.+/\")|not)) | length] | add"
+
+ if ! devlink resource help 2>&1 | grep -q "scope"; then
+ echo "SKIP: devlink resource show not supported"
+ return
+ fi
+
+ devlink resource show > /dev/null 2>&1
+ check_err $? "Failed to dump all resources"
+
+ count=$(cmd_jq "devlink resource show -j" "$port_jq")
+ [ "$count" -gt "0" ]
+ check_err $? "missing port resources in resource dump"
+
+ count=$(cmd_jq "devlink resource show -j" "$dev_jq")
+ [ "$count" -gt "0" ]
+ check_err $? "missing device resources in resource dump"
+
+ count=$(cmd_jq "devlink resource show scope dev -j" "$dev_jq")
+ [ "$count" -gt "0" ]
+ check_err $? "dev scope missing device resources"
+
+ count=$(cmd_jq "devlink resource show scope dev -j" "$port_jq")
+ [ "$count" -eq "0" ]
+ check_err $? "dev scope returned port resources"
+
+ count=$(cmd_jq "devlink resource show scope port -j" "$port_jq")
+ [ "$count" -gt "0" ]
+ check_err $? "port scope missing port resources"
+
+ count=$(cmd_jq "devlink resource show scope port -j" "$dev_jq")
+ [ "$count" -eq "0" ]
+ check_err $? "port scope returned device resources"
+
+ log_test "resource dump test"
+}
+
info_get()
{
local name=$1
--
2.44.0
^ permalink raw reply related
* [PATCH net-next V5 12/12] devlink: Document resource scope filtering
From: Tariq Toukan @ 2026-04-07 19:41 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Simon Horman, Donald Hunter, Jiri Pirko, Jonathan Corbet,
Shuah Khan, Saeed Mahameed, Leon Romanovsky, Tariq Toukan,
Mark Bloch, Shuah Khan, Matthieu Baerts (NGI0), Chuck Lever,
Carolina Jubran, Or Har-Toov, Moshe Shemesh, Dragos Tatulea,
Daniel Zahka, Shahar Shitrit, Cosmin Ratiu, Jacob Keller,
Parav Pandit, Adithya Jayachandran, Shay Drori, Kees Cook,
Daniel Jurgens, netdev, linux-kernel, linux-doc, linux-rdma,
linux-kselftest, Gal Pressman
In-Reply-To: <20260407194107.148063-1-tariqt@nvidia.com>
From: Or Har-Toov <ohartoov@nvidia.com>
Document the scope parameter for devlink resource show, which allows
filtering the dump to device-level or port-level resources only.
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../networking/devlink/devlink-resource.rst | 35 +++++++++++++++++++
1 file changed, 35 insertions(+)
diff --git a/Documentation/networking/devlink/devlink-resource.rst b/Documentation/networking/devlink/devlink-resource.rst
index 9839c1661315..47eec8f875b4 100644
--- a/Documentation/networking/devlink/devlink-resource.rst
+++ b/Documentation/networking/devlink/devlink-resource.rst
@@ -109,3 +109,38 @@ To show resources for a specific port:
$ devlink resource show pci/0000:03:00.0/196608
pci/0000:03:00.0/196608:
name max_SFs size 128 unit entry dpipe_tables none
+
+Resource Scope Filtering
+========================
+
+When dumping resources for all devices, ``devlink resource show`` accepts
+an optional ``scope`` parameter to restrict the response to device-level
+resources, port-level resources, or both (the default).
+
+To dump only device-level resources across all devices:
+
+.. code:: shell
+
+ $ devlink resource show scope dev
+ pci/0000:03:00.0:
+ name max_local_SFs size 128 unit entry dpipe_tables none
+ name max_external_SFs size 128 unit entry dpipe_tables none
+ pci/0000:03:00.1:
+ name max_local_SFs size 128 unit entry dpipe_tables none
+ name max_external_SFs size 128 unit entry dpipe_tables none
+
+To dump only port-level resources across all devices:
+
+.. code:: shell
+
+ $ devlink resource show scope port
+ pci/0000:03:00.0/196608:
+ name max_SFs size 128 unit entry dpipe_tables none
+ pci/0000:03:00.0/196609:
+ name max_SFs size 128 unit entry dpipe_tables none
+ pci/0000:03:00.1/196708:
+ name max_SFs size 128 unit entry dpipe_tables none
+ pci/0000:03:00.1/196709:
+ name max_SFs size 128 unit entry dpipe_tables none
+
+Note that port-level resources are read-only.
--
2.44.0
^ permalink raw reply related
* Re: [PATCH] KVM: x86: nSVM: Redirect IA32_PAT accesses to either hPAT or gPAT
From: Jim Mattson @ 2026-04-07 20:51 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
kvm, linux-doc, linux-kernel, Yosry Ahmed
In-Reply-To: <adVZ7-EiekghvDMD@google.com>
On Tue, Apr 7, 2026 at 12:24 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Apr 07, 2026, Jim Mattson wrote:
> > When KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT is disabled and the vCPU is in
> > guest mode with nested NPT enabled, guest accesses to IA32_PAT are
> > redirected to the gPAT register, which is stored in VMCB02's g_pat field.
> >
> > Non-guest accesses (e.g. from userspace) to IA32_PAT are always redirected
> > to hPAT, which is stored in vcpu->arch.pat.
> >
> > Directing host-initiated accesses to hPAT ensures that KVM_GET/SET_MSRS and
> > KVM_GET/SET_NESTED_STATE are independent of each other and can be ordered
> > arbitrarily during save and restore. gPAT is saved and restored separately
> > via KVM_GET/SET_NESTED_STATE.
> >
> > Use WARN_ON_ONCE to flag any host-initiated accesses originating from KVM
> > itself rather than userspace.
> >
> > Use pr_warn_once to flag any use of the common MSR-handling code (now
> > shared by VMX and TDX) for IA32_PAT by a vCPU that is SVM-capable.
>
> Changelog is stale, but otherwise this LGTM. I'll fixup the changelog when
> applying (in a few weeks).
Oh, crud. This was supposed to be 5/8, but I made some changes after
checkpatch.pl complained and then tried to just regenerate this one,
but I totally flubbed it.
^ permalink raw reply
* Re: [PATCH v9 02/10] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
From: Jim Mattson @ 2026-04-07 20:53 UTC (permalink / raw)
To: Pawan Gupta
Cc: x86, Jon Kohler, Nikolay Borisov, H. Peter Anvin, Josh Poimboeuf,
David Kaplan, Sean Christopherson, Borislav Petkov, Dave Hansen,
Peter Zijlstra, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, KP Singh, Jiri Olsa, David S. Miller,
David Laight, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
David Ahern, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, John Fastabend, Stanislav Fomichev, Hao Luo,
Paolo Bonzini, Jonathan Corbet, linux-kernel, kvm, Asit Mallick,
Tao Zhang, bpf, netdev, linux-doc, chao.gao
In-Reply-To: <20260407191128.b2hr2ttkdpyunhrr@desk>
On Tue, Apr 7, 2026 at 12:11 PM Pawan Gupta
<pawan.kumar.gupta@linux.intel.com> wrote:
>
> On Tue, Apr 07, 2026 at 11:40:57AM -0700, Jim Mattson wrote:
> > My proposal is as follows:
> >
> > 1. The (advanced) hypervisor can advertise to the guest (via CPUID bit
> > or MSR bit) that the short BHB clearing sequence is adequate. This may
> > mean either that the VM will only be hosted on pre-Alder Lake hardware
> > or that the hypervisor will set BHI_DIS_S behind the back of the
> > guest. Presumably, this bit would not be reported if BHI_CTRL is
> > advertised to the guest.
> > 2. If the guest sees this bit, then it can use the short sequence. If
> > it doesn't see this bit, it must use the long sequence.
>
> Thats a good middle ground. Let me check with folks internally what they
> think about defining a new software-only bit.
>
> Third case, for a guest that doesn't want BHI_DIS_S, userspace should be
> allowed to override setting BHI_DIS_S. Then this proposed bit can indicate
> that long sequence is required.
That case can be handled by the paravirtual mitigation MSRs, right?
^ permalink raw reply
* Re: [PATCH RFC v4 10/44] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
From: Michael Roth @ 2026-04-07 21:09 UTC (permalink / raw)
To: Ackerley Tng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm
In-Reply-To: <CAEvNRgEtigp7+PVDkyu_DH947CUqDt312d+P+hWjjd2fHONiag@mail.gmail.com>
On Fri, Apr 03, 2026 at 07:50:16AM -0700, Ackerley Tng wrote:
> Ackerley
> dddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnng <ackerleytng@google.com> writes:
>
> >
> > [...snip...]
> >
> > guest_memfd's populate will first check that the memory is shared, then
> > also set the memory to private after the populate.
> >
> > [...snip...]
> >
> Looking at this again, the above basically means that the entire
> conversion process needs to be performed within populate.
>
> In addition to setting the attributes in guest_memfd as private, for
> consistency, populate will also have to do all the associated
> operations, especially unmapping from the host, checking refcounts,
> and the list of work in conversion will only increase in future with
> direct map removal/restoration and huge page merging.
>
> The complexity of conversion also means possible errors (EAGAIN for
> elevated refcounts and ENOMEM when we're out of memory), and error
> information like the offset where the elevated refcount was.
>
> It doesn't look like there's room for this information to be plumbed out
> through the platform-specific ioctls, and even if we make space, it
> seems odd to have conversion-related error information returned through
> the platform-specific call.
>
>
> I agree with the goal of not having KVM touch private memory contents as
> tracked by guest_memfd, so I'd like to propose that we distinguish:
>
> 1. private as tracked by KVM (guest_memfd/vm_memory_attributes)
> 2. private as tracked by trusted entity
I think this is a good distinction to keep in mind, because if we adopt
the proposal from the call of having userspace set the destination memory
to shared prior to calling kvm_gmem_populate(), then they don't really
stay shared until gmem convert them to private: instead, they get set to
"private as tracked by trusted entity", but at the same time still have
'shared' memory attributes as far as KVM is concerned. Normally (SNP,
at least), the 'private (as tracked by KVM)' state is an intermediate state
on the way to 'private (as tracked by KVM + trusted entity)'.
So we introduce some inconsistencies on that side, in order to address
the inconsistency of kvm_gmem_populate() writing to 'private (as tracked
by KVM)' memory. But as you point out...
> + destination address: private (as tracked by guest_memfd)
> + source address: shared (as tracked by guest_memfd) or NULL
>
> KVM doesn't touch private memory contents, even in this case, because
> it's really a platform-specific ioctl that handles loading, and the
> platform does expect the destination is private for both TDX and SNP
> at the firmware boundary.
...yah, it's not really gmem that's writing to that memory, it's the
platform-specific hooks that 'prepare' the memory as part of population
and puts that in a 'private (as tracked by trusted entity)' state, just as
it's the platform-specific hooks that 'prepare' the memory as part of vCPU
page fault path at run-time and put them into a private (as tracked by
trusted entity). You could even imagine a naive CoCo implementation that
encrypts memory in-place at initial fault time via kvm_gmem_prepare()
hooks... we likely wouldn't insist on some new flow because this results
in gmem calling something that writes to 'private (as tracked by KVM)'
pages and would consider that to be more of a platform-specific
implementation detail that should be handled the same as other
architectures. And that seems like it would be roughly analogous to what
is being discussed here WRT the kvm_gmem_populate() path, so I think it
makes sense to continue to expecting the pages to be marked private in
advance of platform-specific preparation, whether that be via the
populate path or the runtime/fault-time path.
And for recent KVM,
; all the things we exp
about how the callbacks
As far as the copying goes,
By expecting 'private' (as tracked by KVM) as the initial state for
kvm_gmem_populate(), a lot of invariants about private memory (safe
refcount, directmap removal expectations, etc.) remain consistent even
in the populate path, where any special handling for private memory can be
accounted for in the same way rather than "shared, but..." or "private,
but...".
>
> Since SNP (platform-specific) only allows in-place launch update, and
> KVM had to provide an interface that allows a different source address
> for support before in-place conversion, then SNP has to continue
> supporting the to-be-deprecated path by doing the copying within
> platform-specific code.
>
> For consistency, guest_memfd can continue to check that it tracks the
> destination address as private, and sev_gmem_populate will then hide
> the copying code away just to support the legacy case.
>
>
> The flow before in-place conversion is
>
> 1. Load memory (shared or non-guest_memfd memory)
> 2. KVM_SEV_SNP_LAUNCH_UPDATE or KVM_TDX_INIT_MEM_REGION (destination:
> gfn for separate private memory, source: shared memory)
>
> The proposed flow for in-place conversion is
>
> 1. INIT_SHARED or convert to shared
> 2. Load memory while it is shared
> 3. Convert to private (PRESERVE, or new flag?)
> 4. KVM_SEV_SNP_LAUNCH_UPDATE or KVM_TDX_INIT_MEM_REGION (destination:
> gfn for converted private memory, source: NULL)
>
>
> TLDR:
>
> + Think of populate ioctls not as KVM touching memory, but platform
> handling population.
> + KVM code (kvm_gmem_populate) still doesn't touch memory contents
> + post_populate is platform-specific code that handles loading into
> private destination memory just to support legacy non-in-place
> conversion.
> + Don't complicate populate ioctls by doing conversion just to support
> legacy use-cases where platform-specific code has to do copying on
> the host.
That's a good point: these are only considerations in the context of
actually copying from src->dst, but with in-place conversion the
primary/more-performant approach will be for userspace to initial
directly. I.e. if we enforced that, then gmem could right ascertain that
it isn't even writing to private pages via these hooks and any
manipulation of that memory is purely on the part of the trusted entity
handling initial encryption/etc.
I understand that we decided to keep the option of allowing separate
src/dst even with in-place conversion, but it doesn't seem worthwhile if
that necessarily means we need to glue population+conversion together in
1 clumsy interface that needs to handle partial return/error responses to
userspace (or potentially get stuck forever in the conversion path).
So I agree with Ackerley's proposal (which I guess is the same as what's
in this series).
However, 1 other alternative would be to do what was suggested on the
call, but require userspace to subsequently handle the shared->private
conversion. I think that would be workable too.
One other benefit to Ackerley's/current approach however is that it allows
us to potentially keep hugepages intact in the populate path, since
prep'ing/encrypting everything while it's in a shared state means gmem will
split the hugepage and all the firmware/RMP/etc. data structures will only
be able to handle individual 4K pages. I still suspect doing things like
encoding the initial 2MB OVMF image as a single hugepage might yield
enough benefit to explore this (at some point). So there's some niceness
in knowing that Ackerley's approach would allow for that eventually and
not require a complete rethink on these same topics.
Thanks,
Mike
>
> >>>
> >>> [...snip...]
> >>>
^ permalink raw reply
* [PATCH] docs: fix typo in zoran driver documentation
From: Gleb Golovko @ 2026-04-07 21:28 UTC (permalink / raw)
To: corbet; +Cc: linux-doc, Gleb Golovko
Replace "an a few" with "and a few" in
Documentation/driver-api/media/drivers/zoran.rst.
Signed-off-by: Gleb Golovko <gaben123001@gmail.com>
---
Documentation/driver-api/media/drivers/zoran.rst | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/Documentation/driver-api/media/drivers/zoran.rst b/Documentation/driver-api/media/drivers/zoran.rst
index 3e05b7f0442a..2538473c3233 100644
--- a/Documentation/driver-api/media/drivers/zoran.rst
+++ b/Documentation/driver-api/media/drivers/zoran.rst
@@ -222,7 +222,7 @@ The CCIR - I uses the PAL colorsystem, and is used in Great Britain, Hong Kong,
Ireland, Nigeria, South Africa.
The CCIR - N uses the PAL colorsystem and PAL frame size but the NTSC framerate,
-and is used in Argentina, Uruguay, an a few others
+and is used in Argentina, Uruguay, and a few others
We do not talk about how the audio is broadcast !
--
2.42.0.windows.2
^ permalink raw reply related
* Re: [PATCH RFC v4 10/44] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
From: Vishal Annapurve @ 2026-04-07 21:50 UTC (permalink / raw)
To: Michael Roth
Cc: Ackerley Tng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
david, ira.weiny, jmattson, jthoughton, oupton, pankaj.gupta,
qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, Paolo Bonzini, Sean Christopherson,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm
In-Reply-To: <yvfwexsub7nrogh67hzcsupbrkzer6a7kbeao5tlq4elrzc2iz@xrwdjd7p32pp>
On Tue, Apr 7, 2026 at 2:09 PM Michael Roth <michael.roth@amd.com> wrote:
>
> > TLDR:
> >
> > + Think of populate ioctls not as KVM touching memory, but platform
> > handling population.
> > + KVM code (kvm_gmem_populate) still doesn't touch memory contents
> > + post_populate is platform-specific code that handles loading into
> > private destination memory just to support legacy non-in-place
> > conversion.
> > + Don't complicate populate ioctls by doing conversion just to support
> > legacy use-cases where platform-specific code has to do copying on
> > the host.
>
> That's a good point: these are only considerations in the context of
> actually copying from src->dst, but with in-place conversion the
> primary/more-performant approach will be for userspace to initial
> directly. I.e. if we enforced that, then gmem could right ascertain that
> it isn't even writing to private pages via these hooks and any
> manipulation of that memory is purely on the part of the trusted entity
> handling initial encryption/etc.
>
> I understand that we decided to keep the option of allowing separate
> src/dst even with in-place conversion, but it doesn't seem worthwhile if
> that necessarily means we need to glue population+conversion together in
> 1 clumsy interface that needs to handle partial return/error responses to
> userspace (or potentially get stuck forever in the conversion path).
I think ARM needs userspace to specify separate source and destination
memory ranges for initial population as ARM doesn't support in-place
memory encryption. [1]
[1] https://lore.kernel.org/kvm/20260318155413.793430-25-steven.price@arm.com/
>
> So I agree with Ackerley's proposal (which I guess is the same as what's
> in this series).
>
> However, 1 other alternative would be to do what was suggested on the
> call, but require userspace to subsequently handle the shared->private
> conversion. I think that would be workable too.
IIUC, Converting memory ranges to private after it essentially is
treated as private by the KVM CC backend will expose the
implementation to the same risk of userspace being able to access
private memory and compromise host safety which guest_memfd was
invented to address.
>
> One other benefit to Ackerley's/current approach however is that it allows
> us to potentially keep hugepages intact in the populate path, since
> prep'ing/encrypting everything while it's in a shared state means gmem will
> split the hugepage and all the firmware/RMP/etc. data structures will only
> be able to handle individual 4K pages. I still suspect doing things like
> encoding the initial 2MB OVMF image as a single hugepage might yield
> enough benefit to explore this (at some point). So there's some niceness
> in knowing that Ackerley's approach would allow for that eventually and
> not require a complete rethink on these same topics.
>
> Thanks,
>
> Mike
>
> >
> > >>>
> > >>> [...snip...]
> > >>>
^ permalink raw reply
* Re: [PATCH RFC v4 10/44] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
From: Michael Roth @ 2026-04-07 22:09 UTC (permalink / raw)
To: Vishal Annapurve
Cc: Ackerley Tng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
david, ira.weiny, jmattson, jthoughton, oupton, pankaj.gupta,
qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, Paolo Bonzini, Sean Christopherson,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm
In-Reply-To: <CAGtprH-kgRByFvvCYeWMXtsvpb6qpaWAo8k-3PEnioyPg-LEvA@mail.gmail.com>
On Tue, Apr 07, 2026 at 02:50:58PM -0700, Vishal Annapurve wrote:
> On Tue, Apr 7, 2026 at 2:09 PM Michael Roth <michael.roth@amd.com> wrote:
> >
> > > TLDR:
> > >
> > > + Think of populate ioctls not as KVM touching memory, but platform
> > > handling population.
> > > + KVM code (kvm_gmem_populate) still doesn't touch memory contents
> > > + post_populate is platform-specific code that handles loading into
> > > private destination memory just to support legacy non-in-place
> > > conversion.
> > > + Don't complicate populate ioctls by doing conversion just to support
> > > legacy use-cases where platform-specific code has to do copying on
> > > the host.
> >
> > That's a good point: these are only considerations in the context of
> > actually copying from src->dst, but with in-place conversion the
> > primary/more-performant approach will be for userspace to initial
> > directly. I.e. if we enforced that, then gmem could right ascertain that
> > it isn't even writing to private pages via these hooks and any
> > manipulation of that memory is purely on the part of the trusted entity
> > handling initial encryption/etc.
> >
> > I understand that we decided to keep the option of allowing separate
> > src/dst even with in-place conversion, but it doesn't seem worthwhile if
> > that necessarily means we need to glue population+conversion together in
> > 1 clumsy interface that needs to handle partial return/error responses to
> > userspace (or potentially get stuck forever in the conversion path).
>
> I think ARM needs userspace to specify separate source and destination
> memory ranges for initial population as ARM doesn't support in-place
> memory encryption. [1]
>
> [1] https://lore.kernel.org/kvm/20260318155413.793430-25-steven.price@arm.com/
>
> >
> > So I agree with Ackerley's proposal (which I guess is the same as what's
> > in this series).
> >
> > However, 1 other alternative would be to do what was suggested on the
> > call, but require userspace to subsequently handle the shared->private
> > conversion. I think that would be workable too.
>
> IIUC, Converting memory ranges to private after it essentially is
> treated as private by the KVM CC backend will expose the
> implementation to the same risk of userspace being able to access
> private memory and compromise host safety which guest_memfd was
> invented to address.
Doh, fair point. Doing conversion as part of the populate call would allow
us to use the filemap write-lock to avoid userspace being able to fault
in private (as tracked by trusted entity) pages before they are
transitioned to private (as tracked by KVM), so it's safer than having
userspace drive it.
But obviously I still think Ackerley's original proposal has more
upsides than the alternatives mentioned so far.
-Mike
>
> >
> > One other benefit to Ackerley's/current approach however is that it allows
> > us to potentially keep hugepages intact in the populate path, since
> > prep'ing/encrypting everything while it's in a shared state means gmem will
> > split the hugepage and all the firmware/RMP/etc. data structures will only
> > be able to handle individual 4K pages. I still suspect doing things like
> > encoding the initial 2MB OVMF image as a single hugepage might yield
> > enough benefit to explore this (at some point). So there's some niceness
> > in knowing that Ackerley's approach would allow for that eventually and
> > not require a complete rethink on these same topics.
> >
> > Thanks,
> >
> > Mike
> >
> > >
> > > >>>
> > > >>> [...snip...]
> > > >>>
^ permalink raw reply
* [PATCH v2 0/5] Disregard patch series - docs: pt_BR: Complete PGP maintainer guide
From: Daniel Pereira @ 2026-04-07 22:26 UTC (permalink / raw)
To: Jonathan Corbet, linux-doc
Hi Jhonatan,
To keep my recent contribution more organized, please disregard the
patch series I sent a few days ago: [PATCH v2 0/5] docs: pt_BR:
Complete PGP maintainer guide translation. I will be sending a single,
consolidated patch in a few days instead.
Thanks!
^ permalink raw reply
* Re: [PATCH v9 02/10] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
From: Pawan Gupta @ 2026-04-07 22:27 UTC (permalink / raw)
To: Jim Mattson
Cc: x86, Jon Kohler, Nikolay Borisov, H. Peter Anvin, Josh Poimboeuf,
David Kaplan, Sean Christopherson, Borislav Petkov, Dave Hansen,
Peter Zijlstra, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, KP Singh, Jiri Olsa, David S. Miller,
David Laight, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
David Ahern, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, John Fastabend, Stanislav Fomichev, Hao Luo,
Paolo Bonzini, Jonathan Corbet, linux-kernel, kvm, Asit Mallick,
Tao Zhang, bpf, netdev, linux-doc, chao.gao
In-Reply-To: <CALMp9eTK0o7Z7-oTB8ohvmoh-vy-Y2qjdUvbqD6HaEhOEPZmhw@mail.gmail.com>
On Tue, Apr 07, 2026 at 01:53:24PM -0700, Jim Mattson wrote:
> On Tue, Apr 7, 2026 at 12:11 PM Pawan Gupta
> <pawan.kumar.gupta@linux.intel.com> wrote:
> >
> > On Tue, Apr 07, 2026 at 11:40:57AM -0700, Jim Mattson wrote:
> > > My proposal is as follows:
> > >
> > > 1. The (advanced) hypervisor can advertise to the guest (via CPUID bit
> > > or MSR bit) that the short BHB clearing sequence is adequate. This may
> > > mean either that the VM will only be hosted on pre-Alder Lake hardware
> > > or that the hypervisor will set BHI_DIS_S behind the back of the
> > > guest. Presumably, this bit would not be reported if BHI_CTRL is
> > > advertised to the guest.
> > > 2. If the guest sees this bit, then it can use the short sequence. If
> > > it doesn't see this bit, it must use the long sequence.
> >
> > Thats a good middle ground. Let me check with folks internally what they
> > think about defining a new software-only bit.
> >
> > Third case, for a guest that doesn't want BHI_DIS_S, userspace should be
> > allowed to override setting BHI_DIS_S. Then this proposed bit can indicate
> > that long sequence is required.
>
> That case can be handled by the paravirtual mitigation MSRs, right?
Yes. But, that was the part that received the most pushback.
^ permalink raw reply
* Re: [PATCH v2 0/5] Disregard patch series - docs: pt_BR: Complete PGP maintainer guide
From: Jonathan Corbet @ 2026-04-07 22:42 UTC (permalink / raw)
To: Daniel Pereira, linux-doc
In-Reply-To: <CAMAsx6dF=FH7+DHK5+s6z8dc197ATn-seG7ZYYeUbcRb8xsuqg@mail.gmail.com>
Daniel Pereira <danielmaraboo@gmail.com> writes:
> Hi Jhonatan,
>
> To keep my recent contribution more organized, please disregard the
> patch series I sent a few days ago: [PATCH v2 0/5] docs: pt_BR:
> Complete PGP maintainer guide translation. I will be sending a single,
> consolidated patch in a few days instead.
OK ... bear in mind that we're heading into the merge window, so this
series will be 7.2 material regardless.
Thanks,
jon
^ permalink raw reply
* Re: [PATCH v9 02/10] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
From: Jim Mattson @ 2026-04-07 23:27 UTC (permalink / raw)
To: Pawan Gupta
Cc: x86, Jon Kohler, Nikolay Borisov, H. Peter Anvin, Josh Poimboeuf,
David Kaplan, Sean Christopherson, Borislav Petkov, Dave Hansen,
Peter Zijlstra, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, KP Singh, Jiri Olsa, David S. Miller,
David Laight, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
David Ahern, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, John Fastabend, Stanislav Fomichev, Hao Luo,
Paolo Bonzini, Jonathan Corbet, linux-kernel, kvm, Asit Mallick,
Tao Zhang, bpf, netdev, linux-doc, chao.gao
In-Reply-To: <20260407222738.lrartp6evfp7yhti@desk>
On Tue, Apr 7, 2026 at 3:27 PM Pawan Gupta
<pawan.kumar.gupta@linux.intel.com> wrote:
>
> On Tue, Apr 07, 2026 at 01:53:24PM -0700, Jim Mattson wrote:
> > On Tue, Apr 7, 2026 at 12:11 PM Pawan Gupta
> > <pawan.kumar.gupta@linux.intel.com> wrote:
> > >
> > > On Tue, Apr 07, 2026 at 11:40:57AM -0700, Jim Mattson wrote:
> > > > My proposal is as follows:
> > > >
> > > > 1. The (advanced) hypervisor can advertise to the guest (via CPUID bit
> > > > or MSR bit) that the short BHB clearing sequence is adequate. This may
> > > > mean either that the VM will only be hosted on pre-Alder Lake hardware
> > > > or that the hypervisor will set BHI_DIS_S behind the back of the
> > > > guest. Presumably, this bit would not be reported if BHI_CTRL is
> > > > advertised to the guest.
> > > > 2. If the guest sees this bit, then it can use the short sequence. If
> > > > it doesn't see this bit, it must use the long sequence.
> > >
> > > Thats a good middle ground. Let me check with folks internally what they
> > > think about defining a new software-only bit.
> > >
> > > Third case, for a guest that doesn't want BHI_DIS_S, userspace should be
> > > allowed to override setting BHI_DIS_S. Then this proposed bit can indicate
> > > that long sequence is required.
> >
> > That case can be handled by the paravirtual mitigation MSRs, right?
>
> Yes. But, that was the part that received the most pushback.
What is your proposed BHI_DIS_S override mechanism, then?
^ permalink raw reply
* Re: [PATCH v9 02/10] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
From: Dave Hansen @ 2026-04-07 23:41 UTC (permalink / raw)
To: Jim Mattson, Pawan Gupta
Cc: x86, Jon Kohler, Nikolay Borisov, H. Peter Anvin, Josh Poimboeuf,
David Kaplan, Sean Christopherson, Borislav Petkov, Dave Hansen,
Peter Zijlstra, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, KP Singh, Jiri Olsa, David S. Miller,
David Laight, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
David Ahern, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, John Fastabend, Stanislav Fomichev, Hao Luo,
Paolo Bonzini, Jonathan Corbet, linux-kernel, kvm, Asit Mallick,
Tao Zhang, bpf, netdev, linux-doc, chao.gao
In-Reply-To: <CALMp9eQjSqwnvJz4JVzYpMkkTiucSJtW48zC4Hj9GBiUhOH-Eg@mail.gmail.com>
On 4/7/26 16:27, Jim Mattson wrote:
> What is your proposed BHI_DIS_S override mechanism, then?
Let me make sure I get this right. The desire is to:
1. Have hypervisors lie to guests about the CPU they are running on (for
the benefit of large/diverse migration pools)
2. Have guests be allowed to boot with BHI_DIS_S for performance
3. Have apps in those guests that care about security to opt back in to
BHI_DIS_S for themselves?
^ permalink raw reply
* Re: [PATCH] KVM: x86: nSVM: Redirect IA32_PAT accesses to either hPAT or gPAT
From: Sean Christopherson @ 2026-04-07 23:52 UTC (permalink / raw)
To: Jim Mattson
Cc: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
kvm, linux-doc, linux-kernel, Yosry Ahmed
In-Reply-To: <CALMp9eS+XiJ=u2618Hke9ePyzeTuChU=dLt+e=x2nXLTMVH5mw@mail.gmail.com>
On Tue, Apr 07, 2026, Jim Mattson wrote:
> On Tue, Apr 7, 2026 at 12:24 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Tue, Apr 07, 2026, Jim Mattson wrote:
> > > When KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT is disabled and the vCPU is in
> > > guest mode with nested NPT enabled, guest accesses to IA32_PAT are
> > > redirected to the gPAT register, which is stored in VMCB02's g_pat field.
> > >
> > > Non-guest accesses (e.g. from userspace) to IA32_PAT are always redirected
> > > to hPAT, which is stored in vcpu->arch.pat.
> > >
> > > Directing host-initiated accesses to hPAT ensures that KVM_GET/SET_MSRS and
> > > KVM_GET/SET_NESTED_STATE are independent of each other and can be ordered
> > > arbitrarily during save and restore. gPAT is saved and restored separately
> > > via KVM_GET/SET_NESTED_STATE.
> > >
> > > Use WARN_ON_ONCE to flag any host-initiated accesses originating from KVM
> > > itself rather than userspace.
> > >
> > > Use pr_warn_once to flag any use of the common MSR-handling code (now
> > > shared by VMX and TDX) for IA32_PAT by a vCPU that is SVM-capable.
> >
> > Changelog is stale, but otherwise this LGTM. I'll fixup the changelog when
> > applying (in a few weeks).
>
> Oh, crud. This was supposed to be 5/8, but I made some changes after
> checkpatch.pl complained and then tried to just regenerate this one,
> but I totally flubbed it.
Huh. The patch shows up when I grab the thread via b4 mbox and open it with mutt,
but b4 am skips it. I'm guessing there's version-based filtering somewhere in b4.
No need for a v9 on my account, I can splice in 5/8 when applying.
^ permalink raw reply
* Re: [PATCH RFC v4 10/44] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
From: Sean Christopherson @ 2026-04-08 0:30 UTC (permalink / raw)
To: Ackerley Tng
Cc: Michael Roth, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
david, ira.weiny, jmattson, jthoughton, oupton, pankaj.gupta,
qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, Paolo Bonzini, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm
In-Reply-To: <CAEvNRgEtigp7+PVDkyu_DH947CUqDt312d+P+hWjjd2fHONiag@mail.gmail.com>
On Fri, Apr 03, 2026, Ackerley Tng wrote:
> Currently, in TDX's populate flow, KVM doesn't do any copying, it only
> instructs TDX to do the copying.
I disagree with this statement. For all intents and purposes, the TDX-Module is
firmware. If Intel had elected to implement TDX via XuCode, and presented it to
software as ISA (see SGX), then under the hood "firmware" would still be doing the
actual copy, but KVM would be execute some form of "copy" instruction.
Saying "KVM doesn't do any copying" is (very loosely) analogous to saying that
KVM doesn't copy anything when it does REP MOVSQ. It wasn't me your honor, Intel's
string engine did it!
I don't think it changes anything in practice, but I don't want to treat TDX
SEAMCALLs (or SNP PSP commands) as something completely different than what we
usually think of as "hardware".
^ permalink raw reply
* Re: [PATCH RFC v4 10/44] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
From: Sean Christopherson @ 2026-04-08 0:33 UTC (permalink / raw)
To: Michael Roth
Cc: Vishal Annapurve, Ackerley Tng, aik, andrew.jones, binbin.wu,
brauner, chao.p.peng, david, ira.weiny, jmattson, jthoughton,
oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
shivankg, steven.price, tabba, willy, wyihan, yan.y.zhao,
forkloop, pratyush, suzuki.poulose, aneesh.kumar, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm
In-Reply-To: <nmy5polcxfnn3hoircsiqarxmzlulwwq7w34okanccntp32v56@h2eac44agovv>
On Tue, Apr 07, 2026, Michael Roth wrote:
> On Tue, Apr 07, 2026 at 02:50:58PM -0700, Vishal Annapurve wrote:
> > On Tue, Apr 7, 2026 at 2:09 PM Michael Roth <michael.roth@amd.com> wrote:
> > >
> > > > TLDR:
> > > >
> > > > + Think of populate ioctls not as KVM touching memory, but platform
> > > > handling population.
> > > > + KVM code (kvm_gmem_populate) still doesn't touch memory contents
> > > > + post_populate is platform-specific code that handles loading into
> > > > private destination memory just to support legacy non-in-place
> > > > conversion.
> > > > + Don't complicate populate ioctls by doing conversion just to support
> > > > legacy use-cases where platform-specific code has to do copying on
> > > > the host.
> > >
> > > That's a good point: these are only considerations in the context of
> > > actually copying from src->dst, but with in-place conversion the
> > > primary/more-performant approach will be for userspace to initial
> > > directly. I.e. if we enforced that, then gmem could right ascertain that
> > > it isn't even writing to private pages via these hooks and any
> > > manipulation of that memory is purely on the part of the trusted entity
> > > handling initial encryption/etc.
> > >
> > > I understand that we decided to keep the option of allowing separate
> > > src/dst even with in-place conversion, but it doesn't seem worthwhile if
> > > that necessarily means we need to glue population+conversion together in
> > > 1 clumsy interface that needs to handle partial return/error responses to
> > > userspace (or potentially get stuck forever in the conversion path).
> >
> > I think ARM needs userspace to specify separate source and destination
> > memory ranges for initial population as ARM doesn't support in-place
> > memory encryption. [1]
> >
> > [1] https://lore.kernel.org/kvm/20260318155413.793430-25-steven.price@arm.com/
> >
> > >
> > > So I agree with Ackerley's proposal (which I guess is the same as what's
> > > in this series).
> > >
> > > However, 1 other alternative would be to do what was suggested on the
> > > call, but require userspace to subsequently handle the shared->private
> > > conversion. I think that would be workable too.
> >
> > IIUC, Converting memory ranges to private after it essentially is
> > treated as private by the KVM CC backend will expose the
> > implementation to the same risk of userspace being able to access
> > private memory and compromise host safety which guest_memfd was
> > invented to address.
>
> Doh, fair point. Doing conversion as part of the populate call would allow
> us to use the filemap write-lock to avoid userspace being able to fault
> in private (as tracked by trusted entity) pages before they are
> transitioned to private (as tracked by KVM), so it's safer than having
> userspace drive it.
>
> But obviously I still think Ackerley's original proposal has more
> upsides than the alternatives mentioned so far.
I'm a bit lost. What exactly is/was Ackerley's original proposal? If the answer
is "convert pages from shared=>private when populating via in-place conversion",
then I agree, because AFAICT, that's the only sane option.
^ permalink raw reply
* Re: [PATCH 14/24] nfsd: add tracepoint to dir_event handler
From: Steven Rostedt @ 2026-04-08 0:41 UTC (permalink / raw)
To: Jeff Layton
Cc: Alexander Viro, Christian Brauner, Jan Kara, Chuck Lever,
Alexander Aring, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, NeilBrown, Olga Kornievskaia,
Dai Ngo, Tom Talpey, Trond Myklebust, Anna Schumaker,
Amir Goldstein, Calum Mackay, linux-fsdevel, linux-kernel,
linux-trace-kernel, linux-doc, linux-nfs
In-Reply-To: <20260407-dir-deleg-v1-14-aaf68c478abd@kernel.org>
On Tue, 07 Apr 2026 09:21:27 -0400
Jeff Layton <jlayton@kernel.org> wrote:
> Add some extra visibility around the fsnotify handlers.
>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
> fs/nfsd/nfs4state.c | 2 ++
> fs/nfsd/trace.h | 20 ++++++++++++++++++++
> 2 files changed, 22 insertions(+)
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
-- Steve
^ permalink raw reply
* Re: [PATCH v9 02/10] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
From: Jim Mattson @ 2026-04-08 0:47 UTC (permalink / raw)
To: Dave Hansen
Cc: Pawan Gupta, x86, Jon Kohler, Nikolay Borisov, H. Peter Anvin,
Josh Poimboeuf, David Kaplan, Sean Christopherson,
Borislav Petkov, Dave Hansen, Peter Zijlstra, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, KP Singh, Jiri Olsa,
David S. Miller, David Laight, Andy Lutomirski, Thomas Gleixner,
Ingo Molnar, David Ahern, Martin KaFai Lau, Eduard Zingerman,
Song Liu, Yonghong Song, John Fastabend, Stanislav Fomichev,
Hao Luo, Paolo Bonzini, Jonathan Corbet, linux-kernel, kvm,
Asit Mallick, Tao Zhang, bpf, netdev, linux-doc, chao.gao
In-Reply-To: <a605fb45-f8e3-45ad-8924-1da43b9a9e05@intel.com>
On Tue, Apr 7, 2026 at 4:41 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 4/7/26 16:27, Jim Mattson wrote:
> > What is your proposed BHI_DIS_S override mechanism, then?
>
> Let me make sure I get this right. The desire is to:
>
> 1. Have hypervisors lie to guests about the CPU they are running on (for
> the benefit of large/diverse migration pools)
> 2. Have guests be allowed to boot with BHI_DIS_S for performance
> 3. Have apps in those guests that care about security to opt back in to
> BHI_DIS_S for themselves?
I just want guests on heterogeneous migration pools to properly
protect themselves from native BHI when running on host kernels at
least as far back as Linux v6.6.
To that end, I would be satisfied with using the longer BHB clearing
sequence when HYPERVISOR is true and BHI_CTRL is false.
^ permalink raw reply
* Re: [PATCH v2 00/16] fs,x86/resctrl: Add kernel-mode (e.g., PLZA) support to the resctrl subsystem
From: Babu Moger @ 2026-04-08 1:01 UTC (permalink / raw)
To: Reinette Chatre, corbet@lwn.net, tony.luck@intel.com,
Dave.Martin@arm.com, james.morse@arm.com, tglx@kernel.org,
mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com
Cc: skhan@linuxfoundation.org, x86@kernel.org, hpa@zytor.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, kas@kernel.org, rick.p.edgecombe@intel.com,
akpm@linux-foundation.org, pmladek@suse.com,
rdunlap@infradead.org, dapeng1.mi@linux.intel.com,
kees@kernel.org, elver@google.com, paulmck@kernel.org,
lirongqing@baidu.com, safinaskar@gmail.com, fvdl@google.com,
seanjc@google.com, pawan.kumar.gupta@linux.intel.com,
xin@zytor.com, tiala@microsoft.com, Neeraj.Upadhyay@amd.com,
chang.seok.bae@intel.com, Lendacky, Thomas,
elena.reshetova@intel.com, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev,
kvm@vger.kernel.org, eranian@google.com, peternewman@google.com
In-Reply-To: <3305c18e-9e50-4df0-b9f1-c61028628967@intel.com>
Hi Reinette,
On 4/7/26 12:48, Reinette Chatre wrote:
> Hi Babu,
>
> On 4/6/26 3:45 PM, Babu Moger wrote:
>> Hi Reinette,
>>
>> Sorry for the late response. I was trying to get confirmation about the use case.
>
> No problem. I appreciate that you did this so that we can make sure resctrl supports
> needed use cases.
>
>>
>> On 3/31/26 17:24, Reinette Chatre wrote:
>>> On 3/30/26 11:46 AM, Babu Moger wrote:
>>>> On 3/27/26 17:11, Reinette Chatre wrote:
>>>>> On 3/26/26 10:12 AM, Babu Moger wrote:
>>>>>> On 3/24/26 17:51, Reinette Chatre wrote:
>>>>>>> On 3/12/26 1:36 PM, Babu Moger wrote:
>
>>> can have domains that span different CPUs. There thus seem to be a built in assumption of what a "domain"
>>> means for PQR_PLZA_ASSOC so it sounds to me as though, instead of saying that "PQR_PLZA_ASSOC needs
>>> to be the same in QoS domain" it may be more accurate to, for example, say that "PQR_PLZA_ASSOC has L3 scope"?
>>
>> Yes.
>
> Above is about L3 scope ...
Yes. The scope for PQR_PLZA_ASSOC is L3.
Is that what you are asking here?
>
>>>
>>> This seems to be what this implementation does since it hardcodes PQR_PLZA_ASSOC scope to the L3
>>> resource but that creates dependency to the L3 resource that would make PLZA unusable if, for example,
>>> the user boots with "rdt=!l3cat" while wanting to use PLZA to manage MBA allocations when in kernel?
>>
>> Yes. that is correct. It should not be attached to one resource. We need to change it to global scope.
>
> Can I interpret "global scope" as "all online CPUs"? Doing so will simplify
Yes. That is correct.
> supporting this feature. It does not sound practical for a user wanting to assign
> different resource groups to kernel work done in different domains ... the guidance should
> instead be to just set the allocations of one resource group to what is needed in the different
> domains? There may be more flexibility when supporting per-domain RMIDs though but so far
> it sounds as though the focus is global. We can consider what needs to be done to support
> some type of "per-domain" assignment as exercise whether current interface could support it
> in the future.
Yes. Makes sense.
>
> ...
>
>>>> There are multiple ways this feature can be applied. For simplicity, the discussion below focuses only on CLOSID.
>>>>
>>>>
>>>> 1. Global PLZA enablement
>>>>
>>>> PLZA can be configured as a global feature by setting |PQR_PLZA_ASSOC.closid = CLOSID| and |PQR_PLZA_ASSOC.plza_en = 1| on all threads in the system. A dedicated CLOSID is reserved for this purpose,
>>>
>>> Also discussed during v1 is that there is no need to dedicate a CLOSID for this purpose.
>>> There could be an "unthrottled" CLOSID to which all high priority user space tasks as
>>> well as all kernel work of all tasks are assigned.
>>> If user space chooses to dedicate a CLOSID for kernel work then that should supported and
>>> interface can allow that, but there is no need for resctrl to enforce this.
>
> (above is comment about dedicated group - please see below)
>
>
>> Yes. I agree. The changes in context switch code is a concern.
>>
>> You covered some of the cases I was thinking(xx_set_individual).
>>
>> How about this idea?
>>
>> I suggest splitting the PLZA into two distinct aspects:
>>
>> 1. How PLZA is applied within a resource group
>>
>> 2. How PLZA is monitored
>
> I think I see where you are going here. While the "How PLZA is monitored" naming
> refers to "monitoring" I *think* what you are separating here is (a) how PLZA is configured
> (CLOSID and RMID settings) and (b) how that PLZA configuration is assigned to tasks/CPUs,
> not just within a resource group but across the system. Please see below.
>
>
>> Introduce a new file, "info/kmode_type", to describe how kmode applies in the system.
>
> ack. "in the system" as you have above, not "within a resource group" as mentioned
> before that.
>
>>
>> # cat info/kmode_type
>> [global] <- Kernel mode applies to the entire system (all CPUs/tasks)
>> cpus <- Kernel mode applies only to the CPUs in the group
>> tasks <- Kernel mode applies only to the tasks in the group
>>
>> The "global" option is the default right now and it is current common use-case.
>>
>> The "info/kmode_type -> cpus" option introduces new files
>> "kmode_cpus" and "kmode_cpus_list" for users to apply kmode to
>> specific set of CPUs. This lets users change the CPU set for PLZA.
> Where were you thinking about placing these files in the hierarchy?
It needs to be inside the resctrl group (in struct rdtgroup).
>
>> The PLZA MSR is updated when user changes the association to the
>> file. No context switch code changes are needed. This will be
>> dedicated group. The current resctrl group files, "cpus, cpus_list
>
> Why does this have to be a dedicated group? One of the conclusions from v1
> discussion was that the "PLZA group" need *not* be a dedicated group. I repeated that
> in my earlier response that I left quoted above. You did not respond to these
> conclusions and statements in this regard while you keep coming back to this
> needing to be a dedicated group without providing a motivation to do so.
> Could you please elaborate why a dedicated group is required?
If the same group applies identical limits to both user and kernel
space, it essentially behaves like a current resctrl group. In that
sense, it’s not really a PLZA group. PLZA’s key value is the ability to
separate allocations between user space and kernel space. A single CPU
can belong to two groups: one group manages the user-space allocation
for that CPU, while another manages the kernel-mode allocation.
This approach also simplifies file handling, which is another reason I
prefer it.
That said, I’m open to not having a dedicated group if we can still
support all the features that PLZA provides without it.
>
>
>> and tasks" will not be accessible in this mode. This option give
>
> These files can continue to be accessible.
ok.
>
>> some flexibility for the user without the context switch overhead.
>
> Dedicating a resource group to PLZA removes flexibility though, no?
Yes. But makes it easy to handle the files as I mentioned above.
>
>>
>> The "info/kmode_type -> tasks" option introduces a new file,
>> "kmode_tasks", for users to apply kmode to specific set of tasks.
>> This requires context switch changes. This will be dedicated group.
>> The current resctrl group files, "cpus, cpus_list and tasks" will
>> not be accessible in this mode. We currently have no use case for
>> this, so it will not be supported now.
>
> Thank you for confirming. This is a relief.
>
>>
>>
>> Add a file, "info/kmode_monitor", to describe how kmode is monitored.
>>
>> # cat info/kmode_monitor
>> [inherit_ctrl_and_mon] <- Kernel uses the same CLOSID/RMID as user. Default option for the "global"
>> assign_ctrl_inherit_mon <- One CLOSID for all kernel work; RMID inherited from user.
>> assign_ctrl_assign_mon <- One resource group (CLOSID+RMID) for all kernel work. Default option for "cpu" type.
>
> My first thought is that the naming is confusing. resctrl has a very strong relationship between
> "RMID" and "monitoring" so naming a file "monitor" that deals with allocation/ctrl/CLOSID is
> potentially confusion.
>
> Apart from that, while I think I understand where you are going by separating the mode into
> two files I am concerned about future complications needing to accommodate all different
> combinations of the (now) essentially two modes. My preference is thus to keep this simple by
> keeping the mode within one file.
>
> Even so, when stepping back, it does not really look like we need to separate the "global"
> and "per CPU" modes. We could just have a single "per CPU" mode and the "global" is just
> its default of "all CPUs", no?
Yes. That correct.
>
> Consider, for example, the implementation just consisting of:
>
> # cat info/kernel_mode
> [inherit_ctrl_and_mon]
> global_assign_ctrl_inherit_mon_per_cpu
> global_assign_ctrl_assign_mon_per_cpu
>
>>
>> Rename “kernel_mode_assignment” to “kmode_group” to assign the specific group to kmode. This file usage is same as before.
>>
>> #cat info/kmode_groups (Renamed "kernel_mode_assignment")
>> //
>
> Please consider the intent of this file when thinking about names. The idea is that "info/kernel_mode"
> specifies the "mode" of how kernel work is handled and it determines the configuration files used in that
> mode as well as the syntax when interacting with those files. By renaming "kernel_mode_assignment" to
> "kmode_groups" it implicitly requires all future kernel mode enhancements to need some data related to "groups".
>
> In summary, I think this can be simplified by introducing just two new files in info/ that enables the
> user to (a) select and (b) configure the "kernel mode". To start there can be just two modes,
> global_assign_ctrl_inherit_mon_per_cpu and global_assign_ctrl_assign_mon_per_cpu.
> global_assign_ctrl_inherit_mon_per_cpu mode requires a control group in kernel_mode_assignment while
> global_assign_ctrl_assign_mon_per_cpu requires a control and monitoring group.
>
> The resource group in info/kernel_mode_assignment gets two additional files "kernel_mode_cpus" and
> "kernel_mode_cpus_list" that contains the CPUs enabled with the kernel mode configuration, by default
> it will be all online CPUs. The resource group can continue to be used to manage allocations of and
> monitor user space tasks. Specifically, the "cpus", "cpus_list", and "tasks" files remain.
>
> A user wanting just "global" settings will get just that when writing the group to
> info/kernel_mode_assignment. A user wanting "per CPU" settings can follow the
> info/kernel_mode_assignment setting with changes to that resource group's kernel_mode_cpus/kernel_mode_cpus_list
> files. Any task running on a CPU that is *not* in kernel_mode_cpus/kernel_mode_cpus_list can be
> expected to inherit both CLOSID and RMID from user space for all kernel work.
After further consideration, I don’t think the info/kernel_mode file is
necessary. There’s no need to enforce a specific mode for all the PLZA
groups. Avoiding this constraint makes the design more flexible,
particularly as we move toward supporting multiple PLZA groups in the
future. MPAM already appears capable of handling more than one group—for
example, one group could use inherit_ctrl_and_mon, while another could
use global_assign_ctrl_inherit_mon_per_cpu.
The mode can simply be determined on a per-group basis. We can introduce
two new files—kernel_mode_cpus and kernel_mode_cpus_list—within each
resctrl group when kmode (or PLZA) is supported.
The info/kernel_mode_assignment file would indicate which resctrl
group(or groups) is used for PLZA. The files—kernel_mode_cpus and
kernel_mode_cpus_list would indicate how the plza is applied which each
group.
Files and behavior:
- cpus / cpus_list:
CPUs listed here use the same allocation for both user and kernel space.
There is no change to the current semantics of these files.
If these files are empty, the group effectively becomes a PLZA-dedicated
group.
- kernel_mode_cpus / kernel_mode_cpus_list:
These files determine whether a separate kernel allocation is applied.
If empty, user and kernel share the same allocation.
If non-empty, the kernel uses a separate allocation.
The group can be CTL_MON or MON group. Based on type the group the
CLOSID and RMID will be used to enable PLZA. If it is MON, then rmid_en
= 1 when writing PLZA MSR.
Here’s the proposed flow:
# mount -t resctrl resctrl /sys/fs/resctrl/
# cd /sys/fs/resctrl/
# cat info/kernel_mode_assignment
//
By default, the root (default) group is PLZA-enabled when resctrl is
mounted. All CPUs use CLOSID 0 for both user and kernel-mode allocation.
# cat cpus_list
1-64
# cat kmode_cpus_list
1-64
Next, create a new group for PLZA:
# mkdir plza_group
# echo "plza_group//" > info/kernel_mode_assignment
At this point, plza_group becomes the new PLZA-enabled group, and the
PLZA-related MSRs are updated accordingly.
# cat plza_group/cpus_list
<empty>
# cat plza_group/kmode_cpus_list
1-64
The user can then update kmode_cpus_list to apply PLZA only to a
specific subset of CPUs, if desired.
What do you think of this approach?
Thanks
Babu
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox