* [PATCH net-next V4 11/12] selftest: netdevsim: Add resource dump and scope filter test
From: Tariq Toukan @ 2026-04-01 18:49 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Simon Horman, Donald Hunter, Jiri Pirko, Jonathan Corbet,
Shuah Khan, Saeed Mahameed, Leon Romanovsky, Tariq Toukan,
Mark Bloch, Shuah Khan, Chuck Lever, Matthieu Baerts (NGI0),
Carolina Jubran, Or Har-Toov, Moshe Shemesh, Dragos Tatulea,
Shahar Shitrit, Daniel Zahka, Jacob Keller, Cosmin Ratiu,
Parav Pandit, Shay Drori, Adithya Jayachandran, Kees Cook,
Daniel Jurgens, netdev, linux-kernel, linux-doc, linux-rdma,
linux-kselftest, Gal Pressman
In-Reply-To: <20260401184947.135205-1-tariqt@nvidia.com>
From: Or Har-Toov <ohartoov@nvidia.com>
Add resource_dump_test() which verifies dumping resources for all
devices and ports, and tests that scope=dev returns only device-level
resources and scope=port returns only port resources.
Skip if userspace does not support the scope parameter.
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../drivers/net/netdevsim/devlink.sh | 52 ++++++++++++++++++-
1 file changed, 51 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/drivers/net/netdevsim/devlink.sh b/tools/testing/selftests/drivers/net/netdevsim/devlink.sh
index 2e63d02fae4b..8118cc211590 100755
--- a/tools/testing/selftests/drivers/net/netdevsim/devlink.sh
+++ b/tools/testing/selftests/drivers/net/netdevsim/devlink.sh
@@ -5,7 +5,7 @@ lib_dir=$(dirname $0)/../../../net/forwarding
ALL_TESTS="fw_flash_test params_test \
params_default_test regions_test reload_test \
- netns_reload_test resource_test \
+ netns_reload_test resource_test resource_dump_test \
port_resource_doit_test dev_info_test \
empty_reporter_test dummy_reporter_test rate_test"
NUM_NETIFS=0
@@ -483,6 +483,56 @@ resource_test()
log_test "resource test"
}
+resource_dump_test()
+{
+ RET=0
+
+ local port_jq
+ local dev_jq
+ local dl_jq
+ local count
+
+ dl_jq="with_entries(select(.key | startswith(\"$DL_HANDLE\")))"
+ port_jq="[.[] | $dl_jq | keys |"
+ port_jq+=" map(select(test(\"/.+/\"))) | length] | add"
+ dev_jq="[.[] | $dl_jq | keys |"
+ dev_jq+=" map(select(test(\"/.+/\")|not)) | length] | add"
+
+ if ! devlink resource help 2>&1 | grep -q "scope"; then
+ echo "SKIP: devlink resource show not supported"
+ return
+ fi
+
+ devlink resource show > /dev/null 2>&1
+ check_err $? "Failed to dump all resources"
+
+ count=$(cmd_jq "devlink resource show -j" "$port_jq")
+ [ "$count" -gt "0" ]
+ check_err $? "missing port resources in resource dump"
+
+ count=$(cmd_jq "devlink resource show -j" "$dev_jq")
+ [ "$count" -gt "0" ]
+ check_err $? "missing device resources in resource dump"
+
+ count=$(cmd_jq "devlink resource show scope dev -j" "$dev_jq")
+ [ "$count" -gt "0" ]
+ check_err $? "dev scope missing device resources"
+
+ count=$(cmd_jq "devlink resource show scope dev -j" "$port_jq")
+ [ "$count" -eq "0" ]
+ check_err $? "dev scope returned port resources"
+
+ count=$(cmd_jq "devlink resource show scope port -j" "$port_jq")
+ [ "$count" -gt "0" ]
+ check_err $? "port scope missing port resources"
+
+ count=$(cmd_jq "devlink resource show scope port -j" "$dev_jq")
+ [ "$count" -eq "0" ]
+ check_err $? "port scope returned device resources"
+
+ log_test "resource dump test"
+}
+
info_get()
{
local name=$1
--
2.44.0
^ permalink raw reply related
* [PATCH net-next V4 10/12] devlink: Add resource scope filtering to resource dump
From: Tariq Toukan @ 2026-04-01 18:49 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Simon Horman, Donald Hunter, Jiri Pirko, Jonathan Corbet,
Shuah Khan, Saeed Mahameed, Leon Romanovsky, Tariq Toukan,
Mark Bloch, Shuah Khan, Chuck Lever, Matthieu Baerts (NGI0),
Carolina Jubran, Or Har-Toov, Moshe Shemesh, Dragos Tatulea,
Shahar Shitrit, Daniel Zahka, Jacob Keller, Cosmin Ratiu,
Parav Pandit, Shay Drori, Adithya Jayachandran, Kees Cook,
Daniel Jurgens, netdev, linux-kernel, linux-doc, linux-rdma,
linux-kselftest, Gal Pressman
In-Reply-To: <20260401184947.135205-1-tariqt@nvidia.com>
From: Or Har-Toov <ohartoov@nvidia.com>
Allow filtering the resource dump to device-level or port-level
resources using the 'scope' option.
Example - dump only device-level resources:
$ devlink resource show scope dev
pci/0000:03:00.0:
name max_local_SFs size 128 unit entry dpipe_tables none
name max_external_SFs size 128 unit entry dpipe_tables none
pci/0000:03:00.1:
name max_local_SFs size 128 unit entry dpipe_tables none
name max_external_SFs size 128 unit entry dpipe_tables none
Example - dump only port-level resources:
$ devlink resource show scope port
pci/0000:03:00.0/196608:
name max_SFs size 128 unit entry dpipe_tables none
pci/0000:03:00.0/196609:
name max_SFs size 128 unit entry dpipe_tables none
pci/0000:03:00.1/196708:
name max_SFs size 128 unit entry dpipe_tables none
pci/0000:03:00.1/196709:
name max_SFs size 128 unit entry dpipe_tables none
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
Documentation/netlink/specs/devlink.yaml | 24 +++++++++++++++++-
include/uapi/linux/devlink.h | 17 +++++++++++++
net/devlink/netlink_gen.c | 5 ++--
net/devlink/resource.c | 32 ++++++++++++++++++++++--
4 files changed, 73 insertions(+), 5 deletions(-)
diff --git a/Documentation/netlink/specs/devlink.yaml b/Documentation/netlink/specs/devlink.yaml
index 34aa81ba689e..b7d0490fc49d 100644
--- a/Documentation/netlink/specs/devlink.yaml
+++ b/Documentation/netlink/specs/devlink.yaml
@@ -157,6 +157,14 @@ definitions:
entries:
-
name: entry
+ -
+ type: enum
+ name: resource-scope
+ entries:
+ -
+ name: dev
+ -
+ name: port
-
type: enum
name: reload-action
@@ -873,6 +881,16 @@ attribute-sets:
doc: Unique devlink instance index.
checks:
max: u32-max
+ -
+ name: resource-scope-mask
+ type: bitfield32
+ enum: resource-scope
+ enum-as-flags: true
+ doc: |
+ Bitmask selecting which resource classes to include in a
+ resource-dump response. Bit 0 (dev) selects device-level
+ resources; bit 1 (port) selects port-level resources.
+ When absent all classes are returned.
-
name: dl-dev-stats
subset-of: devlink
@@ -1775,7 +1793,11 @@ operations:
- resource-list
dump:
request:
- attributes: *dev-id-attrs
+ attributes:
+ - bus-name
+ - dev-name
+ - index
+ - resource-scope-mask
reply: *resource-dump-reply
-
diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index 7de2d8cc862f..e0a0b523ce5c 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -645,6 +645,7 @@ enum devlink_attr {
DEVLINK_ATTR_PARAM_RESET_DEFAULT, /* flag */
DEVLINK_ATTR_INDEX, /* uint */
+ DEVLINK_ATTR_RESOURCE_SCOPE_MASK, /* bitfield32 */
/* Add new attributes above here, update the spec in
* Documentation/netlink/specs/devlink.yaml and re-generate
@@ -704,6 +705,22 @@ enum devlink_resource_unit {
DEVLINK_RESOURCE_UNIT_ENTRY,
};
+enum devlink_resource_scope {
+ DEVLINK_RESOURCE_SCOPE_DEV_BIT,
+ DEVLINK_RESOURCE_SCOPE_PORT_BIT,
+
+ __DEVLINK_RESOURCE_SCOPE_MAX_BIT,
+ DEVLINK_RESOURCE_SCOPE_MAX_BIT =
+ __DEVLINK_RESOURCE_SCOPE_MAX_BIT - 1
+};
+
+#define DEVLINK_RESOURCE_SCOPE_DEV \
+ _BITUL(DEVLINK_RESOURCE_SCOPE_DEV_BIT)
+#define DEVLINK_RESOURCE_SCOPE_PORT \
+ _BITUL(DEVLINK_RESOURCE_SCOPE_PORT_BIT)
+#define DEVLINK_RESOURCE_SCOPE_VALID_MASK \
+ (_BITUL(__DEVLINK_RESOURCE_SCOPE_MAX_BIT) - 1)
+
enum devlink_port_fn_attr_cap {
DEVLINK_PORT_FN_ATTR_CAP_ROCE_BIT,
DEVLINK_PORT_FN_ATTR_CAP_MIGRATABLE_BIT,
diff --git a/net/devlink/netlink_gen.c b/net/devlink/netlink_gen.c
index 9cc372d9ee41..6d4abd8b828d 100644
--- a/net/devlink/netlink_gen.c
+++ b/net/devlink/netlink_gen.c
@@ -313,10 +313,11 @@ static const struct nla_policy devlink_resource_dump_do_nl_policy[DEVLINK_ATTR_I
};
/* DEVLINK_CMD_RESOURCE_DUMP - dump */
-static const struct nla_policy devlink_resource_dump_dump_nl_policy[DEVLINK_ATTR_INDEX + 1] = {
+static const struct nla_policy devlink_resource_dump_dump_nl_policy[DEVLINK_ATTR_RESOURCE_SCOPE_MASK + 1] = {
[DEVLINK_ATTR_BUS_NAME] = { .type = NLA_NUL_STRING, },
[DEVLINK_ATTR_DEV_NAME] = { .type = NLA_NUL_STRING, },
[DEVLINK_ATTR_INDEX] = NLA_POLICY_FULL_RANGE(NLA_UINT, &devlink_attr_index_range),
+ [DEVLINK_ATTR_RESOURCE_SCOPE_MASK] = NLA_POLICY_BITFIELD32(3),
};
/* DEVLINK_CMD_RELOAD - do */
@@ -974,7 +975,7 @@ const struct genl_split_ops devlink_nl_ops[75] = {
.cmd = DEVLINK_CMD_RESOURCE_DUMP,
.dumpit = devlink_nl_resource_dump_dumpit,
.policy = devlink_resource_dump_dump_nl_policy,
- .maxattr = DEVLINK_ATTR_INDEX,
+ .maxattr = DEVLINK_ATTR_RESOURCE_SCOPE_MASK,
.flags = GENL_CMD_CAP_DUMP,
},
{
diff --git a/net/devlink/resource.c b/net/devlink/resource.c
index 0f1d90bc4b09..c22338b2571d 100644
--- a/net/devlink/resource.c
+++ b/net/devlink/resource.c
@@ -341,6 +341,22 @@ int devlink_nl_resource_dump_doit(struct sk_buff *skb, struct genl_info *info)
return devlink_resource_fill(info, DEVLINK_CMD_RESOURCE_DUMP, 0);
}
+static u32 devlink_resource_scope_get(struct nlattr **attrs, int *flags)
+{
+ struct nla_bitfield32 scope;
+ u32 value;
+
+ if (!attrs || !attrs[DEVLINK_ATTR_RESOURCE_SCOPE_MASK])
+ return DEVLINK_RESOURCE_SCOPE_VALID_MASK;
+
+ scope = nla_get_bitfield32(attrs[DEVLINK_ATTR_RESOURCE_SCOPE_MASK]);
+ value = scope.value & scope.selector;
+ if (value != DEVLINK_RESOURCE_SCOPE_VALID_MASK)
+ *flags |= NLM_F_DUMP_FILTERED;
+
+ return value;
+}
+
static int
devlink_resource_dump_fill_one(struct sk_buff *skb, struct devlink *devlink,
struct devlink_port *devlink_port,
@@ -400,16 +416,27 @@ devlink_nl_resource_dump_one(struct sk_buff *skb, struct devlink *devlink,
struct devlink_nl_dump_state *state = devlink_dump_state(cb);
struct devlink_port *devlink_port;
unsigned long port_idx;
+ u32 scope;
int err;
- if (!state->port_number) {
+ scope = devlink_resource_scope_get(genl_info_dump(cb)->attrs, &flags);
+ if (!scope) {
+ NL_SET_ERR_MSG_ATTR(genl_info_dump(cb)->extack,
+ genl_info_dump(cb)->attrs[DEVLINK_ATTR_RESOURCE_SCOPE_MASK],
+ "empty resource scope selection");
+ return -EINVAL;
+ }
+ if (!state->port_number && (scope & DEVLINK_RESOURCE_SCOPE_DEV)) {
err = devlink_resource_dump_fill_one(skb, devlink, NULL,
- cb, flags, &state->idx);
+ cb, flags,
+ &state->idx);
if (err)
return err;
state->idx = 0;
}
+ if (!(scope & DEVLINK_RESOURCE_SCOPE_PORT))
+ goto out;
xa_for_each_start(&devlink->ports, port_idx, devlink_port,
state->port_number ? state->port_number - 1 : 0) {
err = devlink_resource_dump_fill_one(skb, devlink, devlink_port,
@@ -420,6 +447,7 @@ devlink_nl_resource_dump_one(struct sk_buff *skb, struct devlink *devlink,
}
state->idx = 0;
}
+out:
state->port_number = 0;
return 0;
}
--
2.44.0
^ permalink raw reply related
* [PATCH net-next V4 09/12] devlink: Document port-level resources and full dump
From: Tariq Toukan @ 2026-04-01 18:49 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Simon Horman, Donald Hunter, Jiri Pirko, Jonathan Corbet,
Shuah Khan, Saeed Mahameed, Leon Romanovsky, Tariq Toukan,
Mark Bloch, Shuah Khan, Chuck Lever, Matthieu Baerts (NGI0),
Carolina Jubran, Or Har-Toov, Moshe Shemesh, Dragos Tatulea,
Shahar Shitrit, Daniel Zahka, Jacob Keller, Cosmin Ratiu,
Parav Pandit, Shay Drori, Adithya Jayachandran, Kees Cook,
Daniel Jurgens, netdev, linux-kernel, linux-doc, linux-rdma,
linux-kselftest, Gal Pressman, Jiri Pirko
In-Reply-To: <20260401184947.135205-1-tariqt@nvidia.com>
From: Or Har-Toov <ohartoov@nvidia.com>
Document the port-level resource support and the option to dump all
resources, including both device-level and port-level entries.
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Shay Drori <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../networking/devlink/devlink-resource.rst | 35 +++++++++++++++++++
1 file changed, 35 insertions(+)
diff --git a/Documentation/networking/devlink/devlink-resource.rst b/Documentation/networking/devlink/devlink-resource.rst
index 3d5ae51e65a2..9839c1661315 100644
--- a/Documentation/networking/devlink/devlink-resource.rst
+++ b/Documentation/networking/devlink/devlink-resource.rst
@@ -74,3 +74,38 @@ attribute, which represents the pending change in size. For example:
Note that changes in resource size may require a device reload to properly
take effect.
+
+Port-level Resources and Full Dump
+==================================
+
+In addition to device-level resources, ``devlink`` also supports port-level
+resources. These resources are associated with a specific devlink port rather
+than the device as a whole.
+
+To list resources for all devlink devices and ports:
+
+.. code:: shell
+
+ $ devlink resource show
+ pci/0000:03:00.0:
+ name max_local_SFs size 128 unit entry dpipe_tables none
+ name max_external_SFs size 128 unit entry dpipe_tables none
+ pci/0000:03:00.0/196608:
+ name max_SFs size 128 unit entry dpipe_tables none
+ pci/0000:03:00.0/196609:
+ name max_SFs size 128 unit entry dpipe_tables none
+ pci/0000:03:00.1:
+ name max_local_SFs size 128 unit entry dpipe_tables none
+ name max_external_SFs size 128 unit entry dpipe_tables none
+ pci/0000:03:00.1/196708:
+ name max_SFs size 128 unit entry dpipe_tables none
+ pci/0000:03:00.1/196709:
+ name max_SFs size 128 unit entry dpipe_tables none
+
+To show resources for a specific port:
+
+.. code:: shell
+
+ $ devlink resource show pci/0000:03:00.0/196608
+ pci/0000:03:00.0/196608:
+ name max_SFs size 128 unit entry dpipe_tables none
--
2.44.0
^ permalink raw reply related
* [PATCH net-next V4 08/12] selftest: netdevsim: Add devlink port resource doit test
From: Tariq Toukan @ 2026-04-01 18:49 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Simon Horman, Donald Hunter, Jiri Pirko, Jonathan Corbet,
Shuah Khan, Saeed Mahameed, Leon Romanovsky, Tariq Toukan,
Mark Bloch, Shuah Khan, Chuck Lever, Matthieu Baerts (NGI0),
Carolina Jubran, Or Har-Toov, Moshe Shemesh, Dragos Tatulea,
Shahar Shitrit, Daniel Zahka, Jacob Keller, Cosmin Ratiu,
Parav Pandit, Shay Drori, Adithya Jayachandran, Kees Cook,
Daniel Jurgens, netdev, linux-kernel, linux-doc, linux-rdma,
linux-kselftest, Gal Pressman
In-Reply-To: <20260401184947.135205-1-tariqt@nvidia.com>
From: Or Har-Toov <ohartoov@nvidia.com>
Tests that querying a specific port handle returns the expected
resource name and size.
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../drivers/net/netdevsim/devlink.sh | 29 ++++++++++++++++++-
1 file changed, 28 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/drivers/net/netdevsim/devlink.sh b/tools/testing/selftests/drivers/net/netdevsim/devlink.sh
index 1b529ccaf050..2e63d02fae4b 100755
--- a/tools/testing/selftests/drivers/net/netdevsim/devlink.sh
+++ b/tools/testing/selftests/drivers/net/netdevsim/devlink.sh
@@ -5,7 +5,8 @@ lib_dir=$(dirname $0)/../../../net/forwarding
ALL_TESTS="fw_flash_test params_test \
params_default_test regions_test reload_test \
- netns_reload_test resource_test dev_info_test \
+ netns_reload_test resource_test \
+ port_resource_doit_test dev_info_test \
empty_reporter_test dummy_reporter_test rate_test"
NUM_NETIFS=0
source $lib_dir/lib.sh
@@ -768,6 +769,32 @@ rate_node_del()
devlink port function rate del $handle
}
+port_resource_doit_test()
+{
+ RET=0
+
+ local port_handle="${DL_HANDLE}/0"
+ local name
+ local size
+
+ if ! devlink resource help 2>&1 | grep -q "PORT_INDEX"; then
+ echo "SKIP: devlink resource show not supported for port selector"
+ return
+ fi
+
+ name=$(cmd_jq "devlink resource show $port_handle -j" \
+ '.[][][].name')
+ [ "$name" == "test_resource" ]
+ check_err $? "wrong port resource name (got $name)"
+
+ size=$(cmd_jq "devlink resource show $port_handle -j" \
+ '.[][][].size')
+ [ "$size" == "20" ]
+ check_err $? "wrong port resource size (got $size)"
+
+ log_test "port resource doit test"
+}
+
rate_test()
{
RET=0
--
2.44.0
^ permalink raw reply related
* [PATCH net-next V4 07/12] devlink: Add port-specific option to resource dump doit
From: Tariq Toukan @ 2026-04-01 18:49 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Simon Horman, Donald Hunter, Jiri Pirko, Jonathan Corbet,
Shuah Khan, Saeed Mahameed, Leon Romanovsky, Tariq Toukan,
Mark Bloch, Shuah Khan, Chuck Lever, Matthieu Baerts (NGI0),
Carolina Jubran, Or Har-Toov, Moshe Shemesh, Dragos Tatulea,
Shahar Shitrit, Daniel Zahka, Jacob Keller, Cosmin Ratiu,
Parav Pandit, Shay Drori, Adithya Jayachandran, Kees Cook,
Daniel Jurgens, netdev, linux-kernel, linux-doc, linux-rdma,
linux-kselftest, Gal Pressman
In-Reply-To: <20260401184947.135205-1-tariqt@nvidia.com>
From: Or Har-Toov <ohartoov@nvidia.com>
Allow querying devlink resources per-port via the resource-dump doit
handler. When a port-index attribute is provided, only that port's
resources are returned. When no port-index is given, only device-level
resources are returned, preserving backward compatibility.
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
Documentation/netlink/specs/devlink.yaml | 4 +++-
net/devlink/netlink_gen.c | 3 ++-
net/devlink/netlink_gen.h | 4 ++--
net/devlink/resource.c | 20 +++++++++++++++++---
4 files changed, 24 insertions(+), 7 deletions(-)
diff --git a/Documentation/netlink/specs/devlink.yaml b/Documentation/netlink/specs/devlink.yaml
index c423e049c7bd..34aa81ba689e 100644
--- a/Documentation/netlink/specs/devlink.yaml
+++ b/Documentation/netlink/specs/devlink.yaml
@@ -1757,19 +1757,21 @@ operations:
attribute-set: devlink
dont-validate: [strict]
do:
- pre: devlink-nl-pre-doit
+ pre: devlink-nl-pre-doit-port-optional
post: devlink-nl-post-doit
request:
attributes:
- bus-name
- dev-name
- index
+ - port-index
reply: &resource-dump-reply
value: 36
attributes:
- bus-name
- dev-name
- index
+ - port-index
- resource-list
dump:
request:
diff --git a/net/devlink/netlink_gen.c b/net/devlink/netlink_gen.c
index a5a47a4c6de8..9cc372d9ee41 100644
--- a/net/devlink/netlink_gen.c
+++ b/net/devlink/netlink_gen.c
@@ -309,6 +309,7 @@ static const struct nla_policy devlink_resource_dump_do_nl_policy[DEVLINK_ATTR_I
[DEVLINK_ATTR_BUS_NAME] = { .type = NLA_NUL_STRING, },
[DEVLINK_ATTR_DEV_NAME] = { .type = NLA_NUL_STRING, },
[DEVLINK_ATTR_INDEX] = NLA_POLICY_FULL_RANGE(NLA_UINT, &devlink_attr_index_range),
+ [DEVLINK_ATTR_PORT_INDEX] = { .type = NLA_U32, },
};
/* DEVLINK_CMD_RESOURCE_DUMP - dump */
@@ -962,7 +963,7 @@ const struct genl_split_ops devlink_nl_ops[75] = {
{
.cmd = DEVLINK_CMD_RESOURCE_DUMP,
.validate = GENL_DONT_VALIDATE_STRICT,
- .pre_doit = devlink_nl_pre_doit,
+ .pre_doit = devlink_nl_pre_doit_port_optional,
.doit = devlink_nl_resource_dump_doit,
.post_doit = devlink_nl_post_doit,
.policy = devlink_resource_dump_do_nl_policy,
diff --git a/net/devlink/netlink_gen.h b/net/devlink/netlink_gen.h
index d79f6a0888f6..20034b0929a8 100644
--- a/net/devlink/netlink_gen.h
+++ b/net/devlink/netlink_gen.h
@@ -24,11 +24,11 @@ int devlink_nl_pre_doit(const struct genl_split_ops *ops, struct sk_buff *skb,
struct genl_info *info);
int devlink_nl_pre_doit_port(const struct genl_split_ops *ops,
struct sk_buff *skb, struct genl_info *info);
-int devlink_nl_pre_doit_dev_lock(const struct genl_split_ops *ops,
- struct sk_buff *skb, struct genl_info *info);
int devlink_nl_pre_doit_port_optional(const struct genl_split_ops *ops,
struct sk_buff *skb,
struct genl_info *info);
+int devlink_nl_pre_doit_dev_lock(const struct genl_split_ops *ops,
+ struct sk_buff *skb, struct genl_info *info);
void
devlink_nl_post_doit(const struct genl_split_ops *ops, struct sk_buff *skb,
struct genl_info *info);
diff --git a/net/devlink/resource.c b/net/devlink/resource.c
index 48f195063551..0f1d90bc4b09 100644
--- a/net/devlink/resource.c
+++ b/net/devlink/resource.c
@@ -251,8 +251,10 @@ static int devlink_resource_list_fill(struct sk_buff *skb,
static int devlink_resource_fill(struct genl_info *info,
enum devlink_command cmd, int flags)
{
+ struct devlink_port *devlink_port = info->user_ptr[1];
struct devlink *devlink = info->user_ptr[0];
struct devlink_resource *resource;
+ struct list_head *resource_list;
struct nlattr *resources_attr;
struct sk_buff *skb = NULL;
struct nlmsghdr *nlh;
@@ -261,7 +263,9 @@ static int devlink_resource_fill(struct genl_info *info,
int i;
int err;
- resource = list_first_entry(&devlink->resource_list,
+ resource_list = devlink_port ?
+ &devlink_port->resource_list : &devlink->resource_list;
+ resource = list_first_entry(resource_list,
struct devlink_resource, list);
start_again:
err = devlink_nl_msg_reply_and_new(&skb, info);
@@ -277,6 +281,9 @@ static int devlink_resource_fill(struct genl_info *info,
if (devlink_nl_put_handle(skb, devlink))
goto nla_put_failure;
+ if (devlink_port &&
+ nla_put_u32(skb, DEVLINK_ATTR_PORT_INDEX, devlink_port->index))
+ goto nla_put_failure;
resources_attr = nla_nest_start_noflag(skb,
DEVLINK_ATTR_RESOURCE_LIST);
@@ -285,7 +292,7 @@ static int devlink_resource_fill(struct genl_info *info,
incomplete = false;
i = 0;
- list_for_each_entry_from(resource, &devlink->resource_list, list) {
+ list_for_each_entry_from(resource, resource_list, list) {
err = devlink_resource_put(devlink, skb, resource);
if (err) {
if (!i)
@@ -319,9 +326,16 @@ static int devlink_resource_fill(struct genl_info *info,
int devlink_nl_resource_dump_doit(struct sk_buff *skb, struct genl_info *info)
{
+ struct devlink_port *devlink_port = info->user_ptr[1];
struct devlink *devlink = info->user_ptr[0];
+ struct list_head *resource_list;
+
+ if (info->attrs[DEVLINK_ATTR_PORT_INDEX] && !devlink_port)
+ return -ENODEV;
- if (list_empty(&devlink->resource_list))
+ resource_list = devlink_port ?
+ &devlink_port->resource_list : &devlink->resource_list;
+ if (list_empty(resource_list))
return -EOPNOTSUPP;
return devlink_resource_fill(info, DEVLINK_CMD_RESOURCE_DUMP, 0);
--
2.44.0
^ permalink raw reply related
* [PATCH net-next V4 06/12] devlink: Include port resources in resource dump dumpit
From: Tariq Toukan @ 2026-04-01 18:49 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Simon Horman, Donald Hunter, Jiri Pirko, Jonathan Corbet,
Shuah Khan, Saeed Mahameed, Leon Romanovsky, Tariq Toukan,
Mark Bloch, Shuah Khan, Chuck Lever, Matthieu Baerts (NGI0),
Carolina Jubran, Or Har-Toov, Moshe Shemesh, Dragos Tatulea,
Shahar Shitrit, Daniel Zahka, Jacob Keller, Cosmin Ratiu,
Parav Pandit, Shay Drori, Adithya Jayachandran, Kees Cook,
Daniel Jurgens, netdev, linux-kernel, linux-doc, linux-rdma,
linux-kselftest, Gal Pressman
In-Reply-To: <20260401184947.135205-1-tariqt@nvidia.com>
From: Or Har-Toov <ohartoov@nvidia.com>
Allow querying devlink resources per-port via the resource-dump dumpit
handler. Both device-level and all ports resources are included in the
reply.
For example:
$ devlink resource show
pci/0000:03:00.0:
name local_max_SFs size 508 unit entry
name external_max_SFs size 508 unit entry
pci/0000:03:00.0/196608:
name max_SFs size 20 unit entry
pci/0000:03:00.1:
name local_max_SFs size 508 unit entry
name external_max_SFs size 508 unit entry
pci/0000:03:00.1/262144:
name max_SFs size 20 unit entry
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
net/devlink/devl_internal.h | 4 +++
net/devlink/resource.c | 53 +++++++++++++++++++++++++++++++------
2 files changed, 49 insertions(+), 8 deletions(-)
diff --git a/net/devlink/devl_internal.h b/net/devlink/devl_internal.h
index 7dfb7cdd2d23..000b8d271b90 100644
--- a/net/devlink/devl_internal.h
+++ b/net/devlink/devl_internal.h
@@ -164,6 +164,10 @@ struct devlink_nl_dump_state {
struct {
u64 dump_ts;
};
+ /* DEVLINK_CMD_RESOURCE_DUMP */
+ struct {
+ unsigned long port_number;
+ };
};
};
diff --git a/net/devlink/resource.c b/net/devlink/resource.c
index 02fb36e25c52..48f195063551 100644
--- a/net/devlink/resource.c
+++ b/net/devlink/resource.c
@@ -328,16 +328,20 @@ int devlink_nl_resource_dump_doit(struct sk_buff *skb, struct genl_info *info)
}
static int
-devlink_nl_resource_dump_one(struct sk_buff *skb, struct devlink *devlink,
- struct netlink_callback *cb, int flags)
+devlink_resource_dump_fill_one(struct sk_buff *skb, struct devlink *devlink,
+ struct devlink_port *devlink_port,
+ struct netlink_callback *cb, int flags, int *idx)
{
- struct devlink_nl_dump_state *state = devlink_dump_state(cb);
+ struct list_head *resource_list;
struct nlattr *resources_attr;
- int start_idx = state->idx;
+ int start_idx = *idx;
void *hdr;
int err;
- if (list_empty(&devlink->resource_list))
+ resource_list = devlink_port ?
+ &devlink_port->resource_list : &devlink->resource_list;
+
+ if (list_empty(resource_list))
return 0;
err = -EMSGSIZE;
@@ -348,15 +352,17 @@ devlink_nl_resource_dump_one(struct sk_buff *skb, struct devlink *devlink,
if (devlink_nl_put_handle(skb, devlink))
goto nla_put_failure;
+ if (devlink_port &&
+ nla_put_u32(skb, DEVLINK_ATTR_PORT_INDEX, devlink_port->index))
+ goto nla_put_failure;
resources_attr = nla_nest_start_noflag(skb, DEVLINK_ATTR_RESOURCE_LIST);
if (!resources_attr)
goto nla_put_failure;
- err = devlink_resource_list_fill(skb, devlink,
- &devlink->resource_list, &state->idx);
+ err = devlink_resource_list_fill(skb, devlink, resource_list, idx);
if (err) {
- if (state->idx == start_idx)
+ if (*idx == start_idx)
goto resource_list_cancel;
nla_nest_end(skb, resources_attr);
genlmsg_end(skb, hdr);
@@ -373,6 +379,37 @@ devlink_nl_resource_dump_one(struct sk_buff *skb, struct devlink *devlink,
return err;
}
+static int
+devlink_nl_resource_dump_one(struct sk_buff *skb, struct devlink *devlink,
+ struct netlink_callback *cb, int flags)
+{
+ struct devlink_nl_dump_state *state = devlink_dump_state(cb);
+ struct devlink_port *devlink_port;
+ unsigned long port_idx;
+ int err;
+
+ if (!state->port_number) {
+ err = devlink_resource_dump_fill_one(skb, devlink, NULL,
+ cb, flags, &state->idx);
+ if (err)
+ return err;
+ state->idx = 0;
+ }
+
+ xa_for_each_start(&devlink->ports, port_idx, devlink_port,
+ state->port_number ? state->port_number - 1 : 0) {
+ err = devlink_resource_dump_fill_one(skb, devlink, devlink_port,
+ cb, flags, &state->idx);
+ if (err) {
+ state->port_number = port_idx + 1;
+ return err;
+ }
+ state->idx = 0;
+ }
+ state->port_number = 0;
+ return 0;
+}
+
int devlink_nl_resource_dump_dumpit(struct sk_buff *skb,
struct netlink_callback *cb)
{
--
2.44.0
^ permalink raw reply related
* [PATCH net-next V4 05/12] devlink: Add dump support for device-level resources
From: Tariq Toukan @ 2026-04-01 18:49 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Simon Horman, Donald Hunter, Jiri Pirko, Jonathan Corbet,
Shuah Khan, Saeed Mahameed, Leon Romanovsky, Tariq Toukan,
Mark Bloch, Shuah Khan, Chuck Lever, Matthieu Baerts (NGI0),
Carolina Jubran, Or Har-Toov, Moshe Shemesh, Dragos Tatulea,
Shahar Shitrit, Daniel Zahka, Jacob Keller, Cosmin Ratiu,
Parav Pandit, Shay Drori, Adithya Jayachandran, Kees Cook,
Daniel Jurgens, netdev, linux-kernel, linux-doc, linux-rdma,
linux-kselftest, Gal Pressman, Jiri Pirko
In-Reply-To: <20260401184947.135205-1-tariqt@nvidia.com>
From: Or Har-Toov <ohartoov@nvidia.com>
Add dumpit handler for resource-dump command to iterate over all devlink
devices and show their resources.
$ devlink resource show
pci/0000:08:00.0:
name local_max_SFs size 508 unit entry
name external_max_SFs size 508 unit entry
pci/0000:08:00.1:
name local_max_SFs size 508 unit entry
name external_max_SFs size 508 unit entry
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Shay Drori <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
Documentation/netlink/specs/devlink.yaml | 6 +-
net/devlink/netlink_gen.c | 20 +++++-
net/devlink/netlink_gen.h | 4 +-
net/devlink/resource.c | 77 ++++++++++++++++++++++++
4 files changed, 102 insertions(+), 5 deletions(-)
diff --git a/Documentation/netlink/specs/devlink.yaml b/Documentation/netlink/specs/devlink.yaml
index b495d56b9137..c423e049c7bd 100644
--- a/Documentation/netlink/specs/devlink.yaml
+++ b/Documentation/netlink/specs/devlink.yaml
@@ -1764,13 +1764,17 @@ operations:
- bus-name
- dev-name
- index
- reply:
+ reply: &resource-dump-reply
value: 36
attributes:
- bus-name
- dev-name
- index
- resource-list
+ dump:
+ request:
+ attributes: *dev-id-attrs
+ reply: *resource-dump-reply
-
name: reload
diff --git a/net/devlink/netlink_gen.c b/net/devlink/netlink_gen.c
index eb35e80e01d1..a5a47a4c6de8 100644
--- a/net/devlink/netlink_gen.c
+++ b/net/devlink/netlink_gen.c
@@ -305,7 +305,14 @@ static const struct nla_policy devlink_resource_set_nl_policy[DEVLINK_ATTR_INDEX
};
/* DEVLINK_CMD_RESOURCE_DUMP - do */
-static const struct nla_policy devlink_resource_dump_nl_policy[DEVLINK_ATTR_INDEX + 1] = {
+static const struct nla_policy devlink_resource_dump_do_nl_policy[DEVLINK_ATTR_INDEX + 1] = {
+ [DEVLINK_ATTR_BUS_NAME] = { .type = NLA_NUL_STRING, },
+ [DEVLINK_ATTR_DEV_NAME] = { .type = NLA_NUL_STRING, },
+ [DEVLINK_ATTR_INDEX] = NLA_POLICY_FULL_RANGE(NLA_UINT, &devlink_attr_index_range),
+};
+
+/* DEVLINK_CMD_RESOURCE_DUMP - dump */
+static const struct nla_policy devlink_resource_dump_dump_nl_policy[DEVLINK_ATTR_INDEX + 1] = {
[DEVLINK_ATTR_BUS_NAME] = { .type = NLA_NUL_STRING, },
[DEVLINK_ATTR_DEV_NAME] = { .type = NLA_NUL_STRING, },
[DEVLINK_ATTR_INDEX] = NLA_POLICY_FULL_RANGE(NLA_UINT, &devlink_attr_index_range),
@@ -680,7 +687,7 @@ static const struct nla_policy devlink_notify_filter_set_nl_policy[DEVLINK_ATTR_
};
/* Ops table for devlink */
-const struct genl_split_ops devlink_nl_ops[74] = {
+const struct genl_split_ops devlink_nl_ops[75] = {
{
.cmd = DEVLINK_CMD_GET,
.validate = GENL_DONT_VALIDATE_STRICT,
@@ -958,10 +965,17 @@ const struct genl_split_ops devlink_nl_ops[74] = {
.pre_doit = devlink_nl_pre_doit,
.doit = devlink_nl_resource_dump_doit,
.post_doit = devlink_nl_post_doit,
- .policy = devlink_resource_dump_nl_policy,
+ .policy = devlink_resource_dump_do_nl_policy,
.maxattr = DEVLINK_ATTR_INDEX,
.flags = GENL_CMD_CAP_DO,
},
+ {
+ .cmd = DEVLINK_CMD_RESOURCE_DUMP,
+ .dumpit = devlink_nl_resource_dump_dumpit,
+ .policy = devlink_resource_dump_dump_nl_policy,
+ .maxattr = DEVLINK_ATTR_INDEX,
+ .flags = GENL_CMD_CAP_DUMP,
+ },
{
.cmd = DEVLINK_CMD_RELOAD,
.validate = GENL_DONT_VALIDATE_STRICT,
diff --git a/net/devlink/netlink_gen.h b/net/devlink/netlink_gen.h
index 2817d53a0eba..d79f6a0888f6 100644
--- a/net/devlink/netlink_gen.h
+++ b/net/devlink/netlink_gen.h
@@ -18,7 +18,7 @@ extern const struct nla_policy devlink_dl_rate_tc_bws_nl_policy[DEVLINK_RATE_TC_
extern const struct nla_policy devlink_dl_selftest_id_nl_policy[DEVLINK_ATTR_SELFTEST_ID_FLASH + 1];
/* Ops table for devlink */
-extern const struct genl_split_ops devlink_nl_ops[74];
+extern const struct genl_split_ops devlink_nl_ops[75];
int devlink_nl_pre_doit(const struct genl_split_ops *ops, struct sk_buff *skb,
struct genl_info *info);
@@ -80,6 +80,8 @@ int devlink_nl_dpipe_table_counters_set_doit(struct sk_buff *skb,
struct genl_info *info);
int devlink_nl_resource_set_doit(struct sk_buff *skb, struct genl_info *info);
int devlink_nl_resource_dump_doit(struct sk_buff *skb, struct genl_info *info);
+int devlink_nl_resource_dump_dumpit(struct sk_buff *skb,
+ struct netlink_callback *cb);
int devlink_nl_reload_doit(struct sk_buff *skb, struct genl_info *info);
int devlink_nl_param_get_doit(struct sk_buff *skb, struct genl_info *info);
int devlink_nl_param_get_dumpit(struct sk_buff *skb,
diff --git a/net/devlink/resource.c b/net/devlink/resource.c
index f3014ec425c4..02fb36e25c52 100644
--- a/net/devlink/resource.c
+++ b/net/devlink/resource.c
@@ -223,6 +223,31 @@ static int devlink_resource_put(struct devlink *devlink, struct sk_buff *skb,
return -EMSGSIZE;
}
+static int devlink_resource_list_fill(struct sk_buff *skb,
+ struct devlink *devlink,
+ struct list_head *resource_list_head,
+ int *idx)
+{
+ struct devlink_resource *resource;
+ int i = 0;
+ int err;
+
+ list_for_each_entry(resource, resource_list_head, list) {
+ if (i < *idx) {
+ i++;
+ continue;
+ }
+ err = devlink_resource_put(devlink, skb, resource);
+ if (err) {
+ *idx = i;
+ return err;
+ }
+ i++;
+ }
+ *idx = 0;
+ return 0;
+}
+
static int devlink_resource_fill(struct genl_info *info,
enum devlink_command cmd, int flags)
{
@@ -302,6 +327,58 @@ int devlink_nl_resource_dump_doit(struct sk_buff *skb, struct genl_info *info)
return devlink_resource_fill(info, DEVLINK_CMD_RESOURCE_DUMP, 0);
}
+static int
+devlink_nl_resource_dump_one(struct sk_buff *skb, struct devlink *devlink,
+ struct netlink_callback *cb, int flags)
+{
+ struct devlink_nl_dump_state *state = devlink_dump_state(cb);
+ struct nlattr *resources_attr;
+ int start_idx = state->idx;
+ void *hdr;
+ int err;
+
+ if (list_empty(&devlink->resource_list))
+ return 0;
+
+ err = -EMSGSIZE;
+ hdr = genlmsg_put(skb, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq,
+ &devlink_nl_family, flags, DEVLINK_CMD_RESOURCE_DUMP);
+ if (!hdr)
+ return err;
+
+ if (devlink_nl_put_handle(skb, devlink))
+ goto nla_put_failure;
+
+ resources_attr = nla_nest_start_noflag(skb, DEVLINK_ATTR_RESOURCE_LIST);
+ if (!resources_attr)
+ goto nla_put_failure;
+
+ err = devlink_resource_list_fill(skb, devlink,
+ &devlink->resource_list, &state->idx);
+ if (err) {
+ if (state->idx == start_idx)
+ goto resource_list_cancel;
+ nla_nest_end(skb, resources_attr);
+ genlmsg_end(skb, hdr);
+ return err;
+ }
+ nla_nest_end(skb, resources_attr);
+ genlmsg_end(skb, hdr);
+ return 0;
+
+resource_list_cancel:
+ nla_nest_cancel(skb, resources_attr);
+nla_put_failure:
+ genlmsg_cancel(skb, hdr);
+ return err;
+}
+
+int devlink_nl_resource_dump_dumpit(struct sk_buff *skb,
+ struct netlink_callback *cb)
+{
+ return devlink_nl_dumpit(skb, cb, devlink_nl_resource_dump_one);
+}
+
int devlink_resources_validate(struct devlink *devlink,
struct devlink_resource *resource,
struct genl_info *info)
--
2.44.0
^ permalink raw reply related
* [PATCH net-next V4 04/12] netdevsim: Add devlink port resource registration
From: Tariq Toukan @ 2026-04-01 18:49 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Simon Horman, Donald Hunter, Jiri Pirko, Jonathan Corbet,
Shuah Khan, Saeed Mahameed, Leon Romanovsky, Tariq Toukan,
Mark Bloch, Shuah Khan, Chuck Lever, Matthieu Baerts (NGI0),
Carolina Jubran, Or Har-Toov, Moshe Shemesh, Dragos Tatulea,
Shahar Shitrit, Daniel Zahka, Jacob Keller, Cosmin Ratiu,
Parav Pandit, Shay Drori, Adithya Jayachandran, Kees Cook,
Daniel Jurgens, netdev, linux-kernel, linux-doc, linux-rdma,
linux-kselftest, Gal Pressman, Jiri Pirko
In-Reply-To: <20260401184947.135205-1-tariqt@nvidia.com>
From: Or Har-Toov <ohartoov@nvidia.com>
Register port-level resources for netdevsim ports to enable testing
of the port resource infrastructure.
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Shay Drori <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
drivers/net/netdevsim/dev.c | 23 ++++++++++++++++++++++-
drivers/net/netdevsim/netdevsim.h | 4 ++++
2 files changed, 26 insertions(+), 1 deletion(-)
diff --git a/drivers/net/netdevsim/dev.c b/drivers/net/netdevsim/dev.c
index e82de0fd3157..1e06e781c835 100644
--- a/drivers/net/netdevsim/dev.c
+++ b/drivers/net/netdevsim/dev.c
@@ -1486,9 +1486,25 @@ static int __nsim_dev_port_add(struct nsim_dev *nsim_dev, enum nsim_dev_port_typ
if (err)
goto err_port_free;
+ if (nsim_dev_port_is_pf(nsim_dev_port)) {
+ u64 parent_id = DEVLINK_RESOURCE_ID_PARENT_TOP;
+ struct devlink_resource_size_params params = {
+ .size_max = 100,
+ .size_granularity = 1,
+ .unit = DEVLINK_RESOURCE_UNIT_ENTRY
+ };
+
+ err = devl_port_resource_register(devlink_port,
+ "test_resource", 20,
+ NSIM_PORT_RESOURCE_TEST,
+ parent_id, ¶ms);
+ if (err)
+ goto err_dl_port_unregister;
+ }
+
err = nsim_dev_port_debugfs_init(nsim_dev, nsim_dev_port);
if (err)
- goto err_dl_port_unregister;
+ goto err_port_resource_unregister;
nsim_dev_port->ns = nsim_create(nsim_dev, nsim_dev_port, perm_addr);
if (IS_ERR(nsim_dev_port->ns)) {
@@ -1511,6 +1527,9 @@ static int __nsim_dev_port_add(struct nsim_dev *nsim_dev, enum nsim_dev_port_typ
nsim_destroy(nsim_dev_port->ns);
err_port_debugfs_exit:
nsim_dev_port_debugfs_exit(nsim_dev_port);
+err_port_resource_unregister:
+ if (nsim_dev_port_is_pf(nsim_dev_port))
+ devl_port_resources_unregister(devlink_port);
err_dl_port_unregister:
devl_port_unregister(devlink_port);
err_port_free:
@@ -1527,6 +1546,8 @@ static void __nsim_dev_port_del(struct nsim_dev_port *nsim_dev_port)
devl_rate_leaf_destroy(&nsim_dev_port->devlink_port);
nsim_destroy(nsim_dev_port->ns);
nsim_dev_port_debugfs_exit(nsim_dev_port);
+ if (nsim_dev_port_is_pf(nsim_dev_port))
+ devl_port_resources_unregister(devlink_port);
devl_port_unregister(devlink_port);
kfree(nsim_dev_port);
}
diff --git a/drivers/net/netdevsim/netdevsim.h b/drivers/net/netdevsim/netdevsim.h
index c904e14f6b3f..c7de53706ec4 100644
--- a/drivers/net/netdevsim/netdevsim.h
+++ b/drivers/net/netdevsim/netdevsim.h
@@ -224,6 +224,10 @@ enum nsim_resource_id {
NSIM_RESOURCE_NEXTHOPS,
};
+enum nsim_port_resource_id {
+ NSIM_PORT_RESOURCE_TEST = 1,
+};
+
struct nsim_dev_health {
struct devlink_health_reporter *empty_reporter;
struct devlink_health_reporter *dummy_reporter;
--
2.44.0
^ permalink raw reply related
* [PATCH net-next V4 03/12] net/mlx5: Register SF resource on PF port representor
From: Tariq Toukan @ 2026-04-01 18:49 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Simon Horman, Donald Hunter, Jiri Pirko, Jonathan Corbet,
Shuah Khan, Saeed Mahameed, Leon Romanovsky, Tariq Toukan,
Mark Bloch, Shuah Khan, Chuck Lever, Matthieu Baerts (NGI0),
Carolina Jubran, Or Har-Toov, Moshe Shemesh, Dragos Tatulea,
Shahar Shitrit, Daniel Zahka, Jacob Keller, Cosmin Ratiu,
Parav Pandit, Shay Drori, Adithya Jayachandran, Kees Cook,
Daniel Jurgens, netdev, linux-kernel, linux-doc, linux-rdma,
linux-kselftest, Gal Pressman, Jiri Pirko
In-Reply-To: <20260401184947.135205-1-tariqt@nvidia.com>
From: Or Har-Toov <ohartoov@nvidia.com>
The device-level "resource show" displays max_local_SFs and
max_external_SFs without indicating which port each resource belongs
to. Users cannot determine the controller number and pfnum associated
with each SF pool.
Register max_SFs resource on the host PF representor port to expose
per-port SF limits. Users can correlate the port resource with the
controller number and pfnum shown in 'devlink port show'.
Future patches will introduce an ECPF that manages multiple PFs,
where each PF has its own SF pool.
Example usage:
$ devlink resource show pci/0000:03:00.0/196608
pci/0000:03:00.0/196608:
name max_SFs size 20 unit entry
$ devlink port show pci/0000:03:00.0/196608
pci/0000:03:00.0/196608: type eth netdev pf0hpf flavour pcipf
controller 1 pfnum 0 external true splittable false
function:
hw_addr b8:3f:d2:e1:8f:dc roce enable max_io_eqs 120
We can create up to 20 SFs over devlink port pci/0000:03:00.0/196608,
with pfnum 0 and controller 1.
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Shay Drori <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../net/ethernet/mellanox/mlx5/core/devlink.h | 4 ++
.../mellanox/mlx5/core/esw/devlink_port.c | 37 +++++++++++++++++++
2 files changed, 41 insertions(+)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/devlink.h b/drivers/net/ethernet/mellanox/mlx5/core/devlink.h
index 43b9bf8829cf..4fbb3926a3e5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/devlink.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/devlink.h
@@ -14,6 +14,10 @@ enum mlx5_devlink_resource_id {
MLX5_ID_RES_MAX = __MLX5_ID_RES_MAX - 1,
};
+enum mlx5_devlink_port_resource_id {
+ MLX5_DL_PORT_RES_MAX_SFS = 1,
+};
+
enum mlx5_devlink_param_id {
MLX5_DEVLINK_PARAM_ID_BASE = DEVLINK_PARAM_GENERIC_ID_MAX,
MLX5_DEVLINK_PARAM_ID_FLOW_STEERING_MODE,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/devlink_port.c b/drivers/net/ethernet/mellanox/mlx5/core/esw/devlink_port.c
index 8bffba85f21f..e1d11326af1b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/esw/devlink_port.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/devlink_port.c
@@ -3,6 +3,7 @@
#include <linux/mlx5/driver.h>
#include "eswitch.h"
+#include "devlink.h"
static void
mlx5_esw_get_port_parent_id(struct mlx5_core_dev *dev, struct netdev_phys_item_id *ppid)
@@ -158,6 +159,32 @@ static const struct devlink_port_ops mlx5_esw_dl_sf_port_ops = {
.port_fn_max_io_eqs_set = mlx5_devlink_port_fn_max_io_eqs_set,
};
+static int mlx5_esw_devlink_port_res_register(struct mlx5_eswitch *esw,
+ struct devlink_port *dl_port)
+{
+ struct devlink_resource_size_params size_params;
+ struct mlx5_core_dev *dev = esw->dev;
+ u16 max_sfs, sf_base_id;
+ int err;
+
+ err = mlx5_esw_sf_max_hpf_functions(dev, &max_sfs, &sf_base_id);
+ if (err)
+ return err;
+
+ devlink_resource_size_params_init(&size_params, max_sfs, max_sfs, 1,
+ DEVLINK_RESOURCE_UNIT_ENTRY);
+
+ return devl_port_resource_register(dl_port, "max_SFs", max_sfs,
+ MLX5_DL_PORT_RES_MAX_SFS,
+ DEVLINK_RESOURCE_ID_PARENT_TOP,
+ &size_params);
+}
+
+static void mlx5_esw_devlink_port_res_unregister(struct devlink_port *dl_port)
+{
+ devl_port_resources_unregister(dl_port);
+}
+
int mlx5_esw_offloads_devlink_port_register(struct mlx5_eswitch *esw, struct mlx5_vport *vport)
{
struct mlx5_core_dev *dev = esw->dev;
@@ -189,6 +216,15 @@ int mlx5_esw_offloads_devlink_port_register(struct mlx5_eswitch *esw, struct mlx
if (err)
goto rate_err;
+ if (vport_num == MLX5_VPORT_PF) {
+ err = mlx5_esw_devlink_port_res_register(esw,
+ &dl_port->dl_port);
+ if (err)
+ mlx5_core_dbg(dev,
+ "Failed to register port resources: %d\n",
+ err);
+ }
+
return 0;
rate_err:
@@ -203,6 +239,7 @@ void mlx5_esw_offloads_devlink_port_unregister(struct mlx5_vport *vport)
if (!vport->dl_port)
return;
dl_port = vport->dl_port;
+ mlx5_esw_devlink_port_res_unregister(&dl_port->dl_port);
mlx5_esw_qos_vport_update_parent(vport, NULL, NULL);
devl_rate_leaf_destroy(&dl_port->dl_port);
--
2.44.0
^ permalink raw reply related
* [PATCH net-next V4 02/12] devlink: Add port-level resource registration infrastructure
From: Tariq Toukan @ 2026-04-01 18:49 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Simon Horman, Donald Hunter, Jiri Pirko, Jonathan Corbet,
Shuah Khan, Saeed Mahameed, Leon Romanovsky, Tariq Toukan,
Mark Bloch, Shuah Khan, Chuck Lever, Matthieu Baerts (NGI0),
Carolina Jubran, Or Har-Toov, Moshe Shemesh, Dragos Tatulea,
Shahar Shitrit, Daniel Zahka, Jacob Keller, Cosmin Ratiu,
Parav Pandit, Shay Drori, Adithya Jayachandran, Kees Cook,
Daniel Jurgens, netdev, linux-kernel, linux-doc, linux-rdma,
linux-kselftest, Gal Pressman, Jiri Pirko
In-Reply-To: <20260401184947.135205-1-tariqt@nvidia.com>
From: Or Har-Toov <ohartoov@nvidia.com>
The current devlink resource infrastructure supports only device-level
resources. Some hardware resources are associated with specific ports
rather than the entire device, and today we have no way to show resource
per-port.
Add support for registering resources at the port level.
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Shay Drori <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
include/net/devlink.h | 8 ++++++++
net/devlink/port.c | 2 ++
net/devlink/resource.c | 43 ++++++++++++++++++++++++++++++++++++++++++
3 files changed, 53 insertions(+)
diff --git a/include/net/devlink.h b/include/net/devlink.h
index f5439d050eb0..bcd31de1f890 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -129,6 +129,7 @@ struct devlink_rate {
struct devlink_port {
struct list_head list;
struct list_head region_list;
+ struct list_head resource_list;
struct devlink *devlink;
const struct devlink_port_ops *ops;
unsigned int index;
@@ -1891,6 +1892,13 @@ void devlink_resources_unregister(struct devlink *devlink);
int devl_resource_size_get(struct devlink *devlink,
u64 resource_id,
u64 *p_resource_size);
+int
+devl_port_resource_register(struct devlink_port *devlink_port,
+ const char *resource_name,
+ u64 resource_size, u64 resource_id,
+ u64 parent_resource_id,
+ const struct devlink_resource_size_params *params);
+void devl_port_resources_unregister(struct devlink_port *devlink_port);
int devl_dpipe_table_resource_set(struct devlink *devlink,
const char *table_name, u64 resource_id,
u64 resource_units);
diff --git a/net/devlink/port.c b/net/devlink/port.c
index 7fcd1d3ed44c..485029d43428 100644
--- a/net/devlink/port.c
+++ b/net/devlink/port.c
@@ -1025,6 +1025,7 @@ void devlink_port_init(struct devlink *devlink,
return;
devlink_port->devlink = devlink;
INIT_LIST_HEAD(&devlink_port->region_list);
+ INIT_LIST_HEAD(&devlink_port->resource_list);
devlink_port->initialized = true;
}
EXPORT_SYMBOL_GPL(devlink_port_init);
@@ -1042,6 +1043,7 @@ EXPORT_SYMBOL_GPL(devlink_port_init);
void devlink_port_fini(struct devlink_port *devlink_port)
{
WARN_ON(!list_empty(&devlink_port->region_list));
+ WARN_ON(!list_empty(&devlink_port->resource_list));
}
EXPORT_SYMBOL_GPL(devlink_port_fini);
diff --git a/net/devlink/resource.c b/net/devlink/resource.c
index ee169a467d48..f3014ec425c4 100644
--- a/net/devlink/resource.c
+++ b/net/devlink/resource.c
@@ -532,3 +532,46 @@ void devl_resource_occ_get_unregister(struct devlink *devlink,
resource->occ_get_priv = NULL;
}
EXPORT_SYMBOL_GPL(devl_resource_occ_get_unregister);
+
+/**
+ * devl_port_resource_register - devlink port resource register
+ *
+ * @devlink_port: devlink port
+ * @resource_name: resource's name
+ * @resource_size: resource's size
+ * @resource_id: resource's id
+ * @parent_resource_id: resource's parent id
+ * @params: size parameters
+ *
+ * Generic resources should reuse the same names across drivers.
+ * Please see the generic resources list at:
+ * Documentation/networking/devlink/devlink-resource.rst
+ *
+ * Return: 0 on success, negative error code otherwise.
+ */
+int
+devl_port_resource_register(struct devlink_port *devlink_port,
+ const char *resource_name,
+ u64 resource_size, u64 resource_id,
+ u64 parent_resource_id,
+ const struct devlink_resource_size_params *params)
+{
+ return __devl_resource_register(devlink_port->devlink,
+ &devlink_port->resource_list,
+ resource_name, resource_size,
+ resource_id, parent_resource_id,
+ params);
+}
+EXPORT_SYMBOL_GPL(devl_port_resource_register);
+
+/**
+ * devl_port_resources_unregister - unregister all devlink port resources
+ *
+ * @devlink_port: devlink port
+ */
+void devl_port_resources_unregister(struct devlink_port *devlink_port)
+{
+ __devl_resources_unregister(devlink_port->devlink,
+ &devlink_port->resource_list);
+}
+EXPORT_SYMBOL_GPL(devl_port_resources_unregister);
--
2.44.0
^ permalink raw reply related
* [PATCH net-next V4 01/12] devlink: Refactor resource functions to be generic
From: Tariq Toukan @ 2026-04-01 18:49 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Simon Horman, Donald Hunter, Jiri Pirko, Jonathan Corbet,
Shuah Khan, Saeed Mahameed, Leon Romanovsky, Tariq Toukan,
Mark Bloch, Shuah Khan, Chuck Lever, Matthieu Baerts (NGI0),
Carolina Jubran, Or Har-Toov, Moshe Shemesh, Dragos Tatulea,
Shahar Shitrit, Daniel Zahka, Jacob Keller, Cosmin Ratiu,
Parav Pandit, Shay Drori, Adithya Jayachandran, Kees Cook,
Daniel Jurgens, netdev, linux-kernel, linux-doc, linux-rdma,
linux-kselftest, Gal Pressman, Jiri Pirko
In-Reply-To: <20260401184947.135205-1-tariqt@nvidia.com>
From: Or Har-Toov <ohartoov@nvidia.com>
Currently the resource functions take devlink pointer as parameter
and take the resource list from there.
Allow resource functions to work with other resource lists that will
be added in next patches and not only with the devlink's resource list.
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Shay Drori <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
include/net/devlink.h | 2 +-
net/devlink/resource.c | 114 ++++++++++++++++++++++++++---------------
2 files changed, 73 insertions(+), 43 deletions(-)
diff --git a/include/net/devlink.h b/include/net/devlink.h
index 3038af6ec017..f5439d050eb0 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -1885,7 +1885,7 @@ int devl_resource_register(struct devlink *devlink,
u64 resource_size,
u64 resource_id,
u64 parent_resource_id,
- const struct devlink_resource_size_params *size_params);
+ const struct devlink_resource_size_params *params);
void devl_resources_unregister(struct devlink *devlink);
void devlink_resources_unregister(struct devlink *devlink);
int devl_resource_size_get(struct devlink *devlink,
diff --git a/net/devlink/resource.c b/net/devlink/resource.c
index 351835a710b1..ee169a467d48 100644
--- a/net/devlink/resource.c
+++ b/net/devlink/resource.c
@@ -36,15 +36,16 @@ struct devlink_resource {
};
static struct devlink_resource *
-devlink_resource_find(struct devlink *devlink,
- struct devlink_resource *resource, u64 resource_id)
+__devlink_resource_find(struct list_head *resource_list_head,
+ struct devlink_resource *resource,
+ u64 resource_id)
{
struct list_head *resource_list;
if (resource)
resource_list = &resource->resource_list;
else
- resource_list = &devlink->resource_list;
+ resource_list = resource_list_head;
list_for_each_entry(resource, resource_list, list) {
struct devlink_resource *child_resource;
@@ -52,14 +53,23 @@ devlink_resource_find(struct devlink *devlink,
if (resource->id == resource_id)
return resource;
- child_resource = devlink_resource_find(devlink, resource,
- resource_id);
+ child_resource = __devlink_resource_find(resource_list_head,
+ resource,
+ resource_id);
if (child_resource)
return child_resource;
}
return NULL;
}
+static struct devlink_resource *
+devlink_resource_find(struct devlink *devlink,
+ struct devlink_resource *resource, u64 resource_id)
+{
+ return __devlink_resource_find(&devlink->resource_list,
+ resource, resource_id);
+}
+
static void
devlink_resource_validate_children(struct devlink_resource *resource)
{
@@ -314,26 +324,12 @@ int devlink_resources_validate(struct devlink *devlink,
return err;
}
-/**
- * devl_resource_register - devlink resource register
- *
- * @devlink: devlink
- * @resource_name: resource's name
- * @resource_size: resource's size
- * @resource_id: resource's id
- * @parent_resource_id: resource's parent id
- * @size_params: size parameters
- *
- * Generic resources should reuse the same names across drivers.
- * Please see the generic resources list at:
- * Documentation/networking/devlink/devlink-resource.rst
- */
-int devl_resource_register(struct devlink *devlink,
- const char *resource_name,
- u64 resource_size,
- u64 resource_id,
- u64 parent_resource_id,
- const struct devlink_resource_size_params *size_params)
+static int
+__devl_resource_register(struct devlink *devlink,
+ struct list_head *resource_list_head,
+ const char *resource_name, u64 resource_size,
+ u64 resource_id, u64 parent_resource_id,
+ const struct devlink_resource_size_params *params)
{
struct devlink_resource *resource;
struct list_head *resource_list;
@@ -343,7 +339,8 @@ int devl_resource_register(struct devlink *devlink,
top_hierarchy = parent_resource_id == DEVLINK_RESOURCE_ID_PARENT_TOP;
- resource = devlink_resource_find(devlink, NULL, resource_id);
+ resource = __devlink_resource_find(resource_list_head, NULL,
+ resource_id);
if (resource)
return -EEXIST;
@@ -352,12 +349,13 @@ int devl_resource_register(struct devlink *devlink,
return -ENOMEM;
if (top_hierarchy) {
- resource_list = &devlink->resource_list;
+ resource_list = resource_list_head;
} else {
struct devlink_resource *parent_resource;
- parent_resource = devlink_resource_find(devlink, NULL,
- parent_resource_id);
+ parent_resource = __devlink_resource_find(resource_list_head,
+ NULL,
+ parent_resource_id);
if (parent_resource) {
resource_list = &parent_resource->resource_list;
resource->parent = parent_resource;
@@ -372,46 +370,78 @@ int devl_resource_register(struct devlink *devlink,
resource->size_new = resource_size;
resource->id = resource_id;
resource->size_valid = true;
- memcpy(&resource->size_params, size_params,
- sizeof(resource->size_params));
+ memcpy(&resource->size_params, params, sizeof(resource->size_params));
INIT_LIST_HEAD(&resource->resource_list);
list_add_tail(&resource->list, resource_list);
return 0;
}
+
+/**
+ * devl_resource_register - devlink resource register
+ *
+ * @devlink: devlink
+ * @resource_name: resource's name
+ * @resource_size: resource's size
+ * @resource_id: resource's id
+ * @parent_resource_id: resource's parent id
+ * @params: size parameters
+ *
+ * Generic resources should reuse the same names across drivers.
+ * Please see the generic resources list at:
+ * Documentation/networking/devlink/devlink-resource.rst
+ *
+ * Return: 0 on success, negative error code otherwise.
+ */
+int devl_resource_register(struct devlink *devlink, const char *resource_name,
+ u64 resource_size, u64 resource_id,
+ u64 parent_resource_id,
+ const struct devlink_resource_size_params *params)
+{
+ return __devl_resource_register(devlink, &devlink->resource_list,
+ resource_name, resource_size,
+ resource_id, parent_resource_id,
+ params);
+}
EXPORT_SYMBOL_GPL(devl_resource_register);
-static void devlink_resource_unregister(struct devlink *devlink,
- struct devlink_resource *resource)
+static void devlink_resource_unregister(struct devlink_resource *resource)
{
struct devlink_resource *tmp, *child_resource;
list_for_each_entry_safe(child_resource, tmp, &resource->resource_list,
list) {
- devlink_resource_unregister(devlink, child_resource);
+ devlink_resource_unregister(child_resource);
list_del(&child_resource->list);
kfree(child_resource);
}
}
-/**
- * devl_resources_unregister - free all resources
- *
- * @devlink: devlink
- */
-void devl_resources_unregister(struct devlink *devlink)
+static void
+__devl_resources_unregister(struct devlink *devlink,
+ struct list_head *resource_list_head)
{
struct devlink_resource *tmp, *child_resource;
lockdep_assert_held(&devlink->lock);
- list_for_each_entry_safe(child_resource, tmp, &devlink->resource_list,
+ list_for_each_entry_safe(child_resource, tmp, resource_list_head,
list) {
- devlink_resource_unregister(devlink, child_resource);
+ devlink_resource_unregister(child_resource);
list_del(&child_resource->list);
kfree(child_resource);
}
}
+
+/**
+ * devl_resources_unregister - free all resources
+ *
+ * @devlink: devlink
+ */
+void devl_resources_unregister(struct devlink *devlink)
+{
+ __devl_resources_unregister(devlink, &devlink->resource_list);
+}
EXPORT_SYMBOL_GPL(devl_resources_unregister);
/**
--
2.44.0
^ permalink raw reply related
* [PATCH net-next V4 00/12] devlink: add per-port resource support
From: Tariq Toukan @ 2026-04-01 18:49 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Simon Horman, Donald Hunter, Jiri Pirko, Jonathan Corbet,
Shuah Khan, Saeed Mahameed, Leon Romanovsky, Tariq Toukan,
Mark Bloch, Shuah Khan, Chuck Lever, Matthieu Baerts (NGI0),
Carolina Jubran, Or Har-Toov, Moshe Shemesh, Dragos Tatulea,
Shahar Shitrit, Daniel Zahka, Jacob Keller, Cosmin Ratiu,
Parav Pandit, Shay Drori, Adithya Jayachandran, Kees Cook,
Daniel Jurgens, netdev, linux-kernel, linux-doc, linux-rdma,
linux-kselftest, Gal Pressman
Hi,
This series by Or adds devlink per-port resource support.
See detailed description by Or below [1].
Regards,
Tariq
[1]
Currently, devlink resources are only available at the device level.
However, some resources are inherently per-port, such as the maximum
number of subfunctions (SFs) that can be created on a specific PF port.
This limitation prevents user space from obtaining accurate per-port
capacity information.
This series adds infrastructure for per-port resources in devlink core
and implements it in the mlx5 driver to expose the max_SFs resource
on PF devlink ports.
Patch #1 refactors resource functions to be generic
Patch #2 adds port-level resource registration infrastructure
Patch #3 registers SF resource on PF port representor in mlx5
Patch #4 adds devlink port resource registration to netdevsim for testing
Patch #5 adds dump support for device-level resources
Patch #6 includes port resources in the resource dump dumpit path
Patch #7 adds port-specific option to resource dump doit path
Patch #8 adds selftest for devlink port resource doit
Patch #9 documents port-level resources and full dump
Patch #10 adds resource scope filtering to resource dump
Patch #11 adds selftest for resource dump and scope filter
Patch #12 documents resource scope filtering
Userspace patches for iproute2:
https://github.com/ohartoov/iproute2/tree/port_resources
V4:
- Link to V3:
https://lore.kernel.org/all/20260226221916.1800227-1-tariqt@nvidia.com/
- Remove new devlink port resource command
- Add devlink resource show for all resources (devices + ports) via
dumpit
- Add scope parameter to dump (scope dev / scope port)
- Extend the doit command with port handle option while preserving the
legacy dev doit path
Or Har-Toov (12):
devlink: Refactor resource functions to be generic
devlink: Add port-level resource registration infrastructure
net/mlx5: Register SF resource on PF port representor
netdevsim: Add devlink port resource registration
devlink: Add dump support for device-level resources
devlink: Include port resources in resource dump dumpit
devlink: Add port-specific option to resource dump doit
selftest: netdevsim: Add devlink port resource doit test
devlink: Document port-level resources and full dump
devlink: Add resource scope filtering to resource dump
selftest: netdevsim: Add resource dump and scope filter test
devlink: Document resource scope filtering
Documentation/netlink/specs/devlink.yaml | 32 +-
.../networking/devlink/devlink-resource.rst | 70 ++++
.../net/ethernet/mellanox/mlx5/core/devlink.h | 4 +
.../mellanox/mlx5/core/esw/devlink_port.c | 37 ++
drivers/net/netdevsim/dev.c | 23 +-
drivers/net/netdevsim/netdevsim.h | 4 +
include/net/devlink.h | 10 +-
include/uapi/linux/devlink.h | 17 +
net/devlink/devl_internal.h | 4 +
net/devlink/netlink_gen.c | 24 +-
net/devlink/netlink_gen.h | 8 +-
net/devlink/port.c | 2 +
net/devlink/resource.c | 319 +++++++++++++++---
.../drivers/net/netdevsim/devlink.sh | 79 ++++-
14 files changed, 576 insertions(+), 57 deletions(-)
base-commit: f1359c240191e686614847905fc861cbda480b47
--
2.44.0
^ permalink raw reply
* Re: [PATCH v2] doc: Add CPU Isolation documentation
From: Randy Dunlap @ 2026-04-01 18:25 UTC (permalink / raw)
To: Steven Rostedt, Frederic Weisbecker
Cc: LKML, Anna-Maria Behnsen, Gabriele Monaco, Ingo Molnar,
Jonathan Corbet, Marcelo Tosatti, Marco Crivellari, Michal Hocko,
Paul E . McKenney, Peter Zijlstra, Phil Auld, Thomas Gleixner,
Valentin Schneider, Vlastimil Babka, Waiman Long, linux-doc,
Sebastian Andrzej Siewior, Bagas Sanjaya
In-Reply-To: <20260401130855.02c161d8@gandalf.local.home>
On 4/1/26 10:08 AM, Steven Rostedt wrote:
> On Wed, 1 Apr 2026 18:27:03 +0200
> Frederic Weisbecker <frederic@kernel.org> wrote:
>
>>>> +"CPU Isolation" means leaving a CPU exclusive to a given workload
>>>> +without any undesired code interference from the kernel.
>>>> +
>>>> +Those interferences, commonly pointed out as "noise", can be triggered
>>>
>>> nit: "noise,"
>>
>> Thanks! I have applied all your suggestions, except this one for now because I don't
>> really understand the typo rule behind. Any hint?
>
> So this looks to be an American English thing (placing commas within the
> quote), but from what I read, British English places the comma outside the
> quote.
>
> Here's one case I much rather go the British English way. This also means
> it's only incorrect to Americans ;-)
Yes, just leave it as is.
--
~Randy
^ permalink raw reply
* Re: [PATCH v2] doc: Add CPU Isolation documentation
From: Waiman Long @ 2026-04-01 17:58 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: LKML, Anna-Maria Behnsen, Gabriele Monaco, Ingo Molnar,
Jonathan Corbet, Marcelo Tosatti, Marco Crivellari, Michal Hocko,
Paul E . McKenney, Peter Zijlstra, Phil Auld, Steven Rostedt,
Thomas Gleixner, Valentin Schneider, Vlastimil Babka, linux-doc,
Sebastian Andrzej Siewior, Bagas Sanjaya
In-Reply-To: <ac0-LzTrOJIECLoH@localhost.localdomain>
On 4/1/26 11:47 AM, Frederic Weisbecker wrote:
> Le Thu, Mar 26, 2026 at 03:17:48PM -0400, Waiman Long a écrit :
>> On 3/26/26 10:00 AM, Frederic Weisbecker wrote:
>>> nohz_full was introduced in v3.10 in 2013, which means this
>>> documentation is overdue for 13 years.
>>>
>>> Fortunately Paul wrote a part of the needed documentation a while ago,
>>> especially concerning nohz_full in Documentation/timers/no_hz.rst and
>>> also about per-CPU kthreads in
>>> Documentation/admin-guide/kernel-per-CPU-kthreads.rst
>>>
>>> Introduce a new page that gives an overview of CPU isolation in general.
>>>
>>> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
>>> ---
>>> v2:
>>> - Fix links and code blocks (Bagas and Sebastian)
>>> - Isolation is not only about userspace, rephrase accordingly (Valentin)
>>> - Paste BIOS issues suggestion from Valentin
>>> - Include the whole rtla suite (Valentin)
>>> - Rephrase a few details (Waiman)
>>> - Talk about RCU induced overhead rather than slower RCU (Sebastian)
>>>
>>> Documentation/admin-guide/cpu-isolation.rst | 357 ++++++++++++++++++++
>>> Documentation/admin-guide/index.rst | 1 +
>>> 2 files changed, 358 insertions(+)
>>> create mode 100644 Documentation/admin-guide/cpu-isolation.rst
>>>
>>> diff --git a/Documentation/admin-guide/cpu-isolation.rst b/Documentation/admin-guide/cpu-isolation.rst
>>> new file mode 100644
>>> index 000000000000..886dec79b056
>>> --- /dev/null
>>> +++ b/Documentation/admin-guide/cpu-isolation.rst
>>> @@ -0,0 +1,357 @@
>>> +.. SPDX-License-Identifier: GPL-2.0
>>> +
>>> +=============
>>> +CPU Isolation
>>> +=============
>>> +
>>> +Introduction
>>> +============
>>> +
>>> +"CPU Isolation" means leaving a CPU exclusive to a given workload
>>> +without any undesired code interference from the kernel.
>>> +
>>> +Those interferences, commonly pointed out as "noise", can be triggered
>>> +by asynchronous events (interrupts, timers, scheduler preemption by
>>> +workqueues and kthreads, ...) or synchronous events (syscalls and page
>>> +faults).
>>> +
>>> +Such noise usually goes unnoticed. After all synchronous events are a
>>> +component of the requested kernel service. And asynchronous events are
>>> +either sufficiently well distributed by the scheduler when executed
>>> +as tasks or reasonably fast when executed as interrupt. The timer
>>> +interrupt can even execute 1024 times per seconds without a significant
>>> +and measurable impact most of the time.
>>> +
>>> +However some rare and extreme workloads can be quite sensitive to
>>> +those kinds of noise. This is the case, for example, with high
>>> +bandwidth network processing that can't afford losing a single packet
>>> +or very low latency network processing. Typically those usecases
>>> +involve DPDK, bypassing the kernel networking stack and performing
>>> +direct access to the networking device from userscace.
>> As also pointed by by Sashiko, there is a typo "userscace" -> "userspace".
>> There are also typos reported in
>>
>> https://sashiko.dev/#/patchset/20260326140055.41555-1-frederic%40kernel.org
> Thanks!
>
> What do you think about these lines of Sashiko's review:
>
> """
> Does this script violate the cgroup v2 "no internal process" constraint?
> By enabling the cpuset controller on the test directory's
> cgroup.subtree_control file, the cgroup cannot also contain processes.
> """
>
> That is confusing me...
I would say that the "no internal process" is a suggestion for the
cgroup setup. The real world is actually more complicated. So I will
just ignore that.
Cheers,
Longman
^ permalink raw reply
* Re: [PATCH v4 09/13] ima: Add support for staging measurements with prompt
From: steven chen @ 2026-04-01 17:52 UTC (permalink / raw)
To: Roberto Sassu, corbet, skhan, zohar, dmitry.kasatkin,
eric.snowberg, paul, jmorris, serge
Cc: linux-doc, linux-kernel, linux-integrity, linux-security-module,
gregorylumen, nramas, Roberto Sassu, steven chen
In-Reply-To: <19a1815a1222bd78f6bfde30f60b60ebfacb65aa.camel@huaweicloud.com>
On 3/27/2026 9:45 AM, Roberto Sassu wrote:
> On Thu, 2026-03-26 at 15:44 -0700, steven chen wrote:
>> On 3/26/2026 10:30 AM, Roberto Sassu wrote:
>>> From: Roberto Sassu <roberto.sassu@huawei.com>
>>>
>>> Introduce the ability of staging the IMA measurement list and deleting them
>>> with a prompt.
>>>
>>> Staging means moving the current content of the measurement list to a
>>> separate location, and allowing users to read and delete it. This causes
>>> the measurement list to be atomically truncated before new measurements can
>>> be added. Staging can be done only once at a time. In the event of kexec(),
>>> staging is reverted and staged entries will be carried over to the new
>>> kernel.
>>>
>>> Introduce ascii_runtime_measurements_<algo>_staged and
>>> binary_runtime_measurements_<algo>_staged interfaces to stage and delete
>>> the measurements. Use 'echo A > <IMA interface>' and
>>> 'echo D > <IMA interface>' to respectively stage and delete the entire
>>> measurements list. Locking of these interfaces is also mediated with a call
>>> to _ima_measurements_open() and with ima_measurements_release().
>>>
>>> Implement the staging functionality by introducing the new global
>>> measurements list ima_measurements_staged, and ima_queue_stage() and
>>> ima_queue_delete_staged_all() to respectively move measurements from the
>>> current measurements list to the staged one, and to move staged
>>> measurements to the ima_measurements_trim list for deletion. Introduce
>>> ima_queue_delete() to delete the measurements.
>>>
>>> Finally, introduce the BINARY_STAGED AND BINARY_FULL binary measurements
>>> list types, to maintain the counters and the binary size of staged
>>> measurements and the full measurements list (including entries that were
>>> staged). BINARY still represents the current binary measurements list.
>>>
>>> Use the binary size for the BINARY + BINARY_STAGED types in
>>> ima_add_kexec_buffer(), since both measurements list types are copied to
>>> the secondary kernel during kexec. Use BINARY_FULL in
>>> ima_measure_kexec_event(), to generate a critical data record.
>>>
>>> It should be noted that the BINARY_FULL counter is not passed through
>>> kexec. Thus, the number of entries included in the kexec critical data
>>> records refers to the entries since the previous kexec records.
>>>
>>> Note: This code derives from the Alt-IMA Huawei project, whose license is
>>> GPL-2.0 OR MIT.
>>>
>>> Link: https://github.com/linux-integrity/linux/issues/1
>>> Suggested-by: Gregory Lumen <gregorylumen@linux.microsoft.com> (staging revert)
>>> Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
>>> ---
>>> security/integrity/ima/Kconfig | 13 +++
>>> security/integrity/ima/ima.h | 8 +-
>>> security/integrity/ima/ima_fs.c | 167 ++++++++++++++++++++++++++---
>>> security/integrity/ima/ima_kexec.c | 22 +++-
>>> security/integrity/ima/ima_queue.c | 97 ++++++++++++++++-
>>> 5 files changed, 286 insertions(+), 21 deletions(-)
>>>
>>> diff --git a/security/integrity/ima/Kconfig b/security/integrity/ima/Kconfig
>>> index 976e75f9b9ba..e714726f3384 100644
>>> --- a/security/integrity/ima/Kconfig
>>> +++ b/security/integrity/ima/Kconfig
>>> @@ -332,4 +332,17 @@ config IMA_KEXEC_EXTRA_MEMORY_KB
>>> If set to the default value of 0, an extra half page of memory for those
>>> additional measurements will be allocated.
>>>
>>> +config IMA_STAGING
>>> + bool "Support for staging the measurements list"
>>> + default y
>>> + help
>>> + Add support for staging the measurements list.
>>> +
>>> + It allows user space to stage the measurements list for deletion and
>>> + to delete the staged measurements after confirmation.
>>> +
>>> + On kexec, staging is reverted and staged measurements are prepended
>>> + to the current measurements list when measurements are copied to the
>>> + secondary kernel.
>>> +
>>> endif
>>> diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
>>> index 97b7d6024b5d..65db152a0a24 100644
>>> --- a/security/integrity/ima/ima.h
>>> +++ b/security/integrity/ima/ima.h
>>> @@ -30,9 +30,11 @@ enum tpm_pcrs { TPM_PCR0 = 0, TPM_PCR8 = 8, TPM_PCR10 = 10 };
>>>
>>> /*
>>> * BINARY: current binary measurements list
>>> + * BINARY_STAGED: staged binary measurements list
>>> + * BINARY_FULL: binary measurements list since IMA init (lost after kexec)
>>> */
>>> enum binary_lists {
>>> - BINARY, BINARY__LAST
>>> + BINARY, BINARY_STAGED, BINARY_FULL, BINARY__LAST
>>> };
>>>
>>> /* digest size for IMA, fits SHA1 or MD5 */
>>> @@ -125,6 +127,7 @@ struct ima_queue_entry {
>>> struct ima_template_entry *entry;
>>> };
>>> extern struct list_head ima_measurements; /* list of all measurements */
>>> +extern struct list_head ima_measurements_staged; /* list of staged meas. */
>>>
>>> /* Some details preceding the binary serialized measurement list */
>>> struct ima_kexec_hdr {
>>> @@ -314,6 +317,8 @@ struct ima_template_desc *ima_template_desc_current(void);
>>> struct ima_template_desc *ima_template_desc_buf(void);
>>> struct ima_template_desc *lookup_template_desc(const char *name);
>>> bool ima_template_has_modsig(const struct ima_template_desc *ima_template);
>>> +int ima_queue_stage(void);
>>> +int ima_queue_staged_delete_all(void);
>>> int ima_restore_measurement_entry(struct ima_template_entry *entry);
>>> int ima_restore_measurement_list(loff_t bufsize, void *buf);
>>> int ima_measurements_show(struct seq_file *m, void *v);
>>> @@ -334,6 +339,7 @@ extern spinlock_t ima_queue_lock;
>>> extern atomic_long_t ima_num_entries[BINARY__LAST];
>>> extern atomic_long_t ima_num_violations;
>>> extern struct hlist_head __rcu *ima_htable;
>>> +extern struct mutex ima_extend_list_mutex;
>>>
>>> static inline unsigned int ima_hash_key(u8 *digest)
>>> {
>>> diff --git a/security/integrity/ima/ima_fs.c b/security/integrity/ima/ima_fs.c
>>> index 7709a4576322..39d9128e9f22 100644
>>> --- a/security/integrity/ima/ima_fs.c
>>> +++ b/security/integrity/ima/ima_fs.c
>>> @@ -24,6 +24,13 @@
>>>
>>> #include "ima.h"
>>>
>>> +/*
>>> + * Requests:
>>> + * 'A\n': stage the entire measurements list
>>> + * 'D\n': delete all staged measurements
>>> + */
>>> +#define STAGED_REQ_LENGTH 21
>>> +
>>> static DEFINE_MUTEX(ima_write_mutex);
>>> static DEFINE_MUTEX(ima_measure_mutex);
>>> static long ima_measure_users;
>>> @@ -97,6 +104,11 @@ static void *ima_measurements_start(struct seq_file *m, loff_t *pos)
>>> return _ima_measurements_start(m, pos, &ima_measurements);
>>> }
>>>
>>> +static void *ima_measurements_staged_start(struct seq_file *m, loff_t *pos)
>>> +{
>>> + return _ima_measurements_start(m, pos, &ima_measurements_staged);
>>> +}
>>> +
>>> static void *_ima_measurements_next(struct seq_file *m, void *v, loff_t *pos,
>>> struct list_head *head)
>>> {
>>> @@ -118,6 +130,12 @@ static void *ima_measurements_next(struct seq_file *m, void *v, loff_t *pos)
>>> return _ima_measurements_next(m, v, pos, &ima_measurements);
>>> }
>>>
>>> +static void *ima_measurements_staged_next(struct seq_file *m, void *v,
>>> + loff_t *pos)
>>> +{
>>> + return _ima_measurements_next(m, v, pos, &ima_measurements_staged);
>>> +}
>>> +
>>> static void ima_measurements_stop(struct seq_file *m, void *v)
>>> {
>>> }
>>> @@ -283,6 +301,68 @@ static const struct file_operations ima_measurements_ops = {
>>> .release = ima_measurements_release,
>>> };
>>>
>>> +static const struct seq_operations ima_measurments_staged_seqops = {
>>> + .start = ima_measurements_staged_start,
>>> + .next = ima_measurements_staged_next,
>>> + .stop = ima_measurements_stop,
>>> + .show = ima_measurements_show
>>> +};
>>> +
>>> +static int ima_measurements_staged_open(struct inode *inode, struct file *file)
>>> +{
>>> + return _ima_measurements_open(inode, file,
>>> + &ima_measurments_staged_seqops);
>>> +}
>>> +
>>> +static ssize_t ima_measurements_staged_write(struct file *file,
>>> + const char __user *buf,
>>> + size_t datalen, loff_t *ppos)
>>> +{
>>> + char req[STAGED_REQ_LENGTH];
>>> + int ret;
>>> +
>>> + if (*ppos > 0 || datalen < 2 || datalen > STAGED_REQ_LENGTH)
>>> + return -EINVAL;
>>> +
>>> + if (copy_from_user(req, buf, datalen) != 0)
>>> + return -EFAULT;
>>> +
>>> + if (req[datalen - 1] != '\n')
>>> + return -EINVAL;
>>> +
>>> + req[datalen - 1] = '\0';
>>> +
>>> + switch (req[0]) {
>>> + case 'A':
>>> + if (datalen != 2)
>>> + return -EINVAL;
>>> +
>>> + ret = ima_queue_stage();
>>> + break;
>>> + case 'D':
>>> + if (datalen != 2)
>>> + return -EINVAL;
>>> +
>>> + ret = ima_queue_staged_delete_all();
>>> + break;
>> I think the following two steps may not work because of race condition:
>>
>> step1: ret = ima_queue_stage(); //this will put all logs in active list into staged list;
>> step2: ret = ima_queue_staged_delete_all(); //this will delete all logs in staged list;
>>
>> The following is the step of race condition:
>> 1. current active log list LA1;
>> 2. user agent read the TPM quote QA1 match list LA1;
>> 3. new event NewLog is added into active log list LA1+NewLog
>> 4. user agent call ima_queue_stage() and generated staged list
>> including LA1+NewLog.
>> 5. user agent call ima_queue_staged_delete_all();
>> The new log NewLog in step 3 is also deleted
> Please refer to the documentation patch which explains the intended
> workflow of this approach (Remote Attestation Agent Workflow).
>
> Roberto
So the user agent needs to deeply involve measurement list management:
in user space, user agent need to maintain two lists
user agent is more complicated in this case
Steven
>> Next time the attestation will fail if using the active log list in the
>> kernel.
>>
>> Thanks,
>>
>> Steven
>>
>>> + default:
>>> + ret = -EINVAL;
>>> + }
>>> +
>>> + if (ret < 0)
>>> + return ret;
>>> +
>>> + return datalen;
>>> +}
>>> +
>>> +static const struct file_operations ima_measurements_staged_ops = {
>>> + .open = ima_measurements_staged_open,
>>> + .read = seq_read,
>>> + .write = ima_measurements_staged_write,
>>> + .llseek = seq_lseek,
>>> + .release = ima_measurements_release,
>>> +};
>>> +
>>> void ima_print_digest(struct seq_file *m, u8 *digest, u32 size)
>>> {
>>> u32 i;
>>> @@ -356,6 +436,28 @@ static const struct file_operations ima_ascii_measurements_ops = {
>>> .release = ima_measurements_release,
>>> };
>>>
>>> +static const struct seq_operations ima_ascii_measurements_staged_seqops = {
>>> + .start = ima_measurements_staged_start,
>>> + .next = ima_measurements_staged_next,
>>> + .stop = ima_measurements_stop,
>>> + .show = ima_ascii_measurements_show
>>> +};
>>> +
>>> +static int ima_ascii_measurements_staged_open(struct inode *inode,
>>> + struct file *file)
>>> +{
>>> + return _ima_measurements_open(inode, file,
>>> + &ima_ascii_measurements_staged_seqops);
>>> +}
>>> +
>>> +static const struct file_operations ima_ascii_measurements_staged_ops = {
>>> + .open = ima_ascii_measurements_staged_open,
>>> + .read = seq_read,
>>> + .write = ima_measurements_staged_write,
>>> + .llseek = seq_lseek,
>>> + .release = ima_measurements_release,
>>> +};
>>> +
>>> static ssize_t ima_read_policy(char *path)
>>> {
>>> void *data = NULL;
>>> @@ -459,10 +561,21 @@ static const struct seq_operations ima_policy_seqops = {
>>> };
>>> #endif
>>>
>>> -static int __init create_securityfs_measurement_lists(void)
>>> +static int __init create_securityfs_measurement_lists(bool staging)
>>> {
>>> + const struct file_operations *ascii_ops = &ima_ascii_measurements_ops;
>>> + const struct file_operations *binary_ops = &ima_measurements_ops;
>>> + mode_t permissions = S_IRUSR | S_IRGRP;
>>> + const char *file_suffix = "";
>>> int count = NR_BANKS(ima_tpm_chip);
>>>
>>> + if (staging) {
>>> + ascii_ops = &ima_ascii_measurements_staged_ops;
>>> + binary_ops = &ima_measurements_staged_ops;
>>> + file_suffix = "_staged";
>>> + permissions |= (S_IWUSR | S_IWGRP);
>>> + }
>>> +
>>> if (ima_sha1_idx >= NR_BANKS(ima_tpm_chip))
>>> count++;
>>>
>>> @@ -473,29 +586,32 @@ static int __init create_securityfs_measurement_lists(void)
>>>
>>> if (algo == HASH_ALGO__LAST)
>>> snprintf(file_name, sizeof(file_name),
>>> - "ascii_runtime_measurements_tpm_alg_%x",
>>> - ima_tpm_chip->allocated_banks[i].alg_id);
>>> + "ascii_runtime_measurements_tpm_alg_%x%s",
>>> + ima_tpm_chip->allocated_banks[i].alg_id,
>>> + file_suffix);
>>> else
>>> snprintf(file_name, sizeof(file_name),
>>> - "ascii_runtime_measurements_%s",
>>> - hash_algo_name[algo]);
>>> - dentry = securityfs_create_file(file_name, S_IRUSR | S_IRGRP,
>>> + "ascii_runtime_measurements_%s%s",
>>> + hash_algo_name[algo], file_suffix);
>>> + dentry = securityfs_create_file(file_name, permissions,
>>> ima_dir, (void *)(uintptr_t)i,
>>> - &ima_ascii_measurements_ops);
>>> + ascii_ops);
>>> if (IS_ERR(dentry))
>>> return PTR_ERR(dentry);
>>>
>>> if (algo == HASH_ALGO__LAST)
>>> snprintf(file_name, sizeof(file_name),
>>> - "binary_runtime_measurements_tpm_alg_%x",
>>> - ima_tpm_chip->allocated_banks[i].alg_id);
>>> + "binary_runtime_measurements_tpm_alg_%x%s",
>>> + ima_tpm_chip->allocated_banks[i].alg_id,
>>> + file_suffix);
>>> else
>>> snprintf(file_name, sizeof(file_name),
>>> - "binary_runtime_measurements_%s",
>>> - hash_algo_name[algo]);
>>> - dentry = securityfs_create_file(file_name, S_IRUSR | S_IRGRP,
>>> + "binary_runtime_measurements_%s%s",
>>> + hash_algo_name[algo], file_suffix);
>>> +
>>> + dentry = securityfs_create_file(file_name, permissions,
>>> ima_dir, (void *)(uintptr_t)i,
>>> - &ima_measurements_ops);
>>> + binary_ops);
>>> if (IS_ERR(dentry))
>>> return PTR_ERR(dentry);
>>> }
>>> @@ -503,6 +619,23 @@ static int __init create_securityfs_measurement_lists(void)
>>> return 0;
>>> }
>>>
>>> +static int __init create_securityfs_staging_links(void)
>>> +{
>>> + struct dentry *dentry;
>>> +
>>> + dentry = securityfs_create_symlink("binary_runtime_measurements_staged",
>>> + ima_dir, "binary_runtime_measurements_sha1_staged", NULL);
>>> + if (IS_ERR(dentry))
>>> + return PTR_ERR(dentry);
>>> +
>>> + dentry = securityfs_create_symlink("ascii_runtime_measurements_staged",
>>> + ima_dir, "ascii_runtime_measurements_sha1_staged", NULL);
>>> + if (IS_ERR(dentry))
>>> + return PTR_ERR(dentry);
>>> +
>>> + return 0;
>>> +}
>>> +
>>> /*
>>> * ima_open_policy: sequentialize access to the policy file
>>> */
>>> @@ -595,7 +728,13 @@ int __init ima_fs_init(void)
>>> goto out;
>>> }
>>>
>>> - ret = create_securityfs_measurement_lists();
>>> + ret = create_securityfs_measurement_lists(false);
>>> + if (ret == 0 && IS_ENABLED(CONFIG_IMA_STAGING)) {
>>> + ret = create_securityfs_measurement_lists(true);
>>> + if (ret == 0)
>>> + ret = create_securityfs_staging_links();
>>> + }
>>> +
>>> if (ret != 0)
>>> goto out;
>>>
>>> diff --git a/security/integrity/ima/ima_kexec.c b/security/integrity/ima/ima_kexec.c
>>> index d7d0fb639d99..d5503dd5cc9b 100644
>>> --- a/security/integrity/ima/ima_kexec.c
>>> +++ b/security/integrity/ima/ima_kexec.c
>>> @@ -42,8 +42,8 @@ void ima_measure_kexec_event(const char *event_name)
>>> long len;
>>> int n;
>>>
>>> - buf_size = ima_get_binary_runtime_size(BINARY);
>>> - len = atomic_long_read(&ima_num_entries[BINARY]);
>>> + buf_size = ima_get_binary_runtime_size(BINARY_FULL);
>>> + len = atomic_long_read(&ima_num_entries[BINARY_FULL]);
>>>
>>> n = scnprintf(ima_kexec_event, IMA_KEXEC_EVENT_LEN,
>>> "kexec_segment_size=%lu;ima_binary_runtime_size=%lu;"
>>> @@ -106,13 +106,26 @@ static int ima_dump_measurement_list(unsigned long *buffer_size, void **buffer,
>>>
>>> memset(&khdr, 0, sizeof(khdr));
>>> khdr.version = 1;
>>> - /* This is an append-only list, no need to hold the RCU read lock */
>>> - list_for_each_entry_rcu(qe, &ima_measurements, later, true) {
>>> + /* It can race with ima_queue_stage() and ima_queue_delete_staged(). */
>>> + mutex_lock(&ima_extend_list_mutex);
>>> +
>>> + list_for_each_entry_rcu(qe, &ima_measurements_staged, later,
>>> + lockdep_is_held(&ima_extend_list_mutex)) {
>>> ret = ima_dump_measurement(&khdr, qe);
>>> if (ret < 0)
>>> break;
>>> }
>>>
>>> + list_for_each_entry_rcu(qe, &ima_measurements, later,
>>> + lockdep_is_held(&ima_extend_list_mutex)) {
>>> + if (!ret)
>>> + ret = ima_dump_measurement(&khdr, qe);
>>> + if (ret < 0)
>>> + break;
>>> + }
>>> +
>>> + mutex_unlock(&ima_extend_list_mutex);
>>> +
>>> /*
>>> * fill in reserved space with some buffer details
>>> * (eg. version, buffer size, number of measurements)
>>> @@ -167,6 +180,7 @@ void ima_add_kexec_buffer(struct kimage *image)
>>> extra_memory = CONFIG_IMA_KEXEC_EXTRA_MEMORY_KB * 1024;
>>>
>>> binary_runtime_size = ima_get_binary_runtime_size(BINARY) +
>>> + ima_get_binary_runtime_size(BINARY_STAGED) +
>>> extra_memory;
>>>
>>> if (binary_runtime_size >= ULONG_MAX - PAGE_SIZE)
>>> diff --git a/security/integrity/ima/ima_queue.c b/security/integrity/ima/ima_queue.c
>>> index b6d10dceb669..50519ed837d4 100644
>>> --- a/security/integrity/ima/ima_queue.c
>>> +++ b/security/integrity/ima/ima_queue.c
>>> @@ -26,6 +26,7 @@
>>> static struct tpm_digest *digests;
>>>
>>> LIST_HEAD(ima_measurements); /* list of all measurements */
>>> +LIST_HEAD(ima_measurements_staged); /* list of staged measurements */
>>> #ifdef CONFIG_IMA_KEXEC
>>> static unsigned long binary_runtime_size[BINARY__LAST];
>>> #else
>>> @@ -45,11 +46,11 @@ atomic_long_t ima_num_violations = ATOMIC_LONG_INIT(0);
>>> /* key: inode (before secure-hashing a file) */
>>> struct hlist_head __rcu *ima_htable;
>>>
>>> -/* mutex protects atomicity of extending measurement list
>>> +/* mutex protects atomicity of extending and staging measurement list
>>> * and extending the TPM PCR aggregate. Since tpm_extend can take
>>> * long (and the tpm driver uses a mutex), we can't use the spinlock.
>>> */
>>> -static DEFINE_MUTEX(ima_extend_list_mutex);
>>> +DEFINE_MUTEX(ima_extend_list_mutex);
>>>
>>> /*
>>> * Used internally by the kernel to suspend measurements.
>>> @@ -174,12 +175,16 @@ static int ima_add_digest_entry(struct ima_template_entry *entry,
>>> lockdep_is_held(&ima_extend_list_mutex));
>>>
>>> atomic_long_inc(&ima_num_entries[BINARY]);
>>> + atomic_long_inc(&ima_num_entries[BINARY_FULL]);
>>> +
>>> if (update_htable) {
>>> key = ima_hash_key(entry->digests[ima_hash_algo_idx].digest);
>>> hlist_add_head_rcu(&qe->hnext, &htable[key]);
>>> }
>>>
>>> ima_update_binary_runtime_size(entry, BINARY);
>>> + ima_update_binary_runtime_size(entry, BINARY_FULL);
>>> +
>>> return 0;
>>> }
>>>
>>> @@ -280,6 +285,94 @@ int ima_add_template_entry(struct ima_template_entry *entry, int violation,
>>> return result;
>>> }
>>>
>>> +int ima_queue_stage(void)
>>> +{
>>> + int ret = 0;
>>> +
>>> + mutex_lock(&ima_extend_list_mutex);
>>> + if (!list_empty(&ima_measurements_staged)) {
>>> + ret = -EEXIST;
>>> + goto out_unlock;
>>> + }
>>> +
>>> + if (list_empty(&ima_measurements)) {
>>> + ret = -ENOENT;
>>> + goto out_unlock;
>>> + }
>>> +
>>> + list_replace(&ima_measurements, &ima_measurements_staged);
>>> + INIT_LIST_HEAD(&ima_measurements);
>>> +
>>> + atomic_long_set(&ima_num_entries[BINARY_STAGED],
>>> + atomic_long_read(&ima_num_entries[BINARY]));
>>> + atomic_long_set(&ima_num_entries[BINARY], 0);
>>> +
>>> + if (IS_ENABLED(CONFIG_IMA_KEXEC)) {
>>> + binary_runtime_size[BINARY_STAGED] =
>>> + binary_runtime_size[BINARY];
>>> + binary_runtime_size[BINARY] = 0;
>>> + }
>>> +out_unlock:
>>> + mutex_unlock(&ima_extend_list_mutex);
>>> + return ret;
>>> +}
>>> +
>>> +static void ima_queue_delete(struct list_head *head);
>>> +
>>> +int ima_queue_staged_delete_all(void)
>>> +{
>>> + LIST_HEAD(ima_measurements_trim);
>>> +
>>> + mutex_lock(&ima_extend_list_mutex);
>>> + if (list_empty(&ima_measurements_staged)) {
>>> + mutex_unlock(&ima_extend_list_mutex);
>>> + return -ENOENT;
>>> + }
>>> +
>>> + list_replace(&ima_measurements_staged, &ima_measurements_trim);
>>> + INIT_LIST_HEAD(&ima_measurements_staged);
>>> +
>>> + atomic_long_set(&ima_num_entries[BINARY_STAGED], 0);
>>> +
>>> + if (IS_ENABLED(CONFIG_IMA_KEXEC))
>>> + binary_runtime_size[BINARY_STAGED] = 0;
>>> +
>>> + mutex_unlock(&ima_extend_list_mutex);
>>> +
>>> + ima_queue_delete(&ima_measurements_trim);
>>> + return 0;
>>> +}
>>> +
>>> +static void ima_queue_delete(struct list_head *head)
>>> +{
>>> + struct ima_queue_entry *qe, *qe_tmp;
>>> + unsigned int i;
>>> +
>>> + list_for_each_entry_safe(qe, qe_tmp, head, later) {
>>> + /*
>>> + * Safe to free template_data here without synchronize_rcu()
>>> + * because the only htable reader, ima_lookup_digest_entry(),
>>> + * accesses only entry->digests, not template_data. If new
>>> + * htable readers are added that access template_data, a
>>> + * synchronize_rcu() is required here.
>>> + */
>>> + for (i = 0; i < qe->entry->template_desc->num_fields; i++) {
>>> + kfree(qe->entry->template_data[i].data);
>>> + qe->entry->template_data[i].data = NULL;
>>> + qe->entry->template_data[i].len = 0;
>>> + }
>>> +
>>> + list_del(&qe->later);
>>> +
>>> + /* No leak if condition is false, referenced by ima_htable. */
>>> + if (IS_ENABLED(CONFIG_IMA_DISABLE_HTABLE)) {
>>> + kfree(qe->entry->digests);
>>> + kfree(qe->entry);
>>> + kfree(qe);
>>> + }
>>> + }
>>> +}
>>> +
>>> int ima_restore_measurement_entry(struct ima_template_entry *entry)
>>> {
>>> int result = 0;
^ permalink raw reply
* Re: [PATCH v4 11/13] ima: Support staging and deleting N measurements entries
From: steven chen @ 2026-04-01 17:48 UTC (permalink / raw)
To: Roberto Sassu, corbet, skhan, zohar, dmitry.kasatkin,
eric.snowberg, paul, jmorris, serge
Cc: linux-doc, linux-kernel, linux-integrity, linux-security-module,
gregorylumen, nramas, Roberto Sassu, steven chen
In-Reply-To: <af6aa732b85af36e07e4a82b29170e80b13dc7c4.camel@huaweicloud.com>
On 3/27/2026 10:02 AM, Roberto Sassu wrote:
> On Thu, 2026-03-26 at 16:19 -0700, steven chen wrote:
>> On 3/26/2026 10:30 AM, Roberto Sassu wrote:
>>> From: Roberto Sassu <roberto.sassu@huawei.com>
>>>
>>> Add support for sending a value N between 1 and ULONG_MAX to the staging
>>> interface. This value represents the number of measurements that should be
>>> deleted from the current measurements list.
>>>
>>> This staging method allows the remote attestation agents to easily separate
>>> the measurements that were verified (staged and deleted) from those that
>>> weren't due to the race between taking a TPM quote and reading the
>>> measurements list.
>>>
>>> In order to minimize the locking time of ima_extend_list_mutex, deleting
>>> N entries is realized by staging the entire current measurements list
>>> (with the lock), by determining the N-th staged entry (without the lock),
>>> and by splicing the entries in excess back to the current measurements list
>>> (with the lock). Finally, the N entries are deleted (without the lock).
>>>
>>> Flushing the hash table is not supported for N entries, since it would
>>> require removing the N entries one by one from the hash table under the
>>> ima_extend_list_mutex lock, which would increase the locking time.
>>>
>>> The ima_extend_list_mutex lock is necessary in ima_dump_measurement_list()
>>> because ima_queue_staged_delete_partial() uses __list_cut_position() to
>>> modify ima_measurements_staged, for which no RCU-safe variant exists. For
>>> the staging with prompt flavor alone, list_replace_rcu() could have been
>>> used instead, but since both flavors share the same kexec serialization
>>> path, the mutex is required regardless.
>>>
>>> Link: https://github.com/linux-integrity/linux/issues/1
>>> Suggested-by: Steven Chen <chenste@linux.microsoft.com>
>>> Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
>>> ---
>>> security/integrity/ima/Kconfig | 3 ++
>>> security/integrity/ima/ima.h | 1 +
>>> security/integrity/ima/ima_fs.c | 22 +++++++++-
>>> security/integrity/ima/ima_queue.c | 70 ++++++++++++++++++++++++++++++
>>> 4 files changed, 95 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/security/integrity/ima/Kconfig b/security/integrity/ima/Kconfig
>>> index e714726f3384..6ddb4e77bff5 100644
>>> --- a/security/integrity/ima/Kconfig
>>> +++ b/security/integrity/ima/Kconfig
>>> @@ -341,6 +341,9 @@ config IMA_STAGING
>>> It allows user space to stage the measurements list for deletion and
>>> to delete the staged measurements after confirmation.
>>>
>>> + Or, alternatively, it allows user space to specify N measurements
>>> + entries to be deleted.
>>> +
>>> On kexec, staging is reverted and staged measurements are prepended
>>> to the current measurements list when measurements are copied to the
>>> secondary kernel.
>>> diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
>>> index 699b735dec7d..de0693fce53c 100644
>>> --- a/security/integrity/ima/ima.h
>>> +++ b/security/integrity/ima/ima.h
>>> @@ -319,6 +319,7 @@ struct ima_template_desc *lookup_template_desc(const char *name);
>>> bool ima_template_has_modsig(const struct ima_template_desc *ima_template);
>>> int ima_queue_stage(void);
>>> int ima_queue_staged_delete_all(void);
>>> +int ima_queue_staged_delete_partial(unsigned long req_value);
>>> int ima_restore_measurement_entry(struct ima_template_entry *entry);
>>> int ima_restore_measurement_list(loff_t bufsize, void *buf);
>>> int ima_measurements_show(struct seq_file *m, void *v);
>>> diff --git a/security/integrity/ima/ima_fs.c b/security/integrity/ima/ima_fs.c
>>> index 39d9128e9f22..eb3f343c1138 100644
>>> --- a/security/integrity/ima/ima_fs.c
>>> +++ b/security/integrity/ima/ima_fs.c
>>> @@ -28,6 +28,7 @@
>>> * Requests:
>>> * 'A\n': stage the entire measurements list
>>> * 'D\n': delete all staged measurements
>>> + * '[1, ULONG_MAX]\n' delete N measurements entries
>>> */
>>> #define STAGED_REQ_LENGTH 21
>>>
>>> @@ -319,6 +320,7 @@ static ssize_t ima_measurements_staged_write(struct file *file,
>>> size_t datalen, loff_t *ppos)
>>> {
>>> char req[STAGED_REQ_LENGTH];
>>> + unsigned long req_value;
>>> int ret;
>>>
>>> if (*ppos > 0 || datalen < 2 || datalen > STAGED_REQ_LENGTH)
>>> @@ -346,7 +348,25 @@ static ssize_t ima_measurements_staged_write(struct file *file,
>>> ret = ima_queue_staged_delete_all();
>>> break;
>>> default:
>>> - ret = -EINVAL;
>>> + if (ima_flush_htable) {
>>> + pr_debug("Deleting staged N measurements not supported when flushing the hash table is requested\n");
>>> + return -EINVAL;
>>> + }
>>> +
>>> + ret = kstrtoul(req, 10, &req_value);
>>> + if (ret < 0)
>>> + return ret;
>>> +
>>> + if (req_value == 0) {
>>> + pr_debug("Must delete at least one entry\n");
>>> + return -EINVAL;
>>> + }
>>> +
>>> + ret = ima_queue_stage();
>>> + if (ret < 0)
>>> + return ret;
>>> +
>>> + ret = ima_queue_staged_delete_partial(req_value);
>> The default processing is "Trim N" idea plus performance improvement.
>>
>> Here do everything in one time. And this is what I said in v3.
>>
>> [PATCH v3 1/3] ima: Remove ima_h_table structure
>> <https://lore.kernel.org/linux-integrity/c61aeaa79929a98cb3a6d30835972891fac3570f.camel@linux.ibm.com/T/#t>
> In your approach you do:
>
> lock ima_extend_measure_list_mutex
> scan entries until N
> cut list staged -> trim
> unlock ima_extend_measure_list_mutex
>
>
> In my approach I do:
> lock ima_extend_measure_list_mutex
> list replace active -> staged
> unlock ima_extend_measure_list_mutex
>
> scan entries until N
>
> lock ima_extend_measure_list_mutex
> cut list staged -> trim
> splice staged ->active
> unlock ima_extend_measure_list_mutex
>
> So, I guess if you refer to less user space locking time, you mean one
> lock/unlock and one list replace + list splice less in your solution.
>
> In exchange, you propose to hold the lock in the kernel while scanning
> N. I think it is a significant increase of kernel locking time vs a
> negligible increase of user space locking time (in the kernel, all
> processes need to wait for the ima_extend_measure_list_mutex to be
> released, in user space it is just the agent waiting).
>
> Roberto
Please the version 5:
[PATCH v5 0/3] Trim N entries of IMA event logs
<https://lore.kernel.org/linux-integrity/20260401172956.4581-1-chenste@linux.microsoft.com/T/#t>
Scanning N is moved out of the lock period.
"Trim N" proposal has less lock time than "Staging and deleting" proposal.
"Trim N" proposal has much less code than "Staging and deleting" proposal.
"Trim N" proposal user space operation is more simple than "Staging and
deleting".
Steven
>> The important two parts of trimming is "trim N" and performance improvement.
>>
>> The performance improvement include two parts:
>> hash table staging
>> active log list staging
>>
>> And I think "Trim N" plus performance improvement is the right direction
>> to go.
>> Lots of code for two steps "stage and trim" "stage" part can be removed.
>>
>> Also race condition may happen if not holding the list all time in user
>> space
>> during attestation period: from stage, read list, attestation and trimming.
>>
>> So in order to improve the above user space lock time, "Trim T:N" can be
>> used
>> not to hold list long in user space during attestation.
>>
>> For Trim T:N, T represent total log trimmed since system boot up. Please
>> refer to
>> https://lore.kernel.org/linux-integrity/20260205235849.7086-1-chenste@linux.microsoft.com/T/#t
>>
>> Thanks,
>>
>> Steven
>>> }
>>>
>>> if (ret < 0)
>>> diff --git a/security/integrity/ima/ima_queue.c b/security/integrity/ima/ima_queue.c
>>> index f5c18acfbc43..4fb557d61a88 100644
>>> --- a/security/integrity/ima/ima_queue.c
>>> +++ b/security/integrity/ima/ima_queue.c
>>> @@ -371,6 +371,76 @@ int ima_queue_staged_delete_all(void)
>>> return 0;
>>> }
>>>
>>> +int ima_queue_staged_delete_partial(unsigned long req_value)
>>> +{
>>> + unsigned long req_value_copy = req_value;
>>> + unsigned long size_to_remove = 0, num_to_remove = 0;
>>> + struct list_head *cut_pos = NULL;
>>> + LIST_HEAD(ima_measurements_trim);
>>> + struct ima_queue_entry *qe;
>>> + int ret = 0;
>>> +
>>> + /*
>>> + * Safe walk (no concurrent write), not under ima_extend_list_mutex
>>> + * for performance reasons.
>>> + */
>>> + list_for_each_entry(qe, &ima_measurements_staged, later) {
>>> + size_to_remove += get_binary_runtime_size(qe->entry);
>>> + num_to_remove++;
>>> +
>>> + if (--req_value_copy == 0) {
>>> + /* qe->later always points to a valid list entry. */
>>> + cut_pos = &qe->later;
>>> + break;
>>> + }
>>> + }
>>> +
>>> + /* Nothing to remove, undoing staging. */
>>> + if (req_value_copy > 0) {
>>> + size_to_remove = 0;
>>> + num_to_remove = 0;
>>> + ret = -ENOENT;
>>> + }
>>> +
>>> + mutex_lock(&ima_extend_list_mutex);
>>> + if (list_empty(&ima_measurements_staged)) {
>>> + mutex_unlock(&ima_extend_list_mutex);
>>> + return -ENOENT;
>>> + }
>>> +
>>> + if (cut_pos != NULL)
>>> + /*
>>> + * ima_dump_measurement_list() does not modify the list,
>>> + * cut_pos remains the same even if it was computed before
>>> + * the lock.
>>> + */
>>> + __list_cut_position(&ima_measurements_trim,
>>> + &ima_measurements_staged, cut_pos);
>>> +
>>> + atomic_long_sub(num_to_remove, &ima_num_entries[BINARY_STAGED]);
>>> + atomic_long_add(atomic_long_read(&ima_num_entries[BINARY_STAGED]),
>>> + &ima_num_entries[BINARY]);
>>> + atomic_long_set(&ima_num_entries[BINARY_STAGED], 0);
>>> +
>>> + if (IS_ENABLED(CONFIG_IMA_KEXEC)) {
>>> + binary_runtime_size[BINARY_STAGED] -= size_to_remove;
>>> + binary_runtime_size[BINARY] +=
>>> + binary_runtime_size[BINARY_STAGED];
>>> + binary_runtime_size[BINARY_STAGED] = 0;
>>> + }
>>> +
>>> + /*
>>> + * Splice (prepend) any remaining non-deleted staged entries to the
>>> + * active list (RCU not needed, there cannot be concurrent readers).
>>> + */
>>> + list_splice(&ima_measurements_staged, &ima_measurements);
>>> + INIT_LIST_HEAD(&ima_measurements_staged);
>>> + mutex_unlock(&ima_extend_list_mutex);
>>> +
>>> + ima_queue_delete(&ima_measurements_trim, false);
>>> + return ret;
>>> +}
>>> +
>>> static void ima_queue_delete(struct list_head *head, bool flush_htable)
>>> {
>>> struct ima_queue_entry *qe, *qe_tmp;
^ permalink raw reply
* Re: [PATCH 12/33] rust: macros: update `extract_if` MSRV TODO comment
From: Miguel Ojeda @ 2026-04-01 17:45 UTC (permalink / raw)
To: Gary Guo
Cc: Miguel Ojeda, Nathan Chancellor, Nicolas Schier, Danilo Krummrich,
Andreas Hindborg, Catalin Marinas, Will Deacon, Paul Walmsley,
Palmer Dabbelt, Albert Ou, Alexandre Courbot, David Airlie,
Simona Vetter, Brendan Higgins, David Gow, Greg Kroah-Hartman,
Arve Hjønnevåg, Todd Kjos, Christian Brauner,
Carlos Llamas, Alice Ryhl, Jonathan Corbet, Boqun Feng,
Björn Roy Baron, Benno Lossin, Trevor Gross, rust-for-linux,
linux-kbuild, Lorenzo Stoakes, Vlastimil Babka, Liam R . Howlett,
Uladzislau Rezki, linux-block, moderated for non-subscribers,
Alexandre Ghiti, linux-riscv, nouveau, dri-devel, Rae Moar,
linux-kselftest, kunit-dev, Nick Desaulniers, Bill Wendling,
Justin Stitt, llvm, linux-kernel, Shuah Khan, linux-doc
In-Reply-To: <DHHVSX66206Y.3E7I9QUNTCJ8I@garyguo.net>
On Wed, Apr 1, 2026 at 4:18 PM Gary Guo <gary@garyguo.net> wrote:
>
> When I write the comment the intention is to enable the unstable feature and
> switch.
Yeah, that is what I meant as the alternative in the commit message.
I am OK with either.
(By the way, I wondered why you mentioned 1.85 in the comment, I guess
it was supposed to be 1.86 instead originally, i.e. "above" as in >
1.86)
Cheers,
Miguel
^ permalink raw reply
* Re: [PATCH 09/33] rust: kbuild: make `--remap-path-prefix` workaround conditional
From: Miguel Ojeda @ 2026-04-01 17:39 UTC (permalink / raw)
To: Gary Guo
Cc: Miguel Ojeda, Nathan Chancellor, Nicolas Schier, Danilo Krummrich,
Andreas Hindborg, Catalin Marinas, Will Deacon, Paul Walmsley,
Palmer Dabbelt, Albert Ou, Alexandre Courbot, David Airlie,
Simona Vetter, Brendan Higgins, David Gow, Greg Kroah-Hartman,
Arve Hjønnevåg, Todd Kjos, Christian Brauner,
Carlos Llamas, Alice Ryhl, Jonathan Corbet, Boqun Feng,
Björn Roy Baron, Benno Lossin, Trevor Gross, rust-for-linux,
linux-kbuild, Lorenzo Stoakes, Vlastimil Babka, Liam R . Howlett,
Uladzislau Rezki, linux-block, moderated for non-subscribers,
Alexandre Ghiti, linux-riscv, nouveau, dri-devel, Rae Moar,
linux-kselftest, kunit-dev, Nick Desaulniers, Bill Wendling,
Justin Stitt, llvm, linux-kernel, Shuah Khan, linux-doc
In-Reply-To: <DHHVLQAOMLWB.3FHHSYKNM5TNP@garyguo.net>
On Wed, Apr 1, 2026 at 4:08 PM Gary Guo <gary@garyguo.net> wrote:
>
> Okay, I see what the comments mean now. Perhaps squash this to the previous
> commit?
This one was mostly to ensure the workaround was not needed anymore,
i.e. it is more "optional" than the other.
In fact, we may want to just not have neither of the patches, i.e. we
could just remove the workaround given the timelines of the branches
-- please see my reply on the previous one on this.
Cheers,
Miguel
^ permalink raw reply
* Re: [PATCH 08/33] rust: kbuild: simplify `--remap-path-prefix` workaround
From: Miguel Ojeda @ 2026-04-01 17:36 UTC (permalink / raw)
To: Gary Guo
Cc: Miguel Ojeda, Nathan Chancellor, Nicolas Schier, Danilo Krummrich,
Andreas Hindborg, Catalin Marinas, Will Deacon, Paul Walmsley,
Palmer Dabbelt, Albert Ou, Alexandre Courbot, David Airlie,
Simona Vetter, Brendan Higgins, David Gow, Greg Kroah-Hartman,
Arve Hjønnevåg, Todd Kjos, Christian Brauner,
Carlos Llamas, Alice Ryhl, Jonathan Corbet, Boqun Feng,
Björn Roy Baron, Benno Lossin, Trevor Gross, rust-for-linux,
linux-kbuild, Lorenzo Stoakes, Vlastimil Babka, Liam R . Howlett,
Uladzislau Rezki, linux-block, moderated for non-subscribers,
Alexandre Ghiti, linux-riscv, nouveau, dri-devel, Rae Moar,
linux-kselftest, kunit-dev, Nick Desaulniers, Bill Wendling,
Justin Stitt, llvm, linux-kernel, Shuah Khan, linux-doc
In-Reply-To: <DHHVEPJHLGDW.1E6KDP9BUFG5U@garyguo.net>
On Wed, Apr 1, 2026 at 3:59 PM Gary Guo <gary@garyguo.net> wrote:
>
> I'm not sure that I parse this. You do remove the filter-out completely below?
(I see you saw the other commit)
> Looks like this is going to conflict with rust-fixes (which adds the
> --remap-path-scope). Perhaps worth doing a back merge?
It would be only a couple lines conflicting, so it should be fine.
Having said that, when I was doing this, I wondered if we should even
consider keeping the workaround. In other words, locally for
`rust-next`, the "normal" commit would be to remove the workaround
entirely because there the flag doesn't exist to begin with (i.e. the
workaround should have been removed back when the revert landed).
Then, when conflict happens in linux-next, we can just keep the
addition of the flag from your commit -- the rest can say as-is, i.e.
no workaround needed, because you only enable both flags in a version
(1.95.0) where there is no need for the workaround (which was for <
1.87.0).
It is also why I added the second commit here, i.e. the "make it
conditional", because I was testing that indeed we didn't need the
workaround anymore.
So it may just simpler to do that. What I thought was that perhaps the
workaround is good even if we ourselves don't pass the flag, e.g.
someone else may be passing it. But the chances are very low,
restricted to a couple versions, and the error is obvious and at build
time anyway.
Cheers,
Miguel
^ permalink raw reply
* [PATCH v5 3/3] ima: add new critical data record to measure log trim
From: steven chen @ 2026-04-01 17:29 UTC (permalink / raw)
To: linux-integrity
Cc: zohar, roberto.sassu, dmitry.kasatkin, eric.snowberg, corbet,
serge, paul, jmorris, linux-security-module, anirudhve, chenste,
gregorylumen, nramas, sushring, linux-doc
In-Reply-To: <20260401172956.4581-1-chenste@linux.microsoft.com>
Add a new critical data record to measure the trimming event when
ima event records are deleted since system boot up.
If all IMA event logs are saved in the userspace, use this log to get total
numbers of records deleted since system boot up at that point.
Signed-off-by: steven chen <chenste@linux.microsoft.com>
---
security/integrity/ima/ima_fs.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)
diff --git a/security/integrity/ima/ima_fs.c b/security/integrity/ima/ima_fs.c
index 8e26e0f34311..38d0a49b587f 100644
--- a/security/integrity/ima/ima_fs.c
+++ b/security/integrity/ima/ima_fs.c
@@ -43,6 +43,7 @@ static int valid_policy = 1;
#define IMA_LOG_TRIM_REQ_NUM_LENGTH 15
#define IMA_LOG_TRIM_REQ_TOTAL_LENGTH 32
+#define IMA_LOG_TRIM_EVENT_LEN 256
atomic_long_t ima_number_entries = ATOMIC_LONG_INIT(0);
static long trimcount;
/* mutex protects atomicity of trimming measurement list
@@ -52,6 +53,22 @@ static long trimcount;
static DEFINE_MUTEX(ima_measure_lock);
static long ima_measure_users;
+static void ima_measure_trim_event(void)
+{
+ char ima_log_trim_event[IMA_LOG_TRIM_EVENT_LEN];
+ struct timespec64 ts;
+ u64 time_ns;
+ int n;
+
+ ktime_get_real_ts64(&ts);
+ time_ns = (u64)ts.tv_sec * 1000000000ULL + ts.tv_nsec;
+ n = scnprintf(ima_log_trim_event, IMA_LOG_TRIM_EVENT_LEN,
+ "time= %llu; number= %lu;", time_ns, trimcount);
+
+ ima_measure_critical_data("ima_log_trim", "trim ima event logs",
+ ima_log_trim_event, n, false, NULL, 0);
+}
+
static ssize_t ima_show_htable_value(char __user *buf, size_t count,
loff_t *ppos, atomic_long_t *val)
{
@@ -436,6 +453,9 @@ static ssize_t ima_log_trim_write(struct file *file,
if (ret < 0)
goto out;
+ if (ret > 0)
+ ima_measure_trim_event();
+
trimcount += ret;
ret = datalen;
--
2.43.0
^ permalink raw reply related
* [PATCH v5 2/3] ima: trim N IMA event log records
From: steven chen @ 2026-04-01 17:29 UTC (permalink / raw)
To: linux-integrity
Cc: zohar, roberto.sassu, dmitry.kasatkin, eric.snowberg, corbet,
serge, paul, jmorris, linux-security-module, anirudhve, chenste,
gregorylumen, nramas, sushring, linux-doc
In-Reply-To: <20260401172956.4581-1-chenste@linux.microsoft.com>
Trim N entries of the IMA event logs. Do not clean the hash table.
The values saved in hash table were already used.
Provide a userspace interface ima_trim_log:
When read this interface, it returns total number T of entries trimmed
since system boot up.
When write to this interface need to provide two numbers T:N to let
kernel to trim N entries of IMA event logs.
Kernel measurement list lock time performance improvement by not
clean the hash table.
when kernel get log trim request T:N
- Get the T, compare with the total trimmed number
- if equal, then do trim N and change T to T+N
- else return error
Signed-off-by: steven chen <chenste@linux.microsoft.com>
---
.../admin-guide/kernel-parameters.txt | 4 +
security/integrity/ima/ima.h | 4 +-
security/integrity/ima/ima_fs.c | 198 +++++++++++++++++-
security/integrity/ima/ima_kexec.c | 2 +-
security/integrity/ima/ima_queue.c | 96 +++++++++
5 files changed, 296 insertions(+), 8 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index e92c0056e4e0..cd1a1d0bf0e2 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2197,6 +2197,10 @@
Use the canonical format for the binary runtime
measurements, instead of host native format.
+ ima_flush_htable [IMA]
+ Flush the measurement list hash table when trim all
+ or a part of it for deletion.
+
ima_hash= [IMA]
Format: { md5 | sha1 | rmd160 | sha256 | sha384
| sha512 | ... }
diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
index e3d71d8d56e3..5cbee3a295a0 100644
--- a/security/integrity/ima/ima.h
+++ b/security/integrity/ima/ima.h
@@ -243,11 +243,13 @@ void ima_post_key_create_or_update(struct key *keyring, struct key *key,
const void *payload, size_t plen,
unsigned long flags, bool create);
#endif
-
+extern atomic_long_t ima_number_entries;
#ifdef CONFIG_IMA_KEXEC
void ima_measure_kexec_event(const char *event_name);
+long ima_delete_event_log(long req_val);
#else
static inline void ima_measure_kexec_event(const char *event_name) {}
+static inline long ima_delete_event_log(long req_val) { return 0; }
#endif
/*
diff --git a/security/integrity/ima/ima_fs.c b/security/integrity/ima/ima_fs.c
index 87045b09f120..8e26e0f34311 100644
--- a/security/integrity/ima/ima_fs.c
+++ b/security/integrity/ima/ima_fs.c
@@ -21,6 +21,9 @@
#include <linux/rcupdate.h>
#include <linux/parser.h>
#include <linux/vmalloc.h>
+#include <linux/ktime.h>
+#include <linux/timekeeping.h>
+#include <linux/ima.h>
#include "ima.h"
@@ -38,6 +41,17 @@ __setup("ima_canonical_fmt", default_canonical_fmt_setup);
static int valid_policy = 1;
+#define IMA_LOG_TRIM_REQ_NUM_LENGTH 15
+#define IMA_LOG_TRIM_REQ_TOTAL_LENGTH 32
+atomic_long_t ima_number_entries = ATOMIC_LONG_INIT(0);
+static long trimcount;
+/* mutex protects atomicity of trimming measurement list
+ * and also protects atomicity the measurement list read
+ * write operation.
+ */
+static DEFINE_MUTEX(ima_measure_lock);
+static long ima_measure_users;
+
static ssize_t ima_show_htable_value(char __user *buf, size_t count,
loff_t *ppos, atomic_long_t *val)
{
@@ -64,8 +78,7 @@ static ssize_t ima_show_measurements_count(struct file *filp,
char __user *buf,
size_t count, loff_t *ppos)
{
- return ima_show_htable_value(buf, count, ppos, &ima_htable.len);
-
+ return ima_show_htable_value(buf, count, ppos, &ima_number_entries);
}
static const struct file_operations ima_measurements_count_ops = {
@@ -202,16 +215,77 @@ static const struct seq_operations ima_measurments_seqops = {
.show = ima_measurements_show
};
+/*
+ * _ima_measurements_open - open the IMA measurements file
+ * @inode: inode of the file being opened
+ * @file: file being opened
+ * @seq_ops: sequence operations for the file
+ *
+ * Returns 0 on success, or negative error code.
+ * Implements mutual exclusion between readers and writer
+ * of the measurements file. Multiple readers are allowed,
+ * but writer get exclusive access only no other readers/writers.
+ * Readers is not allowed when there is a writer.
+ */
+static int _ima_measurements_open(struct inode *inode, struct file *file,
+ const struct seq_operations *seq_ops)
+{
+ bool write = !!(file->f_mode & FMODE_WRITE);
+ int ret;
+
+ if (write && !capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ mutex_lock(&ima_measure_lock);
+ if ((write && ima_measure_users != 0) ||
+ (!write && ima_measure_users < 0)) {
+ mutex_unlock(&ima_measure_lock);
+ return -EBUSY;
+ }
+
+ ret = seq_open(file, seq_ops);
+ if (ret < 0) {
+ mutex_unlock(&ima_measure_lock);
+ return ret;
+ }
+
+ if (write)
+ ima_measure_users--;
+ else
+ ima_measure_users++;
+
+ mutex_unlock(&ima_measure_lock);
+ return ret;
+}
+
static int ima_measurements_open(struct inode *inode, struct file *file)
{
- return seq_open(file, &ima_measurments_seqops);
+ return _ima_measurements_open(inode, file, &ima_measurments_seqops);
+}
+
+static int ima_measurements_release(struct inode *inode, struct file *file)
+{
+ bool write = !!(file->f_mode & FMODE_WRITE);
+ int ret;
+
+ mutex_lock(&ima_measure_lock);
+ ret = seq_release(inode, file);
+ if (!ret) {
+ if (!write)
+ ima_measure_users--;
+ else
+ ima_measure_users++;
+ }
+
+ mutex_unlock(&ima_measure_lock);
+ return ret;
}
static const struct file_operations ima_measurements_ops = {
.open = ima_measurements_open,
.read = seq_read,
.llseek = seq_lseek,
- .release = seq_release,
+ .release = ima_measurements_release,
};
void ima_print_digest(struct seq_file *m, u8 *digest, u32 size)
@@ -279,14 +353,114 @@ static const struct seq_operations ima_ascii_measurements_seqops = {
static int ima_ascii_measurements_open(struct inode *inode, struct file *file)
{
- return seq_open(file, &ima_ascii_measurements_seqops);
+ return _ima_measurements_open(inode, file, &ima_ascii_measurements_seqops);
}
static const struct file_operations ima_ascii_measurements_ops = {
.open = ima_ascii_measurements_open,
.read = seq_read,
.llseek = seq_lseek,
- .release = seq_release,
+ .release = ima_measurements_release,
+};
+
+static int ima_log_trim_open(struct inode *inode, struct file *file)
+{
+ bool write = !!(file->f_mode & FMODE_WRITE);
+
+ if (!write && capable(CAP_SYS_ADMIN))
+ return 0;
+ else if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ return _ima_measurements_open(inode, file, &ima_measurments_seqops);
+}
+
+static ssize_t ima_log_trim_read(struct file *file, char __user *buf, size_t size, loff_t *ppos)
+{
+ char tmpbuf[IMA_LOG_TRIM_REQ_NUM_LENGTH];
+ ssize_t len;
+
+ len = scnprintf(tmpbuf, sizeof(tmpbuf), "%li\n", trimcount);
+ return simple_read_from_buffer(buf, size, ppos, tmpbuf, len);
+}
+
+static ssize_t ima_log_trim_write(struct file *file,
+ const char __user *buf, size_t datalen, loff_t *ppos)
+{
+ char tmpbuf[IMA_LOG_TRIM_REQ_TOTAL_LENGTH];
+ char *p = tmpbuf;
+ long count, ret, val = 0, max = LONG_MAX;
+
+ if (*ppos > 0 || datalen > IMA_LOG_TRIM_REQ_TOTAL_LENGTH || datalen < 2) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ if (copy_from_user(tmpbuf, buf, datalen) != 0) {
+ ret = -EFAULT;
+ goto out;
+ }
+
+ p = tmpbuf;
+
+ while (*p && *p != ':') {
+ if (!isdigit((unsigned char)*p))
+ return -EINVAL;
+
+ /* digit value */
+ int d = *p - '0';
+
+ /* overflow check: val * 10 + d > max -> (val > (max - d) / 10) */
+ if (val > (max - d) / 10)
+ return -ERANGE;
+
+ val = val * 10 + d;
+ p++;
+ }
+
+ if (*p != ':')
+ return -EINVAL;
+
+ /* verify trim count matches */
+ if (val != trimcount)
+ return -EINVAL;
+
+ p++; /* skip ':' */
+ ret = kstrtoul(p, 0, &count);
+
+ if (ret < 0)
+ goto out;
+
+ ret = ima_delete_event_log(count);
+
+ if (ret < 0)
+ goto out;
+
+ trimcount += ret;
+
+ ret = datalen;
+out:
+ return ret;
+}
+
+static int ima_log_trim_release(struct inode *inode, struct file *file)
+{
+ bool write = !!(file->f_mode & FMODE_WRITE);
+
+ if (!write && capable(CAP_SYS_ADMIN))
+ return 0;
+ else if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ return ima_measurements_release(inode, file);
+}
+
+static const struct file_operations ima_log_trim_ops = {
+ .open = ima_log_trim_open,
+ .read = ima_log_trim_read,
+ .write = ima_log_trim_write,
+ .llseek = generic_file_llseek,
+ .release = ima_log_trim_release
};
static ssize_t ima_read_policy(char *path)
@@ -528,6 +702,18 @@ int __init ima_fs_init(void)
goto out;
}
+ if (IS_ENABLED(CONFIG_IMA_LOG_TRIMMING)) {
+ dentry = securityfs_create_file("ima_trim_log",
+ S_IRUSR | S_IRGRP | S_IWUSR | S_IWGRP,
+ ima_dir, NULL, &ima_log_trim_ops);
+ if (IS_ERR(dentry)) {
+ ret = PTR_ERR(dentry);
+ goto out;
+ }
+ }
+
+ trimcount = 0;
+
dentry = securityfs_create_file("runtime_measurements_count",
S_IRUSR | S_IRGRP, ima_dir, NULL,
&ima_measurements_count_ops);
diff --git a/security/integrity/ima/ima_kexec.c b/security/integrity/ima/ima_kexec.c
index 7362f68f2d8b..bee997683e03 100644
--- a/security/integrity/ima/ima_kexec.c
+++ b/security/integrity/ima/ima_kexec.c
@@ -41,7 +41,7 @@ void ima_measure_kexec_event(const char *event_name)
int n;
buf_size = ima_get_binary_runtime_size();
- len = atomic_long_read(&ima_htable.len);
+ len = atomic_long_read(&ima_number_entries);
n = scnprintf(ima_kexec_event, IMA_KEXEC_EVENT_LEN,
"kexec_segment_size=%lu;ima_binary_runtime_size=%lu;"
diff --git a/security/integrity/ima/ima_queue.c b/security/integrity/ima/ima_queue.c
index 590637e81ad1..07225e19b9b5 100644
--- a/security/integrity/ima/ima_queue.c
+++ b/security/integrity/ima/ima_queue.c
@@ -22,6 +22,14 @@
#define AUDIT_CAUSE_LEN_MAX 32
+bool ima_flush_htable;
+static int __init ima_flush_htable_setup(char *str)
+{
+ ima_flush_htable = true;
+ return 1;
+}
+__setup("ima_flush_htable", ima_flush_htable_setup);
+
/* pre-allocated array of tpm_digest structures to extend a PCR */
static struct tpm_digest *digests;
@@ -114,6 +122,7 @@ static int ima_add_digest_entry(struct ima_template_entry *entry,
list_add_tail_rcu(&qe->later, &ima_measurements);
atomic_long_inc(&ima_htable.len);
+ atomic_long_inc(&ima_number_entries);
if (update_htable) {
key = ima_hash_key(entry->digests[ima_hash_algo_idx].digest);
hlist_add_head_rcu(&qe->hnext, &ima_htable.queue[key]);
@@ -220,6 +229,93 @@ int ima_add_template_entry(struct ima_template_entry *entry, int violation,
return result;
}
+/**
+ * ima_delete_event_log - delete IMA event entry
+ * @num_records: number of records to delete
+ *
+ * delete num_records entries off the measurement list.
+ * Returns num_records, or negative error code.
+ */
+long ima_delete_event_log(long num_records)
+{
+ long len, cur = num_records, tmp_len = 0;
+ struct ima_queue_entry *qe, *qe_tmp;
+ LIST_HEAD(ima_measurements_to_delete);
+ struct list_head *list_ptr;
+
+ if (!IS_ENABLED(CONFIG_IMA_LOG_TRIMMING))
+ return -EOPNOTSUPP;
+
+ if (num_records <= 0)
+ return num_records;
+
+ list_ptr = &ima_measurements;
+
+ len = atomic_long_read(&ima_number_entries);
+
+ if (num_records <= len) {
+ list_for_each_entry(qe, list_ptr, later) {
+ if (cur > 0) {
+ tmp_len += get_binary_runtime_size(qe->entry);
+ --cur;
+ }
+ if (cur == 0) {
+ qe_tmp = qe;
+ break;
+ }
+ }
+ }
+ else {
+ return -ENOENT;
+ }
+
+
+ mutex_lock(&ima_extend_list_mutex);
+ len = atomic_long_read(&ima_number_entries);
+
+ if (num_records == len) {
+ list_replace(&ima_measurements, &ima_measurements_to_delete);
+ INIT_LIST_HEAD(&ima_measurements);
+ atomic_long_set(&ima_number_entries, 0);
+ list_ptr = &ima_measurements_to_delete;
+ }
+ else {
+ __list_cut_position(&ima_measurements_to_delete, &ima_measurements,
+ &qe_tmp->later);
+ atomic_long_sub(num_records, &ima_number_entries);
+ if (IS_ENABLED(CONFIG_IMA_KEXEC))
+ binary_runtime_size -= tmp_len;
+ }
+
+ mutex_unlock(&ima_extend_list_mutex);
+
+ if (ima_flush_htable)
+ synchronize_rcu();
+
+ list_for_each_entry_safe(qe, qe_tmp, &ima_measurements_to_delete, later) {
+ /*
+ * Ok because after list delete qe is only accessed by
+ * ima_lookup_digest_entry().
+ */
+ for (int i = 0; i < qe->entry->template_desc->num_fields; i++) {
+ kfree(qe->entry->template_data[i].data);
+ qe->entry->template_data[i].data = NULL;
+ qe->entry->template_data[i].len = 0;
+ }
+
+ list_del(&qe->later);
+
+ /* No leak if !ima_flush_htable, referenced by ima_htable. */
+ if (ima_flush_htable) {
+ kfree(qe->entry->digests);
+ kfree(qe->entry);
+ kfree(qe);
+ }
+ }
+
+ return num_records;
+}
+
int ima_restore_measurement_entry(struct ima_template_entry *entry)
{
int result = 0;
--
2.43.0
^ permalink raw reply related
* [PATCH v5 1/3] ima: make ima event log trimming configurable
From: steven chen @ 2026-04-01 17:29 UTC (permalink / raw)
To: linux-integrity
Cc: zohar, roberto.sassu, dmitry.kasatkin, eric.snowberg, corbet,
serge, paul, jmorris, linux-security-module, anirudhve, chenste,
gregorylumen, nramas, sushring, linux-doc
In-Reply-To: <20260401172956.4581-1-chenste@linux.microsoft.com>
Make ima event log trimming function configurable.
Suggested-by: Mimi Zohar <zohar@linux.ibm.com>
Signed-off-by: steven chen <chenste@linux.microsoft.com>
---
security/integrity/ima/Kconfig | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/security/integrity/ima/Kconfig b/security/integrity/ima/Kconfig
index 976e75f9b9ba..322964ae4772 100644
--- a/security/integrity/ima/Kconfig
+++ b/security/integrity/ima/Kconfig
@@ -332,4 +332,16 @@ config IMA_KEXEC_EXTRA_MEMORY_KB
If set to the default value of 0, an extra half page of memory for those
additional measurements will be allocated.
+config IMA_LOG_TRIMMING
+ bool "IMA Event Log Trimming"
+ default n
+ help
+ Say Y here if you want support for IMA Event Log Trimming.
+ This creates the file /sys/kernel/security/integrity/ima/ima_trim_log.
+ Userspace
+ - writes to this file to trigger IMA event log trimming
+ - reads this file to get number of entried trimming last time
+
+ If unsure, say N.
+
endif
--
2.43.0
^ permalink raw reply related
* [PATCH v5 0/3] Trim N entries of IMA event logs
From: steven chen @ 2026-04-01 17:29 UTC (permalink / raw)
To: linux-integrity
Cc: zohar, roberto.sassu, dmitry.kasatkin, eric.snowberg, corbet,
serge, paul, jmorris, linux-security-module, anirudhve, chenste,
gregorylumen, nramas, sushring, linux-doc
The Integrity Measurement Architecture (IMA) maintains a measurement list
—a record of system events used for integrity verification. The IMA event
logs are the entries within this measurement list, each representing a
specific event or measurement that contributes to the system's integrity
assessment.
This update introduces the ability to trim, or remove, N entries from the
current measurement list. Trimming involves deleting N entries from the
list. This action atomically truncates the measurement list, ensuring that
no new measurements can be added until the operation is complete.
Importantly, only one writer can initiate this trimming process at a time,
maintaining consistency and preventing race conditions.
A userspace interface, ima_trim_log, has been provided for this purpose.
When this interface is read, it returns the total number T of entries
trimmed since system boot up. This value T need to be preserved across
kexec soft reboots. By writing two number T:N to this interface, userspace
can request the kernel to trim N entries from the IMA event logs.
To maintain a complete record, userspace is responsible for concatenating
and storing the logs before initiating trimming. Userspace can then send
the collected data to remote verifiers for validation. After receiving
confirmation from the remote verifiers, userspace may instruct the kernel
to proceed with trimming the IMA event logs accordingly.
The primary benefit of this solution is the ability to free valuable
kernel memory by delegating the task of reconstructing the full
measurement list from log chunks to userspace. Trust is not required in
userspace for the integrity of the measurement list, as its integrity is
cryptographically protected by the Trusted Platform Module (TPM).
Multiple readers are allowed to access the ima_trim_log interface
concurrently, while only one writer can trigger log trimming at any time.
During trimming, readers do not see the list and cannot access it while
deletion is in progress, ensuring atomicity.
Introduce the new kernel option ima_flush_htable to decide whether or not
the digests of measurement entries are flushed from the hash table (from
reference [2]).
The ima_measure_users counter (protected by the ima_measure_lock mutex) has
been introduced to protect access to the measurement list part. The open
method of all the measurement interfaces has been extended to allow only
one writer at a time or, in alternative, multiple readers. The write
permission is used to stage/delete the measurements, the read permission
to read them. Write requires also the CAP_SYS_ADMIN capability (from
reference [2]). This ima_measure_users needs to be preserved across kexec
soft reboots
The total trimmed number T and the ima_measure_users both need to be
preserved across kexec soft reboot and new patch will be added for this
purpose in next version.
New IMA log trim event is added when trimming finish.
The time required for trimming is minimal, and IMA event logs are briefly
on hold during this process, preventing read or add operations. This short
interruption has no impact on the overall functionality of IMA.
A new critical data record "ima_log_trim" is added in this solution. This
record logs the trim event with number of entries deleted total T since
system start and time when this happened. User space can get the total
number T of entries trimmed by checking "ima_log_trim" event in the
measurement list.
The following are how user space to use the measurement list and
ima_log_trim interface
1. get the total numer trimmed T through "ima_log_trim" interface
2. get the PCR quote
3. read the measurement list file, close the file, send for verification
4. wait for response from verifier, until get the good response from
verifier with number N that matched the PCR quote got in step 2
5. get the number N from the above message
6. write the T:N to the ima_log_trim interface when no conflict
when kernel get log trim request T:N
Get the T, compare with the total trimmed number
if equal, then do trim N and change T to T+N
else return error
Using above way to trim the log, the time for user space to hold the list
will be trimming T:N operation itself at the step 6. User space agent
race condition is solved too in this way.
References:
-----------
[1] [PATCH 0/1] Trim N entries of IMA event logs
https://lore.kernel.org/linux-integrity/20251202232857.8211-1-chenste@linux.microsoft.com/T/#t
[2] [RFC][PATCH] ima: Add support for staging measurements for deletion
https://lore.kernel.org/linux-integrity/207fd6d7-53c-57bb-36d8-13a0902052d1@linux.microsoft.com/T/#t
[3] [PATCH v2 0/1] Trim N entries of IMA event logs
https://lore.kernel.org/linux-integrity/20251210235314.3341-1-chenste@linux.microsoft.com/T/#t
[4] [PATCH v3 0/3] Trim N entries of IMA event logs
https://lore.kernel.org/linux-integrity/20260106020713.3994-1-chenste@linux.microsoft.com/T/#t
[5] [PATCH v4 0/3] Trim N entries of IMA event logs
https://lore.kernel.org/linux-integrity/20260205235849.7086-1-chenste@linux.microsoft.com/T/#t
Change Log v5:
- lock time performance improvement
- Keep hash table unchanged because log already use the hash value
- Updated patch descriptions as necessary.
Change Log v4:
- Incorporated feedback from Roberto on v3 series.
- Update "ima_log_trim" interface definition
When read this interface, return total number of records trimmed T
need to write T:N to this interface to trim N records
- Update user space use case on how to trim IMA event logs
- Updated patch descriptions as necessary.
Change Log v3:
- Incorporated feedback from Mimi on v2 series.
- split patch into multiple patches
- lock time performance improvement
- Updated patch descriptions as necessary.
Change Log v2:
- Incorporated feedback from the Roberto on v1 series.
- Adapted code from Roberto's RFC [Reference 2]
- Add IMA log trim event log to record trim event
- Updated patch descriptions as necessary
steven chen (3):
ima: make ima event log trimming configurable
ima: trim N IMA event log records
ima: add new critical data record to measure log trim
.../admin-guide/kernel-parameters.txt | 4 +
security/integrity/ima/Kconfig | 12 +
security/integrity/ima/ima.h | 4 +-
security/integrity/ima/ima_fs.c | 218 +++++++++++++++++-
security/integrity/ima/ima_kexec.c | 2 +-
security/integrity/ima/ima_queue.c | 96 ++++++++
6 files changed, 328 insertions(+), 8 deletions(-)
--
2.43.0
^ permalink raw reply
* Re: [PATCH v6] hwmon: add driver for ARCTIC Fan Controller
From: Guenter Roeck @ 2026-04-01 17:34 UTC (permalink / raw)
To: Aureo Serrano de Souza, linux-hwmon
Cc: linux, corbet, skhan, linux-doc, linux-kernel
In-Reply-To: <20260401153949.77488-1-aureo.serrano@arctic.de>
On 4/1/26 08:39, Aureo Serrano de Souza wrote:
> Add hwmon driver for the ARCTIC Fan Controller, a USB HID device
> (VID 0x3904, PID 0xF001) with 10 fan channels. Exposes fan speed in
> RPM (read-only) and PWM duty cycle (0-255, read/write) via sysfs.
>
> The device pushes IN reports at ~1 Hz containing RPM readings. PWM is
> set via OUT reports; the device applies the new duty cycle and sends
> back a 2-byte ACK (Report ID 0x02). The driver waits up to 1 s for
> the ACK using a completion. Measured device latency: max ~563 ms over
> 500 iterations. PWM control is manual-only: the device never changes
> duty cycle autonomously.
>
> raw_event() may run in hardirq context, so fan_rpm[] is protected by
> a spinlock with irq-save. pwm_duty[] is also protected by this spinlock
> because reset_resume() clears it outside the hwmon core lock. The OUT
> report buffer is built and write_pending is armed under the same lock so
> that no reset_resume() can race with the pwm_duty[] snapshot. priv->buf
> is exclusively accessed by write(), which the hwmon core serializes.
>
> Signed-off-by: Aureo Serrano de Souza <aureo.serrano@arctic.de>
This must be a resend because it does not appear to be corrupted.
Please mention that in the future.
Sashiko still has a couple of comments:
https://sashiko.dev/#/patchset/20260401153949.77488-1-aureo.serrano%40arctic.de
Ignore the cache line comment. The concern about the redundant hid_device_io_stop()
seems real, though. Please check.
Thanks,
Guenter
> ---
> Thanks to Guenter Roeck and Thomas Weißschuh for the reviews.
>
> Changes since v5:
> - arctic_fan_probe(): switch from devm_hwmon_device_register_with_info()
> to hwmon_device_register_with_info(); store the returned pointer in
> priv->hwmon_dev for explicit teardown in remove()
> - arctic_fan_remove(): call hwmon_device_unregister(priv->hwmon_dev)
> before hid_device_io_stop/hid_hw_close/hid_hw_stop; this closes the
> use-after-free window where a concurrent sysfs write could call
> hid_hw_output_report() on an already-stopped device; matches the
> removal pattern used by nzxt-smart2 and aquacomputer_d5next
> - arctic_fan_write(): expand write_pending comment to document the
> residual theoretical late-ACK race (unfixable without a correlation
> ID in the device ACK report) and its practical impossibility (observed
> max ACK latency ~563 ms, timeout 1 s; a delay > 1 s indicates a
> non-functional device)
> - arctic_fan_reset_resume(), arctic_fan_read(), arctic_fan_write():
> extend in_report_lock coverage to pwm_duty[]; reset_resume() clears
> pwm_duty[] outside the hwmon core lock, so all paths that read or
> write pwm_duty[] now hold in_report_lock to prevent a data race
> during resume
> - arctic_fan_write(): build the OUT report buffer inside in_report_lock
> so reset_resume() cannot clear pwm_duty[] between the pwm_duty[]
> snapshot and the buffer write; this makes the lock coverage complete
>
> Changes since v4:
> - arctic_fan_write(): switch to wait_for_completion_timeout() (non-
> interruptible); eliminates the signal-interrupted write case of the
> late-ACK race that write_pending could not fully prevent
> - arctic_fan_write(): guard pwm_duty[channel] commit with
> ack_status == 0 check; a device error ACK (status 0x01) no longer
> silently poisons the cached duty used in future OUT reports
> - arctic_fan_probe()/remove(): replace devm_add_action_or_reset() +
> no-op remove() with explicit hid_device_io_stop/hid_hw_close/
> hid_hw_stop in remove(); devm_add_action_or_reset() was called after
> hdev->driver = NULL, causing a NULL deref in hid_hw_close() on unbind
> - add reset_resume callback: device resets PWM to hardware defaults on
> power loss during suspend; driver now clears cached pwm_duty[] on
> reset-resume so stale pre-suspend values are not re-sent as if valid
> - Documentation/hwmon/arctic_fan_controller.rst: document suspend/
> resume behaviour and the updated pwm[1-10] read semantics
>
> Changes since v3:
> - buf[]: upgrade from __aligned(8) to ____cacheline_aligned so the
> DMA buffer occupies its own cache line, preventing false sharing with
> adjacent fan_rpm[]/pwm_duty[] fields on non-coherent architectures
> - arctic_fan_write(): add write_pending flag (protected by
> in_report_lock) so raw_event() delivers ACKs only while a write is
> in flight
> - arctic_fan_write(): commit pwm_duty[channel] only after the device
> ACKs the command; a failed or timed-out write no longer leaves a
> stale value in the cached duty state
> - arctic_fan_probe(): start IO (hid_device_io_start) before registering
> with hwmon; previously a sysfs write arriving between hwmon
> registration and io_start could send an OUT report whose ACK would be
> discarded by the HID core, causing a spurious timeout
> - Documentation/hwmon/arctic_fan_controller.rst: document that cached
> PWM values start at 0 (hardware state unknown at probe) and that each
> OUT report carries all 10 channel values
>
> Changes since v2:
> - buf[]: add __aligned(8) for DMA safety
> - ARCTIC_ACK_TIMEOUT_MS: restore 1000 ms; note observed max ~563 ms
> - arctic_fan_parse_report(): replace hwmon_lock/hwmon_unlock with
> spin_lock_irqsave; hwmon_lock() may sleep and is unsafe when
> raw_event() runs in hardirq/softirq context
> - arctic_fan_raw_event(): use spin_lock_irqsave for ACK path
> - arctic_fan_write(): use spin_lock_irqsave for completion reinit
> - arctic_fan_write(): clamp val to [0, 255] before u8 cast
> - remove priv->hwmon_dev (no longer needed)
>
> Changes since v1:
> - Use hid_dbg() instead of module_param debug flag
> - Move hid_device_id table adjacent to hid_driver struct
> - Use get_unaligned_le16() for RPM parsing
> - Remove impossible bounds/NULL checks; remove retry loop
> - Add hid_is_usb() guard
> - Do not update pwm_duty from IN reports (device is manual-only)
> - Add completion/ACK mechanism for OUT report acknowledgement
> - Add Documentation/hwmon/arctic_fan_controller.rst and MAINTAINERS
>
> diff --git a/Documentation/hwmon/arctic_fan_controller.rst b/Documentation/hwmon/arctic_fan_controller.rst
> new file mode 100644
> index 0000000000..b5be88ae46
> --- /dev/null
> +++ b/Documentation/hwmon/arctic_fan_controller.rst
> @@ -0,0 +1,56 @@
> +.. SPDX-License-Identifier: GPL-2.0-or-later
> +
> +Kernel driver arctic_fan_controller
> +=====================================
> +
> +Supported devices:
> +
> +* ARCTIC Fan Controller (USB HID, VID 0x3904, PID 0xF001)
> +
> +Author: Aureo Serrano de Souza <aureo.serrano@arctic.de>
> +
> +Description
> +-----------
> +
> +This driver provides hwmon support for the ARCTIC Fan Controller, a USB
> +Custom HID device with 10 fan channels. The device sends IN reports about
> +once per second containing current RPM values (bytes 11-30, 10 x uint16 LE).
> +Fan speed control is manual-only: the device does not change PWM
> +autonomously; it only applies a new duty cycle when it receives an OUT
> +report from the host.
> +
> +After the device applies an OUT report, it sends back a 2-byte ACK IN
> +report (Report ID 0x02, byte 1 = 0x00 on success) confirming the command
> +was applied.
> +
> +Usage notes
> +-----------
> +
> +Since it is a USB device, hotplug is supported. The device is autodetected.
> +
> +The device does not support GET_REPORT, so the driver cannot read back the
> +current hardware PWM state at probe time. The cached PWM values (readable
> +via pwm[1-10]) start at 0 and reflect only values that have been
> +successfully written. Because each OUT report carries all 10 channel values,
> +writing a single channel also sends the cached values for all other channels.
> +Users should set all channels to the desired values before relying on the
> +cached state.
> +
> +On system suspend, the device may lose power and reset its PWM channels to
> +hardware defaults. The driver clears its cached duty values on resume so
> +that reads reflect the unknown hardware state rather than stale pre-suspend
> +values. Userspace is responsible for re-applying the desired duty cycles
> +after resume.
> +
> +Sysfs entries
> +-------------
> +
> +================ ==============================================================
> +fan[1-10]_input Fan speed in RPM (read-only). Updated from IN reports at ~1 Hz.
> +pwm[1-10] PWM duty cycle (0-255). Write: sends an OUT report setting the
> + duty cycle (scaled from 0-255 to 0-100% for the device);
> + the cached value is updated only after the device ACKs the
> + command with a success status. Read: returns the last
> + successfully written value; initialized to 0 at driver load
> + and after resume (hardware state unknown).
> +================ ==============================================================
> diff --git a/Documentation/hwmon/index.rst b/Documentation/hwmon/index.rst
> index b2ca8513cf..c34713040e 100644
> --- a/Documentation/hwmon/index.rst
> +++ b/Documentation/hwmon/index.rst
> @@ -42,6 +42,7 @@ Hardware Monitoring Kernel Drivers
> aht10
> amc6821
> aquacomputer_d5next
> + arctic_fan_controller
> asb100
> asc7621
> aspeed-g6-pwm-tach
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 96ea84948d..ec3112bd41 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2053,6 +2053,13 @@ S: Maintained
> F: drivers/net/arcnet/
> F: include/uapi/linux/if_arcnet.h
>
> +ARCTIC FAN CONTROLLER DRIVER
> +M: Aureo Serrano de Souza <aureo.serrano@arctic.de>
> +L: linux-hwmon@vger.kernel.org
> +S: Maintained
> +F: Documentation/hwmon/arctic_fan_controller.rst
> +F: drivers/hwmon/arctic_fan_controller.c
> +
> ARM AND ARM64 SoC SUB-ARCHITECTURES (COMMON PARTS)
> M: Arnd Bergmann <arnd@arndb.de>
> M: Krzysztof Kozlowski <krzk@kernel.org>
> diff --git a/drivers/hwmon/Kconfig b/drivers/hwmon/Kconfig
> index 328867242c..6c90a8dd40 100644
> --- a/drivers/hwmon/Kconfig
> +++ b/drivers/hwmon/Kconfig
> @@ -388,6 +388,18 @@ config SENSORS_APPLESMC
> Say Y here if you have an applicable laptop and want to experience
> the awesome power of applesmc.
>
> +config SENSORS_ARCTIC_FAN_CONTROLLER
> + tristate "ARCTIC Fan Controller"
> + depends on USB_HID
> + help
> + If you say yes here you get support for the ARCTIC Fan Controller,
> + a USB HID device (VID 0x3904, PID 0xF001) with 10 fan channels.
> + The driver exposes fan speed (RPM) and PWM control via the hwmon
> + sysfs interface.
> +
> + This driver can also be built as a module. If so, the module
> + will be called arctic_fan_controller.
> +
> config SENSORS_ARM_SCMI
> tristate "ARM SCMI Sensors"
> depends on ARM_SCMI_PROTOCOL
> diff --git a/drivers/hwmon/Makefile b/drivers/hwmon/Makefile
> index 5833c807c6..ef831c3375 100644
> --- a/drivers/hwmon/Makefile
> +++ b/drivers/hwmon/Makefile
> @@ -49,6 +49,7 @@ obj-$(CONFIG_SENSORS_ADT7475) += adt7475.o
> obj-$(CONFIG_SENSORS_AHT10) += aht10.o
> obj-$(CONFIG_SENSORS_APPLESMC) += applesmc.o
> obj-$(CONFIG_SENSORS_AQUACOMPUTER_D5NEXT) += aquacomputer_d5next.o
> +obj-$(CONFIG_SENSORS_ARCTIC_FAN_CONTROLLER) += arctic_fan_controller.o
> obj-$(CONFIG_SENSORS_ARM_SCMI) += scmi-hwmon.o
> obj-$(CONFIG_SENSORS_ARM_SCPI) += scpi-hwmon.o
> obj-$(CONFIG_SENSORS_AS370) += as370-hwmon.o
> diff --git a/drivers/hwmon/arctic_fan_controller.c b/drivers/hwmon/arctic_fan_controller.c
> new file mode 100644
> index 0000000000..2bfb003f01
> --- /dev/null
> +++ b/drivers/hwmon/arctic_fan_controller.c
> @@ -0,0 +1,370 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Linux hwmon driver for ARCTIC Fan Controller
> + *
> + * USB Custom HID device with 10 fan channels.
> + * Exposes fan RPM (input) and PWM (0-255) via hwmon. Device pushes IN reports
> + * at ~1 Hz; no GET_REPORT. OUT reports set PWM duty (bytes 1-10, 0-100%).
> + * PWM is manual-only: the device does not change duty autonomously, only
> + * when it receives an OUT report from the host.
> + */
> +
> +#include <linux/cache.h>
> +#include <linux/completion.h>
> +#include <linux/err.h>
> +#include <linux/hid.h>
> +#include <linux/hwmon.h>
> +#include <linux/jiffies.h>
> +#include <linux/minmax.h>
> +#include <linux/module.h>
> +#include <linux/spinlock.h>
> +#include <linux/string.h>
> +#include <linux/unaligned.h>
> +
> +#define ARCTIC_VID 0x3904
> +#define ARCTIC_PID 0xF001
> +#define ARCTIC_NUM_FANS 10
> +#define ARCTIC_OUTPUT_REPORT_ID 0x01
> +#define ARCTIC_REPORT_LEN 32
> +#define ARCTIC_RPM_OFFSET 11 /* bytes 11-30: 10 x uint16 LE */
> +/* ACK report: device sends Report ID 0x02, 2 bytes (ID + status) after applying OUT report */
> +#define ARCTIC_ACK_REPORT_ID 0x02
> +#define ARCTIC_ACK_REPORT_LEN 2
> +/*
> + * Time to wait for ACK report after send.
> + * Measured over 500 iterations: max ~563 ms. Keep 1 s as margin.
> + */
> +#define ARCTIC_ACK_TIMEOUT_MS 1000
> +
> +struct arctic_fan_data {
> + struct hid_device *hdev;
> + struct device *hwmon_dev; /* stored for explicit unregister in remove() */
> + spinlock_t in_report_lock; /* protects fan_rpm, ack_status, write_pending, pwm_duty */
> + struct completion in_report_received; /* ACK (ID 0x02) received in raw_event */
> + int ack_status; /* 0 = OK, negative errno on device error */
> + bool write_pending; /* true while an OUT report ACK is in flight */
> + u32 fan_rpm[ARCTIC_NUM_FANS];
> + u8 pwm_duty[ARCTIC_NUM_FANS]; /* 0-255 matching sysfs range; converted to 0-100 on send */
> + /*
> + * OUT report buffer. Cache-line aligned so it occupies its own cache
> + * line, preventing DMA cache-coherency issues with adjacent fields
> + * (fan_rpm[], pwm_duty[]) on non-coherent architectures.
> + * Embedded in the devm_kzalloc'd struct so it is heap-allocated and
> + * passes usb_hcd_map_urb_for_dma(). Serialized by the hwmon core.
> + */
> + u8 buf[ARCTIC_REPORT_LEN] ____cacheline_aligned;
> +};
> +
> +/*
> + * Parse RPM values from the periodic status report (10 x uint16 LE at rpm_off).
> + * pwm_duty is not updated from the report: the device is manual-only, so the
> + * host cache is the authoritative source for PWM.
> + * Called from raw_event which may run in IRQ context; must not sleep.
> + */
> +static void arctic_fan_parse_report(struct arctic_fan_data *priv, u8 *buf,
> + int len, int rpm_off)
> +{
> + unsigned long flags;
> + int i;
> +
> + if (len < rpm_off + 20)
> + return;
> +
> + spin_lock_irqsave(&priv->in_report_lock, flags);
> + for (i = 0; i < ARCTIC_NUM_FANS; i++)
> + priv->fan_rpm[i] = get_unaligned_le16(&buf[rpm_off + i * 2]);
> + spin_unlock_irqrestore(&priv->in_report_lock, flags);
> +}
> +
> +/*
> + * raw_event: IN reports.
> + *
> + * Status report: Report ID 0x01, 32 bytes:
> + * byte 0 = report ID, bytes 1-10 = PWM 0-100%, bytes 11-30 = 10 x RPM uint16 LE.
> + * Device pushes these at ~1 Hz; no GET_REPORT.
> + *
> + * ACK report: Report ID 0x02, 2 bytes:
> + * byte 0 = 0x02, byte 1 = status (0x00 = OK, 0x01 = ERROR).
> + * Sent once after accepting and applying an OUT report (ID 0x01).
> + */
> +static int arctic_fan_raw_event(struct hid_device *hdev,
> + struct hid_report *report, u8 *data, int size)
> +{
> + struct arctic_fan_data *priv = hid_get_drvdata(hdev);
> + unsigned long flags;
> +
> + hid_dbg(hdev, "arctic_fan: raw_event id=%u size=%d\n", report->id, size);
> +
> + if (report->id == ARCTIC_ACK_REPORT_ID && size == ARCTIC_ACK_REPORT_LEN) {
> + spin_lock_irqsave(&priv->in_report_lock, flags);
> + /*
> + * Only deliver if a write is in flight. This prevents a
> + * late-arriving ACK from a timed-out write from erroneously
> + * satisfying a subsequent write's completion wait.
> + */
> + if (priv->write_pending) {
> + priv->ack_status = data[1] == 0x00 ? 0 : -EIO;
> + complete(&priv->in_report_received);
> + }
> + spin_unlock_irqrestore(&priv->in_report_lock, flags);
> + return 0;
> + }
> +
> + if (report->id != ARCTIC_OUTPUT_REPORT_ID || size != ARCTIC_REPORT_LEN) {
> + hid_dbg(hdev, "arctic_fan: raw_event id=%u size=%d ignored\n",
> + report->id, size);
> + return 0;
> + }
> +
> + arctic_fan_parse_report(priv, data, size, ARCTIC_RPM_OFFSET);
> + return 0;
> +}
> +
> +static umode_t arctic_fan_is_visible(const void *data,
> + enum hwmon_sensor_types type,
> + u32 attr, int channel)
> +{
> + if (type == hwmon_fan && attr == hwmon_fan_input)
> + return 0444;
> + if (type == hwmon_pwm && attr == hwmon_pwm_input)
> + return 0644;
> + return 0;
> +}
> +
> +static int arctic_fan_read(struct device *dev, enum hwmon_sensor_types type,
> + u32 attr, int channel, long *val)
> +{
> + struct arctic_fan_data *priv = dev_get_drvdata(dev);
> + unsigned long flags;
> +
> + if (type == hwmon_fan && attr == hwmon_fan_input) {
> + spin_lock_irqsave(&priv->in_report_lock, flags);
> + *val = priv->fan_rpm[channel];
> + spin_unlock_irqrestore(&priv->in_report_lock, flags);
> + return 0;
> + }
> + if (type == hwmon_pwm && attr == hwmon_pwm_input) {
> + spin_lock_irqsave(&priv->in_report_lock, flags);
> + *val = priv->pwm_duty[channel];
> + spin_unlock_irqrestore(&priv->in_report_lock, flags);
> + return 0;
> + }
> + return -EINVAL;
> +}
> +
> +static int arctic_fan_write(struct device *dev, enum hwmon_sensor_types type,
> + u32 attr, int channel, long val)
> +{
> + struct arctic_fan_data *priv = dev_get_drvdata(dev);
> + u8 new_duty = (u8)clamp_val(val, 0, 255);
> + unsigned long flags;
> + unsigned long t;
> + int i, ret;
> +
> + /*
> + * Build the buffer and arm write_pending under in_report_lock so that
> + * reset_resume() cannot clear pwm_duty[] between the pwm_duty[] read
> + * and the buffer write, and raw_event() cannot deliver a stale ACK
> + * from a previous write into this write's completion.
> + *
> + * priv->buf is heap-allocated (embedded in the devm_kzalloc'd struct),
> + * satisfying usb_hcd_map_urb_for_dma(). Exclusively accessed by
> + * write() which the hwmon core serializes.
> + *
> + * pwm_duty[channel] is committed only after a positive device ACK so a
> + * failed or timed-out write does not corrupt the cached state.
> + *
> + * Residual theoretical race: if write A times out (write_pending
> + * cleared), write B sets write_pending = true, and a late ACK from
> + * write A—delayed beyond ARCTIC_ACK_TIMEOUT_MS—arrives during write
> + * B's pending window, it would falsely satisfy write B's completion.
> + * This cannot be prevented in driver code without protocol support
> + * (for example, a correlation ID echoed in the device ACK report).
> + * In testing, observed ACK latency stayed below the 1 s timeout
> + * (maximum ~563 ms over 500 iterations).
> + *
> + * The wait is non-interruptible so that a signal cannot cause write()
> + * to return early while the OUT report is already in flight; an
> + * interruptible early return would create the same late-ACK window
> + * without even the timeout guard.
> + * Serialized by the hwmon core: only one arctic_fan_write() at a time.
> + * Use irqsave to match the IRQ context in which raw_event may run.
> + */
> + spin_lock_irqsave(&priv->in_report_lock, flags);
> + priv->buf[0] = ARCTIC_OUTPUT_REPORT_ID;
> + for (i = 0; i < ARCTIC_NUM_FANS; i++) {
> + u8 d = i == channel ? new_duty : priv->pwm_duty[i];
> +
> + priv->buf[1 + i] = DIV_ROUND_CLOSEST((unsigned int)d * 100, 255);
> + }
> + priv->ack_status = -ETIMEDOUT;
> + priv->write_pending = true;
> + reinit_completion(&priv->in_report_received);
> + spin_unlock_irqrestore(&priv->in_report_lock, flags);
> +
> + ret = hid_hw_output_report(priv->hdev, priv->buf, ARCTIC_REPORT_LEN);
> + if (ret < 0) {
> + spin_lock_irqsave(&priv->in_report_lock, flags);
> + priv->write_pending = false;
> + spin_unlock_irqrestore(&priv->in_report_lock, flags);
> + return ret;
> + }
> +
> + t = wait_for_completion_timeout(&priv->in_report_received,
> + msecs_to_jiffies(ARCTIC_ACK_TIMEOUT_MS));
> + spin_lock_irqsave(&priv->in_report_lock, flags);
> + priv->write_pending = false;
> + /* Commit inside the lock so reset_resume() cannot race with this write */
> + if (t && priv->ack_status == 0)
> + priv->pwm_duty[channel] = new_duty;
> + spin_unlock_irqrestore(&priv->in_report_lock, flags);
> +
> + if (!t)
> + return -ETIMEDOUT;
> + return priv->ack_status; /* 0=OK, -EIO=device error */
> +}
> +
> +static const struct hwmon_ops arctic_fan_ops = {
> + .is_visible = arctic_fan_is_visible,
> + .read = arctic_fan_read,
> + .write = arctic_fan_write,
> +};
> +
> +static const struct hwmon_channel_info *arctic_fan_info[] = {
> + HWMON_CHANNEL_INFO(fan,
> + HWMON_F_INPUT, HWMON_F_INPUT, HWMON_F_INPUT,
> + HWMON_F_INPUT, HWMON_F_INPUT, HWMON_F_INPUT,
> + HWMON_F_INPUT, HWMON_F_INPUT, HWMON_F_INPUT,
> + HWMON_F_INPUT),
> + HWMON_CHANNEL_INFO(pwm,
> + HWMON_PWM_INPUT, HWMON_PWM_INPUT, HWMON_PWM_INPUT,
> + HWMON_PWM_INPUT, HWMON_PWM_INPUT, HWMON_PWM_INPUT,
> + HWMON_PWM_INPUT, HWMON_PWM_INPUT, HWMON_PWM_INPUT,
> + HWMON_PWM_INPUT),
> + NULL
> +};
> +
> +static const struct hwmon_chip_info arctic_fan_chip_info = {
> + .ops = &arctic_fan_ops,
> + .info = arctic_fan_info,
> +};
> +
> +static int arctic_fan_reset_resume(struct hid_device *hdev)
> +{
> + struct arctic_fan_data *priv = hid_get_drvdata(hdev);
> + unsigned long flags;
> +
> + /*
> + * The device resets its PWM channels to hardware defaults on power
> + * loss during suspend. Clear the cached duty values so they reflect
> + * the unknown hardware state, consistent with probe-time behaviour
> + * (the device has no GET_REPORT support). Hold in_report_lock so
> + * this does not race with a concurrent pwm read or write callback.
> + */
> + spin_lock_irqsave(&priv->in_report_lock, flags);
> + memset(priv->pwm_duty, 0, sizeof(priv->pwm_duty));
> + spin_unlock_irqrestore(&priv->in_report_lock, flags);
> + return 0;
> +}
> +
> +static int arctic_fan_probe(struct hid_device *hdev,
> + const struct hid_device_id *id)
> +{
> + struct arctic_fan_data *priv;
> + int ret;
> +
> + if (!hid_is_usb(hdev))
> + return -ENODEV;
> +
> + ret = hid_parse(hdev);
> + if (ret)
> + return ret;
> +
> + priv = devm_kzalloc(&hdev->dev, sizeof(*priv), GFP_KERNEL);
> + if (!priv)
> + return -ENOMEM;
> +
> + priv->hdev = hdev;
> + spin_lock_init(&priv->in_report_lock);
> + init_completion(&priv->in_report_received);
> + hid_set_drvdata(hdev, priv);
> +
> + ret = hid_hw_start(hdev, HID_CONNECT_DRIVER);
> + if (ret)
> + return ret;
> +
> + ret = hid_hw_open(hdev);
> + if (ret)
> + goto out_stop;
> +
> + /*
> + * Start IO before registering with hwmon. If IO were started after
> + * hwmon registration, a sysfs write arriving in that narrow window
> + * would send an OUT report but the ACK could not be delivered (the HID
> + * core discards events until io_started), causing a spurious timeout.
> + */
> + hid_device_io_start(hdev);
> +
> + /*
> + * Use the non-devm variant and store the pointer so remove() can
> + * call hwmon_device_unregister() before tearing down the HID
> + * transport. devm_hwmon_device_register_with_info() would defer
> + * unregistration until after remove() returns, leaving a window
> + * where a concurrent sysfs write could call hid_hw_output_report()
> + * on an already-stopped device (use-after-free).
> + */
> + priv->hwmon_dev = hwmon_device_register_with_info(&hdev->dev, "arctic_fan",
> + priv, &arctic_fan_chip_info,
> + NULL);
> + if (IS_ERR(priv->hwmon_dev)) {
> + ret = PTR_ERR(priv->hwmon_dev);
> + goto out_close;
> + }
> +
> + return 0;
> +
> +out_close:
> + hid_device_io_stop(hdev);
> + hid_hw_close(hdev);
> +out_stop:
> + hid_hw_stop(hdev);
> + return ret;
> +}
> +
> +static void arctic_fan_remove(struct hid_device *hdev)
> +{
> + struct arctic_fan_data *priv = hid_get_drvdata(hdev);
> +
> + /*
> + * Unregister hwmon before stopping the HID transport. This removes
> + * the sysfs files and waits for any in-progress write() callback to
> + * return, so no hwmon op can call hid_hw_output_report() after
> + * hid_hw_stop() frees the underlying USB resources.
> + * Matches the pattern used by nzxt-smart2 and aquacomputer_d5next.
> + */
> + hwmon_device_unregister(priv->hwmon_dev);
> + hid_device_io_stop(hdev);
> + hid_hw_close(hdev);
> + hid_hw_stop(hdev);
> +}
> +
> +static const struct hid_device_id arctic_fan_id_table[] = {
> + { HID_USB_DEVICE(ARCTIC_VID, ARCTIC_PID) },
> + { }
> +};
> +MODULE_DEVICE_TABLE(hid, arctic_fan_id_table);
> +
> +static struct hid_driver arctic_fan_driver = {
> + .name = "arctic_fan",
> + .id_table = arctic_fan_id_table,
> + .probe = arctic_fan_probe,
> + .remove = arctic_fan_remove,
> + .raw_event = arctic_fan_raw_event,
> + .reset_resume = arctic_fan_reset_resume,
> +};
> +
> +module_hid_driver(arctic_fan_driver);
> +
> +MODULE_AUTHOR("Aureo Serrano de Souza <aureo.serrano@arctic.de>");
> +MODULE_DESCRIPTION("HID hwmon driver for ARCTIC Fan Controller");
> +MODULE_LICENSE("GPL");
^ permalink raw reply
* Re: [PATCH v2] doc: Add CPU Isolation documentation
From: Steven Rostedt @ 2026-04-01 17:08 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: Randy Dunlap, LKML, Anna-Maria Behnsen, Gabriele Monaco,
Ingo Molnar, Jonathan Corbet, Marcelo Tosatti, Marco Crivellari,
Michal Hocko, Paul E . McKenney, Peter Zijlstra, Phil Auld,
Thomas Gleixner, Valentin Schneider, Vlastimil Babka, Waiman Long,
linux-doc, Sebastian Andrzej Siewior, Bagas Sanjaya
In-Reply-To: <ac1HV1HLErp8GkZ6@localhost.localdomain>
On Wed, 1 Apr 2026 18:27:03 +0200
Frederic Weisbecker <frederic@kernel.org> wrote:
> > > +"CPU Isolation" means leaving a CPU exclusive to a given workload
> > > +without any undesired code interference from the kernel.
> > > +
> > > +Those interferences, commonly pointed out as "noise", can be triggered
> >
> > nit: "noise,"
>
> Thanks! I have applied all your suggestions, except this one for now because I don't
> really understand the typo rule behind. Any hint?
So this looks to be an American English thing (placing commas within the
quote), but from what I read, British English places the comma outside the
quote.
Here's one case I much rather go the British English way. This also means
it's only incorrect to Americans ;-)
-- Steve
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox