From: Tariq Toukan <tariqt@nvidia.com>
To: Andrew Lunn <andrew+netdev@lunn.ch>,
"David S. Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>, <netdev@vger.kernel.org>,
Paolo Abeni <pabeni@redhat.com>
Cc: Adithya Jayachandran <ajayachandra@nvidia.com>,
Bobby Eshleman <bobbyeshleman@meta.com>,
Carolina Jubran <cjubran@nvidia.com>,
Cosmin Ratiu <cratiu@nvidia.com>,
Daniel Borkmann <daniel@iogearbox.net>,
Daniel Jurgens <danielj@nvidia.com>,
Daniel Zahka <daniel.zahka@gmail.com>, David Wei <dw@davidwei.uk>,
Donald Hunter <donald.hunter@gmail.com>,
Dragos Tatulea <dtatulea@nvidia.com>,
Jiri Pirko <jiri@nvidia.com>, Jiri Pirko <jiri@resnulli.us>,
Joe Damato <joe@dama.to>, Jonathan Corbet <corbet@lwn.net>,
Kees Cook <kees@kernel.org>, Leon Romanovsky <leon@kernel.org>,
<linux-doc@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
<linux-kselftest@vger.kernel.org>, <linux-rdma@vger.kernel.org>,
Mark Bloch <mbloch@nvidia.com>, Moshe Shemesh <moshe@nvidia.com>,
Or Har-Toov <ohartoov@nvidia.com>,
Parav Pandit <parav@nvidia.com>, Petr Machata <petrm@nvidia.com>,
Ratheesh Kannoth <rkannoth@marvell.com>,
Saeed Mahameed <saeedm@nvidia.com>,
Shahar Shitrit <shshitrit@nvidia.com>,
Shay Drori <shayd@nvidia.com>, Shuah Khan <shuah@kernel.org>,
Shuah Khan <skhan@linuxfoundation.org>,
Simon Horman <horms@kernel.org>,
Stanislav Fomichev <sdf@fomichev.me>,
Tariq Toukan <tariqt@nvidia.com>,
Willem de Bruijn <willemb@google.com>,
Gal Pressman <gal@nvidia.com>
Subject: [PATCH net-next V10 07/14] devlink: Allow rate node parents from other devlinks
Date: Wed, 1 Jul 2026 10:32:47 +0300 [thread overview]
Message-ID: <20260701073254.754518-8-tariqt@nvidia.com> (raw)
In-Reply-To: <20260701073254.754518-1-tariqt@nvidia.com>
From: Cosmin Ratiu <cratiu@nvidia.com>
This commit makes use of the building blocks previously added to
implement cross-device rate nodes.
A new 'supported_cross_device_rate_nodes' bool is added to devlink_ops
which lets drivers advertise support for cross-device rate objects.
If enabled and if there is a common shared devlink instance, then:
- all rate objects will be stored in the top-most common nested instance
and
- rate objects can have parents from other devices sharing the same
common instance.
Storing rates in the common shared ancestor is safe, because it is
reference counted by its nested devlink instances, so it's guaranteed to
outlive them. Furthermore, the shared devlink infra guarantees a given
nested devlink hierarchy is managed by the same driver.
The parent devlink from info->ctx is not locked, so none of its mutable
fields can be used. But parent setting only requires comparing devlink
pointer comparisons. Additionally, since the shared devlink is locked,
other rate operations cannot concurrently happen.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../networking/devlink/devlink-port.rst | 2 +
include/net/devlink.h | 9 ++
net/devlink/core.c | 4 +-
net/devlink/rate.c | 86 +++++++++++++++++--
4 files changed, 92 insertions(+), 9 deletions(-)
diff --git a/Documentation/networking/devlink/devlink-port.rst b/Documentation/networking/devlink/devlink-port.rst
index 9374ebe70f48..18aca77006d5 100644
--- a/Documentation/networking/devlink/devlink-port.rst
+++ b/Documentation/networking/devlink/devlink-port.rst
@@ -420,6 +420,8 @@ API allows to configure following rate object's parameters:
Parent node name. Parent node rate limits are considered as additional limits
to all node children limits. ``tx_max`` is an upper limit for children.
``tx_share`` is a total bandwidth distributed among children.
+ If the device supports cross-function scheduling, the parent can be from a
+ different function of the same underlying device.
``tc_bw``
Allow users to set the bandwidth allocation per traffic class on rate
diff --git a/include/net/devlink.h b/include/net/devlink.h
index dd546dbd57cf..ffe1ad5fb70b 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -1594,6 +1594,15 @@ struct devlink_ops {
struct devlink_rate *parent,
void *priv_child, void *priv_parent,
struct netlink_ext_ack *extack);
+ /* Indicates if cross-device rate nodes are supported.
+ * This also requires a shared common ancestor object all devices that
+ * could share rate nodes are nested in.
+ * If enabled, rate operations may be called on an instance with only
+ * the common ancestor lock held and *without that instance lock held*.
+ * It is the driver's responsibility to ensure proper serialization
+ * with other operations.
+ */
+ bool supported_cross_device_rate_nodes;
/**
* selftests_check() - queries if selftest is supported
* @devlink: devlink instance
diff --git a/net/devlink/core.c b/net/devlink/core.c
index ee26c50b4118..c53a42e17a58 100644
--- a/net/devlink/core.c
+++ b/net/devlink/core.c
@@ -534,6 +534,9 @@ void devlink_free(struct devlink *devlink)
{
ASSERT_DEVLINK_NOT_REGISTERED(devlink);
+ devl_lock(devlink);
+ WARN_ON(devlink_rates_check(devlink, NULL, NULL));
+ devl_unlock(devlink);
devlink_rel_put(devlink);
WARN_ON(!list_empty(&devlink->trap_policer_list));
@@ -544,7 +547,6 @@ void devlink_free(struct devlink *devlink)
WARN_ON(!list_empty(&devlink->resource_list));
WARN_ON(!list_empty(&devlink->dpipe_table_list));
WARN_ON(!list_empty(&devlink->sb_list));
- WARN_ON(devlink_rates_check(devlink, NULL, NULL));
WARN_ON(!list_empty(&devlink->linecard_list));
WARN_ON(!xa_empty(&devlink->ports));
diff --git a/net/devlink/rate.c b/net/devlink/rate.c
index 78a59d79c2ea..e727c8b8b33e 100644
--- a/net/devlink/rate.c
+++ b/net/devlink/rate.c
@@ -30,14 +30,42 @@ devlink_rate_leaf_get_from_info(struct devlink *devlink, struct genl_info *info)
return devlink_rate ?: ERR_PTR(-ENODEV);
}
+/* Repeatedly walks the nested devlink chain while cross device rate nodes are
+ * supported and finds the topmost instance where rates should be stored.
+ * That instance is locked, referenced and returned.
+ * When cross device rate nodes aren't supported the original devlink instance
+ * is returned.
+ */
static struct devlink *devl_rate_lock(struct devlink *devlink)
{
- return devlink;
+ struct devlink *rate_devlink = devlink, *parent;
+
+ devl_assert_locked(devlink);
+
+ while (rate_devlink->ops &&
+ rate_devlink->ops->supported_cross_device_rate_nodes) {
+ parent = devlink_nested_in_get_lock(rate_devlink);
+ if (!parent)
+ break;
+ if (rate_devlink != devlink) {
+ /* Unlock intermediate instances. */
+ devl_unlock(rate_devlink);
+ devlink_put(rate_devlink);
+ }
+ rate_devlink = parent;
+ }
+ return rate_devlink;
}
+/* Unlocks and puts 'rate devlink' if different than 'devlink'. */
static void devl_rate_unlock(struct devlink *devlink,
struct devlink *rate_devlink)
{
+ if (devlink == rate_devlink)
+ return;
+
+ devl_unlock(rate_devlink);
+ devlink_put(rate_devlink);
}
static struct devlink_rate *
@@ -121,6 +149,25 @@ static int devlink_rate_put_tc_bws(struct sk_buff *msg, u32 *tc_bw)
return -EMSGSIZE;
}
+static int devlink_nl_rate_parent_fill(struct sk_buff *msg,
+ struct devlink_rate *devlink_rate)
+{
+ struct devlink_rate *parent = devlink_rate->parent;
+ struct devlink *devlink = parent->devlink;
+
+ if (nla_put_string(msg, DEVLINK_ATTR_RATE_PARENT_NODE_NAME,
+ parent->name))
+ return -EMSGSIZE;
+
+ if (devlink != devlink_rate->devlink &&
+ devlink_nl_put_nested_handle(msg,
+ devlink_net(devlink_rate->devlink),
+ devlink, DEVLINK_ATTR_PARENT_DEV))
+ return -EMSGSIZE;
+
+ return 0;
+}
+
static int devlink_nl_rate_fill(struct sk_buff *msg,
struct devlink_rate *devlink_rate,
enum devlink_command cmd, u32 portid, u32 seq,
@@ -165,10 +212,9 @@ static int devlink_nl_rate_fill(struct sk_buff *msg,
devlink_rate->tx_weight))
goto nla_put_failure;
- if (devlink_rate->parent)
- if (nla_put_string(msg, DEVLINK_ATTR_RATE_PARENT_NODE_NAME,
- devlink_rate->parent->name))
- goto nla_put_failure;
+ if (devlink_rate->parent &&
+ devlink_nl_rate_parent_fill(msg, devlink_rate))
+ goto nla_put_failure;
if (devlink_rate_put_tc_bws(msg, devlink_rate->tc_bw))
goto nla_put_failure;
@@ -322,13 +368,14 @@ devlink_nl_rate_parent_node_set(struct devlink_rate *devlink_rate,
struct genl_info *info,
struct nlattr *nla_parent)
{
- struct devlink *devlink = devlink_rate->devlink;
+ struct devlink *devlink = devlink_rate->devlink, *parent_devlink;
const char *parent_name = nla_data(nla_parent);
const struct devlink_ops *ops = devlink->ops;
size_t len = strlen(parent_name);
struct devlink_rate *parent;
int err = -EOPNOTSUPP;
+ parent_devlink = devlink_nl_ctx(info)->parent_devlink ? : devlink;
parent = devlink_rate->parent;
if (parent && !len) {
@@ -346,7 +393,13 @@ devlink_nl_rate_parent_node_set(struct devlink_rate *devlink_rate,
refcount_dec(&parent->refcnt);
devlink_rate->parent = NULL;
} else if (len) {
- parent = devlink_rate_node_get_by_name(rate_devlink, devlink,
+ /* parent_devlink (when different than devlink) isn't locked,
+ * but the rate node devlink instance is, so nobody from the
+ * same group of devices sharing rates could change the used
+ * fields or unregister the parent.
+ */
+ parent = devlink_rate_node_get_by_name(rate_devlink,
+ parent_devlink,
parent_name);
if (IS_ERR(parent))
return -ENODEV;
@@ -633,9 +686,11 @@ static bool devlink_rate_set_ops_supported(const struct devlink_ops *ops,
int devlink_nl_rate_set_doit(struct sk_buff *skb, struct genl_info *info)
{
- struct devlink *rate_devlink, *devlink = devlink_nl_ctx(info)->devlink;
+ struct devlink_nl_ctx *ctx = devlink_nl_ctx(info);
+ struct devlink *devlink = ctx->devlink;
struct devlink_rate *devlink_rate;
const struct devlink_ops *ops;
+ struct devlink *rate_devlink;
int err;
rate_devlink = devl_rate_lock(devlink);
@@ -652,6 +707,14 @@ int devlink_nl_rate_set_doit(struct sk_buff *skb, struct genl_info *info)
goto unlock;
}
+ if (ctx->parent_devlink && ctx->parent_devlink != devlink &&
+ !ops->supported_cross_device_rate_nodes) {
+ NL_SET_ERR_MSG(info->extack,
+ "Cross-device rate parents aren't supported");
+ err = -EOPNOTSUPP;
+ goto unlock;
+ }
+
err = devlink_nl_rate_set(devlink_rate, rate_devlink, ops, info);
if (!err)
@@ -679,6 +742,13 @@ int devlink_nl_rate_new_doit(struct sk_buff *skb, struct genl_info *info)
if (!devlink_rate_set_ops_supported(ops, info, DEVLINK_RATE_TYPE_NODE))
return -EOPNOTSUPP;
+ if (ctx->parent_devlink && ctx->parent_devlink != devlink &&
+ !ops->supported_cross_device_rate_nodes) {
+ NL_SET_ERR_MSG(info->extack,
+ "Cross-device rate parents aren't supported");
+ return -EOPNOTSUPP;
+ }
+
rate_devlink = devl_rate_lock(devlink);
rate_node = devlink_rate_node_get_from_attrs(rate_devlink, devlink,
info->attrs);
--
2.44.0
next prev parent reply other threads:[~2026-07-01 7:35 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-07-01 7:32 [PATCH net-next V10 00/14] devlink and mlx5: Support cross-function rate scheduling Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 01/14] devlink: Update nested instance locking comment Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 02/14] devlink: Add a helper for getting a nested-in instance Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 03/14] devlink: Migrate from info->user_ptr to info->ctx Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 04/14] devlink: Decouple rate storage from associated devlink object Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 05/14] devlink: Add parent dev to devlink API Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 06/14] devlink: Allow parent dev for rate-set and rate-new Tariq Toukan
2026-07-01 7:32 ` Tariq Toukan [this message]
2026-07-01 7:32 ` [PATCH net-next V10 08/14] net/mlx5: qos: Use mlx5_lag_query_bond_speed to query LAG speed Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 09/14] net/mlx5: qos: Refactor vport QoS cleanup Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 10/14] net/mlx5: qos: Model the root node in the scheduling hierarchy Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 11/14] net/mlx5: qos: Remove qos domains and use shd Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 12/14] net/mlx5: qos: Support cross-device tx scheduling Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 13/14] selftests: drv-net: Add test for cross-esw rate scheduling Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 14/14] net/mlx5: Document devlink rates Tariq Toukan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260701073254.754518-8-tariqt@nvidia.com \
--to=tariqt@nvidia.com \
--cc=ajayachandra@nvidia.com \
--cc=andrew+netdev@lunn.ch \
--cc=bobbyeshleman@meta.com \
--cc=cjubran@nvidia.com \
--cc=corbet@lwn.net \
--cc=cratiu@nvidia.com \
--cc=daniel.zahka@gmail.com \
--cc=daniel@iogearbox.net \
--cc=danielj@nvidia.com \
--cc=davem@davemloft.net \
--cc=donald.hunter@gmail.com \
--cc=dtatulea@nvidia.com \
--cc=dw@davidwei.uk \
--cc=edumazet@google.com \
--cc=gal@nvidia.com \
--cc=horms@kernel.org \
--cc=jiri@nvidia.com \
--cc=jiri@resnulli.us \
--cc=joe@dama.to \
--cc=kees@kernel.org \
--cc=kuba@kernel.org \
--cc=leon@kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=mbloch@nvidia.com \
--cc=moshe@nvidia.com \
--cc=netdev@vger.kernel.org \
--cc=ohartoov@nvidia.com \
--cc=pabeni@redhat.com \
--cc=parav@nvidia.com \
--cc=petrm@nvidia.com \
--cc=rkannoth@marvell.com \
--cc=saeedm@nvidia.com \
--cc=sdf@fomichev.me \
--cc=shayd@nvidia.com \
--cc=shshitrit@nvidia.com \
--cc=shuah@kernel.org \
--cc=skhan@linuxfoundation.org \
--cc=willemb@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox