* [PATCH net-next 01/10] devlink: Remove unused param of devlink_rate_nodes_check
2025-02-13 18:01 [PATCH net-next 00/10] devlink and mlx5: Introduce rate domains Tariq Toukan
@ 2025-02-13 18:01 ` Tariq Toukan
2025-02-18 2:54 ` Kalesh Anakkur Purayil
2025-02-13 18:01 ` [PATCH net-next 02/10] devlink: Store devlink rates in a rate domain Tariq Toukan
` (8 subsequent siblings)
9 siblings, 1 reply; 23+ messages in thread
From: Tariq Toukan @ 2025-02-13 18:01 UTC (permalink / raw)
To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
Andrew Lunn, Jiri Pirko
Cc: Cosmin Ratiu, Carolina Jubran, Gal Pressman, Mark Bloch,
Donald Hunter, Jiri Pirko, Jonathan Corbet, Saeed Mahameed,
Leon Romanovsky, Tariq Toukan, netdev, linux-kernel, linux-doc,
linux-rdma
From: Cosmin Ratiu <cratiu@nvidia.com>
The 'mode' param is unused so remove it.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
net/devlink/dev.c | 4 ++--
net/devlink/devl_internal.h | 3 +--
net/devlink/rate.c | 3 +--
3 files changed, 4 insertions(+), 6 deletions(-)
diff --git a/net/devlink/dev.c b/net/devlink/dev.c
index d6e3db300acb..5ff5d9055a8d 100644
--- a/net/devlink/dev.c
+++ b/net/devlink/dev.c
@@ -713,10 +713,10 @@ int devlink_nl_eswitch_set_doit(struct sk_buff *skb, struct genl_info *info)
if (info->attrs[DEVLINK_ATTR_ESWITCH_MODE]) {
if (!ops->eswitch_mode_set)
return -EOPNOTSUPP;
- mode = nla_get_u16(info->attrs[DEVLINK_ATTR_ESWITCH_MODE]);
- err = devlink_rate_nodes_check(devlink, mode, info->extack);
+ err = devlink_rate_nodes_check(devlink, info->extack);
if (err)
return err;
+ mode = nla_get_u16(info->attrs[DEVLINK_ATTR_ESWITCH_MODE]);
err = ops->eswitch_mode_set(devlink, mode, info->extack);
if (err)
return err;
diff --git a/net/devlink/devl_internal.h b/net/devlink/devl_internal.h
index 14eaad9cfe35..9cc7a5f4a19f 100644
--- a/net/devlink/devl_internal.h
+++ b/net/devlink/devl_internal.h
@@ -297,8 +297,7 @@ int devlink_resources_validate(struct devlink *devlink,
struct genl_info *info);
/* Rates */
-int devlink_rate_nodes_check(struct devlink *devlink, u16 mode,
- struct netlink_ext_ack *extack);
+int devlink_rate_nodes_check(struct devlink *devlink, struct netlink_ext_ack *extack);
/* Linecards */
unsigned int devlink_linecard_index(struct devlink_linecard *linecard);
diff --git a/net/devlink/rate.c b/net/devlink/rate.c
index 8828ffaf6cbc..d163c61cdd98 100644
--- a/net/devlink/rate.c
+++ b/net/devlink/rate.c
@@ -561,8 +561,7 @@ int devlink_nl_rate_del_doit(struct sk_buff *skb, struct genl_info *info)
return err;
}
-int devlink_rate_nodes_check(struct devlink *devlink, u16 mode,
- struct netlink_ext_ack *extack)
+int devlink_rate_nodes_check(struct devlink *devlink, struct netlink_ext_ack *extack)
{
struct devlink_rate *devlink_rate;
--
2.45.0
^ permalink raw reply related [flat|nested] 23+ messages in thread* Re: [PATCH net-next 01/10] devlink: Remove unused param of devlink_rate_nodes_check
2025-02-13 18:01 ` [PATCH net-next 01/10] devlink: Remove unused param of devlink_rate_nodes_check Tariq Toukan
@ 2025-02-18 2:54 ` Kalesh Anakkur Purayil
0 siblings, 0 replies; 23+ messages in thread
From: Kalesh Anakkur Purayil @ 2025-02-18 2:54 UTC (permalink / raw)
To: Tariq Toukan
Cc: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
Andrew Lunn, Jiri Pirko, Cosmin Ratiu, Carolina Jubran,
Gal Pressman, Mark Bloch, Donald Hunter, Jiri Pirko,
Jonathan Corbet, Saeed Mahameed, Leon Romanovsky, netdev,
linux-kernel, linux-doc, linux-rdma
[-- Attachment #1: Type: text/plain, Size: 427 bytes --]
On Thu, Feb 13, 2025 at 11:34 PM Tariq Toukan <tariqt@nvidia.com> wrote:
>
> From: Cosmin Ratiu <cratiu@nvidia.com>
>
> The 'mode' param is unused so remove it.
>
> Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
> Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
LGTM,
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
--
Regards,
Kalesh AP
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4239 bytes --]
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH net-next 02/10] devlink: Store devlink rates in a rate domain
2025-02-13 18:01 [PATCH net-next 00/10] devlink and mlx5: Introduce rate domains Tariq Toukan
2025-02-13 18:01 ` [PATCH net-next 01/10] devlink: Remove unused param of devlink_rate_nodes_check Tariq Toukan
@ 2025-02-13 18:01 ` Tariq Toukan
2025-02-13 18:01 ` [PATCH net-next 03/10] devlink: Serialize access to rate domains Tariq Toukan
` (7 subsequent siblings)
9 siblings, 0 replies; 23+ messages in thread
From: Tariq Toukan @ 2025-02-13 18:01 UTC (permalink / raw)
To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
Andrew Lunn, Jiri Pirko
Cc: Cosmin Ratiu, Carolina Jubran, Gal Pressman, Mark Bloch,
Donald Hunter, Jiri Pirko, Jonathan Corbet, Saeed Mahameed,
Leon Romanovsky, Tariq Toukan, netdev, linux-kernel, linux-doc,
linux-rdma
From: Cosmin Ratiu <cratiu@nvidia.com>
Introduce the concept of a rate domain, a data structure aiming to store
devlink rates of one or more devlink objects such that rate nodes within
a rate domain can be parents of other rate nodes from the same domain.
This commit takes the first steps in that direction by:
- introducing a new 'devlink_rate_domain' data structure, allocated on
devlink init.
- storing devlink rates from the devlink object into the rate domain.
- converting all accesses to the new data structure.
- adding checks for everything that walks rates in a rate domain so that
the correct devlink object is looked at. These checks are now spurious
since all rate domains are private to their devlink object, but that
is about to change and this is a good moment to add those checks.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
net/devlink/core.c | 11 ++++++++--
net/devlink/dev.c | 2 +-
net/devlink/devl_internal.h | 7 ++++++-
net/devlink/rate.c | 41 ++++++++++++++++++++++---------------
4 files changed, 40 insertions(+), 21 deletions(-)
diff --git a/net/devlink/core.c b/net/devlink/core.c
index f49cd83f1955..06a2e2dce558 100644
--- a/net/devlink/core.c
+++ b/net/devlink/core.c
@@ -314,6 +314,7 @@ static void devlink_release(struct work_struct *work)
mutex_destroy(&devlink->lock);
lockdep_unregister_key(&devlink->lock_key);
put_device(devlink->dev);
+ kvfree(devlink->rate_domain);
kvfree(devlink);
}
@@ -424,6 +425,11 @@ struct devlink *devlink_alloc_ns(const struct devlink_ops *ops,
if (!devlink)
return NULL;
+ devlink->rate_domain = kvzalloc(sizeof(*devlink->rate_domain), GFP_KERNEL);
+ if (!devlink->rate_domain)
+ goto err_rate_domain;
+ INIT_LIST_HEAD(&devlink->rate_domain->rate_list);
+
ret = xa_alloc_cyclic(&devlinks, &devlink->index, devlink, xa_limit_31b,
&last_id, GFP_KERNEL);
if (ret < 0)
@@ -436,7 +442,6 @@ struct devlink *devlink_alloc_ns(const struct devlink_ops *ops,
xa_init_flags(&devlink->snapshot_ids, XA_FLAGS_ALLOC);
xa_init_flags(&devlink->nested_rels, XA_FLAGS_ALLOC);
write_pnet(&devlink->_net, net);
- INIT_LIST_HEAD(&devlink->rate_list);
INIT_LIST_HEAD(&devlink->linecard_list);
INIT_LIST_HEAD(&devlink->sb_list);
INIT_LIST_HEAD_RCU(&devlink->dpipe_table_list);
@@ -455,6 +460,8 @@ struct devlink *devlink_alloc_ns(const struct devlink_ops *ops,
return devlink;
err_xa_alloc:
+ kvfree(devlink->rate_domain);
+err_rate_domain:
kvfree(devlink);
return NULL;
}
@@ -477,7 +484,7 @@ void devlink_free(struct devlink *devlink)
WARN_ON(!list_empty(&devlink->resource_list));
WARN_ON(!list_empty(&devlink->dpipe_table_list));
WARN_ON(!list_empty(&devlink->sb_list));
- WARN_ON(!list_empty(&devlink->rate_list));
+ WARN_ON(!list_empty(&devlink->rate_domain->rate_list));
WARN_ON(!list_empty(&devlink->linecard_list));
WARN_ON(!xa_empty(&devlink->ports));
diff --git a/net/devlink/dev.c b/net/devlink/dev.c
index 5ff5d9055a8d..c926c75cc10d 100644
--- a/net/devlink/dev.c
+++ b/net/devlink/dev.c
@@ -434,7 +434,7 @@ static void devlink_reload_reinit_sanity_check(struct devlink *devlink)
WARN_ON(!list_empty(&devlink->trap_list));
WARN_ON(!list_empty(&devlink->dpipe_table_list));
WARN_ON(!list_empty(&devlink->sb_list));
- WARN_ON(!list_empty(&devlink->rate_list));
+ WARN_ON(!list_empty(&devlink->rate_domain->rate_list));
WARN_ON(!list_empty(&devlink->linecard_list));
WARN_ON(!xa_empty(&devlink->ports));
}
diff --git a/net/devlink/devl_internal.h b/net/devlink/devl_internal.h
index 9cc7a5f4a19f..209b4a4c7070 100644
--- a/net/devlink/devl_internal.h
+++ b/net/devlink/devl_internal.h
@@ -30,10 +30,14 @@ struct devlink_dev_stats {
u32 remote_reload_stats[DEVLINK_RELOAD_STATS_ARRAY_SIZE];
};
+/* Stores devlink rates associated with a rate domain. */
+struct devlink_rate_domain {
+ struct list_head rate_list;
+};
+
struct devlink {
u32 index;
struct xarray ports;
- struct list_head rate_list;
struct list_head sb_list;
struct list_head dpipe_table_list;
struct list_head resource_list;
@@ -55,6 +59,7 @@ struct devlink {
*/
struct mutex lock;
struct lock_class_key lock_key;
+ struct devlink_rate_domain *rate_domain;
u8 reload_failed:1;
refcount_t refcount;
struct rcu_work rwork;
diff --git a/net/devlink/rate.c b/net/devlink/rate.c
index d163c61cdd98..535863bb0c17 100644
--- a/net/devlink/rate.c
+++ b/net/devlink/rate.c
@@ -36,8 +36,9 @@ devlink_rate_node_get_by_name(struct devlink *devlink, const char *node_name)
{
static struct devlink_rate *devlink_rate;
- list_for_each_entry(devlink_rate, &devlink->rate_list, list) {
- if (devlink_rate_is_node(devlink_rate) &&
+ list_for_each_entry(devlink_rate, &devlink->rate_domain->rate_list, list) {
+ if (devlink_rate->devlink == devlink &&
+ devlink_rate_is_node(devlink_rate) &&
!strcmp(node_name, devlink_rate->name))
return devlink_rate;
}
@@ -166,16 +167,18 @@ void devlink_rates_notify_register(struct devlink *devlink)
{
struct devlink_rate *rate_node;
- list_for_each_entry(rate_node, &devlink->rate_list, list)
- devlink_rate_notify(rate_node, DEVLINK_CMD_RATE_NEW);
+ list_for_each_entry(rate_node, &devlink->rate_domain->rate_list, list)
+ if (rate_node->devlink == devlink)
+ devlink_rate_notify(rate_node, DEVLINK_CMD_RATE_NEW);
}
void devlink_rates_notify_unregister(struct devlink *devlink)
{
struct devlink_rate *rate_node;
- list_for_each_entry_reverse(rate_node, &devlink->rate_list, list)
- devlink_rate_notify(rate_node, DEVLINK_CMD_RATE_DEL);
+ list_for_each_entry_reverse(rate_node, &devlink->rate_domain->rate_list, list)
+ if (rate_node->devlink == devlink)
+ devlink_rate_notify(rate_node, DEVLINK_CMD_RATE_DEL);
}
static int
@@ -187,7 +190,7 @@ devlink_nl_rate_get_dump_one(struct sk_buff *msg, struct devlink *devlink,
int idx = 0;
int err = 0;
- list_for_each_entry(devlink_rate, &devlink->rate_list, list) {
+ list_for_each_entry(devlink_rate, &devlink->rate_domain->rate_list, list) {
enum devlink_command cmd = DEVLINK_CMD_RATE_NEW;
u32 id = NETLINK_CB(cb->skb).portid;
@@ -195,6 +198,9 @@ devlink_nl_rate_get_dump_one(struct sk_buff *msg, struct devlink *devlink,
idx++;
continue;
}
+ if (devlink_rate->devlink != devlink)
+ continue;
+
err = devlink_nl_rate_fill(msg, devlink_rate, cmd, id,
cb->nlh->nlmsg_seq, flags, NULL);
if (err) {
@@ -522,7 +528,7 @@ int devlink_nl_rate_new_doit(struct sk_buff *skb, struct genl_info *info)
goto err_rate_set;
refcount_set(&rate_node->refcnt, 1);
- list_add(&rate_node->list, &devlink->rate_list);
+ list_add(&rate_node->list, &devlink->rate_domain->rate_list);
devlink_rate_notify(rate_node, DEVLINK_CMD_RATE_NEW);
return 0;
@@ -565,8 +571,9 @@ int devlink_rate_nodes_check(struct devlink *devlink, struct netlink_ext_ack *ex
{
struct devlink_rate *devlink_rate;
- list_for_each_entry(devlink_rate, &devlink->rate_list, list)
- if (devlink_rate_is_node(devlink_rate)) {
+ list_for_each_entry(devlink_rate, &devlink->rate_domain->rate_list, list)
+ if (devlink_rate->devlink == devlink &&
+ devlink_rate_is_node(devlink_rate)) {
NL_SET_ERR_MSG(extack, "Rate node(s) exists.");
return -EBUSY;
}
@@ -612,7 +619,7 @@ devl_rate_node_create(struct devlink *devlink, void *priv, char *node_name,
}
refcount_set(&rate_node->refcnt, 1);
- list_add(&rate_node->list, &devlink->rate_list);
+ list_add(&rate_node->list, &devlink->rate_domain->rate_list);
devlink_rate_notify(rate_node, DEVLINK_CMD_RATE_NEW);
return rate_node;
}
@@ -650,7 +657,7 @@ int devl_rate_leaf_create(struct devlink_port *devlink_port, void *priv,
devlink_rate->devlink = devlink;
devlink_rate->devlink_port = devlink_port;
devlink_rate->priv = priv;
- list_add_tail(&devlink_rate->list, &devlink->rate_list);
+ list_add_tail(&devlink_rate->list, &devlink->rate_domain->rate_list);
devlink_port->devlink_rate = devlink_rate;
devlink_rate_notify(devlink_rate, DEVLINK_CMD_RATE_NEW);
@@ -691,13 +698,13 @@ EXPORT_SYMBOL_GPL(devl_rate_leaf_destroy);
*/
void devl_rate_nodes_destroy(struct devlink *devlink)
{
- static struct devlink_rate *devlink_rate, *tmp;
+ struct devlink_rate *devlink_rate, *tmp;
const struct devlink_ops *ops = devlink->ops;
devl_assert_locked(devlink);
- list_for_each_entry(devlink_rate, &devlink->rate_list, list) {
- if (!devlink_rate->parent)
+ list_for_each_entry(devlink_rate, &devlink->rate_domain->rate_list, list) {
+ if (!devlink_rate->parent || devlink_rate->devlink != devlink)
continue;
refcount_dec(&devlink_rate->parent->refcnt);
@@ -708,8 +715,8 @@ void devl_rate_nodes_destroy(struct devlink *devlink)
ops->rate_node_parent_set(devlink_rate, NULL, devlink_rate->priv,
NULL, NULL);
}
- list_for_each_entry_safe(devlink_rate, tmp, &devlink->rate_list, list) {
- if (devlink_rate_is_node(devlink_rate)) {
+ list_for_each_entry_safe(devlink_rate, tmp, &devlink->rate_domain->rate_list, list) {
+ if (devlink_rate->devlink == devlink && devlink_rate_is_node(devlink_rate)) {
ops->rate_node_del(devlink_rate, devlink_rate->priv, NULL);
list_del(&devlink_rate->list);
kfree(devlink_rate->name);
--
2.45.0
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH net-next 03/10] devlink: Serialize access to rate domains
2025-02-13 18:01 [PATCH net-next 00/10] devlink and mlx5: Introduce rate domains Tariq Toukan
2025-02-13 18:01 ` [PATCH net-next 01/10] devlink: Remove unused param of devlink_rate_nodes_check Tariq Toukan
2025-02-13 18:01 ` [PATCH net-next 02/10] devlink: Store devlink rates in a rate domain Tariq Toukan
@ 2025-02-13 18:01 ` Tariq Toukan
2025-02-14 12:54 ` Jiri Pirko
2025-02-13 18:01 ` [PATCH net-next 04/10] devlink: Introduce shared " Tariq Toukan
` (6 subsequent siblings)
9 siblings, 1 reply; 23+ messages in thread
From: Tariq Toukan @ 2025-02-13 18:01 UTC (permalink / raw)
To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
Andrew Lunn, Jiri Pirko
Cc: Cosmin Ratiu, Carolina Jubran, Gal Pressman, Mark Bloch,
Donald Hunter, Jiri Pirko, Jonathan Corbet, Saeed Mahameed,
Leon Romanovsky, Tariq Toukan, netdev, linux-kernel, linux-doc,
linux-rdma
From: Cosmin Ratiu <cratiu@nvidia.com>
Access to rates in a rate domain should be serialized.
This commit introduces two new functions, devl_rate_domain_lock and
devl_rate_domain_unlock, and uses them whenever serial access to a rate
domain is needed. For now, they are no-ops since access to a rate domain
is protected by the devlink lock.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
net/devlink/devl_internal.h | 4 ++
net/devlink/rate.c | 114 +++++++++++++++++++++++++++---------
2 files changed, 89 insertions(+), 29 deletions(-)
diff --git a/net/devlink/devl_internal.h b/net/devlink/devl_internal.h
index 209b4a4c7070..fae81dd6953f 100644
--- a/net/devlink/devl_internal.h
+++ b/net/devlink/devl_internal.h
@@ -121,6 +121,10 @@ static inline void devl_dev_unlock(struct devlink *devlink, bool dev_lock)
device_unlock(devlink->dev);
}
+static inline void devl_rate_domain_lock(struct devlink *devlink) { }
+
+static inline void devl_rate_domain_unlock(struct devlink *devlink) { }
+
typedef void devlink_rel_notify_cb_t(struct devlink *devlink, u32 obj_index);
typedef void devlink_rel_cleanup_cb_t(struct devlink *devlink, u32 obj_index,
u32 rel_index);
diff --git a/net/devlink/rate.c b/net/devlink/rate.c
index 535863bb0c17..54e6a9893e3d 100644
--- a/net/devlink/rate.c
+++ b/net/devlink/rate.c
@@ -167,18 +167,22 @@ void devlink_rates_notify_register(struct devlink *devlink)
{
struct devlink_rate *rate_node;
+ devl_rate_domain_lock(devlink);
list_for_each_entry(rate_node, &devlink->rate_domain->rate_list, list)
if (rate_node->devlink == devlink)
devlink_rate_notify(rate_node, DEVLINK_CMD_RATE_NEW);
+ devl_rate_domain_unlock(devlink);
}
void devlink_rates_notify_unregister(struct devlink *devlink)
{
struct devlink_rate *rate_node;
+ devl_rate_domain_lock(devlink);
list_for_each_entry_reverse(rate_node, &devlink->rate_domain->rate_list, list)
if (rate_node->devlink == devlink)
devlink_rate_notify(rate_node, DEVLINK_CMD_RATE_DEL);
+ devl_rate_domain_unlock(devlink);
}
static int
@@ -190,6 +194,7 @@ devlink_nl_rate_get_dump_one(struct sk_buff *msg, struct devlink *devlink,
int idx = 0;
int err = 0;
+ devl_rate_domain_lock(devlink);
list_for_each_entry(devlink_rate, &devlink->rate_domain->rate_list, list) {
enum devlink_command cmd = DEVLINK_CMD_RATE_NEW;
u32 id = NETLINK_CB(cb->skb).portid;
@@ -209,6 +214,7 @@ devlink_nl_rate_get_dump_one(struct sk_buff *msg, struct devlink *devlink,
}
idx++;
}
+ devl_rate_domain_unlock(devlink);
return err;
}
@@ -225,23 +231,33 @@ int devlink_nl_rate_get_doit(struct sk_buff *skb, struct genl_info *info)
struct sk_buff *msg;
int err;
+ devl_rate_domain_lock(devlink);
devlink_rate = devlink_rate_get_from_info(devlink, info);
- if (IS_ERR(devlink_rate))
- return PTR_ERR(devlink_rate);
+ if (IS_ERR(devlink_rate)) {
+ err = PTR_ERR(devlink_rate);
+ goto unlock;
+ }
msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
- if (!msg)
- return -ENOMEM;
+ if (!msg) {
+ err = -ENOMEM;
+ goto unlock;
+ }
err = devlink_nl_rate_fill(msg, devlink_rate, DEVLINK_CMD_RATE_NEW,
info->snd_portid, info->snd_seq, 0,
info->extack);
- if (err) {
- nlmsg_free(msg);
- return err;
- }
+ if (err)
+ goto err_fill;
+ devl_rate_domain_unlock(devlink);
return genlmsg_reply(msg, info);
+
+err_fill:
+ nlmsg_free(msg);
+unlock:
+ devl_rate_domain_unlock(devlink);
+ return err;
}
static bool
@@ -470,18 +486,24 @@ int devlink_nl_rate_set_doit(struct sk_buff *skb, struct genl_info *info)
const struct devlink_ops *ops;
int err;
+ devl_rate_domain_lock(devlink);
devlink_rate = devlink_rate_get_from_info(devlink, info);
- if (IS_ERR(devlink_rate))
- return PTR_ERR(devlink_rate);
+ if (IS_ERR(devlink_rate)) {
+ err = PTR_ERR(devlink_rate);
+ goto unlock;
+ }
ops = devlink->ops;
- if (!ops || !devlink_rate_set_ops_supported(ops, info, devlink_rate->type))
- return -EOPNOTSUPP;
+ if (!ops || !devlink_rate_set_ops_supported(ops, info, devlink_rate->type)) {
+ err = -EOPNOTSUPP;
+ goto unlock;
+ }
err = devlink_nl_rate_set(devlink_rate, ops, info);
-
if (!err)
devlink_rate_notify(devlink_rate, DEVLINK_CMD_RATE_NEW);
+unlock:
+ devl_rate_domain_unlock(devlink);
return err;
}
@@ -501,15 +523,21 @@ int devlink_nl_rate_new_doit(struct sk_buff *skb, struct genl_info *info)
if (!devlink_rate_set_ops_supported(ops, info, DEVLINK_RATE_TYPE_NODE))
return -EOPNOTSUPP;
+ devl_rate_domain_lock(devlink);
rate_node = devlink_rate_node_get_from_attrs(devlink, info->attrs);
- if (!IS_ERR(rate_node))
- return -EEXIST;
- else if (rate_node == ERR_PTR(-EINVAL))
- return -EINVAL;
+ if (!IS_ERR(rate_node)) {
+ err = -EEXIST;
+ goto unlock;
+ } else if (rate_node == ERR_PTR(-EINVAL)) {
+ err = -EINVAL;
+ goto unlock;
+ }
rate_node = kzalloc(sizeof(*rate_node), GFP_KERNEL);
- if (!rate_node)
- return -ENOMEM;
+ if (!rate_node) {
+ err = -ENOMEM;
+ goto unlock;
+ }
rate_node->devlink = devlink;
rate_node->type = DEVLINK_RATE_TYPE_NODE;
@@ -530,6 +558,7 @@ int devlink_nl_rate_new_doit(struct sk_buff *skb, struct genl_info *info)
refcount_set(&rate_node->refcnt, 1);
list_add(&rate_node->list, &devlink->rate_domain->rate_list);
devlink_rate_notify(rate_node, DEVLINK_CMD_RATE_NEW);
+ devl_rate_domain_unlock(devlink);
return 0;
err_rate_set:
@@ -538,6 +567,8 @@ int devlink_nl_rate_new_doit(struct sk_buff *skb, struct genl_info *info)
kfree(rate_node->name);
err_strdup:
kfree(rate_node);
+unlock:
+ devl_rate_domain_unlock(devlink);
return err;
}
@@ -547,13 +578,17 @@ int devlink_nl_rate_del_doit(struct sk_buff *skb, struct genl_info *info)
struct devlink_rate *rate_node;
int err;
+ devl_rate_domain_lock(devlink);
rate_node = devlink_rate_node_get_from_info(devlink, info);
- if (IS_ERR(rate_node))
- return PTR_ERR(rate_node);
+ if (IS_ERR(rate_node)) {
+ err = PTR_ERR(rate_node);
+ goto unlock;
+ }
if (refcount_read(&rate_node->refcnt) > 1) {
NL_SET_ERR_MSG(info->extack, "Node has children. Cannot delete node.");
- return -EBUSY;
+ err = -EBUSY;
+ goto unlock;
}
devlink_rate_notify(rate_node, DEVLINK_CMD_RATE_DEL);
@@ -564,20 +599,26 @@ int devlink_nl_rate_del_doit(struct sk_buff *skb, struct genl_info *info)
list_del(&rate_node->list);
kfree(rate_node->name);
kfree(rate_node);
+unlock:
+ devl_rate_domain_unlock(devlink);
return err;
}
int devlink_rate_nodes_check(struct devlink *devlink, struct netlink_ext_ack *extack)
{
struct devlink_rate *devlink_rate;
+ int err = 0;
+ devl_rate_domain_lock(devlink);
list_for_each_entry(devlink_rate, &devlink->rate_domain->rate_list, list)
if (devlink_rate->devlink == devlink &&
devlink_rate_is_node(devlink_rate)) {
NL_SET_ERR_MSG(extack, "Rate node(s) exists.");
- return -EBUSY;
+ err = -EBUSY;
+ break;
}
- return 0;
+ devl_rate_domain_unlock(devlink);
+ return err;
}
/**
@@ -595,13 +636,19 @@ devl_rate_node_create(struct devlink *devlink, void *priv, char *node_name,
{
struct devlink_rate *rate_node;
+ devl_rate_domain_lock(devlink);
+
rate_node = devlink_rate_node_get_by_name(devlink, node_name);
- if (!IS_ERR(rate_node))
- return ERR_PTR(-EEXIST);
+ if (!IS_ERR(rate_node)) {
+ rate_node = ERR_PTR(-EEXIST);
+ goto unlock;
+ }
rate_node = kzalloc(sizeof(*rate_node), GFP_KERNEL);
- if (!rate_node)
- return ERR_PTR(-ENOMEM);
+ if (!rate_node) {
+ rate_node = ERR_PTR(-ENOMEM);
+ goto unlock;
+ }
if (parent) {
rate_node->parent = parent;
@@ -615,12 +662,15 @@ devl_rate_node_create(struct devlink *devlink, void *priv, char *node_name,
rate_node->name = kstrdup(node_name, GFP_KERNEL);
if (!rate_node->name) {
kfree(rate_node);
- return ERR_PTR(-ENOMEM);
+ rate_node = ERR_PTR(-ENOMEM);
+ goto unlock;
}
refcount_set(&rate_node->refcnt, 1);
list_add(&rate_node->list, &devlink->rate_domain->rate_list);
devlink_rate_notify(rate_node, DEVLINK_CMD_RATE_NEW);
+unlock:
+ devl_rate_domain_unlock(devlink);
return rate_node;
}
EXPORT_SYMBOL_GPL(devl_rate_node_create);
@@ -648,6 +698,7 @@ int devl_rate_leaf_create(struct devlink_port *devlink_port, void *priv,
if (!devlink_rate)
return -ENOMEM;
+ devl_rate_domain_lock(devlink);
if (parent) {
devlink_rate->parent = parent;
refcount_inc(&devlink_rate->parent->refcnt);
@@ -660,6 +711,7 @@ int devl_rate_leaf_create(struct devlink_port *devlink_port, void *priv,
list_add_tail(&devlink_rate->list, &devlink->rate_domain->rate_list);
devlink_port->devlink_rate = devlink_rate;
devlink_rate_notify(devlink_rate, DEVLINK_CMD_RATE_NEW);
+ devl_rate_domain_unlock(devlink);
return 0;
}
@@ -680,11 +732,13 @@ void devl_rate_leaf_destroy(struct devlink_port *devlink_port)
if (!devlink_rate)
return;
+ devl_rate_domain_lock(devlink_port->devlink);
devlink_rate_notify(devlink_rate, DEVLINK_CMD_RATE_DEL);
if (devlink_rate->parent)
refcount_dec(&devlink_rate->parent->refcnt);
list_del(&devlink_rate->list);
devlink_port->devlink_rate = NULL;
+ devl_rate_domain_unlock(devlink_port->devlink);
kfree(devlink_rate);
}
EXPORT_SYMBOL_GPL(devl_rate_leaf_destroy);
@@ -702,6 +756,7 @@ void devl_rate_nodes_destroy(struct devlink *devlink)
const struct devlink_ops *ops = devlink->ops;
devl_assert_locked(devlink);
+ devl_rate_domain_lock(devlink);
list_for_each_entry(devlink_rate, &devlink->rate_domain->rate_list, list) {
if (!devlink_rate->parent || devlink_rate->devlink != devlink)
@@ -723,5 +778,6 @@ void devl_rate_nodes_destroy(struct devlink *devlink)
kfree(devlink_rate);
}
}
+ devl_rate_domain_unlock(devlink);
}
EXPORT_SYMBOL_GPL(devl_rate_nodes_destroy);
--
2.45.0
^ permalink raw reply related [flat|nested] 23+ messages in thread* Re: [PATCH net-next 03/10] devlink: Serialize access to rate domains
2025-02-13 18:01 ` [PATCH net-next 03/10] devlink: Serialize access to rate domains Tariq Toukan
@ 2025-02-14 12:54 ` Jiri Pirko
2025-02-19 2:21 ` Jakub Kicinski
0 siblings, 1 reply; 23+ messages in thread
From: Jiri Pirko @ 2025-02-14 12:54 UTC (permalink / raw)
To: Tariq Toukan
Cc: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
Andrew Lunn, Jiri Pirko, Cosmin Ratiu, Carolina Jubran,
Gal Pressman, Mark Bloch, Donald Hunter, Jonathan Corbet,
Saeed Mahameed, Leon Romanovsky, netdev, linux-kernel, linux-doc,
linux-rdma
Thu, Feb 13, 2025 at 07:01:27PM +0100, tariqt@nvidia.com wrote:
>From: Cosmin Ratiu <cratiu@nvidia.com>
>
>Access to rates in a rate domain should be serialized.
>
>This commit introduces two new functions, devl_rate_domain_lock and
>devl_rate_domain_unlock, and uses them whenever serial access to a rate
>domain is needed. For now, they are no-ops since access to a rate domain
>is protected by the devlink lock.
>
>Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
>Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
>Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
>---
> net/devlink/devl_internal.h | 4 ++
> net/devlink/rate.c | 114 +++++++++++++++++++++++++++---------
> 2 files changed, 89 insertions(+), 29 deletions(-)
>
>diff --git a/net/devlink/devl_internal.h b/net/devlink/devl_internal.h
>index 209b4a4c7070..fae81dd6953f 100644
>--- a/net/devlink/devl_internal.h
>+++ b/net/devlink/devl_internal.h
>@@ -121,6 +121,10 @@ static inline void devl_dev_unlock(struct devlink *devlink, bool dev_lock)
> device_unlock(devlink->dev);
> }
>
>+static inline void devl_rate_domain_lock(struct devlink *devlink) { }
>+
>+static inline void devl_rate_domain_unlock(struct devlink *devlink) { }
For the record, I'm still not convinced that introducing this kind of
shared inter-devlink lock is good idea. We spent quite a bit of painful
times getting rid of global devlink_mutex and making devlink locking
scheme nice and simple as it currently is.
But at the same time I admit I can't think of any other nicer solution
to the problem this patchset is trying to solve.
Jakub, any thoughts?
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [PATCH net-next 03/10] devlink: Serialize access to rate domains
2025-02-14 12:54 ` Jiri Pirko
@ 2025-02-19 2:21 ` Jakub Kicinski
2025-02-25 13:36 ` Jiri Pirko
0 siblings, 1 reply; 23+ messages in thread
From: Jakub Kicinski @ 2025-02-19 2:21 UTC (permalink / raw)
To: Jiri Pirko
Cc: Tariq Toukan, David S. Miller, Paolo Abeni, Eric Dumazet,
Andrew Lunn, Jiri Pirko, Cosmin Ratiu, Carolina Jubran,
Gal Pressman, Mark Bloch, Donald Hunter, Jonathan Corbet,
Saeed Mahameed, Leon Romanovsky, netdev, linux-kernel, linux-doc,
linux-rdma
On Fri, 14 Feb 2025 13:54:43 +0100 Jiri Pirko wrote:
> For the record, I'm still not convinced that introducing this kind of
> shared inter-devlink lock is good idea. We spent quite a bit of painful
> times getting rid of global devlink_mutex and making devlink locking
> scheme nice and simple as it currently is.
>
> But at the same time I admit I can't think of any other nicer solution
> to the problem this patchset is trying to solve.
>
> Jakub, any thoughts?
The problem comes from having a devlink instance per function /
port rather than for the ASIC. Spawn a single instance and the
problem will go away 🤷️
I think we talked about this multiple times, I think at least
once with Jake, too. Not that I remember all the details now..
--
pw-bot: cr
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH net-next 03/10] devlink: Serialize access to rate domains
2025-02-19 2:21 ` Jakub Kicinski
@ 2025-02-25 13:36 ` Jiri Pirko
2025-02-26 1:40 ` Jakub Kicinski
0 siblings, 1 reply; 23+ messages in thread
From: Jiri Pirko @ 2025-02-25 13:36 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Tariq Toukan, David S. Miller, Paolo Abeni, Eric Dumazet,
Andrew Lunn, Jiri Pirko, Cosmin Ratiu, Carolina Jubran,
Gal Pressman, Mark Bloch, Donald Hunter, Jonathan Corbet,
Saeed Mahameed, Leon Romanovsky, netdev, linux-kernel, linux-doc,
linux-rdma
Wed, Feb 19, 2025 at 03:21:30AM +0100, kuba@kernel.org wrote:
>On Fri, 14 Feb 2025 13:54:43 +0100 Jiri Pirko wrote:
>> For the record, I'm still not convinced that introducing this kind of
>> shared inter-devlink lock is good idea. We spent quite a bit of painful
>> times getting rid of global devlink_mutex and making devlink locking
>> scheme nice and simple as it currently is.
>>
>> But at the same time I admit I can't think of any other nicer solution
>> to the problem this patchset is trying to solve.
>>
>> Jakub, any thoughts?
>
>The problem comes from having a devlink instance per function /
>port rather than for the ASIC. Spawn a single instance and the
>problem will go away 🤷️
Yeah, we currently have VF devlink ports created under PF devlink instance.
That is aligned with PCI geometry. If we have a single per-ASIC parent
devlink, this does not change and we still need to configure cross
PF devlink instances.
The only benefit I see is that we don't need rate domain, but
we can use parent devlink instance lock instead. The locking ordering
might be a bit tricky to fix though.
>
>I think we talked about this multiple times, I think at least
>once with Jake, too. Not that I remember all the details now..
>--
>pw-bot: cr
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH net-next 03/10] devlink: Serialize access to rate domains
2025-02-25 13:36 ` Jiri Pirko
@ 2025-02-26 1:40 ` Jakub Kicinski
2025-02-26 14:44 ` Jiri Pirko
0 siblings, 1 reply; 23+ messages in thread
From: Jakub Kicinski @ 2025-02-26 1:40 UTC (permalink / raw)
To: Jiri Pirko
Cc: Tariq Toukan, David S. Miller, Paolo Abeni, Eric Dumazet,
Andrew Lunn, Jiri Pirko, Cosmin Ratiu, Carolina Jubran,
Gal Pressman, Mark Bloch, Donald Hunter, Jonathan Corbet,
Saeed Mahameed, Leon Romanovsky, netdev, linux-kernel, linux-doc,
linux-rdma
On Tue, 25 Feb 2025 14:36:07 +0100 Jiri Pirko wrote:
> >The problem comes from having a devlink instance per function /
> >port rather than for the ASIC. Spawn a single instance and the
> >problem will go away 🤷️
>
> Yeah, we currently have VF devlink ports created under PF devlink instance.
> That is aligned with PCI geometry. If we have a single per-ASIC parent
> devlink, this does not change and we still need to configure cross
> PF devlink instances.
Why would there still be PF instances? I'm not suggesting that you
create a hierarchy of instances.
> The only benefit I see is that we don't need rate domain, but
> we can use parent devlink instance lock instead. The locking ordering
> might be a bit tricky to fix though.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH net-next 03/10] devlink: Serialize access to rate domains
2025-02-26 1:40 ` Jakub Kicinski
@ 2025-02-26 14:44 ` Jiri Pirko
2025-02-27 2:53 ` Jakub Kicinski
0 siblings, 1 reply; 23+ messages in thread
From: Jiri Pirko @ 2025-02-26 14:44 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Tariq Toukan, David S. Miller, Paolo Abeni, Eric Dumazet,
Andrew Lunn, Jiri Pirko, Cosmin Ratiu, Carolina Jubran,
Gal Pressman, Mark Bloch, Donald Hunter, Jonathan Corbet,
Saeed Mahameed, Leon Romanovsky, netdev, linux-kernel, linux-doc,
linux-rdma
Wed, Feb 26, 2025 at 02:40:05AM +0100, kuba@kernel.org wrote:
>On Tue, 25 Feb 2025 14:36:07 +0100 Jiri Pirko wrote:
>> >The problem comes from having a devlink instance per function /
>> >port rather than for the ASIC. Spawn a single instance and the
>> >problem will go away 🤷️
>>
>> Yeah, we currently have VF devlink ports created under PF devlink instance.
>> That is aligned with PCI geometry. If we have a single per-ASIC parent
>> devlink, this does not change and we still need to configure cross
>> PF devlink instances.
>
>Why would there still be PF instances? I'm not suggesting that you
>create a hierarchy of instances.
I'm not sure how you imagine getting rid of them. One PCI PF
instantiates one devlink now. There are lots of configuration (e.g. params)
that is per-PF. You need this instance for that, how else would you do
per-PF things on shared ASIC instance?
Creating SFs is per-PF operation for example. I didn't to thorough
analysis, but I'm sure there are couple of per-PF things like these.
Also not breaking the existing users may be an argument to keep per-PF
instances.
>
>> The only benefit I see is that we don't need rate domain, but
>> we can use parent devlink instance lock instead. The locking ordering
>> might be a bit tricky to fix though.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH net-next 03/10] devlink: Serialize access to rate domains
2025-02-26 14:44 ` Jiri Pirko
@ 2025-02-27 2:53 ` Jakub Kicinski
2025-02-27 12:22 ` Jiri Pirko
0 siblings, 1 reply; 23+ messages in thread
From: Jakub Kicinski @ 2025-02-27 2:53 UTC (permalink / raw)
To: Jiri Pirko
Cc: Tariq Toukan, David S. Miller, Paolo Abeni, Eric Dumazet,
Andrew Lunn, Jiri Pirko, Cosmin Ratiu, Carolina Jubran,
Gal Pressman, Mark Bloch, Donald Hunter, Jonathan Corbet,
Saeed Mahameed, Leon Romanovsky, netdev, linux-kernel, linux-doc,
linux-rdma
On Wed, 26 Feb 2025 15:44:35 +0100 Jiri Pirko wrote:
> > Why would there still be PF instances? I'm not suggesting that you
> > create a hierarchy of instances.
>
> I'm not sure how you imagine getting rid of them. One PCI PF
> instantiates one devlink now. There are lots of configuration (e.g. params)
> that is per-PF. You need this instance for that, how else would you do
> per-PF things on shared ASIC instance?
There are per-PF ports, right?
> Creating SFs is per-PF operation for example. I didn't to thorough
> analysis, but I'm sure there are couple of per-PF things like these.
Seems like adding a port attribute to SF creation would be a much
smaller extension than adding a layer of objects.
> Also not breaking the existing users may be an argument to keep per-PF
> instances.
We're talking about multi-PF devices only. Besides pretty sure we
moved multiple params and health reporters to be per port, so IDK
what changed now.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH net-next 03/10] devlink: Serialize access to rate domains
2025-02-27 2:53 ` Jakub Kicinski
@ 2025-02-27 12:22 ` Jiri Pirko
2025-03-03 22:06 ` Jakub Kicinski
0 siblings, 1 reply; 23+ messages in thread
From: Jiri Pirko @ 2025-02-27 12:22 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Tariq Toukan, David S. Miller, Paolo Abeni, Eric Dumazet,
Andrew Lunn, Jiri Pirko, Cosmin Ratiu, Carolina Jubran,
Gal Pressman, Mark Bloch, Donald Hunter, Jonathan Corbet,
Saeed Mahameed, Leon Romanovsky, netdev, linux-kernel, linux-doc,
linux-rdma
Thu, Feb 27, 2025 at 03:53:10AM +0100, kuba@kernel.org wrote:
>On Wed, 26 Feb 2025 15:44:35 +0100 Jiri Pirko wrote:
>> > Why would there still be PF instances? I'm not suggesting that you
>> > create a hierarchy of instances.
>>
>> I'm not sure how you imagine getting rid of them. One PCI PF
>> instantiates one devlink now. There are lots of configuration (e.g. params)
>> that is per-PF. You need this instance for that, how else would you do
>> per-PF things on shared ASIC instance?
>
>There are per-PF ports, right?
Depends. On normal host sr-iov, no. On smartnic where you have PF in
host, yes.
>
>> Creating SFs is per-PF operation for example. I didn't to thorough
>> analysis, but I'm sure there are couple of per-PF things like these.
>
>Seems like adding a port attribute to SF creation would be a much
>smaller extension than adding a layer of objects.
>
>> Also not breaking the existing users may be an argument to keep per-PF
>> instances.
>
>We're talking about multi-PF devices only. Besides pretty sure we
>moved multiple params and health reporters to be per port, so IDK
>what changed now.
Looks like pretty much all current NICs are multi-PFs, aren't they?
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH net-next 03/10] devlink: Serialize access to rate domains
2025-02-27 12:22 ` Jiri Pirko
@ 2025-03-03 22:06 ` Jakub Kicinski
2025-03-04 13:11 ` Jiri Pirko
0 siblings, 1 reply; 23+ messages in thread
From: Jakub Kicinski @ 2025-03-03 22:06 UTC (permalink / raw)
To: Jiri Pirko
Cc: Tariq Toukan, David S. Miller, Paolo Abeni, Eric Dumazet,
Andrew Lunn, Jiri Pirko, Cosmin Ratiu, Carolina Jubran,
Gal Pressman, Mark Bloch, Donald Hunter, Jonathan Corbet,
Saeed Mahameed, Leon Romanovsky, netdev, linux-kernel, linux-doc,
linux-rdma
On Thu, 27 Feb 2025 13:22:25 +0100 Jiri Pirko wrote:
> >> I'm not sure how you imagine getting rid of them. One PCI PF
> >> instantiates one devlink now. There are lots of configuration (e.g. params)
> >> that is per-PF. You need this instance for that, how else would you do
> >> per-PF things on shared ASIC instance?
> >
> >There are per-PF ports, right?
>
> Depends. On normal host sr-iov, no. On smartnic where you have PF in
> host, yes.
Yet another "great choice" in mlx5 other drivers have foreseen
problems with and avoided.
> >> Creating SFs is per-PF operation for example. I didn't to thorough
> >> analysis, but I'm sure there are couple of per-PF things like these.
> >
> >Seems like adding a port attribute to SF creation would be a much
> >smaller extension than adding a layer of objects.
> >
> >> Also not breaking the existing users may be an argument to keep per-PF
> >> instances.
> >
> >We're talking about multi-PF devices only. Besides pretty sure we
> >moved multiple params and health reporters to be per port, so IDK
> >what changed now.
>
> Looks like pretty much all current NICs are multi-PFs, aren't they?
Not in a way which requires cross-port state sharing, no.
You should know this.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH net-next 03/10] devlink: Serialize access to rate domains
2025-03-03 22:06 ` Jakub Kicinski
@ 2025-03-04 13:11 ` Jiri Pirko
2025-03-05 0:04 ` Jakub Kicinski
0 siblings, 1 reply; 23+ messages in thread
From: Jiri Pirko @ 2025-03-04 13:11 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Tariq Toukan, David S. Miller, Paolo Abeni, Eric Dumazet,
Andrew Lunn, Jiri Pirko, Cosmin Ratiu, Carolina Jubran,
Gal Pressman, Mark Bloch, Donald Hunter, Jonathan Corbet,
Saeed Mahameed, Leon Romanovsky, netdev, linux-kernel, linux-doc,
linux-rdma
Mon, Mar 03, 2025 at 11:06:23PM +0100, kuba@kernel.org wrote:
>On Thu, 27 Feb 2025 13:22:25 +0100 Jiri Pirko wrote:
>> >> I'm not sure how you imagine getting rid of them. One PCI PF
>> >> instantiates one devlink now. There are lots of configuration (e.g. params)
>> >> that is per-PF. You need this instance for that, how else would you do
>> >> per-PF things on shared ASIC instance?
>> >
>> >There are per-PF ports, right?
>>
>> Depends. On normal host sr-iov, no. On smartnic where you have PF in
>> host, yes.
>
>Yet another "great choice" in mlx5 other drivers have foreseen
>problems with and avoided.
What do you mean? How else to model it? Do you suggest having PF devlink
port for the PF that instantiates? That would sound like Uroboros to me.
>
>> >> Creating SFs is per-PF operation for example. I didn't to thorough
>> >> analysis, but I'm sure there are couple of per-PF things like these.
>> >
>> >Seems like adding a port attribute to SF creation would be a much
>> >smaller extension than adding a layer of objects.
>> >
>> >> Also not breaking the existing users may be an argument to keep per-PF
>> >> instances.
>> >
>> >We're talking about multi-PF devices only. Besides pretty sure we
>> >moved multiple params and health reporters to be per port, so IDK
>> >what changed now.
>>
>> Looks like pretty much all current NICs are multi-PFs, aren't they?
>
>Not in a way which requires cross-port state sharing, no.
>You should know this.
This is not about cross-port state sharing. This is about per-PF
configuration. What am I missing?
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH net-next 03/10] devlink: Serialize access to rate domains
2025-03-04 13:11 ` Jiri Pirko
@ 2025-03-05 0:04 ` Jakub Kicinski
2025-03-05 11:48 ` Jiri Pirko
0 siblings, 1 reply; 23+ messages in thread
From: Jakub Kicinski @ 2025-03-05 0:04 UTC (permalink / raw)
To: Jiri Pirko
Cc: Tariq Toukan, David S. Miller, Paolo Abeni, Eric Dumazet,
Andrew Lunn, Jiri Pirko, Cosmin Ratiu, Carolina Jubran,
Gal Pressman, Mark Bloch, Donald Hunter, Jonathan Corbet,
Saeed Mahameed, Leon Romanovsky, netdev, linux-kernel, linux-doc,
linux-rdma
On Tue, 4 Mar 2025 14:11:40 +0100 Jiri Pirko wrote:
> Mon, Mar 03, 2025 at 11:06:23PM +0100, kuba@kernel.org wrote:
> >On Thu, 27 Feb 2025 13:22:25 +0100 Jiri Pirko wrote:
> >> Depends. On normal host sr-iov, no. On smartnic where you have PF in
> >> host, yes.
> >
> >Yet another "great choice" in mlx5 other drivers have foreseen
> >problems with and avoided.
>
> What do you mean? How else to model it? Do you suggest having PF devlink
> port for the PF that instantiates? That would sound like Uroboros to me.
I reckon it was always more obvious to those of us working on
NPU-derived devices, to which a PCIe port is just a PCIe port,
with no PCIe<>MAC "pipeline" to speak of.
The reason why having the "PF port" is a good idea is exactly
why we're having this conversation. If you don't you'll assign
to the global scope attributes which are really just port attributes.
> >> Looks like pretty much all current NICs are multi-PFs, aren't they?
> >
> >Not in a way which requires cross-port state sharing, no.
> >You should know this.
>
> This is not about cross-port state sharing. This is about per-PF
> configuration. What am I missing?
Maybe we lost the thread of the conversation.. :)
I'm looking at the next patch in this series and it says:
devlink: Introduce shared rate domains
The underlying idea is modeling a piece of hardware which:
1. Exposes multiple functions as separate devlink objects.
2. Is capable of instantiating a transmit scheduling tree spanning
multiple functions.
Modeling this requires devlink rate nodes with parents across other
devlink objects.
Are these domains are not cross port?
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [PATCH net-next 03/10] devlink: Serialize access to rate domains
2025-03-05 0:04 ` Jakub Kicinski
@ 2025-03-05 11:48 ` Jiri Pirko
0 siblings, 0 replies; 23+ messages in thread
From: Jiri Pirko @ 2025-03-05 11:48 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Tariq Toukan, David S. Miller, Paolo Abeni, Eric Dumazet,
Andrew Lunn, Jiri Pirko, Cosmin Ratiu, Carolina Jubran,
Gal Pressman, Mark Bloch, Donald Hunter, Jonathan Corbet,
Saeed Mahameed, Leon Romanovsky, netdev, linux-kernel, linux-doc,
linux-rdma
Wed, Mar 05, 2025 at 01:04:12AM +0100, kuba@kernel.org wrote:
>On Tue, 4 Mar 2025 14:11:40 +0100 Jiri Pirko wrote:
>> Mon, Mar 03, 2025 at 11:06:23PM +0100, kuba@kernel.org wrote:
>> >On Thu, 27 Feb 2025 13:22:25 +0100 Jiri Pirko wrote:
>> >> Depends. On normal host sr-iov, no. On smartnic where you have PF in
>> >> host, yes.
>> >
>> >Yet another "great choice" in mlx5 other drivers have foreseen
>> >problems with and avoided.
>>
>> What do you mean? How else to model it? Do you suggest having PF devlink
>> port for the PF that instantiates? That would sound like Uroboros to me.
>
>I reckon it was always more obvious to those of us working on
>NPU-derived devices, to which a PCIe port is just a PCIe port,
>with no PCIe<>MAC "pipeline" to speak of.
>
>The reason why having the "PF port" is a good idea is exactly
>why we're having this conversation. If you don't you'll assign
>to the global scope attributes which are really just port attributes.
Well, we have devlink port for uplink for this purpose. Why isn't that
enough?
>
>> >> Looks like pretty much all current NICs are multi-PFs, aren't they?
>> >
>> >Not in a way which requires cross-port state sharing, no.
>> >You should know this.
>>
>> This is not about cross-port state sharing. This is about per-PF
>> configuration. What am I missing?
>
>Maybe we lost the thread of the conversation.. :)
>I'm looking at the next patch in this series and it says:
>
> devlink: Introduce shared rate domains
>
> The underlying idea is modeling a piece of hardware which:
> 1. Exposes multiple functions as separate devlink objects.
> 2. Is capable of instantiating a transmit scheduling tree spanning
> multiple functions.
>
> Modeling this requires devlink rate nodes with parents across other
> devlink objects.
>
>Are these domains are not cross port?
Sure. Cross PF even. What I suggest is, if we have devlink instance of
which these 2 PFs are nested, we have this "domain" explicitly defined
and we don't need any other construct.
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH net-next 04/10] devlink: Introduce shared rate domains
2025-02-13 18:01 [PATCH net-next 00/10] devlink and mlx5: Introduce rate domains Tariq Toukan
` (2 preceding siblings ...)
2025-02-13 18:01 ` [PATCH net-next 03/10] devlink: Serialize access to rate domains Tariq Toukan
@ 2025-02-13 18:01 ` Tariq Toukan
2025-02-13 18:01 ` [PATCH net-next 05/10] devlink: Allow specifying parent device for rate commands Tariq Toukan
` (5 subsequent siblings)
9 siblings, 0 replies; 23+ messages in thread
From: Tariq Toukan @ 2025-02-13 18:01 UTC (permalink / raw)
To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
Andrew Lunn, Jiri Pirko
Cc: Cosmin Ratiu, Carolina Jubran, Gal Pressman, Mark Bloch,
Donald Hunter, Jiri Pirko, Jonathan Corbet, Saeed Mahameed,
Leon Romanovsky, Tariq Toukan, netdev, linux-kernel, linux-doc,
linux-rdma
From: Cosmin Ratiu <cratiu@nvidia.com>
The underlying idea is modeling a piece of hardware which:
1. Exposes multiple functions as separate devlink objects.
2. Is capable of instantiating a transmit scheduling tree spanning
multiple functions.
Modeling this requires devlink rate nodes with parents across other
devlink objects. A naive approach that relies on the current
one-lock-per-devlink model is impossible, as it would require in some
cases acquiring multiple devlink locks in the correct order.
Based on the preliminary patches in this series, this commit introduces
the concept of a shared rate domain.
1. A shared rate domain stores rate nodes for a piece of hardware that
has the properties described at the beginning.
2. A shared rate domain is identified by the devlink operations pointer
(a proxy for the device type) and a unique u64 hardware identifier
provided by the driver.
3. There is a global registry of reference counted shared rate domains.
4. A devlink object starts out with a private rate domain, and can be
switched once to use a shared rate domain with
devlink_shared_rate_domain_init. Further calls do nothing.
5. Shared rate domains have an additional mutex serializing access to
rate nodes, acquired by the previously introduced functions
devl_rate_domain_lock and devl_rate_domain_unlock.
These new code paths are unused for now. A caller to
devlink_shared_rate_domain_init will be added in a subsequent patch.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
include/net/devlink.h | 8 ++++
net/devlink/core.c | 79 ++++++++++++++++++++++++++++++++++++-
net/devlink/dev.c | 2 +-
net/devlink/devl_internal.h | 26 ++++++++++--
net/devlink/rate.c | 15 +++++++
5 files changed, 124 insertions(+), 6 deletions(-)
diff --git a/include/net/devlink.h b/include/net/devlink.h
index b8783126c1ed..a9675c1810e6 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -1561,6 +1561,14 @@ void devlink_register(struct devlink *devlink);
void devlink_unregister(struct devlink *devlink);
void devlink_free(struct devlink *devlink);
+/* Can be used to tell devlink that shared rate domains are supported.
+ * The same id needs to be provided for devlink objects that can share
+ * rate nodes in hw (e.g. contain nodes with parents in other devlink objects).
+ * This requires holding the devlink lock and can only be called once per object.
+ * Rate node relationships across different rate domains are not supported.
+ */
+int devlink_shared_rate_domain_init(struct devlink *devlink, u64 id);
+
/**
* struct devlink_port_ops - Port operations
* @port_split: Callback used to split the port into multiple ones.
diff --git a/net/devlink/core.c b/net/devlink/core.c
index 06a2e2dce558..9d374d84225a 100644
--- a/net/devlink/core.c
+++ b/net/devlink/core.c
@@ -289,6 +289,80 @@ void devl_unlock(struct devlink *devlink)
}
EXPORT_SYMBOL_GPL(devl_unlock);
+/* A global data struct with all shared rate domains. */
+static struct {
+ struct mutex lock; /* Acquired after the devlink lock. */
+ struct list_head rate_domains;
+} devlink_rate_domains = {
+ .lock = __MUTEX_INITIALIZER(devlink_rate_domains.lock),
+ .rate_domains = LIST_HEAD_INIT(devlink_rate_domains.rate_domains),
+};
+
+static bool devlink_rate_domain_eq(struct devlink_rate_domain *rate_domain,
+ const struct devlink_ops *ops, u64 id)
+{
+ return rate_domain->ops == ops && rate_domain->id == id;
+}
+
+int devlink_shared_rate_domain_init(struct devlink *devlink, u64 id)
+{
+ struct devlink_rate_domain *rate_domain;
+ int err = 0;
+
+ devl_assert_locked(devlink);
+
+ if (devlink->rate_domain->shared) {
+ if (devlink_rate_domain_eq(devlink->rate_domain, devlink->ops, id))
+ return 0;
+ return -EEXIST;
+ }
+ if (!list_empty(&devlink->rate_domain->rate_list))
+ return -EINVAL;
+
+ mutex_lock(&devlink_rate_domains.lock);
+ list_for_each_entry(rate_domain, &devlink_rate_domains.rate_domains, list) {
+ if (devlink_rate_domain_eq(rate_domain, devlink->ops, id)) {
+ refcount_inc(&rate_domain->refcount);
+ goto replace_domain;
+ }
+ }
+
+ /* Shared domain not found, create one. */
+ rate_domain = kvzalloc(sizeof(*rate_domain), GFP_KERNEL);
+ if (!rate_domain) {
+ err = -ENOMEM;
+ goto unlock;
+ }
+ rate_domain->shared = true;
+ rate_domain->ops = devlink->ops;
+ rate_domain->id = id;
+ mutex_init(&rate_domain->lock);
+ INIT_LIST_HEAD(&rate_domain->rate_list);
+ refcount_set(&rate_domain->refcount, 1);
+ list_add_tail(&rate_domain->list, &devlink_rate_domains.rate_domains);
+replace_domain:
+ kvfree(devlink->rate_domain);
+ devlink->rate_domain = rate_domain;
+unlock:
+ mutex_unlock(&devlink_rate_domains.lock);
+ return err;
+}
+EXPORT_SYMBOL_GPL(devlink_shared_rate_domain_init);
+
+static void devlink_rate_domain_put(struct devlink_rate_domain *rate_domain)
+{
+ if (rate_domain->shared) {
+ if (!refcount_dec_and_test(&rate_domain->refcount))
+ return;
+
+ WARN_ON(!list_empty(&rate_domain->rate_list));
+ mutex_lock(&devlink_rate_domains.lock);
+ list_del(&rate_domain->list);
+ mutex_unlock(&devlink_rate_domains.lock);
+ }
+ kvfree(rate_domain);
+}
+
/**
* devlink_try_get() - try to obtain a reference on a devlink instance
* @devlink: instance to reference
@@ -314,7 +388,7 @@ static void devlink_release(struct work_struct *work)
mutex_destroy(&devlink->lock);
lockdep_unregister_key(&devlink->lock_key);
put_device(devlink->dev);
- kvfree(devlink->rate_domain);
+ devlink_rate_domain_put(devlink->rate_domain);
kvfree(devlink);
}
@@ -428,6 +502,7 @@ struct devlink *devlink_alloc_ns(const struct devlink_ops *ops,
devlink->rate_domain = kvzalloc(sizeof(*devlink->rate_domain), GFP_KERNEL);
if (!devlink->rate_domain)
goto err_rate_domain;
+ devlink->rate_domain->shared = false;
INIT_LIST_HEAD(&devlink->rate_domain->rate_list);
ret = xa_alloc_cyclic(&devlinks, &devlink->index, devlink, xa_limit_31b,
@@ -484,7 +559,7 @@ void devlink_free(struct devlink *devlink)
WARN_ON(!list_empty(&devlink->resource_list));
WARN_ON(!list_empty(&devlink->dpipe_table_list));
WARN_ON(!list_empty(&devlink->sb_list));
- WARN_ON(!list_empty(&devlink->rate_domain->rate_list));
+ WARN_ON(devlink_rates_check(devlink));
WARN_ON(!list_empty(&devlink->linecard_list));
WARN_ON(!xa_empty(&devlink->ports));
diff --git a/net/devlink/dev.c b/net/devlink/dev.c
index c926c75cc10d..84353a85e8fe 100644
--- a/net/devlink/dev.c
+++ b/net/devlink/dev.c
@@ -434,7 +434,7 @@ static void devlink_reload_reinit_sanity_check(struct devlink *devlink)
WARN_ON(!list_empty(&devlink->trap_list));
WARN_ON(!list_empty(&devlink->dpipe_table_list));
WARN_ON(!list_empty(&devlink->sb_list));
- WARN_ON(!list_empty(&devlink->rate_domain->rate_list));
+ WARN_ON(devlink_rates_check(devlink));
WARN_ON(!list_empty(&devlink->linecard_list));
WARN_ON(!xa_empty(&devlink->ports));
}
diff --git a/net/devlink/devl_internal.h b/net/devlink/devl_internal.h
index fae81dd6953f..7401aab274e5 100644
--- a/net/devlink/devl_internal.h
+++ b/net/devlink/devl_internal.h
@@ -30,9 +30,20 @@ struct devlink_dev_stats {
u32 remote_reload_stats[DEVLINK_RELOAD_STATS_ARRAY_SIZE];
};
-/* Stores devlink rates associated with a rate domain. */
+/* Stores devlink rates associated with a rate domain.
+ * Multiple devlink objects may share the same domain (when 'shared' is true)
+ * and rate nodes can have members from multiple devices.
+ */
struct devlink_rate_domain {
+ bool shared;
+ struct list_head list;
struct list_head rate_list;
+ /* Fields below are only used for shared rate domains. */
+ const struct devlink_ops *ops;
+ u64 id;
+ refcount_t refcount;
+ /* Serializes access to rates. */
+ struct mutex lock;
};
struct devlink {
@@ -121,9 +132,17 @@ static inline void devl_dev_unlock(struct devlink *devlink, bool dev_lock)
device_unlock(devlink->dev);
}
-static inline void devl_rate_domain_lock(struct devlink *devlink) { }
+static inline void devl_rate_domain_lock(struct devlink *devlink)
+{
+ if (devlink->rate_domain->shared)
+ mutex_lock(&devlink->rate_domain->lock);
+}
-static inline void devl_rate_domain_unlock(struct devlink *devlink) { }
+static inline void devl_rate_domain_unlock(struct devlink *devlink)
+{
+ if (devlink->rate_domain->shared)
+ mutex_unlock(&devlink->rate_domain->lock);
+}
typedef void devlink_rel_notify_cb_t(struct devlink *devlink, u32 obj_index);
typedef void devlink_rel_cleanup_cb_t(struct devlink *devlink, u32 obj_index,
@@ -307,6 +326,7 @@ int devlink_resources_validate(struct devlink *devlink,
/* Rates */
int devlink_rate_nodes_check(struct devlink *devlink, struct netlink_ext_ack *extack);
+int devlink_rates_check(struct devlink *devlink);
/* Linecards */
unsigned int devlink_linecard_index(struct devlink_linecard *linecard);
diff --git a/net/devlink/rate.c b/net/devlink/rate.c
index 54e6a9893e3d..38f18216eb80 100644
--- a/net/devlink/rate.c
+++ b/net/devlink/rate.c
@@ -621,6 +621,21 @@ int devlink_rate_nodes_check(struct devlink *devlink, struct netlink_ext_ack *ex
return err;
}
+int devlink_rates_check(struct devlink *devlink)
+{
+ struct devlink_rate *devlink_rate;
+ int err = 0;
+
+ devl_rate_domain_lock(devlink);
+ list_for_each_entry(devlink_rate, &devlink->rate_domain->rate_list, list)
+ if (devlink_rate->devlink == devlink) {
+ err = -EBUSY;
+ break;
+ }
+ devl_rate_domain_unlock(devlink);
+ return err;
+}
+
/**
* devl_rate_node_create - create devlink rate node
* @devlink: devlink instance
--
2.45.0
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH net-next 05/10] devlink: Allow specifying parent device for rate commands
2025-02-13 18:01 [PATCH net-next 00/10] devlink and mlx5: Introduce rate domains Tariq Toukan
` (3 preceding siblings ...)
2025-02-13 18:01 ` [PATCH net-next 04/10] devlink: Introduce shared " Tariq Toukan
@ 2025-02-13 18:01 ` Tariq Toukan
2025-02-13 18:01 ` [PATCH net-next 06/10] devlink: Allow rate node parents from other devlinks Tariq Toukan
` (4 subsequent siblings)
9 siblings, 0 replies; 23+ messages in thread
From: Tariq Toukan @ 2025-02-13 18:01 UTC (permalink / raw)
To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
Andrew Lunn, Jiri Pirko
Cc: Cosmin Ratiu, Carolina Jubran, Gal Pressman, Mark Bloch,
Donald Hunter, Jiri Pirko, Jonathan Corbet, Saeed Mahameed,
Leon Romanovsky, Tariq Toukan, netdev, linux-kernel, linux-doc,
linux-rdma
From: Cosmin Ratiu <cratiu@nvidia.com>
Previously, the parent device was assumed to be the same one as one
owning the rate node being manipulated.
This patch expands that by allowing rate commands to accept two
additional arguments, the parent bus name and the parent dev name.
Besides introducing these two attributes in the uapi, this patch:
- adds plumbing in the 'rate-new' and 'rate-set' commands to access
these attributes.
- extends devlink-nl-pre-doit with getting and saving a reference to the
parent dev in info->user_ptr[1].
The parent dev needs to exist and be registered and is then unlocked
but not put. This ensures that it will stay alive until the end of the
request.
- changes devlink-nl-post-doit with putting the parent reference, if
obtained.
Example usage with ynl:
Setting parent to a rate node on another device
./tools/net/ynl/cli.py --spec Documentation/netlink/specs/devlink.yaml \
--do rate-set --json '{ \
"bus-name": "pci", \
"dev-name": "0000:08:00.1", \
"port-index": 65537, \
"parent-dev-bus-name": "pci", \
"parent-dev-name": "0000:08:00.0", \
"rate-parent-node-name": "group1" \
}'
Getting the rate details:
./tools/net/ynl/cli.py --spec Documentation/netlink/specs/devlink.yaml \
--do rate-get --json '{ \
"bus-name": "pci", \
"dev-name": "0000:08:00.1", \
"port-index": 65537 \
}'
Output:
{'bus-name': 'pci',
'dev-name': '0000:08:00.1',
'parent-dev-bus-name': 'pci',
'parent-dev-name': '0000:08:00.0',
'port-index': 65537,
'rate-parent-node-name': 'group1',
'rate-tc-bws': <snipped for brevity>
'rate-tx-max': 0,
'rate-tx-priority': 0,
'rate-tx-share': 0,
'rate-tx-weight': 0,
'rate-type': 'leaf'}
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
Documentation/netlink/specs/devlink.yaml | 18 ++++--
include/uapi/linux/devlink.h | 3 +
net/devlink/netlink.c | 74 +++++++++++++++++++-----
net/devlink/netlink_gen.c | 20 ++++---
net/devlink/netlink_gen.h | 7 +++
5 files changed, 95 insertions(+), 27 deletions(-)
diff --git a/Documentation/netlink/specs/devlink.yaml b/Documentation/netlink/specs/devlink.yaml
index 09fbb4c03fc8..252db0f080ba 100644
--- a/Documentation/netlink/specs/devlink.yaml
+++ b/Documentation/netlink/specs/devlink.yaml
@@ -820,6 +820,12 @@ attribute-sets:
-
name: region-direct
type: flag
+ -
+ name: parent-dev-bus-name
+ type: string
+ -
+ name: parent-dev-name
+ type: string
-
name: dl-dev-stats
@@ -2137,8 +2143,8 @@ operations:
dont-validate: [ strict ]
flags: [ admin-perm ]
do:
- pre: devlink-nl-pre-doit
- post: devlink-nl-post-doit
+ pre: devlink-nl-pre-doit-parent-dev-optional
+ post: devlink-nl-post-doit-parent-dev-optional
request:
attributes:
- bus-name
@@ -2149,6 +2155,8 @@ operations:
- rate-tx-priority
- rate-tx-weight
- rate-parent-node-name
+ - parent-dev-bus-name
+ - parent-dev-name
-
name: rate-new
@@ -2157,8 +2165,8 @@ operations:
dont-validate: [ strict ]
flags: [ admin-perm ]
do:
- pre: devlink-nl-pre-doit
- post: devlink-nl-post-doit
+ pre: devlink-nl-pre-doit-parent-dev-optional
+ post: devlink-nl-post-doit-parent-dev-optional
request:
attributes:
- bus-name
@@ -2169,6 +2177,8 @@ operations:
- rate-tx-priority
- rate-tx-weight
- rate-parent-node-name
+ - parent-dev-bus-name
+ - parent-dev-name
-
name: rate-del
diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index 9401aa343673..a7b62e53b7eb 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -614,6 +614,9 @@ enum devlink_attr {
DEVLINK_ATTR_REGION_DIRECT, /* flag */
+ DEVLINK_ATTR_PARENT_DEV_BUS_NAME, /* string */
+ DEVLINK_ATTR_PARENT_DEV_NAME, /* string */
+
/* Add new attributes above here, update the spec in
* Documentation/netlink/specs/devlink.yaml and re-generate
* net/devlink/netlink_gen.c.
diff --git a/net/devlink/netlink.c b/net/devlink/netlink.c
index 593605c1b1ef..d8e7e23afce2 100644
--- a/net/devlink/netlink.c
+++ b/net/devlink/netlink.c
@@ -12,6 +12,7 @@
#define DEVLINK_NL_FLAG_NEED_PORT BIT(0)
#define DEVLINK_NL_FLAG_NEED_DEVLINK_OR_PORT BIT(1)
#define DEVLINK_NL_FLAG_NEED_DEV_LOCK BIT(2)
+#define DEVLINK_NL_FLAG_NEED_PARENT_DEV BIT(3)
static const struct genl_multicast_group devlink_nl_mcgrps[] = {
[DEVLINK_MCGRP_CONFIG] = { .name = DEVLINK_GENL_MCGRP_CONFIG_NAME },
@@ -177,20 +178,15 @@ int devlink_nl_msg_reply_and_new(struct sk_buff **msg, struct genl_info *info)
return 0;
}
-struct devlink *
-devlink_get_from_attrs_lock(struct net *net, struct nlattr **attrs,
- bool dev_lock)
+static struct devlink *
+devlink_get_lock(struct net *net, struct nlattr *bus_attr, struct nlattr *dev_attr, bool dev_lock)
{
+ char *busname, *devname;
struct devlink *devlink;
unsigned long index;
- char *busname;
- char *devname;
-
- if (!attrs[DEVLINK_ATTR_BUS_NAME] || !attrs[DEVLINK_ATTR_DEV_NAME])
- return ERR_PTR(-EINVAL);
- busname = nla_data(attrs[DEVLINK_ATTR_BUS_NAME]);
- devname = nla_data(attrs[DEVLINK_ATTR_DEV_NAME]);
+ busname = nla_data(bus_attr);
+ devname = nla_data(dev_attr);
devlinks_xa_for_each_registered_get(net, index, devlink) {
if (strcmp(devlink->dev->bus->name, busname) == 0 &&
@@ -206,19 +202,45 @@ devlink_get_from_attrs_lock(struct net *net, struct nlattr **attrs,
return ERR_PTR(-ENODEV);
}
+struct devlink *
+devlink_get_from_attrs_lock(struct net *net, struct nlattr **attrs, bool dev_lock)
+{
+ if (!attrs[DEVLINK_ATTR_BUS_NAME] || !attrs[DEVLINK_ATTR_DEV_NAME])
+ return ERR_PTR(-EINVAL);
+
+ return devlink_get_lock(net, attrs[DEVLINK_ATTR_BUS_NAME],
+ attrs[DEVLINK_ATTR_DEV_NAME], dev_lock);
+}
+
static int __devlink_nl_pre_doit(struct sk_buff *skb, struct genl_info *info,
u8 flags)
{
+ bool parent_dev = flags & DEVLINK_NL_FLAG_NEED_PARENT_DEV;
bool dev_lock = flags & DEVLINK_NL_FLAG_NEED_DEV_LOCK;
+ struct devlink *devlink, *parent_devlink = NULL;
+ struct nlattr **attrs = info->attrs;
struct devlink_port *devlink_port;
- struct devlink *devlink;
int err;
- devlink = devlink_get_from_attrs_lock(genl_info_net(info), info->attrs,
- dev_lock);
- if (IS_ERR(devlink))
- return PTR_ERR(devlink);
+ if (parent_dev && attrs[DEVLINK_ATTR_PARENT_DEV_BUS_NAME] &&
+ attrs[DEVLINK_ATTR_PARENT_DEV_NAME]) {
+ parent_devlink = devlink_get_lock(genl_info_net(info),
+ attrs[DEVLINK_ATTR_PARENT_DEV_BUS_NAME],
+ attrs[DEVLINK_ATTR_PARENT_DEV_NAME], dev_lock);
+ if (IS_ERR(parent_devlink))
+ return PTR_ERR(parent_devlink);
+ info->user_ptr[1] = parent_devlink;
+ /* Drop the parent devlink lock, but don't put it just yet.
+ * This will keep it alive until the end of the request.
+ */
+ devl_dev_unlock(parent_devlink, dev_lock);
+ }
+ devlink = devlink_get_from_attrs_lock(genl_info_net(info), attrs, dev_lock);
+ if (IS_ERR(devlink)) {
+ err = PTR_ERR(devlink);
+ goto parent_put;
+ }
info->user_ptr[0] = devlink;
if (flags & DEVLINK_NL_FLAG_NEED_PORT) {
devlink_port = devlink_port_get_from_info(devlink, info);
@@ -237,6 +259,9 @@ static int __devlink_nl_pre_doit(struct sk_buff *skb, struct genl_info *info,
unlock:
devl_dev_unlock(devlink, dev_lock);
devlink_put(devlink);
+parent_put:
+ if (parent_dev && parent_devlink)
+ devlink_put(parent_devlink);
return err;
}
@@ -265,6 +290,13 @@ int devlink_nl_pre_doit_port_optional(const struct genl_split_ops *ops,
return __devlink_nl_pre_doit(skb, info, DEVLINK_NL_FLAG_NEED_DEVLINK_OR_PORT);
}
+int devlink_nl_pre_doit_parent_dev_optional(const struct genl_split_ops *ops,
+ struct sk_buff *skb,
+ struct genl_info *info)
+{
+ return __devlink_nl_pre_doit(skb, info, DEVLINK_NL_FLAG_NEED_PARENT_DEV);
+}
+
static void __devlink_nl_post_doit(struct sk_buff *skb, struct genl_info *info,
u8 flags)
{
@@ -274,6 +306,10 @@ static void __devlink_nl_post_doit(struct sk_buff *skb, struct genl_info *info,
devlink = info->user_ptr[0];
devl_dev_unlock(devlink, dev_lock);
devlink_put(devlink);
+ if ((flags & DEVLINK_NL_FLAG_NEED_PARENT_DEV) && info->user_ptr[1]) {
+ devlink = info->user_ptr[1];
+ devlink_put(devlink);
+ }
}
void devlink_nl_post_doit(const struct genl_split_ops *ops,
@@ -289,6 +325,14 @@ devlink_nl_post_doit_dev_lock(const struct genl_split_ops *ops,
__devlink_nl_post_doit(skb, info, DEVLINK_NL_FLAG_NEED_DEV_LOCK);
}
+void
+devlink_nl_post_doit_parent_dev_optional(const struct genl_split_ops *ops,
+ struct sk_buff *skb,
+ struct genl_info *info)
+{
+ __devlink_nl_post_doit(skb, info, DEVLINK_NL_FLAG_NEED_PARENT_DEV);
+}
+
static int devlink_nl_inst_single_dumpit(struct sk_buff *msg,
struct netlink_callback *cb, int flags,
devlink_nl_dump_one_func_t *dump_one,
diff --git a/net/devlink/netlink_gen.c b/net/devlink/netlink_gen.c
index f9786d51f68f..2dc82cc9a0b1 100644
--- a/net/devlink/netlink_gen.c
+++ b/net/devlink/netlink_gen.c
@@ -496,7 +496,7 @@ static const struct nla_policy devlink_rate_get_dump_nl_policy[DEVLINK_ATTR_DEV_
};
/* DEVLINK_CMD_RATE_SET - do */
-static const struct nla_policy devlink_rate_set_nl_policy[DEVLINK_ATTR_RATE_TX_WEIGHT + 1] = {
+static const struct nla_policy devlink_rate_set_nl_policy[DEVLINK_ATTR_PARENT_DEV_NAME + 1] = {
[DEVLINK_ATTR_BUS_NAME] = { .type = NLA_NUL_STRING, },
[DEVLINK_ATTR_DEV_NAME] = { .type = NLA_NUL_STRING, },
[DEVLINK_ATTR_RATE_NODE_NAME] = { .type = NLA_NUL_STRING, },
@@ -505,10 +505,12 @@ static const struct nla_policy devlink_rate_set_nl_policy[DEVLINK_ATTR_RATE_TX_W
[DEVLINK_ATTR_RATE_TX_PRIORITY] = { .type = NLA_U32, },
[DEVLINK_ATTR_RATE_TX_WEIGHT] = { .type = NLA_U32, },
[DEVLINK_ATTR_RATE_PARENT_NODE_NAME] = { .type = NLA_NUL_STRING, },
+ [DEVLINK_ATTR_PARENT_DEV_BUS_NAME] = { .type = NLA_NUL_STRING, },
+ [DEVLINK_ATTR_PARENT_DEV_NAME] = { .type = NLA_NUL_STRING, },
};
/* DEVLINK_CMD_RATE_NEW - do */
-static const struct nla_policy devlink_rate_new_nl_policy[DEVLINK_ATTR_RATE_TX_WEIGHT + 1] = {
+static const struct nla_policy devlink_rate_new_nl_policy[DEVLINK_ATTR_PARENT_DEV_NAME + 1] = {
[DEVLINK_ATTR_BUS_NAME] = { .type = NLA_NUL_STRING, },
[DEVLINK_ATTR_DEV_NAME] = { .type = NLA_NUL_STRING, },
[DEVLINK_ATTR_RATE_NODE_NAME] = { .type = NLA_NUL_STRING, },
@@ -517,6 +519,8 @@ static const struct nla_policy devlink_rate_new_nl_policy[DEVLINK_ATTR_RATE_TX_W
[DEVLINK_ATTR_RATE_TX_PRIORITY] = { .type = NLA_U32, },
[DEVLINK_ATTR_RATE_TX_WEIGHT] = { .type = NLA_U32, },
[DEVLINK_ATTR_RATE_PARENT_NODE_NAME] = { .type = NLA_NUL_STRING, },
+ [DEVLINK_ATTR_PARENT_DEV_BUS_NAME] = { .type = NLA_NUL_STRING, },
+ [DEVLINK_ATTR_PARENT_DEV_NAME] = { .type = NLA_NUL_STRING, },
};
/* DEVLINK_CMD_RATE_DEL - do */
@@ -1160,21 +1164,21 @@ const struct genl_split_ops devlink_nl_ops[74] = {
{
.cmd = DEVLINK_CMD_RATE_SET,
.validate = GENL_DONT_VALIDATE_STRICT,
- .pre_doit = devlink_nl_pre_doit,
+ .pre_doit = devlink_nl_pre_doit_parent_dev_optional,
.doit = devlink_nl_rate_set_doit,
- .post_doit = devlink_nl_post_doit,
+ .post_doit = devlink_nl_post_doit_parent_dev_optional,
.policy = devlink_rate_set_nl_policy,
- .maxattr = DEVLINK_ATTR_RATE_TX_WEIGHT,
+ .maxattr = DEVLINK_ATTR_PARENT_DEV_NAME,
.flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
},
{
.cmd = DEVLINK_CMD_RATE_NEW,
.validate = GENL_DONT_VALIDATE_STRICT,
- .pre_doit = devlink_nl_pre_doit,
+ .pre_doit = devlink_nl_pre_doit_parent_dev_optional,
.doit = devlink_nl_rate_new_doit,
- .post_doit = devlink_nl_post_doit,
+ .post_doit = devlink_nl_post_doit_parent_dev_optional,
.policy = devlink_rate_new_nl_policy,
- .maxattr = DEVLINK_ATTR_RATE_TX_WEIGHT,
+ .maxattr = DEVLINK_ATTR_PARENT_DEV_NAME,
.flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
},
{
diff --git a/net/devlink/netlink_gen.h b/net/devlink/netlink_gen.h
index 8f2bd50ddf5e..dd402264a47f 100644
--- a/net/devlink/netlink_gen.h
+++ b/net/devlink/netlink_gen.h
@@ -27,12 +27,19 @@ int devlink_nl_pre_doit_dev_lock(const struct genl_split_ops *ops,
int devlink_nl_pre_doit_port_optional(const struct genl_split_ops *ops,
struct sk_buff *skb,
struct genl_info *info);
+int devlink_nl_pre_doit_parent_dev_optional(const struct genl_split_ops *ops,
+ struct sk_buff *skb,
+ struct genl_info *info);
void
devlink_nl_post_doit(const struct genl_split_ops *ops, struct sk_buff *skb,
struct genl_info *info);
void
devlink_nl_post_doit_dev_lock(const struct genl_split_ops *ops,
struct sk_buff *skb, struct genl_info *info);
+void
+devlink_nl_post_doit_parent_dev_optional(const struct genl_split_ops *ops,
+ struct sk_buff *skb,
+ struct genl_info *info);
int devlink_nl_get_doit(struct sk_buff *skb, struct genl_info *info);
int devlink_nl_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb);
--
2.45.0
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH net-next 06/10] devlink: Allow rate node parents from other devlinks
2025-02-13 18:01 [PATCH net-next 00/10] devlink and mlx5: Introduce rate domains Tariq Toukan
` (4 preceding siblings ...)
2025-02-13 18:01 ` [PATCH net-next 05/10] devlink: Allow specifying parent device for rate commands Tariq Toukan
@ 2025-02-13 18:01 ` Tariq Toukan
2025-02-13 18:01 ` [PATCH net-next 07/10] net/mlx5: qos: Introduce shared esw qos domains Tariq Toukan
` (3 subsequent siblings)
9 siblings, 0 replies; 23+ messages in thread
From: Tariq Toukan @ 2025-02-13 18:01 UTC (permalink / raw)
To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
Andrew Lunn, Jiri Pirko
Cc: Cosmin Ratiu, Carolina Jubran, Gal Pressman, Mark Bloch,
Donald Hunter, Jiri Pirko, Jonathan Corbet, Saeed Mahameed,
Leon Romanovsky, Tariq Toukan, netdev, linux-kernel, linux-doc,
linux-rdma
From: Cosmin Ratiu <cratiu@nvidia.com>
This commit make use of the unlocked parent devlink from
info->user_ptr[1] to assign a devlink rate node to the requested parent
node. Because it is not locked, none of its mutable fields can be used.
But parent setting only requires:
1. Verifying that the same rate domain is used. The rate domain is
immutable once set, so this is safe.
2. Comparing devlink_rate->devlink with the requested parent devlink.
As the shared devlink rate domain is locked, other entities cannot
concurrently make changes to any of its rates.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
net/devlink/rate.c | 46 +++++++++++++++++++++++++++++++++++++++-------
1 file changed, 39 insertions(+), 7 deletions(-)
diff --git a/net/devlink/rate.c b/net/devlink/rate.c
index 38f18216eb80..e6b4f4cb8a01 100644
--- a/net/devlink/rate.c
+++ b/net/devlink/rate.c
@@ -125,10 +125,26 @@ static int devlink_nl_rate_fill(struct sk_buff *msg,
devlink_rate->tx_weight))
goto nla_put_failure;
- if (devlink_rate->parent)
- if (nla_put_string(msg, DEVLINK_ATTR_RATE_PARENT_NODE_NAME,
- devlink_rate->parent->name))
+ if (devlink_rate->parent) {
+ struct devlink_rate *parent = devlink_rate->parent;
+
+ if (nla_put_string(msg, DEVLINK_ATTR_RATE_PARENT_NODE_NAME, parent->name))
goto nla_put_failure;
+ if (parent->devlink != devlink) {
+ /* The parent devlink isn't locked, but a reference to
+ * it is held so it cannot suddenly disappear.
+ * And since there are rate nodes pointing to it,
+ * the parent devlink is fully initialized and
+ * the fields accessed here are valid and immutable.
+ */
+ if (nla_put_string(msg, DEVLINK_ATTR_PARENT_DEV_BUS_NAME,
+ parent->devlink->dev->bus->name))
+ goto nla_put_failure;
+ if (nla_put_string(msg, DEVLINK_ATTR_PARENT_DEV_NAME,
+ dev_name(parent->devlink->dev)))
+ goto nla_put_failure;
+ }
+ }
genlmsg_end(msg, hdr);
return 0;
@@ -281,9 +297,18 @@ devlink_nl_rate_parent_node_set(struct devlink_rate *devlink_rate,
const char *parent_name = nla_data(nla_parent);
const struct devlink_ops *ops = devlink->ops;
size_t len = strlen(parent_name);
+ struct devlink *parent_devlink;
struct devlink_rate *parent;
int err = -EOPNOTSUPP;
+ parent_devlink = info->user_ptr[1] ? : devlink;
+ if (parent_devlink != devlink) {
+ if (parent_devlink->rate_domain != devlink->rate_domain) {
+ NL_SET_ERR_MSG(info->extack,
+ "Cannot set parent to a rate from a different rate domain");
+ return -EINVAL;
+ }
+ }
parent = devlink_rate->parent;
if (parent && !len) {
@@ -301,7 +326,11 @@ devlink_nl_rate_parent_node_set(struct devlink_rate *devlink_rate,
refcount_dec(&parent->refcnt);
devlink_rate->parent = NULL;
} else if (len) {
- parent = devlink_rate_node_get_by_name(devlink, parent_name);
+ /* parent_devlink (if != devlink) isn't locked, but the rate
+ * domain, immutable once set, is already locked and the parent
+ * is only used to determine node owner via pointer comparison.
+ */
+ parent = devlink_rate_node_get_by_name(parent_devlink, parent_name);
if (IS_ERR(parent))
return -ENODEV;
@@ -762,8 +791,8 @@ EXPORT_SYMBOL_GPL(devl_rate_leaf_destroy);
* devl_rate_nodes_destroy - destroy all devlink rate nodes on device
* @devlink: devlink instance
*
- * Unset parent for all rate objects and destroy all rate nodes
- * on specified device.
+ * Unset parent for all rate objects that involve this device and destroy all
+ * rate nodes on it.
*/
void devl_rate_nodes_destroy(struct devlink *devlink)
{
@@ -774,7 +803,9 @@ void devl_rate_nodes_destroy(struct devlink *devlink)
devl_rate_domain_lock(devlink);
list_for_each_entry(devlink_rate, &devlink->rate_domain->rate_list, list) {
- if (!devlink_rate->parent || devlink_rate->devlink != devlink)
+ if (!devlink_rate->parent ||
+ (devlink_rate->devlink != devlink &&
+ devlink_rate->parent->devlink != devlink))
continue;
refcount_dec(&devlink_rate->parent->refcnt);
@@ -784,6 +815,7 @@ void devl_rate_nodes_destroy(struct devlink *devlink)
else if (devlink_rate_is_node(devlink_rate))
ops->rate_node_parent_set(devlink_rate, NULL, devlink_rate->priv,
NULL, NULL);
+ devlink_rate->parent = NULL;
}
list_for_each_entry_safe(devlink_rate, tmp, &devlink->rate_domain->rate_list, list) {
if (devlink_rate->devlink == devlink && devlink_rate_is_node(devlink_rate)) {
--
2.45.0
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH net-next 07/10] net/mlx5: qos: Introduce shared esw qos domains
2025-02-13 18:01 [PATCH net-next 00/10] devlink and mlx5: Introduce rate domains Tariq Toukan
` (5 preceding siblings ...)
2025-02-13 18:01 ` [PATCH net-next 06/10] devlink: Allow rate node parents from other devlinks Tariq Toukan
@ 2025-02-13 18:01 ` Tariq Toukan
2025-02-13 18:01 ` [PATCH net-next 08/10] net/mlx5: qos: Support cross-esw tx scheduling Tariq Toukan
` (2 subsequent siblings)
9 siblings, 0 replies; 23+ messages in thread
From: Tariq Toukan @ 2025-02-13 18:01 UTC (permalink / raw)
To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
Andrew Lunn, Jiri Pirko
Cc: Cosmin Ratiu, Carolina Jubran, Gal Pressman, Mark Bloch,
Donald Hunter, Jiri Pirko, Jonathan Corbet, Saeed Mahameed,
Leon Romanovsky, Tariq Toukan, netdev, linux-kernel, linux-doc,
linux-rdma
From: Cosmin Ratiu <cratiu@nvidia.com>
Introduce shared esw qos domains, capable of holding rate groups for
multiple E-Switches of the same NIC. Shared qos domains are reference
counted and can be discovered via devcom. The devcom comp lock is used
in write-mode to prevent init/cleanup races.
When initializing a shared qos domain fails due to devcom errors,
the code falls back to using a private qos domain while logging a
message that cross-esw scheduling cannot be supported.
Shared esw qos domains will be used in a future patch to support
cross-eswitch scheduling.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../net/ethernet/mellanox/mlx5/core/esw/qos.c | 73 +++++++++++++++++--
1 file changed, 67 insertions(+), 6 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
index 8b7c843446e1..6a469f214187 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
@@ -11,10 +11,17 @@
/* Minimum supported BW share value by the HW is 1 Mbit/sec */
#define MLX5_MIN_BW_SHARE 1
-/* Holds rate nodes associated with an E-Switch. */
+/* Holds rate nodes associated with one or more E-Switches.
+ * If cross-esw scheduling is supported, this is shared between all
+ * E-Switches of a NIC.
+ */
struct mlx5_qos_domain {
/* Serializes access to all qos changes in the qos domain. */
struct mutex lock;
+ /* Whether this domain is shared with other E-Switches. */
+ bool shared;
+ /* The reference count is only used for shared qos domains. */
+ refcount_t refcnt;
/* List of all mlx5_esw_sched_nodes. */
struct list_head nodes;
};
@@ -34,7 +41,7 @@ static void esw_assert_qos_lock_held(struct mlx5_eswitch *esw)
lockdep_assert_held(&esw->qos.domain->lock);
}
-static struct mlx5_qos_domain *esw_qos_domain_alloc(void)
+static struct mlx5_qos_domain *esw_qos_domain_alloc(bool shared)
{
struct mlx5_qos_domain *qos_domain;
@@ -44,21 +51,75 @@ static struct mlx5_qos_domain *esw_qos_domain_alloc(void)
mutex_init(&qos_domain->lock);
INIT_LIST_HEAD(&qos_domain->nodes);
+ qos_domain->shared = shared;
+ if (shared)
+ refcount_set(&qos_domain->refcnt, 1);
return qos_domain;
}
-static int esw_qos_domain_init(struct mlx5_eswitch *esw)
+static void esw_qos_devcom_lock(struct mlx5_devcom_comp_dev *devcom, bool shared)
{
- esw->qos.domain = esw_qos_domain_alloc();
+ if (shared)
+ mlx5_devcom_comp_lock(devcom);
+}
+
+static void esw_qos_devcom_unlock(struct mlx5_devcom_comp_dev *devcom, bool shared)
+{
+ if (shared)
+ mlx5_devcom_comp_unlock(devcom);
+}
+
+static int esw_qos_domain_init(struct mlx5_eswitch *esw, bool shared)
+{
+ struct mlx5_devcom_comp_dev *devcom = esw->dev->priv.hca_devcom_comp;
+
+ if (shared && IS_ERR_OR_NULL(devcom)) {
+ esw_info(esw->dev, "Cross-esw QoS cannot be initialized because devcom is unavailable.");
+ shared = false;
+ }
+
+ esw_qos_devcom_lock(devcom, shared);
+ if (shared) {
+ struct mlx5_devcom_comp_dev *pos;
+ struct mlx5_core_dev *peer_dev;
+
+ mlx5_devcom_for_each_peer_entry(devcom, peer_dev, pos) {
+ struct mlx5_eswitch *peer_esw = peer_dev->priv.eswitch;
+
+ if (peer_esw->qos.domain && peer_esw->qos.domain->shared) {
+ esw->qos.domain = peer_esw->qos.domain;
+ refcount_inc(&esw->qos.domain->refcnt);
+ goto unlock;
+ }
+ }
+ }
+
+ /* If no shared domain found, this esw will create one.
+ * Doing it with the devcom comp lock held prevents races with other
+ * eswitches doing concurrent init.
+ */
+ esw->qos.domain = esw_qos_domain_alloc(shared);
+unlock:
+ esw_qos_devcom_unlock(devcom, shared);
return esw->qos.domain ? 0 : -ENOMEM;
}
static void esw_qos_domain_release(struct mlx5_eswitch *esw)
{
- kfree(esw->qos.domain);
+ struct mlx5_devcom_comp_dev *devcom = esw->dev->priv.hca_devcom_comp;
+ struct mlx5_qos_domain *domain = esw->qos.domain;
+ bool shared = domain->shared;
+
+ /* Shared domains are released with the devcom comp lock held to
+ * prevent races with other eswitches doing concurrent init.
+ */
+ esw_qos_devcom_lock(devcom, shared);
+ if (!shared || refcount_dec_and_test(&domain->refcnt))
+ kfree(domain);
esw->qos.domain = NULL;
+ esw_qos_devcom_unlock(devcom, shared);
}
enum sched_node_type {
@@ -829,7 +890,7 @@ int mlx5_esw_qos_init(struct mlx5_eswitch *esw)
if (esw->qos.domain)
return 0; /* Nothing to change. */
- return esw_qos_domain_init(esw);
+ return esw_qos_domain_init(esw, false);
}
void mlx5_esw_qos_cleanup(struct mlx5_eswitch *esw)
--
2.45.0
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH net-next 08/10] net/mlx5: qos: Support cross-esw tx scheduling
2025-02-13 18:01 [PATCH net-next 00/10] devlink and mlx5: Introduce rate domains Tariq Toukan
` (6 preceding siblings ...)
2025-02-13 18:01 ` [PATCH net-next 07/10] net/mlx5: qos: Introduce shared esw qos domains Tariq Toukan
@ 2025-02-13 18:01 ` Tariq Toukan
2025-02-13 18:01 ` [PATCH net-next 09/10] net/mlx5: qos: Init shared devlink rate domain Tariq Toukan
2025-02-13 18:01 ` [PATCH net-next 10/10] net/mlx5: Document devlink rates and cross-esw scheduling Tariq Toukan
9 siblings, 0 replies; 23+ messages in thread
From: Tariq Toukan @ 2025-02-13 18:01 UTC (permalink / raw)
To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
Andrew Lunn, Jiri Pirko
Cc: Cosmin Ratiu, Carolina Jubran, Gal Pressman, Mark Bloch,
Donald Hunter, Jiri Pirko, Jonathan Corbet, Saeed Mahameed,
Leon Romanovsky, Tariq Toukan, netdev, linux-kernel, linux-doc,
linux-rdma
From: Cosmin Ratiu <cratiu@nvidia.com>
Up to now, rate groups could only contain vports from the same E-Switch.
This patch relaxes that restriction if the device supports it
(HCA_CAP.esw_cross_esw_sched == true) and the right conditions are met:
- Link Aggregation (LAG) is enabled.
- The E-Switches use the same qos domain.
This also enables the use of the previously added shared esw qos
domains.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../net/ethernet/mellanox/mlx5/core/esw/qos.c | 59 +++++++++++++++----
1 file changed, 49 insertions(+), 10 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
index 6a469f214187..e6dcfe348a7e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
@@ -147,7 +147,9 @@ struct mlx5_esw_sched_node {
enum sched_node_type type;
/* The eswitch this node belongs to. */
struct mlx5_eswitch *esw;
- /* The children nodes of this node, empty list for leaf nodes. */
+ /* The children nodes of this node, empty list for leaf nodes.
+ * Can be from multiple E-Switches.
+ */
struct list_head children;
/* Valid only if this node is associated with a vport. */
struct mlx5_vport *vport;
@@ -398,6 +400,7 @@ static int esw_qos_vport_create_sched_element(struct mlx5_esw_sched_node *vport_
{
u32 sched_ctx[MLX5_ST_SZ_DW(scheduling_context)] = {};
struct mlx5_core_dev *dev = vport_node->esw->dev;
+ struct mlx5_vport *vport = vport_node->vport;
void *attr;
if (!mlx5_qos_element_type_supported(dev,
@@ -408,7 +411,13 @@ static int esw_qos_vport_create_sched_element(struct mlx5_esw_sched_node *vport_
MLX5_SET(scheduling_context, sched_ctx, element_type,
SCHEDULING_CONTEXT_ELEMENT_TYPE_VPORT);
attr = MLX5_ADDR_OF(scheduling_context, sched_ctx, element_attributes);
- MLX5_SET(vport_element, attr, vport_number, vport_node->vport->vport);
+ MLX5_SET(vport_element, attr, vport_number, vport->vport);
+ if (vport->dev != dev) {
+ /* The port is assigned to a node on another eswitch. */
+ MLX5_SET(vport_element, attr, eswitch_owner_vhca_id_valid, true);
+ MLX5_SET(vport_element, attr, eswitch_owner_vhca_id,
+ MLX5_CAP_GEN(vport->dev, vhca_id));
+ }
MLX5_SET(scheduling_context, sched_ctx, parent_element_id, vport_node->parent->ix);
MLX5_SET(scheduling_context, sched_ctx, max_average_bw, vport_node->max_rate);
@@ -887,10 +896,16 @@ static int esw_qos_devlink_rate_to_mbps(struct mlx5_core_dev *mdev, const char *
int mlx5_esw_qos_init(struct mlx5_eswitch *esw)
{
- if (esw->qos.domain)
- return 0; /* Nothing to change. */
+ bool use_shared_domain = esw->mode == MLX5_ESWITCH_OFFLOADS &&
+ MLX5_CAP_QOS(esw->dev, esw_cross_esw_sched);
+
+ if (esw->qos.domain) {
+ if (esw->qos.domain->shared == use_shared_domain)
+ return 0; /* Nothing to change. */
+ esw_qos_domain_release(esw);
+ }
- return esw_qos_domain_init(esw, false);
+ return esw_qos_domain_init(esw, use_shared_domain);
}
void mlx5_esw_qos_cleanup(struct mlx5_eswitch *esw)
@@ -1021,16 +1036,40 @@ int mlx5_esw_devlink_rate_node_del(struct devlink_rate *rate_node, void *priv,
return 0;
}
-int mlx5_esw_qos_vport_update_parent(struct mlx5_vport *vport, struct mlx5_esw_sched_node *parent,
- struct netlink_ext_ack *extack)
+static int mlx5_esw_validate_cross_esw_scheduling(struct mlx5_eswitch *esw,
+ struct mlx5_esw_sched_node *parent,
+ struct netlink_ext_ack *extack)
{
- struct mlx5_eswitch *esw = vport->dev->priv.eswitch;
- int err = 0;
+ if (!parent || esw == parent->esw)
+ return 0;
- if (parent && parent->esw != esw) {
+ if (!MLX5_CAP_QOS(esw->dev, esw_cross_esw_sched)) {
NL_SET_ERR_MSG_MOD(extack, "Cross E-Switch scheduling is not supported");
return -EOPNOTSUPP;
}
+ if (esw->qos.domain != parent->esw->qos.domain) {
+ NL_SET_ERR_MSG_MOD(extack,
+ "Cannot add vport to a parent belonging to a different qos domain");
+ return -EOPNOTSUPP;
+ }
+ if (!mlx5_lag_is_active(esw->dev)) {
+ NL_SET_ERR_MSG_MOD(extack,
+ "Cross E-Switch scheduling requires LAG to be activated");
+ return -EOPNOTSUPP;
+ }
+
+ return 0;
+}
+
+int mlx5_esw_qos_vport_update_parent(struct mlx5_vport *vport, struct mlx5_esw_sched_node *parent,
+ struct netlink_ext_ack *extack)
+{
+ struct mlx5_eswitch *esw = vport->dev->priv.eswitch;
+ int err;
+
+ err = mlx5_esw_validate_cross_esw_scheduling(esw, parent, extack);
+ if (err)
+ return err;
esw_qos_lock(esw);
if (!vport->qos.sched_node && parent)
--
2.45.0
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH net-next 09/10] net/mlx5: qos: Init shared devlink rate domain
2025-02-13 18:01 [PATCH net-next 00/10] devlink and mlx5: Introduce rate domains Tariq Toukan
` (7 preceding siblings ...)
2025-02-13 18:01 ` [PATCH net-next 08/10] net/mlx5: qos: Support cross-esw tx scheduling Tariq Toukan
@ 2025-02-13 18:01 ` Tariq Toukan
2025-02-13 18:01 ` [PATCH net-next 10/10] net/mlx5: Document devlink rates and cross-esw scheduling Tariq Toukan
9 siblings, 0 replies; 23+ messages in thread
From: Tariq Toukan @ 2025-02-13 18:01 UTC (permalink / raw)
To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
Andrew Lunn, Jiri Pirko
Cc: Cosmin Ratiu, Carolina Jubran, Gal Pressman, Mark Bloch,
Donald Hunter, Jiri Pirko, Jonathan Corbet, Saeed Mahameed,
Leon Romanovsky, Tariq Toukan, netdev, linux-kernel, linux-doc,
linux-rdma
From: Cosmin Ratiu <cratiu@nvidia.com>
If the device can do cross-esw scheduling, switch to using a shared rate
domain so that devlink rate nodes can have parents from other functions
of the same device.
As a fallback mechanism, if switching to a shared rate domain failed, a
warning is logged and the code proceeds with trying to use a private qos
domain since cross-esw scheduling cannot be supported.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
index e6dcfe348a7e..9af12ae98691 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
@@ -899,6 +899,20 @@ int mlx5_esw_qos_init(struct mlx5_eswitch *esw)
bool use_shared_domain = esw->mode == MLX5_ESWITCH_OFFLOADS &&
MLX5_CAP_QOS(esw->dev, esw_cross_esw_sched);
+ if (use_shared_domain) {
+ u64 guid = mlx5_query_nic_system_image_guid(esw->dev);
+ int err;
+
+ err = devlink_shared_rate_domain_init(priv_to_devlink(esw->dev), guid);
+ if (err) {
+ /* On failure, issue a warning and switch to using a private domain. */
+ esw_warn(esw->dev,
+ "Shared devlink rate domain init failed (err %d), cross-esw QoS not available",
+ err);
+ use_shared_domain = false;
+ }
+ }
+
if (esw->qos.domain) {
if (esw->qos.domain->shared == use_shared_domain)
return 0; /* Nothing to change. */
--
2.45.0
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH net-next 10/10] net/mlx5: Document devlink rates and cross-esw scheduling
2025-02-13 18:01 [PATCH net-next 00/10] devlink and mlx5: Introduce rate domains Tariq Toukan
` (8 preceding siblings ...)
2025-02-13 18:01 ` [PATCH net-next 09/10] net/mlx5: qos: Init shared devlink rate domain Tariq Toukan
@ 2025-02-13 18:01 ` Tariq Toukan
9 siblings, 0 replies; 23+ messages in thread
From: Tariq Toukan @ 2025-02-13 18:01 UTC (permalink / raw)
To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
Andrew Lunn, Jiri Pirko
Cc: Cosmin Ratiu, Carolina Jubran, Gal Pressman, Mark Bloch,
Donald Hunter, Jiri Pirko, Jonathan Corbet, Saeed Mahameed,
Leon Romanovsky, Tariq Toukan, netdev, linux-kernel, linux-doc,
linux-rdma
From: Cosmin Ratiu <cratiu@nvidia.com>
Extend the devlink-port documentation with a mention that parents can be
from different devices.
It seems rates were not documented in the mlx5-specific file, so add
examples on how to limit VFs and groups and also provide an example of
the intended way to achieve cross-esw scheduling.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../networking/devlink/devlink-port.rst | 2 ++
Documentation/networking/devlink/mlx5.rst | 33 +++++++++++++++++++
2 files changed, 35 insertions(+)
diff --git a/Documentation/networking/devlink/devlink-port.rst b/Documentation/networking/devlink/devlink-port.rst
index 9d22d41a7cd1..1d9e5839eef4 100644
--- a/Documentation/networking/devlink/devlink-port.rst
+++ b/Documentation/networking/devlink/devlink-port.rst
@@ -417,6 +417,8 @@ API allows to configure following rate object's parameters:
Parent node name. Parent node rate limits are considered as additional limits
to all node children limits. ``tx_max`` is an upper limit for children.
``tx_share`` is a total bandwidth distributed among children.
+ If the device supports cross-function scheduling, the parent can be from a
+ different function of the same underlying device.
``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case
nodes with the same priority form a WFQ subgroup in the sibling group
diff --git a/Documentation/networking/devlink/mlx5.rst b/Documentation/networking/devlink/mlx5.rst
index 7febe0aecd53..61e76da36faf 100644
--- a/Documentation/networking/devlink/mlx5.rst
+++ b/Documentation/networking/devlink/mlx5.rst
@@ -298,3 +298,36 @@ User commands examples:
.. note::
This command can run over all interfaces such as PF/VF and representor ports.
+
+Rates
+=====
+
+mlx5 devices can limit transmission of individual VFs or a group of them via
+the devlink-rate API in switchdev mode.
+
+User commands examples:
+
+- Print the existing rates::
+
+ $ devlink port function rate show
+
+- Set a max tx limit on traffic from VF0::
+
+ $ devlink port function rate set pci/0000:82:00.0/1 tx_max 10Gbit
+
+- Create a rate group with a max tx limit and adding two VFs to it::
+
+ $ devlink port function rate add pci/0000:82:00.0/group1 tx_max 10Gbit
+ $ devlink port function rate set pci/0000:82:00.0/1 parent group1
+ $ devlink port function rate set pci/0000:82:00.0/2 parent group1
+
+- Same scenario, with a min guarantee of 20% of the bandwidth for the first VFs::
+
+ $ devlink port function rate add pci/0000:82:00.0/group1 tx_max 10Gbit
+ $ devlink port function rate set pci/0000:82:00.0/1 parent group1 tx_share 2Gbit
+ $ devlink port function rate set pci/0000:82:00.0/2 parent group1
+
+- Cross-device scheduling::
+
+ $ devlink port function rate add pci/0000:82:00.0/group1 tx_max 10Gbit
+ $ devlink port function rate set pci/0000:82:00.1/32769 parent group1
--
2.45.0
^ permalink raw reply related [flat|nested] 23+ messages in thread