* [PATCH net-next V10 00/14] devlink and mlx5: Support cross-function rate scheduling
@ 2026-07-01 7:32 Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 01/14] devlink: Update nested instance locking comment Tariq Toukan
` (13 more replies)
0 siblings, 14 replies; 15+ messages in thread
From: Tariq Toukan @ 2026-07-01 7:32 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
netdev, Paolo Abeni
Cc: Adithya Jayachandran, Bobby Eshleman, Carolina Jubran,
Cosmin Ratiu, Daniel Borkmann, Daniel Jurgens, Daniel Zahka,
David Wei, Donald Hunter, Dragos Tatulea, Jiri Pirko, Jiri Pirko,
Joe Damato, Jonathan Corbet, Kees Cook, Leon Romanovsky,
linux-doc, linux-kernel, linux-kselftest, linux-rdma, Mark Bloch,
Moshe Shemesh, Or Har-Toov, Parav Pandit, Petr Machata,
Ratheesh Kannoth, Saeed Mahameed, Shahar Shitrit, Shay Drori,
Shuah Khan, Shuah Khan, Simon Horman, Stanislav Fomichev,
Tariq Toukan, Willem de Bruijn, Gal Pressman
Hi,
This series by Cosmin adds support for cross-function rate scheduling in
devlink and mlx5.
See detailed explanation by Cosmin below [0].
Regards,
Tariq
[0]
devlink objects support rate management for TX scheduling, which
involves maintaining a tree of rate nodes that corresponds to TX
schedulers in hardware. 'man devlink-rate' has the full details.
The tree of rate nodes is maintained per devlink object, protected by
the devlink lock.
There exists hardware capable of instantiating TX scheduling trees
spanning multiple functions of the same physical device (and thus
devlink objects) and therefore the current API and locking scheme is
insufficient.
This patch series changes the devlink rate implementation and API to
allow supporting such hardware and managing TX scheduling trees across
multiple functions of a physical device.
Modeling this requires having devlink rate nodes with parents in other
devlink objects. A naive approach that relies on the current
one-lock-per-devlink model is impossible, as it would require in some
cases acquiring multiple devlink locks in the correct order.
The solution proposed in this patch series makes use of the recently
introduced shared devlink instance [1] to manage rate hierarchy changes
across multiple functions.
V1 of this patch series was sent a long time ago [2], using a different
approach of storing rates in a shared rate domain with special locking
rules. This new approach uses standard devlink instances and nesting.
The first part of the series adds support to devlink rates for
maintaining the rate tree across multiple functions.
The second part changes the mlx5 implementation to make use of this (and
cleans up remnants of the previous approach, involving rate domains).
The neat part about using the shared devlink object is that it works for
SFs as well, which are already nested in their parent PF instances. So
with this series, complex scheduling trees spanning multiple SFs across
multiple PFs of the same NIC can now be supported.
---
[1] https://lore.kernel.org/all/20260312100407.551173-1-jiri@resnulli.us/T/#u
[2] https://lore.kernel.org/netdev/20250213180134.323929-1-tariqt@nvidia.com/
[3] https://lore.kernel.org/netdev/20260325152801.236343-1-jiri@resnulli.us/
Comments from internal Sashiko review:
> When a rate node is linked to a parent on a different devlink device, should
> the rate-get reply and dump operations also include the parent's device
> information?
> Currently, parent-dev is only added to the request attributes for rate-set
> and rate-new, but it appears to be omitted from the serialization logic and
> the rate-get reply definition. Does this render the cross-device
> relationship invisible to user space monitoring tools?
[CR] Preexisting problem with devlink rate API, the rate-get command
specifies only a subset of output attributes. May be fixed separately.
> > diff --git a/net/devlink/rate.c b/net/devlink/rate.c
> > index 5436811cfbbd..2b30329a1394 100644
> > --- a/net/devlink/rate.c
> > +++ b/net/devlink/rate.c
> > @@ -660,9 +660,11 @@ int devlink_nl_rate_set_doit(struct sk_buff *skb, struct genl_info *info)
> >
> > int devlink_nl_rate_new_doit(struct sk_buff *skb, struct genl_info *info)
> > {
> > - struct devlink *rate_devlink, *devlink = devlink_nl_ctx(info)->devlink;
> > + struct devlink_nl_ctx *ctx = devlink_nl_ctx(info);
> > + struct devlink *devlink = ctx->devlink;
> While ctx is now retrieved, does the rest of the function ignore
> ctx->parent_devlink?
> It looks like the command handlers never retrieve or use the parent devlink,
> and unconditionally use the child's devlink for lookups and assignments. Will
> this cause cross-device parent assignments to silently fail or return -ENODEV
> since they search for the parent node exclusively on the child's device?
[CR] False positive, it's used in devlink_nl_rate_parent_node_set().
> Additionally, looking at how the parent devlink is prepared in
> __devlink_nl_pre_doit() (which is called via
> DEVLINK_NL_FLAG_OPTIONAL_PARENT_DEV):
> __devlink_nl_pre_doit()
> if (parent_dev && attrs[DEVLINK_ATTR_PARENT_DEV]) {
> parent_devlink = devlink_get_parent_from_attrs_lock(net, attrs);
> ...
> devlink_nl_ctx(info)->parent_devlink = parent_devlink;
> /* Drop the parent devlink lock but don't release the reference.
> * This will keep it alive until the end of the request.
> */
> devl_unlock(parent_devlink);
> }
> To actually implement the cross-device feature in the command handlers, the
> parent's rate list would need to be accessed. Since the parent devlink lock
> is explicitly dropped here, wouldn't accessing the parent's rate list
> without its lock violate devlink locking rules and lead to data races?
[CR] Locking changes are intentional. All rate changes for anything in
the rate group now acquire the root devlink lock, where rates are
stored.
> > * devl_rate_nodes_destroy - destroy all devlink rate nodes on device
> > * @devlink: devlink instance
> > *
> > - * Unset parent for all rate objects and destroy all rate nodes
> > - * on specified device.
> > + * Unset parent for all rate objects involving this device and destroy all rate
> > + * nodes on it.
> > */
> > void devl_rate_nodes_destroy(struct devlink *devlink)
> > {
> > - const struct devlink_ops *ops = devlink->ops;
> > struct devlink_rate *devlink_rate, *tmp;
> > + const struct devlink_ops *ops;
> > + struct devlink *rate_devlink;
> >
> > devl_assert_locked(devlink);
> > + rate_devlink = devl_rate_lock(devlink);
> >
> > - list_for_each_entry(devlink_rate, &devlink->rate_list, list) {
> > - if (!devlink_rate->parent)
> > + list_for_each_entry(devlink_rate, &rate_devlink->rate_list, list) {
> > + if (!devlink_rate->parent ||
> > + (devlink_rate->devlink != devlink &&
> > + devlink_rate->parent->devlink != devlink))
> > continue;
> >
> > + /* This could destroy rate objects on other devlinks in the
> > + * same hierarchy under 'rate_devlink'. This is safe because
> > + * the shared common ancestor is locked so there can be no
> > + * other concurrent rate operations on devlink_rate->devlink.
> > + */
> > + ops = devlink_rate->devlink->ops;
> Does this code invoke devlink_ops callbacks on other devlink instances
> without holding their instance lock?
> While the comment indicates the shared common ancestor lock prevents concurrent
> rate operations, the child device's instance lock (devlink_rate->devlink) is
> not held.
> Could this violate the devlink locking contract if the driver is executing
> concurrent non-rate operations on the child instance? Because devlink
> drivers rely on the instance lock to serialize access to their internal
> structures, could invoking these callbacks unlocked cause data corruption or
> trigger lockdep splats if the driver asserts devl_assert_locked()?
[CR] As mentioned in the comment, concurrent ops which touch rates
_will_ eventually acquire the shared devlink lock, so they're safe. We
don't care about other ops, since we only touching rates here.
> > @@ -321,22 +307,19 @@ static int esw_qos_create_rate_limit_element(struct mlx5_esw_sched_node *node,
> > return esw_qos_node_create_sched_element(node, sched_ctx, extack);
> > }
> >
> > -static u32 esw_qos_calculate_min_rate_divider(struct mlx5_eswitch *esw,
> > - struct mlx5_esw_sched_node *parent)
> > +static u32
> > +esw_qos_calculate_min_rate_divider(struct mlx5_esw_sched_node *parent)
> > {
> > - struct list_head *nodes = parent ? &parent->children : &esw->qos.domain->nodes;
> > - u32 fw_max_bw_share = MLX5_CAP_QOS(esw->dev, max_tsar_bw_share);
> > + u32 fw_max_bw_share = MLX5_CAP_QOS(parent->esw->dev, max_tsar_bw_share);
> This is a pre-existing issue, but does this division risk a divide-by-zero
> panic regression?
> If the hardware does not support bandwidth sharing, max_tsar_bw_share
> could be 0. If a user subsequently sets a non-zero tx_share (minimum rate
> guarantee), max_guarantee becomes greater than 0, resulting in
> max_guarantee / fw_max_bw_share.
> Should we validate whether fw_max_bw_share is non-zero before performing
> the division?
[CR] Preexisting, also not really possible for fw to give a value of 0
for the divisor.
> > @@ -665,7 +631,6 @@ static int esw_qos_create_tc_arbiter_sched_elem(
> > struct netlink_ext_ack *extack)
> > {
> > u32 tsar_ctx[MLX5_ST_SZ_DW(scheduling_context)] = {};
> > - u32 tsar_parent_ix;
> > void *attr;
> This is a pre-existing issue, but is there a hardware scheduling element
> leak when transitioning a node to a TC arbiter?
> When esw_qos_tc_arbiter_scheduling_setup() calls
> esw_qos_create_tc_arbiter_sched_elem(node, extack), it provisions a new
> hardware element and immediately overwrites node->ix with its ID.
> If the subsequent step esw_qos_create_vports_tc_nodes(node, extack) fails,
> it jumps to err_tc_node_create and restores node->ix = curr_ix, but it
> doesn't appear to tear down the newly created scheduling element. Does this
> permanently leak the newly provisioned hardware element in the firmware?
[CR] False alarm, the call to esw_qos_node_destroy_sched_element()
destroys the new temporary scheduling element on failure.
> > @@ -1319,11 +1288,9 @@ static int esw_qos_switch_tc_arbiter_node_to_vports(
> > struct mlx5_esw_sched_node *node,
> > struct netlink_ext_ack *extack)
> > {
> This is a pre-existing issue, but what happens if the rollback path fails
> during vport migration in esw_qos_switch_vports_node_to_tc_arbiter()?
> If an error occurs, it attempts to rollback by calling
> esw_qos_switch_vport_tcs_to_vport(). If this rollback fails internally
> (e.g. due to ENOMEM in esw_qos_vport_update()), the vport is
> successfully restored to the doomed TC arbiter node, and the error
> is silently ignored.
> Then, the caller esw_qos_node_enable_tc_arbitration() executes
> esw_qos_tc_arbiter_scheduling_teardown(), which frees the TC arbiter's
> VPORTS_TC_TSAR nodes without freeing the attached vport VPORT_TC nodes.
> Does this leak the vport TC nodes and leave them holding use-after-free
> pointers to their freed parents, leading to memory corruption when the
> vport is later disabled?
[CR] Preexisting, also hard to hit. Requires a failure mid-way when
switching nodes to TC arbitration AND a failure during rollback. Fixing
it outside the scope of this series.
> > @@ -1888,7 +1855,9 @@ mlx5_esw_qos_vport_update_parent(struct mlx5_vport *vport,
> > err = mlx5_esw_qos_vport_enable(vport, type, parent, 0, 0,
> > extack);
> > } else if (vport->qos.sched_node) {
> > - err = esw_qos_vport_update_parent(vport, parent, extack);
> > + err = esw_qos_vport_update_parent(vport,
> > + parent ? : esw->qos.root,
> > + extack);
> Can this sequence cause a use-after-free regression on esw->qos.root
> when updating a vport's QoS configuration?
> If a vport is the only QoS user in the E-Switch, esw->qos.refcnt is 1.
> If its QoS configuration is updated, esw_qos_vport_update() is called
> with the parent argument pointing to esw->qos.root (since this patch
> changes the default parent from NULL to esw->qos.root).
> Inside esw_qos_vport_update(), mlx5_esw_qos_vport_disable_locked() drops
> the vport's QoS reference, decrementing esw->qos.refcnt to 0. This
> triggers esw_qos_destroy(), freeing esw->qos.root.
> The function then calls mlx5_esw_qos_vport_enable(..., parent, ...) passing
> the stale parent pointer. Because parent is no longer NULL, it doesn't
> fetch the newly re-allocated root node, and instead passes the freed
> pointer to __esw_qos_alloc_node(), resulting in a use-after-free.
[CR] False alarm. The finding confuses esw_qos_vport_disable() and
mlx5_esw_qos_vport_disable_locked(). esw_qos_vport_disable() doesn't
touch the refcnt.
> > @@ -1859,13 +1840,15 @@ mlx5_esw_qos_vport_update_parent(struct mlx5_vport *vport,
> > extack);
> > }
> > - esw_qos_unlock(esw);
> > +
> > return err;
> > }
> >
> > void mlx5_esw_qos_vport_clear_parent(struct mlx5_vport *vport)
> > {
> > + esw_qos_lock(vport->dev);
> > mlx5_esw_qos_vport_update_parent(vport, NULL, NULL);
> > + esw_qos_unlock(vport->dev);
> > }
> Could this lead to a recursive mutex deadlock during VF teardown on older
> hardware where shd is NULL?
> In the teardown path, mlx5_eswitch_unload_pf_vf_vport() already acquires
> esw->state_lock. It then proceeds to call:
> mlx5_eswitch_unload_pf_vf_vport()
> mlx5_eswitch_unload_vport()
> mlx5_esw_offloads_unload_rep()
> mlx5_esw_offloads_devlink_port_unregister()
> mlx5_esw_qos_vport_clear_parent()
> Since this patch changes mlx5_esw_qos_vport_clear_parent() to
> unconditionally call esw_qos_lock(), which falls back to acquiring
> esw->state_lock when dev->shd is NULL, won't this result in an attempt to
> re-acquire the non-recursive state_lock that is already held by the
> teardown process?
[CR] False alarm. state_lock isn't held during VF teardown.
> > @@ -839,13 +940,17 @@ void devl_rate_nodes_destroy(struct devlink *devlink)
> > refcount_dec(&devlink_rate->parent->refcnt);
> > devlink_rate->parent = NULL;
> When unsetting the parent of a rate object in devl_rate_nodes_destroy(),
> this patch allows it to happen to cross-device child nodes (where
> devlink_rate->devlink != devlink). Since the child's devlink instance is
> still active, shouldn't its state change (losing its parent) be
> communicated to user space?
> Omitting devlink_rate_notify(devlink_rate, DEVLINK_CMD_RATE_NEW) here might
> leave user space monitors out of sync, incorrectly believing the active child
> is still attached to the deleted parent node.
[CR] Preexisting, devl_rate_nodes_destroy() didn't send notifications.
> > +static struct devlink *devl_rate_lock(struct devlink *devlink)
> > {
> > - return devlink;
> > + struct devlink *rate_devlink = devlink, *parent;
> > +
> > + devl_assert_locked(devlink);
> > +
> > + while (rate_devlink->ops &&
> > + rate_devlink->ops->supported_cross_device_rate_nodes) {
> > + parent = devlink_nested_in_get_lock(rate_devlink);
> > + if (!parent)
> > + break;
> Can this lead to list corruption if the parent devlink is unregistered?
> If the parent is unregistered concurrently (for example, a PF is unregistered
> before VF teardown), devlink_nested_in_get_lock() will return NULL because of
> this internal check:
> devlink_nested_in_get_lock() {
> ...
> if (devl_is_registered(devlink))
> return devlink;
> ...
> return NULL;
> }
> Since parent is NULL, this loop breaks and returns the child devlink lock
> instead of the parent's lock. Later, in functions like devl_rate_leaf_destroy(),
> list_del(&devlink_rate->list) is called on the rate node.
> Because the rate node is chained into the parent's rate_list, does this
> modify the parent's list concurrently without holding its lock?
[CR] This misunderstands the nesting hierarchy. VFs are not nested in
PFs, PFs are nested in shd, which is reference counted (=> outlives all
PFs) and keeps all rate nodes for all PFs.
> This is a pre-existing issue, but does devlink_nl_rate_del_doit()
> unconditionally free rate nodes even if the driver rejects the deletion?
> Looking at devlink_nl_rate_del_doit() in net/devlink/rate.c:
> devlink_nl_rate_del_doit() {
> ...
> err = devlink->ops->rate_node_del(rate_node, rate_node->priv,
> info->extack);
> if (rate_node->parent)
> refcount_dec(&rate_node->parent->refcnt);
> list_del(&rate_node->list);
> kfree(rate_node->name);
> kfree(rate_node);
> ...
> }
> If the hardware or driver cannot fulfill the request and ops->rate_node_del()
> returns an error like -EBUSY, the function ignores the error regarding memory
> management and proceeds to kfree() the rate node.
> Could this leave the driver with a dangling pointer to the freed rate node,
> triggering a use-after-free later?
[CR] Preexisting, it is intentional that the node is removed no matter
what the rate_node_del op manages to do.
> > @@ -243,7 +243,29 @@ devlink_get_from_attrs_lock(struct net *net, struct nlattr **attrs,
> > + err = nla_parse_nested(tb, maxtype, attrs[DEVLINK_ATTR_PARENT_DEV],
> > + devlink_dl_parent_dev_nl_policy, NULL);
> Could this nla_parse_nested() call pass down an extended ACK pointer instead
> of NULL?
> When passing NULL for the extack parameter, if the nested attribute is
> malformed (such as a missing NLA_F_NESTED flag), userspace will receive a
> silent -EINVAL without a descriptive error message.
> Because devlink_get_parent_from_attrs_lock() is called by
> __devlink_nl_pre_doit() which has access to the genl_info structure and
> its extack pointer, passing it down to this function would allow for
> reporting accurate validation errors to userspace.
[CR] Valid, but quite a lot of plumbing is necessary for extack to be
used here. Out of scope of the series.
> This is a pre-existing issue, but does the error path in
> devlink_nl_rate_new_doit() leak the parent's reference count?
> If devlink_nl_rate_set() successfully sets the parent (which increments
> parent->refcnt) but fails on a subsequent attribute validation like tc_bw,
> it returns an error.
> The cleanup path in devlink_nl_rate_new_doit() handles the error:
> err_rate_set:
> ops->rate_node_del(rate_node, rate_node->priv, info->extack);
> err_node_new:
> kfree(rate_node->name);
> err_strdup:
> kfree(rate_node);
> Because this path frees the node but never decrements the parent's refcnt,
> does this permanently elevate the parent's refcount, falsely indicating it
> has children and preventing it from ever being deleted?
[CR] Preexisting, will be fixed separately with a net patch.
> > @@ -30,14 +30,42 @@ devlink_rate_leaf_get_from_info(struct devlink *devlink, struct genl_info *info)
> > return devlink_rate ?: ERR_PTR(-ENODEV);
> > }
> >
> > +/* Repeatedly walks the nested devlink chain while cross device rate nodes are
> > + * supported and finds the topmost instance where rates should be stored.
> > + * That instance is locked, referenced and returned.
> > + * When cross device rate nodes aren't supported the original devlink instance
> > + * is returned.
> > + */
> > static struct devlink *devl_rate_lock(struct devlink *devlink)
> > {
> > - return devlink;
> > + struct devlink *rate_devlink = devlink, *parent;
> > +
> > + devl_assert_locked(devlink);
> > +
> > + while (rate_devlink->ops &&
> > + rate_devlink->ops->supported_cross_device_rate_nodes) {
> > + parent = devlink_nested_in_get_lock(rate_devlink);
> > + if (!parent)
> > + break;
> > + if (rate_devlink != devlink) {
> > + /* Unlock intermediate instances. */
> > + devl_unlock(rate_devlink);
> > + devlink_put(rate_devlink);
> > + }
> > + rate_devlink = parent;
> > + }
> > + return rate_devlink;
> > }
> Now that devl_rate_lock() returns the shared ancestor, does this cause
> devlink_rates_check() to skip child rate nodes?
> In net/devlink/rate.c:devlink_rates_check(), the loop filters nodes by the
> devlink pointer:
> rate_devlink = devl_rate_lock(devlink);
> list_for_each_entry(devlink_rate, &rate_devlink->rate_list, list)
> if (devlink_rate->devlink == devlink &&
> (!rate_filter || rate_filter(devlink_rate))) {
> ...
> err = -EBUSY;
> break;
> }
> Because of this check, if a user creates an SF rate node and then triggers an
> eswitch mode change to legacy on the PF devlink, devlink_rates_check() skips
> the SF node and returns 0 instead of -EBUSY.
> Could this allow mlx5_devlink_eswitch_mode_set() to proceed and unconditionally
> free the QoS domain in mlx5_esw_qos_cleanup(), while the child SF devlink and
> its software rate node remain active?
> If so, a subsequent operation on the active child SF rate node (like
> DEVLINK_CMD_RATE_DEL) would invoke mlx5_esw_devlink_rate_node_del() and call
> esw_qos_lock(), leading to a NULL pointer dereference when accessing
> esw->qos.domain->lock.
[CR] False alarm, no driver uses supported_cross_device_rate_nodes yet
so this code path isn't taken. QoS domains are removed in a later patch
in the series before the cross device flag is enabled, so this finding
is moot.
> > static int mlx5_esw_qos_max_link_speed_get(struct mlx5_core_dev *mdev, u32 *link_speed_max,
> > - bool take_rtnl,
> > struct netlink_ext_ack *extack)
> > {
> > int err;
> >
> > - if (!mlx5_lag_is_active(mdev))
> > + if (!mlx5_lag_is_active(mdev) ||
> > + mlx5_lag_query_bond_speed(mdev, link_speed_max) < 0 ||
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> This is a pre-existing issue, but does mlx5_lag_query_bond_speed() have a race
> condition when accessing the ldev->tracker structure?
> The mlx5_lag_query_bond_speed() function reads ldev->tracker.bond_speed_mbps
> while holding lag_lock. However, mlx5_lag_netdev_event() modifies
> ldev->tracker holding only rtnl_lock, without holding lag_lock:
> drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c:mlx5_lag_netdev_event() {
> ...
> ldev->tracker = tracker;
> ...
> }
> Concurrently, mlx5_ldev_add_netdev() modifies fields inside it under lag_lock,
> meaning its updates can be overwritten and lost by the read-modify-write in
> mlx5_lag_netdev_event().
> Since this patch now heavily relies on mlx5_lag_query_bond_speed() for QoS
> max link speed, could this expose readers to observing torn or inconsistent
> tracker states?
[CR] Preexisting, no impact. Either a value of 0 is read, or the correct one.
> > @@ -458,10 +461,17 @@ esw_qos_vport_create_sched_element(struct mlx5_esw_sched_node *vport_node,
> > MLX5_SET(scheduling_context, sched_ctx, max_average_bw,
> > vport_node->max_rate);
> > + if (vport->dev != dev) {
> > + /* The port is assigned to a node on another eswitch. */
> > + MLX5_SET(vport_element, attr, eswitch_owner_vhca_id_valid,
> > + true);
> > + MLX5_SET(vport_element, attr, eswitch_owner_vhca_id,
> > + MLX5_CAP_GEN(vport->dev, vhca_id));
> Does this code silently truncate the vhca_id if it is >= 2048?
> The hardware structure mlx5_ifc_vport_element_bits defines
> eswitch_owner_vhca_id as an 11-bit field, but MLX5_CAP_GEN(..., vhca_id)
> returns a 16-bit identifier. The MLX5_SET macro will silently apply a
> mask of 0x7ff. Should there be a bounds check returning an error like
> -EOVERFLOW here, or does the hardware actually support a 16-bit field?
[CR] vhca_id >= 2048 would imply a NIC with that many physical ports,
which is only possible in theory. Not worth addressing.
> > @@ -1221,8 +1260,15 @@ static int esw_qos_vport_update(struct mlx5_vport *vport,
> > if (err)
> > return err;
> >
> > - if (curr_type == SCHED_NODE_TYPE_TC_ARBITER_TSAR && curr_type == type)
> > + if (curr_type == SCHED_NODE_TYPE_TC_ARBITER_TSAR && curr_type == type) {
> > esw_qos_tc_arbiter_get_bw_shares(vport_node, curr_tc_bw);
> Does this condition skip the preservation and restoration of TC bandwidth
> shares when transitioning between different node types?
> When a vport transitions away from SCHED_NODE_TYPE_TC_ARBITER_TSAR
> (e.g., to a regular SCHED_NODE_TYPE_VPORT), curr_type != type evaluates to
> true, causing the function to skip saving the current tc_bw shares.
> esw_qos_vport_disable() then unconditionally clears vport_node->tc_bw to
> zero.
> If the subsequent esw_qos_vport_enable() fails, the rollback path attempts
> to restore the old TC_ARBITER_TSAR state.
> However, because curr_type != type, the code skips restoring the bandwidth
> shares via esw_qos_set_tc_arbiter_bw_shares().
> Are the vport's bandwidth guarantees permanently erased in this failure
> scenario?
[CR] Pre-existing, unlikely and low-impact: If tc-bw settings fail,
there will be an error and an opportunity for the user to retry.
> > @@ -1990,6 +1990,13 @@ void mlx5_eswitch_disable_sriov(struct mlx5_eswitch *esw, bool clear_vf)
> > esw->esw_funcs.num_vfs, esw->esw_funcs.num_ec_vfs, esw->enabled_vports);
> >
> mlx5_eswitch_invalidate_wq(esw);
> > +
> > + if (esw->mode == MLX5_ESWITCH_OFFLOADS) {
> > + struct devlink *devlink = priv_to_devlink(esw->dev);
> > +
> > + devl_rate_nodes_destroy(devlink);
> > + }
> > +
> Can this reordering cause a use-after-free if leaf unparenting fails?
> Devlink's devl_rate_nodes_destroy() calls ops->rate_leaf_parent_set(..., NULL)
> which maps to mlx5_esw_qos_vport_update_parent(). Devlink ignores the return
> value of this callback.
> If the hardware operation fails (e.g., due to a firmware timeout), the driver
> leaves vport->qos.sched_node->parent pointing to the original parent node.
> Devlink then unconditionally proceeds to destroy and free the rate node.
> Later, when mlx5_eswitch_unload_vf_vports() executes, it calls
> mlx5_esw_qos_vport_disable(), which reaches esw_qos_vport_disable():
> drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c:esw_qos_vport_disable() {
> ...
> list_del_init(&vport_node->entry);
> esw_qos_normalize_min_rate(vport_node->esw, vport_node->parent, extack);
> ...
> }
> Will this dereference the freed vport_node->parent, resulting in list
> corruption or a use-after-free?
[CR] This is a preexisting problem, brought to light by the reordering
of group destruction before leaf destruction. It's extremely unlikely,
requiring the firmware command to reparent a vport to its root to fail.
Fixing this properly requires multiple patches and will be pursued after
this series.
> > @@ -2039,6 +2040,9 @@ void mlx5_eswitch_disable_locked(struct mlx5_eswitch *esw)
> > esw->mode == MLX5_ESWITCH_LEGACY ? "LEGACY" : "OFFLOADS",
> > esw->esw_funcs.num_vfs, esw->esw_funcs.num_ec_vfs, esw->enabled_vports);
> >
> > + if (esw->mode == MLX5_ESWITCH_OFFLOADS)
> > + devl_rate_nodes_destroy(devlink);
> > +
> Does this identical reordering in the locked disable path suffer from the
> same unparenting failure use-after-free described above?
[CR] Same comment as above. QoS improvements for the error paths will
follow.
V10:
- Added a comment in devl_rate_nodes_destroy clarifying locking.
- Expanded 'supported_cross_device_rate_nodes' comment with locking
expectations.
- Simplified rate locking by only keeping the common ancestor locked.
- Removed devlink_nested_in_get_locked and devlink_nested_in_put_unlock.
- devlink_nl_rate_parent_node_set iterates over the proper rate list.
- Refactored mlx5 locking given dev->shd is now optional (after [3]).
- Fixed a bug in pruning introduced by the root node patch.
- Fixed a bug on failure when detaching a node from parent.
- Clarified expectations for shared devlink rate storage.
- Fixed incorrect net namespace when listing shared instances.
V9:
https://lore.kernel.org/netdev/20260326065949.44058-1-tariqt@nvidia.com/
Cosmin Ratiu (14):
devlink: Update nested instance locking comment
devlink: Add a helper for getting a nested-in instance
devlink: Migrate from info->user_ptr to info->ctx
devlink: Decouple rate storage from associated devlink object
devlink: Add parent dev to devlink API
devlink: Allow parent dev for rate-set and rate-new
devlink: Allow rate node parents from other devlinks
net/mlx5: qos: Use mlx5_lag_query_bond_speed to query LAG speed
net/mlx5: qos: Refactor vport QoS cleanup
net/mlx5: qos: Model the root node in the scheduling hierarchy
net/mlx5: qos: Remove qos domains and use shd
net/mlx5: qos: Support cross-device tx scheduling
selftests: drv-net: Add test for cross-esw rate scheduling
net/mlx5: Document devlink rates
Documentation/netlink/specs/devlink.yaml | 30 +-
.../networking/devlink/devlink-port.rst | 2 +
Documentation/networking/devlink/index.rst | 8 +-
Documentation/networking/devlink/mlx5.rst | 33 +
.../net/ethernet/mellanox/mlx5/core/devlink.c | 1 +
.../mellanox/mlx5/core/esw/devlink_port.c | 1 -
.../net/ethernet/mellanox/mlx5/core/esw/qos.c | 605 +++++++++---------
.../net/ethernet/mellanox/mlx5/core/esw/qos.h | 3 -
.../net/ethernet/mellanox/mlx5/core/eswitch.c | 27 +-
.../net/ethernet/mellanox/mlx5/core/eswitch.h | 18 +-
include/net/devlink.h | 9 +
include/uapi/linux/devlink.h | 2 +
net/devlink/core.c | 20 +-
net/devlink/dev.c | 16 +-
net/devlink/devl_internal.h | 20 +
net/devlink/dpipe.c | 14 +-
net/devlink/health.c | 12 +-
net/devlink/linecard.c | 4 +-
net/devlink/netlink.c | 82 ++-
net/devlink/netlink_gen.c | 24 +-
net/devlink/netlink_gen.h | 8 +
net/devlink/param.c | 4 +-
net/devlink/port.c | 18 +-
net/devlink/rate.c | 331 +++++++---
net/devlink/region.c | 6 +-
net/devlink/resource.c | 14 +-
net/devlink/sb.c | 22 +-
net/devlink/trap.c | 12 +-
.../testing/selftests/drivers/net/hw/Makefile | 1 +
.../drivers/net/hw/devlink_rate_cross_esw.py | 296 +++++++++
30 files changed, 1132 insertions(+), 511 deletions(-)
create mode 100755 tools/testing/selftests/drivers/net/hw/devlink_rate_cross_esw.py
base-commit: 1c664ec4b9ea827b609d296921ed5bad8a40a158
--
2.44.0
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH net-next V10 01/14] devlink: Update nested instance locking comment
2026-07-01 7:32 [PATCH net-next V10 00/14] devlink and mlx5: Support cross-function rate scheduling Tariq Toukan
@ 2026-07-01 7:32 ` Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 02/14] devlink: Add a helper for getting a nested-in instance Tariq Toukan
` (12 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: Tariq Toukan @ 2026-07-01 7:32 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
netdev, Paolo Abeni
Cc: Adithya Jayachandran, Bobby Eshleman, Carolina Jubran,
Cosmin Ratiu, Daniel Borkmann, Daniel Jurgens, Daniel Zahka,
David Wei, Donald Hunter, Dragos Tatulea, Jiri Pirko, Jiri Pirko,
Joe Damato, Jonathan Corbet, Kees Cook, Leon Romanovsky,
linux-doc, linux-kernel, linux-kselftest, linux-rdma, Mark Bloch,
Moshe Shemesh, Or Har-Toov, Parav Pandit, Petr Machata,
Ratheesh Kannoth, Saeed Mahameed, Shahar Shitrit, Shay Drori,
Shuah Khan, Shuah Khan, Simon Horman, Stanislav Fomichev,
Tariq Toukan, Willem de Bruijn, Gal Pressman
From: Cosmin Ratiu <cratiu@nvidia.com>
In commit [1] a comment about nested instance locking was updated. But
there's another place where this is mentioned, so update that as well.
[1] commit 0061b5199d7c ("devlink: Reverse locking order for nested
instances")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
Documentation/networking/devlink/index.rst | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/Documentation/networking/devlink/index.rst b/Documentation/networking/devlink/index.rst
index 32f70879ddd0..4745148fecf4 100644
--- a/Documentation/networking/devlink/index.rst
+++ b/Documentation/networking/devlink/index.rst
@@ -31,10 +31,10 @@ sure to respect following rules:
- Lock ordering should be maintained. If driver needs to take instance
lock of both nested and parent instances at the same time, devlink
- instance lock of the parent instance should be taken first, only then
- instance lock of the nested instance could be taken.
- - Driver should use object-specific helpers to setup the nested relationship
- before registering the nested devlink instance:
+ instance lock of the nested instance should be taken first, only then
+ instance lock of the parent instance could be taken.
+ - Driver should use object-specific helpers to setup the
+ nested relationship:
- ``devl_nested_devlink_set()`` - called to setup devlink -> nested
devlink relationship (could be used for multiple nested instances).
--
2.44.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH net-next V10 02/14] devlink: Add a helper for getting a nested-in instance
2026-07-01 7:32 [PATCH net-next V10 00/14] devlink and mlx5: Support cross-function rate scheduling Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 01/14] devlink: Update nested instance locking comment Tariq Toukan
@ 2026-07-01 7:32 ` Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 03/14] devlink: Migrate from info->user_ptr to info->ctx Tariq Toukan
` (11 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: Tariq Toukan @ 2026-07-01 7:32 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
netdev, Paolo Abeni
Cc: Adithya Jayachandran, Bobby Eshleman, Carolina Jubran,
Cosmin Ratiu, Daniel Borkmann, Daniel Jurgens, Daniel Zahka,
David Wei, Donald Hunter, Dragos Tatulea, Jiri Pirko, Jiri Pirko,
Joe Damato, Jonathan Corbet, Kees Cook, Leon Romanovsky,
linux-doc, linux-kernel, linux-kselftest, linux-rdma, Mark Bloch,
Moshe Shemesh, Or Har-Toov, Parav Pandit, Petr Machata,
Ratheesh Kannoth, Saeed Mahameed, Shahar Shitrit, Shay Drori,
Shuah Khan, Shuah Khan, Simon Horman, Stanislav Fomichev,
Tariq Toukan, Willem de Bruijn, Gal Pressman
From: Cosmin Ratiu <cratiu@nvidia.com>
Upcoming code will need to obtain references to locked nested-in
devlink instances. Add a helper to lock, reference and return the
nested-in instance.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
net/devlink/core.c | 16 ++++++++++++++++
net/devlink/devl_internal.h | 4 ++++
2 files changed, 20 insertions(+)
diff --git a/net/devlink/core.c b/net/devlink/core.c
index fe9f6a0a67d5..ee26c50b4118 100644
--- a/net/devlink/core.c
+++ b/net/devlink/core.c
@@ -67,6 +67,22 @@ static void __devlink_rel_put(struct devlink_rel *rel)
devlink_rel_free(rel);
}
+struct devlink *__must_check devlink_nested_in_get_lock(struct devlink *devlink)
+{
+ devl_assert_locked(devlink);
+ if (!devlink->rel)
+ return NULL;
+ devlink = devlinks_xa_get(devlink->rel->nested_in.devlink_index);
+ if (!devlink)
+ return NULL;
+ devl_lock(devlink);
+ if (devl_is_registered(devlink))
+ return devlink;
+ devl_unlock(devlink);
+ devlink_put(devlink);
+ return NULL;
+}
+
static void devlink_rel_nested_in_notify_work(struct work_struct *work)
{
struct devlink_rel *rel = container_of(work, struct devlink_rel,
diff --git a/net/devlink/devl_internal.h b/net/devlink/devl_internal.h
index e4e48ee2da5a..36dff282f9b0 100644
--- a/net/devlink/devl_internal.h
+++ b/net/devlink/devl_internal.h
@@ -136,6 +136,10 @@ typedef void devlink_rel_notify_cb_t(struct devlink *devlink, u32 obj_index);
typedef void devlink_rel_cleanup_cb_t(struct devlink *devlink, u32 obj_index,
u32 rel_index);
+/* Returns the locked+referenced nested-in instance or NULL. */
+struct devlink *__must_check
+devlink_nested_in_get_lock(struct devlink *devlink);
+
void devlink_rel_nested_in_clear(u32 rel_index);
int devlink_rel_nested_in_add(u32 *rel_index, u32 devlink_index,
u32 obj_index, devlink_rel_notify_cb_t *notify_cb,
--
2.44.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH net-next V10 03/14] devlink: Migrate from info->user_ptr to info->ctx
2026-07-01 7:32 [PATCH net-next V10 00/14] devlink and mlx5: Support cross-function rate scheduling Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 01/14] devlink: Update nested instance locking comment Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 02/14] devlink: Add a helper for getting a nested-in instance Tariq Toukan
@ 2026-07-01 7:32 ` Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 04/14] devlink: Decouple rate storage from associated devlink object Tariq Toukan
` (10 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: Tariq Toukan @ 2026-07-01 7:32 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
netdev, Paolo Abeni
Cc: Adithya Jayachandran, Bobby Eshleman, Carolina Jubran,
Cosmin Ratiu, Daniel Borkmann, Daniel Jurgens, Daniel Zahka,
David Wei, Donald Hunter, Dragos Tatulea, Jiri Pirko, Jiri Pirko,
Joe Damato, Jonathan Corbet, Kees Cook, Leon Romanovsky,
linux-doc, linux-kernel, linux-kselftest, linux-rdma, Mark Bloch,
Moshe Shemesh, Or Har-Toov, Parav Pandit, Petr Machata,
Ratheesh Kannoth, Saeed Mahameed, Shahar Shitrit, Shay Drori,
Shuah Khan, Shuah Khan, Simon Horman, Stanislav Fomichev,
Tariq Toukan, Willem de Bruijn, Gal Pressman
From: Cosmin Ratiu <cratiu@nvidia.com>
Replace deprecated info->user_ptr[0]/[1] with a typed
devlink_nl_ctx struct stored in info->ctx. The struct aliases
the same union memory, so the migration is safe.
There are no functionality changes here.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
net/devlink/dev.c | 16 ++++++++--------
net/devlink/devl_internal.h | 13 +++++++++++++
net/devlink/dpipe.c | 14 +++++++-------
net/devlink/health.c | 12 ++++++------
net/devlink/linecard.c | 4 ++--
net/devlink/netlink.c | 8 ++++----
net/devlink/param.c | 4 ++--
net/devlink/port.c | 18 +++++++++---------
net/devlink/rate.c | 8 ++++----
net/devlink/region.c | 6 +++---
net/devlink/resource.c | 14 +++++++++-----
net/devlink/sb.c | 22 +++++++++++-----------
net/devlink/trap.c | 12 ++++++------
13 files changed, 84 insertions(+), 67 deletions(-)
diff --git a/net/devlink/dev.c b/net/devlink/dev.c
index 57b2b8f03543..bcf001554e84 100644
--- a/net/devlink/dev.c
+++ b/net/devlink/dev.c
@@ -222,7 +222,7 @@ static void devlink_notify(struct devlink *devlink, enum devlink_command cmd)
int devlink_nl_get_doit(struct sk_buff *skb, struct genl_info *info)
{
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
struct sk_buff *msg;
int err;
@@ -519,7 +519,7 @@ devlink_nl_reload_actions_performed_snd(struct devlink *devlink, u32 actions_per
int devlink_nl_reload_doit(struct sk_buff *skb, struct genl_info *info)
{
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
enum devlink_reload_action action;
enum devlink_reload_limit limit;
struct net *dest_net = NULL;
@@ -683,7 +683,7 @@ static int devlink_nl_eswitch_fill(struct sk_buff *msg, struct devlink *devlink,
int devlink_nl_eswitch_get_doit(struct sk_buff *skb, struct genl_info *info)
{
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
struct sk_buff *msg;
int err;
@@ -704,7 +704,7 @@ int devlink_nl_eswitch_get_doit(struct sk_buff *skb, struct genl_info *info)
int devlink_nl_eswitch_set_doit(struct sk_buff *skb, struct genl_info *info)
{
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
const struct devlink_ops *ops = devlink->ops;
enum devlink_eswitch_encap_mode encap_mode;
u8 inline_mode;
@@ -906,7 +906,7 @@ devlink_nl_info_fill(struct sk_buff *msg, struct devlink *devlink,
int devlink_nl_info_get_doit(struct sk_buff *skb, struct genl_info *info)
{
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
struct sk_buff *msg;
int err;
@@ -1134,7 +1134,7 @@ int devlink_nl_flash_update_doit(struct sk_buff *skb, struct genl_info *info)
{
struct nlattr *nla_overwrite_mask, *nla_file_name;
struct devlink_flash_update_params params = {};
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
const char *file_name;
u32 supported_params;
int ret;
@@ -1302,7 +1302,7 @@ devlink_nl_selftests_fill(struct sk_buff *msg, struct devlink *devlink,
int devlink_nl_selftests_get_doit(struct sk_buff *skb, struct genl_info *info)
{
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
struct sk_buff *msg;
int err;
@@ -1372,7 +1372,7 @@ static const struct nla_policy devlink_selftest_nl_policy[DEVLINK_ATTR_SELFTEST_
int devlink_nl_selftests_run_doit(struct sk_buff *skb, struct genl_info *info)
{
struct nlattr *tb[DEVLINK_ATTR_SELFTEST_ID_MAX + 1];
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
struct nlattr *attrs, *selftests;
struct sk_buff *msg;
void *hdr;
diff --git a/net/devlink/devl_internal.h b/net/devlink/devl_internal.h
index 36dff282f9b0..52c8bf359dd4 100644
--- a/net/devlink/devl_internal.h
+++ b/net/devlink/devl_internal.h
@@ -151,6 +151,19 @@ int devlink_rel_devlink_handle_put(struct sk_buff *msg, struct devlink *devlink,
bool *msg_updated);
/* Netlink */
+struct devlink_nl_ctx {
+ struct devlink *devlink;
+ struct devlink_port *devlink_port;
+};
+
+static inline struct devlink_nl_ctx *
+devlink_nl_ctx(struct genl_info *info)
+{
+ BUILD_BUG_ON(sizeof(struct devlink_nl_ctx) >
+ sizeof_field(struct genl_info, ctx));
+ return (struct devlink_nl_ctx *)info->ctx;
+}
+
enum devlink_multicast_groups {
DEVLINK_MCGRP_CONFIG,
};
diff --git a/net/devlink/dpipe.c b/net/devlink/dpipe.c
index c8d4a4374ae1..08c7b66fc3e8 100644
--- a/net/devlink/dpipe.c
+++ b/net/devlink/dpipe.c
@@ -213,7 +213,7 @@ static int devlink_dpipe_tables_fill(struct genl_info *info,
struct list_head *dpipe_tables,
const char *table_name)
{
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
struct devlink_dpipe_table *table;
struct nlattr *tables_attr;
struct sk_buff *skb = NULL;
@@ -290,7 +290,7 @@ static int devlink_dpipe_tables_fill(struct genl_info *info,
int devlink_nl_dpipe_table_get_doit(struct sk_buff *skb, struct genl_info *info)
{
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
const char *table_name = NULL;
if (info->attrs[DEVLINK_ATTR_DPIPE_TABLE_NAME])
@@ -478,7 +478,7 @@ int devlink_dpipe_entry_ctx_prepare(struct devlink_dpipe_dump_ctx *dump_ctx)
if (!dump_ctx->hdr)
goto nla_put_failure;
- devlink = dump_ctx->info->user_ptr[0];
+ devlink = devlink_nl_ctx(dump_ctx->info)->devlink;
if (devlink_nl_put_handle(dump_ctx->skb, devlink))
goto nla_put_failure;
dump_ctx->nest = nla_nest_start_noflag(dump_ctx->skb,
@@ -563,7 +563,7 @@ static int devlink_dpipe_entries_fill(struct genl_info *info,
int devlink_nl_dpipe_entries_get_doit(struct sk_buff *skb,
struct genl_info *info)
{
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
struct devlink_dpipe_table *table;
const char *table_name;
@@ -650,7 +650,7 @@ static int devlink_dpipe_headers_fill(struct genl_info *info,
struct devlink_dpipe_headers *
dpipe_headers)
{
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
struct nlattr *headers_attr;
struct sk_buff *skb = NULL;
struct nlmsghdr *nlh;
@@ -713,7 +713,7 @@ static int devlink_dpipe_headers_fill(struct genl_info *info,
int devlink_nl_dpipe_headers_get_doit(struct sk_buff *skb,
struct genl_info *info)
{
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
if (!devlink->dpipe_headers)
return -EOPNOTSUPP;
@@ -747,7 +747,7 @@ static int devlink_dpipe_table_counters_set(struct devlink *devlink,
int devlink_nl_dpipe_table_counters_set_doit(struct sk_buff *skb,
struct genl_info *info)
{
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
const char *table_name;
bool counters_enable;
diff --git a/net/devlink/health.c b/net/devlink/health.c
index ea7a334e939b..8ce6cd399cb7 100644
--- a/net/devlink/health.c
+++ b/net/devlink/health.c
@@ -358,7 +358,7 @@ devlink_health_reporter_get_from_info(struct devlink *devlink,
int devlink_nl_health_reporter_get_doit(struct sk_buff *skb,
struct genl_info *info)
{
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
struct devlink_health_reporter *reporter;
struct sk_buff *msg;
int err;
@@ -456,7 +456,7 @@ int devlink_nl_health_reporter_get_dumpit(struct sk_buff *skb,
int devlink_nl_health_reporter_set_doit(struct sk_buff *skb,
struct genl_info *info)
{
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
struct devlink_health_reporter *reporter;
reporter = devlink_health_reporter_get_from_info(devlink, info);
@@ -715,7 +715,7 @@ EXPORT_SYMBOL_GPL(devlink_health_reporter_state_update);
int devlink_nl_health_reporter_recover_doit(struct sk_buff *skb,
struct genl_info *info)
{
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
struct devlink_health_reporter *reporter;
reporter = devlink_health_reporter_get_from_info(devlink, info);
@@ -1157,7 +1157,7 @@ static int devlink_fmsg_dumpit(struct devlink_fmsg *fmsg, struct sk_buff *skb,
int devlink_nl_health_reporter_diagnose_doit(struct sk_buff *skb,
struct genl_info *info)
{
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
struct devlink_health_reporter *reporter;
struct devlink_fmsg *fmsg;
int err;
@@ -1252,7 +1252,7 @@ int devlink_nl_health_reporter_dump_get_dumpit(struct sk_buff *skb,
int devlink_nl_health_reporter_dump_clear_doit(struct sk_buff *skb,
struct genl_info *info)
{
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
struct devlink_health_reporter *reporter;
reporter = devlink_health_reporter_get_from_info(devlink, info);
@@ -1269,7 +1269,7 @@ int devlink_nl_health_reporter_dump_clear_doit(struct sk_buff *skb,
int devlink_nl_health_reporter_test_doit(struct sk_buff *skb,
struct genl_info *info)
{
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
struct devlink_health_reporter *reporter;
reporter = devlink_health_reporter_get_from_info(devlink, info);
diff --git a/net/devlink/linecard.c b/net/devlink/linecard.c
index 8315d35cb91d..fd18f2759770 100644
--- a/net/devlink/linecard.c
+++ b/net/devlink/linecard.c
@@ -171,7 +171,7 @@ void devlink_linecards_notify_unregister(struct devlink *devlink)
int devlink_nl_linecard_get_doit(struct sk_buff *skb, struct genl_info *info)
{
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
struct devlink_linecard *linecard;
struct sk_buff *msg;
int err;
@@ -371,7 +371,7 @@ static int devlink_linecard_type_unset(struct devlink_linecard *linecard,
int devlink_nl_linecard_set_doit(struct sk_buff *skb, struct genl_info *info)
{
struct netlink_ext_ack *extack = info->extack;
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
struct devlink_linecard *linecard;
int err;
diff --git a/net/devlink/netlink.c b/net/devlink/netlink.c
index ae4afc739678..f0a857e286bc 100644
--- a/net/devlink/netlink.c
+++ b/net/devlink/netlink.c
@@ -252,18 +252,18 @@ static int __devlink_nl_pre_doit(struct sk_buff *skb, struct genl_info *info,
if (IS_ERR(devlink))
return PTR_ERR(devlink);
- info->user_ptr[0] = devlink;
+ devlink_nl_ctx(info)->devlink = devlink;
if (flags & DEVLINK_NL_FLAG_NEED_PORT) {
devlink_port = devlink_port_get_from_info(devlink, info);
if (IS_ERR(devlink_port)) {
err = PTR_ERR(devlink_port);
goto unlock;
}
- info->user_ptr[1] = devlink_port;
+ devlink_nl_ctx(info)->devlink_port = devlink_port;
} else if (flags & DEVLINK_NL_FLAG_NEED_DEVLINK_OR_PORT) {
devlink_port = devlink_port_get_from_info(devlink, info);
if (!IS_ERR(devlink_port))
- info->user_ptr[1] = devlink_port;
+ devlink_nl_ctx(info)->devlink_port = devlink_port;
}
return 0;
@@ -304,7 +304,7 @@ static void __devlink_nl_post_doit(struct sk_buff *skb, struct genl_info *info,
bool dev_lock = flags & DEVLINK_NL_FLAG_NEED_DEV_LOCK;
struct devlink *devlink;
- devlink = info->user_ptr[0];
+ devlink = devlink_nl_ctx(info)->devlink;
devl_dev_unlock(devlink, dev_lock);
devlink_put(devlink);
}
diff --git a/net/devlink/param.c b/net/devlink/param.c
index 3e9d2e5750c2..1cc562a6ebfd 100644
--- a/net/devlink/param.c
+++ b/net/devlink/param.c
@@ -627,7 +627,7 @@ devlink_param_get_from_info(struct xarray *params, struct genl_info *info)
int devlink_nl_param_get_doit(struct sk_buff *skb,
struct genl_info *info)
{
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
struct devlink_param_item *param_item;
struct sk_buff *msg;
int err;
@@ -728,7 +728,7 @@ static int __devlink_nl_cmd_param_set_doit(struct devlink *devlink,
int devlink_nl_param_set_doit(struct sk_buff *skb, struct genl_info *info)
{
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
return __devlink_nl_cmd_param_set_doit(devlink, 0, &devlink->params,
info, DEVLINK_CMD_PARAM_NEW);
diff --git a/net/devlink/port.c b/net/devlink/port.c
index 485029d43428..c268afefaed7 100644
--- a/net/devlink/port.c
+++ b/net/devlink/port.c
@@ -594,7 +594,7 @@ void devlink_ports_notify_unregister(struct devlink *devlink)
int devlink_nl_port_get_doit(struct sk_buff *skb, struct genl_info *info)
{
- struct devlink_port *devlink_port = info->user_ptr[1];
+ struct devlink_port *devlink_port = devlink_nl_ctx(info)->devlink_port;
struct sk_buff *msg;
int err;
@@ -830,7 +830,7 @@ static int devlink_port_function_set(struct devlink_port *port,
int devlink_nl_port_set_doit(struct sk_buff *skb, struct genl_info *info)
{
- struct devlink_port *devlink_port = info->user_ptr[1];
+ struct devlink_port *devlink_port = devlink_nl_ctx(info)->devlink_port;
int err;
if (info->attrs[DEVLINK_ATTR_PORT_TYPE]) {
@@ -856,8 +856,8 @@ int devlink_nl_port_set_doit(struct sk_buff *skb, struct genl_info *info)
int devlink_nl_port_split_doit(struct sk_buff *skb, struct genl_info *info)
{
- struct devlink_port *devlink_port = info->user_ptr[1];
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink_port *devlink_port = devlink_nl_ctx(info)->devlink_port;
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
u32 count;
if (GENL_REQ_ATTR_CHECK(info, DEVLINK_ATTR_PORT_SPLIT_COUNT))
@@ -887,8 +887,8 @@ int devlink_nl_port_split_doit(struct sk_buff *skb, struct genl_info *info)
int devlink_nl_port_unsplit_doit(struct sk_buff *skb, struct genl_info *info)
{
- struct devlink_port *devlink_port = info->user_ptr[1];
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink_port *devlink_port = devlink_nl_ctx(info)->devlink_port;
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
if (!devlink_port->ops->port_unsplit)
return -EOPNOTSUPP;
@@ -899,7 +899,7 @@ int devlink_nl_port_new_doit(struct sk_buff *skb, struct genl_info *info)
{
struct netlink_ext_ack *extack = info->extack;
struct devlink_port_new_attrs new_attrs = {};
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
struct devlink_port *devlink_port;
struct sk_buff *msg;
int err;
@@ -961,9 +961,9 @@ int devlink_nl_port_new_doit(struct sk_buff *skb, struct genl_info *info)
int devlink_nl_port_del_doit(struct sk_buff *skb, struct genl_info *info)
{
- struct devlink_port *devlink_port = info->user_ptr[1];
+ struct devlink_port *devlink_port = devlink_nl_ctx(info)->devlink_port;
struct netlink_ext_ack *extack = info->extack;
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
if (!devlink_port->ops->port_del)
return -EOPNOTSUPP;
diff --git a/net/devlink/rate.c b/net/devlink/rate.c
index 533d21b028a7..630441e429b3 100644
--- a/net/devlink/rate.c
+++ b/net/devlink/rate.c
@@ -239,7 +239,7 @@ int devlink_nl_rate_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb)
int devlink_nl_rate_get_doit(struct sk_buff *skb, struct genl_info *info)
{
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
struct devlink_rate *devlink_rate;
struct sk_buff *msg;
int err;
@@ -588,7 +588,7 @@ static bool devlink_rate_set_ops_supported(const struct devlink_ops *ops,
int devlink_nl_rate_set_doit(struct sk_buff *skb, struct genl_info *info)
{
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
struct devlink_rate *devlink_rate;
const struct devlink_ops *ops;
int err;
@@ -610,7 +610,7 @@ int devlink_nl_rate_set_doit(struct sk_buff *skb, struct genl_info *info)
int devlink_nl_rate_new_doit(struct sk_buff *skb, struct genl_info *info)
{
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
struct devlink_rate *rate_node;
const struct devlink_ops *ops;
int err;
@@ -666,7 +666,7 @@ int devlink_nl_rate_new_doit(struct sk_buff *skb, struct genl_info *info)
int devlink_nl_rate_del_doit(struct sk_buff *skb, struct genl_info *info)
{
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
struct devlink_rate *rate_node;
int err;
diff --git a/net/devlink/region.c b/net/devlink/region.c
index 5588e3d560b9..537779bbff07 100644
--- a/net/devlink/region.c
+++ b/net/devlink/region.c
@@ -469,7 +469,7 @@ static void devlink_region_snapshot_del(struct devlink_region *region,
int devlink_nl_region_get_doit(struct sk_buff *skb, struct genl_info *info)
{
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
struct devlink_port *port = NULL;
struct devlink_region *region;
const char *region_name;
@@ -588,7 +588,7 @@ int devlink_nl_region_get_dumpit(struct sk_buff *skb,
int devlink_nl_region_del_doit(struct sk_buff *skb, struct genl_info *info)
{
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
struct devlink_snapshot *snapshot;
struct devlink_port *port = NULL;
struct devlink_region *region;
@@ -633,7 +633,7 @@ int devlink_nl_region_del_doit(struct sk_buff *skb, struct genl_info *info)
int devlink_nl_region_new_doit(struct sk_buff *skb, struct genl_info *info)
{
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
struct devlink_snapshot *snapshot;
struct devlink_port *port = NULL;
struct nlattr *snapshot_id_attr;
diff --git a/net/devlink/resource.c b/net/devlink/resource.c
index 574108ccfe5d..c3cfda7ea070 100644
--- a/net/devlink/resource.c
+++ b/net/devlink/resource.c
@@ -117,7 +117,7 @@ devlink_resource_validate_size(struct devlink_resource *resource, u64 size,
int devlink_nl_resource_set_doit(struct sk_buff *skb, struct genl_info *info)
{
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
struct devlink_resource *resource;
u64 resource_id;
u64 size;
@@ -251,8 +251,9 @@ static int devlink_resource_list_fill(struct sk_buff *skb,
static int devlink_resource_fill(struct genl_info *info,
enum devlink_command cmd, int flags)
{
- struct devlink_port *devlink_port = info->user_ptr[1];
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink_nl_ctx *ctx = devlink_nl_ctx(info);
+ struct devlink *devlink = ctx->devlink;
+ struct devlink_port *devlink_port;
struct devlink_resource *resource;
struct list_head *resource_list;
struct nlattr *resources_attr;
@@ -263,6 +264,7 @@ static int devlink_resource_fill(struct genl_info *info,
int i;
int err;
+ devlink_port = ctx->devlink_port;
resource_list = devlink_port ?
&devlink_port->resource_list : &devlink->resource_list;
resource = list_first_entry(resource_list,
@@ -326,10 +328,12 @@ static int devlink_resource_fill(struct genl_info *info,
int devlink_nl_resource_dump_doit(struct sk_buff *skb, struct genl_info *info)
{
- struct devlink_port *devlink_port = info->user_ptr[1];
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink_nl_ctx *ctx = devlink_nl_ctx(info);
+ struct devlink *devlink = ctx->devlink;
+ struct devlink_port *devlink_port;
struct list_head *resource_list;
+ devlink_port = ctx->devlink_port;
if (info->attrs[DEVLINK_ATTR_PORT_INDEX] && !devlink_port)
return -ENODEV;
diff --git a/net/devlink/sb.c b/net/devlink/sb.c
index 49fcbfe08f15..129bd016e302 100644
--- a/net/devlink/sb.c
+++ b/net/devlink/sb.c
@@ -204,7 +204,7 @@ static int devlink_nl_sb_fill(struct sk_buff *msg, struct devlink *devlink,
int devlink_nl_sb_get_doit(struct sk_buff *skb, struct genl_info *info)
{
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
struct devlink_sb *devlink_sb;
struct sk_buff *msg;
int err;
@@ -306,7 +306,7 @@ static int devlink_nl_sb_pool_fill(struct sk_buff *msg, struct devlink *devlink,
int devlink_nl_sb_pool_get_doit(struct sk_buff *skb, struct genl_info *info)
{
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
struct devlink_sb *devlink_sb;
struct sk_buff *msg;
u16 pool_index;
@@ -415,7 +415,7 @@ static int devlink_sb_pool_set(struct devlink *devlink, unsigned int sb_index,
int devlink_nl_sb_pool_set_doit(struct sk_buff *skb, struct genl_info *info)
{
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
enum devlink_sb_threshold_type threshold_type;
struct devlink_sb *devlink_sb;
u16 pool_index;
@@ -506,7 +506,7 @@ static int devlink_nl_sb_port_pool_fill(struct sk_buff *msg,
int devlink_nl_sb_port_pool_get_doit(struct sk_buff *skb,
struct genl_info *info)
{
- struct devlink_port *devlink_port = info->user_ptr[1];
+ struct devlink_port *devlink_port = devlink_nl_ctx(info)->devlink_port;
struct devlink *devlink = devlink_port->devlink;
struct devlink_sb *devlink_sb;
struct sk_buff *msg;
@@ -624,8 +624,8 @@ static int devlink_sb_port_pool_set(struct devlink_port *devlink_port,
int devlink_nl_sb_port_pool_set_doit(struct sk_buff *skb,
struct genl_info *info)
{
- struct devlink_port *devlink_port = info->user_ptr[1];
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink_port *devlink_port = devlink_nl_ctx(info)->devlink_port;
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
struct devlink_sb *devlink_sb;
u16 pool_index;
u32 threshold;
@@ -716,7 +716,7 @@ devlink_nl_sb_tc_pool_bind_fill(struct sk_buff *msg, struct devlink *devlink,
int devlink_nl_sb_tc_pool_bind_get_doit(struct sk_buff *skb,
struct genl_info *info)
{
- struct devlink_port *devlink_port = info->user_ptr[1];
+ struct devlink_port *devlink_port = devlink_nl_ctx(info)->devlink_port;
struct devlink *devlink = devlink_port->devlink;
struct devlink_sb *devlink_sb;
struct sk_buff *msg;
@@ -864,8 +864,8 @@ static int devlink_sb_tc_pool_bind_set(struct devlink_port *devlink_port,
int devlink_nl_sb_tc_pool_bind_set_doit(struct sk_buff *skb,
struct genl_info *info)
{
- struct devlink_port *devlink_port = info->user_ptr[1];
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink_port *devlink_port = devlink_nl_ctx(info)->devlink_port;
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
enum devlink_sb_pool_type pool_type;
struct devlink_sb *devlink_sb;
u16 tc_index;
@@ -902,7 +902,7 @@ int devlink_nl_sb_tc_pool_bind_set_doit(struct sk_buff *skb,
int devlink_nl_sb_occ_snapshot_doit(struct sk_buff *skb, struct genl_info *info)
{
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
const struct devlink_ops *ops = devlink->ops;
struct devlink_sb *devlink_sb;
@@ -918,7 +918,7 @@ int devlink_nl_sb_occ_snapshot_doit(struct sk_buff *skb, struct genl_info *info)
int devlink_nl_sb_occ_max_clear_doit(struct sk_buff *skb,
struct genl_info *info)
{
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
const struct devlink_ops *ops = devlink->ops;
struct devlink_sb *devlink_sb;
diff --git a/net/devlink/trap.c b/net/devlink/trap.c
index 8edb31654a68..793ffc66dc11 100644
--- a/net/devlink/trap.c
+++ b/net/devlink/trap.c
@@ -302,7 +302,7 @@ static int devlink_nl_trap_fill(struct sk_buff *msg, struct devlink *devlink,
int devlink_nl_trap_get_doit(struct sk_buff *skb, struct genl_info *info)
{
struct netlink_ext_ack *extack = info->extack;
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
struct devlink_trap_item *trap_item;
struct sk_buff *msg;
int err;
@@ -412,7 +412,7 @@ static int devlink_trap_action_set(struct devlink *devlink,
int devlink_nl_trap_set_doit(struct sk_buff *skb, struct genl_info *info)
{
struct netlink_ext_ack *extack = info->extack;
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
struct devlink_trap_item *trap_item;
if (list_empty(&devlink->trap_list))
@@ -511,7 +511,7 @@ devlink_nl_trap_group_fill(struct sk_buff *msg, struct devlink *devlink,
int devlink_nl_trap_group_get_doit(struct sk_buff *skb, struct genl_info *info)
{
struct netlink_ext_ack *extack = info->extack;
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
struct devlink_trap_group_item *group_item;
struct sk_buff *msg;
int err;
@@ -682,7 +682,7 @@ static int devlink_trap_group_set(struct devlink *devlink,
int devlink_nl_trap_group_set_doit(struct sk_buff *skb, struct genl_info *info)
{
struct netlink_ext_ack *extack = info->extack;
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
struct devlink_trap_group_item *group_item;
bool modified = false;
int err;
@@ -804,7 +804,7 @@ int devlink_nl_trap_policer_get_doit(struct sk_buff *skb,
{
struct devlink_trap_policer_item *policer_item;
struct netlink_ext_ack *extack = info->extack;
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
struct sk_buff *msg;
int err;
@@ -924,7 +924,7 @@ int devlink_nl_trap_policer_set_doit(struct sk_buff *skb,
{
struct devlink_trap_policer_item *policer_item;
struct netlink_ext_ack *extack = info->extack;
- struct devlink *devlink = info->user_ptr[0];
+ struct devlink *devlink = devlink_nl_ctx(info)->devlink;
if (list_empty(&devlink->trap_policer_list))
return -EOPNOTSUPP;
--
2.44.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH net-next V10 04/14] devlink: Decouple rate storage from associated devlink object
2026-07-01 7:32 [PATCH net-next V10 00/14] devlink and mlx5: Support cross-function rate scheduling Tariq Toukan
` (2 preceding siblings ...)
2026-07-01 7:32 ` [PATCH net-next V10 03/14] devlink: Migrate from info->user_ptr to info->ctx Tariq Toukan
@ 2026-07-01 7:32 ` Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 05/14] devlink: Add parent dev to devlink API Tariq Toukan
` (9 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: Tariq Toukan @ 2026-07-01 7:32 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
netdev, Paolo Abeni
Cc: Adithya Jayachandran, Bobby Eshleman, Carolina Jubran,
Cosmin Ratiu, Daniel Borkmann, Daniel Jurgens, Daniel Zahka,
David Wei, Donald Hunter, Dragos Tatulea, Jiri Pirko, Jiri Pirko,
Joe Damato, Jonathan Corbet, Kees Cook, Leon Romanovsky,
linux-doc, linux-kernel, linux-kselftest, linux-rdma, Mark Bloch,
Moshe Shemesh, Or Har-Toov, Parav Pandit, Petr Machata,
Ratheesh Kannoth, Saeed Mahameed, Shahar Shitrit, Shay Drori,
Shuah Khan, Shuah Khan, Simon Horman, Stanislav Fomichev,
Tariq Toukan, Willem de Bruijn, Gal Pressman
From: Cosmin Ratiu <cratiu@nvidia.com>
Devlink rate leafs and nodes were stored in their respective devlink
objects pointed to by devlink_rate->devlink.
This patch removes that association by introducing the concept of
'rate node devlink', which is where all rates that could link to each
other are stored. For now this is the same as devlink_rate->devlink.
After this patch, the devlink rates stored in this devlink instance
could potentially be from multiple other devlink instances. So all rate
node manipulation code was updated to:
- correctly compare the actual devlink object during iteration.
- maybe acquire additional locks (noop for now).
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
net/devlink/rate.c | 249 ++++++++++++++++++++++++++++++++-------------
1 file changed, 177 insertions(+), 72 deletions(-)
diff --git a/net/devlink/rate.c b/net/devlink/rate.c
index 630441e429b3..295f4185fdfd 100644
--- a/net/devlink/rate.c
+++ b/net/devlink/rate.c
@@ -30,13 +30,25 @@ devlink_rate_leaf_get_from_info(struct devlink *devlink, struct genl_info *info)
return devlink_rate ?: ERR_PTR(-ENODEV);
}
+static struct devlink *devl_rate_lock(struct devlink *devlink)
+{
+ return devlink;
+}
+
+static void devl_rate_unlock(struct devlink *devlink,
+ struct devlink *rate_devlink)
+{
+}
+
static struct devlink_rate *
-devlink_rate_node_get_by_name(struct devlink *devlink, const char *node_name)
+devlink_rate_node_get_by_name(struct devlink *rate_devlink,
+ struct devlink *devlink, const char *node_name)
{
struct devlink_rate *devlink_rate;
- list_for_each_entry(devlink_rate, &devlink->rate_list, list) {
- if (devlink_rate_is_node(devlink_rate) &&
+ list_for_each_entry(devlink_rate, &rate_devlink->rate_list, list) {
+ if (devlink_rate->devlink == devlink &&
+ devlink_rate_is_node(devlink_rate) &&
!strcmp(node_name, devlink_rate->name))
return devlink_rate;
}
@@ -44,7 +56,8 @@ devlink_rate_node_get_by_name(struct devlink *devlink, const char *node_name)
}
static struct devlink_rate *
-devlink_rate_node_get_from_attrs(struct devlink *devlink, struct nlattr **attrs)
+devlink_rate_node_get_from_attrs(struct devlink *rate_devlink,
+ struct devlink *devlink, struct nlattr **attrs)
{
const char *rate_node_name;
size_t len;
@@ -57,24 +70,30 @@ devlink_rate_node_get_from_attrs(struct devlink *devlink, struct nlattr **attrs)
if (!len || strspn(rate_node_name, "0123456789") == len)
return ERR_PTR(-EINVAL);
- return devlink_rate_node_get_by_name(devlink, rate_node_name);
+ return devlink_rate_node_get_by_name(rate_devlink, devlink,
+ rate_node_name);
}
static struct devlink_rate *
-devlink_rate_node_get_from_info(struct devlink *devlink, struct genl_info *info)
+devlink_rate_node_get_from_info(struct devlink *rate_devlink,
+ struct devlink *devlink,
+ struct genl_info *info)
{
- return devlink_rate_node_get_from_attrs(devlink, info->attrs);
+ return devlink_rate_node_get_from_attrs(rate_devlink, devlink,
+ info->attrs);
}
static struct devlink_rate *
-devlink_rate_get_from_info(struct devlink *devlink, struct genl_info *info)
+devlink_rate_get_from_info(struct devlink *rate_devlink,
+ struct devlink *devlink, struct genl_info *info)
{
struct nlattr **attrs = info->attrs;
if (attrs[DEVLINK_ATTR_PORT_INDEX])
return devlink_rate_leaf_get_from_info(devlink, info);
else if (attrs[DEVLINK_ATTR_RATE_NODE_NAME])
- return devlink_rate_node_get_from_info(devlink, info);
+ return devlink_rate_node_get_from_info(rate_devlink, devlink,
+ info);
else
return ERR_PTR(-EINVAL);
}
@@ -190,17 +209,25 @@ static void devlink_rate_notify(struct devlink_rate *devlink_rate,
void devlink_rates_notify_register(struct devlink *devlink)
{
struct devlink_rate *rate_node;
+ struct devlink *rate_devlink;
- list_for_each_entry(rate_node, &devlink->rate_list, list)
- devlink_rate_notify(rate_node, DEVLINK_CMD_RATE_NEW);
+ rate_devlink = devl_rate_lock(devlink);
+ list_for_each_entry(rate_node, &rate_devlink->rate_list, list)
+ if (rate_node->devlink == devlink)
+ devlink_rate_notify(rate_node, DEVLINK_CMD_RATE_NEW);
+ devl_rate_unlock(devlink, rate_devlink);
}
void devlink_rates_notify_unregister(struct devlink *devlink)
{
struct devlink_rate *rate_node;
+ struct devlink *rate_devlink;
- list_for_each_entry_reverse(rate_node, &devlink->rate_list, list)
- devlink_rate_notify(rate_node, DEVLINK_CMD_RATE_DEL);
+ rate_devlink = devl_rate_lock(devlink);
+ list_for_each_entry_reverse(rate_node, &rate_devlink->rate_list, list)
+ if (rate_node->devlink == devlink)
+ devlink_rate_notify(rate_node, DEVLINK_CMD_RATE_DEL);
+ devl_rate_unlock(devlink, rate_devlink);
}
static int
@@ -209,17 +236,20 @@ devlink_nl_rate_get_dump_one(struct sk_buff *msg, struct devlink *devlink,
{
struct devlink_nl_dump_state *state = devlink_dump_state(cb);
struct devlink_rate *devlink_rate;
+ struct devlink *rate_devlink;
int idx = 0;
int err = 0;
- list_for_each_entry(devlink_rate, &devlink->rate_list, list) {
+ rate_devlink = devl_rate_lock(devlink);
+ list_for_each_entry(devlink_rate, &rate_devlink->rate_list, list) {
enum devlink_command cmd = DEVLINK_CMD_RATE_NEW;
u32 id = NETLINK_CB(cb->skb).portid;
- if (idx < state->idx) {
+ if (idx < state->idx || devlink_rate->devlink != devlink) {
idx++;
continue;
}
+
err = devlink_nl_rate_fill(msg, devlink_rate, cmd, id,
cb->nlh->nlmsg_seq, flags, NULL);
if (err) {
@@ -228,6 +258,7 @@ devlink_nl_rate_get_dump_one(struct sk_buff *msg, struct devlink *devlink,
}
idx++;
}
+ devl_rate_unlock(devlink, rate_devlink);
return err;
}
@@ -239,28 +270,38 @@ int devlink_nl_rate_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb)
int devlink_nl_rate_get_doit(struct sk_buff *skb, struct genl_info *info)
{
- struct devlink *devlink = devlink_nl_ctx(info)->devlink;
+ struct devlink *rate_devlink, *devlink = devlink_nl_ctx(info)->devlink;
struct devlink_rate *devlink_rate;
struct sk_buff *msg;
int err;
- devlink_rate = devlink_rate_get_from_info(devlink, info);
- if (IS_ERR(devlink_rate))
- return PTR_ERR(devlink_rate);
+ rate_devlink = devl_rate_lock(devlink);
+ devlink_rate = devlink_rate_get_from_info(rate_devlink, devlink, info);
+ if (IS_ERR(devlink_rate)) {
+ err = PTR_ERR(devlink_rate);
+ goto unlock;
+ }
msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
- if (!msg)
- return -ENOMEM;
+ if (!msg) {
+ err = -ENOMEM;
+ goto unlock;
+ }
err = devlink_nl_rate_fill(msg, devlink_rate, DEVLINK_CMD_RATE_NEW,
info->snd_portid, info->snd_seq, 0,
info->extack);
- if (err) {
- nlmsg_free(msg);
- return err;
- }
+ if (err)
+ goto err_fill;
+ devl_rate_unlock(devlink, rate_devlink);
return genlmsg_reply(msg, info);
+
+err_fill:
+ nlmsg_free(msg);
+unlock:
+ devl_rate_unlock(devlink, rate_devlink);
+ return err;
}
static bool
@@ -277,6 +318,7 @@ devlink_rate_is_parent_node(struct devlink_rate *devlink_rate,
static int
devlink_nl_rate_parent_node_set(struct devlink_rate *devlink_rate,
+ struct devlink *rate_devlink,
struct genl_info *info,
struct nlattr *nla_parent)
{
@@ -304,7 +346,8 @@ devlink_nl_rate_parent_node_set(struct devlink_rate *devlink_rate,
refcount_dec(&parent->refcnt);
devlink_rate->parent = NULL;
} else if (len) {
- parent = devlink_rate_node_get_by_name(devlink, parent_name);
+ parent = devlink_rate_node_get_by_name(rate_devlink, devlink,
+ parent_name);
if (IS_ERR(parent))
return -ENODEV;
@@ -423,6 +466,7 @@ static int devlink_nl_rate_tc_bw_set(struct devlink_rate *devlink_rate,
}
static int devlink_nl_rate_set(struct devlink_rate *devlink_rate,
+ struct devlink *rate_devlink,
const struct devlink_ops *ops,
struct genl_info *info)
{
@@ -497,7 +541,8 @@ static int devlink_nl_rate_set(struct devlink_rate *devlink_rate,
*/
nla_parent = attrs[DEVLINK_ATTR_RATE_PARENT_NODE_NAME];
if (nla_parent) {
- err = devlink_nl_rate_parent_node_set(devlink_rate, info,
+ err = devlink_nl_rate_parent_node_set(devlink_rate,
+ rate_devlink, info,
nla_parent);
if (err)
return err;
@@ -588,29 +633,37 @@ static bool devlink_rate_set_ops_supported(const struct devlink_ops *ops,
int devlink_nl_rate_set_doit(struct sk_buff *skb, struct genl_info *info)
{
- struct devlink *devlink = devlink_nl_ctx(info)->devlink;
+ struct devlink *rate_devlink, *devlink = devlink_nl_ctx(info)->devlink;
struct devlink_rate *devlink_rate;
const struct devlink_ops *ops;
int err;
- devlink_rate = devlink_rate_get_from_info(devlink, info);
- if (IS_ERR(devlink_rate))
- return PTR_ERR(devlink_rate);
+ rate_devlink = devl_rate_lock(devlink);
+ devlink_rate = devlink_rate_get_from_info(rate_devlink, devlink, info);
+ if (IS_ERR(devlink_rate)) {
+ err = PTR_ERR(devlink_rate);
+ goto unlock;
+ }
ops = devlink->ops;
- if (!ops || !devlink_rate_set_ops_supported(ops, info, devlink_rate->type))
- return -EOPNOTSUPP;
+ if (!ops ||
+ !devlink_rate_set_ops_supported(ops, info, devlink_rate->type)) {
+ err = -EOPNOTSUPP;
+ goto unlock;
+ }
- err = devlink_nl_rate_set(devlink_rate, ops, info);
+ err = devlink_nl_rate_set(devlink_rate, rate_devlink, ops, info);
if (!err)
devlink_rate_notify(devlink_rate, DEVLINK_CMD_RATE_NEW);
+unlock:
+ devl_rate_unlock(devlink, rate_devlink);
return err;
}
int devlink_nl_rate_new_doit(struct sk_buff *skb, struct genl_info *info)
{
- struct devlink *devlink = devlink_nl_ctx(info)->devlink;
+ struct devlink *rate_devlink, *devlink = devlink_nl_ctx(info)->devlink;
struct devlink_rate *rate_node;
const struct devlink_ops *ops;
int err;
@@ -624,15 +677,22 @@ int devlink_nl_rate_new_doit(struct sk_buff *skb, struct genl_info *info)
if (!devlink_rate_set_ops_supported(ops, info, DEVLINK_RATE_TYPE_NODE))
return -EOPNOTSUPP;
- rate_node = devlink_rate_node_get_from_attrs(devlink, info->attrs);
- if (!IS_ERR(rate_node))
- return -EEXIST;
- else if (rate_node == ERR_PTR(-EINVAL))
- return -EINVAL;
+ rate_devlink = devl_rate_lock(devlink);
+ rate_node = devlink_rate_node_get_from_attrs(rate_devlink, devlink,
+ info->attrs);
+ if (!IS_ERR(rate_node)) {
+ err = -EEXIST;
+ goto unlock;
+ } else if (rate_node == ERR_PTR(-EINVAL)) {
+ err = -EINVAL;
+ goto unlock;
+ }
rate_node = kzalloc_obj(*rate_node);
- if (!rate_node)
- return -ENOMEM;
+ if (!rate_node) {
+ err = -ENOMEM;
+ goto unlock;
+ }
rate_node->devlink = devlink;
rate_node->type = DEVLINK_RATE_TYPE_NODE;
@@ -646,13 +706,14 @@ int devlink_nl_rate_new_doit(struct sk_buff *skb, struct genl_info *info)
if (err)
goto err_node_new;
- err = devlink_nl_rate_set(rate_node, ops, info);
+ err = devlink_nl_rate_set(rate_node, rate_devlink, ops, info);
if (err)
goto err_rate_set;
refcount_set(&rate_node->refcnt, 1);
- list_add(&rate_node->list, &devlink->rate_list);
+ list_add(&rate_node->list, &rate_devlink->rate_list);
devlink_rate_notify(rate_node, DEVLINK_CMD_RATE_NEW);
+ devl_rate_unlock(devlink, rate_devlink);
return 0;
err_rate_set:
@@ -661,22 +722,29 @@ int devlink_nl_rate_new_doit(struct sk_buff *skb, struct genl_info *info)
kfree(rate_node->name);
err_strdup:
kfree(rate_node);
+unlock:
+ devl_rate_unlock(devlink, rate_devlink);
return err;
}
int devlink_nl_rate_del_doit(struct sk_buff *skb, struct genl_info *info)
{
- struct devlink *devlink = devlink_nl_ctx(info)->devlink;
+ struct devlink *rate_devlink, *devlink = devlink_nl_ctx(info)->devlink;
struct devlink_rate *rate_node;
int err;
- rate_node = devlink_rate_node_get_from_info(devlink, info);
- if (IS_ERR(rate_node))
- return PTR_ERR(rate_node);
+ rate_devlink = devl_rate_lock(devlink);
+ rate_node = devlink_rate_node_get_from_info(rate_devlink, devlink,
+ info);
+ if (IS_ERR(rate_node)) {
+ err = PTR_ERR(rate_node);
+ goto unlock;
+ }
if (refcount_read(&rate_node->refcnt) > 1) {
NL_SET_ERR_MSG(info->extack, "Node has children. Cannot delete node.");
- return -EBUSY;
+ err = -EBUSY;
+ goto unlock;
}
devlink_rate_notify(rate_node, DEVLINK_CMD_RATE_DEL);
@@ -687,6 +755,8 @@ int devlink_nl_rate_del_doit(struct sk_buff *skb, struct genl_info *info)
list_del(&rate_node->list);
kfree(rate_node->name);
kfree(rate_node);
+unlock:
+ devl_rate_unlock(devlink, rate_devlink);
return err;
}
@@ -695,14 +765,20 @@ int devlink_rates_check(struct devlink *devlink,
struct netlink_ext_ack *extack)
{
struct devlink_rate *devlink_rate;
+ struct devlink *rate_devlink;
+ int err = 0;
- list_for_each_entry(devlink_rate, &devlink->rate_list, list)
- if (!rate_filter || rate_filter(devlink_rate)) {
+ rate_devlink = devl_rate_lock(devlink);
+ list_for_each_entry(devlink_rate, &rate_devlink->rate_list, list)
+ if (devlink_rate->devlink == devlink &&
+ (!rate_filter || rate_filter(devlink_rate))) {
if (extack)
NL_SET_ERR_MSG(extack, "Rate node(s) exists.");
- return -EBUSY;
+ err = -EBUSY;
+ break;
}
- return 0;
+ devl_rate_unlock(devlink, rate_devlink);
+ return err;
}
/**
@@ -719,14 +795,21 @@ devl_rate_node_create(struct devlink *devlink, void *priv, char *node_name,
struct devlink_rate *parent)
{
struct devlink_rate *rate_node;
-
- rate_node = devlink_rate_node_get_by_name(devlink, node_name);
- if (!IS_ERR(rate_node))
- return ERR_PTR(-EEXIST);
+ struct devlink *rate_devlink;
+
+ rate_devlink = devl_rate_lock(devlink);
+ rate_node = devlink_rate_node_get_by_name(rate_devlink, devlink,
+ node_name);
+ if (!IS_ERR(rate_node)) {
+ rate_node = ERR_PTR(-EEXIST);
+ goto unlock;
+ }
rate_node = kzalloc_obj(*rate_node);
- if (!rate_node)
- return ERR_PTR(-ENOMEM);
+ if (!rate_node) {
+ rate_node = ERR_PTR(-ENOMEM);
+ goto unlock;
+ }
rate_node->type = DEVLINK_RATE_TYPE_NODE;
rate_node->devlink = devlink;
@@ -735,7 +818,8 @@ devl_rate_node_create(struct devlink *devlink, void *priv, char *node_name,
rate_node->name = kstrdup(node_name, GFP_KERNEL);
if (!rate_node->name) {
kfree(rate_node);
- return ERR_PTR(-ENOMEM);
+ rate_node = ERR_PTR(-ENOMEM);
+ goto unlock;
}
if (parent) {
@@ -744,8 +828,10 @@ devl_rate_node_create(struct devlink *devlink, void *priv, char *node_name,
}
refcount_set(&rate_node->refcnt, 1);
- list_add(&rate_node->list, &devlink->rate_list);
+ list_add(&rate_node->list, &rate_devlink->rate_list);
devlink_rate_notify(rate_node, DEVLINK_CMD_RATE_NEW);
+unlock:
+ devl_rate_unlock(devlink, rate_devlink);
return rate_node;
}
EXPORT_SYMBOL_GPL(devl_rate_node_create);
@@ -761,10 +847,10 @@ EXPORT_SYMBOL_GPL(devl_rate_node_create);
int devl_rate_leaf_create(struct devlink_port *devlink_port, void *priv,
struct devlink_rate *parent)
{
- struct devlink *devlink = devlink_port->devlink;
+ struct devlink *rate_devlink, *devlink = devlink_port->devlink;
struct devlink_rate *devlink_rate;
- devl_assert_locked(devlink_port->devlink);
+ devl_assert_locked(devlink);
if (WARN_ON(devlink_port->devlink_rate))
return -EBUSY;
@@ -773,6 +859,7 @@ int devl_rate_leaf_create(struct devlink_port *devlink_port, void *priv,
if (!devlink_rate)
return -ENOMEM;
+ rate_devlink = devl_rate_lock(devlink);
if (parent) {
devlink_rate->parent = parent;
refcount_inc(&devlink_rate->parent->refcnt);
@@ -782,9 +869,10 @@ int devl_rate_leaf_create(struct devlink_port *devlink_port, void *priv,
devlink_rate->devlink = devlink;
devlink_rate->devlink_port = devlink_port;
devlink_rate->priv = priv;
- list_add_tail(&devlink_rate->list, &devlink->rate_list);
+ list_add_tail(&devlink_rate->list, &rate_devlink->rate_list);
devlink_port->devlink_rate = devlink_rate;
devlink_rate_notify(devlink_rate, DEVLINK_CMD_RATE_NEW);
+ devl_rate_unlock(devlink, rate_devlink);
return 0;
}
@@ -800,16 +888,19 @@ EXPORT_SYMBOL_GPL(devl_rate_leaf_create);
void devl_rate_leaf_destroy(struct devlink_port *devlink_port)
{
struct devlink_rate *devlink_rate = devlink_port->devlink_rate;
+ struct devlink *rate_devlink, *devlink = devlink_port->devlink;
- devl_assert_locked(devlink_port->devlink);
+ devl_assert_locked(devlink);
if (!devlink_rate)
return;
+ rate_devlink = devl_rate_lock(devlink);
devlink_rate_notify(devlink_rate, DEVLINK_CMD_RATE_DEL);
if (devlink_rate->parent)
refcount_dec(&devlink_rate->parent->refcnt);
list_del(&devlink_rate->list);
devlink_port->devlink_rate = NULL;
+ devl_rate_unlock(devlink, rate_devlink);
kfree(devlink_rate);
}
EXPORT_SYMBOL_GPL(devl_rate_leaf_destroy);
@@ -818,20 +909,30 @@ EXPORT_SYMBOL_GPL(devl_rate_leaf_destroy);
* devl_rate_nodes_destroy - destroy all devlink rate nodes on device
* @devlink: devlink instance
*
- * Unset parent for all rate objects and destroy all rate nodes
- * on specified device.
+ * Unset parent for all rate objects involving this device and destroy all rate
+ * nodes on it.
*/
void devl_rate_nodes_destroy(struct devlink *devlink)
{
- const struct devlink_ops *ops = devlink->ops;
struct devlink_rate *devlink_rate, *tmp;
+ const struct devlink_ops *ops;
+ struct devlink *rate_devlink;
devl_assert_locked(devlink);
+ rate_devlink = devl_rate_lock(devlink);
- list_for_each_entry(devlink_rate, &devlink->rate_list, list) {
- if (!devlink_rate->parent)
+ list_for_each_entry(devlink_rate, &rate_devlink->rate_list, list) {
+ if (!devlink_rate->parent ||
+ (devlink_rate->devlink != devlink &&
+ devlink_rate->parent->devlink != devlink))
continue;
+ /* This could destroy rate objects on other devlinks in the
+ * same hierarchy under 'rate_devlink'. This is safe because
+ * the shared common ancestor is locked so there can be no
+ * other concurrent rate operations on devlink_rate->devlink.
+ */
+ ops = devlink_rate->devlink->ops;
if (devlink_rate_is_leaf(devlink_rate))
ops->rate_leaf_parent_set(devlink_rate, NULL, devlink_rate->priv,
NULL, NULL);
@@ -842,13 +943,17 @@ void devl_rate_nodes_destroy(struct devlink *devlink)
refcount_dec(&devlink_rate->parent->refcnt);
devlink_rate->parent = NULL;
}
- list_for_each_entry_safe(devlink_rate, tmp, &devlink->rate_list, list) {
- if (devlink_rate_is_node(devlink_rate)) {
+ ops = devlink->ops;
+ list_for_each_entry_safe(devlink_rate, tmp, &rate_devlink->rate_list,
+ list) {
+ if (devlink_rate->devlink == devlink &&
+ devlink_rate_is_node(devlink_rate)) {
ops->rate_node_del(devlink_rate, devlink_rate->priv, NULL);
list_del(&devlink_rate->list);
kfree(devlink_rate->name);
kfree(devlink_rate);
}
}
+ devl_rate_unlock(devlink, rate_devlink);
}
EXPORT_SYMBOL_GPL(devl_rate_nodes_destroy);
--
2.44.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH net-next V10 05/14] devlink: Add parent dev to devlink API
2026-07-01 7:32 [PATCH net-next V10 00/14] devlink and mlx5: Support cross-function rate scheduling Tariq Toukan
` (3 preceding siblings ...)
2026-07-01 7:32 ` [PATCH net-next V10 04/14] devlink: Decouple rate storage from associated devlink object Tariq Toukan
@ 2026-07-01 7:32 ` Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 06/14] devlink: Allow parent dev for rate-set and rate-new Tariq Toukan
` (8 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: Tariq Toukan @ 2026-07-01 7:32 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
netdev, Paolo Abeni
Cc: Adithya Jayachandran, Bobby Eshleman, Carolina Jubran,
Cosmin Ratiu, Daniel Borkmann, Daniel Jurgens, Daniel Zahka,
David Wei, Donald Hunter, Dragos Tatulea, Jiri Pirko, Jiri Pirko,
Joe Damato, Jonathan Corbet, Kees Cook, Leon Romanovsky,
linux-doc, linux-kernel, linux-kselftest, linux-rdma, Mark Bloch,
Moshe Shemesh, Or Har-Toov, Parav Pandit, Petr Machata,
Ratheesh Kannoth, Saeed Mahameed, Shahar Shitrit, Shay Drori,
Shuah Khan, Shuah Khan, Simon Horman, Stanislav Fomichev,
Tariq Toukan, Willem de Bruijn, Gal Pressman
From: Cosmin Ratiu <cratiu@nvidia.com>
Upcoming changes to the rate commands need the parent devlink specified.
This change adds a nested 'parent-dev' attribute to the API and helpers
to obtain and put a reference to the parent devlink instance in
info->ctx.
To avoid deadlocks, the parent devlink is unlocked before obtaining the
main devlink instance that is the target of the request.
A reference to the parent is kept until the end of the request to avoid
it suddenly disappearing.
This means that this reference is of limited use without additional
protection.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
Documentation/netlink/specs/devlink.yaml | 20 +++++++++++++
include/uapi/linux/devlink.h | 2 ++
net/devlink/devl_internal.h | 3 ++
net/devlink/netlink.c | 36 ++++++++++++++++++++----
4 files changed, 56 insertions(+), 5 deletions(-)
diff --git a/Documentation/netlink/specs/devlink.yaml b/Documentation/netlink/specs/devlink.yaml
index 52ad1e7805d1..13d960b3abb1 100644
--- a/Documentation/netlink/specs/devlink.yaml
+++ b/Documentation/netlink/specs/devlink.yaml
@@ -895,6 +895,16 @@ attribute-sets:
resource-dump response. Bit 0 (dev) selects device-level
resources; bit 1 (port) selects port-level resources.
When absent all classes are returned.
+ -
+ name: parent-dev
+ type: nest
+ nested-attributes: dl-parent-dev
+ doc: |
+ Identifies the devlink instance which owns the parent rate node.
+ Used with rate-set and rate-new to parent a rate object to a node on
+ a different devlink instance, enabling cross-device rate scheduling.
+ When absent, the parent node is resolved on the same instance.
+
-
name: dl-dev-stats
subset-of: devlink
@@ -1317,6 +1327,16 @@ attribute-sets:
Specifies the bandwidth share assigned to the Traffic Class.
The bandwidth for the traffic class is determined
in proportion to the sum of the shares of all configured classes.
+ -
+ name: dl-parent-dev
+ subset-of: devlink
+ attributes:
+ -
+ name: bus-name
+ -
+ name: dev-name
+ -
+ name: index
operations:
enum-model: directional
diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index ca713bcc47b9..a6801feb7744 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -648,6 +648,8 @@ enum devlink_attr {
DEVLINK_ATTR_INDEX, /* uint */
DEVLINK_ATTR_RESOURCE_SCOPE_MASK, /* u32 */
+ DEVLINK_ATTR_PARENT_DEV, /* nested */
+
/* Add new attributes above here, update the spec in
* Documentation/netlink/specs/devlink.yaml and re-generate
* net/devlink/netlink_gen.c.
diff --git a/net/devlink/devl_internal.h b/net/devlink/devl_internal.h
index 52c8bf359dd4..cdf894ba5a9d 100644
--- a/net/devlink/devl_internal.h
+++ b/net/devlink/devl_internal.h
@@ -154,6 +154,7 @@ int devlink_rel_devlink_handle_put(struct sk_buff *msg, struct devlink *devlink,
struct devlink_nl_ctx {
struct devlink *devlink;
struct devlink_port *devlink_port;
+ struct devlink *parent_devlink;
};
static inline struct devlink_nl_ctx *
@@ -197,6 +198,8 @@ typedef int devlink_nl_dump_one_func_t(struct sk_buff *msg,
struct devlink *
devlink_get_from_attrs_lock(struct net *net, struct nlattr **attrs,
bool dev_lock);
+struct devlink *
+devlink_get_parent_from_attrs_lock(struct net *net, struct nlattr **attrs);
int devlink_nl_dumpit(struct sk_buff *msg, struct netlink_callback *cb,
devlink_nl_dump_one_func_t *dump_one);
diff --git a/net/devlink/netlink.c b/net/devlink/netlink.c
index f0a857e286bc..5a057dc86b0f 100644
--- a/net/devlink/netlink.c
+++ b/net/devlink/netlink.c
@@ -12,6 +12,7 @@
#define DEVLINK_NL_FLAG_NEED_PORT BIT(0)
#define DEVLINK_NL_FLAG_NEED_DEVLINK_OR_PORT BIT(1)
#define DEVLINK_NL_FLAG_NEED_DEV_LOCK BIT(2)
+#define DEVLINK_NL_FLAG_OPTIONAL_PARENT_DEV BIT(3)
static const struct genl_multicast_group devlink_nl_mcgrps[] = {
[DEVLINK_MCGRP_CONFIG] = { .name = DEVLINK_GENL_MCGRP_CONFIG_NAME },
@@ -239,19 +240,39 @@ devlink_get_from_attrs_lock(struct net *net, struct nlattr **attrs,
return ERR_PTR(-ENODEV);
}
+struct devlink *
+devlink_get_parent_from_attrs_lock(struct net *net, struct nlattr **attrs)
+{
+ return ERR_PTR(-EOPNOTSUPP);
+}
+
static int __devlink_nl_pre_doit(struct sk_buff *skb, struct genl_info *info,
u8 flags)
{
+ bool parent_dev = flags & DEVLINK_NL_FLAG_OPTIONAL_PARENT_DEV;
bool dev_lock = flags & DEVLINK_NL_FLAG_NEED_DEV_LOCK;
+ struct devlink *devlink, *parent_devlink = NULL;
+ struct net *net = genl_info_net(info);
+ struct nlattr **attrs = info->attrs;
struct devlink_port *devlink_port;
- struct devlink *devlink;
int err;
- devlink = devlink_get_from_attrs_lock(genl_info_net(info), info->attrs,
- dev_lock);
- if (IS_ERR(devlink))
- return PTR_ERR(devlink);
+ if (parent_dev && attrs[DEVLINK_ATTR_PARENT_DEV]) {
+ parent_devlink = devlink_get_parent_from_attrs_lock(net, attrs);
+ if (IS_ERR(parent_devlink))
+ return PTR_ERR(parent_devlink);
+ devlink_nl_ctx(info)->parent_devlink = parent_devlink;
+ /* Drop the parent devlink lock but don't release the reference.
+ * This will keep it alive until the end of the request.
+ */
+ devl_unlock(parent_devlink);
+ }
+ devlink = devlink_get_from_attrs_lock(net, attrs, dev_lock);
+ if (IS_ERR(devlink)) {
+ err = PTR_ERR(devlink);
+ goto parent_put;
+ }
devlink_nl_ctx(info)->devlink = devlink;
if (flags & DEVLINK_NL_FLAG_NEED_PORT) {
devlink_port = devlink_port_get_from_info(devlink, info);
@@ -270,6 +291,9 @@ static int __devlink_nl_pre_doit(struct sk_buff *skb, struct genl_info *info,
unlock:
devl_dev_unlock(devlink, dev_lock);
devlink_put(devlink);
+parent_put:
+ if (parent_dev && parent_devlink)
+ devlink_put(parent_devlink);
return err;
}
@@ -307,6 +331,8 @@ static void __devlink_nl_post_doit(struct sk_buff *skb, struct genl_info *info,
devlink = devlink_nl_ctx(info)->devlink;
devl_dev_unlock(devlink, dev_lock);
devlink_put(devlink);
+ if (devlink_nl_ctx(info)->parent_devlink)
+ devlink_put(devlink_nl_ctx(info)->parent_devlink);
}
void devlink_nl_post_doit(const struct genl_split_ops *ops,
--
2.44.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH net-next V10 06/14] devlink: Allow parent dev for rate-set and rate-new
2026-07-01 7:32 [PATCH net-next V10 00/14] devlink and mlx5: Support cross-function rate scheduling Tariq Toukan
` (4 preceding siblings ...)
2026-07-01 7:32 ` [PATCH net-next V10 05/14] devlink: Add parent dev to devlink API Tariq Toukan
@ 2026-07-01 7:32 ` Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 07/14] devlink: Allow rate node parents from other devlinks Tariq Toukan
` (7 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: Tariq Toukan @ 2026-07-01 7:32 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
netdev, Paolo Abeni
Cc: Adithya Jayachandran, Bobby Eshleman, Carolina Jubran,
Cosmin Ratiu, Daniel Borkmann, Daniel Jurgens, Daniel Zahka,
David Wei, Donald Hunter, Dragos Tatulea, Jiri Pirko, Jiri Pirko,
Joe Damato, Jonathan Corbet, Kees Cook, Leon Romanovsky,
linux-doc, linux-kernel, linux-kselftest, linux-rdma, Mark Bloch,
Moshe Shemesh, Or Har-Toov, Parav Pandit, Petr Machata,
Ratheesh Kannoth, Saeed Mahameed, Shahar Shitrit, Shay Drori,
Shuah Khan, Shuah Khan, Simon Horman, Stanislav Fomichev,
Tariq Toukan, Willem de Bruijn, Gal Pressman
From: Cosmin Ratiu <cratiu@nvidia.com>
Currently, a devlink rate's parent device is assumed to be the same as
the one where the devlink rate is created.
This patch changes that to allow rate commands to accept an additional
argument that specifies the parent dev. This will allow devlink rate
groups with leafs from other devices.
Example of the new usage with ynl:
Creating a group on pci/0000:08:00.1 with a parent to an already
existing pci/0000:08:00.1/group1:
./tools/net/ynl/pyynl/cli.py --spec \
Documentation/netlink/specs/devlink.yaml --do rate-new --json '{
"bus-name": "pci",
"dev-name": "0000:08:00.1",
"rate-node-name": "group2",
"rate-parent-node-name": "group1",
"parent-dev": {
"bus-name": "pci",
"dev-name": "0000:08:00.1"
}
}'
Setting the parent of leaf node pci/0000:08:00.1/65537 to
pci/0000:08:00.0/group1:
./tools/net/ynl/pyynl/cli.py --spec \
Documentation/netlink/specs/devlink.yaml --do rate-set --json '{
"bus-name": "pci",
"dev-name": "0000:08:00.1",
"port-index": 65537,
"parent-dev": {
"bus-name": "pci",
"dev-name": "0000:08:00.0"
},
"rate-parent-node-name": "group1"
}'
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
Documentation/netlink/specs/devlink.yaml | 10 +++---
net/devlink/netlink.c | 40 +++++++++++++++++++++++-
net/devlink/netlink_gen.c | 24 +++++++++-----
net/devlink/netlink_gen.h | 8 +++++
net/devlink/rate.c | 4 ++-
5 files changed, 72 insertions(+), 14 deletions(-)
diff --git a/Documentation/netlink/specs/devlink.yaml b/Documentation/netlink/specs/devlink.yaml
index 13d960b3abb1..38b1190f3d26 100644
--- a/Documentation/netlink/specs/devlink.yaml
+++ b/Documentation/netlink/specs/devlink.yaml
@@ -2309,8 +2309,8 @@ operations:
dont-validate: [strict]
flags: [admin-perm]
do:
- pre: devlink-nl-pre-doit
- post: devlink-nl-post-doit
+ pre: devlink-nl-pre-doit-parent-dev-optional
+ post: devlink-nl-post-doit-parent-dev-optional
request:
attributes:
- bus-name
@@ -2323,6 +2323,7 @@ operations:
- rate-tx-weight
- rate-parent-node-name
- rate-tc-bws
+ - parent-dev
-
name: rate-new
@@ -2331,8 +2332,8 @@ operations:
dont-validate: [strict]
flags: [admin-perm]
do:
- pre: devlink-nl-pre-doit
- post: devlink-nl-post-doit
+ pre: devlink-nl-pre-doit-parent-dev-optional
+ post: devlink-nl-post-doit-parent-dev-optional
request:
attributes:
- bus-name
@@ -2345,6 +2346,7 @@ operations:
- rate-tx-weight
- rate-parent-node-name
- rate-tc-bws
+ - parent-dev
-
name: rate-del
diff --git a/net/devlink/netlink.c b/net/devlink/netlink.c
index 5a057dc86b0f..300580c1a217 100644
--- a/net/devlink/netlink.c
+++ b/net/devlink/netlink.c
@@ -243,7 +243,29 @@ devlink_get_from_attrs_lock(struct net *net, struct nlattr **attrs,
struct devlink *
devlink_get_parent_from_attrs_lock(struct net *net, struct nlattr **attrs)
{
- return ERR_PTR(-EOPNOTSUPP);
+ unsigned int maxtype = ARRAY_SIZE(devlink_dl_parent_dev_nl_policy) - 1;
+ struct devlink *devlink;
+ struct nlattr **tb;
+ int err;
+
+ if (!attrs[DEVLINK_ATTR_PARENT_DEV])
+ return ERR_PTR(-EINVAL);
+
+ tb = kcalloc(maxtype + 1, sizeof(*tb), GFP_KERNEL);
+ if (!tb)
+ return ERR_PTR(-ENOMEM);
+
+ err = nla_parse_nested(tb, maxtype, attrs[DEVLINK_ATTR_PARENT_DEV],
+ devlink_dl_parent_dev_nl_policy, NULL);
+ if (err)
+ goto out;
+
+ devlink = devlink_get_from_attrs_lock(net, tb, false);
+ kfree(tb);
+ return devlink;
+out:
+ kfree(tb);
+ return ERR_PTR(err);
}
static int __devlink_nl_pre_doit(struct sk_buff *skb, struct genl_info *info,
@@ -322,6 +344,14 @@ int devlink_nl_pre_doit_port_optional(const struct genl_split_ops *ops,
return __devlink_nl_pre_doit(skb, info, DEVLINK_NL_FLAG_NEED_DEVLINK_OR_PORT);
}
+int devlink_nl_pre_doit_parent_dev_optional(const struct genl_split_ops *ops,
+ struct sk_buff *skb,
+ struct genl_info *info)
+{
+ return __devlink_nl_pre_doit(skb, info,
+ DEVLINK_NL_FLAG_OPTIONAL_PARENT_DEV);
+}
+
static void __devlink_nl_post_doit(struct sk_buff *skb, struct genl_info *info,
u8 flags)
{
@@ -348,6 +378,14 @@ devlink_nl_post_doit_dev_lock(const struct genl_split_ops *ops,
__devlink_nl_post_doit(skb, info, DEVLINK_NL_FLAG_NEED_DEV_LOCK);
}
+void
+devlink_nl_post_doit_parent_dev_optional(const struct genl_split_ops *ops,
+ struct sk_buff *skb,
+ struct genl_info *info)
+{
+ __devlink_nl_post_doit(skb, info, DEVLINK_NL_FLAG_OPTIONAL_PARENT_DEV);
+}
+
static int devlink_nl_inst_single_dumpit(struct sk_buff *msg,
struct netlink_callback *cb, int flags,
devlink_nl_dump_one_func_t *dump_one,
diff --git a/net/devlink/netlink_gen.c b/net/devlink/netlink_gen.c
index f52b0c2b19ed..dec00133178d 100644
--- a/net/devlink/netlink_gen.c
+++ b/net/devlink/netlink_gen.c
@@ -46,6 +46,12 @@ devlink_attr_param_type_validate(const struct nlattr *attr,
}
/* Common nested types */
+const struct nla_policy devlink_dl_parent_dev_nl_policy[DEVLINK_ATTR_INDEX + 1] = {
+ [DEVLINK_ATTR_BUS_NAME] = { .type = NLA_NUL_STRING, },
+ [DEVLINK_ATTR_DEV_NAME] = { .type = NLA_NUL_STRING, },
+ [DEVLINK_ATTR_INDEX] = NLA_POLICY_FULL_RANGE(NLA_UINT, &devlink_attr_index_range),
+};
+
const struct nla_policy devlink_dl_port_function_nl_policy[DEVLINK_PORT_FN_ATTR_CAPS + 1] = {
[DEVLINK_PORT_FUNCTION_ATTR_HW_ADDR] = { .type = NLA_BINARY, },
[DEVLINK_PORT_FN_ATTR_STATE] = NLA_POLICY_MAX(NLA_U8, 1),
@@ -608,7 +614,7 @@ static const struct nla_policy devlink_rate_get_dump_nl_policy[DEVLINK_ATTR_INDE
};
/* DEVLINK_CMD_RATE_SET - do */
-static const struct nla_policy devlink_rate_set_nl_policy[DEVLINK_ATTR_INDEX + 1] = {
+static const struct nla_policy devlink_rate_set_nl_policy[DEVLINK_ATTR_PARENT_DEV + 1] = {
[DEVLINK_ATTR_BUS_NAME] = { .type = NLA_NUL_STRING, },
[DEVLINK_ATTR_DEV_NAME] = { .type = NLA_NUL_STRING, },
[DEVLINK_ATTR_INDEX] = NLA_POLICY_FULL_RANGE(NLA_UINT, &devlink_attr_index_range),
@@ -619,10 +625,11 @@ static const struct nla_policy devlink_rate_set_nl_policy[DEVLINK_ATTR_INDEX + 1
[DEVLINK_ATTR_RATE_TX_WEIGHT] = { .type = NLA_U32, },
[DEVLINK_ATTR_RATE_PARENT_NODE_NAME] = { .type = NLA_NUL_STRING, },
[DEVLINK_ATTR_RATE_TC_BWS] = NLA_POLICY_NESTED(devlink_dl_rate_tc_bws_nl_policy),
+ [DEVLINK_ATTR_PARENT_DEV] = NLA_POLICY_NESTED(devlink_dl_parent_dev_nl_policy),
};
/* DEVLINK_CMD_RATE_NEW - do */
-static const struct nla_policy devlink_rate_new_nl_policy[DEVLINK_ATTR_INDEX + 1] = {
+static const struct nla_policy devlink_rate_new_nl_policy[DEVLINK_ATTR_PARENT_DEV + 1] = {
[DEVLINK_ATTR_BUS_NAME] = { .type = NLA_NUL_STRING, },
[DEVLINK_ATTR_DEV_NAME] = { .type = NLA_NUL_STRING, },
[DEVLINK_ATTR_INDEX] = NLA_POLICY_FULL_RANGE(NLA_UINT, &devlink_attr_index_range),
@@ -633,6 +640,7 @@ static const struct nla_policy devlink_rate_new_nl_policy[DEVLINK_ATTR_INDEX + 1
[DEVLINK_ATTR_RATE_TX_WEIGHT] = { .type = NLA_U32, },
[DEVLINK_ATTR_RATE_PARENT_NODE_NAME] = { .type = NLA_NUL_STRING, },
[DEVLINK_ATTR_RATE_TC_BWS] = NLA_POLICY_NESTED(devlink_dl_rate_tc_bws_nl_policy),
+ [DEVLINK_ATTR_PARENT_DEV] = NLA_POLICY_NESTED(devlink_dl_parent_dev_nl_policy),
};
/* DEVLINK_CMD_RATE_DEL - do */
@@ -1290,21 +1298,21 @@ const struct genl_split_ops devlink_nl_ops[75] = {
{
.cmd = DEVLINK_CMD_RATE_SET,
.validate = GENL_DONT_VALIDATE_STRICT,
- .pre_doit = devlink_nl_pre_doit,
+ .pre_doit = devlink_nl_pre_doit_parent_dev_optional,
.doit = devlink_nl_rate_set_doit,
- .post_doit = devlink_nl_post_doit,
+ .post_doit = devlink_nl_post_doit_parent_dev_optional,
.policy = devlink_rate_set_nl_policy,
- .maxattr = DEVLINK_ATTR_INDEX,
+ .maxattr = DEVLINK_ATTR_PARENT_DEV,
.flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
},
{
.cmd = DEVLINK_CMD_RATE_NEW,
.validate = GENL_DONT_VALIDATE_STRICT,
- .pre_doit = devlink_nl_pre_doit,
+ .pre_doit = devlink_nl_pre_doit_parent_dev_optional,
.doit = devlink_nl_rate_new_doit,
- .post_doit = devlink_nl_post_doit,
+ .post_doit = devlink_nl_post_doit_parent_dev_optional,
.policy = devlink_rate_new_nl_policy,
- .maxattr = DEVLINK_ATTR_INDEX,
+ .maxattr = DEVLINK_ATTR_PARENT_DEV,
.flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
},
{
diff --git a/net/devlink/netlink_gen.h b/net/devlink/netlink_gen.h
index 20034b0929a8..a70e0e4769aa 100644
--- a/net/devlink/netlink_gen.h
+++ b/net/devlink/netlink_gen.h
@@ -13,6 +13,7 @@
#include <uapi/linux/devlink.h>
/* Common nested types */
+extern const struct nla_policy devlink_dl_parent_dev_nl_policy[DEVLINK_ATTR_INDEX + 1];
extern const struct nla_policy devlink_dl_port_function_nl_policy[DEVLINK_PORT_FN_ATTR_CAPS + 1];
extern const struct nla_policy devlink_dl_rate_tc_bws_nl_policy[DEVLINK_RATE_TC_ATTR_BW + 1];
extern const struct nla_policy devlink_dl_selftest_id_nl_policy[DEVLINK_ATTR_SELFTEST_ID_FLASH + 1];
@@ -29,12 +30,19 @@ int devlink_nl_pre_doit_port_optional(const struct genl_split_ops *ops,
struct genl_info *info);
int devlink_nl_pre_doit_dev_lock(const struct genl_split_ops *ops,
struct sk_buff *skb, struct genl_info *info);
+int devlink_nl_pre_doit_parent_dev_optional(const struct genl_split_ops *ops,
+ struct sk_buff *skb,
+ struct genl_info *info);
void
devlink_nl_post_doit(const struct genl_split_ops *ops, struct sk_buff *skb,
struct genl_info *info);
void
devlink_nl_post_doit_dev_lock(const struct genl_split_ops *ops,
struct sk_buff *skb, struct genl_info *info);
+void
+devlink_nl_post_doit_parent_dev_optional(const struct genl_split_ops *ops,
+ struct sk_buff *skb,
+ struct genl_info *info);
int devlink_nl_get_doit(struct sk_buff *skb, struct genl_info *info);
int devlink_nl_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb);
diff --git a/net/devlink/rate.c b/net/devlink/rate.c
index 295f4185fdfd..78a59d79c2ea 100644
--- a/net/devlink/rate.c
+++ b/net/devlink/rate.c
@@ -663,9 +663,11 @@ int devlink_nl_rate_set_doit(struct sk_buff *skb, struct genl_info *info)
int devlink_nl_rate_new_doit(struct sk_buff *skb, struct genl_info *info)
{
- struct devlink *rate_devlink, *devlink = devlink_nl_ctx(info)->devlink;
+ struct devlink_nl_ctx *ctx = devlink_nl_ctx(info);
+ struct devlink *devlink = ctx->devlink;
struct devlink_rate *rate_node;
const struct devlink_ops *ops;
+ struct devlink *rate_devlink;
int err;
ops = devlink->ops;
--
2.44.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH net-next V10 07/14] devlink: Allow rate node parents from other devlinks
2026-07-01 7:32 [PATCH net-next V10 00/14] devlink and mlx5: Support cross-function rate scheduling Tariq Toukan
` (5 preceding siblings ...)
2026-07-01 7:32 ` [PATCH net-next V10 06/14] devlink: Allow parent dev for rate-set and rate-new Tariq Toukan
@ 2026-07-01 7:32 ` Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 08/14] net/mlx5: qos: Use mlx5_lag_query_bond_speed to query LAG speed Tariq Toukan
` (6 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: Tariq Toukan @ 2026-07-01 7:32 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
netdev, Paolo Abeni
Cc: Adithya Jayachandran, Bobby Eshleman, Carolina Jubran,
Cosmin Ratiu, Daniel Borkmann, Daniel Jurgens, Daniel Zahka,
David Wei, Donald Hunter, Dragos Tatulea, Jiri Pirko, Jiri Pirko,
Joe Damato, Jonathan Corbet, Kees Cook, Leon Romanovsky,
linux-doc, linux-kernel, linux-kselftest, linux-rdma, Mark Bloch,
Moshe Shemesh, Or Har-Toov, Parav Pandit, Petr Machata,
Ratheesh Kannoth, Saeed Mahameed, Shahar Shitrit, Shay Drori,
Shuah Khan, Shuah Khan, Simon Horman, Stanislav Fomichev,
Tariq Toukan, Willem de Bruijn, Gal Pressman
From: Cosmin Ratiu <cratiu@nvidia.com>
This commit makes use of the building blocks previously added to
implement cross-device rate nodes.
A new 'supported_cross_device_rate_nodes' bool is added to devlink_ops
which lets drivers advertise support for cross-device rate objects.
If enabled and if there is a common shared devlink instance, then:
- all rate objects will be stored in the top-most common nested instance
and
- rate objects can have parents from other devices sharing the same
common instance.
Storing rates in the common shared ancestor is safe, because it is
reference counted by its nested devlink instances, so it's guaranteed to
outlive them. Furthermore, the shared devlink infra guarantees a given
nested devlink hierarchy is managed by the same driver.
The parent devlink from info->ctx is not locked, so none of its mutable
fields can be used. But parent setting only requires comparing devlink
pointer comparisons. Additionally, since the shared devlink is locked,
other rate operations cannot concurrently happen.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../networking/devlink/devlink-port.rst | 2 +
include/net/devlink.h | 9 ++
net/devlink/core.c | 4 +-
net/devlink/rate.c | 86 +++++++++++++++++--
4 files changed, 92 insertions(+), 9 deletions(-)
diff --git a/Documentation/networking/devlink/devlink-port.rst b/Documentation/networking/devlink/devlink-port.rst
index 9374ebe70f48..18aca77006d5 100644
--- a/Documentation/networking/devlink/devlink-port.rst
+++ b/Documentation/networking/devlink/devlink-port.rst
@@ -420,6 +420,8 @@ API allows to configure following rate object's parameters:
Parent node name. Parent node rate limits are considered as additional limits
to all node children limits. ``tx_max`` is an upper limit for children.
``tx_share`` is a total bandwidth distributed among children.
+ If the device supports cross-function scheduling, the parent can be from a
+ different function of the same underlying device.
``tc_bw``
Allow users to set the bandwidth allocation per traffic class on rate
diff --git a/include/net/devlink.h b/include/net/devlink.h
index dd546dbd57cf..ffe1ad5fb70b 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -1594,6 +1594,15 @@ struct devlink_ops {
struct devlink_rate *parent,
void *priv_child, void *priv_parent,
struct netlink_ext_ack *extack);
+ /* Indicates if cross-device rate nodes are supported.
+ * This also requires a shared common ancestor object all devices that
+ * could share rate nodes are nested in.
+ * If enabled, rate operations may be called on an instance with only
+ * the common ancestor lock held and *without that instance lock held*.
+ * It is the driver's responsibility to ensure proper serialization
+ * with other operations.
+ */
+ bool supported_cross_device_rate_nodes;
/**
* selftests_check() - queries if selftest is supported
* @devlink: devlink instance
diff --git a/net/devlink/core.c b/net/devlink/core.c
index ee26c50b4118..c53a42e17a58 100644
--- a/net/devlink/core.c
+++ b/net/devlink/core.c
@@ -534,6 +534,9 @@ void devlink_free(struct devlink *devlink)
{
ASSERT_DEVLINK_NOT_REGISTERED(devlink);
+ devl_lock(devlink);
+ WARN_ON(devlink_rates_check(devlink, NULL, NULL));
+ devl_unlock(devlink);
devlink_rel_put(devlink);
WARN_ON(!list_empty(&devlink->trap_policer_list));
@@ -544,7 +547,6 @@ void devlink_free(struct devlink *devlink)
WARN_ON(!list_empty(&devlink->resource_list));
WARN_ON(!list_empty(&devlink->dpipe_table_list));
WARN_ON(!list_empty(&devlink->sb_list));
- WARN_ON(devlink_rates_check(devlink, NULL, NULL));
WARN_ON(!list_empty(&devlink->linecard_list));
WARN_ON(!xa_empty(&devlink->ports));
diff --git a/net/devlink/rate.c b/net/devlink/rate.c
index 78a59d79c2ea..e727c8b8b33e 100644
--- a/net/devlink/rate.c
+++ b/net/devlink/rate.c
@@ -30,14 +30,42 @@ devlink_rate_leaf_get_from_info(struct devlink *devlink, struct genl_info *info)
return devlink_rate ?: ERR_PTR(-ENODEV);
}
+/* Repeatedly walks the nested devlink chain while cross device rate nodes are
+ * supported and finds the topmost instance where rates should be stored.
+ * That instance is locked, referenced and returned.
+ * When cross device rate nodes aren't supported the original devlink instance
+ * is returned.
+ */
static struct devlink *devl_rate_lock(struct devlink *devlink)
{
- return devlink;
+ struct devlink *rate_devlink = devlink, *parent;
+
+ devl_assert_locked(devlink);
+
+ while (rate_devlink->ops &&
+ rate_devlink->ops->supported_cross_device_rate_nodes) {
+ parent = devlink_nested_in_get_lock(rate_devlink);
+ if (!parent)
+ break;
+ if (rate_devlink != devlink) {
+ /* Unlock intermediate instances. */
+ devl_unlock(rate_devlink);
+ devlink_put(rate_devlink);
+ }
+ rate_devlink = parent;
+ }
+ return rate_devlink;
}
+/* Unlocks and puts 'rate devlink' if different than 'devlink'. */
static void devl_rate_unlock(struct devlink *devlink,
struct devlink *rate_devlink)
{
+ if (devlink == rate_devlink)
+ return;
+
+ devl_unlock(rate_devlink);
+ devlink_put(rate_devlink);
}
static struct devlink_rate *
@@ -121,6 +149,25 @@ static int devlink_rate_put_tc_bws(struct sk_buff *msg, u32 *tc_bw)
return -EMSGSIZE;
}
+static int devlink_nl_rate_parent_fill(struct sk_buff *msg,
+ struct devlink_rate *devlink_rate)
+{
+ struct devlink_rate *parent = devlink_rate->parent;
+ struct devlink *devlink = parent->devlink;
+
+ if (nla_put_string(msg, DEVLINK_ATTR_RATE_PARENT_NODE_NAME,
+ parent->name))
+ return -EMSGSIZE;
+
+ if (devlink != devlink_rate->devlink &&
+ devlink_nl_put_nested_handle(msg,
+ devlink_net(devlink_rate->devlink),
+ devlink, DEVLINK_ATTR_PARENT_DEV))
+ return -EMSGSIZE;
+
+ return 0;
+}
+
static int devlink_nl_rate_fill(struct sk_buff *msg,
struct devlink_rate *devlink_rate,
enum devlink_command cmd, u32 portid, u32 seq,
@@ -165,10 +212,9 @@ static int devlink_nl_rate_fill(struct sk_buff *msg,
devlink_rate->tx_weight))
goto nla_put_failure;
- if (devlink_rate->parent)
- if (nla_put_string(msg, DEVLINK_ATTR_RATE_PARENT_NODE_NAME,
- devlink_rate->parent->name))
- goto nla_put_failure;
+ if (devlink_rate->parent &&
+ devlink_nl_rate_parent_fill(msg, devlink_rate))
+ goto nla_put_failure;
if (devlink_rate_put_tc_bws(msg, devlink_rate->tc_bw))
goto nla_put_failure;
@@ -322,13 +368,14 @@ devlink_nl_rate_parent_node_set(struct devlink_rate *devlink_rate,
struct genl_info *info,
struct nlattr *nla_parent)
{
- struct devlink *devlink = devlink_rate->devlink;
+ struct devlink *devlink = devlink_rate->devlink, *parent_devlink;
const char *parent_name = nla_data(nla_parent);
const struct devlink_ops *ops = devlink->ops;
size_t len = strlen(parent_name);
struct devlink_rate *parent;
int err = -EOPNOTSUPP;
+ parent_devlink = devlink_nl_ctx(info)->parent_devlink ? : devlink;
parent = devlink_rate->parent;
if (parent && !len) {
@@ -346,7 +393,13 @@ devlink_nl_rate_parent_node_set(struct devlink_rate *devlink_rate,
refcount_dec(&parent->refcnt);
devlink_rate->parent = NULL;
} else if (len) {
- parent = devlink_rate_node_get_by_name(rate_devlink, devlink,
+ /* parent_devlink (when different than devlink) isn't locked,
+ * but the rate node devlink instance is, so nobody from the
+ * same group of devices sharing rates could change the used
+ * fields or unregister the parent.
+ */
+ parent = devlink_rate_node_get_by_name(rate_devlink,
+ parent_devlink,
parent_name);
if (IS_ERR(parent))
return -ENODEV;
@@ -633,9 +686,11 @@ static bool devlink_rate_set_ops_supported(const struct devlink_ops *ops,
int devlink_nl_rate_set_doit(struct sk_buff *skb, struct genl_info *info)
{
- struct devlink *rate_devlink, *devlink = devlink_nl_ctx(info)->devlink;
+ struct devlink_nl_ctx *ctx = devlink_nl_ctx(info);
+ struct devlink *devlink = ctx->devlink;
struct devlink_rate *devlink_rate;
const struct devlink_ops *ops;
+ struct devlink *rate_devlink;
int err;
rate_devlink = devl_rate_lock(devlink);
@@ -652,6 +707,14 @@ int devlink_nl_rate_set_doit(struct sk_buff *skb, struct genl_info *info)
goto unlock;
}
+ if (ctx->parent_devlink && ctx->parent_devlink != devlink &&
+ !ops->supported_cross_device_rate_nodes) {
+ NL_SET_ERR_MSG(info->extack,
+ "Cross-device rate parents aren't supported");
+ err = -EOPNOTSUPP;
+ goto unlock;
+ }
+
err = devlink_nl_rate_set(devlink_rate, rate_devlink, ops, info);
if (!err)
@@ -679,6 +742,13 @@ int devlink_nl_rate_new_doit(struct sk_buff *skb, struct genl_info *info)
if (!devlink_rate_set_ops_supported(ops, info, DEVLINK_RATE_TYPE_NODE))
return -EOPNOTSUPP;
+ if (ctx->parent_devlink && ctx->parent_devlink != devlink &&
+ !ops->supported_cross_device_rate_nodes) {
+ NL_SET_ERR_MSG(info->extack,
+ "Cross-device rate parents aren't supported");
+ return -EOPNOTSUPP;
+ }
+
rate_devlink = devl_rate_lock(devlink);
rate_node = devlink_rate_node_get_from_attrs(rate_devlink, devlink,
info->attrs);
--
2.44.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH net-next V10 08/14] net/mlx5: qos: Use mlx5_lag_query_bond_speed to query LAG speed
2026-07-01 7:32 [PATCH net-next V10 00/14] devlink and mlx5: Support cross-function rate scheduling Tariq Toukan
` (6 preceding siblings ...)
2026-07-01 7:32 ` [PATCH net-next V10 07/14] devlink: Allow rate node parents from other devlinks Tariq Toukan
@ 2026-07-01 7:32 ` Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 09/14] net/mlx5: qos: Refactor vport QoS cleanup Tariq Toukan
` (5 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: Tariq Toukan @ 2026-07-01 7:32 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
netdev, Paolo Abeni
Cc: Adithya Jayachandran, Bobby Eshleman, Carolina Jubran,
Cosmin Ratiu, Daniel Borkmann, Daniel Jurgens, Daniel Zahka,
David Wei, Donald Hunter, Dragos Tatulea, Jiri Pirko, Jiri Pirko,
Joe Damato, Jonathan Corbet, Kees Cook, Leon Romanovsky,
linux-doc, linux-kernel, linux-kselftest, linux-rdma, Mark Bloch,
Moshe Shemesh, Or Har-Toov, Parav Pandit, Petr Machata,
Ratheesh Kannoth, Saeed Mahameed, Shahar Shitrit, Shay Drori,
Shuah Khan, Shuah Khan, Simon Horman, Stanislav Fomichev,
Tariq Toukan, Willem de Bruijn, Gal Pressman
From: Cosmin Ratiu <cratiu@nvidia.com>
Previously, the master device of the uplink netdev was queried for its
maximum link speed from the QoS layer, requiring the uplink_netdev mutex
and possibly the RTNL (if the call originated from the TC matchall
layer).
Acquiring these locks here is risky, as lock cycles could form. The
locking for the QoS layer is about to change, so to avoid issues,
replace the code querying the LAG's max link speed with the existing
infrastructure added in commit [1].
This simplifies this part and avoids potential lock cycles.
One caveat is that there's a new edge case, when the bond device is not
fully formed to represent the LAG device, the speed isn't calculated and
is left at 0. This now handled explicitly.
[1] commit f0b2fde98065 ("net/mlx5: Add support for querying bond
speed")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../net/ethernet/mellanox/mlx5/core/esw/qos.c | 36 ++++---------------
1 file changed, 6 insertions(+), 30 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
index faccc60fc93a..d04fda4b3778 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
@@ -1489,41 +1489,16 @@ static int esw_qos_node_enable_tc_arbitration(struct mlx5_esw_sched_node *node,
return err;
}
-static u32 mlx5_esw_qos_lag_link_speed_get(struct mlx5_core_dev *mdev,
- bool take_rtnl)
-{
- struct ethtool_link_ksettings lksettings;
- struct net_device *slave, *master;
- u32 speed = SPEED_UNKNOWN;
-
- slave = mlx5_uplink_netdev_get(mdev);
- if (!slave)
- goto out;
-
- if (take_rtnl)
- rtnl_lock();
- master = netdev_master_upper_dev_get(slave);
- if (master && !__ethtool_get_link_ksettings(master, &lksettings))
- speed = lksettings.base.speed;
- if (take_rtnl)
- rtnl_unlock();
-
-out:
- mlx5_uplink_netdev_put(mdev, slave);
- return speed;
-}
-
static int mlx5_esw_qos_max_link_speed_get(struct mlx5_core_dev *mdev, u32 *link_speed_max,
- bool take_rtnl,
struct netlink_ext_ack *extack)
{
int err;
- if (!mlx5_lag_is_active(mdev))
+ if (!mlx5_lag_is_active(mdev) ||
+ mlx5_lag_query_bond_speed(mdev, link_speed_max) < 0 ||
+ *link_speed_max == 0)
goto skip_lag;
- *link_speed_max = mlx5_esw_qos_lag_link_speed_get(mdev, take_rtnl);
-
if (*link_speed_max != (u32)SPEED_UNKNOWN)
return 0;
@@ -1560,7 +1535,8 @@ int mlx5_esw_qos_modify_vport_rate(struct mlx5_eswitch *esw, u16 vport_num, u32
return PTR_ERR(vport);
if (rate_mbps) {
- err = mlx5_esw_qos_max_link_speed_get(esw->dev, &link_speed_max, false, NULL);
+ err = mlx5_esw_qos_max_link_speed_get(esw->dev, &link_speed_max,
+ NULL);
if (err)
return err;
@@ -1598,7 +1574,7 @@ static int esw_qos_devlink_rate_to_mbps(struct mlx5_core_dev *mdev, const char *
return -EINVAL;
}
- err = mlx5_esw_qos_max_link_speed_get(mdev, &link_speed_max, true, extack);
+ err = mlx5_esw_qos_max_link_speed_get(mdev, &link_speed_max, extack);
if (err)
return err;
--
2.44.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH net-next V10 09/14] net/mlx5: qos: Refactor vport QoS cleanup
2026-07-01 7:32 [PATCH net-next V10 00/14] devlink and mlx5: Support cross-function rate scheduling Tariq Toukan
` (7 preceding siblings ...)
2026-07-01 7:32 ` [PATCH net-next V10 08/14] net/mlx5: qos: Use mlx5_lag_query_bond_speed to query LAG speed Tariq Toukan
@ 2026-07-01 7:32 ` Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 10/14] net/mlx5: qos: Model the root node in the scheduling hierarchy Tariq Toukan
` (4 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: Tariq Toukan @ 2026-07-01 7:32 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
netdev, Paolo Abeni
Cc: Adithya Jayachandran, Bobby Eshleman, Carolina Jubran,
Cosmin Ratiu, Daniel Borkmann, Daniel Jurgens, Daniel Zahka,
David Wei, Donald Hunter, Dragos Tatulea, Jiri Pirko, Jiri Pirko,
Joe Damato, Jonathan Corbet, Kees Cook, Leon Romanovsky,
linux-doc, linux-kernel, linux-kselftest, linux-rdma, Mark Bloch,
Moshe Shemesh, Or Har-Toov, Parav Pandit, Petr Machata,
Ratheesh Kannoth, Saeed Mahameed, Shahar Shitrit, Shay Drori,
Shuah Khan, Shuah Khan, Simon Horman, Stanislav Fomichev,
Tariq Toukan, Willem de Bruijn, Gal Pressman
From: Cosmin Ratiu <cratiu@nvidia.com>
Qos cleanup is a complex affair, because of the two modes of operation
(legacy and switchdev).
Leaf QoS is removed:
1. In legacy mode by esw_vport_cleanup() -> mlx5_esw_qos_vport_disable()
2. In switchdev mode by mlx5_esw_offloads_devlink_port_unregister() ->
mlx5_esw_qos_vport_update_parent(). A little later in the same flow, the
calls in 1 happen but they are noops.
Zooming out a bit, from both mlx5_eswitch_disable_locked() and
mlx5_eswitch_disable_sriov() the leaves are destroyed before the nodes,
which is the reverse of what should be.
For SFs there's no devl_rate_nodes_destroy() call to unparent the
affected leaf.
Sanitize all of this by:
1. Destroying nodes before leaves in both legacy and switchdev mode.
2. Only removing vport qos from esw_vport_cleanup(), reachable from both
legacy and switchdev and also reachable by SF removal.
3. Unexpose mlx5_esw_qos_vport_update_parent(), which becomes internal
to qos.
4. Remove the WARN in mlx5_esw_qos_vport_disable().
This also takes care of a theoretical corner case, when
mlx5_esw_qos_vport_update_parent() tried to reattach the vport to
the original parent on failure, which can fail as well, leaving the
vport in a broken state.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../mellanox/mlx5/core/esw/devlink_port.c | 1 -
.../net/ethernet/mellanox/mlx5/core/esw/qos.c | 14 ++++----------
.../net/ethernet/mellanox/mlx5/core/eswitch.c | 19 ++++++++++---------
.../net/ethernet/mellanox/mlx5/core/eswitch.h | 2 --
4 files changed, 14 insertions(+), 22 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/devlink_port.c b/drivers/net/ethernet/mellanox/mlx5/core/esw/devlink_port.c
index 6e50311faa27..8c27a33f9d7b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/esw/devlink_port.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/devlink_port.c
@@ -268,7 +268,6 @@ void mlx5_esw_offloads_devlink_port_unregister(struct mlx5_vport *vport)
dl_port = vport->dl_port;
mlx5_esw_devlink_port_res_unregister(&dl_port->dl_port);
- mlx5_esw_qos_vport_update_parent(vport, NULL, NULL);
devl_rate_leaf_destroy(&dl_port->dl_port);
devl_port_unregister(&dl_port->dl_port);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
index d04fda4b3778..204f47c99142 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
@@ -1139,18 +1139,10 @@ static void mlx5_esw_qos_vport_disable_locked(struct mlx5_vport *vport)
void mlx5_esw_qos_vport_disable(struct mlx5_vport *vport)
{
struct mlx5_eswitch *esw = vport->dev->priv.eswitch;
- struct mlx5_esw_sched_node *parent;
lockdep_assert_held(&esw->state_lock);
esw_qos_lock(esw);
- if (!vport->qos.sched_node)
- goto unlock;
-
- parent = vport->qos.sched_node->parent;
- WARN(parent, "Disabling QoS on port before detaching it from node");
-
mlx5_esw_qos_vport_disable_locked(vport);
-unlock:
esw_qos_unlock(esw);
}
@@ -1866,8 +1858,10 @@ int mlx5_esw_devlink_rate_node_del(struct devlink_rate *rate_node, void *priv,
return 0;
}
-int mlx5_esw_qos_vport_update_parent(struct mlx5_vport *vport, struct mlx5_esw_sched_node *parent,
- struct netlink_ext_ack *extack)
+static int
+mlx5_esw_qos_vport_update_parent(struct mlx5_vport *vport,
+ struct mlx5_esw_sched_node *parent,
+ struct netlink_ext_ack *extack)
{
struct mlx5_eswitch *esw = vport->dev->priv.eswitch;
int err = 0;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index a0e2ca87b8d8..b67f15a8f766 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -1990,6 +1990,13 @@ void mlx5_eswitch_disable_sriov(struct mlx5_eswitch *esw, bool clear_vf)
esw->esw_funcs.num_vfs, esw->esw_funcs.num_ec_vfs, esw->enabled_vports);
mlx5_eswitch_invalidate_wq(esw);
+
+ if (esw->mode == MLX5_ESWITCH_OFFLOADS) {
+ struct devlink *devlink = priv_to_devlink(esw->dev);
+
+ devl_rate_nodes_destroy(devlink);
+ }
+
mlx5_esw_reps_block(esw);
if (!mlx5_core_is_ecpf(esw->dev)) {
@@ -2003,12 +2010,6 @@ void mlx5_eswitch_disable_sriov(struct mlx5_eswitch *esw, bool clear_vf)
}
mlx5_esw_reps_unblock(esw);
-
- if (esw->mode == MLX5_ESWITCH_OFFLOADS) {
- struct devlink *devlink = priv_to_devlink(esw->dev);
-
- devl_rate_nodes_destroy(devlink);
- }
/* Destroy legacy fdb when disabling sriov in legacy mode. */
if (esw->mode == MLX5_ESWITCH_LEGACY)
mlx5_eswitch_disable_locked(esw);
@@ -2039,6 +2040,9 @@ void mlx5_eswitch_disable_locked(struct mlx5_eswitch *esw)
esw->mode == MLX5_ESWITCH_LEGACY ? "LEGACY" : "OFFLOADS",
esw->esw_funcs.num_vfs, esw->esw_funcs.num_ec_vfs, esw->enabled_vports);
+ if (esw->mode == MLX5_ESWITCH_OFFLOADS)
+ devl_rate_nodes_destroy(devlink);
+
if (esw->fdb_table.flags & MLX5_ESW_FDB_CREATED) {
esw->fdb_table.flags &= ~MLX5_ESW_FDB_CREATED;
if (esw->mode == MLX5_ESWITCH_OFFLOADS)
@@ -2047,9 +2051,6 @@ void mlx5_eswitch_disable_locked(struct mlx5_eswitch *esw)
esw_legacy_disable(esw);
mlx5_esw_acls_ns_cleanup(esw);
}
-
- if (esw->mode == MLX5_ESWITCH_OFFLOADS)
- devl_rate_nodes_destroy(devlink);
}
void mlx5_eswitch_disable(struct mlx5_eswitch *esw)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index fea72b1dedab..140343f2b913 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -482,8 +482,6 @@ int mlx5_eswitch_set_vport_trust(struct mlx5_eswitch *esw,
u16 vport_num, bool setting);
int mlx5_eswitch_set_vport_rate(struct mlx5_eswitch *esw, u16 vport,
u32 max_rate, u32 min_rate);
-int mlx5_esw_qos_vport_update_parent(struct mlx5_vport *vport, struct mlx5_esw_sched_node *node,
- struct netlink_ext_ack *extack);
int mlx5_eswitch_set_vepa(struct mlx5_eswitch *esw, u8 setting);
int mlx5_eswitch_get_vepa(struct mlx5_eswitch *esw, u8 *setting);
int mlx5_eswitch_get_vport_config(struct mlx5_eswitch *esw,
--
2.44.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH net-next V10 10/14] net/mlx5: qos: Model the root node in the scheduling hierarchy
2026-07-01 7:32 [PATCH net-next V10 00/14] devlink and mlx5: Support cross-function rate scheduling Tariq Toukan
` (8 preceding siblings ...)
2026-07-01 7:32 ` [PATCH net-next V10 09/14] net/mlx5: qos: Refactor vport QoS cleanup Tariq Toukan
@ 2026-07-01 7:32 ` Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 11/14] net/mlx5: qos: Remove qos domains and use shd Tariq Toukan
` (3 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: Tariq Toukan @ 2026-07-01 7:32 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
netdev, Paolo Abeni
Cc: Adithya Jayachandran, Bobby Eshleman, Carolina Jubran,
Cosmin Ratiu, Daniel Borkmann, Daniel Jurgens, Daniel Zahka,
David Wei, Donald Hunter, Dragos Tatulea, Jiri Pirko, Jiri Pirko,
Joe Damato, Jonathan Corbet, Kees Cook, Leon Romanovsky,
linux-doc, linux-kernel, linux-kselftest, linux-rdma, Mark Bloch,
Moshe Shemesh, Or Har-Toov, Parav Pandit, Petr Machata,
Ratheesh Kannoth, Saeed Mahameed, Shahar Shitrit, Shay Drori,
Shuah Khan, Shuah Khan, Simon Horman, Stanislav Fomichev,
Tariq Toukan, Willem de Bruijn, Gal Pressman
From: Cosmin Ratiu <cratiu@nvidia.com>
In commit [1] the concept of the root node in the qos hierarchy was
removed due to a bug with how tx_share worked. The side effect is that
in many places, there are now corner cases related to parent handling.
However, since that change, support for tc_bw was added and now, with
upcoming cross-esw support, the code is about to become even more
complicated, increasing the number of such corner cases.
Bring back the concept of the root node, to which all esw vports and
nodes are connected to. This benefits multiple operations which can
assume there's always a valid parent and don't have to do ternary
gymnastics to determine the correct esw to talk to.
As side effect, there's no longer a need to store the groups in the
qos domain, since normalization can simply iterate over all children of
the root node. Normalization gets simplified as a result.
There should be no functionality changes as a result of this change.
[1] commit 330f0f6713a3 ("net/mlx5: Remove default QoS group and attach
vports directly to root TSAR")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../net/ethernet/mellanox/mlx5/core/esw/qos.c | 206 ++++++++----------
.../net/ethernet/mellanox/mlx5/core/eswitch.h | 3 +-
2 files changed, 89 insertions(+), 120 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
index 204f47c99142..49c8ec0dac9a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
@@ -15,8 +15,6 @@
struct mlx5_qos_domain {
/* Serializes access to all qos changes in the qos domain. */
struct mutex lock;
- /* List of all mlx5_esw_sched_nodes. */
- struct list_head nodes;
};
static void esw_qos_lock(struct mlx5_eswitch *esw)
@@ -43,7 +41,6 @@ static struct mlx5_qos_domain *esw_qos_domain_alloc(void)
return NULL;
mutex_init(&qos_domain->lock);
- INIT_LIST_HEAD(&qos_domain->nodes);
return qos_domain;
}
@@ -62,6 +59,7 @@ static void esw_qos_domain_release(struct mlx5_eswitch *esw)
}
enum sched_node_type {
+ SCHED_NODE_TYPE_ROOT,
SCHED_NODE_TYPE_VPORTS_TSAR,
SCHED_NODE_TYPE_VPORT,
SCHED_NODE_TYPE_TC_ARBITER_TSAR,
@@ -106,18 +104,6 @@ struct mlx5_esw_sched_node {
u32 tc_bw[DEVLINK_RATE_TCS_MAX];
};
-static void esw_qos_node_attach_to_parent(struct mlx5_esw_sched_node *node)
-{
- if (!node->parent) {
- /* Root children are assigned a depth level of 2. */
- node->level = 2;
- list_add_tail(&node->entry, &node->esw->qos.domain->nodes);
- } else {
- node->level = node->parent->level + 1;
- list_add_tail(&node->entry, &node->parent->children);
- }
-}
-
static int esw_qos_num_tcs(struct mlx5_core_dev *dev)
{
int num_tcs = mlx5_max_tc(dev) + 1;
@@ -125,14 +111,14 @@ static int esw_qos_num_tcs(struct mlx5_core_dev *dev)
return num_tcs < DEVLINK_RATE_TCS_MAX ? num_tcs : DEVLINK_RATE_TCS_MAX;
}
-static void
-esw_qos_node_set_parent(struct mlx5_esw_sched_node *node, struct mlx5_esw_sched_node *parent)
+static void esw_qos_node_set_parent(struct mlx5_esw_sched_node *node,
+ struct mlx5_esw_sched_node *parent)
{
- list_del_init(&node->entry);
node->parent = parent;
- if (parent)
- node->esw = parent->esw;
- esw_qos_node_attach_to_parent(node);
+ node->esw = parent->esw;
+ node->level = parent->level + 1;
+ list_del(&node->entry);
+ list_add_tail(&node->entry, &parent->children);
}
static void esw_qos_nodes_set_parent(struct list_head *nodes,
@@ -321,22 +307,19 @@ static int esw_qos_create_rate_limit_element(struct mlx5_esw_sched_node *node,
return esw_qos_node_create_sched_element(node, sched_ctx, extack);
}
-static u32 esw_qos_calculate_min_rate_divider(struct mlx5_eswitch *esw,
- struct mlx5_esw_sched_node *parent)
+static u32
+esw_qos_calculate_min_rate_divider(struct mlx5_esw_sched_node *parent)
{
- struct list_head *nodes = parent ? &parent->children : &esw->qos.domain->nodes;
- u32 fw_max_bw_share = MLX5_CAP_QOS(esw->dev, max_tsar_bw_share);
+ u32 fw_max_bw_share = MLX5_CAP_QOS(parent->esw->dev, max_tsar_bw_share);
struct mlx5_esw_sched_node *node;
u32 max_guarantee = 0;
/* Find max min_rate across all nodes.
* This will correspond to fw_max_bw_share in the final bw_share calculation.
*/
- list_for_each_entry(node, nodes, entry) {
- if (node->esw == esw && node->ix != esw->qos.root_tsar_ix &&
- node->min_rate > max_guarantee)
+ list_for_each_entry(node, &parent->children, entry)
+ if (node->min_rate > max_guarantee)
max_guarantee = node->min_rate;
- }
if (max_guarantee)
return max_t(u32, max_guarantee / fw_max_bw_share, 1);
@@ -368,18 +351,13 @@ static void esw_qos_update_sched_node_bw_share(struct mlx5_esw_sched_node *node,
esw_qos_sched_elem_config(node, node->max_rate, bw_share, extack);
}
-static void esw_qos_normalize_min_rate(struct mlx5_eswitch *esw,
- struct mlx5_esw_sched_node *parent,
+static void esw_qos_normalize_min_rate(struct mlx5_esw_sched_node *parent,
struct netlink_ext_ack *extack)
{
- struct list_head *nodes = parent ? &parent->children : &esw->qos.domain->nodes;
- u32 divider = esw_qos_calculate_min_rate_divider(esw, parent);
+ u32 divider = esw_qos_calculate_min_rate_divider(parent);
struct mlx5_esw_sched_node *node;
- list_for_each_entry(node, nodes, entry) {
- if (node->esw != esw || node->ix == esw->qos.root_tsar_ix)
- continue;
-
+ list_for_each_entry(node, &parent->children, entry) {
/* Vports TC TSARs don't have a minimum rate configured,
* so there's no need to update the bw_share on them.
*/
@@ -391,7 +369,7 @@ static void esw_qos_normalize_min_rate(struct mlx5_eswitch *esw,
if (list_empty(&node->children))
continue;
- esw_qos_normalize_min_rate(node->esw, node, extack);
+ esw_qos_normalize_min_rate(node, extack);
}
}
@@ -412,14 +390,11 @@ static u32 esw_qos_calculate_tc_bw_divider(u32 *tc_bw)
static int esw_qos_set_node_min_rate(struct mlx5_esw_sched_node *node,
u32 min_rate, struct netlink_ext_ack *extack)
{
- struct mlx5_eswitch *esw = node->esw;
-
if (min_rate == node->min_rate)
return 0;
node->min_rate = min_rate;
- esw_qos_normalize_min_rate(esw, node->parent, extack);
-
+ esw_qos_normalize_min_rate(node->parent, extack);
return 0;
}
@@ -472,8 +447,7 @@ esw_qos_vport_create_sched_element(struct mlx5_esw_sched_node *vport_node,
SCHEDULING_CONTEXT_ELEMENT_TYPE_VPORT);
attr = MLX5_ADDR_OF(scheduling_context, sched_ctx, element_attributes);
MLX5_SET(vport_element, attr, vport_number, vport_node->vport->vport);
- MLX5_SET(scheduling_context, sched_ctx, parent_element_id,
- parent ? parent->ix : vport_node->esw->qos.root_tsar_ix);
+ MLX5_SET(scheduling_context, sched_ctx, parent_element_id, parent->ix);
MLX5_SET(scheduling_context, sched_ctx, max_average_bw,
vport_node->max_rate);
@@ -513,7 +487,7 @@ esw_qos_vport_tc_create_sched_element(struct mlx5_esw_sched_node *vport_tc_node,
}
static struct mlx5_esw_sched_node *
-__esw_qos_alloc_node(struct mlx5_eswitch *esw, u32 tsar_ix, enum sched_node_type type,
+__esw_qos_alloc_node(u32 tsar_ix, enum sched_node_type type,
struct mlx5_esw_sched_node *parent)
{
struct mlx5_esw_sched_node *node;
@@ -522,20 +496,12 @@ __esw_qos_alloc_node(struct mlx5_eswitch *esw, u32 tsar_ix, enum sched_node_type
if (!node)
return NULL;
- node->esw = esw;
node->ix = tsar_ix;
node->type = type;
- node->parent = parent;
INIT_LIST_HEAD(&node->children);
- esw_qos_node_attach_to_parent(node);
- if (!parent) {
- /* The caller is responsible for inserting the node into the
- * parent list if necessary. This function can also be used with
- * a NULL parent, which doesn't necessarily indicate that it
- * refers to the root scheduling element.
- */
- list_del_init(&node->entry);
- }
+ INIT_LIST_HEAD(&node->entry);
+ if (parent)
+ esw_qos_node_set_parent(node, parent);
return node;
}
@@ -570,7 +536,7 @@ static int esw_qos_create_vports_tc_node(struct mlx5_esw_sched_node *parent,
SCHEDULING_HIERARCHY_E_SWITCH))
return -EOPNOTSUPP;
- vports_tc_node = __esw_qos_alloc_node(parent->esw, 0,
+ vports_tc_node = __esw_qos_alloc_node(0,
SCHED_NODE_TYPE_VPORTS_TC_TSAR,
parent);
if (!vports_tc_node) {
@@ -665,7 +631,6 @@ static int esw_qos_create_tc_arbiter_sched_elem(
struct netlink_ext_ack *extack)
{
u32 tsar_ctx[MLX5_ST_SZ_DW(scheduling_context)] = {};
- u32 tsar_parent_ix;
void *attr;
if (!mlx5_qos_tsar_type_supported(tc_arbiter_node->esw->dev,
@@ -678,10 +643,8 @@ static int esw_qos_create_tc_arbiter_sched_elem(
attr = MLX5_ADDR_OF(scheduling_context, tsar_ctx, element_attributes);
MLX5_SET(tsar_element, attr, tsar_type, TSAR_ELEMENT_TSAR_TYPE_TC_ARB);
- tsar_parent_ix = tc_arbiter_node->parent ? tc_arbiter_node->parent->ix :
- tc_arbiter_node->esw->qos.root_tsar_ix;
MLX5_SET(scheduling_context, tsar_ctx, parent_element_id,
- tsar_parent_ix);
+ tc_arbiter_node->parent->ix);
MLX5_SET(scheduling_context, tsar_ctx, element_type,
SCHEDULING_CONTEXT_ELEMENT_TYPE_TSAR);
MLX5_SET(scheduling_context, tsar_ctx, max_average_bw,
@@ -694,37 +657,36 @@ static int esw_qos_create_tc_arbiter_sched_elem(
}
static struct mlx5_esw_sched_node *
-__esw_qos_create_vports_sched_node(struct mlx5_eswitch *esw, struct mlx5_esw_sched_node *parent,
+__esw_qos_create_vports_sched_node(struct mlx5_esw_sched_node *parent,
struct netlink_ext_ack *extack)
{
struct mlx5_esw_sched_node *node;
- u32 tsar_ix;
int err;
+ u32 ix;
- err = esw_qos_create_node_sched_elem(esw->dev, esw->qos.root_tsar_ix, 0,
- 0, &tsar_ix);
+ err = esw_qos_create_node_sched_elem(parent->esw->dev, parent->ix, 0, 0,
+ &ix);
if (err) {
NL_SET_ERR_MSG_MOD(extack, "E-Switch create TSAR for node failed");
return ERR_PTR(err);
}
- node = __esw_qos_alloc_node(esw, tsar_ix, SCHED_NODE_TYPE_VPORTS_TSAR, parent);
+ node = __esw_qos_alloc_node(ix, SCHED_NODE_TYPE_VPORTS_TSAR, parent);
if (!node) {
NL_SET_ERR_MSG_MOD(extack, "E-Switch alloc node failed");
err = -ENOMEM;
goto err_alloc_node;
}
- list_add_tail(&node->entry, &esw->qos.domain->nodes);
- esw_qos_normalize_min_rate(esw, NULL, extack);
- trace_mlx5_esw_node_qos_create(esw->dev, node, node->ix);
+ esw_qos_normalize_min_rate(parent, extack);
+ trace_mlx5_esw_node_qos_create(parent->esw->dev, node, node->ix);
return node;
err_alloc_node:
- if (mlx5_destroy_scheduling_element_cmd(esw->dev,
+ if (mlx5_destroy_scheduling_element_cmd(parent->esw->dev,
SCHEDULING_HIERARCHY_E_SWITCH,
- tsar_ix))
+ ix))
NL_SET_ERR_MSG_MOD(extack, "E-Switch destroy TSAR for node failed");
return ERR_PTR(err);
}
@@ -746,7 +708,7 @@ esw_qos_create_vports_sched_node(struct mlx5_eswitch *esw, struct netlink_ext_ac
if (err)
return ERR_PTR(err);
- node = __esw_qos_create_vports_sched_node(esw, NULL, extack);
+ node = __esw_qos_create_vports_sched_node(esw->qos.root, extack);
if (IS_ERR(node))
esw_qos_put(esw);
@@ -762,38 +724,47 @@ static void __esw_qos_destroy_node(struct mlx5_esw_sched_node *node, struct netl
trace_mlx5_esw_node_qos_destroy(esw->dev, node, node->ix);
esw_qos_destroy_node(node, extack);
- esw_qos_normalize_min_rate(esw, NULL, extack);
+ esw_qos_normalize_min_rate(esw->qos.root, extack);
}
static int esw_qos_create(struct mlx5_eswitch *esw, struct netlink_ext_ack *extack)
{
struct mlx5_core_dev *dev = esw->dev;
+ struct mlx5_esw_sched_node *root;
+ u32 root_ix;
int err;
if (!MLX5_CAP_GEN(dev, qos) || !MLX5_CAP_QOS(dev, esw_scheduling))
return -EOPNOTSUPP;
- err = esw_qos_create_node_sched_elem(esw->dev, 0, 0, 0,
- &esw->qos.root_tsar_ix);
+ err = esw_qos_create_node_sched_elem(esw->dev, 0, 0, 0, &root_ix);
if (err) {
esw_warn(dev, "E-Switch create root TSAR failed (%d)\n", err);
return err;
}
+ root = __esw_qos_alloc_node(root_ix, SCHED_NODE_TYPE_ROOT, NULL);
+ if (!root) {
+ esw_warn(dev, "E-Switch create root node failed\n");
+ err = -ENOMEM;
+ goto out_err;
+ }
+ root->esw = esw;
+ root->level = 1;
+ esw->qos.root = root;
refcount_set(&esw->qos.refcnt, 1);
return 0;
+out_err:
+ mlx5_destroy_scheduling_element_cmd(dev, SCHEDULING_HIERARCHY_E_SWITCH,
+ root_ix);
+ return err;
}
static void esw_qos_destroy(struct mlx5_eswitch *esw)
{
- int err;
-
- err = mlx5_destroy_scheduling_element_cmd(esw->dev,
- SCHEDULING_HIERARCHY_E_SWITCH,
- esw->qos.root_tsar_ix);
- if (err)
- esw_warn(esw->dev, "E-Switch destroy root TSAR failed (%d)\n", err);
+ esw_qos_destroy_node(esw->qos.root, NULL);
+ esw->qos.root = NULL;
}
static int esw_qos_get(struct mlx5_eswitch *esw, struct netlink_ext_ack *extack)
@@ -866,8 +837,7 @@ esw_qos_create_vport_tc_sched_node(struct mlx5_vport *vport,
u8 tc = vports_tc_node->tc;
int err;
- vport_tc_node = __esw_qos_alloc_node(vport_node->esw, 0,
- SCHED_NODE_TYPE_VPORT_TC,
+ vport_tc_node = __esw_qos_alloc_node(0, SCHED_NODE_TYPE_VPORT_TC,
vports_tc_node);
if (!vport_tc_node)
return -ENOMEM;
@@ -959,7 +929,7 @@ esw_qos_vport_tc_enable(struct mlx5_vport *vport, enum sched_node_type type,
/* Increase the parent's level by 2 to account for both the
* TC arbiter and the vports TC scheduling element.
*/
- new_level = (parent ? parent->level : 2) + 2;
+ new_level = parent->level + 2;
max_level = 1 << MLX5_CAP_QOS(vport_node->esw->dev,
log_esw_max_sched_depth);
if (new_level > max_level) {
@@ -997,7 +967,7 @@ esw_qos_vport_tc_enable(struct mlx5_vport *vport, enum sched_node_type type,
err_sched_nodes:
if (type == SCHED_NODE_TYPE_RATE_LIMITER) {
esw_qos_node_destroy_sched_element(vport_node, NULL);
- esw_qos_node_attach_to_parent(vport_node);
+ esw_qos_node_set_parent(vport_node, vport_node->parent);
} else {
esw_qos_tc_arbiter_scheduling_teardown(vport_node, NULL);
}
@@ -1055,7 +1025,7 @@ static void esw_qos_vport_disable(struct mlx5_vport *vport, struct netlink_ext_a
vport_node->bw_share = 0;
memset(vport_node->tc_bw, 0, sizeof(vport_node->tc_bw));
list_del_init(&vport_node->entry);
- esw_qos_normalize_min_rate(vport_node->esw, vport_node->parent, extack);
+ esw_qos_normalize_min_rate(vport_node->parent, extack);
trace_mlx5_esw_vport_qos_destroy(vport_node->esw->dev, vport);
}
@@ -1068,7 +1038,7 @@ static int esw_qos_vport_enable(struct mlx5_vport *vport,
struct mlx5_esw_sched_node *vport_node = vport->qos.sched_node;
int err;
- esw_assert_qos_lock_held(vport->dev->priv.eswitch);
+ esw_assert_qos_lock_held(vport_node->esw);
esw_qos_node_set_parent(vport_node, parent);
if (type == SCHED_NODE_TYPE_VPORT)
@@ -1079,7 +1049,7 @@ static int esw_qos_vport_enable(struct mlx5_vport *vport,
return err;
vport_node->type = type;
- esw_qos_normalize_min_rate(vport_node->esw, parent, extack);
+ esw_qos_normalize_min_rate(parent, extack);
trace_mlx5_esw_vport_qos_create(vport->dev, vport, vport_node->max_rate,
vport_node->bw_share);
@@ -1092,7 +1062,6 @@ static int mlx5_esw_qos_vport_enable(struct mlx5_vport *vport, enum sched_node_t
{
struct mlx5_eswitch *esw = vport->dev->priv.eswitch;
struct mlx5_esw_sched_node *sched_node;
- struct mlx5_eswitch *parent_esw;
int err;
esw_assert_qos_lock_held(esw);
@@ -1100,14 +1069,13 @@ static int mlx5_esw_qos_vport_enable(struct mlx5_vport *vport, enum sched_node_t
if (err)
return err;
- parent_esw = parent ? parent->esw : esw;
- sched_node = __esw_qos_alloc_node(parent_esw, 0, type, parent);
+ if (!parent)
+ parent = esw->qos.root;
+ sched_node = __esw_qos_alloc_node(0, type, parent);
if (!sched_node) {
esw_qos_put(esw);
return -ENOMEM;
}
- if (!parent)
- list_add_tail(&sched_node->entry, &esw->qos.domain->nodes);
sched_node->max_rate = max_rate;
sched_node->min_rate = min_rate;
@@ -1279,10 +1247,9 @@ static int esw_qos_vport_update_parent(struct mlx5_vport *vport, struct mlx5_esw
/* Set vport QoS type based on parent node type if different from
* default QoS; otherwise, use the vport's current QoS type.
*/
- if (parent && parent->type == SCHED_NODE_TYPE_TC_ARBITER_TSAR)
+ if (parent->type == SCHED_NODE_TYPE_TC_ARBITER_TSAR)
type = SCHED_NODE_TYPE_RATE_LIMITER;
- else if (curr_parent &&
- curr_parent->type == SCHED_NODE_TYPE_TC_ARBITER_TSAR)
+ else if (curr_parent->type == SCHED_NODE_TYPE_TC_ARBITER_TSAR)
type = SCHED_NODE_TYPE_VPORT;
else
type = vport->qos.sched_node->type;
@@ -1311,11 +1278,9 @@ static int esw_qos_switch_tc_arbiter_node_to_vports(
struct mlx5_esw_sched_node *node,
struct netlink_ext_ack *extack)
{
- u32 parent_tsar_ix = node->parent ?
- node->parent->ix : node->esw->qos.root_tsar_ix;
int err;
- err = esw_qos_create_node_sched_elem(node->esw->dev, parent_tsar_ix,
+ err = esw_qos_create_node_sched_elem(node->esw->dev, node->parent->ix,
node->max_rate, node->bw_share,
&node->ix);
if (err) {
@@ -1370,8 +1335,8 @@ esw_qos_move_node(struct mlx5_esw_sched_node *curr_node)
{
struct mlx5_esw_sched_node *new_node;
- new_node = __esw_qos_alloc_node(curr_node->esw, curr_node->ix,
- curr_node->type, NULL);
+ new_node = __esw_qos_alloc_node(curr_node->ix, curr_node->type,
+ curr_node->parent);
if (!new_node)
return ERR_PTR(-ENOMEM);
@@ -1595,9 +1560,8 @@ static bool esw_qos_vport_validate_unsupported_tc_bw(struct mlx5_vport *vport,
u32 *tc_bw)
{
struct mlx5_esw_sched_node *node = vport->qos.sched_node;
- struct mlx5_eswitch *esw = vport->dev->priv.eswitch;
-
- esw = (node && node->parent) ? node->parent->esw : esw;
+ struct mlx5_eswitch *esw = node ?
+ node->parent->esw : vport->dev->priv.eswitch;
return esw_qos_validate_unsupported_tc_bw(esw, tc_bw);
}
@@ -1622,8 +1586,9 @@ static void esw_vport_qos_prune_empty(struct mlx5_vport *vport)
if (!vport_node)
return;
- if (vport_node->parent || vport_node->max_rate ||
- vport_node->min_rate || !esw_qos_tc_bw_disabled(vport_node->tc_bw))
+ if (vport_node->parent != vport_node->esw->qos.root ||
+ vport_node->max_rate || vport_node->min_rate ||
+ !esw_qos_tc_bw_disabled(vport_node->tc_bw))
return;
mlx5_esw_qos_vport_disable_locked(vport);
@@ -1880,7 +1845,9 @@ mlx5_esw_qos_vport_update_parent(struct mlx5_vport *vport,
err = mlx5_esw_qos_vport_enable(vport, type, parent, 0, 0,
extack);
} else if (vport->qos.sched_node) {
- err = esw_qos_vport_update_parent(vport, parent, extack);
+ err = esw_qos_vport_update_parent(vport,
+ parent ? : esw->qos.root,
+ extack);
}
esw_qos_unlock(esw);
return err;
@@ -1928,7 +1895,7 @@ mlx5_esw_qos_node_validate_set_parent(struct mlx5_esw_sched_node *node,
{
u8 new_level, max_level;
- if (parent && parent->esw != node->esw) {
+ if (parent->esw != node->esw) {
NL_SET_ERR_MSG_MOD(extack,
"Cannot assign node to another E-Switch");
return -EOPNOTSUPP;
@@ -1940,13 +1907,13 @@ mlx5_esw_qos_node_validate_set_parent(struct mlx5_esw_sched_node *node,
return -EOPNOTSUPP;
}
- if (parent && parent->type == SCHED_NODE_TYPE_TC_ARBITER_TSAR) {
+ if (parent->type == SCHED_NODE_TYPE_TC_ARBITER_TSAR) {
NL_SET_ERR_MSG_MOD(extack,
"Cannot attach a node to a parent with TC bandwidth configured");
return -EOPNOTSUPP;
}
- new_level = parent ? parent->level + 1 : 2;
+ new_level = parent->level + 1;
if (node->type == SCHED_NODE_TYPE_TC_ARBITER_TSAR) {
/* Increase by one to account for the vports TC scheduling
* element.
@@ -1997,14 +1964,12 @@ static int esw_qos_vports_node_update_parent(struct mlx5_esw_sched_node *node,
{
struct mlx5_esw_sched_node *curr_parent = node->parent;
struct mlx5_eswitch *esw = node->esw;
- u32 parent_ix;
int err;
- parent_ix = parent ? parent->ix : node->esw->qos.root_tsar_ix;
mlx5_destroy_scheduling_element_cmd(esw->dev,
SCHEDULING_HIERARCHY_E_SWITCH,
node->ix);
- err = esw_qos_create_node_sched_elem(esw->dev, parent_ix,
+ err = esw_qos_create_node_sched_elem(esw->dev, parent->ix,
node->max_rate, 0, &node->ix);
if (err) {
NL_SET_ERR_MSG_MOD(extack,
@@ -2031,12 +1996,15 @@ static int mlx5_esw_qos_node_update_parent(struct mlx5_esw_sched_node *node,
struct mlx5_eswitch *esw = node->esw;
int err;
+ esw_qos_lock(esw);
+ curr_parent = node->parent;
+ if (!parent)
+ parent = esw->qos.root;
+
err = mlx5_esw_qos_node_validate_set_parent(node, parent, extack);
if (err)
- return err;
+ goto out;
- esw_qos_lock(esw);
- curr_parent = node->parent;
if (node->type == SCHED_NODE_TYPE_TC_ARBITER_TSAR) {
err = esw_qos_tc_arbiter_node_update_parent(node, parent,
extack);
@@ -2047,8 +2015,8 @@ static int mlx5_esw_qos_node_update_parent(struct mlx5_esw_sched_node *node,
if (err)
goto out;
- esw_qos_normalize_min_rate(esw, curr_parent, extack);
- esw_qos_normalize_min_rate(esw, parent, extack);
+ esw_qos_normalize_min_rate(curr_parent, extack);
+ esw_qos_normalize_min_rate(parent, extack);
out:
esw_qos_unlock(esw);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index 140343f2b913..10c4eacd43b4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -415,8 +415,9 @@ struct mlx5_eswitch {
struct {
/* Initially 0, meaning no QoS users and QoS is disabled. */
refcount_t refcnt;
- u32 root_tsar_ix;
struct mlx5_qos_domain *domain;
+ /* The root node of the hierarchy. */
+ struct mlx5_esw_sched_node *root;
} qos;
struct mlx5_esw_bridge_offloads *br_offloads;
--
2.44.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH net-next V10 11/14] net/mlx5: qos: Remove qos domains and use shd
2026-07-01 7:32 [PATCH net-next V10 00/14] devlink and mlx5: Support cross-function rate scheduling Tariq Toukan
` (9 preceding siblings ...)
2026-07-01 7:32 ` [PATCH net-next V10 10/14] net/mlx5: qos: Model the root node in the scheduling hierarchy Tariq Toukan
@ 2026-07-01 7:32 ` Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 12/14] net/mlx5: qos: Support cross-device tx scheduling Tariq Toukan
` (2 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: Tariq Toukan @ 2026-07-01 7:32 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
netdev, Paolo Abeni
Cc: Adithya Jayachandran, Bobby Eshleman, Carolina Jubran,
Cosmin Ratiu, Daniel Borkmann, Daniel Jurgens, Daniel Zahka,
David Wei, Donald Hunter, Dragos Tatulea, Jiri Pirko, Jiri Pirko,
Joe Damato, Jonathan Corbet, Kees Cook, Leon Romanovsky,
linux-doc, linux-kernel, linux-kselftest, linux-rdma, Mark Bloch,
Moshe Shemesh, Or Har-Toov, Parav Pandit, Petr Machata,
Ratheesh Kannoth, Saeed Mahameed, Shahar Shitrit, Shay Drori,
Shuah Khan, Shuah Khan, Simon Horman, Stanislav Fomichev,
Tariq Toukan, Willem de Bruijn, Gal Pressman
From: Cosmin Ratiu <cratiu@nvidia.com>
E-Switch QoS domains were added with the intention of eventually
implementing shared qos domains to support cross-esw scheduling in the
previous approach ([1]), but they are no longer necessary in the new
approach.
Remove QoS domains and switch to using the shd lock for protecting
against concurrent QoS modifications.
Enable the supported_cross_device_rate_nodes devlink ops attribute so
that all calls originating from devlink rate acquire the shd lock. Only
the additional entry points into QoS need to acquire the shd lock.
The wrinkle is that since shd can be NULL (e.g. on older HW without
serial number available), there needs to be a fallback locking
mechanism. The devlink instance lock cannot be used, as some code paths
into QoS (get, set & modify vport rate) happen with RTNL held, and the
existing devlink -> RTNL order prevents devlink lock usage there.
The other two options are either esw->state_lock or a new lock as
fallback when shd is NULL. This patch adds esw->state_lock, which
implies:
- 3 new lock/unlock helper pairs to acquire/release the missing lock:
- esw_qos_{,un}lock: acquire/release esw->state_lock when shd is NULL.
- esw_qos_shd_{,un}lock: when esw->state_lock is already held.
- esw_qos_devlink_{,un}lock: when shd is already held.
- esw_assert_qos_lock_held now asserts esw->state_lock is held when shd
is NULL.
Use the corresponding lock/unlock function in all places where either
shd or state_lock would need to be acquired.
Document all of this trickery next to esw_assert_qos_lock_held.
Enabling supported_cross_device_rate_nodes now is safe, because
mlx5_esw_qos_vport_update_parent rejects cross-esw parent updates.
This will change in the next patch.
[1]
https://lore.kernel.org/netdev/20250213180134.323929-1-tariqt@nvidia.com/
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../net/ethernet/mellanox/mlx5/core/devlink.c | 1 +
.../net/ethernet/mellanox/mlx5/core/esw/qos.c | 245 ++++++++----------
.../net/ethernet/mellanox/mlx5/core/esw/qos.h | 3 -
.../net/ethernet/mellanox/mlx5/core/eswitch.c | 8 -
.../net/ethernet/mellanox/mlx5/core/eswitch.h | 13 +-
5 files changed, 120 insertions(+), 150 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
index c31e05529fc4..b9026cc64383 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
@@ -383,6 +383,7 @@ static const struct devlink_ops mlx5_devlink_ops = {
.rate_node_del = mlx5_esw_devlink_rate_node_del,
.rate_leaf_parent_set = mlx5_esw_devlink_rate_leaf_parent_set,
.rate_node_parent_set = mlx5_esw_devlink_rate_node_parent_set,
+ .supported_cross_device_rate_nodes = true,
#endif
#ifdef CONFIG_MLX5_SF_MANAGER
.port_new = mlx5_devlink_sf_port_new,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
index 49c8ec0dac9a..80a28596349b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
@@ -11,53 +11,6 @@
/* Minimum supported BW share value by the HW is 1 Mbit/sec */
#define MLX5_MIN_BW_SHARE 1
-/* Holds rate nodes associated with an E-Switch. */
-struct mlx5_qos_domain {
- /* Serializes access to all qos changes in the qos domain. */
- struct mutex lock;
-};
-
-static void esw_qos_lock(struct mlx5_eswitch *esw)
-{
- mutex_lock(&esw->qos.domain->lock);
-}
-
-static void esw_qos_unlock(struct mlx5_eswitch *esw)
-{
- mutex_unlock(&esw->qos.domain->lock);
-}
-
-static void esw_assert_qos_lock_held(struct mlx5_eswitch *esw)
-{
- lockdep_assert_held(&esw->qos.domain->lock);
-}
-
-static struct mlx5_qos_domain *esw_qos_domain_alloc(void)
-{
- struct mlx5_qos_domain *qos_domain;
-
- qos_domain = kzalloc_obj(*qos_domain);
- if (!qos_domain)
- return NULL;
-
- mutex_init(&qos_domain->lock);
-
- return qos_domain;
-}
-
-static int esw_qos_domain_init(struct mlx5_eswitch *esw)
-{
- esw->qos.domain = esw_qos_domain_alloc();
-
- return esw->qos.domain ? 0 : -ENOMEM;
-}
-
-static void esw_qos_domain_release(struct mlx5_eswitch *esw)
-{
- kfree(esw->qos.domain);
- esw->qos.domain = NULL;
-}
-
enum sched_node_type {
SCHED_NODE_TYPE_ROOT,
SCHED_NODE_TYPE_VPORTS_TSAR,
@@ -104,6 +57,65 @@ struct mlx5_esw_sched_node {
u32 tc_bw[DEVLINK_RATE_TCS_MAX];
};
+/* Locking notes:
+ * QoS changes are normally protected by the shd lock. But on older HW shd
+ * might not be created at all, so there needs to be a fallback serialization
+ * mechanism. This is esw->state_lock.
+ * Callers into QoS hold a combination of RTNL, devlink instance lock and
+ * esw->state_lock. Devlink rate ops additionally hold the shd lock if it
+ * exists.
+ * - VF rate ops use esw_qos_lock/esw_qos_unlock.
+ * - callers with esw->state_lock held use esw_qos_shd_lock/esw_qos_shd_unlock.
+ * - devlink callers use esw_qos_devlink_lock/esw_qos_devlink_unlock.
+ */
+static void esw_assert_qos_lock_held(struct mlx5_core_dev *dev)
+{
+ if (dev->shd)
+ devl_assert_locked(dev->shd);
+ else
+ lockdep_assert_held(&dev->priv.eswitch->state_lock);
+}
+
+static void esw_qos_lock(struct mlx5_core_dev *dev)
+{
+ if (dev->shd)
+ devl_lock(dev->shd);
+ else
+ mutex_lock(&dev->priv.eswitch->state_lock);
+}
+
+static void esw_qos_unlock(struct mlx5_core_dev *dev)
+{
+ if (dev->shd)
+ devl_unlock(dev->shd);
+ else
+ mutex_unlock(&dev->priv.eswitch->state_lock);
+}
+
+static void esw_qos_shd_lock(struct mlx5_core_dev *dev)
+{
+ if (dev->shd)
+ devl_lock(dev->shd);
+}
+
+static void esw_qos_shd_unlock(struct mlx5_core_dev *dev)
+{
+ if (dev->shd)
+ devl_unlock(dev->shd);
+}
+
+static void esw_qos_devlink_lock(struct mlx5_core_dev *dev)
+{
+ if (!dev->shd)
+ mutex_lock(&dev->priv.eswitch->state_lock);
+}
+
+static void esw_qos_devlink_unlock(struct mlx5_core_dev *dev)
+{
+ if (!dev->shd)
+ mutex_unlock(&dev->priv.eswitch->state_lock);
+}
+
static int esw_qos_num_tcs(struct mlx5_core_dev *dev)
{
int num_tcs = mlx5_max_tc(dev) + 1;
@@ -700,7 +712,7 @@ esw_qos_create_vports_sched_node(struct mlx5_eswitch *esw, struct netlink_ext_ac
struct mlx5_esw_sched_node *node;
int err;
- esw_assert_qos_lock_held(esw);
+ esw_assert_qos_lock_held(esw->dev);
if (!MLX5_CAP_QOS(esw->dev, log_esw_max_sched_depth))
return ERR_PTR(-EOPNOTSUPP);
@@ -771,7 +783,7 @@ static int esw_qos_get(struct mlx5_eswitch *esw, struct netlink_ext_ack *extack)
{
int err = 0;
- esw_assert_qos_lock_held(esw);
+ esw_assert_qos_lock_held(esw->dev);
if (!refcount_inc_not_zero(&esw->qos.refcnt)) {
/* esw_qos_create() set refcount to 1 only on success.
* No need to decrement on failure.
@@ -784,7 +796,7 @@ static int esw_qos_get(struct mlx5_eswitch *esw, struct netlink_ext_ack *extack)
static void esw_qos_put(struct mlx5_eswitch *esw)
{
- esw_assert_qos_lock_held(esw);
+ esw_assert_qos_lock_held(esw->dev);
if (refcount_dec_and_test(&esw->qos.refcnt))
esw_qos_destroy(esw);
}
@@ -940,7 +952,7 @@ esw_qos_vport_tc_enable(struct mlx5_vport *vport, enum sched_node_type type,
}
}
- esw_assert_qos_lock_held(vport->dev->priv.eswitch);
+ esw_assert_qos_lock_held(vport->dev);
if (type == SCHED_NODE_TYPE_RATE_LIMITER)
err = esw_qos_create_rate_limit_element(vport_node, extack);
@@ -1038,7 +1050,7 @@ static int esw_qos_vport_enable(struct mlx5_vport *vport,
struct mlx5_esw_sched_node *vport_node = vport->qos.sched_node;
int err;
- esw_assert_qos_lock_held(vport_node->esw);
+ esw_assert_qos_lock_held(vport->dev);
esw_qos_node_set_parent(vport_node, parent);
if (type == SCHED_NODE_TYPE_VPORT)
@@ -1064,7 +1076,7 @@ static int mlx5_esw_qos_vport_enable(struct mlx5_vport *vport, enum sched_node_t
struct mlx5_esw_sched_node *sched_node;
int err;
- esw_assert_qos_lock_held(esw);
+ esw_assert_qos_lock_held(vport->dev);
err = esw_qos_get(esw, extack);
if (err)
return err;
@@ -1093,15 +1105,13 @@ static int mlx5_esw_qos_vport_enable(struct mlx5_vport *vport, enum sched_node_t
static void mlx5_esw_qos_vport_disable_locked(struct mlx5_vport *vport)
{
- struct mlx5_eswitch *esw = vport->dev->priv.eswitch;
-
- esw_assert_qos_lock_held(esw);
+ esw_assert_qos_lock_held(vport->dev);
if (!vport->qos.sched_node)
return;
esw_qos_vport_disable(vport, NULL);
mlx5_esw_qos_vport_qos_free(vport);
- esw_qos_put(esw);
+ esw_qos_put(vport->dev->priv.eswitch);
}
void mlx5_esw_qos_vport_disable(struct mlx5_vport *vport)
@@ -1109,9 +1119,9 @@ void mlx5_esw_qos_vport_disable(struct mlx5_vport *vport)
struct mlx5_eswitch *esw = vport->dev->priv.eswitch;
lockdep_assert_held(&esw->state_lock);
- esw_qos_lock(esw);
+ esw_qos_shd_lock(vport->dev);
mlx5_esw_qos_vport_disable_locked(vport);
- esw_qos_unlock(esw);
+ esw_qos_shd_unlock(vport->dev);
}
static int mlx5_esw_qos_set_vport_max_rate(struct mlx5_vport *vport, u32 max_rate,
@@ -1119,7 +1129,7 @@ static int mlx5_esw_qos_set_vport_max_rate(struct mlx5_vport *vport, u32 max_rat
{
struct mlx5_esw_sched_node *vport_node = vport->qos.sched_node;
- esw_assert_qos_lock_held(vport->dev->priv.eswitch);
+ esw_assert_qos_lock_held(vport->dev);
if (!vport_node)
return mlx5_esw_qos_vport_enable(vport, SCHED_NODE_TYPE_VPORT, NULL, max_rate, 0,
@@ -1134,7 +1144,7 @@ static int mlx5_esw_qos_set_vport_min_rate(struct mlx5_vport *vport, u32 min_rat
{
struct mlx5_esw_sched_node *vport_node = vport->qos.sched_node;
- esw_assert_qos_lock_held(vport->dev->priv.eswitch);
+ esw_assert_qos_lock_held(vport->dev);
if (!vport_node)
return mlx5_esw_qos_vport_enable(vport, SCHED_NODE_TYPE_VPORT, NULL, 0, min_rate,
@@ -1147,29 +1157,27 @@ static int mlx5_esw_qos_set_vport_min_rate(struct mlx5_vport *vport, u32 min_rat
int mlx5_esw_qos_set_vport_rate(struct mlx5_vport *vport, u32 max_rate, u32 min_rate)
{
- struct mlx5_eswitch *esw = vport->dev->priv.eswitch;
int err;
- esw_qos_lock(esw);
+ esw_qos_lock(vport->dev);
err = mlx5_esw_qos_set_vport_min_rate(vport, min_rate, NULL);
if (!err)
err = mlx5_esw_qos_set_vport_max_rate(vport, max_rate, NULL);
- esw_qos_unlock(esw);
+ esw_qos_unlock(vport->dev);
return err;
}
bool mlx5_esw_qos_get_vport_rate(struct mlx5_vport *vport, u32 *max_rate, u32 *min_rate)
{
- struct mlx5_eswitch *esw = vport->dev->priv.eswitch;
bool enabled;
- esw_qos_lock(esw);
+ esw_qos_shd_lock(vport->dev);
enabled = !!vport->qos.sched_node;
if (enabled) {
*max_rate = vport->qos.sched_node->max_rate;
*min_rate = vport->qos.sched_node->min_rate;
}
- esw_qos_unlock(esw);
+ esw_qos_shd_unlock(vport->dev);
return enabled;
}
@@ -1205,7 +1213,7 @@ static int esw_qos_vport_update(struct mlx5_vport *vport,
u32 curr_tc_bw[DEVLINK_RATE_TCS_MAX] = {0};
int err;
- esw_assert_qos_lock_held(vport->dev->priv.eswitch);
+ esw_assert_qos_lock_held(vport->dev);
if (curr_type == type && curr_parent == parent)
return 0;
@@ -1235,11 +1243,10 @@ static int esw_qos_vport_update(struct mlx5_vport *vport,
static int esw_qos_vport_update_parent(struct mlx5_vport *vport, struct mlx5_esw_sched_node *parent,
struct netlink_ext_ack *extack)
{
- struct mlx5_eswitch *esw = vport->dev->priv.eswitch;
struct mlx5_esw_sched_node *curr_parent;
enum sched_node_type type;
- esw_assert_qos_lock_held(esw);
+ esw_assert_qos_lock_held(vport->dev);
curr_parent = vport->qos.sched_node->parent;
if (curr_parent == parent)
return 0;
@@ -1503,9 +1510,9 @@ int mlx5_esw_qos_modify_vport_rate(struct mlx5_eswitch *esw, u16 vport_num, u32
return err;
}
- esw_qos_lock(esw);
+ esw_qos_lock(vport->dev);
err = mlx5_esw_qos_set_vport_max_rate(vport, rate_mbps, NULL);
- esw_qos_unlock(esw);
+ esw_qos_unlock(vport->dev);
return err;
}
@@ -1582,7 +1589,7 @@ static void esw_vport_qos_prune_empty(struct mlx5_vport *vport)
{
struct mlx5_esw_sched_node *vport_node = vport->qos.sched_node;
- esw_assert_qos_lock_held(vport->dev->priv.eswitch);
+ esw_assert_qos_lock_held(vport->dev);
if (!vport_node)
return;
@@ -1594,44 +1601,26 @@ static void esw_vport_qos_prune_empty(struct mlx5_vport *vport)
mlx5_esw_qos_vport_disable_locked(vport);
}
-int mlx5_esw_qos_init(struct mlx5_eswitch *esw)
-{
- if (esw->qos.domain)
- return 0; /* Nothing to change. */
-
- return esw_qos_domain_init(esw);
-}
-
-void mlx5_esw_qos_cleanup(struct mlx5_eswitch *esw)
-{
- if (esw->qos.domain)
- esw_qos_domain_release(esw);
-}
-
/* Eswitch devlink rate API */
int mlx5_esw_devlink_rate_leaf_tx_share_set(struct devlink_rate *rate_leaf, void *priv,
u64 tx_share, struct netlink_ext_ack *extack)
{
struct mlx5_vport *vport = priv;
- struct mlx5_eswitch *esw;
int err;
- esw = vport->dev->priv.eswitch;
- if (!mlx5_esw_allowed(esw))
+ if (!mlx5_esw_allowed(vport->dev->priv.eswitch))
return -EPERM;
err = esw_qos_devlink_rate_to_mbps(vport->dev, "tx_share", &tx_share, extack);
if (err)
return err;
- esw_qos_lock(esw);
+ esw_qos_devlink_lock(vport->dev);
err = mlx5_esw_qos_set_vport_min_rate(vport, tx_share, extack);
- if (err)
- goto out;
- esw_vport_qos_prune_empty(vport);
-out:
- esw_qos_unlock(esw);
+ if (!err)
+ esw_vport_qos_prune_empty(vport);
+ esw_qos_devlink_unlock(vport->dev);
return err;
}
@@ -1639,24 +1628,20 @@ int mlx5_esw_devlink_rate_leaf_tx_max_set(struct devlink_rate *rate_leaf, void *
u64 tx_max, struct netlink_ext_ack *extack)
{
struct mlx5_vport *vport = priv;
- struct mlx5_eswitch *esw;
int err;
- esw = vport->dev->priv.eswitch;
- if (!mlx5_esw_allowed(esw))
+ if (!mlx5_esw_allowed(vport->dev->priv.eswitch))
return -EPERM;
err = esw_qos_devlink_rate_to_mbps(vport->dev, "tx_max", &tx_max, extack);
if (err)
return err;
- esw_qos_lock(esw);
+ esw_qos_devlink_lock(vport->dev);
err = mlx5_esw_qos_set_vport_max_rate(vport, tx_max, extack);
- if (err)
- goto out;
- esw_vport_qos_prune_empty(vport);
-out:
- esw_qos_unlock(esw);
+ if (!err)
+ esw_vport_qos_prune_empty(vport);
+ esw_qos_devlink_unlock(vport->dev);
return err;
}
@@ -1667,16 +1652,14 @@ int mlx5_esw_devlink_rate_leaf_tc_bw_set(struct devlink_rate *rate_leaf,
{
struct mlx5_esw_sched_node *vport_node;
struct mlx5_vport *vport = priv;
- struct mlx5_eswitch *esw;
bool disable;
int err = 0;
- esw = vport->dev->priv.eswitch;
- if (!mlx5_esw_allowed(esw))
+ if (!mlx5_esw_allowed(vport->dev->priv.eswitch))
return -EPERM;
disable = esw_qos_tc_bw_disabled(tc_bw);
- esw_qos_lock(esw);
+ esw_qos_devlink_lock(vport->dev);
if (!esw_qos_vport_validate_unsupported_tc_bw(vport, tc_bw)) {
NL_SET_ERR_MSG_MOD(extack,
@@ -1710,7 +1693,7 @@ int mlx5_esw_devlink_rate_leaf_tc_bw_set(struct devlink_rate *rate_leaf,
if (!err)
esw_qos_set_tc_arbiter_bw_shares(vport_node, tc_bw, extack);
unlock:
- esw_qos_unlock(esw);
+ esw_qos_devlink_unlock(vport->dev);
return err;
}
@@ -1720,18 +1703,17 @@ int mlx5_esw_devlink_rate_node_tc_bw_set(struct devlink_rate *rate_node,
struct netlink_ext_ack *extack)
{
struct mlx5_esw_sched_node *node = priv;
- struct mlx5_eswitch *esw = node->esw;
bool disable;
int err;
- if (!esw_qos_validate_unsupported_tc_bw(esw, tc_bw)) {
+ if (!esw_qos_validate_unsupported_tc_bw(node->esw, tc_bw)) {
NL_SET_ERR_MSG_MOD(extack,
"E-Switch traffic classes number is not supported");
return -EOPNOTSUPP;
}
disable = esw_qos_tc_bw_disabled(tc_bw);
- esw_qos_lock(esw);
+ esw_qos_devlink_lock(node->esw->dev);
if (disable) {
err = esw_qos_node_disable_tc_arbitration(node, extack);
goto unlock;
@@ -1741,7 +1723,7 @@ int mlx5_esw_devlink_rate_node_tc_bw_set(struct devlink_rate *rate_node,
if (!err)
esw_qos_set_tc_arbiter_bw_shares(node, tc_bw, extack);
unlock:
- esw_qos_unlock(esw);
+ esw_qos_devlink_unlock(node->esw->dev);
return err;
}
@@ -1756,9 +1738,9 @@ int mlx5_esw_devlink_rate_node_tx_share_set(struct devlink_rate *rate_node, void
if (err)
return err;
- esw_qos_lock(esw);
+ esw_qos_devlink_lock(esw->dev);
err = esw_qos_set_node_min_rate(node, tx_share, extack);
- esw_qos_unlock(esw);
+ esw_qos_devlink_unlock(esw->dev);
return err;
}
@@ -1773,9 +1755,9 @@ int mlx5_esw_devlink_rate_node_tx_max_set(struct devlink_rate *rate_node, void *
if (err)
return err;
- esw_qos_lock(esw);
+ esw_qos_devlink_lock(esw->dev);
err = esw_qos_sched_elem_config(node, tx_max, node->bw_share, extack);
- esw_qos_unlock(esw);
+ esw_qos_devlink_unlock(esw->dev);
return err;
}
@@ -1790,7 +1772,7 @@ int mlx5_esw_devlink_rate_node_new(struct devlink_rate *rate_node, void **priv,
if (IS_ERR(esw))
return PTR_ERR(esw);
- esw_qos_lock(esw);
+ esw_qos_devlink_lock(esw->dev);
if (esw->mode != MLX5_ESWITCH_OFFLOADS) {
NL_SET_ERR_MSG_MOD(extack,
"Rate node creation supported only in switchdev mode");
@@ -1803,10 +1785,9 @@ int mlx5_esw_devlink_rate_node_new(struct devlink_rate *rate_node, void **priv,
err = PTR_ERR(node);
goto unlock;
}
-
*priv = node;
unlock:
- esw_qos_unlock(esw);
+ esw_qos_devlink_unlock(esw->dev);
return err;
}
@@ -1816,10 +1797,11 @@ int mlx5_esw_devlink_rate_node_del(struct devlink_rate *rate_node, void *priv,
struct mlx5_esw_sched_node *node = priv;
struct mlx5_eswitch *esw = node->esw;
- esw_qos_lock(esw);
+ esw_qos_devlink_lock(esw->dev);
__esw_qos_destroy_node(node, extack);
esw_qos_put(esw);
- esw_qos_unlock(esw);
+ esw_qos_devlink_unlock(esw->dev);
+
return 0;
}
@@ -1836,7 +1818,6 @@ mlx5_esw_qos_vport_update_parent(struct mlx5_vport *vport,
return -EOPNOTSUPP;
}
- esw_qos_lock(esw);
if (!vport->qos.sched_node && parent) {
enum sched_node_type type;
@@ -1849,7 +1830,7 @@ mlx5_esw_qos_vport_update_parent(struct mlx5_vport *vport,
parent ? : esw->qos.root,
extack);
}
- esw_qos_unlock(esw);
+
return err;
}
@@ -1862,14 +1843,11 @@ int mlx5_esw_devlink_rate_leaf_parent_set(struct devlink_rate *devlink_rate,
struct mlx5_vport *vport = priv;
int err;
+ esw_qos_devlink_lock(vport->dev);
err = mlx5_esw_qos_vport_update_parent(vport, node, extack);
- if (!err) {
- struct mlx5_eswitch *esw = vport->dev->priv.eswitch;
-
- esw_qos_lock(esw);
+ if (!err)
esw_vport_qos_prune_empty(vport);
- esw_qos_unlock(esw);
- }
+ esw_qos_devlink_unlock(vport->dev);
return err;
}
@@ -1996,7 +1974,7 @@ static int mlx5_esw_qos_node_update_parent(struct mlx5_esw_sched_node *node,
struct mlx5_eswitch *esw = node->esw;
int err;
- esw_qos_lock(esw);
+ esw_qos_devlink_lock(esw->dev);
curr_parent = node->parent;
if (!parent)
parent = esw->qos.root;
@@ -2019,8 +1997,7 @@ static int mlx5_esw_qos_node_update_parent(struct mlx5_esw_sched_node *node,
esw_qos_normalize_min_rate(parent, extack);
out:
- esw_qos_unlock(esw);
-
+ esw_qos_devlink_unlock(esw->dev);
return err;
}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.h b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.h
index 0a50982b0e27..f275e850d2c9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.h
@@ -6,9 +6,6 @@
#ifdef CONFIG_MLX5_ESWITCH
-int mlx5_esw_qos_init(struct mlx5_eswitch *esw);
-void mlx5_esw_qos_cleanup(struct mlx5_eswitch *esw);
-
int mlx5_esw_qos_set_vport_rate(struct mlx5_vport *evport, u32 max_rate, u32 min_rate);
bool mlx5_esw_qos_get_vport_rate(struct mlx5_vport *vport, u32 *max_rate, u32 *min_rate);
void mlx5_esw_qos_vport_disable(struct mlx5_vport *vport);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index b67f15a8f766..b6e2c153b4f7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -1885,10 +1885,6 @@ int mlx5_eswitch_enable_locked(struct mlx5_eswitch *esw, int num_vfs)
MLX5_NB_INIT(&esw->nb, eswitch_vport_event, NIC_VPORT_CHANGE);
mlx5_eq_notifier_register(esw->dev, &esw->nb);
- err = mlx5_esw_qos_init(esw);
- if (err)
- goto err_esw_init;
-
if (esw->mode == MLX5_ESWITCH_LEGACY) {
err = esw_legacy_enable(esw);
} else {
@@ -2555,9 +2551,6 @@ int mlx5_eswitch_init(struct mlx5_core_dev *dev)
goto reps_err;
esw->mode = MLX5_ESWITCH_LEGACY;
- err = mlx5_esw_qos_init(esw);
- if (err)
- goto reps_err;
mutex_init(&esw->offloads.encap_tbl_lock);
hash_init(esw->offloads.encap_tbl);
@@ -2612,7 +2605,6 @@ void mlx5_eswitch_cleanup(struct mlx5_eswitch *esw)
mlx5_eswitch_invalidate_wq(esw);
destroy_workqueue(esw->work_queue);
- mlx5_esw_qos_cleanup(esw);
WARN_ON(refcount_read(&esw->qos.refcnt));
mutex_destroy(&esw->state_lock);
WARN_ON(!xa_empty(&esw->offloads.vhca_map));
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index 10c4eacd43b4..c655f6e8da1c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -234,8 +234,10 @@ struct mlx5_vport {
struct mlx5_vport_info info;
- /* Protected with the E-Switch qos domain lock. The Vport QoS can
- * either be disabled (sched_node is NULL) or in one of three states:
+ /* Protected by either the shared devlink (dev->shd) lock or by
+ * esw->state_lock. See esw_assert_qos_lock_held() for more details.
+ * The Vport QoS can either be disabled (sched_node is NULL) or in one
+ * of three states:
* 1. Regular QoS (sched_node is a vport node).
* 2. TC QoS enabled on the vport (sched_node is a TC arbiter).
* 3. TC QoS enabled on the vport's parent node
@@ -382,7 +384,6 @@ enum {
};
struct dentry;
-struct mlx5_qos_domain;
struct mlx5_eswitch {
struct mlx5_core_dev *dev;
@@ -411,11 +412,13 @@ struct mlx5_eswitch {
atomic64_t user_count;
wait_queue_head_t work_queue_wait;
- /* Protected with the E-Switch qos domain lock. */
+ /* QoS changes are serialized by either the shared devlink (dev->shd)
+ * lock or by esw->state_lock. See esw_assert_qos_lock_held() for more
+ * details.
+ */
struct {
/* Initially 0, meaning no QoS users and QoS is disabled. */
refcount_t refcnt;
- struct mlx5_qos_domain *domain;
/* The root node of the hierarchy. */
struct mlx5_esw_sched_node *root;
} qos;
--
2.44.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH net-next V10 12/14] net/mlx5: qos: Support cross-device tx scheduling
2026-07-01 7:32 [PATCH net-next V10 00/14] devlink and mlx5: Support cross-function rate scheduling Tariq Toukan
` (10 preceding siblings ...)
2026-07-01 7:32 ` [PATCH net-next V10 11/14] net/mlx5: qos: Remove qos domains and use shd Tariq Toukan
@ 2026-07-01 7:32 ` Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 13/14] selftests: drv-net: Add test for cross-esw rate scheduling Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 14/14] net/mlx5: Document devlink rates Tariq Toukan
13 siblings, 0 replies; 15+ messages in thread
From: Tariq Toukan @ 2026-07-01 7:32 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
netdev, Paolo Abeni
Cc: Adithya Jayachandran, Bobby Eshleman, Carolina Jubran,
Cosmin Ratiu, Daniel Borkmann, Daniel Jurgens, Daniel Zahka,
David Wei, Donald Hunter, Dragos Tatulea, Jiri Pirko, Jiri Pirko,
Joe Damato, Jonathan Corbet, Kees Cook, Leon Romanovsky,
linux-doc, linux-kernel, linux-kselftest, linux-rdma, Mark Bloch,
Moshe Shemesh, Or Har-Toov, Parav Pandit, Petr Machata,
Ratheesh Kannoth, Saeed Mahameed, Shahar Shitrit, Shay Drori,
Shuah Khan, Shuah Khan, Simon Horman, Stanislav Fomichev,
Tariq Toukan, Willem de Bruijn, Gal Pressman
From: Cosmin Ratiu <cratiu@nvidia.com>
Up to now, rate groups could only contain vports from the same E-Switch.
This patch relaxes that restriction if the device supports it
(HCA_CAP.esw_cross_esw_sched == true) and the right conditions are met:
- Link Aggregation (LAG) is enabled.
- The E-Switches are from the same shared devlink device.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../net/ethernet/mellanox/mlx5/core/esw/qos.c | 120 +++++++++++++-----
1 file changed, 85 insertions(+), 35 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
index 80a28596349b..0d20f51b9702 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
@@ -45,7 +45,9 @@ struct mlx5_esw_sched_node {
enum sched_node_type type;
/* The eswitch this node belongs to. */
struct mlx5_eswitch *esw;
- /* The children nodes of this node, empty list for leaf nodes. */
+ /* The children nodes of this node, empty list for leaf nodes.
+ * Can be from multiple E-Switches.
+ */
struct list_head children;
/* Valid only if this node is associated with a vport. */
struct mlx5_vport *vport;
@@ -447,6 +449,7 @@ esw_qos_vport_create_sched_element(struct mlx5_esw_sched_node *vport_node,
struct mlx5_esw_sched_node *parent = vport_node->parent;
u32 sched_ctx[MLX5_ST_SZ_DW(scheduling_context)] = {};
struct mlx5_core_dev *dev = vport_node->esw->dev;
+ struct mlx5_vport *vport = vport_node->vport;
void *attr;
if (!mlx5_qos_element_type_supported(
@@ -458,10 +461,17 @@ esw_qos_vport_create_sched_element(struct mlx5_esw_sched_node *vport_node,
MLX5_SET(scheduling_context, sched_ctx, element_type,
SCHEDULING_CONTEXT_ELEMENT_TYPE_VPORT);
attr = MLX5_ADDR_OF(scheduling_context, sched_ctx, element_attributes);
- MLX5_SET(vport_element, attr, vport_number, vport_node->vport->vport);
+ MLX5_SET(vport_element, attr, vport_number, vport->vport);
MLX5_SET(scheduling_context, sched_ctx, parent_element_id, parent->ix);
MLX5_SET(scheduling_context, sched_ctx, max_average_bw,
vport_node->max_rate);
+ if (vport->dev != dev) {
+ /* The port is assigned to a node on another eswitch. */
+ MLX5_SET(vport_element, attr, eswitch_owner_vhca_id_valid,
+ true);
+ MLX5_SET(vport_element, attr, eswitch_owner_vhca_id,
+ MLX5_CAP_GEN(vport->dev, vhca_id));
+ }
return esw_qos_node_create_sched_element(vport_node, sched_ctx, extack);
}
@@ -473,6 +483,7 @@ esw_qos_vport_tc_create_sched_element(struct mlx5_esw_sched_node *vport_tc_node,
{
u32 sched_ctx[MLX5_ST_SZ_DW(scheduling_context)] = {};
struct mlx5_core_dev *dev = vport_tc_node->esw->dev;
+ struct mlx5_vport *vport = vport_tc_node->vport;
void *attr;
if (!mlx5_qos_element_type_supported(
@@ -484,8 +495,7 @@ esw_qos_vport_tc_create_sched_element(struct mlx5_esw_sched_node *vport_tc_node,
MLX5_SET(scheduling_context, sched_ctx, element_type,
SCHEDULING_CONTEXT_ELEMENT_TYPE_VPORT_TC);
attr = MLX5_ADDR_OF(scheduling_context, sched_ctx, element_attributes);
- MLX5_SET(vport_tc_element, attr, vport_number,
- vport_tc_node->vport->vport);
+ MLX5_SET(vport_tc_element, attr, vport_number, vport->vport);
MLX5_SET(vport_tc_element, attr, traffic_class, vport_tc_node->tc);
MLX5_SET(scheduling_context, sched_ctx, max_bw_obj_id,
rate_limit_elem_ix);
@@ -493,6 +503,13 @@ esw_qos_vport_tc_create_sched_element(struct mlx5_esw_sched_node *vport_tc_node,
vport_tc_node->parent->ix);
MLX5_SET(scheduling_context, sched_ctx, bw_share,
vport_tc_node->bw_share);
+ if (vport->dev != dev) {
+ /* The port is assigned to a node on another eswitch. */
+ MLX5_SET(vport_tc_element, attr, eswitch_owner_vhca_id_valid,
+ true);
+ MLX5_SET(vport_tc_element, attr, eswitch_owner_vhca_id,
+ MLX5_CAP_GEN(vport->dev, vhca_id));
+ }
return esw_qos_node_create_sched_element(vport_tc_node, sched_ctx,
extack);
@@ -1062,8 +1079,9 @@ static int esw_qos_vport_enable(struct mlx5_vport *vport,
vport_node->type = type;
esw_qos_normalize_min_rate(parent, extack);
- trace_mlx5_esw_vport_qos_create(vport->dev, vport, vport_node->max_rate,
- vport_node->bw_share);
+ trace_mlx5_esw_vport_qos_create(vport_node->esw->dev, vport,
+ vport_node->bw_share,
+ vport_node->max_rate);
return 0;
}
@@ -1202,6 +1220,28 @@ static int esw_qos_vport_tc_check_type(enum sched_node_type curr_type,
return 0;
}
+static bool esw_qos_validate_unsupported_tc_bw(struct mlx5_eswitch *esw,
+ u32 *tc_bw)
+{
+ int i, num_tcs = esw_qos_num_tcs(esw->dev);
+
+ for (i = num_tcs; i < DEVLINK_RATE_TCS_MAX; i++)
+ if (tc_bw[i])
+ return false;
+
+ return true;
+}
+
+static bool esw_qos_vport_validate_unsupported_tc_bw(struct mlx5_vport *vport,
+ u32 *tc_bw)
+{
+ struct mlx5_esw_sched_node *node = vport->qos.sched_node;
+ struct mlx5_eswitch *esw = node ?
+ node->parent->esw : vport->dev->priv.eswitch;
+
+ return esw_qos_validate_unsupported_tc_bw(esw, tc_bw);
+}
+
static int esw_qos_vport_update(struct mlx5_vport *vport,
enum sched_node_type type,
struct mlx5_esw_sched_node *parent,
@@ -1221,8 +1261,15 @@ static int esw_qos_vport_update(struct mlx5_vport *vport,
if (err)
return err;
- if (curr_type == SCHED_NODE_TYPE_TC_ARBITER_TSAR && curr_type == type)
+ if (curr_type == SCHED_NODE_TYPE_TC_ARBITER_TSAR && curr_type == type) {
esw_qos_tc_arbiter_get_bw_shares(vport_node, curr_tc_bw);
+ if (!esw_qos_validate_unsupported_tc_bw(parent->esw,
+ curr_tc_bw)) {
+ NL_SET_ERR_MSG_MOD(extack,
+ "Unsupported traffic classes on the new device");
+ return -EOPNOTSUPP;
+ }
+ }
esw_qos_vport_disable(vport, extack);
@@ -1550,29 +1597,6 @@ static int esw_qos_devlink_rate_to_mbps(struct mlx5_core_dev *mdev, const char *
return 0;
}
-static bool esw_qos_validate_unsupported_tc_bw(struct mlx5_eswitch *esw,
- u32 *tc_bw)
-{
- int i, num_tcs = esw_qos_num_tcs(esw->dev);
-
- for (i = num_tcs; i < DEVLINK_RATE_TCS_MAX; i++) {
- if (tc_bw[i])
- return false;
- }
-
- return true;
-}
-
-static bool esw_qos_vport_validate_unsupported_tc_bw(struct mlx5_vport *vport,
- u32 *tc_bw)
-{
- struct mlx5_esw_sched_node *node = vport->qos.sched_node;
- struct mlx5_eswitch *esw = node ?
- node->parent->esw : vport->dev->priv.eswitch;
-
- return esw_qos_validate_unsupported_tc_bw(esw, tc_bw);
-}
-
static bool esw_qos_tc_bw_disabled(u32 *tc_bw)
{
int i;
@@ -1805,18 +1829,44 @@ int mlx5_esw_devlink_rate_node_del(struct devlink_rate *rate_node, void *priv,
return 0;
}
+static int
+mlx5_esw_validate_cross_esw_scheduling(struct mlx5_eswitch *esw,
+ struct mlx5_esw_sched_node *parent,
+ struct netlink_ext_ack *extack)
+{
+ if (!parent || esw == parent->esw)
+ return 0;
+
+ if (!MLX5_CAP_QOS(esw->dev, esw_cross_esw_sched)) {
+ NL_SET_ERR_MSG_MOD(extack,
+ "Cross E-Switch scheduling is not supported");
+ return -EOPNOTSUPP;
+ }
+ if (!esw->dev->shd || esw->dev->shd != parent->esw->dev->shd) {
+ NL_SET_ERR_MSG_MOD(extack,
+ "Cannot add vport to a parent belonging to a different device");
+ return -EOPNOTSUPP;
+ }
+ if (!mlx5_lag_is_active(esw->dev)) {
+ NL_SET_ERR_MSG_MOD(extack,
+ "Cross E-Switch scheduling requires LAG to be activated");
+ return -EOPNOTSUPP;
+ }
+
+ return 0;
+}
+
static int
mlx5_esw_qos_vport_update_parent(struct mlx5_vport *vport,
struct mlx5_esw_sched_node *parent,
struct netlink_ext_ack *extack)
{
struct mlx5_eswitch *esw = vport->dev->priv.eswitch;
- int err = 0;
+ int err;
- if (parent && parent->esw != esw) {
- NL_SET_ERR_MSG_MOD(extack, "Cross E-Switch scheduling is not supported");
- return -EOPNOTSUPP;
- }
+ err = mlx5_esw_validate_cross_esw_scheduling(esw, parent, extack);
+ if (err)
+ return err;
if (!vport->qos.sched_node && parent) {
enum sched_node_type type;
--
2.44.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH net-next V10 13/14] selftests: drv-net: Add test for cross-esw rate scheduling
2026-07-01 7:32 [PATCH net-next V10 00/14] devlink and mlx5: Support cross-function rate scheduling Tariq Toukan
` (11 preceding siblings ...)
2026-07-01 7:32 ` [PATCH net-next V10 12/14] net/mlx5: qos: Support cross-device tx scheduling Tariq Toukan
@ 2026-07-01 7:32 ` Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 14/14] net/mlx5: Document devlink rates Tariq Toukan
13 siblings, 0 replies; 15+ messages in thread
From: Tariq Toukan @ 2026-07-01 7:32 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
netdev, Paolo Abeni
Cc: Adithya Jayachandran, Bobby Eshleman, Carolina Jubran,
Cosmin Ratiu, Daniel Borkmann, Daniel Jurgens, Daniel Zahka,
David Wei, Donald Hunter, Dragos Tatulea, Jiri Pirko, Jiri Pirko,
Joe Damato, Jonathan Corbet, Kees Cook, Leon Romanovsky,
linux-doc, linux-kernel, linux-kselftest, linux-rdma, Mark Bloch,
Moshe Shemesh, Or Har-Toov, Parav Pandit, Petr Machata,
Ratheesh Kannoth, Saeed Mahameed, Shahar Shitrit, Shay Drori,
Shuah Khan, Shuah Khan, Simon Horman, Stanislav Fomichev,
Tariq Toukan, Willem de Bruijn, Gal Pressman
From: Cosmin Ratiu <cratiu@nvidia.com>
Adds a Python selftest using the YNL devlink API to verify the devlink
rate ops. The test requires a bond device given in the config as NETIF
containing two PFs. Test setup will then create 1 VF on each PF and
verify the various rate commands.
./devlink_rate_cross_esw.py
TAP version 13
1..3
ok 1 devlink_rate_cross_esw.test_same_esw_parent
ok 2 devlink_rate_cross_esw.test_cross_esw_parent
ok 3 devlink_rate_cross_esw.test_tx_rates_on_cross_esw
Tests will be skipped when the preconditions aren't met, when the
devlink API is too old or when the devices don't appear to support
cross-esw scheduling (detected via EOPNOTSUPP).
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../testing/selftests/drivers/net/hw/Makefile | 1 +
.../drivers/net/hw/devlink_rate_cross_esw.py | 296 ++++++++++++++++++
2 files changed, 297 insertions(+)
create mode 100755 tools/testing/selftests/drivers/net/hw/devlink_rate_cross_esw.py
diff --git a/tools/testing/selftests/drivers/net/hw/Makefile b/tools/testing/selftests/drivers/net/hw/Makefile
index fd0535a96d84..234db5c2c90c 100644
--- a/tools/testing/selftests/drivers/net/hw/Makefile
+++ b/tools/testing/selftests/drivers/net/hw/Makefile
@@ -20,6 +20,7 @@ TEST_GEN_FILES := \
TEST_PROGS = \
csum.py \
devlink_port_split.py \
+ devlink_rate_cross_esw.py \
devlink_rate_tc_bw.py \
devmem.py \
ethtool.sh \
diff --git a/tools/testing/selftests/drivers/net/hw/devlink_rate_cross_esw.py b/tools/testing/selftests/drivers/net/hw/devlink_rate_cross_esw.py
new file mode 100755
index 000000000000..4416f024cb76
--- /dev/null
+++ b/tools/testing/selftests/drivers/net/hw/devlink_rate_cross_esw.py
@@ -0,0 +1,296 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+"""
+Devlink Rate Cross-eswitch Scheduling Test Suite
+==================================================
+
+Control-plane tests for cross-eswitch TX scheduling via devlink-rate.
+Validates that VFs from different PFs on the same chip can share
+rate groups using the cross-device parent-dev attribute.
+
+Preconditions:
+- NETIF points to a bond device with exactly two interfaces.
+- the interfaces must be two PFs from different devices sharing the same chip.
+- (for mlx5): the two interfaces are in switchdev mode and configured in a LAG:
+ - devlink dev eswitch set $DEV1 mode switchdev
+ - devlink dev eswitch set $DEV2 mode switchdev
+ - devlink dev param set $DEV1 name esw_multiport value 1 cmode runtime
+ - devlink dev param set $DEV2 name esw_multiport value 1 cmode runtime
+- test cases will be skipped if:
+ - the number of interfaces in the bond device is != 2.
+ - the kernel doesn't support devlink rates.
+ - the devlink API doesn't support cross-device parents (ENODEV).
+ - cross-esw rate scheduling returns EOPNOTSUPP.
+"""
+
+import errno
+import glob
+import os
+import time
+
+from lib.py import ksft_pr, ksft_eq, ksft_run, ksft_exit
+from lib.py import KsftSkipEx, KsftFailEx
+from lib.py import NetDrvEnv, DevlinkFamily
+from lib.py import NlError
+from lib.py import cmd, defer, ip, tool
+
+
+# --- Discovery and setup ---
+
+
+def get_bond_slaves(bond_ifname):
+ """Returns sorted list of slave netdev names for a bond."""
+ pattern = f"/sys/class/net/{bond_ifname}/lower_*"
+ lowers = glob.glob(pattern)
+ if not lowers:
+ raise KsftSkipEx(f"No bond slaves for {bond_ifname}")
+ slaves = []
+ for path in sorted(lowers):
+ name = os.path.basename(path)
+ if name.startswith("lower_"):
+ name = name[len("lower_"):]
+ slaves.append(name)
+ return slaves
+
+
+def discover_pfs(cfg):
+ """Discovers both PFs from bond slaves."""
+ slaves = get_bond_slaves(cfg.ifname)
+ if len(slaves) != 2:
+ raise KsftSkipEx(f"Need 2 bond slaves, found {len(slaves)}")
+
+ pf0, pf1 = slaves[0], slaves[1]
+ ksft_pr(f"PF0: {pf0} PF1: {pf1}")
+ return pf0, pf1
+
+
+def get_pci_addr(ifname):
+ """Resolves PCI address for a network interface."""
+ return os.path.basename(os.path.realpath(f"/sys/class/net/{ifname}/device"))
+
+
+def get_vf_port_index(pf_pci):
+ """Finds devlink port-index for vf0 under pf_pci."""
+ ports = tool("devlink", "port show", json=True)["port"]
+ for port_name, props in ports.items():
+ if port_name.startswith(f"pci/{pf_pci}/") and props.get("vfnum") == 0:
+ return int(port_name.split("/")[-1])
+ raise KsftSkipEx(f"VF port not found for {pf_pci}")
+
+
+def cleanup_esw(pf):
+ """Removes VFs if created by tests."""
+ cmd(f"echo 0 > /sys/class/net/{pf}/device/sriov_numvfs", shell=True, fail=False)
+
+
+def setup_esw(pf):
+ """Creates 1 VF on 'pf'."""
+ path = f"/sys/class/net/{pf}/device/sriov_numvfs"
+ cmd(f"echo 0 > {path}", shell=True)
+ cmd(f"echo 1 > {path}", shell=True)
+ defer(cleanup_esw, pf)
+ time.sleep(2)
+
+ vf_dir = f"/sys/class/net/{pf}/device/virtfn0/net"
+ entries = os.listdir(vf_dir) if os.path.isdir(vf_dir) else []
+ if not entries:
+ raise KsftSkipEx(f"VF not found for {pf}")
+ ip(f"link set dev {entries[0]} up")
+
+ pf_pci = get_pci_addr(pf)
+ vf_idx = get_vf_port_index(pf_pci)
+ ksft_pr(f"Created VF {vf_idx} on PF {pf} ({pf_pci})")
+ return pf_pci, vf_idx
+
+
+# --- Rate operation helpers ---
+
+
+def rate_new(devnl, dev_pci, node_name, **kwargs):
+ """Creates rate node."""
+ params = {
+ "bus-name": "pci",
+ "dev-name": dev_pci,
+ "rate-node-name": node_name,
+ }
+ params.update(kwargs)
+ try:
+ devnl.rate_new(params)
+ except NlError as e:
+ if e.error == errno.EOPNOTSUPP:
+ raise KsftSkipEx("rate_new not supported") from e
+ raise KsftFailEx("rate_new failed") from e
+
+
+def rate_get(devnl, dev_pci, node_name):
+ """Gets rate node."""
+ params = {
+ "bus-name": "pci",
+ "dev-name": dev_pci,
+ "rate-node-name": node_name,
+ }
+ return devnl.rate_get(params)
+
+
+def rate_get_leaf(devnl, dev_pci, port_index):
+ """Gets rate leaf (VF)."""
+ params = {
+ "bus-name": "pci",
+ "dev-name": dev_pci,
+ "port-index": port_index,
+ }
+ return devnl.rate_get(params)
+
+
+def rate_del(devnl, dev_pci, node_name):
+ """Deletes rate node."""
+ devnl.rate_del({
+ "bus-name": "pci",
+ "dev-name": dev_pci,
+ "rate-node-name": node_name,
+ })
+
+
+def rate_set_leaf(devnl, dev_pci, port_index, **kwargs):
+ """Sets rate attributes on a leaf (VF)."""
+ params = {
+ "bus-name": "pci",
+ "dev-name": dev_pci,
+ "port-index": port_index,
+ }
+ params.update(kwargs)
+ try:
+ devnl.rate_set(params)
+ except NlError as e:
+ if e.error == errno.EOPNOTSUPP:
+ raise KsftSkipEx("rate_set not supported") from e
+ raise KsftFailEx("rate_set failed") from e
+
+
+def rate_set_leaf_parent(devnl, dev_pci, port_index,
+ parent_name, parent_dev_pci=None):
+ """Sets a leaf's parent, optionally cross-esw."""
+ params = {
+ "bus-name": "pci",
+ "dev-name": dev_pci,
+ "port-index": port_index,
+ "rate-parent-node-name": parent_name,
+ }
+ if parent_dev_pci:
+ params["parent-dev"] = {
+ "bus-name": "pci",
+ "dev-name": parent_dev_pci,
+ }
+ try:
+ devnl.rate_set(params)
+ except NlError as e:
+ if e.error == errno.EOPNOTSUPP:
+ raise KsftSkipEx("rate_set not supported") from e
+ if parent_dev_pci and e.error == errno.ENODEV:
+ raise KsftSkipEx("Cross-esw scheduling not supported") from e
+ raise KsftFailEx("rate_set failed") from e
+
+
+def rate_clear_leaf_parent(devnl, dev_pci, port_index):
+ """Clears a leaf's parent."""
+ rate_set_leaf_parent(devnl, dev_pci, port_index, "")
+
+
+def rate_set_node(devnl, dev_pci, node_name, **kwargs):
+ """Sets rate attributes on a node."""
+ params = {
+ "bus-name": "pci",
+ "dev-name": dev_pci,
+ "rate-node-name": node_name,
+ }
+ params.update(kwargs)
+ devnl.rate_set(params)
+
+
+# --- Test cases ---
+
+
+def test_same_esw_parent(cfg):
+ """Assigns PF0's VF to PF0's group (same esw baseline)."""
+ pf0, _ = discover_pfs(cfg)
+ pf0_pci, vf0_idx = setup_esw(pf0)
+
+ rate_new(cfg.devnl, pf0_pci, "group0")
+ defer(rate_del, cfg.devnl, pf0_pci, "group0")
+ ksft_pr("rate-new succeeded")
+
+ rate_set_leaf_parent(cfg.devnl, pf0_pci, vf0_idx, "group0")
+ defer(rate_clear_leaf_parent, cfg.devnl, pf0_pci, vf0_idx)
+
+ ksft_pr("Same-esw parent assignment succeeded")
+
+
+def test_cross_esw_parent(cfg):
+ """Sets cross-esw parent, then clear it."""
+ pf0, pf1 = discover_pfs(cfg)
+ pf0_pci, _ = setup_esw(pf0)
+ pf1_pci, vf1_idx = setup_esw(pf1)
+
+ rate_new(cfg.devnl, pf0_pci, "group1")
+ defer(rate_del, cfg.devnl, pf0_pci, "group1")
+ ksft_pr("rate-new succeeded")
+
+ rate_set_leaf_parent(cfg.devnl, pf1_pci, vf1_idx,
+ "group1", parent_dev_pci=pf0_pci)
+ defer(rate_clear_leaf_parent, cfg.devnl, pf1_pci, vf1_idx)
+
+ ksft_pr("Cross-esw parent set and clear succeeded")
+
+
+def test_tx_rates_on_cross_esw(cfg):
+ """Sets tx_max on group and tx_share on leaves in a cross-esw setup."""
+ pf0, pf1 = discover_pfs(cfg)
+ pf0_pci, vf0_idx = setup_esw(pf0)
+ pf1_pci, vf1_idx = setup_esw(pf1)
+
+ rate_new(cfg.devnl, pf0_pci, "group2", **{"rate-tx-max": 10000000})
+ defer(rate_del, cfg.devnl, pf0_pci, "group2")
+ ksft_pr("rate-new succeeded")
+
+ rate_set_leaf_parent(cfg.devnl, pf1_pci, vf1_idx,
+ "group2", parent_dev_pci=pf0_pci)
+ defer(rate_clear_leaf_parent, cfg.devnl, pf1_pci, vf1_idx)
+ ksft_pr("set parent cross-esw succeeded")
+
+ rate_set_leaf_parent(cfg.devnl, pf0_pci, vf0_idx, "group2")
+ defer(rate_clear_leaf_parent, cfg.devnl, pf0_pci, vf0_idx)
+ ksft_pr("set parent same esw succeeded")
+
+ rate_set_leaf(cfg.devnl, pf0_pci, vf0_idx, **{"rate-tx-share": 1000000})
+ rate = rate_get_leaf(cfg.devnl, pf0_pci, vf0_idx)
+ ksft_eq(rate["rate-tx-share"], 1000000)
+ rate_set_leaf(cfg.devnl, pf1_pci, vf1_idx, **{"rate-tx-share": 2000000})
+ rate = rate_get_leaf(cfg.devnl, pf1_pci, vf1_idx)
+ ksft_eq(rate["rate-tx-share"], 2000000)
+ rate_set_node(cfg.devnl, pf0_pci, "group2", **{"rate-tx-max": 250000000})
+ rate = rate_get(cfg.devnl, pf0_pci, "group2")
+ ksft_eq(rate["rate-tx-max"], 250000000)
+
+ ksft_pr("tx_max and tx_share set on cross-esw group")
+
+
+def main() -> None:
+ """Main function."""
+
+ with NetDrvEnv(__file__, nsim_test=False) as cfg:
+ cfg.devnl = DevlinkFamily()
+
+ ksft_run(
+ cases=[
+ test_same_esw_parent,
+ test_cross_esw_parent,
+ test_tx_rates_on_cross_esw,
+ ],
+ args=(cfg,),
+ )
+ ksft_exit()
+
+
+if __name__ == "__main__":
+ main()
--
2.44.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH net-next V10 14/14] net/mlx5: Document devlink rates
2026-07-01 7:32 [PATCH net-next V10 00/14] devlink and mlx5: Support cross-function rate scheduling Tariq Toukan
` (12 preceding siblings ...)
2026-07-01 7:32 ` [PATCH net-next V10 13/14] selftests: drv-net: Add test for cross-esw rate scheduling Tariq Toukan
@ 2026-07-01 7:32 ` Tariq Toukan
13 siblings, 0 replies; 15+ messages in thread
From: Tariq Toukan @ 2026-07-01 7:32 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
netdev, Paolo Abeni
Cc: Adithya Jayachandran, Bobby Eshleman, Carolina Jubran,
Cosmin Ratiu, Daniel Borkmann, Daniel Jurgens, Daniel Zahka,
David Wei, Donald Hunter, Dragos Tatulea, Jiri Pirko, Jiri Pirko,
Joe Damato, Jonathan Corbet, Kees Cook, Leon Romanovsky,
linux-doc, linux-kernel, linux-kselftest, linux-rdma, Mark Bloch,
Moshe Shemesh, Or Har-Toov, Parav Pandit, Petr Machata,
Ratheesh Kannoth, Saeed Mahameed, Shahar Shitrit, Shay Drori,
Shuah Khan, Shuah Khan, Simon Horman, Stanislav Fomichev,
Tariq Toukan, Willem de Bruijn, Gal Pressman
From: Cosmin Ratiu <cratiu@nvidia.com>
It seems rates were not documented in the mlx5-specific file, so add
examples on how to limit VFs and groups and also provide an example of
the intended way to achieve cross-esw scheduling.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
Documentation/networking/devlink/mlx5.rst | 33 +++++++++++++++++++++++
1 file changed, 33 insertions(+)
diff --git a/Documentation/networking/devlink/mlx5.rst b/Documentation/networking/devlink/mlx5.rst
index 4bba4d780a4a..cf1dffa67669 100644
--- a/Documentation/networking/devlink/mlx5.rst
+++ b/Documentation/networking/devlink/mlx5.rst
@@ -419,3 +419,36 @@ User commands examples:
.. note::
This command can run over all interfaces such as PF/VF and representor ports.
+
+Rates
+=====
+
+mlx5 devices can limit transmission of individual VFs or a group of them via
+the devlink-rate API in switchdev mode.
+
+User commands examples:
+
+- Print the existing rates::
+
+ $ devlink port function rate show
+
+- Set a max tx limit on traffic from VF0::
+
+ $ devlink port function rate set pci/0000:82:00.0/1 tx_max 10Gbit
+
+- Create a rate group with a max tx limit and add two VFs to it::
+
+ $ devlink port function rate add pci/0000:82:00.0/group1 tx_max 10Gbit
+ $ devlink port function rate set pci/0000:82:00.0/1 parent group1
+ $ devlink port function rate set pci/0000:82:00.0/2 parent group1
+
+- Same scenario, with a min guarantee of 20% of the bandwidth for the first VF::
+
+ $ devlink port function rate add pci/0000:82:00.0/group1 tx_max 10Gbit
+ $ devlink port function rate set pci/0000:82:00.0/1 parent group1 tx_share 2Gbit
+ $ devlink port function rate set pci/0000:82:00.0/2 parent group1
+
+- Cross-device scheduling::
+
+ $ devlink port function rate add pci/0000:82:00.0/group1 tx_max 10Gbit
+ $ devlink port function rate set pci/0000:82:00.1/32769 parent pci/0000:82:00.0/group1
--
2.44.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
end of thread, other threads:[~2026-07-01 7:36 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-07-01 7:32 [PATCH net-next V10 00/14] devlink and mlx5: Support cross-function rate scheduling Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 01/14] devlink: Update nested instance locking comment Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 02/14] devlink: Add a helper for getting a nested-in instance Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 03/14] devlink: Migrate from info->user_ptr to info->ctx Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 04/14] devlink: Decouple rate storage from associated devlink object Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 05/14] devlink: Add parent dev to devlink API Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 06/14] devlink: Allow parent dev for rate-set and rate-new Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 07/14] devlink: Allow rate node parents from other devlinks Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 08/14] net/mlx5: qos: Use mlx5_lag_query_bond_speed to query LAG speed Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 09/14] net/mlx5: qos: Refactor vport QoS cleanup Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 10/14] net/mlx5: qos: Model the root node in the scheduling hierarchy Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 11/14] net/mlx5: qos: Remove qos domains and use shd Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 12/14] net/mlx5: qos: Support cross-device tx scheduling Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 13/14] selftests: drv-net: Add test for cross-esw rate scheduling Tariq Toukan
2026-07-01 7:32 ` [PATCH net-next V10 14/14] net/mlx5: Document devlink rates Tariq Toukan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox